Custom aggregates: a couple tips and ORDER BY in 9.0

A friend asked about a way to report the first three semesters that a group of students were documented as being present, and report those values each in a column.

The tricky thing is that the semesters students attend are rarely the same. I started out with a very naive query (and sorry for the bad formatting that follows.. i need to find some good SQL formatting markup) just to get some initial results:


select student,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1) as sem1,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1 offset 1) as sem2,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1 offset 2) as sem3
FROM assoc a1
WHERE
student IN ( select student from assoc group by student HAVING count(*) > 2)
GROUP BY student;

That query pretty much sucks, requiring five sequential scans of ‘assoc’:

                                     QUERY PLAN                                     
 HashAggregate  (cost=3913.13..315256.94 rows=78 width=2)
   ->  Hash Semi Join  (cost=1519.18..3718.08 rows=78017 width=2)
         Hash Cond: (a1.student = assoc.student)
         ->  Seq Scan on assoc a1  (cost=0.00..1126.17 rows=78017 width=2)
         ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
               ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                     Filter: (count(*) > 2)
                     ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)
   SubPlan 1
     ->  Limit  (cost=1326.21..1326.22 rows=1 width=3)
           ->  Sort  (cost=1326.21..1328.71 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)
   SubPlan 2
     ->  Limit  (cost=1331.22..1331.22 rows=1 width=3)
           ->  Sort  (cost=1331.21..1333.71 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)
   SubPlan 3
     ->  Limit  (cost=1334.14..1334.14 rows=1 width=3)
           ->  Sort  (cost=1334.14..1336.64 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)

So, he reminded me about custom aggregates! I did a little searching and found an example function that I added an extra CASE statement that stops the aggregate from adding more than three items to the array returned:


CREATE FUNCTION array_append_not_null(anyarray,anyelement)
RETURNS anyarray
AS '
SELECT CASE WHEN $2 IS NULL THEN $1 WHEN array_upper($1, 1) > 2 THEN $1 ELSE array_append($1,$2) END
'
LANGUAGE sql IMMUTABLE RETURNS NULL ON NULL INPUT;

And finally, I declared an aggregate:


CREATE AGGREGATE three_semesters_not_null (
sfunc = array_append_not_null,
basetype = anyelement,
stype = anyarray,
initcond = '{}'
);

One problem though – we want the array returned to be only the first three semesters, rather than any three semesters a student has a record for. Meaning, we need to sort the information passed to the aggregate function. We could do this inside the aggregate itself (bubble sort, anyone?) or we can presort the input! I chose presorting, to avoid writing a real ugly case statement.

My query (compatible with 8.3 or higher):


SELECT sorted.student, three_semesters_not_null(sorted.semester)
FROM (SELECT student, semester from assoc order by semester ) as sorted
WHERE
sorted.student IN (select a.student from assoc a group by a.student HAVING count(*) > 2)
GROUP BY sorted.student;

Which yields the much nicer query plan, requiring just two sequential scans:

                                      QUERY PLAN                                      
 HashAggregate  (cost=11722.96..11725.46 rows=200 width=64)
   ->  Hash Semi Join  (cost=10052.32..11570.82 rows=30427 width=64)
         Hash Cond: (assoc.student = a.student)
         ->  Sort  (cost=8533.14..8728.18 rows=78017 width=5)
               Sort Key: assoc.semester
               ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=5)
         ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
               ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                     Filter: (count(*) > 2)
                     ->  Seq Scan on assoc a  (cost=0.00..1126.17 rows=78017 width=2)

I ran my queries by Magnus, and he reminded me that what I really needed was ORDER BY in my aggregate! Fortunately, 9.0 has exactly this feature:


SELECT student,
three_semesters_not_null(semester order by semester asc ) as first_three_semesters
FROM assoc
WHERE student IN (select student from assoc group by student HAVING count(*) > 2)
GROUP BY student;

Which results in the following plan:

                                        QUERY PLAN                                        
 GroupAggregate  (cost=11125.05..11711.15 rows=78 width=5)
   ->  Sort  (cost=11125.05..11320.09 rows=78017 width=5)
         Sort Key: public.assoc.student
         ->  Hash Semi Join  (cost=1519.18..3718.08 rows=78017 width=5)
               Hash Cond: (public.assoc.student = public.assoc.student)
               ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=5)
               ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
                     ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                           Filter: (count(*) > 2)
                           ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)

A final alternative would be to transform the IN query into a JOIN:


SELECT a.student,
three_semesters_not_null(a.semester order by a.semester asc ) as first_three_semesters
FROM assoc a
JOIN (select student from assoc group by student HAVING count(*) > 2) as b ON b.student = a.student
GROUP BY a.student;

And the plan isn’t much different:

                                        QUERY PLAN                                        
 GroupAggregate  (cost=11125.05..11711.15 rows=78 width=5)
   ->  Sort  (cost=11125.05..11320.09 rows=78017 width=5)
         Sort Key: a.student
         ->  Hash Join  (cost=1519.18..3718.08 rows=78017 width=5)
               Hash Cond: (a.student = assoc.student)
               ->  Seq Scan on assoc a  (cost=0.00..1126.17 rows=78017 width=5)
               ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
                     ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                           Filter: (count(*) > 2)
                           ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)

Any other suggestions for this type of query?

I’ve attached the file I was using to test this out.
custom_aggregates.sql

11 thoughts on Custom aggregates: a couple tips and ORDER BY in 9.0

Comments are closed.

  1. Your query is pretty neat, any advantage using a cte?

    with students_g2 as (
    select student from assoc group by student HAVING count(*) > 2
    )
    SELECT a.student,
    three_semesters_not_null(a.semester order by a.semester asc ) as first_three_semesters
    FROM assoc a
    JOIN students_g2 as b ON b.student = a.student
    GROUP BY a.student;

  2. What about some windowing functions?

    SELECT
    student, array_agg(semester)
    FROM
    (
    SELECT
    student, semester,
    count(*) OVER w,
    row_number() OVER w
    FROM
    assoc
    WINDOW
    w AS (PARTITION BY student ORDER BY semester ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) ss
    WHERE
    count >= 3 AND
    row_number <= 3
    GROUP BY
    student
    ;

    While using a CTE is nicer to look at, in my experience it's almost always slower because it always has to be materialized.

  3. Isn’t the requirement it be in 3 separate columns rather than an array? Also do they have to have at least 3 semesters? I’m assuming they do since that seems to be a requirement in everyones query.

    I would go with a window function if available (if not probably a self-join — eg. 8.3) , but instead I would use lead or lag functions to split into 3 columns.

    So something like:

    SELECT student, sem1,sem2, sem3
    FROM
    (SELECT student, row_number() OVER w As rn,
    lag(semester, 2) OVER w AS sem1,
    lag(semester, 1) OVER w AS sem2, semester As sem3
    FROM assoc
    WINDOW
    w AS (PARTITION BY student ORDER BY semester ) ) As foo
    WHERE rn = 3;

  4. @ Regina

    I was trying to convert the query as closely as possible. An array is easy to convert into columns with a subquery, but if columns are desired, your query is better.

  5. hm… and what is the reason for running a separate HAVING count(*) > 2 query and then joining it?

    Why not to put the HAVING count(*) > 2 in the main query? Your approach would make sence only (if at all) there are way too many groups with count(*) <= 2…

  6. @seb – I thought about using CTE, but I had to implement the first version in 8.3, which doesn’t have them. Thanks for the query!

  7. @Marko: Thanks! Again – thought about windowing functions, but didn’t have them in 8.3. Great example. I will write all of these up on the PostgreSQL wiki as that seems like a nicer place for this than in my blog. I’ll add them to a cleaned up version of my script too.

  8. @Regina: Thanks for that example. It’s true – the first requirement was given to me as separate columns, and as I was talking with my friend, we started just using an array. Sorry for the discontinuity! Thanks for the example.

    And Re: requiring at least three – I left that in, but later the customer decided that they could accept students with less than three semesters as well.

  9. In that case I guess I would change my version to

    WHERE rn = 1

    and use semester As sem1, lead(semester,1) As sem2, lead(semester,2) As sem3

    I’m not sure how the performance would be for doing a self-join in to simulate the window approach compared to your original — from what I recall I think the performance is better (but varies depending on num records per student) .. The 8.3 compatible approach would look something like below:

    So it would be something of the form

    SELECT student, MAX(CASE WHEN rnum =1 THEN semester ELSE NULL END) As sem1, .. MAX(…rnum = ..n) As semn
    (SELECT a.student, a.semester, COUNT(*) As rnum
    assoc As a INNER JOIN assoc As b
    ON (a.student = b.student AND a.semester >= b.semester)
    GROUP BY a.student, a.semester
    HAVING COUNT(*) <= 3) As foo

    GROUP BY student
    :)

  10. I think, second SELECT query is conceptually wrong. It can work for a while, but break once the planner will change.
    One can only hope that
    SELECT some_aggregate(f) FROM (SELECT f FROM t ORDER BY f) AS sub;
    would call aggregate function in order of f, but it’s up to current planner/executor implementation and actually there is no guarantee on it. For your SELECT with IN/JOIN it is allowed and can choose any join method which does not preserve order.