{"id":2059,"date":"2010-09-30T16:09:17","date_gmt":"2010-10-01T00:09:17","guid":{"rendered":"http:\/\/www.chesnok.com\/daily\/?p=2059"},"modified":"2012-03-26T02:51:55","modified_gmt":"2012-03-26T10:51:55","slug":"custom-aggregates-a-couple-tips-and-order-by-in-9-0","status":"publish","type":"post","link":"https:\/\/www.chesnok.com\/daily\/2010\/09\/30\/custom-aggregates-a-couple-tips-and-order-by-in-9-0\/","title":{"rendered":"Custom aggregates: a couple tips and ORDER BY in 9.0"},"content":{"rendered":"<p>A friend asked about a way to report the first three semesters that a group of students were documented as being present, and report those values each in a column.  <\/p>\n<p>The tricky thing is that the semesters students attend are rarely the same. I started out with a very naive query (and sorry for the bad formatting that follows.. i need to find some good SQL formatting markup) just to get some initial results: <\/p>\n<p><code><br \/>\nselect student,<br \/>\n(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1) as sem1,<br \/>\n(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1 offset 1) as sem2,<br \/>\n(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1 offset 2) as sem3<br \/>\nFROM assoc a1<br \/>\nWHERE<br \/>\nstudent IN ( select student from assoc group by student HAVING count(*) > 2)<br \/>\nGROUP BY student;<br \/>\n<\/code><\/p>\n<p>That query pretty much sucks, requiring five sequential scans of &#8216;assoc&#8217;:<\/p>\n<pre>\r\n                                     QUERY PLAN                                     \r\n HashAggregate  (cost=3913.13..315256.94 rows=78 width=2)\r\n   ->  Hash Semi Join  (cost=1519.18..3718.08 rows=78017 width=2)\r\n         Hash Cond: (a1.student = assoc.student)\r\n         ->  Seq Scan on assoc a1  (cost=0.00..1126.17 rows=78017 width=2)\r\n         ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)\r\n               ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)\r\n                     Filter: (count(*) > 2)\r\n                     ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)\r\n   SubPlan 1\r\n     ->  Limit  (cost=1326.21..1326.22 rows=1 width=3)\r\n           ->  Sort  (cost=1326.21..1328.71 rows=1000 width=3)\r\n                 Sort Key: a2.semester\r\n                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)\r\n                       Filter: (student = a1.student)\r\n   SubPlan 2\r\n     ->  Limit  (cost=1331.22..1331.22 rows=1 width=3)\r\n           ->  Sort  (cost=1331.21..1333.71 rows=1000 width=3)\r\n                 Sort Key: a2.semester\r\n                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)\r\n                       Filter: (student = a1.student)\r\n   SubPlan 3\r\n     ->  Limit  (cost=1334.14..1334.14 rows=1 width=3)\r\n           ->  Sort  (cost=1334.14..1336.64 rows=1000 width=3)\r\n                 Sort Key: a2.semester\r\n                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)\r\n                       Filter: (student = a1.student)\r\n<\/pre>\n<p>So, he reminded me about custom aggregates!  I did a little searching and found an example function that I added an extra CASE statement that stops the aggregate from adding more than three items to the array returned:<\/p>\n<p><code><br \/>\nCREATE FUNCTION array_append_not_null(anyarray,anyelement)<br \/>\nRETURNS anyarray<br \/>\n  AS '<br \/>\nSELECT CASE WHEN $2 IS NULL THEN $1 WHEN array_upper($1, 1) > 2 THEN $1 ELSE array_append($1,$2) END<br \/>\n'<br \/>\nLANGUAGE sql IMMUTABLE RETURNS NULL ON NULL INPUT;<br \/>\n<\/code><\/p>\n<p>And finally, I declared an aggregate:<\/p>\n<p><code><br \/>\nCREATE AGGREGATE three_semesters_not_null (<br \/>\n  sfunc = array_append_not_null,<br \/>\n  basetype = anyelement,<br \/>\n  stype = anyarray,<br \/>\n  initcond = '{}'<br \/>\n);<br \/>\n<\/code><\/p>\n<p>One problem though &#8211; we want the array returned to be only the <b>first<\/b> three semesters, rather than any three semesters a student has a record for. Meaning, we need to sort the information passed to the aggregate function. We could do this inside the aggregate itself (bubble sort, anyone?) or we can presort the input!  I chose presorting, to avoid writing a real ugly case statement.<\/p>\n<p>My query (compatible with 8.3 or higher): <\/p>\n<p><code><br \/>\nSELECT sorted.student, three_semesters_not_null(sorted.semester)<br \/>\n\tFROM (SELECT student, semester from assoc order by semester ) as sorted<br \/>\nWHERE<br \/>\n\tsorted.student IN (select a.student from assoc a group by a.student HAVING count(*) > 2)<br \/>\nGROUP BY sorted.student;<br \/>\n<\/code><\/p>\n<p>Which yields the much nicer query plan, requiring just two sequential scans: <\/p>\n<pre>\r\n                                      QUERY PLAN                                      \r\n HashAggregate  (cost=11722.96..11725.46 rows=200 width=64)\r\n   ->  Hash Semi Join  (cost=10052.32..11570.82 rows=30427 width=64)\r\n         Hash Cond: (assoc.student = a.student)\r\n         ->  Sort  (cost=8533.14..8728.18 rows=78017 width=5)\r\n               Sort Key: assoc.semester\r\n               ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=5)\r\n         ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)\r\n               ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)\r\n                     Filter: (count(*) > 2)\r\n                     ->  Seq Scan on assoc a  (cost=0.00..1126.17 rows=78017 width=2)\r\n\r\n<\/pre>\n<p>I ran my queries by Magnus, and he reminded me that what I really needed was ORDER BY in my aggregate!  Fortunately, 9.0 has exactly this feature: <\/p>\n<p><code><br \/>\nSELECT student,<br \/>\n       three_semesters_not_null(semester order by semester asc ) as first_three_semesters<br \/>\nFROM assoc<br \/>\nWHERE student IN (select student from assoc group by student HAVING count(*) > 2)<br \/>\n\tGROUP BY student;<br \/>\n<\/code><\/p>\n<p>Which results in the following plan:<\/p>\n<pre>\r\n                                        QUERY PLAN                                        \r\n GroupAggregate  (cost=11125.05..11711.15 rows=78 width=5)\r\n   ->  Sort  (cost=11125.05..11320.09 rows=78017 width=5)\r\n         Sort Key: public.assoc.student\r\n         ->  Hash Semi Join  (cost=1519.18..3718.08 rows=78017 width=5)\r\n               Hash Cond: (public.assoc.student = public.assoc.student)\r\n               ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=5)\r\n               ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)\r\n                     ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)\r\n                           Filter: (count(*) > 2)\r\n                           ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)\r\n\r\n<\/pre>\n<p>A final alternative would be to transform the IN query into a JOIN:<\/p>\n<p><code><br \/>\nSELECT a.student,<br \/>\n\tthree_semesters_not_null(a.semester order by a.semester asc ) as first_three_semesters<br \/>\nFROM assoc a<br \/>\n\tJOIN (select student from assoc group by student HAVING count(*) > 2) as b ON b.student = a.student<br \/>\nGROUP BY a.student;<br \/>\n<\/code><\/p>\n<p>And the plan isn&#8217;t much different:<\/p>\n<pre>\r\n                                        QUERY PLAN                                        \r\n GroupAggregate  (cost=11125.05..11711.15 rows=78 width=5)\r\n   ->  Sort  (cost=11125.05..11320.09 rows=78017 width=5)\r\n         Sort Key: a.student\r\n         ->  Hash Join  (cost=1519.18..3718.08 rows=78017 width=5)\r\n               Hash Cond: (a.student = assoc.student)\r\n               ->  Seq Scan on assoc a  (cost=0.00..1126.17 rows=78017 width=5)\r\n               ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)\r\n                     ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)\r\n                           Filter: (count(*) > 2)\r\n                           ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)\r\n\r\n<\/pre>\n<p>Any other suggestions for this type of query? <\/p>\n<p>I&#8217;ve attached the file I was using to test this out.<br \/>\n<a href='http:\/\/www.chesnok.com\/daily\/wp-content\/uploads\/2010\/09\/custom_aggregates.sql_1.txt'>custom_aggregates.sql<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A friend asked about a way to report the first three semesters that a group of students were documented as being present, and report those values each in a column. The tricky thing is that the semesters students attend are &hellip; <a href=\"https:\/\/www.chesnok.com\/daily\/2010\/09\/30\/custom-aggregates-a-couple-tips-and-order-by-in-9-0\/\">Continue reading &rarr;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[97,9],"tags":[328,358,360,359,647,624],"class_list":["post-2059","post","type-post","status-publish","format-standard","hentry","category-postgres","category-postgresql","tag-9-0","tag-custom-aggregates","tag-features","tag-order-by","tag-postgres","tag-postgresql"],"_links":{"self":[{"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/posts\/2059","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/comments?post=2059"}],"version-history":[{"count":10,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/posts\/2059\/revisions"}],"predecessor-version":[{"id":2084,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/posts\/2059\/revisions\/2084"}],"wp:attachment":[{"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/media?parent=2059"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/categories?post=2059"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/tags?post=2059"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}