{"id":5046,"date":"2013-11-04T14:13:59","date_gmt":"2013-11-04T22:13:59","guid":{"rendered":"http:\/\/www.chesnok.com\/daily\/?p=5046"},"modified":"2013-11-05T07:41:37","modified_gmt":"2013-11-05T15:41:37","slug":"joining-against-views-can-be-harmful","status":"publish","type":"post","link":"https:\/\/www.chesnok.com\/daily\/2013\/11\/04\/joining-against-views-can-be-harmful\/","title":{"rendered":"JOINing against VIEWs can be harmful"},"content":{"rendered":"<p>I had a recent code review problem that was very curious at first glance, but came down to the use of complex VIEW in an even more complicated and frequently used reporting query.<\/p>\n<p>I&#8217;ll just paste <a href=\"https:\/\/github.com\/mozilla\/socorro\/pull\/1613#discussion_r7203097\">a edited version of the review below<\/a>.<\/p>\n<p>tl;dr: Don&#8217;t use <code>product_info<\/code> (a view, not a table) in this query, move WHERE clauses for product_name and version_string into the <code>infos<\/code> CTE, strictly limit the number of columns in tables being joined<\/p>\n<p>This query is unfortunately doomed because it is using <code>product_info<\/code> &#8212; a view which already contains data from product_versions. There are four other tables which we don&#8217;t care about for the query that are included in the view.<\/p>\n<p>As a result, you get a self-join many times over. A hint at the horrors of what Postgres decides to do with this is here:<\/p>\n<pre><code>Unique  (cost=10248.32..10248.35 rows=1 width=294)\n   CTE infos\n     -&gt;  Hash Right Join  (cost=301.82..1683.83 rows=40195 width=96)\n           Hash Cond: (pvb.product_version_id = pv.product_version_id)\n           -&gt;  Seq Scan on product_version_builds pvb  (cost=0.00..768.71 rows=42271 width=16)\n           -&gt;  Hash  (cost=282.46..282.46 rows=1549 width=84)\n                 -&gt;  Hash Right Join  (cost=218.53..282.46 rows=1549 width=84)\n                       Hash Cond: (pv.product_version_id = pi.product_version_id)\n                       -&gt;  Seq Scan on product_versions pv  (cost=0.00..40.29 rows=1629 width=35)\n                       -&gt;  Hash  (cost=199.17..199.17 rows=1549 width=53)\n                             -&gt;  Subquery Scan on pi  (cost=179.81..199.17 rows=1549 width=53)\n                                   -&gt;  Sort  (cost=179.81..183.68 rows=1549 width=62)\n                                         Sort Key: product_versions.product_name, product_versions.version_string\n                                         -&gt;  Hash Join  (cost=5.70..97.73 rows=1549 width=62)\n                                               Hash Cond: ((product_versions.product_name = product_release_channels.product_name) AND (product_versions.build_type = product_release_channels.release_channel))\n                                               -&gt;  Seq Scan on product_versions  (cost=0.00..40.29 rows=1629 width=52)\n                                               -&gt;  Hash  (cost=5.03..5.03 rows=45 width=42)\n                                                     -&gt;  Hash Join  (cost=2.34..5.03 rows=45 width=42)\n                                                           Hash Cond: (product_release_channels.release_channel = release_channels.release_channel)\n                                                           -&gt;  Hash Join  (cost=1.23..3.29 rows=45 width=34)\n                                                                 Hash Cond: (product_release_channels.product_name = products.product_name)\n                                                                 -&gt;  Seq Scan on product_release_channels  (cost=0.00..1.45 rows=45 width=22)\n                                                                 -&gt;  Hash  (cost=1.10..1.10 rows=10 width=12)\n                                                                       -&gt;  Seq Scan on products  (cost=0.00..1.10 rows=10 width=12)\n                                                           -&gt;  Hash  (cost=1.05..1.05 rows=5 width=8)\n                                                                 -&gt;  Seq Scan on release_channels  (cost=0.00..1.05 rows=5 width=8)\n<\/code><\/pre>\n<p>Whenever you see so many nested joins, subquery sorts and sequence scans mushed together in a staircase, that&#8217;s a signal that we should investigate whether the query we&#8217;re running is really what we thought it was.<\/p>\n<p>While @peterbe dug through code with me, he mentioned that <code>product_info<\/code> was a view! Now all the self-JOINs made sense and I started refactoring.<\/p>\n<p>The <code>product_info<\/code> view was being deconstructed into it&#8217;s component parts, which already included product_versions (resulting in a self-join) and including a bunch of junk that for the purposes of this query, we don&#8217;t really care about. So, as the first step, I just made a copy of the SELECT query from the view (you can get that by running <code>\\d+ product_info<\/code> in <code>psql<\/code> or you can dig it out of the <code>socorro\/external\/postgresql\/procs\/views<\/code> section of our code.<\/p>\n<p>Here&#8217;s my proposal for what should go into <code>infos<\/code>:<\/p>\n<pre><code>         SELECT \n                product_versions.product_version_id\n                , product_versions.version_string\n                , 'new'::text AS which_table\n                , product_versions.product_name\n                , product_versions.release_version\n                , product_versions.build_type\n                , product_version_builds.build_id\n                , product_versions.is_rapid_beta\n                , product_versions.rapid_beta_id\n                , product_versions.version_sort\n        FROM product_versions\n                LEFT JOIN product_version_builds USING (product_version_id)\n        WHERE  %(product name and versions)s\n<\/code><\/pre>\n<p>We really need to move the product name and version filtering to this portion of the query because otherwise we end up doing a horrible self join on a 42,000 row table! :watch:<\/p>\n<p>Here&#8217;s what the self-join looks like in the EXPLAIN:<\/p>\n<pre><code>   -&gt;  Sort  (cost=8564.48..8564.49 rows=1 width=294)\n         Sort Key: i1.version_sort, i1.product_version_id, i1.product_name, i1.version_string, i1.which_table, i1.release_version, i1.build_type, i1.build_id, i1.is_rapid_beta, i2.is_rapid_beta, ((((i2.product_nam\ne)::text || ':'::text) || (i2.version_string)::text))\n         -&gt;  Merge Join  (cost=7755.52..8564.47 rows=1 width=294)\n               Merge Cond: ((i1.product_name = i2.product_name) AND (i1.release_version = i2.release_version) AND (i1.build_type = i2.build_type))\n               Join Filter: (((i1.product_name = 'Firefox'::citext) AND (i1.version_string = '26.0a2'::citext) AND (i1.version_string = i2.version_string)) OR ((i1.rapid_beta_id = i2.product_version_id) AND (i2.pr\noduct_name = 'Firefox'::citext) AND (i2.version_string = '26.0a2'::citext) AND (i2.is_rapid_beta IS TRUE)))\n               -&gt;  Sort  (cost=3877.76..3978.25 rows=40195 width=233)\n                     Sort Key: i1.product_name, i1.release_version, i1.build_type\n                     -&gt;  CTE Scan on infos i1  (cost=0.00..803.90 rows=40195 width=233)\n               -&gt;  Sort  (cost=3877.76..3978.25 rows=40195 width=133)\n                     Sort Key: i2.product_name, i2.release_version, i2.build_type\n                     -&gt;  CTE Scan on infos i2  (cost=0.00..803.90 rows=40195 width=133)\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/f.cloud.github.com\/assets\/54803\/1403698\/ba78397a-3cf7-11e3-81d4-9d76ada8f290.jpg\" alt=\"sad_kitten\" \/><\/p>\n<p>This is pretty sad. The <em>Sort<\/em> at the top of Mt. Sadness. There are a series of sorts further down that are just HUGE because we&#8217;re tossing 45k records that must be joined to each other, and the width of the query is 294 &#8212; <em>294 columns<\/em> in addition to our 45k rows.<\/p>\n<p>The obvious (but sadly not always effective) thing to try is to see if we can filter our rows out earlier. Because we&#8217;re using <code>infos<\/code>, conveniently, that looks possible without too much trouble.<\/p>\n<p>That just leaves sorting out the rapid beta self-join, which based on my tests should be pretty easy to continue to do in the body of the main SELECT, at line 125.<\/p>\n<p>With the changes I proposed, the estimated duration of this query is ~200 ms in stage and the query plan looks like:<\/p>\n<pre><code>                                                                                QUERY PLAN                                                                                 \n---------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n HashAggregate  (cost=37.07..37.08 rows=1 width=294) (actual time=221.131..221.149 rows=31 loops=1)\n   CTE infos\n     -&gt;  Nested Loop Left Join  (cost=0.00..35.18 rows=26 width=64) (actual time=0.136..0.459 rows=150 loops=1)\n           -&gt;  Index Scan using product_version_version_key on product_versions  (cost=0.00..7.27 rows=1 width=52) (actual time=0.111..0.112 rows=1 loops=1)\n                 Index Cond: ((product_name = 'Firefox'::citext) AND (version_string = '26.0a2'::citext))\n           -&gt;  Index Only Scan using product_version_builds_key on product_version_builds  (cost=0.00..27.58 rows=33 width=16) (actual time=0.019..0.268 rows=150 loops=1)\n                 Index Cond: (product_version_id = product_versions.product_version_id)\n                 Heap Fetches: 150\n   -&gt;  Hash Join  (cost=0.84..1.86 rows=1 width=294) (actual time=0.943..47.334 rows=22500 loops=1)\n         Hash Cond: (i1.product_version_id = i2.product_version_id)\n         Join Filter: ((i1.version_string = i2.version_string) OR ((i1.rapid_beta_id = i2.product_version_id) AND (i2.is_rapid_beta IS TRUE)))\n         -&gt;  CTE Scan on infos i1  (cost=0.00..0.52 rows=26 width=233) (actual time=0.141..0.236 rows=150 loops=1)\n         -&gt;  Hash  (cost=0.52..0.52 rows=26 width=69) (actual time=0.778..0.778 rows=150 loops=1)\n               Buckets: 1024  Batches: 1  Memory Usage: 8kB\n               -&gt;  CTE Scan on infos i2  (cost=0.00..0.52 rows=26 width=69) (actual time=0.002..0.664 rows=150 loops=1)\n Total runtime: 221.321 ms\n(16 rows)\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I had a recent code review problem that was very curious at first glance, but came down to the use of complex VIEW in an even more complicated and frequently used reporting query. I&#8217;ll just paste a edited version of &hellip; <a href=\"https:\/\/www.chesnok.com\/daily\/2013\/11\/04\/joining-against-views-can-be-harmful\/\">Continue reading &rarr;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[603,647,602],"class_list":["post-5046","post","type-post","status-publish","format-standard","hentry","category-postgresql","tag-ctes","tag-postgres","tag-query-optimization"],"_links":{"self":[{"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/posts\/5046","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/comments?post=5046"}],"version-history":[{"count":4,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/posts\/5046\/revisions"}],"predecessor-version":[{"id":5055,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/posts\/5046\/revisions\/5055"}],"wp:attachment":[{"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/media?parent=5046"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/categories?post=5046"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.chesnok.com\/daily\/wp-json\/wp\/v2\/tags?post=5046"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}