JOINing against VIEWs can be harmful

I had a recent code review problem that was very curious at first glance, but came down to the use of complex VIEW in an even more complicated and frequently used reporting query.

I’ll just paste a edited version of the review below.

tl;dr: Don’t use product_info (a view, not a table) in this query, move WHERE clauses for product_name and version_string into the infos CTE, strictly limit the number of columns in tables being joined

This query is unfortunately doomed because it is using product_info — a view which already contains data from product_versions. There are four other tables which we don’t care about for the query that are included in the view.

As a result, you get a self-join many times over. A hint at the horrors of what Postgres decides to do with this is here:

Unique  (cost=10248.32..10248.35 rows=1 width=294)
   CTE infos
     ->  Hash Right Join  (cost=301.82..1683.83 rows=40195 width=96)
           Hash Cond: (pvb.product_version_id = pv.product_version_id)
           ->  Seq Scan on product_version_builds pvb  (cost=0.00..768.71 rows=42271 width=16)
           ->  Hash  (cost=282.46..282.46 rows=1549 width=84)
                 ->  Hash Right Join  (cost=218.53..282.46 rows=1549 width=84)
                       Hash Cond: (pv.product_version_id = pi.product_version_id)
                       ->  Seq Scan on product_versions pv  (cost=0.00..40.29 rows=1629 width=35)
                       ->  Hash  (cost=199.17..199.17 rows=1549 width=53)
                             ->  Subquery Scan on pi  (cost=179.81..199.17 rows=1549 width=53)
                                   ->  Sort  (cost=179.81..183.68 rows=1549 width=62)
                                         Sort Key: product_versions.product_name, product_versions.version_string
                                         ->  Hash Join  (cost=5.70..97.73 rows=1549 width=62)
                                               Hash Cond: ((product_versions.product_name = product_release_channels.product_name) AND (product_versions.build_type = product_release_channels.release_channel))
                                               ->  Seq Scan on product_versions  (cost=0.00..40.29 rows=1629 width=52)
                                               ->  Hash  (cost=5.03..5.03 rows=45 width=42)
                                                     ->  Hash Join  (cost=2.34..5.03 rows=45 width=42)
                                                           Hash Cond: (product_release_channels.release_channel = release_channels.release_channel)
                                                           ->  Hash Join  (cost=1.23..3.29 rows=45 width=34)
                                                                 Hash Cond: (product_release_channels.product_name = products.product_name)
                                                                 ->  Seq Scan on product_release_channels  (cost=0.00..1.45 rows=45 width=22)
                                                                 ->  Hash  (cost=1.10..1.10 rows=10 width=12)
                                                                       ->  Seq Scan on products  (cost=0.00..1.10 rows=10 width=12)
                                                           ->  Hash  (cost=1.05..1.05 rows=5 width=8)
                                                                 ->  Seq Scan on release_channels  (cost=0.00..1.05 rows=5 width=8)

Whenever you see so many nested joins, subquery sorts and sequence scans mushed together in a staircase, that’s a signal that we should investigate whether the query we’re running is really what we thought it was.

While @peterbe dug through code with me, he mentioned that product_info was a view! Now all the self-JOINs made sense and I started refactoring.

The product_info view was being deconstructed into it’s component parts, which already included product_versions (resulting in a self-join) and including a bunch of junk that for the purposes of this query, we don’t really care about. So, as the first step, I just made a copy of the SELECT query from the view (you can get that by running \d+ product_info in psql or you can dig it out of the socorro/external/postgresql/procs/views section of our code.

Here’s my proposal for what should go into infos:

         SELECT 
                product_versions.product_version_id
                , product_versions.version_string
                , 'new'::text AS which_table
                , product_versions.product_name
                , product_versions.release_version
                , product_versions.build_type
                , product_version_builds.build_id
                , product_versions.is_rapid_beta
                , product_versions.rapid_beta_id
                , product_versions.version_sort
        FROM product_versions
                LEFT JOIN product_version_builds USING (product_version_id)
        WHERE  %(product name and versions)s

We really need to move the product name and version filtering to this portion of the query because otherwise we end up doing a horrible self join on a 42,000 row table! :watch:

Here’s what the self-join looks like in the EXPLAIN:

   ->  Sort  (cost=8564.48..8564.49 rows=1 width=294)
         Sort Key: i1.version_sort, i1.product_version_id, i1.product_name, i1.version_string, i1.which_table, i1.release_version, i1.build_type, i1.build_id, i1.is_rapid_beta, i2.is_rapid_beta, ((((i2.product_nam
e)::text || ':'::text) || (i2.version_string)::text))
         ->  Merge Join  (cost=7755.52..8564.47 rows=1 width=294)
               Merge Cond: ((i1.product_name = i2.product_name) AND (i1.release_version = i2.release_version) AND (i1.build_type = i2.build_type))
               Join Filter: (((i1.product_name = 'Firefox'::citext) AND (i1.version_string = '26.0a2'::citext) AND (i1.version_string = i2.version_string)) OR ((i1.rapid_beta_id = i2.product_version_id) AND (i2.pr
oduct_name = 'Firefox'::citext) AND (i2.version_string = '26.0a2'::citext) AND (i2.is_rapid_beta IS TRUE)))
               ->  Sort  (cost=3877.76..3978.25 rows=40195 width=233)
                     Sort Key: i1.product_name, i1.release_version, i1.build_type
                     ->  CTE Scan on infos i1  (cost=0.00..803.90 rows=40195 width=233)
               ->  Sort  (cost=3877.76..3978.25 rows=40195 width=133)
                     Sort Key: i2.product_name, i2.release_version, i2.build_type
                     ->  CTE Scan on infos i2  (cost=0.00..803.90 rows=40195 width=133)

sad_kitten

This is pretty sad. The Sort at the top of Mt. Sadness. There are a series of sorts further down that are just HUGE because we’re tossing 45k records that must be joined to each other, and the width of the query is 294 — 294 columns in addition to our 45k rows.

The obvious (but sadly not always effective) thing to try is to see if we can filter our rows out earlier. Because we’re using infos, conveniently, that looks possible without too much trouble.

That just leaves sorting out the rapid beta self-join, which based on my tests should be pretty easy to continue to do in the body of the main SELECT, at line 125.

With the changes I proposed, the estimated duration of this query is ~200 ms in stage and the query plan looks like:

                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=37.07..37.08 rows=1 width=294) (actual time=221.131..221.149 rows=31 loops=1)
   CTE infos
     ->  Nested Loop Left Join  (cost=0.00..35.18 rows=26 width=64) (actual time=0.136..0.459 rows=150 loops=1)
           ->  Index Scan using product_version_version_key on product_versions  (cost=0.00..7.27 rows=1 width=52) (actual time=0.111..0.112 rows=1 loops=1)
                 Index Cond: ((product_name = 'Firefox'::citext) AND (version_string = '26.0a2'::citext))
           ->  Index Only Scan using product_version_builds_key on product_version_builds  (cost=0.00..27.58 rows=33 width=16) (actual time=0.019..0.268 rows=150 loops=1)
                 Index Cond: (product_version_id = product_versions.product_version_id)
                 Heap Fetches: 150
   ->  Hash Join  (cost=0.84..1.86 rows=1 width=294) (actual time=0.943..47.334 rows=22500 loops=1)
         Hash Cond: (i1.product_version_id = i2.product_version_id)
         Join Filter: ((i1.version_string = i2.version_string) OR ((i1.rapid_beta_id = i2.product_version_id) AND (i2.is_rapid_beta IS TRUE)))
         ->  CTE Scan on infos i1  (cost=0.00..0.52 rows=26 width=233) (actual time=0.141..0.236 rows=150 loops=1)
         ->  Hash  (cost=0.52..0.52 rows=26 width=69) (actual time=0.778..0.778 rows=150 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 8kB
               ->  CTE Scan on infos i2  (cost=0.00..0.52 rows=26 width=69) (actual time=0.002..0.664 rows=150 loops=1)
 Total runtime: 221.321 ms
(16 rows)
  • Christopher Browne

    My favorite ever was a view that unioned together 9 “rotor” tables, and we then had a query that did a self join, turning into an 81-way join :-) .

  • Josh Berkus

    Selena, FWIW, product_info was supposed to be a transitional view which covered the period where we moved from Socorro 1 to Socorro 2, and our whole way of looking at products changed, but we needed to support legacy UI code. So it was designed to be backwards-compatible at the cost of efficiency.

    • http://chesnok.com/daily Selena Deckelmann

      Sure. That the view was mistaken for a JOIN-able table was a common-enough performance problem, I figured it would be interesting to share how to diagnose the problem. Esp given that my review of the patch ended up being blog length itself.

      • Josh Berkus

        Yeah, I just feel defensive whenever you trash my code in public. Even if nobody else knows that it’s my code. ;-)

        • http://chesnok.com/daily Selena Deckelmann

          DUDE. LOL — please explain where I trashed your code.

          If anything, I thought I might have trashed the developers’ code I was reviewing, who used the view in the new query. But he was happy with the review because it solved the problem, and it was bonus to him that I explained why. :D

          • Josh Berkus

            Oh, it’s just me. Because I know how horrible that view is.

  • http://www.ledgersmb.org/ Chris Travers

    I think the key in querying with views generally is to be aware of what you are doing. I think it’s fine to join against views as long as the views are built for that. In fact, fairly frequently I build views specifically for managing joins. The problem, as you put it, is that views can encapsulate a lot of logic and if you aren’t clear both regarding what a view was designed to do, and what the view is doing, you can dig a deep hole very fast….

    The one time I was badly bitten by this was when I used a created a view which was a union of four other views, only to find out that each one was doing a separate sequential scan on the same large table (pulling, between them, every record but in distinct subsets).

    • http://chesnok.com/daily Selena Deckelmann

      Yeah, I agree. Our schema is now mostly defined in a model, but most of the views, all the stored procs and DOMAINs are in raw SQL. This situation highlights the difficulties developers face when they need to dive into areas of the code they don’t normally touch. We’re using raw SQL for our backend API, so there’s not a lot of barriers to using views and tables interchangeably. Also, not many clues that DANGER LURKS. :)

      I’ve been thinking about what I can do to help either document or make the schema more friendly in this regard. I know I can’t make every situation like this go away, but it seems as though I could probably do some useful things.