Everyday Postgres: How I write queries using psql: Common Table Expressions

This this series of posts about using Postgres every day. The last post was about \ commands in psql.

I’m now going to share in a series of posts my workflow for writing queries, and some of the things about working with Postgres that I take for granted in writing queries.

Shortcuts I can’t live without

Three important shortcuts you should learn are:

  • \e: Pulls the last query you executed into a buffer in your favorite editor
  • \df+ [function]: This displays [function] information, and the + dumps the function itself to STDOUT
  • \ef [function]: This pulls [function] into a buffer in your favorite editor. This is the most convenient way to grab a copy of an individual function for me.
  • \ef: This opens your favorite editor and puts a template for a function (in any supported procedural language) in a buffer

I’ll talk about writing functions in a future post.

Thinking in CTEs

In searching through my recent psql history, I found quite a few WITH queries. These are Common Table Expressions, a useful feature supported by many databases that allows you to embed subqueries in your SQL in a very readable format. CTEs have a lot more interesting features and properties, like RECURSIVE.

However, I tend to just use CTEs as a more convenient form of a subquery. This allows me to break apart long queries into smaller, testable chunks. I usually will write a subquery so that it’s in my command history, generate some fake data for testing, and go back to that query in my history to test edge cases.

I iterate on the smaller tables until I have a set of understandable “paragraphs” of SQL. This makes it easier for me to explain the logic of the query with others, and makes testing each piece easier in the event that something breaks. Usually, when a CTE breaks, I’ve made an assumption about incoming data that’s incorrect.

The composability of SQL is often terrible. CTEs help break apart the complexity visually. There’s some warnings about CTEs not performing well under certain circumstances. My approach is to design with CTEs and optimize for performance only if needed.

Other advantages of CTEs

In case you’re not yet convinced CTEs are worth learning, I made a bullet list of advantages from some useful comments about how others are using CTEs:

  • Alternative to throwaway VIEWs and temporary tables when querying replicas (comment from bma)
  • Variable declaration – to emulate DECLARE in SQL Server, for example
  • Easier to understand queries and faster development time (ME)

An example of the kinds of queries I write

Something you’d see a lot in my command history are queries that look like this:

WITH crashes AS (                                                               
    SELECT uptime_string AS category                                                      
        , sum(report_count) AS report_count                                     
    FROM signature_summary_uptime                                               
        JOIN signatures USING (signature_id)                                       
    WHERE                                                                           
        signatures.signature = 'Fake Signature #1'                                             
        AND report_date >= '2013-08-05T00:00:00+00:00'::timestamptz             
        AND report_date < '2013-08-12T00:00:00+00:00'::timestamptz              
        AND product_name IN ('Firefox')  AND version_string IN ('1')            
    GROUP BY category                                                           
),                                                                              
totals AS (                                                                     
    SELECT                                                                      
        category                                                                
        , report_count                                                          
        , sum(report_count) OVER () as total_count                              
    FROM crashes                                                                
)                                                                               
SELECT category                                                                 
    , report_count                                                              
    , round((report_count * 100::numeric)/total_count,3)::TEXT                  
as percentage                                                                   
FROM totals                                                                     
ORDER BY report_count DESC                                                      
;

You’ll see that I have one or more WITH clauses, and then a query that performs a final summary query using the data from the CTEs.

This query probably was asked for something like this:

Please provide counts of crashes with the same uptime, for Firefox version 1, and the signature ‘Fake Signature #1′ for the last week, including a percentage of all of the sampled crashes.

While I’m sure there are better ways to write the query above, I wanted to show how I have made a pattern for myself to speed up query writing. I’m not always interested in the best possible query. Hopefully, the Postgres planner makes up for many of my sins as a developer!

What I am interested in is finding answers to problems quickly for my coworkers.

In answering the question I was asked, I first dig out an appropriate summary table (we have quite a few in Socorro). I found the signature_summary_uptime table, and fortunately it has product_name and version_string available in the table. I only need to join signatures to fulfill the request. (Yay for denormalized data that supports the kinds of queries we often run!)

Next, I see that I’m being asked for a total percentage, so I need to calculate a sum across all the rows that I retrieve. That can be very slow, so I create a second CTE that uses data from the first CTE (rather than doing two full table scans to calculate the total). I use a window function instead of SUM() here because I’ve done experiments to see which tends to be faster.

And, finally once I have all the data together, I run my final query using my two CTE tables.

How CTEs and breaking down this process have helped me

So, I’ve had about a year to practice. A query like this today takes me 10-15 minutes to assemble and test. They are typically slightly more complex — with more dependencies, and maybe 2-3 more tables involved in JOINs. But they follow the same basic pattern.

Most queries on my data sets conform to recognizable patterns.

After a few months, we recognized that moving JSON for crash data into Postgres also would be a win, and was easy to process using very similar queries.

That’s all helped make finding answers about Firefox crashes easier and faster!

8 thoughts on Everyday Postgres: How I write queries using psql: Common Table Expressions

Comments are closed.

  1. Thanks for sharing those useful tips, Selena. While I normally agree with principles about premature optimization and the root of all evil, I have run into CTE performance issues in my own and my colleague’s code so frequently that I no longer consider it premature to avoid them. Plus it just seems like such a shame to throw away so much of the higher intelligence of the query planner. At least until someone adds the optional capability to disable to the optimization fence and push down qualifiers. In the mean time, I’ve taken to advising everyone to use a throwaway view for each CTE as a similarly readable replacement. (By throw-away, I mean start transaction, create views, run query, then drop views or rollback.) It’s not quite as clear as a CTE, but at least it uses the full power of the Postgres query planner. –Daniel Browning

    • Performance issues aren’t typically a problem in my case.

      Despite your experience, I strongly recommend developers consider CTEs.

      The advantages imo are: development time and being able to quickly understand what a query is for. Those are huge advantages in an environment where most people aren’t well-versed in SQL. In my experience, that is most workplaces.

    • If you are querying hot standby systems “throwaway” views and temp tables are not eligible to be used, so CTEs, functions, and more complex SQL end up being your primary options.

  2. FWIW, I think it’s a myth that CTE-as-an-optimization-fence is a feature. It’s simply a limitation of the current implementation. I’ve even heard it said that this is required by the SQL standard several times. I don’t read the standard, but based on what little I know about it, I doubt this is true.

    I think that the sooner this is addressed the better, because they really are a huge boon for expressiveness.

  3. CTEs saved my day using it as a quick way for translating several DECLAREs from a SQL Server test script. Put another way, I’ve used CTEs as a form of variable declaration:

    with
    g1 as
    (select ST_GeomFromText(‘POLYGON(( … ))’) as geo1),
    gBox as
    (select ST_GeomFromText(‘POLYGON(( … ))’) as geoBox),
    gOk as
    (select ‘POLYGON(( … ))’ as geoOk),
    geoQuery as
    (select ST_Intersection(geo1, geoBox) as geoResult from g1, gBox)

    select ST_Astext(ST_Reverse(geoResult)) = geoOk::varchar as r2 from geoQuery, gOk;

  4. Pingback: Everyday Postgres: INSERT with SELECT