JOINing against VIEWs can be harmful

I had a recent code review problem that was very curious at first glance, but came down to the use of complex VIEW in an even more complicated and frequently used reporting query.

I’ll just paste a edited version of the review below.

tl;dr: Don’t use product_info (a view, not a table) in this query, move WHERE clauses for product_name and version_string into the infos CTE, strictly limit the number of columns in tables being joined

This query is unfortunately doomed because it is using product_info — a view which already contains data from product_versions. There are four other tables which we don’t care about for the query that are included in the view.

As a result, you get a self-join many times over. A hint at the horrors of what Postgres decides to do with this is here:

Unique  (cost=10248.32..10248.35 rows=1 width=294)
   CTE infos
     ->  Hash Right Join  (cost=301.82..1683.83 rows=40195 width=96)
           Hash Cond: (pvb.product_version_id = pv.product_version_id)
           ->  Seq Scan on product_version_builds pvb  (cost=0.00..768.71 rows=42271 width=16)
           ->  Hash  (cost=282.46..282.46 rows=1549 width=84)
                 ->  Hash Right Join  (cost=218.53..282.46 rows=1549 width=84)
                       Hash Cond: (pv.product_version_id = pi.product_version_id)
                       ->  Seq Scan on product_versions pv  (cost=0.00..40.29 rows=1629 width=35)
                       ->  Hash  (cost=199.17..199.17 rows=1549 width=53)
                             ->  Subquery Scan on pi  (cost=179.81..199.17 rows=1549 width=53)
                                   ->  Sort  (cost=179.81..183.68 rows=1549 width=62)
                                         Sort Key: product_versions.product_name, product_versions.version_string
                                         ->  Hash Join  (cost=5.70..97.73 rows=1549 width=62)
                                               Hash Cond: ((product_versions.product_name = product_release_channels.product_name) AND (product_versions.build_type = product_release_channels.release_channel))
                                               ->  Seq Scan on product_versions  (cost=0.00..40.29 rows=1629 width=52)
                                               ->  Hash  (cost=5.03..5.03 rows=45 width=42)
                                                     ->  Hash Join  (cost=2.34..5.03 rows=45 width=42)
                                                           Hash Cond: (product_release_channels.release_channel = release_channels.release_channel)
                                                           ->  Hash Join  (cost=1.23..3.29 rows=45 width=34)
                                                                 Hash Cond: (product_release_channels.product_name = products.product_name)
                                                                 ->  Seq Scan on product_release_channels  (cost=0.00..1.45 rows=45 width=22)
                                                                 ->  Hash  (cost=1.10..1.10 rows=10 width=12)
                                                                       ->  Seq Scan on products  (cost=0.00..1.10 rows=10 width=12)
                                                           ->  Hash  (cost=1.05..1.05 rows=5 width=8)
                                                                 ->  Seq Scan on release_channels  (cost=0.00..1.05 rows=5 width=8)

Whenever you see so many nested joins, subquery sorts and sequence scans mushed together in a staircase, that’s a signal that we should investigate whether the query we’re running is really what we thought it was.

While @peterbe dug through code with me, he mentioned that product_info was a view! Now all the self-JOINs made sense and I started refactoring.

The product_info view was being deconstructed into it’s component parts, which already included product_versions (resulting in a self-join) and including a bunch of junk that for the purposes of this query, we don’t really care about. So, as the first step, I just made a copy of the SELECT query from the view (you can get that by running \d+ product_info in psql or you can dig it out of the socorro/external/postgresql/procs/views section of our code.

Here’s my proposal for what should go into infos:

         SELECT 
                product_versions.product_version_id
                , product_versions.version_string
                , 'new'::text AS which_table
                , product_versions.product_name
                , product_versions.release_version
                , product_versions.build_type
                , product_version_builds.build_id
                , product_versions.is_rapid_beta
                , product_versions.rapid_beta_id
                , product_versions.version_sort
        FROM product_versions
                LEFT JOIN product_version_builds USING (product_version_id)
        WHERE  %(product name and versions)s

We really need to move the product name and version filtering to this portion of the query because otherwise we end up doing a horrible self join on a 42,000 row table! :watch:

Here’s what the self-join looks like in the EXPLAIN:

   ->  Sort  (cost=8564.48..8564.49 rows=1 width=294)
         Sort Key: i1.version_sort, i1.product_version_id, i1.product_name, i1.version_string, i1.which_table, i1.release_version, i1.build_type, i1.build_id, i1.is_rapid_beta, i2.is_rapid_beta, ((((i2.product_nam
e)::text || ':'::text) || (i2.version_string)::text))
         ->  Merge Join  (cost=7755.52..8564.47 rows=1 width=294)
               Merge Cond: ((i1.product_name = i2.product_name) AND (i1.release_version = i2.release_version) AND (i1.build_type = i2.build_type))
               Join Filter: (((i1.product_name = 'Firefox'::citext) AND (i1.version_string = '26.0a2'::citext) AND (i1.version_string = i2.version_string)) OR ((i1.rapid_beta_id = i2.product_version_id) AND (i2.pr
oduct_name = 'Firefox'::citext) AND (i2.version_string = '26.0a2'::citext) AND (i2.is_rapid_beta IS TRUE)))
               ->  Sort  (cost=3877.76..3978.25 rows=40195 width=233)
                     Sort Key: i1.product_name, i1.release_version, i1.build_type
                     ->  CTE Scan on infos i1  (cost=0.00..803.90 rows=40195 width=233)
               ->  Sort  (cost=3877.76..3978.25 rows=40195 width=133)
                     Sort Key: i2.product_name, i2.release_version, i2.build_type
                     ->  CTE Scan on infos i2  (cost=0.00..803.90 rows=40195 width=133)

sad_kitten

This is pretty sad. The Sort at the top of Mt. Sadness. There are a series of sorts further down that are just HUGE because we’re tossing 45k records that must be joined to each other, and the width of the query is 294 — 294 columns in addition to our 45k rows.

The obvious (but sadly not always effective) thing to try is to see if we can filter our rows out earlier. Because we’re using infos, conveniently, that looks possible without too much trouble.

That just leaves sorting out the rapid beta self-join, which based on my tests should be pretty easy to continue to do in the body of the main SELECT, at line 125.

With the changes I proposed, the estimated duration of this query is ~200 ms in stage and the query plan looks like:

                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=37.07..37.08 rows=1 width=294) (actual time=221.131..221.149 rows=31 loops=1)
   CTE infos
     ->  Nested Loop Left Join  (cost=0.00..35.18 rows=26 width=64) (actual time=0.136..0.459 rows=150 loops=1)
           ->  Index Scan using product_version_version_key on product_versions  (cost=0.00..7.27 rows=1 width=52) (actual time=0.111..0.112 rows=1 loops=1)
                 Index Cond: ((product_name = 'Firefox'::citext) AND (version_string = '26.0a2'::citext))
           ->  Index Only Scan using product_version_builds_key on product_version_builds  (cost=0.00..27.58 rows=33 width=16) (actual time=0.019..0.268 rows=150 loops=1)
                 Index Cond: (product_version_id = product_versions.product_version_id)
                 Heap Fetches: 150
   ->  Hash Join  (cost=0.84..1.86 rows=1 width=294) (actual time=0.943..47.334 rows=22500 loops=1)
         Hash Cond: (i1.product_version_id = i2.product_version_id)
         Join Filter: ((i1.version_string = i2.version_string) OR ((i1.rapid_beta_id = i2.product_version_id) AND (i2.is_rapid_beta IS TRUE)))
         ->  CTE Scan on infos i1  (cost=0.00..0.52 rows=26 width=233) (actual time=0.141..0.236 rows=150 loops=1)
         ->  Hash  (cost=0.52..0.52 rows=26 width=69) (actual time=0.778..0.778 rows=150 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 8kB
               ->  CTE Scan on infos i2  (cost=0.00..0.52 rows=26 width=69) (actual time=0.002..0.664 rows=150 loops=1)
 Total runtime: 221.321 ms
(16 rows)

VPN Problems and Ubuntu: killing off the dnsmasq zombie

I’ve been having problems with VPN, DNS and Ubuntu for a year. But, I’m also pretty lazy when it comes to spending time on configuration. And configuring VPNs is like last on my list of ways I’d like to spend my time.

In short, I’d rather reboot than figure out exactly why my networking just stopped working.

REBOOT.

Fortunately, I had an easy (for me) work-around for most of my VPN needs: use SSH and a jump-host for getting to servers. I found it annoying when I wanted to look at a website on protected network space, or had a service on an unusual port that I wanted to test things against. I would work around with SSH tunnels, or I would fire up my Mac, whose VPN settings worked flawlessly.

That all said, I thought today, a sunny, lovely fall day in Portland, I would fix my VPN.

And so, my buddy @uberj_ helped me get things sorted.

The root cause of all my VPN heartache was the dnsmasq daemon controlling my DNS. And, related, network-manager. There are a few places that document exactly how to disable dnsmasq

  • DNS in Ubuntu 12.04 http://www.stgraber.org/2012/02/24/dns-in-ubuntu-12-04/
  • Disabling dnsmasq as your local DNS server in Ubuntu http://mark.orbum.net/2012/05/14/disabling-dnsmasq-as-your-local-dns-server-in-ubuntu/

However, they leave out one important step: killing off the existing dnsmasq process. For the unlucky, restarting network-manager does not kill off dnsmasq.

So, to find and kill dnsmasq, do the following:

 sudo service network-manager stop
 kill `ps -C dnsmasq -o pid=`
 sudo service network-manager start

Then, start your VPN and check out the contents of the /etc/resolv.conf. If all went well, you’ve got nameserver addresses other than 127.0.0.1 in the file.

Yay!

Sadly, this was not the end of my story.

After a few minutes, NetworkManager started dnsmasq up again!

Zombie dnsmasq

So, like any reasonable sysadmin, I opened up the /etc/NetworkManager/NetworkManager.conf file, uncommented the dns=dnsmasq line, and replaced it with dns=/dev/null. My guess was that you can probably put just about anything other than dnsmasq into that line to permanently disable the plugin.

I ran sudo service network-manager restart, checked /etc/resolv.conf and felt pretty smug.

I tried also uninstalling dnsmasq-base package, but unfortunately that takes out a number of other packages I appear to need. So, I left /dev/null in my NetworkManager.conf, and updated this blog post.

But wait...

While editing this blog post, dnsmasq took over my DNS settings again.

A clue as to what was happening was in /var/log/syslog:

Oct 18 10:20:10 localhost dnsmasq[30535]: started, version 2.59 cache disabled
Oct 18 10:20:10 localhost dnsmasq[30535]: compile time options: IPv6 GNU-getopt DBus i18n DHCP TFTP conntrack IDN
Oct 18 10:20:10 localhost dnsmasq[30535]: DBus support enabled: connected to system bus
Oct 18 10:20:10 localhost dnsmasq[30535]: warning: no upstream servers configured

It turns out that dnsmasq was still getting revived by NetworkManager. Why NetworkManager doesn’t seem to care about configuration settings was beyond my willingness to investigate today. So, I did some more searching about truly killing of dnsmasq for good.

And I found this thread, and this sample configuration file. In the output for the dnsmasq process from ps:

nobody   30777 30759  0 10:21 ?        00:00:00 /usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces --pid-file=/var/run/sendsigs.omit.d/network-manager.dnsmasq.pid --listen-address=127.0.0.1 --conf-file=/var/run/nm-dns-dnsmasq.conf --cache-size=0 --proxy-dnssec --enable-dbus --conf-dir=/etc/NetworkManager/dnsmasq.d

I dug into the thread, and the suggestion was to set port=0 in the config. I created a file called custom in /etc/NetworkManager/dnsmasq.d. And ran sudo service network-manager restart.

And then I got this in my syslog:

Oct 18 10:21:10 localhost dnsmasq[30777]: started, version 2.59 DNS disabled

FINALLY.

FINALLY!

An experiment in attention

I’ve had reoccurring thoughts about attention and who I give mine to. In the last week, I’ve been mentioned in a couple “women in tech” twitter lists. This seems to happen about quarterly and someone will create a list of 50 or so, or maybe 100+ women on a list.

I spent a couple days looking through my 5k followers and assembled a list of the women and women’s groups who are following me. I probably missed a few, and I know I followed a few people who don’t consider themselves women. Sorry! Just let me know and I’ll add/drop as needed.

So, why bother with a list like this?

Before I created this list, I was following about 920 people. Which, in itself is a sort of ridiculous number. How could I possibly pay attention to that many people?

I really don’t, right? I just check into twitter, sample the firehose, and then step away for minutes, hours or days.

When I do sample the tweet stream, whose voices do I listen to? There are certainly a few close friends whose feeds I look at directly, and a few other people I’m interested in who I will catch up on a backlog. Otherwise, it’s whoever is the most vocal.

What I noticed is: most of the voices I hear from when I sample the feed are men. It wasn’t anywhere near balanced. That’s on social activity, technical rants, technical praise and blogging.

I’d like to be skewed toward women’s voices for a while. Particularly on tech issues. So, I just added about 450 women to my feed who were already following me.

My next step may be replacing my primary feed with this list I’ve made. I wrote a tool a while back to extract URLs and RSS feeds from my friend’s twitter profiles and feeds. The code isn’t awesome, just rough and practical. But you could do something similar using it. You need a set of read/write API keys (create an app, then make it read/write), but I did the “hard” work for you:

./opml.py --consumer_key [key here] --consumer_secret [secret here] --access_token [token here] --access_token_secret [token secret here]  --to_follow [file of twitter handles]

I left some crud in there that links the script directly to my account for the list. Sorry! If anyone actually wants to use this, I’ll clean it up. (just ping me on github)

Anyway, doing this kind of attention hacking for yourself isn’t hard. It is drudgery to go through all your followers and guess who is what gender. But it is interesting to spend a few hours contemplating what the people you’re giving your attention to have in common, and how you might hack it a bit to hear from different perspectives from time to time.

I’m turning comments off because I don’t care to hear from anyone who thinks I’m somehow being sexist by changing who I pay attention to. Cheer up, haters! I’m sure plenty of people are already paying attention to you.

Fancy SQL Monday: format() instead of quote_*()

In the comments, Isaac pointed out that using format() dramatically increases the readability of SQL. I liked the look of his query, so I dug a little deeper.

As of version 9.1 (first released in 2010), a new function is listed in Postgres’ built-in string function documentation:

format(formatstr text [, str “any” [, ...] ]): Format a string. This function is similar to the C function sprintf; but only the following conversion specifications are recognized: %s interpolates the corresponding argument as a string; %I escapes its argument as an SQL identifier; %L escapes its argument as an SQL literal; %% outputs a literal %. A conversion can reference an explicit parameter position by preceding the conversion specifier with n$, where n is the argument position.

We also have examples linked in the definition for various quoting strategies for dynamic SQL.

This is an example where the Postgres documentation probably should have reversed the order what is mentioned.

It turns out that format() makes it much easier to avoid using the quote_*() functions. The code looks a lot more like a python """ string (you can have arbitrary whitespace in there!), with flexible options for usage. The only feature missing is named parameters.

My application requires Postgres 9.2 at this point (for JSON datatype), so my plan is to refactor a few functions using format() instead of quote_ident() in particular.

Are there situations where you’d prefer to use quote_*() other than for backward compatibility? It seems as though format() is far safer, particularly for the quoting and nullable problems mentioned on the Quote Literal Example documentation.

Catalog SQL Friday: using DO when we’ve got tables as variables

Just a quick note about modifying constraints:

There’s no such thing as ALTER CONSTRAINT. So, if you want to safely change a CHECK constraint, like on a partition, you need to DROP and ADD it in a single transaction.

Below is a snippet for finding partitions, their CHECK constraints based on a WHERE clause. Then we DROP the existing constraint and add back the correct constraint. It doesn’t take much sleuthing to figure out what the problem was. :)

 DO $$
  DECLARE myrecord record;
  DECLARE theweek text;
  BEGIN
    FOR myrecord IN SELECT relname, conname from pg_constraint
      JOIN pg_class ON pg_constraint.conrelid = pg_class.oid
      WHERE consrc ~ 'without' and split_part(relname, '_201', 1)
      IN (select table_name from report_partition_info
      WHERE partition_column = 'date_processed') LIMIT 1
    LOOP
       EXECUTE 'ALTER TABLE ' || quote_ident(myrecord.relname)
       || ' DROP CONSTRAINT IF EXISTS '
       || quote_ident(myrecord.conname) || ';';</p>

   theweek = substring(myrecord.relname from '........$');

   EXECUTE 'ALTER TABLE ' || quote_ident(myrecord.relname)
   || ' ADD CONSTRAINT ' || quote_ident(myrecord.conname)
   || ' CHECK ((date_processed >= timestamptz('
   || quote_literal(to_char(date(theweek), 'YYYY-MM-DD')) || '))'
   || ' AND (date_processed < timestamptz('
   || quote_literal(to_char(date(theweek) + 7, 'YYYY-MM-DD'))
   || ')));';

   RAISE NOTICE 'DONE: %', myrecord.relname;
END LOOP;

END$$; 

Here’s the gist version for easier reading.

The couple things I learned in this process was a nice feature in substring() allowing me to return the date portion of my partition names easily, and split_part() which allowed me to return the parent table name and compare it to my list of partitionable tables for the specific partition column. I recently added support for partitioning on a different column for certain tables, so I have to differentiate for this fix. The string function docs are pretty great.

I didn’t do any optimization of this — just got it working and am now testing it in our stage environment. The final script is going to perform the changes on a month’s worth of partitions at a time to help reduce the chance of deadlocking.

If you have thoughts on how I could have done this more efficiently, let me know in the comments!

Why I gave $1024 to the Ada Initiative

AdaCamp participants

I’m one of the founding board members and advisors to the Ada Initiative, a 501(c)3 charitable non-profit that supports women in open technology and culture. This year, I am supporting the organization through a donation of $1024.

My reasons for supporting the Ada Initiative, advocating for their work, and now donating are many. I started out working with open source software in 1996, as a sysadmin setting up backups for a small research lab at the University of Oregon. I’d installed Linux from floppies in 1994, and surrounded myself with other open source advocates, sysadmins, hackers and finally, computer science students. Open source software has defined my career, and most of what I’ve chosen to do with my life.

In those few years in college, I met only two women who were interested in open source software – one that I brought with me from Chemistry into the Computer Science program and a woman who worked in Academic Services. I met no other professors, fellow students or friends online to collaborate or talk with.

All my mentors and friends were men. And, to be honest, I never thought much about that. Growing up, I tended to hang out with boys or spend a lot of time alone. As I started to think about computers and programming as fundamentally changing society, as tools that enabled and could dramatically change lives, I felt discomfort that only men seemed to know the things I knew.

A critically important aspect of being involved in the work that the Ada Initiative does is simply introducing women to each other in our communities. I’ve attended all three AdaCamps – in Melbourne, AU, Washington DC and San Francisco. Each gathering was larger than the last, and I’ve had important conversations at each that have changed my thinking and my activism in significant ways.

My most striking realization is how little work is done to educate adult women about computer science, about open source and open culture. Most outreach efforts focus on children, with the implication that adults are some how “not worth the effort”, “beyond help”, or that adults will simply find the things that interest them on their own.

My experience with the Ada Initiative, and with PyLadies contradicts everything that I had assumed, and everything that some advocates for early CS education for girls have implied — that the only way to change diversity is to start with a new generation of kids.

Adult women want the same skills that anyone interested in computers want. But there are important social and economic barriers to pursuing those goals.

The grassroots movement to educate adult women in writing software is inspiring. It reinforces my belief that it is never too late to learn these skills, and the women who seek out these classes inspire me with their dedication and fearlessness. Because of what I’ve learned and experienced, I’m now taking a break from the Ada Initiative advisors board to focus on my work with PyLadies.

Many people share my discomfort with a world whose code is only written by men. We’re making important changes in the way women view open technology and culture. We’re showing women that they definitely can learn and master these skills, that there are huge benefits to doing so and that women fundamentally belong in our communities. That work, I believe, also leads to the kind of intersectional awakening about diversity that all community builders should have.

The work of the Ada Initiative is difficult, wonderful and making a huge difference in my life. Please join me in supporting the Ada Initiative in 2013.

Using JSON data type in production with Socorro

Back in June, we started using the JSON datatype in production for Mozilla’s Socorro. Our implementation analyzes crashes from Firefox and other Mozilla products, configured with a HBase backend and Postgres version 9.2 processing and serving reports to middleware and a Django front-end.

The guts of the application that stores raw data into Postgres is implemented in a crashstorage class, and an example of how we use this field in report generation is in a stored procedure for rolling up our “Top Crashers By Signature” reports.

The idea is to get the metadata we store for each crash into Postgres in its raw JSON form available to SQL queries. We are currently still storing core crash report data in a normalized table (extracted from the original JSON). We are considering re-writing everything to just use a JSON column at the core. There may be some significant negative performance impacts, so more testing is needed before we move forward.

Previously, we only stored metadata in HBase. While it is convenient to have a year’s worth of data in HBase for running arbitrary queries against, once we settle on reports, it is far faster and more convenient to be able to use Postgres with JSON functions and SQL to rollup aggregates. I’m sure this isn’t true for everyone. We’re working with large, but manageable data sets – around 30 GB per week of JSON, and only 2-3 weeks of data at a time.

The advantage to having the JSON in Postgres is that we no longer are adding columns to our base tables. It’s not terribly difficult to add columns, but it requires some special work to make the database schema migrations as low-lock as possible on tables with weekly partitions. Cutting out the DBA as an intermediary before developers can deploy new features and reports is a huge win for automation and sustainability of the application.

My preference is that developers never require a DBA to be present when deploying an app — even when there are schema changes. There are lots of features in Postgres that significantly reduce the need for DBA involvement — streaming replication, pg_basebackup, zero downtime column ADD, DROP and ALTER, transactional DDL and CREATE INDEX CONCURRENTLY to name a few. We’re not quite there for Socorro, but my plan is to get there.

The ideal role of the Postgres DBA is more to write database code, monitor and improve performance. That kind of role feels a lot more like a partner to developers, rather than a separate, mysterious entity. In much the same way that DevOps and configuration management enable great cooperation between IT and developers, many of Postgres’ maintenance features enable greater cooperation and less friction between DBAs and developers deploying new features.

There are still a few configuration and initialization issues to work through before Postgres can be seen as completely past it’s reputation as a difficult to deploy database. Some of that can be solved with more widely used configuration recipes for tools like Puppet and Chef. My colleagues in IT have been working on a Puppet module that they plan to release that automatically sets up a master database and as many replicas as you’d like to configure.

My hope is that people more widely share their automation stories and code!

In anycase, the JSON datatype is solving an important problem for our team, and significantly reducing the demand for schema changes. I’d love to hear more reports from users trying out JSON the field.

Fancy SQL Monday: Why use OVER() instead of a CROSS JOIN?

EDIT: I had to add this, because it was cracking me up:

FOR LIKE EVERY REASON EVER

I’m reimplementing some expensive database queries, moving them from our middleware into materialized view tables. We make pretty extensive use of Common Table Expressions (CTEs). And we generate many reporting queries that calculate averages and percentages of a total. One way this could be done is with a CROSS JOIN, which is a cartesian product of two tables, adding the total and then the percentage calculation to our original table that produces counts of events. For information about JOIN types supported by PostgreSQL, see SELECT…FROM documentation.

Or we could use OVER(), one of several Window Functions supported by Postgres. From the Postgres documentation, Window Functions are:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

A few other folks have written about Window Functions: Postgres Guide to Window Functions, Window Functions: Postgres’s best kept secret, SQL Window Functions (examples in Ruby). “All built-in and user-defined aggregate functions — such as count, max, bit_or, or xmlagg” — are available for computation as a window function, quoting Open Logic, who phrased it well.

If you read about OVER(), you’ll see that all the examples specify a PARTITION clause. But you don’t have to specify a PARTITION to take advantage of the feature.

We had a query containing two CTEs and a final SELECT that produced the matview data. Here’s an example of that query:

WITH crashes as (
  SELECT
    product_name as category
    , version_string
    , SUM(report_count) as report_count
  FROM signature_summary_products
  JOIN signatures USING (signature_id)
  WHERE signatures.signature = 'libflashplayer.so@0x1f2a14'
  AND report_date >= now()::date - '15 day'::interval
  AND report_date < now()::date 
  GROUP BY product_name, version_string
),
totals as (
  SELECT
    category
    , version_string
    , report_count
    , SUM(report_count) OVER () as total_count
  FROM crashes
)
SELECT category
  , version_string
  , report_count
  , round((report_count * 100::numeric)/total_count,3)::TEXT
  as percentage
FROM totals
ORDER BY report_count DESC;

The part under consideration is:

    , SUM(report_count) OVER () as total_count

So, you see that OVER() has no PARTITION defined, meaning that the SUM will be calculated over the entire result.

Here is that same query, implemented using SUM() and a CROSS JOIN:

WITH crashes as (
  SELECT
    product_name as category
    , version_string
     , SUM(report_count) as report_count
  FROM signature_summary_products
  JOIN signatures USING (signature_id)
  WHERE signatures.signature = 'libflashplayer.so@0x1f2a14'
    AND report_date >= now()::date - '15 day'::interval
    AND report_date < now()::date 
  GROUP BY product_name, version_string
),
totals as (
  SELECT
    SUM(report_count) AS total_count
  FROM crashes
)
SELECT category
  , version_string
  , report_count
  , round((report_count * 100::numeric)/total_count,3)::TEXT
as percentage
FROM crashes CROSS JOIN totals
ORDER BY report_count DESC;

What’s the difference to Postgres between that and a SUM() plus a CROSS JOIN?

Here’s the EXPLAIN output from this query, pared down to the relevant section:

   CTE totals
     ->  WindowAgg  (cost=0.00..0.03 rows=1 width=72) (actual time=0.112..0.114 rows=3 loops=1)
           ->  CTE Scan on crashes  (cost=0.00..0.02 rows=1 width=72) (actual time=0.097..0.100 rows=3 loops=1)
   ->  CTE Scan on totals  (cost=0.00..0.04 rows=1 width=104) (actual time=0.121..0.129 rows=3 loops=1)

The important bit to have a look at is WindowAgg right after CTE totals.

Now compare to the EXPLAIN output from a SUM() plus a CROSS JOIN query:

   CTE totals
     ->  Aggregate  (cost=0.02..0.03 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1)
           ->  CTE Scan on crashes  (cost=0.00..0.02 rows=1 width=8) (actual time=0.001..0.004 rows=3 loops=1)
   ->  Nested Loop  (cost=0.00..0.07 rows=1 width=104) (actual time=0.191..0.205 rows=3 loops=1)
         ->  CTE Scan on crashes  (cost=0.00..0.02 rows=1 width=72) (actual time=0.162..0.164 rows=3 loops=1)
         ->  CTE Scan on totals  (cost=0.00..0.02 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=3)

You can see here that we now have an Aggregate (for the SUM()) plus a Nested Loop (for the CROSS JOIN). This example query only has three results returned, but our more typical queries involve 10k or more rows returned.

Avoiding the Aggregate and Nested Loop will save us lots of memory and processing time on every run of a very expensive query.

Speaking at the Cash Music Summit today

I’m headed off to the Cash Music Summit in Portland, OR today.

Here’s the blurb I wrote out for Jesse’s zine:

What’s open source got to do with it?

Free software sounds like a 70s era free-love pipe dream. The idea and a copyright hack to enable it were born in 1984, brought to term by an academic who just wanted to fix a problem he had with his printer.

Free and open source software underpin Cash Music’s platform. Why should that matter to you? Software developers prefer working with code they can freely read, understand and modify. But the benefits to end users are not always as clear.

There’s a method to the madness that is giving away something many believe is worth more if kept secret. It’s not just about sharing, but also accountability, choice and freedom. And, it’s about creating communities we love contributing to, with the kind of people we love to collaborate with.

I’m not sure exactly which story I’m going to tell today – I’ve got 10 minutes though and a pretty sweet picture of Tina Turner.

If you’ve not seen Cash Music or learned about it’s mission, check out this NYTimes article about their work.

What I mean when I talk about collaboration with teachers: part I

I’ve given a few talks about my experience learning to teach. This is an edited version of my speaking notes for the keynote I gave at the Computer Science Teachers Association conference. This is the best distillation of my thoughts about the value of open source in my life, and what motivates me to contribute and teach. The first half is over 2000 words, so I’m breaking this into two posts. The next part I publish will be the second half of the talk – about the classes I’ve taught, and my lessons learned about what people need to know to get started in free and open source software.

I am a beginner teacher. I’ve only just started writing lessons and teaching classes to adult women who are learning or practicing their programming. All of what I share today is based on my personal experiences working with first time, adult programmers. My plan today is to tell you a little bit about me and what motivates me to teach and contribute to open source, share with you the successes of some of our beginner adult programming efforts and finally what I think open source communities offer teachers.

And I want to start by giving away my punchline. When it comes to working with open source community – of which I’m a member and a leader, and there are many, many leaders without any kind of central authority – I can say for sure today that we’ll come to you.

I’ve been working for the past couple years to find like-minded open source community members, and for those of you in the audience today, I am making a commitment to you – if you want it – to find an open source person to come and talk about what it is that they do to your classroom.

Just contact me (you can leave a comment below – just indicate if you’d prefer I not make your comment public), and I will make this happen, either through Mozilla or through my open source collaborators. I’ve spent the last 16 years going to conferences, and I would like to introduce that network of people to you.

I want to start with something Julie Horvath said recently. She wrote a blog post about women in tech and it struck such a chord with me. The first sentence really stopped me dead in my tracks.

I didn’t grow up thinking I could do anything I wanted to.

When I look at this again, I feel overwhelmed by how much it matches what the women I’ve taught said that they think about programming.

Screen Shot 2013-07-25 at 5.31.27 AM

I see this every time I walk into a classroom to teach beginning programmers. This is a photo from a class on algorithms, people doing a pen and pencil exercise in groups. Several women said afterward they finally felt confident that they could explain what algorithms were. That before coming and working on this in these groups, they literally had never really thought about how algorithms related to programming or what it might mean to implement or create their own algorithms.

I’ve come to think of this as a possibilities problem. People truly have no idea what is possible for them in computer science. And in my teaching experience in particular, many women coming to these classes have a very limited view of what they can accomplish. They don’t know what the job opportunities are, they don’t realize how programming can be used in their lives outside of work, and they know very little about how a computer works or what the main components of a computer are.

When I think about what I really need to do — what my focus is in teaching that I do — I think about changing the scope of what people think is possible. Broadening the scope, and enhancing whatever details I can that make studying programming and ultimately computer science relevant to the lives of the people coming to these classes.

So if we were to just to attach a little overdeveloped importance to this idea of expanding the scope of possibility, we could call this “possibility engineering.”

In my experience, there’s two basic things I have to do – I need to raise awareness, and then I need to offer encouragement. It’s in addition, of course, to teaching real skills that people need. And as classroom teachers, you’re all aware of the need for these two things. I’ve found that these issues are often left out of how outreach and teaching in open source communities is structured.

Screen Shot 2013-07-25 at 5.09.46 AM

This is a picture of a sticky note I drew of how I felt while learning to use a new programming language or trying learn a new module in Python. The top of the graph is “euphoria” or “happiness”, and the bottom of the graph is “despair” or unhappiness. You can see I have a lot of ups and downs!

The peaks are when I’m reading documentation for the first time, succeeding with experiments and implementing code. The valleys are when I actually try the tutorials and they don’t completely work, when I write code that fails and when I’m trying to refactor my test suite. In the end, my emotions level out and if I’m lucky, I end up satisfied with the tool I chose to work with.

Screen Shot 2013-07-25 at 5.11.08 AM

Here’s what I think happens sometimes with the women who come to PyLadies and then never come back. They initially are very happy, but then something happens that causes them to give up.

Screen Shot 2013-07-25 at 5.11.51 AM

In one case, I know exactly what happened — a woman attended the workshops, tried things on their own that didn’t work, and then finally had something break with Python on her Windows laptop and she never came back.

What happens when PyLadies succeeds? What does the emotional graph look like?

Screen Shot 2013-07-25 at 5.12.49 AM

What I’ve seen in the 60+ women that keep coming to meetings is that they continue to have difficult experiences – things break, they don’t know how to fix them.

But they all come back to the group. They ask questions, they commiserate over things that don’t work and they get the help they need to see that they are improving at the same time as they feel as though they are getting better, making friends and being supported. The in-person experiences are key.

Screen Shot 2013-07-25 at 5.13.48 AM

Before we go on, it’s important to acknowledge a key truth about what it is that teachers are teaching. Computer Science is a way of thinking and solving problems. It’s not a company or a product.

This is of course obvious to all of you in this room – but it’s such an important idea to come back to to in all of our work. It’s about getting kids or adults to understand the basics of what a computer is and what it does, and how it stores data about what, where, how and when we do things. We need people to understand these concepts in the same way that we need people to be able to read. When our society is increasingly assisted, augmented and controlled with the help of computers, democracy is at stake when most people have no idea how a computer and software works.

The role of open source groups like PyLadies, of non-profits like Mozilla, is ultimately to empower people: to spread knowledge, dispel myths and invite exploration.

But these groups are mostly helping out people who are already out of high school.

There’s a fair amount of research at this point about what many people think about computers when they’re in high school. I’ve mostly read about what girls think, and try to keep that in mind when I’m advertising my courses. Which brings me to what I thought computer science was all about when I was in high school.

What I knew was:

  • Computers were for playing games
  • Computers were for anti-social boys
  • You’ll find lots of inappropriate, animated ASCII art on computers

And I think that highlights a problem with how we’re collectively handling explaining computer science to the world. We can’t rely on ad-hoc self-education, or discovery learning to help people understand how the whole world is changing.

Screen Shot 2013-07-25 at 5.16.37 AM

Here’s a list of job titles from my colleagues in the industry. Many of these are jobs that didn’t exist 20 years ago, some are jobs that didn’t exist five years ago. So much is changing so fast.

Despite that, we have some real principles – computer science principles – underpinning it all. That’s where we need to focus, while at the same time exposing people to this wealth of possibility.

So, how did I, a person who thought the computers were for gaming, for boys and probably a little bit seedy, get from there to thinking it might be possible to join an open source community and move on to actually changing something I cared about?

In 2000, I made my first contribution to an open source project. I was working at Intel, managing network equipment monitoring and I’d found a problem with how I’d set everything up and needed to modify something like 7000 files to fix it. So I wrote a simple script.

Not too long after that, someone else had a similar problem and posted about it to a mailing list. So I decided, I might as well help that guy out and post the script. Then, I did.

And what happened next totally changed my life. The maintainer of the project not only thanked me, and asked a bunch of questions, he accepted my patch committed it to the main repo, and added me as a contributor to the project’s site.

Mind Blown

I changed the source code of a tool I used every day.

I felt deliciously powerful, so important! And incredulous that something that I’d written that was so obviously terrible, was good enough to be part of a piece of software that I not only used every day, but thought was incredibly great.

And other people used it! I know because I got bug reports later.

Today, I’m a major contributor to the PostgreSQL community, and I founded a chapter of PyLadies in Portland. I’m also deeply involved in many aspects of open source community organizing, like running conferences and helping out with the Ada Initiative. It’s hard to understate how much that patch affected the rest of my life.

I’m super passionate about open source software and I really think collaborating with teachers is awesome. And what I think in particular is great about collaboration between us is getting the open source community to understand teaching at scale. By that, I mean learning how to teach everyone — the way that we teach in our public education system.

Public education is a grand experiment, and a very successful one. Despite the many issues we have with the administration of it, we have a literacy rate that enables us to sustain a democracy and a system for getting an incredible amount of information to most of our republic’s citizens.

We should be using this system to teach everyone about computer science.

Beyond that, I want open source communities to figure out how to teach at that kind of scale. Not only do we need computer science in the classrooms — we need free and open source principles and tools to be taught as well.

We can get there with community members reaching out to teachers as a first step.

And an important part of that is learning what the process of developing lessons and teaching students in classrooms is all about. This is the huge thing we (free and open source developers and community members) can learn from you (teachers).

Teachers and open source community have a lot in common. Some of the more important things are:

  • Minimal resources
  • Teach anyone who shows up
  • Change the world by sharing ideas

My dream in this is that we’ll find a way to provide effective computer science education for everyone.

We’re trying to find that minimal set of concepts that will make people feel empowered at a keyboard, a kiosk or any computer they interact with in their lives. That they understand what’s being said in the newspaper about computers, that they can ask questions without feeling shamed or stupid, and that they can learn more if they choose.

And I don’t mean at all to say that you’re all signing up for teaching everyone. But I am signing up to at least try to do this for the adults in my life that need and want it.

Second half of this talk coming shortly…