Report from first day at PgEast and hoping for another tool to be opened up

I wrote up some quick notes from talks and conversations over at the Emma Tech blog.

The most exciting talk I sat in today so far was about an Oracle PL/SQL to Postgres PL/PgSQL translation tool that I’m hoping the company who created it will open source. We’ll see. Fortunately, a fellow conference-goer had an inspirational story to share about open sourcing another tool for Postgres, which meant incredible adoption in just a few months in our community.

Not every project will see that kind of immediate benefit and growth from open sourcing, but there is a certain class of project – where most people can complete 80% of a useful tool, but don’t bother to put in the additional effort to get the remaining 20% of the features that they’d really like to have.

But, when someone does finally release a tool that provides that extra 20% of features, adopting the new tool is a no-brainer.. particularly if it is open source. I think this PL/SQL conversion tool falls into this sweet spot.

Now I’m sitting in the Foreign Data Wrappers talk and very excited to see what Andrew is announcing. Great to see people creating things that make the crowd here clap, smile and celebrate.

GSoC 2011, accepting submissions starting March 28!

The PostgreSQL project has been accepted into the Google Summer of Code 2011.

Students may begin submitting proposals starting March 28, concluding
on April 8.

Development work runs from May 23 through August 15. For students,
suggested projects, ideas and details are at:
http://wiki.postgresql.org/wiki/GSoC_2011
Our GSoC landing page is at:
http://www.google-melange.com/gsoc/org/show/google/gsoc2011/postgresql

We encourage students to contact project admins – me, Josh Berkus and
Robert Treat this year – if they have questions. Once students have a
proposal in mind, we will encourage them to engage with pgsql-hackers
to flesh out their proposals and get feedback the same way that all
contributors do. For those of you who have been around for previous
GSoCs, this should be familiar to you. 🙂

Many thanks to the 15 volunteer mentors and admins this year (in no
particular order):

  • Dave Page – Past mentor – pgAdmin, Windows, Packaging, Infrastructure
  • Heikki Linnakangas – Postgres Committer
  • Magnus Hagander – Postgres Committer, pgAdmin
  • Guillaume Lelarge – pgAdmin
  • Jehan-Guillaume de Rorthais – phpPgAdmin
  • Joe Abbate – Python-related, catalog-related projects
  • David E. Wheeler – Perl-related, extensions, PGXN
  • Mark Wong – benchmarking, monitoring, performance
  • Tatsuo Ishii – Postgres Committer, pgpool-II
  • Stephen Frost – Postgres contributor
  • Devrim Gündüz – Administration related software (dashboard)
  • Josh Berkus – auto-configuration, performance testing
  • Selena Deckelmann – configuration, testing
  • Andreas Scherbaum – performance, configuration, testing
  • Robert Treat – Past mentor 2x, co-admin, Mentor Summit attendee.

We can always accept more mentors! Actual assignment to projects
depends greatly on the proposals from students. Please contact me if
you are interested.

Intro to PostgreSQL class starts March 7!

Remember that class I announced about a month ago?

Well, it’s happening for real. We’re starting March 7th and going for 6 weeks. Sign up now if you’re want to join us for this first edition of the class.

I’m planning to do screen casts for a lot of the content, and have just started playing around with Screenflow.

The first couple weeks are primarily about using psql and learning key features of PostgreSQL, with some history sprinkled in. The next two weeks dive into features like: full text search, built-in functions, our many datatypes, indexing and transactional DDL. I’ll be surveying students as we go along to add detail where I can on key features they’re interested in. The last few weeks go into administration, maintenance and configuration. I’ll also be throwing in details about the PostgreSQL community – people, the best places to go for help, and hopefully some cameos from Postgres community members.

So, don’t forget to sign up today! Especially because this pudding says so:

Image courtesy of @thesethings

PL Developer Summit at PgCon, May 21!

UPDATE: We have 18 PLs. Added to the list from comments. 🙂

You’re probably aware that PostgreSQL supports a few procedural languages, PL/PgSQL being the most well-known for compatibility with Oracle’s PL/SQL.

Interest in PostgreSQL Procedural Languages (PLs) has grown significantly in the last few years and so PgCon is hosting a special PL summit on Saturday May 21, 2011.

Did you know that there are 17 other procedural languages are currently implemented?

  1. PL/Tcl and PL/Tclu
  2. PL/Perl and PL/Perlu
  3. PL/Python and PL/Pythonu
  4. PL/Ruby
  5. PL/Java
  6. PL/Lua
  7. PL/LOLCODE
  8. PL/Js
  9. PL/Proxy
  10. PL/PHP
  11. PL/sh
  12. PL/R
  13. PL/Parrot
  14. PL/scheme
  15. PL/Perl6
  16. PL/PSM
  17. PL/XSLT

And we have at least one proprietary PL from EnterpriseDB:

We invite PL developers, PostgreSQL core hackers, those interested in future PL development and PgCon attendees interested in learning more to attend!

Before we decided to create this summit, I put together a survey for PL developers. All survey respondents wanted a summit to happen!

The most popular topics were:

  • Postgres PL Interface Improvements
  • Connecting with other PL developers
  • New features in PLs
  • Hacking together
  • State of PLs
  • Distributions and builds
  • PG9.1 extensions vs PL languages
  • Security (pl vs plu)
  • PGXN

The most popular PLs were:

  • PL/PgSQL
  • PL/Perl
  • PL/Python
  • PL/R

The summit is open to attendees of PgCon and special guests. Please RSVP and help set the agenda.

The agenda and any results of the summit will be published on the wiki.

PostgreSQL at MySQL Users Conference: the sessions!

You’ve probably seen a few posts about this – from the CFP, to Baron’s recent pointer to the release of the schedule. And now Josh Berkus just posted a Meetup for the event, so that spurred me on for this post…

So, just to make things even easier for you, I thought I’d summarize the awesome talks we’re having at the O’Reilly MySQL Users Conference this year related to PostgreSQL.

We’re also having a Birds of a Feather session, and staffing a booth on the exhibit floor!

If you’re planning to attend, you can use my code & save 25% in addition to early registration savings: mys11fsd: http://oreil.ly/goaqst

Hope to see you there!

Hot Standby features for 9.1, just committed: Pause and Resume

On February 8th, Simon Riggs committed a couple new functions that will allow Hot Standby to be paused and resumed. You can already *read* from the Hot Standby without pausing, but you could never pause the application of changes in the past. You might want to do this if you have a very high-write-volume server, and some very expensive queries that you want to run on a slave.

Basic Recovery Control functions for use in Hot Standby. Pause, Resume,
Status check functions only. Also, new recovery.conf parameter to
pause_at_recovery_target, default on.

The basic idea is that if you have a read-only standby system, you can give it the command: pg_xlog_replay_pause() and the standby will stop applying changes. Then you can use the database in read-only mode without new changes being applied. When you’re done you can issue the command: pg_xlog_replay_resume() and proceed with applying logs.

There are some related features that I can’t wait to test out around named restore points for replay. But the ability to pause replay and run queries is just awesome.

This is a feature that Simon talked about back in 2009 at FOSDEM, and I am very excited to see it implemented.

Offering an Intro to PostgreSQL class

UPDATE: See below for pricing.

I’m working with Code Lesson to offer an Introduction to PostgreSQL class.

Code Lesson is pretty cool – it’s an online course system, and the idea is you get a couple assignments and lessons taught by me each week, and there’s a midterm and final evaluation. I love conferences, but the nice thing about an online course is you don’t have to spend an entire workday taking a tutorial at a conference, or travelling to a particular location, and you can finish assignments when it’s convenient for you.

My current working outline is:

Intro to Postgres

Hello, world!
* History of PostgreSQL project
* Features
* Basic SQL

Usage
* psql
* Drivers: Perl and Python examples
* GUIs
* Documentation

Survey of features
* Full text search
* Built-in functions
* Datatypes
* Indexes
* Transactional DDL

Community
* Mailing lists & IRC
* Asking questions
* Modules, add-ons, tools

Operations
* System and hardware
* Installation and configuration
* Maintenance and operation
* Replication

Our plan is to provide students with login access to a shared database. During the course, I’ll be available to answer questions and I’m considering making short videos to go along with the course material.

We haven’t set the price for it just yet, but should be figuring that out in the next week or so.

Anyway, if you’re interested, sign up and you’ll get an email when we set the price. I’m happy to answer any questions you have about content.

Another thing that was requested in the Hacker News thread was more advanced material. I think the advanced material falls into two categories – PostgreSQL core functionality, and administration/tuning.

Update! Pricing is set at $325/student, with a 10% discount if you register 2 or more students at the same time.

PostgreSQL 9.0.1 released, includes security fix & maintenance releases for 6 other versions

The PostgreSQL Global Development group released new maintenance versions today: 9.0.1, 8.4.5, 8.3.12, 8.2.18, 8.1.22, 8.0.26 and 7.4.30. This is the final update for PostgreSQL versions 7.4 and 8.0. There’s a security issue in there involving procedural languages, and a detailed description of the vulnerability is on our wiki. A key thing to remember is that the issue primarily affects people who use SECURITY DEFINER along with a procedural language function. PL/PgSQL is not affected, but any other procedural language with a “trusted” mode is. This includes PL/Perl, PL/tcl, PL/Python (7.4 or earlier) and others. The new versions fix issues in PL/Perl and PL/tcl. A patch for PL/PHP is currently in the works.

Most developers feel that the security issue is relatively obscure. If you aren’t using a procedural language with some mechanism for altering privileges (SET ROLE or SECURITY DEFINER, for example), you aren’t vulnerable to the security issue and can upgrade Postgres during your next regularly scheduled downtime. If you *are* vulnerable, we recommend investigating the use of the functions that may be vulnerable, and taking steps to prevent their exploitation by upgrading as soon as you can.

From the FAQ:

What is the level of risk associated with this exploit?

Low. It requires all of the following:

  • An attacker must have an authenticated connection to the database server.
  • The attacker must be able to execute arbitrary statements over that connection.
  • The attacker must have an strong knowledge of PostgreSQL.
  • Your application must include procedures or functions in an external procedural language.
  • These functions and procedures must be executed by users with greater privileges than the attacker, using SECURITY DEFINER or SET ROLE, and using the same connection as the attacker.

This was also the first release for which I generated release notes! 😀

Here was my list of interesting changes for the announcement:

  • Prevent show_session_authorization() from crashing within autovacuum processes, backpatched to all supported versions;
  • Fix connection leak after duplicate connection name errors, fix handling of connection names longer than 62 bytes and improve contrib/dblink’s handling of tables containing dropped columns, backpatched to all supported versions;
  • Defend against functions returning setof record where not all the returned rows are actually of the same rowtype, backpatched to 8.0;
  • Fix possible duplicate scans of UNION ALL member relations, backpatched to 8.2;
  • Reduce PANIC to ERROR on infrequent btree failure cases, backpatched to 8.2;
  • Add hstore(text, text) function to contrib/hstore, to support migration away from the => operator, which was deprecated in 9.0. Function support backpatched to 8.2;
  • Treat exit code 128 as non-fatal on Win32, backpatched to 8.2;
  • Fix failure to mark cached plans as transient, causing CREATE INDEX CONCURRENTLY to not be used right away, backpatched to 8.3;
  • Fix evaluation of inner side of an outer join is a sub-select with non-strict expressions in its output list, backpatched to 8.4;
  • Allow full SSL certificate verification to succeed in the case where both host and hostaddr are specified, backpatched to 8.4;
  • Improve parallel restore’s ability to cope with selective restore (-L option), backpatched to 8.4 with caveats;
  • Fix failure of “ALTER TABLE t ADD COLUMN c serial” when done by non-owner, 9.0 only.
  • Several bugfixes for join removal, 9.0 only.

If you have a look at a new tool that Robert Haas and Tom Lane commited to the repo called git_changelog, you can use it to find the commit IDs for the various features (you need the whole source tree to do it :)).

You’ll find that there are a lot of commits in these sets. We haven’t had a minor release since May 2010, so they kind of added up.

Any other changes in there you think we should have mentioned in the announcement? Let me know in the comments.

Download new versions now:

Custom aggregates: a couple tips and ORDER BY in 9.0

A friend asked about a way to report the first three semesters that a group of students were documented as being present, and report those values each in a column.

The tricky thing is that the semesters students attend are rarely the same. I started out with a very naive query (and sorry for the bad formatting that follows.. i need to find some good SQL formatting markup) just to get some initial results:


select student,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1) as sem1,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1 offset 1) as sem2,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1 offset 2) as sem3
FROM assoc a1
WHERE
student IN ( select student from assoc group by student HAVING count(*) > 2)
GROUP BY student;

That query pretty much sucks, requiring five sequential scans of ‘assoc’:

                                     QUERY PLAN                                     
 HashAggregate  (cost=3913.13..315256.94 rows=78 width=2)
   ->  Hash Semi Join  (cost=1519.18..3718.08 rows=78017 width=2)
         Hash Cond: (a1.student = assoc.student)
         ->  Seq Scan on assoc a1  (cost=0.00..1126.17 rows=78017 width=2)
         ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
               ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                     Filter: (count(*) > 2)
                     ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)
   SubPlan 1
     ->  Limit  (cost=1326.21..1326.22 rows=1 width=3)
           ->  Sort  (cost=1326.21..1328.71 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)
   SubPlan 2
     ->  Limit  (cost=1331.22..1331.22 rows=1 width=3)
           ->  Sort  (cost=1331.21..1333.71 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)
   SubPlan 3
     ->  Limit  (cost=1334.14..1334.14 rows=1 width=3)
           ->  Sort  (cost=1334.14..1336.64 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)

So, he reminded me about custom aggregates! I did a little searching and found an example function that I added an extra CASE statement that stops the aggregate from adding more than three items to the array returned:


CREATE FUNCTION array_append_not_null(anyarray,anyelement)
RETURNS anyarray
AS '
SELECT CASE WHEN $2 IS NULL THEN $1 WHEN array_upper($1, 1) > 2 THEN $1 ELSE array_append($1,$2) END
'
LANGUAGE sql IMMUTABLE RETURNS NULL ON NULL INPUT;

And finally, I declared an aggregate:


CREATE AGGREGATE three_semesters_not_null (
sfunc = array_append_not_null,
basetype = anyelement,
stype = anyarray,
initcond = '{}'
);

One problem though – we want the array returned to be only the first three semesters, rather than any three semesters a student has a record for. Meaning, we need to sort the information passed to the aggregate function. We could do this inside the aggregate itself (bubble sort, anyone?) or we can presort the input! I chose presorting, to avoid writing a real ugly case statement.

My query (compatible with 8.3 or higher):


SELECT sorted.student, three_semesters_not_null(sorted.semester)
FROM (SELECT student, semester from assoc order by semester ) as sorted
WHERE
sorted.student IN (select a.student from assoc a group by a.student HAVING count(*) > 2)
GROUP BY sorted.student;

Which yields the much nicer query plan, requiring just two sequential scans:

                                      QUERY PLAN                                      
 HashAggregate  (cost=11722.96..11725.46 rows=200 width=64)
   ->  Hash Semi Join  (cost=10052.32..11570.82 rows=30427 width=64)
         Hash Cond: (assoc.student = a.student)
         ->  Sort  (cost=8533.14..8728.18 rows=78017 width=5)
               Sort Key: assoc.semester
               ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=5)
         ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
               ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                     Filter: (count(*) > 2)
                     ->  Seq Scan on assoc a  (cost=0.00..1126.17 rows=78017 width=2)

I ran my queries by Magnus, and he reminded me that what I really needed was ORDER BY in my aggregate! Fortunately, 9.0 has exactly this feature:


SELECT student,
three_semesters_not_null(semester order by semester asc ) as first_three_semesters
FROM assoc
WHERE student IN (select student from assoc group by student HAVING count(*) > 2)
GROUP BY student;

Which results in the following plan:

                                        QUERY PLAN                                        
 GroupAggregate  (cost=11125.05..11711.15 rows=78 width=5)
   ->  Sort  (cost=11125.05..11320.09 rows=78017 width=5)
         Sort Key: public.assoc.student
         ->  Hash Semi Join  (cost=1519.18..3718.08 rows=78017 width=5)
               Hash Cond: (public.assoc.student = public.assoc.student)
               ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=5)
               ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
                     ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                           Filter: (count(*) > 2)
                           ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)

A final alternative would be to transform the IN query into a JOIN:


SELECT a.student,
three_semesters_not_null(a.semester order by a.semester asc ) as first_three_semesters
FROM assoc a
JOIN (select student from assoc group by student HAVING count(*) > 2) as b ON b.student = a.student
GROUP BY a.student;

And the plan isn’t much different:

                                        QUERY PLAN                                        
 GroupAggregate  (cost=11125.05..11711.15 rows=78 width=5)
   ->  Sort  (cost=11125.05..11320.09 rows=78017 width=5)
         Sort Key: a.student
         ->  Hash Join  (cost=1519.18..3718.08 rows=78017 width=5)
               Hash Cond: (a.student = assoc.student)
               ->  Seq Scan on assoc a  (cost=0.00..1126.17 rows=78017 width=5)
               ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
                     ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                           Filter: (count(*) > 2)
                           ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)

Any other suggestions for this type of query?

I’ve attached the file I was using to test this out.
custom_aggregates.sql

Using logger with pg_standby

Piping logs to syslog is pretty useful for automating log rotation and forwarding lots of different logs to a central log server.

To that end, the command-line utility ‘logger’ is nice for piping output from utilities like pg_standby without having to add syslogging code to the utility itself. Another thing is that logger comes by default with modern packages of syslog.

Here’s an easy way to implement this:


restore_command = 'pg_standby -d -s 2 -t /pgdata/trigger /shared/wal_archive/ %f %p %r 2>&1 | logger -p local3.info -t pgstandby'