I am asked fairly frequently to give a health assessment of Postgres databases. Below is the process I’ve used and continue to refine.
The list isn’t exhaustive, but it covers the main issues a DBA needs to address.
- Run boxinfo.pl on a system
Fetch the script from http://bucardo.org/wiki/Boxinfo. Run as the postgres user on the system (or a user that has access to the postgres config).
- Check network.
What is the network configuration of the system? What is the network topology between database and application servers? Any errors?
- Check hardware.
How many disks? What is the RAID level? What is the SLA for disk replacement? How many spares? What is the SLA for providing data to the application? Can we meet that with the hardware we have?
- Check operating system.
IO scheduler set to ‘noop’ or ‘deadline’, swappiness set to 0 (http://www.pythian.com/news/1913/what-exactly-is-swappiness/)
- Check filesystems.
Which filesystem is being used? What parameters are used with the filesystem? Typical things: noatime, ‘
tune2fs -m 0 /dev/sdXY‘ (get rid of root reserved space on database partition), readahead – set to at least 1MB, 8MB might be better.
- Check partitions.
What are the partition sizes? Are the
pgdatadirectories separated? Are they of sufficient size for production, SLAs, error management, backups?
- Check Postgres.
What is the read/write mix of the application? What is our available memory? What is the anticipated transpactions per second? Where are stats being written (
- Check connection pooler.
Which connection pooler is being used? Which system is it running on? Where will clients connect from? Which connection style (single statement, single transaction, multi-transaction)?
- Backups, disaster recovery, HA
Big issues. Must be tailored to each situation.
What’s your checklist for analyzing a system?
So, on the connection pooler part — you say “which … is being used?” implying this is a must-have.
Can you talk a little more about why that is?
I understand that a CGI-style web app that makes a new connection on every page request is going to be inefficient, but if a webapp makes a dozen direct persist connections, and then recycles those for each request, is that sufficient?
What pooler do you like for webapps?
That’s a lot of questions 😀
== Why Pool? ==
With persistent connections you may want to abstract this away to have more fine-grained control over failover. Like, if your application doesn’t support online reconfiguration of which database it’s connecting to, maybe the pooler could take care of that. There are lots of other situations.
So, in a situation when you’ve got a very small webapp, it’s more about reliability and continuity than it is about performance management.
== Why assume there’s a pooler? ==
Many web frameworks have a pooler of some kind that comes with the ORM, or is just part of their existing infrastructure. Sometimes people don’t even realize they’ve got one enabled.
== Which pooler do I recommend? ==
The two Postgres-specific poolers that I’m most familiar with are pgpool-II (and it’s older cousin pgpool), and pgbouncer.
The big advantage to pgpool-II is that it can detect failure and trigger failover. I’ve set this up in situation where other HA solutions didn’t fit the use case. It’s a bit more flexible when you have lots of different Postgres database with varying policies on the same server. It’s configuration is pretty simple and similar to configuring Postgres itself.
Most people I know who are supporting web applications tend to use pgbouncer – because it supports online reconfiguration, is very lightweight, and fits many of the common web application patterns: lots of db servers, lots of clients, and lots of them going up and down, a need to upgrade services without user-detectable downtime.
Boxinfo looks useful. Is there any output from the aspersa summary tool that would supplement it?
That is a cool tool 🙂
Integration would be awesome. I will check it out later and see what the delta is.
There used to be powerpostgresql.com, a quite nice site with similar recommendations about performance. Too bad it’s not available anymore 🙁
“Are the /root, pg_xlog and pgdata directories separated”
Why does /root have to be separated? I would think “/” ( the general OS partition) would be separated, but not specifically the root user’s home directory.
I just meant ‘/’ not /root 🙂
The zeroth item on the list, which you probably because it’s so natural and obvious to you, is that little conversation that happens before you start with all that other stuff. Its value just can’t be overstated.
Any insights into how to guide that conversation?