Christopher Browne's Web Pages
Prev		Next

5. Disk Usage

It is nice if you can draw data from memory (thanks for the memories!), but ultimately, persistent data needs to persist on disk.

Move WAL to a separate disk
The WAL (Write Ahead Logs) are stored in the pg_xlog directory. A symbolic link may be used to point them to a different disk drive, and allows these sequential writes to take place without any delays resulting from the other database updates that cause there to be WAL writes.
One of the common "operating principles" is that PostgreSQL can only commit transactions as rapidly as the spinning of the disks will allow. Absent of "fancy intelligent" disk controllers, if you have one disk, which spins at a rate of 15KRPM, then the theoretical maximum number of transactions per minute is that same 15,000. Having multiple disks multiplies this, more or less.
Moving tables to separate disks
Table contents are stored in table-specific files, which may be identified and shifted to a separate disk, making use of symbolic links to point at the new location.
Indices
Indices, just like tables, are stored in their own files, and may be moved to separate disks.
With a table that is updated heavily, it may be very worthwhile to split both the table and its indices to separate disks.
RAID
At the simplest level, using RAID 0, where updates are striped across multiple disks, multiplying speed, or RAID 1, where updates are replicated across multiple disks, are the "cheap" choices that may provide minor improvements in performance.
More realistically, RAID 5 and RAID 10, which stripe updates and parity across a "whole bunch of disks" will provide, at nominally increased cost, significantly improved performance along with good reliability even in the wake of a disk (hopefully not "or two") flaking out.
If there are enough disks involved in a RAID array, the previously-discussed "clever hacks" to move parts of the database to other devices become less and less important. In some loose benchmarking that I have seen, the "rule of thumb" about having WAL on a separate device goes away if you're using a FibreChannel disk array.
In effect, if your disk controller has enough disks to stripe the data across, there is no point in doing clever things yourself.
Battery-backed Cache
The really enormous performance increases that have been found for update-heavy database loads come from another hardware enhancement, namely the RAID controller with battery-backed cache. At the time of writing, the major vendors are LSI Logic, Compaq, and Intel, and the usual form of the product is as a PCI SCSI or SATA controller supporting 16 or more drives, with 128MB or more of cache on the controller along with a battery to help this survive power outages.
On a test involving a workload heavy on updates, with transactions involving relatively few updates, the introduction of this cache increased performance by a factor of 20.
These controllers have prices that vary; a number of them are available for less than $1000 USD. For systems with heavy update traffic, this is likely to be a cheap price to pay.
Solid State Disk
Another option that should have an immense effect on performance would be to install a SSD Solid State Disk device. These are typically a board or "box" with a CPU, a whole bunch of memory, and some combination of battery and "nonvolatile storage" (in case the power is out for longer than the battery can cope with).
It is normally interfaced to your computer system as a SCSI device which happens to have latency measured in nanoseconds instead of milliseconds.
Put your WAL on this and you'll see, if anything, more improvement in performance.
Unfortunately, the designs tend to be somewhat exotic, with the result that prices are staggeringly high. Be prepared to need to pay tens of thousands of dollars for fairly nominal amounts of "disk space".
Exotic Disk arrays
Numerous vendors sell disk array hardware which integrates together the ability to slot in extra disks on demand, whether to increase capacity, speed, or to recover from some disk having failed.
As prices go up into the hundreds of thousands of dollars, capabilities increase with corresponding, erm, "wildness". A decent array will have gigabytes of cache and built-in batteries; this is quite likely to provide better performance than even SSD could offer.

Choosing filesystems

There is not widespread agreement as to what choices are universally the " best". The traditional "ideal" behaviour comes on traditional Unix filesystems like the Berkeley BSD UFS/FFS; they use, by default, the very same 8K block size that PostgreSQL does.

PostgreSQL has traditionally used 8K disk blocks on FreeBSD; recent changes move block size to 16K on FreeBSD, which encourages use of the same block size with PostgreSQL. This may be a little wasteful for tables that remain small, as one would expect, on average, half a block to be wasted; modern disk drives hundreds of GB in size can afford such losses.

An early report from performance testing on FreeBSD 4.7 suggested that moving to 16K blocks led to write-intensive activity being speeded by about 8%.

While anecdotal, that indicates that this sort of tuning can provide meaningful incremental improvements in performance. It may not be as good as a query optimization that makes the query 10x as fast, but it can be worth going for all the same.

On Linux, the choices are more difficult. There are quite a number of filesystems to choose from, and they all have features that lend them to not be considered optimal:

ext2
This filesystem suffers from the " long fsck problem "; if your system goes down, for whatever reason, and the filesystem is very large, it may take a very long time to run fsck, and, in some cases, may require user intervention in order to recover.
The largest block size supported by ext2 is 4K; you may wish for an ext2 filesystem to use the maximum size to maximize efficiency of storage of the typical 8K blocks used by PostgreSQL. The appropriate option is: mke2fs -b 4096
ext3
This uses the same format as ext2, but adds in journalling, which has the merit of improving fsck times if you reboot unexpectedly.
As with ext2, increasing the block size may improve efficiency a little bit. Unfortunately, Linux filesystems don't usually support 8K or larger blocks.
Unfortunately, journalling has the demerit that it replicates the careful effort that the DBMS is making to try to ensure that all updates are pushed out to disk in a consistent manner. In effect, both the filesystem and the database are trying to maintain journals with equivalent intent, if not identical implementation, and it tends to lead to somewhat lowered performance. Anything that was not committed to the database files resides in the PostgreSQL WAL files, so that any journalling of database file updates is redundant.
Reiserfs
As a filesystem intended, by design, to provide improved performance for directories containing very large numbers of small files, Reiserfs performance degrades somewhat when it deals with rather large files. It is almost certainly not the ideal choice.
JFS
Suffering from the demerits of journalling, this filesystem, originally based on what IBM used in OS/2, probably won't be the first choice from a performance perspective.
It also appears that development efforts are lagging those on ext3.
XFS
SGI's XFS filesystem was expected to become best option available on Linux; it supports metadata-only journalling, which is what we would want, and was designed to support efficiently handling very large files, which is also what we would want.
Up until recently, it had the significant demerit that adding it in required applying custom patches to produce your own kernel, and that's not what the guys wearing suits at head office want to hear. They want to hear: " Our distribution vendor includes XFS support in the kernels burned onto the CDs that they sell." (Of course, it might well be beneficial to compile your own kernel anyways, to eliminate unneeded device drivers and such, but they won't be happy with that answer...)
Integration of XFS into "official" Linux kernels proceeded in version 2.6, which should resolve the concerns of the non-technical decisionmakers. Unfortunately, the fiscal woes of SGI have limited the ability to put serious effort into improvements.

On BSD-related systems, the Berkeley FFS combined with " soft updates" provides something close enough to "journalling" to allow quick fsck runs, whilst not adding the redundancies associated with updating database files.

Mounting noatime
On virtually any Unix-like system, a performance boost may be attained by using the noatime mount option, which eliminates the collection of access time statistics on inodes. This eliminates nearly continual updates to the filesystem metadata that would otherwise result.