Christopher B. Browne's Home Page
cbbrowne@gmail.com

PostgreSQL

Christopher Browne


Table of Contents
1. PostgreSQL
2. Backing up PostgreSQL Databases

1. PostgreSQL

PostgreSQL

This is an extensible "object-relational" database system, the most sophisticated and probably the best supported of the free databases. It has been chosen as Linux Journal "Editors' Choice" in 2001 , 2003 , 2004 , 2005 .

At one time, this was the Postgres research RDBMS engine, by Michael Stonebraker (creator of Ingres, one of the first commonly-available relational databases, more recently a "senior technical dude" at Informix , of late teaching at MIT.). People liked Postgres enough that some people decided to make the query language SQL-compliant, at which point, in keeping with other product names at the time, was called PostgreSQL-95 . Subsequent work dropped the "best before" dating.

There was a commercialized version of Postgres that was called Illustra; the company was bought out by Informix, and its object-oriented features have been integrated into Informix Universal Server. People occasionally try to draw out "family inheritance" in the code bases between PostgreSQL, Informix, and Ingres due to Stonebraker's involvement with them all, while there may be some internal similarities, the respective projects were considerably independent from one another, and real similarities aren't likely to be overly conspicuous.

More " free third-party tools" have been created for PostgreSQL than any of the other free databases, and include:

1.1. Replication Systems

1.2. Interesting Extensions

  • pgpool

    pgpool is a connection pool server for PostgreSQL, standing between PostgreSQL clients and a server. Any PostgreSQL client may connect to pgpool as if it were a real PostgreSQL server. It can serve a number of functions:

    • pgpool caches database connections to reduce the overhead involved in establishing connections.

      It may also be used to diminish the number of connections that need to be opened, which can improve the efficiency of server usage.

    • pgpool may be used for failover purposes. If the "main" server goes down, pgpool can automatically switch connections over to the secondary server.

    • Replication

      Update queries may be applied to multiple servers, so that this would represent a sort of "poor-man's replication scheme."

    • Load balancing

      SELECT queries can be submitted in some form of "round robin" against multiple servers.

    There are plans afoot to build a more sophisticated scheme to dynamically configure pgpool in conjunction with Slony-I . With Slony-I, you may submit a request to shift the "master" role from one database to another, perhaps for handling a system failure, or to take a server out for maintenance. A useful enhancement to pgpool would be to have Slony-I inform it to defer all requests when that switchover process starts, and then to start to submit them to the new master as soon as it is ready. That potentially allows applications to be switched to a new "master" fairly much invisibly to the users.

  • SQL-Trees

  • Celko-style Nested Set Tree PostgreSQL Functions

  • pg_autovacuum

    I have contributed code improving logging, execution as a daemon, as well as getting it to run on Solaris and AIX.

  • PostgreSQL Certificates

    Along with Documentation This provides a set of extensions to PostgreSQL allowing you to store digital certificates, and to have database fields that are stored in encrypted form that may only be extracted by users that are authenticated using digital certificates.

  • Distributed PostgreSQL

    Working on 2 Phase Commit, with a view to providing an XA -compliant interface.

  • initdb in C

  • ExtenDB - Cost-effective Data Warehousing and BI

  • uuid type for PostgreSQL

    Based on RFC 4122 - Universally Unique IDentifier (UUID) URN Namespace

  • Enumerated Fields in PostgreSQL

  • Using materialized views for commonly-queried subsets

  • pgmemcache

    This is a PostgreSQL-based implementation of Memcached . The general idea is that you have a transactional database as the "official store" for vital information.

    If you have applications which really frequently access this information, where is isn't a forcible disaster if they're not completely up to date, you can improve performance by having a distributed cache that takes the "heat" of heavy queries. Users first try to hit the cache; if they fail, then they hit the database, and draw the results into the cache. Updates will naturally invalidate the cache; either your applications must explicitly mark entries as dead or repair them, or triggers on tables are used for cache entry invalidation.

    This sort of algorithm has proven very useful for various sorts of query-heavy database-driven applications where entries are much more frequently read than they are written:

    • Encyclopedia-like applications like Wikipedia;

    • Discussion boards like Slashdot ;

    • The WHOIS service

    Another "home" for pgmemcache may be found here.

1.3. Tools

1.4. Programming PostgreSQL

PostgreSQL provides an interface for tying languages to the database server, allowing the Gentle User to develop stored functions in quite a variety of languages.

  • C - included internally in standard distribution

  • SQL - included internally in standard distribution

  • PL/pgsql - included internally in standard distribution

  • Tcl - included internally in standard distribution

  • Perl - included internally in standard distribution

  • Python - included internally in standard distribution

  • Ruby

  • PL/R User's Guide - R Procedural Language

    On using the R statistical language in conjunction with PostgreSQL.

  • Embedding R in Postgres

    Articles on how languages were embedded in with the PostgreSQL engine so that statistical functions implemented in R could directly access database data and then be exported as SQL functions.

  • plPHP - programming in PHP

  • PL/sh handler

  • PL/Java

  • PL/Java

  • PL/PGJ

    A Java Stored Procedure execution system for PostgreSQL.

  • PL/mono

    plMono is a PostgreSQL language using the embedded Mono runtime. It provides support for writing functions in C#, or any other language that supports .NET. At the time of this writing that includes: APL , C++, COBOL , Eiffel , F# (a Caml derivative), Forth , Fortran , Haskell, ML , and probably others.

1.5. Tuning PostgreSQL

PostgreSQL often gets accused of being rather slow. Some of that represents the way things were years ago. When observed recently, it is typically the result of taking an install with all of the defaults from what is generated by the source distribution.

Note

The defaults are configured to be able to function on all platforms on which PostgreSQL runs, and are definitely not optimal for typical usage.

Fortunately, it is quite straightforward to achieve massive improvements by looking at a very few things. Here is a sort of " top ten" list of things that should be done to improve performance, and why:

  • Increase, in postgresql.conf, the value of shared_buffers.

    The defaults are low because on some platforms (Solaris and SGI, notably) it requires invasive action like recompiling the kernel in order to use larger values.

    The shared buffers are used to hold query results that are in progress. The default of 512K is quite likely to be low. The "rule of thumb" is to estimate the " right value" to be between 5% and 15% of total system memory.

    Note that this value does not include any OS file caching that may take place; look below at effective_cache_size, which addresses that. Note that any time records are updated, they must be pushed out to disk, thereby invalidating both this cache as well as the OSes filesystem cache.

  • Increase, in postgresql.conf, the value of effective_cache_size.

    This parameter indicates to the query optimizer how much OS filesystem cache you expect to have available. On a server used solely for running PostgreSQL, this might represent most of your system's physical memory.

  • Run VACUUM on database tables frequently.

    Any time records are deleted or updated in a table, this leads to generating " dead tuples" for this table. They are not automatically cleaned out; that requires running VACUUM. If the table is not vacuumed, the postmaster will have to rummage through the dead data, only to discard it as useless. It may be no big deal if there are 10,000 dead tuples in a table with a million entries, but performance will be bad if (say) a table of financial balances has a few hundred live entries combined with thousands of obsolete dead tuples.

    At one time, VACUUM locked whatever table was being cleaned; that has long changed, so that it now does not block activity on the system. Seemingly paradoxically, the more often you run VACUUM, the faster it tends to run, and the more likely it is that it will do some good.

    If there get to be really a lot of dead tuples in a table, there may be more free space than the free space map knows to manage, and a proper cleanup may require the other version, called VACUUM FULL.

    A recently developed program, pg_autovacuum, monitors database updates, and automatically initiates VACUUMs on just those tables that need cleaning.

  • Run ANALYZE on the database periodically.

    PostgreSQL's query optimizer is a cost-based system, and hence needs to have some statistics on what data the various tables on the system contain.

    Until you run ANALYZE, the system may have no better ideas than to do sequential scans of all the tables involved in queries.

    The same is true for other cost-based optimizers; I have also seen Oracle bog down horribly on queries until some statistics built up at which point those queries sped up immensely.

  • Move the WAL files to a separate physical disk.

    That is, set the directory pg_xlog to point to a directory on a separate disk from that which the data is stored on. These files, that contain the most recent updates to the database, are heavily updated, and if you can separate these updates from those concurrently taking place on database tables, you can get an easy performance boost, often on the order of 25%.

    It may also be helpful to ensure text logs are being written to a separate physical device.

    If your requirements are more sophisticated, it may even be worthwhile to move individual database files to separate disks if you have enough disk drives to do so.

    Mind you, if you have such sophisticated requirements, it is likely to make even more sense to build fairly sizable arrays of RAID disk, at which point all the disk activity will be automatically heavily striped. In such a situation, it is fairly likely that separating certain database files to specific disks would lead to less use of striping, and perhaps lower performance a little.

    My boss did some benchmarking where he compared having WAL and the rest of the data all "shared" on a big disk array with splitting off a couple of disks to just do WAL; he couldn't see any measurable performance difference.

  • Increase, in postgresql.conf, the value of sort_mem

    If yo