PostgreSQL uses statistics on the contents of tables to help it decide how to optimize queries.
Consider the statistics for table contact
, which
consists of some 20,000 contacts, randomly generated.
performance=# \x Expanded display is on. performance=# select * from pg_stats where tablename = 'contact' and attname = 'name'; -[ RECORD 1] -----+------------------------------------------------------------------------- schemaname | public tablename | contact attname | name null_frac | 0 avg_width | 17 n_distinct | -0.63596 most_common_vals | {"Chroseus Warren","Ebotte Williams","Emery Booth", "Garnett Preston","Lionel Brooks","Aaron Sims", "Adlard Preston","Adrian Day","Agnes Tripp","Albert Brooks"} most_common_freqs | {0.001,0.001,0.001,0.001,0.001,0.000666667,0.000666667, 0.000666667,0.000666667,0.000666667} histogram_bounds | {"Aaron Ayala","Basil Momjian","Daniel Nicholson", "Ellis Pratt","Gawin Thompson","Hugh Bauer", "Joseph Brown","Margarete Crow", "Philip Holmes", "Sander Peck","Zachary Petkovic"} correlation | -0.00923928
The statistics on individuals' names don't show off anything
too terribly skewed. They are fairly distinct; in the context of the
statistics collected, the 10 most common names add up to only about 1%
of the total, which indicates that they are pretty spread out.
null_frac
is 0, indicating that the stats collector
didn't find a single one that was left blank.
The histogram_bounds
field shows us something
of what the distribution of names looks like, breaking down that the
first 10% of the table falls between " Aaron Ayala" and
" Basil Momjian", the second between " Basil
Momjian" and " Daniel Nicholson", and so
forth.
All in all, this could be a pretty good field to index on, if you do a lot of queries on names.
performance=# select * from pg_stats where tablename = 'contact' and attname = 'country'; -[ RECORD 1 ]-----+----------------------------------- schemaname | public tablename | contact attname | country null_frac | 0 avg_width | 6 n_distinct | 238 most_common_vals | {US,CA} most_common_freqs | {0.449333,0.0516667} histogram_bounds | {AD,BF,CM,FM,HR,KR,MO,OM,SE,TP,ZM} correlation | 0.209325
In contrast, country
has been skewed, heavily.
Nearly half the entries in this field are US
; about 5%
are "CA". A query looking for country =
'US' will find it more effective to scan through the whole
table, since it was likely to need to hit virtually all the pages,
anyways.
performance=# explain analyze select * from contact where country = 'US'::character(2); QUERY PLAN ------------------------------------------------------------------------ Seq Scan on contact (cost=0.00..8182.81 rows=81679 width=301) (actual time=407.67..2315.16 rows=79864 loops=1) Filter: ((country)::bpchar = 'US'::bpchar) Total runtime: 2395.11 msec (3 rows)
This query selects data for US
contacts. It
uses whatever query path the optimizer prefers, which turns out to be
the Seq Scan
. You might be curious as to whether this
is truly a better choice than using the index.
We can use the PostgreSQL runtime parameter
ENABLE_SEQSCAN
to disable it. (What happens, when you
set it to FALSE
, is that the optimizer puts an
outrageously high cost on Seq Scans.
)
The result is as follows:
performance=# set enable_seqscan to false; --- Force query to not do --- Seq Scan SET performance=# explain analyze select * from contact where country = 'US'::character(2); QUERY PLAN ------------------------------------------------------------------ Index Scan using contact_country_idx on contact (cost=0.00..273818.66 rows=81679 width=301) (actual time=0.68..4114.60 rows=79864 loops=1) Index Cond: ((country)::bpchar = 'US'::bpchar) Total runtime: 4171.22 msec (3 rows)
We can, in principle, improve on this; if most queries on this table are done on the basis of country, then we can use the CLUSTER command to group entries together based on that index.
performance=# cluster contact_country_idx on contact; CLUSTER performance=# explain analyze select * from contact where country = 'US'::character(2); QUERY PLAN ------------------------------------------------------------ Index Scan using contact_country_idx on contact (cost=0.00..273800.55 rows=81679 width=301) (actual time=0.42..2386.19 rows=79864 loops=1) Index Cond: ((country)::bpchar = 'US'::bpchar) Total runtime: 2441.15 msec (3 rows)
The runtime is quite a lot better, although not better than
with the Seq Scan
. The data set was sufficiently
small and the system it was running on busy enough that the relative
timings are not statistically significant.
The principle that falls out of this is a pretty typical one:
The PostgreSQL query optimizer does a
pretty good job, and, so long as it is working with reasonable
statistics, it is quite likely to do the " Right Thing."
If it's doing a Seq Scan
when you were expecting
otherwise, you should consider the possibility that the optimizer
might well be right.
In older (pre-7.4) versions of PostgreSQL, EXISTS clauses would run considerably faster than IN clauses.
One of the "rules" of efficient use of client/server applications is to try to minimize the number of requests submitted; to maximize the amount of useful data processed per query. The usual way you would expect this to be useful would be by replacing a series of SELECT statements that each pull in one row with a single query that pulls all the rows you need.
Thus, if there are ten selects, each with an ID, as with SELECT * FROM contact WHERE id = 15001;, SELECT * FROM contact WHERE id = 15002; , and so forth, it is doubtless less expensive to pull all the rows at once, as in the query SELECT * FROM contact WHERE id BETWEEN 15001 AND 15011;, or SELECT * FROM contact WHERE id IN (15001, 15002, 15003, 15004, ... 15011);.
Selecting 10,000 rows at once will be substantially cheaper than selecting them one at a time, and this should be completely unsurprising.
What you may not have considered is that it might be quicker to JOIN multiple queries together.
-- Do 5 selects explain analyze select name from contact where id = 20004; explain analyze select voice from contact where id = 311111; explain analyze select fax from contact where id = 241111; explain analyze select cell from contact where id = 32111; explain analyze select postcode from contact where id = 271111; -- Join this into one select explain analyze select a.name, b.voice, c.fax, d.cell, e.postcode from contact a, contact b, contact c, contact d, contact e where a.id = 20004 and b.id = 311111 and c.id = 241111 and d.id = 32111 and e.id = 271111;
When I ran these, the five queries each took about 0.15 seconds, thus totalling 0.75s. Joining them together led to a runtime of about 0.56s.
That is certainly not an immense improvement. The JOIN introduces some additional costs in parsing and optimizing the more complex query. More likely to disqualify this approach is the fact that the more complex query is more expensive for programmers to manage, and that is a cost that should not be ignored.
The example may be a little frivolous, but if you have a situation where a particularly " shaped" set of data is being selected, a database administrator might set up a VIEW that joins together all the data so that a single SELECT gets at the data efficiently in one query.
Using Aggregate Functions and Operators in PostgreSQL
describes the MAX()
and MIN()
aggregates thus:
... this simply prevents you from selecting everything from the database, iterating through each row one by one, and calculating your results by hand.
While it is strictly true that aggregates do avoid this, and allow you to avoid the work associated with:
Marshalling and shipping all the data across network interface;
Processing the data, once it has been transferred;
Writing code to process the data,
Aggregates are nonetheless not as efficient as we might like, since the postmaster still has to " select everything from the database, iterating through each row one by one". On a large table, this can be very expensive.
Some database systems track such statistics, storing them for
you so that querying this is a quick in-memory lookup. The use, by
PostgreSQL, of MVCC
makes this impractical since different queries that are active
concurrently might actually need different values, if
MAX()
/MIN()
vary within the
scope of the various transactions that are concurrently
accurate.
If you are looking for the maximum or minimum value for a particular field that has a useful index, there is a much more efficient query than this.
Consider the following example:
create table names ( id serial not null unique, name character varying not null unique ); insert into names (name) values ('Tux the Penguin'); insert into names (name) values ('Puff the Friendly Giant'); insert into names (name) values ('Bill the Cat'); insert into names (name) values ('Schrodinger''s Cat'); -- And a further cast of tens of thousands ... -- Aggregates - which require a Seq Scan select min(name), max(name), min(id), max(id) from names; -- Select via index select name as min from names order by name limit 1; select name as max from names order by name desc limit 1; select id as min from names order by id limit 1; select id as max from names order by id desc limit 1;
Allows the system to potentially start returning data much earlier...
Should go elsewhere; the notions include leaving out indices, and using COPY...