We saw that filtering on a column with or without an index changes the way the query planner plans and executes the query.
Indexes are used to speed up queries by facilitating the filtering part for the planner.
The promise is that a condition over an indexed column is faster than over a non indexed column.
The underlying assumption is that when filtering over an indexed column, that index will always be used.
But in reality this is not always the case.
In this session we look at:
An index is invisible redundant data that boost query execution time.
Which means:
If it was a super hero the index would be an mix of
- the Invisible man
- Flash
- Robin (as a redundant sidekick to Batman)
- and Chaos (for randomness of usage)

First use EXPLAIN to understand what algorithms the optimizer chooses and to detect if there are any slow operations. Then decide on the type of index and whether to create it or not.
So when you want to make a query faster and you see that its explain plan is lacking index support, think about the query in terms of SLA in your application.
Does this query need to run as fast as possible, even when it means that you now have to maintain more indexes?
An index adds overhead to queries and database operations and requires extra storage. So it's not advised to just create a bunch of indexes in the hope that the queries will just speed up magically. Creating an index must have some sort of justification.
two ways:
Adding a UNIQUE or PRIMARY KEY constraint on columns automatically creates a UNIQUE index on that column.
Without an index, checking uniqueness requires a full table scan, which is not efficient for large datasets.
By using a UNIQUE index, PostgreSQL determines whether a value already exists in the column or not.
Note that PostgreSQL doesn't create an index automatically for foreign keys and it's a good practice to manually create one. Without an UNIQUE Index on a foreign key, operations like deleting or updating rows on the referenced table can be slow, as PostgreSQL will need to perform a sequential scan to find matching rows.
We work here on the denormalized version of the trees database: treesdb_v03
Schema | Name | Type | Owner
--------+----------------+-------+--------
public | locations | table | alexis
public | taxonomy | table | alexis
public | tree_domains | table | alexis
public | tree_genres | table | alexis
public | tree_names | table | alexis
public | tree_species | table | alexis
public | tree_stages | table | alexis
public | tree_varieties | table | alexis
public | trees | table | alexis
public | version | table | alexis
Let's add a UNIQUE constraint on the domain column of the tree_domains table.
alter table tree_domains add constraint unique_domain UNIQUE (domain);
Then we can for the presence of a new index on that table with \d tree_domains :
Table "public.tree_domains"
Column | Type | Collation | Nullable | Default
--------+-------------------+-----------+----------+------------------------------------------
id | integer | | not null | nextval('tree_domains_id_seq'::regclass)
domain | character varying | | |
Indexes:
"tree_domains_pkey" PRIMARY KEY, btree (id)
"unique_domain" UNIQUE CONSTRAINT, btree (domain)
Referenced by:
TABLE "trees" CONSTRAINT "fk_tree_domain" FOREIGN KEY (domain_id) REFERENCES tree_domains(id)
Now the table has 2 indexes, one for the primary key and one for the UNIQUE domain.
Note:
unique_domain UNIQUE CONSTRAINT as an IndexbtreeSum up:
Adding a UNIQUE constraint on a column automatically creates a B-tree index on that column.
You can create an index on any columns or sets of columns with
CREATE INDEX index_name ON table (column);
or with (column1, column2, column3, ...) for indexes on multiple columns.
By default this creates the B-tree index, which is the most common type of index in PostgreSQL.
You can create other types of indexes with USING type_of_index (column). For instance a HASH index:
CREATE INDEX name ON table USING HASH (column);
PostgresQL offers multiple types of indexes (see documentation chapter 11).
We'll focus on B-tree and Hash Indexes.
Let's start with the most common index used in PostgreSQL the B-Tree, or balanced tree and see how it works.
When we do a Sequential Scan the query engine scans all the rows sequentially. This lookup has a cost of O(n) for n rows. The time it takes to look something up doubles when the data doubles.
With a B-tree index, the search has a cost of O(log(n)). A doubling of the data results in a factor log_10(2) ~ 1.30 increase in search time.
The B-Tree index has the structure of a ... Tree! but upside down.
The b-tree index organizes the information in hierarchical blocks following an upside down tree.
A B-tree index in PostgreSQL consists of the following key elements: Nodes, Branches and pointers
Root Node: The topmost node that serves as the entry point for all searches.
Internal Nodes: Nodes between the root and leaf nodes, used for routing searches by storing pointers and key ranges.
Leaf Nodes: The bottom-level nodes that store the actual index entries (key-value pairs) and pointers to the corresponding table rows.
Keys: Ordered values in the nodes used to guide searches, ensuring efficient data retrieval.
Pointers: Links between nodes (in internal nodes) and to table rows (in leaf nodes) for fast navigation through the index.

B-tree stores data such that each node contains keys in ascending order.
Each of these keys has two references to two child nodes.
A block size is 8 Kb in PostgreSQL. An 8 Kb block can have dozens of index records.
So:
An index with six to seven levels can accommodate billions of index records!
B-trees are very powerful. They have huge storage capacity and allow for super fast retrieval
B-trees follow certain rules that makes them super efficient:
Let's we have an index on the domain column of the trees table.
A query like SELECT * FROM trees WHERE domain = 'garden'; initiates the search using the domain B-tree index.
'garden') with key ranges in internal nodes, following pointers to the appropriate child nodes.'garden') in the leaf node and retrieves the row pointers.
Although update, insert and delete queries force some reorganization of the index elements, the impact is limited thanks to the B-tree structure rules.
Each time a new entry is inserted, these rules are applied to keep the index consistent.
So the modification does not add significant overhead for index reorganization.
In the worst case, the number of modified blocks cannot exceed the depth of the tree.
In PostgreSQL, a B-tree index can be created for any ordinal data type; that is, for any two distinct values of the data type, one value is less than the other. This includes user-defined types.
However in some cases B-trees are not efficient:
datetime which is too dynamic for B-tree indexing.The documentation says:
B-trees can handle equality and range queries on data that can be sorted into some ordering. In particular, the PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison using one of these operators: < <= = >= > Constructs equivalent to combinations of these operators, such as BETWEEN and IN, can also be implemented with a B-tree index search. Also, an IS NULL or IS NOT NULL condition on an index column can be used with a B-tree index.
So B-trees are efficient when we want to search using a filter on an ordered column, typically a categorical column with many values, or a numeric column.
We'll use the airdb database,
available here. The most recent dump is 1.32 Gb.
The schema is

The airdb database is part of the excellent book PostgreSQL Query Optimization 2021, Apress, Henrietta Dombrovskaya,Boris Novikov,Anna Bailliekova Github, book
In the flight table (640k+ rows) explain there are 6734 flights departing from Logan Airport in Boston. departure_airport = BOS.
Let's check out the performance of the query with and without a B-tree index on the departure_airport attribute.
EXPLAIN select * from flight where departure_airport = 'BOS';
Adding an index reduced the execution time by a factor of 3.
If we look at an airport with much less traffic (for instance AGA has only 78 departures), then the query time is reduced by a factor of ... 140!!!
Before we dive into Hash indexes, we need to understand what a HASH function is
A hash function is a fundamental concept in computer science with wide-ranging applications.
A hash function is an algorithm that takes an input (or 'message') of arbitrary size and produces a fixed-size output, typically a string of characters or a number.
This output is called a hash value, hash code, or simply a hash.

The key properties of a good hash function are:

from https://wizardzines.com/comics/hash-functions/
HASH strings are used to uniquely id GIT commits

Hash functions are used in many computer science domains including:
Being able to map large and complex data to a simple fixed size string quickly allows for super fast comparisons on all kinds of data.
From the documentation:
Hash indexes store a 32-bit hash code derived from the value of the indexed column.
As a direct consequence of the nature of hash strings, HASH indexes can only handle simple equality comparisons.
The query planner will consider using a hash index whenever an indexed column is involved in a comparison using =
So what's a hash index and why can't we just use a B-tree index ?
A hash index is a type of database index that uses a hash table data structure to store and retrieve data.
It's designed for equality comparisons and can be very efficient for certain types of queries.
Key characteristics of hash indexes:
When to use hash indexes:
When not to use hash indexes:
In short, HASH indexes work because it's easier to check if two long strings are equal
b359f1067efa76f37863778f7b6d0e8d911e3ee8efa807ad01fbf5dc1ef9006b
xeefdbd8a1da6d2915566fde36db9db0b524eb737fc57cd1367effd16dc0d06d
than to check equality for
[email protected]
[email protected]
Example of creating a hash index in PostgreSQL:
CREATE INDEX idx_<name of the index> ON <table> USING HASH (<column>);
The only thing that changes compared to a B-tree index is the keyword USING HASH after the table name.
A HASH index is useful for queries like:
SELECT * FROM users WHERE email = '[email protected]';
It's worth noting that in many cases, especially in PostgreSQL, B-tree indexes are often preferred over hash indexes due to their versatility.
Modern B-tree implementations are highly optimized and can perform nearly as well as hash indexes for equality comparisons while also supporting range queries and sorting.
Before deciding to use a hash index, it's recommended to benchmark your specific use case to ensure it provides a significant performance benefit over a B-tree index.
Other examples where HASH indexes make sense include
Equality queries on large tables: When you frequently perform exact match queries on a column with many unique values, a HASH index can outperform a B-tree index. For instance :
email_address (varchar)SELECT * FROM Users WHERE email_address = '<[email protected]>';Join operations on unique columns: When joining tables based on columns with unique values, HASH indexes can speed up the operation:
employee_id (in Employees table), manager_id (in Departments table)SELECT * FROM Employees e JOIN Departments d ON e.employee_id = d.manager_id;It's important to note that HASH indexes are only useful for equality comparisons. They don't support range queries, pattern matching, or ordering operations. In these cases, B-tree indexes would still be preferred.
Let's connect to the airdb table on the remote server.
The table booking has over 5m rows and the following columns
Table "postgres_air.booking"
Column | Type | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+---------
booking_id | bigint | | not null |
booking_ref | text | | not null |
booking_name | text | | |
account_id | integer | | |
email | text | | not null |
phone | text | | not null |
update_ts | timestamp with time zone | | |
price | numeric(7,2) | | |
Indexes:
"booking_pkey" PRIMARY KEY, btree (booking_id)
"booking_booking_ref_key" UNIQUE CONSTRAINT, btree (booking_ref)
Foreign-key constraints:
"booking_account_id_fk" FOREIGN KEY (account_id) REFERENCES account(account_id)
Referenced by:
TABLE "booking_leg" CONSTRAINT "booking_id_fk" FOREIGN KEY (booking_id) REFERENCES booking(booking_id)
TABLE "passenger" CONSTRAINT "pass_booking_id_fk" FOREIGN KEY (booking_id) REFERENCES booking(booking_id)
Let's pick up a random phone number and compare the query plans before and after creating a HASH index
select phone from booking order by random() limit 1;
gives us the phone number 1092288610. This phone corresponds to 2729 rows.
select count(*) from booking where phone='1092288610';
EXPLAIN select * from booking where phone='1092288610';
The query plan is
EXPLAIN ANALYZE select * from booking where phone='1092288610';
QUERY PLAN
---------------------------------------------------------------------------------
Gather (cost=1000.00..119071.27 rows=955 width=92) (actual time=5.399..769.893 rows=2729 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on booking (cost=0.00..117975.77 rows=398 width=92) (actual time=12.208..707.440 rows=910 loops=3)
Filter: (phone = '1092288610'::text)
Rows Removed by Filter: 1880162
Planning Time: 0.087 ms
JIT:
Functions: 6
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 1.288 ms, Inlining 0.000 ms, Optimization 0.904 ms, Emission 32.169 ms, Total 34.360 ms
Execution Time: 770.573 ms
or in another format:
So:
Let's create a HASH index with
CREATE INDEX hash_on_phone ON booking USING HASH (phone);
⏳⏳⏳ Takes a bit of time ⏳⏳⏳ (~ 15940.361 ms)
And let's re-run the same analysis
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on booking (cost=31.40..3548.20 rows=955 width=92) (actual time=1.242..47.670 rows=2729 loops=1)
Recheck Cond: (phone = '1092288610'::text)
Heap Blocks: exact=2686
-> Bitmap Index Scan on hash_on_phone (cost=0.00..31.16 rows=955 width=0) (actual time=0.729..0.738 rows=2729 loops=1)
Index Cond: (phone = '1092288610'::text)
Planning Time: 7.061 ms
Execution Time: 47.952 ms
Wow!
The actual time is down from 770 ms to 48 ms. a factor of 16!
The algorithm is no longer a Seq Scan but a pair of Bitmap Index Scan on hash_on_phone followed by a Bitmap Heap Scan.
Also note that query plan became simpler and didn't require parallel processing or JIT compilation.
The GATHER node is a part of PostgreSQL's parallel query execution framework. Its primary functions are:
Key characteristics:
Limitations:
When you see a
GATHERnode, it indicates that PostgreSQL is attempting to speed up the query by distributing the workload across multiple CPU cores.
Now, let's analyze why the GATHER node is present in the first query plan (without the HASH index) but not in the second one (with the HASH index):
Gather (cost=1000.00..119071.45 rows=957 width=92) (actual time=4.937..888.074 rows=1955 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on booking (cost=0.00..117975.75 rows=399 width=92)
Here, PostgreSQL decided to use parallel query execution. The GATHER node is coordinating two worker processes to perform a Parallel Sequential Scan on the booking table. This decision was made because:
a) The table is large (over 5 million rows). b) Without an index, a full table scan is necessary. c) The full table scan is an operation that can benefit from parallelization.
Bitmap Heap Scan on booking (cost=31.42..3555.31 rows=957 width=92)
-> Bitmap Index Scan on idx_booking_phone_hash (cost=0.00..31.18 rows=957 width=0)
Here, there's no GATHER node. PostgreSQL didn't choose to use parallel execution because:
a) The HASH index allows for efficient direct access to the relevant rows. b) The amount of work to be done (scanning the index and fetching specific rows) is much less than a full table scan. c) The overhead of starting and coordinating parallel workers would likely outweigh the benefits for this faster operation.
The presence of the GATHER node in the first plan shows that PostgreSQL was trying to optimize the expensive full table scan by parallelizing it. However, once the HASH index was available, the query became so much more efficient that parallelization was no longer beneficial.
Let's also compare to a B-tree index to see if it's true that B-trees are that efficient.
First drop the HASH index.
drop index hash_on_phone;
and create a B-tree INDEX
CREATE INDEX idx_booking_phone ON booking (phone);
check that a b-tree index was indeed created
Indexes:
"booking_pkey" PRIMARY KEY, btree (booking_id)
"booking_booking_ref_key" UNIQUE CONSTRAINT, btree (booking_ref)
"idx_booking_phone" btree (phone)
How does the query perform ?
Bitmap Heap Scan on booking (cost=11.83..3528.63 rows=955 width=92) (actual time=0.937..4.060 rows=2729 loops=1)
Recheck Cond: (phone = '1092288610'::text)
Heap Blocks: exact=2686
-> Bitmap Index Scan on idx_booking_phone (cost=0.00..11.59 rows=955 width=0) (actual time=0.504..0.505 rows=2729 loops=1)
Index Cond: (phone = '1092288610'::text)
Planning Time: 0.368 ms
Execution Time: 4.183 ms
Well ... the B-tree executes the same query in 4 ms instead of 48ms . nearly another 10 factor. Compared to the query without any index B-tree, a speed factor of 187!
with phone that has one row (for instance 987765263)
| Index type | Execution Time | Boost factor |
|---|---|---|
| No index | 939.108 ms | - |
| HASH | 0.084 ms | 11190 |
| B-tree | 0.137 ms | 6855 |
In that case, when the equality condition returns only 1 row (or just a few), the HASH index beats the B-tree! Woohoo Go HASH INDEX!
The boost factors are just insane!
This comparison highlights that while HASH indexes can be very fast for equality lookups, they don't always lead to better overall performance.
Factors like planning time, accuracy of statistics, and how well the database can optimize the rest of the query execution all play a role.
In practice, you'd want to test with your specific data and query patterns to determine which index type performs best for your use case.
We won't go into these other types of indexes. Just know that they exist.
From the documentation: GiST indexes are an infrastructure within which many different indexing strategies can be implemented.
GiST (Generalized Search Tree) indexes are useful for indexing complex data types and implementing custom indexing methods. GiST indexes allows us to define how data should be arranged in the index and how searches should be conducted.
GiST can be used for :
Let's look at a simple example using geometric data using the trees geolocation column.
Now, let's create a GiST index on the location column:
CREATE INDEX pois_geolocation_idx ON trees USING GIST (geolocation);
With this index in place, we can efficiently perform spatial queries, such as finding POIs within a certain distance of a given point:
SELECT id, geolocation
FROM trees
WHERE location <@ circle '((lat, long), radius)'::circle;
Assume you have a product table to store product tags TEXT[]
You can create a GiST index on the tags array
CREATE INDEX idx_product_tags ON products USING GIST (tags);
and query to find products that have all specified tags
SELECT name, tags FROM products WHERE tags @> ARRAY['electronics', 'portable'];
or any of the specified tags
SELECT name, tags FROM products WHERE tags && ARRAY['footwear', 'kitchen'];
In short use GiST when your query involves these operators
<< &< &> >> <<| &<| |&> |>> @> <@ ~= &&
An extension of GiST for more exotic data structures, such as quadtrees, k-d trees, and radix trees
Also for Arrays but limited to a smaller set of operators:
@> <@ = &&