Indexing Practice Lab

B-tree, Hash indexes, query performance, INSERT overhead

Practice on the Energy database

See ./sql/create_database_energydb.sql to create :

see also ./docs/06/S06.03.energy-db-a-new-database.md for an explanation.

Indexing Exercises

The goal of this exercise is to illustrate the impact of B-tree and Hash indexes on a table attribute.

Start by EXPLAIN ANALYZE a simple select query and note the real performance of the query in terms of cost and total execution time.

We’ll add indexes (B-tree and Hash) on that column to show that adding indexes can or sometimes cannot improve the query performance.

We’ll look at different types of filtering :

and compare the EXPLAIN ANALYZE plans.

We’ll also look at INSERTS queries to understand the overhead brought by the presence of indexes.

Keep in mind

Here are some key points for you to keep in mind:

  1. Hash indexes are generally best for equality conditions, while B-tree indexes are versatile and can handle equality, range, and sorting operations.

  2. The effectiveness of an index depends on various factors, including the selectivity of the query (i.e., how many rows it’s expected to return relative to the total number of rows in the table).

  3. Sometimes, the query planner might choose not to use an index if it determines that a sequential scan would be faster (e.g., when returning a large portion of the table).

  4. Multi-column indexes can be particularly useful for queries that filter on multiple columns simultaneously.

  5. Function-based indexes can improve performance for queries that use functions or expressions in their WHERE clauses.

The database

The results will depends on the volume of data your version of the energydb has generated.

If you don’t see major differences in the performance time of the queries with or without indexes, then don’t hesitate to generate more data using the generate_daily_production and generate_production_entities functions.

The results presented in this document are based on 400k production entities.

Exercise 1: Equality Condition with Hash Index vs. B-tree Index

The candidate query is :

SELECT * FROM production_entities
WHERE energy_source_id = 1;

There are only 6 types of energy sources in the database. So this query will return quite a large amount of rows.

Our baseline consists of this query without any index on the energy_source_id column.

No Index

Run the following query and use EXPLAIN ANALYZE to evaluate its performance:

EXPLAIN ANALYZE SELECT * FROM production_entities
WHERE energy_source_id = 1;

results : here’s a possible query plan (yours may differ)

 Seq Scan on production_entities  (cost=0.00..20187.25 rows=66643 width=81) (actual time=0.009..57.848 rows=66692 loops=1)
   Filter: (energy_source_id = 1)
   Rows Removed by Filter: 333408
 Planning Time: 0.055 ms
 Execution Time: 61.401 ms

Interpretation : as expected, the planner uses a Sequential Scan to scan all the table rows and apply the filtering condition. The baseline query time is 61.401 ms with a cost of 20187.25.

Add B-tree

Create a B-tree index with

create index idx_energy_source_id on  production_entities (energy_source_id);

check that the index has been created. \d production_entities should inclued this line

Indexes:
    "idx_energy_source_id" btree (energy_source_id)

Now repeat the EXPLAIN ANALYZE query.

results : A possible query plan (yours may differ)

 Bitmap Heap Scan on production_entities  (cost=744.91..16763.94 rows=66643 width=81) (actual time=5.098..23.919 rows=66692 loops=1)
   Recheck Cond: (energy_source_id = 1)
   Heap Blocks: exact=5981
   ->  Bitmap Index Scan on idx_energy_source_id  (cost=0.00..728.25 rows=66643 width=0) (actual time=4.120..4.121 rows=66692 loops=1)
         Index Cond: (energy_source_id = 1)
 Planning Time: 0.373 ms
 Execution Time: 29.159 ms

Interpretation : as expected, the planner uses a Bitmap Index Scan + Bitmap Heap Scan (since there are quite a lot of rows in the result) leveraging the index. The time has gone down to 29.159 ms and the cost to 16763.94.

adding the index reduces the query execution time by a rough factor of 2

Add Hash Index

Let’s do the same but now with a Hash Index.

But first drop the b-tree inedx with

drop index idx_energy_source_id;

Create the hash index with

create index idx_energy_source_id  on  production_entities USING HASH (energy_source_id);

and repeat the EXPLAIN ANALYZE of the query.

results : A possible query plan (yours may differ)

 Bitmap Heap Scan on production_entities  (cost=2196.48..18215.52 rows=66643 width=81) (actual time=4.966..27.260 rows=66692 loops=1)
   Recheck Cond: (energy_source_id = 1)
   Heap Blocks: exact=5981
   ->  Bitmap Index Scan on idx_energy_source_id  (cost=0.00..2179.82 rows=66643 width=0) (actual time=3.797..3.797 rows=66692 loops=1)
         Index Cond: (energy_source_id = 1)
 Planning Time: 0.281 ms
 Execution Time: 32.655 ms

Interpretation : the Hash Index brings a similar performance boost to a B-tree index with a cost of 18215 and an execution time of around 30ms.

Conclusion

When the number of records is significant there is not a lot of difference between Hash and B-tree indexes on a categorical column condition.

Exercise 2: Join query

Let’s do the same exercise on a query that produces the same result but that uses the values of the energy_sources table.

The query is :

EXPLAIN ANALYZE SELECT *
FROM production_entities pe
join energy_sources es on es.id = pe.energy_source_id
WHERE es.source_name = 'Solar';

Exercise 3: Range Condition with B-tree Index

Run the following query and use EXPLAIN ANALYZE to analyze its performance:

EXPLAIN ANALYZE
SELECT * FROM production_entities
WHERE capacity_mw BETWEEN 100 AND 500;

Now create a B-tree index on the capacity_mw column:

CREATE INDEX idx_btree_capacity_mw ON production_entities USING BTREE (capacity_mw);

Run the EXPLAIN ANALYZE query again and compare the results. What is the index not put to use ?

Now reduce the range in the query until the the planner uses the index.

What kind of boost do you get in terms of execution time ?

Exercise 4: Multi-column Index

Run the following query and use EXPLAIN to analyze its performance:

EXPLAIN ANALYZE
SELECT * FROM production_entities
WHERE country_id = 1 AND built_date > '2010-01-01';

Create a multi-column B-tree index:

CREATE INDEX idx_btree_country_built_date ON production_entities USING BTREE (country_id, built_date);

Exercise 5: Function-based Index

Run the following query and use EXPLAIN to analyze its performance:

EXPLAIN ANALYZE
SELECT * FROM production_entities
WHERE EXTRACT(YEAR FROM built_date) = 2015;

Create a function-based index:

CREATE INDEX idx_btree_built_year ON production_entities USING BTREE (EXTRACT(YEAR FROM built_date));

Run the EXPLAIN ANALYZE query again and compare the results.

Exercise 6: HASH Index on JOIN Condition

Run the following query and use EXPLAIN ANALYZE to analyze its performance:

EXPLAIN ANALYZE
SELECT pe.entity_name, epd.date, epd.energy_produced_mwh
FROM energy_production_daily epd
JOIN production_entities pe ON epd.production_entity_id = pe.id
WHERE epd.date BETWEEN '2023-01-01' AND '2023-12-31';

Create a HASH index on the production_entity_id column in the energy_production_daily table:

CREATE INDEX idx_hash_production_entity_id ON energy_production_daily USING HASH (production_entity_id);

Run the EXPLAIN ANALYZE query again and compare the results.

Optionally, create a B-tree index on the date column to further improve performance:

CREATE INDEX idx_btree_date ON energy_production_daily USING BTREE (date);

Run the EXPLAIN ANALYZE query a third time and compare the results.

For this exercise, students should:

Other experiments

You can experiment with other types of queries with different

and also increasing the volume of generated data in the tables production_entities and energy_production_daily;