Postgresql windows functions and CTEs

Windows functions

Window function

What are Window functions ?

A window function ( also referred to as Analytic Function) performs a calculation across a set of table rows that are somehow related to the current row. … However, window functions do not cause rows to become grouped into a single output row (not like group by). Instead, the rows retain their separate identities.

The window function is being applied to more rows than just the current row of the query result.

Let’s say we run a SUM() of some column on a whole table and then run a select query on the database.

The goal is not only to calculate the SUM() …

SELECT SUM(some_column ) from some_table;
# 12

… but to have the result as an extra column collated to the rows returned by the query.

col 1 col 2 result of SUM()
abc cde 12
wer sdf 12
zyh yup 12

Aggregate functions like MAX(), MIN(), AVG(), SUM(), COUNT() etc can also be used as window functions. And then functions such as ROW NUMBER(), RANK(), DENSE RANK(), LEAD(), LAG() are specific to window functions.

These functions are commonly used across most of the popular RDBMS such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server etc.

Windows function use 2 clauses:

start by loading the data

Attention:

We need to work on the worldhits database in its original denormalized version. Just one track table. Reload the csv file into a newly created worldhitsdb database if needed.

psql into worldhitsdb on your local

Max popularity and the OVER() clause

Say, we want to find the maximum popularity for all tracks.

select max(popularity) as max_pop from tracks t

Now we want to add that max_pop value as a column when returning all the rows.

select t.track, t.artist, t.popularity, (select max(popularity) from tracks t)  as max_pop  from tracks t;

Adding subqueries for extra columns is fine but makes queries hard to understand. (and is not super efficient)

We can have the exact same result by using the OVER() clause in the previous query

select t.track, t.artist, t.popularity,
max(popularity) OVER() as max_pop
from tracks t ;

Here the OVER() clause just creates a window OVER the whole table and applies the MAX() function OVER that window.

Max Popularity by Artist

Now we want to find the max popularity for each artist. Instead of just returning a 2 column results with max(pop) and artist name that we get with a GROUP BY

select max(popularity) as max_pop, artist from tracks group by artist
max_pop artist
36 Afro Celt Sound System
46 Al Di Meola
53 Ali Farka Touré

… we want to return more information per row in the db and add the extra max_pop column calculated for each artist.

We want to have:

id artist track max_pop
301 Salif Keita Sina 32
302 Salif Keita Tekere 32
302 Salif Keita something else 32
203 Jónsi Tornado 30
201 Jónsi Kolniður 30

Partition BY

This is where partition by comes in:

partition by <column name> specifies the window, the subset of rows, over which to apply the max() function.

This query add max_pop by artist column to the list of tracks.

select t.track, t.artist, t.popularity,
max(popularity) over(partition by artist) as max_pop
from tracks t ;

In short, the OVER() clause specifies to SQL that you need to create a a subset of records

In the above query,

General case

The general query is :

SELECT <list of columns>,
SOME_FCT(<main_column_name>) OVER(PARTITION BY <other_column_name> ) AS some_alias
FROM <table> ;

This works with other aggregate functions such as: max(), min(), avg(), count(), sum()

Stats

So we can combine them to get some stats for each artist.

select t.artist, t.popularity,
max(popularity) over(partition by artist) as max_pop,
min(popularity) over(partition by artist) as min_pop,
avg(popularity) over(partition by artist) as avg_pop
from tracks t ;

functions specific to window functions

There are 5 functions that are specific to window functions.

They are used to

ROW_NUMBER()

ROW_NUMBER() assigns a unique value to each records in each window

Without specifying a partition, we get an index over all the records

select t.id,  t.artist,
row_number() over() as rn
from tracks t limit 10;

rn is exactly the same as the current id since id is sequential without gaps.

Note that, if the id, the primary key, was not sequential, you could use that to generate a new primary key that is sequential.

More interestingly if we want to restart the row number per artist, we add PARTITION BY artist in the OVER clause:

select t.id, t.artist, LEFT(t.track,20),
row_number() over(partition by artist) as rn
from tracks t
order by artist asc
limit 20;

The rn is reset to 1 for each artist.

When is that useful ?

When ordering the rows by some column, you can fetch the top or bottom artists.

For instance, let’s fetch

the top 2 tracks with the highest tempo from each artist

by:

  1. add order by tempo desc in the OVER() clause
select t.id, t.artist, LEFT(t.track,10), t.tempo,
row_number() OVER(partition by artist order by tempo desc) as rn
from tracks t
order by artist asc ;
  1. then use that as a subquery (called subset) and filter
select * from (
    select t.id, t.artist, LEFT(t.track,10), t.tempo,
    row_number() over(partition by artist order by tempo desc) as rn
    from tracks t
    order by artist asc
    ) subset
where subset.rn < 3;

Deleting duplicates

ROW_NUMBER() is also useful to find and delete duplicates in a table.

A general duplicates deletion query would look like this

WITH duplicates AS (
    SELECT
       id,
      ROW_NUMBER() OVER (
        PARTITION BY <columns that are duplicated>
        ORDER BY insert_date DESC
      ) AS rn
    FROM trees
)
DELETE FROM trees
WHERE id IN (
    SELECT id FROM duplicates
    WHERE rn > 1
);

Using a a named subquery (WITH duplicates AS) or CTE and partitioning by the columns that are representative of duplicated rows and then deleting from the results of that subquery where the row number indicates a duplicate (i.e. rn > 1).

RANK()

To fetch the most popular tracks each year and number the tracks by popularity per year :

select * from (
    select t.id, t.artist, LEFT(t.track,10), t.year, t.popularity,
    row_number() over(partition by year order by popularity desc) as rn
    from tracks t) X
where X.rn < 3;

but some tracks have the same popularity score andd yet have different row numbers.

Using rank() we can assign equal row numbers to tracks with equal popularity scores.

select t.id, t.artist, t.track, t.year, t.popularity,
rank() over(partition by year order by popularity desc) as rnk
from tracks t;

Equal values will have the same rank (see year 2005);

so rank can be 1, 2, 2, 4

so top 2 most popular tracks per year

select * from (
    select t.id, t.artist, t.track, t.year, t.popularity,
    rank() over(partition by year order by popularity desc) as rnk
    from tracks t
) pop
where pop.rnk < 2;

1985 and 2005 have 2 tracks ex-aequo for most popular.

dense_rank()

dense_rank() similar to rank but will increment the rank without jumps in numbers

We can compare row_number(), rank() and dense_rank() in the following query

select t.id, t.artist, t.track, t.year, t.popularity,
row_number() over(partition by year order by popularity desc) as rn,
rank() over(partition by year order by popularity desc) as rnk,
dense_rank() over(partition by year order by popularity desc) as drnk
from tracks t
where t.year in (1985, 2005);

And now a release date

To illustrate LAG and LEAD, we need an album release date column.

We could use id as proxy but … meh :(;

Let’s create a fake album release date using:

First create or fetch the day, month, year values for each track

select t.id, LEFT(t.track,20),
    t.year as rel_year,
    floor(random() * 12 + 1) as rel_month,
    row_number() over ( partition by t.year) as rel_day
from tracks t
order by album;

We need to add the release_date column as DATE data type to the tracks table

see https://www.postgresql.org/docs/current/datatype-datetime.html

ALTER TABLE tracks add column release_date DATE ;

and now :

We use the PostgreSQL MAKE_DATE() function to transform a set of strings (year, month, day) .

By the way, to get info on a function in PostgreSQL you can \df function_name:

\df make_date

returns

                                      List of functions
   Schema   |   Name    | Result data type |           Argument data types            | Type
------------+-----------+------------------+------------------------------------------+------
 pg_catalog | make_date | date             | day integer, month integer, day integer | func

Notice that the function accepts only INTs as input values.

So we need to cast the year, month, day values as INTs

Then we use a CTE to define the subquery

WITH subquery AS (
    SELECT t.id, t.track,
           t.year as rel_year,
           floor(random() * 10 + 1)::int as rel_month,
           row_number() over (partition by t.year) as rel_day
    FROM tracks t
    ORDER BY album
)

UPDATE tracks
SET release_date = MAKE_DATE(rel_year::int, rel_month::int, rel_day::int)
FROM subquery sq
WHERE tracks.id = sq.id;

If you don’t cast the rel_day, rel_month, … as int.

you get

ERROR:  42883: function make_date(integer, double precision, bigint) does not exist

LEAD() and LAG()

Now that we have a release date for each track we can demo lead() and lag()

We want to find if the popularity of a track is higher, lower or equal than the previously released track year after year.

start with getting the popularity of tracks year by year

select t.id, LEFT(t.track, 10), t.release_date, t.popularity,
lag(popularity) over(partition by year order by release_date) as prev_track_pop
from tracks t
where year BETWEEN 1964 and  1970 ;

We see null values when the album was first that year and the previous row popularity in the column prev_track_pop

Note: if we don’t partition by year, but just use OVER(order by release_date), we have just one NULL value for the first row.

Now we can find out if the popularity was higher, lower or equal than the previously released track.

select t.id, LEFT(t.track,10), t.release_date, t.popularity,
case when t.popularity > lag(popularity) over(order by release_date) then 'higher'
when t.popularity < lag(popularity) over(order by release_date) then 'lower'
when t.popularity = lag(popularity) over(order by release_date) then 'equal'
else 'unknown'
END as pop_delta
from tracks t
where year BETWEEN 1964 and  1970;

LAG() and LEAD() take arguments:

so lag(popularity, 2, 0) looks 2 rows before the currrent row and 0 is a default value

You can also set a column name as the default value. For instance

Then the default value will be the current row popularity

LEAD() is the same as LAG() except that it gets the value of the row(s) following the current record.

Further readings