In this lesson we look at
Relational database come from relational algebra theory.
A small example:
In that context, a database table is a subset of all the possible combinations of the values of its columns. (in our case above with 2 sets of 3 elements each, we get 9 elements in total)
For instance, a table with columns A and B that can respectively take the values {1,2,3} and {a,b,c} is :
| A | B |
| --- | --- |
| 1 | c |
| 1 | b |
| 3 | a |
| 2 | b |
One important concept from relation theory, is that
an operation on a relation produces ... another relation!
This is called the CLOSURE property
When you query a table, the result is another table (a relation). This is fundamental to why SQL works the way it does.
Because the output is always a relation, you can use it as input for another operation:
-- Each step produces a relation (table-like result)
SELECT * FROM trees -- produces a relation
WHERE height > 15 -- produces a relation
ORDER BY circumference DESC -- produces a relation
LIMIT 10; -- produces a relation
Since a query returns a relation, you can use it wherever you would use a table:
-- The subquery returns a relation, so we can query it like a table
SELECT arrondissement, avg_height
FROM (
SELECT
l.arrondissement,
AVG(t.height) as avg_height
FROM trees t
JOIN locations l ON t.location_id = l.id
GROUP BY l.arrondissement
) AS arrond_heights
WHERE avg_height > 20;
WITH tall_trees AS ( -- produces a relation
SELECT * FROM trees
WHERE height > 20
),
remarkable_tall_trees AS ( -- operates on a relation, produces a relation
SELECT * FROM tall_trees
WHERE remarkable = true
)
SELECT * FROM remarkable_tall_trees; -- operates on a relation
SELECT id, height, circumference
FROM trees
WHERE height IS NOT NULL;
This result IS A TABLE with columns: id, height, circumference
We can treat it like any other table!
WITH tree_measurements AS (
SELECT id, height, circumference
FROM trees
WHERE height IS NOT NULL
)
-- Now query this relation like it's a regular table
SELECT
COUNT(*) as tree_count,
AVG(height) as avg_height,
MAX(circumference) as max_circumference
FROM tree_measurements;
A relation has:
Closure means:
This creates a composable system where you can build complex queries from simple parts:
-- Each WITH clause: relation → operation → relation
WITH
-- Relation 1: Large trees
large_trees AS (
SELECT * FROM trees WHERE height > 25
),
-- Relation 2: Remarkable large trees (operates on relation 1)
remarkable_large AS (
SELECT * FROM large_trees WHERE remarkable = true
),
-- Relation 3: With location info (operates on relation 2)
with_locations AS (
SELECT
r.*,
l.arrondissement
FROM remarkable_large r
JOIN locations l ON r.location_id = l.id
)
-- Final operation on relation 3
SELECT
arrondissement,
COUNT(*) as remarkable_large_count
FROM with_locations
GROUP BY arrondissement;
If operations didn't return relations, you couldn't chain them:
❌ If SELECT returned just a number → can't join it
❌ If GROUP BY returned a list → can't filter it
✅ All operations return relations → infinitely composable
The closure property (operations on relations produce relations) is what makes SQL:
Every
SELECT,JOIN,WHERE,GROUP BY, etc. takes relations as input and produces a relation as output. This is the foundation of relational algebra and why CTEs, subqueries, and views all work seamlessly together.
Every table has two dimensions:
← WIDTH (columns) →
┌────┬────────┬────────┬───────────┐
│ id │ height │ circum │ remarkable│
├────┼────────┼────────┼───────────┤
↑ │ 1 │ 15 │ 120 │ false │
│ │ 2 │ 22 │ 180 │ true │
DEPTH │ 3 │ 18 │ 150 │ false │
(rows)│ 4 │ 25 │ 200 │ true │
│ │ 5 │ 12 │ 100 │ false │
↓ └────┴────────┴────────┴───────────┘
Depth filtering reduces the number of rows by applying conditions.
-- Original: 100,000 trees (depth = 100,000)
SELECT * FROM trees;
-- Filtered depth: Only tall trees (depth = ~20,000)
SELECT * FROM trees
WHERE height > 20;
-- Further filtered depth (depth = ~5,000)
SELECT * FROM trees
WHERE height > 20 AND remarkable = true;
Result: Fewer rows, same columns
Before (depth = 5):
┌────┬────────┬────────┬───────────┐
│ id │ height │ circum │ remarkable│
├────┼────────┼────────┼───────────┤
│ 1 │ 15 │ 120 │ false │
│ 2 │ 22 │ 180 │ true │
│ 3 │ 18 │ 150 │ false │
│ 4 │ 25 │ 200 │ true │
│ 5 │ 12 │ 100 │ false │
└────┴────────┴────────┴───────────┘
After WHERE height > 20 => depth = 2:
┌────┬────────┬────────┬───────────┐
│ id │ height │ circum │ remarkable│
├────┼────────┼────────┼───────────┤
│ 2 │ 22 │ 180 │ true │
│ 4 │ 25 │ 200 │ true │
└────┴────────┴────────┴───────────┘
Width filtering reduces the number of columns by specifying which attributes to include.
-- Full width: all columns (width = 11)
SELECT * FROM trees;
-- Reduced width: only 3 columns (width = 3)
SELECT id, height, circumference FROM trees;
-- Minimal width: just 1 column (width = 1)
SELECT height FROM trees;
Result: Same rows, fewer columns
Before (width = 4):
┌────┬────────┬────────┬───────────┐
│ id │ height │ circum │ remarkable│
├────┼────────┼────────┼───────────┤
│ 1 │ 15 │ 120 │ false │
│ 2 │ 22 │ 180 │ true │
│ 3 │ 18 │ 150 │ false │
└────┴────────┴────────┴───────────┘
After SELECT id, height (width = 2):
┌────┬────────┐
│ id │ height │
├────┼────────┤
│ 1 │ 15 │
│ 2 │ 22 │
│ 3 │ 18 │
└────┴────────┘
A complete SELECT query creates a subset in both dimensions:
-- Reduce BOTH depth and width
SELECT id, height, circumference -- ← WIDTH: 3 columns
FROM trees
WHERE height > 20 AND remarkable = true -- ← DEPTH: filtered rows
LIMIT 10; -- ← DEPTH: further limit
Depth (rows) is controlled by:
WHERE clausesLIMITJOIN conditionsHAVING (after GROUP BY)Width (columns) is controlled by:
SELECT* vs specific columnsEvery SELECT produces a subset:
Optimization: Select only what you need
Select only what you need
This is why SQL is based on relational algebra SQL is about creating subsets of relations in two dimensions!
Relational algebra is what makes SQL a declarative language
You describe WHAT you want, not HOW to get it
You say: "Give me tall remarkable trees"
SELECT height, circumference
FROM trees
WHERE height > 20 AND remarkable = true;
You don't say:
You describe HOW to do it, step by step
You write the exact steps
results = []
for tree in trees:
if tree.height > 20 and tree.remarkable == True:
results.append({
'height': tree.height,
'circumference': tree.circumference
})
return results
SQL is declarative because it operates on relations (tables):
DECLARATIVE: You describe the result you want
Operations produce relations → composable → declarative
Because operations always return relations, you can describe complex queries declaratively:
-- Each part describes WHAT, not HOW
WITH tall_trees AS (
SELECT * FROM trees WHERE height > 20 -- WHAT: tall trees
),
remarkable_tall AS (
SELECT * FROM tall_trees WHERE remarkable = true -- WHAT: remarkable ones
)
SELECT
COUNT(*) as total, -- WHAT: count them
AVG(circumference) as avg_circum -- WHAT: average circumference
FROM remarkable_tall;
Imperative equivalent would be:
HOW: step by step instructions
tall_trees = []
for tree in all_trees:
if tree.height > 20:
tall_trees.append(tree)
remarkable_tall = []
for tree in tall_trees:
if tree.remarkable:
remarkable_tall.append(tree)
total = len(remarkable_tall)
sum_circum = 0
for tree in remarkable_tall:
sum_circum += tree.circumference
avg_circum = sum_circum / total if total > 0 else 0
Depth × Width and Declarative are connected concepts:
| Concept | What it means | Example |
|---|---|---|
| Depth × Width | The shape/size of results | "100 rows × 3 columns" |
| Declarative | How you express queries | "Give me X where Y" not "loop, check, add" |
The connection:
The relational model (operations on relations produce relations) enables the declarative style (describe what, not how), which lets you think in terms of depth × width (the shape of your result set) rather than procedural steps!
Relational algebra is based on set theory, and in mathematics, sets cannot have duplicates by definition.
Mathematical Set:
{1, 2, 3, 3, 4} ❌ Not a valid set
{1, 2, 3, 4} ✅ Valid set (no duplicates)
A set either contains an element or it doesn't.
You can't have "two copies" of the same element.
A relation is defined as:
Since it's a set, by definition:
In relational theory, a tuple represents a fact or proposition.
Trees relation represents facts: "There exists a tree with id=1, height=15, circumference=120" "There exists a tree with id=2, height=22, circumference=180"
If you have duplicates, what does it mean?
This doesn't make sense in relational theory:
id | height | circumference
---|--------|-------------
1 | 15 | 120
1 | 15 | 120 ← Same fact stated twice?
Questions:
The duplicate adds no new information: it's the same fact repeated.
Mathematical set operations (union, intersection, difference) assume no duplicates:
Set A = {1, 2, 3}
Set B = {2, 3, 4}
A ∪ B = {1, 2, 3, 4} ✅ Well-defined
A ∩ B = {2, 3} ✅ Well-defined
A - B = {1} ✅ Well-defined
If sets had duplicates, these operations become ambiguous:
"Set" A = {1, 2, 2, 3} ❌ Not really a set
"Set" B = {2, 3, 3, 4} ❌ Not really a set
A ∪ B = ???
- {1, 2, 3, 4}?
- {1, 2, 2, 3, 3, 4}?
- {1, 2, 2, 2, 3, 3, 3, 4}?
Which is correct? No way to know.
In pure relational theory, every relation must have a key: a set of attributes that uniquely identifies each tuple.
-- Trees table with proper key
CREATE TABLE trees (
id INTEGER PRIMARY KEY, -- Ensures uniqueness
height INTEGER,
circumference INTEGER
);
-- Each tuple is unique by definition
-- No two rows can have the same id
Key properties:
If duplicates existed, the key wouldn't be unique!
SQL allows duplicates because it's practical, but it's technically not pure relational algebra.
SQL allows this:
SELECT height FROM trees;
Result:
height
------
15
15 ← Duplicate!
22
15 ← Another duplicate!
18
This is a bag (multiset), not a set.
Why SQL allows this:
SQL gives you DISTINCT to get pure set behavior:
-- Bag behavior (SQL default)
SELECT height FROM trees;
-- Result: 15, 15, 22, 15, 18
-- Set behavior (relational theory)
SELECT DISTINCT height FROM trees;
-- Result: 15, 18, 22
-- With duplicates, results are misleading
SELECT arrondissement
FROM locations
WHERE arrondissement IS NOT NULL;
-- Returns 100,000 rows with many duplicates
-- '16ème', '16ème', '16ème', '16ème', ...
-- What you actually want:
SELECT DISTINCT arrondissement
FROM locations
WHERE arrondissement IS NOT NULL;
-- Returns 20 unique arrondissements
-- UNION (set behavior) - removes duplicates
SELECT height FROM trees WHERE height > 20
UNION
SELECT height FROM trees WHERE remarkable = true;
-- Result: unique heights
-- UNION ALL (bag behavior) - keeps duplicates
SELECT height FROM trees WHERE height > 20
UNION ALL
SELECT height FROM trees WHERE remarkable = true;
-- Result: duplicates if a tree is both tall AND remarkable
-- Duplicates affect counts
SELECT COUNT(*) FROM (
SELECT height FROM trees
) subquery;
-- Counts duplicates: 100,000
SELECT COUNT(*) FROM (
SELECT DISTINCT height FROM trees
) subquery;
-- Counts unique values: 150
✅ No duplicates (it's a set)
✅ Clean mathematical operations
✅ Unambiguous semantics
❌ Expensive to enforce (requires checking)
❌ Less flexible for real-world needs
✅ Fast (no duplicate checking by default)
✅ Flexible (bags when needed, sets when needed)
✅ Can count/aggregate duplicates
❌ Not pure relational algebra
❌ Can lead to unexpected results
-- This probably has duplicates you don't want
SELECT
tn.name,
t.height
FROM trees t
JOIN taxonomy tax ON t.taxonomy_id = tax.id
JOIN tree_names tn ON tax.name_id = tn.id;
-- If a tree name appears 5000 times, you get 5000 rows
-- When calculating average, this skews results!
-- Better:
SELECT
tn.name,
AVG(t.height) as avg_height -- Handles duplicates correctly
FROM trees t
JOIN taxonomy tax ON t.taxonomy_id = tax.id
JOIN tree_names tn ON tax.name_id = tn.id
GROUP BY tn.name;
-- Here duplicates are meaningful - we want to count them
SELECT
l.arrondissement,
COUNT(*) as tree_count -- Count includes duplicates
FROM trees t
JOIN locations l ON t.location_id = l.id
GROUP BY l.arrondissement;
-- Each row represents one tree
-- Multiple trees in same arrondissement = valid duplicates
Best practice: Think about whether your query should return a set or a bag:
SELECT DISTINCT when you want unique values (set)SELECT when duplicates are meaningful (bag)GROUP BY + aggregates when processing groups of duplicates