SQL and relational Algebra
SQL and relational Algebra
Scope
In this lesson we look at
- how SQL relates to relational algebra (and why we call them relational databases)
- assessing and improving query performance with the use of the EXPLAIN command and indexes.
The relation in Relational database
Relational database come from relational algebra theory.
A small example:
- A and B are sets of objects: for instance A = {1,2,3}, B = {a,b,c}
- tuples are combination of elements of these objects : (1,a), (1,b), …, (3, c)
In that context, a database table is a subset of all the possible combinations of the values of its columns. (in our case above with 2 sets of 3 elements each, we get 9 elements in total)
For instance, a table with columns A and B that can respectively take the values {1,2,3} and {a,b,c} is :
| A | B |
| --- | --- |
| 1 | c |
| 1 | b |
| 3 | a |
| 2 | b |
Relations Producing Relations: The Closure Property
One important concept from relation theory, is that
an operation on a relation produces … another relation!
This is called the CLOSURE property
When you query a table, the result is another table (a relation). This is fundamental to why SQL works the way it does.
Why This Matters
1. You can chain operations
Because the output is always a relation, you can use it as input for another operation:
-- Each step produces a relation (table-like result)
SELECT * FROM trees -- produces a relation
WHERE height > 15 -- produces a relation
ORDER BY circumference DESC -- produces a relation
LIMIT 10; -- produces a relation
2. Subqueries work naturally
Since a query returns a relation, you can use it wherever you’d use a table:
-- The subquery returns a relation, so we can query it like a table
SELECT arrondissement, avg_height
FROM (
SELECT
l.arrondissement,
AVG(t.height) as avg_height
FROM trees t
JOIN locations l ON t.location_id = l.id
GROUP BY l.arrondissement
) AS arrond_heights
WHERE avg_height > 20;
3. CTEs are just named relations
WITH tall_trees AS ( -- produces a relation
SELECT * FROM trees
WHERE height > 20
),
remarkable_tall_trees AS ( -- operates on a relation, produces a relation
SELECT * FROM tall_trees
WHERE remarkable = true
)
SELECT * FROM remarkable_tall_trees; -- operates on a relation
Practical Example with Trees
- Step 1: Query trees → produces a RELATION
SELECT id, height, circumference
FROM trees
WHERE height IS NOT NULL;
This result IS A TABLE with columns: id, height, circumference
We can treat it like any other table!
- Step 2: Use that relation as input
WITH tree_measurements AS (
SELECT id, height, circumference
FROM trees
WHERE height IS NOT NULL
)
-- Now query this relation like it's a regular table
SELECT
COUNT(*) as tree_count,
AVG(height) as avg_height,
MAX(circumference) as max_circumference
FROM tree_measurements;
What Makes It a “Relation”?
A relation has:
- Rows (tuples) - individual records
- Columns (attributes) - defined structure
- No duplicate rows (in theory, though SQL allows duplicates)
- No inherent order (though we can sort for display)
Why “Closure” Matters
Closure means:
- Input: Relation(s)
- Operation: SELECT, JOIN, UNION, etc.
- Output: Always a Relation
This creates a composable system where you can build complex queries from simple parts:
-- Each WITH clause: relation → operation → relation
WITH
-- Relation 1: Large trees
large_trees AS (
SELECT * FROM trees WHERE height > 25
),
-- Relation 2: Remarkable large trees (operates on relation 1)
remarkable_large AS (
SELECT * FROM large_trees WHERE remarkable = true
),
-- Relation 3: With location info (operates on relation 2)
with_locations AS (
SELECT
r.*,
l.arrondissement
FROM remarkable_large r
JOIN locations l ON r.location_id = l.id
)
-- Final operation on relation 3
SELECT
arrondissement,
COUNT(*) as remarkable_large_count
FROM with_locations
GROUP BY arrondissement;
Contrast with Non-Relational Operations
If operations didn’t return relations, you couldn’t chain them:
❌ If SELECT returned just a number → can't join it
❌ If GROUP BY returned a list → can't filter it
✅ All operations return relations → infinitely composable
The closure property (operations on relations produce relations) is what makes SQL:
- Composable - chain operations together
- Flexible - use query results anywhere you’d use a table
- Powerful - build complex queries from simple building blocks
Every
SELECT
,JOIN
,WHERE
,GROUP BY
, etc. takes relations as input and produces a relation as output. This is the foundation of relational algebra and why CTEs, subqueries, and views all work seamlessly together.
Depth and Width in SQL SELECT Queries
The Two Dimensions of a Relation (Table)
Every table has two dimensions:
- Depth = Number of rows (records/tuples)
- Width = Number of columns (attributes/fields)
Visual Representation
← WIDTH (columns) →
┌────┬────────┬────────┬───────────┐
│ id │ height │ circum │ remarkable│
├────┼────────┼────────┼───────────┤
↑ │ 1 │ 15 │ 120 │ false │
│ │ 2 │ 22 │ 180 │ true │
DEPTH │ 3 │ 18 │ 150 │ false │
(rows)│ 4 │ 25 │ 200 │ true │
│ │ 5 │ 12 │ 100 │ false │
↓ └────┴────────┴────────┴───────────┘
Depth: Filtering Rows with WHERE
Depth filtering reduces the number of rows by applying conditions.
-- Original: 100,000 trees (depth = 100,000)
SELECT * FROM trees;
-- Filtered depth: Only tall trees (depth = ~20,000)
SELECT * FROM trees
WHERE height > 20;
-- Further filtered depth (depth = ~5,000)
SELECT * FROM trees
WHERE height > 20 AND remarkable = true;
Result: Fewer rows, same columns
Before (depth = 5):
┌────┬────────┬────────┬───────────┐
│ id │ height │ circum │ remarkable│
├────┼────────┼────────┼───────────┤
│ 1 │ 15 │ 120 │ false │
│ 2 │ 22 │ 180 │ true │
│ 3 │ 18 │ 150 │ false │
│ 4 │ 25 │ 200 │ true │
│ 5 │ 12 │ 100 │ false │
└────┴────────┴────────┴───────────┘
After WHERE height > 20 (depth = 2):
┌────┬────────┬────────┬───────────┐
│ id │ height │ circum │ remarkable│
├────┼────────┼────────┼───────────┤
│ 2 │ 22 │ 180 │ true │
│ 4 │ 25 │ 200 │ true │
└────┴────────┴────────┴───────────┘
Width: Selecting Columns
Width filtering reduces the number of columns by specifying which attributes to include.
-- Full width: all columns (width = 11)
SELECT * FROM trees;
-- Reduced width: only 3 columns (width = 3)
SELECT id, height, circumference FROM trees;
-- Minimal width: just 1 column (width = 1)
SELECT height FROM trees;
Result: Same rows, fewer columns
Before (width = 4):
┌────┬────────┬────────┬───────────┐
│ id │ height │ circum │ remarkable│
├────┼────────┼────────┼───────────┤
│ 1 │ 15 │ 120 │ false │
│ 2 │ 22 │ 180 │ true │
│ 3 │ 18 │ 150 │ false │
└────┴────────┴────────┴───────────┘
After SELECT id, height (width = 2):
┌────┬────────┐
│ id │ height │
├────┼────────┤
│ 1 │ 15 │
│ 2 │ 22 │
│ 3 │ 18 │
└────┴────────┘
Combining Depth × Width
A complete SELECT query creates a subset in both dimensions:
-- Reduce BOTH depth and width
SELECT id, height, circumference -- ← WIDTH: 3 columns
FROM trees
WHERE height > 20 AND remarkable = true -- ← DEPTH: filtered rows
LIMIT 10; -- ← DEPTH: further limit
Key Takeaways
- Depth (rows) is controlled by:
WHERE
clausesLIMIT
JOIN
conditionsHAVING
(after GROUP BY)
- Width (columns) is controlled by:
- Column list in
SELECT
- Using
*
vs specific columns
- Column list in
- Every SELECT produces a subset:
- Subset of rows (depth filtering)
- Subset of columns (width filtering)
- Or both
- Optimization: Select only what you need
- Fewer columns = less data transfer
- Fewer rows = faster processing
Select only what you need
Relational Algebra Terminology
- Depth filtering (rows) = Selection operation ($\sigma$)
- Width filtering (columns) = Projection operation ($\pi$)
This is why SQL is based on relational algebra SQL is about creating subsets of relations in two dimensions!
SQL is a declarative language
Relational algebra is what makes SQL a declarative language
What “Declarative” Means
You describe WHAT you want, not HOW to get it
You say: “Give me tall remarkable trees”
SELECT height, circumference
FROM trees
WHERE height > 20 AND remarkable = true;
You don’t say:
- ❌ “Loop through each tree”
- ❌ “Check if height > 20”
- ❌ “If true, add to results array”
- ❌ “Return the array”
vs Imperative (Python, JavaScript, etc.)
You describe HOW to do it, step by step
You write the exact steps
results = []
for tree in trees:
if tree.height > 20 and tree.remarkable == True:
results.append({
'height': tree.height,
'circumference': tree.circumference
})
return results
SQL is declarative because it operates on relations (tables):
- Relations are sets - no order, no duplicates (in theory, see below)
- Operations on sets are mathematical - they describe transformations, not procedures
- The database engine decides HOW - you just describe WHAT
DECLARATIVE: You describe the result you want
SELECT
l.arrondissement, -- What columns (WIDTH)
COUNT(*) as tree_count
FROM trees t
JOIN locations l ON t.location_id = l.id
WHERE t.height > 20 -- What rows (DEPTH)
GROUP BY l.arrondissement;
The database decides:
- Which index to use
- Join algorithm (hash join? nested loop?)
- Whether to scan or seek
- Execution order
Declarative Works with Relations
Operations produce relations → composable → declarative
Because operations always return relations, you can describe complex queries declaratively:
-- Each part describes WHAT, not HOW
WITH tall_trees AS (
SELECT * FROM trees WHERE height > 20 -- WHAT: tall trees
),
remarkable_tall AS (
SELECT * FROM tall_trees WHERE remarkable = true -- WHAT: remarkable ones
)
SELECT
COUNT(*) as total, -- WHAT: count them
AVG(circumference) as avg_circum -- WHAT: average circumference
FROM remarkable_tall;
Imperative equivalent would be:
HOW: step by step instructions
tall_trees = []
for tree in all_trees:
if tree.height > 20:
tall_trees.append(tree)
remarkable_tall = []
for tree in tall_trees:
if tree.remarkable:
remarkable_tall.append(tree)
total = len(remarkable_tall)
sum_circum = 0
for tree in remarkable_tall:
sum_circum += tree.circumference
avg_circum = sum_circum / total if total > 0 else 0
Summary
Depth × Width and Declarative are connected concepts:
Concept | What it means | Example |
---|---|---|
Depth × Width | The shape/size of results | “100 rows × 3 columns” |
Declarative | How you express queries | “Give me X where Y” not “loop, check, add” |
The connection:
- SQL operates on relations (tables)
- Relations enable set-based thinking
- Set-based operations are naturally declarative
- You describe the WHAT (depth × width of results)
- Database figures out the HOW (execution plan)
The relational model (operations on relations produce relations) enables the declarative style (describe what, not how), which lets you think in terms of depth × width (the shape of your result set) rather than procedural steps!
Appendix: Why No Duplicates in Relational Algebra?
The Mathematical Foundation
Relational algebra is based on set theory, and in mathematics, sets cannot have duplicates by definition.
Mathematical Set:
{1, 2, 3, 3, 4} ❌ Not a valid set
{1, 2, 3, 4} ✅ Valid set (no duplicates)
A set either contains an element or it doesn't.
You can't have "two copies" of the same element.
What is a Relation in Theory?
A relation is defined as:
- A set of tuples (rows)
- Over a set of attributes (columns)
Since it’s a set, by definition:
- Each tuple must be unique
- Order doesn’t matter
- No duplicates allowed
Why This Makes Sense Theoretically
1. Identity and Meaning
In relational theory, a tuple represents a fact or proposition.
Trees relation represents facts: “There exists a tree with id=1, height=15, circumference=120” “There exists a tree with id=2, height=22, circumference=180”
If you have duplicates, what does it mean?
This doesn’t make sense in relational theory:
id | height | circumference
---|--------|-------------
1 | 15 | 120
1 | 15 | 120 ← Same fact stated twice?
Questions:
- Does the same tree exist twice?
- Are there two identical trees?
- Is this a counting error?
The duplicate adds no new information: it’s the same fact repeated.
Set Operations Should Work Cleanly
Mathematical set operations (union, intersection, difference) assume no duplicates:
Set A = {1, 2, 3}
Set B = {2, 3, 4}
A ∪ B = {1, 2, 3, 4} ✅ Well-defined
A ∩ B = {2, 3} ✅ Well-defined
A - B = {1} ✅ Well-defined
If sets had duplicates, these operations become ambiguous:
"Set" A = {1, 2, 2, 3} ❌ Not really a set
"Set" B = {2, 3, 3, 4} ❌ Not really a set
A ∪ B = ???
- {1, 2, 3, 4}?
- {1, 2, 2, 3, 3, 4}?
- {1, 2, 2, 2, 3, 3, 3, 4}?
Which is correct? Mathematics can’t answer this!
Primary Keys Enforce Uniqueness
In pure relational theory, every relation must have a key: a set of attributes that uniquely identifies each tuple.
-- Trees table with proper key
CREATE TABLE trees (
id INTEGER PRIMARY KEY, -- Ensures uniqueness
height INTEGER,
circumference INTEGER
);
-- Each tuple is unique by definition
-- No two rows can have the same id
Key properties:
- Minimal: no attribute can be removed and still identify the tuple
- Unique: no two tuples have the same key value
If duplicates existed, the key wouldn’t be unique!
SQL’s Deviation from Theory
SQL allows duplicates because it’s practical, but it’s technically not pure relational algebra.
SQL Tables are “Bags” (Multisets), not Sets
SQL allows this:
SELECT height FROM trees;
Result:
height
------
15
15 ← Duplicate!
22
15 ← Another duplicate!
18
This is a bag (multiset), not a set.
Why SQL allows this:
- Performance - removing duplicates is expensive (requires sorting or hashing)
- Practical need - sometimes you want to count duplicates
- Convenience - forcing DISTINCT everywhere would be annoying
When SQL Enforces Set Behavior
SQL gives you DISTINCT to get pure set behavior:
-- Bag behavior (SQL default)
SELECT height FROM trees;
-- Result: 15, 15, 22, 15, 18
-- Set behavior (relational theory)
SELECT DISTINCT height FROM trees;
-- Result: 15, 18, 22
Practical Examples: Why Duplicates are Problematic
Example 1: Counting
-- With duplicates, results are misleading
SELECT arrondissement
FROM locations
WHERE arrondissement IS NOT NULL;
-- Returns 100,000 rows with many duplicates
-- '16ème', '16ème', '16ème', '16ème', ...
-- What you actually want:
SELECT DISTINCT arrondissement
FROM locations
WHERE arrondissement IS NOT NULL;
-- Returns 20 unique arrondissements
Example 2: Set Operations
-- UNION (set behavior) - removes duplicates
SELECT height FROM trees WHERE height > 20
UNION
SELECT height FROM trees WHERE remarkable = true;
-- Result: unique heights
-- UNION ALL (bag behavior) - keeps duplicates
SELECT height FROM trees WHERE height > 20
UNION ALL
SELECT height FROM trees WHERE remarkable = true;
-- Result: duplicates if a tree is both tall AND remarkable
Example 3: Aggregations
-- Duplicates affect counts
SELECT COUNT(*) FROM (
SELECT height FROM trees
) subquery;
-- Counts duplicates: 100,000
SELECT COUNT(*) FROM (
SELECT DISTINCT height FROM trees
) subquery;
-- Counts unique values: 150
Theoretical vs Practical Trade-off
Relational Theory (Pure)
✅ No duplicates (it's a set)
✅ Clean mathematical operations
✅ Unambiguous semantics
❌ Expensive to enforce (requires checking)
❌ Less flexible for real-world needs
SQL (Practical)
✅ Fast (no duplicate checking by default)
✅ Flexible (bags when needed, sets when needed)
✅ Can count/aggregate duplicates
❌ Not pure relational algebra
❌ Can lead to unexpected results
When Duplicates Actually Matter
Problem Case: Aggregations without GROUP BY
-- This probably has duplicates you don't want
SELECT
tn.name,
t.height
FROM trees t
JOIN taxonomy tax ON t.taxonomy_id = tax.id
JOIN tree_names tn ON tax.name_id = tn.id;
-- If a tree name appears 5000 times, you get 5000 rows
-- When calculating average, this skews results!
-- Better:
SELECT
tn.name,
AVG(t.height) as avg_height -- Handles duplicates correctly
FROM trees t
JOIN taxonomy tax ON t.taxonomy_id = tax.id
JOIN tree_names tn ON tax.name_id = tn.id
GROUP BY tn.name;
Safe Case: When Duplicates Are Meaningful
-- Here duplicates are meaningful - we want to count them
SELECT
l.arrondissement,
COUNT(*) as tree_count -- Count includes duplicates
FROM trees t
JOIN locations l ON t.location_id = l.id
GROUP BY l.arrondissement;
-- Each row represents one tree
-- Multiple trees in same arrondissement = valid duplicates
Summary
Why no duplicates in relational algebra?
- Mathematical foundation - relations are sets, sets have no duplicates
- Semantics - duplicate tuples add no new information
- Operations - set operations (∪, ∩, −) are well-defined without duplicates
- Keys - every relation has a key that ensures uniqueness
Why SQL allows duplicates?
- Performance - checking for duplicates is expensive
- Practicality - sometimes you need to count/process duplicates
- Flexibility - use DISTINCT when you need set behavior
Key Takeaway
- Relational Theory: Relations are SETS (no duplicates)
- SQL Implementation: Tables are BAGS (duplicates allowed)
- SQL gives you DISTINCT to enforce set behavior when needed.
Best practice: Think about whether your query should return a set or a bag:
- Use
SELECT DISTINCT
when you want unique values (set) - Use
SELECT
when duplicates are meaningful (bag) - Use
GROUP BY
+ aggregates when processing groups of duplicates