A serverless huge scale database from Google Cloud
hyper scalers: GCP, aws, azure
Alternatives:
In Nov 2025, GOogle Cloud offers 12 different databases
You should have 300$ credits to begin exploring.
Big Query also offers a sandbox
Caution: use at your own risks. It’s very easy to spin up large resouces and blow up your budget.
Do not try if you are not confortable with using cloud resources with your own credit card.
make sure you can monitor your usage consumption in Billing
turn off all resources you’ve setup.
do not publish API keys and access credentials
3 tier: storage + analytical + transfer
Storage is cheap. Querying is what you pay for.
example: this will list 10 datasets from bigquery-public-data
bq ls --project_id=bigquery-public-data --max_results=10
BQ is:
BQ is not made for heavy transactional workloads.
In short: BigQuery = big reads, PostgreSQL = frequent small writes.
throughput: How many results you get per second.
BQ is not trying to return each query very fast.
BQ is designed to handle very large amounts of data efficiently, even if that means each individual query takes longer to start or finish.
You might wait 2–5 seconds for the first result (higher latency), but the system can process huge datasets (terabytes) in one go (high throughput).
From wikipedia: Data orientation is the representation of tabular data in a linear memory model such as in-disk or in-memory. The two most common representations are column-oriented (columnar format) and row-oriented (row format).

Data orientation => important tradeoffs in performance and storage.
from here
In a row database, how does the system know when a row ends and a new one starts ?
The database stores metadata with the data, not just the raw values. In a row-oriented DB (e.g., PostgreSQL):
The DB stores rows inside pages (e.g., 8 KB in PostgreSQL).
So even if data looks like:
id, name, age, city
1, Alice, 30, New York
2, Bob, 25, Boston
On disk it’s more like:
[row header][ID][Name length][Name][Age][City length][City]
[row header][ID][Name length][Name][Age][City length][City]
The row header + schema = how the system knows where each row starts and ends.
=> easy writes; adding a new row = just add to the end of the last row
Column-based databases store data by columns instead of rows.
Each column’s values are stored together.
In a columnar system, the data is stored as (with also approriate headers):
[1, 2][Alice, Bob][30, 25][New York, Boston]
see also this article what is columnar database?
Let’s compare the same join query on a row database (PostgreSQL) vs a columnar database (BigQuery) and explain the difference in execution plans.
Example Schema
Two tables:
customers (customer_id, name, city)
orders (order_id, customer_id, amount)
We want total money spent per customer:
SELECT c.name, SUM(o.amount) AS total_spent
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY c.name;
How it executes:
customer_id if they exist.Execution behavior:
Fetch row from customers → Lookup matching rows in orders using index → Aggregate
How it executes:
customer_id, name, and amount.Execution style:
Scan columnar segments → Build distributed hash table → Parallel join → Parallel aggregate
A distributed hash join means BigQuery splits the data across many machines.
Each machine:
Hashing ensures that matching rows from two tables go to the same machine, so the join can be done in parallel.
We hash the join key to keep the workload evenly distributed across all workers and avoid hotspots: highly imbalanced workload distributions.

BigQuery is massively parallel because both the data and the query execution are distributed across many machines working at the same time.
So instead of one computer doing all the work, hundreds or thousands do small parts at once → massively parallel.
Columnar orientation splits the data with respect to columns. width / column reduction
BQ has a 2nd level of data reduction: partitionning : height / row reduction
Partitioning is splitting a large table into smaller chunks (partitions) based on a granular column — usually date, timestamp, or integer range.
Instead of scanning the whole table, BigQuery will only scan the partitions needed for the query.
The query
SELECT * FROM events
WHERE event_date = "2024-01-02";
→ BigQuery reads only the 2024-01-02 partition, not the entire table.
CREATE TABLE project.dataset.orders (
order_id INT64,
customer_id INT64,
order_date DATE,
amount FLOAT64
)
PARTITION BY DATE(order_date);
Or with extra clustering
CREATE TABLE project.dataset.customer_orders (
order_id INT64,
customer_id INT64,
order_date DATE,
region STRING,
amount FLOAT64
)
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, region;
CLUSTER BY further organizes the data within each partition into physical chunks.
Without clustering:
Partition: 2025-01-01
├── All rows for that date stored together
└── Mixed customer_ids and regions scattered throughout
With clustering:
Partition: 2025-01-01
├── Cluster 1: customer_id 1-100, region "US"
├── Cluster 2: customer_id 1-100, region "EU"
├── Cluster 3: customer_id 101-200, region "US"
└── Cluster 4: customer_id 101-200, region "EU"
When you query with a filter like:
SELECT * FROM customer_orders
WHERE DATE(order_date) = '2025-01-01'
AND customer_id = 50
AND region = 'US';
BigQuery does this:
This dramatically reduces the amount of data read from disk—you’re scanning maybe 5-10% of the partition instead of 100%.
Clustering creates sort order within the partition files. It doesn’t create separate sub-partitions you can see; it’s transparent internal organization.
But the physical data is ordered so that:
So partitioning is subseting a table wrt to rows while columnar data organization is subsetting a table wrt to columns.
When you run a query:
SELECT col1, col3
FROM my_table
WHERE date = '2024-01-02';
BigQuery does this:
2024-01-02
→ ignore all other row groups.| Step | What is filtered | Effect |
|---|---|---|
| 1. Partition pruning | Rows | Skip irrelevant row blocks |
| 2. Column pruning | Columns | Skip irrelevant columns |
| 3. Join columns back | Based on row index | Reconstruct the requested rows |
Partitioning removes rows you don’t need, columnar storage skips columns you don’t need. -> BQ can handle super large datasets
but it’s not the only thing
1) Distributed Storage + Compute Separation
2) Massively Parallel Processing (MPP)
3) Data Locality & Sharding
4) Compression + Encodings
Columns often contain similar values → BigQuery applies:
Smaller data → lower scan cost → faster reads.
5) Broadcast + Distributed Hash Joins
Small tables are broadcast to all workers. : it is faster to copy the small table to every worker machine. Broadcasting means BigQuery copies the small table to all workers so each worker can join locally without moving large data around.
Large tables are joined in parallel using hash partitioning.
Optimized for many small, precise lookups.
Subsetting the data before processing:
Optimized for large scans and aggregations.
PostgreSQL goes fast by finding the right rows efficiently;
BigQuery goes fast by processing huge amounts of data in parallel while skipping everything it doesn’t need.
the user-facing SQL looks the same, but the execution model underneath is completely different.
You can write:
SELECT city, AVG(amount)
FROM orders
GROUP BY city;
This works in PostgreSQL and BigQuery.
| Database | How it executes the SQL | What it’s optimized for |
|---|---|---|
| PostgreSQL | Uses indexes, row-level access, and a single-machine query planner | Fast small lookups and transactions |
| BigQuery | Skips columns, skips partitions, shards data, and runs work in parallel on many machines | Fast large scans and aggregations |