Big Query

A serverless huge scale database from Google Cloud

BQ includes a built-in query engine capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure and without having to create or rebuild indexes.

console.cloud.google.com/bigquery


Google cloud, AWS, Azure

hyper scalers: GCP, AWS, Azure

Alternatives:


google cloud databases offerings

In Nov 2025, GOogle Cloud offers 12 different databases

Analytical / OLAP

In-memory / Caching

Key-Value / High-throughput


Google Cloud Database Solutions

SolutionDescriptionTarget Use Case
Cloud SQLFully managed relational database (MySQL, PostgreSQL, SQL Server).Traditional applications requiring SQL, migrations from on-premises databases, OLTP workloads.
Cloud SpannerGlobally distributed relational database with strong consistency and horizontal scalability.Mission-critical applications requiring global availability and transactional consistency.
FirestoreServerless NoSQL document database with offline synchronization.Mobile/web apps requiring real-time sync and automatic scaling.
Firebase Realtime DatabaseReal-time NoSQL database with instant synchronization across clients.Mobile and web apps needing real-time updates (e.g., chat, gaming).
BigtableHigh-performance NoSQL database for massive, low-latency workloads.Real-time analytics, IoT, time-series data, high-throughput applications.
MemorystoreManaged in-memory cache (Redis or Memcached).Caching to reduce application latency and speed up access to frequently used data.
BigQueryServerless data warehouse for large-scale analytics.Big data analysis, reporting, machine learning, and data warehousing.
DatastoreManaged NoSQL document database with automatic scaling.Web and mobile apps requiring simple scalability and serverless management.
AlloyDBPostgreSQL-compatible relational database optimized for performance and AI.Demanding analytical and transactional workloads, integration with Vertex AI.
Database Migration ServiceService for migrating databases to Google Cloud with minimal downtime.Migrating on-premises or cloud databases to Google Cloud.

to play along

You should have 300$ credits to begin exploring.

Big Query also offers a sandbox


Handle with caution

Caution: use at your own risks. It's very easy to spin up large resouces and blow up your budget.

Do not try if you are not confortable with using cloud resources with your own credit card.

Make sure you can monitor your usage consumption in Billing

turn off all resources you've setup.

do not publish API keys and access credentials


BQ Billing

3 tier: storage + analytical + transfer

Storage is cheap. Querying is what you pay for.


How to work with BQ

example: this will list 10 datasets from bigquery-public-data

 bq ls --project_id=bigquery-public-data --max_results=10

BG public datasets : bigquery-public-data

public datasets

You can use these datasets to complement your own datasets.

suffices to join on the public dataset

JOIN  `bigquery-public-data.new_york_citibike.citibike_trips on ...

BQ vs postgresql, mysql etc

BQ is:

and


BQ is OLAP, postgres is OLTP

BQ is not made for heavy transactional workloads.

In short: BigQuery = big reads, PostgreSQL = frequent small writes.


Latency vs Throughput

With BQ, latency is higher — results are optimized for throughput, not response time.

BQ is not trying to return each query very fast.

BQ is designed to handle very large amounts of data efficiently, even if that means each individual query takes longer to start or finish.

You might wait 2–5 seconds for the first result (higher latency), but the system can process huge datasets (terabytes) in one go (high throughput).


Data orientation : rows vs columns

From wikipedia: Data orientation is the representation of tabular data in a linear memory model such as in-disk or in-memory. The two most common representations are column-oriented (columnar format) and row-oriented (row format).

row vs columnar storage

Data orientation => important tradeoffs in performance and storage.


row vs columnar storage

from here


Rows orientation, storage

In a row database, how does the system know when a row ends and a new one starts ?

row storage diagram

The database stores metadata with the data, not just the raw values.

In a row-oriented DB (e.g., PostgreSQL):

The DB stores rows inside pages (e.g., 8 KB in PostgreSQL).


So even if data looks like:

id, name, age, city
1, Alice, 30, New York
2, Bob, 25, Boston

On disk it’s more like:

[row header][ID][Name length][Name][Age][City length][City]
[row header][ID][Name length][Name][Age][City length][City]

The row header + schema = how the system knows where each row starts and ends.

=> easy writes; adding a new row = just add to the end of the last row


Columnar database storage

Column-based databases store data by columns instead of rows.

Each column’s values are stored together.

In a columnar system, the data is stored as (with also approriate headers):

[1, 2][Alice, Bob][30, 25][New York, Boston]
columnar storage diagram

Columnar database in 2 mns

see also this article what is columnar database?


Comparison on simple JOIn

Let’s compare the same join query on a row database (PostgreSQL) vs a columnar database (BigQuery) and explain the difference in execution plans.

Example Schema

Two tables:

customers (customer_id, name, city)
orders    (order_id, customer_id, amount)

We want total money spent per customer:

SELECT c.name, SUM(o.amount) AS total_spent
FROM customers c
JOIN orders o
  ON c.customer_id = o.customer_id
GROUP BY c.name;

In PostgreSQL (Row-oriented)

How it executes:

Execution behavior:

Fetch row from customers → Lookup matching rows in orders using index → Aggregate

In BigQuery (Columnar / Distributed Engine)

How it executes:

Execution style:

Scan columnar segments → Build distributed hash table → Parallel join → Parallel aggregate

Massively parallel

BigQuery is massively parallel because both the data and the query execution are distributed across many machines working at the same time.

So instead of one computer doing all the work

hundreds or thousands do small parts at oncemassively parallel


Distributed hash joins

A distributed hash join means BigQuery splits the data across many machines.

Each machine:

Hashing ensures that matching rows from two tables go to the same machine, so the join can be done in parallel.

We hash the join key to keep the workload evenly distributed across all workers and avoid hotspots: highly imbalanced workload distributions.


Example

SELECT  u.user_id, u.name, u.email, o.order_id, o.order_date
FROM    users u
JOIN    orders o ON u.user_id = o.user_id
WHERE   u.age > 18;

Shuffling Phase: BigQuery partitions both tables (users and orders) based on the join key (user_id). Each worker node processes a subset of the data, hashing the join key to distribute rows across nodes.

Build Phase: The smaller table (e.g., users) is loaded into a hash table in memory on each worker node. If the table is too large, BigQuery spills the hash table to disk.

Probe Phase: The larger table (e.g., orders) is scanned. For each row in orders, BigQuery probes the hash table to find matching rows from users using the join key (user_id). Matches are combined into the result set.

Filtering: The WHERE clause (age > 18) is applied before or during the join, depending on the query plan, to reduce the amount of data processed.


Key Points


But why hash ?

Why not distribute the workload across workers based on the values of the key ?

1. Uniform Distribution


with Distributed hash joins

The key based load distribution is no longer based on the values of the join key

if users 1-100 have lots of orders then the worker for this range of values will be overloaded

but with hashing, head id in 1 to 100 is a unique hash and gets distributed across several workers

However, if a single key (e.g., user_id = 1) has too many rows, even hashing won’t help—one worker will still be overloaded.


2. Avoiding Hotspots


3. Efficient Lookup


4. Handling Large Datasets


What Happens If You Don’t Hash?


Summary

Hashing the join key ensures:

  1. Balanced workloads across workers.
  2. Fast lookups using hash tables.
  3. Minimal data movement during the join.
  4. Scalability for large datasets.

BigQuery (and other distributed databases) use hashing because it’s the most efficient way to distribute and join data in parallel.


distributed hash join diagram
distributed hash join diagram distributed hash join diagram

Partitioning

Columnar orientation splits the data with respect to columns width / column reduction. The engine only looks up the columns it needs

BQ has a 2nd level of data reduction: partitionning which is height / row reduction


Partitioning

Partitioning is splitting a large table into smaller chunks (partitions) based on a granular column — usually date, timestamp, or integer range.

Instead of scanning the whole table, BigQuery will only scan the partitions needed for the query.

The query

SELECT * FROM events
WHERE event_date = "2024-01-02";

→ BigQuery reads only the 2024-01-02 partition, not the entire table.

/data/2024-01-01/ (rows for Jan 1, 2024)
/data/2024-01-02/ (rows for Jan 2, 2024)
...

Why it matters:


bigquery partitioning

see here and here


SELECT name, age
FROM users
WHERE date = '2023-01-01';

How to create a table with partionning

CREATE TABLE project.dataset.orders (
  order_id INT64,
  customer_id INT64,
  order_date DATE,
  amount FLOAT64
)
PARTITION BY DATE(order_date);

Or with extra clustering

CREATE TABLE project.dataset.customer_orders (
  order_id INT64,
  customer_id INT64,
  order_date DATE,
  region STRING,
  amount FLOAT64
)
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, region;

Cluster by

CLUSTER BY further organizes the data within each partition into physical chunks.

Without clustering:

Partition: 2025-01-01
├── All rows for that date stored together
└── Mixed customer_ids and regions scattered throughout

With clustering:

Partition: 2025-01-01
├── Cluster 1: customer_id 1-100, region "US"
├── Cluster 2: customer_id 1-100, region "EU"
├── Cluster 3: customer_id 101-200, region "US"
└── Cluster 4: customer_id 101-200, region "EU"

The Physical Storage Benefit

When you query with a filter like:

SELECT * FROM customer_orders
WHERE DATE(order_date) = '2025-01-01'
  AND customer_id = 50
  AND region = 'US';

BigQuery does this:

  1. Partition pruning: Only reads the 2025-01-01 partition (skips all other dates)
  2. Cluster pruning: Within that partition, only reads Cluster 1 & 2 (skips clusters with customer_id > 100)
  3. Region pruning: Further narrows to just the "US" cluster

This dramatically reduces the amount of data read from disk. you're scanning maybe 5-10% of the partition instead of 100%.

Key Point

Clustering creates sort order within the partition files. It doesn't create separate sub-partitions you can see; it's transparent internal organization.

But the physical data is ordered so that:


No indexing needed

-> Partitioning and Clustering Replace Indexes

BigQuery does not support manual indexes because its architecture (columnar + partitioning + clustering) makes them unnecessary for most analytical queries. Design your tables with partitioning/clustering to optimize performance.


Recap

So partitioning is subseting a table wrt to rows while columnar data organization is subsetting a table wrt to columns.

When you run a query:

SELECT col1, col3
FROM my_table
WHERE date = '2024-01-02';

BigQuery does this:

  1. Identify only the partition for 2024-01-02 → ignore all other row groups.
  2. Inside that partition, read only col1 and col3 → ignore all other columns.
  3. Reassemble just those columns using row index alignment.
StepWhat is filteredEffect
1. Partition pruningRowsSkip irrelevant row blocks
2. Column pruningColumnsSkip irrelevant columns
3. Join columns backBased on row indexReconstruct the requested rows

Partitioning removes rows you don’t need, columnar storage skips columns you don’t need. -> BQ can handle super large datasets

but it's not the only thing

  1. Distributed Storage + Compute Separation
  1. Massively Parallel Processing (MPP)
  1. Data Locality & Sharding
  1. Compression + Encodings

Columns often contain similar values → BigQuery applies:

Smaller data → lower scan cost → faster reads.

  1. Broadcast + Distributed Hash Joins

Small tables are broadcast to all workers. : it is faster to copy the small table to every worker machine. Broadcasting means BigQuery copies the small table to all workers so each worker can join locally without moving large data around.

Large tables are joined in parallel using hash partitioning.


PostgreSQL performance is mainly about:

Optimized for many small, precise lookups.


BigQuery performance is mainly about:

Optimized for large scans and aggregations.


PostgreSQL goes fast by finding the right rows efficiently;

BigQuery goes fast by processing huge amounts of data in parallel while skipping everything it doesn’t need.

the user-facing SQL looks the same, but the execution model underneath is completely different.

You can write:

SELECT city, AVG(amount)
FROM orders
GROUP BY city;

This works in PostgreSQL and BigQuery.


But the engines behave differently

DatabaseHow it executes the SQLWhat it's optimized for
PostgreSQLUses indexes, row-level access, and a single-machine query plannerFast small lookups and transactions
BigQuerySkips columns, skips partitions, shards data, and runs work in parallel on many machinesFast large scans and aggregations

Machine Learning in BQ

---

BigQuery ML: Machine Learning in SQL


How It Works

  1. Data Preparation: Store your dataset in a BigQuery table.
  2. Model Training: Use CREATE MODEL with SQL to train models.
  3. Evaluation: Assess model performance with ML.EVALUATE.
  4. Prediction: Generate predictions using ML.PREDICT.
  5. Integration: Embed models in queries, dashboards, or workflows.

Training a Model

-- Example: Logistic Regression
CREATE OR REPLACE MODEL `mydataset.my_model`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['label']
) AS
SELECT * FROM `mydataset.training_data`;

Evaluating and Predicting


Key Features

Use Cases

Advanced Options


1 / 0