A serverless huge scale database from Google Cloud
BQ includes a built-in query engine capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure and without having to create or rebuild indexes.
console.cloud.google.com/bigquery
hyper scalers: GCP, AWS, Azure
Alternatives:
In Nov 2025, GOogle Cloud offers 12 different databases
| Solution | Description | Target Use Case |
|---|---|---|
| Cloud SQL | Fully managed relational database (MySQL, PostgreSQL, SQL Server). | Traditional applications requiring SQL, migrations from on-premises databases, OLTP workloads. |
| Cloud Spanner | Globally distributed relational database with strong consistency and horizontal scalability. | Mission-critical applications requiring global availability and transactional consistency. |
| Firestore | Serverless NoSQL document database with offline synchronization. | Mobile/web apps requiring real-time sync and automatic scaling. |
| Firebase Realtime Database | Real-time NoSQL database with instant synchronization across clients. | Mobile and web apps needing real-time updates (e.g., chat, gaming). |
| Bigtable | High-performance NoSQL database for massive, low-latency workloads. | Real-time analytics, IoT, time-series data, high-throughput applications. |
| Memorystore | Managed in-memory cache (Redis or Memcached). | Caching to reduce application latency and speed up access to frequently used data. |
| BigQuery | Serverless data warehouse for large-scale analytics. | Big data analysis, reporting, machine learning, and data warehousing. |
| Datastore | Managed NoSQL document database with automatic scaling. | Web and mobile apps requiring simple scalability and serverless management. |
| AlloyDB | PostgreSQL-compatible relational database optimized for performance and AI. | Demanding analytical and transactional workloads, integration with Vertex AI. |
| Database Migration Service | Service for migrating databases to Google Cloud with minimal downtime. | Migrating on-premises or cloud databases to Google Cloud. |
You should have 300$ credits to begin exploring.
Big Query also offers a sandbox
Caution: use at your own risks. It's very easy to spin up large resouces and blow up your budget.
Do not try if you are not confortable with using cloud resources with your own credit card.
Make sure you can monitor your usage consumption in Billing
turn off all resources you've setup.
do not publish API keys and access credentials
3 tier: storage + analytical + transfer
Storage is cheap. Querying is what you pay for.
example: this will list 10 datasets from bigquery-public-data
bq ls --project_id=bigquery-public-data --max_results=10
You can use these datasets to complement your own datasets.
suffices to join on the public dataset
JOIN `bigquery-public-data.new_york_citibike.citibike_trips on ...
BQ is:
and
BQ is not made for heavy transactional workloads.
In short: BigQuery = big reads, PostgreSQL = frequent small writes.
With BQ, latency is higher — results are optimized for throughput, not response time.
BQ is not trying to return each query very fast.
BQ is designed to handle very large amounts of data efficiently, even if that means each individual query takes longer to start or finish.
You might wait 2–5 seconds for the first result (higher latency), but the system can process huge datasets (terabytes) in one go (high throughput).
From wikipedia: Data orientation is the representation of tabular data in a linear memory model such as in-disk or in-memory. The two most common representations are column-oriented (columnar format) and row-oriented (row format).
Data orientation => important tradeoffs in performance and storage.
from here
In a row database, how does the system know when a row ends and a new one starts ?
The database stores metadata with the data, not just the raw values.
In a row-oriented DB (e.g., PostgreSQL):
The DB stores rows inside pages (e.g., 8 KB in PostgreSQL).
So even if data looks like:
id, name, age, city
1, Alice, 30, New York
2, Bob, 25, Boston
On disk it’s more like:
[row header][ID][Name length][Name][Age][City length][City]
[row header][ID][Name length][Name][Age][City length][City]
The row header + schema = how the system knows where each row starts and ends.
=> easy writes; adding a new row = just add to the end of the last row
Column-based databases store data by columns instead of rows.
Each column’s values are stored together.
In a columnar system, the data is stored as (with also approriate headers):
[1, 2][Alice, Bob][30, 25][New York, Boston]
see also this article what is columnar database?
Let’s compare the same join query on a row database (PostgreSQL) vs a columnar database (BigQuery) and explain the difference in execution plans.
Example Schema
Two tables:
customers (customer_id, name, city)
orders (order_id, customer_id, amount)
We want total money spent per customer:
SELECT c.name, SUM(o.amount) AS total_spent
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY c.name;
How it executes:
customer_id if they exist.Execution behavior:
Fetch row from customers → Lookup matching rows in orders using index → Aggregate
How it executes:
customer_id, name, and amount.Execution style:
Scan columnar segments → Build distributed hash table → Parallel join → Parallel aggregate
BigQuery is massively parallel because both the data and the query execution are distributed across many machines working at the same time.
So instead of one computer doing all the work
hundreds or thousands do small parts at once → massively parallel
A distributed hash join means BigQuery splits the data across many machines.
Each machine:
Hashing ensures that matching rows from two tables go to the same machine, so the join can be done in parallel.
We hash the join key to keep the workload evenly distributed across all workers and avoid hotspots: highly imbalanced workload distributions.
SELECT u.user_id, u.name, u.email, o.order_id, o.order_date
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.age > 18;
Shuffling Phase: BigQuery partitions both tables (users and orders) based on the join key (user_id). Each worker node processes a subset of the data, hashing the join key to distribute rows across nodes.
Build Phase: The smaller table (e.g., users) is loaded into a hash table in memory on each worker node. If the table is too large, BigQuery spills the hash table to disk.
Probe Phase: The larger table (e.g., orders) is scanned. For each row in orders, BigQuery probes the hash table to find matching rows from users using the join key (user_id). Matches are combined into the result set.
Filtering: The WHERE clause (age > 18) is applied before or during the join, depending on the query plan, to reduce the amount of data processed.
Why not distribute the workload across workers based on the values of the key ?
user_id), the distribution might be skewed. Some worker nodes will receive significantly more data than others, leading to uneven workloads and slower query performance.SHA-256 or a simpler internal hash) converts the join key into a uniformly distributed integer. This ensures that rows are evenly distributed across worker nodes, balancing the load.The key based load distribution is no longer based on the values of the join key
if users 1-100 have lots of orders then the worker for this range of values will be overloaded
but with hashing, head id in 1 to 100 is a unique hash and gets distributed across several workers
However, if a single key (e.g., user_id = 1) has too many rows, even hashing won’t help—one worker will still be overloaded.
user_id values are sequential (1, 2, 3, ...), and you distribute them by value ranges (e.g., Node 1 gets 1-1000, Node 2 gets 1001-2000, etc.), a few nodes might end up processing most of the data if the keys are not uniformly distributed.Partitioning: Hashing allows BigQuery to partition the data into smaller chunks that can be processed independently by different workers. This is critical for distributed systems where data is too large to fit on a single machine.
Minimizing Data Movement: By hashing the join key, each worker only needs to communicate with other workers handling the same hash bucket, reducing network overhead.
Hashing the join key ensures:
BigQuery (and other distributed databases) use hashing because it’s the most efficient way to distribute and join data in parallel.
Columnar orientation splits the data with respect to columns width / column reduction. The engine only looks up the columns it needs
BQ has a 2nd level of data reduction: partitionning which is height / row reduction
Partitioning is splitting a large table into smaller chunks (partitions) based on a granular column — usually date, timestamp, or integer range.
Instead of scanning the whole table, BigQuery will only scan the partitions needed for the query.
The query
SELECT * FROM events
WHERE event_date = "2024-01-02";
→ BigQuery reads only the 2024-01-02 partition, not the entire table.
/data/2024-01-01/ (rows for Jan 1, 2024)
/data/2024-01-02/ (rows for Jan 2, 2024)
...
Why it matters:
SELECT name, age
FROM users
WHERE date = '2023-01-01';
CREATE TABLE project.dataset.orders (
order_id INT64,
customer_id INT64,
order_date DATE,
amount FLOAT64
)
PARTITION BY DATE(order_date);
Or with extra clustering
CREATE TABLE project.dataset.customer_orders (
order_id INT64,
customer_id INT64,
order_date DATE,
region STRING,
amount FLOAT64
)
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, region;
CLUSTER BY further organizes the data within each partition into physical chunks.
Without clustering:
Partition: 2025-01-01
├── All rows for that date stored together
└── Mixed customer_ids and regions scattered throughout
With clustering:
Partition: 2025-01-01
├── Cluster 1: customer_id 1-100, region "US"
├── Cluster 2: customer_id 1-100, region "EU"
├── Cluster 3: customer_id 101-200, region "US"
└── Cluster 4: customer_id 101-200, region "EU"
When you query with a filter like:
SELECT * FROM customer_orders
WHERE DATE(order_date) = '2025-01-01'
AND customer_id = 50
AND region = 'US';
BigQuery does this:
This dramatically reduces the amount of data read from disk. you're scanning maybe 5-10% of the partition instead of 100%.
Clustering creates sort order within the partition files. It doesn't create separate sub-partitions you can see; it's transparent internal organization.
But the physical data is ordered so that:
-> Partitioning and Clustering Replace Indexes
BigQuery does not support manual indexes because its architecture (columnar + partitioning + clustering) makes them unnecessary for most analytical queries. Design your tables with partitioning/clustering to optimize performance.
So partitioning is subseting a table wrt to rows while columnar data organization is subsetting a table wrt to columns.
When you run a query:
SELECT col1, col3
FROM my_table
WHERE date = '2024-01-02';
BigQuery does this:
2024-01-02
→ ignore all other row groups.| Step | What is filtered | Effect |
|---|---|---|
| 1. Partition pruning | Rows | Skip irrelevant row blocks |
| 2. Column pruning | Columns | Skip irrelevant columns |
| 3. Join columns back | Based on row index | Reconstruct the requested rows |
Partitioning removes rows you don’t need, columnar storage skips columns you don’t need. -> BQ can handle super large datasets
but it's not the only thing
Columns often contain similar values → BigQuery applies:
Smaller data → lower scan cost → faster reads.
Small tables are broadcast to all workers. : it is faster to copy the small table to every worker machine. Broadcasting means BigQuery copies the small table to all workers so each worker can join locally without moving large data around.
Large tables are joined in parallel using hash partitioning.
Optimized for many small, precise lookups.
Subsetting the data before processing:
Massively parallel execution across many workers
Broadcast / distributed hash joins
Vectorized execution + compression
Optimized for large scans and aggregations.
PostgreSQL goes fast by finding the right rows efficiently;
BigQuery goes fast by processing huge amounts of data in parallel while skipping everything it doesn’t need.
the user-facing SQL looks the same, but the execution model underneath is completely different.
You can write:
SELECT city, AVG(amount)
FROM orders
GROUP BY city;
This works in PostgreSQL and BigQuery.
| Database | How it executes the SQL | What it's optimized for |
|---|---|---|
| PostgreSQL | Uses indexes, row-level access, and a single-machine query planner | Fast small lookups and transactions |
| BigQuery | Skips columns, skips partitions, shards data, and runs work in parallel on many machines | Fast large scans and aggregations |
CREATE MODEL with SQL to train models.ML.EVALUATE.ML.PREDICT.-- Example: Logistic Regression
CREATE OR REPLACE MODEL `mydataset.my_model`
OPTIONS(
model_type='LOGISTIC_REG',
input_label_cols=['label']
) AS
SELECT * FROM `mydataset.training_data`;
LINEAR_REG (Regression)LOGISTIC_REG (Classification)KMEANS (Clustering)ARIMA_PLUS (Time Series)AUTOML_CLASSIFIER/AUTOML_REGRESSOR (AutoML)SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_model`);
SELECT * FROM ML.PREDICT(MODEL `mydataset.my_model`, TABLE `mydataset.new_data`);