Neo4j Practice

Aura db

AuraDB offers free neo4j hosting.

Create an account on AuraDB and then create an instance.

Important, save the password, you won't be able to see it afterwards

Click on the "Learning" button bottom left side of the bage,

then select Query fundamentals

And connect to your instance

On page 2 of the tutorial, Load the Northwind dataset.

and click "run import", upper right

What does the Load dataset do

we can see that multiple files have been uploaded

one file per node
one file per relation

click on "show cypher" to understand how the data is loaded

cypher

The 1st cypher statements is

CYPHER 5 CREATE CONSTRAINT `territoryID_Territory_uniq` IF NOT EXISTS
FOR (n: `Territory`)
REQUIRE (n.`territoryID`) IS UNIQUE;

This creates a constraint named supplierID_Supplier_uniq that ensures uniqueness of the supplierID property across all nodes labeled Supplier. The constraint enforces data integrity and prevents duplicate suppliers in the dataset

the 2nd statement

Load statement

CYPHER 5 UNWIND $nodeRecords AS nodeRecord
WITH *
WHERE NOT nodeRecord.`supplierID` IN $idsToSkip AND NOT nodeRecord.`supplierID` IS NULL
MERGE (n: `Supplier` { `supplierID`: nodeRecord.`supplierID` })
SET n.`companyName` = nodeRecord.`companyName`
SET n.`contactName` = nodeRecord.`contactName`
SET n.`contactTitle` = nodeRecord.`contactTitle`
SET n.`address` = nodeRecord.`address`
SET n.`city` = nodeRecord.`city`
SET n.`region` = nodeRecord.`region`
SET n.`postalCode` = nodeRecord.`postalCode`
SET n.`country` = nodeRecord.`country`
SET n.`phone` = nodeRecord.`phone`
SET n.`fax` = nodeRecord.`fax`
SET n.`homePage` = nodeRecord.`homePage`;

Loads the dataset into the collection.

UNWIND $nodeRecords AS nodeRecord : Takes a parameter array $nodeRecords and processes each record individually
WHERE NOT nodeRecord.supplierID IN $idsToSkip and NOT nodeRecord.supplierID IS NULL - Filters out records that should be skipped (via $idsToSkip parameter) and records with null supplier IDs

Node Creation/Update:

MERGE (n: Supplier{supplierID: nodeRecord.supplierID }): Either finds an existing Supplier node with the given supplierID or creates a new one if it doesn't exist

The subsequent SET statements populate all the supplier properties:

companyName - The supplier's company name
contactName - Primary contact person ...
It uses MERGE instead of CREATE to handle potential duplicates gracefully: If a supplier with the same ID already exists, it will update the properties rather than create a duplicate The $idsToSkip parameter allows you to exclude certain suppliers from being processed (useful for incremental loads or handling problematic data)

This is a typical bulk data loading pattern for Neo4j - you pass in an array of records as a parameter and process them all in a single transaction for efficiency.

The original $nodeRecords object is probably loaded with

LOAD CSV WITH HEADERS FROM 'file:///suppliers.csv' AS row
WITH collect({
  supplierID: toInteger(row.SupplierID),
  companyName: row.CompanyName,
  contactName: row.ContactName
  // ... other fields
}) AS nodeRecords

Then the UNWIND statement would follow

Loading a csv dataset

For nodes

3 steps:

load the csv data into a node object
[opt] create uniqueness constraints
Unwind the node opbject to populate each element into the node collection

For relations

the cypher Load statement is

CYPHER 5 UNWIND $relRecords AS relRecord
MATCH (source: `Territory` { `territoryID`: relRecord.`territoryID` })
MATCH (target: `Region` { `regionID`: relRecord.`regionID` })
MERGE (source)-[r: `IN_REGION`]->(target);

We match the Ids of both sides of the relations with exiting node Ids.

And we must do that for every relations.

A lot more work than loading a csv or json into MongoDB, weaviate or even PostgresQL (although we need to create tables first).

Data loading prep

Data Preparation Challenges

we need

one csv per node
one csv per relationship
All IDs must match and exist

You can add constraints to enforce data integrity but schema validation really relies on upstream data cleaning and formatting work.

Including:

Normalize IDs across all files
Handle missing references (what if an orderID doesn't exist?)
Decide on ID strategy (keep original IDs vs. generate new ones)
Plan import order (nodes first, then relationships)

IDs can be of any type as long as they are unique.

integer
strings
- UUID
- Hash strings
- any unique string

Northwind dataset schema

The Northwind dataset is a fictional trading company database originally created by Microsoft for SQL Server tutorials. It represents a small trading company that imports and exports specialty foods.

Key entities in the Northwind dataset include:

Products: Specialty food items like beverages, condiments, and seafood
Categories: Product groupings (e.g., Beverages, Confections, Dairy Products)
Suppliers: Companies that provide the products
Customers: Companies that buy the products
Orders: Purchase transactions
Employees: Staff members who handle orders
Shippers: Companies that deliver the orders

The relationships between these entities demonstrate:

One-to-many relationships (e.g., a Category contains multiple Products)
Many-to-many relationships (e.g., Orders contain multiple Products, Products appear in multiple Orders)
Self-referential relationships (e.g., Employees have a reporting structure)

The relationships are defined as

IN_REGION: a territory belongs to a region
IN_TERRITORY: an employee is in a territory
product is PART_OF a category
an order ORDERS (includes) products
- the relationship ORDERS has properties: unit price, quantity, discount
a customer PURCHASED orders
suppliers SUPPLIES products
employees REPORT_TO employees
an employee SOLD orders

Raw csv

The northwind dataset is available from the neo4j github repo: https://github.com/neo4j-contrib/northwind-neo4j

This other repo holds the csv files and the code to load the dataset https://github.com/neo4j-graph-examples/northwind

Schema viz

Once loaded you can see the whole schema with

CALL db.schema.visualization()

Northwind Schema

Let's Explore

Click on the PURCHASED relations and notice the query

MATCH p=()-[:PURCHASED]->() RETURN p LIMIT 25;

Edit the query and remove the LIMIT 25 to see the whole graph with the PURCHASED relationship

Click on a Product node in the sidebar and then on one of the nodes in the graph

Note the cypher query

MATCH (n:Product) RETURN n LIMIT 25;

find all the orders for all customers

A customer - purchased -> orders

So the cypher is

MATCH path=(c:Customer)-[pr:PURCHASED]->(o:Order) RETURN path LIMIT 25;

If you don't want the graph but want to return the table instead, just return the entities and relations that you need from the query.

For instance if we want to return the customer's contactName and companyName, the order's orderID and shipName, the query becomes

MATCH (c:Customer)-[pr:PURCHASED]->(o:Order) RETURN c.contactName, c.companyName, o.orderID, o.shipName LIMIT 25;

and you get the table

Cypher query syntax

A basic Cypher query follows this structure:

MATCH [pattern] - Define what to find in the graph WHERE [conditions] - Optional: Filter the results RETURN [what to show] - Specify output LIMIT [number] - Optional: Restrict number of results

A MATCH pattern has three key components:

Nodes: In parentheses ()
- With labels: (p:Person)
- Properties: (p:Person {name: "John"})
- Anonymous: ()
Relationships: In square brackets []
- With types: -[:KNOWS]->
- Direction: -> or <-
- Anonymous: --
Variables: For referencing
- Node variables: (p:Person)
- Relationship variables: -[r:KNOWS]->
- Path variables: path=()-->()

Filtering, ordering, limiting, coalescing, ... in cypher with the MATCH clause, is not that different from regular SQL queries syntax .

See the cheat sheet for the MATCH clause

Your turn

Knowing the relations:

Product PART_OF Category
Suppliers :SUPPLIES Products
Supplier SUPPLIES Product PART_OF Category
Order :ORDERS Product

Write the following queries for graph and tables

find all products over $50
for seafood products,
- return product name and price
- display the graph
Who supplies seafood product
- return productName, unitPrice, companyName, country

Returning more entities

This query returns the graph of suppliers who supply product > $50 and their categories.

MATCH path=(s:Supplier)-[:SUPPLIES]->(p:Product)-[:PART_OF]->(c:Category)
WHERE p.unitPrice > 50
RETURN path

What if we also want to see the orders for these products ?

Using WITH

MATCH path=(s:Supplier)-[:SUPPLIES]->(p:Product)-[:PART_OF]->(c:Category)
WHERE p.unitPrice > 50
WITH path, p
MATCH orderPath=(o:Order)-[: ORDERS ]->(p)
RETURN path, orderPath

this returns the first graph of suppliers, products and their categories. But also the orders for these products.

Back to practicing

Which categories have the most expensive products?
- use max(p.unitPrice) AS maxPrice
Find all orders shipped to London
find all the products shipped to London
Who are our top 5 customers by number of orders?

Optional MATCH

OPTIONAL MATCH works like a regular MATCH but when no matching relationships/patterns are found, it returns NULL instead of filtering out the record.

This Returns ALL products with their orders (if they exist)

MATCH (p:Product)
WHERE p.productName = 'Chai'
OPTIONAL MATCH (p)<-[o:ORDERS]-()
RETURN p.productName, o;

Will show:

Product1, order_relationship1
Product1, order_relationship2
Product2, order_relationship3
Product3, NULL // Product with no orders
Product4, order_relationship5

OPTIONAL MATCH is like a LEFT JOIN in SQL : it keeps records from the left side (Products) even when there's no match on the right side (Orders).

So for instance looking at order counts

MATCH (p:Product)
WHERE p.produtName = "Aniseed Syrup"
OPTIONAL MATCH (p)<-[o:ORDERS]-()
RETURN p.productName,
       COUNT(o) as order_count;

Property Filters

These queries are equivalent and provide the same results.

MATCH (o:Order)-[:ORDERS]->(p:Product)
WHERE o.shipCity = 'London'
RETURN o.orderID, o.shipDate, p.productName, p.unitPrice

property pattern:

MATCH (o:Order { shipCity : 'London' })-[:ORDERS]->(p:Product)
RETURN o.orderID, o.shipDate, p.productName, p.unitPrice

The property pattern is slightly more performant as it filters earlier in the query execution. However, WHERE clauses offer more flexibility for complex conditions.

Back to you

find all the employees reporting to mr / mrs 'Fuller'
which product categories each supplier provides ?
- you can omit the relations between entities
- (s:Supplier)-->(:Product)-->(c:Category)

MATCH (s:Supplier)-->(:Product)-->(c:Category)
RETURN s.companyName AS company, collect( DISTINCT c.categoryName) AS categories

Create new relationships with MERGE

The pattern to create relationship VERB with properties

MATCH <source to target relation>
WITH source, target, some way of counting
WHERE define threshold or some filtering
MERGE (source)<->r:VERB</-/>(target)
SET r.<property name> = <property value>

Let's create new relationships : FREQUENTLY_BOUGHT_WITH: as in belong to the same order

Notice how you can point back to the product through the order.

First we find all the products that often (freq > 10) bought together.

MATCH (p1:Product)<-[:ORDERS]-(Order)-[:ORDERS]->(p2:Product)
WHERE p1 < p2
WITH p1, p2, count(*) AS frequency
WHERE frequency > 3
return p1.productName, p2.productName, frequency
ORDER BY frequency desc

Let's use the results of that query to create a new relation called FREQUENTLY_BOUGHT_WITH and add the property weight = frequency

We do not specify any arrows which makes the relation bidirectional

MATCH (p1:Product)<-[:ORDERS]-(Order)-[:ORDERS]->(p2:Product)
WHERE p1 < p2
WITH p1, p2, count(*) AS frequency
WHERE frequency > 3
MERGE (p1)-[r:FREQUENTLY_BOUGHT_WITH]-(p2)
 SET r.weight= frequency

We see : Created 58 relationships, set 58 properties

now we can visualize the products using this relation

MATCH p=()-[:FREQUENTLY_BOUGHT_WITH]->() RETURN p

Your turn to create a new relationship

create new relations

Employee to Employee (worked together)
- as in employees who PROCESSED the same orders

Other example

MATCH (c:Customer)-[:PURCHASED]->(:Order)-[:ORDERS]->(p:Product)<-[:ORDERS]-(:Order)<-[:PURCHASED]-(c2:Customer)
WHERE c < c2
// find similar customers
WITH c, c2, count(*) as similarity
// with at least 50 shared product purchases
WHERE similarity > 50
// create a relationship between the two without specifying direction
MERGE (c)-[sim:SIMILAR_TO]-(c2)
// set relationship weight from similairity
ON CREATE SET sim.weight = similarity

Advanced databases - Graph database Neo4j - practice