Project - Create, Validate, Import, Explore

In this project you will create a new MongoDB database from a dataset of your choice.

Your task consists in:

You can work solo or in groups of 2 max.

The deliverable is a report on these tasks.

Avoid using Atlas, prefer local MongoDB.

Deliverable

Please fill in the spreadsheet with a link to the dataset, and a link to the google doc that you will submit.

This project is graded.

You will have 5’ to present your work on Friday Nov 28th. No slides needed, just screen sharing and talk

The project report: A Tutorial

You will write a tutorial based on your experience.

The document must include the following elements:

Throughout the project,

and finish with

The interesting element in the report is the issues you encountered and how you solved them.

Worth repeating:

genAI usage

If (hahaha) you use LLMs to write, it’s ok, but please

As the bard said: Brevity is the soul of Wit

Brevity is the soul of Wit

Dataset Selection

The dataset choice is yours. The only condition is that it must be fairly large, several GB of data

Application Design

Start by choosing the dataset and imagining the use of this data by an application.

For example, if we use the data from the 210k trees of Paris:

When you have an idea of your application, think on how it will consume the data.

What will be the most frequent operations and queries in the application?

For example for the first “field user” application

etc…

Preparation Phase

Prepare your working environment.

Note the technical characteristics of your computer that will influence your import strategy.

Note for example:

Install MongoDB and mongoimport

Then install MongoDB Community Edition.

installing

On Windows 11, you will use the MSI installer available on the official website.

On MacOS, you can use Homebrew with the appropriate command.

Verify that the installation is successful by starting the MongoDB server and connecting to it.

To check if mongoimport is already installed on your local, do mongoimport --version in a terminal. If it returns the version you’re fine, otherwise you must install it.

See this page to download command line utilities. The list of utilities is here. It includes mongoimport, mongoexport. see also mongostat and mongotop as diagnostic tools.

Installing Mongo server and tools on windows is usually more difficult than on MacOS or Linux. chatGPT is often a great help for troubleshooting.

Schema Analysis and Design

Before starting the import, examine the original dataset files.

In your documentation, describe:

From this analysis, design your schema validation rules.

Think about the level of validation you want to implement and justify your choice in your documentation.

Import Planning

In your report, note:

Indexing Strategy

Identify fields that will be frequently queried in your requests.

In your report, specify:

Import Process

Start by testing your approach with a small sample of data.

Monitor system resources during import and note:

If you encounter memory constraints, adjust the import batch size or use the --numInsertionWorkers option of mongoimport.

Verification and Validation

After import, verify the integrity of your data.

Include in your report:

Summary and Reflection

Your report can include

Errors and difficulties are part of the learning process.

Importing data in MongoDB

We have several options to load the data from a JSON file.

Either in mongosh with insertMany() or from the command line with mongoimport.

In mongosh with insertMany()

The following script

// load the JSON data
const fs = require("fs");
const dataPath = "./trees_1k.json"
const treesData = JSON.parse(fs.readFileSync(dataPath, "utf8"));

// Insert data into the desired collection
let startTime = new Date()
db.trees.insertMany(treesData);
let endTime = new Date()
print(`Operation took ${endTime - startTime} milliseconds`)

Using mongoimport command line

mongoimport is usually the fastest option for large volumes.

By default mongoimport takes a ndjson file (one document per line ) as input.

But you can also use a JSON file (an array of documents) if you add the flag --jsonArray.

The overall syntax for mongoimport follows:

mongoimport --uri="mongodb+srv://<username>:<password>@<cluster-url>" \
--db <database_name> \
--collection <collection_name> \
--file <path to ndjson file>

Here are other interesting, self explanatory, flags that may come in handy:

In our context, here is a version of the command line, using the MONGO_ATLAS_URI environment variable and loading the JSON file trees_1k.json in the current folder.

time mongoimport --uri="${MONGO_ATLAS_URI}" \
--db treesdb  \
--collection trees \
--jsonArray \
--file ./trees_1k.json

which results in

2024-12-13T11:41:45.941+0100 connected to: mongodb+srv://[**REDACTED**]@skatai.w932a.mongodb.net/
2024-12-13T11:41:48.942+0100 [########################] treesdb.trees	558KB/558KB (100.0%)
2024-12-13T11:41:52.087+0100 [########################] treesdb.trees	558KB/558KB (100.0%)
2024-12-13T11:41:52.087+0100 1000 document(s) imported successfully. 0 document(s) failed to import.
mongoimport --uri="${MONGO_ATLAS_URI}" --db treesdb --collection trees  --fil  0.15s user 0.09s system 3% cpu 6.869 total

Summary

You will present your project on Friday Dec 13th. You have 5’ to present your work. No slides, just walk us through.

Since we are many, it is very important you stick to 5’ presentation time

And don’t forget to have fun.