In this project you will create a new MongoDB database from a dataset of your choice.
Your task consists in:
You can work solo or in groups of 2 max.
The deliverable is a report on these tasks.
Avoid using Atlas, prefer local MongoDB.
Please fill in the spreadsheet with a link to the dataset, and a link to the google doc that you will submit.
This project is graded.
You will have 5' to present your work on Friday Nov 28th. No slides needed, just screen sharing and talk
You will write a tutorial based on your experience.
The document must include the following elements:
Throughout the project,
and finish with
The interesting element in the report is the issues you encountered and how you solved them.
Worth repeating:
If (hahaha) you use LLMs to write, it's ok, but please
As the bard said: Brevity is the soul of Wit

The dataset choice is yours. The only condition is that it must be fairly large, several GB of data
The dataset can include long texts (articles, books, ...), IoT data, etc ...
The dataset must be large enough to justify using MongoDB. Choose a datase of at least 1Gb
Avoid datasets that are mainly images, videos, etc. As this will require a cloud storage solution to avoid storing images directly in MongoDB documents.
Start by choosing the dataset and imagining the use of this data by an application.
For example, if we use the data from the 210k trees of Paris:
When you have an idea of your application, think on how it will consume the data.
What will be the most frequent operations and queries in the application?
For example for the first "field user" application
etc...
Prepare your working environment.
Note the technical characteristics of your computer that will influence your import strategy.
Note for example:
Then install MongoDB Community Edition.
On Windows 11, you will use the MSI installer available on the official website.
On MacOS, you can use Homebrew with the appropriate command.
Verify that the installation is successful by starting the MongoDB server and connecting to it.
To check if mongoimport is already installed on your local, do mongoimport --version in a terminal. If it returns the version you're fine, otherwise you must install it.
See this page to download command line utilities. The list of utilities is here. It includes mongoimport, mongoexport. see also mongostat and mongotop as diagnostic tools.
Installing Mongo server and tools on windows is usually more difficult than on MacOS or Linux. chatGPT is often a great help for troubleshooting.
Before starting the import, examine the original dataset files.
In your documentation, describe:
From this analysis, design your schema validation rules.
Think about the level of validation you want to implement and justify your choice in your documentation.
In your report, note:
Identify fields that will be frequently queried in your requests.
In your report, specify:
Start by testing your approach with a small sample of data.
Monitor system resources during import and note:
If you encounter memory constraints, adjust the import batch size or use the
--numInsertionWorkersoption ofmongoimport.
After import, verify the integrity of your data.
Include in your report:
Your report can include
Errors and difficulties are part of the learning process.
We have several options to load the data from a JSON file.
Either in mongosh with insertMany() or from the command line with mongoimport.
insertMany()The following script
trees.json filetrees collection with insertMany()// load the JSON data
const fs = require("fs");
const dataPath = "./trees_1k.json"
const treesData = JSON.parse(fs.readFileSync(dataPath, "utf8"));
// Insert data into the desired collection
let startTime = new Date()
db.trees.insertMany(treesData);
let endTime = new Date()
print(`Operation took ${endTime - startTime} milliseconds`)
mongoimport command linemongoimport is usually the fastest option for large volumes.
By default mongoimport takes a ndjson file (one document per line ) as input.
But you can also use a JSON file (an array of documents) if you add the flag --jsonArray.
The overall syntax for mongoimport follows:
mongoimport --uri="mongodb+srv://<username>:<password>@<cluster-url>" \
--db <database_name> \
--collection <collection_name> \
--file <path to ndjson file>
Here are other interesting, self explanatory, flags that may come in handy:
--mode=[insert|upsert|merge|delete]--stopOnError--drop (drops the collection first )--stopOnErrorIn our context, here is a version of the command line, using the MONGO_ATLAS_URI environment variable and loading the JSON file trees_1k.json in the current folder.
time mongoimport --uri="${MONGO_ATLAS_URI}" \
--db treesdb \
--collection trees \
--jsonArray \
--file ./trees_1k.json
which results in
2024-12-13T11:41:45.941+0100 connected to: mongodb+srv://[**REDACTED**]@skatai.w932a.mongodb.net/
2024-12-13T11:41:48.942+0100 [########################] treesdb.trees 558KB/558KB (100.0%)
2024-12-13T11:41:52.087+0100 [########################] treesdb.trees 558KB/558KB (100.0%)
2024-12-13T11:41:52.087+0100 1000 document(s) imported successfully. 0 document(s) failed to import.
mongoimport --uri="${MONGO_ATLAS_URI}" --db treesdb --collection trees --fil 0.15s user 0.09s system 3% cpu 6.869 total
You will present your project on Friday Dec 13th. You have 5' to present your work. No slides, just walk us through.
Since we are many, it is very important you stick to 5' presentation time
And don't forget to have fun.