Wikipedia API Workshop Instructions

Introduction

Imagine you're a digital anthropologist tasked with studying the world's major cities through the lens of collective human knowledge. Wikipedia, the modern Library of Alexandria, contains millions of interconnected articles written by people from every corner of the globe. But reading through pages manually would take years!

Today, you'll learn to harness the Wikipedia API - a powerful tool that lets you programmatically access this vast repository of knowledge. Like a detective gathering clues, you'll extract data about cities, uncover hidden connections between them, and discover patterns that would be impossible to spot by eye. By the end of this workshop, you'll have built your own mini research database and gained insights into how these urban centers are documented and connected in our collective digital memory.

Your mission: Use code to explore, analyze, and visualize how Wikipedia represents the world's great cities. Let's begin your journey as a data explorer!

good habits

always start small, (one city, one exploration)
make sure you understand what's hapenning
do you like the results
look for weird things, that feel off.
follow your instinct.
then scale up

It's all about building intuition and familiarity with the tools.

Setup

Open Google Colab
Install required library:

!pip install wikipedia-api

Note the ! before pip.

Part 1: Basic Page Retrieval

Task 1: Get a Wikipedia Page

# Import the library
import wikipediaapi

# Instanciate the wiki object
wiki = wikipediaapi.Wikipedia( user_agent="[email protected]",  language='en')

# Try any topic: person, city, country, sports, companies anything
page = wiki.page('Paris')

Task 2: Explore Page Properties

Print and examine these page attributes:

page.title
page.summary
page.url
page.categories
page.sections # Table of contents
Try others using dir(page)

Part 2: Bulk Data Collection

Task 3: Create a List of Topics

For instance list of cities. You can use any topic you want.

cities = ["Paris", "New York", "Tokyo", "London", "Berlin"]

- '[' and ']' are used to create a list. - each element is separated by a comma `,`. - each element is a string, so it's between double quotes `"---"`.

Task 4: Collect Multiple Data Points

For each city, retrieve:

Summary (page.summary)
Full text (page.text)
All links (page.links)
Categories (page.categories)
Number of sections

Part 3: Data Analysis

Task 5: Analyze Your Data

Count total links per page
find most common links
Find common links between pages
Calculate text length for each page
Extract section titles

Task 6: Text Processing

Find most common words in summaries (exclude stop words)
Search for specific keywords across all pages
Compare summary length vs full text length

Content analysis

Who are the persons mentioned in each page?
What are the most common adjectives?
etc

Part 4: Save and Visualize

Task 7: Create an Enhanced DataFrame

import pandas as pd

Build a DataFrame with columns: url, title, summary, text_length, link_count, section_count

Task 8: Basic Visualization

Create a bar chart comparing text lengths
Plot link counts per city

Task 9: Export Your Work

Save DataFrame to CSV
Create a summary report of your findings
Download both files

More general tasks

Find all cities mentioned in each page's links
Ask Gemini to identify patterns in your data!

Investigating with AI