DHUM 25A43 - Fall 2025 Course Introduction
Wikipedia API Workshop Instructions
Introduction
Imagine you’re a digital anthropologist tasked with studying the world’s major cities through the lens of collective human knowledge. Wikipedia, the modern Library of Alexandria, contains millions of interconnected articles written by people from every corner of the globe. But reading through pages manually would take years!
Today, you’ll learn to harness the Wikipedia API - a powerful tool that lets you programmatically access this vast repository of knowledge. Like a detective gathering clues, you’ll extract data about cities, uncover hidden connections between them, and discover patterns that would be impossible to spot by eye. By the end of this workshop, you’ll have built your own mini research database and gained insights into how these urban centers are documented and connected in our collective digital memory.
Your mission: Use code to explore, analyze, and visualize how Wikipedia represents the world’s great cities. Let’s begin your journey as a data explorer!
good habits
- always start small, (one city, one exploration)
- make sure you understand what’s hapenning
- do you like the results
- look for weird things, that feel off.
- follow your instinct.
- then scale up
It’s all about building intuition and familiarity with the tools.
Setup
- Open Google Colab
- Install required library:
!pip install wikipedia-api
Note the !
before pip
.
Part 1: Basic Page Retrieval
Task 1: Get a Wikipedia Page
# Import the library
import wikipediaapi
# Instanciate the wiki object
wiki = wikipediaapi.Wikipedia( user_agent="[email protected]", language='en')
# Try any topic: person, city, country, sports, companies anything
page = wiki.page('Paris')
Task 2: Explore Page Properties
Print and examine these page attributes:
page.title
page.summary
page.url
page.categories
page.sections
# Table of contents- Try others using
dir(page)
Part 2: Bulk Data Collection
Task 3: Create a List of Topics
For instance list of cities. You can use any topic you want.
cities = ["Paris", "New York", "Tokyo", "London", "Berlin"]
- ’[’ and ‘]’ are used to create a list.
- each element is separated by a comma
,
. - each element is a string, so it’s between double quotes
"---"
.
Task 4: Collect Multiple Data Points
For each city, retrieve:
- Summary (
page.summary
) - Full text (
page.text
) - All links (
page.links
) - Categories (
page.categories
) - Number of sections
Part 3: Data Analysis
Task 5: Analyze Your Data
- Count total links per page
- find most common links
- Find common links between pages
- Calculate text length for each page
- Extract section titles
Task 6: Text Processing
- Find most common words in summaries (exclude stop words)
- Search for specific keywords across all pages
- Compare summary length vs full text length
Content analysis
- Who are the persons mentioned in each page?
- What are the most common adjectives?
- etc
Part 4: Save and Visualize
Task 7: Create an Enhanced DataFrame
import pandas as pd
Build a DataFrame with columns: url
, title
, summary
, text_length
, link_count
, section_count
Task 8: Basic Visualization
- Create a bar chart comparing text lengths
- Plot link counts per city
Task 9: Export Your Work
- Save DataFrame to CSV
- Create a summary report of your findings
- Download both files
More general tasks
- Find all cities mentioned in each page’s links
- Ask Gemini to identify patterns in your data!