← Back to inwai

Investigating with AI

3 min read

Wikipedia API Workshop Instructions

Introduction

Imagine you're a digital anthropologist tasked with studying the world's major cities through the lens of collective human knowledge. Wikipedia, the modern Library of Alexandria, contains millions of interconnected articles written by people from every corner of the globe. But reading through pages manually would take years!

Today, you'll learn to harness the Wikipedia API - a powerful tool that lets you programmatically access this vast repository of knowledge. Like a detective gathering clues, you'll extract data about cities, uncover hidden connections between them, and discover patterns that would be impossible to spot by eye. By the end of this workshop, you'll have built your own mini research database and gained insights into how these urban centers are documented and connected in our collective digital memory.

Your mission: Use code to explore, analyze, and visualize how Wikipedia represents the world's great cities. Let's begin your journey as a data explorer!

good habits

  • always start small, (one city, one exploration)
  • make sure you understand what's hapenning
  • do you like the results
  • look for weird things, that feel off.
  • follow your instinct.
  • then scale up

It's all about building intuition and familiarity with the tools.

Setup

  1. Open Google Colab
  2. Install required library:
!pip install wikipedia-api

Note the ! before pip.

Part 1: Basic Page Retrieval

Task 1: Get a Wikipedia Page

# Import the library
import wikipediaapi

# Instanciate the wiki object
wiki = wikipediaapi.Wikipedia( user_agent="[email protected]",  language='en')

# Try any topic: person, city, country, sports, companies anything
page = wiki.page('Paris')

Task 2: Explore Page Properties

Print and examine these page attributes:

  • page.title
  • page.summary
  • page.url
  • page.categories
  • page.sections # Table of contents
  • Try others using dir(page)

Part 2: Bulk Data Collection

Task 3: Create a List of Topics

For instance list of cities. You can use any topic you want.

cities = ["Paris", "New York", "Tokyo", "London", "Berlin"]
- '[' and ']' are used to create a list. - each element is separated by a comma `,`. - each element is a string, so it's between double quotes `"---"`.

Task 4: Collect Multiple Data Points

For each city, retrieve:

  • Summary (page.summary)
  • Full text (page.text)
  • All links (page.links)
  • Categories (page.categories)
  • Number of sections

Part 3: Data Analysis

Task 5: Analyze Your Data

  • Count total links per page
  • find most common links
  • Find common links between pages
  • Calculate text length for each page
  • Extract section titles

Task 6: Text Processing

  • Find most common words in summaries (exclude stop words)
  • Search for specific keywords across all pages
  • Compare summary length vs full text length

Content analysis

  • Who are the persons mentioned in each page?
  • What are the most common adjectives?
  • etc

Part 4: Save and Visualize

Task 7: Create an Enhanced DataFrame

import pandas as pd

Build a DataFrame with columns: url, title, summary, text_length, link_count, section_count

Task 8: Basic Visualization

  • Create a bar chart comparing text lengths
  • Plot link counts per city

Task 9: Export Your Work

  • Save DataFrame to CSV
  • Create a summary report of your findings
  • Download both files

More general tasks

  • Find all cities mentioned in each page's links
  • Ask Gemini to identify patterns in your data!