Imagine you're a digital anthropologist tasked with studying the world's major cities through the lens of collective human knowledge. Wikipedia, the modern Library of Alexandria, contains millions of interconnected articles written by people from every corner of the globe. But reading through pages manually would take years!
Today, you'll learn to harness the Wikipedia API - a powerful tool that lets you programmatically access this vast repository of knowledge. Like a detective gathering clues, you'll extract data about cities, uncover hidden connections between them, and discover patterns that would be impossible to spot by eye. By the end of this workshop, you'll have built your own mini research database and gained insights into how these urban centers are documented and connected in our collective digital memory.
Your mission: Use code to explore, analyze, and visualize how Wikipedia represents the world's great cities. Let's begin your journey as a data explorer!
It's all about building intuition and familiarity with the tools.
!pip install wikipedia-api
Note the ! before pip.
# Import the library
import wikipediaapi
# Instanciate the wiki object
wiki = wikipediaapi.Wikipedia( user_agent="[email protected]", language='en')
# Try any topic: person, city, country, sports, companies anything
page = wiki.page('Paris')
Print and examine these page attributes:
page.titlepage.summarypage.urlpage.categoriespage.sections # Table of contentsdir(page)For instance list of cities. You can use any topic you want.
cities = ["Paris", "New York", "Tokyo", "London", "Berlin"]
For each city, retrieve:
page.summary)page.text)page.links)page.categories)import pandas as pd
Build a DataFrame with columns: url, title, summary, text_length, link_count, section_count