BAN API Integration Workshop

BAN API, HTTP extension, geolocation, JSONB

Workshop : Tree Locations & BAN API

The goal of this workshop is to complement the location information of each tree with a properly formatted address.

We will use the tree geolocation coordinates (longitude and latitude) of each tree and query the French government Official Address database BAN API to retrieve a properly formatted address for each single tree.

We will implement that process in a series of PL/pgSQL functions using the http PostgreSQL extension.

And store the results using the JSONB native data type which is well adapted to storing API responses.

The project has 5 steps. At the end of each step you will dump your database and submit the database dump file in a Google form.

Learning Outcomes

In this workshop you will practice:

The problem

The tree addresses data in the database is a mess.

To illustrate this, here are 2 locations randomly picked from the locations table in the normalized version of the trees database.

---------------+----------------------------------------------------
suppl_address  | [null]
address        | CIMETIERE DE PANTIN / AVENUE DES MERISIERS / DIV 33
arrondissement | SEINE-SAINT-DENIS
geolocation    | (2.404061886887479,48.90484356530813)
---------------+----------------------------------------------------
suppl_address  | 61
address        | RUE DE BERCY
arrondissement | PARIS 12E ARRDT
geolocation    | (2.3820653541062367,48.837817565831166)

We can see that :

To use that dataset in a real application with proper localization we need properly formatted addresses.

Luckily for us, there is a way to retrieve the official address associated with geolocation coordinates (latitude an longitude).

The BAN

The official list of addresses in France is called the BAN: Base Nationale d’Adresses.

https://adresse.data.gouv.fr/api-doc/adresse

The BAN API

The BAN website offers a free API that can convert geolocation coordinates to a real address.

For instance, this is the location of a beautiful Japonicum, Styphnolobium tree in the 15th arrondissement.

We send the geolocation coordinates of the tree to the API endpoint api-adresse.data.gouv.fr/reverse with the GET request:

curl "https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457"

This returns a JSON formatted response

{
  "type": "FeatureCollection",
  "version": "draft",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          2.292559,
          48.846781
        ]
      },
      "properties": {
        "id": "75115_3266_00130",
        "type": "housenumber",
        "score": 0.9999999905435162,
        "housenumber": "130",
        "street": "Avenue Emile Zola",
        "name": "130 Avenue Emile Zola",
        "postcode": "75015",
        "citycode": "75115",
        "city": "Paris",
        "district": "Paris 15e Arrondissement",
        "context": "75, Paris, Île-de-France",
        "label": "130 Avenue Emile Zola 75015 Paris",
        "x": 648083.32,
        "y": 6860981.04,
        "importance": 0.80895,
        "distance": 4
      }
    },
    ...

The returned attributes for the properties key are :

The API documentation also mentions these 2 fields which are not in the example above:

But the documentation lacks any description of the distance attribute.

API calls often return results of varying length with dynamic fields. Sometimes, for some APIs, some fields may be missing from the output while others, not expected, creep up. This is why the JSON or JSONB format is appropriate for storing API outputs. You don’t have to stick to a fixed set of attributes and handle all the variations.

Note: INSEE stands for “National Institute of Statistics and Economic Studies” (Institut National de la Statistique et des Études Économiques), which is the organization that assigns unique codes to French administrative divisions.

The BAN API returns multiple addresses per coordinates

Given a geolocation point (longitude, latitude), the BAN API returns multiple addresses within the features array. You can view the full response at the url https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457

In the rest of the worksheet, we will only consider and record the first of these addresses and discard the others.

Precision of Latitude and Longitude

The number of digits in latitude and longitude coordinates directly relates to the precision with which you can pinpoint a location on Earth. Here’s a breakdown of this relationship:

Decimal places precision
Whole number Country or large region (~100 km precision)
1 decimal place Large city or district (~10 km)
2 decimal places Town or village (~1 km)
3 decimal places Neighborhood, street (~100 m)
4 decimal places Individual building (~10 m)
5 decimal places Individual trees (~1 m)
6 decimal places Individual humans (~10 cm)
7 decimal places ~1 cm
8 decimal places ~1 mm

The geolocation coordinates (2.3820653541062367,48.837817565831166) in the database have 16 digits after the decimal place !

This StackOverflow answer nails why this is absurd:

Ten or more decimal places indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless. Be careful, because unless you are the one reading these numbers off the device, this can indicate low quality processing!

low quality processing ? hahaha you bet!

On the other hand, the BAN API returns coordinates with 6 digits, which would put the precision at the 10 cm range whereas 16 digits is closer to the angstrom level!

"coordinates": [
  2.292559,
  48.846781
]

Overview of the project

The overall process is simple, the implementation details less so.

For a set of trees

The first thing we need is to activate or install the http extension. This part is crucial and probably not that straightforward.

But with grit and hard work you can do it!

You can do it!

Once that’s done, we write a series PL/pgSQL functions:

We need to do that, not just for one tree but for all the trees in the database. However, the BAN API is limited to 50 requests per seconds per API. So we also need to find a way to throttle or time the calls to the API.

Project parts

The project consists of 4 parts

  1. setup : create a new database, load the data and activate the http extension
  2. API call: for a tree, build the url and make the API call
  3. log and store: log the API call, parse and store the result in a new table
  4. process all the trees: get the address for all the trees in the database

Throughout the project you will have to figure out how to make things work:

It won’t be a walk in the park but I’m sure you can definitely do it.

How to dump and submit your database

At the end of each part of the project, you will be asked to 1) dump the database and 2) submit it via a google form.

Dump the database

I need you to use a terminal to dump your database and not the pgAdmin interface. This is important. Do not use pgAdmin to dump the database.

So in a powershell or iTerm terminal you execute the following command

pg_dump -h localhost  \
  --no-owner \
  --format=plain \
  --exclude-schema='information_schema' \
  --exclude-schema='pg_*' \
  --section=pre-data \
  --section=data \
  --section=post-data \
  --no-privileges \
  --no-security-labels \
  --no-tablespaces \
  --no-unlogged-table-data \
  --no-publications \
  --no-subscriptions \
  --no-toast-compression \
  --no-comments \
  treesdb_v04 > treesdb_v04.01.sql

where

The dump file is a plain SQL file.

The name of the dump file must include

For instance, as alexis, I will upload 4 files:

Verify the dump

To test that the dump worked you should be able to open the file in an editor (Vscode, vim, …) and check that the file contains the proper CREATE TABLE and CREATE FUNCTION statements.

Then head over to this google form, fill in your name and upload the dump file

https://forms.gle/4N21CW91P9smSM658

The file should not be larger than 10 MB.

IMPORTANT


PART I: setup

Setup a new database: treesdb_v04

In this workshop we work on a subset of the trees database in a normalize structure.

Postal addresses are not relevant to trees in wide spaces (gardens, cemeteries, …).

This version of the database only has trees in the domain called Alignement, which roughly corresponds to streets.

The new database only has 1000 trees, all situated in Paris with domain as Alignement.

Then restore the data into the database:

In a psql session (terminal or pgAdmin), run the following command

\i <absolute path to the sql file>/treesdb_v04.01.sql

The \i psql command executes the sql statements contained in the sql file.

Check the data upload

To check that everything is as expected, make sure that you have 1000 trees and 1000 locations in your new database.

Now install the HTTP extension

With a bit of luck the http extension is already installed. Let’s check

first check if the extension is available

In a psql session (pgAdmin or terminal), run:

CREATE EXTENSION http;

If that returns CREATE EXTENSION or ERROR extension "http" already exists , you’re in luck the extension is installed.

You can list the installed extensions with \dx.

If the http extension is installed and active \dx returns:

List of installed extensions
   Name   | Version |   Schema   |                                Description
----------+---------+------------+----------------------------------------------------------------------------
 http     | 1.6     | public     | HTTP client for PostgreSQL, allows web page retrieval inside the database.

Using the http extension

In this project, we only need to send GET requests to the endpoint. For that we use the http_get function.

You can test that the http extension works by sending a get request to http://httpbun.com/ip with the http_get function.

SELECT http_get.status, http_get.content
FROM http_get('http://httpbun.com/ip');

This should return something like that

-[ RECORD 1 ]-------------------------
status  | 200
content | {                           +
        |   "origin": "146.70.119.165"+
        | }                           +
        |

The status = 200 indicates that the API call was successful. See here for a complete list of http codes.

More examples are available in the http extension Github repo: https://github.com/pramsey/pgsql-http

Submit Part I

This is the end of the first part of the project.

Dump your database and upload it in the google form. (see instructions above )


Part II: API call

In the second part of the project, you will write 2 functions.

The 1st function build_api_url() builds the URL endpoint The 2nd function call_ban_api() takes the output of the first function and calls the API

1st function: build_api_url()

The role of the build_api_url() function is to return the properly formatted endpoint string that will be used by the http_get function.

https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457``

Your task: build the endpoint url

Your task is to write a PL/pgSQL function which

  1. takes a tree id as input
  2. finds the geolocation coordinates of the tree: (lon, lat)
  3. concatenates the url and the coordinates to return the full url as https://api-adresse.data.gouv.fr/reverse/?lon=<longitude>&lat=<latitude>

note: Double check which element of the geolocation point is the latitude and which is the longitude. We mostly speak of latitude and longitude, but the coordinates are stored as (long, lat) in the locations table.

You should proceed step by step to build the function:

A possible function template follows

CREATE OR REPLACE FUNCTION build_api_url(p_tree_id integer)
RETURNS VARCHAR
AS $$
  DECLARE
    -- intermediate variables
    v_coordinates point;
    v_lat float;
    v_lon float;
    -- the returned variable
    v_url VARCHAR;
    -- declare a constant
    c_endpoint CONSTANT text := 'https://api-adresse.data.gouv.fr/reverse/';
  BEGIN
    -- the query to get the locations.geolocation into v_coordinates
    ...
    -- extract coordinates into lon and lat
    ...
    -- concatenate
    ...
    -- return the url
    RETURN v_url;
  END;
$$ LANGUAGE plpgsql;

test your build_api_url function

Test that your function works correctly with a specific tree id (808):

select build_api_url(808);
-[ RECORD 1 ]-+----------------------------------------------------------------------------------------
build_api_url |  https://api-adresse.data.gouv.fr/reverse/?lon=2.2691198735270324&lat=48.85835185927335

2nd function: call_ban_api()

Next, we send a get request to the API using the output of the build_api_url() function.

As in the example above, we want to catch both the status and content returned by the API call.

The query to send a get request to the url is:

SELECT http_get.status, http_get.content
INTO v_api_result
FROM http_get(p_url);

where v_api_result is a variable you need to declare with type RECORD

Your task: write the function that calls the API

Write a function called: call_ban_api()

Note you can string concatenate the message with the status using %: 'API call failed with status code %', v_api_status;

test your function

Test a call to the url that was returned by the select build_api_url(2048);.

you should either get

 {"error" : "Error: Resolving timed out after 1073 milliseconds"}

or something like (truncated)

{"type": "FeatureCollection", "limit": 1, "center": [2.2691198735270324, 48.85835185927335], "licence": "ETALAB-2.0",
"version": "draft", "features": [{"type": "Feature",
"geometry": {"type": "Point", "coordinates": [2.269179, 48.858286]},
"properties": {"x": 646379.43, "y": 6862275.9, "id": "75116_8028_00001", "city": "Paris",
"name": "1 Avenue du Ranelagh", "type": "housenumber", "label": "1 Avenue du Ranelagh 75016 Paris",
...

Submit Part II

This is the end of the second part of the project.

At this point, you have uploaded the data in the treedb_v04 database and created 2 functions build_api_url(integer) and call_ban_api(text)

Dump your database and upload it in the google form. (see instructions above )

The name of the file should be: treesdb.<your username>.02.sql


Part III: log and store

Log the API calls

Let’s say this works and you productionize these 2 functions.

Each time a new tree is added to the database, a trigger calls the functions and stores the result in the database.

Then a few weeks, or months later, you notice that something went wrong at some point in time and the BAN addresses are no longer fetched from the BAN API. And you have no clue when and why that happened.

This is why you need to log in each call to the API and record

This way you can find out when the problem happened and what error occurred.

First create a api_logs table

Start by creating a new table called api_logs with the following columns

Note that:

Write the function log_api_calls()

Write a function called log_api_call which takes (tree_id, status, error) as arguments and INSERTs them into the api_logs table. It can be a SQL function. It does not have to be a PL/pgSQL function.

Make sure the function returns the id of the row that was just inserted with RETURNING.

Test the function log_api_calls()

Call the log_api_calls function with the following arguments and check that

  1. it does not error out
  2. the info actually gets into the table
-- happy path, no error message
select log_api_calls(808, 200, '');

And

-- something went wrong
select log_api_calls(808, 500, 'Error, something went wrong');

Check that both calls made into the api_logs table.

Integrate log_api_calls() into call_ban_api()

Next, modify the call_ban_api function to integrate the log_api_calls() function so that all API calls are recorded. Even the ones that succeed.

A few modifications to think of:

test the new version of the call_ban_api(integer, text) function

Verify that successive calls to call_ban_api all end up recording rows in the api_logs table.

Let’s recapitulate

So at this point you should have

Awesome!

The next step is a walk down the park. 😇😇😇

Note: you can use \df to list the functions and \sf build_api_url or \sf call_ban_api, … to see the function definition.

To finish this Part III, we need to parse the response from the API and store it in a address table.

Record the BAN address

Start by creating a table called ban_addresses dedicated to the BAN addresses. It will be linked to the trees table through a tree_id foreign key constraint.

The columns of the table ban_locations are:

The address column will hold the 1st address returned by the BAN API as a JSONB dictionary.

By default the address column should be an empty JSONB dictionary : '{}'::jsonb

Don’t forge to create the FOREIGN KEY constraint on the tree_id with respect to the trees table

[solution]

ALTER TABLE ban_addresses
ADD CONSTRAINT trees_ban_addresses_id_fkey
FOREIGN KEY (tree_id) REFERENCES trees(id);

Check that the foreign key has been created with \d+ ban_addresses

Parse the JSON response

The API returns a JSON dictionary with multiple elements.

{
  ...
  "features": [
    {
      ...,
      "properties": {
        the address we're interested in
      }
    },
    ...

We don’t want to record the whole dictionary, just the first properties in the features array.

So we need to parse the JSON response within a PL/pgSQl function.

Parsing JSON in PL/pgSQL

You parse jSON dictionaries in PostgreSQL with a sequence of arrows ->

Let’s say you have a JSON column in a table that follows the above structure : features is an array of JSON objects, that have a properties key.

We want to select features, then the first element (‘0’) then the properties key in that first element

In our case

json_data->'features'->0->'properties'

So we can parse the output of the call_ban_api() function with

select call_ban_api(some url)::jsonb -> 'features' -> 0 -> 'properties';

try it!

Your task : write a function that parses the output of the API

write a function, call it parse_api_result(), that takes

You can use a block with

BEGIN
  BEGIN
    -- try parsing the json b
  EXCEPTION WHEN others THEN
    RETURN '{}'::jsonb;
  END;
END;

Note also that if the parsing does not correspond to the structure of the dictionary of the API response, then it will be NULL. So you want to coalesce the result of the parsing as such

COALESCE(v_parsed_address, '{}'::jsonb)

so that it returns '{}'::jsonb if the parsing is not valid

Don’t forget to test the parsing function

Wrap all these functions into one main function

It is best to have a wrapping function which given a tree id encapsulate the functions build_api_url, call_ban_api and parse_api_result.

Let’s call this function retrieve_tree_address_from_ban(p_id INTEGER).

Test that everything works

Call the retrieve_tree_address_from_ban function on multiple tree ids

For each call, check the api_logs and the ban_addresses tables

Submit Part III

Congrats! You made it to the end of Part III! You rock!

At this point, you should have created the following functions

You also have 2 new tables:

Dump your database and upload it in the google form. (see instructions above )

The name of the file should be: treesdb.<your username>.03.sql


Part IV: process the whole set of trees

We are able to get the address tree by tree and we need the addresses for all 1000 trees.

There are many ways on how to batch process the sequence of api calls.

In general we want

However, we do not want to simply send many requests to an API as fast as possible. that will almost always certainly fails at some point. APIs do not like being pushed around.

We want each successive API call to be separated by one or more seconds.

Write the function : fetch_addresses_for_trees(ARRAY_AGG(id))

So the function we want to write in this last part of the project has to

the tree selection

To select and process the trees that do not yet have an address you can use a main query as such

WITH sample_trees AS (
    SELECT id
    FROM trees
    ORDER BY RANDOM()
    LIMIT 2
)
SELECT fetch_addresses_for_trees(ARRAY_AGG(id))
FROM sample_trees;

You should change the filtering of the trees to select trees that do not yet have a valid address. Remember that the address attribute of the ban_addresses table, has the value '{}'::jsonb by default.

test the fetch_addresses_for_trees function and selection query

Run the query that selects the trees and calls the fetch_addresses_for_trees function for a limited number of trees and verify that

process all the trees

Your last task is to process as many trees as possible before you submit the database

Submit the Part IV

Congratulations you’ve made it till the end of the project. I hope that was fun and that you learned a few things along the way.

Now dump your database and submit it in the google form. You know the drill!

At this point, you should have created the following functions

An the tables:

Dump your database and upload it in the google form. (see instructions above )

The name of the file should be: treesdb.<your username>.04.sql