← Back to dbsql

BAN API Integration Workshop

22 min read

Workshop : Tree Locations & BAN API

The goal of this workshop is to complement the location information of each tree with a properly formatted address.

We will use the tree geolocation coordinates (longitude and latitude) of each tree and query the French government Official Address database BAN API to retrieve a properly formatted address for each single tree.

We will implement that process in a series of PL/pgSQL functions using the http PostgreSQL extension.

And store the results using the JSONB native data type which is well adapted to storing API responses.

The project has 5 steps.

Learning Outcomes

In this workshop you will practice:

  • installing extensions
  • writing semi-complex PL/pgSQL functions
  • logging successful outcomes and catching failures in PL/pgSQL functions
  • storing data and querying in the JSONB format
  • dealing with the challenges of API calls.

The problem

The tree addresses data in the database is a mess.

To illustrate this, here are 2 locations randomly picked from the locations table in the normalized version of the trees database.

---------------+----------------------------------------------------
suppl_address  | [null]
address        | CIMETIERE DE PANTIN / AVENUE DES MERISIERS / DIV 33
arrondissement | SEINE-SAINT-DENIS
geolocation    | (2.404061886887479,48.90484356530813)
---------------+----------------------------------------------------
suppl_address  | 61
address        | RUE DE BERCY
arrondissement | PARIS 12E ARRDT
geolocation    | (2.3820653541062367,48.837817565831166)

We can see that :

  • Arrondissement values are literals instead of zip codes.
  • Arrondissement values extend outside of Paris (SEINE-SAINT-DENIS is NOT an arrondissement!).
  • Address fields are missing zip codes.
  • suppl_address is used for the street number and is very often NULL

To use that dataset in a real application with proper localization we need properly formatted addresses.

Luckily for us, there is a way to retrieve the official address associated with geolocation coordinates (latitude an longitude).

The BAN

The official list of addresses in France is called the BAN: Base Nationale d'Adresses.

https://adresse.data.gouv.fr/api-doc/adresse

The BAN API

The BAN website offers a free API that can convert geolocation coordinates to a real address.

For instance, this is the location of a beautiful Japonicum, Styphnolobium tree in the 15th arrondissement.

We send the geolocation coordinates of the tree to the API endpoint api-adresse.data.gouv.fr/reverse with the GET request:

curl "https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457" | jq

This returns a JSON formatted response

{
  "type": "FeatureCollection",
  "version": "draft",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          2.292559,
          48.846781
        ]
      },
      "properties": {
        "id": "75115_3266_00130",
        "type": "housenumber",
        "score": 0.9999999905435162,
        "housenumber": "130",
        "street": "Avenue Emile Zola",
        "name": "130 Avenue Emile Zola",
        "postcode": "75015",
        "citycode": "75115",
        "city": "Paris",
        "district": "Paris 15e Arrondissement",
        "context": "75, Paris, Île-de-France",
        "label": "130 Avenue Emile Zola 75015 Paris",
        "x": 648083.32,
        "y": 6860981.04,
        "importance": 0.80895,
        "distance": 4
      }
    },
    ...

The returned attributes for the properties key are :

  • id: address identifier (interoperability key)
  • type: type of result found
    • housenumber: "to the house number" position
    • street: "to the street" position, placed approximately at its center
    • locality: hamlet or place name
    • municipality: "to the municipality" position
  • score: value from 0 to 1 indicating the relevance of the result
  • housenumber: number with possible repetition index (bis, ter, A, B)
  • street: street name
  • name: possible number and street name or place name
  • postcode: postal code
  • citycode: INSEE code of the municipality
  • city: name of the municipality
  • district: name of the district (Paris/Lyon/Marseille)
  • context: department number, department name, and region name
  • label: full address label
  • x: geographic coordinates in legal projection
  • y: geographic coordinates in legal projection
  • importance: importance indicator (technical field)

The API documentation also mentions these 2 fields which are not in the example above:

  • oldcitycode: INSEE code of the former municipality (if applicable)
  • oldcity: name of the former municipality (if applicable)

But the documentation lacks any description of the distance attribute.

API calls often return results of varying length with dynamic fields. Sometimes, for some APIs, some fields may be missing from the output while others, not expected, creep up. This is why the JSON or JSONB format is appropriate for storing API outputs. You don't have to stick to a fixed set of attributes and handle all the variations.

Note: INSEE stands for "National Institute of Statistics and Economic Studies" (Institut National de la Statistique et des Γ‰tudes Γ‰conomiques), which is the organization that assigns unique codes to French administrative divisions.

The BAN API returns multiple addresses per coordinates

Given a geolocation point (longitude, latitude), the BAN API returns multiple addresses within the features array. You can view the full response at the url https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457

In the rest of the worksheet, we will only consider and record the first of these addresses and discard the others.

Precision of Latitude and Longitude

The number of digits in latitude and longitude coordinates directly relates to the precision with which you can pinpoint a location on Earth. Here's a breakdown of this relationship:

Decimal placesprecision
Whole numberCountry or large region (~100 km precision)
1 decimal placeLarge city or district (~10 km)
2 decimal placesTown or village (~1 km)
3 decimal placesNeighborhood, street (~100 m)
4 decimal placesIndividual building (~10 m)
5 decimal placesIndividual trees (~1 m)
6 decimal placesIndividual humans (~10 cm)
7 decimal places~1 cm
8 decimal places~1 mm

The geolocation coordinates (2.3820653541062367,48.837817565831166) in the database have 16 digits after the decimal place !

This StackOverflow answer nails why this is absurd:

Ten or more decimal places indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless. Be careful, because unless you are the one reading these numbers off the device, this can indicate low quality processing!

low quality processing ? hahaha you bet!

On the other hand, the BAN API returns coordinates with 6 digits, which would put the precision at the 10 cm range whereas 16 digits is closer to the angstrom level!

"coordinates": [
  2.292559,
  48.846781
]

Or we are dealing with very very little Bonzai trees

Bonzai

Overview of the project

The overall process is simple, the implementation details less so.

For a set of trees

  • get the geolocation from the database
  • send a get request to the BAN API
  • parse and store the response

The first thing we need is to activate or install the http extension. This part is crucial and probably not that straightforward.

see here for instructions

Once that's done, we write a series PL/pgSQL functions:

  • retrieve geolocation data for a given tree id: A simple query. This function should return the full URL that will be queried taking the tree id as input.
  • interrogate the BAN API: more complex as it involves sending the right message to the API, handling exceptions and then catching the response
  • parsing and inserting the response as a JSONB record related to the tree

We need to do that, not just for one tree but for all the trees in the database. However, the BAN API is limited to 50 requests per seconds per API. So we also need to find a way to throttle or time the calls to the API.

Project parts

The project consists of 4 parts

  1. setup : create a new database, load the data and activate the http extension
  2. API call: for a tree, build the url and make the API call
  3. log and store: log the API call, parse and store the result in a new table
  4. process all the trees: get the address for all the trees in the database

Throughout the project you will have to figure out how to make things work:

  • how to write PL/pgSQL function
  • choosing appropriate data types
  • managing API call errors
  • parsing JSONB in PL/pgSQL

It won't be a walk in the park but I'm sure you can definitely do it.

walk in the park


PART I: setup

Setup a new database: treesdb_v04

In this workshop we work on a subset of the trees database in a normalize structure.

Postal addresses are not relevant to trees in wide spaces (gardens, cemeteries, ...).

This version of the database only has trees in the domain called Alignement, which roughly corresponds to streets.

The new database only has 1000 trees, all situated in Paris with domain as Alignement.

Then restore the data into the database:

In a psql session (terminal or pgAdmin), run the following command

\i <absolute path to the sql file>/treesdb_v04.sql

The \i psql command executes the sql statements contained in the sql file.

Check the data upload

To check that everything is as expected, make sure that you have 1000 trees and 1000 locations in your new database.

Now install the HTTP extension

Full instructions are available here

With a bit of luck the http extension is already installed. Let's check

first check if the extension is available

In a psql session (pgAdmin or terminal), run:

CREATE EXTENSION http;

If that returns CREATE EXTENSION or ERROR extension "http" already exists , you're in luck the extension is installed.

You can list the installed extensions with \dx.

If the http extension is installed and active \dx returns:

List of installed extensions
   Name   | Version |   Schema   |                                Description
----------+---------+------------+----------------------------------------------------------------------------
 http     | 1.6     | public     | HTTP client for PostgreSQL, allows web page retrieval inside the database.

Using the http extension

In this project, we only need to send GET requests to the endpoint. For that we use the http_get function.

You can test that the http extension works by sending a get request to http://httpbun.com/ip with the http_get function.

SELECT http_get.status, http_get.content
FROM http_get('http://httpbun.com/ip');

This should return something like that

-[ RECORD 1 ]-------------------------
status  | 200
content | {                           +
        |   "origin": "146.70.119.165"+
        | }                           +
        |

The status = 200 indicates that the API call was successful. See here for a complete list of http codes.

More examples are available in the http extension Github repo: https://github.com/pramsey/pgsql-http


Part II: API call

In the second part of the project, you will write 2 functions.

The 1st function build_api_url() builds the URL endpoint The 2nd function call_ban_api() takes the output of the first function and calls the API

1st function: build_api_url()

The role of the build_api_url() function is to return the properly formatted endpoint string that will be used by the http_get function.

  • The input argument is the tree id,
  • the output is the following string (with the appropriate lon and lat arguments)
https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457``

Your task: build the endpoint url

Your task is to write a PL/pgSQL function which

  1. takes a tree id as input
  2. finds the geolocation coordinates of the tree: (lon, lat)
  3. concatenates the url and the coordinates to return the full url as https://api-adresse.data.gouv.fr/reverse/?lon=<longitude>&lat=<latitude>

note: Double check which element of the geolocation point is the latitude and which is the longitude. We mostly speak of latitude and longitude, but the coordinates are stored as (long, lat) in the locations table.

You should proceed step by step to build the function:

  • start with a simple function that returns the geolocation point.
    • specify what data type will be returned (point),
    • declare the variable (call it coordinates) that will hold the result,
    • and use the statement SELECT ... INTO coordinates
  • then extract the elements of the coordinates variable
  • finally build the URL by concatenating these elements into a url variable that the function returns. change the data type that is returned by the function as varchar.

A possible function template follows

CREATE OR REPLACE FUNCTION build_api_url(p_tree_id integer)
RETURNS VARCHAR
AS $$
  DECLARE
    -- intermediate variables
    v_coordinates point;
    v_lat float;
    v_lon float;
    -- the returned variable
    v_url VARCHAR;
    -- declare a constant
    c_endpoint CONSTANT text := 'https://api-adresse.data.gouv.fr/reverse/';
  BEGIN
    -- the query to get the locations.geolocation into v_coordinates
    ...
    -- extract coordinates into lon and lat
    ...
    -- concatenate
    ...
    -- return the url
    RETURN v_url;
  END;
$$ LANGUAGE plpgsql;

test your build_api_url function

Test that your function works correctly with a specific tree id (808):

select build_api_url(808);
-[ RECORD 1 ]-+----------------------------------------------------------------------------------------
build_api_url |  https://api-adresse.data.gouv.fr/reverse/?lon=2.2691198735270324&lat=48.85835185927335

[solution]

CREATE FUNCTION public.build_api_url(p_tree_id integer) RETURNS character varying
    LANGUAGE plpgsql
    AS $$
  DECLARE
    v_coordinates point;
    v_lat float;
    v_lon float;
    v_url VARCHAR;
    c_endpoint CONSTANT text := 'https://api-adresse.data.gouv.fr/reverse/';
  BEGIN
    WITH get_location AS (
      SELECT loc.geolocation::point
      FROM locations loc
      JOIN trees t ON t.location_id = loc.id
      WHERE t.id = p_tree_id
    )
    SELECT * INTO v_coordinates
    FROM get_location;

    v_lon := v_coordinates[0];
    v_lat := v_coordinates[1];

    v_url := c_endpoint || '?lon=' || v_lon::text || '&lat=' || v_lat::text;

    RETURN v_url;
  END;
$$;

2nd function: call_ban_api()

Next, we send a get request to the API using the output of the build_api_url() function.

  • Input: the url as output of the build_api_url() function.
  • Output: The response from the API or an error message

As in the example above, we want to catch both the status and content returned by the API call.

  • The status will indicate if the call was successful. You want to test whether status = 200 (πŸ‘) or not (πŸ‘Ž).
  • The content is a JSON dictionary

The query to send a get request to the url is:

SELECT http_get.status, http_get.content
INTO v_api_result
FROM http_get(p_url);

where v_api_result is a variable you need to declare with type RECORD

Your task: write the function that calls the API

Write a function called: call_ban_api()

  • input: the url
  • output: the data returned by the API. The returned type is JSON.
  • if the status returned by the API call is not 200, then raise an EXCEPTION to indicate that the call was not successful. use RAISE EXCEPTION 'some message';

Note you can string concatenate the message with the status using %: 'API call failed with status code %', v_api_status;

test your function

Test a call to the url that was returned by the select build_api_url(2048);.

you should either get

 {"error" : "Error: Resolving timed out after 1073 milliseconds"}

or something like (truncated)

{"type": "FeatureCollection", "limit": 1, "center": [2.2691198735270324, 48.85835185927335], "licence": "ETALAB-2.0",
"version": "draft", "features": [{"type": "Feature",
"geometry": {"type": "Point", "coordinates": [2.269179, 48.858286]},
"properties": {"x": 646379.43, "y": 6862275.9, "id": "75116_8028_00001", "city": "Paris",
"name": "1 Avenue du Ranelagh", "type": "housenumber", "label": "1 Avenue du Ranelagh 75016 Paris",
...

This is the end of the second part of the project.

At this point, you have uploaded the data in the treedb_v04 database and created 2 functions build_api_url(integer) and call_ban_api(text)


Part III: log and store

Log the API calls

Let's say this works and you productionize these 2 functions.

Each time a new tree is added to the database, a trigger calls the functions and stores the result in the database.

Then a few weeks, or months later, you notice that something went wrong at some point in time and the BAN addresses are no longer fetched from the BAN API. And you have no clue when and why that happened.

This is why you need to log in each call to the API and record

  • the time of the call
  • the status
  • the input data: the tree id
  • the error message if there is one

This way you can find out when the problem happened and what error occurred.

First create a api_logs table

Start by creating a new table called api_logs with the following columns

  • tree_id: integer, NOT NULL
  • status: integer, NOT NULL
  • error: varchar default ''
  • created_at: date default now(), NOT NULL

Note that:

  • Don't forget the primary key id as serial.
  • There is no need to add a foreign constraint on the tree_id with respect to the tree table.
  • tree_id does not have to be UNIQUE since we may call the API on the same tree multiple times if it fails the first time.

Write the function log_api_calls()

Write a function called log_api_call which takes (tree_id, status, error) as arguments and INSERTs them into the api_logs table. It can be a SQL function. It does not have to be a PL/pgSQL function.

Make sure the function returns the id of the row that was just inserted with RETURNING.

Test the function log_api_calls()

Call the log_api_calls function with the following arguments and check that

  1. it does not error out
  2. the info actually gets into the table
-- happy path, no error message
select log_api_calls(808, 200, '');

And

-- something went wrong
select log_api_calls(808, 500, 'Error, something went wrong');

Check that both calls made into the api_logs table.

Integrate log_api_calls() into call_ban_api()

Next, modify the call_ban_api function to integrate the log_api_calls() function so that all API calls are recorded. Even the ones that succeed.

A few modifications to think of:

  • you need to pass the tree_id as an argument to the function call_ban_api. Since that will overload the existing call_ban_api(text) function, you may want to first drop the existing call_ban_api(text) function.

  • when the api call fails and returns an error we want to log the content attribute which should contain the error message. We don't know in advance if the content is a string or a more structured JSON object. You may have to experiment with type casting. see https://stackoverflow.com/questions/27215216/how-to-convert-a-json-string-to-text for help

  • in PL/pgSQL, it's best to call the log_api_calls function with PERFORM instead of SELECT since you don't need the output of that function

    • in case of error: PERFORM log_api_calls(tree_id, api_status, api_response::text);
    • if the API call succeeded then just log an empty string for the error argument: PERFORM log_api_calls(tree_id, api_status, '');

test the new version of the call_ban_api(integer, text) function

Verify that successive calls to call_ban_api all end up recording rows in the api_logs table.

Let's recapitulate

So at this point you should have

  • a function that creates the URL for the API call: build_api_url()
  • a function that calls the API with that URL and returns a JSON dictionary with multiple addresses: call_ban_api()
  • a function that logs all the API calls: log_api_calls()
  • a log table recording all the API calls: api_log

Awesome!

The next step is a walk down the park. πŸ˜‡πŸ˜‡πŸ˜‡

Note: you can use \df to list the functions and \sf build_api_url or \sf call_ban_api, ... to see the function definition.

To finish this Part III, we need to parse the response from the API and store it in a address table.

Record the BAN address

Start by creating a table called ban_addresses dedicated to the BAN addresses. It will be linked to the trees table through a tree_id foreign key constraint.

The columns of the table ban_locations are:

  • id: primary key, serial
  • tree_id : integer, NOT NULL
  • address : JSONB default : empty json dictionary : '{}'::jsonb
  • updated_at : date default now() NOT NULL

The address column will hold the 1st address returned by the BAN API as a JSONB dictionary.

By default the address column should be an empty JSONB dictionary : '{}'::jsonb

Don't forge to create the FOREIGN KEY constraint on the tree_id with respect to the trees table

[solution]

ALTER TABLE ban_addresses
ADD CONSTRAINT trees_ban_addresses_id_fkey
FOREIGN KEY (tree_id) REFERENCES trees(id);

Check that the foreign key has been created with \d+ ban_addresses

Parse the JSON response

The API returns a JSON dictionary with multiple elements.

{
  ...
  "features": [
    {
      ...,
      "properties": {
        the address we're interested in
      }
    },
    ...

We don't want to record the whole dictionary, just the first properties in the features array.

So we need to parse the JSON response within a PL/pgSQl function.

Parsing JSON in PL/pgSQL

You parse jSON dictionaries in PostgreSQL with a sequence of arrows ->

Let's say you have a JSON column in a table that follows the above structure : features is an array of JSON objects, that have a properties key.

We want to select features, then the first element ('0') then the properties key in that first element

In our case

json_data->'features'->0->'properties'

So we can parse the output of the call_ban_api() function with

select call_ban_api(some url)::jsonb -> 'features' -> 0 -> 'properties';

try it!

Your task : write a function that parses the output of the API

write a function, call it parse_api_result(), that takes

  • p_api_output JSONB as input
  • and returns properties
  • If parsing fails, return an empty JSONB object

You can use a block with

BEGIN
  BEGIN
    -- try parsing the json b
  EXCEPTION WHEN others THEN
    RETURN '{}'::jsonb;
  END;
END;

Note also that if the parsing does not correspond to the structure of the dictionary of the API response, then it will be NULL. So you want to coalesce the result of the parsing as such

COALESCE(v_parsed_address, '{}'::jsonb)

so that it returns '{}'::jsonb if the parsing is not valid

Don't forget to test the parsing function

Wrap all these functions into one main function

It is best to have a wrapping function which given a tree id encapsulate the functions build_api_url, call_ban_api and parse_api_result.

Let's call this function retrieve_tree_address_from_ban(p_id INTEGER).

  • It takes a tree_id as input
  • calls the build_api_url, call_ban_api and parse_api_result in sequence
  • inserts the tree_id and returned address into the ban_addresses table
  • returns the id of the new created row

Test that everything works

Call the retrieve_tree_address_from_ban function on multiple tree ids

For each call, check the api_logs and the ban_addresses tables

Submit Part III

Congrats! You made it to the end of Part III! You rock!

At this point, you should have created the following functions

  • build_api_url()
  • call_ban_api()
  • log_api_calls()
  • parse_api_result
  • and the main wrapper function : retrieve_tree_address_from_ban

You also have 2 new tables:

  • api_logs
  • ban addresses

Dump your database and upload it in the google form. (see instructions above )

The name of the file should be: treesdb.<your username>.03.sql


Part IV: process the whole set of trees

We are able to get the address tree by tree and we need the addresses for all 1000 trees.

There are many ways on how to batch process the sequence of api calls.

In general we want

  • a query that selects the trees that we need processing for. that way we can limit the number of calls to the API. Emeber that they are limited in time (50 requests per second per IP)
  • a function that gets the address for multiple trees

However, we do not want to simply send many requests to an API as fast as possible. that will almost always certainly fails at some point. APIs do not like being pushed around.

We want each successive API call to be separated by one or more seconds.

Write the function : fetch_addresses_for_trees(ARRAY_AGG(id))

So the function we want to write in this last part of the project has to

  • take an array of tree ids : ARRAY_AGG(id)
  • LOOP over each tree id with : FOREACH curr_tree_id IN ARRAY tree_ids LOOP
  • sleep for some time with : PERFORM pg_sleep(sleep_duration);

the tree selection

To select and process the trees that do not yet have an address you can use a main query as such

WITH sample_trees AS (
    SELECT id
    FROM trees
    ORDER BY RANDOM()
    LIMIT 2
)
SELECT fetch_addresses_for_trees(ARRAY_AGG(id))
FROM sample_trees;

You should change the filtering of the trees to select trees that do not yet have a valid address. Remember that the address attribute of the ban_addresses table, has the value '{}'::jsonb by default.

test the fetch_addresses_for_trees function and selection query

Run the query that selects the trees and calls the fetch_addresses_for_trees function for a limited number of trees and verify that

  • the api calls are properly logged
  • the addresses actually get recorded in the ban_addresses table
  • the trees with valid addresses are not re processed

process all the trees

Your last task is to process as many trees as possible before you submit the database

Submit the Part IV

Congratulations you've made it till the end of the project. I hope that was fun and that you learned a few things along the way.

Now dump your database and submit it in the google form. You know the drill!

At this point, you should have created the following functions

  • build_api_url()
  • call_ban_api()
  • log_api_calls()
  • parse_api_result
  • retrieve_tree_address_from_ban
  • fetch_addresses_for_trees

An the tables:

  • api_logs
  • ban addresses

Dump your database and upload it in the google form. (see instructions above )

The name of the file should be: treesdb.<your username>.04.sql