BAN API Integration Workshop
BAN API, HTTP extension, geolocation, JSONB
Workshop : Tree Locations & BAN API
The goal of this workshop is to complement the location information of each tree with a properly formatted address.
We will use the tree geolocation coordinates (longitude and latitude) of each tree and query the French government Official Address database BAN API to retrieve a properly formatted address for each single tree.
We will implement that process in a series of PL/pgSQL functions using the http
PostgreSQL extension.
And store the results using the JSONB native data type which is well adapted to storing API responses.
The project has 5 steps. At the end of each step you will dump your database and submit the database dump file in a Google form.
Learning Outcomes
In this workshop you will practice:
- installing extensions
- writing semi-complex PL/pgSQL functions
- logging successful outcomes and catching failures in PL/pgSQL functions
- storing data and querying in the JSONB format
- dealing with the challenges of API calls.
The problem
The tree addresses data in the database is a mess.
To illustrate this, here are 2 locations randomly picked from the locations
table in the normalized version of the trees database.
---------------+----------------------------------------------------
suppl_address | [null]
address | CIMETIERE DE PANTIN / AVENUE DES MERISIERS / DIV 33
arrondissement | SEINE-SAINT-DENIS
geolocation | (2.404061886887479,48.90484356530813)
---------------+----------------------------------------------------
suppl_address | 61
address | RUE DE BERCY
arrondissement | PARIS 12E ARRDT
geolocation | (2.3820653541062367,48.837817565831166)
We can see that :
- Arrondissement values are literals instead of zip codes.
- Arrondissement values extend outside of Paris (SEINE-SAINT-DENIS is NOT an arrondissement!).
- Address fields are missing zip codes.
suppl_address
is used for the street number and is very oftenNULL
To use that dataset in a real application with proper localization we need properly formatted addresses.
Luckily for us, there is a way to retrieve the official address associated with geolocation coordinates (latitude an longitude).
The BAN
The official list of addresses in France is called the BAN: Base Nationale d’Adresses.
https://adresse.data.gouv.fr/api-doc/adresse
The BAN API
The BAN website offers a free API that can convert geolocation coordinates to a real address.
For instance, this is the location of a beautiful Japonicum, Styphnolobium tree in the 15th arrondissement.
We send the geolocation coordinates of the tree
to the API endpoint api-adresse.data.gouv.fr/reverse
with the GET request:
curl "https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457"
This returns a JSON formatted response
{
"type": "FeatureCollection",
"version": "draft",
"features": [
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
2.292559,
48.846781
]
},
"properties": {
"id": "75115_3266_00130",
"type": "housenumber",
"score": 0.9999999905435162,
"housenumber": "130",
"street": "Avenue Emile Zola",
"name": "130 Avenue Emile Zola",
"postcode": "75015",
"citycode": "75115",
"city": "Paris",
"district": "Paris 15e Arrondissement",
"context": "75, Paris, Île-de-France",
"label": "130 Avenue Emile Zola 75015 Paris",
"x": 648083.32,
"y": 6860981.04,
"importance": 0.80895,
"distance": 4
}
},
...
The returned attributes for the properties
key are :
- id: address identifier (interoperability key)
- type: type of result found
- housenumber: “to the house number” position
- street: “to the street” position, placed approximately at its center
- locality: hamlet or place name
- municipality: “to the municipality” position
- score: value from 0 to 1 indicating the relevance of the result
- housenumber: number with possible repetition index (bis, ter, A, B)
- street: street name
- name: possible number and street name or place name
- postcode: postal code
- citycode: INSEE code of the municipality
- city: name of the municipality
- district: name of the district (Paris/Lyon/Marseille)
- context: department number, department name, and region name
- label: full address label
- x: geographic coordinates in legal projection
- y: geographic coordinates in legal projection
- importance: importance indicator (technical field)
The API documentation also mentions these 2 fields which are not in the example above:
- oldcitycode: INSEE code of the former municipality (if applicable)
- oldcity: name of the former municipality (if applicable)
But the documentation lacks any description of the distance
attribute.
API calls often return results of varying length with dynamic fields. Sometimes, for some APIs, some fields may be missing from the output while others, not expected, creep up. This is why the JSON or JSONB format is appropriate for storing API outputs. You don’t have to stick to a fixed set of attributes and handle all the variations.
Note: INSEE stands for “National Institute of Statistics and Economic Studies” (Institut National de la Statistique et des Études Économiques), which is the organization that assigns unique codes to French administrative divisions.
The BAN API returns multiple addresses per coordinates
Given a geolocation point (longitude, latitude), the BAN API returns multiple addresses within the features
array.
You can view the full response at the url https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457
In the rest of the worksheet, we will only consider and record the first of these addresses and discard the others.
Precision of Latitude and Longitude
The number of digits in latitude and longitude coordinates directly relates to the precision with which you can pinpoint a location on Earth. Here’s a breakdown of this relationship:
Decimal places | precision |
---|---|
Whole number | Country or large region (~100 km precision) |
1 decimal place | Large city or district (~10 km) |
2 decimal places | Town or village (~1 km) |
3 decimal places | Neighborhood, street (~100 m) |
4 decimal places | Individual building (~10 m) |
5 decimal places | Individual trees (~1 m) |
6 decimal places | Individual humans (~10 cm) |
7 decimal places | ~1 cm |
8 decimal places | ~1 mm |
The geolocation coordinates (2.3820653541062367,48.837817565831166)
in the database have 16 digits after the decimal place !
This StackOverflow answer nails why this is absurd:
Ten or more decimal places indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless. Be careful, because unless you are the one reading these numbers off the device, this can indicate low quality processing!
low quality processing ? hahaha you bet!
On the other hand, the BAN API returns coordinates with 6 digits, which would put the precision at the 10 cm range whereas 16 digits is closer to the angstrom level!
"coordinates": [
2.292559,
48.846781
]
Overview of the project
The overall process is simple, the implementation details less so.
For a set of trees
- get the geolocation from the database
- send a get request to the BAN API
- parse and store the response
The first thing we need is to activate or install the http
extension. This part is crucial and probably not that straightforward.
But with grit and hard work you can do it!
Once that’s done, we write a series PL/pgSQL functions:
- retrieve geolocation data for a given tree id: A simple query. This function should return the full URL that will be queried taking the tree id as input.
- interrogate the BAN API: more complex as it involves sending the right message to the API, handling exceptions and then catching the response
- parsing and inserting the response as a JSONB record related to the tree
We need to do that, not just for one tree but for all the trees in the database. However, the BAN API is limited to 50 requests per seconds per API. So we also need to find a way to throttle or time the calls to the API.
Project parts
The project consists of 4 parts
- setup : create a new database, load the data and activate the
http
extension - API call: for a tree, build the url and make the API call
- log and store: log the API call, parse and store the result in a new table
- process all the trees: get the address for all the trees in the database
Throughout the project you will have to figure out how to make things work:
- how to write PL/pgSQL function
- choosing appropriate data types
- managing API call errors
- parsing JSONB in PL/pgSQL
It won’t be a walk in the park but I’m sure you can definitely do it.
How to dump and submit your database
At the end of each part of the project, you will be asked to 1) dump the database and 2) submit it via a google form.
Dump the database
I need you to use a terminal to dump your database and not the pgAdmin interface. This is important. Do not use pgAdmin to dump the database.
So in a powershell or iTerm terminal you execute the following command
pg_dump -h localhost \
--no-owner \
--format=plain \
--exclude-schema='information_schema' \
--exclude-schema='pg_*' \
--section=pre-data \
--section=data \
--section=post-data \
--no-privileges \
--no-security-labels \
--no-tablespaces \
--no-unlogged-table-data \
--no-publications \
--no-subscriptions \
--no-toast-compression \
--no-comments \
treesdb_v04 > treesdb_v04.01.sql
where
treesdb_v04
is the name of your databasetreesdb_v04.01.sql
is the name of the database dump file.
The dump file is a plain SQL file.
The name of the dump file must include
- your username
- the part number it relates to
For instance, as alexis
, I will upload 4 files:
treesdb_alexis.01.sql
treesdb_alexis.02.sql
treesdb_alexis.03.sql
treesdb_alexis.04.sql
Verify the dump
To test that the dump worked you should be able to open the file in an editor (Vscode, vim, …) and check that the file contains the proper CREATE TABLE
and CREATE FUNCTION
statements.
Then head over to this google form, fill in your name and upload the dump file
https://forms.gle/4N21CW91P9smSM658
- Fill in your name and email
- select the part of the project that you submit for
- upload the file
The file should not be larger than 10 MB.
IMPORTANT
- do not rename the functions or change the type of their arguments
- do not use pgAdmin to dump the database, only the terminal
- pay attention to the naming of your dump file
PART I: setup
Setup a new database: treesdb_v04
In this workshop we work on a subset of the trees database in a normalize structure.
Postal addresses are not relevant to trees in wide spaces (gardens, cemeteries, …).
This version of the database only has trees in the domain called Alignement
, which roughly corresponds to streets.
The new database only has 1000 trees, all situated in Paris with domain as Alignement
.
- download the
treesdb_v04.01.sql
from the github repo: https://github.com/SkatAI/epitadb/blob/master/data/treesdb_v04.01.sql - create a new database in pgAdmin, name it
treesdb_v04
Then restore the data into the database:
In a psql
session (terminal or pgAdmin), run the following command
\i <absolute path to the sql file>/treesdb_v04.01.sql
The \i
psql command executes the sql statements contained in the sql file.
Check the data upload
To check that everything is as expected, make sure that you have 1000 trees and 1000 locations in your new database.
Now install the HTTP extension
With a bit of luck the http
extension is already installed. Let’s check
first check if the extension is available
In a psql
session (pgAdmin or terminal), run:
CREATE EXTENSION http;
If that returns CREATE EXTENSION
or ERROR extension "http" already exists
, you’re in luck the extension is installed.
You can list the installed extensions with \dx
.
If the http
extension is installed and active \dx
returns:
List of installed extensions
Name | Version | Schema | Description
----------+---------+------------+----------------------------------------------------------------------------
http | 1.6 | public | HTTP client for PostgreSQL, allows web page retrieval inside the database.
Using the http
extension
In this project, we only need to send GET
requests to the endpoint. For that we use the http_get
function.
You can test that the http
extension works by sending a get request to http://httpbun.com/ip
with the http_get
function.
SELECT http_get.status, http_get.content
FROM http_get('http://httpbun.com/ip');
This should return something like that
-[ RECORD 1 ]-------------------------
status | 200
content | { +
| "origin": "146.70.119.165"+
| } +
|
The status = 200
indicates that the API call was successful. See here for a complete list of http codes.
More examples are available in the http
extension Github repo: https://github.com/pramsey/pgsql-http
Submit Part I
This is the end of the first part of the project.
Dump your database and upload it in the google form. (see instructions above )
Part II: API call
In the second part of the project, you will write 2 functions.
The 1st function build_api_url()
builds the URL endpoint
The 2nd function call_ban_api()
takes the output of the first function and calls the API
1st function: build_api_url()
The role of the build_api_url()
function is to return the properly formatted endpoint string that will be used by the http_get
function.
- The input argument is the tree
id
, - the output is the following string (with the appropriate
lon
andlat
arguments)
https://api-adresse.data.gouv.fr/reverse/?lon=2.29257146066964&lat=48.84682397923457``
Your task: build the endpoint url
Your task is to write a PL/pgSQL function which
- takes a tree id as input
- finds the geolocation coordinates of the tree: (lon, lat)
- concatenates the url and the coordinates to return the full url as
https://api-adresse.data.gouv.fr/reverse/?lon=<longitude>&lat=<latitude>
note: Double check which element of the geolocation point is the latitude and which is the longitude. We mostly speak of
latitude and longitude
, but the coordinates are stored as (long, lat) in the locations table.
You should proceed step by step to build the function:
- start with a simple function that returns the geolocation point.
- specify what data type will be returned (
point
), - declare the variable (call it
coordinates
) that will hold the result, - and use the statement
SELECT ... INTO coordinates
- specify what data type will be returned (
- then extract the elements of the
coordinates
variable - finally build the URL by concatenating these elements into a
url
variable that the function returns. change the data type that is returned by the function asvarchar
.
A possible function template follows
CREATE OR REPLACE FUNCTION build_api_url(p_tree_id integer)
RETURNS VARCHAR
AS $$
DECLARE
-- intermediate variables
v_coordinates point;
v_lat float;
v_lon float;
-- the returned variable
v_url VARCHAR;
-- declare a constant
c_endpoint CONSTANT text := 'https://api-adresse.data.gouv.fr/reverse/';
BEGIN
-- the query to get the locations.geolocation into v_coordinates
...
-- extract coordinates into lon and lat
...
-- concatenate
...
-- return the url
RETURN v_url;
END;
$$ LANGUAGE plpgsql;
test your build_api_url
function
Test that your function works correctly with a specific tree id (808
):
select build_api_url(808);
-[ RECORD 1 ]-+----------------------------------------------------------------------------------------
build_api_url | https://api-adresse.data.gouv.fr/reverse/?lon=2.2691198735270324&lat=48.85835185927335
2nd function: call_ban_api()
Next, we send a get request to the API using the output of the build_api_url()
function.
- Input: the url as output of the
build_api_url()
function. - Output: The response from the API or an error message
As in the example above, we want to catch both the status
and content
returned by the API call.
- The
status
will indicate if the call was successful. You want to test whetherstatus = 200
(👍) or not (👎). - The content is a JSON dictionary
The query to send a get request to the url
is:
SELECT http_get.status, http_get.content
INTO v_api_result
FROM http_get(p_url);
where v_api_result
is a variable you need to declare with type RECORD
Your task: write the function that calls the API
Write a function called: call_ban_api()
- input: the url
- output: the data returned by the API. The returned type is JSON.
- if the status returned by the API call is not
200
, then raise an EXCEPTION to indicate that the call was not successful. useRAISE EXCEPTION 'some message';
Note you can string concatenate the message with the status using %
: 'API call failed with status code %', v_api_status;
test your function
Test a call to the url that was returned by the select build_api_url(2048);
.
you should either get
{"error" : "Error: Resolving timed out after 1073 milliseconds"}
or something like (truncated)
{"type": "FeatureCollection", "limit": 1, "center": [2.2691198735270324, 48.85835185927335], "licence": "ETALAB-2.0",
"version": "draft", "features": [{"type": "Feature",
"geometry": {"type": "Point", "coordinates": [2.269179, 48.858286]},
"properties": {"x": 646379.43, "y": 6862275.9, "id": "75116_8028_00001", "city": "Paris",
"name": "1 Avenue du Ranelagh", "type": "housenumber", "label": "1 Avenue du Ranelagh 75016 Paris",
...
Submit Part II
This is the end of the second part of the project.
At this point, you have uploaded the data in the treedb_v04
database and created 2 functions build_api_url(integer)
and call_ban_api(text)
Dump your database and upload it in the google form. (see instructions above )
The name of the file should be: treesdb.<your username>.02.sql
Part III: log and store
Log the API calls
Let’s say this works and you productionize these 2 functions.
Each time a new tree is added to the database, a trigger calls the functions and stores the result in the database.
Then a few weeks, or months later, you notice that something went wrong at some point in time and the BAN addresses are no longer fetched from the BAN API. And you have no clue when and why that happened.
This is why you need to log in each call to the API and record
- the time of the call
- the status
- the input data: the tree id
- the error message if there is one
This way you can find out when the problem happened and what error occurred.
First create a api_logs
table
Start by creating a new table called api_logs
with the following columns
- tree_id: integer, NOT NULL
status
: integer, NOT NULL- error:
varchar
default ‘’ - created_at:
date
defaultnow()
, NOT NULL
Note that:
- Don’t forget the primary key
id
asserial
. - There is no need to add a foreign constraint on the
tree_id
with respect to the tree table. tree_id
does not have to be UNIQUE since we may call the API on the same tree multiple times if it fails the first time.
Write the function log_api_calls()
Write a function called log_api_call
which takes (tree_id
, status
, error
) as arguments and INSERTs them into the api_logs
table. It can be a SQL function. It does not have to be a PL/pgSQL function.
Make sure the function returns the id
of the row that was just inserted with RETURNING
.
Test the function log_api_calls()
Call the log_api_calls
function with the following arguments and check that
- it does not error out
- the info actually gets into the table
-- happy path, no error message
select log_api_calls(808, 200, '');
And
-- something went wrong
select log_api_calls(808, 500, 'Error, something went wrong');
Check that both calls made into the api_logs
table.
Integrate log_api_calls()
into call_ban_api()
Next, modify the call_ban_api
function to integrate the log_api_calls()
function so that all API calls are recorded. Even the ones that succeed.
A few modifications to think of:
- you need to pass the
tree_id
as an argument to the functioncall_ban_api
. Since that will overload the existingcall_ban_api(text)
function, you may want to first drop the existingcall_ban_api(text)
function. -
when the api call fails and returns an error we want to log the
content
attribute which should contain the error message. We don’t know in advance if the content is a string or a more structured JSON object. You may have to experiment with type casting. see https://stackoverflow.com/questions/27215216/how-to-convert-a-json-string-to-text for help - in PL/pgSQL, it’s best to call the
log_api_calls
function withPERFORM
instead ofSELECT
since you don’t need the output of that function- in case of error:
PERFORM log_api_calls(tree_id, api_status, api_response::text);
- if the API call succeeded then just log an empty string for the error argument:
PERFORM log_api_calls(tree_id, api_status, '');
- in case of error:
test the new version of the call_ban_api(integer, text)
function
Verify that successive calls to call_ban_api
all end up recording rows in the api_logs
table.
Let’s recapitulate
So at this point you should have
- a function that creates the URL for the API call:
build_api_url()
- a function that calls the API with that URL and returns a JSON dictionary with multiple addresses:
call_ban_api()
- a function that logs all the API calls:
log_api_calls()
- a log table recording all the API calls:
api_log
Awesome!
The next step is a walk down the park. 😇😇😇
Note: you can use
\df
to list the functions and\sf build_api_url
or\sf call_ban_api
, … to see the function definition.
To finish this Part III, we need to parse the response from the API and store it in a address table.
Record the BAN address
Start by creating a table called ban_addresses
dedicated to the BAN addresses. It will be linked to the trees table through a tree_id
foreign key constraint.
The columns of the table ban_locations
are:
id
: primary key, serialtree_id
: integer, NOT NULLaddress
: JSONB default : empty json dictionary :'{}'::jsonb
updated_at
:date
default now() NOT NULL
The address
column will hold the 1st address returned by the BAN API as a JSONB dictionary.
By default the address
column should be an empty JSONB dictionary : '{}'::jsonb
Don’t forge to create the FOREIGN KEY constraint on the tree_id
with respect to the trees table
[solution]
ALTER TABLE ban_addresses
ADD CONSTRAINT trees_ban_addresses_id_fkey
FOREIGN KEY (tree_id) REFERENCES trees(id);
Check that the foreign key has been created with \d+ ban_addresses
Parse the JSON response
The API returns a JSON dictionary with multiple elements.
{
...
"features": [
{
...,
"properties": {
the address we're interested in
}
},
...
We don’t want to record the whole dictionary, just the first properties
in the features
array.
So we need to parse the JSON response within a PL/pgSQl function.
Parsing JSON in PL/pgSQL
You parse jSON dictionaries in PostgreSQL with a sequence of arrows ->
Let’s say you have a JSON column in a table that follows the above structure : features
is an array of JSON objects, that have a properties
key.
We want to select features
, then the first element (‘0’) then the properties key in that first element
In our case
json_data->'features'->0->'properties'
So we can parse the output of the call_ban_api()
function with
select call_ban_api(some url)::jsonb -> 'features' -> 0 -> 'properties';
try it!
Your task : write a function that parses the output of the API
write a function, call it parse_api_result()
, that takes
p_api_output JSONB
as input- and returns
properties
- If parsing fails, return an empty JSONB object
You can use a block with
BEGIN
BEGIN
-- try parsing the json b
EXCEPTION WHEN others THEN
RETURN '{}'::jsonb;
END;
END;
Note also that if the parsing does not correspond to the structure of the dictionary of the API response, then it will be NULL. So you want to coalesce the result of the parsing as such
COALESCE(v_parsed_address, '{}'::jsonb)
so that it returns '{}'::jsonb
if the parsing is not valid
Don’t forget to test the parsing function
Wrap all these functions into one main function
It is best to have a wrapping function which given a tree id encapsulate the functions build_api_url
, call_ban_api
and parse_api_result
.
Let’s call this function retrieve_tree_address_from_ban(p_id INTEGER)
.
- It takes a
tree_id
as input - calls the
build_api_url
,call_ban_api
andparse_api_result
in sequence - inserts the tree_id and returned address into the
ban_addresses
table - returns the id of the new created row
Test that everything works
Call the retrieve_tree_address_from_ban
function on multiple tree ids
For each call, check the api_logs
and the ban_addresses
tables
Submit Part III
Congrats! You made it to the end of Part III! You rock!
At this point, you should have created the following functions
build_api_url()
call_ban_api()
log_api_calls()
parse_api_result
- and the main wrapper function :
retrieve_tree_address_from_ban
You also have 2 new tables:
api_logs
ban addresses
Dump your database and upload it in the google form. (see instructions above )
The name of the file should be: treesdb.<your username>.03.sql
Part IV: process the whole set of trees
We are able to get the address tree by tree and we need the addresses for all 1000 trees.
There are many ways on how to batch process the sequence of api calls.
In general we want
- a query that selects the trees that we need processing for. that way we can limit the number of calls to the API. Emeber that they are limited in time (50 requests per second per IP)
- a function that gets the address for multiple trees
However, we do not want to simply send many requests to an API as fast as possible. that will almost always certainly fails at some point. APIs do not like being pushed around.
We want each successive API call to be separated by one or more seconds.
Write the function : fetch_addresses_for_trees(ARRAY_AGG(id))
So the function we want to write in this last part of the project has to
- take an array of tree ids :
ARRAY_AGG(id)
- LOOP over each tree id with :
FOREACH curr_tree_id IN ARRAY tree_ids LOOP
- sleep for some time with :
PERFORM pg_sleep(sleep_duration);
the tree selection
To select and process the trees that do not yet have an address you can use a main query as such
WITH sample_trees AS (
SELECT id
FROM trees
ORDER BY RANDOM()
LIMIT 2
)
SELECT fetch_addresses_for_trees(ARRAY_AGG(id))
FROM sample_trees;
You should change the filtering of the trees to select trees that do not yet have a valid address. Remember that the address
attribute of the ban_addresses
table, has the value '{}'::jsonb
by default.
test the fetch_addresses_for_trees
function and selection query
Run the query that selects the trees and calls the fetch_addresses_for_trees
function for a limited number of trees and verify that
- the api calls are properly logged
- the addresses actually get recorded in the
ban_addresses
table - the trees with valid addresses are not re processed
process all the trees
Your last task is to process as many trees as possible before you submit the database
Submit the Part IV
Congratulations you’ve made it till the end of the project. I hope that was fun and that you learned a few things along the way.
Now dump your database and submit it in the google form. You know the drill!
At this point, you should have created the following functions
build_api_url()
call_ban_api()
log_api_calls()
parse_api_result
retrieve_tree_address_from_ban
fetch_addresses_for_trees
An the tables:
api_logs
ban addresses
Dump your database and upload it in the google form. (see instructions above )
The name of the file should be: treesdb.<your username>.04.sql