SHA256 Functions Worksheet
SHA256, pgcrypto, hashing, data integrity
Worksheet: Using Functions in PL/pgSQL
In this exercise, the goal is to generate a unique sha256 hash string for each tree entry in the treesdb_v03 database.
The SHA hash string that will be generated takes multiple columns describing tree characteristics as input.
sha256 in PostgreSQL
The sha256 function in PostgreSQL is used to generate a cryptographic hash of data,. It is part of the pgcrypto extension, which provides various cryptographic functions, including hashing algorithms.
To use the sha256 function, you utilize the digest() function provided by pgcrypto. It requires two input parameters:
- Data: The string or data to be hashed.
- Algorithm: The name of the hash algorithm, in this case,
'sha256'.
Before using the function, you must install the pgcrypto extension, which can be done by running:
CREATE EXTENSION pgcrypto;
Once installed, you can hash a string as shown in the following example:
SELECT encode(digest('example_string', 'sha256'), 'hex') AS sha256_hash;
This example generates the SHA-256 hash of the string example_string and returns it in a human-readable hexadecimal format.
SELECT encode(digest('example_string', 'sha256'), 'hex') AS sha256_hash;
sha256_hash
------------------------------------------------------------------
9bfb55a8406617ff3e6767ec5d27fc6b5682c3a79c415ef6e084bf7d050273e6
Benefits of sha256
Hashing the columns that fully describe a tree (such as name, domain, genre, species, variety, arrondissement) has several advantages:
1. Unique Identifier Creation
- By concatenating all descriptive columns and applying a hash function like
sha256, you create a unique, fixed-length identifier (hash) for each tree. This ensures that even with varying column lengths, the hash size remains constant.
Very useful for content that varies a lot in length like articles, blog posts, etc
2. Efficient Comparison
- Instead of comparing multiple columns to check if two trees are the same, you can compare a single hash value. This speeds up querying and comparisons, especially with large datasets.
Hashing a record also helps with:
- Data Integrity: Hashing ensures that even a small change in any of the descriptive fields results in a completely different hash. This helps maintain data integrity and detect changes or tampering.
- Data Handling: Hashes are useful when you need a compact way to reference records without needing to expose or transmit all individual columns. This simplifies indexing and referencing trees in other operations.
In summary, hashing provides a compact, efficient, and secure way to uniquely identify and handle records in a database.
Your task
1. Concatenate Categorical Columns
- Write a SQL query that concatenates the following columns for each tree:
name,domain,genre,species,variety,arrondissement.- Use
COALESCEto replace anyNULLvalues with'UNK'in the concatenated result. - Example structure:
COALESCE(column_name, 'UNK').
- Use
2. Generate SHA256 Hash
- display the
digestandencodefunctions definition
You can use \df+ digest to find which version of the function you want to inspect.
and then
SELECT pg_catalog.pg_get_functiondef('digest(bytea, text)'::regprocedure);
To see the definition for digest(bytea, text).
- Extend the SQL query to pass the concatenated string into the
sha256function. - The query should return the resulting
sha256hash string.
3. Create SQL Function for Tree Hash
- Write a SQL function that:
- Accepts a tree’s
idas input. - Returns the
sha256hash generated from the concatenated columns of that tree.
- Accepts a tree’s
4. Add sha_id Column
- Add a new column
sha_idof typetextto thetreestable .
This column will store the hash values for each tree.
5. Modify Function: Insert Hash Value
- Modify the function from task 3 so that it:
- Inserts the hash value into the
sha_idcolumn. - Includes exception handling to catch any errors (e.g., inserting a duplicate hash).
- Inserts the hash value into the
6. Test the Function on Several Trees
- Run the modified function for several rows in the database to insert their hash values into the
sha_idcolumn.
7. Update Function: Handle Subset of Trees
- Extend the function to handle a set of trees, rather than just a single tree.
- Input could be a list of
ids or a query that selects a subset of trees.
- Input could be a list of
8. Ensure Uniqueness of Hash Values
- Add a database constraint to ensure that the
sha_idcolumn only contains unique values.
9. PL/pgSQL Function: Generate Hash for All Trees
- Write a
PL/pgSQLfunction that:- Runs the
sha256hash on all rows in the database. - Inserts the hash values into the
sha_idcolumn. - Checks for duplicates and ensures that no tree is processed more than once.
- Runs the
Additional Notes
- Make sure to test your functions thoroughly.
- Consider edge cases like:
- Trees with missing or
NULLvalues in categorical columns. - Duplicates in
sha_idvalues.
- Trees with missing or
- You can use
RAISE NOTICEstatements within PL/pgSQL to help with debugging.