Cryptographic Longterm Key¶
A Cryptographic Longterm Key is the name given to a Bloom filter used as a privacy preserving representation of an entity. Unlike a cryptographic hash function, a CLK preserves similarity - meaning two similar entities will have similar CLKs. This property is necessary for probabilistic record linkage.
CLKs are created independent of the entity service following a keyed hashing process.
A CLK incorporates information from multiple identifying fields (e.g., name, date of birth, phone number) for each entity. The schema section details how to capture the configuration for creating CLKs from PII, and the next section outlines how to serialize CLKs for use with this service’s api.
The Cryptographic Longterm Key was introduced in A Novel Error-Tolerant Anonymous Linking Code by Rainer Schnell, Tobias Bachteler, and Jörg Reiher.
Bloom Filter Format¶
A Bloom filter is simply an encoding of PII as a bitarray.
This can easily be represented as bytes (each being an 8 bit number between 0 and 255). We serialize by base64 encoding the raw bytes of the bit array.
An example with a 64 bit filter:
# bloom filters binary value '0100110111010000101111011111011111011000110010101010010010100110' # which corresponds to the following bytes [77, 208, 189, 247, 216, 202, 164, 166] # which gets base64 encoded to 'TdC999jKpKY=\n'
As with standard Base64 encodings, a newline is introduced every 76 characters.
Comparing Cryptograhpic Longterm Keys¶
The similarity metric used is the Sørensen–Dice index - although this may become a configurable option in the future.
Blocking is a technique that makes large-scale record linkage practical. Blocking partitions datasets into groups, called blocks and only the records in corresponding blocks are compared. This can massively reduce the total number of comparisons that need to be conducted to find matching records.
The Entity Service supports different result types which effect what output is produced, and who may see the output.
The security guarantees differ substantially for each output type. See the Security document for a treatment of these concerns.
Similarities scores are computed between all CLKs in each organisation - the scores above a given threshold are returned. This output type is currently the only way to work with 1 to many relationships.
result_token (generated when creating the mapping) is required. The
be set to
Results are a JSON array of JSON arrays of three elements:
[ [[party_id_0, row_index_0], [party_id_1, row_index_1], score], ... ]
Where the index values will be the 0 based dataset index and row index from the uploaded CLKs, and
the score will be a Number between the provided threshold and
A score of
1.0 means the CLKs were identical. Threshold values are usually between
The maximum number of results returned is the product of the two data set lengths.
Comparing two data sets each containing 1 million records with a threshold of
0.0will return 1 trillion results (
The groups result has been created for multi-party linkage, and will replace the direct mapping result for two parties as it contains the same information in a different format.
The result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party index and the row index:
[ [ [party_id, row_index], ... ], ... ]
result_token (generated when creating the mapping) is required to retrieve the results. The
result_type should be set to
Permutation and Mask¶
This protocol creates a random reordering for both organizations; and creates a mask revealing where the reordered rows line up.
Accessing the mask requires the
result_token, and accessing the permutation requires a
receipt-token (provided to each organization when they upload data).
Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.
result_type should be set to