Concepts

Cryptographic Longterm Key

A Cryptographic Longterm Key is the name given to a Bloom filter used as a privacy preserving representation of an entity. Unlike a cryptographic hash function, a CLK preserves similarity - meaning two similar entities will have similar CLKs. This property is necessary for probabilistic record linkage.

CLKs are created independent of the entity service following a keyed hashing process.

A CLK incorporates information from multiple identifying fields (e.g., name, date of birth, phone number) for each entity. The schema section details how to capture the configuration for creating CLKs from PII, and the next section outlines how to serialize CLKs for use with this service’s api.

Note

The Cryptographic Longterm Key was introduced in A Novel Error-Tolerant Anonymous Linking Code by Rainer Schnell, Tobias Bachteler, and Jörg Reiher.

Bloom Filter Format

A Bloom filter is simply an encoding of PII as a bitarray.

This can easily be represented as bytes (each being an 8 bit number between 0 and 255). We serialize by base64 encoding the raw bytes of the bit array.

An example with a 64 bit filter:

# bloom filters binary value
'0100110111010000101111011111011111011000110010101010010010100110'

# which corresponds to the following bytes
[77, 208, 189, 247, 216, 202, 164, 166]

# which gets base64 encoded to
'TdC999jKpKY=\n'

As with standard Base64 encodings, a newline is introduced every 76 characters.

Schema

It is important that participating organisations agree on how personally identifiable information is processed to create the clks. We call the configuration for creating CLKs a linkage schema. The organisations have to agree on a schema to ensure their CLKs are comparable.

The linkage schema is documented in clkhash, our reference implementation written in Python.

Note

Due to the one way nature of hashing, the entity service can’t determine whether the linkage schema was followed when clients generated CLKs.

Comparing Cryptograhpic Longterm Keys

The similarity metric used is the Sørensen–Dice index - although this may become a configurable option in the future.

Output Types

The Entity Service supports different result types which effect what output is produced, and who may see the output.

Warning

The security guarantees differ substantially for each output type. See the Security document for a treatment of these concerns.

Similarity Score

Similarities scores are computed between all CLKs in each organisation - the scores above a given threshold are returned. This output type is currently the only way to work with 1 to many relationships.

The result_token (generated when creating the mapping) is required. The result_type should be set to "similarity_scores".

Results are a simple JSON array of arrays:

[
    [index_a, index_b, score],
    ...
]

Where the index values will be the 0 based row index from the uploaded CLKs, and the score will be a Number between the provided threshold and 1.0.

A score of 1.0 means the CLKs were identical. Threshold values are usually between 0.5 and 1.0.

Note

The maximum number of results returned is the product of the two data set lengths.

For example:

Comparing two data sets each containing 1 million records with a threshold of 0.0 will return 1 trillion results (1e+12).

Direct Mapping Table

The direct mapping takes the similarity scores and simply assigns the highest scores as links.

The links are exposed as a lookup table using indices from the two organizations:

{
    index_a: index_b,
    ...
}

The result_token (generated when creating the mapping) is required to retrieve the results. The result_type should be set to "mapping".

Permutation and Mask

This protocol creates a random reordering for both organizations; and creates a mask revealing where the reordered rows line up.

Accessing the mask requires the result_token, and accessing the permutation requires a receipt-token (provided to each organization when they upload data).

Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.

The result_type should be set to "permutations".