Entity Service - v1.15.0¶
The Entity Service allows two organizations to carry out private record linkage — finding matching records of entities between their respective datasets without disclosing personally identifiable information.
Overview¶
The Entity Service is based on the concept of Anonymous Linking Codes (ALC). These can be seen as bit-arrays representing an entity, with the property that the similarity of the bits of two ALCs reflect the similarity of the corresponding entities.
An anonymous linking code that has been shown to produce good results and is widely used in practice is the so called *Cryptographic Longterm Key*, or CLK for short.
Note
From now on, we will use CLK exclusively instead of ALC, as our reference implementation of the private record linkage process uses CLK as anonymous linking code. The Entity Service is however not limited to CLKs.
Private record linkage - using the Entity Service - is a two stage process:
- First, each party locally encodes their entities’ data (e.g. using the
clkhash tool to produce
CLKs
, using blocklib to group similar entities into subgroups). TheseCLKs
are then uploaded to the service. All these tasks can be conveniently initiated by the anonlink-client tool. - The service then calculates the similarity between entities, using the probabilistic matching library anonlink. Depending on configuration, the output is returned as a mapping, permutations and mask, or similarity scores.
Table Of Contents¶
Tutorials¶
Command line example¶
This brief example shows using anonlink
- the command line tool that is packaged with the
anonlink-client
library. It is not a requirement to use anonlink-client
with the Entity Service REST API.
We assume you have access to a command line prompt with Python and Pip installed.
Install anonlink-client
:
$ pip install anonlink-client
Generate and split some mock personally identifiable data:
$ anonlink generate 2000 raw_pii_2k.csv
$ head -n 1 raw_pii_2k.csv > alice.txt
$ tail -n 1500 raw_pii_2k.csv >> alice.txt
$ head -n 1000 raw_pii_2k.csv > bob.txt
A corresponding hashing schema can be generated as well:
$ anonlink generate-default-schema schema.json
Process the personally identifying data into Cryptographic Longterm Key:
$ anonlink hash alice.txt horse_staple schema.json alice-hashed.json
generating CLKs: 100%|████████████████████████████████████████████| 1.50K/1.50K [00:00<00:00, 6.69Kclk/s, mean=522, std=34.4]
CLK data written to alice-hashed.json
$ anonlink hash bob.txt horse_staple schema.json bob-hashed.json
generating CLKs: 100%|████████████████████████████████████████████| 999/999 [00:00<00:00, 5.14Kclk/s, mean=520, std=34.2]
CLK data written to bob-hashed.json
Now to interact with an Entity Service. First check that the service is healthy and responds to a status check:
$ anonlink status --server https://anonlink.easd.data61.xyz
{"rate": 53129, "status": "ok", "project_count": 1410}
Then create a new linkage project and set the output type (to groups
):
$ anonlink create-project \
--server https://anonlink.easd.data61.xyz \
--type groups \
--schema schema.json \
--output credentials.json
The entity service replies with a project id and credentials which get saved into the file credentials.json
.
The contents is two upload tokens and a result token:
{
"update_tokens": [
"21d4c9249e1c70ac30f9ce03893983c493d7e90574980e55",
"3ad6ae9028c09fcbc7fbca36d19743294bfaf215f1464905"
],
"project_id": "809b12c7e141837c3a15be758b016d5a7826d90574f36e74",
"result_token": "230a303b05dfd186be87fa65bf7b0970fb786497834910d1"
}
These credentials get substituted in the following commands. Each CLK dataset gets uploaded to the Entity Service:
$ anonlink upload --server https://anonlink.easd.data61.xyz \
--apikey 21d4c9249e1c70ac30f9ce03893983c493d7e90574980e55 \
--project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
alice-hashed.json
{"receipt_token": "05ac237462d86bc3e2232ae3db71d9ae1b9e99afe840ee5a", "message": "Updated"}
$ clkutil upload --server https://anonlink.easd.data61.xyz \
--apikey 3ad6ae9028c09fcbc7fbca36d19743294bfaf215f1464905 \
--project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
bob-hashed.json
{"receipt_token": "6d9a0ee7fc3a66e16805738097761d38c62ea01a8c6adf39", "message": "Updated"}
Now we can compute linkages using various thresholds. For example to only see relationships where the
similarity is above 0.9
:
$ anonlink create --server https://anonlink.easd.data61.xyz \
--apikey 230a303b05dfd186be87fa65bf7b0970fb786497834910d1 \
--project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
--name "Tutorial mapping run" \
--threshold 0.9
{"run_id": "31a6d3c775151a877dcac625b4b91a6659317046ea45ad11", "notes": "Run created by anonlink-client 0.1.2", "name": "Tutorial mapping run", "threshold": 0.9}
After a small delay the linkage result will have been computed and we can use anonlink
to retrieve it:
$ anonlink results --server https://anonlink.easd.data61.xyz \
--apikey 230a303b05dfd186be87fa65bf7b0970fb786497834910d1 \
--project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
--run 31a6d3c775151a877dcac625b4b91a6659317046ea45ad11
State: completed
Stage (3/3): compute output
Downloading result
Received result
{
"groups": [
[
[0, 403],
[1, 903]
],
[
[0, 402],
[1, 092]
],
[
[0, 401],
[1, 901]
],
...
This output shows the linked pairs between Alice and Bob that have a similarity above 0.9.
Looking at the corresponding entities in Alice’s data:
head -n 405 alice.txt | tail -n 3
901,Sandra Boone,1974/10/30,F
902,Lucas Hernandez,1937/06/11,M
903,Ellis Stevens,2008/06/02,M
And the corresponding entities in Bob’s data:
head -n 905 bob.txt | tail -n 3
901,Sandra Boone,1974/10/30,F
902,Lucas Hernandez,1937/06/11,M
903,Ellis Stevens,2008/06/02,M
Anonlink Entity Service API¶
This tutorial demonstrates directly interacting with the entity service via the REST API. The primary alternative is to use a library or command line tool such as `anonlink-client
<https://anonlink-client.readthedocs.io/>`__ which can handle the communication with the anonlink entity service.
Dependencies¶
In this tutorial we interact with the REST API using the requests
Python library. Additionally we use the clkhash
Python library to define the linkage schema and to encode the PII. The synthetic dataset comes from the recordlinkage
package. All the dependencies can be installed with pip:
pip install requests clkhash recordlinkage
Steps¶
- Check connection to Anonlink Entity Service
- Synthetic Data generation and encoding
- Create a new linkage project
- Upload the encodings
- Create a run
- Retrieve and analyse results
[1]:
import json
import os
import time
import requests
from IPython.display import clear_output
Check Connection¶
If you are connecting to a custom entity service, change the address here.
[2]:
server = os.getenv("SERVER", "https://anonlink.easd.data61.xyz")
url = server + "/api/v1/"
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://anonlink.easd.data61.xyz/api/v1/
[3]:
requests.get(url + 'status').json()
[3]:
{'project_count': 777, 'rate': 113057931, 'status': 'ok'}
Data preparation¶
This section won’t be explained in great detail as it directly follows the clkhash tutorials.
We encode a synthetic dataset from the recordlinkage
library using clkhash
.
[4]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[5]:
dfA, dfB = load_febrl4()
[6]:
with open('a.csv', 'w') as a_csv:
dfA.to_csv(a_csv, line_terminator='\n')
with open('b.csv', 'w') as b_csv:
dfB.to_csv(b_csv, line_terminator='\n')
Schema Preparation¶
The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash
how to treat each column for encoding PII into CLKs. A detailed description of the hashing schema can be found in the clkhash documentation.
A linkage schema can either be defined as Python code as shown here, or as a JSON file (shown in other tutorials). The importance of each field is controlled by the k
parameter in the FieldHashingProperties
. We ignore the record id and social security id fields so they won’t be incorporated into the encoding.
[7]:
import clkhash
from clkhash.comparators import *
from clkhash.field_formats import *
schema = clkhash.randomnames.NameList.SCHEMA
_missing = MissingValueSpec(sentinel='')
schema.fields = [
Ignore('rec_id'),
StringSpec('given_name',
FieldHashingProperties(
NgramComparison(2),
BitsPerTokenStrategy(15))),
StringSpec('surname',
FieldHashingProperties(
NgramComparison(2),
BitsPerTokenStrategy(15))),
IntegerSpec('street_number',
FieldHashingProperties(
NgramComparison(1, positional=True),
BitsPerTokenStrategy(15),
missing_value=_missing)),
StringSpec('address_1',
FieldHashingProperties(
NgramComparison(2),
BitsPerTokenStrategy(15))),
StringSpec('address_2',
FieldHashingProperties(
NgramComparison(2),
BitsPerTokenStrategy(15))),
StringSpec('suburb',
FieldHashingProperties(
NgramComparison(2),
BitsPerTokenStrategy(15))),
IntegerSpec('postcode',
FieldHashingProperties(
NgramComparison(1, positional=True),
BitsPerTokenStrategy(15))),
StringSpec('state',
FieldHashingProperties(
NgramComparison(2),
BitsPerTokenStrategy(15))),
IntegerSpec('date_of_birth',
FieldHashingProperties(
NgramComparison(1, positional=True),
BitsPerTokenStrategy(15),
missing_value=_missing)),
Ignore('soc_sec_id')
]
Encoding¶
Transforming the raw personally identity information into CLK encodings following the defined schema. See the clkhash documentation for further details on this.
[8]:
from clkhash import clk
from clkhash.serialization import serialize_bitarray
with open('a.csv') as a_pii:
# clkhash generates bitarrays
hashed_data_a = clk.generate_clk_from_csv(a_pii, 'secret', schema, validate=False)
# we serialize them into a json friendly format
serialized_data_a = [serialize_bitarray(bf) for bf in hashed_data_a]
with open('clks_a.json', 'w') as f:
json.dump({'clks': serialized_data_a}, f)
with open('b.csv') as b_pii:
serialized_data_b = [serialize_bitarray(bf) for bf in clk.generate_clk_from_csv(b_pii, 'secret', schema, validate=False)]
with open('clks_b.json', 'w') as f:
json.dump({'clks': serialized_data_b}, f)
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 2.98kclk/s, mean=643, std=45.
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 3.45kclk/s, mean=631, std=52.
Create Linkage Project¶
The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.
[9]:
project_spec = {
"schema": {},
"result_type": "groups",
"number_parties": 2,
"name": "API Tutorial Test"
}
credentials = requests.post(url + 'projects', json=project_spec).json()
project_id = credentials['project_id']
a_token, b_token = credentials['update_tokens']
credentials
[9]:
{'project_id': '30b77f6dd474f4b222f47c487b93b6d71abd8467629d647d',
'result_token': '80ab26ffe3a2bf3421a3bc4fbccfc3521e6de4026de80ffa',
'update_tokens': ['00408bb1820a71a6752a969446518c05dc6648eff936b3dd',
'3553e6eb51f65ae7a5e4e40a134a7039dadc4246acc89619']}
The server returns a project_id
, a result_token
and a set of update_tokens
, one for each data provider. - The project_id
references the project uniquely on the server. - The result_token
authorises project API requests, i.e., access to the result of the linkage. - The update_tokens
authorise the data upload. There is one update_token
for each data provider, and each token can only be used once.
Note: the analyst will need to pass on the project_id
(the id of the linkage project) and one of the update_tokens
to each data provider.
The result_token
can also be used to carry out project API requests:
[10]:
requests.get(url + 'projects/{}'.format(project_id),
headers={"Authorization": credentials['result_token']}).json()
[10]:
{'error': False,
'name': 'API Tutorial Test',
'notes': '',
'number_parties': 2,
'parties_contributed': 0,
'project_id': '30b77f6dd474f4b222f47c487b93b6d71abd8467629d647d',
'result_type': 'groups',
'schema': {},
'uses_blocking': False}
Now the two clients can upload their data providing the appropriate upload tokens.
CLK Upload¶
there are currently two different ways of uploading CLKs to the entity server.
Method 1: Direct Upload¶
The ‘clks’ endpoint accepts CLKs in both json and binary format. However, this method is not recommended for large datasets, as uploads can not be resumed and might time out.
[11]:
a_response = requests.post(
'{}projects/{}/clks'.format(url, project_id),
json={'clks': serialized_data_a},
headers={"Authorization": a_token}
).json()
[12]:
b_response = requests.post(
'{}projects/{}/clks'.format(url, project_id),
json={'clks': serialized_data_b},
headers={"Authorization": b_token}
).json()
Method 2. Upload to object store.¶
The entity service can be deployed with an object store. This object store can be used by the data providers to upload their CLKs. First, the data provider have to request a set of temporary credentials which authorise the upload to the object store. The returned Temporary Object Store Credentials can be used with any S3 compatible client. For example by using boto3 in Python. The returned credentials are restricted to allow only uploading data to a particular path in a particular bucket for a finite period (defaulting to 12 hours). After the client uploaded the data, he informs the entity service.
Note this feature may be disabled by the administrator, in this case the endpoint will return a 500 server error.
[11]:
from minio import Minio
upload_response = requests.get(
url + 'projects/{}/authorize-external-upload'.format(project_id),
headers={'Authorization': a_token},
).json()
upload_credentials = upload_response['credentials']
upload_info = upload_response['upload']
# Use Minio python client to upload data
mc = Minio(
upload_info['endpoint'],
access_key=upload_credentials['AccessKeyId'],
secret_key=upload_credentials['SecretAccessKey'],
session_token=upload_credentials['SessionToken'],
region='us-east-1',
secure=upload_info['secure']
)
etag = mc.fput_object(
upload_info['bucket'],
upload_info['path'] + "/clks_a.json",
'clks_a.json',
metadata={
"hash-count": 5000,
"hash-size": 128
})
# Should be able to notify the service that we've uploaded data
res = requests.post(url + f"projects/{project_id}/clks",
headers={'Authorization': a_token},
json={
'encodings': {
'file': {
'bucket': upload_info['bucket'],
'path': upload_info['path'] + "/clks_a.json",
}
}
})
print(f'Upload party A: {"OK" if res.status_code == 201 else "ERROR"}')
#party B:
upload_response = requests.get(
url + 'projects/{}/authorize-external-upload'.format(project_id),
headers={'Authorization': b_token},
).json()
upload_credentials = upload_response['credentials']
upload_info = upload_response['upload']
# Use Minio python client to upload data
mc = Minio(
upload_info['endpoint'],
access_key=upload_credentials['AccessKeyId'],
secret_key=upload_credentials['SecretAccessKey'],
session_token=upload_credentials['SessionToken'],
region='us-east-1',
secure=upload_info['secure']
)
etag = mc.fput_object(
upload_info['bucket'],
upload_info['path'] + "/clks_b.json",
'clks_b.json',
metadata={
"hash-count": 5000,
"hash-size": 128
})
# Should be able to notify the service that we've uploaded data
res = requests.post(url + f"projects/{project_id}/clks",
headers={'Authorization': b_token},
json={
'encodings': {
'file': {
'bucket': upload_info['bucket'],
'path': upload_info['path'] + "/clks_b.json",
}
}
})
print(f'Upload party B: {"OK" if res.status_code == 201 else "ERROR"}')
Upload party A: OK
Upload party B: OK
Every upload gets a receipt token. In some operating modes this receipt is required to access the results.
Create a run¶
Now the project has been created and the CLK encodings have been uploaded we can carry out some privacy preserving record linkage. The same encoding data can be linked using different threshold values by creating runs.
[13]:
run_response = requests.post(
"{}projects/{}/runs".format(url, project_id),
headers={"Authorization": credentials['result_token']},
json={
'threshold': 0.80,
'name': "Tutorial Run #1"
}
).json()
[14]:
run_id = run_response['run_id']
Run Status¶
[15]:
requests.get(
'{}projects/{}/runs/{}/status'.format(url, project_id, run_id),
headers={"Authorization": credentials['result_token']}
).json()
[15]:
{'current_stage': {'description': 'compute output', 'number': 3},
'stages': 3,
'state': 'completed',
'time_added': '2021-08-30T23:21:37.005204',
'time_completed': '2021-08-30T23:21:39.863173',
'time_started': '2021-08-30T23:21:37.050874'}
Results¶
Now after some delay (depending on the size) we can fetch the results. This can of course be done by directly polling the REST API using requests
, however for simplicity we will just use the watch_run_status function provided in anonlinkclient.rest_client
.
Note theserver
is provided rather thanurl
.
[16]:
from anonlinkclient.rest_client import RestClient, format_run_status
rest_client = RestClient(server)
for update in rest_client.watch_run_status(project_id, run_id, credentials['result_token'], timeout=300):
clear_output(wait=True)
print(format_run_status(update))
State: completed
Stage (3/3): compute output
[17]:
data = json.loads(rest_client.run_get_result_text(
project_id,
run_id,
credentials['result_token']))
This result is the 1-1 mapping between rows that were more similar than the given threshold.
[18]:
for i in range(10):
((_, a_index), (_, b_index)) = sorted(data['groups'][i])
print("a[{}] maps to b[{}]".format(a_index, b_index))
print("...")
a[870] maps to b[1723]
a[2570] maps to b[3737]
a[1920] maps to b[4157]
a[3090] maps to b[4797]
a[2940] maps to b[3663]
a[2228] maps to b[1095]
a[2623] maps to b[3447]
a[3297] maps to b[4053]
a[3853] maps to b[4795]
a[672] maps to b[2795]
...
In this dataset there are 5000 records in common. With the chosen threshold and schema we currently retrieve:
[19]:
len(data['groups'])
[19]:
4842
Cleanup¶
If you want you can delete the run and project from the anonlink-entity-service.
[20]:
requests.delete("{}/projects/{}".format(url, project_id), headers={"Authorization": credentials['result_token']})
[20]:
<Response [204]>
[ ]:
Entity Service Permutation Output¶
This tutorial demonstrates the workflow for private record linkage using the entity service. Two parties Alice and Bob have a dataset of personally identifiable information (PII) of several entities. They want to learn the linkage of corresponding entities between their respective datasets with the help of the entity service and an independent party, the Analyst.
The chosen output type is permuatations
, which consists of two permutations and one mask.
Who learns what?¶
After the linkage has been carried out Alice and Bob will be able to retrieve a permutation
- a reordering of their respective data sets such that shared entities line up.
The Analyst - who creates the linkage project - learns the mask
. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.
Steps¶
These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the Analyst acting the integration authority.
- Check connection to Entity Service
- Data preparation
- Write CSV files with PII
- Create a Linkage Schema
- Create Linkage Project
- Generate CLKs from PII
- Upload the PII
- Create a run
- Retrieve and analyse results
Check Connection¶
If you’re connecting to a custom entity service, change the address here. Or set the environment variableSERVER
before launching the Jupyter notebook.
[1]:
import os
url = os.getenv("SERVER", "https://anonlink.easd.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://anonlink.easd.data61.xyz
[2]:
!anonlink status --server "{url}"
{"project_count": 846, "rate": 593838, "status": "ok"}
Data preparation¶
Following the anonlink-client command line tutorial we will use a dataset from the recordlinkage
library. We will just write both datasets out to temporary CSV files.
[3]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[4]:
dfA, dfB = load_febrl4()
a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)
b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)
dfA.head(3)
[4]:
given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | |
---|---|---|---|---|---|---|---|---|---|---|
rec_id | ||||||||||
rec-1070-org | michaela | neumann | 8 | stanley street | miami | winston hills | 4223 | nsw | 19151111 | 5304218 |
rec-1016-org | courtney | painter | 12 | pinkerton circuit | bega flats | richlands | 4560 | vic | 19161214 | 4066625 |
rec-4405-org | charles | green | 38 | salkauskas crescent | kela | dapto | 4566 | nsw | 19480930 | 4365168 |
Schema Preparation¶
The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the clkhash schema docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.
[5]:
schema = NamedTemporaryFile('wt')
[6]:
%%writefile {schema.name}
{
"version": 3,
"clkConfig": {
"l": 1024,
"xor_folds": 0,
"kdf": {
"type": "HKDF",
"hash": "SHA256",
"info": "c2NoZW1hX2V4YW1wbGU=",
"salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
"keySize": 64
}
},
"features": [
{
"identifier": "rec_id",
"ignored": true
},
{
"identifier": "given_name",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerToken": 30
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "surname",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerToken": 30
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "street_number",
"format": {
"type": "integer"
},
"hashing": {
"missingValue": {
"sentinel": ""
},
"strategy": {
"bitsPerToken": 15
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 1,
"positional": true
}
}
},
{
"identifier": "address_1",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerToken": 15
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "address_2",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerToken": 15
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "suburb",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerToken": 15
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "postcode",
"format": {
"type": "integer",
"minimum": 100,
"maximum": 9999
},
"hashing": {
"strategy": {
"bitsPerToken": 15
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 1,
"positional": true
}
}
},
{
"identifier": "state",
"format": {
"type": "string",
"encoding": "utf-8",
"maxLength": 3
},
"hashing": {
"strategy": {
"bitsPerToken": 30
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "date_of_birth",
"format": {
"type": "integer"
},
"hashing": {
"missingValue": {
"sentinel": ""
},
"strategy": {
"bitsPerToken": 30
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 1,
"positional": true
}
}
},
{
"identifier": "soc_sec_id",
"ignored": true
}
]
}
Overwriting /tmp/tmplm0udc70
Create Linkage Project¶
The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.
[7]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)
!anonlink create-project \
--schema "{schema.name}" \
--output "{creds.name}" \
--type "permutations" \
--server "{url}"
creds.seek(0)
import json
with open(creds.name, 'r') as f:
credentials = json.load(f)
project_id = credentials['project_id']
credentials
Credentials will be saved in /tmp/tmp_d0pcu7x
Project created
[7]:
{'project_id': 'd9ffdb48df4cc0acb4f0ab29f56be0873dff50f95ba15ada',
'result_token': '030796ecdf1fdf600f6751ca2bd2aee98c360aafcea56934',
'update_tokens': ['4b138f6464315179e08e3d08e403b1da0be27ab3e478ece4',
'61c9e2ddd1a053c99af4f5c09e224a43723a4dfd9dceafd7']}
Note: the analyst will need to pass on the project_id
(the id of the linkage project) and one of the two update_tokens
to each data provider.
Hash and Upload¶
At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. We need: - anonlink-client provides a command line encoding tool. - the linkage schema from above - and a secret which is only known to Alice and Bob. (here: my_secret
)
Full command line documentation can be fournd here, see clkhash documentation for further details on the encoding itself.
[8]:
!anonlink hash "{a_csv.name}" my_secret "{schema.name}" "{a_clks.name}"
!anonlink hash "{b_csv.name}" my_secret "{schema.name}" "{b_clks.name}"
CLK data written to /tmp/tmpgso1v_7b.json
CLK data written to /tmp/tmpamtsmico.json
Now the two clients can upload their data providing the appropriate upload tokens and the project_id. As with all commands in anonlink
we can output help:
[9]:
!anonlink upload --help
Usage: anonlink upload [OPTIONS] CLK_JSON
Upload CLK data to entity matching server.
Given a json file containing hashed clk data as CLK_JSON, upload to the
entity resolution service.
Use "-" to read from stdin.
Options:
--project TEXT Project identifier
--apikey TEXT Authentication API key for the server.
-o, --output FILENAME
--blocks FILENAME Generated blocks JSON file
--server TEXT Server address including protocol. Default
https://anonlink.easd.data61.xyz.
--retry-multiplier INTEGER <milliseconds> If receives a 503 from
server, minimum waiting time before
retrying. Default 100.
--retry-exponential-max INTEGER
<milliseconds> If receives a 503 from
server, maximum time interval between
retries. Default 10000.
--retry-max-time INTEGER <milliseconds> If receives a 503 from
server, retry only within this period.
Default 20000.
-v, --verbose Script is more talkative
--help Show this message and exit.
Alice uploads her data¶
[10]:
with NamedTemporaryFile('wt') as f:
!anonlink upload \
--project="{project_id}" \
--apikey="{credentials['update_tokens'][0]}" \
--server "{url}" \
--output "{f.name}" \
"{a_clks.name}"
res = json.load(open(f.name))
alice_receipt_token = res['receipt_token']
Every upload gets a receipt token. This token is required to access the results.
Bob uploads his data¶
[11]:
with NamedTemporaryFile('wt') as f:
!anonlink upload \
--project="{project_id}" \
--apikey="{credentials['update_tokens'][1]}" \
--server "{url}" \
--output "{f.name}" \
"{b_clks.name}"
bob_receipt_token = json.load(open(f.name))['receipt_token']
Create a run¶
Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:
[12]:
with NamedTemporaryFile('wt') as f:
!anonlink create \
--project="{project_id}" \
--apikey="{credentials['result_token']}" \
--server "{url}" \
--threshold 0.85 \
--output "{f.name}"
run_id = json.load(open(f.name))['run_id']
Results¶
Now after some delay (depending on the size) we can fetch the mask. Results can be fetched with the anonlink
command line tool:
!anonlink results --server "{url}" \
--project="{credentials['project_id']}" \
--apikey="{credentials['result_token']}" --output results.txt
However for this tutorial we are going to wait for the run to complete using the anonlinkclient.rest_client
then pull the raw results using the requests
library:
[13]:
import requests
from anonlinkclient.rest_client import RestClient
from anonlinkclient.rest_client import format_run_status
from IPython.display import clear_output
[14]:
rest_client = RestClient(url)
for update in rest_client.watch_run_status(project_id, run_id, credentials['result_token'], timeout=300):
clear_output(wait=True)
print(format_run_status(update))
State: completed
Stage (3/3): compute output
[15]:
results = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': credentials['result_token']}).json()
[16]:
mask = results['mask']
This mask is a boolean array that specifies where rows of permuted data line up.
[17]:
print(mask[:10])
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
The number of 1s in the mask will tell us how many matches were found.
[18]:
sum([1 for m in mask if m == 1])
[18]:
4851
We also use requests
to fetch the permutations for each data provider:
[19]:
alice_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': bob_receipt_token}).json()
Now Alice and Bob both have a new permutation - a new ordering for their data.
[20]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]
[20]:
[1525, 1775, 4695, 1669, 1816, 2778, 1025, 2069, 4358, 4217]
This permutation says the first row of Alice’s data should be moved to position 308.
[21]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]
[21]:
[2882, 3332, 3654, 300, 1949, 765, 4356, 1049, 2325, 4964]
[22]:
def reorder(items, order):
"""
Assume order is a list of new index
"""
neworder = items.copy()
for item, newpos in zip(items, order):
neworder[newpos] = item
return neworder
[23]:
with open(a_csv.name, 'r') as f:
alice_raw = f.readlines()[1:]
alice_reordered = reorder(alice_raw, alice_permutation)
with open(b_csv.name, 'r') as f:
bob_raw = f.readlines()[1:]
bob_reordered = reorder(bob_raw, bob_permutation)
Now that the two data sets have been permuted, the mask reveals where the rows line up, and where they don’t.
[24]:
alice_reordered[:10]
[24]:
['rec-1977-org,aidan,morrison,2,broadsmith street,,cloverdale,2787,act,19140202,8821751\n',
'rec-51-org,ella,blunden,37,freda gibson circuit,croyde,paddington,2770,sa,19401209,3593307\n',
'rec-1151-org,courtney,gilbertson,35,maccallum circuit,barley hill,granville,2646,qld,19910105,5257049\n',
'rec-2037-org,freya,mason,10,barnett close,dianella masonic village (cnr cornwell s,clifton springs,7301,wa,19241109,1571902\n',
'rec-903-org,brianna,barisic,1502,haddon street,parish talowahl,launching place,3220,vic,19750703,5367822\n',
'rec-2883-org,jackson,clarke,1,cargelligo street,summerset,bellevue hill,3835,qld,19571105,7943648\n',
'rec-2856-org,chloe,setlhong,4,nunki place,yacklin,cronulla,6164,act,19950628,2829638\n',
'rec-4831-org,caleb,thorpe,4,river street,,granville,2641,nsw,19590118,7916934\n',
'rec-317-org,amber,nicolakopoulos,38,atkinson street,mount patrick,edgewater,2905,sa,19910707,9220881\n',
'rec-2685-org,joel,lodge,200,steinwedel street,kmart p plaza,toowoomba,4012,wa,19710830,2655513\n']
[25]:
bob_reordered[:10]
[25]:
['rec-1977-dup-0,aidan,morrison,2,broadsmit hstreet,,clovedale,2787,act,19140202,8821751\n',
'rec-51-dup-0,adam,,37,freda gibson circuit,cryode,paddington,2770,sa,19401209,3593307\n',
'rec-1151-dup-0,courtney,dabinet,240,feathertopstreet,barley hill,tardun,2646,qld,19910105,5257049\n',
'rec-2037-dup-0,beth,maso,10,barnett close,dianella masonic vlilage (cnr cornwell s,clifton springs,7320,wa,19241109,1571902\n',
'rec-903-dup-0,barisic,brianna,1502,haddon street,parish talowahl,launching place,3220,vic,19750703,5367822\n',
'rec-2883-dup-0,jackon,clareke,1,cargelligo street,summerset,bellevueh ill,3835,qdl,19571105,7943648\n',
'rec-2856-dup-0,chloe,setlhong,4,nunki place,yacklin,cronulla,6614,act,19950628,2829638\n',
'rec-4831-dup-0,cleb,thorpe,4,river street,,granville,2641,nsw,19590118,7916134\n',
'rec-317-dup-0,amber,nicolakopoulos,38,atkinson street,mount patrick,edgewter,2905,sa,19910707,9220881\n',
'rec-2685-dup-0,joe,lodgw,200,steinwedel street,kmart p plaza,toowoomba,4016,wa,19710830,2655513\n']
Accuracy¶
To compute how well the matching went we will use the first index as our reference.
For example in rec-1396-org
is the original record which has a match in rec-1396-dup-0
. To satisfy ourselves we can preview the first few supposed matches:
[26]:
for i, m in enumerate(mask[:10]):
if m:
entity_a = alice_reordered[i].split(',')
entity_b = bob_reordered[i].split(',')
name_a = ' '.join(entity_a[1:3]).title()
name_b = ' '.join(entity_b[1:3]).title()
print("{} ({})".format(name_a, entity_a[0]), '=?', "{} ({})".format(name_b, entity_b[0]))
Aidan Morrison (rec-1977-org) =? Aidan Morrison (rec-1977-dup-0)
Ella Blunden (rec-51-org) =? Adam (rec-51-dup-0)
Courtney Gilbertson (rec-1151-org) =? Courtney Dabinet (rec-1151-dup-0)
Freya Mason (rec-2037-org) =? Beth Maso (rec-2037-dup-0)
Brianna Barisic (rec-903-org) =? Barisic Brianna (rec-903-dup-0)
Jackson Clarke (rec-2883-org) =? Jackon Clareke (rec-2883-dup-0)
Chloe Setlhong (rec-2856-org) =? Chloe Setlhong (rec-2856-dup-0)
Caleb Thorpe (rec-4831-org) =? Cleb Thorpe (rec-4831-dup-0)
Amber Nicolakopoulos (rec-317-org) =? Amber Nicolakopoulos (rec-317-dup-0)
Joel Lodge (rec-2685-org) =? Joe Lodgw (rec-2685-dup-0)
Metrics¶
If you know the ground truth — the correct mapping between the two datasets — you can compute performance metrics of the linkage.
Precision: The percentage of actual matches out of all found matches. (tp/(tp+fp)
)
Recall: How many of the actual matches have we found? (tp/(tp+fn)
)
[27]:
tp = 0
fp = 0
for i, m in enumerate(mask):
if m:
entity_a = alice_reordered[i].split(',')
entity_b = bob_reordered[i].split(',')
if entity_a[0].split('-')[1] == entity_b[0].split('-')[1]:
tp += 1
else:
fp += 1
#print('False positive:',' '.join(entity_a[1:3]).title(), '?', ' '.join(entity_b[1:3]).title(), entity_a[-1] == entity_b[-1])
print("Found {} correct matches out of 5000. Incorrectly linked {} matches.".format(tp, fp))
precision = tp/(tp+fp)
recall = tp/5000
print("Precision: {:.1f}%".format(100*precision))
print("Recall: {:.1f}%".format(100*recall))
Found 4851 correct matches out of 5000. Incorrectly linked 0 matches.
Precision: 100.0%
Recall: 97.0%
[28]:
# Deleting the project
!anonlink delete-project \
--project="{credentials['project_id']}" \
--apikey="{credentials['result_token']}" \
--server="{url}"
Project deleted
Entity Service Similarity Scores Output¶
This tutorial demonstrates generating CLKs from PII, creating a new project on the entity service, and how to retrieve the results. The output type is raw similarity scores. This output type is particularly useful for determining a good threshold for the greedy solver used in mapping.
The sections are usually run by different participants - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the analyst is acting as the integration authority.
Who learns what?¶
Alice and Bob will both generate and upload their CLKs.
The analyst - who creates the linkage project - learns the similarity scores
. Be aware that this is a lot of information and are subject to frequency attacks.
Steps¶
- Check connection to Entity Service
- Data preparation
- Write CSV files with PII
- Create a Linkage Schema
- Create Linkage Project
- Generate CLKs from PII
- Upload the PII
- Create a run
- Retrieve and analyse results
[1]:
%matplotlib inline
import json
import os
import time
import pandas as pd
import matplotlib.pyplot as plt
import requests
import anonlinkclient.rest_client
from IPython.display import clear_output
Check Connection¶
If you are connecting to a custom entity service, change the address here.
[2]:
url = os.getenv("SERVER", "https://anonlink.easd.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://anonlink.easd.data61.xyz
[3]:
!anonlink status --server "{url}"
{"project_count": 845, "rate": 593838, "status": "ok"}
Data preparation¶
Following the anonlink client command line tutorial we will use a dataset from the recordlinkage
library. We will just write both datasets out to temporary CSV files.
If you are following along yourself you may have to adjust the file names in all the !anonlink
commands.
[4]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[5]:
dfA, dfB = load_febrl4()
a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)
b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)
dfA.head(3)
[5]:
given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | |
---|---|---|---|---|---|---|---|---|---|---|
rec_id | ||||||||||
rec-1070-org | michaela | neumann | 8 | stanley street | miami | winston hills | 4223 | nsw | 19151111 | 5304218 |
rec-1016-org | courtney | painter | 12 | pinkerton circuit | bega flats | richlands | 4560 | vic | 19161214 | 4066625 |
rec-4405-org | charles | green | 38 | salkauskas crescent | kela | dapto | 4566 | nsw | 19480930 | 4365168 |
Schema Preparation¶
The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns rec_id
and soc_sec_id
for CLK generation.
[6]:
schema = NamedTemporaryFile('wt')
[7]:
%%writefile {schema.name}
{
"version": 3,
"clkConfig": {
"l": 1024,
"xor_folds": 0,
"kdf": {
"type": "HKDF",
"hash": "SHA256",
"info": "c2NoZW1hX2V4YW1wbGU=",
"salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
"keySize": 64
}
},
"features": [
{
"identifier": "rec_id",
"ignored": true
},
{
"identifier": "given_name",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerFeature": 200
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "surname",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerFeature": 200
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "street_number",
"format": {
"type": "integer"
},
"hashing": {
"missingValue": {
"sentinel": ""
},
"strategy": {
"bitsPerFeature": 100
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 1,
"positional": true
}
}
},
{
"identifier": "address_1",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerFeature": 100
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "address_2",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerFeature": 100
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "suburb",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"strategy": {
"bitsPerFeature": 100
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "postcode",
"format": {
"type": "integer",
"minimum": 100,
"maximum": 9999
},
"hashing": {
"strategy": {
"bitsPerFeature": 100
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 1,
"positional": true
}
}
},
{
"identifier": "state",
"format": {
"type": "string",
"encoding": "utf-8",
"maxLength": 3
},
"hashing": {
"strategy": {
"bitsPerFeature": 100
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 2,
"positional": false
}
}
},
{
"identifier": "date_of_birth",
"format": {
"type": "integer"
},
"hashing": {
"missingValue": {
"sentinel": ""
},
"strategy": {
"bitsPerFeature": 200
},
"hash": {
"type": "doubleHash"
},
"comparison": {
"type": "ngram",
"n": 1,
"positional": true
}
}
},
{
"identifier": "soc_sec_id",
"ignored": true
}
]
}
Overwriting /tmp/tmprrzuvk7f
Create Linkage Project¶
The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.
[8]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)
!anonlink create-project \
--schema "{schema.name}" \
--output "{creds.name}" \
--type "similarity_scores" \
--server "{url}"
creds.seek(0)
with open(creds.name, 'r') as f:
credentials = json.load(f)
project_id = credentials['project_id']
credentials
Credentials will be saved in /tmp/tmp1h8qppks
Project created
[8]:
{'project_id': '8f8347b3e97665ebc87f4a1744a2a62e0ae4c999184bc754',
'result_token': 'f6ef6b121f3e5861bceddc36cf1cfdebf9c25a2352937c90',
'update_tokens': ['192e38e1f9e773b945c882799e5490502b9454c711b66e2d',
'f1ca0cbbdc3055c731e898a4ebe6121be9e8d82541fb78fd']}
Note: the analyst will need to pass on the project_id
(the id of the linkage project) and one of the two update_tokens
to each data provider.
Hash and Upload¶
At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. Please see clkhash documentation for further details on this.
[9]:
!anonlink hash "{a_csv.name}" secret "{schema.name}" "{a_clks.name}"
!anonlink hash "{b_csv.name}" secret "{schema.name}" "{b_clks.name}"
CLK data written to /tmp/tmp63vp_3mj.json
CLK data written to /tmp/tmpr4cqqglj.json
Now the two clients can upload their data providing the appropriate upload tokens.
Alice uploads her data¶
[10]:
with NamedTemporaryFile('wt') as f:
!anonlink upload \
--project="{project_id}" \
--apikey="{credentials['update_tokens'][0]}" \
--server "{url}" \
--output "{f.name}" \
"{a_clks.name}"
res = json.load(open(f.name))
alice_receipt_token = res['receipt_token']
Every upload gets a receipt token. In some operating modes this receipt is required to access the results.
Bob uploads his data¶
[11]:
with NamedTemporaryFile('wt') as f:
!anonlink upload \
--project="{project_id}" \
--apikey="{credentials['update_tokens'][1]}" \
--server "{url}" \
--output "{f.name}" \
"{b_clks.name}"
bob_receipt_token = json.load(open(f.name))['receipt_token']
Create a run¶
Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:
[12]:
with NamedTemporaryFile('wt') as f:
!anonlink create \
--project="{project_id}" \
--apikey="{credentials['result_token']}" \
--server "{url}" \
--threshold 0.75 \
--output "{f.name}"
run_id = json.load(open(f.name))['run_id']
Results¶
Now after some delay (depending on the size) we can fetch the result. This can be done with anonlink
:
!anonlink results --server "{url}" \
--project="{credentials['project_id']}" \
--apikey="{credentials['result_token']}" --output results.txt
However for this tutorial we are going to use the anonlinkclient.rest_client
module:
[13]:
from anonlinkclient.rest_client import RestClient
from anonlinkclient.rest_client import format_run_status
rest_client = RestClient(url)
for update in rest_client.watch_run_status(project_id, run_id, credentials['result_token'], timeout=300):
clear_output(wait=True)
print(format_run_status(update))
State: completed
Stage (2/2): compute similarity scores
Progress: 100.00%
[14]:
data = json.loads(rest_client.run_get_result_text(
project_id,
run_id,
credentials['result_token']))['similarity_scores']
This result is a large list of tuples recording the similarity between all rows above the given threshold.
[15]:
for row in data[:10]:
print(row)
[[0, 76], [1, 2345], 1.0]
[[0, 83], [1, 3439], 1.0]
[[0, 103], [1, 863], 1.0]
[[0, 154], [1, 2391], 1.0]
[[0, 177], [1, 4247], 1.0]
[[0, 192], [1, 1176], 1.0]
[[0, 270], [1, 4516], 1.0]
[[0, 312], [1, 1253], 1.0]
[[0, 407], [1, 3743], 1.0]
[[0, 670], [1, 3550], 1.0]
Note there can be a lot of similarity scores:
[16]:
len(data)
[16]:
280116
We will display a sample of these similarity scores in a histogram using matplotlib:
[17]:
plt.style.use('seaborn-deep')
plt.hist([score for _, _, score in data], bins=50)
plt.xlabel('similarity score')
plt.show()

The vast majority of these similarity scores are for non matches. We expect the matches to have a high similarity score. So let’s zoom into the right side of the distribution.
[18]:
plt.hist([score for _, _, score in data if score >= 0.79], bins=50);
plt.xlabel('similarity score')
plt.show()

Indeed, there is a cluster of scores between 0.9 and 1.0. To better visualize that these are indeed the scores for the matches, we will now extract the true_matches from the datasets and group the similarity scores into those for the matches and the non-matches (We can do this because we know the ground truth of the dataset).
[19]:
# rec_id in dfA has the form 'rec-1070-org'. We only want the number. Additionally, as we are
# interested in the position of the records, we create a new index which contains the row numbers.
dfA_ = dfA.rename(lambda x: x[4:-4], axis='index').reset_index()
dfB_ = dfB.rename(lambda x: x[4:-6], axis='index').reset_index()
# now we can merge dfA_ and dfB_ on the record_id.
a = pd.DataFrame({'ida': dfA_.index, 'rec_id': dfA_['rec_id']})
b = pd.DataFrame({'idb': dfB_.index, 'rec_id': dfB_['rec_id']})
dfj = a.merge(b, on='rec_id', how='inner').drop(columns=['rec_id'])
# and build a set of the corresponding row numbers.
true_matches = set((row[0], row[1]) for row in dfj.itertuples(index=False))
[20]:
scores_matches = []
scores_non_matches = []
for (_, a), (_, b), score in data:
if score < 0.79:
continue
if (a, b) in true_matches:
scores_matches.append(score)
else:
scores_non_matches.append(score)
[21]:
plt.hist([scores_matches, scores_non_matches], bins=50, label=['matches', 'non-matches'])
plt.legend(loc='upper right')
plt.xlabel('similarity score')
plt.show()

We can see that the similarity scores for the matches and the ones for the non-matches form two different distributions. With a suitable linkage schema, these two distributions hardly overlap.
When choosing a similarity threshold for solving, the valley between these two distributions is a good starting point. In this example, it is around 0.82. We can see that almost all similarity scores above 0.82 are from matches, thus the solver will produce a linkage result with high precision. However, recall will not be optimal, as there are still some scores from matches below 0.82. By moving the threshold to either side, you can favour either precision or recall.
[22]:
# Deleting the project
!anonlink delete-project --project="{credentials['project_id']}" \
--apikey="{credentials['result_token']}" \
--server="{url}"
Project deleted
Entity Service: Multiparty linkage demo¶
This notebook is a demonstration of the multiparty linkage capability that has been implemented in the Entity Service.
We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included.
[1]:
import csv
import itertools
import os
import pandas as pd
import requests
Each party has a dataset of the following form:
[2]:
pd.read_csv('data/dataset-1.csv', index_col='id').head()
[2]:
givenname | surname | dob | gender | city | income | phone number | |
---|---|---|---|---|---|---|---|
id | |||||||
0 | tara | hilton | 27-08-1941 | male | canberra | 84052.973 | 08 2210 0298 |
3 | saJi | vernre | 22-12-2972 | mals | perth | 50104.118 | 02 1090 1906 |
7 | sliver | paciorek | NaN | mals | sydney | 31750.893 | NaN |
9 | ruby | george | 09-05-1939 | male | sydney | 135099.875 | 07 4698 6255 |
10 | eyrinm | campbell | 29-1q-1983 | male | perth | NaN | 08 299y 1535 |
Comparing the beginning of the first dataset to the second, we can see that the quality of the data is not very good. There are a lot of spelling mistakes and missing information. Let’s see how well the entity service does with linking those entities.
[3]:
pd.read_csv('data/dataset-2.csv', index_col='id').head()
[3]:
givenname | surname | dob | gender | city | income | phone number | |
---|---|---|---|---|---|---|---|
id | |||||||
3 | zali | verner | 22-12-1972 | male | perth | 50104.118 | 02 1090 1906 |
4 | samuel | tremellen | 21-12-1923 | male | melbourne | 159316.091 | 03 3605 9336 |
5 | amy | lodge | 16-01-1958 | male | canberra | 70170.456 | 07 8286 9372 |
7 | oIji | pacioerk | 10-02-1959 | mal3 | sydney | 31750.893 | 04 4220 5949 |
10 | erin | kampgell | 29-12-1983 | make | perth | 331476.598 | 08 2996 1445 |
Check the status of the Entity Service¶
Ensure that it is running and that we have the correct version. Multiparty support was introduced in version 1.11.0.
[4]:
SERVER = os.getenv("SERVER", "https://anonlink.easd.data61.xyz")
PREFIX = f"{SERVER}/api/v1"
print(requests.get(f"{PREFIX}/status").json())
print(requests.get(f"{PREFIX}/version").json())
{'project_count': 839, 'rate': 550410, 'status': 'ok'}
{'anonlink': '0.12.5', 'entityservice': 'v1.13.0-beta2', 'python': '3.8.2'}
Create a new project¶
We create a new multiparty project for five parties by specifying the number of parties and the output type (currently only the group
output type supports multiparty linkage). Retain the project_id
, so we can find the project later. Also retain the result_token
, so we can retrieve the results (careful: anyone with this token has access to the results). Finally, the update_tokens
identify the five data data providers and permit them to upload CLKs.
[5]:
project_info = requests.post(
f"{PREFIX}/projects",
json={
"schema": {},
"result_type": "groups",
"number_parties": 5,
"name": "example project"
}
).json()
project_id = project_info["project_id"]
result_token = project_info["result_token"]
update_tokens = project_info["update_tokens"]
print("project_id:", project_id)
print()
print("result_token:", result_token)
print()
print("update_tokens:", update_tokens)
project_id: 35697a8223f98ed4112488ae3c87e8134d169a364d35e2e7
result_token: 075faf5822cfbe3abe4ce47510a7d3190f518768282f83a7
update_tokens: ['26a30750ba4b7124bc3fd8a36e57bf6211af3fda960c6fb0', '27d17421a4f01c61e4b6ec782486c550da93d350a8d2dbf1', '5c0f98cd55acd48c99bd7f2ddd26af46f6afd31095c7a8a1', 'dcc87296257cb13c9ac3da1e0905c1448a5d51bc9f1fbec3', '9937b6e17abe516e9364cbc88a22593ef78ccdf3d045a907']
Upload the hashed data¶
This is where each party uploads their CLKs into the service. Here, we do the work of all five data providers inside this for loop. In a deployment scenario, each data provider would be uploading their own CLKs using their own update token.
These CLKs are already hashed using clkhash (with this linkage schema), so for each data provider, we just need to upload their corresponding hash file.
[6]:
for i, token in enumerate(update_tokens, start=1):
with open(f"data/clks-{i}.json") as f:
r = requests.post(
f"{PREFIX}/projects/{project_id}/clks",
data=f,
headers={
"Authorization": token,
"content-type": "application/json"
}
)
print(f"Data provider {i}: {r.text}")
Data provider 1: {
"message": "Updated",
"receipt_token": "be6ab1dd0833283ec78ce829f7276b53926588d86c503534"
}
Data provider 2: {
"message": "Updated",
"receipt_token": "74a3f479949d5bb2537c5cab01db9d8d08bf0f7aad991c4d"
}
Data provider 3: {
"message": "Updated",
"receipt_token": "5a88765376836d57e37489e9f205e0d5bb8d9abd6d9cfc7a"
}
Data provider 4: {
"message": "Updated",
"receipt_token": "e005523285d21cfec2927d17050faffb1c249a5b8784f2a4"
}
Data provider 5: {
"message": "Updated",
"receipt_token": "e2c10b8f9f5f6ea90978d9cf0f3b25700fbd222658b704bb"
}
Begin a run¶
The data providers have uploaded their CLKs, so we may begin the computation. This computation may be repeated multiple times, each time with different parameters. Each such repetition is called a run. The most important parameter to vary between runs is the similarity threshold. Two records whose similarity is above this threshold will be considered to describe the same entity.
Here, we perform one run. We (somewhat arbitrarily) choose the threshold to be 0.8.
[7]:
r = requests.post(
f"{PREFIX}/projects/{project_id}/runs",
headers={
"Authorization": result_token
},
json={
"threshold": 0.8
}
)
run_id = r.json()["run_id"]
Check the status¶
Let’s see whether the run has finished (‘state’ is ‘completed’)!
[8]:
r = requests.get(
f"{PREFIX}/projects/{project_id}/runs/{run_id}/status",
headers={
"Authorization": result_token
}
)
r.json()
[8]:
{'current_stage': {'description': 'waiting for CLKs',
'number': 1,
'progress': {'absolute': 5,
'description': 'number of parties already contributed',
'relative': 1.0}},
'stages': 3,
'state': 'created',
'time_added': '2020-04-03T01:20:55.141739+00:00',
'time_started': None}
Now after some delay (depending on the size) we can fetch the results. Waiting for completion can be achieved by directly polling the REST API using requests
, however for simplicity we will just use the watch_run_status
function provided in anonlinkclient.rest_client
.
[9]:
from IPython.display import clear_output
from anonlinkclient.rest_client import RestClient, format_run_status
rest_client = RestClient(SERVER)
for update in rest_client.watch_run_status(project_id, run_id, result_token, timeout=300):
clear_output(wait=True)
print(format_run_status(update))
State: completed
Stage (3/3): compute output
Retrieve the results¶
We retrieve the results of the linkage. As we selected earlier, the result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party id and the row index.
The last 20 groups look like this.
[10]:
r = requests.get(
f"{PREFIX}/projects/{project_id}/runs/{run_id}/result",
headers={
"Authorization": result_token
}
)
groups = r.json()
groups['groups'][-20:]
[10]:
[[[0, 781], [4, 780]],
[[2, 3173], [4, 3176], [3, 3163], [0, 3145], [1, 3161]],
[[2, 1617], [3, 1620]],
[[0, 444], [1, 423]],
[[4, 391], [1, 409]],
[[1, 347], [4, 332], [2, 353], [0, 352]],
[[1, 3171], [4, 3185], [0, 3153], [2, 3182], [3, 3172]],
[[2, 1891], [4, 1906], [3, 1889]],
[[0, 2139], [4, 2147]],
[[0, 1206], [4, 1205], [2, 1206]],
[[2, 2726], [4, 2710], [3, 2722]],
[[3, 2040], [4, 2059], [2, 2059]],
[[1, 899], [4, 924], [0, 923]],
[[0, 2482], [1, 2494], [4, 2483], [3, 2488], [2, 2509]],
[[3, 741], [4, 736], [2, 749], [1, 722]],
[[1, 1587], [4, 1638]],
[[1, 1157], [4, 1209]],
[[1, 2027], [3, 740]],
[[1, 1260], [2, 1311], [3, 1281], [4, 1326]],
[[1, 1323], [2, 1362], [4, 1384], [0, 1396]]]
To sanity check, we print their records’ corresponding PII:
[11]:
def load_dataset(i):
dataset = []
with open(f"data/dataset-{i}.csv") as f:
reader = csv.reader(f)
next(reader) # ignore header
for row in reader:
dataset.append(row[1:])
return dataset
datasets = list(map(load_dataset, range(1, 6)))
for group in itertools.islice(groups["groups"][-20:], 20):
for (i, j) in group:
print(i, datasets[i][j])
print()
0 ['kain', 'mason', '09-07-1932', 'male', 'sydnev', '119435.710', '08 8537 7448']
4 ['kaim', 'iiiazon', '09-07-1932', 'male', 'sydnev', '119445.720', '08 8638 7448']
2 ['harriyon', 'micyelmor', '21-04-1971', 'male', 'pert1>', '291889.942', '04 5633 5749']
4 ['harri5on', 'micyelkore', '21-04-1971', '', 'pertb', '291880.942', '04 5633 5749']
3 ['hariso17', 'micelmore', '21-04-1971', 'male', 'pertb', '291880.042', '04 5633 5749']
0 ['harrison', 'michelmore', '21-04-1981', 'malw', 'preth', '291880.942', '04 5643 5749']
1 ['harris0n', 'michelmoer', '21-04-1971', '', '', '291880.942', '04 5633 5749']
2 ['lauren', 'macgowan', '08-01-1960', 'male', '', '43779.493', '03 6533 7075']
3 ['lauren', 'macgowan', '08-01-1950', 'male', 'sydney', '43770.493', '03 6532 7075']
0 ['joshai', 'browne', '30-10-2904', '', 'melbounfe', '522585.205', '03 7150 7587']
1 ['joshua', 'browne', '30-10-2004', 'female', 'melbourne', '522585.205', '03 7150 7587']
4 ['feliciti', 'green', '23-02-1909', 'male', '', '183205.299', '08 4794 9870']
1 ['feljcitv', 'greery', '23-02-1998', 'male', '', '183205.299', '08 4794 9970']
1 ['alannah', 'gully', '15-04-1903', 'make', 'meobourne', '134518.814', '04 5104 4572']
4 ['alana', 'gully', '15-04-1903', 'male', 'melbourne', '134518.814', '04 5104 4582']
2 ['alama', 'gulli', '15-04-1903', 'mald', 'melbourne', '134518.814', '04 5104 5582']
0 ['alsna', 'gullv', '15-04-1903', 'male', '', '134518.814', '04 5103 4582']
1 ['madison', 'crosswell', '11-06-1990', 'male', 'perth', '151347.559', '03 0936 9125']
4 ['madisori', 'crossw4ll', '11-96-1990', 'male', 'perth', '151347.559', '03 0926 9125']
0 ['madispn', 'crossvvell', '11-06-2990', 'male', 'bperth', '151347.559', '03 0936 9125']
2 ['badisoj', 'cross2ell', '11-06-1990', 'malw', 'eprth', '151347.559', '03 0936 9125']
3 ['mad9son', 'crosswell', '11-06-1990', '', '', '151347.559', '03 0937 9125']
2 ['harley', 'krin', '29-05-1967', 'maoe', 'melbourne', '120938.846', '08 8095 4760']
4 ['harley', 'green', '29-05-1967', 'male', 'melbourne', '120937.846', '08 8096 4760']
3 ['harley', 'gfeen', '29-04-1967', 'mslr', 'melbourne', '120937.856', '08 8096 4760']
0 ['nicho1as', 'mak0nw', '06-06-1977', 'male', '', '91255.089', '08 2404 9176']
4 ['nicol', 'maano', '06-06-1977', '', '', '91155.089', '08 2404 9176']
0 ['james', 'lavender', '08-02-2000', 'male', 'canberra', '88844.369', '02 5862 9827']
4 ['jaiiies', 'lvender', '08-02-2900', 'male', 'canberra', '88844.369', '02 5862 982u']
2 ['jimmy', 'lavendre', '08-02-2000', 'malw', 'canberra', '88844.369', '02 5863 9827']
2 ['ara', 'hite', '01-05-1994', 'femzle', 'canberra', '29293.820', '03 0641 9597']
4 ['tara', 'white', '01-05-1984', 'female', 'canberra', '29293.820', '03 0641 9597']
3 ['tara', 'white', '01-05-1974', 'femzle', '', '29293.820', '03 0641 0697']
3 ['spericer', 'pize', '03-04-1983', 'male', 'canberra', '', '03 5691 5970']
4 ['spencer', 'paize', '03-04-1983', 'male', 'canberra', '56328.357', '03 6691 5970']
2 ['spenfer', 'pai2e', '03-04-1893', 'male', 'can1>erra', '56328.357', '03 6691 5970']
1 ['isbaella', 'darby-cocks', '14-09-1921', 'male', 'pergh', '87456.184', '03 0678 5513']
4 ['isabella', 'darby-cocks', '14-09-1921', 'male', 'perth', '87456.194', '03 0679 5513']
0 ['isabeloa', 'darby-cocks', '14-09-2921', 'make', 'perth', '87456.194', '04 0678 6513']
0 ['jarrod', 'brone', '09-08-1967', 'mal3', 'perth', '1075t6.775', '08 2829 1110']
1 ['jarrod', 'browne', '09-08-1967', 'male', 'perth', '107556.775', '08 2820 1110']
4 ['jarrod', 'brownb', '09-08-1967', 'mqle', 'pertb', '107556.775', '08 2820 2110']
3 ['jarr0d', 'brown', '09-08-1967', 'male', '', '107546.775', '08 2820 1110']
2 ['jarr0d', 'borwne', '09-08-1067', 'male', 'pertb', '107556.775', '08 2820 1110']
3 ['marko', 'matthews', '11-04-1992', 'male', 'melbourne', '106467.902', '03 1460 7673']
4 ['marko', 'matthews', '11-0r-1992', 'maoe', 'melhourne', '106467.992', '03 1460 7673']
2 ['marko', 'matthevvs', '11-94-1992', 'mals', 'melbourne', '', '03 1460 7673']
1 ['makro', 'matthews', '11-04-1992', '', 'emlbourne', '106467.903', '03 1460 7673']
1 ['nkiki', 'spers', '10-02-2007', 'fenale', '', '156639.106', '07 9447 1767']
4 ['nikkui', 'pezes', '10-02-20p7', 'female', '', '156639.106', '07 9447 1767']
1 ['roby', 'felepa', '25-19-1959', 'male', 'aclonerra', '85843.631', '07 5804 7920']
4 ['robert', 'felepa', '25-10-1959', 'male', 'can1>erra', '85842.631', '07 5804 7929']
1 ['shai', 'dixon', '24-09-1979', 'female', 'melbourne', '609473.955', '08 4533 9404']
3 ['mia', 'dixon', '24-09-1979', 'female', 'melbourne', '1198037.556', '08 3072 7335']
1 ['livia', 'riaj', '13-03-1907', 'malw', 'melbovrne', '73305.107', '07 3846 2530']
2 ['livia', 'ryank', '13-03-1907', 'malw', 'melbuorne', '73305.107', '07 3946 2630']
3 ['ltvia', 'ryan', '13-03-1907', 'maoe', 'melbourne', '73305.197', '07 3046 2530']
4 ['livia', 'ryan', '13-03-1907', 'male', 'melbourne', '73305.107', '07 3946 2530']
1 ['brock', 'budge', '27-09-1960', 'male', 'perth', '209428.166', '02 5106 4056']
2 ['brocck', 'bud9e', '27-09-1960', 'male', 'pertb', '208428.166', '02 5106 4056']
4 ['brock', 'budge', '27-09-1970', 'male', '', '209428.167', '02 5206 4056']
0 ['brock', 'bwudge', '27-09-2860', '', 'perth', '209428.166', '02 5106 3056']
Despite the high amount of noise in the data, the Anonlink Entity Service was able to produce a fairly accurate matching. However note Mia Galbraith and Talia Galbraith are most likely not an actual match.
We may be able to improve on this results by fine-tuning the hashing schema or by changing the threshold.
Delete the project¶
[12]:
r = requests.delete(
f"{PREFIX}/projects/{project_id}",
headers={
"Authorization": result_token
}
)
print(r.status_code)
204
Usage¶
You can download the tutorials from github.
The dependencies are listed in tutorial-requirements.txt
.
The code is often evolving and may include some breaking changes not yet deployed in our testing deployment (at the
URL https://anonlink.easd.data61.xyz). So to run the tutorials, you can either:
- use the tutorials from the
master
branch of this repository which will work with the currently deployed testing service,- or build and deploy the service from the same branch as the tutorials you would like to run, providing its URL to the tutorials via the environment variable
SERVER
(e.g.SERVER=http://0.0.0.0:8851
if deployed locally).
Other use-cases are not supported and may fail for non-obvious reasons.
External Tutorials¶
The clkhash
library includes a tutorial of carrying out record linkage on perturbed data.
http://clkhash.readthedocs.io/en/latest/tutorial_cli.html
Concepts¶
Cryptographic Longterm Key¶
A Cryptographic Longterm Key is the name given to a Bloom filter used as a privacy preserving representation of an entity. Unlike a cryptographic hash function, a CLK preserves similarity - meaning two similar entities will have similar CLKs. This property is necessary for probabilistic record linkage.
CLKs are created independent of the entity service following a keyed hashing process.
A CLK incorporates information from multiple identifying fields (e.g., name, date of birth, phone number) for each entity. The schema section details how to capture the configuration for creating CLKs from PII, and the next section outlines how to serialize CLKs for use with this service’s api.
Note
The Cryptographic Longterm Key was introduced in A Novel Error-Tolerant Anonymous Linking Code by Rainer Schnell, Tobias Bachteler, and Jörg Reiher.
Bloom Filter Format¶
A Bloom filter is simply an encoding of PII as a bitarray.
This can easily be represented as bytes (each being an 8 bit number between 0 and 255). We serialize by base64 encoding the raw bytes of the bit array.
An example with a 64 bit filter:
# bloom filters binary value
'0100110111010000101111011111011111011000110010101010010010100110'
# which corresponds to the following bytes
[77, 208, 189, 247, 216, 202, 164, 166]
# which gets base64 encoded to
'TdC999jKpKY=\n'
As with standard Base64 encodings, a newline is introduced every 76 characters.
Linkage Schema¶
It is important that participating organisations agree on how personally identifiable information is processed to create the clks. We call the configuration for creating CLKs a linkage schema. The organisations have to agree on a schema to ensure their CLKs are comparable.
The linkage schema is documented in clkhash, our reference implementation written in Python.
Note
Due to the one way nature of hashing, the entity service can’t determine whether the linkage schema was followed when clients generated CLKs.
Comparing Cryptograhpic Longterm Keys¶
The similarity metric used is the Sørensen–Dice index - although this may become a configurable option in the future.
Blocking¶
Blocking is a technique that makes large-scale record linkage practical. Blocking partitions datasets into groups, called blocks and only the records in corresponding blocks are compared. This can massively reduce the total number of comparisons that need to be conducted to find matching records.
In the Anonlink Entity Service blocking is optional, and is carried out by the client e.g., using the blocklib library. See the blocklib documentation for more information including tutorials.
Output Types¶
The Entity Service supports different result types which effect what output is produced, and who may see the output.
Warning
The security guarantees differ substantially for each output type. See the Security document for a treatment of these concerns.
Similarity Score¶
Similarities scores are computed between all CLKs in each organisation - the scores above a given threshold are returned. This output type is currently the only way to work with 1 to many relationships.
The result_token
(generated when creating the mapping) is required. The result_type
should
be set to "similarity_scores"
.
Results are a JSON array of JSON arrays of three elements:
[
[[party_id_0, row_index_0], [party_id_1, row_index_1], score],
...
]
Where the index values will be the 0 based dataset index and row index from the uploaded CLKs, and
the score will be a Number between the provided threshold and 1.0
.
A score of 1.0
means the CLKs were identical. Threshold values are usually between
0.5
and 1.0
.
Note
The maximum number of results returned is the product of the two data set lengths.
For example:
Comparing two data sets each containing 1 million records with a threshold of0.0
will return 1 trillion results (1e+12
).
Groups Result¶
The groups result has been created for multi-party linkage, and will replace the direct mapping result for two parties as it contains the same information in a different format.
The result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party index and the row index:
[
[
[party_id, row_index],
...
],
...
]
The result_token
(generated when creating the mapping) is required to retrieve the results. The
result_type
should be set to "groups"
.
Permutation and Mask¶
This protocol creates a random reordering for both organizations; and creates a mask revealing where the reordered rows line up.
Accessing the mask requires the result_token
, and accessing the permutation requires a
receipt-token
(provided to each organization when they upload data).
Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.
The result_type
should be set to "permutations"
.
Security¶
The service isn’t given any personally identifying information in raw form - rather clients must locally compute a CLK which is a hashed version of the data to be linked.
Considerations for each output type¶
Groups¶
The default output of the Entity Service comprises a list of edges - connections between rows in the different datasets. This assumes at most a 1-1 correspondence - each entity will only be present in zero or one edge.
This output is only available to the client who created the mapping, but it is worth highlighting that it does (by design) leak information about the intersection of the sets of entities.
Knowledge about set intersection This output contains information about which particular entities are shared, and which are not. Potentially knowing the overlap between the organizations is disclosive. This is mitigated by using unique authorization codes generated for each mapping which is required to retrieve the results.
Row indicies exposed The output directly exposes the row indices provided to the service, which if not randomized may be disclosive. For example entities simply exported from a database might be ordered by age, patient admittance date, salary band etc.
Similarity Score¶
All calculated similarities (above a given threshold) between entities are returned. This output comprises a list of weighted edges - similarity between rows in dataset A to rows in dataset B. This is a many to many relationship where entities can appear in multiple edges.
Recovery from the distance measurements This output type includes the plaintext distance measurements between entities, this additional information can be used to fingerprint individual entities based on their ordered similarity scores. In combination with public information this can lead to recovery of identity. This attack is described in section 3 of Vulnerabilities in the use of similarity tables in combination with pseudonymisation to preserve data privacy in the UK Office for National Statistics’ Privacy-Preserving Record Linkage by Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague.
In order to prevent this attack it is important not to provide the similarity table to untrusted parties.
Permutation and Mask¶
This output type involves creating a random reordering of the entities for both organizations; and creating a binary mask vector revealing where the reordered rows line up. This output is designed for use in multi-party computation algorithms.
This mitigates the Knowledge about set intersection problem from the direct mapping output - assuming the mask is not made available to the data providers.
Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.
Authentication / Authorization¶
The entity service does not support authentication, yet. This is planned for a future version.
All sensitive data is protected by token-based authorization. That is, you need to provide the correct token to access different resources. A token is a unique random 192 bit string.
There are three different types of tokens:
- update_token: required to upload a party’s CLKs.
- result_token: required to access the result of the entity resolution process. This is, depending on the output type, either similarity scores, a group output, or a mask.
- receipt-token: this token is returned to either party after uploading their respective CLKs. With this receipt-token they can then access their respective permutations, if the output type of the mapping is set to permutation and mask.
Important
These tokens are the only artifacts that protect the sensitive data. Therefore it is paramount to make sure that only authorized parties have access to these tokens!
Attack Vectors¶
The following attack vectors need to be considered for all output types.
Stealing/Leaking uploaded CLKs
The uploaded CLKs for one organization could be leaked to the partner organization - who possesses the HMAC secret breaking semantic security. The entity service doesn’t expose an API that allows users to access any CLKs, the object store (MINIO or S3) and the database (postgresql) are configured to not allow public access.
Deployment¶
Local Deployment¶
Dependencies¶
Docker and docker-compose
Build¶
From the project folder, run:
./tools/build.sh
The will create the docker images tagged with latest which are used by docker-compose
.
Run¶
Run docker compose:
docker-compose -p anonlink -f tools/docker-compose.yml up
This will start the following containers:
- nginx frontend
- gunicorn/flask backend
- celery backend worker
- postgres database
- redis job queue
- minio object store
- jaeger opentracing
A temporary container that initializes the database will also be created and soon exit.
The REST api for the service is exposed on port 8851
of the nginx container, which docker
will map to a high numbered port on your host.
The address of the REST API endpoint can be found with:
docker-compose -p anonlink -f tools/docker-compose.yml port nginx 8851
For example to GET the service status:
$ export ENTITY_SERVICE=`docker-compose -p anonlink -f tools/docker-compose.yml port nginx 8851`
$ curl $ENTITY_SERVICE/api/v1/status
{
"status": "ok",
"number_mappings": 0,
"rate": 1
}
The service can be taken down by hitting CTRL+C. This doesn’t clear the DB volumes, which will persist and conflict with the next call to docker-compose … up unless they are removed. Removing these volumes is easy, just run:
docker-compose -p anonlink -f tools/docker-compose.yml down -v
in between calls to docker-compose … up.
Monitoring¶
A celery monitor tool flower is also part of the docker-compose file - this graphical interface allows administration and monitoring of the celery tasks and workers. Access this via the monitor container.
Testing with docker-compose¶
An additional docker-compose config file can be found in ./tools/ci.yml, this can be added in to run along with the rest of the service:
docker-compose -p n1estest -f tools/docker-compose.yml -f tools/ci.yml up -d
docker logs -f n1estest_tests_1
docker-compose -p n1estest -f tools/docker-compose.yml -f tools/ci.yml down
Docker Compose Tips¶
Local Scaling¶
You can run additional worker containers by scaling with docker-compose:
docker-compose -f tools/docker-compose.yml scale es_worker=2
A collection of development tips.
Volumes¶
You might need to destroy the docker volumes used for the object store and the postgres database:
docker-compose -f tools/docker-compose.yml rm -s -v [-p <project-name>]
Restart one service¶
Docker compose can modify an existing deployment, this can be particularly effective when you modify and rebuild the backend and want to restart it without changing anything else:
docker-compose -f tools/docker-compose.yml up -d --no-deps es_backend
Scaling¶
You can run additional worker containers by scaling with docker-compose:
docker-compose -f tools/docker-compose.yml scale es_worker=2
Mix and match docker compose¶
During development you can run the redis and database containers with docker-compose, and directly run the celery and flask applications with Python.
docker-compose -f tools/docker-compose.yml run es_db
docker-compose -f tools/docker-compose.yml run es_redis
Production deployment¶
Production deployment assumes a Kubernetes cluster.
The entity service has been deployed to kubernetes clusters on Azure, GCE, minikube, and AWS. The system has been designed to scale across multiple nodes and handle node failure without data loss.
Overview¶
At a high level the main custom components are:
- REST API Server - a gunicorn/flask backend web service hosting the REST api.
- PPRL Worker instances - using celery for task scheduling.
The components that are used in support are:
- Postgresql database holds all match metadata
- Redis is used for the celery job queue and as a cache
- An object store (e.g. AWS S3, or Minio) stores the raw CLKs, intermediate files, and results.
- nginx provides upload buffering, request rate limiting.
- An ingress controller (e.g. nginx-ingress/traefik) provides TLS termination.
The rest of this document goes into how to deploy in a production setting.
Requirements¶
A Kubernetes Cluster is required - creating and setting up a Kubernetes cluster is out of scope for this documentation.
Hardware requirements
Recommended AWS worker instance type
is r3.4xlarge
- spot instances are fine as we handle node failure. The
number of nodes depends on the size of the expected jobs, as well as the
memory on each node. For testing we recommend starting with at least two nodes, with each
node having at least 8 GiB of memory and 2 vCPUs.
Software to interact with the cluster
You will need to install the kubectl command line tool, and helm.
Helm¶
The Anonlink Entity Service has been packaged using helm, follow the helm installation documentation.
Ingress Controller¶
For external API access the deployment optionally includes an Ingress
resource.
This can be enabled with the api.ingress.enabled
setting.
Note the ingress requires configuration specifically for the
ingress controller
installed on the Kubernetes cluster, usually via annotations which can be provided in the
api.ingress.annotations
setting.
Note
If client’s are pushing or pulling large amounts of data (e.g. large encodings or many raw similarity scores), the ingress may need to be configured with a large buffer and long timeouts. Using the NGINX ingress controller we found the following ingress annotations to be a good starting point:
ingress.kubernetes.io/proxy-body-size: 4096m
nginx.ingress.kubernetes.io/proxy-body-size: 4096m
nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
Deploy the system¶
Helm can be used to deploy the system to a kubernetes cluster. There are two options, if you would like
to deploy from the source simply run helm dependency update
command from your
deployment/entity-service directory, otherwise (recommended approach) add the Data61 helm chart
repository:
helm repo add data61 https://data61.github.io/charts
helm repo update
Configuring the deployment¶
Create a new blank yaml file to hold your custom deployment settings my-deployment.yaml
.
Carefully read through the chart’s default values.yaml
file and override any values in your deployment
configuration file.
At a minimum consider setting up an ingress by changing api.ingress
, change the number of
workers in workers.replicaCount
(and workers.highmemory.replicaCount
), check
you’re happy with the workers’ cpu and memory limits in workers.resources
, and finally set
the credentials:
global.postgresql.postgresqlPassword
redis.password
(andredis-ha.redisPassword
if provisioning redis)minio.accessKey
andminio.secretKey
anonlink.objectstore.uploadAccessKey
andanonlink.objectstore.uploadSecretKey
Configuration of the celery workers¶
Celery is highly configurable and wrong configurations can lead to a number of runtime issues, such as exhausting the number of connection the database can handle, to threads exhaustion blocking the underlying machine.
We are thus recommending some sets of attributes, but note that every deployment is different and may require its own tweaking.
Celery is not always the best at sharing resources, we recommend deployments specify a limit of CPU resources
each worker can use, and correspondingly set the concurrency of the workers to this limit. More information is
provided directly in the values.yaml
file.
Before Installation¶
Before installation, it is best practice to run some checks that helm provides. The first one is to execute:
helm lint -f extraValues.yaml
Note that it uses all the default deployment values provided in the values.yaml file, and overwrite them with the given values in extraValues.yaml. It should return some information if some values are missing, e.g.:
2019/09/11 15:13:10 [INFO] Missing required value: global.postgresql.postgresqlPassword must be provided.
2019/09/11 15:13:10 [INFO] Missing required value: minio.accessKey must be provided.
2019/09/11 15:13:10 [INFO] Missing required value: minio.secretKey must be provided.
==> Linting .
Lint OK
1 chart(s) linted, no failures
- Notes:
the lint command does not exit with a non 0 exit code, and our templates are currently failing if linting with the option –strict.
if the folder Charts is not deleted, the linting may throw some errors from the dependent charts if a value is missing without clear description, e.g. if the redis password is missing, the following error is returned from the redis-ha template because the method b64enc requires a non empty string, but the template does not check first if the value is empty:
==> Linting . [ERROR] templates/: render error in "entity-service/charts/redis-ha/templates/redis-auth-secret.yaml": template: entity-service/charts/redis-ha/templates/redis-auth-secret.yaml:10:35: executing "entity-service/charts/redis-ha/templates/redis-auth-secret.yaml" at <b64enc>: invalid value; expected string Error: 1 chart(s) linted, 1 chart(s) failed
Then, it advised to use the –dry-run –debug options before deploying with helm, which will return all the resources yaml descriptions.
Installation¶
To install the whole system assuming you have a configuration file my-deployment.yaml
in the current
directory:
$ helm upgrade --install anonlink data61/entity-service -f anonlink.yaml
This can take several minutes the first time you deploy to a new cluster.
Run integration tests and an end to end test¶
Integration tests can be carried out in the same Kubernetes cluster by creating a integration test Job
.
Create an integration-test-job.yaml
file with the following content:
apiVersion: batch/v1
kind: Job
metadata:
name: anonlinkintegrationtest
labels:
jobgroup: integration-test
spec:
completions: 1
parallelism: 1
template:
metadata:
labels:
jobgroup: integration-test
spec:
restartPolicy: Never
containers:
- name: entitytester
image: data61/anonlink-app:v1.12.0
imagePullPolicy: Always
env:
- name: SERVER
value: https://anonlink.easd.data61.xyz
command:
- "python"
- "-m"
- "pytest"
- "entityservice/tests"
- "-x"
Update the SERVER
url then create the new job on the cluster with:
kubectl create -f integration-test-job.yaml
Upgrade Deployment with Helm¶
Updating a running chart is usually straight forward. For example if the release is called
anonlink
in namespace testing
execute the following to increase the number of workers
to 20:
helm upgrade anonlink entity-service --namespace=testing --set workers.replicas="20"
However, note you may wish to instead keep all configurable values in a yaml
file and track
the changes in version control.
Minimal Deployment¶
To run with minikube for local testing we have provided a minimal.yaml
configuration file that will
set small resource limits. Install the minimal system with:
helm install entity-service --name="mini-es" --values entity-service/minimal-values.yaml
Database Deployment Options¶
At deployment time you must set the postgresql password in global.postgresql.postgresqlPassword
.
You can decide to deploy a postgres database along with the anonlink entity service or instead use an existing
database. To configure a deployment to use an external postgres database, simply set provision.postgresql
to false
, set the database server in postgresql.nameOverride
, and add credentials to the
global.postgresql
section.
Object Store Deployment Options¶
At deployment time you can decide to deploy MinIO or instead use an existing object store service compatible with AWS S3.
Note that there is a trade off between using a local deployment of MinIO vs AWS S3. In our AWS based experimentation Minio is noticeably faster, but more expensive and less reliable than AWS S3, your own mileage may vary.
To configure a deployment to use an external object store, set provision.minio
to false
and add
appropriate connection configuration in the minio
section. For example to use AWS S3 simply provide your access
credentials (and disable provisioning minio):
helm install entity-service --name="es-s3" --set provision.minio=false --set minio.accessKey=XXX --set minio.secretKey=YYY --set minio.bucket=<bucket>
Object Store for client use¶
Optionally client’s can upload and download data via an object store instead of via the REST API. This requires external access to an object store, and the service must have authorization to create temporary restricted credentials.
The following settings control this optional feature:
Environment Variable | Helm Config |
---|---|
UPLOAD_OBJECT_STORE_ENABLED |
anonlink.objectstore.uploadEnabled |
UPLOAD_OBJECT_STORE_SERVER |
anonlink.objectstore.uploadServer |
UPLOAD_OBJECT_STORE_SECURE |
anonlink.objectstore.uploadSecure |
UPLOAD_OBJECT_STORE_BUCKET |
anonlink.objectstore.uploadBucket.name |
UPLOAD_OBJECT_STORE_ACCESS_KEY |
anonlink.objectstore.uploadAccessKey |
UPLOAD_OBJECT_STORE_SECRET_KEY |
anonlink.objectstore.uploadSecretKey |
UPLOAD_OBJECT_STORE_STS_DURATION |
- (default 43200 seconds) |
DOWNLOAD_OBJECT_STORE_SERVER |
anonlink.objectstore.downloadServer |
DOWNLOAD_OBJECT_STORE_SECURE |
anonlink.objectstore.downloadSecure |
DOWNLOAD_OBJECT_STORE_ACCESS_KEY |
anonlink.objectstore.downloadAccessKey |
DOWNLOAD_OBJECT_STORE_SECRET_KEY |
anonlink.objectstore.downloadSecretKey |
DOWNLOAD_OBJECT_STORE_STS_DURATION |
- (default 43200 seconds) |
Note
If the uploadServer
and downloadServer
configuration values are not provided, the deployment
will assume that MinIO has been deployed along with the service and fallback to using the MinIO ingress
host (if present), otherwise the cluster internal address of the deployed MinIO service. This last fallback is
in place simply to make e2e testing easier.
Redis Deployment Options¶
At deployment time you can decide to provision redis using our chart, or instead use an existing redis installation or managed service. The provisioned redis is a highly available 3 node redis cluster using the redis-ha helm chart.
Directly connecting to redis, and discovery via the sentinel protocol are supported. When using sentinel protocol for redis discovery read only requests are dispatched to redis replicas.
Carefully read the comments in the redis
section of the default values.yaml
file.
To use a separate install of redis using the server shared-redis-ha-redis-ha.default.svc.cluster.local
:
helm install entity-service --name="es-shared-redis" \
--set provision.redis=false \
--set redis.server=shared-redis-ha-redis-ha.default.svc.cluster.local \
--set redis.use_sentinel=true
Note these settings can also be provided via a values.yaml
deployment configuration file.
Uninstalling¶
To uninstall a release called es
in the default namespace:
helm del es
Or if the anonlink-entity-service has been installed into its own namespace you can simple delete
the whole namespace with kubectl
:
kubectl delete namespace miniestest
Deployment Risks¶
The purpose of this document is to record known deployment risks of the entity service and our mitigations. References the 2017 Top 10 security risks - https://www.owasp.org/index.php/Top_10-2017_Top_10
Risks¶
Unauthorized user accesses results¶
A6 - Security misconfiguration.
A2 - Broken authentication.
A5 - Broken access control.
Authorized user attacks the system¶
A10 - Insufficient Logging & Monitoring A3 - Sensitive Data Exposure
An admin can access the raw clks uploaded by both parties.
However a standard user cannot.
User coerces N1 to execute attacking code¶
Insecure deserialization. Compromised shared host.
An underlying component has a vulnerability¶
Dependencies including anonlink could have vulnerabilities.
Development¶
Changelog¶
Next Version¶
Custom Celery Routes
Added support to customise the celery routing with environment variable CELERY_ROUTES.
Version 1.15.1¶
Dependency updates
Implemented in #687
Delete upload files on object store after ingestion
If a data provider uploads its data via the object store, we now clean up afterwards.
Implemented in #686
Fixed Record Linkage API tutorial
Adjusted to changes in the clkhash library.
Implemented in #684
Delete encodings from database at project deletion
Encodings will be deleted at project deletion, but only for projects created with this version or higher.
Implemented in #683
Version 1.15.0¶
Highlights¶
Similarity scores are deduplicated
Previously candidate pairs that appear in more than one block would produce more than one similarity score. The iterator that processing similarity scores now de-duplicates before storing them.
Implemented in: #660
Provided Block Identifiers are now hashed
We now hash the user provided block identifier before storing in DB.
Implemented in: #633
Failed runs return message indicating the failure reason
The run status for a failed run now includes a message attribute with information on what went wrong.
Implemented in: #624
Other changes¶
The run status endpoint now includes total_number_of_comparisons for completed runs. Implemented in: #651
As usual lots of version upgrades - now using the latest stable redis and postgresql.
Version 1.14.0¶
Highlights¶
API now supports directly downloading similarity scores from the internal object store
If the request includes the header RETURN-OBJECT-STORE-ADDRESS, the response will be a small json payload with temporary download credentials to pull the _binary_ similarity scores directly from the object store. The json object has credentials and object keys:
{
"credentials": {
"AccessKeyId": "",
"SecretAccessKey": "",
"SessionToken": "",
"Expiration": "<ISO 8601 datetime string>"
},
"object": {
"endpoint": "<config.DOWNLOAD_OBJECT_STORE_SERVER>",
"secure": "<config.DOWNLOAD_OBJECT_STORE_SECURE>",
"bucket": "bucket_name",
"path": "path"
}
}
The binary file is serialized using anonlink.serialization
, you can convert the stream into Python types with:
mc = Minio(file_info['endpoint'], ...)
candidate_pair_stream = mc.get_object(file_info['bucket'], file_info['path'])
sims, (dset_is0, dset_is1), (rec_is0, rec_is1) = anonlink.serialization.load_candidate_pairs(candidate_pair_stream)
The following settings control the optional feature of using an external object store:
Environment Variable | Helm Config |
---|---|
DOWNLOAD_OBJECT_STORE_SERVER |
anonlink.objectstore.downloadServer |
DOWNLOAD_OBJECT_STORE_SECURE |
anonlink.objectstore.downloadSecure |
DOWNLOAD_OBJECT_STORE_ACCESS_KEY |
anonlink.objectstore.downloadAccessKey |
DOWNLOAD_OBJECT_STORE_SECRET_KEY |
anonlink.objectstore.downloadSecretKey |
DOWNLOAD_OBJECT_STORE_STS_DURATION |
- (default 43200 seconds) |
Implemented in: #594, #612, #613, #614
Service now uses sqlalchemy for database migrations
Sqlalchemy models have been added for all database tables, initial database setup now uses alembic for migrations. The database and object store init scripts can now be run multiple times without causing issues.
Implemented in #603, #611
New configurable limits on maximum number of candidate pairs
Protects the service from running out of memory due to excessive numbers of candidate pairs being processed. An added side effect is the service now keeps track of the number of candidate pairs in a run (as well as the number of comparisons).
The configurable is controlled by the following two environment variables, and their initial default values:
SOLVER_MAX_CANDIDATE_PAIRS="100_000_000"
SIMILARITY_SCORES_MAX_CANDIDATE_PAIRS="500_000_000"
If a run exceeds these limits, the run is put into an error state and further processing is abandoned to protect the service from running out of memory.
Implemented in #595, #605
Other changes¶
- Ingress now supports a user supplied path. We no longer assume an nginx ingress controller. #587
- Migrate off deprecated k8s chart repos #596, #588
- Helm chart now uses standard recommended Kubernetes labels. #616
- Fix an issue with case sensitivity in object store metadata #590
- If the object store bucket doesn’t exist it is now automatically created. #577
- Ignore but log failures to delete from object store #576
- Many dependency updates #578, #579, #580, #582, #581, #583, #596, #604, #609, #615
- Update the base image, all base dependencies and migrated from minio-py v5 to v7 #601, #608, #610
- CI e2e tests on Kubernetes will now correctly fail if the tests don’t run. #618
- Add optional pod annotations to init jobs. #619
Version 1.13.0¶
- extended tutorial to include upload to object store #573
- chart update #572
Version 1.13.0-beta3¶
- Improved performance for blocks of small size #563
- fix a problem with the upload to the external object store #564
- updated documentation #567, $569
Version 1.13.0-beta2¶
Adds support for users to supply blocking information along with encodings. Data can now be uploaded to an object store and pulled by the Anonlink Entity Service instead of uploaded via the REST API. This release includes substantial internal changes as encodings are now stored in Postgres instead of the object store.
- Feature to pull data from an object store and create temporary upload credentials. #537, #544, #551
- Blocking implementation #510 #527,
- Benchmarking #478, #541
- Encodings are now stored in Postgres database instead of files in an object store. #516, #522
- Start to add integration tests to complement our end to end tests. #520, #528
- Use anonlink-client instead of clkhash #536
- Use Python 3.8 in base image. #518
- A base image is now used for all our Docker images. #506, #511, #517, #519
- Binary encodings now stored internally with their encoding id. #505
- REST API implementation for accepting clknblocks #503
- Update Open API spec to version 3. Add Blocking API #479
- CI Updates #476
- Chart updates #496, #497, #539
- Documentation updates (production deployment, debugging with PyCharm) #473, #504
- Fix Jaeger #500, #523
Misc changes/fixes: - Detect invalid encoding size as early as possible #507 - Use local benchmark cache #531 - Cleanup docker-compose #533, #534, #547 - Calculate number of comparisons accounting for user supplied blocks. #543
Version 1.13.0-beta¶
Fixed a bug where a dataprovider could upload their clks multiple times in a project using the same upload token. (#463)
Fixed a bug where workers accepted work after failing to initialize their database connection pool. (#477)
Modified
similarity_score
output to follow the group format in preparation to extending this output type to more parties. (#464)Tutorials have been improved following an internal review. (#467)
Database schema and CLK upload api has been modified to support blocking. (#470)
Benchmarking results can now be saved to an object store without authentication. Allowing an AWS user to save to S3 using node permissions. (#490)
Removed duplicate/redundant tests. (#466)
Updated dependencies:
- We have enabled dependabot on GitHub to keep our Python dependencies up to date.
anonlinkclient
now used for benchmarking. (#490)- Chart dependencies
redis-ha
,postgres
andminio
all updated. (#496, #497)
Breaking Changes¶
- the
similarity_score
output type has been modified, it now returns a JSON array of JSON objects, where such an object looks like[[party_id_0, row_index_0], [party_id_1, row_index_1], score]
. (#464) - Integration test configuration is now consistent with benchmark config. Instead of setting
ENTITY_SERVICE_URL
including/api/v1
now just set the host address inSERVER
. (#495)
Database Changes (Internal)¶
- the
dataproviders
tableuploaded
field has been modified from a BOOL to an ENUM type (#463) - The
projects
table has a newuses_blocking
field. (#470)
Version 1.13.0-alpha¶
fixed bug where invalid state changes could occur when starting a run (#459)
matching
output type has been removed as redundant with thegroups
output with 2 parties. (#458)Update dependencies:
- requests from 2.21.0 to 2.22.0 (#459)
Breaking Change¶
matching
output type is not available anymore. (#458)
Version 1.12.0¶
Logging configurable in the deployed entity service by using the key
loggingCfg
. (#448)Several old settings have been removed from the default values.yaml and docker files which have been replaced by
CHUNK_SIZE_AIM
(#414):SMALL_COMPARISON_CHUNK_SIZE
LARGE_COMPARISON_CHUNK_SIZE
SMALL_JOB_SIZE
LARGE_JOB_SIZE
Remove
ENTITY_MATCH_THRESHOLD
environment variable (#444)Celery configuration updates to solve threads and memory leaks in deployment. (#427)
Update docker-compose files to use these new preferred configurations.
Update helm charts with preferred configuration default deployment is a minimal working deployment.
New environment variables:
CELERY_DB_MIN_CONNECTIONS
,FLASK_DB_MIN_CONNECTIONS
,CELERY_DB_MAX_CONNECTIONS
andFLASK_DB_MAX_CONNECTIONS
to configure the database connections pool. (#405)Simplify access to the database from services relying on a single way to get a connection via a connection pool. (#405)
Deleting a run is now implemented. (#413)
Added some missing documentation about the output type groups (#449)
Sentinel name is configurable. (#436)
Improvement on the Kubernetes deployment test stage on Azure DevOps:
- Re-order cleaning steps to first purge the deployment and then deleting the remaining. (#426)
- Run integration tests in parallel, reducing pipeline stage Kubernetes deployment tests from 30 minutes to 15 minutes. (#438)
- Tests running on a deployed entity-service on k8s creates an artifact containing all the logs of all the containers, useful for debugging. (#445)
- Test container not restarted on test failure. (#434)
Benchmark improvements:
- Benchmark output has been modified to handle multi-party linkage.
- Benchmark to handle more than 2 parties, being able to repeat experiments. and pushing the results to minio object store. (#406, #424 and #425)
- Azure DevOps benchmark stage runs a 3 parties linkage. (#433)
Improvements on Redis cache:
- Refactor the cache. (#430)
- Run state kept in cache (instead of fully relying on database) (#431 and #432)
Update dependencies:
- anonlink to v0.12.5. (#423)
- redis from 3.2.0 to 3.2.1 (#415)
- alpine from 3.9 to 3.10.1 (#404)
Add some release documentation. (#455)
Version 1.11.2¶
- Switch to Azure Devops pipeline for CI.
- Switch to docker hub for container hosting.
Version 1.11.1¶
- Include multiparty linkage tutorial/example.
- Tightened up how we use a database connection from the flask app.
- Deployment and logging documentation updates.
Version 1.11.0¶
- Adds support for multiparty record linkage.
- Logging is now configurable from a file.
Other improvements¶
- Another tutorial for directly using the REST api was added.
- K8s deployment updated to use
3.15.0
Postgres chart. Postgres configuration now uses aglobal
namespace so subcharts can all use the same configuration as documented here. - Jenkins testing now fails if the benchmark exits incorrectly or if the benchmark results contain failed results.
- Jenkins will now execute the tutorials notebooks and fail if any cells error.
Version 1.10.0¶
- Updates Anonlink and switches to using Anonlink’s default format for serialization of similarity scores.
- Sorts similarity scores before solving, improving accuracy.
- Uses Anonlink’s new API for similarity score computation and solving.
- Add support for using an external Postgres database.
- Added optional support for redis discovery via the sentinel protocol.
- Kubernetes deployment no longer includes a default postgres password. Ensure that you set your own postgresqlPassword.
- The Kubernetes deployment documentation has been extended.
Version 1.9.4¶
- Introduces configurable logging of HTTP headers.
- Dependency issue resolved.
Version 1.9.3¶
- Redis can now be used in highly available mode. Includes upstream fix where the redis sentinels crash.
- The custom kubernetes certificate management templates have been removed.
- Minor updates to the kubernetes resources. No longer using beta apis.
Version 1.9.2¶
- 2 race conditions have been identified and fixed.
- Integration tests are sped up and more focused. The test suite now fails after the first test failure.
- Code tidy-ups to be more pep8 compliant.
Version 1.9.1¶
- Adds support for (almost) arbitrary sized encodings. A minimum and maximum can be set at deployment time, and currently anonlink requires the size to be a multiple of 8.
- Adds support for opentracing with Jaeger.
- improvements to the benchmarking container
- internal refactoring of tasks
Version 1.9.0¶
- minio and redis services are now optional for kubernetes deployment.
- Introduction of a high memory worker and associated task queue.
- Fix issue where we could start tasks twice.
- Structlog now used for celery workers.
- CI now tests a kubernetes deployment.
- Many Jenkins CI updates and fixes.
- Updates to Jupyter notebooks and docs.
- Updates to Python and Helm chart dependencies and docker base images.
Version 1.8.1¶
Improve system stability while handling large intermediate results. Intermediate results are now stored in files instead of in Redis. This permits us to stream them instead of loading everything into memory.
Version 1.8¶
Version 1.8 introduces breaking changes to the REST API to allow an analyst to reuse uploaded CLKs.
Instead of a linkage project only having one result, we introduce a new sub-resource runs. A project holds the schema and CLKs from all data providers; and multiple runs can be created with different parameters. A run has a status and a result endpoint. Runs can be queued before the CLK data has been uploaded.
We also introduced changes to the result types. The result type permutation, which was producing permutations and an encrypted mask, was removed. And the result type permutation_unecrypyted_mask was renamed to permutations.
Brief summary of API changes: - the mapping endpoint has been renamed to projects - To carry out a linkage computation you must post to a project’s runs endpoint: /api/v1/project/<PROJECT_ID>/runs - Results are now accessed under the `runs endpoint: /api/v1/project/<PROJECT_ID>/runs/<RUN_ID>/result - result type permutation_unecrypyted_mask was renamed to permutations - result type permutation was removed
For all the updated API details check the Open API document.
Other improvements¶
- The documentation is now served at the root.
- The flower monitoring tool for celery is now included with the docker-compose deployment. Note this will be disabled for production deployment with kubernetes by default.
- The docker containers have been migrated to alpine linux to be much leaner.
- Substantial internal refactoring - especially of views.
- Move to pytest for end to end tests.
Version 1.7.3¶
Deployment and documentation sprint.
- Fixes a bug where only the top k results of a chunk were being requested from anonlink. #59 #84
- Updates to helm deployment templates to support a single namespace having multiple entityservices. Helm charts are more standard, some config has moved into a configmap and an experimental cert-manager configuration option has been added. #83, #90
- More sensible logging during testing.
- Every http request now has a (globally configurable) timeout
- Minor update regarding handling uploading empty CLKs. #92
- Update to latest versions of anonlink and clkhash. #94
- Documentation updates.
Version 1.7.2¶
Dependency and deployment updates. We now pin versions of Python, anonlink, clkhash, phe and docker images nginx and postgres.
Version 1.7.0¶
Added a view type that returns similarity scores of potential matches.
Version 1.6.8¶
Scalability sprint.
- Much better chunking of work.
- Security hardening by modifing the response from the server. Now there is no differences between invalid token and unknown resource - both return a 403 response status.
- Mapping information includes the time it was started.
- Update and add tests.
- Update the deployment to use Helm.
Devops¶
Continuous Integration¶
Azure DevOps¶
anonlink-entity-service
is automatically built and tested using Azure DevOps
in the project Anonlink <https://dev.azure.com/data61/Anonlink>.
It consists of a build pipeline <https://dev.azure.com/data61/Anonlink/_build?definitionId=1>.
The build pipeline is defined in the script azure-pipelines.yml which uses resources from the folder .azurePipeline.
The continuous integration stages are:
- building and pushing the following docker images:
- the frontend
data61/anonlink-nginx
- the Python base imagedata61/anonlink-base
- the backenddata61/anonlink-app
- the tutorialsdata61/anonlink-docs-tutorials
(used to tests the tutorial Python Notebooks) - the benchmarkdata61/anonlink-benchmark
(used to run the benchmark) - runs the benchmark using
docker-compose
and publishes the results as an artifact in Azure - runs the tutorial tests using
docker-compose
and publishes the results in Azure - runs the end to end tests by deploying the whole service on
Kubernetes
, running the tests found inbackend/entityservice/tests
and publishing the results in Azure. The pod logs are also available in Azure DevOps.
The build pipeline is triggered for every push on every branch. It is not triggered by Pull Requests to avoid duplicate testing and building potentially untrusted external code.
The build pipeline requires two environment variables provided by Azure environment:
- dockerHubId: username for the pipeline to push images to Data61’s Docker hub account.
- dockerHubPassword: password for the corresponding username (this is a secret variable).
It also requires a service connection to a k8s
cluster to be configured.
Base Image¶
The CI system builds and pushes the base image, before building downstream images. The CI
system builds the application images using the current base VERSION
. If a base image with the given
digest is already present on Docker Hub the base image won’t be rebuilt.
For additional details see Dependencies.
Debugging¶
There are a few ways to debug the Anonlink Entity Service, one of the easiest ways is using docker-compose to take care of all the dependant services.
Debugging in PyCharm¶
Roughly following the JetBrains tutorial <https://www.jetbrains.com/help/pycharm/using-docker-compose-as-a-remote-interpreter.html> will work with one deviation. Before debugging, launch the nginx service manually from the docker-compose.yml file.
The following steps through this process using PyCharm 2020.
Add Python Interpreter¶
Start by adding a new Python interpreter. In new versions of PyCharm look for the interpreter down the bottom right of the screen.
Make a docker-compose interpreter¶
Adding a Python interpreter from a docker-compose service is straightforward.
Manually start nginx¶
Because the Anonlink Entity Service has an nginx
container
in-front of the backend api we manually start nginx.
Note
An alternative would be to expose the port from the backend.
Create a Run Configuration¶
Create a new Python run configuration for the API. It should default to using the
docker-compose Python interpreter, add the script path to
entityservice._init_.py
.
Debug¶
Add a breakpoint and start debugging!
Visit the url in a browser (e.g. http://localhost:8851/api/v1/status) or cause the breakpoint in a notebook or separate unit test etc. If you want the interactive terminal just click “Console” in the debugger and enjoy auto-completion etc:
Road map for the entity service¶
- baseline benchmarking vs known datasets (accuracy and speed) e.g
recordspeed
datasets - Schema specification and tooling
- Algorithmic improvements. e.g., implementing canopy clustering solver
- A web front end including authentication and access control
- Uploading multiple hashes per entity. Handle multiple schemas.
- Check how we deal with missing information, old addresses etc
- Semi supervised machine learning methods to learn thresholds
- Handle 1 to many relationships. E.g. familial groups
- Larger scale graph solving methods
- optimise anonlink memory management and C++ code
Bigger Projects
- GPU implementation of core similarity scoring
- somewhat homomorphic encryption could be used for similarity score
- consider allowing users to upload raw PII
Releasing¶
Releasing a version of the Anonlink Entity Service¶
We follow gitflow. Each release has a GitHub milestone associated with it which groups all the features and bug fixes together.
Multiple docker images are contained within this repository (e.g., backend
, frontend
, benchmark
) which
are independently versioned. In general a release involves a new version of both the backend
and the frontend
.
This is because the documentation is baked into the frontend so user visible changes to the backend require a new
frontend.
- Choose a new version using semantic versioning.
- Create a branch off the latest
develop
calledrelease-x.y.z
. - Update the versions in the code base (e.g.,
backend/entityservice/VERSION
) of any components that have been changed. As above note if the backend version has changed you must release a new frontend too. - Update the versions in the Chart.yaml file.
- Update the changelog to include user friendly information on all features, taking special care to mention any breaking changes.
- Open a PR to merge these changes into
develop
, and get a code review. Make any requested changes, and merge the changes intodevelop
(don’t close the branch). - Open a PR to merge the release branch into
master
, only proceed if the CI tests all pass. Merge, rather than squashing the commits. - Create a git tag of the form
vX.Y.Z[-aN|-bN]
(e.g. using GitHub’s releases ui). - Tag and push release versions of docker images from this tag and the tag latest (manually for now but ideally using CI).
- Commit to develop (via a PR) creating a new
"Next Version
section in the changelog. - Proudly announce the new release on the anonlink google group https://groups.google.com/forum/#!forum/anonlink
Implementation Details¶
Components¶
The entity service is implemented in Python and comprises the following components:
- A gunicorn/flask backend that implements the HTTP REST api.
- Celery backend worker/s that do the actual work. This interfaces with
the
anonlink
library. - An nginx frontend to reverse proxy the gunicorn/flask backend application.
- A Minio object store (large files such as raw uploaded hashes, results)
- A postgres database stores the linking metadata.
- A redis task queue that interfaces between the flask app and the celery backend. Redis also acts as an ephemeral cache.
Each of these has been packaged as a docker image, however the use of external services (redis, postgres, minio) can be configured through environment variables. Multiple workers can be used to distribute the work beyond one machine - by default all cores will be used for computing similarity scores and encrypting the mask vector.
Dependencies¶
Anonlink Entity Service uses Python dependencies found in base/requirements.txt
. These can be
manually installed using pip
:
pip install -r base/requirements.txt
Docker is used for packaging the application, we rely on a base image that includes the operating system
level and Python level dependencies. To update a dependency change the pinned version in base/requirements.txt
or base/Dockerfile
. Our CI system will bake the base image and tag it with a digest.
If you were so inclined you could generate the digest yourself with bash (example digest shown):
$ cd base
$ sha256sum requirements.txt Dockerfile | sha256sum | cut -f 1 -d " " | tr [:upper:] [:lower:]
3814723844e4b359f0b07e86a57093ad4f88aa434c42ced9c72c611bbcf9819a
Then a microservice can be updated to use this base image. In the application Dockerfile
there will
be an overridable digest:
ARG VERSION=4b497c1a0b2a6cc3ea848338a67c3a129050d32d9c532373e3301be898920b55
FROM data61/anonlink-base:${VERSION}
Either update this digest in the Dockerfile
, or when building with docker build
pass in an extra
argument:
--build-arg VERSION=3814723844e4b359f0b07e86a57093ad4f88aa434c42ced9c72c611bbcf9819a
Note the CI system automatically uses the current base image when building the application images.
Redis¶
Redis is used as the default message broker for celery as well as a cross-container in memory cache.
Redis key/values used directly by the Anonlink Entity Service:
Key | Redis Type | Description |
---|---|---|
“entityservice-status” | String | pickled status |
“run:{run_id}” | Hash | run info |
“clk-pkl-{dp_id}” | String | pickled encodings |
Redis Cache: Run Info¶
The run info HASH
stores:
- similarity scoring progress for each run under
"progress"
- run state under
"state"
, current valid states are{active, complete, deleted}
. Seebackend/entityservice/cache/active_runs.py
for implementation.
Object Store¶
Write access to an AWS S3 compatible object store is required to store intermediate files for the Anonlink Entity Service. The optional feature for data upload via object store also requires access to an AWS S3 compatible object store - along with authorization to create temporary credentials.
MinIO is an open source object store implementation which can be used with both Docker Compose and Kubernetes deployments instead of AWS S3.
Deployment Testing¶
Testing Local Deployment¶
The docker compose file tools/ci.yml
is deployed along with tools/docker-compose.yml
. This compose file
defines additional containers which run benchmarks and tests after a short delay.
Testing K8s Deployment¶
The kubernetes deployment uses helm
with the template found in deployment/entity-service
. Jenkins additionally
defines the docker image versions to use and ensures an ingress is not provisioned. The deployment is configured to be
quite conservative in terms of cluster resources.
The k8s deployment test is limited to 30 minutes and an effort is made to clean up all created resources.
After a few minutes waiting for the deployment a
Kubernetes Job is created using
kubectl create
.
This job includes a 1GiB
persistent volume claim
to which the results are written (as results.xml
). During the testing the pytest output will be rendered,
and then the Job’s pod terminates. We create a temporary pod which mounts the same results volume and then we copy
across the produced test result artifact.
Benchmarking¶
In the benchmarking folder is a benchmarking script and associated Dockerfile.
The docker image is published at data61/anonlink-benchmark
The container/script is configured via environment variables.
SERVER
: (required) the url of the server.EXPERIMENT
: json file containing a list of experiments to run. Schema of experiments is defined in ./schema/experiments.json.DATA_PATH
: path to a directory to store test data (useful to cache).RESULT_PATH
: full filename to write results file.SCHEMA
: path to the linkage schema file used when creating projects. If not provided it is assumed to be in the data directory.TIMEOUT
: this timeout defined the time to wait for the result of a run in seconds. Default is 1200 (20min).
Run Benchmarking Container¶
Run the container directly with docker - substituting configuration information as required:
docker run -it
-e SERVER=https://anonlink.easd.data61.xyz \
-e RESULTS_PATH=/app/results.json \
quay.io/n1analytics/entity-benchmark:latest
By default the container will pull synthetic datasets from an S3 bucket and run default benchmark experiments
against the configured SERVER
. The default experiments (listed below) are set in
benchmarking/default-experiments.json
.
The output will be printed and saved to a file pointed to by RESULTS_PATH
(e.g. to /app/results.json
).
Cache Volume¶
For speeding up benchmarking when running multiple times you may wish to mount a volume at the DATA_PATH
to store the downloaded test data. Note the container runs as user 1000
, so any mounted volume must be read
and writable by that user. To create a volume using docker:
docker volume create linkage-benchmark-data
To copy data from a local directory and change owner:
docker run --rm -v `pwd`:/src \
-v linkage-benchmark-data:/data busybox \
sh -c "cp -r /src/linkage-bench-cache-experiments.json /data; chown -R 1000:1000 /data"
To run the benchmarks using the cache volume:
docker run \
--name ${benchmarkContainerName} \
--network ${networkName} \
-e SERVER=${localserver} \
-e DATA_PATH=/cache \
-e EXPERIMENT=/cache/linkage-bench-cache-experiments.json \
-e RESULTS_PATH=/app/results.json \
--mount source=linkage-benchmark-data,target=/cache \
quay.io/n1analytics/entity-benchmark:latest
Experiments¶
The benchmarking script will run a range of experiments, defined in a json file.
Data¶
The experiments use synthetic data generated with the febrl tool. The data is stored in the S3 bucket s3://public-linkage-data. The naming convention is {type_of_data}_{party}_{size}.
You’ll find
- the PII data for various dataset sizes, the CLKs in binary and json format, generated with the linkage schema defined in
schema.json
. - the corresponding linkage schema in
schema.json
- the blocks, generated with P-Sig blocking.
- the corresponding blocking schema
psig_schema.json
- the combined
clknblocks
files for the different parties and dataset sizes.
This particular blocking schema creates blocks with a median size of 1. The average size does not exceed 10 for any dataset, and each entity is part of 5 different blocks.
Config¶
The experiments are configured in a json document. Currently, you can specify the dataset sizes, the linkage threshold, the number of repetitions and if blocking should be used. The default is:
[
{
"sizes": ["100K", "100K"],
"threshold": 0.95
},
{
"sizes": ["100K", "100K"],
"threshold": 0.80
},
{
"sizes": ["100K", "1M"],
"threshold": 0.95
}
]
The schema of the experiments can be found in benchmarking/schema/experiments.json
.
Logging¶
The entity service uses the standard Python logging library for logging.
The following named loggers are used:
- entityservice * entityservice.views * entityservice.models * entityservice.database
- celery.es
The following environment variables affect logging:
- LOG_CFG - sets the path to a logging configuration file. There are two examples:
- entityservice/default_logging.yaml
- entityservice/verbose_logging.yaml
- DEBUG - sets the logging level to debug for all application code.
- LOGFILE - directs the log output to this file instead of stdout.
- LOG_HTTP_HEADER_FIELDS - HTTP headers to include in the application logs.
Example logging output with LOG_HTTP_HEADER_FIELDS=User-Agent,Host:
[2019-02-02 23:17:23 +0000] [10] [INFO] Adding new project to database [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=6c2a3730
[2019-02-02 23:17:23 +0000] [12] [INFO] Getting detail for a project [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [12] [INFO] Checking credentials [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [12] [INFO] 0 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [11] [INFO] Receiving CLK data. [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Storing user 25895 supplied clks from json [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Received 100 encodings. Uploading 16.89 KiB to object store [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Adding metadata on encoded entities to database [entityservice.database.insertions] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Job scheduled to handle user uploaded hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:24 +0000] [12] [INFO] Getting detail for a project [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [12] [INFO] Checking credentials [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [12] [INFO] 1 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [10] [INFO] Receiving CLK data. [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Storing user 25896 supplied clks from json [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Received 100 encodings. Uploading 16.89 KiB to object store [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Adding metadata on encoded entities to database [entityservice.database.insertions] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Job scheduled to handle user uploaded hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:25 +0000] [12] [INFO] Getting detail for a project [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] Checking credentials [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] 2 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] Adding new project to database [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=df791527
[2019-02-02 23:17:26 +0000] [12] [INFO] request description of a run [entityservice.views.run.description] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=bf5b2544 rid=invalid
[2019-02-02 23:17:26 +0000] [12] [INFO] Requested project or run resource with invalid identifier token [entityservice.views.auth_checks] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=bf5b2544 rid=invalid
[2019-02-02 23:17:26 +0000] [12] [INFO] Request to delete project [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
[2019-02-02 23:17:26 +0000] [12] [INFO] Marking project for deletion [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
[2019-02-02 23:17:26 +0000] [12] [INFO] Queuing authorized request to delete project resources [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
With DEBUG enabled there are a lot of logs from the backend and workers:
[2019-02-02 23:14:47 +0000] [10] [INFO] Marking project for deletion [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:47 +0000] [10] [DEBUG] Trying to connect to postgres db [entityservice.database.util] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [10] [DEBUG] Database connection established [entityservice.database.util] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [10] [INFO] Queuing authorized request to delete project resources [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [9] [INFO] Request to delete project [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=5486c153
Tracing¶
TRACING_CFG overrides the path to an open tracing configuration file.