Entity Service - v1.11.2¶
The Entity Service allows two organizations to carry out private record linkage — finding matching records of entities between their respective datasets without disclosing personally identifiable information.
Overview¶
The Entity Service is based on the concept of Anonymous Linking Codes (ALC). These can be seen as bit-arrays representing an entity, with the property that the similarity of the bits of two ALCs reflect the similarity of the corresponding entities.
An anonymous linking code that has been shown to produce good results and is widely used in practice is the so called *Cryptographic Longterm Key*, or CLK for short.
Note
From now on, we will use CLK exclusively instead of ALC, as our reference implementation of the private record linkage process uses CLK as anonymous linking code. The Entity Service is however not limited to CLKs.
Private record linkage - using the Entity Service - is a two stage process:
- First, each party locally computes the CLKs for their entities’ data (e.g. using the
clkhash tool). These
CLKs
are then uploaded to the service. - The service then calculates the similarity between entities, using the probabilistic matching library anonlink. Depending on configuration, the output is returned as a mapping, permutations and mask, or similarity scores.
Table Of Contents¶
Tutorials¶
Anonlink Entity Service API¶
This tutorial demonstrates interacting with the entity service via the REST API. The primary alternative is to use a library or command line tool such as `clkhash
<http://clkhash.readthedocs.io/>`__ which can handle the communication with the anonlink entity service.
Dependencies¶
In this tutorial we interact with the REST API using the requests
Python library. Additionally we use the clkhash
Python library in this tutorial to define the linkage schema and to encode the PII. The synthetic dataset comes from the recordlinkage
package. All the dependencies can be installed with pip:
pip install requests clkhash recordlinkage
Steps¶
- Check connection to Anonlink Entity Service
- Synthetic Data generation and encoding
- Create a new linkage project
- Upload the encodings
- Create a run
- Retrieve and analyse results
[1]:
import json
import os
import time
import requests
from IPython.display import clear_output
Check Connection¶
If you are connecting to a custom entity service, change the address here.
[2]:
server = os.getenv("SERVER", "https://testing.es.data61.xyz")
url = server + "/api/v1/"
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://testing.es.data61.xyz/api/v1/
[3]:
requests.get(url + 'status').json()
[3]:
{'project_count': 2278, 'rate': 3863861, 'status': 'ok'}
Data preparation¶
This section won’t be explained in great detail as it directly follows the clkhash tutorials.
We encode a synthetic dataset from the recordlinkage
library using clkhash
.
[4]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[5]:
dfA, dfB = load_febrl4()
[6]:
with open('a.csv', 'w') as a_csv:
dfA.to_csv(a_csv, line_terminator='\n')
with open('b.csv', 'w') as b_csv:
dfB.to_csv(b_csv, line_terminator='\n')
Schema Preparation¶
The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash
how to treat each column for encoding PII into CLKs. A detailed description of the hashing schema can be found in the clkhash documentation.
A linkage schema can either be defined as Python code as shown here, or as a JSON file (shown in other tutorials). The importance of each field is controlled by the k
parameter in the FieldHashingProperties
. We ignore the record id and social security id fields so they won’t be incorporated into the encoding.
[7]:
import clkhash
from clkhash.field_formats import *
schema = clkhash.randomnames.NameList.SCHEMA
_missing = MissingValueSpec(sentinel='')
schema.fields = [
Ignore('rec_id'),
StringSpec('given_name',
FieldHashingProperties(ngram=2, k=15)),
StringSpec('surname',
FieldHashingProperties(ngram=2, k=15)),
IntegerSpec('street_number',
FieldHashingProperties(ngram=1,
positional=True,
k=15,
missing_value=_missing)),
StringSpec('address_1',
FieldHashingProperties(ngram=2, k=15)),
StringSpec('address_2',
FieldHashingProperties(ngram=2, k=15)),
StringSpec('suburb',
FieldHashingProperties(ngram=2, k=15)),
IntegerSpec('postcode',
FieldHashingProperties(ngram=1, positional=True, k=15)),
StringSpec('state',
FieldHashingProperties(ngram=2, k=15)),
IntegerSpec('date_of_birth',
FieldHashingProperties(ngram=1, positional=True, k=15, missing_value=_missing)),
Ignore('soc_sec_id')
]
Encoding¶
Transforming the raw personally identity information into CLK encodings following the defined schema. See the clkhash documentation for further details on this.
[8]:
from clkhash import clk
with open('a.csv') as a_pii:
hashed_data_a = clk.generate_clk_from_csv(a_pii, ('key1',), schema, validate=False)
with open('b.csv') as b_pii:
hashed_data_b = clk.generate_clk_from_csv(b_pii, ('key1',), schema, validate=False)
generating CLKs: 100%|██████████| 5.00k/5.00k [00:02<00:00, 1.78kclk/s, mean=645, std=43.8]
generating CLKs: 100%|██████████| 5.00k/5.00k [00:02<00:00, 1.35kclk/s, mean=634, std=50.3]
Create Linkage Project¶
The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.
[9]:
project_spec = {
"schema": {},
"result_type": "mapping",
"number_parties": 2,
"name": "API Tutorial Test"
}
credentials = requests.post(url + 'projects', json=project_spec).json()
project_id = credentials['project_id']
a_token, b_token = credentials['update_tokens']
credentials
[9]:
{'project_id': 'e98ababc1a02a4057a13b39c846e9f219acf71bd0a4143c7',
'result_token': '693c423c0c021f92a9f7b1658ef8f19beaa7b9c1b27ea22c',
'update_tokens': ['57401d6c0edfa78abf3bd4a87936159f8c974f93dc352d21',
'8c44139db950ca88f58f18d18e219f001fa105543a7b25e6']}
Note: the analyst will need to pass on the project_id
(the id of the linkage project) and one of the two update_tokens
to each data provider.
The result_token
can also be used to carry out project API requests:
[10]:
requests.get(url + 'projects/{}'.format(project_id),
headers={"Authorization": credentials['result_token']}).json()
[10]:
{'error': False,
'name': 'API Tutorial Test',
'notes': '',
'number_parties': 2,
'parties_contributed': 0,
'project_id': 'e98ababc1a02a4057a13b39c846e9f219acf71bd0a4143c7',
'result_type': 'mapping',
'schema': {}}
Now the two clients can upload their data providing the appropriate upload tokens.
CLK Upload¶
[12]:
a_response = requests.post(
'{}projects/{}/clks'.format(url, project_id),
json={'clks': hashed_data_a},
headers={"Authorization": a_token}
).json()
[13]:
b_response = requests.post(
'{}projects/{}/clks'.format(url, project_id),
json={'clks': hashed_data_b},
headers={"Authorization": b_token}
).json()
Every upload gets a receipt token. In some operating modes this receipt is required to access the results.
Create a run¶
Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:
[21]:
run_response = requests.post(
"{}projects/{}/runs".format(url, project_id),
headers={"Authorization": credentials['result_token']},
json={
'threshold': 0.80,
'name': "Tutorial Run #1"
}
).json()
[22]:
run_id = run_response['run_id']
Run Status¶
[23]:
requests.get(
'{}projects/{}/runs/{}/status'.format(url, project_id, run_id),
headers={"Authorization": credentials['result_token']}
).json()
[23]:
{'current_stage': {'description': 'compute similarity scores',
'number': 2,
'progress': {'absolute': 25000000,
'description': 'number of already computed similarity scores',
'relative': 1.0}},
'stages': 3,
'state': 'running',
'time_added': '2019-04-30T12:18:44.633541+00:00',
'time_started': '2019-04-30T12:18:44.778142+00:00'}
Now after some delay (depending on the size) we can fetch the results. This can of course be done by directly polling the REST API using requests
, however for simplicity we will just use the watch_run_status function provided in clkhash.rest_client
.
Note theserver
is provided rather thanurl
.
[24]:
import clkhash.rest_client
for update in clkhash.rest_client.watch_run_status(server, project_id, run_id, credentials['result_token'], timeout=300):
clear_output(wait=True)
print(clkhash.rest_client.format_run_status(update))
State: completed
Stage (3/3): compute output
[25]:
data = json.loads(clkhash.rest_client.run_get_result_text(
server,
project_id,
run_id,
credentials['result_token']))
This result is the 1-1 mapping between rows that were more similar than the given threshold.
[30]:
for i in range(10):
print("a[{}] maps to b[{}]".format(i, data['mapping'][str(i)]))
print("...")
a[0] maps to b[1449]
a[1] maps to b[2750]
a[2] maps to b[4656]
a[3] maps to b[4119]
a[4] maps to b[3306]
a[5] maps to b[2305]
a[6] maps to b[3944]
a[7] maps to b[992]
a[8] maps to b[4612]
a[9] maps to b[3629]
...
In this dataset there are 5000 records in common. With the chosen threshold and schema we currently retrieve:
[31]:
len(data['mapping'])
[31]:
4853
Cleanup¶
If you want you can delete the run and project from the anonlink-entity-service.
[44]:
requests.delete(
"{}/projects/{}".format(url, project_id),
headers={"Authorization": credentials['result_token']})
[44]:
<Response [403]>
[ ]:
Entity Service Permutation Output¶
This tutorial demonstrates the workflow for private record linkage using the entity service. Two parties Alice and Bob have a dataset of personally identifiable information (PII) of several entities. They want to learn the linkage of corresponding entities between their respective datasets with the help of the entity service and an independent party, the Analyst.
The chosen output type is permuatations
, which consists of two permutations and one mask.
Who learns what?¶
After the linkage has been carried out Alice and Bob will be able to retrieve a permutation
- a reordering of their respective data sets such that shared entities line up.
The Analyst - who creates the linkage project - learns the mask
. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.
Steps¶
These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the Analyst acting the integration authority.
- Check connection to Entity Service
- Data preparation
- Write CSV files with PII
- Create a Linkage Schema
- Create Linkage Project
- Generate CLKs from PII
- Upload the PII
- Create a run
- Retrieve and analyse results
## Check Connection
If you’re connecting to a custom entity service, change the address here.
[1]:
import os
url = os.getenv("SERVER", "https://testing.es.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://testing.es.data61.xyz
[2]:
!clkutil status --server "{url}"
{"project_count": 1021, "rate": 2453247, "status": "ok"}
## Data preparation
Following the clkhash tutorial we will use a dataset from the recordlinkage
library. We will just write both datasets out to temporary CSV files.
[3]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[4]:
dfA, dfB = load_febrl4()
a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)
b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)
dfA.head(3)
[4]:
given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | |
---|---|---|---|---|---|---|---|---|---|---|
rec_id | ||||||||||
rec-1070-org | michaela | neumann | 8 | stanley street | miami | winston hills | 4223 | nsw | 19151111 | 5304218 |
rec-1016-org | courtney | painter | 12 | pinkerton circuit | bega flats | richlands | 4560 | vic | 19161214 | 4066625 |
rec-4405-org | charles | green | 38 | salkauskas crescent | kela | dapto | 4566 | nsw | 19480930 | 4365168 |
The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.
[5]:
schema = NamedTemporaryFile('wt')
[6]:
%%writefile {schema.name}
{
"version": 1,
"clkConfig": {
"l": 1024,
"k": 30,
"hash": {
"type": "doubleHash"
},
"kdf": {
"type": "HKDF",
"hash": "SHA256",
"info": "c2NoZW1hX2V4YW1wbGU=",
"salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
"keySize": 64
}
},
"features": [
{
"identifier": "rec_id",
"ignored": true
},
{
"identifier": "given_name",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "surname",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "street_number",
"format": { "type": "integer" },
"hashing": { "ngram": 1, "positional": true, "weight": 0.5, "missingValue": {"sentinel": ""} }
},
{
"identifier": "address_1",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 0.5 }
},
{
"identifier": "address_2",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 0.5 }
},
{
"identifier": "suburb",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 0.5 }
},
{
"identifier": "postcode",
"format": { "type": "integer", "minimum": 100, "maximum": 9999 },
"hashing": { "ngram": 1, "positional": true, "weight": 0.5 }
},
{
"identifier": "state",
"format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "date_of_birth",
"format": { "type": "integer" },
"hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
},
{
"identifier": "soc_sec_id",
"ignored": true
}
]
}
Overwriting /tmp/tmptfalxkiq
## Create Linkage Project
The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.
[7]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)
!clkutil create-project --schema "{schema.name}" --output "{creds.name}" --type "permutations" --server "{url}"
creds.seek(0)
import json
with open(creds.name, 'r') as f:
credentials = json.load(f)
project_id = credentials['project_id']
credentials
Credentials will be saved in /tmp/tmpyr8dc2pf
Project created
[7]:
{'project_id': 'b8211d1450c8d0d631dbdc1fb482af106b8cbdebed5b7fd3',
'result_token': '8fe1fc01f7ac3a3406d1e031b7d120800aa6460d0da62abb',
'update_tokens': ['1c39c6972626bd34729812f0b9cf6e467461824dbbd0682c',
'901c12061cf621b67df5b9de2719b8806636364d3fdc1765']}
Note: the analyst will need to pass on the project_id
(the id of the linkage project) and one of the two update_tokens
to each data provider.
## Hash and Upload
At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. We need: - the clkhash library - the linkage schema from above - and two secret passwords which are only known to Alice and Bob. (here: horse
and staple
)
Please see clkhash documentation for further details on this.
[8]:
!clkutil hash "{a_csv.name}" horse staple "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" horse staple "{schema.name}" "{b_clks.name}"
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.32kclk/s, mean=765, std=37.1]
CLK data written to /tmp/tmpc_4k553j.json
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 4.28kclk/s, mean=756, std=43.3]
CLK data written to /tmp/tmpv7eo2tfp.json
Now the two clients can upload their data providing the appropriate upload tokens and the project_id. As with all commands in clkhash
we can output help:
[9]:
!clkutil upload --help
Usage: clkutil upload [OPTIONS] CLK_JSON
Upload CLK data to entity matching server.
Given a json file containing hashed clk data as CLK_JSON, upload to the
entity resolution service.
Use "-" to read from stdin.
Options:
--project TEXT Project identifier
--apikey TEXT Authentication API key for the server.
--server TEXT Server address including protocol
-o, --output FILENAME
-v, --verbose Script is more talkative
--help Show this message and exit.
Alice uploads her data¶
[10]:
with NamedTemporaryFile('wt') as f:
!clkutil upload \
--project="{project_id}" \
--apikey="{credentials['update_tokens'][0]}" \
--server "{url}" \
--output "{f.name}" \
"{a_clks.name}"
res = json.load(open(f.name))
alice_receipt_token = res['receipt_token']
Every upload gets a receipt token. This token is required to access the results.
Bob uploads his data¶
Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:
Now after some delay (depending on the size) we can fetch the mask. This can be done with clkutil:
!clkutil results --server "{url}" \
--project="{credentials['project_id']}" \
--apikey="{credentials['result_token']}" --output results.txt
However for this tutorial we are going to use the Python requests
library:
[13]:
import requests
import clkhash.rest_client
from IPython.display import clear_output
[14]:
for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):
clear_output(wait=True)
print(clkhash.rest_client.format_run_status(update))
State: completed
Stage (3/3): compute output
[15]:
results = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': credentials['result_token']}).json()
[16]:
mask = results['mask']
This mask is a boolean array that specifies where rows of permuted data line up.
[17]:
print(mask[:10])
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
The number of 1s in the mask will tell us how many matches were found.
[18]:
sum([1 for m in mask if m == 1])
[18]:
4858
We also use requests
to fetch the permutations for each data provider:
[19]:
alice_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': bob_receipt_token}).json()
Now Alice and Bob both have a new permutation - a new ordering for their data.
[20]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]
[20]:
[2333, 1468, 559, 274, 653, 3385, 278, 3568, 3617, 4356]
This permutation says the first row of Alice’s data should be moved to position 308.
[21]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]
[21]:
[2083, 1106, 3154, 1180, 2582, 375, 3533, 1046, 316, 2427]
[22]:
def reorder(items, order):
"""
Assume order is a list of new index
"""
neworder = items.copy()
for item, newpos in zip(items, order):
neworder[newpos] = item
return neworder
[23]:
with open(a_csv.name, 'r') as f:
alice_raw = f.readlines()[1:]
alice_reordered = reorder(alice_raw, alice_permutation)
with open(b_csv.name, 'r') as f:
bob_raw = f.readlines()[1:]
bob_reordered = reorder(bob_raw, bob_permutation)
Now that the two data sets have been permuted, the mask reveals where the rows line up, and where they don’t.
[24]:
alice_reordered[:10]
[24]:
['rec-2689-org,ainsley,robison,23,atherton street,villa 1/4,deer park,3418,nsw,19310531,4102867\n',
'rec-1056-org,chloe,imgraben,47,curlewis crescent,dragon rising,burleigh waters,2680,qld,19520516,6111417\n',
'rec-1820-org,liam,cullens,121,chandler street,the burrows,safety bay,3073,qld,19910811,7828812\n',
'rec-2192-org,ellie,fearnall,31,fishburn street,colbara,cherrybrook,5171,wa,,7745948\n',
'rec-2696-org,campbell,nguyen,6,diselma place,villa 2,collinswood,4343,nsw,19630325,2861961\n',
'rec-968-org,aidan,blake,15,namatjira drive,cooramin,dromana,4074,vic,19270928,4317464\n',
'rec-3833-org,nicholas,clarke,13,gaylard place,tryphinia view,wetherill park,2810,nsw,19041223,3927795\n',
'rec-4635-org,isabella,white,8,cooling place,,rosebud,6151,sa,19990911,2206317\n',
'rec-3549-org,harry,thorpe,11,kambalda crescent,louisa tor 4,angaston,2777,qld,19421128,2701790\n',
'rec-1220-org,lauren,weltman,6,tewksbury circuit,heritage estate,evans head,6330,nsw,19840930,9462453\n']
[25]:
bob_reordered[:10]
[25]:
['rec-2689-dup-0,ainsley,labalck,23,atherto n street,villa 1/4,deer park,3418,nsw,19310531,4102867\n',
'rec-1056-dup-0,james,imgrapen,47,curlewiscrescent,dragon rising,burleigh waters,2680,qld,19520516,6111417\n',
'rec-1820-dup-0,liam,cullens,121,chandlerw street,the burrows,safety bay,3073,qld,19910811,7828812\n',
'rec-2192-dup-0,elpie,fearnull,31,fishbunestreet,,cherrybrook,5171,wa,,7745948\n',
'rec-2696-dup-0,jenna,nguyen,85,diselmaplace,villz2,collinswood,4343,nsw,19630325,2861961\n',
'rec-968-dup-0,aidan,blake,15,namatjifra drive,cooramin,dromana,4074,vic,19270928,4317464\n',
'rec-3833-dup-0,nicholas,clarke,,gaylard place,tryphinia view,wetherill park,2810,nsw,19041223,3972795\n',
'rec-4635-dup-0,isaeblla,white,8,cooling place,massey green,rosebud,6151,sa,19990911,2206317\n',
'rec-3549-dup-0,taylor,thorpe,11,kambalda c rescent,louisa tor 4,angasgon,2777,qld,19421128,2701790\n',
'rec-1220-dup-0,lauren,welman,6,tewksburl circuit,heritage estate,evans head,6330,nsw,19840930,9462453\n']
Accuracy¶
To compute how well the matching went we will use the first index as our reference.
For example in rec-1396-org
is the original record which has a match in rec-1396-dup-0
. To satisfy ourselves we can preview the first few supposed matches:
[26]:
for i, m in enumerate(mask[:10]):
if m:
entity_a = alice_reordered[i].split(',')
entity_b = bob_reordered[i].split(',')
name_a = ' '.join(entity_a[1:3]).title()
name_b = ' '.join(entity_b[1:3]).title()
print("{} ({})".format(name_a, entity_a[0]), '=?', "{} ({})".format(name_b, entity_b[0]))
Ainsley Robison (rec-2689-org) =? Ainsley Labalck (rec-2689-dup-0)
Chloe Imgraben (rec-1056-org) =? James Imgrapen (rec-1056-dup-0)
Liam Cullens (rec-1820-org) =? Liam Cullens (rec-1820-dup-0)
Ellie Fearnall (rec-2192-org) =? Elpie Fearnull (rec-2192-dup-0)
Campbell Nguyen (rec-2696-org) =? Jenna Nguyen (rec-2696-dup-0)
Aidan Blake (rec-968-org) =? Aidan Blake (rec-968-dup-0)
Nicholas Clarke (rec-3833-org) =? Nicholas Clarke (rec-3833-dup-0)
Isabella White (rec-4635-org) =? Isaeblla White (rec-4635-dup-0)
Harry Thorpe (rec-3549-org) =? Taylor Thorpe (rec-3549-dup-0)
Lauren Weltman (rec-1220-org) =? Lauren Welman (rec-1220-dup-0)
Metrics¶
If you know the ground truth — the correct mapping between the two datasets — you can compute performance metrics of the linkage.
Precision: The percentage of actual matches out of all found matches. (tp/(tp+fp)
)
Recall: How many of the actual matches have we found? (tp/(tp+fn)
)
[27]:
tp = 0
fp = 0
for i, m in enumerate(mask):
if m:
entity_a = alice_reordered[i].split(',')
entity_b = bob_reordered[i].split(',')
if entity_a[0].split('-')[1] == entity_b[0].split('-')[1]:
tp += 1
else:
fp += 1
#print('False positive:',' '.join(entity_a[1:3]).title(), '?', ' '.join(entity_b[1:3]).title(), entity_a[-1] == entity_b[-1])
print("Found {} correct matches out of 5000. Incorrectly linked {} matches.".format(tp, fp))
precision = tp/(tp+fp)
recall = tp/5000
print("Precision: {:.1f}%".format(100*precision))
print("Recall: {:.1f}%".format(100*recall))
Found 4858 correct matches out of 5000. Incorrectly linked 0 matches.
Precision: 100.0%
Recall: 97.2%
Entity Service Similarity Scores Output¶
This tutorial demonstrates generating CLKs from PII, creating a new project on the entity service, and how to retrieve the results. The output type is raw similarity scores. This output type is particularly useful for determining a good threshold for the greedy solver used in mapping.
The sections are usually run by different participants - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the analyst is acting as the integration authority.
Who learns what?¶
Alice and Bob will both generate and upload their CLKs.
The analyst - who creates the linkage project - learns the similarity scores
. Be aware that this is a lot of information and are subject to frequency attacks.
Steps¶
- Check connection to Entity Service
- Data preparation
- Write CSV files with PII
- Create a Linkage Schema
- Create Linkage Project
- Generate CLKs from PII
- Upload the PII
- Create a run
- Retrieve and analyse results
[1]:
%matplotlib inline
import json
import os
import time
import matplotlib.pyplot as plt
import requests
import clkhash.rest_client
from IPython.display import clear_output
Check Connection¶
If you are connecting to a custom entity service, change the address here.
[2]:
url = os.getenv("SERVER", "https://testing.es.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://testing.es.data61.xyz
[3]:
!clkutil status --server "{url}"
{"project_count": 2115, "rate": 7737583, "status": "ok"}
Data preparation¶
Following the clkhash tutorial we will use a dataset from the recordlinkage
library. We will just write both datasets out to temporary CSV files.
If you are following along yourself you may have to adjust the file names in all the !clkutil
commands.
[4]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[5]:
dfA, dfB = load_febrl4()
a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)
b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)
dfA.head(3)
[5]:
given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | |
---|---|---|---|---|---|---|---|---|---|---|
rec_id | ||||||||||
rec-1070-org | michaela | neumann | 8 | stanley street | miami | winston hills | 4223 | nsw | 19151111 | 5304218 |
rec-1016-org | courtney | painter | 12 | pinkerton circuit | bega flats | richlands | 4560 | vic | 19161214 | 4066625 |
rec-4405-org | charles | green | 38 | salkauskas crescent | kela | dapto | 4566 | nsw | 19480930 | 4365168 |
Schema Preparation¶
The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.
[6]:
schema = NamedTemporaryFile('wt')
[7]:
%%writefile {schema.name}
{
"version": 1,
"clkConfig": {
"l": 1024,
"k": 30,
"hash": {
"type": "doubleHash"
},
"kdf": {
"type": "HKDF",
"hash": "SHA256",
"info": "c2NoZW1hX2V4YW1wbGU=",
"salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
"keySize": 64
}
},
"features": [
{
"identifier": "rec_id",
"ignored": true
},
{
"identifier": "given_name",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "surname",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "street_number",
"format": { "type": "integer" },
"hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
},
{
"identifier": "address_1",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "address_2",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "suburb",
"format": { "type": "string", "encoding": "utf-8" },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "postcode",
"format": { "type": "integer", "minimum": 100, "maximum": 9999 },
"hashing": { "ngram": 1, "positional": true, "weight": 1 }
},
{
"identifier": "state",
"format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
"hashing": { "ngram": 2, "weight": 1 }
},
{
"identifier": "date_of_birth",
"format": { "type": "integer" },
"hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
},
{
"identifier": "soc_sec_id",
"ignored": true
}
]
}
Overwriting /tmp/tmpvlivqdcf
Create Linkage Project¶
The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.
[8]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)
!clkutil create-project --schema "{schema.name}" --output "{creds.name}" --type "similarity_scores" --server "{url}"
creds.seek(0)
with open(creds.name, 'r') as f:
credentials = json.load(f)
project_id = credentials['project_id']
credentials
Credentials will be saved in /tmp/tmpcwpvq6kj
Project created
[8]:
{'project_id': '1eb3da44f73440c496ab42217381181de55e9dcd6743580c',
'result_token': '846c6c25097c7794131de0d3e2c39c04b7de9688acedc383',
'update_tokens': ['52aae3f1dfa8a4ec1486d8f7d63a8fe708876b39a8ec585b',
'92e2c9c1ce52a2c2493b5e22953600735a07553f7d00a704']}
Note: the analyst will need to pass on the project_id
(the id of the linkage project) and one of the two update_tokens
to each data provider.
Hash and Upload¶
At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. Please see clkhash documentation for further details on this.
[9]:
!clkutil hash "{a_csv.name}" horse staple "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" horse staple "{schema.name}" "{b_clks.name}"
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.06kclk/s, mean=883, std=33.6]
CLK data written to /tmp/tmpj8m1dvxj.json
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.30kclk/s, mean=875, std=39.7]
CLK data written to /tmp/tmpi2y_ogl9.json
Now the two clients can upload their data providing the appropriate upload tokens.
Alice uploads her data¶
[10]:
with NamedTemporaryFile('wt') as f:
!clkutil upload \
--project="{project_id}" \
--apikey="{credentials['update_tokens'][0]}" \
--server "{url}" \
--output "{f.name}" \
"{a_clks.name}"
res = json.load(open(f.name))
alice_receipt_token = res['receipt_token']
Every upload gets a receipt token. In some operating modes this receipt is required to access the results.
Bob uploads his data¶
[11]:
with NamedTemporaryFile('wt') as f:
!clkutil upload \
--project="{project_id}" \
--apikey="{credentials['update_tokens'][1]}" \
--server "{url}" \
--output "{f.name}" \
"{b_clks.name}"
bob_receipt_token = json.load(open(f.name))['receipt_token']
Create a run¶
Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:
[12]:
with NamedTemporaryFile('wt') as f:
!clkutil create \
--project="{project_id}" \
--apikey="{credentials['result_token']}" \
--server "{url}" \
--threshold 0.9 \
--output "{f.name}"
run_id = json.load(open(f.name))['run_id']
Results¶
Now after some delay (depending on the size) we can fetch the mask. This can be done with clkutil:
!clkutil results --server "{url}" \
--project="{credentials['project_id']}" \
--apikey="{credentials['result_token']}" --output results.txt
However for this tutorial we are going to use the clkhash
library:
[13]:
for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):
clear_output(wait=True)
print(clkhash.rest_client.format_run_status(update))
time.sleep(3)
State: completed
Stage (2/2): compute similarity scores
Progress: 1.000%
[17]:
data = json.loads(clkhash.rest_client.run_get_result_text(
url,
project_id,
run_id,
credentials['result_token']))['similarity_scores']
This result is a large list of tuples recording the similarity between all rows above the given threshold.
[18]:
for row in data[:10]:
print(row)
[76, 2345, 1.0]
[83, 3439, 1.0]
[103, 863, 1.0]
[154, 2391, 1.0]
[177, 4247, 1.0]
[192, 1176, 1.0]
[270, 4516, 1.0]
[312, 1253, 1.0]
[407, 3743, 1.0]
[670, 3550, 1.0]
Note there can be a lot of similarity scores:
[19]:
len(data)
[19]:
1572906
We will display a sample of these similarity scores in a histogram using matplotlib:
[20]:
plt.hist([_[2] for _ in data[::100]], bins=50);

The vast majority of these similarity scores are for non matches. Let’s zoom into the right side of the distribution.
[21]:
plt.hist([_[2] for _ in data[::1] if _[2] > 0.94], bins=50);

Now it looks like a good threshold should be above 0.95
. Let’s have a look at some of the candidate matches around there.
[22]:
def sample(data, threshold, num_samples, epsilon=0.01):
samples = []
for row in data:
if abs(row[2] - threshold) <= epsilon:
samples.append(row)
if len(samples) >= num_samples:
break
return samples
def lookup_originals(candidate_pair):
a = dfA.iloc[candidate_pair[0]]
b = dfB.iloc[candidate_pair[1]]
return a, b
[23]:
def look_at_per_field_accuracy(threshold = 0.999, num_samples = 100):
results = []
for i, candidate in enumerate(sample(data, threshold, num_samples, 0.01), start=1):
record_a, record_b = lookup_originals(candidate)
results.append(record_a == record_b)
print("Proportion of exact matches for each field using threshold: {}".format(threshold))
print(sum(results)/num_samples)
So we should expect a very high proportion of matches across all fields for high thresholds:
[24]:
look_at_per_field_accuracy(threshold = 0.999, num_samples = 100)
Proportion of exact matches for each field using threshold: 0.999
given_name 0.93
surname 0.96
street_number 0.88
address_1 0.92
address_2 0.80
suburb 0.92
postcode 0.95
state 1.00
date_of_birth 0.96
soc_sec_id 0.40
dtype: float64
But if we look at a threshold which is closer to the boundary between real matches we should see a lot more errors:
[25]:
look_at_per_field_accuracy(threshold = 0.95, num_samples = 100)
Proportion of exact matches for each field using threshold: 0.95
given_name 0.49
surname 0.57
street_number 0.81
address_1 0.55
address_2 0.44
suburb 0.70
postcode 0.84
state 0.93
date_of_birth 0.84
soc_sec_id 0.92
dtype: float64
[26]:
[26]:
'0.12.0'
[ ]:
[1]:
import csv
import json
import os
import pandas as pd
[2]:
KEY1 = 'correct'
KEY2 = 'horse'
SERVER = os.getenv("SERVER", "https://testing.es.data61.xyz")
Scenario¶
There are three parties named Alice, Bob, and Charlie, each holding a dataset of about 3200 records. They know that they have some entities in common, but with incomplete overlap. The common features describing those entities are given name, surname, date of birth, and phone number.
They all have some additional information about those entities in their respective datasets, Alice has a person’s gender, Bob has their city, and Charlie has their income. They wish to create a table for analysis: each row has a gender, city, and income, but they don’t want to share any additional information. They can use Anonlink to do this in a privacy-preserving way (without revealing given names, surnames, dates of birth, and phone numbers).
Alice, Bob, and Charlie: agree on secret keys and a linkage schema¶
They keep the keys to themselves, but the schema may be revealed to the analyst.
[3]:
print(f'keys: {KEY1}, {KEY2}')
keys: correct, horse
[4]:
with open('data/schema_ABC.json') as f:
print(f.read())
{
"version": 2,
"clkConfig": {
"l": 1024,
"kdf": {
"type": "HKDF",
"hash": "SHA256",
"salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
"info": "c2NoZW1hX2V4YW1wbGU=",
"keySize": 64
}
},
"features": [
{
"identifier": "id",
"ignored": true
},
{
"identifier": "givenname",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"ngram": 2,
"positional": false,
"strategy": {"k": 15}
}
},
{
"identifier": "surname",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"ngram": 2,
"positional": false,
"strategy": {"k": 15}
}
},
{
"identifier": "dob",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"ngram": 2,
"positional": true,
"strategy": {"k": 15}
}
},
{
"identifier": "phone number",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"ngram": 1,
"positional": true,
"strategy": {"k": 8}
}
},
{
"identifier": "ignoredForLinkage",
"ignored": true
}
]
}
Sneak peek at input data¶
[5]:
pd.read_csv('data/dataset-alice.csv').head()
[5]:
id | givenname | surname | dob | phone number | gender | |
---|---|---|---|---|---|---|
0 | 0 | tara | hilton | 27-08-1941 | 08 2210 0298 | male |
1 | 3 | saJi | vernre | 22-12-2972 | 02 1090 1906 | mals |
2 | 7 | sliver | paciorek | NaN | NaN | mals |
3 | 9 | ruby | george | 09-05-1939 | 07 4698 6255 | male |
4 | 10 | eyrinm | campbell | 29-1q-1983 | 08 299y 1535 | male |
[6]:
pd.read_csv('data/dataset-bob.csv').head()
[6]:
id | givenname | surname | dob | phone number | city | |
---|---|---|---|---|---|---|
0 | 3 | zali | verner | 22-12-1972 | 02 1090 1906 | perth |
1 | 4 | samuel | tremellen | 21-12-1923 | 03 3605 9336 | melbourne |
2 | 5 | amy | lodge | 16-01-1958 | 07 8286 9372 | canberra |
3 | 7 | oIji | pacioerk | 10-02-1959 | 04 4220 5949 | sydney |
4 | 10 | erin | kampgell | 29-12-1983 | 08 2996 1445 | perth |
Charlie¶
[7]:
pd.read_csv('data/dataset-charlie.csv').head()
[7]:
id | givenname | surname | dob | phone number | income | |
---|---|---|---|---|---|---|
0 | 1 | joshua | arkwright | 16-02-1903 | 04 8511 9580 | 70189.446 |
1 | 3 | zal: | verner | 22-12-1972 | 02 1090 1906 | 50194.118 |
2 | 7 | oliyer | paciorwk | 10-02-1959 | 04 4210 5949 | 31750.993 |
3 | 8 | nacoya | ranson | 17-08-1925 | 07 6033 4580 | 102446.131 |
4 | 10 | erih | campbell | 29-12-1i83 | 08 299t 1435 | 331476.599 |
Analyst: create the project¶
The analyst keeps the result token to themselves. The three update tokens go to Alice, Bob and Charlie. The project ID is known by everyone.
[8]:
!clkutil create-project --server $SERVER --type groups --schema data/schema_ABC.json --parties 3 --output credentials.json
with open('credentials.json') as f:
credentials = json.load(f)
project_id = credentials['project_id']
result_token = credentials['result_token']
update_token_alice = credentials['update_tokens'][0]
update_token_bob = credentials['update_tokens'][1]
update_token_charlie = credentials['update_tokens'][2]
Project created
Alice: hash the data and upload it to the server¶
The data is hashed according to the schema and the keys. Alice’s update token is needed to upload the hashed data. No PII is uploaded to the service—only the hashes.
[9]:
!clkutil hash data/dataset-alice.csv $KEY1 $KEY2 data/schema_ABC.json dataset-alice-hashed.json --check-header false
generating CLKs: 0%| | 0.00/3.23k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs: 6%|6 | 200/3.23k [00:02<00:31, 96.1clk/s, mean=372, std=32.6]
generating CLKs: 25%|##4 | 800/3.23k [00:02<00:17, 136clk/s, mean=371, std=35.5]
generating CLKs: 63%|######2 | 2.03k/3.23k [00:02<00:06, 193clk/s, mean=372, std=34.7]
generating CLKs: 100%|##########| 3.23k/3.23k [00:02<00:00, 1.29kclk/s, mean=372, std=34.9]
CLK data written to dataset-alice-hashed.json
[10]:
!clkutil upload --server $SERVER --apikey $update_token_alice --project $project_id dataset-alice-hashed.json
{"message": "Updated", "receipt_token": "c54597f32fd969603efba706af1556abee3cc35f2718bcb6"}
Bob: hash the data and upload it to the server¶
[11]:
!clkutil hash data/dataset-bob.csv $KEY1 $KEY2 data/schema_ABC.json dataset-bob-hashed.json --check-header false
generating CLKs: 0%| | 0.00/3.24k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs: 6%|6 | 200/3.24k [00:01<00:25, 119clk/s, mean=369, std=32.4]
generating CLKs: 31%|### | 1.00k/3.24k [00:01<00:13, 168clk/s, mean=371, std=35]
generating CLKs: 56%|#####5 | 1.80k/3.24k [00:01<00:06, 238clk/s, mean=371, std=35.5]
generating CLKs: 100%|##########| 3.24k/3.24k [00:02<00:00, 1.45kclk/s, mean=372, std=35.3]
CLK data written to dataset-bob-hashed.json
[12]:
!clkutil upload --server $SERVER --apikey $update_token_bob --project $project_id dataset-bob-hashed.json
{"message": "Updated", "receipt_token": "6ee2fe5df850b795ee6ddff1aaf4dfb03f6d4398dedcc248"}
Charlie: hash the data and upload it to the server¶
[13]:
!clkutil hash data/dataset-charlie.csv $KEY1 $KEY2 data/schema_ABC.json dataset-charlie-hashed.json --check-header false
generating CLKs: 0%| | 0.00/3.26k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs: 6%|6 | 200/3.26k [00:01<00:24, 122clk/s, mean=371, std=33.3]
generating CLKs: 55%|#####5 | 1.80k/3.26k [00:01<00:08, 174clk/s, mean=372, std=34.5]
generating CLKs: 100%|##########| 3.26k/3.26k [00:01<00:00, 1.73kclk/s, mean=372, std=34.8]
CLK data written to dataset-charlie-hashed.json
[14]:
!clkutil upload --server $SERVER --apikey $update_token_charlie --project $project_id dataset-charlie-hashed.json
{"message": "Updated", "receipt_token": "064664ed9fd1f58c4da05c62a4832b813276d09342137a42"}
Analyst: start the linkage run¶
This will start the linkage computation. We will wait a little bit and then retrieve the results.
[15]:
!clkutil create --server $SERVER --project $project_id --apikey $result_token --threshold 0.7 --output=run-credentials.json
with open('run-credentials.json') as f:
run_credentials = json.load(f)
run_id = run_credentials['run_id']
Analyst: retreve the results¶
[16]:
!clkutil results --server $SERVER --project $project_id --apikey $result_token --run $run_id --watch --output linkage-output.json
State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
Downloading result
Received result
[17]:
with open('linkage-output.json') as f:
linkage_output = json.load(f)
linkage_groups = linkage_output['groups']
Everyone: make table of interesting information¶
We use the linkage result to make a table of genders, cities, and incomes without revealing any other PII.
[18]:
with open('data/dataset-alice.csv') as f:
r = csv.reader(f)
next(r) # Skip header
genders = tuple(row[-1] for row in r)
with open('data/dataset-bob.csv') as f:
r = csv.reader(f)
next(r) # Skip header
cities = tuple(row[-1] for row in r)
with open('data/dataset-charlie.csv') as f:
r = csv.reader(f)
next(r) # Skip header
incomes = tuple(row[-1] for row in r)
[19]:
table = []
for group in linkage_groups:
row = [''] * 3
for i, j in group:
row[i] = [genders, cities, incomes][i][j]
if sum(map(bool, row)) > 1:
table.append(row)
pd.DataFrame(table, columns=['gender', 'city', 'income']).head(10)
[19]:
gender | city | income | |
---|---|---|---|
0 | peGh | 395273.665 | |
1 | sydnev | 77367.636 | |
2 | pertb | 323383.650 | |
3 | syd1e7y | 79745.538 | |
4 | perth | 28019.494 | |
5 | canberra | 78961.675 | |
6 | female | brisnane | |
7 | male | canbetra | |
8 | sydme7 | 106849.526 | |
9 | melbourne | 68548.966 |
The last 20 groups look like this.
[20]:
linkage_groups[-15:]
[20]:
[[[0, 2111], [1, 2100]],
[[0, 2121], [2, 2131], [1, 2111]],
[[1, 1146], [2, 1202], [0, 1203]],
[[1, 2466], [2, 2478], [0, 2460]],
[[0, 429], [1, 412]],
[[0, 2669], [1, 1204]],
[[1, 1596], [2, 1623]],
[[0, 487], [1, 459]],
[[1, 1776], [2, 1800], [0, 1806]],
[[1, 2586], [2, 2602]],
[[0, 919], [1, 896]],
[[0, 100], [2, 107], [1, 100]],
[[0, 129], [1, 131], [2, 135]],
[[0, 470], [1, 440]],
[[0, 1736], [1, 1692], [2, 1734]]]
Sneak peek at the result¶
We obviously can’t do this in a real-world setting, but let’s view the linkage using the PII. If the IDs match, then we are correct.
[21]:
with open('data/dataset-alice.csv') as f:
r = csv.reader(f)
next(r) # Skip header
dataset_alice = tuple(r)
with open('data/dataset-bob.csv') as f:
r = csv.reader(f)
next(r) # Skip header
dataset_bob = tuple(r)
with open('data/dataset-charlie.csv') as f:
r = csv.reader(f)
next(r) # Skip header
dataset_charlie = tuple(r)
[22]:
table = []
for group in linkage_groups:
for i, j in sorted(group):
table.append([dataset_alice, dataset_bob, dataset_charlie][i][j])
table.append([''] * 6)
pd.DataFrame(table, columns=['id', 'given name', 'surname', 'dob', 'phone number', 'non-linking']).tail(15)
[22]:
id | given name | surname | dob | phone number | non-linking | |
---|---|---|---|---|---|---|
6426 | 1171 | isabelle | bridgland | 30-03-1994 | 04 5318 6471 | mal4 |
6427 | 1171 | isalolIe | riahgland | 30-02-1994 | 04 5318 6471 | sydnry |
6428 | 1171 | isabelle | bridgland | 30-02-1994 | 04 5318 6471 | 63514.217 |
6429 | ||||||
6430 | 1243 | thmoas | doaldson | 13-04-1900 | 09 6963 1944 | male |
6431 | 1243 | thoma5 | donaldson | 13-04-1900 | 08 6962 1944 | perth |
6432 | 1243 | thomas | donalsdon | 13-04-2900 | 08 6963 2944 | 489229.297 |
6433 | ||||||
6434 | 2207 | annah | aslea | 02-11-2906 | 04 5501 5973 | male |
6435 | 2207 | hannah | easlea | 02-11-2006 | 04 5501 5973 | canberra |
6436 | ||||||
6437 | 5726 | rhys | clarke | 19-05-1929 | 02 9220 9635 | mqle |
6438 | 5726 | ry5 | clarke | 19-05-1939 | 02 9120 9635 | |
6439 | 5726 | rhys | klark | 19-05-2938 | 02 9220 9635 | 118197.119 |
6440 |
[1]:
import csv
import itertools
import os
import requests
Entity Service: Multiparty linkage demo¶
This notebook is a demonstration of the multiparty linkage capability that has been implemented in the Entity Service.
We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included.
Check the status of the Entity Service¶
Ensure that it is running and that we have the correct version. Multiparty support was introduced in version 1.11.0.
[2]:
SERVER = os.getenv("SERVER", "https://testing.es.data61.xyz")
PREFIX = f"{SERVER}/api/v1"
print(requests.get(f"{PREFIX}/status").json())
print(requests.get(f"{PREFIX}/version").json())
{'project_count': 10, 'rate': 20496894, 'status': 'ok'}
{'anonlink': '0.11.2', 'entityservice': 'v1.11.0', 'python': '3.6.8'}
Create a new project¶
We create a new multiparty project for five parties by specifying the number of parties and the output type (currently only the group
output type supports multiparty linkage). Retain the project_id
, so we can find the project later. Also retain the result_token
, so we can retrieve the results (careful: anyone with this token has access to the results). Finally, the update_tokens
identify the five data data providers and permit them to upload CLKs.
[3]:
project_info = requests.post(
f"{PREFIX}/projects",
json={
"schema": {},
"result_type": "groups",
"number_parties": 5,
"name": "example project"
}
).json()
project_id = project_info["project_id"]
result_token = project_info["result_token"]
update_tokens = project_info["update_tokens"]
print("project_id:", project_id)
print()
print("result_token:", result_token)
print()
print("update_tokens:", update_tokens)
project_id: 8eeb1050f5add8f78ff4a0da04219fead48f22220fb0f15e
result_token: c8f22b577aac9432871eeea02cbe504d399a9776add1de9f
update_tokens: ['6bf0f1c84c17116eb9f93cf8a4cfcb13d49d288a1f376dd8', '4b9265070849af1f0546f2adaeaa85a7d0e60b10f9b4afbc', '3ff03cadd750ce1b40cc4ec2b99db0132f62d8687328eeb9', 'c1b562ece6bbef6cd1a0541301bb1f82bd697bce04736296', '8cfdebbe12c65ae2ff20fd0c0ad5de4feb06c9a9dd1209c8']
Upload the hashed data¶
This is where each party uploads their CLKs into the service. Here, we do the work of all five data providers inside this for loop. In a deployment scenario, each data provider would be uploading their own CLKs using their own update token.
These CLKs are already hashed using clkhash, so for each data provider, we just need to upload their corresponding hash file.
[4]:
for i, token in enumerate(update_tokens, start=1):
with open(f"data/clks-{i}.json") as f:
r = requests.post(
f"{PREFIX}/projects/{project_id}/clks",
data=f,
headers={
"Authorization": token,
"content-type": "application/json"
}
)
print(f"Data provider {i}: {r.text}")
Data provider 1: {
"message": "Updated",
"receipt_token": "c7d9ba71260863f13af55e12603f8694c29e935262b15687"
}
Data provider 2: {
"message": "Updated",
"receipt_token": "70e4ed1b403c4e628183f82548a9297f8417ca3de94648bf"
}
Data provider 3: {
"message": "Updated",
"receipt_token": "b56fe568b93dc4522444e503078e16c18573adecbc086b6a"
}
Data provider 4: {
"message": "Updated",
"receipt_token": "7e3c80e554cfde23847d9aa2cff1323aa8f411e4033c0562"
}
Data provider 5: {
"message": "Updated",
"receipt_token": "8bde91367ee52b5c6804d5ce2d2d3350ce3c3766b8625bbc"
}
Begin a run¶
The data providers have uploaded their CLKs, so we may begin the computation. This computation may be repeated multiple times, each time with different parameters. Each such repetition is called a run. The most important parameter to vary between runs is the similarity threshold. Two records whose similarity is above this threshold will be considered to describe the same entity.
Here, we perform one run. We (somewhat arbitrarily) choose the threshold to be 0.8.
[5]:
r = requests.post(
f"{PREFIX}/projects/{project_id}/runs",
headers={
"Authorization": result_token
},
json={
"threshold": 0.8
}
)
run_id = r.json()["run_id"]
Check the status¶
Let’s see whether the run has finished (‘state’ is ‘completed’)!
[6]:
r = requests.get(
f"{PREFIX}/projects/{project_id}/runs/{run_id}/status",
headers={
"Authorization": result_token
}
)
r.json()
[6]:
{'current_stage': {'description': 'waiting for CLKs',
'number': 1,
'progress': {'absolute': 5,
'description': 'number of parties already contributed',
'relative': 1.0}},
'stages': 3,
'state': 'queued',
'time_added': '2019-06-23T11:17:27.646642+00:00',
'time_started': None}
Now after some delay (depending on the size) we can fetch the results. Waiting for completion can be achieved by directly polling the REST API using requests
, however for simplicity we will just use the watch_run_status
function provided in clkhash.rest_client
.
[7]:
import clkhash.rest_client
from IPython.display import clear_output
for update in clkhash.rest_client.watch_run_status(SERVER, project_id, run_id, result_token, timeout=30):
clear_output(wait=True)
print(clkhash.rest_client.format_run_status(update))
State: completed
Stage (3/3): compute output
Retrieve the results¶
We retrieve the results of the linkage. As we selected earlier, the result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party id and the row index.
The last 20 groups look like this.
[8]:
r = requests.get(
f"{PREFIX}/projects/{project_id}/runs/{run_id}/result",
headers={
"Authorization": result_token
}
)
groups = r.json()
groups['groups'][-20:]
[8]:
[[[0, 3127], [3, 3145], [2, 3152], [1, 3143]],
[[2, 1653], [3, 1655], [1, 1632], [0, 1673], [4, 1682]],
[[0, 2726], [1, 2737], [3, 2735]],
[[1, 837], [3, 864]],
[[0, 1667], [4, 1676], [1, 1624], [3, 1646]],
[[1, 1884], [2, 1911], [4, 1926], [0, 1916]],
[[0, 192], [2, 198]],
[[3, 328], [4, 330], [0, 350], [2, 351], [1, 345]],
[[2, 3173], [4, 3176], [3, 3163], [0, 3145], [1, 3161]],
[[1, 347], [4, 332], [2, 353], [0, 352]],
[[1, 736], [3, 761], [2, 768], [0, 751], [4, 754]],
[[1, 342], [2, 349]],
[[3, 899], [2, 913]],
[[1, 465], [3, 477]],
[[0, 285], [1, 293]],
[[0, 785], [3, 794]],
[[3, 2394], [4, 2395], [0, 2395]],
[[1, 1260], [2, 1311], [3, 1281], [4, 1326]],
[[0, 656], [2, 663]],
[[1, 2468], [2, 2479]]]
To sanity check, we print their records’ corresponding PII:
[17]:
def load_dataset(i):
dataset = []
with open(f"data/dataset-{i}.csv") as f:
reader = csv.reader(f)
next(reader) # ignore header
for row in reader:
dataset.append(row[1:])
return dataset
datasets = list(map(load_dataset, range(1, 6)))
for group in itertools.islice(groups["groups"][-20:], 20):
for (i, j) in group:
print(i, datasets[i][j])
print()
0 ['samual', 'mason', '05-12-1917', 'male', 'pertb', '405808.756', '07 2284 3649']
3 ['samuAl', 'mason', '05-12-1917', 'male', 'peryh', '4058o8.756', '07 2274 3549']
2 ['samie', 'mazon', '05-12-1917', 'male', '', '405898.756', '07 2275 3649']
1 ['zamusl', 'mason', '05-12-2917', 'male', '', '405898.756', '07 2274 2649']
2 ['thomas', 'burfrod', '08-04-1999', '', 'pertj', '182174.209', '02 3881 9666']
3 ['thomas', 'burfrod', '09-04-1999', 'male', '', '182174.209', '02 3881 9666']
1 ['thomas', 'burford', '08-04-19o9', 'mal4', '', '182175.109', '02 3881 9666']
0 ['thomas', 'burford', '08-04-1999', 'male', 'perth', '182174.109', '02 3881 9666']
4 ['thomas', 'burf0rd', '08-04-q999', 'mske', 'perrh', '182174.109', '02 3881 9666']
0 ['kaitlin', 'bondza', '03-08-1961', 'male', 'sydney', '41168.999', '02 4632 1380']
1 ['kaitlin', 'bondja', '03-08-1961', 'malr', 'sydmey', '41168.999', '02 4632 1370']
3 ["k'latlin", 'bonklza', '03-08-1961', 'male', 'sydaney', '', '02 4632 1380']
1 ['chr8stian', 'jolly', '22-08-2009', 'male', '', '178371.991', '04 5868 7703']
3 ['chr8stian', 'jolly', '22-09-2099', 'malr', 'melbokurne', '178271.991', '04 5868 7703']
0 ['oaklrigh', 'ngvyen', '24-07-1907', 'mslr', 'sydney', '63175.398', '04 9019 6235']
4 ['oakleith', 'ngvyen', '24-97-1907', 'male', 'sydiney', '63175.498', '04 9019 6235']
1 ['oajleigh', 'ngryen', '24-07-1007', 'male', 'sydney', '63175.498', '04 9919 6235']
3 ['oakleigh', 'nguyrn', '34-07-1907', 'male', 'sbdeney', '63175.r98', '04 9019 6235']
1 ['georgia', 'nguyen', '06-11-1930', 'male', 'perth', '247847.799', '08 6560 4063']
2 ['georia', 'nfuyen', '06-11-1930', 'male', 'perrh', '247847.799', '08 6560 4963']
4 ['geortia', 'nguyea', '06-11-1930', 'male', 'pertb', '247847.798', '08 6560 4063']
0 ['egorgia', 'nguyqn', '06-11-1930', 'male', 'peryh', '247847.799', '08 6460 4963']
0 ['connor', 'mcneill', '05-09-1902', 'male', 'sydney', '108473.824', '02 6419 9472']
2 ['connro', 'mcnell', '05-09-1902', 'male', 'sydnye', '108474.824', '02 6419 9472']
3 ['alessandria', 'sherriff', '25-91-1951', 'male', 'melb0urne', '5224r.762', '03 3077 2019']
4 ['alessandria', 'sherriff', '25-01-1951', 'male', 'melbourne', '52245.762', '03 3077 1019']
0 ['alessandria', "sherr'lff", '25-01-1951', 'malr', 'melbourne', '', '03 3977 1019']
2 ['alessandria', 'shernff', '25-01-1051', 'mzlr', 'melbourne', '52245.663', '03 3077 1019']
1 ['alessandrya', 'sherrif', '25-01-1961', 'male', 'jkelbouurne', '52245.762', '03 3077 1019']
2 ['harriyon', 'micyelmor', '21-04-1971', 'male', 'pert1>', '291889.942', '04 5633 5749']
4 ['harri5on', 'micyelkore', '21-04-1971', '', 'pertb', '291880.942', '04 5633 5749']
3 ['hariso17', 'micelmore', '21-04-1971', 'male', 'pertb', '291880.042', '04 5633 5749']
0 ['harrison', 'michelmore', '21-04-1981', 'malw', 'preth', '291880.942', '04 5643 5749']
1 ['harris0n', 'michelmoer', '21-04-1971', '', '', '291880.942', '04 5633 5749']
1 ['alannah', 'gully', '15-04-1903', 'make', 'meobourne', '134518.814', '04 5104 4572']
4 ['alana', 'gully', '15-04-1903', 'male', 'melbourne', '134518.814', '04 5104 4582']
2 ['alama', 'gulli', '15-04-1903', 'mald', 'melbourne', '134518.814', '04 5104 5582']
0 ['alsna', 'gullv', '15-04-1903', 'male', '', '134518.814', '04 5103 4582']
1 ['sraah', 'bates-brownsword', '26-11-1905', 'malr', '', '59685.979', '03 8545 5584']
3 ['sarah', 'bates-brownswort', '26-11-1905', 'male', '', '59686.879', '03 8545 6584']
2 ['sara0>', 'bates-browjsword', '26-11-1905', 'male', '', '59685.879', '']
0 ['saran', 'bates-brownsvvord', '26-11-1905', 'malr', 'sydney', '59685.879', '03 8555 5584']
4 ['snrah', 'bates-bro2nsword', '26-11-1005', 'male', 'sydney', '58685.879', '03 8545 5584']
1 ['beth', 'lette', '18-01-2000', 'female', 'sydney', '179719.049', '07 1868 6031']
2 ['beth', 'lette', '18-02-2000', 'femal4', 'stdq7ey', '179719.049', '07 1868 6931']
3 ['tahlia', 'bishlp', '', 'female', 'sydney', '101203.290', '03 886u 1916']
2 ['ahlia', 'bishpp', '', 'female', 'syriey', '101204.290', '03 8867 1916']
1 ['fzachary', 'mydlalc', '20-95-1916', 'male', 'sydney', '121209.129', '08 3807 4717']
3 ['zachary', 'mydlak', '20-05-1016', 'malr', 'sydhey', '121200.129', '08 3807 4627']
0 ['jessica', 'white', '04-07-1979', 'male', 'perth', '385632.266', '04 8026 8748']
1 ['jezsica', 'whi5e', '05-07-1979', 'male', 'perth', '385632.276', '04 8026 8748']
0 ['beriiamin', 'musoluno', '21-0y-1994', 'female', 'sydney', '81857.391', '08 8870 e498']
3 ['byenzakin', 'musoljno', '21-07-1995', 'female', 'sydney', '81857.392', '']
3 ['ella', 'howie', '26-03-2003', 'male', 'melbourne', '97556.316', '03 3655 1171']
4 ['ela', 'howie', '26-03-2003', 'male', 'melboirne', '', '03 3555 1171']
0 ['lela', 'howie', '26-03-2903', 'male', 'melbourhe', '', '03 3655 1171']
1 ['livia', 'riaj', '13-03-1907', 'malw', 'melbovrne', '73305.107', '07 3846 2530']
2 ['livia', 'ryank', '13-03-1907', 'malw', 'melbuorne', '73305.107', '07 3946 2630']
3 ['ltvia', 'ryan', '13-03-1907', 'maoe', 'melbourne', '73305.197', '07 3046 2530']
4 ['livia', 'ryan', '13-03-1907', 'male', 'melbourne', '73305.107', '07 3946 2530']
0 ['coby', 'ibshop', '', 'msle', 'sydney', '211655.118', '02 0833 7777']
2 ['coby', 'bishop', '15-08-1948', 'male', 'sydney', '211655.118', '02 9833 7777']
1 ['emjkly', 'pareemore', '01-03-2977', 'female', 'rnelbourne', '1644487.925', '03 5761 5483']
2 ['emiily', 'parremore', '01-03-1977', 'female', 'melbourne', '1644487.925', '03 5761 5483']
Despite the high amount of noise in the data, the entity service was able to produce a fairly accurate matching. However, Isabella George and Mia/Talia Galbraith are most likely not an actual match.
We may be able to improve on this results by fine-tuning the hashing schema or by changing the threshold.
Delete the project¶
[18]:
r = requests.delete(
f"{PREFIX}/projects/{project_id}",
headers={
"Authorization": result_token
}
)
print(r.status_code)
204
External Tutorials¶
The clkhash
library includes a tutorial of carrying out record linkage on perturbed data.
<http://clkhash.readthedocs.io/en/latest/tutorial_cli.html>
Command line example¶
This brief example shows using clkutil
- the command line tool that is packaged with the
clkhash
library. It is not a requirement to use clkhash
with the Entity Service REST API.
We assume you have access to a command line prompt with Python and Pip installed.
Install clkhash
:
$ pip install clkhash
Generate and split some mock personally identifiable data:
$ clkutil generate 2000 raw_pii_2k.csv
$ head -n 1 raw_pii_2k.csv > alice.txt
$ tail -n 1500 raw_pii_2k.csv >> alice.txt
$ head -n 1000 raw_pii_2k.csv > bob.txt
A corresponding hashing schema can be generated as well:
$ clkutil generate-default-schema schema.json
Process the personally identifying data into Cryptographic Longterm Key:
$ clkutil hash alice.txt horse staple schema.json alice-hashed.json
generating CLKs: 100%|████████████████████████████████████████████| 1.50K/1.50K [00:00<00:00, 6.69Kclk/s, mean=522, std=34.4]
CLK data written to alice-hashed.json
$ clkutil hash bob.txt horse staple schema.json bob-hashed.json
generating CLKs: 100%|████████████████████████████████████████████| 999/999 [00:00<00:00, 5.14Kclk/s, mean=520, std=34.2]
CLK data written to bob-hashed.json
Now to interact with an Entity Service. First check that the service is healthy and responds to a status check:
$ clkutil status --server https://testing.es.data61.xyz
{"rate": 53129, "status": "ok", "project_count": 1410}
Then create a new linkage project and set the output type (to mapping
):
$ clkutil create-project \
--server https://testing.es.data61.xyz \
--type mapping \
--schema schema.json \
--output credentials.json
The entity service replies with a project id and credentials which get saved into the file credentials.json
.
The contents is two upload tokens and a result token:
{
"update_tokens": [
"21d4c9249e1c70ac30f9ce03893983c493d7e90574980e55",
"3ad6ae9028c09fcbc7fbca36d19743294bfaf215f1464905"
],
"project_id": "809b12c7e141837c3a15be758b016d5a7826d90574f36e74",
"result_token": "230a303b05dfd186be87fa65bf7b0970fb786497834910d1"
}
These credentials get substituted in the following commands. Each CLK dataset gets uploaded to the Entity Service:
$ clkutil upload --server https://testing.es.data61.xyz \
--apikey 21d4c9249e1c70ac30f9ce03893983c493d7e90574980e55 \
--project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
alice-hashed.json
{"receipt_token": "05ac237462d86bc3e2232ae3db71d9ae1b9e99afe840ee5a", "message": "Updated"}
$ clkutil upload --server https://testing.es.data61.xyz \
--apikey 3ad6ae9028c09fcbc7fbca36d19743294bfaf215f1464905 \
--project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
bob-hashed.json
{"receipt_token": "6d9a0ee7fc3a66e16805738097761d38c62ea01a8c6adf39", "message": "Updated"}
Now we can compute mappings using various thresholds. For example to only see relationships where the
similarity is above 0.9
:
$ clkutil create --server https://testing.es.data61.xyz \
--apikey 230a303b05dfd186be87fa65bf7b0970fb786497834910d1 \
--project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
--name "Tutorial mapping run" \
--threshold 0.9
{"run_id": "31a6d3c775151a877dcac625b4b91a6659317046ea45ad11", "notes": "Run created by clkhash 0.11.2", "name": "Tutorial mapping run", "threshold": 0.9}
After a small delay the mapping will have been computed and we can use clkutil
to retrieve the
results:
$ clkutil results --server https://testing.es.data61.xyz \
--apikey 230a303b05dfd186be87fa65bf7b0970fb786497834910d1 \
--project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
--run 31a6d3c775151a877dcac625b4b91a6659317046ea45ad11
State: completed
Stage (3/3): compute output
Downloading result
Received result
{
"mapping": {
"0": "500",
"1": "501",
"10": "510",
"100": "600",
"101": "601",
"102": "602",
"103": "603",
"104": "604",
"105": "605",
"106": "606",
"107": "607",
...
This mapping output is telling us that the similarity is above our threshold between identified rows of Alice and Bob’s data sets.
Looking at the first two entities in Alice’s data:
head alice.txt -n 3
INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
500,Arjun Efron,1990/01/14,M
501,Sedrick Verlinden,1954/11/28,M
And looking at the corresponding 500th and 501st entities in Bob’s data:
tail -n 499 bob.txt | head -n 2
500,Arjun Efron,1990/01/14,M
501,Sedrick Verlinden,1954/11/28,M
Concepts¶
Cryptographic Longterm Key¶
A Cryptographic Longterm Key is the name given to a Bloom filter used as a privacy preserving representation of an entity. Unlike a cryptographic hash function, a CLK preserves similarity - meaning two similar entities will have similar CLKs. This property is necessary for probabilistic record linkage.
CLKs are created independent of the entity service following a keyed hashing process.
A CLK incorporates information from multiple identifying fields (e.g., name, date of birth, phone number) for each entity. The schema section details how to capture the configuration for creating CLKs from PII, and the next section outlines how to serialize CLKs for use with this service’s api.
Note
The Cryptographic Longterm Key was introduced in A Novel Error-Tolerant Anonymous Linking Code by Rainer Schnell, Tobias Bachteler, and Jörg Reiher.
Bloom Filter Format¶
A Bloom filter is simply an encoding of PII as a bitarray.
This can easily be represented as bytes (each being an 8 bit number between 0 and 255). We serialize by base64 encoding the raw bytes of the bit array.
An example with a 64 bit filter:
# bloom filters binary value
'0100110111010000101111011111011111011000110010101010010010100110'
# which corresponds to the following bytes
[77, 208, 189, 247, 216, 202, 164, 166]
# which gets base64 encoded to
'TdC999jKpKY=\n'
As with standard Base64 encodings, a newline is introduced every 76 characters.
Schema¶
It is important that participating organisations agree on how personally identifiable information is processed to create the clks. We call the configuration for creating CLKs a linkage schema. The organisations have to agree on a schema to ensure their CLKs are comparable.
The linkage schema is documented in clkhash, our reference implementation written in Python.
Note
Due to the one way nature of hashing, the entity service can’t determine whether the linkage schema was followed when clients generated CLKs.
Comparing Cryptograhpic Longterm Keys¶
The similarity metric used is the Sørensen–Dice index - although this may become a configurable option in the future.
Output Types¶
The Entity Service supports different result types which effect what output is produced, and who may see the output.
Warning
The security guarantees differ substantially for each output type. See the Security document for a treatment of these concerns.
Similarity Score¶
Similarities scores are computed between all CLKs in each organisation - the scores above a given threshold are returned. This output type is currently the only way to work with 1 to many relationships.
The result_token
(generated when creating the mapping) is required. The result_type
should
be set to "similarity_scores"
.
Results are a simple JSON array of arrays:
[
[index_a, index_b, score],
...
]
Where the index values will be the 0 based row index from the uploaded CLKs, and
the score will be a Number between the provided threshold and 1.0
.
A score of 1.0
means the CLKs were identical. Threshold values are usually between
0.5
and 1.0
.
Note
The maximum number of results returned is the product of the two data set lengths.
For example:
Comparing two data sets each containing 1 million records with a threshold of0.0
will return 1 trillion results (1e+12
).
Direct Mapping Table¶
The direct mapping takes the similarity scores and simply assigns the highest scores as links.
The links are exposed as a lookup table using indices from the two organizations:
{
index_a: index_b,
...
}
The result_token
(generated when creating the mapping) is required to retrieve the results. The
result_type
should be set to "mapping"
.
Permutation and Mask¶
This protocol creates a random reordering for both organizations; and creates a mask revealing where the reordered rows line up.
Accessing the mask requires the result_token
, and accessing the permutation requires a
receipt-token
(provided to each organization when they upload data).
Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.
The result_type
should be set to "permutations"
.
Security¶
The service isn’t given any personally identifying information in raw form - rather clients must locally compute a CLK which is a hashed version of the data to be linked.
Considerations for each output type¶
Direct Mapping Table¶
The default output of the Entity Service comprises a list of edges - connections between rows in dataset A to rows in dataset B. This assumes at most a 1-1 corrospondence - each entity will only be present in zero or one edge.
This output is only available to the client who created the mapping, but it is worth highlighting that it does (by design) leak information about the intersection of the two sets of entities.
Knowledge about set intersection This output contains information about which particular entities are shared, and which are not. Potentially knowing the overlap between the organizations is disclosive. This is mitigated by using unique authorization codes generated for each mapping which is required to retrieve the results.
Row indicies exposed The output directly exposes the row indices provided to the service, which if not randomized may be disclosive. For example entities simply exported from a database might be ordered by age, patient admittance date, salary band etc.
Similarity Score¶
All calculated similarities (above a given threshold) between entities are returned. This output comprises a list of weighted edges - similarity between rows in dataset A to rows in dataset B. This is a many to many relationship where entities can appear in multiple edges.
Recovery from the distance measurements This output type includes the plaintext distance measurements between entities, this additional information can be used to fingerprint individual entities based on their ordered similarity scores. In combination with public information this can lead to recovery of identity. This attack is described in section 3 of Vulnerabilities in the use of similarity tables in combination with pseudonymisation to preserve data privacy in the UK Office for National Statistics’ Privacy-Preserving Record Linkage by Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague.
In order to prevent this attack it is important not to provide the similarity table to untrusted parties.
Permutation and Mask¶
This output type involves creating a random reordering of the entities for both organizations; and creating a binary mask vector revealing where the reordered rows line up. This output is designed for use in multi-party computation algorithms.
This mitigates the Knowledge about set intersection problem from the direct mapping output - assuming the mask is not made available to the data providers.
Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.
Authentication / Authorization¶
The entity service does not support authentication, yet. This is planned for a future version.
All sensitive data is protected by token-based authorization. That is, you need to provide the correct token to access different resources. A token is a unique random 192 bit string.
There are three different types of tokens:
- update_token: required to upload a party’s CLKs.
- result_token: required to access the result of the entity resolution process. This is, depending on the output type, either similarity scores, a direct mapping table, or a mask.
- receipt-token: this token is returned to either party after uploading their respective CLKs. With this receipt-token they can then access their respective permutations, if the output type of the mapping is set to permutation and mask.
Important
These tokens are the only artifacts that protect the sensitive data. Therefore it is paramount to make sure that only authorized parties have access to these tokens!
Attack Vectors¶
The following attack vectors need to be considered for all output types.
Stealing/Leaking uploaded CLKs
The uploaded CLKs for one organization could be leaked to the partner organization - who possesses the HMAC secret breaking semantic security. The entity service doesn’t expose an API that allows users to access any CLKs, the object store (MINIO or S3) and the database (postgresql) are configured to not allow public access.
Deployment¶
Local Deployment¶
Dependencies¶
Docker and docker-compose
Build¶
From the project folder, run:
./tools/build.sh
The will create the docker images tagged with latest which are used by docker-compose
.
Run¶
Run docker compose:
docker-compose -p n1es -f tools/docker-compose.yml up
This will start the following containers:
- nginx frontend (named
n1es_nginx_1
) - gunicorn/flask backend (named
n1es_backend_1
) - celery backend worker (named
n1es_worker_1
) - postgres database (named
n1es_db_1
) - redis job queue (named
n1es_redis_1
) - minio object store
- jaeger opentracing
The REST api for the service is exposed on port 8851
of the nginx container, which docker
will map to a high numbered port on your host.
The address of the nginx endpoint can be found with:
docker port n1es_nginx_1 "8851"
For example to GET the service status:
$ export ENTITY_SERVICE=`docker port n1es_nginx_1 "8851"`
$ curl $ENTITY_SERVICE/api/v1/status
{
"status": "ok",
"number_mappings": 0,
"rate": 1
}
The service can be taken down by hitting CTRL+C. This doesn’t clear the DB volumes, which will persist and conflict with the next call to docker-compose … up unless they are removed. Removing these volumes is easy, just run:
docker-compose -p n1es -f tools/docker-compose.yml down -v
in between calls to docker-compose … up.
Monitoring¶
A celery monitor tool flower is also part of the docker-compose file - this graphical interface allows administration and monitoring of the celery tasks and workers. Access this via the monitor container.
Testing with docker-compose¶
An additional docker-compose config file can be found in ./tools/ci.yml, this can be added in to run along with the rest of the service:
docker-compose -p n1estest -f tools/docker-compose.yml -f tools/ci.yml up -d
docker logs -f n1estest_tests_1
docker-compose -p n1estest -f tools/docker-compose.yml -f tools/ci.yml down
Docker Compose Tips¶
Local Scaling¶
You can run additional worker containers by scaling with docker-compose:
docker-compose -f tools/docker-compose.yml scale es_worker=2
A collection of development tips.
Volumes¶
You might need to destroy the docker volumes used for the object store and the postgres database:
docker-compose -f tools/docker-compose.yml rm -s -v [-p <project-name>]
Restart one service¶
Docker compose can modify an existing deployment, this can be particularly effective when you modify and rebuild the backend and want to restart it without changing anything else:
docker-compose -f tools/docker-compose.yml up -d --no-deps es_backend
Scaling¶
You can run additional worker containers by scaling with docker-compose:
docker-compose -f tools/docker-compose.yml scale es_worker=2
Mix and match docker compose¶
During development you can run the redis and database containers with docker-compose, and directly run the celery and flask applications with Python.
docker-compose -f tools/docker-compose.yml run es_db
docker-compose -f tools/docker-compose.yml run es_redis
Production deployment¶
Production deployment assumes a multi node Kubernetes cluster.
The entity service has been deployed to kubernetes clusters on Azure, GCE, minikube and AWS. The system has been designed to scale across multiple nodes and handle node failure without data loss.
At a high level the main custom components are:
- ES App - a gunicorn/flask backend web service hosts the REST api
- Entity Match Worker instances - uses celery for task scheduling
The components that are used in support are:
- Postgresql database holds all match metadata
- Redis is used for the celery job queue and as a cache
- An object store (e.g. AWS S3, or Minio) stores the raw CLKs, intermediate files, and results.
- nginx provides upload buffering, request rate limiting.
- An ingress controller (e.g. nginx-ingress/traefik) provides TLS termination.
The rest of this document goes into how to deploy in a production setting.
Provision a Kubernetes cluster¶
Creating a Kubernetes cluster is out of scope for this documentation.
Hardware requirements
Recommended AWS worker instance type
is r3.4xlarge
- spot instances are fine as we handle node failure. The
number of nodes depends on the size of the expected jobs, as well as the
memory on each node. For testing we recommend starting with at least two nodes, with each
node having at least 8 GiB of memory and 2 vCPUs.
Software to interact with the cluster
You will need to install the kubectl command line tool, and helm
Install Helm¶
The entity service system has been packaged using helm, there is a client program that needs to be installed
At the very least you will need to install tiller into the cluster:
helm init
Ingress Controller¶
We assume the cluster has an ingress controller, if this isn’t the case first add one. We suggest using Traefik or NGINX Ingress Controller. Both can be installed using helm.
Deploy the system¶
Helm can be used to deploy the system to a kubernetes cluster.
From the deployment/entity-service directory pull the dependencies:
helm dependency update
Configuring the deployment¶
Create a new blank yaml file to hold your custom deployment settings my-deployment.yaml
.
Carefully read through the default values.yaml
file and override any values in your deployment
configuration file.
At a minimum consider setting up an ingress by changing api.ingress
, change the number of
workers in workers.replicaCount
(and possibly workers.highmemory.replicaCount
), check
you’re happy with the workers’ cpu and memory limits in workers.resources
, and finally set
the credentials:
global.postgresql.postgresqlPassword
redis.password
(andredis-ha.redisPassword
if provisioning redis)minio.accessKey
andminio.secretKey
Installation¶
To install the whole system execute:
cd deployment
helm install entityservice --name="anonlink" --values ``my-deployment.yaml``
This can take several minutes the first time you deploy to a new cluster.
Run integration tests and an end to end test¶
Update the server url by editing the jobs/integration-test-job.yaml
file then create a
new job on the cluster:
kubectl create -f jobs/integration-test-job.yaml
To view the celery monitor:¶
Note the monitor must be enabled at deployment. Find the pod that the celery monitor is running on then forward the port. For example:
$ kubectl get -n default pod --selector=run=celery-monitor -o jsonpath='{.items..metadata.name}'
entityservice-monitor-4045544268-s34zl
$kubectl port-forward entityservice-monitor-4045544268-s34zl 8888:8888
Upgrade Deployment with Helm¶
Updating a running chart is usually straight forward. For example if the release is called
anonlink
in namespace testing
execute the following to increase the number of workers
to 20:
helm upgrade anonlink entity-service --namespace=testing --set workers.replicas="20"
However, note you may wish to instead keep all configurable values in a yaml file and track that in version control.
Minimal Deployment¶
To run with minikube for local testing we have provided a minimal.yaml
file that will
set very small resource limits. Install the minimal system with:
helm install entity-service --name="mini-es" --values entity-service/minimal-values.yaml
Database Deployment Options¶
At deployment time you must set the postgresql password in global.postgresql.postgresqlPassword
.
You can decide to deploy a postgres database along with the anonlink entity service or instead use an existing
database. To configure a deployment to use an external postgres database, simply set provision.postgresql
to false
, set the database server in postgresql.nameOverride
, and add credentials to the
global.postgresql
section.
Object Store Deployment Options¶
At deployment time you can decide to deploy MINIO or instead use an existing service such as AWS S3.
Note that there is a trade off between using a local deployment of minio vs S3. In our AWS based experimentation Minio is noticeably faster, but more expensive and less reliable than AWS S3, your own mileage may vary.
To configure a deployment to use an external object store, set provision.minio
to false
and add
appropriate connection configuration in the minio
section. For example to use AWS S3 simply provide your access
credentials (and disable provisioning minio):
helm install entity-service --name="es-s3" --set provision.minio=false --set minio.accessKey=XXX --set minio.secretKey=YYY --set minio.bucket=<bucket>
Redis Deployment Options¶
At deployment time you can decide to provision redis using our chart, or instead use an existing redis installation or managed service. The provisioned redis is a highly available 3 node redis cluster using the redis-ha helm chart. Directly connecting to redis, and discovery via the sentinel protocol are supported. When using sentinel protocol for redis discovery read only requests are dispatched to redis replicas.
Carefully read the comments in the redis
section of the default values.yaml
file.
To use a separate install of redis using the server shared-redis-ha-redis-ha.default.svc.cluster.local
:
helm install entity-service --name="es-shared-redis" \
--set provision.redis=false \
--set redis.server=shared-redis-ha-redis-ha.default.svc.cluster.local \
--set redis.use_sentinel=true
Uninstalling¶
To uninstall a release called es
in the default namespace:
helm del es
Or if the anonlink-entity-service has been installed into its own namespace you can simple delete
the whole namespace with kubectl
:
kubectl delete namespace miniestest
Deployment Risks¶
The purpose of this document is to record known deployment risks of the entity service and our mitigations. References the 2017 Top 10 security risks - https://www.owasp.org/index.php/Top_10-2017_Top_10
Risks¶
Unauthorized user accesses results¶
A6 - Security misconfiguration.
A2 - Broken authentication.
A5 - Broken access control.
Authorized user attacks the system¶
A10 - Insufficient Logging & Monitoring A3 - Sensitive Data Exposure
An admin can access the raw clks uploaded by both parties.
However a standard user cannot.
User coerces N1 to execute attacking code¶
Insecure deserialization. Compromised shared host.
An underlying component has a vulnerability¶
Dependencies including anonlink could have vulnerabilities.
Development¶
Changelog¶
Version 1.11.2¶
- Switch to Azure Devops pipeline for CI.
- Switch to docker hub for container hosting.
Version 1.11.1¶
- Include multiparty linkage tutorial/example.
- Tightened up how we use a database connection from the flask app.
- Deployment and logging documentation updates.
Version 1.11.0¶
- Adds support for multiparty record linkage.
- Logging is now configurable from a file.
Other improvements¶
- Another tutorial for directly using the REST api was added.
- K8s deployment updated to use
3.15.0
Postgres chart. Postgres configuration now uses aglobal
namespace so subcharts can all use the same configuration as documented here. - Jenkins testing now fails if the benchmark exits incorrectly or if the benchmark results contain failed results.
- Jenkins will now execute the tutorials notebooks and fail if any cells error.
Version 1.10.0¶
- Updates Anonlink and switches to using Anonlink’s default format for serialization of similarity scores.
- Sorts similarity scores before solving, improving accuracy.
- Uses Anonlink’s new API for similarity score computation and solving.
- Add support for using an external Postgres database.
- Added optional support for redis discovery via the sentinel protocol.
- Kubernetes deployment no longer includes a default postgres password. Ensure that you set your own postgresqlPassword.
- The Kubernetes deployment documentation has been extended.
Version 1.9.4¶
- Introduces configurable logging of HTTP headers.
- Dependency issue resolved.
Version 1.9.3¶
- Redis can now be used in highly available mode. Includes upstream fix where the redis sentinels crash.
- The custom kubernetes certificate management templates have been removed.
- Minor updates to the kubernetes resources. No longer using beta apis.
Version 1.9.2¶
- 2 race conditions have been identified and fixed.
- Integration tests are sped up and more focused. The test suite now fails after the first test failure.
- Code tidy-ups to be more pep8 compliant.
Version 1.9.1¶
- Adds support for (almost) arbitrary sized encodings. A minimum and maximum can be set at deployment time, and currently anonlink requires the size to be a multiple of 8.
- Adds support for opentracing with Jaeger.
- improvements to the benchmarking container
- internal refactoring of tasks
Version 1.9.0¶
- minio and redis services are now optional for kubernetes deployment.
- Introduction of a high memory worker and associated task queue.
- Fix issue where we could start tasks twice.
- Structlog now used for celery workers.
- CI now tests a kubernetes deployment.
- Many Jenkins CI updates and fixes.
- Updates to Jupyter notebooks and docs.
- Updates to Python and Helm chart dependencies and docker base images.
Version 1.8.1¶
Improve system stability while handling large intermediate results. Intermediate results are now stored in files instead of in Redis. This permits us to stream them instead of loading everything into memory.
Version 1.8¶
Version 1.8 introduces breaking changes to the REST API to allow an analyst to reuse uploaded CLKs.
Instead of a linkage project only having one result, we introduce a new sub-resource runs. A project holds the schema and CLKs from all data providers; and multiple runs can be created with different parameters. A run has a status and a result endpoint. Runs can be queued before the CLK data has been uploaded.
We also introduced changes to the result types. The result type permutation, which was producing permutations and an encrypted mask, was removed. And the result type permutation_unecrypyted_mask was renamed to permutations.
Brief summary of API changes: - the mapping endpoint has been renamed to projects - To carry out a linkage computation you must post to a project’s runs endpoint: /api/v1/project/<PROJECT_ID>/runs - Results are now accessed under the `runs endpoint: /api/v1/project/<PROJECT_ID>/runs/<RUN_ID>/result - result type permutation_unecrypyted_mask was renamed to permutations - result type permutation was removed
For all the updated API details check the Open API document.
Other improvements¶
- The documentation is now served at the root.
- The flower monitoring tool for celery is now included with the docker-compose deployment. Note this will be disabled for production deployment with kubernetes by default.
- The docker containers have been migrated to alpine linux to be much leaner.
- Substantial internal refactoring - especially of views.
- Move to pytest for end to end tests.
Version 1.7.3¶
Deployment and documentation sprint.
- Fixes a bug where only the top k results of a chunk were being requested from anonlink. #59 #84
- Updates to helm deployment templates to support a single namespace having multiple entityservices. Helm charts are more standard, some config has moved into a configmap and an experimental cert-manager configuration option has been added. #83, #90
- More sensible logging during testing.
- Every http request now has a (globally configurable) timeout
- Minor update regarding handling uploading empty CLKs. #92
- Update to latest versions of anonlink and clkhash. #94
- Documentation updates.
Version 1.7.2¶
Dependency and deployment updates. We now pin versions of Python, anonlink, clkhash, phe and docker images nginx and postgres.
Version 1.7.0¶
Added a view type that returns similarity scores of potential matches.
Version 1.6.8¶
Scalability sprint.
- Much better chunking of work.
- Security hardening by modifing the response from the server. Now there is no differences between invalid token and unknown resource - both return a 403 response status.
- Mapping information includes the time it was started.
- Update and add tests.
- Update the deployment to use Helm.
Road map for the entity service¶
- baseline benchmarking vs known datasets (accuracy and speed) e.g
recordspeed
datasets - blocking
- Schema specification and tooling
- Algorithmic improvements. e.g., implementing canopy clustering solver
- A web front end including authentication and access control
- Uploading multiple hashes per entity. Handle multiple schemas.
- Check how we deal with missing information, old addresses etc
- Semi supervised machine learning methods to learn thresholds
- Handle 1 to many relationships. E.g. familial groups
- Larger scale graph solving methods
- Remove bottleneck of sparse links having to fit in redis.
- improve uploads by allowing direct binary file transfer into object store
- optimise anonlink memory management and C++ code
Bigger Projects - consider more than 2 organizations participating in one mapping - GPU implementation of core similarity scoring - somewhat homomorphic encryption could be used for similarity score - consider allowing users to upload raw PII
Implementation Details¶
Components¶
The entity service is implemented in Python and comprises the following components:
- A gunicorn/flask backend that implements the HTTP REST api.
- Celery backend worker/s that do the actual work. This interfaces with
the
anonlink
library. - An nginx frontend to reverse proxy the gunicorn/flask backend application.
- A Minio object store (large files such as raw uploaded hashes, results)
- A postgres database stores the linking metadata.
- A redis task queue that interfaces between the flask app and the celery backend. Redis also acts as an ephemeral cache.
Each of these has been packaged as a docker image, however the use of external services (redis, postgres, minio) can be configured through environment variables. Multiple workers can be used to distribute the work beyond one machine - by default all cores will be used for computing similarity scores and encrypting the mask vector.
Continuous Integration Testing¶
We test the service using Jenkins. Every pull request gets deployed in the local configuration using Docker Compose, as well as in the production deployment to kubernetes.
At a high level the testing covers:
- building the docker containers
- deploying using Docker Compose
- testing the tutorial notebooks don’t error
- running the integration tests against the local deployment
- running a benchmark suite against the local deployment
- building and packaging the documentation
- publishing the containers to quay.io
- deploying to kubernetes
- running the integration tests against the kubernetes deployment
All of this is orchestrated using the jenkins pipeline script at Jenkinsfile.groovy
. There is one custom
library which is n1-pipeline a collection of helpers that
we created for common jenkins tasks.
The integration tests currently take around 30 minutes.
Testing Local Deployment¶
The docker compose file tools/ci.yml
is deployed along with tools/docker-compose.yml
. This simply defines an
additional container (from the same backend image) which runs the integration tests after a short delay.
The logs from the various containers (nginx, backend, worker, database) are all collected, archived and are made available in the Jenkins UI for introspection.
Testing K8s Deployment¶
The kubernetes deployment uses helm
with the template found in deployment/entity-service
. Jenkins additionally
defines the docker image versions to use and ensures an ingress is not provisioned. The deployment is configured to be
quite conservative in terms of cluster resources. Currently this logic all resides in Jenkinsfile.groovy
.
The k8s deployment test is limited to 30 minutes and an effort is made to clean up all created resources.
After a few minutes waiting for the deployment a Kubernetes Job is created using kubectl create
.
This job includes a 1GiB
persistent volume claim
to which the results are written (as results.xml
). During the testing the pytest output will be rendered in jenkins,
and then the Job’s pod terminates. We create a temporary pod which mounts the same results volume and then we copy
across the produced artifact for rendering in Jenkins. This dance is only necessary to retrieve files from the cluster
to our Jenkins instance, it would be straightforward if we only wanted the stdout from each pod/job.
Devops¶
Continuous Integration¶
Azure DevOps¶
anonlink-entity-service
is automatically built and tested using Azure DevOps
in the project Anonlink <https://dev.azure.com/data61/Anonlink>.
It consists only of a build pipeline <https://dev.azure.com/data61/Anonlink/_build?definitionId=1>.
The build pipeline is defined in the script azure-pipelines.yml which uses resources from the folder .azurePipeline.
The continuous integration stages are:
- building and pushing the following docker images:
- the frontend
data61/anonlink-nginx
- the backenddata61/anonlink-app
- the tutorialsdata61/anonlink-docs-tutorials
(used to tests the tutorial Python Notebooks) - the benchmarkdata61/anonlink-benchmark
(used to run the benchmark) - runs the benchmark using
docker-compose
and publishes the results as an artifact in Azure - runs the tutorial tests using
docker-compose
and publishes the results in Azure - runs the integration tests by deploying the whole service on
Kubernetes
, running the integration tests and publishing the results in Azure.
The build pipeline is triggered for every push on every branch. It is not triggered by Pull Requests to avoid duplicate testing and building potentially untrusted external code.
The build pipeline requires two environment variables provided by Azure environment:
- dockerHubId: username for the pipeline to push images to Data61 dockerhub
- dockerHubPassword: password for the corresponding username (this is a secret variable).
It also requires a connection to a k8s
cluster to be configured.
Benchmarking¶
In the benchmarking folder is a benchmarking script and associated Dockerfile.
The docker image is published at https://quay.io/repository/n1analytics/entity-benchmark
The container/script is configured via environment variables.
SERVER
: (required) the url of the server.EXPERIMENT
: json file containing a list of experiments to run. Schema of experiments is defined in ./schema/experiments.json.DATA_PATH
: path to a directory to store test data (useful to cache).RESULT_PATH
: full filename to write results file.SCHEMA
: path to the linkage schema file used when creating projects. If not provided it is assumed to be in the data directory.TIMEOUT
: this timeout defined the time to wait for the result of a run in seconds. Default is 1200 (20min).
Run Benchmarking Container¶
Run the container directly with docker - substituting configuration information as required:
docker run -it
-e SERVER=https://testing.es.data61.xyz \
-e RESULTS_PATH=/app/results.json \
quay.io/n1analytics/entity-benchmark:latest
By default the container will pull synthetic datasets from an S3 bucket and run default benchmark experiments
against the configured SERVER
. The default experiments (listed below) are set in
benchmarking/default-experiments.json
.
The output will be printed and saved to a file pointed to by RESULTS_PATH
(e.g. to /app/results.json
).
Cache Volume¶
For speeding up benchmarking when running multiple times you may wish to mount a volume at the DATA_PATH
to store the downloaded test data. Note the container runs as user 1000
, so any mounted volume must be read
and writable by that user. To create a volume using docker:
docker volume create linkage-benchmark-data
To copy data from a local directory and change owner:
docker run --rm -v `pwd`:/src \
-v linkage-benchmark-data:/data busybox \
sh -c "cp -r /src/linkage-bench-cache-experiments.json /data; chown -R 1000:1000 /data"
To run the benchmarks using the cache volume:
docker run \
--name ${benchmarkContainerName} \
--network ${networkName} \
-e SERVER=${localserver} \
-e DATA_PATH=/cache \
-e EXPERIMENT=/cache/linkage-bench-cache-experiments.json \
-e RESULTS_PATH=/app/results.json \
--mount source=linkage-benchmark-data,target=/cache \
quay.io/n1analytics/entity-benchmark:latest
Experiments¶
Experiments to run can be configured as a simple json document. The default is:
[
{
"sizes": ["100K", "100K"],
"threshold": 0.95
},
{
"sizes": ["100K", "100K"],
"threshold": 0.80
},
{
"sizes": ["100K", "1M"],
"threshold": 0.95
}
]
The schema of the experiments can be found in benchmarking/schema/experiments.json
.
Logging¶
The entity service uses the standard Python logging library for logging.
The following named loggers are used:
- entityservice * entityservice.views * entityservice.models * entityservice.database
- celery.es
The following environment variables affect logging:
- LOG_CFG - sets the path to a logging configuration file. There are two examples: - entityservice/default_logging.yaml - entityservice/verbose_logging.yaml
- DEBUG - sets the logging level to debug for all application code.
- LOGFILE - directs the log output to this file instead of stdout.
- LOG_HTTP_HEADER_FIELDS - HTTP headers to include in the application logs.
Example logging output with LOG_HTTP_HEADER_FIELDS=User-Agent,Host:
[2019-02-02 23:17:23 +0000] [10] [INFO] Adding new project to database [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=6c2a3730
[2019-02-02 23:17:23 +0000] [12] [INFO] Getting detail for a project [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [12] [INFO] Checking credentials [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [12] [INFO] 0 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [11] [INFO] Receiving CLK data. [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Storing user 25895 supplied clks from json [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Received 100 encodings. Uploading 16.89 KiB to object store [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Adding metadata on encoded entities to database [entityservice.database.insertions] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Job scheduled to handle user uploaded hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:24 +0000] [12] [INFO] Getting detail for a project [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [12] [INFO] Checking credentials [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [12] [INFO] 1 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [10] [INFO] Receiving CLK data. [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Storing user 25896 supplied clks from json [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Received 100 encodings. Uploading 16.89 KiB to object store [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Adding metadata on encoded entities to database [entityservice.database.insertions] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Job scheduled to handle user uploaded hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:25 +0000] [12] [INFO] Getting detail for a project [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] Checking credentials [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] 2 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] Adding new project to database [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=df791527
[2019-02-02 23:17:26 +0000] [12] [INFO] request description of a run [entityservice.views.run.description] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=bf5b2544 rid=invalid
[2019-02-02 23:17:26 +0000] [12] [INFO] Requested project or run resource with invalid identifier token [entityservice.views.auth_checks] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=bf5b2544 rid=invalid
[2019-02-02 23:17:26 +0000] [12] [INFO] Request to delete project [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
[2019-02-02 23:17:26 +0000] [12] [INFO] Marking project for deletion [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
[2019-02-02 23:17:26 +0000] [12] [INFO] Queuing authorized request to delete project resources [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
With DEBUG enabled there are a lot of logs from the backend and workers:
[2019-02-02 23:14:47 +0000] [10] [INFO] Marking project for deletion [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:47 +0000] [10] [DEBUG] Trying to connect to postgres db [entityservice.database.util] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [10] [DEBUG] Database connection established [entityservice.database.util] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [10] [INFO] Queuing authorized request to delete project resources [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [9] [INFO] Request to delete project [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=5486c153
Tracing¶
- TRACING_HOST
- TRACING_PORT