[1]:
import csv
import itertools
import os
import pandas as pd
import requests
Entity Service: Multiparty linkage demo¶
This notebook is a demonstration of the multiparty linkage capability that has been implemented in the Entity Service.
We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included.
Each party has a dataset of the following form:
[2]:
pd.read_csv('data/dataset-1.csv', index_col='id').head()
[2]:
givenname | surname | dob | gender | city | income | phone number | |
---|---|---|---|---|---|---|---|
id | |||||||
0 | tara | hilton | 27-08-1941 | male | canberra | 84052.973 | 08 2210 0298 |
3 | saJi | vernre | 22-12-2972 | mals | perth | 50104.118 | 02 1090 1906 |
7 | sliver | paciorek | NaN | mals | sydney | 31750.893 | NaN |
9 | ruby | george | 09-05-1939 | male | sydney | 135099.875 | 07 4698 6255 |
10 | eyrinm | campbell | 29-1q-1983 | male | perth | NaN | 08 299y 1535 |
Comparing the beginning of the first dataset to the second, we can see that the quality of the data is not very good. There are a lot of spelling mistakes and missing information. Let’s see how well the entity service does with linking those entities.
[3]:
pd.read_csv('data/dataset-2.csv', index_col='id').head()
[3]:
givenname | surname | dob | gender | city | income | phone number | |
---|---|---|---|---|---|---|---|
id | |||||||
3 | zali | verner | 22-12-1972 | male | perth | 50104.118 | 02 1090 1906 |
4 | samuel | tremellen | 21-12-1923 | male | melbourne | 159316.091 | 03 3605 9336 |
5 | amy | lodge | 16-01-1958 | male | canberra | 70170.456 | 07 8286 9372 |
7 | oIji | pacioerk | 10-02-1959 | mal3 | sydney | 31750.893 | 04 4220 5949 |
10 | erin | kampgell | 29-12-1983 | make | perth | 331476.598 | 08 2996 1445 |
Check the status of the Entity Service¶
Ensure that it is running and that we have the correct version. Multiparty support was introduced in version 1.11.0.
[4]:
SERVER = os.getenv("SERVER", "https://anonlink.easd.data61.xyz")
PREFIX = f"{SERVER}/api/v1"
print(requests.get(f"{PREFIX}/status").json())
print(requests.get(f"{PREFIX}/version").json())
{'project_count': 7107, 'rate': 2884208, 'status': 'ok'}
{'anonlink': '0.12.5', 'entityservice': 'v1.13.0-alpha', 'python': '3.7.5'}
Create a new project¶
We create a new multiparty project for five parties by specifying the number of parties and the output type (currently only the group
output type supports multiparty linkage). Retain the project_id
, so we can find the project later. Also retain the result_token
, so we can retrieve the results (careful: anyone with this token has access to the results). Finally, the update_tokens
identify the five data data providers and permit them to upload CLKs.
[5]:
project_info = requests.post(
f"{PREFIX}/projects",
json={
"schema": {},
"result_type": "groups",
"number_parties": 5,
"name": "example project"
}
).json()
project_id = project_info["project_id"]
result_token = project_info["result_token"]
update_tokens = project_info["update_tokens"]
print("project_id:", project_id)
print()
print("result_token:", result_token)
print()
print("update_tokens:", update_tokens)
project_id: e3602cac3940582e87c636f3a3827176ca7abe8d5b4e0096
result_token: ca19df31d445fade86390f38c5d8f578d545c5f92376ffb3
update_tokens: ['c24cab922055e8dd2c7ea639c342b9fce706fbbe7a531f8e', '7712f77f2ab2c2d7210ffa09465de5209ac9f50657fac0a8', 'ae41434b182d2ac82fc0646bf4e49e0e6c5e8f52f6350ba1', 'd8419a8c0f4b274ed1aca56d6adc8b8743c681b7eb02af9a', 'baefc60676a830b648fd176cc1c6d18248b048825036f8d6']
Upload the hashed data¶
This is where each party uploads their CLKs into the service. Here, we do the work of all five data providers inside this for loop. In a deployment scenario, each data provider would be uploading their own CLKs using their own update token.
These CLKs are already hashed using clkhash (with this linkage schema), so for each data provider, we just need to upload their corresponding hash file.
[6]:
for i, token in enumerate(update_tokens, start=1):
with open(f"data/clks-{i}.json") as f:
r = requests.post(
f"{PREFIX}/projects/{project_id}/clks",
data=f,
headers={
"Authorization": token,
"content-type": "application/json"
}
)
print(f"Data provider {i}: {r.text}")
Data provider 1: {
"message": "Updated",
"receipt_token": "b060225db2fb1edda39bcc2153a9310392f87abcacd9db2b"
}
Data provider 2: {
"message": "Updated",
"receipt_token": "db94c740c469a9bda9931829d1ba58210426134a46ba1edb"
}
Data provider 3: {
"message": "Updated",
"receipt_token": "ad60b956a4f90c8dd16fb7d278c0a8670d0bb3348a19f70a"
}
Data provider 4: {
"message": "Updated",
"receipt_token": "2ce533e0a87020654d150084389529ba05bb1ad1628a0bd4"
}
Data provider 5: {
"message": "Updated",
"receipt_token": "ce6b281666226d181a9b8bb191daf57128400096d59bfd4c"
}
Begin a run¶
The data providers have uploaded their CLKs, so we may begin the computation. This computation may be repeated multiple times, each time with different parameters. Each such repetition is called a run. The most important parameter to vary between runs is the similarity threshold. Two records whose similarity is above this threshold will be considered to describe the same entity.
Here, we perform one run. We (somewhat arbitrarily) choose the threshold to be 0.8.
[7]:
r = requests.post(
f"{PREFIX}/projects/{project_id}/runs",
headers={
"Authorization": result_token
},
json={
"threshold": 0.8
}
)
run_id = r.json()["run_id"]
Check the status¶
Let’s see whether the run has finished (‘state’ is ‘completed’)!
[8]:
r = requests.get(
f"{PREFIX}/projects/{project_id}/runs/{run_id}/status",
headers={
"Authorization": result_token
}
)
r.json()
[8]:
{'current_stage': {'description': 'compute similarity scores',
'number': 2,
'progress': {'absolute': 0,
'description': 'number of already computed similarity scores',
'relative': 0.0}},
'stages': 3,
'state': 'running',
'time_added': '2019-11-24T23:12:37.412183+00:00',
'time_started': '2019-11-24T23:12:37.436726+00:00'}
Now after some delay (depending on the size) we can fetch the results. Waiting for completion can be achieved by directly polling the REST API using requests
, however for simplicity we will just use the watch_run_status
function provided in clkhash.rest_client
.
[9]:
from IPython.display import clear_output
from clkhash.rest_client import RestClient
from clkhash.rest_client import format_run_status
rest_client = RestClient(SERVER)
for update in rest_client.watch_run_status(project_id, run_id, result_token, timeout=300):
clear_output(wait=True)
print(format_run_status(update))
State: completed
Stage (3/3): compute output
Retrieve the results¶
We retrieve the results of the linkage. As we selected earlier, the result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party id and the row index.
The last 20 groups look like this.
[10]:
r = requests.get(
f"{PREFIX}/projects/{project_id}/runs/{run_id}/result",
headers={
"Authorization": result_token
}
)
groups = r.json()
groups['groups'][-20:]
[10]:
[[[3, 1831], [4, 1854]],
[[0, 2362], [2, 2369]],
[[2, 2910], [4, 2915]],
[[3, 1885], [4, 1902]],
[[2, 11], [3, 10]],
[[0, 3085], [3, 3117]],
[[1, 815], [3, 838]],
[[1, 450], [2, 474]],
[[0, 1253], [2, 1252], [1, 1191], [4, 1261]],
[[1, 1967], [2, 1985]],
[[1, 4], [4, 2]],
[[1, 468], [2, 489], [3, 482], [4, 469]],
[[2, 2384], [3, 2378], [0, 2378]],
[[3, 2102], [4, 2115]],
[[1, 2215], [2, 2221]],
[[0, 1993], [4, 1994]],
[[0, 474], [4, 437], [1, 443], [2, 466]],
[[1, 1034], [2, 1090]],
[[0, 1835], [4, 1847]],
[[0, 2496], [4, 2498]]]
To sanity check, we print their records’ corresponding PII:
[11]:
def load_dataset(i):
dataset = []
with open(f"data/dataset-{i}.csv") as f:
reader = csv.reader(f)
next(reader) # ignore header
for row in reader:
dataset.append(row[1:])
return dataset
datasets = list(map(load_dataset, range(1, 6)))
for group in itertools.islice(groups["groups"][-20:], 20):
for (i, j) in group:
print(i, datasets[i][j])
print()
3 ['joshua', 'tremellen', '05-01-1988', 'male', 'sydney', '156320.936', '03 7154 7258']
4 ['joua', 'dreemleln', '05-01-1988', 'male', 'sydnru', '156320.936', '03 8154 7258']
0 ['katharine', 'procter', '03-02-2003', 'female', 'sydney', '116172.524', '08 4057 0794']
2 ['katharine', 'procter', '03-02-3003', 'femald', 'sydnev', '116172.524', '08 4057 0694']
2 ['georgi3', "wytk'ln", '01-06-1927', 'male', 'sydriry', '35625.897', '08 2668 2433']
4 ['georgja', 'ytkkn', '01-06-1927', 'male', 'sydrirv', '35626.797', '08 2668 2433']
3 ['heath', 'ryan', '20-02-1949', 'male', 'canberra', '70507.784', '04 9913 1283']
4 ['heath', 'rya17', '20-02-2949', '', 'canbcera4', '70507.784', '04 9913 1283']
2 ['siaitlyn', 'robezon', '31-12-1937', 'male', 'sdvnev', '105108.052', '07 2226 8544']
3 ['kaitlyn', 'robeson', '31-12-1937', 'maoe', 'sydney', '105107.051', '07 2226 8545']
0 ['holly', 'reih', '22-06-2009', 'msle', 'syconey', '131184.582', '']
3 ['holly', 'reicl', '21-06-2009', 'male', 'sydey', '131184.582', '']
1 ['sasmine', 'bridqland', '20-06-1942', 'msle', 'syclney', '155539.109', '04 5020 4447']
3 ['ajsmine', 'bridgland', '20-06-2942', 'male', 's6dney', '155539.100', '04 5020 4447']
1 ['ella', 'mo1davt5ev', '01-93-1985', 'male', 'pertj', '', '03 1427 7602']
2 ['ella', 'moldavtsev', '01-03-1985', 'male', 'perth', '171412.470', '03 1427 7602']
0 ['courtney', 'mashberg', '30-05-1908', 'male', 'perth', '277942.921', '03 1022 1796']
2 ['courtne', 'mazhberg', '30-05-1908', 'mzle', 'perth', '277942.021', '03 1022 1796']
1 ['courtnev', 'mashbcrg', '30-05-1808', 'male', 'perth', '277941.921', '03 1022 1796']
4 ['kourtney', 'msshperg', '30-05-1907', 'male', 'per6b', '277942.921', '03 1022 1796']
1 ['ary', 'relkos', '26-10-2003', 'male', 'melbonrrie', '136614.506', '02 2102 6467']
2 ['arru', 'rellos', '26-10-2093', 'male', 'melbouthd', '136614.506', '02 1192 6367']
1 ['erin', 'kampgell', '29-12-1983', 'make', 'perth', '331476.598', '08 2996 1445']
4 ['wrin', 'kampbwll', '29-22-1983', 'male', 'pertl0', '331476.599', '08 2996 1435']
1 ['stephnaie', 'goldsworthy', '03-06-1958', '', 'canbrrra', '83372.67q', '02 4093 4044']
2 ['sttepbanie', 'goldsworthy', '03-06-1958', 'mald', 'canbedra', '83372.772', '02 4093 4044']
3 ['stefanie', 'goldsworthy', '03-06-1958', 'male', 'camberra', '83372.572', '']
4 ['stefanie', 'go|dsworthy', '03-06-1958', '', 'cabr:erra', '83372.672', '02 4093 4044']
2 ['ro5y', 'whitr', '30-12-1933', 'mal4', 'sydney', '91104.885', '02 2375 0175']
3 ['rory', 'white', '30-12-1933', 'male', 'sydney', '91104.785', '02 2375 0175']
0 ['mory', 'wh:te', '30-12-1033', 'male', 'sydhey', '91104.785', '02 2375 0175']
3 ['antony', 'riean', '18-01-1908', 'male', 'canberra', '59633.334', '07 2734 8270']
4 ['anthnoy', 'ryari', '18-01-1908', 'male', 'cajberra', '58633.434', '07 2734 8370']
1 ['ryan', 'allxhin', '20-10-2011', 'male', 'melbounre', '267843.384', '']
2 ['ryan', 'allchin', '20-10-2011', 'male', 'melbourne', '167843.484', '08 7962 6255']
0 ['haery', 'reklos', '26-10-2003', 'malw', 'mlebourne', '136614.506', '02 1102 6467']
4 ['harey', 'eelloz', '26-10-2003', 'mame', 'melbourne', '136614.506', '02 110w 6467']
0 ['larizsa', 'morrison', '16-04-2960', 'maje', 'melbouene', '196846.869', '04 3434 7115']
4 ['larissa', 'morrison', '16-04-1960', 'male', 'melbourne', '196846.869', '04 3434 7115']
1 ['lairssa', 'mornson', '16-04-1960', 'male', '', '196836.869', '04 3434 7115']
2 ['larissa', 'morrijon', '16-04-1960', 'make', '', '196846.859', '04 3434 7115']
1 ["ke'Irx", 'chappel', '19-05-1966', 'male', '', '138869.396', '']
2 ['keira', 'chapepl', '19-05-1966', 'male', '', '148869.296', '']
0 ['meagan', 'vrahn', '26-05-2950', '', 'melbourne', '154858.094', '04 1222 9254']
4 ['meagan', 'frahn', '26-05-1950', 'male', 'melbourne', '154856.094', '04 1222 9254']
0 ['zoel', 'ev', '06-09-1990', 'gemale', 'ysdnvvy', '183366.696', '02 5578 4520']
4 ['joel', 'everett', '06-09-1990', 'female', 'sydney', '183366.696', '02 5578 4520']
Despite the high amount of noise in the data, the entity service was able to produce a fairly accurate matching. However, Isabella George and Mia/Talia Galbraith are most likely not an actual match.
We may be able to improve on this results by fine-tuning the hashing schema or by changing the threshold.
Delete the project¶
[12]:
r = requests.delete(
f"{PREFIX}/projects/{project_id}",
headers={
"Authorization": result_token
}
)
print(r.status_code)
204