[1]:
import csv
import json
import os
import pandas as pd
[2]:
KEY1 = 'correct'
KEY2 = 'horse'
SERVER = os.getenv("SERVER", "https://testing.es.data61.xyz")
Scenario¶
There are three parties named Alice, Bob, and Charlie, each holding a dataset of about 3200 records. They know that they have some entities in common, but with incomplete overlap. The common features describing those entities are given name, surname, date of birth, and phone number.
They all have some additional information about those entities in their respective datasets, Alice has a person’s gender, Bob has their city, and Charlie has their income. They wish to create a table for analysis: each row has a gender, city, and income, but they don’t want to share any additional information. They can use Anonlink to do this in a privacy-preserving way (without revealing given names, surnames, dates of birth, and phone numbers).
Alice, Bob, and Charlie: agree on secret keys and a linkage schema¶
They keep the keys to themselves, but the schema may be revealed to the analyst.
[3]:
print(f'keys: {KEY1}, {KEY2}')
keys: correct, horse
[4]:
with open('data/schema_ABC.json') as f:
print(f.read())
{
"version": 2,
"clkConfig": {
"l": 1024,
"kdf": {
"type": "HKDF",
"hash": "SHA256",
"salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
"info": "c2NoZW1hX2V4YW1wbGU=",
"keySize": 64
}
},
"features": [
{
"identifier": "id",
"ignored": true
},
{
"identifier": "givenname",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"ngram": 2,
"positional": false,
"strategy": {"k": 15}
}
},
{
"identifier": "surname",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"ngram": 2,
"positional": false,
"strategy": {"k": 15}
}
},
{
"identifier": "dob",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"ngram": 2,
"positional": true,
"strategy": {"k": 15}
}
},
{
"identifier": "phone number",
"format": {
"type": "string",
"encoding": "utf-8"
},
"hashing": {
"ngram": 1,
"positional": true,
"strategy": {"k": 8}
}
},
{
"identifier": "ignoredForLinkage",
"ignored": true
}
]
}
Sneak peek at input data¶
[5]:
pd.read_csv('data/dataset-alice.csv').head()
[5]:
id | givenname | surname | dob | phone number | gender | |
---|---|---|---|---|---|---|
0 | 0 | tara | hilton | 27-08-1941 | 08 2210 0298 | male |
1 | 3 | saJi | vernre | 22-12-2972 | 02 1090 1906 | mals |
2 | 7 | sliver | paciorek | NaN | NaN | mals |
3 | 9 | ruby | george | 09-05-1939 | 07 4698 6255 | male |
4 | 10 | eyrinm | campbell | 29-1q-1983 | 08 299y 1535 | male |
[6]:
pd.read_csv('data/dataset-bob.csv').head()
[6]:
id | givenname | surname | dob | phone number | city | |
---|---|---|---|---|---|---|
0 | 3 | zali | verner | 22-12-1972 | 02 1090 1906 | perth |
1 | 4 | samuel | tremellen | 21-12-1923 | 03 3605 9336 | melbourne |
2 | 5 | amy | lodge | 16-01-1958 | 07 8286 9372 | canberra |
3 | 7 | oIji | pacioerk | 10-02-1959 | 04 4220 5949 | sydney |
4 | 10 | erin | kampgell | 29-12-1983 | 08 2996 1445 | perth |
Charlie¶
[7]:
pd.read_csv('data/dataset-charlie.csv').head()
[7]:
id | givenname | surname | dob | phone number | income | |
---|---|---|---|---|---|---|
0 | 1 | joshua | arkwright | 16-02-1903 | 04 8511 9580 | 70189.446 |
1 | 3 | zal: | verner | 22-12-1972 | 02 1090 1906 | 50194.118 |
2 | 7 | oliyer | paciorwk | 10-02-1959 | 04 4210 5949 | 31750.993 |
3 | 8 | nacoya | ranson | 17-08-1925 | 07 6033 4580 | 102446.131 |
4 | 10 | erih | campbell | 29-12-1i83 | 08 299t 1435 | 331476.599 |
Analyst: create the project¶
The analyst keeps the result token to themselves. The three update tokens go to Alice, Bob and Charlie. The project ID is known by everyone.
[8]:
!clkutil create-project --server $SERVER --type groups --schema data/schema_ABC.json --parties 3 --output credentials.json
with open('credentials.json') as f:
credentials = json.load(f)
project_id = credentials['project_id']
result_token = credentials['result_token']
update_token_alice = credentials['update_tokens'][0]
update_token_bob = credentials['update_tokens'][1]
update_token_charlie = credentials['update_tokens'][2]
Project created
Alice: hash the data and upload it to the server¶
The data is hashed according to the schema and the keys. Alice’s update token is needed to upload the hashed data. No PII is uploaded to the service—only the hashes.
[9]:
!clkutil hash data/dataset-alice.csv $KEY1 $KEY2 data/schema_ABC.json dataset-alice-hashed.json --check-header false
generating CLKs: 0%| | 0.00/3.23k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs: 6%|6 | 200/3.23k [00:02<00:31, 96.1clk/s, mean=372, std=32.6]
generating CLKs: 25%|##4 | 800/3.23k [00:02<00:17, 136clk/s, mean=371, std=35.5]
generating CLKs: 63%|######2 | 2.03k/3.23k [00:02<00:06, 193clk/s, mean=372, std=34.7]
generating CLKs: 100%|##########| 3.23k/3.23k [00:02<00:00, 1.29kclk/s, mean=372, std=34.9]
CLK data written to dataset-alice-hashed.json
[10]:
!clkutil upload --server $SERVER --apikey $update_token_alice --project $project_id dataset-alice-hashed.json
{"message": "Updated", "receipt_token": "c54597f32fd969603efba706af1556abee3cc35f2718bcb6"}
Bob: hash the data and upload it to the server¶
[11]:
!clkutil hash data/dataset-bob.csv $KEY1 $KEY2 data/schema_ABC.json dataset-bob-hashed.json --check-header false
generating CLKs: 0%| | 0.00/3.24k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs: 6%|6 | 200/3.24k [00:01<00:25, 119clk/s, mean=369, std=32.4]
generating CLKs: 31%|### | 1.00k/3.24k [00:01<00:13, 168clk/s, mean=371, std=35]
generating CLKs: 56%|#####5 | 1.80k/3.24k [00:01<00:06, 238clk/s, mean=371, std=35.5]
generating CLKs: 100%|##########| 3.24k/3.24k [00:02<00:00, 1.45kclk/s, mean=372, std=35.3]
CLK data written to dataset-bob-hashed.json
[12]:
!clkutil upload --server $SERVER --apikey $update_token_bob --project $project_id dataset-bob-hashed.json
{"message": "Updated", "receipt_token": "6ee2fe5df850b795ee6ddff1aaf4dfb03f6d4398dedcc248"}
Charlie: hash the data and upload it to the server¶
[13]:
!clkutil hash data/dataset-charlie.csv $KEY1 $KEY2 data/schema_ABC.json dataset-charlie-hashed.json --check-header false
generating CLKs: 0%| | 0.00/3.26k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs: 6%|6 | 200/3.26k [00:01<00:24, 122clk/s, mean=371, std=33.3]
generating CLKs: 55%|#####5 | 1.80k/3.26k [00:01<00:08, 174clk/s, mean=372, std=34.5]
generating CLKs: 100%|##########| 3.26k/3.26k [00:01<00:00, 1.73kclk/s, mean=372, std=34.8]
CLK data written to dataset-charlie-hashed.json
[14]:
!clkutil upload --server $SERVER --apikey $update_token_charlie --project $project_id dataset-charlie-hashed.json
{"message": "Updated", "receipt_token": "064664ed9fd1f58c4da05c62a4832b813276d09342137a42"}
Analyst: start the linkage run¶
This will start the linkage computation. We will wait a little bit and then retrieve the results.
[15]:
!clkutil create --server $SERVER --project $project_id --apikey $result_token --threshold 0.7 --output=run-credentials.json
with open('run-credentials.json') as f:
run_credentials = json.load(f)
run_id = run_credentials['run_id']
Analyst: retreve the results¶
[16]:
!clkutil results --server $SERVER --project $project_id --apikey $result_token --run $run_id --watch --output linkage-output.json
State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
Downloading result
Received result
[17]:
with open('linkage-output.json') as f:
linkage_output = json.load(f)
linkage_groups = linkage_output['groups']
Everyone: make table of interesting information¶
We use the linkage result to make a table of genders, cities, and incomes without revealing any other PII.
[18]:
with open('data/dataset-alice.csv') as f:
r = csv.reader(f)
next(r) # Skip header
genders = tuple(row[-1] for row in r)
with open('data/dataset-bob.csv') as f:
r = csv.reader(f)
next(r) # Skip header
cities = tuple(row[-1] for row in r)
with open('data/dataset-charlie.csv') as f:
r = csv.reader(f)
next(r) # Skip header
incomes = tuple(row[-1] for row in r)
[19]:
table = []
for group in linkage_groups:
row = [''] * 3
for i, j in group:
row[i] = [genders, cities, incomes][i][j]
if sum(map(bool, row)) > 1:
table.append(row)
pd.DataFrame(table, columns=['gender', 'city', 'income']).head(10)
[19]:
gender | city | income | |
---|---|---|---|
0 | peGh | 395273.665 | |
1 | sydnev | 77367.636 | |
2 | pertb | 323383.650 | |
3 | syd1e7y | 79745.538 | |
4 | perth | 28019.494 | |
5 | canberra | 78961.675 | |
6 | female | brisnane | |
7 | male | canbetra | |
8 | sydme7 | 106849.526 | |
9 | melbourne | 68548.966 |
The last 20 groups look like this.
[20]:
linkage_groups[-15:]
[20]:
[[[0, 2111], [1, 2100]],
[[0, 2121], [2, 2131], [1, 2111]],
[[1, 1146], [2, 1202], [0, 1203]],
[[1, 2466], [2, 2478], [0, 2460]],
[[0, 429], [1, 412]],
[[0, 2669], [1, 1204]],
[[1, 1596], [2, 1623]],
[[0, 487], [1, 459]],
[[1, 1776], [2, 1800], [0, 1806]],
[[1, 2586], [2, 2602]],
[[0, 919], [1, 896]],
[[0, 100], [2, 107], [1, 100]],
[[0, 129], [1, 131], [2, 135]],
[[0, 470], [1, 440]],
[[0, 1736], [1, 1692], [2, 1734]]]
Sneak peek at the result¶
We obviously can’t do this in a real-world setting, but let’s view the linkage using the PII. If the IDs match, then we are correct.
[21]:
with open('data/dataset-alice.csv') as f:
r = csv.reader(f)
next(r) # Skip header
dataset_alice = tuple(r)
with open('data/dataset-bob.csv') as f:
r = csv.reader(f)
next(r) # Skip header
dataset_bob = tuple(r)
with open('data/dataset-charlie.csv') as f:
r = csv.reader(f)
next(r) # Skip header
dataset_charlie = tuple(r)
[22]:
table = []
for group in linkage_groups:
for i, j in sorted(group):
table.append([dataset_alice, dataset_bob, dataset_charlie][i][j])
table.append([''] * 6)
pd.DataFrame(table, columns=['id', 'given name', 'surname', 'dob', 'phone number', 'non-linking']).tail(15)
[22]:
id | given name | surname | dob | phone number | non-linking | |
---|---|---|---|---|---|---|
6426 | 1171 | isabelle | bridgland | 30-03-1994 | 04 5318 6471 | mal4 |
6427 | 1171 | isalolIe | riahgland | 30-02-1994 | 04 5318 6471 | sydnry |
6428 | 1171 | isabelle | bridgland | 30-02-1994 | 04 5318 6471 | 63514.217 |
6429 | ||||||
6430 | 1243 | thmoas | doaldson | 13-04-1900 | 09 6963 1944 | male |
6431 | 1243 | thoma5 | donaldson | 13-04-1900 | 08 6962 1944 | perth |
6432 | 1243 | thomas | donalsdon | 13-04-2900 | 08 6963 2944 | 489229.297 |
6433 | ||||||
6434 | 2207 | annah | aslea | 02-11-2906 | 04 5501 5973 | male |
6435 | 2207 | hannah | easlea | 02-11-2006 | 04 5501 5973 | canberra |
6436 | ||||||
6437 | 5726 | rhys | clarke | 19-05-1929 | 02 9220 9635 | mqle |
6438 | 5726 | ry5 | clarke | 19-05-1939 | 02 9120 9635 | |
6439 | 5726 | rhys | klark | 19-05-2938 | 02 9220 9635 | 118197.119 |
6440 |