[1]:

import csv
import json
import os

import pandas as pd

[2]:

KEY1 = 'correct'
KEY2 = 'horse'

SERVER = os.getenv("SERVER", "https://testing.es.data61.xyz")

Scenario¶

There are three parties named Alice, Bob, and Charlie, each holding a dataset of about 3200 records. They know that they have some entities in common, but with incomplete overlap. The common features describing those entities are given name, surname, date of birth, and phone number.

They all have some additional information about those entities in their respective datasets, Alice has a person’s gender, Bob has their city, and Charlie has their income. They wish to create a table for analysis: each row has a gender, city, and income, but they don’t want to share any additional information. They can use Anonlink to do this in a privacy-preserving way (without revealing given names, surnames, dates of birth, and phone numbers).

Alice, Bob, and Charlie: agree on secret keys and a linkage schema¶

They keep the keys to themselves, but the schema may be revealed to the analyst.

[3]:

print(f'keys: {KEY1}, {KEY2}')

keys: correct, horse

[4]:

with open('data/schema_ABC.json') as f:
    print(f.read())


{
  "version": 2,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
      "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
      "info": "c2NoZW1hX2V4YW1wbGU=",
      "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "id",
      "ignored": true
    },
    {
      "identifier": "givenname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 2,
        "positional": false,
        "strategy": {"k": 15}
      }
    },
    {
      "identifier": "surname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 2,
        "positional": false,
        "strategy": {"k": 15}
      }
    },
    {
      "identifier": "dob",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 2,
        "positional": true,
        "strategy": {"k": 15}
      }
    },
    {
      "identifier": "phone number",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 1,
        "positional": true,
        "strategy": {"k": 8}
      }
    },
    {
      "identifier": "ignoredForLinkage",
      "ignored": true
    }
  ]
}

Sneak peek at input data¶

[5]:

pd.read_csv('data/dataset-alice.csv').head()

[5]:

	id	givenname	surname	dob	phone number	gender
0	0	tara	hilton	27-08-1941	08 2210 0298	male
1	3	saJi	vernre	22-12-2972	02 1090 1906	mals
2	7	sliver	paciorek	NaN	NaN	mals
3	9	ruby	george	09-05-1939	07 4698 6255	male
4	10	eyrinm	campbell	29-1q-1983	08 299y 1535	male

[6]:

pd.read_csv('data/dataset-bob.csv').head()

[6]:

	id	givenname	surname	dob	phone number	city
0	3	zali	verner	22-12-1972	02 1090 1906	perth
1	4	samuel	tremellen	21-12-1923	03 3605 9336	melbourne
2	5	amy	lodge	16-01-1958	07 8286 9372	canberra
3	7	oIji	pacioerk	10-02-1959	04 4220 5949	sydney
4	10	erin	kampgell	29-12-1983	08 2996 1445	perth

Charlie¶

[7]:

pd.read_csv('data/dataset-charlie.csv').head()

[7]:

	id	givenname	surname	dob	phone number	income
0	1	joshua	arkwright	16-02-1903	04 8511 9580	70189.446
1	3	zal:	verner	22-12-1972	02 1090 1906	50194.118
2	7	oliyer	paciorwk	10-02-1959	04 4210 5949	31750.993
3	8	nacoya	ranson	17-08-1925	07 6033 4580	102446.131
4	10	erih	campbell	29-12-1i83	08 299t 1435	331476.599

Analyst: create the project¶

The analyst keeps the result token to themselves. The three update tokens go to Alice, Bob and Charlie. The project ID is known by everyone.

[8]:

!clkutil create-project --server $SERVER --type groups --schema data/schema_ABC.json --parties 3 --output credentials.json

with open('credentials.json') as f:
    credentials = json.load(f)
    project_id = credentials['project_id']
    result_token = credentials['result_token']
    update_token_alice = credentials['update_tokens'][0]
    update_token_bob = credentials['update_tokens'][1]
    update_token_charlie = credentials['update_tokens'][2]

Project created

Alice: hash the data and upload it to the server¶

The data is hashed according to the schema and the keys. Alice’s update token is needed to upload the hashed data. No PII is uploaded to the service—only the hashes.

[9]:

!clkutil hash data/dataset-alice.csv $KEY1 $KEY2 data/schema_ABC.json dataset-alice-hashed.json --check-header false


generating CLKs:   0%|          | 0.00/3.23k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs:   6%|6         | 200/3.23k [00:02<00:31, 96.1clk/s, mean=372, std=32.6]
generating CLKs:  25%|##4       | 800/3.23k [00:02<00:17, 136clk/s, mean=371, std=35.5]
generating CLKs:  63%|######2   | 2.03k/3.23k [00:02<00:06, 193clk/s, mean=372, std=34.7]
generating CLKs: 100%|##########| 3.23k/3.23k [00:02<00:00, 1.29kclk/s, mean=372, std=34.9]
CLK data written to dataset-alice-hashed.json

[10]:

!clkutil upload --server $SERVER --apikey $update_token_alice --project $project_id dataset-alice-hashed.json

{"message": "Updated", "receipt_token": "c54597f32fd969603efba706af1556abee3cc35f2718bcb6"}

Bob: hash the data and upload it to the server¶

[11]:

!clkutil hash data/dataset-bob.csv $KEY1 $KEY2 data/schema_ABC.json dataset-bob-hashed.json --check-header false


generating CLKs:   0%|          | 0.00/3.24k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs:   6%|6         | 200/3.24k [00:01<00:25, 119clk/s, mean=369, std=32.4]
generating CLKs:  31%|###       | 1.00k/3.24k [00:01<00:13, 168clk/s, mean=371, std=35]
generating CLKs:  56%|#####5    | 1.80k/3.24k [00:01<00:06, 238clk/s, mean=371, std=35.5]
generating CLKs: 100%|##########| 3.24k/3.24k [00:02<00:00, 1.45kclk/s, mean=372, std=35.3]
CLK data written to dataset-bob-hashed.json

[12]:

!clkutil upload --server $SERVER --apikey $update_token_bob --project $project_id dataset-bob-hashed.json

{"message": "Updated", "receipt_token": "6ee2fe5df850b795ee6ddff1aaf4dfb03f6d4398dedcc248"}

Charlie: hash the data and upload it to the server¶

[13]:

!clkutil hash data/dataset-charlie.csv $KEY1 $KEY2 data/schema_ABC.json dataset-charlie-hashed.json --check-header false


generating CLKs:   0%|          | 0.00/3.26k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs:   6%|6         | 200/3.26k [00:01<00:24, 122clk/s, mean=371, std=33.3]
generating CLKs:  55%|#####5    | 1.80k/3.26k [00:01<00:08, 174clk/s, mean=372, std=34.5]
generating CLKs: 100%|##########| 3.26k/3.26k [00:01<00:00, 1.73kclk/s, mean=372, std=34.8]
CLK data written to dataset-charlie-hashed.json

[14]:

!clkutil upload --server $SERVER --apikey $update_token_charlie --project $project_id dataset-charlie-hashed.json

{"message": "Updated", "receipt_token": "064664ed9fd1f58c4da05c62a4832b813276d09342137a42"}

Analyst: start the linkage run¶

This will start the linkage computation. We will wait a little bit and then retrieve the results.

[15]:

!clkutil create --server $SERVER --project $project_id --apikey $result_token --threshold 0.7 --output=run-credentials.json

with open('run-credentials.json') as f:
    run_credentials = json.load(f)
    run_id = run_credentials['run_id']

Analyst: retreve the results¶

[16]:

!clkutil results --server $SERVER --project $project_id --apikey $result_token --run $run_id --watch --output linkage-output.json

State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
Downloading result
Received result

[17]:

with open('linkage-output.json') as f:
    linkage_output = json.load(f)
    linkage_groups = linkage_output['groups']

Everyone: make table of interesting information¶

We use the linkage result to make a table of genders, cities, and incomes without revealing any other PII.

[18]:

with open('data/dataset-alice.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    genders = tuple(row[-1] for row in r)

with open('data/dataset-bob.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    cities = tuple(row[-1] for row in r)

with open('data/dataset-charlie.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    incomes = tuple(row[-1] for row in r)

[19]:

table = []
for group in linkage_groups:
    row = [''] * 3
    for i, j in group:
        row[i] = [genders, cities, incomes][i][j]
    if sum(map(bool, row)) > 1:
        table.append(row)
pd.DataFrame(table, columns=['gender', 'city', 'income']).head(10)

[19]:

	gender	city	income
0		peGh	395273.665
1		sydnev	77367.636
2		pertb	323383.650
3		syd1e7y	79745.538
4		perth	28019.494
5		canberra	78961.675
6	female	brisnane
7	male	canbetra
8		sydme7	106849.526
9		melbourne	68548.966

The last 20 groups look like this.

[20]:

linkage_groups[-15:]

[20]:

[[[0, 2111], [1, 2100]],
 [[0, 2121], [2, 2131], [1, 2111]],
 [[1, 1146], [2, 1202], [0, 1203]],
 [[1, 2466], [2, 2478], [0, 2460]],
 [[0, 429], [1, 412]],
 [[0, 2669], [1, 1204]],
 [[1, 1596], [2, 1623]],
 [[0, 487], [1, 459]],
 [[1, 1776], [2, 1800], [0, 1806]],
 [[1, 2586], [2, 2602]],
 [[0, 919], [1, 896]],
 [[0, 100], [2, 107], [1, 100]],
 [[0, 129], [1, 131], [2, 135]],
 [[0, 470], [1, 440]],
 [[0, 1736], [1, 1692], [2, 1734]]]

Sneak peek at the result¶

We obviously can’t do this in a real-world setting, but let’s view the linkage using the PII. If the IDs match, then we are correct.

[21]:

with open('data/dataset-alice.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_alice = tuple(r)

with open('data/dataset-bob.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_bob = tuple(r)

with open('data/dataset-charlie.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_charlie = tuple(r)

[22]:

table = []
for group in linkage_groups:
    for i, j in sorted(group):
        table.append([dataset_alice, dataset_bob, dataset_charlie][i][j])
    table.append([''] * 6)

pd.DataFrame(table, columns=['id', 'given name', 'surname', 'dob', 'phone number', 'non-linking']).tail(15)

[22]:

	id	given name	surname	dob	phone number	non-linking
6426	1171	isabelle	bridgland	30-03-1994	04 5318 6471	mal4
6427	1171	isalolIe	riahgland	30-02-1994	04 5318 6471	sydnry
6428	1171	isabelle	bridgland	30-02-1994	04 5318 6471	63514.217
6429
6430	1243	thmoas	doaldson	13-04-1900	09 6963 1944	male
6431	1243	thoma5	donaldson	13-04-1900	08 6962 1944	perth
6432	1243	thomas	donalsdon	13-04-2900	08 6963 2944	489229.297
6433
6434	2207	annah	aslea	02-11-2906	04 5501 5973	male
6435	2207	hannah	easlea	02-11-2006	04 5501 5973	canberra
6436
6437	5726	rhys	clarke	19-05-1929	02 9220 9635	mqle
6438	5726	ry5	clarke	19-05-1939	02 9120 9635
6439	5726	rhys	klark	19-05-2938	02 9220 9635	118197.119
6440

Table of Contents

This Page