[1]:

import csv
import json
import os

import pandas as pd

[2]:

SECRET = 'my_secret'

SERVER = os.getenv("SERVER", "https://anonlink.easd.data61.xyz")

Multiparty Linkage with Clkhash¶

Scenario¶

There are three parties named Alice, Bob, and Charlie, each holding a dataset of about 3200 records. They know that they have some entities in common, but with incomplete overlap. The common features describing those entities are given name, surname, date of birth, and phone number.

They all have some additional information about those entities in their respective datasets, Alice has a person’s gender, Bob has their city, and Charlie has their income. They wish to create a table for analysis: each row has a gender, city, and income, but they don’t want to share any additional information. They can use Anonlink to do this in a privacy-preserving way (without revealing given names, surnames, dates of birth, and phone numbers).

Alice, Bob, and Charlie: agree on secret keys and a linkage schema¶

They keep the keys to themselves, but the schema may be revealed to the analyst.

[3]:

print(f'keys: {SECRET}')

keys: my_secret

[4]:

with open('data/schema_ABC.json') as f:
    print(f.read())

{
  "version": 3,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
      "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
      "info": "c2NoZW1hX2V4YW1wbGU=",
      "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "id",
      "ignored": true
    },
    {
      "identifier": "givenname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 15
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": false
        }
      }
    },
    {
      "identifier": "surname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 15
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": false
        }
      }
    },
    {
      "identifier": "dob",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 15
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": true
        }
      }
    },
    {
      "identifier": "phone number",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 8
        },
        "comparison": {
          "type": "ngram",
          "n": 1,
          "positional": true
        }
      }
    },
    {
      "identifier": "ignoredForLinkage",
      "ignored": true
    }
  ]
}

Sneak peek at input data¶

Alice¶

[5]:

pd.read_csv('data/dataset-alice.csv').head()

[5]:

	id	givenname	surname	dob	phone number	gender
0	0	tara	hilton	27-08-1941	08 2210 0298	male
1	3	saJi	vernre	22-12-2972	02 1090 1906	mals
2	7	sliver	paciorek	NaN	NaN	mals
3	9	ruby	george	09-05-1939	07 4698 6255	male
4	10	eyrinm	campbell	29-1q-1983	08 299y 1535	male

Bob¶

[6]:

pd.read_csv('data/dataset-bob.csv').head()

[6]:

	id	givenname	surname	dob	phone number	city
0	3	zali	verner	22-12-1972	02 1090 1906	perth
1	4	samuel	tremellen	21-12-1923	03 3605 9336	melbourne
2	5	amy	lodge	16-01-1958	07 8286 9372	canberra
3	7	oIji	pacioerk	10-02-1959	04 4220 5949	sydney
4	10	erin	kampgell	29-12-1983	08 2996 1445	perth

Charlie¶

[7]:

pd.read_csv('data/dataset-charlie.csv').head()

[7]:

	id	givenname	surname	dob	phone number	income
0	1	joshua	arkwright	16-02-1903	04 8511 9580	70189.446
1	3	zal:	verner	22-12-1972	02 1090 1906	50194.118
2	7	oliyer	paciorwk	10-02-1959	04 4210 5949	31750.993
3	8	nacoya	ranson	17-08-1925	07 6033 4580	102446.131
4	10	erih	campbell	29-12-1i83	08 299t 1435	331476.599

Analyst: create the project¶

The analyst keeps the result token to themselves. The three update tokens go to Alice, Bob and Charlie. The project ID is known by everyone.

[8]:

!clkutil create-project \
    --server $SERVER \
    --type groups \
    --schema data/schema_ABC.json \
    --parties 3 \
    --output credentials.json

with open('credentials.json') as f:
    credentials = json.load(f)
    project_id = credentials['project_id']
    result_token = credentials['result_token']
    update_token_alice = credentials['update_tokens'][0]
    update_token_bob = credentials['update_tokens'][1]
    update_token_charlie = credentials['update_tokens'][2]

Project created

Alice: hash the data and upload it to the server¶

The data is hashed according to the schema and the keys. Alice’s update token is needed to upload the hashed data. No PII is uploaded to the service—only the hashes.

[9]:

!clkutil hash \
    data/dataset-alice.csv \
    $SECRET \
    data/schema_ABC.json \
    dataset-alice-hashed.json \
    --check-header false

CLK data written to dataset-alice-hashed.json

[10]:

!clkutil upload \
    --server $SERVER \
    --apikey $update_token_alice \
    --project $project_id \
    dataset-alice-hashed.json

{"message": "Updated", "receipt_token": "c202d98eb83c7e55e6177ba9bcf55cb35f40ac1d21714897"}

Bob: hash the data and upload it to the server¶

[11]:

!clkutil hash \
    data/dataset-bob.csv \
    $SECRET \
    data/schema_ABC.json \
    dataset-bob-hashed.json \
    --check-header false

CLK data written to dataset-bob-hashed.json

[12]:

!clkutil upload \
    --server $SERVER \
    --apikey $update_token_bob \
    --project $project_id \
    dataset-bob-hashed.json

{"message": "Updated", "receipt_token": "75083f544df8e944cc590089bb3e31c134e810992f08ea80"}

Charlie: hash the data and upload it to the server¶

[13]:

!clkutil hash \
    data/dataset-charlie.csv \
    $SECRET \
    data/schema_ABC.json \
    dataset-charlie-hashed.json \
    --check-header false

CLK data written to dataset-charlie-hashed.json

[14]:

!clkutil upload \
    --server $SERVER \
    --apikey $update_token_charlie \
    --project $project_id \
    dataset-charlie-hashed.json

{"message": "Updated", "receipt_token": "814b4a226453d7261348a403e134b0764501432bf679658f"}

Analyst: start the linkage run¶

This will start the linkage computation. We will wait a little bit and then retrieve the results.

[15]:

!clkutil create \
    --server $SERVER \
    --project $project_id \
    --apikey $result_token \
    --threshold 0.7 \
    --output=run-credentials.json

with open('run-credentials.json') as f:
    run_credentials = json.load(f)
    run_id = run_credentials['run_id']

Analyst: retrieve the results¶

[16]:

!clkutil results \
    --server $SERVER \
    --project $project_id \
    --apikey $result_token \
    --run $run_id \
    --watch \
    --output linkage-output.json

State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
Downloading result
Received result

[17]:

with open('linkage-output.json') as f:
    linkage_output = json.load(f)
    linkage_groups = linkage_output['groups']
linkage_groups[-15:]

[17]:

[[[0, 1787], [1, 1751], [2, 1784]],
 [[0, 565], [1, 557], [2, 564]],
 [[0, 836], [1, 815], [2, 850]],
 [[0, 505], [2, 495]],
 [[0, 536], [2, 525], [1, 512]],
 [[0, 1641], [2, 1608], [1, 1584]],
 [[0, 2234], [1, 2228], [2, 2242]],
 [[0, 781], [1, 762], [2, 799]],
 [[0, 918], [2, 2840]],
 [[1, 1393], [2, 1421], [0, 1451]],
 [[1, 1587], [2, 1609], [0, 1642]],
 [[1, 1730], [2, 1767]],
 [[1, 2808], [2, 2813]],
 [[0, 2765], [2, 2794], [1, 2789]],
 [[1, 351], [2, 356]]]

The result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party index and the row index:

[
  [[party_id, row_index], ... ],
  ...
]

Everyone: make table of interesting information¶

We use the linkage result to make a table of genders, cities, and incomes without revealing any other PII.

[18]:

with open('data/dataset-alice.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    genders = tuple(row[-1] for row in r)

with open('data/dataset-bob.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    cities = tuple(row[-1] for row in r)

with open('data/dataset-charlie.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    incomes = tuple(row[-1] for row in r)

[19]:

table = []
for group in linkage_groups:
    row = [''] * 3
    for i, j in group:
        row[i] = [genders, cities, incomes][i][j]
    if sum(map(bool, row)) > 1:
        table.append(row)
pd.DataFrame(table, columns=['gender', 'city', 'income']).head(10)

[19]:

	gender	city	income
0	male	melbourne
1	femalr		277039.294
2		pertb	21407e.192
3		mlebourne	56899.522
4	male	canberra
5	femaoe	sydn3y
6	male		154195.553
7	female		44652.704
8	male	sydnely
9	mal3	sydney

The last 20 groups look like this.

Sneak peek at the result¶

We obviously can’t do this in a real-world setting, but let’s view the linkage using the PII. If the IDs match, then we are correct.

[20]:

with open('data/dataset-alice.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_alice = tuple(r)

with open('data/dataset-bob.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_bob = tuple(r)

with open('data/dataset-charlie.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_charlie = tuple(r)

[21]:

table = []
for group in linkage_groups:
    for i, j in sorted(group):
        table.append([dataset_alice, dataset_bob, dataset_charlie][i][j])
    table.append([''] * 6)

pd.DataFrame(table, columns=['id', 'given name', 'surname', 'dob', 'phone number', 'non-linking']).tail(15)

[21]:

	id	given name	surname	dob	phone number	non-linking
6450	5436	nikki	spears	10-02-2097	06 9447 1767	156639.106
6451
6452	5833	nell	rud	06-1p-1956	08 5510 5369	sydnev
6453	5833	ned	reif	06-20-1956	08 5510 5369	117275.089
6454
6455	872	jackson	green	06-09-1920
6456	872	jackson	gnn	06-00-1920	08 3409 2246	147663.277
6457
6458	8662	luct	pulfort	05-03-1903	02 0726 9479	male
6459	8662	lucy	pulford	05-03-1903		melbourrie
6460	8662	lusy	pulford	05-03-1993	02 0726 0489	192230.309
6461
6462	1885	nicholas	robson	06-01-1914	02 7799 6803	canberra
6463	1885	nicho\|as	robson	06-91-1914	02 7799 6803	61333.218
6464

[22]:

# Deleting the project
!clkutil delete-project --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" \
        --server="{SERVER}"

Project deleted

Table of Contents

This Page