[1]:
import csv
import itertools
import os

import requests

Entity Service: Multiparty linkage demo

This notebook is a demonstration of the multiparty linkage capability that has been implemented in the Entity Service.

We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included.

Check the status of the Entity Service

Ensure that it is running and that we have the correct version. Multiparty support was introduced in version 1.11.0.

[2]:
SERVER = os.getenv("SERVER", "https://testing.es.data61.xyz")
PREFIX = f"{SERVER}/api/v1"
print(requests.get(f"{PREFIX}/status").json())
print(requests.get(f"{PREFIX}/version").json())
{'project_count': 10, 'rate': 20496894, 'status': 'ok'}
{'anonlink': '0.11.2', 'entityservice': 'v1.11.0', 'python': '3.6.8'}

Create a new project

We create a new multiparty project for five parties by specifying the number of parties and the output type (currently only the group output type supports multiparty linkage). Retain the project_id, so we can find the project later. Also retain the result_token, so we can retrieve the results (careful: anyone with this token has access to the results). Finally, the update_tokens identify the five data data providers and permit them to upload CLKs.

[3]:
project_info = requests.post(
    f"{PREFIX}/projects",
    json={
        "schema": {},
        "result_type": "groups",
        "number_parties": 5,
        "name": "example project"
    }
).json()
project_id = project_info["project_id"]
result_token = project_info["result_token"]
update_tokens = project_info["update_tokens"]

print("project_id:", project_id)
print()
print("result_token:", result_token)
print()
print("update_tokens:", update_tokens)
project_id: 8eeb1050f5add8f78ff4a0da04219fead48f22220fb0f15e

result_token: c8f22b577aac9432871eeea02cbe504d399a9776add1de9f

update_tokens: ['6bf0f1c84c17116eb9f93cf8a4cfcb13d49d288a1f376dd8', '4b9265070849af1f0546f2adaeaa85a7d0e60b10f9b4afbc', '3ff03cadd750ce1b40cc4ec2b99db0132f62d8687328eeb9', 'c1b562ece6bbef6cd1a0541301bb1f82bd697bce04736296', '8cfdebbe12c65ae2ff20fd0c0ad5de4feb06c9a9dd1209c8']

Upload the hashed data

This is where each party uploads their CLKs into the service. Here, we do the work of all five data providers inside this for loop. In a deployment scenario, each data provider would be uploading their own CLKs using their own update token.

These CLKs are already hashed using clkhash, so for each data provider, we just need to upload their corresponding hash file.

[4]:
for i, token in enumerate(update_tokens, start=1):
    with open(f"data/clks-{i}.json") as f:
        r = requests.post(
            f"{PREFIX}/projects/{project_id}/clks",
            data=f,
            headers={
                "Authorization": token,
                "content-type": "application/json"
            }
        )
    print(f"Data provider {i}: {r.text}")
Data provider 1: {
  "message": "Updated",
  "receipt_token": "c7d9ba71260863f13af55e12603f8694c29e935262b15687"
}

Data provider 2: {
  "message": "Updated",
  "receipt_token": "70e4ed1b403c4e628183f82548a9297f8417ca3de94648bf"
}

Data provider 3: {
  "message": "Updated",
  "receipt_token": "b56fe568b93dc4522444e503078e16c18573adecbc086b6a"
}

Data provider 4: {
  "message": "Updated",
  "receipt_token": "7e3c80e554cfde23847d9aa2cff1323aa8f411e4033c0562"
}

Data provider 5: {
  "message": "Updated",
  "receipt_token": "8bde91367ee52b5c6804d5ce2d2d3350ce3c3766b8625bbc"
}

Begin a run

The data providers have uploaded their CLKs, so we may begin the computation. This computation may be repeated multiple times, each time with different parameters. Each such repetition is called a run. The most important parameter to vary between runs is the similarity threshold. Two records whose similarity is above this threshold will be considered to describe the same entity.

Here, we perform one run. We (somewhat arbitrarily) choose the threshold to be 0.8.

[5]:
r = requests.post(
    f"{PREFIX}/projects/{project_id}/runs",
    headers={
        "Authorization": result_token
    },
    json={
        "threshold": 0.8
    }
)
run_id = r.json()["run_id"]

Check the status

Let’s see whether the run has finished (‘state’ is ‘completed’)!

[6]:
r = requests.get(
    f"{PREFIX}/projects/{project_id}/runs/{run_id}/status",
    headers={
        "Authorization": result_token
    }
)
r.json()
[6]:
{'current_stage': {'description': 'waiting for CLKs',
  'number': 1,
  'progress': {'absolute': 5,
   'description': 'number of parties already contributed',
   'relative': 1.0}},
 'stages': 3,
 'state': 'queued',
 'time_added': '2019-06-23T11:17:27.646642+00:00',
 'time_started': None}

Now after some delay (depending on the size) we can fetch the results. Waiting for completion can be achieved by directly polling the REST API using requests, however for simplicity we will just use the watch_run_status function provided in clkhash.rest_client.

[7]:
import clkhash.rest_client
from IPython.display import clear_output

for update in clkhash.rest_client.watch_run_status(SERVER, project_id, run_id, result_token, timeout=30):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))

State: completed
Stage (3/3): compute output

Retrieve the results

We retrieve the results of the linkage. As we selected earlier, the result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party id and the row index.

The last 20 groups look like this.

[8]:
r = requests.get(
    f"{PREFIX}/projects/{project_id}/runs/{run_id}/result",
    headers={
        "Authorization": result_token
    }
)
groups = r.json()
groups['groups'][-20:]
[8]:
[[[0, 3127], [3, 3145], [2, 3152], [1, 3143]],
 [[2, 1653], [3, 1655], [1, 1632], [0, 1673], [4, 1682]],
 [[0, 2726], [1, 2737], [3, 2735]],
 [[1, 837], [3, 864]],
 [[0, 1667], [4, 1676], [1, 1624], [3, 1646]],
 [[1, 1884], [2, 1911], [4, 1926], [0, 1916]],
 [[0, 192], [2, 198]],
 [[3, 328], [4, 330], [0, 350], [2, 351], [1, 345]],
 [[2, 3173], [4, 3176], [3, 3163], [0, 3145], [1, 3161]],
 [[1, 347], [4, 332], [2, 353], [0, 352]],
 [[1, 736], [3, 761], [2, 768], [0, 751], [4, 754]],
 [[1, 342], [2, 349]],
 [[3, 899], [2, 913]],
 [[1, 465], [3, 477]],
 [[0, 285], [1, 293]],
 [[0, 785], [3, 794]],
 [[3, 2394], [4, 2395], [0, 2395]],
 [[1, 1260], [2, 1311], [3, 1281], [4, 1326]],
 [[0, 656], [2, 663]],
 [[1, 2468], [2, 2479]]]

To sanity check, we print their records’ corresponding PII:

[17]:
def load_dataset(i):
    dataset = []
    with open(f"data/dataset-{i}.csv") as f:
        reader = csv.reader(f)
        next(reader)  # ignore header
        for row in reader:
            dataset.append(row[1:])
    return dataset

datasets = list(map(load_dataset, range(1, 6)))

for group in itertools.islice(groups["groups"][-20:], 20):
    for (i, j) in group:
        print(i, datasets[i][j])
    print()
0 ['samual', 'mason', '05-12-1917', 'male', 'pertb', '405808.756', '07 2284 3649']
3 ['samuAl', 'mason', '05-12-1917', 'male', 'peryh', '4058o8.756', '07 2274 3549']
2 ['samie', 'mazon', '05-12-1917', 'male', '', '405898.756', '07 2275 3649']
1 ['zamusl', 'mason', '05-12-2917', 'male', '', '405898.756', '07 2274 2649']

2 ['thomas', 'burfrod', '08-04-1999', '', 'pertj', '182174.209', '02 3881 9666']
3 ['thomas', 'burfrod', '09-04-1999', 'male', '', '182174.209', '02 3881 9666']
1 ['thomas', 'burford', '08-04-19o9', 'mal4', '', '182175.109', '02 3881 9666']
0 ['thomas', 'burford', '08-04-1999', 'male', 'perth', '182174.109', '02 3881 9666']
4 ['thomas', 'burf0rd', '08-04-q999', 'mske', 'perrh', '182174.109', '02 3881 9666']

0 ['kaitlin', 'bondza', '03-08-1961', 'male', 'sydney', '41168.999', '02 4632 1380']
1 ['kaitlin', 'bondja', '03-08-1961', 'malr', 'sydmey', '41168.999', '02 4632 1370']
3 ["k'latlin", 'bonklza', '03-08-1961', 'male', 'sydaney', '', '02 4632 1380']

1 ['chr8stian', 'jolly', '22-08-2009', 'male', '', '178371.991', '04 5868 7703']
3 ['chr8stian', 'jolly', '22-09-2099', 'malr', 'melbokurne', '178271.991', '04 5868 7703']

0 ['oaklrigh', 'ngvyen', '24-07-1907', 'mslr', 'sydney', '63175.398', '04 9019 6235']
4 ['oakleith', 'ngvyen', '24-97-1907', 'male', 'sydiney', '63175.498', '04 9019 6235']
1 ['oajleigh', 'ngryen', '24-07-1007', 'male', 'sydney', '63175.498', '04 9919 6235']
3 ['oakleigh', 'nguyrn', '34-07-1907', 'male', 'sbdeney', '63175.r98', '04 9019 6235']

1 ['georgia', 'nguyen', '06-11-1930', 'male', 'perth', '247847.799', '08 6560 4063']
2 ['georia', 'nfuyen', '06-11-1930', 'male', 'perrh', '247847.799', '08 6560 4963']
4 ['geortia', 'nguyea', '06-11-1930', 'male', 'pertb', '247847.798', '08 6560 4063']
0 ['egorgia', 'nguyqn', '06-11-1930', 'male', 'peryh', '247847.799', '08 6460 4963']

0 ['connor', 'mcneill', '05-09-1902', 'male', 'sydney', '108473.824', '02 6419 9472']
2 ['connro', 'mcnell', '05-09-1902', 'male', 'sydnye', '108474.824', '02 6419 9472']

3 ['alessandria', 'sherriff', '25-91-1951', 'male', 'melb0urne', '5224r.762', '03 3077 2019']
4 ['alessandria', 'sherriff', '25-01-1951', 'male', 'melbourne', '52245.762', '03 3077 1019']
0 ['alessandria', "sherr'lff", '25-01-1951', 'malr', 'melbourne', '', '03 3977 1019']
2 ['alessandria', 'shernff', '25-01-1051', 'mzlr', 'melbourne', '52245.663', '03 3077 1019']
1 ['alessandrya', 'sherrif', '25-01-1961', 'male', 'jkelbouurne', '52245.762', '03 3077 1019']

2 ['harriyon', 'micyelmor', '21-04-1971', 'male', 'pert1>', '291889.942', '04 5633 5749']
4 ['harri5on', 'micyelkore', '21-04-1971', '', 'pertb', '291880.942', '04 5633 5749']
3 ['hariso17', 'micelmore', '21-04-1971', 'male', 'pertb', '291880.042', '04 5633 5749']
0 ['harrison', 'michelmore', '21-04-1981', 'malw', 'preth', '291880.942', '04 5643 5749']
1 ['harris0n', 'michelmoer', '21-04-1971', '', '', '291880.942', '04 5633 5749']

1 ['alannah', 'gully', '15-04-1903', 'make', 'meobourne', '134518.814', '04 5104 4572']
4 ['alana', 'gully', '15-04-1903', 'male', 'melbourne', '134518.814', '04 5104 4582']
2 ['alama', 'gulli', '15-04-1903', 'mald', 'melbourne', '134518.814', '04 5104 5582']
0 ['alsna', 'gullv', '15-04-1903', 'male', '', '134518.814', '04 5103 4582']

1 ['sraah', 'bates-brownsword', '26-11-1905', 'malr', '', '59685.979', '03 8545 5584']
3 ['sarah', 'bates-brownswort', '26-11-1905', 'male', '', '59686.879', '03 8545 6584']
2 ['sara0>', 'bates-browjsword', '26-11-1905', 'male', '', '59685.879', '']
0 ['saran', 'bates-brownsvvord', '26-11-1905', 'malr', 'sydney', '59685.879', '03 8555 5584']
4 ['snrah', 'bates-bro2nsword', '26-11-1005', 'male', 'sydney', '58685.879', '03 8545 5584']

1 ['beth', 'lette', '18-01-2000', 'female', 'sydney', '179719.049', '07 1868 6031']
2 ['beth', 'lette', '18-02-2000', 'femal4', 'stdq7ey', '179719.049', '07 1868 6931']

3 ['tahlia', 'bishlp', '', 'female', 'sydney', '101203.290', '03 886u 1916']
2 ['ahlia', 'bishpp', '', 'female', 'syriey', '101204.290', '03 8867 1916']

1 ['fzachary', 'mydlalc', '20-95-1916', 'male', 'sydney', '121209.129', '08 3807 4717']
3 ['zachary', 'mydlak', '20-05-1016', 'malr', 'sydhey', '121200.129', '08 3807 4627']

0 ['jessica', 'white', '04-07-1979', 'male', 'perth', '385632.266', '04 8026 8748']
1 ['jezsica', 'whi5e', '05-07-1979', 'male', 'perth', '385632.276', '04 8026 8748']

0 ['beriiamin', 'musoluno', '21-0y-1994', 'female', 'sydney', '81857.391', '08 8870 e498']
3 ['byenzakin', 'musoljno', '21-07-1995', 'female', 'sydney', '81857.392', '']

3 ['ella', 'howie', '26-03-2003', 'male', 'melbourne', '97556.316', '03 3655 1171']
4 ['ela', 'howie', '26-03-2003', 'male', 'melboirne', '', '03 3555 1171']
0 ['lela', 'howie', '26-03-2903', 'male', 'melbourhe', '', '03 3655 1171']

1 ['livia', 'riaj', '13-03-1907', 'malw', 'melbovrne', '73305.107', '07 3846 2530']
2 ['livia', 'ryank', '13-03-1907', 'malw', 'melbuorne', '73305.107', '07 3946 2630']
3 ['ltvia', 'ryan', '13-03-1907', 'maoe', 'melbourne', '73305.197', '07 3046 2530']
4 ['livia', 'ryan', '13-03-1907', 'male', 'melbourne', '73305.107', '07 3946 2530']

0 ['coby', 'ibshop', '', 'msle', 'sydney', '211655.118', '02 0833 7777']
2 ['coby', 'bishop', '15-08-1948', 'male', 'sydney', '211655.118', '02 9833 7777']

1 ['emjkly', 'pareemore', '01-03-2977', 'female', 'rnelbourne', '1644487.925', '03 5761 5483']
2 ['emiily', 'parremore', '01-03-1977', 'female', 'melbourne', '1644487.925', '03 5761 5483']

Despite the high amount of noise in the data, the entity service was able to produce a fairly accurate matching. However, Isabella George and Mia/Talia Galbraith are most likely not an actual match.

We may be able to improve on this results by fine-tuning the hashing schema or by changing the threshold.

Delete the project

[18]:
r = requests.delete(
    f"{PREFIX}/projects/{project_id}",
    headers={
        "Authorization": result_token
    }
)
print(r.status_code)
204