Entity Service - v1.11.0

The Entity Service allows two organizations to carry out private record linkage — finding matching records of entities between their respective datasets without disclosing personally identifiable information.

Overview

The Entity Service is based on the concept of Anonymous Linking Codes (ALC). These can be seen as bit-arrays representing an entity, with the property that the similarity of the bits of two ALCs reflect the similarity of the corresponding entities.

An anonymous linking code that has been shown to produce good results and is widely used in practice is the so called *Cryptographic Longterm Key*, or CLK for short.

Note

From now on, we will use CLK exclusively instead of ALC, as our reference implementation of the private record linkage process uses CLK as anonymous linking code. The Entity Service is however not limited to CLKs.

Entity Service Overview

Schematical overview of the process of private record linkage using the Entity Service

Private record linkage - using the Entity Service - is a two stage process:

Table Of Contents

Tutorials

Run Status

[23]:
requests.get(
        '{}projects/{}/runs/{}/status'.format(url, project_id, run_id),
        headers={"Authorization": credentials['result_token']}
    ).json()
[23]:
{'current_stage': {'description': 'compute similarity scores',
  'number': 2,
  'progress': {'absolute': 25000000,
   'description': 'number of already computed similarity scores',
   'relative': 1.0}},
 'stages': 3,
 'state': 'running',
 'time_added': '2019-04-30T12:18:44.633541+00:00',
 'time_started': '2019-04-30T12:18:44.778142+00:00'}

Now after some delay (depending on the size) we can fetch the results. This can of course be done by directly polling the REST API using requests, however for simplicity we will just use the watch_run_status function provided in clkhash.rest_client.

Note the server is provided rather than url.
[24]:
import clkhash.rest_client
for update in clkhash.rest_client.watch_run_status(server, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))

State: completed
Stage (3/3): compute output
[25]:
data = json.loads(clkhash.rest_client.run_get_result_text(
    server,
    project_id,
    run_id,
    credentials['result_token']))

This result is the 1-1 mapping between rows that were more similar than the given threshold.

[30]:
for i in range(10):
    print("a[{}] maps to b[{}]".format(i, data['mapping'][str(i)]))
print("...")
a[0] maps to b[1449]
a[1] maps to b[2750]
a[2] maps to b[4656]
a[3] maps to b[4119]
a[4] maps to b[3306]
a[5] maps to b[2305]
a[6] maps to b[3944]
a[7] maps to b[992]
a[8] maps to b[4612]
a[9] maps to b[3629]
...

In this dataset there are 5000 records in common. With the chosen threshold and schema we currently retrieve:

[31]:
len(data['mapping'])
[31]:
4853

Cleanup

If you want you can delete the run and project from the anonlink-entity-service.

[44]:
requests.delete(
    "{}/projects/{}".format(url, project_id),
    headers={"Authorization": credentials['result_token']})
[44]:
<Response [403]>
[ ]:

Entity Service Permutation Output

This tutorial demonstrates the workflow for private record linkage using the entity service. Two parties Alice and Bob have a dataset of personally identifiable information (PII) of several entities. They want to learn the linkage of corresponding entities between their respective datasets with the help of the entity service and an independent party, the Analyst.

The chosen output type is permuatations, which consists of two permutations and one mask.

Who learns what?

After the linkage has been carried out Alice and Bob will be able to retrieve a permutation - a reordering of their respective data sets such that shared entities line up.

The Analyst - who creates the linkage project - learns the mask. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.

Steps

These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the Analyst acting the integration authority.

## Check Connection

If you’re connecting to a custom entity service, change the address here.
[1]:
import os
url = os.getenv("SERVER", "https://testing.es.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://testing.es.data61.xyz
[2]:
!clkutil status --server "{url}"
{"project_count": 2109, "rate": 8216626, "status": "ok"}

## Data preparation

Following the clkhash tutorial we will use a dataset from the recordlinkage library. We will just write both datasets out to temporary CSV files.

[3]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[4]:
dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head(3)

[4]:
given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id
rec_id
rec-1070-org michaela neumann 8 stanley street miami winston hills 4223 nsw 19151111 5304218
rec-1016-org courtney painter 12 pinkerton circuit bega flats richlands 4560 vic 19161214 4066625
rec-4405-org charles green 38 salkauskas crescent kela dapto 4566 nsw 19480930 4365168

The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.

[5]:
schema = NamedTemporaryFile('wt')
[6]:
%%writefile {schema.name}
{
  "version": 1,
  "clkConfig": {
    "l": 1024,
    "k": 30,
    "hash": {
      "type": "doubleHash"
    },
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 0.5, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 100, "maximum": 9999 },
      "hashing": { "ngram": 1, "positional": true, "weight": 0.5 }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}
Overwriting /tmp/tmpu8y0vxd4

## Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.

[7]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

!clkutil create-project --schema "{schema.name}" --output "{creds.name}" --type "permutations" --server "{url}"
creds.seek(0)

import json
with open(creds.name, 'r') as f:
    credentials = json.load(f)

project_id = credentials['project_id']
credentials
Credentials will be saved in /tmp/tmpngtrvblo
Project created
[7]:
{'project_id': '539a612e09bbac7fc5178f7798e15dfc310bc06878ff25fe',
 'result_token': '2a52a9729facd2fd4e547b8029697e3ab7a464c32f3ada7e',
 'update_tokens': ['47f701f76e06e2283f68dfddfb15da4b56bb05a43d6c5acb',
  '0b2228ff49ef9caeb29744f9ce97b39280873919a60a8765']}

Note: the analyst will need to pass on the project_id (the id of the linkage project) and one of the two update_tokens to each data provider.

## Hash and Upload

At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. We need: - the clkhash library - the linkage schema from above - and two secret passwords which are only known to Alice and Bob. (here: horse and staple)

Please see clkhash documentation for further details on this.

[8]:
!clkutil hash "{a_csv.name}" horse staple "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" horse staple "{schema.name}" "{b_clks.name}"
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 3.31kclk/s, mean=765, std=37.1]
CLK data written to /tmp/tmpy3s8f407.json
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 3.53kclk/s, mean=756, std=43.3]
CLK data written to /tmp/tmp0fdoothg.json

Now the two clients can upload their data providing the appropriate upload tokens and the project_id. As with all commands in clkhash we can output help:

[9]:
!clkutil upload --help
Usage: clkutil upload [OPTIONS] CLK_JSON

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as CLK_JSON, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --project TEXT         Project identifier
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.
Alice uploads her data
[10]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{a_clks.name}"
    res = json.load(open(f.name))
    alice_receipt_token = res['receipt_token']

Every upload gets a receipt token. This token is required to access the results.

Bob uploads his data

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

Now after some delay (depending on the size) we can fetch the mask. This can be done with clkutil:

!clkutil results --server "{url}" \
    --project="{credentials['project_id']}" \
    --apikey="{credentials['result_token']}" --output results.txt

However for this tutorial we are going to use the Python requests library:

[14]:
import requests
import clkhash.rest_client

from IPython.display import clear_output
[15]:
for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))
State: completed
Stage (3/3): compute output
[17]:
results = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': credentials['result_token']}).json()
[18]:
mask = results['mask']

This mask is a boolean array that specifies where rows of permuted data line up.

[19]:
print(mask[:10])
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The number of 1s in the mask will tell us how many matches were found.

[20]:
sum([1 for m in mask if m == 1])
[20]:
4858

We also use requests to fetch the permutations for each data provider:

[21]:
alice_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': bob_receipt_token}).json()

Now Alice and Bob both have a new permutation - a new ordering for their data.

[22]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]
[22]:
[4659, 4076, 1898, 868, 3271, 2486, 1078, 3774, 2656, 4324]

This permutation says the first row of Alice’s data should be moved to position 308.

[23]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]
[23]:
[3074, 1996, 4523, 500, 3384, 1115, 746, 1165, 2999, 2204]
[24]:
def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item

    return neworder
[25]:
with open(a_csv.name, 'r') as f:
    alice_raw = f.readlines()[1:]
    alice_reordered = reorder(alice_raw, alice_permutation)

with open(b_csv.name, 'r') as f:
    bob_raw = f.readlines()[1:]
    bob_reordered = reorder(bob_raw, bob_permutation)

Now that the two data sets have been permuted, the mask reveals where the rows line up, and where they don’t.

[26]:
alice_reordered[:10]
[26]:
['rec-4746-org,gabrielle,fargahry-tolba,10,northbourne avenue,pia place,st georges basin,2011,vic,19640424,7326839\n',
 'rec-438-org,alison,hearn,4,macdonnell street,cabrini medical centre,adelaide,2720,vic,19191230,2937695\n',
 'rec-3902-org,,oreilly,,paul coe crescent,wylarah,tuart hill,3219,vic,19500925,4201497\n',
 'rec-920-org,benjamin,clarke,122,archibald street,locn 1487,nickol,2535,nsw,19010518,1978760\n',
 'rec-2152-org,emiily,fitzpatrick,,aland place,keralland,rowville,2219,vic,19270130,1148897\n',
 'rec-3434-org,alex,clarke,12,fiveash street,emerald garden,homebush,2321,nsw,19840627,7280280\n',
 'rec-4197-org,talan,stubbs,21,augustus way,ashell,croydon north,3032,wa,19221022,7550622\n',
 'rec-2875-org,luke,white,31,outtrim avenue,glenora farm,flinders bay,2227,sa,19151010,6925269\n',
 'rec-2559-org,emiily,binns,24,howell place,sec 142 hd rounsevell,ryde,2627,wa,19941108,8919080\n',
 'rec-2679-org,thomas,brain,108,brewster place,geelong grove,eight mile plains,2114,qld,19851127,8873329\n']
[27]:
bob_reordered[:10]
[27]:
['rec-4746-dup-0,gabrielle,fargahry-tolba,11,northbourne avenue,pia place,st georges basin,2011,vic,19640424,7326839\n',
 'rec-438-dup-0,heatn,alison,4,macdonnell street,cabrini medicalb centre,adelaide,2270,vic,19191230,2937695\n',
 'rec-3902-dup-0,,oreilly,,paul coe cerscent,wylrah,tuart hill,3219,vic,19500925,4201497\n',
 'rec-920-dup-0,scott,clarke,122,archibald street,locn 1487,nickol,2553,nsw,19010518,1978760\n',
 'rec-2152-dup-0,megna,fitzpatrick,,aland place,keralalnd,rowville,2219,vic,19270130,1148897\n',
 'rec-3434-dup-0,alex,clarke,12,,emeral dgarden,homebush,2321,nsw,19840627,7280280\n',
 'rec-4197-dup-0,talan,stubbs,21,binns street,ashell,croydon north,3032,wa,19221022,7550622\n',
 'rec-2875-dup-0,luke,white,31,outtrim aqenue,glenora farm,flinedrs bay,2227,sa,19151010,6925269\n',
 'rec-2559-dup-0,binns,emiilzy,24,howell place,sec 142 hd rounsevell,ryde,2627,wa,19941108,8919080\n',
 'rec-2679-dup-0,dixon,thomas,108,brewster place,geelong grove,eight mile plains,2114,qld,19851127,8873329\n']
Accuracy

To compute how well the matching went we will use the first index as our reference.

For example in rec-1396-org is the original record which has a match in rec-1396-dup-0. To satisfy ourselves we can preview the first few supposed matches:

[28]:
for i, m in enumerate(mask[:10]):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        name_a = ' '.join(entity_a[1:3]).title()
        name_b = ' '.join(entity_b[1:3]).title()

        print("{} ({})".format(name_a, entity_a[0]), '=?', "{} ({})".format(name_b, entity_b[0]))
Gabrielle Fargahry-Tolba (rec-4746-org) =? Gabrielle Fargahry-Tolba (rec-4746-dup-0)
Alison Hearn (rec-438-org) =? Heatn Alison (rec-438-dup-0)
 Oreilly (rec-3902-org) =?  Oreilly (rec-3902-dup-0)
Benjamin Clarke (rec-920-org) =? Scott Clarke (rec-920-dup-0)
Emiily Fitzpatrick (rec-2152-org) =? Megna Fitzpatrick (rec-2152-dup-0)
Alex Clarke (rec-3434-org) =? Alex Clarke (rec-3434-dup-0)
Talan Stubbs (rec-4197-org) =? Talan Stubbs (rec-4197-dup-0)
Luke White (rec-2875-org) =? Luke White (rec-2875-dup-0)
Emiily Binns (rec-2559-org) =? Binns Emiilzy (rec-2559-dup-0)
Thomas Brain (rec-2679-org) =? Dixon Thomas (rec-2679-dup-0)
Metrics

If you know the ground truth — the correct mapping between the two datasets — you can compute performance metrics of the linkage.

Precision: The percentage of actual matches out of all found matches. (tp/(tp+fp))

Recall: How many of the actual matches have we found? (tp/(tp+fn))

[29]:
tp = 0
fp = 0

for i, m in enumerate(mask):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        if entity_a[0].split('-')[1] == entity_b[0].split('-')[1]:
            tp += 1
        else:
            fp += 1
            #print('False positive:',' '.join(entity_a[1:3]).title(), '?', ' '.join(entity_b[1:3]).title(), entity_a[-1] == entity_b[-1])

print("Found {} correct matches out of 5000. Incorrectly linked {} matches.".format(tp, fp))
precision = tp/(tp+fp)
recall = tp/5000

print("Precision: {:.1f}%".format(100*precision))
print("Recall: {:.1f}%".format(100*recall))
Found 4858 correct matches out of 5000. Incorrectly linked 0 matches.
Precision: 100.0%
Recall: 97.2%

Entity Service Similarity Scores Output

This tutorial demonstrates generating CLKs from PII, creating a new project on the entity service, and how to retrieve the results. The output type is raw similarity scores. This output type is particularly useful for determining a good threshold for the greedy solver used in mapping.

The sections are usually run by different participants - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the analyst is acting as the integration authority.

Who learns what?

Alice and Bob will both generate and upload their CLKs.

The analyst - who creates the linkage project - learns the similarity scores. Be aware that this is a lot of information and are subject to frequency attacks.

Steps
  • Check connection to Entity Service
  • Data preparation
  • Write CSV files with PII
  • Create a Linkage Schema
  • Create Linkage Project
  • Generate CLKs from PII
  • Upload the PII
  • Create a run
  • Retrieve and analyse results
[1]:
%matplotlib inline

import json
import os
import time

import matplotlib.pyplot as plt
import requests
import clkhash.rest_client
from IPython.display import clear_output
Check Connection

If you are connecting to a custom entity service, change the address here.

[2]:
url = os.getenv("SERVER", "https://testing.es.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://testing.es.data61.xyz
[3]:
!clkutil status --server "{url}"
{"project_count": 2115, "rate": 7737583, "status": "ok"}
Data preparation

Following the clkhash tutorial we will use a dataset from the recordlinkage library. We will just write both datasets out to temporary CSV files.

If you are following along yourself you may have to adjust the file names in all the !clkutil commands.

[4]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[5]:
dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head(3)
[5]:
given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id
rec_id
rec-1070-org michaela neumann 8 stanley street miami winston hills 4223 nsw 19151111 5304218
rec-1016-org courtney painter 12 pinkerton circuit bega flats richlands 4560 vic 19161214 4066625
rec-4405-org charles green 38 salkauskas crescent kela dapto 4566 nsw 19480930 4365168
Schema Preparation

The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.

[6]:
schema = NamedTemporaryFile('wt')
[7]:
%%writefile {schema.name}
{
  "version": 1,
  "clkConfig": {
    "l": 1024,
    "k": 30,
    "hash": {
      "type": "doubleHash"
    },
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 100, "maximum": 9999 },
      "hashing": { "ngram": 1, "positional": true, "weight": 1 }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}
Overwriting /tmp/tmpvlivqdcf
Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.

[8]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

!clkutil create-project --schema "{schema.name}" --output "{creds.name}" --type "similarity_scores" --server "{url}"
creds.seek(0)

with open(creds.name, 'r') as f:
    credentials = json.load(f)

project_id = credentials['project_id']
credentials
Credentials will be saved in /tmp/tmpcwpvq6kj
Project created
[8]:
{'project_id': '1eb3da44f73440c496ab42217381181de55e9dcd6743580c',
 'result_token': '846c6c25097c7794131de0d3e2c39c04b7de9688acedc383',
 'update_tokens': ['52aae3f1dfa8a4ec1486d8f7d63a8fe708876b39a8ec585b',
  '92e2c9c1ce52a2c2493b5e22953600735a07553f7d00a704']}

Note: the analyst will need to pass on the project_id (the id of the linkage project) and one of the two update_tokens to each data provider.

Hash and Upload

At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. Please see clkhash documentation for further details on this.

[9]:
!clkutil hash "{a_csv.name}" horse staple "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" horse staple "{schema.name}" "{b_clks.name}"
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.06kclk/s, mean=883, std=33.6]
CLK data written to /tmp/tmpj8m1dvxj.json
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.30kclk/s, mean=875, std=39.7]
CLK data written to /tmp/tmpi2y_ogl9.json

Now the two clients can upload their data providing the appropriate upload tokens.

Alice uploads her data
[10]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{a_clks.name}"
    res = json.load(open(f.name))
    alice_receipt_token = res['receipt_token']

Every upload gets a receipt token. In some operating modes this receipt is required to access the results.

Bob uploads his data
[11]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][1]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{b_clks.name}"

    bob_receipt_token = json.load(open(f.name))['receipt_token']
Create a run

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

[12]:
with NamedTemporaryFile('wt') as f:
    !clkutil create \
        --project="{project_id}" \
        --apikey="{credentials['result_token']}" \
        --server "{url}" \
        --threshold 0.9 \
        --output "{f.name}"

    run_id = json.load(open(f.name))['run_id']
Results

Now after some delay (depending on the size) we can fetch the mask. This can be done with clkutil:

!clkutil results --server "{url}" \
    --project="{credentials['project_id']}" \
    --apikey="{credentials['result_token']}" --output results.txt

However for this tutorial we are going to use the clkhash library:

[13]:
for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))
time.sleep(3)
State: completed
Stage (2/2): compute similarity scores
Progress: 1.000%
[17]:
data = json.loads(clkhash.rest_client.run_get_result_text(
    url,
    project_id,
    run_id,
    credentials['result_token']))['similarity_scores']

This result is a large list of tuples recording the similarity between all rows above the given threshold.

[18]:
for row in data[:10]:
    print(row)
[76, 2345, 1.0]
[83, 3439, 1.0]
[103, 863, 1.0]
[154, 2391, 1.0]
[177, 4247, 1.0]
[192, 1176, 1.0]
[270, 4516, 1.0]
[312, 1253, 1.0]
[407, 3743, 1.0]
[670, 3550, 1.0]

Note there can be a lot of similarity scores:

[19]:
len(data)
[19]:
1572906

We will display a sample of these similarity scores in a histogram using matplotlib:

[20]:
plt.hist([_[2] for _ in data[::100]], bins=50);
_images/tutorial_Similarity_Scores_31_0.png

The vast majority of these similarity scores are for non matches. Let’s zoom into the right side of the distribution.

[21]:
plt.hist([_[2] for _ in data[::1] if _[2] > 0.94], bins=50);
_images/tutorial_Similarity_Scores_33_0.png

Now it looks like a good threshold should be above 0.95. Let’s have a look at some of the candidate matches around there.

[22]:
def sample(data, threshold, num_samples, epsilon=0.01):
    samples = []
    for row in data:
        if abs(row[2] - threshold) <= epsilon:
            samples.append(row)
        if len(samples) >= num_samples:
            break
    return samples

def lookup_originals(candidate_pair):
    a = dfA.iloc[candidate_pair[0]]
    b = dfB.iloc[candidate_pair[1]]
    return a, b
[23]:
def look_at_per_field_accuracy(threshold = 0.999, num_samples = 100):
    results = []
    for i, candidate in enumerate(sample(data, threshold, num_samples, 0.01), start=1):
        record_a, record_b = lookup_originals(candidate)
        results.append(record_a == record_b)

    print("Proportion of exact matches for each field using threshold: {}".format(threshold))
    print(sum(results)/num_samples)

So we should expect a very high proportion of matches across all fields for high thresholds:

[24]:
look_at_per_field_accuracy(threshold = 0.999, num_samples = 100)
Proportion of exact matches for each field using threshold: 0.999
given_name       0.93
surname          0.96
street_number    0.88
address_1        0.92
address_2        0.80
suburb           0.92
postcode         0.95
state            1.00
date_of_birth    0.96
soc_sec_id       0.40
dtype: float64

But if we look at a threshold which is closer to the boundary between real matches we should see a lot more errors:

[25]:
look_at_per_field_accuracy(threshold = 0.95, num_samples = 100)
Proportion of exact matches for each field using threshold: 0.95
given_name       0.49
surname          0.57
street_number    0.81
address_1        0.55
address_2        0.44
suburb           0.70
postcode         0.84
state            0.93
date_of_birth    0.84
soc_sec_id       0.92
dtype: float64
[26]:

[26]:
'0.12.0'
[ ]:

External Tutorials

The clkhash library includes a tutorial of carrying out record linkage on perturbed data. <http://clkhash.readthedocs.io/en/latest/tutorial_cli.html>

Command line example

This brief example shows using clkutil - the command line tool that is packaged with the clkhash library. It is not a requirement to use clkhash with the Entity Service REST API.

We assume you have access to a command line prompt with Python and Pip installed.

Install clkhash:

$ pip install clkhash

Generate and split some mock personally identifiable data:

$ clkutil generate 2000 raw_pii_2k.csv

$ head -n 1 raw_pii_2k.csv > alice.txt
$ tail -n 1500 raw_pii_2k.csv >> alice.txt
$ head -n 1000 raw_pii_2k.csv > bob.txt

A corresponding hashing schema can be generated as well:

$ clkutil generate-default-schema schema.json

Process the personally identifying data into Cryptographic Longterm Key:

$ clkutil hash alice.txt horse staple schema.json alice-hashed.json
generating CLKs: 100%|████████████████████████████████████████████| 1.50K/1.50K [00:00<00:00, 6.69Kclk/s, mean=522, std=34.4]
CLK data written to alice-hashed.json

$ clkutil hash bob.txt horse staple schema.json bob-hashed.json
generating CLKs: 100%|████████████████████████████████████████████| 999/999 [00:00<00:00, 5.14Kclk/s, mean=520, std=34.2]
CLK data written to bob-hashed.json

Now to interact with an Entity Service. First check that the service is healthy and responds to a status check:

$ clkutil status --server https://testing.es.data61.xyz
{"rate": 53129, "status": "ok", "project_count": 1410}

Then create a new linkage project and set the output type (to mapping):

$ clkutil create-project \
    --server https://testing.es.data61.xyz \
    --type mapping \
    --schema schema.json \
    --output credentials.json

The entity service replies with a project id and credentials which get saved into the file credentials.json. The contents is two upload tokens and a result token:

{
    "update_tokens": [
        "21d4c9249e1c70ac30f9ce03893983c493d7e90574980e55",
        "3ad6ae9028c09fcbc7fbca36d19743294bfaf215f1464905"
    ],
    "project_id": "809b12c7e141837c3a15be758b016d5a7826d90574f36e74",
    "result_token": "230a303b05dfd186be87fa65bf7b0970fb786497834910d1"
}

These credentials get substituted in the following commands. Each CLK dataset gets uploaded to the Entity Service:

$ clkutil upload --server https://testing.es.data61.xyz \
                --apikey 21d4c9249e1c70ac30f9ce03893983c493d7e90574980e55 \
                --project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
                alice-hashed.json
{"receipt_token": "05ac237462d86bc3e2232ae3db71d9ae1b9e99afe840ee5a", "message": "Updated"}

$ clkutil upload --server https://testing.es.data61.xyz \
                --apikey 3ad6ae9028c09fcbc7fbca36d19743294bfaf215f1464905 \
                --project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
                bob-hashed.json
{"receipt_token": "6d9a0ee7fc3a66e16805738097761d38c62ea01a8c6adf39", "message": "Updated"}

Now we can compute mappings using various thresholds. For example to only see relationships where the similarity is above 0.9:

$ clkutil create --server https://testing.es.data61.xyz \
                 --apikey 230a303b05dfd186be87fa65bf7b0970fb786497834910d1 \
                 --project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
                 --name "Tutorial mapping run" \
                 --threshold 0.9
{"run_id": "31a6d3c775151a877dcac625b4b91a6659317046ea45ad11", "notes": "Run created by clkhash 0.11.2", "name": "Tutorial mapping run", "threshold": 0.9}

After a small delay the mapping will have been computed and we can use clkutil to retrieve the results:

$ clkutil results --server https://testing.es.data61.xyz \
                --apikey  230a303b05dfd186be87fa65bf7b0970fb786497834910d1 \
                --project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
                --run 31a6d3c775151a877dcac625b4b91a6659317046ea45ad11
State: completed
Stage (3/3): compute output
Downloading result
Received result
{
  "mapping": {
    "0": "500",
    "1": "501",
    "10": "510",
    "100": "600",
    "101": "601",
    "102": "602",
    "103": "603",
    "104": "604",
    "105": "605",
    "106": "606",
    "107": "607",
    ...

This mapping output is telling us that the similarity is above our threshold between identified rows of Alice and Bob’s data sets.

Looking at the first two entities in Alice’s data:

head alice.txt -n 3
INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
500,Arjun Efron,1990/01/14,M
501,Sedrick Verlinden,1954/11/28,M

And looking at the corresponding 500th and 501st entities in Bob’s data:

tail -n 499 bob.txt | head -n 2
500,Arjun Efron,1990/01/14,M
501,Sedrick Verlinden,1954/11/28,M

Concepts

Cryptographic Longterm Key

A Cryptographic Longterm Key is the name given to a Bloom filter used as a privacy preserving representation of an entity. Unlike a cryptographic hash function, a CLK preserves similarity - meaning two similar entities will have similar CLKs. This property is necessary for probabilistic record linkage.

CLKs are created independent of the entity service following a keyed hashing process.

A CLK incorporates information from multiple identifying fields (e.g., name, date of birth, phone number) for each entity. The schema section details how to capture the configuration for creating CLKs from PII, and the next section outlines how to serialize CLKs for use with this service’s api.

Note

The Cryptographic Longterm Key was introduced in A Novel Error-Tolerant Anonymous Linking Code by Rainer Schnell, Tobias Bachteler, and Jörg Reiher.

Bloom Filter Format

A Bloom filter is simply an encoding of PII as a bitarray.

This can easily be represented as bytes (each being an 8 bit number between 0 and 255). We serialize by base64 encoding the raw bytes of the bit array.

An example with a 64 bit filter:

# bloom filters binary value
'0100110111010000101111011111011111011000110010101010010010100110'

# which corresponds to the following bytes
[77, 208, 189, 247, 216, 202, 164, 166]

# which gets base64 encoded to
'TdC999jKpKY=\n'

As with standard Base64 encodings, a newline is introduced every 76 characters.

Schema

It is important that participating organisations agree on how personally identifiable information is processed to create the clks. We call the configuration for creating CLKs a linkage schema. The organisations have to agree on a schema to ensure their CLKs are comparable.

The linkage schema is documented in clkhash, our reference implementation written in Python.

Note

Due to the one way nature of hashing, the entity service can’t determine whether the linkage schema was followed when clients generated CLKs.

Comparing Cryptograhpic Longterm Keys

The similarity metric used is the Sørensen–Dice index - although this may become a configurable option in the future.

Output Types

The Entity Service supports different result types which effect what output is produced, and who may see the output.

Warning

The security guarantees differ substantially for each output type. See the Security document for a treatment of these concerns.

Similarity Score

Similarities scores are computed between all CLKs in each organisation - the scores above a given threshold are returned. This output type is currently the only way to work with 1 to many relationships.

The result_token (generated when creating the mapping) is required. The result_type should be set to "similarity_scores".

Results are a simple JSON array of arrays:

[
    [index_a, index_b, score],
    ...
]

Where the index values will be the 0 based row index from the uploaded CLKs, and the score will be a Number between the provided threshold and 1.0.

A score of 1.0 means the CLKs were identical. Threshold values are usually between 0.5 and 1.0.

Note

The maximum number of results returned is the product of the two data set lengths.

For example:

Comparing two data sets each containing 1 million records with a threshold of 0.0 will return 1 trillion results (1e+12).
Direct Mapping Table

The direct mapping takes the similarity scores and simply assigns the highest scores as links.

The links are exposed as a lookup table using indices from the two organizations:

{
    index_a: index_b,
    ...
}

The result_token (generated when creating the mapping) is required to retrieve the results. The result_type should be set to "mapping".

Permutation and Mask

This protocol creates a random reordering for both organizations; and creates a mask revealing where the reordered rows line up.

Accessing the mask requires the result_token, and accessing the permutation requires a receipt-token (provided to each organization when they upload data).

Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.

The result_type should be set to "permutations".

Security

The service isn’t given any personally identifying information in raw form - rather clients must locally compute a CLK which is a hashed version of the data to be linked.

Considerations for each output type

Direct Mapping Table

The default output of the Entity Service comprises a list of edges - connections between rows in dataset A to rows in dataset B. This assumes at most a 1-1 corrospondence - each entity will only be present in zero or one edge.

This output is only available to the client who created the mapping, but it is worth highlighting that it does (by design) leak information about the intersection of the two sets of entities.

Knowledge about set intersection This output contains information about which particular entities are shared, and which are not. Potentially knowing the overlap between the organizations is disclosive. This is mitigated by using unique authorization codes generated for each mapping which is required to retrieve the results.

Row indicies exposed The output directly exposes the row indices provided to the service, which if not randomized may be disclosive. For example entities simply exported from a database might be ordered by age, patient admittance date, salary band etc.

Similarity Score

All calculated similarities (above a given threshold) between entities are returned. This output comprises a list of weighted edges - similarity between rows in dataset A to rows in dataset B. This is a many to many relationship where entities can appear in multiple edges.

Recovery from the distance measurements This output type includes the plaintext distance measurements between entities, this additional information can be used to fingerprint individual entities based on their ordered similarity scores. In combination with public information this can lead to recovery of identity. This attack is described in section 3 of Vulnerabilities in the use of similarity tables in combination with pseudonymisation to preserve data privacy in the UK Office for National Statistics’ Privacy-Preserving Record Linkage by Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague.

In order to prevent this attack it is important not to provide the similarity table to untrusted parties.

Permutation and Mask

This output type involves creating a random reordering of the entities for both organizations; and creating a binary mask vector revealing where the reordered rows line up. This output is designed for use in multi-party computation algorithms.

This mitigates the Knowledge about set intersection problem from the direct mapping output - assuming the mask is not made available to the data providers.

Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.

Authentication / Authorization

The entity service does not support authentication, yet. This is planned for a future version.

All sensitive data is protected by token-based authorization. That is, you need to provide the correct token to access different resources. A token is a unique random 192 bit string.

There are three different types of tokens:

  • update_token: required to upload a party’s CLKs.
  • result_token: required to access the result of the entity resolution process. This is, depending on the output type, either similarity scores, a direct mapping table, or a mask.
  • receipt-token: this token is returned to either party after uploading their respective CLKs. With this receipt-token they can then access their respective permutations, if the output type of the mapping is set to permutation and mask.

Important

These tokens are the only artifacts that protect the sensitive data. Therefore it is paramount to make sure that only authorized parties have access to these tokens!

Attack Vectors

The following attack vectors need to be considered for all output types.

Stealing/Leaking uploaded CLKs

The uploaded CLKs for one organization could be leaked to the partner organization - who possesses the HMAC secret breaking semantic security. The entity service doesn’t expose an API that allows users to access any CLKs, the object store (MINIO or S3) and the database (postgresql) are configured to not allow public access.

Deployment

Local Deployment

Dependencies

Docker and docker-compose

Build

From the project folder, run:

./tools/build.sh

The will create the docker images tagged with latest which are used by docker-compose.

Run

Run docker compose:

docker-compose -p n1es -f tools/docker-compose.yml up

This will start the following containers:

  • nginx frontend (named n1es_nginx_1)
  • gunicorn/flask backend (named n1es_backend_1)
  • celery backend worker (named n1es_worker_1)
  • postgres database (named n1es_db_1)
  • redis job queue (named n1es_redis_1)
  • minio object store
  • jaeger opentracing

The REST api for the service is exposed on port 8851 of the nginx container, which docker will map to a high numbered port on your host.

The address of the nginx endpoint can be found with:

docker port n1es_nginx_1 "8851"

For example to GET the service status:

$ export ENTITY_SERVICE=`docker port n1es_nginx_1 "8851"`
$ curl $ENTITY_SERVICE/api/v1/status
{
    "status": "ok",
    "number_mappings": 0,
    "rate": 1
}

The service can be taken down by hitting CTRL+C. This doesn’t clear the DB volumes, which will persist and conflict with the next call to docker-compose … up unless they are removed. Removing these volumes is easy, just run:

docker-compose -p n1es -f tools/docker-compose.yml down -v

in between calls to docker-compose … up.

Monitoring

A celery monitor tool flower is also part of the docker-compose file - this graphical interface allows administration and monitoring of the celery tasks and workers. Access this via the monitor container.

Testing with docker-compose

An additional docker-compose config file can be found in ./tools/ci.yml, this can be added in to run along with the rest of the service:

docker-compose -p n1estest -f tools/docker-compose.yml -f tools/ci.yml  up -d

docker logs -f n1estest_tests_1

docker-compose -p n1estest -f tools/docker-compose.yml -f tools/ci.yml down
Docker Compose Tips
Local Scaling

You can run additional worker containers by scaling with docker-compose:

docker-compose -f tools/docker-compose.yml scale es_worker=2

A collection of development tips.

Volumes

You might need to destroy the docker volumes used for the object store and the postgres database:

docker-compose -f tools/docker-compose.yml rm -s -v [-p <project-name>]
Restart one service

Docker compose can modify an existing deployment, this can be particularly effective when you modify and rebuild the backend and want to restart it without changing anything else:

docker-compose -f tools/docker-compose.yml up -d --no-deps es_backend
Scaling

You can run additional worker containers by scaling with docker-compose:

docker-compose -f tools/docker-compose.yml scale es_worker=2
Mix and match docker compose

During development you can run the redis and database containers with docker-compose, and directly run the celery and flask applications with Python.

docker-compose -f tools/docker-compose.yml run es_db
docker-compose -f tools/docker-compose.yml run es_redis

Production deployment

Production deployment assumes a multi node Kubernetes cluster.

The entity service has been deployed to kubernetes clusters on GCE, minikube and AWS. The system has been designed to scale across multiple nodes and handle node failure without data loss.

Entity Service Kubernetes Deployment

At a high level the main custom components are:

  • ES App - a gunicorn/flask backend web service hosts the REST api
  • Entity Match Worker instances - uses celery for task scheduling

The components that are used in support are:

  • postgresql database holds all match metadata
  • redis is used for the celery job queue and as a cache
  • (optionally) minio object store stores the raw CLKs, intermediate files, and results.
  • nginx provides upload buffering, request rate limiting.
  • an ingress controller (e.g. nginx-ingress/traefik) provides TLS termination

The rest of this document goes into how to deploy in a production setting.

Provision a Kubernetes cluster

Creating a Kubernetes cluster is out of scope for this documentation. For AWS there is a good tutorial here.

Hardware requirements

Recommended AWS worker instance type is r3.4xlarge - spot instances are fine as we handle node failure. The number of nodes depends on the size of the expected jobs, as well as the memory on each node. For testing we recommend starting with at least two nodes, with each node having at least 8 GiB of memory and 2 vCPUs.

Software to interact with the cluster

You will need to install the kubectl command line tool, and helm

Cluster Storage

An existing kubernetes cluster may already have dynamically provisioned storage. If not, create a default storage class. For AWS execute:

kubectl create -f aws-storage.yaml

Dynamically provisioned storage

When pods require persistent storage this can be dynamically provided by the cluster. The default settings (in values.yaml) assumes the existence of a "default" storageClass.

For a cluster on AWS the aws-storage.yaml resource will dynamically provision elastic block store volumes.

Install Helm

The entity service system has been packaged using helm, there is a client program that needs to be installed

At the very least you will need to install tiller into the cluster:

helm init
Ingress Controller

We assume the cluster has an ingress controller, if this isn’t the case first add one. We suggest using Traefik or NGINX Ingress Controller. Both can be installed using helm.

Deploy the system

Helm can be used to easily deploy the system to a kubernetes cluster.

From the deployment/entity-service directory pull the dependencies:

helm dependency update
Configuring the deployment

Create a new blank yaml file to hold your custom deployment settings my-deployment.yaml. Carefully read through the default values.yaml file and override any values in your deployment configuration file.

At a minimum consider setting up an ingress by changing api.ingress, change the number of workers in workers.replicaCount (and possibly workers.highmemory.replicaCount), check you’re happy with the workers’ cpu and memory limits in workers.resources, and finally set the credentials:

  • postgresql.postgresqlPassword
  • redis.password (and redis-ha.redisPassword if provisioning redis)
  • minio.accessKey and minio.secretKey

You may additionally want to check the persistent volume storageClass and sizes.

Installation

To install the whole system execute:

cd deployment
helm install entityservice --namespace=es --name="n1entityservice" --values ``my-deployment.yaml``

This can take around 10 minutes the first time you deploy to a new cluster.

Run integration tests and an end to end test

Update the server url by editing the yaml file then create a new job on the cluster:

kubectl create -f jobs/integration-test-job.yaml
To view the celery monitor:

Find the pod that the monitor is running on then forward the port:

kubectl port-forward entityservice-monitor-4045544268-s34zl 8888:8888
Upgrade Deployment with Helm

Updating a running chart is usually straight forward. For example if the release is called es in namespace testing execute the following to increase the number of workers:

helm upgrade es entity-service --namespace=testing --set workers.replicas="20"

However note you may wish to instead keep all configurable values in a yaml file and track that in version control.

Minimal Deployment

To run with minikube for local testing we have provided a minimal.yaml file that will set very small resource limits. Install the minimal system with:

helm install entity-service --name="mini-es" --values entity-service/minimal-values.yaml
Database Deployment Options

At deployment time you can configure the deployed postgresql database.

In particular you should set the postgresql.postgresqlPassword in values.yaml.

Object Store Deployment Options

At deployment time you can decide to deploy MINIO or instead use an existing service such as AWS S3. Note that there is a trade off between using a local deployment of minio vs S3.

In our AWS based experimentation Minio is noticeably faster, but more expensive and less reliable than AWS S3, your own mileage may vary.

To configure a deployment to use an external object store, simply set provision.minio to false and add appropriate connection configuration in the minio section. For example to use AWS S3 simply provide your access credentials (and disable provisioning minio):

helm install entity-service --name="es-s3" --set provision.minio=false --set minio.accessKey=XXX --set minio.secretKey=YYY --set minio.bucket=<bucket>
Redis Deployment Options

At deployment time you can decide to provision redis using our chart, or instead use an existing redis installation or managed service. The provisioned redis is a highly available 3 node redis cluster using the redis-ha helm chart. Directly connecting to redis, and discovery via the sentinel protocol are supported. When using sentinel protocol for redis discovery read only requests are dispatched to redis replicas.

Carefully read the comments in the default values.yaml file.

To use a separate install of redis using the server shared-redis-ha-redis-ha.default.svc.cluster.local

helm install entity-service –name=”es-shared-redis”
–set provision.redis=false –set redis.server=shared-redis-ha-redis-ha.default.svc.cluster.local –set redis.use_sentinel=true
Uninstalling

To uninstall a release called es:

helm del es

If it has been installed into its own namespace you can simple delete the whole namespace with kubectl:

kubectl delete namespace miniestest

Deployment Risks

The purpose of this document is to record known deployment risks of the entity service and our mitigations. References the 2017 Top 10 security risks - https://www.owasp.org/index.php/Top_10-2017_Top_10

Risks
User accesses unit record data

A1 - Injection

A3 - Sensitive Data Exposure

Unauthorized user accesses results

A6 - Security misconfiguration.

A2 - Broken authentication.

A5 - Broken access control.

Authorized user attacks the system

A10 - Insufficient Logging & Monitoring A3 - Sensitive Data Exposure

An admin can access the raw clks uploaded by both parties.

However a standard user cannot.

User coerces N1 to execute attacking code

Insecure deserialization. Compromised shared host.

An underlying component has a vulnerability

Dependencies including anonlink could have vulnerabilities.

Development

Changelog

Version 1.11.0
  • Adds support for multiparty record linkage.
  • Logging is now configurable from a file.
Other improvements
  • Another tutorial for directly using the REST api was added.
  • K8s deployment updated to use 3.15.0 Postgres chart. Postgres configuration now uses a global namespace so subcharts can all use the same configuration as documented here.
  • Jenkins testing now fails if the benchmark exits incorrectly or if the benchmark results contain failed results.
  • Jenkins will now execute the tutorials notebooks and fail if any cells error.
Version 1.10.0
  • Updates Anonlink and switches to using Anonlink’s default format for serialization of similarity scores.
  • Sorts similarity scores before solving, improving accuracy.
  • Uses Anonlink’s new API for similarity score computation and solving.
  • Add support for using an external Postgres database.
  • Added optional support for redis discovery via the sentinel protocol.
  • Kubernetes deployment no longer includes a default postgres password. Ensure that you set your own postgresqlPassword.
  • The Kubernetes deployment documentation has been extended.
Version 1.9.4
  • Introduces configurable logging of HTTP headers.
  • Dependency issue resolved.
Version 1.9.3
  • Redis can now be used in highly available mode. Includes upstream fix where the redis sentinels crash.
  • The custom kubernetes certificate management templates have been removed.
  • Minor updates to the kubernetes resources. No longer using beta apis.
Version 1.9.2
  • 2 race conditions have been identified and fixed.
  • Integration tests are sped up and more focused. The test suite now fails after the first test failure.
  • Code tidy-ups to be more pep8 compliant.
Version 1.9.1
  • Adds support for (almost) arbitrary sized encodings. A minimum and maximum can be set at deployment time, and currently anonlink requires the size to be a multiple of 8.
  • Adds support for opentracing with Jaeger.
  • improvements to the benchmarking container
  • internal refactoring of tasks
Version 1.9.0
  • minio and redis services are now optional for kubernetes deployment.
  • Introduction of a high memory worker and associated task queue.
  • Fix issue where we could start tasks twice.
  • Structlog now used for celery workers.
  • CI now tests a kubernetes deployment.
  • Many Jenkins CI updates and fixes.
  • Updates to Jupyter notebooks and docs.
  • Updates to Python and Helm chart dependencies and docker base images.
Version 1.8.1

Improve system stability while handling large intermediate results. Intermediate results are now stored in files instead of in Redis. This permits us to stream them instead of loading everything into memory.

Version 1.8

Version 1.8 introduces breaking changes to the REST API to allow an analyst to reuse uploaded CLKs.

Instead of a linkage project only having one result, we introduce a new sub-resource runs. A project holds the schema and CLKs from all data providers; and multiple runs can be created with different parameters. A run has a status and a result endpoint. Runs can be queued before the CLK data has been uploaded.

We also introduced changes to the result types. The result type permutation, which was producing permutations and an encrypted mask, was removed. And the result type permutation_unecrypyted_mask was renamed to permutations.

Brief summary of API changes: - the mapping endpoint has been renamed to projects - To carry out a linkage computation you must post to a project’s runs endpoint: /api/v1/project/<PROJECT_ID>/runs - Results are now accessed under the `runs endpoint: /api/v1/project/<PROJECT_ID>/runs/<RUN_ID>/result - result type permutation_unecrypyted_mask was renamed to permutations - result type permutation was removed

For all the updated API details check the Open API document.

Other improvements
  • The documentation is now served at the root.
  • The flower monitoring tool for celery is now included with the docker-compose deployment. Note this will be disabled for production deployment with kubernetes by default.
  • The docker containers have been migrated to alpine linux to be much leaner.
  • Substantial internal refactoring - especially of views.
  • Move to pytest for end to end tests.
Version 1.7.3

Deployment and documentation sprint.

  • Fixes a bug where only the top k results of a chunk were being requested from anonlink. #59 #84
  • Updates to helm deployment templates to support a single namespace having multiple entityservices. Helm charts are more standard, some config has moved into a configmap and an experimental cert-manager configuration option has been added. #83, #90
  • More sensible logging during testing.
  • Every http request now has a (globally configurable) timeout
  • Minor update regarding handling uploading empty CLKs. #92
  • Update to latest versions of anonlink and clkhash. #94
  • Documentation updates.
Version 1.7.2

Dependency and deployment updates. We now pin versions of Python, anonlink, clkhash, phe and docker images nginx and postgres.

Version 1.7.0

Added a view type that returns similarity scores of potential matches.

Version 1.6.8

Scalability sprint.

  • Much better chunking of work.
  • Security hardening by modifing the response from the server. Now there is no differences between invalid token and unknown resource - both return a 403 response status.
  • Mapping information includes the time it was started.
  • Update and add tests.
  • Update the deployment to use Helm.

Road map for the entity service

  • baseline benchmarking vs known datasets (accuracy and speed) e.g recordspeed datasets
  • blocking
  • Schema specification and tooling
  • Algorithmic improvements. e.g., implementing canopy clustering solver
  • A web front end including authentication and access control
  • Uploading multiple hashes per entity. Handle multiple schemas.
  • Check how we deal with missing information, old addresses etc
  • Semi supervised machine learning methods to learn thresholds
  • Handle 1 to many relationships. E.g. familial groups
  • Larger scale graph solving methods
  • Remove bottleneck of sparse links having to fit in redis.
  • improve uploads by allowing direct binary file transfer into object store
  • optimise anonlink memory management and C++ code

Bigger Projects - consider more than 2 organizations participating in one mapping - GPU implementation of core similarity scoring - somewhat homomorphic encryption could be used for similarity score - consider allowing users to upload raw PII

Implementation Details

Components

The entity service is implemented in Python and comprises the following components:

  • A gunicorn/flask backend that implements the HTTP REST api.
  • Celery backend worker/s that do the actual work. This interfaces with the anonlink library.
  • An nginx frontend to reverse proxy the gunicorn/flask backend application.
  • A Minio object store (large files such as raw uploaded hashes, results)
  • A postgres database stores the linking metadata.
  • A redis task queue that interfaces between the flask app and the celery backend. Redis also acts as an ephemeral cache.

Each of these has been packaged as a docker image, however the use of external services (redis, postgres, minio) can be configured through environment variables. Multiple workers can be used to distribute the work beyond one machine - by default all cores will be used for computing similarity scores and encrypting the mask vector.

Continuous Integration Testing

We test the service using Jenkins. Every pull request gets deployed in the local configuration using Docker Compose, as well as in the production deployment to kubernetes.

At a high level the testing covers:

  • building the docker containers
  • deploying using Docker Compose
  • testing the tutorial notebooks don’t error
  • running the integration tests against the local deployment
  • running a benchmark suite against the local deployment
  • building and packaging the documentation
  • publishing the containers to quay.io
  • deploying to kubernetes
  • running the integration tests against the kubernetes deployment

All of this is orchestrated using the jenkins pipeline script at Jenkinsfile.groovy. There is one custom library which is n1-pipeline a collection of helpers that we created for common jenkins tasks.

The integration tests currently take around 30 minutes.

Testing Local Deployment

The docker compose file tools/ci.yml is deployed along with tools/docker-compose.yml. This simply defines an additional container (from the same backend image) which runs the integration tests after a short delay.

The logs from the various containers (nginx, backend, worker, database) are all collected, archived and are made available in the Jenkins UI for introspection.

Testing K8s Deployment

The kubernetes deployment uses helm with the template found in deployment/entity-service. Jenkins additionally defines the docker image versions to use and ensures an ingress is not provisioned. The deployment is configured to be quite conservative in terms of cluster resources. Currently this logic all resides in Jenkinsfile.groovy.

The k8s deployment test is limited to 30 minutes and an effort is made to clean up all created resources.

After a few minutes waiting for the deployment a Kubernetes Job is created using kubectl create.

This job includes a 1GiB persistent volume claim to which the results are written (as results.xml). During the testing the pytest output will be rendered in jenkins, and then the Job’s pod terminates. We create a temporary pod which mounts the same results volume and then we copy across the produced artifact for rendering in Jenkins. This dance is only necessary to retrieve files from the cluster to our Jenkins instance, it would be straightforward if we only wanted the stdout from each pod/job.

Benchmarking

In the benchmarking folder is a benchmarking script and associated Dockerfile. The docker image is published at https://quay.io/repository/n1analytics/entity-benchmark

The container/script is configured via environment variables.

  • SERVER: (required) the url of the server.
  • EXPERIMENT: json file containing a list of experiments to run. Schema of experiments is defined in ./schema/experiments.json.
  • DATA_PATH: path to a directory to store test data (useful to cache).
  • RESULT_PATH: full filename to write results file.
  • SCHEMA: path to the linkage schema file used when creating projects. If not provided it is assumed to be in the data directory.
  • TIMEOUT: this timeout defined the time to wait for the result of a run in seconds. Default is 1200 (20min).

Run Benchmarking Container

Run the container directly with docker - substituting configuration information as required:

docker run -it
    -e SERVER=https://testing.es.data61.xyz \
    -e RESULTS_PATH=/app/results.json \
    quay.io/n1analytics/entity-benchmark:latest

By default the container will pull synthetic datasets from an S3 bucket and run default benchmark experiments against the configured SERVER. The default experiments (listed below) are set in benchmarking/default-experiments.json.

The output will be printed and saved to a file pointed to by RESULTS_PATH (e.g. to /app/results.json).

Cache Volume

For speeding up benchmarking when running multiple times you may wish to mount a volume at the DATA_PATH to store the downloaded test data. Note the container runs as user 1000, so any mounted volume must be read and writable by that user. To create a volume using docker:

docker volume create linkage-benchmark-data

To copy data from a local directory and change owner:

docker run --rm -v `pwd`:/src \
    -v linkage-benchmark-data:/data busybox \
    sh -c "cp -r /src/linkage-bench-cache-experiments.json /data; chown -R 1000:1000 /data"

To run the benchmarks using the cache volume:

docker run \
    --name ${benchmarkContainerName} \
    --network ${networkName} \
    -e SERVER=${localserver} \
    -e DATA_PATH=/cache \
    -e EXPERIMENT=/cache/linkage-bench-cache-experiments.json \
    -e RESULTS_PATH=/app/results.json \
    --mount source=linkage-benchmark-data,target=/cache \
    quay.io/n1analytics/entity-benchmark:latest

Experiments

Experiments to run can be configured as a simple json document. The default is:

[
  {
    "sizes": ["100K", "100K"],
    "threshold": 0.95
  },
  {
    "sizes": ["100K", "100K"],
    "threshold": 0.80
  },
  {
    "sizes": ["100K", "1M"],
    "threshold": 0.95
  }
]

The schema of the experiments can be found in benchmarking/schema/experiments.json.

Logging

The entity service uses the standard Python logging library for logging.

The following named loggers are used:

  • entityservice * entityservice.views * entityservice.models * entityservice.database
  • celery.es

The following environment variables affect logging:

  • DEBUG - sets the logging level to debug for all application code.
  • LOGFILE - directs the log output to this file.
  • LOG_HTTP_HEADER_FIELDS - HTTP headers to include in the application logs.

Example logging output with LOG_HTTP_HEADER_FIELDS=User-Agent,Host:

[2019-02-02 23:17:23 +0000] [10] [INFO] Adding new project to database [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=6c2a3730
[2019-02-02 23:17:23 +0000] [12] [INFO] Getting detail for a project   [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [12] [INFO] Checking credentials           [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [12] [INFO] 0 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [11] [INFO] Receiving CLK data.            [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Storing user 25895 supplied clks from json [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Received 100 encodings. Uploading 16.89 KiB to object store [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Adding metadata on encoded entities to database [entityservice.database.insertions] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Job scheduled to handle user uploaded hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:24 +0000] [12] [INFO] Getting detail for a project   [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [12] [INFO] Checking credentials           [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [12] [INFO] 1 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [10] [INFO] Receiving CLK data.            [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Storing user 25896 supplied clks from json [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Received 100 encodings. Uploading 16.89 KiB to object store [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Adding metadata on encoded entities to database [entityservice.database.insertions] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Job scheduled to handle user uploaded hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:25 +0000] [12] [INFO] Getting detail for a project   [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] Checking credentials           [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] 2 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] Adding new project to database [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=df791527
[2019-02-02 23:17:26 +0000] [12] [INFO] request description of a run   [entityservice.views.run.description] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=bf5b2544 rid=invalid
[2019-02-02 23:17:26 +0000] [12] [INFO] Requested project or run resource with invalid identifier token [entityservice.views.auth_checks] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=bf5b2544 rid=invalid
[2019-02-02 23:17:26 +0000] [12] [INFO] Request to delete project      [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
[2019-02-02 23:17:26 +0000] [12] [INFO] Marking project for deletion   [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
[2019-02-02 23:17:26 +0000] [12] [INFO] Queuing authorized request to delete project resources [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9

With DEBUG enabled there are a lot of logs from the backend and workers:

[2019-02-02 23:14:47 +0000] [10] [INFO] Marking project for deletion   [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:47 +0000] [10] [DEBUG] Trying to connect to postgres db [entityservice.database.util] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [10] [DEBUG] Database connection established [entityservice.database.util] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [10] [INFO] Queuing authorized request to delete project resources [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [9] [INFO] Request to delete project      [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=5486c153

Tracing

  • TRACING_HOST
  • TRACING_PORT