Entity Service - v1.12.0

The Entity Service allows two organizations to carry out private record linkage — finding matching records of entities between their respective datasets without disclosing personally identifiable information.

Overview

The Entity Service is based on the concept of Anonymous Linking Codes (ALC). These can be seen as bit-arrays representing an entity, with the property that the similarity of the bits of two ALCs reflect the similarity of the corresponding entities.

An anonymous linking code that has been shown to produce good results and is widely used in practice is the so called *Cryptographic Longterm Key*, or CLK for short.

Note

From now on, we will use CLK exclusively instead of ALC, as our reference implementation of the private record linkage process uses CLK as anonymous linking code. The Entity Service is however not limited to CLKs.

Entity Service Overview

Schematical overview of the process of private record linkage using the Entity Service

Private record linkage - using the Entity Service - is a two stage process:

Table Of Contents

Tutorials

Command line example

This brief example shows using clkutil - the command line tool that is packaged with the clkhash library. It is not a requirement to use clkhash with the Entity Service REST API.

We assume you have access to a command line prompt with Python and Pip installed.

Install clkhash:

$ pip install clkhash

Generate and split some mock personally identifiable data:

$ clkutil generate 2000 raw_pii_2k.csv

$ head -n 1 raw_pii_2k.csv > alice.txt
$ tail -n 1500 raw_pii_2k.csv >> alice.txt
$ head -n 1000 raw_pii_2k.csv > bob.txt

A corresponding hashing schema can be generated as well:

$ clkutil generate-default-schema schema.json

Process the personally identifying data into Cryptographic Longterm Key:

$ clkutil hash alice.txt horse staple schema.json alice-hashed.json
generating CLKs: 100%|████████████████████████████████████████████| 1.50K/1.50K [00:00<00:00, 6.69Kclk/s, mean=522, std=34.4]
CLK data written to alice-hashed.json

$ clkutil hash bob.txt horse staple schema.json bob-hashed.json
generating CLKs: 100%|████████████████████████████████████████████| 999/999 [00:00<00:00, 5.14Kclk/s, mean=520, std=34.2]
CLK data written to bob-hashed.json

Now to interact with an Entity Service. First check that the service is healthy and responds to a status check:

$ clkutil status --server https://testing.es.data61.xyz
{"rate": 53129, "status": "ok", "project_count": 1410}

Then create a new linkage project and set the output type (to mapping):

$ clkutil create-project \
    --server https://testing.es.data61.xyz \
    --type mapping \
    --schema schema.json \
    --output credentials.json

The entity service replies with a project id and credentials which get saved into the file credentials.json. The contents is two upload tokens and a result token:

{
    "update_tokens": [
        "21d4c9249e1c70ac30f9ce03893983c493d7e90574980e55",
        "3ad6ae9028c09fcbc7fbca36d19743294bfaf215f1464905"
    ],
    "project_id": "809b12c7e141837c3a15be758b016d5a7826d90574f36e74",
    "result_token": "230a303b05dfd186be87fa65bf7b0970fb786497834910d1"
}

These credentials get substituted in the following commands. Each CLK dataset gets uploaded to the Entity Service:

$ clkutil upload --server https://testing.es.data61.xyz \
                --apikey 21d4c9249e1c70ac30f9ce03893983c493d7e90574980e55 \
                --project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
                alice-hashed.json
{"receipt_token": "05ac237462d86bc3e2232ae3db71d9ae1b9e99afe840ee5a", "message": "Updated"}

$ clkutil upload --server https://testing.es.data61.xyz \
                --apikey 3ad6ae9028c09fcbc7fbca36d19743294bfaf215f1464905 \
                --project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
                bob-hashed.json
{"receipt_token": "6d9a0ee7fc3a66e16805738097761d38c62ea01a8c6adf39", "message": "Updated"}

Now we can compute mappings using various thresholds. For example to only see relationships where the similarity is above 0.9:

$ clkutil create --server https://testing.es.data61.xyz \
                 --apikey 230a303b05dfd186be87fa65bf7b0970fb786497834910d1 \
                 --project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
                 --name "Tutorial mapping run" \
                 --threshold 0.9
{"run_id": "31a6d3c775151a877dcac625b4b91a6659317046ea45ad11", "notes": "Run created by clkhash 0.11.2", "name": "Tutorial mapping run", "threshold": 0.9}

After a small delay the mapping will have been computed and we can use clkutil to retrieve the results:

$ clkutil results --server https://testing.es.data61.xyz \
                --apikey  230a303b05dfd186be87fa65bf7b0970fb786497834910d1 \
                --project 809b12c7e141837c3a15be758b016d5a7826d90574f36e74 \
                --run 31a6d3c775151a877dcac625b4b91a6659317046ea45ad11
State: completed
Stage (3/3): compute output
Downloading result
Received result
{
  "mapping": {
    "0": "500",
    "1": "501",
    "10": "510",
    "100": "600",
    "101": "601",
    "102": "602",
    "103": "603",
    "104": "604",
    "105": "605",
    "106": "606",
    "107": "607",
    ...

This mapping output is telling us that the similarity is above our threshold between identified rows of Alice and Bob’s data sets.

Looking at the first two entities in Alice’s data:

head alice.txt -n 3
INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
500,Arjun Efron,1990/01/14,M
501,Sedrick Verlinden,1954/11/28,M

And looking at the corresponding 500th and 501st entities in Bob’s data:

tail -n 499 bob.txt | head -n 2
500,Arjun Efron,1990/01/14,M
501,Sedrick Verlinden,1954/11/28,M

Entity Service Permutation Output

This tutorial demonstrates the workflow for private record linkage using the entity service. Two parties Alice and Bob have a dataset of personally identifiable information (PII) of several entities. They want to learn the linkage of corresponding entities between their respective datasets with the help of the entity service and an independent party, the Analyst.

The chosen output type is permuatations, which consists of two permutations and one mask.

Who learns what?

After the linkage has been carried out Alice and Bob will be able to retrieve a permutation - a reordering of their respective data sets such that shared entities line up.

The Analyst - who creates the linkage project - learns the mask. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.

Steps

These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the Analyst acting the integration authority.

## Check Connection

If you’re connecting to a custom entity service, change the address here.
[1]:
import os
url = os.getenv("SERVER", "https://testing.es.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://testing.es.data61.xyz
[2]:
!clkutil status --server "{url}"
{"project_count": 1021, "rate": 2453247, "status": "ok"}

## Data preparation

Following the clkhash tutorial we will use a dataset from the recordlinkage library. We will just write both datasets out to temporary CSV files.

[3]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[4]:
dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head(3)

[4]:
given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id
rec_id
rec-1070-org michaela neumann 8 stanley street miami winston hills 4223 nsw 19151111 5304218
rec-1016-org courtney painter 12 pinkerton circuit bega flats richlands 4560 vic 19161214 4066625
rec-4405-org charles green 38 salkauskas crescent kela dapto 4566 nsw 19480930 4365168

The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.

[5]:
schema = NamedTemporaryFile('wt')
[6]:
%%writefile {schema.name}
{
  "version": 1,
  "clkConfig": {
    "l": 1024,
    "k": 30,
    "hash": {
      "type": "doubleHash"
    },
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 0.5, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 0.5 }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 100, "maximum": 9999 },
      "hashing": { "ngram": 1, "positional": true, "weight": 0.5 }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}
Overwriting /tmp/tmptfalxkiq

## Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.

[7]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

!clkutil create-project --schema "{schema.name}" --output "{creds.name}" --type "permutations" --server "{url}"
creds.seek(0)

import json
with open(creds.name, 'r') as f:
    credentials = json.load(f)

project_id = credentials['project_id']
credentials
Credentials will be saved in /tmp/tmpyr8dc2pf
Project created
[7]:
{'project_id': 'b8211d1450c8d0d631dbdc1fb482af106b8cbdebed5b7fd3',
 'result_token': '8fe1fc01f7ac3a3406d1e031b7d120800aa6460d0da62abb',
 'update_tokens': ['1c39c6972626bd34729812f0b9cf6e467461824dbbd0682c',
  '901c12061cf621b67df5b9de2719b8806636364d3fdc1765']}

Note: the analyst will need to pass on the project_id (the id of the linkage project) and one of the two update_tokens to each data provider.

## Hash and Upload

At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. We need: - the clkhash library - the linkage schema from above - and two secret passwords which are only known to Alice and Bob. (here: horse and staple)

Please see clkhash documentation for further details on this.

[8]:
!clkutil hash "{a_csv.name}" horse staple "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" horse staple "{schema.name}" "{b_clks.name}"
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.32kclk/s, mean=765, std=37.1]
CLK data written to /tmp/tmpc_4k553j.json
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 4.28kclk/s, mean=756, std=43.3]
CLK data written to /tmp/tmpv7eo2tfp.json

Now the two clients can upload their data providing the appropriate upload tokens and the project_id. As with all commands in clkhash we can output help:

[9]:
!clkutil upload --help
Usage: clkutil upload [OPTIONS] CLK_JSON

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as CLK_JSON, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --project TEXT         Project identifier
  --apikey TEXT          Authentication API key for the server.
  --server TEXT          Server address including protocol
  -o, --output FILENAME
  -v, --verbose          Script is more talkative
  --help                 Show this message and exit.
Alice uploads her data
[10]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{a_clks.name}"
    res = json.load(open(f.name))
    alice_receipt_token = res['receipt_token']

Every upload gets a receipt token. This token is required to access the results.

Bob uploads his data

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

Now after some delay (depending on the size) we can fetch the mask. This can be done with clkutil:

!clkutil results --server "{url}" \
    --project="{credentials['project_id']}" \
    --apikey="{credentials['result_token']}" --output results.txt

However for this tutorial we are going to use the Python requests library:

[13]:
import requests
import clkhash.rest_client

from IPython.display import clear_output
[14]:
for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))
State: completed
Stage (3/3): compute output
[15]:
results = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': credentials['result_token']}).json()
[16]:
mask = results['mask']

This mask is a boolean array that specifies where rows of permuted data line up.

[17]:
print(mask[:10])
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The number of 1s in the mask will tell us how many matches were found.

[18]:
sum([1 for m in mask if m == 1])
[18]:
4858

We also use requests to fetch the permutations for each data provider:

[19]:
alice_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': bob_receipt_token}).json()

Now Alice and Bob both have a new permutation - a new ordering for their data.

[20]:
alice_permutation = alice_res['permutation']
alice_permutation[:10]
[20]:
[2333, 1468, 559, 274, 653, 3385, 278, 3568, 3617, 4356]

This permutation says the first row of Alice’s data should be moved to position 308.

[21]:
bob_permutation = bob_res['permutation']
bob_permutation[:10]
[21]:
[2083, 1106, 3154, 1180, 2582, 375, 3533, 1046, 316, 2427]
[22]:
def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item

    return neworder
[23]:
with open(a_csv.name, 'r') as f:
    alice_raw = f.readlines()[1:]
    alice_reordered = reorder(alice_raw, alice_permutation)

with open(b_csv.name, 'r') as f:
    bob_raw = f.readlines()[1:]
    bob_reordered = reorder(bob_raw, bob_permutation)

Now that the two data sets have been permuted, the mask reveals where the rows line up, and where they don’t.

[24]:
alice_reordered[:10]
[24]:
['rec-2689-org,ainsley,robison,23,atherton street,villa 1/4,deer park,3418,nsw,19310531,4102867\n',
 'rec-1056-org,chloe,imgraben,47,curlewis crescent,dragon rising,burleigh waters,2680,qld,19520516,6111417\n',
 'rec-1820-org,liam,cullens,121,chandler street,the burrows,safety bay,3073,qld,19910811,7828812\n',
 'rec-2192-org,ellie,fearnall,31,fishburn street,colbara,cherrybrook,5171,wa,,7745948\n',
 'rec-2696-org,campbell,nguyen,6,diselma place,villa 2,collinswood,4343,nsw,19630325,2861961\n',
 'rec-968-org,aidan,blake,15,namatjira drive,cooramin,dromana,4074,vic,19270928,4317464\n',
 'rec-3833-org,nicholas,clarke,13,gaylard place,tryphinia view,wetherill park,2810,nsw,19041223,3927795\n',
 'rec-4635-org,isabella,white,8,cooling place,,rosebud,6151,sa,19990911,2206317\n',
 'rec-3549-org,harry,thorpe,11,kambalda crescent,louisa tor 4,angaston,2777,qld,19421128,2701790\n',
 'rec-1220-org,lauren,weltman,6,tewksbury circuit,heritage estate,evans head,6330,nsw,19840930,9462453\n']
[25]:
bob_reordered[:10]
[25]:
['rec-2689-dup-0,ainsley,labalck,23,atherto n street,villa 1/4,deer park,3418,nsw,19310531,4102867\n',
 'rec-1056-dup-0,james,imgrapen,47,curlewiscrescent,dragon rising,burleigh waters,2680,qld,19520516,6111417\n',
 'rec-1820-dup-0,liam,cullens,121,chandlerw street,the burrows,safety bay,3073,qld,19910811,7828812\n',
 'rec-2192-dup-0,elpie,fearnull,31,fishbunestreet,,cherrybrook,5171,wa,,7745948\n',
 'rec-2696-dup-0,jenna,nguyen,85,diselmaplace,villz2,collinswood,4343,nsw,19630325,2861961\n',
 'rec-968-dup-0,aidan,blake,15,namatjifra drive,cooramin,dromana,4074,vic,19270928,4317464\n',
 'rec-3833-dup-0,nicholas,clarke,,gaylard place,tryphinia view,wetherill park,2810,nsw,19041223,3972795\n',
 'rec-4635-dup-0,isaeblla,white,8,cooling place,massey green,rosebud,6151,sa,19990911,2206317\n',
 'rec-3549-dup-0,taylor,thorpe,11,kambalda c rescent,louisa tor 4,angasgon,2777,qld,19421128,2701790\n',
 'rec-1220-dup-0,lauren,welman,6,tewksburl circuit,heritage estate,evans head,6330,nsw,19840930,9462453\n']
Accuracy

To compute how well the matching went we will use the first index as our reference.

For example in rec-1396-org is the original record which has a match in rec-1396-dup-0. To satisfy ourselves we can preview the first few supposed matches:

[26]:
for i, m in enumerate(mask[:10]):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        name_a = ' '.join(entity_a[1:3]).title()
        name_b = ' '.join(entity_b[1:3]).title()

        print("{} ({})".format(name_a, entity_a[0]), '=?', "{} ({})".format(name_b, entity_b[0]))
Ainsley Robison (rec-2689-org) =? Ainsley Labalck (rec-2689-dup-0)
Chloe Imgraben (rec-1056-org) =? James Imgrapen (rec-1056-dup-0)
Liam Cullens (rec-1820-org) =? Liam Cullens (rec-1820-dup-0)
Ellie Fearnall (rec-2192-org) =? Elpie Fearnull (rec-2192-dup-0)
Campbell Nguyen (rec-2696-org) =? Jenna Nguyen (rec-2696-dup-0)
Aidan Blake (rec-968-org) =? Aidan Blake (rec-968-dup-0)
Nicholas Clarke (rec-3833-org) =? Nicholas Clarke (rec-3833-dup-0)
Isabella White (rec-4635-org) =? Isaeblla White (rec-4635-dup-0)
Harry Thorpe (rec-3549-org) =? Taylor Thorpe (rec-3549-dup-0)
Lauren Weltman (rec-1220-org) =? Lauren Welman (rec-1220-dup-0)
Metrics

If you know the ground truth — the correct mapping between the two datasets — you can compute performance metrics of the linkage.

Precision: The percentage of actual matches out of all found matches. (tp/(tp+fp))

Recall: How many of the actual matches have we found? (tp/(tp+fn))

[27]:
tp = 0
fp = 0

for i, m in enumerate(mask):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        if entity_a[0].split('-')[1] == entity_b[0].split('-')[1]:
            tp += 1
        else:
            fp += 1
            #print('False positive:',' '.join(entity_a[1:3]).title(), '?', ' '.join(entity_b[1:3]).title(), entity_a[-1] == entity_b[-1])

print("Found {} correct matches out of 5000. Incorrectly linked {} matches.".format(tp, fp))
precision = tp/(tp+fp)
recall = tp/5000

print("Precision: {:.1f}%".format(100*precision))
print("Recall: {:.1f}%".format(100*recall))
Found 4858 correct matches out of 5000. Incorrectly linked 0 matches.
Precision: 100.0%
Recall: 97.2%

Entity Service Similarity Scores Output

This tutorial demonstrates generating CLKs from PII, creating a new project on the entity service, and how to retrieve the results. The output type is raw similarity scores. This output type is particularly useful for determining a good threshold for the greedy solver used in mapping.

The sections are usually run by different participants - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the analyst is acting as the integration authority.

Who learns what?

Alice and Bob will both generate and upload their CLKs.

The analyst - who creates the linkage project - learns the similarity scores. Be aware that this is a lot of information and are subject to frequency attacks.

Steps
  • Check connection to Entity Service
  • Data preparation
  • Write CSV files with PII
  • Create a Linkage Schema
  • Create Linkage Project
  • Generate CLKs from PII
  • Upload the PII
  • Create a run
  • Retrieve and analyse results
[1]:
%matplotlib inline

import json
import os
import time

import matplotlib.pyplot as plt
import requests
import clkhash.rest_client
from IPython.display import clear_output
Check Connection

If you are connecting to a custom entity service, change the address here.

[2]:
url = os.getenv("SERVER", "https://testing.es.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')
Testing anonlink-entity-service hosted at https://testing.es.data61.xyz
[3]:
!clkutil status --server "{url}"
{"project_count": 2115, "rate": 7737583, "status": "ok"}
Data preparation

Following the clkhash tutorial we will use a dataset from the recordlinkage library. We will just write both datasets out to temporary CSV files.

If you are following along yourself you may have to adjust the file names in all the !clkutil commands.

[4]:
from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4
[5]:
dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head(3)
[5]:
given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id
rec_id
rec-1070-org michaela neumann 8 stanley street miami winston hills 4223 nsw 19151111 5304218
rec-1016-org courtney painter 12 pinkerton circuit bega flats richlands 4560 vic 19161214 4066625
rec-4405-org charles green 38 salkauskas crescent kela dapto 4566 nsw 19480930 4365168
Schema Preparation

The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns rec_id and soc_sec_id for CLK generation.

[6]:
schema = NamedTemporaryFile('wt')
[7]:
%%writefile {schema.name}
{
  "version": 1,
  "clkConfig": {
    "l": 1024,
    "k": 30,
    "hash": {
      "type": "doubleHash"
    },
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 100, "maximum": 9999 },
      "hashing": { "ngram": 1, "positional": true, "weight": 1 }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
      "hashing": { "ngram": 2, "weight": 1 }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "ngram": 1, "positional": true, "weight": 1, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}
Overwriting /tmp/tmpvlivqdcf
Create Linkage Project

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.

[8]:
creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

!clkutil create-project --schema "{schema.name}" --output "{creds.name}" --type "similarity_scores" --server "{url}"
creds.seek(0)

with open(creds.name, 'r') as f:
    credentials = json.load(f)

project_id = credentials['project_id']
credentials
Credentials will be saved in /tmp/tmpcwpvq6kj
Project created
[8]:
{'project_id': '1eb3da44f73440c496ab42217381181de55e9dcd6743580c',
 'result_token': '846c6c25097c7794131de0d3e2c39c04b7de9688acedc383',
 'update_tokens': ['52aae3f1dfa8a4ec1486d8f7d63a8fe708876b39a8ec585b',
  '92e2c9c1ce52a2c2493b5e22953600735a07553f7d00a704']}

Note: the analyst will need to pass on the project_id (the id of the linkage project) and one of the two update_tokens to each data provider.

Hash and Upload

At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. Please see clkhash documentation for further details on this.

[9]:
!clkutil hash "{a_csv.name}" horse staple "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" horse staple "{schema.name}" "{b_clks.name}"
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.06kclk/s, mean=883, std=33.6]
CLK data written to /tmp/tmpj8m1dvxj.json
generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.30kclk/s, mean=875, std=39.7]
CLK data written to /tmp/tmpi2y_ogl9.json

Now the two clients can upload their data providing the appropriate upload tokens.

Alice uploads her data
[10]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{a_clks.name}"
    res = json.load(open(f.name))
    alice_receipt_token = res['receipt_token']

Every upload gets a receipt token. In some operating modes this receipt is required to access the results.

Bob uploads his data
[11]:
with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][1]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{b_clks.name}"

    bob_receipt_token = json.load(open(f.name))['receipt_token']
Create a run

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

[12]:
with NamedTemporaryFile('wt') as f:
    !clkutil create \
        --project="{project_id}" \
        --apikey="{credentials['result_token']}" \
        --server "{url}" \
        --threshold 0.9 \
        --output "{f.name}"

    run_id = json.load(open(f.name))['run_id']
Results

Now after some delay (depending on the size) we can fetch the mask. This can be done with clkutil:

!clkutil results --server "{url}" \
    --project="{credentials['project_id']}" \
    --apikey="{credentials['result_token']}" --output results.txt

However for this tutorial we are going to use the clkhash library:

[13]:
for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))
time.sleep(3)
State: completed
Stage (2/2): compute similarity scores
Progress: 1.000%
[17]:
data = json.loads(clkhash.rest_client.run_get_result_text(
    url,
    project_id,
    run_id,
    credentials['result_token']))['similarity_scores']

This result is a large list of tuples recording the similarity between all rows above the given threshold.

[18]:
for row in data[:10]:
    print(row)
[76, 2345, 1.0]
[83, 3439, 1.0]
[103, 863, 1.0]
[154, 2391, 1.0]
[177, 4247, 1.0]
[192, 1176, 1.0]
[270, 4516, 1.0]
[312, 1253, 1.0]
[407, 3743, 1.0]
[670, 3550, 1.0]

Note there can be a lot of similarity scores:

[19]:
len(data)
[19]:
1572906

We will display a sample of these similarity scores in a histogram using matplotlib:

[20]:
plt.hist([_[2] for _ in data[::100]], bins=50);
_images/tutorial_Similarity_Scores_31_0.png

The vast majority of these similarity scores are for non matches. Let’s zoom into the right side of the distribution.

[21]:
plt.hist([_[2] for _ in data[::1] if _[2] > 0.94], bins=50);
_images/tutorial_Similarity_Scores_33_0.png

Now it looks like a good threshold should be above 0.95. Let’s have a look at some of the candidate matches around there.

[22]:
def sample(data, threshold, num_samples, epsilon=0.01):
    samples = []
    for row in data:
        if abs(row[2] - threshold) <= epsilon:
            samples.append(row)
        if len(samples) >= num_samples:
            break
    return samples

def lookup_originals(candidate_pair):
    a = dfA.iloc[candidate_pair[0]]
    b = dfB.iloc[candidate_pair[1]]
    return a, b
[23]:
def look_at_per_field_accuracy(threshold = 0.999, num_samples = 100):
    results = []
    for i, candidate in enumerate(sample(data, threshold, num_samples, 0.01), start=1):
        record_a, record_b = lookup_originals(candidate)
        results.append(record_a == record_b)

    print("Proportion of exact matches for each field using threshold: {}".format(threshold))
    print(sum(results)/num_samples)

So we should expect a very high proportion of matches across all fields for high thresholds:

[24]:
look_at_per_field_accuracy(threshold = 0.999, num_samples = 100)
Proportion of exact matches for each field using threshold: 0.999
given_name       0.93
surname          0.96
street_number    0.88
address_1        0.92
address_2        0.80
suburb           0.92
postcode         0.95
state            1.00
date_of_birth    0.96
soc_sec_id       0.40
dtype: float64

But if we look at a threshold which is closer to the boundary between real matches we should see a lot more errors:

[25]:
look_at_per_field_accuracy(threshold = 0.95, num_samples = 100)
Proportion of exact matches for each field using threshold: 0.95
given_name       0.49
surname          0.57
street_number    0.81
address_1        0.55
address_2        0.44
suburb           0.70
postcode         0.84
state            0.93
date_of_birth    0.84
soc_sec_id       0.92
dtype: float64
[26]:

[26]:
'0.12.0'
[ ]:

[1]:
import csv
import json
import os

import pandas as pd
[2]:
KEY1 = 'correct'
KEY2 = 'horse'

SERVER = os.getenv("SERVER", "https://testing.es.data61.xyz")

Multiparty Linkage with Clkhash

Scenario

There are three parties named Alice, Bob, and Charlie, each holding a dataset of about 3200 records. They know that they have some entities in common, but with incomplete overlap. The common features describing those entities are given name, surname, date of birth, and phone number.

They all have some additional information about those entities in their respective datasets, Alice has a person’s gender, Bob has their city, and Charlie has their income. They wish to create a table for analysis: each row has a gender, city, and income, but they don’t want to share any additional information. They can use Anonlink to do this in a privacy-preserving way (without revealing given names, surnames, dates of birth, and phone numbers).

Alice, Bob, and Charlie: agree on secret keys and a linkage schema

They keep the keys to themselves, but the schema may be revealed to the analyst.

[3]:
print(f'keys: {KEY1}, {KEY2}')
keys: correct, horse
[4]:
with open('data/schema_ABC.json') as f:
    print(f.read())

{
  "version": 2,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
      "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
      "info": "c2NoZW1hX2V4YW1wbGU=",
      "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "id",
      "ignored": true
    },
    {
      "identifier": "givenname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 2,
        "positional": false,
        "strategy": {"k": 15}
      }
    },
    {
      "identifier": "surname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 2,
        "positional": false,
        "strategy": {"k": 15}
      }
    },
    {
      "identifier": "dob",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 2,
        "positional": true,
        "strategy": {"k": 15}
      }
    },
    {
      "identifier": "phone number",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "ngram": 1,
        "positional": true,
        "strategy": {"k": 8}
      }
    },
    {
      "identifier": "ignoredForLinkage",
      "ignored": true
    }
  ]
}

Sneak peek at input data
Alice
[5]:
pd.read_csv('data/dataset-alice.csv').head()
[5]:
id givenname surname dob phone number gender
0 0 tara hilton 27-08-1941 08 2210 0298 male
1 3 saJi vernre 22-12-2972 02 1090 1906 mals
2 7 sliver paciorek NaN NaN mals
3 9 ruby george 09-05-1939 07 4698 6255 male
4 10 eyrinm campbell 29-1q-1983 08 299y 1535 male
Bob
[6]:
pd.read_csv('data/dataset-bob.csv').head()
[6]:
id givenname surname dob phone number city
0 3 zali verner 22-12-1972 02 1090 1906 perth
1 4 samuel tremellen 21-12-1923 03 3605 9336 melbourne
2 5 amy lodge 16-01-1958 07 8286 9372 canberra
3 7 oIji pacioerk 10-02-1959 04 4220 5949 sydney
4 10 erin kampgell 29-12-1983 08 2996 1445 perth
Charlie
[7]:
pd.read_csv('data/dataset-charlie.csv').head()
[7]:
id givenname surname dob phone number income
0 1 joshua arkwright 16-02-1903 04 8511 9580 70189.446
1 3 zal: verner 22-12-1972 02 1090 1906 50194.118
2 7 oliyer paciorwk 10-02-1959 04 4210 5949 31750.993
3 8 nacoya ranson 17-08-1925 07 6033 4580 102446.131
4 10 erih campbell 29-12-1i83 08 299t 1435 331476.599
Analyst: create the project

The analyst keeps the result token to themselves. The three update tokens go to Alice, Bob and Charlie. The project ID is known by everyone.

[8]:
!clkutil create-project --server $SERVER --type groups --schema data/schema_ABC.json --parties 3 --output credentials.json

with open('credentials.json') as f:
    credentials = json.load(f)
    project_id = credentials['project_id']
    result_token = credentials['result_token']
    update_token_alice = credentials['update_tokens'][0]
    update_token_bob = credentials['update_tokens'][1]
    update_token_charlie = credentials['update_tokens'][2]
Project created
Alice: hash the data and upload it to the server

The data is hashed according to the schema and the keys. Alice’s update token is needed to upload the hashed data. No PII is uploaded to the service—only the hashes.

[9]:
!clkutil hash data/dataset-alice.csv $KEY1 $KEY2 data/schema_ABC.json dataset-alice-hashed.json --check-header false

generating CLKs:   0%|          | 0.00/3.23k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs:   6%|6         | 200/3.23k [00:02<00:31, 96.1clk/s, mean=372, std=32.6]
generating CLKs:  25%|##4       | 800/3.23k [00:02<00:17, 136clk/s, mean=371, std=35.5]
generating CLKs:  63%|######2   | 2.03k/3.23k [00:02<00:06, 193clk/s, mean=372, std=34.7]
generating CLKs: 100%|##########| 3.23k/3.23k [00:02<00:00, 1.29kclk/s, mean=372, std=34.9]
CLK data written to dataset-alice-hashed.json
[10]:
!clkutil upload --server $SERVER --apikey $update_token_alice --project $project_id dataset-alice-hashed.json
{"message": "Updated", "receipt_token": "c54597f32fd969603efba706af1556abee3cc35f2718bcb6"}
Bob: hash the data and upload it to the server
[11]:
!clkutil hash data/dataset-bob.csv $KEY1 $KEY2 data/schema_ABC.json dataset-bob-hashed.json --check-header false

generating CLKs:   0%|          | 0.00/3.24k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs:   6%|6         | 200/3.24k [00:01<00:25, 119clk/s, mean=369, std=32.4]
generating CLKs:  31%|###       | 1.00k/3.24k [00:01<00:13, 168clk/s, mean=371, std=35]
generating CLKs:  56%|#####5    | 1.80k/3.24k [00:01<00:06, 238clk/s, mean=371, std=35.5]
generating CLKs: 100%|##########| 3.24k/3.24k [00:02<00:00, 1.45kclk/s, mean=372, std=35.3]
CLK data written to dataset-bob-hashed.json
[12]:
!clkutil upload --server $SERVER --apikey $update_token_bob --project $project_id dataset-bob-hashed.json
{"message": "Updated", "receipt_token": "6ee2fe5df850b795ee6ddff1aaf4dfb03f6d4398dedcc248"}
Charlie: hash the data and upload it to the server
[13]:
!clkutil hash data/dataset-charlie.csv $KEY1 $KEY2 data/schema_ABC.json dataset-charlie-hashed.json --check-header false

generating CLKs:   0%|          | 0.00/3.26k [00:00<?, ?clk/s, mean=0, std=0]
generating CLKs:   6%|6         | 200/3.26k [00:01<00:24, 122clk/s, mean=371, std=33.3]
generating CLKs:  55%|#####5    | 1.80k/3.26k [00:01<00:08, 174clk/s, mean=372, std=34.5]
generating CLKs: 100%|##########| 3.26k/3.26k [00:01<00:00, 1.73kclk/s, mean=372, std=34.8]
CLK data written to dataset-charlie-hashed.json
[14]:
!clkutil upload --server $SERVER --apikey $update_token_charlie --project $project_id dataset-charlie-hashed.json
{"message": "Updated", "receipt_token": "064664ed9fd1f58c4da05c62a4832b813276d09342137a42"}
Analyst: start the linkage run

This will start the linkage computation. We will wait a little bit and then retrieve the results.

[15]:
!clkutil create --server $SERVER --project $project_id --apikey $result_token --threshold 0.7 --output=run-credentials.json

with open('run-credentials.json') as f:
    run_credentials = json.load(f)
    run_id = run_credentials['run_id']
Analyst: retreve the results
[16]:
!clkutil results --server $SERVER --project $project_id --apikey $result_token --run $run_id --watch --output linkage-output.json
State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
Downloading result
Received result
[17]:
with open('linkage-output.json') as f:
    linkage_output = json.load(f)
    linkage_groups = linkage_output['groups']
Everyone: make table of interesting information

We use the linkage result to make a table of genders, cities, and incomes without revealing any other PII.

[18]:
with open('data/dataset-alice.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    genders = tuple(row[-1] for row in r)

with open('data/dataset-bob.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    cities = tuple(row[-1] for row in r)

with open('data/dataset-charlie.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    incomes = tuple(row[-1] for row in r)
[19]:
table = []
for group in linkage_groups:
    row = [''] * 3
    for i, j in group:
        row[i] = [genders, cities, incomes][i][j]
    if sum(map(bool, row)) > 1:
        table.append(row)
pd.DataFrame(table, columns=['gender', 'city', 'income']).head(10)
[19]:
gender city income
0 peGh 395273.665
1 sydnev 77367.636
2 pertb 323383.650
3 syd1e7y 79745.538
4 perth 28019.494
5 canberra 78961.675
6 female brisnane
7 male canbetra
8 sydme7 106849.526
9 melbourne 68548.966

The last 20 groups look like this.

[20]:
linkage_groups[-15:]
[20]:
[[[0, 2111], [1, 2100]],
 [[0, 2121], [2, 2131], [1, 2111]],
 [[1, 1146], [2, 1202], [0, 1203]],
 [[1, 2466], [2, 2478], [0, 2460]],
 [[0, 429], [1, 412]],
 [[0, 2669], [1, 1204]],
 [[1, 1596], [2, 1623]],
 [[0, 487], [1, 459]],
 [[1, 1776], [2, 1800], [0, 1806]],
 [[1, 2586], [2, 2602]],
 [[0, 919], [1, 896]],
 [[0, 100], [2, 107], [1, 100]],
 [[0, 129], [1, 131], [2, 135]],
 [[0, 470], [1, 440]],
 [[0, 1736], [1, 1692], [2, 1734]]]
Sneak peek at the result

We obviously can’t do this in a real-world setting, but let’s view the linkage using the PII. If the IDs match, then we are correct.

[21]:
with open('data/dataset-alice.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_alice = tuple(r)

with open('data/dataset-bob.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_bob = tuple(r)

with open('data/dataset-charlie.csv') as f:
    r = csv.reader(f)
    next(r)  # Skip header
    dataset_charlie = tuple(r)
[22]:
table = []
for group in linkage_groups:
    for i, j in sorted(group):
        table.append([dataset_alice, dataset_bob, dataset_charlie][i][j])
    table.append([''] * 6)

pd.DataFrame(table, columns=['id', 'given name', 'surname', 'dob', 'phone number', 'non-linking']).tail(15)
[22]:
id given name surname dob phone number non-linking
6426 1171 isabelle bridgland 30-03-1994 04 5318 6471 mal4
6427 1171 isalolIe riahgland 30-02-1994 04 5318 6471 sydnry
6428 1171 isabelle bridgland 30-02-1994 04 5318 6471 63514.217
6429
6430 1243 thmoas doaldson 13-04-1900 09 6963 1944 male
6431 1243 thoma5 donaldson 13-04-1900 08 6962 1944 perth
6432 1243 thomas donalsdon 13-04-2900 08 6963 2944 489229.297
6433
6434 2207 annah aslea 02-11-2906 04 5501 5973 male
6435 2207 hannah easlea 02-11-2006 04 5501 5973 canberra
6436
6437 5726 rhys clarke 19-05-1929 02 9220 9635 mqle
6438 5726 ry5 clarke 19-05-1939 02 9120 9635
6439 5726 rhys klark 19-05-2938 02 9220 9635 118197.119
6440
[1]:
import csv
import itertools
import os

import requests

Entity Service: Multiparty linkage demo

This notebook is a demonstration of the multiparty linkage capability that has been implemented in the Entity Service.

We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included.

Check the status of the Entity Service

Ensure that it is running and that we have the correct version. Multiparty support was introduced in version 1.11.0.

[2]:
SERVER = os.getenv("SERVER", "https://testing.es.data61.xyz")
PREFIX = f"{SERVER}/api/v1"
print(requests.get(f"{PREFIX}/status").json())
print(requests.get(f"{PREFIX}/version").json())
{'project_count': 10, 'rate': 20496894, 'status': 'ok'}
{'anonlink': '0.11.2', 'entityservice': 'v1.11.0', 'python': '3.6.8'}
Create a new project

We create a new multiparty project for five parties by specifying the number of parties and the output type (currently only the group output type supports multiparty linkage). Retain the project_id, so we can find the project later. Also retain the result_token, so we can retrieve the results (careful: anyone with this token has access to the results). Finally, the update_tokens identify the five data data providers and permit them to upload CLKs.

[3]:
project_info = requests.post(
    f"{PREFIX}/projects",
    json={
        "schema": {},
        "result_type": "groups",
        "number_parties": 5,
        "name": "example project"
    }
).json()
project_id = project_info["project_id"]
result_token = project_info["result_token"]
update_tokens = project_info["update_tokens"]

print("project_id:", project_id)
print()
print("result_token:", result_token)
print()
print("update_tokens:", update_tokens)
project_id: 8eeb1050f5add8f78ff4a0da04219fead48f22220fb0f15e

result_token: c8f22b577aac9432871eeea02cbe504d399a9776add1de9f

update_tokens: ['6bf0f1c84c17116eb9f93cf8a4cfcb13d49d288a1f376dd8', '4b9265070849af1f0546f2adaeaa85a7d0e60b10f9b4afbc', '3ff03cadd750ce1b40cc4ec2b99db0132f62d8687328eeb9', 'c1b562ece6bbef6cd1a0541301bb1f82bd697bce04736296', '8cfdebbe12c65ae2ff20fd0c0ad5de4feb06c9a9dd1209c8']
Upload the hashed data

This is where each party uploads their CLKs into the service. Here, we do the work of all five data providers inside this for loop. In a deployment scenario, each data provider would be uploading their own CLKs using their own update token.

These CLKs are already hashed using clkhash, so for each data provider, we just need to upload their corresponding hash file.

[4]:
for i, token in enumerate(update_tokens, start=1):
    with open(f"data/clks-{i}.json") as f:
        r = requests.post(
            f"{PREFIX}/projects/{project_id}/clks",
            data=f,
            headers={
                "Authorization": token,
                "content-type": "application/json"
            }
        )
    print(f"Data provider {i}: {r.text}")
Data provider 1: {
  "message": "Updated",
  "receipt_token": "c7d9ba71260863f13af55e12603f8694c29e935262b15687"
}

Data provider 2: {
  "message": "Updated",
  "receipt_token": "70e4ed1b403c4e628183f82548a9297f8417ca3de94648bf"
}

Data provider 3: {
  "message": "Updated",
  "receipt_token": "b56fe568b93dc4522444e503078e16c18573adecbc086b6a"
}

Data provider 4: {
  "message": "Updated",
  "receipt_token": "7e3c80e554cfde23847d9aa2cff1323aa8f411e4033c0562"
}

Data provider 5: {
  "message": "Updated",
  "receipt_token": "8bde91367ee52b5c6804d5ce2d2d3350ce3c3766b8625bbc"
}

Begin a run

The data providers have uploaded their CLKs, so we may begin the computation. This computation may be repeated multiple times, each time with different parameters. Each such repetition is called a run. The most important parameter to vary between runs is the similarity threshold. Two records whose similarity is above this threshold will be considered to describe the same entity.

Here, we perform one run. We (somewhat arbitrarily) choose the threshold to be 0.8.

[5]:
r = requests.post(
    f"{PREFIX}/projects/{project_id}/runs",
    headers={
        "Authorization": result_token
    },
    json={
        "threshold": 0.8
    }
)
run_id = r.json()["run_id"]
Check the status

Let’s see whether the run has finished (‘state’ is ‘completed’)!

[6]:
r = requests.get(
    f"{PREFIX}/projects/{project_id}/runs/{run_id}/status",
    headers={
        "Authorization": result_token
    }
)
r.json()
[6]:
{'current_stage': {'description': 'waiting for CLKs',
  'number': 1,
  'progress': {'absolute': 5,
   'description': 'number of parties already contributed',
   'relative': 1.0}},
 'stages': 3,
 'state': 'queued',
 'time_added': '2019-06-23T11:17:27.646642+00:00',
 'time_started': None}

Now after some delay (depending on the size) we can fetch the results. Waiting for completion can be achieved by directly polling the REST API using requests, however for simplicity we will just use the watch_run_status function provided in clkhash.rest_client.

[7]:
import clkhash.rest_client
from IPython.display import clear_output

for update in clkhash.rest_client.watch_run_status(SERVER, project_id, run_id, result_token, timeout=30):
    clear_output(wait=True)
    print(clkhash.rest_client.format_run_status(update))

State: completed
Stage (3/3): compute output
Retrieve the results

We retrieve the results of the linkage. As we selected earlier, the result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party id and the row index.

The last 20 groups look like this.

[8]:
r = requests.get(
    f"{PREFIX}/projects/{project_id}/runs/{run_id}/result",
    headers={
        "Authorization": result_token
    }
)
groups = r.json()
groups['groups'][-20:]
[8]:
[[[0, 3127], [3, 3145], [2, 3152], [1, 3143]],
 [[2, 1653], [3, 1655], [1, 1632], [0, 1673], [4, 1682]],
 [[0, 2726], [1, 2737], [3, 2735]],
 [[1, 837], [3, 864]],
 [[0, 1667], [4, 1676], [1, 1624], [3, 1646]],
 [[1, 1884], [2, 1911], [4, 1926], [0, 1916]],
 [[0, 192], [2, 198]],
 [[3, 328], [4, 330], [0, 350], [2, 351], [1, 345]],
 [[2, 3173], [4, 3176], [3, 3163], [0, 3145], [1, 3161]],
 [[1, 347], [4, 332], [2, 353], [0, 352]],
 [[1, 736], [3, 761], [2, 768], [0, 751], [4, 754]],
 [[1, 342], [2, 349]],
 [[3, 899], [2, 913]],
 [[1, 465], [3, 477]],
 [[0, 285], [1, 293]],
 [[0, 785], [3, 794]],
 [[3, 2394], [4, 2395], [0, 2395]],
 [[1, 1260], [2, 1311], [3, 1281], [4, 1326]],
 [[0, 656], [2, 663]],
 [[1, 2468], [2, 2479]]]

To sanity check, we print their records’ corresponding PII:

[17]:
def load_dataset(i):
    dataset = []
    with open(f"data/dataset-{i}.csv") as f:
        reader = csv.reader(f)
        next(reader)  # ignore header
        for row in reader:
            dataset.append(row[1:])
    return dataset

datasets = list(map(load_dataset, range(1, 6)))

for group in itertools.islice(groups["groups"][-20:], 20):
    for (i, j) in group:
        print(i, datasets[i][j])
    print()
0 ['samual', 'mason', '05-12-1917', 'male', 'pertb', '405808.756', '07 2284 3649']
3 ['samuAl', 'mason', '05-12-1917', 'male', 'peryh', '4058o8.756', '07 2274 3549']
2 ['samie', 'mazon', '05-12-1917', 'male', '', '405898.756', '07 2275 3649']
1 ['zamusl', 'mason', '05-12-2917', 'male', '', '405898.756', '07 2274 2649']

2 ['thomas', 'burfrod', '08-04-1999', '', 'pertj', '182174.209', '02 3881 9666']
3 ['thomas', 'burfrod', '09-04-1999', 'male', '', '182174.209', '02 3881 9666']
1 ['thomas', 'burford', '08-04-19o9', 'mal4', '', '182175.109', '02 3881 9666']
0 ['thomas', 'burford', '08-04-1999', 'male', 'perth', '182174.109', '02 3881 9666']
4 ['thomas', 'burf0rd', '08-04-q999', 'mske', 'perrh', '182174.109', '02 3881 9666']

0 ['kaitlin', 'bondza', '03-08-1961', 'male', 'sydney', '41168.999', '02 4632 1380']
1 ['kaitlin', 'bondja', '03-08-1961', 'malr', 'sydmey', '41168.999', '02 4632 1370']
3 ["k'latlin", 'bonklza', '03-08-1961', 'male', 'sydaney', '', '02 4632 1380']

1 ['chr8stian', 'jolly', '22-08-2009', 'male', '', '178371.991', '04 5868 7703']
3 ['chr8stian', 'jolly', '22-09-2099', 'malr', 'melbokurne', '178271.991', '04 5868 7703']

0 ['oaklrigh', 'ngvyen', '24-07-1907', 'mslr', 'sydney', '63175.398', '04 9019 6235']
4 ['oakleith', 'ngvyen', '24-97-1907', 'male', 'sydiney', '63175.498', '04 9019 6235']
1 ['oajleigh', 'ngryen', '24-07-1007', 'male', 'sydney', '63175.498', '04 9919 6235']
3 ['oakleigh', 'nguyrn', '34-07-1907', 'male', 'sbdeney', '63175.r98', '04 9019 6235']

1 ['georgia', 'nguyen', '06-11-1930', 'male', 'perth', '247847.799', '08 6560 4063']
2 ['georia', 'nfuyen', '06-11-1930', 'male', 'perrh', '247847.799', '08 6560 4963']
4 ['geortia', 'nguyea', '06-11-1930', 'male', 'pertb', '247847.798', '08 6560 4063']
0 ['egorgia', 'nguyqn', '06-11-1930', 'male', 'peryh', '247847.799', '08 6460 4963']

0 ['connor', 'mcneill', '05-09-1902', 'male', 'sydney', '108473.824', '02 6419 9472']
2 ['connro', 'mcnell', '05-09-1902', 'male', 'sydnye', '108474.824', '02 6419 9472']

3 ['alessandria', 'sherriff', '25-91-1951', 'male', 'melb0urne', '5224r.762', '03 3077 2019']
4 ['alessandria', 'sherriff', '25-01-1951', 'male', 'melbourne', '52245.762', '03 3077 1019']
0 ['alessandria', "sherr'lff", '25-01-1951', 'malr', 'melbourne', '', '03 3977 1019']
2 ['alessandria', 'shernff', '25-01-1051', 'mzlr', 'melbourne', '52245.663', '03 3077 1019']
1 ['alessandrya', 'sherrif', '25-01-1961', 'male', 'jkelbouurne', '52245.762', '03 3077 1019']

2 ['harriyon', 'micyelmor', '21-04-1971', 'male', 'pert1>', '291889.942', '04 5633 5749']
4 ['harri5on', 'micyelkore', '21-04-1971', '', 'pertb', '291880.942', '04 5633 5749']
3 ['hariso17', 'micelmore', '21-04-1971', 'male', 'pertb', '291880.042', '04 5633 5749']
0 ['harrison', 'michelmore', '21-04-1981', 'malw', 'preth', '291880.942', '04 5643 5749']
1 ['harris0n', 'michelmoer', '21-04-1971', '', '', '291880.942', '04 5633 5749']

1 ['alannah', 'gully', '15-04-1903', 'make', 'meobourne', '134518.814', '04 5104 4572']
4 ['alana', 'gully', '15-04-1903', 'male', 'melbourne', '134518.814', '04 5104 4582']
2 ['alama', 'gulli', '15-04-1903', 'mald', 'melbourne', '134518.814', '04 5104 5582']
0 ['alsna', 'gullv', '15-04-1903', 'male', '', '134518.814', '04 5103 4582']

1 ['sraah', 'bates-brownsword', '26-11-1905', 'malr', '', '59685.979', '03 8545 5584']
3 ['sarah', 'bates-brownswort', '26-11-1905', 'male', '', '59686.879', '03 8545 6584']
2 ['sara0>', 'bates-browjsword', '26-11-1905', 'male', '', '59685.879', '']
0 ['saran', 'bates-brownsvvord', '26-11-1905', 'malr', 'sydney', '59685.879', '03 8555 5584']
4 ['snrah', 'bates-bro2nsword', '26-11-1005', 'male', 'sydney', '58685.879', '03 8545 5584']

1 ['beth', 'lette', '18-01-2000', 'female', 'sydney', '179719.049', '07 1868 6031']
2 ['beth', 'lette', '18-02-2000', 'femal4', 'stdq7ey', '179719.049', '07 1868 6931']

3 ['tahlia', 'bishlp', '', 'female', 'sydney', '101203.290', '03 886u 1916']
2 ['ahlia', 'bishpp', '', 'female', 'syriey', '101204.290', '03 8867 1916']

1 ['fzachary', 'mydlalc', '20-95-1916', 'male', 'sydney', '121209.129', '08 3807 4717']
3 ['zachary', 'mydlak', '20-05-1016', 'malr', 'sydhey', '121200.129', '08 3807 4627']

0 ['jessica', 'white', '04-07-1979', 'male', 'perth', '385632.266', '04 8026 8748']
1 ['jezsica', 'whi5e', '05-07-1979', 'male', 'perth', '385632.276', '04 8026 8748']

0 ['beriiamin', 'musoluno', '21-0y-1994', 'female', 'sydney', '81857.391', '08 8870 e498']
3 ['byenzakin', 'musoljno', '21-07-1995', 'female', 'sydney', '81857.392', '']

3 ['ella', 'howie', '26-03-2003', 'male', 'melbourne', '97556.316', '03 3655 1171']
4 ['ela', 'howie', '26-03-2003', 'male', 'melboirne', '', '03 3555 1171']
0 ['lela', 'howie', '26-03-2903', 'male', 'melbourhe', '', '03 3655 1171']

1 ['livia', 'riaj', '13-03-1907', 'malw', 'melbovrne', '73305.107', '07 3846 2530']
2 ['livia', 'ryank', '13-03-1907', 'malw', 'melbuorne', '73305.107', '07 3946 2630']
3 ['ltvia', 'ryan', '13-03-1907', 'maoe', 'melbourne', '73305.197', '07 3046 2530']
4 ['livia', 'ryan', '13-03-1907', 'male', 'melbourne', '73305.107', '07 3946 2530']

0 ['coby', 'ibshop', '', 'msle', 'sydney', '211655.118', '02 0833 7777']
2 ['coby', 'bishop', '15-08-1948', 'male', 'sydney', '211655.118', '02 9833 7777']

1 ['emjkly', 'pareemore', '01-03-2977', 'female', 'rnelbourne', '1644487.925', '03 5761 5483']
2 ['emiily', 'parremore', '01-03-1977', 'female', 'melbourne', '1644487.925', '03 5761 5483']

Despite the high amount of noise in the data, the entity service was able to produce a fairly accurate matching. However, Isabella George and Mia/Talia Galbraith are most likely not an actual match.

We may be able to improve on this results by fine-tuning the hashing schema or by changing the threshold.

Delete the project
[18]:
r = requests.delete(
    f"{PREFIX}/projects/{project_id}",
    headers={
        "Authorization": result_token
    }
)
print(r.status_code)
204

External Tutorials

The clkhash library includes a tutorial of carrying out record linkage on perturbed data. <http://clkhash.readthedocs.io/en/latest/tutorial_cli.html>

Concepts

Cryptographic Longterm Key

A Cryptographic Longterm Key is the name given to a Bloom filter used as a privacy preserving representation of an entity. Unlike a cryptographic hash function, a CLK preserves similarity - meaning two similar entities will have similar CLKs. This property is necessary for probabilistic record linkage.

CLKs are created independent of the entity service following a keyed hashing process.

A CLK incorporates information from multiple identifying fields (e.g., name, date of birth, phone number) for each entity. The schema section details how to capture the configuration for creating CLKs from PII, and the next section outlines how to serialize CLKs for use with this service’s api.

Note

The Cryptographic Longterm Key was introduced in A Novel Error-Tolerant Anonymous Linking Code by Rainer Schnell, Tobias Bachteler, and Jörg Reiher.

Bloom Filter Format

A Bloom filter is simply an encoding of PII as a bitarray.

This can easily be represented as bytes (each being an 8 bit number between 0 and 255). We serialize by base64 encoding the raw bytes of the bit array.

An example with a 64 bit filter:

# bloom filters binary value
'0100110111010000101111011111011111011000110010101010010010100110'

# which corresponds to the following bytes
[77, 208, 189, 247, 216, 202, 164, 166]

# which gets base64 encoded to
'TdC999jKpKY=\n'

As with standard Base64 encodings, a newline is introduced every 76 characters.

Schema

It is important that participating organisations agree on how personally identifiable information is processed to create the clks. We call the configuration for creating CLKs a linkage schema. The organisations have to agree on a schema to ensure their CLKs are comparable.

The linkage schema is documented in clkhash, our reference implementation written in Python.

Note

Due to the one way nature of hashing, the entity service can’t determine whether the linkage schema was followed when clients generated CLKs.

Comparing Cryptograhpic Longterm Keys

The similarity metric used is the Sørensen–Dice index - although this may become a configurable option in the future.

Output Types

The Entity Service supports different result types which effect what output is produced, and who may see the output.

Warning

The security guarantees differ substantially for each output type. See the Security document for a treatment of these concerns.

Similarity Score

Similarities scores are computed between all CLKs in each organisation - the scores above a given threshold are returned. This output type is currently the only way to work with 1 to many relationships.

The result_token (generated when creating the mapping) is required. The result_type should be set to "similarity_scores".

Results are a simple JSON array of arrays:

[
    [index_a, index_b, score],
    ...
]

Where the index values will be the 0 based row index from the uploaded CLKs, and the score will be a Number between the provided threshold and 1.0.

A score of 1.0 means the CLKs were identical. Threshold values are usually between 0.5 and 1.0.

Note

The maximum number of results returned is the product of the two data set lengths.

For example:

Comparing two data sets each containing 1 million records with a threshold of 0.0 will return 1 trillion results (1e+12).
Direct Mapping Table (Deprecated for Groups Result)

The direct mapping takes the similarity scores and simply assigns the highest scores as links.

The links are exposed as a lookup table using indices from the two organizations:

{
    index_a: index_b,
    ...
}

The result_token (generated when creating the mapping) is required to retrieve the results. The result_type should be set to "mapping".

Groups Result

The groups result has been created for multi-party linkage, and will replace the direct mapping result for two parties as it contains the same information in a different format.

The result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party index and the row index:

[
  [
    [party_id, row_index],
    ...
  ],
  ...
]

The result_token (generated when creating the mapping) is required to retrieve the results. The result_type should be set to "groups".

Permutation and Mask

This protocol creates a random reordering for both organizations; and creates a mask revealing where the reordered rows line up.

Accessing the mask requires the result_token, and accessing the permutation requires a receipt-token (provided to each organization when they upload data).

Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.

The result_type should be set to "permutations".

Security

The service isn’t given any personally identifying information in raw form - rather clients must locally compute a CLK which is a hashed version of the data to be linked.

Considerations for each output type

Direct Mapping Table

The default output of the Entity Service comprises a list of edges - connections between rows in dataset A to rows in dataset B. This assumes at most a 1-1 corrospondence - each entity will only be present in zero or one edge.

This output is only available to the client who created the mapping, but it is worth highlighting that it does (by design) leak information about the intersection of the two sets of entities.

Knowledge about set intersection This output contains information about which particular entities are shared, and which are not. Potentially knowing the overlap between the organizations is disclosive. This is mitigated by using unique authorization codes generated for each mapping which is required to retrieve the results.

Row indicies exposed The output directly exposes the row indices provided to the service, which if not randomized may be disclosive. For example entities simply exported from a database might be ordered by age, patient admittance date, salary band etc.

Similarity Score

All calculated similarities (above a given threshold) between entities are returned. This output comprises a list of weighted edges - similarity between rows in dataset A to rows in dataset B. This is a many to many relationship where entities can appear in multiple edges.

Recovery from the distance measurements This output type includes the plaintext distance measurements between entities, this additional information can be used to fingerprint individual entities based on their ordered similarity scores. In combination with public information this can lead to recovery of identity. This attack is described in section 3 of Vulnerabilities in the use of similarity tables in combination with pseudonymisation to preserve data privacy in the UK Office for National Statistics’ Privacy-Preserving Record Linkage by Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague.

In order to prevent this attack it is important not to provide the similarity table to untrusted parties.

Permutation and Mask

This output type involves creating a random reordering of the entities for both organizations; and creating a binary mask vector revealing where the reordered rows line up. This output is designed for use in multi-party computation algorithms.

This mitigates the Knowledge about set intersection problem from the direct mapping output - assuming the mask is not made available to the data providers.

Note the mask will be the length of the smaller data set and is applied after permuting the entities. This means the owner of the larger data set learns a subset of her rows which are not in the smaller data set.

Authentication / Authorization

The entity service does not support authentication, yet. This is planned for a future version.

All sensitive data is protected by token-based authorization. That is, you need to provide the correct token to access different resources. A token is a unique random 192 bit string.

There are three different types of tokens:

  • update_token: required to upload a party’s CLKs.
  • result_token: required to access the result of the entity resolution process. This is, depending on the output type, either similarity scores, a direct mapping table, or a mask.
  • receipt-token: this token is returned to either party after uploading their respective CLKs. With this receipt-token they can then access their respective permutations, if the output type of the mapping is set to permutation and mask.

Important

These tokens are the only artifacts that protect the sensitive data. Therefore it is paramount to make sure that only authorized parties have access to these tokens!

Attack Vectors

The following attack vectors need to be considered for all output types.

Stealing/Leaking uploaded CLKs

The uploaded CLKs for one organization could be leaked to the partner organization - who possesses the HMAC secret breaking semantic security. The entity service doesn’t expose an API that allows users to access any CLKs, the object store (MINIO or S3) and the database (postgresql) are configured to not allow public access.

Deployment

Local Deployment

Dependencies

Docker and docker-compose

Build

From the project folder, run:

./tools/build.sh

The will create the docker images tagged with latest which are used by docker-compose.

Run

Run docker compose:

docker-compose -p n1es -f tools/docker-compose.yml up

This will start the following containers:

  • nginx frontend (named n1es_nginx_1)
  • gunicorn/flask backend (named n1es_backend_1)
  • celery backend worker (named n1es_worker_1)
  • postgres database (named n1es_db_1)
  • redis job queue (named n1es_redis_1)
  • minio object store
  • jaeger opentracing

The REST api for the service is exposed on port 8851 of the nginx container, which docker will map to a high numbered port on your host.

The address of the nginx endpoint can be found with:

docker port n1es_nginx_1 "8851"

For example to GET the service status:

$ export ENTITY_SERVICE=`docker port n1es_nginx_1 "8851"`
$ curl $ENTITY_SERVICE/api/v1/status
{
    "status": "ok",
    "number_mappings": 0,
    "rate": 1
}

The service can be taken down by hitting CTRL+C. This doesn’t clear the DB volumes, which will persist and conflict with the next call to docker-compose … up unless they are removed. Removing these volumes is easy, just run:

docker-compose -p n1es -f tools/docker-compose.yml down -v

in between calls to docker-compose … up.

Monitoring

A celery monitor tool flower is also part of the docker-compose file - this graphical interface allows administration and monitoring of the celery tasks and workers. Access this via the monitor container.

Testing with docker-compose

An additional docker-compose config file can be found in ./tools/ci.yml, this can be added in to run along with the rest of the service:

docker-compose -p n1estest -f tools/docker-compose.yml -f tools/ci.yml  up -d

docker logs -f n1estest_tests_1

docker-compose -p n1estest -f tools/docker-compose.yml -f tools/ci.yml down
Docker Compose Tips
Local Scaling

You can run additional worker containers by scaling with docker-compose:

docker-compose -f tools/docker-compose.yml scale es_worker=2

A collection of development tips.

Volumes

You might need to destroy the docker volumes used for the object store and the postgres database:

docker-compose -f tools/docker-compose.yml rm -s -v [-p <project-name>]
Restart one service

Docker compose can modify an existing deployment, this can be particularly effective when you modify and rebuild the backend and want to restart it without changing anything else:

docker-compose -f tools/docker-compose.yml up -d --no-deps es_backend
Scaling

You can run additional worker containers by scaling with docker-compose:

docker-compose -f tools/docker-compose.yml scale es_worker=2
Mix and match docker compose

During development you can run the redis and database containers with docker-compose, and directly run the celery and flask applications with Python.

docker-compose -f tools/docker-compose.yml run es_db
docker-compose -f tools/docker-compose.yml run es_redis

Production deployment

Production deployment assumes a multi node Kubernetes cluster.

The entity service has been deployed to kubernetes clusters on Azure, GCE, minikube and AWS. The system has been designed to scale across multiple nodes and handle node failure without data loss.

Entity Service Kubernetes Deployment

At a high level the main custom components are:

  • ES App - a gunicorn/flask backend web service hosts the REST api
  • Entity Match Worker instances - uses celery for task scheduling

The components that are used in support are:

  • Postgresql database holds all match metadata
  • Redis is used for the celery job queue and as a cache
  • An object store (e.g. AWS S3, or Minio) stores the raw CLKs, intermediate files, and results.
  • nginx provides upload buffering, request rate limiting.
  • An ingress controller (e.g. nginx-ingress/traefik) provides TLS termination.

The rest of this document goes into how to deploy in a production setting.

Provision a Kubernetes cluster

Creating a Kubernetes cluster is out of scope for this documentation.

Hardware requirements

Recommended AWS worker instance type is r3.4xlarge - spot instances are fine as we handle node failure. The number of nodes depends on the size of the expected jobs, as well as the memory on each node. For testing we recommend starting with at least two nodes, with each node having at least 8 GiB of memory and 2 vCPUs.

Software to interact with the cluster

You will need to install the kubectl command line tool, and helm

Install Helm

The entity service system has been packaged using helm, there is a client program that needs to be installed

At the very least you will need to install tiller into the cluster:

helm init
Ingress Controller

We assume the cluster has an ingress controller, if this isn’t the case first add one. We suggest using Traefik or NGINX Ingress Controller. Both can be installed using helm.

Deploy the system

Helm can be used to deploy the system to a kubernetes cluster.

From the deployment/entity-service directory pull the dependencies:

helm dependency update
Configuring the deployment

Create a new blank yaml file to hold your custom deployment settings my-deployment.yaml. Carefully read through the default values.yaml file and override any values in your deployment configuration file.

At a minimum consider setting up an ingress by changing api.ingress, change the number of workers in workers.replicaCount (and possibly workers.highmemory.replicaCount), check you’re happy with the workers’ cpu and memory limits in workers.resources, and finally set the credentials:

  • global.postgresql.postgresqlPassword
  • redis.password (and redis-ha.redisPassword if provisioning redis)
  • minio.accessKey and minio.secretKey
Configuration of the celery workers

Celery is highly configurable and wrong configurations can lead to a number of runtime issues, such as exhausting the number of connection the database can handle, to threads exhaustion blocking the underlying machine.

We are thus recommending some sets of attributes, but note that every deployment is different and may require its own tweaking.

First observation: celery is not a good sharer of resources (cf issues <https://github.com/data61/anonlink-entity-service/issues/410>). We would thus recommend to specify a limit of number of CPUs each worker can use, and set correspondingly the concurrency of the workers to this limit. More help is provided directly in the values.yaml file.

Before Installation

Before installation, it is best practice to run some checks that helm provides. The first one is to execute:

helm lint -f extraValues.yaml

Note that it uses all the default deployment values provided in the values.yaml file, and overwrite them with the given values in extraValues.yaml. It should return some information if some values are missing, e.g.:

2019/09/11 15:13:10 [INFO] Missing required value: global.postgresql.postgresqlPassword must be provided.
2019/09/11 15:13:10 [INFO] Missing required value: minio.accessKey must be provided.
2019/09/11 15:13:10 [INFO] Missing required value: minio.secretKey must be provided.
==> Linting .
Lint OK

1 chart(s) linted, no failures
Notes:
  • the lint command does not exit with a non 0 exit code, and our templates are currently failing if linting with the option –strict.
  • if the folder Charts is not deleted, the linting may throw some errors from the dependent charts if a

value is missing without clear description, e.g. if the redis password is missing, the following error is returned from the redis-ha template because the method b64enc requires a non empty string, but the template does not check first if the value is empty:

 ==> Linting .
[ERROR] templates/: render error in "entity-service/charts/redis-ha/templates/redis-auth-secret.yaml": template: entity-service/charts/redis-ha/templates/redis-auth-secret.yaml:10:35: executing "entity-service/charts/redis-ha/templates/redis-auth-secret.yaml" at <b64enc>: invalid value; expected string

Error: 1 chart(s) linted, 1 chart(s) failed

Then, it advised to use the –dry-run –debug options before deploying with helm, which will return all the resources yaml descriptions.

Installation

To install the whole system execute:

cd deployment
helm install entityservice --name="anonlink" --values ``my-deployment.yaml``

This can take several minutes the first time you deploy to a new cluster.

Run integration tests and an end to end test

Update the server url by editing the jobs/integration-test-job.yaml file then create a new job on the cluster:

kubectl create -f jobs/integration-test-job.yaml
To view the celery monitor:

Note the monitor must be enabled at deployment. Find the pod that the celery monitor is running on then forward the port. For example:

$ kubectl get -n default pod --selector=run=celery-monitor -o jsonpath='{.items..metadata.name}'
entityservice-monitor-4045544268-s34zl

$kubectl port-forward entityservice-monitor-4045544268-s34zl 8888:8888
Upgrade Deployment with Helm

Updating a running chart is usually straight forward. For example if the release is called anonlink in namespace testing execute the following to increase the number of workers to 20:

helm upgrade anonlink entity-service --namespace=testing --set workers.replicas="20"

However, note you may wish to instead keep all configurable values in a yaml file and track that in version control.

Minimal Deployment

To run with minikube for local testing we have provided a minimal.yaml file that will set very small resource limits. Install the minimal system with:

helm install entity-service --name="mini-es" --values entity-service/minimal-values.yaml
Database Deployment Options

At deployment time you must set the postgresql password in global.postgresql.postgresqlPassword.

You can decide to deploy a postgres database along with the anonlink entity service or instead use an existing database. To configure a deployment to use an external postgres database, simply set provision.postgresql to false, set the database server in postgresql.nameOverride, and add credentials to the global.postgresql section.

Object Store Deployment Options

At deployment time you can decide to deploy MINIO or instead use an existing service such as AWS S3.

Note that there is a trade off between using a local deployment of minio vs S3. In our AWS based experimentation Minio is noticeably faster, but more expensive and less reliable than AWS S3, your own mileage may vary.

To configure a deployment to use an external object store, set provision.minio to false and add appropriate connection configuration in the minio section. For example to use AWS S3 simply provide your access credentials (and disable provisioning minio):

helm install entity-service --name="es-s3" --set provision.minio=false --set minio.accessKey=XXX --set minio.secretKey=YYY --set minio.bucket=<bucket>
Redis Deployment Options

At deployment time you can decide to provision redis using our chart, or instead use an existing redis installation or managed service. The provisioned redis is a highly available 3 node redis cluster using the redis-ha helm chart. Directly connecting to redis, and discovery via the sentinel protocol are supported. When using sentinel protocol for redis discovery read only requests are dispatched to redis replicas.

Carefully read the comments in the redis section of the default values.yaml file.

To use a separate install of redis using the server shared-redis-ha-redis-ha.default.svc.cluster.local:

helm install entity-service --name="es-shared-redis" \
     --set provision.redis=false \
     --set redis.server=shared-redis-ha-redis-ha.default.svc.cluster.local \
     --set redis.use_sentinel=true
Uninstalling

To uninstall a release called es in the default namespace:

helm del es

Or if the anonlink-entity-service has been installed into its own namespace you can simple delete the whole namespace with kubectl:

kubectl delete namespace miniestest

Deployment Risks

The purpose of this document is to record known deployment risks of the entity service and our mitigations. References the 2017 Top 10 security risks - https://www.owasp.org/index.php/Top_10-2017_Top_10

Risks
User accesses unit record data

A1 - Injection

A3 - Sensitive Data Exposure

Unauthorized user accesses results

A6 - Security misconfiguration.

A2 - Broken authentication.

A5 - Broken access control.

Authorized user attacks the system

A10 - Insufficient Logging & Monitoring A3 - Sensitive Data Exposure

An admin can access the raw clks uploaded by both parties.

However a standard user cannot.

User coerces N1 to execute attacking code

Insecure deserialization. Compromised shared host.

An underlying component has a vulnerability

Dependencies including anonlink could have vulnerabilities.

Development

Changelog

Next Version
Version 1.12.0
  • Logging configurable in the deployed entity service by using the key loggingCfg. (#448)

  • Several old settings have been removed from the default values.yaml and docker files which have been replaced by CHUNK_SIZE_AIM (#414): - SMALL_COMPARISON_CHUNK_SIZE - LARGE_COMPARISON_CHUNK_SIZE - SMALL_JOB_SIZE - LARGE_JOB_SIZE

  • Remove ENTITY_MATCH_THRESHOLD environment variable (#444)

  • Celery configuration updates to solve threads and memory leaks in deployment. (#427)

  • Update docker-compose files to use these new preferred configurations.

  • Update helm charts with preferred configuration default deployment is a minimal working deployment.

  • New environment variables: CELERY_DB_MIN_CONNECTIONS, FLASK_DB_MIN_CONNECTIONS, CELERY_DB_MAX_CONNECTIONS and FLASK_DB_MAX_CONNECTIONS to configure the database connections pool. (#405)

  • Simplify access to the database from services relying on a single way to get a connection via a connection pool. (#405)

  • Deleting a run is now implemented. (#413)

  • Added some missing documentation about the output type groups (#449)

  • Sentinel name is configurable. (#436)

  • Improvement on the Kubernetes deployment test stage on Azure DevOps: - Re-order cleaning steps to first purge the deployment and then deleting the remaining. (#426) - Run integration tests in parallel, reducing pipeline stage Kubernetes deployment tests from 30 minutes to 15 minutes. (#438) - Tests running on a deployed entity-service on k8s creates an artifact containing all the logs of all the containers, useful for debugging. (#445) - Test container not restarted on test failure. (#434)

  • Benchmark improvements: - Benchmark output has been modified to handle multi-party linkage. - Benchmark to handle more than 2 parties, being able to repeat experiments.

    and pushing the results to minio object store. (#406, #424 and #425)

    • Azure DevOps benchmark stage runs a 3 parties linkage. (#433)
  • Improvements on Redis cache: - Refactor the cache. (#430) - Run state kept in cache (instead of fully relying on database) (#431 and #432)

  • Update dependencies: - anonlink to v0.12.5. (#423) - redis to from 3.2.0 to 3.2.1 (#415) - alpine from 3.9 to 3.10.1 (#404)

  • Add some release documentation. (#455)

Version 1.11.2
  • Switch to Azure Devops pipeline for CI.
  • Switch to docker hub for container hosting.
Version 1.11.1
  • Include multiparty linkage tutorial/example.
  • Tightened up how we use a database connection from the flask app.
  • Deployment and logging documentation updates.
Version 1.11.0
  • Adds support for multiparty record linkage.
  • Logging is now configurable from a file.
Other improvements
  • Another tutorial for directly using the REST api was added.
  • K8s deployment updated to use 3.15.0 Postgres chart. Postgres configuration now uses a global namespace so subcharts can all use the same configuration as documented here.
  • Jenkins testing now fails if the benchmark exits incorrectly or if the benchmark results contain failed results.
  • Jenkins will now execute the tutorials notebooks and fail if any cells error.
Version 1.10.0
  • Updates Anonlink and switches to using Anonlink’s default format for serialization of similarity scores.
  • Sorts similarity scores before solving, improving accuracy.
  • Uses Anonlink’s new API for similarity score computation and solving.
  • Add support for using an external Postgres database.
  • Added optional support for redis discovery via the sentinel protocol.
  • Kubernetes deployment no longer includes a default postgres password. Ensure that you set your own postgresqlPassword.
  • The Kubernetes deployment documentation has been extended.
Version 1.9.4
  • Introduces configurable logging of HTTP headers.
  • Dependency issue resolved.
Version 1.9.3
  • Redis can now be used in highly available mode. Includes upstream fix where the redis sentinels crash.
  • The custom kubernetes certificate management templates have been removed.
  • Minor updates to the kubernetes resources. No longer using beta apis.
Version 1.9.2
  • 2 race conditions have been identified and fixed.
  • Integration tests are sped up and more focused. The test suite now fails after the first test failure.
  • Code tidy-ups to be more pep8 compliant.
Version 1.9.1
  • Adds support for (almost) arbitrary sized encodings. A minimum and maximum can be set at deployment time, and currently anonlink requires the size to be a multiple of 8.
  • Adds support for opentracing with Jaeger.
  • improvements to the benchmarking container
  • internal refactoring of tasks
Version 1.9.0
  • minio and redis services are now optional for kubernetes deployment.
  • Introduction of a high memory worker and associated task queue.
  • Fix issue where we could start tasks twice.
  • Structlog now used for celery workers.
  • CI now tests a kubernetes deployment.
  • Many Jenkins CI updates and fixes.
  • Updates to Jupyter notebooks and docs.
  • Updates to Python and Helm chart dependencies and docker base images.
Version 1.8.1

Improve system stability while handling large intermediate results. Intermediate results are now stored in files instead of in Redis. This permits us to stream them instead of loading everything into memory.

Version 1.8

Version 1.8 introduces breaking changes to the REST API to allow an analyst to reuse uploaded CLKs.

Instead of a linkage project only having one result, we introduce a new sub-resource runs. A project holds the schema and CLKs from all data providers; and multiple runs can be created with different parameters. A run has a status and a result endpoint. Runs can be queued before the CLK data has been uploaded.

We also introduced changes to the result types. The result type permutation, which was producing permutations and an encrypted mask, was removed. And the result type permutation_unecrypyted_mask was renamed to permutations.

Brief summary of API changes: - the mapping endpoint has been renamed to projects - To carry out a linkage computation you must post to a project’s runs endpoint: /api/v1/project/<PROJECT_ID>/runs - Results are now accessed under the `runs endpoint: /api/v1/project/<PROJECT_ID>/runs/<RUN_ID>/result - result type permutation_unecrypyted_mask was renamed to permutations - result type permutation was removed

For all the updated API details check the Open API document.

Other improvements
  • The documentation is now served at the root.
  • The flower monitoring tool for celery is now included with the docker-compose deployment. Note this will be disabled for production deployment with kubernetes by default.
  • The docker containers have been migrated to alpine linux to be much leaner.
  • Substantial internal refactoring - especially of views.
  • Move to pytest for end to end tests.
Version 1.7.3

Deployment and documentation sprint.

  • Fixes a bug where only the top k results of a chunk were being requested from anonlink. #59 #84
  • Updates to helm deployment templates to support a single namespace having multiple entityservices. Helm charts are more standard, some config has moved into a configmap and an experimental cert-manager configuration option has been added. #83, #90
  • More sensible logging during testing.
  • Every http request now has a (globally configurable) timeout
  • Minor update regarding handling uploading empty CLKs. #92
  • Update to latest versions of anonlink and clkhash. #94
  • Documentation updates.
Version 1.7.2

Dependency and deployment updates. We now pin versions of Python, anonlink, clkhash, phe and docker images nginx and postgres.

Version 1.7.0

Added a view type that returns similarity scores of potential matches.

Version 1.6.8

Scalability sprint.

  • Much better chunking of work.
  • Security hardening by modifing the response from the server. Now there is no differences between invalid token and unknown resource - both return a 403 response status.
  • Mapping information includes the time it was started.
  • Update and add tests.
  • Update the deployment to use Helm.

Road map for the entity service

  • baseline benchmarking vs known datasets (accuracy and speed) e.g recordspeed datasets
  • blocking
  • Schema specification and tooling
  • Algorithmic improvements. e.g., implementing canopy clustering solver
  • A web front end including authentication and access control
  • Uploading multiple hashes per entity. Handle multiple schemas.
  • Check how we deal with missing information, old addresses etc
  • Semi supervised machine learning methods to learn thresholds
  • Handle 1 to many relationships. E.g. familial groups
  • Larger scale graph solving methods
  • Remove bottleneck of sparse links having to fit in redis.
  • improve uploads by allowing direct binary file transfer into object store
  • optimise anonlink memory management and C++ code

Bigger Projects - consider more than 2 organizations participating in one mapping - GPU implementation of core similarity scoring - somewhat homomorphic encryption could be used for similarity score - consider allowing users to upload raw PII

Releasing

Implementation Details

Components

The entity service is implemented in Python and comprises the following components:

  • A gunicorn/flask backend that implements the HTTP REST api.
  • Celery backend worker/s that do the actual work. This interfaces with the anonlink library.
  • An nginx frontend to reverse proxy the gunicorn/flask backend application.
  • A Minio object store (large files such as raw uploaded hashes, results)
  • A postgres database stores the linking metadata.
  • A redis task queue that interfaces between the flask app and the celery backend. Redis also acts as an ephemeral cache.

Each of these has been packaged as a docker image, however the use of external services (redis, postgres, minio) can be configured through environment variables. Multiple workers can be used to distribute the work beyond one machine - by default all cores will be used for computing similarity scores and encrypting the mask vector.

Redis

Redis is used as the default message broker for celery as well as a cross-container in memory cache.

Redis key/values used directly by the Anonlink Entity Service:

Key Redis Type Description
“entityservice-status” String pickled status
“run:{run_id}” Hash run info
“clk-pkl-{dp_id}” String pickled encodings
Redis Cache: Run Info

The run info HASH stores:

  • similarity scoring progress for each run under "progress"
  • run state under "state", current valid states are {active, complete, deleted}. See backend/entityservice/cache/active_runs.py for implementation.

Continuous Integration Testing

We test the service using Jenkins. Every pull request gets deployed in the local configuration using Docker Compose, as well as in the production deployment to kubernetes.

At a high level the testing covers:

  • building the docker containers
  • deploying using Docker Compose
  • testing the tutorial notebooks don’t error
  • running the integration tests against the local deployment
  • running a benchmark suite against the local deployment
  • building and packaging the documentation
  • publishing the containers to quay.io
  • deploying to kubernetes
  • running the integration tests against the kubernetes deployment

All of this is orchestrated using the jenkins pipeline script at Jenkinsfile.groovy. There is one custom library which is n1-pipeline a collection of helpers that we created for common jenkins tasks.

The integration tests currently take around 30 minutes.

Testing Local Deployment

The docker compose file tools/ci.yml is deployed along with tools/docker-compose.yml. This simply defines an additional container (from the same backend image) which runs the integration tests after a short delay.

The logs from the various containers (nginx, backend, worker, database) are all collected, archived and are made available in the Jenkins UI for introspection.

Testing K8s Deployment

The kubernetes deployment uses helm with the template found in deployment/entity-service. Jenkins additionally defines the docker image versions to use and ensures an ingress is not provisioned. The deployment is configured to be quite conservative in terms of cluster resources. Currently this logic all resides in Jenkinsfile.groovy.

The k8s deployment test is limited to 30 minutes and an effort is made to clean up all created resources.

After a few minutes waiting for the deployment a Kubernetes Job is created using kubectl create.

This job includes a 1GiB persistent volume claim to which the results are written (as results.xml). During the testing the pytest output will be rendered in jenkins, and then the Job’s pod terminates. We create a temporary pod which mounts the same results volume and then we copy across the produced artifact for rendering in Jenkins. This dance is only necessary to retrieve files from the cluster to our Jenkins instance, it would be straightforward if we only wanted the stdout from each pod/job.

Devops

Continuous Integration

Azure DevOps

anonlink-entity-service is automatically built and tested using Azure DevOps in the project Anonlink <https://dev.azure.com/data61/Anonlink>.

It consists only of a build pipeline <https://dev.azure.com/data61/Anonlink/_build?definitionId=1>.

The build pipeline is defined in the script azure-pipelines.yml which uses resources from the folder .azurePipeline.

The continuous integration stages are:

  • building and pushing the following docker images: - the frontend data61/anonlink-nginx - the backend data61/anonlink-app - the tutorials data61/anonlink-docs-tutorials (used to tests the tutorial Python Notebooks) - the benchmark data61/anonlink-benchmark (used to run the benchmark)
  • runs the benchmark using docker-compose and publishes the results as an artifact in Azure
  • runs the tutorial tests using docker-compose and publishes the results in Azure
  • runs the integration tests by deploying the whole service on Kubernetes, running the integration tests and publishing the results in Azure. It also publishes the pods logs if some tests failed.

The build pipeline is triggered for every push on every branch. It is not triggered by Pull Requests to avoid duplicate testing and building potentially untrusted external code.

The build pipeline requires two environment variables provided by Azure environment:

  • dockerHubId: username for the pipeline to push images to Data61 dockerhub
  • dockerHubPassword: password for the corresponding username (this is a secret variable).

It also requires a connection to a k8s cluster to be configured.

Benchmarking

In the benchmarking folder is a benchmarking script and associated Dockerfile. The docker image is published at https://quay.io/repository/n1analytics/entity-benchmark

The container/script is configured via environment variables.

  • SERVER: (required) the url of the server.
  • EXPERIMENT: json file containing a list of experiments to run. Schema of experiments is defined in ./schema/experiments.json.
  • DATA_PATH: path to a directory to store test data (useful to cache).
  • RESULT_PATH: full filename to write results file.
  • SCHEMA: path to the linkage schema file used when creating projects. If not provided it is assumed to be in the data directory.
  • TIMEOUT: this timeout defined the time to wait for the result of a run in seconds. Default is 1200 (20min).

Run Benchmarking Container

Run the container directly with docker - substituting configuration information as required:

docker run -it
    -e SERVER=https://testing.es.data61.xyz \
    -e RESULTS_PATH=/app/results.json \
    quay.io/n1analytics/entity-benchmark:latest

By default the container will pull synthetic datasets from an S3 bucket and run default benchmark experiments against the configured SERVER. The default experiments (listed below) are set in benchmarking/default-experiments.json.

The output will be printed and saved to a file pointed to by RESULTS_PATH (e.g. to /app/results.json).

Cache Volume

For speeding up benchmarking when running multiple times you may wish to mount a volume at the DATA_PATH to store the downloaded test data. Note the container runs as user 1000, so any mounted volume must be read and writable by that user. To create a volume using docker:

docker volume create linkage-benchmark-data

To copy data from a local directory and change owner:

docker run --rm -v `pwd`:/src \
    -v linkage-benchmark-data:/data busybox \
    sh -c "cp -r /src/linkage-bench-cache-experiments.json /data; chown -R 1000:1000 /data"

To run the benchmarks using the cache volume:

docker run \
    --name ${benchmarkContainerName} \
    --network ${networkName} \
    -e SERVER=${localserver} \
    -e DATA_PATH=/cache \
    -e EXPERIMENT=/cache/linkage-bench-cache-experiments.json \
    -e RESULTS_PATH=/app/results.json \
    --mount source=linkage-benchmark-data,target=/cache \
    quay.io/n1analytics/entity-benchmark:latest

Experiments

Experiments to run can be configured as a simple json document. The default is:

[
  {
    "sizes": ["100K", "100K"],
    "threshold": 0.95
  },
  {
    "sizes": ["100K", "100K"],
    "threshold": 0.80
  },
  {
    "sizes": ["100K", "1M"],
    "threshold": 0.95
  }
]

The schema of the experiments can be found in benchmarking/schema/experiments.json.

Logging

The entity service uses the standard Python logging library for logging.

The following named loggers are used:

  • entityservice * entityservice.views * entityservice.models * entityservice.database
  • celery.es

The following environment variables affect logging:

  • LOG_CFG - sets the path to a logging configuration file. There are two examples: - entityservice/default_logging.yaml - entityservice/verbose_logging.yaml
  • DEBUG - sets the logging level to debug for all application code.
  • LOGFILE - directs the log output to this file instead of stdout.
  • LOG_HTTP_HEADER_FIELDS - HTTP headers to include in the application logs.

Example logging output with LOG_HTTP_HEADER_FIELDS=User-Agent,Host:

[2019-02-02 23:17:23 +0000] [10] [INFO] Adding new project to database [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=6c2a3730
[2019-02-02 23:17:23 +0000] [12] [INFO] Getting detail for a project   [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [12] [INFO] Checking credentials           [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [12] [INFO] 0 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=a7e2554a
[2019-02-02 23:17:23 +0000] [11] [INFO] Receiving CLK data.            [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Storing user 25895 supplied clks from json [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Received 100 encodings. Uploading 16.89 KiB to object store [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Adding metadata on encoded entities to database [entityservice.database.insertions] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:23 +0000] [11] [INFO] Job scheduled to handle user uploaded hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25895 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=d61c3138
[2019-02-02 23:17:24 +0000] [12] [INFO] Getting detail for a project   [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [12] [INFO] Checking credentials           [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [12] [INFO] 1 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=c13ecc77
[2019-02-02 23:17:24 +0000] [10] [INFO] Receiving CLK data.            [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Storing user 25896 supplied clks from json [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Received 100 encodings. Uploading 16.89 KiB to object store [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Adding metadata on encoded entities to database [entityservice.database.insertions] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:24 +0000] [10] [INFO] Job scheduled to handle user uploaded hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 dp_id=25896 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=352c4409
[2019-02-02 23:17:25 +0000] [12] [INFO] Getting detail for a project   [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] Checking credentials           [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] 2 parties have contributed hashes [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=6408f4ceb90e25cdf910b00daff3dcf23e4c891c1cfa2383 request=8e67e62a
[2019-02-02 23:17:25 +0000] [12] [INFO] Adding new project to database [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=df791527
[2019-02-02 23:17:26 +0000] [12] [INFO] request description of a run   [entityservice.views.run.description] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=bf5b2544 rid=invalid
[2019-02-02 23:17:26 +0000] [12] [INFO] Requested project or run resource with invalid identifier token [entityservice.views.auth_checks] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=bf5b2544 rid=invalid
[2019-02-02 23:17:26 +0000] [12] [INFO] Request to delete project      [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
[2019-02-02 23:17:26 +0000] [12] [INFO] Marking project for deletion   [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9
[2019-02-02 23:17:26 +0000] [12] [INFO] Queuing authorized request to delete project resources [entityservice.views.project] Host=nginx User-Agent=python-requests/2.18.4 pid=7f302255ff3e2ce78273a390997f38ba8979965043c23581 request=d5b766a9

With DEBUG enabled there are a lot of logs from the backend and workers:

[2019-02-02 23:14:47 +0000] [10] [INFO] Marking project for deletion   [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:47 +0000] [10] [DEBUG] Trying to connect to postgres db [entityservice.database.util] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [10] [DEBUG] Database connection established [entityservice.database.util] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [10] [INFO] Queuing authorized request to delete project resources [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=31a6449e
[2019-02-02 23:14:48 +0000] [9] [INFO] Request to delete project      [entityservice.views.project] User-Agent=python-requests/2.18.4 pid=bd0e0cf51a979f78ad8912758f20cc05d0d9129ab0f3552f request=5486c153

Tracing

  • TRACING_HOST
  • TRACING_PORT