Entity Service Permutation Output¶

This tutorial demonstrates the workflow for private record linkage using the entity service. Two parties Alice and Bob have a dataset of personally identifiable information (PII) of several entities. They want to learn the linkage of corresponding entities between their respective datasets with the help of the entity service and an independent party, the Analyst.

The chosen output type is permuatations, which consists of two permutations and one mask.

Who learns what?¶

After the linkage has been carried out Alice and Bob will be able to retrieve a permutation - a reordering of their respective data sets such that shared entities line up.

The Analyst - who creates the linkage project - learns the mask. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.

Steps¶

These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are Alice and Bob, and the Analyst acting the integration authority.

Check connection to Entity Service
Data preparation
Write CSV files with PII
Create a Linkage Schema
Create Linkage Project
Generate CLKs from PII
Upload the PII
Create a run
Retrieve and analyse results

Check Connection¶

If you’re connecting to a custom entity service, change the address here.

[1]:

import os
url = os.getenv("SERVER", "https://anonlink.easd.data61.xyz")
print(f'Testing anonlink-entity-service hosted at {url}')

Testing anonlink-entity-service hosted at https://anonlink.easd.data61.xyz

[2]:

!clkutil status --server "{url}"

{"project_count": 7050, "rate": 2824020, "status": "ok"}

Data preparation¶

Following the clkhash tutorial we will use a dataset from the recordlinkage library. We will just write both datasets out to temporary CSV files.

[3]:

from tempfile import NamedTemporaryFile
from recordlinkage.datasets import load_febrl4

[4]:

dfA, dfB = load_febrl4()

a_csv = NamedTemporaryFile('w')
a_clks = NamedTemporaryFile('w', suffix='.json')
dfA.to_csv(a_csv)
a_csv.seek(0)

b_csv = NamedTemporaryFile('w')
b_clks = NamedTemporaryFile('w', suffix='.json')
dfB.to_csv(b_csv)
b_csv.seek(0)

dfA.head(3)

[4]:

	given_name	surname	street_number	address_1	address_2	suburb	postcode	state	date_of_birth	soc_sec_id
rec_id
rec-1070-org	michaela	neumann	8	stanley street	miami	winston hills	4223	nsw	19151111	5304218
rec-1016-org	courtney	painter	12	pinkerton circuit	bega flats	richlands	4560	vic	19161214	4066625
rec-4405-org	charles	green	38	salkauskas crescent	kela	dapto	4566	nsw	19480930	4365168

Schema Preparation¶

The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.

[5]:

schema = NamedTemporaryFile('wt')

[6]:

%%writefile {schema.name}
{
  "version": 3,
  "clkConfig": {
    "l": 1024,
    "xor_folds": 0,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
      "info": "c2NoZW1hX2V4YW1wbGU=",
      "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
      "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 30
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": false
        }
      }
    },
    {
      "identifier": "surname",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 30
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": false
        }
      }
    },
    {
      "identifier": "street_number",
      "format": {
        "type": "integer"
      },
      "hashing": {
        "missingValue": {
          "sentinel": ""
        },
        "strategy": {
          "bitsPerToken": 15
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 1,
          "positional": true
        }
      }
    },
    {
      "identifier": "address_1",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 15
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": false
        }
      }
    },
    {
      "identifier": "address_2",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 15
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": false
        }
      }
    },
    {
      "identifier": "suburb",
      "format": {
        "type": "string",
        "encoding": "utf-8"
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 15
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": false
        }
      }
    },
    {
      "identifier": "postcode",
      "format": {
        "type": "integer",
        "minimum": 100,
        "maximum": 9999
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 15
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 1,
          "positional": true
        }
      }
    },
    {
      "identifier": "state",
      "format": {
        "type": "string",
        "encoding": "utf-8",
        "maxLength": 3
      },
      "hashing": {
        "strategy": {
          "bitsPerToken": 30
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 2,
          "positional": false
        }
      }
    },
    {
      "identifier": "date_of_birth",
      "format": {
        "type": "integer"
      },
      "hashing": {
        "missingValue": {
          "sentinel": ""
        },
        "strategy": {
          "bitsPerToken": 30
        },
        "hash": {
          "type": "doubleHash"
        },
        "comparison": {
          "type": "ngram",
          "n": 1,
          "positional": true
        }
      }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}

Overwriting /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmp3jpcxxrs

Create Linkage Project¶

The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.

[7]:

creds = NamedTemporaryFile('wt')
print("Credentials will be saved in", creds.name)

!clkutil create-project \
    --schema "{schema.name}" \
    --output "{creds.name}" \
    --type "permutations" \
    --server "{url}"

creds.seek(0)

import json
with open(creds.name, 'r') as f:
    credentials = json.load(f)

project_id = credentials['project_id']
credentials

Credentials will be saved in /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmp_tz_feve
Project created

[7]:

{'project_id': '7c942add9259b0c61fc06ce24afc6ee9c99355cc5a5eae7a',
 'result_token': '4552074bebabf66a19e707ef64aa35638fc1eb2cd3b9a768',
 'update_tokens': ['1045c9dda873d3cccf37181bcff7c61a5e82c6051d0da2c0',
  'fc27160c4e4736c1dbbecbedd6bc5e4117a3626c1f2eda9c']}

Note: the analyst will need to pass on the project_id (the id of the linkage project) and one of the two update_tokens to each data provider.

Hash and Upload¶

At the moment both data providers have raw personally identiy information. We first have to generate CLKs from the raw entity information. We need: - the clkhash library - the linkage schema from above - and a secret which is only known to Alice and Bob. (here: my_secret)

Please see clkhash documentation for further details on this.

[8]:

!clkutil hash "{a_csv.name}" my_secret "{schema.name}" "{a_clks.name}"
!clkutil hash "{b_csv.name}" my_secret "{schema.name}" "{b_clks.name}"

CLK data written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmppybfm62c.json
CLK data written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpu4jx4mjv.json

Now the two clients can upload their data providing the appropriate upload tokens and the project_id. As with all commands in clkhash we can output help:

[9]:

!clkutil upload --help

Usage: clkutil upload [OPTIONS] CLK_JSON

  Upload CLK data to entity matching server.

  Given a json file containing hashed clk data as CLK_JSON, upload to the
  entity resolution service.

  Use "-" to read from stdin.

Options:
  --project TEXT                  Project identifier
  --apikey TEXT                   Authentication API key for the server.
  -o, --output FILENAME
  --server TEXT                   Server address including protocol. Default
                                  https://anonlink.easd.data61.xyz.
  --retry-multiplier INTEGER      <milliseconds> If receives a 503 from
                                  server, minimum waiting time before
                                  retrying. Default 100.
  --retry-exponential-max INTEGER
                                  <milliseconds> If receives a 503 from
                                  server, maximum time interval between
                                  retries. Default 10000.
  --retry-max-time INTEGER        <milliseconds> If receives a 503 from
                                  server, retry only within this period.
                                  Default 20000.
  -v, --verbose                   Script is more talkative
  --help                          Show this message and exit.

Alice uploads her data¶

[10]:

with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{a_clks.name}"
    res = json.load(open(f.name))
    alice_receipt_token = res['receipt_token']

Every upload gets a receipt token. This token is required to access the results.

Bob uploads his data¶

[11]:

with NamedTemporaryFile('wt') as f:
    !clkutil upload \
        --project="{project_id}" \
        --apikey="{credentials['update_tokens'][1]}" \
        --server "{url}" \
        --output "{f.name}" \
        "{b_clks.name}"

    bob_receipt_token = json.load(open(f.name))['receipt_token']

Create a run¶

Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:

[12]:

with NamedTemporaryFile('wt') as f:
    !clkutil create \
        --project="{project_id}" \
        --apikey="{credentials['result_token']}" \
        --server "{url}" \
        --threshold 0.85 \
        --output "{f.name}"

    run_id = json.load(open(f.name))['run_id']

Results¶

Now after some delay (depending on the size) we can fetch the mask. This can be done with clkutil:

!clkutil results --server "{url}" \
    --project="{credentials['project_id']}" \
    --apikey="{credentials['result_token']}" --output results.txt

However for this tutorial we are going to use the Python requests library:

[13]:

import requests
from clkhash.rest_client import RestClient
from clkhash.rest_client import format_run_status

from IPython.display import clear_output

[14]:

rest_client = RestClient(url)
for update in rest_client.watch_run_status(project_id, run_id, credentials['result_token'], timeout=300):
    clear_output(wait=True)
    print(format_run_status(update))

State: completed
Stage (3/3): compute output

[15]:

results = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': credentials['result_token']}).json()

[16]:

mask = results['mask']

This mask is a boolean array that specifies where rows of permuted data line up.

[17]:

print(mask[:10])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The number of 1s in the mask will tell us how many matches were found.

[18]:

sum([1 for m in mask if m == 1])

[18]:

We also use requests to fetch the permutations for each data provider:

[19]:

alice_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': alice_receipt_token}).json()
bob_res = requests.get('{}/api/v1/projects/{}/runs/{}/result'.format(url, project_id, run_id), headers={'Authorization': bob_receipt_token}).json()

Now Alice and Bob both have a new permutation - a new ordering for their data.

[20]:

alice_permutation = alice_res['permutation']
alice_permutation[:10]

[20]:

[3645, 1068, 4371, 465, 1533, 987, 343, 53, 3298, 2515]

This permutation says the first row of Alice’s data should be moved to position 308.

[21]:

bob_permutation = bob_res['permutation']
bob_permutation[:10]

[21]:

[3857, 4827, 3267, 4934, 1958, 3682, 4576, 4895, 4867, 1188]

[22]:

def reorder(items, order):
    """
    Assume order is a list of new index
    """
    neworder = items.copy()
    for item, newpos in zip(items, order):
        neworder[newpos] = item

    return neworder

[23]:

with open(a_csv.name, 'r') as f:
    alice_raw = f.readlines()[1:]
    alice_reordered = reorder(alice_raw, alice_permutation)

with open(b_csv.name, 'r') as f:
    bob_raw = f.readlines()[1:]
    bob_reordered = reorder(bob_raw, bob_permutation)

Now that the two data sets have been permuted, the mask reveals where the rows line up, and where they don’t.

[24]:

alice_reordered[:10]

[24]:

['rec-3302-org,blaize,koopman,17,allison place,aldersyde estate,balwyn north,4650,nsw,19110608,7823755\n',
 'rec-1385-org,joel,bishop,10,french street,cedarview,orange,3223,nt,,1324854\n',
 'rec-190-org,,alias,24,elkington street,pangani,isle of capri,2145,sa,19650429,8261472\n',
 'rec-4781-org,jacob,waller,89,dalley crescent,the willows,mosman,2480,qld,19580408,6317326\n',
 'rec-4881-org,alexandra,nguyen,44,colebatch place,langley flats,freshwater,3242,nsw,19511004,6416159\n',
 'rec-4770-org,tegan,rosendale,1,sherbrooke street,nazareth village,innaloo,2250,wa,19801011,9351309\n',
 'rec-3385-org,shanaye,carbone,41,haystack crescent,st vincents hospital,matong,3690,nsw,19300519,1632237\n',
 'rec-3738-org,imogen,carlington,45,mcinnes street,parish talowahl,girilambone,2154,nsw,19781117,7912921\n',
 'rec-831-org,laura,flannery,54,sid barnes crescent,weemilah,winston hills,5073,qld,19581023,9712180\n',
 'rec-815-org,holly,campbell,21,casey crescent,nestor,westmead,4573,qld,19911007,4424335\n']

[25]:

bob_reordered[:10]

[25]:

['rec-3302-dup-0,blaize,koopman,17,allison place,aldersydeestate,balwyn north,4650,nsw,19110608,7823755\n',
 'rec-1385-dup-0,elton,bishop,10,french street,,orange,3223,nt,,1324854\n',
 'rec-190-dup-0,,alias,24,elkington street,panganu,isle of capri,2145,sa,19650429,8261472\n',
 'rec-4781-dup-0,jacob,waliler,89,dalley crescent,the ui llows,mosman,2487,qld,19580408,6317326\n',
 'rec-4881-dup-0,nguyen,alexandra,44,colebatch place,langley flats,freshwater,3242,nsw,19511004,6416159\n',
 'rec-4770-dup-0,tegan,rosendale,1,sherbrooke street,nazareth village,innaloo,2550,nsw,19801011,9351309\n',
 'rec-3385-dup-0,shanaye,lonto,41,haystack crescent,,leetob,3680,nsw,19300519,1632237\n',
 'rec-3738-dup-0,imogen,carlington,45,mcinnes treet,parish talowahl,girilabmone,2154,nsw,19781117,7912921\n',
 'rec-831-dup-0,laura,flannery,54,sid barnes crescent,,winstonhills,5073,qld,19581023,9712180\n',
 'rec-815-dup-0,holyl,campbell,21,casey crescent,,westmead,4573,qld,19911007,4424335\n']

Accuracy¶

To compute how well the matching went we will use the first index as our reference.

For example in rec-1396-org is the original record which has a match in rec-1396-dup-0. To satisfy ourselves we can preview the first few supposed matches:

[26]:

for i, m in enumerate(mask[:10]):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        name_a = ' '.join(entity_a[1:3]).title()
        name_b = ' '.join(entity_b[1:3]).title()

        print("{} ({})".format(name_a, entity_a[0]), '=?', "{} ({})".format(name_b, entity_b[0]))

Blaize Koopman (rec-3302-org) =? Blaize Koopman (rec-3302-dup-0)
Joel Bishop (rec-1385-org) =? Elton Bishop (rec-1385-dup-0)
 Alias (rec-190-org) =?  Alias (rec-190-dup-0)
Jacob Waller (rec-4781-org) =? Jacob Waliler (rec-4781-dup-0)
Alexandra Nguyen (rec-4881-org) =? Nguyen Alexandra (rec-4881-dup-0)
Tegan Rosendale (rec-4770-org) =? Tegan Rosendale (rec-4770-dup-0)
Shanaye Carbone (rec-3385-org) =? Shanaye Lonto (rec-3385-dup-0)
Imogen Carlington (rec-3738-org) =? Imogen Carlington (rec-3738-dup-0)
Laura Flannery (rec-831-org) =? Laura Flannery (rec-831-dup-0)
Holly Campbell (rec-815-org) =? Holyl Campbell (rec-815-dup-0)

Metrics¶

If you know the ground truth — the correct mapping between the two datasets — you can compute performance metrics of the linkage.

Precision: The percentage of actual matches out of all found matches. (tp/(tp+fp))

Recall: How many of the actual matches have we found? (tp/(tp+fn))

[27]:

tp = 0
fp = 0

for i, m in enumerate(mask):
    if m:
        entity_a = alice_reordered[i].split(',')
        entity_b = bob_reordered[i].split(',')
        if entity_a[0].split('-')[1] == entity_b[0].split('-')[1]:
            tp += 1
        else:
            fp += 1
            #print('False positive:',' '.join(entity_a[1:3]).title(), '?', ' '.join(entity_b[1:3]).title(), entity_a[-1] == entity_b[-1])

print("Found {} correct matches out of 5000. Incorrectly linked {} matches.".format(tp, fp))
precision = tp/(tp+fp)
recall = tp/5000

print("Precision: {:.1f}%".format(100*precision))
print("Recall: {:.1f}%".format(100*recall))

Found 4851 correct matches out of 5000. Incorrectly linked 0 matches.
Precision: 100.0%
Recall: 97.0%

[28]:

# Deleting the project
!clkutil delete-project \
        --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" \
        --server="{url}"

Project deleted

Table of Contents

This Page