{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "# Entity Service Permutation Output\n", "\n", "This tutorial demonstrates the workflow for private record linkage using the entity service. Two parties _Alice_ and _Bob_ have a dataset of personally identifiable information (PII) of several entities. They want to learn the linkage of corresponding entities between their respective datasets with the help of the entity service and an independent party, the _Analyst_.\n", "\n", "The chosen output type is `permuatations`, which consists of two permutations and one mask.\n", "\n", "\n", "### Who learns what?\n", "\n", "After the linkage has been carried out Alice and Bob will be able to retrieve a `permutation` - a reordering of their respective data sets such that shared entities line up.\n", "\n", "The Analyst - who creates the linkage project - learns the `mask`. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.\n", "\n", "\n", "### Steps\n", "These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are _Alice_ and *Bob*, and the *Analyst* acting the integration authority.\n", "\n", "* [Check connection to Entity Service](#check_con)\n", "* [Data preparation](#data_prep)\n", " * Write CSV files with PII\n", " * [Create a Linkage Schema](#schema_prep)\n", "* [Create Linkage Project](#create_pro)\n", "* [Generate CLKs from PII](#hash_n_up)\n", "* [Upload the PII](#hash_n_up)\n", "* [Create a run](#create_run)\n", "* [Retrieve and analyse results](#results)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": "\u003ca id\u003d\"check_con\"\u003e\u003c/a\u003e\n## Check Connection\n\n\u003e If you\u0027re connecting to a custom entity service, change the address here." }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "Testing anonlink-entity-service hosted at https://testing.es.data61.xyz\n" ], "output_type": "stream" } ], "source": "import os\nurl \u003d os.getenv(\"SERVER\", \"https://testing.es.data61.xyz\")\nprint(f\u0027Testing anonlink-entity-service hosted at {url}\u0027)" }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "{\"project_count\": 2109, \"rate\": 8216626, \"status\": \"ok\"}\r\n" ], "output_type": "stream" } ], "source": [ "!clkutil status --server \"{url}\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": "\u003ca id\u003d\"data_prep\"\u003e\u003c/a\u003e\n## Data preparation\n\nFollowing the [clkhash tutorial](http://clkhash.readthedocs.io/en/latest/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.\n" }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from tempfile import NamedTemporaryFile\n", "from recordlinkage.datasets import load_febrl4" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": " given_name surname street_number address_1 \\\nrec_id \nrec-1070-org michaela neumann 8 stanley street \nrec-1016-org courtney painter 12 pinkerton circuit \nrec-4405-org charles green 38 salkauskas crescent \n\n address_2 suburb postcode state date_of_birth \\\nrec_id \nrec-1070-org miami winston hills 4223 nsw 19151111 \nrec-1016-org bega flats richlands 4560 vic 19161214 \nrec-4405-org kela dapto 4566 nsw 19480930 \n\n soc_sec_id \nrec_id \nrec-1070-org 5304218 \nrec-1016-org 4066625 \nrec-4405-org 4365168 ", "text/html": "\u003cdiv\u003e\n\u003cstyle scoped\u003e\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n\u003c/style\u003e\n\u003ctable border\u003d\"1\" class\u003d\"dataframe\"\u003e\n \u003cthead\u003e\n \u003ctr style\u003d\"text-align: right;\"\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003egiven_name\u003c/th\u003e\n \u003cth\u003esurname\u003c/th\u003e\n \u003cth\u003estreet_number\u003c/th\u003e\n \u003cth\u003eaddress_1\u003c/th\u003e\n \u003cth\u003eaddress_2\u003c/th\u003e\n \u003cth\u003esuburb\u003c/th\u003e\n \u003cth\u003epostcode\u003c/th\u003e\n \u003cth\u003estate\u003c/th\u003e\n \u003cth\u003edate_of_birth\u003c/th\u003e\n \u003cth\u003esoc_sec_id\u003c/th\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003cth\u003erec_id\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003cth\u003e\u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003cth\u003erec-1070-org\u003c/th\u003e\n \u003ctd\u003emichaela\u003c/td\u003e\n \u003ctd\u003eneumann\u003c/td\u003e\n \u003ctd\u003e8\u003c/td\u003e\n \u003ctd\u003estanley street\u003c/td\u003e\n \u003ctd\u003emiami\u003c/td\u003e\n \u003ctd\u003ewinston hills\u003c/td\u003e\n \u003ctd\u003e4223\u003c/td\u003e\n \u003ctd\u003ensw\u003c/td\u003e\n \u003ctd\u003e19151111\u003c/td\u003e\n \u003ctd\u003e5304218\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003cth\u003erec-1016-org\u003c/th\u003e\n \u003ctd\u003ecourtney\u003c/td\u003e\n \u003ctd\u003epainter\u003c/td\u003e\n \u003ctd\u003e12\u003c/td\u003e\n \u003ctd\u003epinkerton circuit\u003c/td\u003e\n \u003ctd\u003ebega flats\u003c/td\u003e\n \u003ctd\u003erichlands\u003c/td\u003e\n \u003ctd\u003e4560\u003c/td\u003e\n \u003ctd\u003evic\u003c/td\u003e\n \u003ctd\u003e19161214\u003c/td\u003e\n \u003ctd\u003e4066625\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003cth\u003erec-4405-org\u003c/th\u003e\n \u003ctd\u003echarles\u003c/td\u003e\n \u003ctd\u003egreen\u003c/td\u003e\n \u003ctd\u003e38\u003c/td\u003e\n \u003ctd\u003esalkauskas crescent\u003c/td\u003e\n \u003ctd\u003ekela\u003c/td\u003e\n \u003ctd\u003edapto\u003c/td\u003e\n \u003ctd\u003e4566\u003c/td\u003e\n \u003ctd\u003ensw\u003c/td\u003e\n \u003ctd\u003e19480930\u003c/td\u003e\n \u003ctd\u003e4365168\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e" }, "metadata": {}, "output_type": "execute_result", "execution_count": 4 } ], "source": "dfA, dfB \u003d load_febrl4()\n\na_csv \u003d NamedTemporaryFile(\u0027w\u0027)\na_clks \u003d NamedTemporaryFile(\u0027w\u0027, suffix\u003d\u0027.json\u0027)\ndfA.to_csv(a_csv)\na_csv.seek(0)\n\nb_csv \u003d NamedTemporaryFile(\u0027w\u0027)\nb_clks \u003d NamedTemporaryFile(\u0027w\u0027, suffix\u003d\u0027.json\u0027)\ndfB.to_csv(b_csv)\nb_csv.seek(0)\n\ndfA.head(3)\n" }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": "\u003ca id\u003d\"schema_prep\"\u003e\u003c/a\u003e\n## Schema Preparation\n\nThe linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the [api docs](http://clkhash.readthedocs.io/en/latest/schema.html). We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation." }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "schema \u003d NamedTemporaryFile(\u0027wt\u0027)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "Overwriting /tmp/tmpu8y0vxd4\n" ], "output_type": "stream" } ], "source": [ "%%writefile {schema.name}\n", "{\n", " \"version\": 1,\n", " \"clkConfig\": {\n", " \"l\": 1024,\n", " \"k\": 30,\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"kdf\": {\n", " \"type\": \"HKDF\",\n", " \"hash\": \"SHA256\",\n", " \"info\": \"c2NoZW1hX2V4YW1wbGU\u003d\",\n", " \"salt\": \"SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA\u003d\u003d\",\n", " \"keySize\": 64\n", " }\n", " },\n", " \"features\": [\n", " {\n", " \"identifier\": \"rec_id\",\n", " \"ignored\": true\n", " },\n", " {\n", " \"identifier\": \"given_name\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"surname\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"street_number\",\n", " \"format\": { \"type\": \"integer\" },\n", " \"hashing\": { \"ngram\": 1, \"positional\": true, \"weight\": 0.5, \"missingValue\": {\"sentinel\": \"\"} }\n", " },\n", " {\n", " \"identifier\": \"address_1\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 0.5 }\n", " },\n", " {\n", " \"identifier\": \"address_2\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 0.5 }\n", " },\n", " {\n", " \"identifier\": \"suburb\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 0.5 }\n", " },\n", " {\n", " \"identifier\": \"postcode\",\n", " \"format\": { \"type\": \"integer\", \"minimum\": 100, \"maximum\": 9999 },\n", " \"hashing\": { \"ngram\": 1, \"positional\": true, \"weight\": 0.5 }\n", " },\n", " {\n", " \"identifier\": \"state\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\", \"maxLength\": 3 },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"date_of_birth\",\n", " \"format\": { \"type\": \"integer\" },\n", " \"hashing\": { \"ngram\": 1, \"positional\": true, \"weight\": 1, \"missingValue\": {\"sentinel\": \"\"} }\n", " },\n", " {\n", " \"identifier\": \"soc_sec_id\",\n", " \"ignored\": true\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "\u003ca id\u003d\"create_pro\"\u003e\u003c/a\u003e\n", "## Create Linkage Project\n", "\n", "The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "Credentials will be saved in /tmp/tmpngtrvblo\n", "\u001b[31mProject created\u001b[0m\r\n" ], "output_type": "stream" }, { "data": { "text/plain": "{\u0027project_id\u0027: \u0027539a612e09bbac7fc5178f7798e15dfc310bc06878ff25fe\u0027,\n \u0027result_token\u0027: \u00272a52a9729facd2fd4e547b8029697e3ab7a464c32f3ada7e\u0027,\n \u0027update_tokens\u0027: [\u002747f701f76e06e2283f68dfddfb15da4b56bb05a43d6c5acb\u0027,\n \u00270b2228ff49ef9caeb29744f9ce97b39280873919a60a8765\u0027]}" }, "metadata": {}, "output_type": "execute_result", "execution_count": 7 } ], "source": [ "creds \u003d NamedTemporaryFile(\u0027wt\u0027)\n", "print(\"Credentials will be saved in\", creds.name)\n", "\n", "!clkutil create-project --schema \"{schema.name}\" --output \"{creds.name}\" --type \"permutations\" --server \"{url}\"\n", "creds.seek(0)\n", "\n", "import json\n", "with open(creds.name, \u0027r\u0027) as f:\n", " credentials \u003d json.load(f)\n", "\n", "project_id \u003d credentials[\u0027project_id\u0027]\n", "credentials" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "**Note:** the analyst will need to pass on the `project_id` (the id of the linkage project) and one of the two `update_tokens` to each data provider.\n", "\n", "\u003ca id\u003d\"hash_n_up\"\u003e\u003c/a\u003e\n", "## Hash and Upload\n", "\n", "At the moment both data providers have *raw* personally identiy information. We first have to generate CLKs from the raw entity information. We need:\n", "- the *clkhash* library\n", "- the linkage schema from above\n", "- and two secret passwords which are only known to Alice and Bob. (here: `horse` and `staple`)\n", "\n", "Please see [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "\rgenerating CLKs: 0%| | 0.00/5.00k [00:00\u003c?, ?clk/s, mean\u003d0, std\u003d0]", "\rgenerating CLKs: 4%| | 200/5.00k [00:00\u003c00:11, 435clk/s, mean\u003d767, std\u003d34.9]", "\rgenerating CLKs: 36%|▎| 1.80k/5.00k [00:00\u003c00:05, 596clk/s, mean\u003d765, std\u003d37.5]", "\rgenerating CLKs: 56%|▌| 2.80k/5.00k [00:00\u003c00:02, 826clk/s, mean\u003d764, std\u003d37.3]", "\rgenerating CLKs: 68%|▋| 3.40k/5.00k [00:01\u003c00:01, 997clk/s, mean\u003d765, std\u003d37.5]", "\rgenerating CLKs: 84%|▊| 4.20k/5.00k [00:01\u003c00:00, 1.34kclk/s, mean\u003d764, std\u003d37.4]", "\rgenerating CLKs: 96%|▉| 4.80k/5.00k [00:01\u003c00:00, 1.74kclk/s, mean\u003d765, std\u003d37.2]", "\rgenerating CLKs: 100%|█| 5.00k/5.00k [00:01\u003c00:00, 3.31kclk/s, mean\u003d765, std\u003d37.1]\r\n\u001b[31mCLK data written to /tmp/tmpy3s8f407.json\u001b[0m\r\n", "\rgenerating CLKs: 0%| | 0.00/5.00k [00:00\u003c?, ?clk/s, mean\u003d0, std\u003d0]", "\rgenerating CLKs: 4%| | 200/5.00k [00:00\u003c00:10, 463clk/s, mean\u003d756, std\u003d44.4]", "\rgenerating CLKs: 32%|▎| 1.60k/5.00k [00:00\u003c00:05, 645clk/s, mean\u003d755, std\u003d43.4]", "\rgenerating CLKs: 40%|▍| 2.00k/5.00k [00:00\u003c00:03, 823clk/s, mean\u003d755, std\u003d43.3]", "\rgenerating CLKs: 56%|▌| 2.80k/5.00k [00:00\u003c00:01, 1.11kclk/s, mean\u003d755, std\u003d43.5]", "\rgenerating CLKs: 64%|▋| 3.20k/5.00k [00:01\u003c00:01, 1.26kclk/s, mean\u003d755, std\u003d43.6]", "\rgenerating CLKs: 84%|▊| 4.20k/5.00k [00:01\u003c00:00, 1.69kclk/s, mean\u003d755, std\u003d43.3]", "\rgenerating CLKs: 96%|▉| 4.80k/5.00k [00:01\u003c00:00, 2.10kclk/s, mean\u003d756, std\u003d43.1]", "\rgenerating CLKs: 100%|█| 5.00k/5.00k [00:01\u003c00:00, 3.53kclk/s, mean\u003d756, std\u003d43.3]\r\n\u001b[31mCLK data written to /tmp/tmp0fdoothg.json\u001b[0m\r\n" ], "output_type": "stream" } ], "source": [ "!clkutil hash \"{a_csv.name}\" horse staple \"{schema.name}\" \"{a_clks.name}\"\n", "!clkutil hash \"{b_csv.name}\" horse staple \"{schema.name}\" \"{b_clks.name}\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Now the two clients can upload their data providing the appropriate *upload tokens* and the *project_id*. As with all commands in `clkhash` we can output help:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "Usage: clkutil upload [OPTIONS] CLK_JSON\r\n\r\n Upload CLK data to entity matching server.\r\n\r\n Given a json file containing hashed clk data as CLK_JSON, upload to the\r\n entity resolution service.\r\n\r\n Use \"-\" to read from stdin.\r\n\r\nOptions:\r\n --project TEXT Project identifier\r\n --apikey TEXT Authentication API key for the server.\r\n --server TEXT Server address including protocol\r\n -o, --output FILENAME\r\n -v, --verbose Script is more talkative\r\n --help Show this message and exit.\r\n" ], "output_type": "stream" } ], "source": [ "!clkutil upload --help" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "### Alice uploads her data" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile(\u0027wt\u0027) as f:\n", " !clkutil upload \\\n", " --project\u003d\"{project_id}\" \\\n", " --apikey\u003d\"{credentials[\u0027update_tokens\u0027][0]}\" \\\n", " --server \"{url}\" \\\n", " --output \"{f.name}\" \\\n", " \"{a_clks.name}\"\n", " res \u003d json.load(open(f.name))\n", " alice_receipt_token \u003d res[\u0027receipt_token\u0027]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Every upload gets a receipt token. This token is required to access the results." ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "### Bob uploads his data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile(\u0027wt\u0027) as f:\n", " !clkutil upload \\\n", " --project\u003d\"{project_id}\" \\\n", " --apikey\u003d\"{credentials[\u0027update_tokens\u0027][1]}\" \\\n", " --server \"{url}\" \\\n", " --output \"{f.name}\" \\\n", " \"{b_clks.name}\"\n", " \n", " bob_receipt_token \u003d json.load(open(f.name))[\u0027receipt_token\u0027]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "\u003ca id\u003d\"create_run\"\u003e\u003c/a\u003e\n", "## Create a run\n", "\n", "Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile(\u0027wt\u0027) as f:\n", " !clkutil create \\\n", " --project\u003d\"{project_id}\" \\\n", " --apikey\u003d\"{credentials[\u0027result_token\u0027]}\" \\\n", " --server \"{url}\" \\\n", " --threshold 0.85 \\\n", " --output \"{f.name}\"\n", " \n", " run_id \u003d json.load(open(f.name))[\u0027run_id\u0027]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "\u003ca id\u003d\"results\"\u003e\u003c/a\u003e\n", "## Results\n", "\n", "Now after some delay (depending on the size) we can fetch the mask.\n", "This can be done with clkutil:\n", "\n", " !clkutil results --server \"{url}\" \\\n", " --project\u003d\"{credentials[\u0027project_id\u0027]}\" \\\n", " --apikey\u003d\"{credentials[\u0027result_token\u0027]}\" --output results.txt\n", " \n", "However for this tutorial we are going to use the Python `requests` library:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": "import requests\nimport clkhash.rest_client\n\nfrom IPython.display import clear_output" }, { "cell_type": "code", "execution_count": 15, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "State: completed\nStage (3/3): compute output\n" ], "output_type": "stream" } ], "source": [ "for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials[\u0027result_token\u0027], timeout\u003d300):\n", " clear_output(wait\u003dTrue)\n", " print(clkhash.rest_client.format_run_status(update))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "results \u003d requests.get(\u0027{}/api/v1/projects/{}/runs/{}/result\u0027.format(url, project_id, run_id), headers\u003d{\u0027Authorization\u0027: credentials[\u0027result_token\u0027]}).json()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "mask \u003d results[\u0027mask\u0027]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "This mask is a boolean array that specifies where rows of permuted data line up." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n" ], "output_type": "stream" } ], "source": [ "print(mask[:10])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "The number of 1s in the mask will tell us how many matches were found." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": "4858" }, "metadata": {}, "output_type": "execute_result", "execution_count": 20 } ], "source": [ "sum([1 for m in mask if m \u003d\u003d 1])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "We also use `requests` to fetch the permutations for each data provider:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "alice_res \u003d requests.get(\u0027{}/api/v1/projects/{}/runs/{}/result\u0027.format(url, project_id, run_id), headers\u003d{\u0027Authorization\u0027: alice_receipt_token}).json()\n", "bob_res \u003d requests.get(\u0027{}/api/v1/projects/{}/runs/{}/result\u0027.format(url, project_id, run_id), headers\u003d{\u0027Authorization\u0027: bob_receipt_token}).json()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Now Alice and Bob both have a new permutation - a new ordering for their data." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": "[4659, 4076, 1898, 868, 3271, 2486, 1078, 3774, 2656, 4324]" }, "metadata": {}, "output_type": "execute_result", "execution_count": 22 } ], "source": [ "alice_permutation \u003d alice_res[\u0027permutation\u0027]\n", "alice_permutation[:10]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "This permutation says the first row of Alice\u0027s data should be moved to position 308." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": "[3074, 1996, 4523, 500, 3384, 1115, 746, 1165, 2999, 2204]" }, "metadata": {}, "output_type": "execute_result", "execution_count": 23 } ], "source": [ "bob_permutation \u003d bob_res[\u0027permutation\u0027]\n", "bob_permutation[:10]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "def reorder(items, order):\n", " \"\"\"\n", " Assume order is a list of new index\n", " \"\"\"\n", " neworder \u003d items.copy()\n", " for item, newpos in zip(items, order):\n", " neworder[newpos] \u003d item\n", " \n", " return neworder" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with open(a_csv.name, \u0027r\u0027) as f:\n", " alice_raw \u003d f.readlines()[1:]\n", " alice_reordered \u003d reorder(alice_raw, alice_permutation)\n", "\n", "with open(b_csv.name, \u0027r\u0027) as f:\n", " bob_raw \u003d f.readlines()[1:]\n", " bob_reordered \u003d reorder(bob_raw, bob_permutation)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Now that the two data sets have been permuted, the mask reveals where the rows line up, and where they don\u0027t." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": "[\u0027rec-4746-org,gabrielle,fargahry-tolba,10,northbourne avenue,pia place,st georges basin,2011,vic,19640424,7326839\\n\u0027,\n \u0027rec-438-org,alison,hearn,4,macdonnell street,cabrini medical centre,adelaide,2720,vic,19191230,2937695\\n\u0027,\n \u0027rec-3902-org,,oreilly,,paul coe crescent,wylarah,tuart hill,3219,vic,19500925,4201497\\n\u0027,\n \u0027rec-920-org,benjamin,clarke,122,archibald street,locn 1487,nickol,2535,nsw,19010518,1978760\\n\u0027,\n \u0027rec-2152-org,emiily,fitzpatrick,,aland place,keralland,rowville,2219,vic,19270130,1148897\\n\u0027,\n \u0027rec-3434-org,alex,clarke,12,fiveash street,emerald garden,homebush,2321,nsw,19840627,7280280\\n\u0027,\n \u0027rec-4197-org,talan,stubbs,21,augustus way,ashell,croydon north,3032,wa,19221022,7550622\\n\u0027,\n \u0027rec-2875-org,luke,white,31,outtrim avenue,glenora farm,flinders bay,2227,sa,19151010,6925269\\n\u0027,\n \u0027rec-2559-org,emiily,binns,24,howell place,sec 142 hd rounsevell,ryde,2627,wa,19941108,8919080\\n\u0027,\n \u0027rec-2679-org,thomas,brain,108,brewster place,geelong grove,eight mile plains,2114,qld,19851127,8873329\\n\u0027]" }, "metadata": {}, "output_type": "execute_result", "execution_count": 26 } ], "source": [ "alice_reordered[:10]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": "[\u0027rec-4746-dup-0,gabrielle,fargahry-tolba,11,northbourne avenue,pia place,st georges basin,2011,vic,19640424,7326839\\n\u0027,\n \u0027rec-438-dup-0,heatn,alison,4,macdonnell street,cabrini medicalb centre,adelaide,2270,vic,19191230,2937695\\n\u0027,\n \u0027rec-3902-dup-0,,oreilly,,paul coe cerscent,wylrah,tuart hill,3219,vic,19500925,4201497\\n\u0027,\n \u0027rec-920-dup-0,scott,clarke,122,archibald street,locn 1487,nickol,2553,nsw,19010518,1978760\\n\u0027,\n \u0027rec-2152-dup-0,megna,fitzpatrick,,aland place,keralalnd,rowville,2219,vic,19270130,1148897\\n\u0027,\n \u0027rec-3434-dup-0,alex,clarke,12,,emeral dgarden,homebush,2321,nsw,19840627,7280280\\n\u0027,\n \u0027rec-4197-dup-0,talan,stubbs,21,binns street,ashell,croydon north,3032,wa,19221022,7550622\\n\u0027,\n \u0027rec-2875-dup-0,luke,white,31,outtrim aqenue,glenora farm,flinedrs bay,2227,sa,19151010,6925269\\n\u0027,\n \u0027rec-2559-dup-0,binns,emiilzy,24,howell place,sec 142 hd rounsevell,ryde,2627,wa,19941108,8919080\\n\u0027,\n \u0027rec-2679-dup-0,dixon,thomas,108,brewster place,geelong grove,eight mile plains,2114,qld,19851127,8873329\\n\u0027]" }, "metadata": {}, "output_type": "execute_result", "execution_count": 27 } ], "source": [ "bob_reordered[:10]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Accuracy\n", "\n", "To compute how well the matching went we will use the first index as our reference.\n", "\n", "For example in `rec-1396-org` is the original record which has a match in `rec-1396-dup-0`. To satisfy ourselves we can preview the first few supposed matches:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "Gabrielle Fargahry-Tolba (rec-4746-org) \u003d? Gabrielle Fargahry-Tolba (rec-4746-dup-0)\nAlison Hearn (rec-438-org) \u003d? Heatn Alison (rec-438-dup-0)\n Oreilly (rec-3902-org) \u003d? Oreilly (rec-3902-dup-0)\nBenjamin Clarke (rec-920-org) \u003d? Scott Clarke (rec-920-dup-0)\nEmiily Fitzpatrick (rec-2152-org) \u003d? Megna Fitzpatrick (rec-2152-dup-0)\nAlex Clarke (rec-3434-org) \u003d? Alex Clarke (rec-3434-dup-0)\nTalan Stubbs (rec-4197-org) \u003d? Talan Stubbs (rec-4197-dup-0)\nLuke White (rec-2875-org) \u003d? Luke White (rec-2875-dup-0)\nEmiily Binns (rec-2559-org) \u003d? Binns Emiilzy (rec-2559-dup-0)\nThomas Brain (rec-2679-org) \u003d? Dixon Thomas (rec-2679-dup-0)\n" ], "output_type": "stream" } ], "source": [ "for i, m in enumerate(mask[:10]):\n", " if m:\n", " entity_a \u003d alice_reordered[i].split(\u0027,\u0027)\n", " entity_b \u003d bob_reordered[i].split(\u0027,\u0027)\n", " name_a \u003d \u0027 \u0027.join(entity_a[1:3]).title()\n", " name_b \u003d \u0027 \u0027.join(entity_b[1:3]).title()\n", " \n", " print(\"{} ({})\".format(name_a, entity_a[0]), \u0027\u003d?\u0027, \"{} ({})\".format(name_b, entity_b[0]))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "### Metrics\n", "If you know the ground truth — the correct mapping between the two datasets — you can compute performance metrics of the linkage.\n", "\n", "**Precision**: The percentage of actual matches out of all found matches. (`tp/(tp+fp)`)\n", "\n", "**Recall**: How many of the actual matches have we found? (`tp/(tp+fn)`)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "text": [ "Found 4858 correct matches out of 5000. Incorrectly linked 0 matches.\nPrecision: 100.0%\nRecall: 97.2%\n" ], "output_type": "stream" } ], "source": [ "tp \u003d 0\n", "fp \u003d 0\n", "\n", "for i, m in enumerate(mask):\n", " if m:\n", " entity_a \u003d alice_reordered[i].split(\u0027,\u0027)\n", " entity_b \u003d bob_reordered[i].split(\u0027,\u0027)\n", " if entity_a[0].split(\u0027-\u0027)[1] \u003d\u003d entity_b[0].split(\u0027-\u0027)[1]:\n", " tp +\u003d 1\n", " else:\n", " fp +\u003d 1\n", " #print(\u0027False positive:\u0027,\u0027 \u0027.join(entity_a[1:3]).title(), \u0027?\u0027, \u0027 \u0027.join(entity_b[1:3]).title(), entity_a[-1] \u003d\u003d entity_b[-1])\n", "\n", "print(\"Found {} correct matches out of 5000. Incorrectly linked {} matches.\".format(tp, fp))\n", "precision \u003d tp/(tp+fp)\n", "recall \u003d tp/5000\n", "\n", "print(\"Precision: {:.1f}%\".format(100*precision))\n", "print(\"Recall: {:.1f}%\".format(100*recall))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }