{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import json\n",
    "import os\n",
    "\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "KEY1 = 'correct'\n",
    "KEY2 = 'horse'\n",
    "\n",
    "SERVER = os.getenv(\"SERVER\", \"https://testing.es.data61.xyz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "# Scenario\n",
    "\n",
    "There are three parties named Alice, Bob, and Charlie, each holding a dataset of about 3200 records. They know that they have some entities in common, but with incomplete overlap. The common features describing those entities are given name, surname, date of birth, and phone number.\n",
    "\n",
    "They all have some additional information about those entities in their respective datasets, Alice has a person's gender, Bob has their city, and Charlie has their income. They wish to create a table for analysis: each row has a gender, city, and income, but they don't want to share any additional information. They can use Anonlink to do this in a privacy-preserving way (without revealing given names, surnames, dates of birth, and phone numbers)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Alice, Bob, and Charlie: agree on secret keys and a linkage schema\n",
    "\n",
    "They keep the keys to themselves, but the schema may be revealed to the analyst."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "keys: correct, horse\n"
     ]
    }
   ],
   "source": [
    "print(f'keys: {KEY1}, {KEY2}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "{\n",
      "  \"version\": 2,\n",
      "  \"clkConfig\": {\n",
      "    \"l\": 1024,\n",
      "    \"kdf\": {\n",
      "      \"type\": \"HKDF\",\n",
      "      \"hash\": \"SHA256\",\n",
      "      \"salt\": \"SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==\",\n",
      "      \"info\": \"c2NoZW1hX2V4YW1wbGU=\",\n",
      "      \"keySize\": 64\n",
      "    }\n",
      "  },\n",
      "  \"features\": [\n",
      "    {\n",
      "      \"identifier\": \"id\",\n",
      "      \"ignored\": true\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"givenname\",\n",
      "      \"format\": {\n",
      "        \"type\": \"string\",\n",
      "        \"encoding\": \"utf-8\"\n",
      "      },\n",
      "      \"hashing\": {\n",
      "        \"ngram\": 2,\n",
      "        \"positional\": false,\n",
      "        \"strategy\": {\"k\": 15}\n",
      "      }\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"surname\",\n",
      "      \"format\": {\n",
      "        \"type\": \"string\",\n",
      "        \"encoding\": \"utf-8\"\n",
      "      },\n",
      "      \"hashing\": {\n",
      "        \"ngram\": 2,\n",
      "        \"positional\": false,\n",
      "        \"strategy\": {\"k\": 15}\n",
      "      }\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"dob\",\n",
      "      \"format\": {\n",
      "        \"type\": \"string\",\n",
      "        \"encoding\": \"utf-8\"\n",
      "      },\n",
      "      \"hashing\": {\n",
      "        \"ngram\": 2,\n",
      "        \"positional\": true,\n",
      "        \"strategy\": {\"k\": 15}\n",
      "      }\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"phone number\",\n",
      "      \"format\": {\n",
      "        \"type\": \"string\",\n",
      "        \"encoding\": \"utf-8\"\n",
      "      },\n",
      "      \"hashing\": {\n",
      "        \"ngram\": 1,\n",
      "        \"positional\": true,\n",
      "        \"strategy\": {\"k\": 8}\n",
      "      }\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"ignoredForLinkage\",\n",
      "      \"ignored\": true\n",
      "    }\n",
      "  ]\n",
      "}\n",
      "\n"
     ]
    }
   ],
   "source": [
    "with open('data/schema_ABC.json') as f:\n",
    "    print(f.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "# Sneak peek at input data\n",
    "### Alice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>givenname</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>phone number</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>tara</td>\n",
       "      <td>hilton</td>\n",
       "      <td>27-08-1941</td>\n",
       "      <td>08 2210 0298</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>saJi</td>\n",
       "      <td>vernre</td>\n",
       "      <td>22-12-2972</td>\n",
       "      <td>02 1090 1906</td>\n",
       "      <td>mals</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>sliver</td>\n",
       "      <td>paciorek</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>mals</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>9</td>\n",
       "      <td>ruby</td>\n",
       "      <td>george</td>\n",
       "      <td>09-05-1939</td>\n",
       "      <td>07 4698 6255</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10</td>\n",
       "      <td>eyrinm</td>\n",
       "      <td>campbell</td>\n",
       "      <td>29-1q-1983</td>\n",
       "      <td>08 299y 1535</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id givenname   surname         dob  phone number gender\n",
       "0   0      tara    hilton  27-08-1941  08 2210 0298   male\n",
       "1   3      saJi    vernre  22-12-2972  02 1090 1906   mals\n",
       "2   7    sliver  paciorek         NaN           NaN   mals\n",
       "3   9      ruby    george  09-05-1939  07 4698 6255   male\n",
       "4  10    eyrinm  campbell  29-1q-1983  08 299y 1535   male"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('data/dataset-alice.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "### Bob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>givenname</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>phone number</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>zali</td>\n",
       "      <td>verner</td>\n",
       "      <td>22-12-1972</td>\n",
       "      <td>02 1090 1906</td>\n",
       "      <td>perth</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>samuel</td>\n",
       "      <td>tremellen</td>\n",
       "      <td>21-12-1923</td>\n",
       "      <td>03 3605 9336</td>\n",
       "      <td>melbourne</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>amy</td>\n",
       "      <td>lodge</td>\n",
       "      <td>16-01-1958</td>\n",
       "      <td>07 8286 9372</td>\n",
       "      <td>canberra</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7</td>\n",
       "      <td>oIji</td>\n",
       "      <td>pacioerk</td>\n",
       "      <td>10-02-1959</td>\n",
       "      <td>04 4220 5949</td>\n",
       "      <td>sydney</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10</td>\n",
       "      <td>erin</td>\n",
       "      <td>kampgell</td>\n",
       "      <td>29-12-1983</td>\n",
       "      <td>08 2996 1445</td>\n",
       "      <td>perth</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id givenname    surname         dob  phone number       city\n",
       "0   3      zali     verner  22-12-1972  02 1090 1906      perth\n",
       "1   4    samuel  tremellen  21-12-1923  03 3605 9336  melbourne\n",
       "2   5       amy      lodge  16-01-1958  07 8286 9372   canberra\n",
       "3   7      oIji   pacioerk  10-02-1959  04 4220 5949     sydney\n",
       "4  10      erin   kampgell  29-12-1983  08 2996 1445      perth"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('data/dataset-bob.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Charlie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>givenname</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>phone number</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>joshua</td>\n",
       "      <td>arkwright</td>\n",
       "      <td>16-02-1903</td>\n",
       "      <td>04 8511 9580</td>\n",
       "      <td>70189.446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>zal:</td>\n",
       "      <td>verner</td>\n",
       "      <td>22-12-1972</td>\n",
       "      <td>02 1090 1906</td>\n",
       "      <td>50194.118</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>oliyer</td>\n",
       "      <td>paciorwk</td>\n",
       "      <td>10-02-1959</td>\n",
       "      <td>04 4210 5949</td>\n",
       "      <td>31750.993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8</td>\n",
       "      <td>nacoya</td>\n",
       "      <td>ranson</td>\n",
       "      <td>17-08-1925</td>\n",
       "      <td>07 6033 4580</td>\n",
       "      <td>102446.131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10</td>\n",
       "      <td>erih</td>\n",
       "      <td>campbell</td>\n",
       "      <td>29-12-1i83</td>\n",
       "      <td>08 299t 1435</td>\n",
       "      <td>331476.599</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id givenname    surname         dob  phone number      income\n",
       "0   1    joshua  arkwright  16-02-1903  04 8511 9580   70189.446\n",
       "1   3      zal:     verner  22-12-1972  02 1090 1906   50194.118\n",
       "2   7    oliyer   paciorwk  10-02-1959  04 4210 5949   31750.993\n",
       "3   8    nacoya     ranson  17-08-1925  07 6033 4580  102446.131\n",
       "4  10      erih   campbell  29-12-1i83  08 299t 1435  331476.599"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('data/dataset-charlie.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Analyst: create the project\n",
    "\n",
    "The analyst keeps the result token to themselves. The three update tokens go to Alice, Bob and Charlie. The project ID is known by everyone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Project created\n"
     ]
    }
   ],
   "source": [
    "!clkutil create-project --server $SERVER --type groups --schema data/schema_ABC.json --parties 3 --output credentials.json\n",
    "\n",
    "with open('credentials.json') as f:\n",
    "    credentials = json.load(f)\n",
    "    project_id = credentials['project_id']\n",
    "    result_token = credentials['result_token']\n",
    "    update_token_alice = credentials['update_tokens'][0]\n",
    "    update_token_bob = credentials['update_tokens'][1]\n",
    "    update_token_charlie = credentials['update_tokens'][2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Alice: hash the data and upload it to the server\n",
    "The data is hashed according to the schema and the keys. Alice's update token is needed to upload the hashed data. No PII is uploaded to the service—only the hashes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "generating CLKs:   0%|          | 0.00/3.23k [00:00<?, ?clk/s, mean=0, std=0]\n",
      "generating CLKs:   6%|6         | 200/3.23k [00:02<00:31, 96.1clk/s, mean=372, std=32.6]\n",
      "generating CLKs:  25%|##4       | 800/3.23k [00:02<00:17, 136clk/s, mean=371, std=35.5] \n",
      "generating CLKs:  63%|######2   | 2.03k/3.23k [00:02<00:06, 193clk/s, mean=372, std=34.7]\n",
      "generating CLKs: 100%|##########| 3.23k/3.23k [00:02<00:00, 1.29kclk/s, mean=372, std=34.9]\n",
      "CLK data written to dataset-alice-hashed.json\n"
     ]
    }
   ],
   "source": [
    "!clkutil hash data/dataset-alice.csv $KEY1 $KEY2 data/schema_ABC.json dataset-alice-hashed.json --check-header false"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"message\": \"Updated\", \"receipt_token\": \"c54597f32fd969603efba706af1556abee3cc35f2718bcb6\"}\n"
     ]
    }
   ],
   "source": [
    "!clkutil upload --server $SERVER --apikey $update_token_alice --project $project_id dataset-alice-hashed.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Bob: hash the data and upload it to the server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "generating CLKs:   0%|          | 0.00/3.24k [00:00<?, ?clk/s, mean=0, std=0]\n",
      "generating CLKs:   6%|6         | 200/3.24k [00:01<00:25, 119clk/s, mean=369, std=32.4]\n",
      "generating CLKs:  31%|###       | 1.00k/3.24k [00:01<00:13, 168clk/s, mean=371, std=35]\n",
      "generating CLKs:  56%|#####5    | 1.80k/3.24k [00:01<00:06, 238clk/s, mean=371, std=35.5]\n",
      "generating CLKs: 100%|##########| 3.24k/3.24k [00:02<00:00, 1.45kclk/s, mean=372, std=35.3]\n",
      "CLK data written to dataset-bob-hashed.json\n"
     ]
    }
   ],
   "source": [
    "!clkutil hash data/dataset-bob.csv $KEY1 $KEY2 data/schema_ABC.json dataset-bob-hashed.json --check-header false"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"message\": \"Updated\", \"receipt_token\": \"6ee2fe5df850b795ee6ddff1aaf4dfb03f6d4398dedcc248\"}\n"
     ]
    }
   ],
   "source": [
    "!clkutil upload --server $SERVER --apikey $update_token_bob --project $project_id dataset-bob-hashed.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Charlie: hash the data and upload it to the server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "generating CLKs:   0%|          | 0.00/3.26k [00:00<?, ?clk/s, mean=0, std=0]\n",
      "generating CLKs:   6%|6         | 200/3.26k [00:01<00:24, 122clk/s, mean=371, std=33.3]\n",
      "generating CLKs:  55%|#####5    | 1.80k/3.26k [00:01<00:08, 174clk/s, mean=372, std=34.5]\n",
      "generating CLKs: 100%|##########| 3.26k/3.26k [00:01<00:00, 1.73kclk/s, mean=372, std=34.8]\n",
      "CLK data written to dataset-charlie-hashed.json\n"
     ]
    }
   ],
   "source": [
    "!clkutil hash data/dataset-charlie.csv $KEY1 $KEY2 data/schema_ABC.json dataset-charlie-hashed.json --check-header false"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"message\": \"Updated\", \"receipt_token\": \"064664ed9fd1f58c4da05c62a4832b813276d09342137a42\"}\n"
     ]
    }
   ],
   "source": [
    "!clkutil upload --server $SERVER --apikey $update_token_charlie --project $project_id dataset-charlie-hashed.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Analyst: start the linkage run\n",
    "\n",
    "This will start the linkage computation. We will wait a little bit and then retrieve the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "!clkutil create --server $SERVER --project $project_id --apikey $result_token --threshold 0.7 --output=run-credentials.json\n",
    "\n",
    "with open('run-credentials.json') as f:\n",
    "    run_credentials = json.load(f)\n",
    "    run_id = run_credentials['run_id']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Analyst: retreve the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "State: completed\n",
      "Stage (3/3): compute output\n",
      "State: completed\n",
      "Stage (3/3): compute output\n",
      "State: completed\n",
      "Stage (3/3): compute output\n",
      "Downloading result\n",
      "Received result\n"
     ]
    }
   ],
   "source": [
    "!clkutil results --server $SERVER --project $project_id --apikey $result_token --run $run_id --watch --output linkage-output.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "with open('linkage-output.json') as f:\n",
    "    linkage_output = json.load(f)\n",
    "    linkage_groups = linkage_output['groups']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Everyone: make table of interesting information\n",
    "\n",
    "We use the linkage result to make a table of genders, cities, and incomes without revealing any other PII."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "with open('data/dataset-alice.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    genders = tuple(row[-1] for row in r)\n",
    "    \n",
    "with open('data/dataset-bob.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    cities = tuple(row[-1] for row in r)\n",
    "    \n",
    "with open('data/dataset-charlie.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    incomes = tuple(row[-1] for row in r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>city</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td></td>\n",
       "      <td>peGh</td>\n",
       "      <td>395273.665</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td></td>\n",
       "      <td>sydnev</td>\n",
       "      <td>77367.636</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td></td>\n",
       "      <td>pertb</td>\n",
       "      <td>323383.650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td></td>\n",
       "      <td>syd1e7y</td>\n",
       "      <td>79745.538</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td></td>\n",
       "      <td>perth</td>\n",
       "      <td>28019.494</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td></td>\n",
       "      <td>canberra</td>\n",
       "      <td>78961.675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>female</td>\n",
       "      <td>brisnane</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>male</td>\n",
       "      <td>canbetra</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td></td>\n",
       "      <td>sydme7</td>\n",
       "      <td>106849.526</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td></td>\n",
       "      <td>melbourne</td>\n",
       "      <td>68548.966</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   gender       city      income\n",
       "0               peGh  395273.665\n",
       "1             sydnev   77367.636\n",
       "2              pertb  323383.650\n",
       "3            syd1e7y   79745.538\n",
       "4              perth   28019.494\n",
       "5           canberra   78961.675\n",
       "6  female   brisnane            \n",
       "7    male   canbetra            \n",
       "8             sydme7  106849.526\n",
       "9          melbourne   68548.966"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table = []\n",
    "for group in linkage_groups:\n",
    "    row = [''] * 3\n",
    "    for i, j in group:\n",
    "        row[i] = [genders, cities, incomes][i][j]\n",
    "    if sum(map(bool, row)) > 1:\n",
    "        table.append(row)\n",
    "pd.DataFrame(table, columns=['gender', 'city', 'income']).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last 20 groups look like this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "pycharm": {},
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[[0, 2111], [1, 2100]],\n",
       " [[0, 2121], [2, 2131], [1, 2111]],\n",
       " [[1, 1146], [2, 1202], [0, 1203]],\n",
       " [[1, 2466], [2, 2478], [0, 2460]],\n",
       " [[0, 429], [1, 412]],\n",
       " [[0, 2669], [1, 1204]],\n",
       " [[1, 1596], [2, 1623]],\n",
       " [[0, 487], [1, 459]],\n",
       " [[1, 1776], [2, 1800], [0, 1806]],\n",
       " [[1, 2586], [2, 2602]],\n",
       " [[0, 919], [1, 896]],\n",
       " [[0, 100], [2, 107], [1, 100]],\n",
       " [[0, 129], [1, 131], [2, 135]],\n",
       " [[0, 470], [1, 440]],\n",
       " [[0, 1736], [1, 1692], [2, 1734]]]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "linkage_groups[-15:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "# Sneak peek at the result\n",
    "\n",
    "We obviously can't do this in a real-world setting, but let's view the linkage using the PII. If the IDs match, then we are correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "with open('data/dataset-alice.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    dataset_alice = tuple(r)\n",
    "    \n",
    "with open('data/dataset-bob.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    dataset_bob = tuple(r)\n",
    "    \n",
    "with open('data/dataset-charlie.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    dataset_charlie = tuple(r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>given name</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>phone number</th>\n",
       "      <th>non-linking</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6426</th>\n",
       "      <td>1171</td>\n",
       "      <td>isabelle</td>\n",
       "      <td>bridgland</td>\n",
       "      <td>30-03-1994</td>\n",
       "      <td>04 5318 6471</td>\n",
       "      <td>mal4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6427</th>\n",
       "      <td>1171</td>\n",
       "      <td>isalolIe</td>\n",
       "      <td>riahgland</td>\n",
       "      <td>30-02-1994</td>\n",
       "      <td>04 5318 6471</td>\n",
       "      <td>sydnry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6428</th>\n",
       "      <td>1171</td>\n",
       "      <td>isabelle</td>\n",
       "      <td>bridgland</td>\n",
       "      <td>30-02-1994</td>\n",
       "      <td>04 5318 6471</td>\n",
       "      <td>63514.217</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6429</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6430</th>\n",
       "      <td>1243</td>\n",
       "      <td>thmoas</td>\n",
       "      <td>doaldson</td>\n",
       "      <td>13-04-1900</td>\n",
       "      <td>09 6963 1944</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6431</th>\n",
       "      <td>1243</td>\n",
       "      <td>thoma5</td>\n",
       "      <td>donaldson</td>\n",
       "      <td>13-04-1900</td>\n",
       "      <td>08 6962 1944</td>\n",
       "      <td>perth</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6432</th>\n",
       "      <td>1243</td>\n",
       "      <td>thomas</td>\n",
       "      <td>donalsdon</td>\n",
       "      <td>13-04-2900</td>\n",
       "      <td>08 6963 2944</td>\n",
       "      <td>489229.297</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6433</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6434</th>\n",
       "      <td>2207</td>\n",
       "      <td>annah</td>\n",
       "      <td>aslea</td>\n",
       "      <td>02-11-2906</td>\n",
       "      <td>04 5501 5973</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6435</th>\n",
       "      <td>2207</td>\n",
       "      <td>hannah</td>\n",
       "      <td>easlea</td>\n",
       "      <td>02-11-2006</td>\n",
       "      <td>04 5501 5973</td>\n",
       "      <td>canberra</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6436</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6437</th>\n",
       "      <td>5726</td>\n",
       "      <td>rhys</td>\n",
       "      <td>clarke</td>\n",
       "      <td>19-05-1929</td>\n",
       "      <td>02 9220 9635</td>\n",
       "      <td>mqle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6438</th>\n",
       "      <td>5726</td>\n",
       "      <td>ry5</td>\n",
       "      <td>clarke</td>\n",
       "      <td>19-05-1939</td>\n",
       "      <td>02 9120 9635</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6439</th>\n",
       "      <td>5726</td>\n",
       "      <td>rhys</td>\n",
       "      <td>klark</td>\n",
       "      <td>19-05-2938</td>\n",
       "      <td>02 9220 9635</td>\n",
       "      <td>118197.119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6440</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id given name    surname         dob  phone number non-linking\n",
       "6426  1171   isabelle  bridgland  30-03-1994  04 5318 6471        mal4\n",
       "6427  1171   isalolIe  riahgland  30-02-1994  04 5318 6471      sydnry\n",
       "6428  1171   isabelle  bridgland  30-02-1994  04 5318 6471   63514.217\n",
       "6429                                                                  \n",
       "6430  1243     thmoas   doaldson  13-04-1900  09 6963 1944        male\n",
       "6431  1243     thoma5  donaldson  13-04-1900  08 6962 1944       perth\n",
       "6432  1243     thomas  donalsdon  13-04-2900  08 6963 2944  489229.297\n",
       "6433                                                                  \n",
       "6434  2207      annah      aslea  02-11-2906  04 5501 5973        male\n",
       "6435  2207     hannah     easlea  02-11-2006  04 5501 5973    canberra\n",
       "6436                                                                  \n",
       "6437  5726       rhys     clarke  19-05-1929  02 9220 9635        mqle\n",
       "6438  5726        ry5     clarke  19-05-1939  02 9120 9635            \n",
       "6439  5726       rhys      klark  19-05-2938  02 9220 9635  118197.119\n",
       "6440                                                                  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table = []\n",
    "for group in linkage_groups:\n",
    "    for i, j in sorted(group):\n",
    "        table.append([dataset_alice, dataset_bob, dataset_charlie][i][j])\n",
    "    table.append([''] * 6)\n",
    "    \n",
    "pd.DataFrame(table, columns=['id', 'given name', 'surname', 'dob', 'phone number', 'non-linking']).tail(15)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}