{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import json\n",
    "import os\n",
    "\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "SECRET = 'my_secret'\n",
    "\n",
    "SERVER = os.getenv(\"SERVER\", \"https://anonlink.easd.data61.xyz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "# Multiparty Linkage with Clkhash\n",
    "\n",
    "\n",
    "## Scenario\n",
    "\n",
    "There are three parties named Alice, Bob, and Charlie, each holding a dataset of about 3200 records. They know that they have some entities in common, but with incomplete overlap. The common features describing those entities are given name, surname, date of birth, and phone number.\n",
    "\n",
    "They all have some additional information about those entities in their respective datasets, Alice has a person's gender, Bob has their city, and Charlie has their income. They wish to create a table for analysis: each row has a gender, city, and income, but they don't want to share any additional information. They can use Anonlink to do this in a privacy-preserving way (without revealing given names, surnames, dates of birth, and phone numbers)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Alice, Bob, and Charlie: agree on secret keys and a linkage schema\n",
    "\n",
    "They keep the keys to themselves, but the schema may be revealed to the analyst."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "keys: my_secret\n"
     ]
    }
   ],
   "source": [
    "print(f'keys: {SECRET}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"version\": 3,\n",
      "  \"clkConfig\": {\n",
      "    \"l\": 1024,\n",
      "    \"kdf\": {\n",
      "      \"type\": \"HKDF\",\n",
      "      \"hash\": \"SHA256\",\n",
      "      \"salt\": \"SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==\",\n",
      "      \"info\": \"c2NoZW1hX2V4YW1wbGU=\",\n",
      "      \"keySize\": 64\n",
      "    }\n",
      "  },\n",
      "  \"features\": [\n",
      "    {\n",
      "      \"identifier\": \"id\",\n",
      "      \"ignored\": true\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"givenname\",\n",
      "      \"format\": {\n",
      "        \"type\": \"string\",\n",
      "        \"encoding\": \"utf-8\"\n",
      "      },\n",
      "      \"hashing\": {\n",
      "        \"strategy\": {\n",
      "          \"bitsPerToken\": 15\n",
      "        },\n",
      "        \"comparison\": {\n",
      "          \"type\": \"ngram\",\n",
      "          \"n\": 2,\n",
      "          \"positional\": false\n",
      "        }\n",
      "      }\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"surname\",\n",
      "      \"format\": {\n",
      "        \"type\": \"string\",\n",
      "        \"encoding\": \"utf-8\"\n",
      "      },\n",
      "      \"hashing\": {\n",
      "        \"strategy\": {\n",
      "          \"bitsPerToken\": 15\n",
      "        },\n",
      "        \"comparison\": {\n",
      "          \"type\": \"ngram\",\n",
      "          \"n\": 2,\n",
      "          \"positional\": false\n",
      "        }\n",
      "      }\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"dob\",\n",
      "      \"format\": {\n",
      "        \"type\": \"string\",\n",
      "        \"encoding\": \"utf-8\"\n",
      "      },\n",
      "      \"hashing\": {\n",
      "        \"strategy\": {\n",
      "          \"bitsPerToken\": 15\n",
      "        },\n",
      "        \"comparison\": {\n",
      "          \"type\": \"ngram\",\n",
      "          \"n\": 2,\n",
      "          \"positional\": true\n",
      "        }\n",
      "      }\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"phone number\",\n",
      "      \"format\": {\n",
      "        \"type\": \"string\",\n",
      "        \"encoding\": \"utf-8\"\n",
      "      },\n",
      "      \"hashing\": {\n",
      "        \"strategy\": {\n",
      "          \"bitsPerToken\": 8\n",
      "        },\n",
      "        \"comparison\": {\n",
      "          \"type\": \"ngram\",\n",
      "          \"n\": 1,\n",
      "          \"positional\": true\n",
      "        }\n",
      "      }\n",
      "    },\n",
      "    {\n",
      "      \"identifier\": \"ignoredForLinkage\",\n",
      "      \"ignored\": true\n",
      "    }\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "with open('data/schema_ABC.json') as f:\n",
    "    print(f.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Sneak peek at input data\n",
    "\n",
    "### Alice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>givenname</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>phone number</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>tara</td>\n",
       "      <td>hilton</td>\n",
       "      <td>27-08-1941</td>\n",
       "      <td>08 2210 0298</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>saJi</td>\n",
       "      <td>vernre</td>\n",
       "      <td>22-12-2972</td>\n",
       "      <td>02 1090 1906</td>\n",
       "      <td>mals</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>sliver</td>\n",
       "      <td>paciorek</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>mals</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>9</td>\n",
       "      <td>ruby</td>\n",
       "      <td>george</td>\n",
       "      <td>09-05-1939</td>\n",
       "      <td>07 4698 6255</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>eyrinm</td>\n",
       "      <td>campbell</td>\n",
       "      <td>29-1q-1983</td>\n",
       "      <td>08 299y 1535</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id givenname   surname         dob  phone number gender\n",
       "0   0      tara    hilton  27-08-1941  08 2210 0298   male\n",
       "1   3      saJi    vernre  22-12-2972  02 1090 1906   mals\n",
       "2   7    sliver  paciorek         NaN           NaN   mals\n",
       "3   9      ruby    george  09-05-1939  07 4698 6255   male\n",
       "4  10    eyrinm  campbell  29-1q-1983  08 299y 1535   male"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('data/dataset-alice.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "### Bob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>givenname</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>phone number</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>zali</td>\n",
       "      <td>verner</td>\n",
       "      <td>22-12-1972</td>\n",
       "      <td>02 1090 1906</td>\n",
       "      <td>perth</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>samuel</td>\n",
       "      <td>tremellen</td>\n",
       "      <td>21-12-1923</td>\n",
       "      <td>03 3605 9336</td>\n",
       "      <td>melbourne</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>amy</td>\n",
       "      <td>lodge</td>\n",
       "      <td>16-01-1958</td>\n",
       "      <td>07 8286 9372</td>\n",
       "      <td>canberra</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>oIji</td>\n",
       "      <td>pacioerk</td>\n",
       "      <td>10-02-1959</td>\n",
       "      <td>04 4220 5949</td>\n",
       "      <td>sydney</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>erin</td>\n",
       "      <td>kampgell</td>\n",
       "      <td>29-12-1983</td>\n",
       "      <td>08 2996 1445</td>\n",
       "      <td>perth</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id givenname    surname         dob  phone number       city\n",
       "0   3      zali     verner  22-12-1972  02 1090 1906      perth\n",
       "1   4    samuel  tremellen  21-12-1923  03 3605 9336  melbourne\n",
       "2   5       amy      lodge  16-01-1958  07 8286 9372   canberra\n",
       "3   7      oIji   pacioerk  10-02-1959  04 4220 5949     sydney\n",
       "4  10      erin   kampgell  29-12-1983  08 2996 1445      perth"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('data/dataset-bob.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Charlie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>givenname</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>phone number</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>joshua</td>\n",
       "      <td>arkwright</td>\n",
       "      <td>16-02-1903</td>\n",
       "      <td>04 8511 9580</td>\n",
       "      <td>70189.446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>zal:</td>\n",
       "      <td>verner</td>\n",
       "      <td>22-12-1972</td>\n",
       "      <td>02 1090 1906</td>\n",
       "      <td>50194.118</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>oliyer</td>\n",
       "      <td>paciorwk</td>\n",
       "      <td>10-02-1959</td>\n",
       "      <td>04 4210 5949</td>\n",
       "      <td>31750.993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>8</td>\n",
       "      <td>nacoya</td>\n",
       "      <td>ranson</td>\n",
       "      <td>17-08-1925</td>\n",
       "      <td>07 6033 4580</td>\n",
       "      <td>102446.131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>erih</td>\n",
       "      <td>campbell</td>\n",
       "      <td>29-12-1i83</td>\n",
       "      <td>08 299t 1435</td>\n",
       "      <td>331476.599</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id givenname    surname         dob  phone number      income\n",
       "0   1    joshua  arkwright  16-02-1903  04 8511 9580   70189.446\n",
       "1   3      zal:     verner  22-12-1972  02 1090 1906   50194.118\n",
       "2   7    oliyer   paciorwk  10-02-1959  04 4210 5949   31750.993\n",
       "3   8    nacoya     ranson  17-08-1925  07 6033 4580  102446.131\n",
       "4  10      erih   campbell  29-12-1i83  08 299t 1435  331476.599"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('data/dataset-charlie.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Analyst: create the project\n",
    "\n",
    "The analyst keeps the result token to themselves. The three update tokens go to Alice, Bob and Charlie. The project ID is known by everyone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mProject created\u001b[0m\r\n"
     ]
    }
   ],
   "source": [
    "!clkutil create-project \\\n",
    "    --server $SERVER \\\n",
    "    --type groups \\\n",
    "    --schema data/schema_ABC.json \\\n",
    "    --parties 3 \\\n",
    "    --output credentials.json\n",
    "\n",
    "with open('credentials.json') as f:\n",
    "    credentials = json.load(f)\n",
    "    project_id = credentials['project_id']\n",
    "    result_token = credentials['result_token']\n",
    "    update_token_alice = credentials['update_tokens'][0]\n",
    "    update_token_bob = credentials['update_tokens'][1]\n",
    "    update_token_charlie = credentials['update_tokens'][2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Alice: hash the data and upload it to the server\n",
    "The data is hashed according to the schema and the keys. Alice's update token is needed to upload the hashed data. No PII is uploaded to the service—only the hashes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mCLK data written to dataset-alice-hashed.json\u001b[0m\r\n"
     ]
    }
   ],
   "source": [
    "!clkutil hash \\\n",
    "    data/dataset-alice.csv \\\n",
    "    $SECRET \\\n",
    "    data/schema_ABC.json \\\n",
    "    dataset-alice-hashed.json \\\n",
    "    --check-header false"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"message\": \"Updated\", \"receipt_token\": \"c202d98eb83c7e55e6177ba9bcf55cb35f40ac1d21714897\"}"
     ]
    }
   ],
   "source": [
    "!clkutil upload \\\n",
    "    --server $SERVER \\\n",
    "    --apikey $update_token_alice \\\n",
    "    --project $project_id \\\n",
    "    dataset-alice-hashed.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Bob: hash the data and upload it to the server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mCLK data written to dataset-bob-hashed.json\u001b[0m\r\n"
     ]
    }
   ],
   "source": [
    "!clkutil hash \\\n",
    "    data/dataset-bob.csv \\\n",
    "    $SECRET \\\n",
    "    data/schema_ABC.json \\\n",
    "    dataset-bob-hashed.json \\\n",
    "    --check-header false"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"message\": \"Updated\", \"receipt_token\": \"75083f544df8e944cc590089bb3e31c134e810992f08ea80\"}"
     ]
    }
   ],
   "source": [
    "!clkutil upload \\\n",
    "    --server $SERVER \\\n",
    "    --apikey $update_token_bob \\\n",
    "    --project $project_id \\\n",
    "    dataset-bob-hashed.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Charlie: hash the data and upload it to the server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mCLK data written to dataset-charlie-hashed.json\u001b[0m\r\n"
     ]
    }
   ],
   "source": [
    "!clkutil hash \\\n",
    "    data/dataset-charlie.csv \\\n",
    "    $SECRET \\\n",
    "    data/schema_ABC.json \\\n",
    "    dataset-charlie-hashed.json \\\n",
    "    --check-header false"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"message\": \"Updated\", \"receipt_token\": \"814b4a226453d7261348a403e134b0764501432bf679658f\"}"
     ]
    }
   ],
   "source": [
    "!clkutil upload \\\n",
    "    --server $SERVER \\\n",
    "    --apikey $update_token_charlie \\\n",
    "    --project $project_id \\\n",
    "    dataset-charlie-hashed.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Analyst: start the linkage run\n",
    "\n",
    "This will start the linkage computation. We will wait a little bit and then retrieve the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "!clkutil create \\\n",
    "    --server $SERVER \\\n",
    "    --project $project_id \\\n",
    "    --apikey $result_token \\\n",
    "    --threshold 0.7 \\\n",
    "    --output=run-credentials.json\n",
    "\n",
    "with open('run-credentials.json') as f:\n",
    "    run_credentials = json.load(f)\n",
    "    run_id = run_credentials['run_id']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Analyst: retrieve the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mState: completed\n",
      "Stage (3/3): compute output\u001b[0m\n",
      "\u001b[31mState: completed\n",
      "Stage (3/3): compute output\u001b[0m\n",
      "\u001b[31mState: completed\n",
      "Stage (3/3): compute output\u001b[0m\n",
      "\u001b[31mDownloading result\u001b[0m\n",
      "\u001b[31mReceived result\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!clkutil results \\\n",
    "    --server $SERVER \\\n",
    "    --project $project_id \\\n",
    "    --apikey $result_token \\\n",
    "    --run $run_id \\\n",
    "    --watch \\\n",
    "    --output linkage-output.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[[0, 1787], [1, 1751], [2, 1784]],\n",
       " [[0, 565], [1, 557], [2, 564]],\n",
       " [[0, 836], [1, 815], [2, 850]],\n",
       " [[0, 505], [2, 495]],\n",
       " [[0, 536], [2, 525], [1, 512]],\n",
       " [[0, 1641], [2, 1608], [1, 1584]],\n",
       " [[0, 2234], [1, 2228], [2, 2242]],\n",
       " [[0, 781], [1, 762], [2, 799]],\n",
       " [[0, 918], [2, 2840]],\n",
       " [[1, 1393], [2, 1421], [0, 1451]],\n",
       " [[1, 1587], [2, 1609], [0, 1642]],\n",
       " [[1, 1730], [2, 1767]],\n",
       " [[1, 2808], [2, 2813]],\n",
       " [[0, 2765], [2, 2794], [1, 2789]],\n",
       " [[1, 351], [2, 356]]]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('linkage-output.json') as f:\n",
    "    linkage_output = json.load(f)\n",
    "    linkage_groups = linkage_output['groups']\n",
    "linkage_groups[-15:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a list of groups of records. Every record in such a group belongs to the same entity and consists of two values, the party index and the row index:\n",
    "```\n",
    "[\n",
    "  [[party_id, row_index], ... ],\n",
    "  ...\n",
    "]\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Everyone: make table of interesting information\n",
    "\n",
    "We use the linkage result to make a table of genders, cities, and incomes without revealing any other PII."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "with open('data/dataset-alice.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    genders = tuple(row[-1] for row in r)\n",
    "    \n",
    "with open('data/dataset-bob.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    cities = tuple(row[-1] for row in r)\n",
    "    \n",
    "with open('data/dataset-charlie.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    incomes = tuple(row[-1] for row in r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>city</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>male</td>\n",
       "      <td>melbourne</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>femalr</td>\n",
       "      <td></td>\n",
       "      <td>277039.294</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td>pertb</td>\n",
       "      <td>21407e.192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td></td>\n",
       "      <td>mlebourne</td>\n",
       "      <td>56899.522</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>male</td>\n",
       "      <td>canberra</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>femaoe</td>\n",
       "      <td>sydn3y</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>male</td>\n",
       "      <td></td>\n",
       "      <td>154195.553</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>female</td>\n",
       "      <td></td>\n",
       "      <td>44652.704</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>male</td>\n",
       "      <td>sydnely</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>mal3</td>\n",
       "      <td>sydney</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   gender       city      income\n",
       "0    male  melbourne            \n",
       "1  femalr             277039.294\n",
       "2              pertb  21407e.192\n",
       "3          mlebourne   56899.522\n",
       "4    male   canberra            \n",
       "5  femaoe     sydn3y            \n",
       "6    male             154195.553\n",
       "7  female              44652.704\n",
       "8    male    sydnely            \n",
       "9    mal3     sydney            "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table = []\n",
    "for group in linkage_groups:\n",
    "    row = [''] * 3\n",
    "    for i, j in group:\n",
    "        row[i] = [genders, cities, incomes][i][j]\n",
    "    if sum(map(bool, row)) > 1:\n",
    "        table.append(row)\n",
    "pd.DataFrame(table, columns=['gender', 'city', 'income']).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last 20 groups look like this."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## Sneak peek at the result\n",
    "\n",
    "We obviously can't do this in a real-world setting, but let's view the linkage using the PII. If the IDs match, then we are correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "with open('data/dataset-alice.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    dataset_alice = tuple(r)\n",
    "    \n",
    "with open('data/dataset-bob.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    dataset_bob = tuple(r)\n",
    "    \n",
    "with open('data/dataset-charlie.csv') as f:\n",
    "    r = csv.reader(f)\n",
    "    next(r)  # Skip header\n",
    "    dataset_charlie = tuple(r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>given name</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>phone number</th>\n",
       "      <th>non-linking</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>6450</td>\n",
       "      <td>5436</td>\n",
       "      <td>nikki</td>\n",
       "      <td>spears</td>\n",
       "      <td>10-02-2097</td>\n",
       "      <td>06 9447 1767</td>\n",
       "      <td>156639.106</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6451</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6452</td>\n",
       "      <td>5833</td>\n",
       "      <td>nell</td>\n",
       "      <td>rud</td>\n",
       "      <td>06-1p-1956</td>\n",
       "      <td>08 5510 5369</td>\n",
       "      <td>sydnev</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6453</td>\n",
       "      <td>5833</td>\n",
       "      <td>ned</td>\n",
       "      <td>reif</td>\n",
       "      <td>06-20-1956</td>\n",
       "      <td>08 5510 5369</td>\n",
       "      <td>117275.089</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6454</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6455</td>\n",
       "      <td>872</td>\n",
       "      <td>jackson</td>\n",
       "      <td>green</td>\n",
       "      <td>06-09-1920</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6456</td>\n",
       "      <td>872</td>\n",
       "      <td>jackson</td>\n",
       "      <td>gnn</td>\n",
       "      <td>06-00-1920</td>\n",
       "      <td>08 3409 2246</td>\n",
       "      <td>147663.277</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6457</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6458</td>\n",
       "      <td>8662</td>\n",
       "      <td>luct</td>\n",
       "      <td>pulfort</td>\n",
       "      <td>05-03-1903</td>\n",
       "      <td>02 0726 9479</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6459</td>\n",
       "      <td>8662</td>\n",
       "      <td>lucy</td>\n",
       "      <td>pulford</td>\n",
       "      <td>05-03-1903</td>\n",
       "      <td></td>\n",
       "      <td>melbourrie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6460</td>\n",
       "      <td>8662</td>\n",
       "      <td>lusy</td>\n",
       "      <td>pulford</td>\n",
       "      <td>05-03-1993</td>\n",
       "      <td>02 0726 0489</td>\n",
       "      <td>192230.309</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6461</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6462</td>\n",
       "      <td>1885</td>\n",
       "      <td>nicholas</td>\n",
       "      <td>robson</td>\n",
       "      <td>06-01-1914</td>\n",
       "      <td>02 7799 6803</td>\n",
       "      <td>canberra</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6463</td>\n",
       "      <td>1885</td>\n",
       "      <td>nicho|as</td>\n",
       "      <td>robson</td>\n",
       "      <td>06-91-1914</td>\n",
       "      <td>02 7799 6803</td>\n",
       "      <td>61333.218</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6464</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id given name  surname         dob  phone number non-linking\n",
       "6450  5436      nikki   spears  10-02-2097  06 9447 1767  156639.106\n",
       "6451                                                                \n",
       "6452  5833       nell      rud  06-1p-1956  08 5510 5369      sydnev\n",
       "6453  5833        ned     reif  06-20-1956  08 5510 5369  117275.089\n",
       "6454                                                                \n",
       "6455   872    jackson    green  06-09-1920                          \n",
       "6456   872    jackson      gnn  06-00-1920  08 3409 2246  147663.277\n",
       "6457                                                                \n",
       "6458  8662       luct  pulfort  05-03-1903  02 0726 9479        male\n",
       "6459  8662       lucy  pulford  05-03-1903                melbourrie\n",
       "6460  8662       lusy  pulford  05-03-1993  02 0726 0489  192230.309\n",
       "6461                                                                \n",
       "6462  1885   nicholas   robson  06-01-1914  02 7799 6803    canberra\n",
       "6463  1885   nicho|as   robson  06-91-1914  02 7799 6803   61333.218\n",
       "6464                                                                "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table = []\n",
    "for group in linkage_groups:\n",
    "    for i, j in sorted(group):\n",
    "        table.append([dataset_alice, dataset_bob, dataset_charlie][i][j])\n",
    "    table.append([''] * 6)\n",
    "    \n",
    "pd.DataFrame(table, columns=['id', 'given name', 'surname', 'dob', 'phone number', 'non-linking']).tail(15)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mProject deleted\u001b[0m\r\n"
     ]
    }
   ],
   "source": [
    "# Deleting the project\n",
    "!clkutil delete-project --project=\"{credentials['project_id']}\" \\\n",
    "        --apikey=\"{credentials['result_token']}\" \\\n",
    "        --server=\"{SERVER}\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}