{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "# Entity Service Permutation Output\n", "\n", "This tutorial demonstrates the workflow for private record linkage using the entity service. Two parties _Alice_ and _Bob_ have a dataset of personally identifiable information (PII) of several entities. They want to learn the linkage of corresponding entities between their respective datasets with the help of the entity service and an independent party, the _Analyst_.\n", "\n", "The chosen output type is `permuatations`, which consists of two permutations and one mask.\n", "\n", "\n", "### Who learns what?\n", "\n", "After the linkage has been carried out Alice and Bob will be able to retrieve a `permutation` - a reordering of their respective data sets such that shared entities line up.\n", "\n", "The Analyst - who creates the linkage project - learns the `mask`. The mask is a binary vector that indicates which rows in the permuted data sets are aligned. Note this reveals how many entities are shared.\n", "\n", "\n", "### Steps\n", "These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are _Alice_ and *Bob*, and the *Analyst* acting the integration authority.\n", "\n", "* [Check connection to Entity Service](#Check-Connection)\n", "* [Data preparation](#Data-preparation)\n", " * Write CSV files with PII\n", " * [Create a Linkage Schema](#Schema-Preparation)\n", "* [Create Linkage Project](#Create-Linkage-Project)\n", "* [Generate CLKs from PII](#Hash-and-Upload)\n", "* [Upload the PII](#Hash-and-Upload)\n", "* [Create a run](#Create-a-run)\n", "* [Retrieve and analyse results](#Results)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Check Connection\n", "\n", "> If you're connecting to a custom entity service, change the address here. Or set the environment variable `SERVER` before launching the Jupyter notebook." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing anonlink-entity-service hosted at https://anonlink.easd.data61.xyz\n" ] } ], "source": [ "import os\n", "url = os.getenv(\"SERVER\", \"https://anonlink.easd.data61.xyz\")\n", "print(f'Testing anonlink-entity-service hosted at {url}')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"project_count\": 846, \"rate\": 593838, \"status\": \"ok\"}\r\n" ] } ], "source": [ "!anonlink status --server \"{url}\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Data preparation\n", "\n", "Following the [anonlink-client command line tutorial](https://anonlink-client.readthedocs.io/en/latest/tutorial/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from tempfile import NamedTemporaryFile\n", "from recordlinkage.datasets import load_febrl4" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "
\n", " | given_name | \n", "surname | \n", "street_number | \n", "address_1 | \n", "address_2 | \n", "suburb | \n", "postcode | \n", "state | \n", "date_of_birth | \n", "soc_sec_id | \n", "
---|---|---|---|---|---|---|---|---|---|---|
rec_id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
rec-1070-org | \n", "michaela | \n", "neumann | \n", "8 | \n", "stanley street | \n", "miami | \n", "winston hills | \n", "4223 | \n", "nsw | \n", "19151111 | \n", "5304218 | \n", "
rec-1016-org | \n", "courtney | \n", "painter | \n", "12 | \n", "pinkerton circuit | \n", "bega flats | \n", "richlands | \n", "4560 | \n", "vic | \n", "19161214 | \n", "4066625 | \n", "
rec-4405-org | \n", "charles | \n", "green | \n", "38 | \n", "salkauskas crescent | \n", "kela | \n", "dapto | \n", "4566 | \n", "nsw | \n", "19480930 | \n", "4365168 | \n", "