Benchmarking
============

In the `benchmarking` folder is a benchmarking script and associated Dockerfile.
The docker image is published at ``data61/anonlink-benchmark``

The container/script is configured via environment variables.

- ``SERVER``: (required) the url of the server.
- ``EXPERIMENT``: json file containing a list of experiments to run. Schema of experiments is
  defined in `./schema/experiments.json`.
- ``DATA_PATH``: path to a directory to store test data (useful to cache).
- ``RESULT_PATH``: full filename to write results file.
- ``SCHEMA``: path to the linkage schema file used when creating projects. If not provided it is assumed
  to be in the data directory.
- ``TIMEOUT``: this timeout defined the time to wait for the result of a run in seconds. Default is 1200 (20min).


Run Benchmarking Container
--------------------------

Run the container directly with docker - substituting configuration information as required::

    docker run -it
        -e SERVER=https://anonlink.easd.data61.xyz \
        -e RESULTS_PATH=/app/results.json \
        quay.io/n1analytics/entity-benchmark:latest


By default the container will pull synthetic datasets from an S3 bucket and run default benchmark experiments
against the configured ``SERVER``. The default experiments (listed below) are set in
``benchmarking/default-experiments.json``.

The output will be printed and saved to a file pointed to by ``RESULTS_PATH`` (e.g. to ``/app/results.json``).

Cache Volume
~~~~~~~~~~~~

For speeding up benchmarking when running multiple times you may wish to mount a volume at the ``DATA_PATH``
to store the downloaded test data. Note the container runs as user ``1000``, so any mounted volume must be read
and writable by that user. To create a volume using docker::

    docker volume create linkage-benchmark-data

To copy data from a local directory and change owner::

    docker run --rm -v `pwd`:/src \
        -v linkage-benchmark-data:/data busybox \
        sh -c "cp -r /src/linkage-bench-cache-experiments.json /data; chown -R 1000:1000 /data"

To run the benchmarks using the cache volume::

    docker run \
        --name ${benchmarkContainerName} \
        --network ${networkName} \
        -e SERVER=${localserver} \
        -e DATA_PATH=/cache \
        -e EXPERIMENT=/cache/linkage-bench-cache-experiments.json \
        -e RESULTS_PATH=/app/results.json \
        --mount source=linkage-benchmark-data,target=/cache \
        quay.io/n1analytics/entity-benchmark:latest


Experiments
-----------
The benchmarking script will run a range of experiments, defined in a json file.

Data
~~~~

The experiments use synthetic data generated with the febrl tool. The data is stored in the S3 bucket
s3://public-linkage-data. The naming convention is {type_of_data}_{party}_{size}.

You'll find

- the PII data for various dataset sizes, the CLKs in binary and json format, generated with the linkage schema defined in ``schema.json``.
- the corresponding linkage schema in ``schema.json``
- the blocks, generated with P-Sig blocking.
- the corresponding blocking schema ``psig_schema.json``
- the combined ``clknblocks`` files for the different parties and dataset sizes.

This particular blocking schema creates blocks with a median size of 1. The average size does not exceed 10 for any
dataset, and each entity is part of 5 different blocks.

Config
~~~~~~

The experiments are configured in a json document. Currently, you can specify the dataset sizes, the linkage threshold,
the number of repetitions and if blocking should be used. The default is::

    [
      {
        "sizes": ["100K", "100K"],
        "threshold": 0.95
      },
      {
        "sizes": ["100K", "100K"],
        "threshold": 0.80
      },
      {
        "sizes": ["100K", "1M"],
        "threshold": 0.95
      }
    ]

The schema of the experiments can be found in ``benchmarking/schema/experiments.json``.