Benchmarking¶
In the benchmarking folder is a benchmarking script and associated Dockerfile.
The docker image is published at data61/anonlink-benchmark
The container/script is configured via environment variables.
SERVER: (required) the url of the server.EXPERIMENT: json file containing a list of experiments to run. Schema of experiments is defined in ./schema/experiments.json.DATA_PATH: path to a directory to store test data (useful to cache).RESULT_PATH: full filename to write results file.SCHEMA: path to the linkage schema file used when creating projects. If not provided it is assumed to be in the data directory.TIMEOUT: this timeout defined the time to wait for the result of a run in seconds. Default is 1200 (20min).
Run Benchmarking Container¶
Run the container directly with docker - substituting configuration information as required:
docker run -it
-e SERVER=https://anonlink.easd.data61.xyz \
-e RESULTS_PATH=/app/results.json \
quay.io/n1analytics/entity-benchmark:latest
By default the container will pull synthetic datasets from an S3 bucket and run default benchmark experiments
against the configured SERVER. The default experiments (listed below) are set in
benchmarking/default-experiments.json.
The output will be printed and saved to a file pointed to by RESULTS_PATH (e.g. to /app/results.json).
Cache Volume¶
For speeding up benchmarking when running multiple times you may wish to mount a volume at the DATA_PATH
to store the downloaded test data. Note the container runs as user 1000, so any mounted volume must be read
and writable by that user. To create a volume using docker:
docker volume create linkage-benchmark-data
To copy data from a local directory and change owner:
docker run --rm -v `pwd`:/src \
-v linkage-benchmark-data:/data busybox \
sh -c "cp -r /src/linkage-bench-cache-experiments.json /data; chown -R 1000:1000 /data"
To run the benchmarks using the cache volume:
docker run \
--name ${benchmarkContainerName} \
--network ${networkName} \
-e SERVER=${localserver} \
-e DATA_PATH=/cache \
-e EXPERIMENT=/cache/linkage-bench-cache-experiments.json \
-e RESULTS_PATH=/app/results.json \
--mount source=linkage-benchmark-data,target=/cache \
quay.io/n1analytics/entity-benchmark:latest
Experiments¶
The benchmarking script will run a range of experiments, defined in a json file.
Data¶
The experiments use synthetic data generated with the febrl tool. The data is stored in the S3 bucket s3://public-linkage-data. The naming convention is {type_of_data}_{party}_{size}.
You’ll find
- the PII data for various dataset sizes, the CLKs in binary and json format, generated with the linkage schema defined in
schema.json. - the corresponding linkage schema in
schema.json - the blocks, generated with P-Sig blocking.
- the corresponding blocking schema
psig_schema.json - the combined
clknblocksfiles for the different parties and dataset sizes.
This particular blocking schema creates blocks with a median size of 1. The average size does not exceed 10 for any dataset, and each entity is part of 5 different blocks.
Config¶
The experiments are configured in a json document. Currently, you can specify the dataset sizes, the linkage threshold, the number of repetitions and if blocking should be used. The default is:
[
{
"sizes": ["100K", "100K"],
"threshold": 0.95
},
{
"sizes": ["100K", "100K"],
"threshold": 0.80
},
{
"sizes": ["100K", "1M"],
"threshold": 0.95
}
]
The schema of the experiments can be found in benchmarking/schema/experiments.json.