Development

Implementation Details

Components

The entity service is implemented in Python and comprises the following components:

  • A gunicorn/flask backend that implements the HTTP REST api.
  • Celery backend worker/s that do the actual work. This interfaces with the anonlink library.
  • An nginx frontend to reverse proxy the gunicorn/flask backend application.
  • A Minio object store (large files such as raw uploaded hashes, results)
  • A postgres database stores the linking metadata.
  • A redis task queue that interfaces between the flask app and the celery backend. Redis also acts as an ephemeral cache.

Each of these has been packaged as a docker image, however the use of external services (redis, postgres, minio) can be configured through environment variables. Multiple workers can be used to distribute the work beyond one machine - by default all cores will be used for computing similarity scores and encrypting the mask vector.

Dependencies

Anonlink Entity Service uses Python dependencies found in base/requirements.txt. These can be manually installed using pip:

pip install -r base/requirements.txt

Docker is used for packaging the application, we rely on a base image that includes the operating system level and Python level dependencies. To update a dependency change the pinned version in base/requirements.txt or base/Dockerfile. Our CI system will bake the base image and tag it with a digest.

If you were so inclined you could generate the digest yourself with bash (example digest shown):

$ cd base
$ sha256sum requirements.txt Dockerfile | sha256sum | cut -f 1 -d " " | tr [:upper:] [:lower:]
3814723844e4b359f0b07e86a57093ad4f88aa434c42ced9c72c611bbcf9819a

Then a microservice can be updated to use this base image. In the application Dockerfile there will be an overridable digest:

ARG VERSION=4b497c1a0b2a6cc3ea848338a67c3a129050d32d9c532373e3301be898920b55
FROM data61/anonlink-base:${VERSION}

Either update this digest in the Dockerfile, or when building with docker build pass in an extra argument:

--build-arg VERSION=3814723844e4b359f0b07e86a57093ad4f88aa434c42ced9c72c611bbcf9819a

Note the CI system automatically uses the current base image when building the application images.

Redis

Redis is used as the default message broker for celery as well as a cross-container in memory cache.

Redis key/values used directly by the Anonlink Entity Service:

Key Redis Type Description
“entityservice-status” String pickled status
“run:{run_id}” Hash run info
“clk-pkl-{dp_id}” String pickled encodings

Redis Cache: Run Info

The run info HASH stores:

  • similarity scoring progress for each run under "progress"
  • run state under "state", current valid states are {active, complete, deleted}. See backend/entityservice/cache/active_runs.py for implementation.

Object Store

Write access to an AWS S3 compatible object store is required to store intermediate files for the Anonlink Entity Service. The optional feature for data upload via object store also requires access to an AWS S3 compatible object store - along with authorization to create temporary credentials.

MinIO is an open source object store implementation which can be used with both Docker Compose and Kubernetes deployments instead of AWS S3.

Deployment Testing

Testing Local Deployment

The docker compose file tools/ci.yml is deployed along with tools/docker-compose.yml. This compose file defines additional containers which run benchmarks and tests after a short delay.

Testing K8s Deployment

The kubernetes deployment uses helm with the template found in deployment/entity-service. Jenkins additionally defines the docker image versions to use and ensures an ingress is not provisioned. The deployment is configured to be quite conservative in terms of cluster resources.

The k8s deployment test is limited to 30 minutes and an effort is made to clean up all created resources.

After a few minutes waiting for the deployment a Kubernetes Job is created using kubectl create.

This job includes a 1GiB persistent volume claim to which the results are written (as results.xml). During the testing the pytest output will be rendered, and then the Job’s pod terminates. We create a temporary pod which mounts the same results volume and then we copy across the produced test result artifact.