Changelog

Next Version

Version 1.15.0

Highlights

Similarity scores are deduplicated

Previously candidate pairs that appear in more than one block would produce more than one similarity score. The iterator that processing similarity scores now de-duplicates before storing them.

Implemented in: #660

Provided Block Identifiers are now hashed

We now hash the user provided block identifier before storing in DB.

Implemented in: #633

Failed runs return message indicating the failure reason

The run status for a failed run now includes a message attribute with information on what went wrong.

Implemented in: #624

Other changes

The run status endpoint now includes total_number_of_comparisons for completed runs. Implemented in: #651

As usual lots of version upgrades - now using the latest stable redis and postgresql.

Version 1.14.0

Highlights

API now supports directly downloading similarity scores from the internal object store

If the request includes the header RETURN-OBJECT-STORE-ADDRESS, the response will be a small json payload with temporary download credentials to pull the _binary_ similarity scores directly from the object store. The json object has credentials and object keys:

{
  "credentials": {
    "AccessKeyId": "",
    "SecretAccessKey": "",
    "SessionToken": "",
    "Expiration": "<ISO 8601 datetime string>"
  },
  "object": {
      "endpoint": "<config.DOWNLOAD_OBJECT_STORE_SERVER>",
      "secure": "<config.DOWNLOAD_OBJECT_STORE_SECURE>",
      "bucket": "bucket_name",
      "path": "path"
  }
}

The binary file is serialized using anonlink.serialization, you can convert the stream into Python types with:

mc = Minio(file_info['endpoint'], ...)
candidate_pair_stream = mc.get_object(file_info['bucket'], file_info['path'])
sims, (dset_is0, dset_is1), (rec_is0, rec_is1) = anonlink.serialization.load_candidate_pairs(candidate_pair_stream)

The following settings control the optional feature of using an external object store:

Environment Variable Helm Config
DOWNLOAD_OBJECT_STORE_SERVER anonlink.objectstore.downloadServer
DOWNLOAD_OBJECT_STORE_SECURE anonlink.objectstore.downloadSecure
DOWNLOAD_OBJECT_STORE_ACCESS_KEY anonlink.objectstore.downloadAccessKey
DOWNLOAD_OBJECT_STORE_SECRET_KEY anonlink.objectstore.downloadSecretKey
DOWNLOAD_OBJECT_STORE_STS_DURATION - (default 43200 seconds)

Implemented in: #594, #612, #613, #614

Service now uses sqlalchemy for database migrations

Sqlalchemy models have been added for all database tables, initial database setup now uses alembic for migrations. The database and object store init scripts can now be run multiple times without causing issues.

Implemented in #603, #611

New configurable limits on maximum number of candidate pairs

Protects the service from running out of memory due to excessive numbers of candidate pairs being processed. An added side effect is the service now keeps track of the number of candidate pairs in a run (as well as the number of comparisons).

The configurable is controlled by the following two environment variables, and their initial default values:

SOLVER_MAX_CANDIDATE_PAIRS="100_000_000"
SIMILARITY_SCORES_MAX_CANDIDATE_PAIRS="500_000_000"

If a run exceeds these limits, the run is put into an error state and further processing is abandoned to protect the service from running out of memory.

Implemented in #595, #605

Other changes

  • Ingress now supports a user supplied path. We no longer assume an nginx ingress controller. #587
  • Migrate off deprecated k8s chart repos #596, #588
  • Helm chart now uses standard recommended Kubernetes labels. #616
  • Fix an issue with case sensitivity in object store metadata #590
  • If the object store bucket doesn’t exist it is now automatically created. #577
  • Ignore but log failures to delete from object store #576
  • Many dependency updates #578, #579, #580, #582, #581, #583, #596, #604, #609, #615
  • Update the base image, all base dependencies and migrated from minio-py v5 to v7 #601, #608, #610
  • CI e2e tests on Kubernetes will now correctly fail if the tests don’t run. #618
  • Add optional pod annotations to init jobs. #619

Version 1.13.0

  • extended tutorial to include upload to object store #573
  • chart update #572

Version 1.13.0-beta3

  • Improved performance for blocks of small size #563
  • fix a problem with the upload to the external object store #564
  • updated documentation #567, $569

Version 1.13.0-beta2

Adds support for users to supply blocking information along with encodings. Data can now be uploaded to an object store and pulled by the Anonlink Entity Service instead of uploaded via the REST API. This release includes substantial internal changes as encodings are now stored in Postgres instead of the object store.

  • Feature to pull data from an object store and create temporary upload credentials. #537, #544, #551
  • Blocking implementation #510 #527,
  • Benchmarking #478, #541
  • Encodings are now stored in Postgres database instead of files in an object store. #516, #522
  • Start to add integration tests to complement our end to end tests. #520, #528
  • Use anonlink-client instead of clkhash #536
  • Use Python 3.8 in base image. #518
  • A base image is now used for all our Docker images. #506, #511, #517, #519
  • Binary encodings now stored internally with their encoding id. #505
  • REST API implementation for accepting clknblocks #503
  • Update Open API spec to version 3. Add Blocking API #479
  • CI Updates #476
  • Chart updates #496, #497, #539
  • Documentation updates (production deployment, debugging with PyCharm) #473, #504
  • Fix Jaeger #500, #523

Misc changes/fixes: - Detect invalid encoding size as early as possible #507 - Use local benchmark cache #531 - Cleanup docker-compose #533, #534, #547 - Calculate number of comparisons accounting for user supplied blocks. #543

Version 1.13.0-beta

  • Fixed a bug where a dataprovider could upload their clks multiple times in a project using the same upload token. (#463)

  • Fixed a bug where workers accepted work after failing to initialize their database connection pool. (#477)

  • Modified similarity_score output to follow the group format in preparation to extending this output type to more parties. (#464)

  • Tutorials have been improved following an internal review. (#467)

  • Database schema and CLK upload api has been modified to support blocking. (#470)

  • Benchmarking results can now be saved to an object store without authentication. Allowing an AWS user to save to S3 using node permissions. (#490)

  • Removed duplicate/redundant tests. (#466)

  • Updated dependencies:

    • We have enabled dependabot on GitHub to keep our Python dependencies up to date.
    • anonlinkclient now used for benchmarking. (#490)
    • Chart dependencies redis-ha, postgres and minio all updated. (#496, #497)

Breaking Changes

  • the similarity_score output type has been modified, it now returns a JSON array of JSON objects, where such an object looks like [[party_id_0, row_index_0], [party_id_1, row_index_1], score]. (#464)
  • Integration test configuration is now consistent with benchmark config. Instead of setting ENTITY_SERVICE_URL including /api/v1 now just set the host address in SERVER. (#495)

Database Changes (Internal)

  • the dataproviders table uploaded field has been modified from a BOOL to an ENUM type (#463)
  • The projects table has a new uses_blocking field. (#470)

Version 1.13.0-alpha

  • fixed bug where invalid state changes could occur when starting a run (#459)

  • matching output type has been removed as redundant with the groups output with 2 parties. (#458)

  • Update dependencies:

    • requests from 2.21.0 to 2.22.0 (#459)

Breaking Change

  • matching output type is not available anymore. (#458)

Version 1.12.0

  • Logging configurable in the deployed entity service by using the key loggingCfg. (#448)

  • Several old settings have been removed from the default values.yaml and docker files which have been replaced by CHUNK_SIZE_AIM (#414):

    • SMALL_COMPARISON_CHUNK_SIZE
    • LARGE_COMPARISON_CHUNK_SIZE
    • SMALL_JOB_SIZE
    • LARGE_JOB_SIZE
  • Remove ENTITY_MATCH_THRESHOLD environment variable (#444)

  • Celery configuration updates to solve threads and memory leaks in deployment. (#427)

  • Update docker-compose files to use these new preferred configurations.

  • Update helm charts with preferred configuration default deployment is a minimal working deployment.

  • New environment variables: CELERY_DB_MIN_CONNECTIONS, FLASK_DB_MIN_CONNECTIONS, CELERY_DB_MAX_CONNECTIONS and FLASK_DB_MAX_CONNECTIONS to configure the database connections pool. (#405)

  • Simplify access to the database from services relying on a single way to get a connection via a connection pool. (#405)

  • Deleting a run is now implemented. (#413)

  • Added some missing documentation about the output type groups (#449)

  • Sentinel name is configurable. (#436)

  • Improvement on the Kubernetes deployment test stage on Azure DevOps:

    • Re-order cleaning steps to first purge the deployment and then deleting the remaining. (#426)
    • Run integration tests in parallel, reducing pipeline stage Kubernetes deployment tests from 30 minutes to 15 minutes. (#438)
    • Tests running on a deployed entity-service on k8s creates an artifact containing all the logs of all the containers, useful for debugging. (#445)
    • Test container not restarted on test failure. (#434)
  • Benchmark improvements:

    • Benchmark output has been modified to handle multi-party linkage.
    • Benchmark to handle more than 2 parties, being able to repeat experiments. and pushing the results to minio object store. (#406, #424 and #425)
    • Azure DevOps benchmark stage runs a 3 parties linkage. (#433)
  • Improvements on Redis cache:

    • Refactor the cache. (#430)
    • Run state kept in cache (instead of fully relying on database) (#431 and #432)
  • Update dependencies:

    • anonlink to v0.12.5. (#423)
    • redis from 3.2.0 to 3.2.1 (#415)
    • alpine from 3.9 to 3.10.1 (#404)
  • Add some release documentation. (#455)

Version 1.11.2

  • Switch to Azure Devops pipeline for CI.
  • Switch to docker hub for container hosting.

Version 1.11.1

  • Include multiparty linkage tutorial/example.
  • Tightened up how we use a database connection from the flask app.
  • Deployment and logging documentation updates.

Version 1.11.0

  • Adds support for multiparty record linkage.
  • Logging is now configurable from a file.

Other improvements

  • Another tutorial for directly using the REST api was added.
  • K8s deployment updated to use 3.15.0 Postgres chart. Postgres configuration now uses a global namespace so subcharts can all use the same configuration as documented here.
  • Jenkins testing now fails if the benchmark exits incorrectly or if the benchmark results contain failed results.
  • Jenkins will now execute the tutorials notebooks and fail if any cells error.

Version 1.10.0

  • Updates Anonlink and switches to using Anonlink’s default format for serialization of similarity scores.
  • Sorts similarity scores before solving, improving accuracy.
  • Uses Anonlink’s new API for similarity score computation and solving.
  • Add support for using an external Postgres database.
  • Added optional support for redis discovery via the sentinel protocol.
  • Kubernetes deployment no longer includes a default postgres password. Ensure that you set your own postgresqlPassword.
  • The Kubernetes deployment documentation has been extended.

Version 1.9.4

  • Introduces configurable logging of HTTP headers.
  • Dependency issue resolved.

Version 1.9.3

  • Redis can now be used in highly available mode. Includes upstream fix where the redis sentinels crash.
  • The custom kubernetes certificate management templates have been removed.
  • Minor updates to the kubernetes resources. No longer using beta apis.

Version 1.9.2

  • 2 race conditions have been identified and fixed.
  • Integration tests are sped up and more focused. The test suite now fails after the first test failure.
  • Code tidy-ups to be more pep8 compliant.

Version 1.9.1

  • Adds support for (almost) arbitrary sized encodings. A minimum and maximum can be set at deployment time, and currently anonlink requires the size to be a multiple of 8.
  • Adds support for opentracing with Jaeger.
  • improvements to the benchmarking container
  • internal refactoring of tasks

Version 1.9.0

  • minio and redis services are now optional for kubernetes deployment.
  • Introduction of a high memory worker and associated task queue.
  • Fix issue where we could start tasks twice.
  • Structlog now used for celery workers.
  • CI now tests a kubernetes deployment.
  • Many Jenkins CI updates and fixes.
  • Updates to Jupyter notebooks and docs.
  • Updates to Python and Helm chart dependencies and docker base images.

Version 1.8.1

Improve system stability while handling large intermediate results. Intermediate results are now stored in files instead of in Redis. This permits us to stream them instead of loading everything into memory.

Version 1.8

Version 1.8 introduces breaking changes to the REST API to allow an analyst to reuse uploaded CLKs.

Instead of a linkage project only having one result, we introduce a new sub-resource runs. A project holds the schema and CLKs from all data providers; and multiple runs can be created with different parameters. A run has a status and a result endpoint. Runs can be queued before the CLK data has been uploaded.

We also introduced changes to the result types. The result type permutation, which was producing permutations and an encrypted mask, was removed. And the result type permutation_unecrypyted_mask was renamed to permutations.

Brief summary of API changes: - the mapping endpoint has been renamed to projects - To carry out a linkage computation you must post to a project’s runs endpoint: /api/v1/project/<PROJECT_ID>/runs - Results are now accessed under the `runs endpoint: /api/v1/project/<PROJECT_ID>/runs/<RUN_ID>/result - result type permutation_unecrypyted_mask was renamed to permutations - result type permutation was removed

For all the updated API details check the Open API document.

Other improvements

  • The documentation is now served at the root.
  • The flower monitoring tool for celery is now included with the docker-compose deployment. Note this will be disabled for production deployment with kubernetes by default.
  • The docker containers have been migrated to alpine linux to be much leaner.
  • Substantial internal refactoring - especially of views.
  • Move to pytest for end to end tests.

Version 1.7.3

Deployment and documentation sprint.

  • Fixes a bug where only the top k results of a chunk were being requested from anonlink. #59 #84
  • Updates to helm deployment templates to support a single namespace having multiple entityservices. Helm charts are more standard, some config has moved into a configmap and an experimental cert-manager configuration option has been added. #83, #90
  • More sensible logging during testing.
  • Every http request now has a (globally configurable) timeout
  • Minor update regarding handling uploading empty CLKs. #92
  • Update to latest versions of anonlink and clkhash. #94
  • Documentation updates.

Version 1.7.2

Dependency and deployment updates. We now pin versions of Python, anonlink, clkhash, phe and docker images nginx and postgres.

Version 1.7.0

Added a view type that returns similarity scores of potential matches.

Version 1.6.8

Scalability sprint.

  • Much better chunking of work.
  • Security hardening by modifing the response from the server. Now there is no differences between invalid token and unknown resource - both return a 403 response status.
  • Mapping information includes the time it was started.
  • Update and add tests.
  • Update the deployment to use Helm.