All data was not sent to Gains in time for nightly processing

Incident Report for Cary Group AB - Statuspage

Postmortem

Root cause for the degraded performance is identified as a combination of distributed lock congestion and line duplication. Expect a more detailed analysis here and updated performance of the integration platform after summer vacations (week starting August 10th.)

Update July 30th, 2025

The integration platform contains an "order aggregator" function that creates an "aggregated order line" message from an order line and its corresponding header. This aggregated order line combines the information from both the order header and line.

This greatly simplifies integrations that deal only with order lines, but also need some information from the order header. Using the aggregated order, they do not have to deal with state.

Gains is such an integration.

The implementation of the order aggregator is scaled out and runs concurrent processes for performance reasons. It uses a distributed locking mechanism to coordinate aggregation within the same order number between these processes.

This implementation is fast for orders with only a few lines, and gets slower for orders with many lines. For example, if we run 50 order aggregator processes and get an order on the service bus with 50 order lines, then 49 processes will be waiting on the distributed lock.

This "lock congestion" is a consequence of keeping the implementation simple. It has not been a problem since the order aggregator was implemented. To mitigate the problem of lock congestion, we have planned for "batch processing" where each process receives an array of orders instead of a single order. This has been a documented idea, but has not been implemented. The added complexity has not been deemed worth it. Performance has been good enough.

When the UK ran a full historical load on July 9th, we experienced severe problems with lock congestion due to the amount and shape of the orders. Eventually, all orders were correctly processed, but the performance of the order aggregator was too slow.

To mitigate this performance problem, we have since implemented "batch processing" in the order aggregator, and expect improved throughput that scales linearly with the number of processes (version 1.3.5042). This new version of the order aggregator is expected to be deployed in mid-August when the team is back from summer vacations.

Early load tests indicate that the UK data load from July 9th would have been processed in 2-3 hours with the new version of the order aggregator.

Update August 13th, 2025

The update has been deployed and the sustained throughput during normal operations is measured to be 6 000 orders per minute. We can manually scale out to 30 000 orders per minute within a few minutes. The 2-3 hour estimate above is well within our limits.

Posted Jul 11, 2025 - 17:05 CEST

Resolved

Gains is processing data manually approximately at 09:00 UTC so all data in Gains should be available shortly.
Posted Jul 10, 2025 - 10:59 CEST

Identified

All data was not sent in time to Gains to meet the dead-line for nightly processing Wednesday night. Due to high load on the integration platform, some data for the following markets failed to meet this dead-line:

* Sweden
* Denmark
* Spain
* UK

All data was eventually successfully sent, and will be processed Thursday night.
Posted Jul 10, 2025 - 10:56 CEST
This incident affected: Cary Group - Gains integration.