Unable to process full data load from UK

Incident Report for Cary Group AB - Statuspage

Postmortem

The root cause of the incident is that the workers that receive messages from the service bus topics where unable to keep up with the incoming traffic. After two hours, the storage limit for the service bus was reached, and the incoming API started to reject traffic.

The number of workers scale out automatically, but to a limit. Further, the storage size of the service bus is set to a limit. These limits have been tuned over time by the development team to optimize on cost for the kind of traffic that we have.

These scale limits where not sufficient for the full data load on Friday.

We have added a new kind of alert that will trigger an alarm when the storage size for the service bus reach a certain limit. This would have given the team an opportunity to be notified before the storage limit was exceeded and scale out in time.

Posted Jun 13, 2025 - 15:15 CEST

Resolved

At 11:50 UTC, the integration platform started to reject incoming requests. The reason is that the processing of data failed to keep up with incoming traffic and the data eventually exceeded the storage limit for the underlying service bus (set to 5GB).

The storage limit has been increased (40GB) and operations are back to normal since 12:08 UTC.

* 09:50 UTC full data load started.
* 11:50 UTC incoming requests starting to be rejected.
* 12:08 UTC operations back to normal.

Posted Jun 13, 2025 - 13:00 CEST