The root cause of the incident is that the workers that receive messages from the service bus topics where unable to keep up with the incoming traffic. After two hours, the storage limit for the service bus was reached, and the incoming API started to reject traffic.
The number of workers scale out automatically, but to a limit. Further, the storage size of the service bus is set to a limit. These limits have been tuned over time by the development team to optimize on cost for the kind of traffic that we have.
These scale limits where not sufficient for the full data load on Friday.
We have added a new kind of alert that will trigger an alarm when the storage size for the service bus reach a certain limit. This would have given the team an opportunity to be notified before the storage limit was exceeded and scale out in time.