Stuck workflows
Resolved
Jun 10, 2026 at 6:31am UTC
Post-Mortem: Chat & Workflow Outage (June 10, 2026)
Summary
On the morning of June 10, our AI workflows and chat became non-functional for several hours due to a misconfiguration introduced during a database migration. Messages sent through chat and AI-driven workflows were not processed during this window. The issue is fully resolved, and we've put additional safeguards in place to catch this class of problem much earlier.
What happened
During a migration of our main caching database, the key eviction strategy was accidentally set to noeviction. Over time this caused RAM usage to build up steadily. At around 04:00 CET, the database began rejecting new writes. Critically, it did not crash, since memory load was only at ~75%, so it continued to appear healthy and stable from the outside.
Because of the noeviction setting, our workflows were unable to "lock" a new run in the database. This is a required step for them to execute, so in practice it rendered them non-functional, even though the underlying systems all looked online.
This is also why the problem wasn't obvious at first: every database appeared to be running normally, when in reality writes were being silently rejected.
Timeline (CET)
- 04:00 — Database begins rejecting new writes; workflows can no longer lock new runs.
- 06:00 — On-call team becomes aware of the issue and begins investigating.
- 06:40 — Database configuration corrected; workflows and chat fully restored.
Resolution
We corrected the database eviction configuration and confirmed that workflows could lock runs again. The team is actively monitoring RAM usage to ensure the situation remains stable.
What we're doing to prevent this
- We've lowered our RAM alerting threshold from 80% to 65%, so we get a much earlier warning before a database approaches a problematic state.
- We've added dedicated checks that escalate immediately when run-locking begins failing at scale, so a silent failure like this surfaces right away rather than going unnoticed.
- We are working with our cloud provider to understand why the eviction strategy was changed during the migration in the first place, so we can prevent the root cause from recurring.
We know reliability is everything, and we're sorry for the disruption this caused. If you have any questions or are still seeing issues, please reach out.
Affected services
Created
Jun 10, 2026 at 6:31am UTC
Workflows are stuck and loading forever.
Affected services