Restart Recovery
Beam persists intent and message state so operators can restart the directory or message bus without silently losing active work.
Directory recovery
On boot, the directory inspects every non-terminal intent in intent_log.
received,validated,queued, anddispatchedintents are finalized asfailedwith a retryable recovery error. These requests were interrupted before Beam could prove a delivery outcome.deliveredintents are treated differently. Beam keeps them open if the original result timeout has not expired yet, because the recipient may still reconnect and return a late result for the samenonce.- Recovered
deliveredintents are swept in the background. If the original timeout window expires and no late result arrives, Beam finalizes them asfailedwithTIMEOUT. - If a late result does arrive after restart, Beam reconciles the persisted record instead of creating a second delivery. The original
noncestays authoritative.
Operator expectation:
- a restart may turn interrupted dispatches into retryable failures
- already delivered work can still complete successfully after the process comes back
- replaying the same
nonceafter recovery never creates a second trace for a cached success
Message bus recovery
The message bus replays persisted work on startup.
receivedanddispatchedbus messages are requeued immediately on boot- the original
nonceis preserved - queued retry windows remain intact because
next_retry_atis stored in SQLite delivered,acked,failed, anddead_lettermessages are left unchanged
This means a crash during an outbound delivery attempt does not strand the message in an in-memory state that the retry worker can no longer see.
Operational notes
- Keep the SQLite database on durable storage. Recovery depends on persisted
intent_log,intent_trace_events, andbeam_messagesrows surviving the restart. RELAY_TIMEOUT_MScontrols the default timeout used when Beam has to recover an orphaneddeliveredintent and no explicit timeout was stored in the trace details.RELAY_RECOVERY_SWEEP_INTERVAL_MScontrols how often the directory checks recovereddeliveredintents for expiry after restart.
Recommended verification
For release or deployment checks, verify at least these scenarios:
- restart the directory after a recipient has received an intent but before it has returned a result
- restart the message bus while a message is in
dispatched - confirm that resending the same
noncereturns the cached result or cached failure instead of creating a duplicate delivery