Durable Semantics
This document defines the current durability guarantees and limits in BubuStack. It is a contract for operators, SDK users, and workflow authors.
Who this is for
- Operators who need to understand durability and recovery guarantees.
- Workflow authors who care about retries and idempotency.
- SDK/component authors implementing correct behavior on failures.
What you'll get
- The exact delivery guarantees and failure boundaries.
- How retries, redrive, and cleanup behave.
- The constraints you must design around for correctness.
Delivery model (current)
- StoryRun creation is at-least-once by default. If the same external trigger is retried without an idempotency token, multiple StoryRuns may be created.
- When a trigger token is provided via the SDK, the SDK derives a deterministic
StoryRun name and treats
AlreadyExistsas idempotent when inputs match. - Trigger tokens must only be reused with identical inputs; mismatches are rejected.
- Impulses can
configure delivery behavior (dedupe + retry schedule) via
spec.deliveryPolicy. The SDK enforces this when BUBU trigger policy environment variables are present. - Impulses can throttle trigger submission via
spec.throttle. The SDK enforces per-pod rate and concurrency limits and records throttled events inImpulse.status.throttledTriggersandImpulse.status.lastThrottled. - StepRun creation is idempotent for Engram-backed steps because controllers derive deterministic StepRun names from StoryRun name and step name.
- Step execution is at-least-once. A StepRun may execute multiple times because retries and job recreation can re-run the same step.
Explicit delivery guarantees (current)
The guarantees below are the explicit contract boundaries for BubuStack today. If you need stronger guarantees (exactly-once effects), you must use idempotency keys and ledgering as described later in this document.
- StoryRun creation
- At-least-once when no trigger token is used.
- Idempotent for identical inputs when a trigger token is used and the deterministic StoryRun name matches the token-derived name.
- StepRun creation
- Idempotent per step for Engram-backed steps: StepRun names are derived deterministically from StoryRun name + step name.
- Step execution
- At-least-once. Retries and Job recreation can re-run the same step.
- Signals
- Best-effort delivery. No ordering or replay guarantee unless signal sequences are used and persisted in the StepRun status.
- Streaming transport (bobravoz-grpc)
- Best-effort by default. When
delivery.semantics=at_least_onceand replay is enabled, the hub provides at-least-once delivery with replay on reconnect. In-memory buffers can still drop messages on overflow in best-effort modes. See Transport Settings.
- Best-effort by default. When
Delivery matrix (current)
| Operation | Guarantee | Notes |
|---|---|---|
| StoryRun creation (no trigger token) | At-least-once | Retries can create multiple StoryRuns. |
| StoryRun creation (with trigger token) | Idempotent for identical inputs | Reusing a token with different inputs is rejected. |
| StepRun creation (Engram-backed step) | Idempotent per step | Deterministic names prevent duplicate StepRuns for the same step. |
| Step execution | At-least-once | Retries and job recreation can re-run steps. |
| Signals | Best-effort | No ordering or replay guarantees. |
| Streaming transport (default) | Best-effort | At-least-once when delivery semantics + replay are enabled. |
Trigger delivery policy
Impulse delivery
policy controls how triggers dedupe and retry StoryRun creation. It is configured
on ImpulseTemplate.spec.deliveryPolicy and can be overridden per
Impulse.spec.deliveryPolicy.
Dedupe modes:
none: no deduplication; repeated triggers may create multiple StoryRuns.token: a trigger token must be provided; missing tokens are rejected.key: the SDK derives a token fromdedupe.keyTemplateand uses it for idempotency.
Key templates are evaluated deterministically with the SDK template engine. The template can reference:
inputs(trigger payload map)story.nameandstory.namespaceimpulse.nameandimpulse.namespace
Retry schedule (trigger delivery, not step execution):
maxAttempts: total attempts including the first.baseDelay: initial retry delay (Go duration string).maxDelay: cap for computed delays.backoff:exponential,linear, orconstant.
Retries are only attempted for retryable Kubernetes API errors. If a trigger
token is used (explicitly or via dedupe.keyTemplate), repeated attempts map to
the same StoryRun; if inputs differ, the SDK rejects the retry.
When a trigger token is set, the StoryRun must include a trigger input hash
annotation (storyrun.bubustack.io/trigger-input-hash) that matches the inputs.
The SDK sets this automatically; non-SDK clients must compute and supply it.
Custom clients that do not use the SDK must implement the same behavior to respect the policy.
Retry and idempotency expectations
- Retries are controlled by StepRun retry policies and can re-execute steps.
- Trigger delivery retries are separate from StepRun retries and only govern StoryRun creation.
- Safe retry requires idempotent external side effects or external idempotency keys.
- Use stable identifiers derived from StoryRun and StepRun identity for idempotency keys.
- Preserve the original trigger token at the event source and reuse it for retries so replays resolve to the same StoryRun.
Exactly-once via idempotency (explicit model)
BubuStack does not provide native exactly-once execution for step side effects. Instead, it supports effectively exactly-once behavior only when you:
- Use stable idempotency keys derived from StoryRun/StepRun identity.
- Record side effects in a durable ledger (StepRun status or external system).
- Make external calls idempotent or transactional using those keys.
What “exactly-once” means in BubuStack:
- Exactly-once side effects are achieved by the caller using idempotency keys and ledgering. BubuStack provides the identifiers and persistence hooks, but it does not prevent duplicates by itself.
- Exactly-once does not apply to step execution. A step can run multiple times under retry/recreate conditions; only the effects can be deduped.
Known failure modes / boundaries:
- Job retries, controller restarts, or kube-apiserver errors can re-run a step.
- If your external system ignores idempotency keys, duplicate effects can occur.
- If you emit effects before recording them durably, you can observe duplicates on retry.
Recommended pattern:
- Generate a stable idempotency key.
- Check your effect ledger (or external system) to see if the effect exists.
- Write durable state before side effects when possible.
SDK usage patterns
The following examples use bubu-sdk-go. See Go SDK for the full API reference.
Example: idempotent StoryRun creation with a trigger token.
ctx := sdk.WithTriggerToken(ctx, "source-event-id-123")
run, err := sdk.StartStory(ctx, "my-story", inputs)
Example: stable idempotency keys for external side effects.
key := fmt.Sprintf("storyrun/%s/step/%s", run.Name, stepID)
Example: record side effects in the StepRun ledger.
if err := sdk.RecordEffect(ctx, key, "succeeded", map[string]any{"providerId": id}); err != nil {
// Treat as soft failure if you can tolerate missing ledger entries.
}
Recovery rules (current)
- On bobrapet controller restart, StoryRun reconciliation rehydrates StepState from existing StepRuns and merges terminal phases without clobbering completed steps.
- StepRun reconciliation reattaches to the Job by name when it exists.
- If a Job is missing while a StepRun is still non-terminal, a new Job is created and the step is re-executed.
- Resume vs restart rules:
- Resume: if the Job exists, the controller resumes monitoring and sets the StepRun to
Runningwhen it was stillPending. - Restart: if the Job is missing after a prior execution, the controller recreates the Job and records restart metadata on the StepRun.
- Restart metadata is tracked via annotations:
runs.bubustack.io/job-uid(last observed Job UID)runs.bubustack.io/restart-count(monotonic restart counter)runs.bubustack.io/restarted-at(RFC3339 timestamp)
- Resume: if the Job exists, the controller resumes monitoring and sets the StepRun to
gateandwaitsteps remain paused until their conditions are satisfied or timeouts apply; gate decisions live in StoryRun status.- StoryRun redrive is annotation-driven: set
storyrun.bubustack.io/redrive-tokento a new value. The controller deletes child StepRuns/StoryRuns, clears step timers, resets StoryRun status, and re-runs with the same spec/inputs. StoryRun spec remains immutable; redrive uses metadata only. The controller records the last processed token instoryrun.bubustack.io/redrive-observed.
Timers and schedules
- Story timeouts are enforced by the DAG reconciler.
waitandgatesupport poll intervals and timeouts.sleepuses a durable timer persisted on the StoryRun (annotation-backed) and pauses execution until the deadline is reached.- Timer precision is bounded by reconcile cadence and controller requeue delays.
- Cron/schedules are implemented as an external impulse (see cron-impulse for implementation details).
State persistence and history
- Durable state is stored in StoryRun and StepRun status.
- Large payloads are stored via storage references instead of inline status data.
- There is no durable event history log today; retention is managed by StoryRun retention settings and controller cleanup.
- Status updates are eventually consistent at the object level and follow a last-writer-wins model.
- Operational visibility relies on Kubernetes Events (best-effort, not durable, not replayable). BubuStack does not add new CRDs or persist a workflow event history log.
- Resource size guardrails are intentional: signals/effects are bounded lists, signal payloads are capped, and large payloads must be offloaded to storage refs. Avoid writing large aggregates to status.
Signals and events
- Step-level signals are written to StepRun status and merged into step context.
- Signal delivery is best-effort. Payloads are capped and may be truncated.
- Signal events are appended to
status.signalEventswith a monotonic sequence number for replay. The list is bounded; older events may be trimmed. - The SDK exposes a replay helper that reads
status.signalEventsand returns events after a given sequence number. - Ordering is by
signalEvents[].seqwhen available. Thestatus.signalsmap is last-writer-wins and is intended for “latest value” lookups. - Streaming transport buffers (bobravoz-grpc) are in-memory and can drop messages on overflow.
- Kubernetes Events are used for operational diagnostics (e.g., retries, restarts, blocked templates) and should not be treated as a durable signal channel.
External side effects guidance
- Write durable state before invoking external side effects when possible.
- Use idempotency keys derived from StoryRun or StepRun identity for external calls.
- Record effects in the StepRun
status.effectsledger (or your own outbox) so retries can detect already-applied side effects. - Prefer transactional outbox patterns or external systems that provide exactly-once guarantees when needed.
- SDK helper for effect dedupe:
result, already, err := sdk.ExecuteEffectOnce(ctx, key, func(ctx context.Context) (any, error) {
// perform side effect, return safe details for the effect ledger
return map[string]any{"providerId": "abc"}, nil
})
if errors.Is(err, sdk.ErrEffectAlreadyRecorded) || already {
// effect already recorded; skip duplicate side effects
}
_ = result
Related references
- Core — Core resources and execution flow.
- Architecture — Module map and dependency graph.
- Component Ecosystem — SDK usage, contracts, and component catalog.
- Primitives — Step semantics and gate/wait behavior.
- Lifecycle — Phase and terminal rules.
- Inputs and Payloads — Size limits and storage refs.
- CRD Design — Resource model and policy resolution chains.
- Error Contract — Structured error contract for StepRuns.
- Go SDK — SDK entry points and usage patterns.
- Streaming Contract — Streaming message rules.
- Transport Settings — Backpressure, routing, replay, and delivery semantics.
- Operator Configuration — Controller defaults and scheduling keys.
- Roadmap — Durable execution and checkpointing are on the roadmap.