Observability Runbook
Use this page when a workflow or streaming pipeline behaves unexpectedly in production.
If StoryRun is stuck in Pending
- Check controller logs for reconcile errors.
- Verify queue and concurrency limits in Operator Configuration.
- Verify referenced Story, Engram, and Impulse resources exist and pass validation.
If traces are missing
- Confirm
telemetry.enabledin operator config. - Confirm
OTEL_EXPORTER_OTLP_ENDPOINTand exporter settings on the controller deployment. - Verify collector reachability from cluster workloads.
If streaming packets are dropped
- Check buffer and drop metrics (hub/connector metrics in observability docs).
- Review
flowControl, buffer caps, and replay settings in Transport Settings. - Inspect connector reconnect behavior and hub health.
If retries keep looping
- Inspect
StepRun.status.error.exitClassandretryablevalues. - Compare Step retry policy with Job backoff limits.
- Confirm idempotency key and effect-ledger usage for external side effects.