Reliability

What this chapter covers

Trucks drive through dead zones, networks drop, and a database can stall for a few minutes — none of which is allowed to lose a position or corrupt a truck's state. This chapter explains, in product terms, how Korido stays correct under those conditions: messages that are delivered at least once and retried until they land, a dead-letter archive that recovers what a long outage set aside, processing that is strictly ordered per truck, and a family of scheduled jobs that close whatever real life left hanging open.

The picture

Every position rides a queue between the moment it is received and the moment it is processed. That queue promises at-least-once delivery: it will keep handing a message back until the work succeeds. A transient failure — a database blip — is simply retried, with the gap between attempts growing each time, so a routine outage rides out inside the queue and never loses data.

The retries are patient by design: fourteen retries after the first delivery — fifteen delivery attempts in all — each backing off further than the last, from two seconds up to a five-minute ceiling. Those retry delays sum to roughly thirty-eight minutes of grace before a batch is considered stuck. A database or connection hiccup shorter than that window is absorbed entirely — the trucks keep reporting, the queue keeps the work, and once the database recovers everything drains through with nothing lost.

When patience runs out: the dead-letter archive

If a batch exhausts its retries — practically, only during an outage longer than the retry window — it moves down a chain of dead-letter queues and is finally archived to durable object storage, where it waits safely for as long as it takes. A scheduled replay job then closes the loop automatically.

Current behavior

Replay is deliberately idempotent: the archive re-enqueues before delete, and position deduplication makes a repeated replay harmless.

The replay job lists the archive a bounded page at a time, re-validates each batch, and puts it back at the head of the live pipeline — where it is processed exactly as a fresh arrival. It enqueues a batch before deleting it from the archive, so a crash mid-replay safely re-lists and re-sends it next time. Anything that cannot be parsed as a valid batch is parked as "poison" in a separate place and counted, never retried in an infinite loop. Because the archive does not age out while the queues themselves keep messages only a few days, the archive is what turns a survivable-for-days problem into a survivable-indefinitely one.

This whole recovery loop is only safe because replay can be repeated harmlessly — which is the next idea.

Idempotency everywhere: a retry never double-counts

At-least-once delivery means the same message will sometimes be processed twice. The system is built so that reprocessing changes nothing the first pass already did. Every write path has a way to recognize "I have seen this before":

Raw positions are keyed to one row per truck per capture instant, so a replayed or buffered-then-flushed batch inserts only the instants missing and silently discards the duplicates.
Alerts and events are deduplicated: only one open event of a given kind can exist for a truck at a time, and one-off notifications carry a key that makes a repeat a no-op.
The bulk sync path from the mobile apps carries a per-batch key so a retried upload applies exactly once.
Scheduled jobs take a short-lived lock before running, so two ticks cannot do the same sweep at once.

The discipline extends to jobs that emit something and then mark it done. Such a job always sends first and marks done only on a confirmed send — never the reverse. Marking done first would strand the record if the send then dropped, and because the consumer on the other end is itself idempotent, a resend is always safe. The rule throughout: never add a retry to a path until you have confirmed the path can be safely repeated.

Strictly serial, per truck

Turning a stream of positions into trips, stops, and gaps is the work of a state machine — and a state machine only makes sense if it reads events in order. Frames must be processed oldest-to-newest for one truck, because "the truck started moving" only means something relative to "the truck was stopped" a moment earlier. Feed the same truck's frames out of order, or two at once, and the machine reasons about a past that has already been overwritten.

Boundary

Queue concurrency is part of the correctness model, not just performance tuning. Changing it requires a per-truck locking design that preserves oldest-to-newest state-machine input.

So the positions pipeline processes one batch at a time, never overlapping work for the same truck. Two guards enforce this. The queue that carries positions runs without concurrency, so the engine sees deliveries in sequence. And the database itself will only permit one open trip, one open stop, and one open gap per truck — a structural rule that would reject a duplicate the moment overlapping processing tried to create one. This is why raising queue concurrency is not a tuning knob: the engine's ordering assumptions and those one-open-record-per-truck rules both depend on serial processing, and changing it would require a per-truck locking design first.

The scheduled jobs, in families

Ingestion handles what the trucks actively report. A second engine — a set of scheduled jobs running on fixed cadences from every couple of minutes to weekly — handles what should have happened but didn't, and the slow work the fast path deliberately defers. The jobs fall into families by what they protect.

Detection turns silence and geometry into events, because no dispatcher watches every truck continuously. A single liveness pass, every couple of minutes, notices both a truck whose location has gone stale (opening a data gap) and a tracker whose heartbeat has stopped (a device-offline alert) — two different failures it is careful to tell apart. Other detectors surface corridor disruptions, recurring slowdowns, missions predicted to run late, and missions that were scheduled to depart but have not yet started.
Safety nets bound how long anything can stay open. If a truck vanishes mid-trip, the state machine has an open trip with no natural close in sight; a safety net closes trips, stops, gaps, and visits that have stayed open past their reasonable life, cancels missions after a long tracker silence, and clears stale deviations and fuel events. A watchdog job even checks that the other jobs are still running. These are backstops that step in only once something has already gone wrong.
Reconciliation completes work the fast path defers so ingestion is never blocked on something slow. Turning a coordinate into a place name asks a mapping provider, so ingestion stores the coordinate now and a job attaches the label moments later. Learning fleet-wide patterns and hotspots likewise runs on its own cadence, and a decay job lets old evidence fade so intelligence stays current. A promotion pass lifts well-observed tenant hotspots into the shared global set every fleet benefits from.
Delivery carries results outward — dispatching immediate notifications, storing push eligibility for the rollout-controlled mobile channel, rolling lower-urgency alerts into an hourly digest, reconciling push receipts when outbound delivery is enabled, prompting a mission's owner and driver to confirm a suspected pause, and sweeping the event outbox — and includes the dead-letter replay loop that recovers parked telemetry.

Two operating rules keep this second engine safe. Every job takes a soft lock so overlapping ticks do not collide, and any job that scans a wide window or does heavy geographic work must bound its work per tick — paging through its backlog rather than attempting everything at once — so a single job can never saturate the database.

Environments and deployment

Korido runs entirely on Cloudflare Workers — managed, serverless compute with no machines to keep alive. The backend is a two-Worker pipeline: an ingress Worker at the edge that authenticates callers, takes in telemetry and webhooks, and forwards work; and an engine Worker placed near the database in Europe that does the data-heavy processing, runs the scheduled jobs, and owns the queues. The fleet app, admin portal, and customer tracking portal each run as their own Worker, and the mobile apps ship through their own app-store release channel. Every Worker invocation is short-lived and stateless, with tight memory and code-size budgets, which is why durable state lives in the database and coordination lives in the fast key-value store rather than in any running process.

Deploys are staged and independent, in an order that keeps a running system consistent while it changes underneath. Schema changes go first, and a breaking one is split into phases — add the new shape in a backward-compatible way, ship code that reads both shapes, backfill, then tighten — so every version of the code stays able to read whatever database state it meets. The engine deploys ahead of the ingress Worker whenever the ingress Worker is about to forward something new, then the web surfaces that depend on the new behavior follow, and mobile ships on its own cadence. Each surface is verified after it lands. This staging is what lets Korido evolve continuously without a maintenance window.

Edge cases

A database outage under forty minutes. Fully absorbed by queue retries. No batch reaches the dead-letter chain; everything drains through when the database recovers.
An outage longer than the retry window. Batches divert to the dead-letter chain and are archived to durable storage, then automatically replayed once the root cause is fixed. Replay is a no-op for positions already stored, so it is safe to let it run.
A malformed archived message. It is parked as poison in a separate location and counted, never retried forever, so one bad message cannot stall the recovery of good ones.
A replay that overlaps a partial success. A batch that dead-lettered after some of its rows were stored re-inserts only the missing rows on replay; the already-stored instants collide on their key and are discarded.
A truck that disappears mid-trip. Ingestion leaves an open trip with no close; a safety-net job closes it once it has stayed open too long, and a long enough silence cancels the mission within a bounded window.
A job that emits then marks done, interrupted between the two. Because the emit is confirmed before the record is marked, an interruption leaves the record unclaimed and the next tick re-emits; the idempotent consumer makes the resend harmless.
A heavy scheduled job on a large window. It pages its work per tick, so it advances a bounded slice each run instead of holding the database under a long, saturating scan.

How it connects

Data Architecture — the append-only, deduplicated observed plane that makes at-least-once delivery and replay safe.
Tenancy and Security — the cross-tenant service paths the ingestion pipeline and scheduled jobs run on.
Part 3 — The fleet engine — the per-truck state machine whose ordering assumptions require strictly serial processing.

Reliability ​

What this chapter covers ​

The picture ​

When patience runs out: the dead-letter archive ​

Idempotency everywhere: a retry never double-counts ​

Strictly serial, per truck ​

The scheduled jobs, in families ​

Environments and deployment ​

Edge cases ​

How it connects ​