CRDTs Are Not Enough When Your Coworker Is an AI Agent

Kadhir Mani

(6.5 minutes)

<section data-section-id='5969f96e-e470-4a5c-8f27-ddc8d6ea1de4'><h2 id="problem">Problem</h2>CRDTs converge; they do not coordinate. That distinction becomes critical the moment an AI agent joins the editor. We hit this while building a multiplayer editor where humans and agents could both modify the same document. A CRDT guarantees replicas reach the same state, but it says nothing about whether the agent should write right now, when a human is actively in the same section, or where its output lands given stale context. Human-only multiplayer assumes every keystroke reflects fresh, local intent. An agent breaks that assumption: it operates on a snapshot taken before inference begins, may rewrite large spans at once, and has no inherent presence awareness. </section> <section data-section-id='d4169e3e-a334-4db4-9e84-735750da7315'><h2 id="why-the-naive-solution-fails">Why the Naive Solution Fails</h2>The naive solution is to treat the agent as just another peer on the CRDT graph, a faster user. Let the CRDT merge concurrent edits and converge on a consistent state. Problem solved. But it ended up not being that simple. CRDTs guarantee convergence, not that an edit should have been made in the first place. In practice, the result was agents repeatedly overwriting sections that humans were actively editing. With several people in the same document, each with their own agent running concurrently, the collisions compounded. Agents exposed a gap in three specific ways:<ul><li value="1">Stale context. The agent reads at T₀, infers, then writes at T₁. Human edits made between those moments are invisible to its prompt. The CRDT merges output that was reasoned against a state that no longer exists.</li><li value="2">Large-span rewrites. Humans edit words; agents rewrite paragraphs. A wider edit span raises collision probability and can silently invalidate comment anchors and structured blocks.</li><li value="3">No presence awareness. Human collaborators spot a cursor in a section and adjust their own editing behavior accordingly. Agents have no equivalent signal. Without it, an agent writes into an actively-edited section, and the CRDT faithfully merges the result.</li></ul> </section> <section data-section-id='fd135455-2113-40b4-9040-4f52fac64bdd'><h2 id="practical-solution-shape">Practical Solution Shape</h2>We went down a different route. We added a coordination layer that gates all writes to the shared state and persistence for non-human participants. <mermaid data-height="400" data-card="true"> flowchart TD H[Human Editor] --> C["Coordination Layer(Presence, Locks, Approvals)"] A[AI Agent] --> C R[Reviewer] --> C C --> S[Collaborative Document State] S --> P[Durable Persistence] </mermaid> </section> <section data-section-id='cec9faae-06bf-417f-9c1a-6efcace8d73c'>The coordination layer sits above CRDT convergence and answers three questions before the agent writes:<ul><li value="1">Is the target section free?</li><li value="2">Is a human actively focused there?</li><li value="3">Does the agent have a current snapshot of document state?</li></ul> After ensuring the agent has been cleared, it takes a narrow, expiring section-level lease, not a whole-document lock, acquired against a stable snapshot. In other words, once the agent acquires a specific section lock, a human can no longer edit that section to avoid unpredictable collisions. A human can, however, override and kick an agent out at any time. This system allows many humans and agents to edit the shared canvas with generally fewer conflicts. Edits can be parallelized better. Four invariants keep the system safe:<ul><li value="1">No stale writes — version is checked at lease time; a mismatch forces a re-read.</li><li value="2">No unbounded locks — leases carry a TTL and auto-expire on crash or stall.</li><li value="3">Approval on conflict — if a human is actively in the target section, the agent’s write is gated on an explicit approval signal before the lease is granted.</li><li value="4">No live-state/storage coupling — reconnect reconciles against server state, not a local event replay.</li></ul> The trade-off is added latency and protocol surface area in exchange for safety and a more predictable user experience. TTLs must be calibrated to inference latency. Aka, too short and agents false-expire, but too long and a stalled agent blocks the section. Large rewrites that span anchored comments need a coherence pass after publishing; lease scope alone does not protect anchor references. <mermaid data-height="380" data-card="true"> flowchart LR N1(Request Edit) --> N2(Check Presence/Focus) N2 --> N3{Human Active?} N3 -->|Yes| N4(Request Approval) N4 -->|Approved| N5(Acquire Expiring Lease) N3 -->|No| N5 N5 --> N6(Edit Stable Snapshot) N6 --> N7(Publish Collaborative Event) N7 --> N8(Release Lock) classDef p fill:#3E63DD,stroke:#263c85,color:#fff classDef d fill:#F59E0B,stroke:#b47408,color:#fff class N1,N2,N4,N5,N6,N7,N8 p class N3 d </mermaid></section> <section data-section-id='fed7b334-5bc1-4088-8ba9-8adcfb6bfbe2'><h2 id="a-small-protocol">A Small Protocol</h2>A minimal protocol for agent edits can be sketched without reference to any specific framework. The core steps are always the same: <pre spellcheck="false" data-language="javascript" data-highlight-language="javascript">def agent_edit(section_id, base_version): state = read_presence(section_id) if state.human_active: approval = request_approval(section_id, ttl=45s) if not approval.granted: return conflict lease = acquire_lease(section_id, base_version, ttl=90s) if not lease.ok: return retry_with_fresh_snapshot snapshot = read_section(section_id) update = generate_edit(snapshot) publish_collaborative_update(section_id, update) release_lease(section_id)</pre> The critical distinction is what this protocol is not doing. The CRDT handles whether two updates can be merged into a consistent state, it resolves the mathematical question of convergence. This protocol handles a prior question: should the agent be allowed to produce an update at all, given what is currently happening in the document. Both layers are necessary. Removing the coordination protocol and relying on CRDT convergence alone means the agent will write, the CRDT will merge, and the result will be technically consistent but semantically wrong. The lease TTL is the main tuning surface. Inference latency varies widely across model calls; a TTL that works for a short edit may frequently expire on a longer rewrite. The safer approach is a short initial TTL with a single extension path, rather than a long default that holds the section unnecessarily. Quick callout about line 13 above.The agent edits that snapshot and then publishes the changes as ordinary collaborative events. The trickiest part of this publish step was reconciling the agent's edits as CRDT events as a purely backend operation, especially given all the complex nodes we want to support. It was trivial enough to get text with some light formatting working, but it became a real headache when the same system needed to support mermaid diagrams, XYZ flow charts, feedback nodes, and so much more. The system and tooling we built to make this work ended up becoming quite complex over time. We'll make a separate post on that topic later. </section> <section data-section-id='fba2ac2e-c8fb-436f-a771-04aa37726ecd'><h2 id="edge-cases">Edge Cases</h2>Several failure modes require explicit handling beyond the happy path. <ul><li value="1">Stale read, then human edit, then agent write. The agent reads at T₀, a human edits at T₁, and the agent writes at T₂ with a prompt grounded in T₀. The version check at lease acquisition catches this if the human’s edit incremented the section version. If it did not (e.g., the edit was to a different field), the race may still produce incoherent output. Version granularity should match the edit granularity.</li><li value="2">Orphaned lease after model timeout or crash. If the agent process dies mid-inference, the lease must self-expire via TTL. Without automatic expiry, a crashed agent holds the section indefinitely. Monitor lease age and alert on leases approaching the TTL ceiling.</li><li value="3">Reconnect storms after offline editing. When a client reconnects after an extended offline period, replaying a local event queue against a diverged server state can produce a burst of conflicting updates. Reconcile against the server snapshot at reconnect time rather than replaying local events in order.</li><li value="4">Multiple tabs for the same user. Presence tracked per-session rather than per-user can report the same human as active in multiple sections simultaneously. The coordination layer must deduplicate presence by user identity, not by connection, or approval requests will fire when the “other” editor is the same person in a different tab.</li></ul> </section> <section data-section-id='961f552d-cea7-419c-808a-284fba9ad566'><h2 id="lessons-learned">Lessons Learned</h2><ul><li value="1">Convergence ≠ coordination. CRDTs guarantee merge; they don’t decide whether an agent should write. That layer must be built separately.</li><li value="2">Test stale-context and reconnect paths first. They are the most likely failure modes and the hardest to retrofit.</li><li value="3">Prefer expiring section-level soft locks. Narrow scope, automatic expiry, and an approval fallback — not whole-document hard locks.</li><li value="4">Make presence and transport state explicit. Agents must know when they are offline or behind; silence is not a safe assumption.</li><li value="5">Agent writes are multiplayer events, not background jobs. Same presence broadcast, same lock protocol, same conflict surface.</li></ul> The hard part of AI document editing is not making the model write text. It is making the model behave inside a multiplayer system where humans are already thinking, typing, reviewing, disconnecting, reconnecting, and changing their minds. CRDTs are still necessary. They are just not enough. </section>