Technical MSPs · May 20, 2026
Why stalled runtime prompts need recovery, not just longer timeouts
Most remote troubleshooting platforms handle a stalled prompt the same way: wait longer. Double the timeout, triple it, hope the endpoint eventually coughs up a response. This works until it doesn't. A machine goes to sleep mid-session, a network partition splits the control plane from the runtime, or a tool call returns malformed JSON that chokes the parser. At that point, longer timeouts just mean longer failures. What technicians actually need is a recovery path that preserves context and lets work continue.
Caisey approaches this differently because its architecture separates the control plane from headless runtimes enrolled on each endpoint. The Cloudflare Worker control plane and SQLite Durable Objects maintain session state independently of whatever is happening on the client device. This separation is what makes productive recovery possible.
The failure modes that timeouts cannot fix
A timeout assumes the problem is slowness. In practice, stalled prompts fail in at least four distinct ways that slowness does not explain.
**Empty responses.** The runtime receives the prompt, executes the tool call, but returns nothing. This might happen when a PowerShell command runs successfully but produces no stdout, or when a macOS osascript invocation exits zero with blank output. The runtime is alive, the network is fine, but the payload is hollow.
**Malformed tool calls.** The runtime attempts to execute a function with arguments that serialize incorrectly, or the response schema drifts from what the control plane expects. The prompt technically completes, but the result is structurally useless. A longer timeout cannot repair bad JSON.
**Orphaned prompts.** The control plane dispatches a prompt, but the runtime never receives it. Perhaps the WebSocket dropped, or the Durable Object migration left a routing table stale. The prompt sits in limbo, neither failed nor fulfilled.
**Runtime death mid-prompt.** The endpoint agent crashes, the machine reboots, or an antivirus quarantines the runtime process. The prompt was sent, but no entity remains to answer.
Each of these requires a different recovery strategy. Extending timeouts addresses none of them.
What productive recovery looks like
Recovery means more than retrying the same prompt and hoping. It means understanding what happened, preserving what context exists, and giving the technician actionable next steps.
For empty responses, Caisey can distinguish between "the command ran and produced nothing" and "the command never ran at all." The runtime reports execution metadata separately from output payload. If the metadata confirms execution with exit code zero, the technician knows to adjust the query rather than resend it. If metadata is absent, the prompt is genuinely orphaned and should be requeued.
Malformed responses trigger schema validation at the control plane before any downstream processing. Caisey logs the exact shape of the failure, which helps identify whether a particular tool definition needs revision or whether a specific endpoint has an environmental issue. Technicians see the malformed payload, not a generic error code.
Orphaned prompts are detected through RPC-level acknowledgments. The control plane tracks which prompts have received runtime confirmation. Unacknowledged prompts beyond a short window are candidates for requeue or escalation, not indefinite waiting. The Durable Object's SQLite state makes this tracking durable across Worker invocations.
Runtime death is caught through heartbeat monitoring. Enrolled endpoints report liveness independently of prompt traffic. If a runtime disappears mid-session, the control plane knows immediately and can surface that status to the technician rather than leaving them staring at a spinner.
Why session history enables safe retry
Blind retry is dangerous in remote troubleshooting. Re-running a script that already partially modified system state can make problems worse. Caisey preserves full session history in the Durable Object, including which prompts completed, which failed, and what state changes resulted. This lets technicians retry with awareness rather than hope.
Consider a technician diagnosing a print spooler issue. They send a prompt to stop the service, inspect the queue, and restart it. The runtime dies after stopping the service but before restart. With history, the next technician—or the same one after reconnection—sees exactly what executed. They can issue a targeted restart command rather than blindly repeating the full sequence and potentially disrupting a queue that has since changed.
Practical patterns for MSPs
MSPs managing hundreds or thousands of endpoints need recovery to be systematic, not artisanal. Several patterns emerge from Caisey's approach.
**Differentiate retry from requeue.** Retry means the same prompt to the same runtime. Requeue means the prompt goes to any available runtime for that endpoint, potentially after a fresh enrollment check. Caisey tracks which runtimes have acknowledged which prompts, making this distinction automatic.
**Surface runtime health to dispatch decisions.** If an endpoint's runtime has missed two heartbeats, new prompts should probably not target it. The control plane can hold prompts pending runtime recovery or route them through alternative bridges. Client grouping lets MSPs set these policies per customer or per environment.
**Preserve partial output.** Even malformed or incomplete responses may contain useful fragments. Caisey stores whatever the runtime returned before failure, not just success cases. A truncated registry query might still reveal the key value a technician needs.
**Make recovery visible in the transcript.** Public reviewed transcript shares and audit logs should show not just final results but recovery actions taken. This builds trust with customers who want to understand why a session took longer than expected.
Moving past timeout configuration as policy
Many MSPs currently manage prompt reliability through timeout settings in their remote access tool: 30 seconds for quick commands, 5 minutes for installers, 15 minutes for Windows Updates. This is configuration as a substitute for understanding. It papers over failures without addressing them.
Caisey's model inverts this. Short timeouts with explicit recovery paths are preferable to long timeouts with silent failures. A 10-second timeout that cleanly identifies an orphaned prompt and requeues it teaches the system something. A 5-minute timeout that eventually returns "unknown error" teaches nothing.
For technical MSPs, the operational win is cumulative. Every recovered prompt with preserved context is a ticket that does not need escalation. Every malformed response logged with its exact shape is a tool definition that can be improved. Every runtime death detected through heartbeat loss is a machine that gets proactive attention before the customer calls.
Timeouts are easy to configure and hard to debug. Recovery is harder to build and easier to operate. The choice between them defines whether remote troubleshooting scales with endpoint count or drowns in it.