MSPs · May 20, 2026
How update agents should repair without disrupting active support
Every MSP has lived through the 2 p.m. call where a technician loses connection mid-session because the endpoint agent decided it was time to update itself. The user was watching. The issue was half-diagnosed. Now the technician is blind, the user is confused, and the ticket just got longer. Agent updates are necessary, but the way most tools handle them treats active support as an acceptable casualty. It should not be.
The problem starts with how update agents are designed. Many treat the agent as a simple Windows service or background process that can be stopped, replaced, and restarted without considering what that process is actually doing at the moment. If a technician has an active query running, a file transfer in progress, or a conversation thread open, a hard restart wipes state that may not be reconstructible. The user context, the diagnostic path, the approval chain—all of it becomes fragile when the agent vanishes without warning.
What non-disruptive check-ins actually mean
A non-disruptive check-in is not just "check if someone is logged in." It is a runtime-aware poll that asks: is this agent currently participating in a support session? Is there an open bridge? Has there been RPC activity in the last N minutes? Caisey's runtime tracks this explicitly because the control plane maintains session state in Durable Objects. The agent knows whether it is idle or engaged because the cloud tells it, and the cloud knows because the session record is live.
This matters for updates because the agent can defer its own restart when it sees active engagement. Not by guessing based on CPU usage or network traffic, but by checking a specific session flag. The update downloads in the background, stages the new binaries, and waits for a clean break: session closed, technician explicitly disconnected, or a configurable idle timeout. The user never sees a service stutter. The technician never loses their place.
Why unchanged-package restart avoidance matters
Some update systems restart the agent even when the package has not meaningfully changed. A certificate rotation, a metadata update, a version bump with no runtime impact—all trigger the same hard restart cycle. This is lazy engineering with real operational cost. Each unnecessary restart is a chance for the service to fail to come back up, for the machine to hit a new firewall rule, for the user to notice something happened.
Caisey's approach is to diff what matters. If the runtime binary is unchanged, if the bridge protocol version is compatible, if the Durable Object schema does not require a migration, the agent skips the restart entirely. It applies what changed—maybe a config value, maybe a rotated token—and continues running. This sounds small until you multiply it across five hundred endpoints that check in every hour. The aggregate stability win is significant.
Staged updates and rollback readiness
Even when an update is necessary, the agent should not replace itself in place. It should stage the new version alongside the running one, validate the staging environment, and only then perform a hot handoff. If the new version fails to initialize, the old version is still there, still running, still connected. The update is marked failed, the incident is logged, and the MSP knows without a user ever reporting "my remote support disappeared."
This requires the agent to be more than a single executable. It needs a lightweight supervisor that understands versioning, can enumerate installed runtimes, and can make a forward-or-backward decision based on health checks. The supervisor itself should change rarely and update through a separate, more conservative channel. This is how you avoid the nightmare scenario where the update mechanism breaks and now nothing can fix it.
The Windows-specific considerations
On Windows, service control manager interactions add friction. Stopping a service that has active threads can hang if those threads are waiting on network I/O. Restarting a service with dependent services can cascade into failures you did not anticipate. The agent should register itself with the SCM in a way that allows for graceful shutdown signals, and it should handle SERVICE_CONTROL_STOP by finishing in-flight RPCs, closing bridges cleanly, and only then exiting.
Windows also has the wrinkle of session 0 isolation. If the agent runs UI-adjacent operations, an update that restarts the service may lose access to the user's session context. Caisey's agent separates the persistent runtime—which handles cloud communication and session state—from any interactive components, so a runtime update does not disturb session-specific handles.
What MSPs should ask their vendors
If you are evaluating remote support tools, ask specific questions about update behavior. Does the agent check for active sessions before restarting? Can it stage and validate an update without stopping the current runtime? What happens if an update fails mid-process—is there an automatic rollback, and does that rollback require manual intervention? How often does the agent restart for updates that do not change its core functionality?
Vendors who have not thought about this will give vague answers about "seamless updates" without explaining the mechanism. The mechanism matters because it determines whether your 2 p.m. session survives or dies.
Building repair into the agent's normal rhythm
Updates are one form of repair, but agents also need to recover from their own failures without human triage. A network blip should not require a manual service restart. A corrupted local cache should not brick the endpoint until someone remotes in to fix it. The agent should have a periodic self-check—distinct from the cloud check-in—that validates its own state: can I reach the control plane? Is my local database consistent? Are my bridges responding to health probes?
When the self-check finds a problem, the agent attempts graduated recovery. Reconnect the bridge. Rebuild the cache from cloud state. If that fails, escalate to the supervisor for a controlled restart—but only if no session is active. If a session is active, the agent flags itself degraded and notifies the technician, who can decide whether to close the session and allow repair or work around the limitation. The user is never left with a broken agent and no explanation.
The operational payoff
MSPs that get this right stop treating agent updates as a source of tickets. The update stream becomes invisible infrastructure, like DNS or certificate renewal—something that happens, works, and never appears in a post-mortem. Technicians trust that the endpoint will be there when they need it. Users stop experiencing mysterious "the support program closed itself" moments. The MSP's reputation for reliability compounds because the foundation is solid.
Caisey is designed around this stability model because remote troubleshooting only works if the tooling itself is not another variable that can fail. The agent's update and repair behavior is not an afterthought; it is core to the architecture, shaped by the same Durable Object session awareness that makes the rest of the platform predictable. When the cloud knows what every endpoint is doing, the endpoint can make smarter decisions about when it is safe to change.