Skip to main content

How to handle agent failures gracefully

What you'll build: A robust agent client with retry logic, fallback agents, payment error recovery, and a circuit breaker — no fragile happy-path-only code.


Understanding failure modes​

FailureWhat it meansWhat to do
Network errorAgent unreachableRetry with backoff
TimeoutAgent took too longRetry or try another agent
Agent error (success: false)Agent ran but returned failureDo NOT retry — try a different agent
Payment errorSignature invalid or expiredBuild a new signature, retry once
No agents founddiscoverAgents() returned []Fail fast with a clear message

Retry with exponential backoff​

Retry network errors and timeouts — not agent logic errors.

src/retry.ts
export async function withRetry<T>(
fn: () => Promise<T>,
maxRetries = 3,
label = "call"
): Promise<T> {
let lastError: Error | undefined;

for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err: any) {
lastError = err;

// Don't retry agent logic failures
if (err.name === "AgentError") throw err;

const delayMs = Math.pow(2, attempt - 1) * 1000; // 1s, 2s, 4s
console.warn(`[${label}] Attempt ${attempt} failed: ${err.message}. Retrying in ${delayMs}ms...`);
await new Promise((r) => setTimeout(r, delayMs));
}
}

throw lastError;
}

Use it:

const result = await withRetry(
() => client.callAgent(agent, { capability: "check_position", input }),
3,
"check_position"
);

Fallback agents​

If agent A fails, try agent B, then C. Stop at the first success.

src/fallback.ts
export async function callWithFallback(
client: MilkyWayClient,
capability: string,
input: Record<string, unknown>
): Promise<CallResult> {
// Discover multiple agents, sorted by reliability score
const agents = await client.discoverAgents({ capability, limit: 3 });

if (agents.length === 0) {
throw new Error(`No agents found for capability: ${capability}`);
}

let lastError: Error | undefined;

for (const agent of agents) {
try {
const result = await client.callAgent(agent, { capability, input });

if (result.success) return result;

// Agent returned an error — try the next one
console.warn(`Agent ${agent.name} returned error: ${result.error}`);
} catch (err: any) {
lastError = err;
console.warn(`Agent ${agent.name} unreachable: ${err.message}`);
}
}

throw lastError ?? new Error(`All ${agents.length} agents failed for ${capability}`);
}

Handling payment errors​

Payment signatures expire after 60 seconds. If you get a payment error, simply retry — callAgent() builds a fresh signature on every invocation.

src/payment-retry.ts
export async function callWithPaymentRetry(
client: MilkyWayClient,
agent: DiscoveredAgent,
options: CallOptions
): Promise<CallResult> {
const result = await client.callAgent(agent, options);

if (result.success) return result;

// Payment signature may have expired — retry once with a fresh signature
if (result.error?.toLowerCase().includes("payment") || result.error?.includes("402")) {
console.warn("Payment error — retrying with fresh signature");
return client.callAgent(agent, options);
}

return result;
}

â„šī¸ callAgent() builds a new payment signature on every invocation. Simply calling it again is sufficient — no manual signature construction needed.


Circuit breaker​

For high-frequency callers: if an agent fails repeatedly, stop calling it for a cooling-off period. Prevents hammering a degraded agent.

src/circuit-breaker.ts
interface CircuitState {
failures: number;
openUntil: number | null;
}

const circuits = new Map<string, CircuitState>();

const FAILURE_THRESHOLD = 3;
const COOLOFF_MS = 10 * 60 * 1000; // 10 minutes

export function isCircuitOpen(agentId: string): boolean {
const state = circuits.get(agentId);
if (!state) return false;
if (state.openUntil && Date.now() < state.openUntil) return true;

// Cooloff expired — reset
circuits.delete(agentId);
return false;
}

export function recordSuccess(agentId: string): void {
circuits.delete(agentId);
}

export function recordFailure(agentId: string): void {
const state = circuits.get(agentId) ?? { failures: 0, openUntil: null };
state.failures++;

if (state.failures >= FAILURE_THRESHOLD) {
state.openUntil = Date.now() + COOLOFF_MS;
console.warn(`Circuit open for agent ${agentId} — skipping for 10 minutes`);
}

circuits.set(agentId, state);
}

Use it before calling an agent:

for (const agent of agents) {
const id = String(agent.agentId);
if (isCircuitOpen(id)) {
console.log(`Skipping ${agent.name} — circuit open`);
continue;
}

try {
const result = await client.callAgent(agent, options);
recordSuccess(id);
return result;
} catch (err) {
recordFailure(id);
}
}

Putting it all together​

src/robust-client.ts
import { MilkyWayClient } from "@usemilkyway/client";
import { withRetry } from "./retry";
import { isCircuitOpen, recordSuccess, recordFailure } from "./circuit-breaker";

export async function robustCall(
client: MilkyWayClient,
capability: string,
input: Record<string, unknown>
) {
const agents = await client.discoverAgents({ capability, limit: 3 });
if (agents.length === 0) throw new Error(`No agents for ${capability}`);

for (const agent of agents) {
const id = String(agent.agentId);
if (isCircuitOpen(id)) continue;

try {
const result = await withRetry(
() => client.callAgent(agent, { capability, input }),
3
);

if (result.success) {
recordSuccess(id);
return result;
}
} catch (err) {
recordFailure(id);
}
}

throw new Error(`All agents exhausted for ${capability}`);
}

What's next​