Traffic management · Retry mechanism

Recover from transient failures before your users see them

When a subgraph call hits a transient error, the router retries it with sensible spacing — and leaves mutations alone, so nothing gets duplicated. You decide when retries kick in and how many to allow.

Per-subgraph retry rules. GraphQL-aware: mutations stay safe.

Available onFreeProEnterprise

The problem

Retries belong in one place with guardrails

Per-service retry logic and blind retries both break under load. The router fixes both.

Retries per service produce inconsistent behavior

Team A uses three attempts, Team B uses one, Team C forgot. The client experience depends on which subgraph happened to blip.

Naïve retries make outages worse

If every client retries the same failing service at the same interval, you get a retry storm: synchronized traffic waves that keep the service from recovering.

Retrying a mutation is dangerous

Mutations are not always idempotent. Retry the same charge operation twice and you have double-charged a customer. A retry layer that does not know queries from mutations is a liability.

Our solution

Transient failures retry automatically without duplicating mutations

Mutations are never retried.

  1. Cosmo Router automatically retries failed GraphQL queries using exponential backoff with jitter, configured once at the router, with expression-based conditions for when retries fire.

  2. When a subgraph request fails with a retryable condition, the router waits using the AWS-recommended Backoff and Jitter pattern, then tries again. Limits cap attempts, intervals, and total retry duration.

  3. Mutations are never retried. An expression evaluates each failure; defaults cover connection errors and HTTP 502/503/504. Extend with `statusCode == 429` or boolean logic as needed.

One policy for every subgraph, tuned with expressions when you need nuance.

Tradeoffs

Before & After

Before CosmoWith Cosmo
Retry logic scattered across subgraph servicesCentralized retry policy at the router
Fixed retry intervals causing retry stormsExponential backoff with jitter distributes retry load
All failures retried, including non-idempotent mutationsOnly queries retry; mutations never do
Hard to tune which errors should trigger a retryExpression-based conditions with built-in helpers

Reference

Expression helpers

Use these in retry expression strings; combine with ||, &&, and comparisons.

FunctionReturns true for
IsRetryableStatusCode()HTTP 500, 502, 503, 504
IsConnectionError()Connection refused, reset, DNS, TLS failures
IsTimeout()Any timeout (HTTP, network, deadline exceeded)
IsHttpReadTimeout()HTTP read timeouts specifically
IsConnectionRefused()ECONNREFUSED
IsConnectionReset()ECONNRESET

Full reference: retry documentation.

How the retry mechanism works

01
Expressions pick retryable vs. final failure.

Evaluate the failure

When a subgraph request fails, the router evaluates the retry condition expression.

02
AWS backoff-and-jitter pattern.

Wait with backoff and jitter

On a retry, the router waits for the configured interval. Each subsequent retry follows the pattern; jitter adds randomness so parallel retries do not align.

03
Queries only; never mutations.

Retry the subgraph call

The retried request goes to the same subgraph. If it succeeds, the client sees the success.

04
Bounded blast radius.

Stop at the limit

If `max_attempts` or `max_duration` is reached without success, the router returns the final error to the client.

Use case patterns

When teams tune retries first

Same mechanism, different expressions for network blips, deploys, rate limits, and fine-grained exclusions.

Recovering from a network blip

Default expression

Default retry catches connection errors via IsConnectionError(), waits with jitter, retries. Client never sees the error.

Riding out a deployment

503

Subgraph returns 503 during rolling deploy. IsRetryableStatusCode() includes 503. Up to five attempts over ten seconds usually lands on a healthy pod.

Honoring rate limits

429

Include statusCode == 429 in the expression and enable Retry-After handling. Router waits the requested amount and tries again.

Excluding slow business logic

Fine-grained

Expression like !IsHttpReadTimeout() && IsTimeout() retries connection-level timeouts but skips HTTP read timeouts.

Cosmo vs generic HTTP retry vs service mesh

Centralized router retries stay GraphQL-aware and expression-driven without bolting policy onto every subgraph.

AspectCosmoGeneric HTTP retryService mesh
GraphQL-awareYes (queries vs mutations)NoNo
Condition expressionsYesUsually noLimited
Backoff algorithmJitter built inVariesVaries
429 Retry-After supportYesVariesVaries

Ship retries once at the router

Jitter beats storms. Expressions beat one-size-fits-all. Queries only, always.

FAQ

GraphQL router retries

Full detail in the retry documentation.