Does the router retry mutations?

No. Mutations are never retried automatically. Only queries (which should be idempotent) are retried.

What errors does the default configuration retry?

HTTP 500, 502, 503, 504 and connection-level errors. The default expression is IsRetryableStatusCode() || IsConnectionError().

Does the router retry on unexpected EOF?

Yes. Unexpected EOF from the subgraph is treated as a retryable connection-level failure alongside the other defaults.

Can I retry on 429 Too Many Requests?

Yes. Add statusCode == 429 to the retry expression. The router can also honor the Retry-After header when enabled.

What prevents retry storms during an outage?

Jitter. Each retry wait is randomized within a range, so parallel retries do not align. The algorithm follows the AWS Backoff and Jitter pattern.

How do I see which retries are happening?

Enable debug logging. Retry attempts are logged with the failure reason and wait interval.

Does retry interact with circuit breakers?

Yes. When a circuit breaker is open for a subgraph, retries to that subgraph stop; the breaker takes precedence. Once the breaker closes, retries resume.

GraphQL Router Retry Mechanism | Cosmo by WunderGraph

The problem

Retries belong in one place with guardrails

Per-service retry logic and blind retries both break under load. The router fixes both.

Retries per service produce inconsistent behavior

Team A uses three attempts, Team B uses one, Team C forgot. The client experience depends on which subgraph happened to blip.

Naïve retries make outages worse

If every client retries the same failing service at the same interval, you get a retry storm: synchronized traffic waves that keep the service from recovering.

Retrying a mutation is dangerous

Mutations are not always idempotent. Retry the same charge operation twice and you have double-charged a customer. A retry layer that does not know queries from mutations is a liability.

Our solution

Transient failures retry automatically without duplicating mutations

Mutations are never retried.

Cosmo Router automatically retries failed GraphQL queries using exponential backoff with jitter, configured once at the router, with expression-based conditions for when retries fire.
When a subgraph request fails with a retryable condition, the router waits using the AWS-recommended Backoff and Jitter pattern, then tries again. Limits cap attempts, intervals, and total retry duration.
Mutations are never retried. An expression evaluates each failure; defaults cover connection errors and HTTP 502/503/504. Extend with `statusCode == 429` or boolean logic as needed.

One policy for every subgraph, tuned with expressions when you need nuance.

Tradeoffs

Before & After

Before Cosmo	With Cosmo
Retry logic scattered across subgraph services	Centralized retry policy at the router
Fixed retry intervals causing retry storms	Exponential backoff with jitter distributes retry load
All failures retried, including non-idempotent mutations	Only queries retry; mutations never do
Hard to tune which errors should trigger a retry	Expression-based conditions with built-in helpers

Reference

Expression helpers

Use these in retry expression strings; combine with ||, &&, and comparisons.

Function	Returns true for
IsRetryableStatusCode()	HTTP 500, 502, 503, 504
IsConnectionError()	Connection refused, reset, DNS, TLS failures
IsTimeout()	Any timeout (HTTP, network, deadline exceeded)
IsHttpReadTimeout()	HTTP read timeouts specifically
IsConnectionRefused()	ECONNREFUSED
IsConnectionReset()	ECONNRESET

Full reference: retry documentation.

How the retry mechanism works

01

Expressions pick retryable vs. final failure.

Evaluate the failure

When a subgraph request fails, the router evaluates the retry condition expression.

02

AWS backoff-and-jitter pattern.

Wait with backoff and jitter

On a retry, the router waits for the configured interval. Each subsequent retry follows the pattern; jitter adds randomness so parallel retries do not align.

03

Queries only; never mutations.

Retry the subgraph call

The retried request goes to the same subgraph. If it succeeds, the client sees the success.

04

Bounded blast radius.

Stop at the limit

If `max_attempts` or `max_duration` is reached without success, the router returns the final error to the client.

Use case patterns

When teams tune retries first

Same mechanism, different expressions for network blips, deploys, rate limits, and fine-grained exclusions.

Recovering from a network blip

Default expression

Default retry catches connection errors via IsConnectionError(), waits with jitter, retries. Client never sees the error.

Riding out a deployment

503

Subgraph returns 503 during rolling deploy. IsRetryableStatusCode() includes 503. Up to five attempts over ten seconds usually lands on a healthy pod.

Honoring rate limits

429

Include statusCode == 429 in the expression and enable Retry-After handling. Router waits the requested amount and tries again.

Excluding slow business logic

Fine-grained

Expression like !IsHttpReadTimeout() && IsTimeout() retries connection-level timeouts but skips HTTP read timeouts.

Cosmo vs generic HTTP retry vs service mesh

Centralized router retries stay GraphQL-aware and expression-driven without bolting policy onto every subgraph.

Aspect	Cosmo	Generic HTTP retry	Service mesh
GraphQL-aware	Yes (queries vs mutations)	No	No
Condition expressions	Yes	Usually no	Limited
Backoff algorithm	Jitter built in	Varies	Varies
429 Retry-After support	Yes	Varies	Varies

Ship retries once at the router

Jitter beats storms. Expressions beat one-size-fits-all. Queries only, always.

Start Free Read the Docs

Recover from transient failures before your users see them