Router · Traffic shaping

Keep your federated API up when a subgraph gets flaky

Retries, timeouts, and circuit breakers in one YAML file, with defaults for every subgraph and overrides for the ones that need different treatment.

Built into the router. No service mesh, no per-service retry library.

Router-level traffic rules

Routertraffic_shaping configSubgraph Aretry · timeout· circuit breakerSubgraph Bretry · timeout· circuit breakertimeout: 60sSubgraph Cretry · timeout· circuit breaker

Available onFreeProEnterprise

The problem

Why one failing subgraph brings down the whole API

Federated GraphQL amplifies whatever reliability problems your subgraphs already have. One flaky service and the router turns every query that touches it into an error.

One slow subgraph slows every query that touches it.

Without a timeout budget at the router, a 30-second backend hang cascades into a 30-second client wait. Your p99 inherits the worst-behaving service.

Transient failures look like real failures to the client.

A momentary 503 during a subgraph restart becomes a user-visible error. Implementing retries in every client or every service produces inconsistent behavior and duplicated code.

One failing subgraph can take down the rest.

When requests pile up against a service returning errors, threads and connections fill up on the router. Other queries, including ones that didn't touch the broken service, start timing out too.

Our solution

One place to set retry, timeout, and circuit breaker rules for every subgraph

Cosmo's traffic shaping is a router-level configuration layer that controls retries, timeouts, and circuit breakers for every subgraph: set once as global defaults, overridden per subgraph where needed, so one failing service can't take down your entire federated API.

What happens at request time

  1. The all section sets defaults (retry attempts, timeout budgets, circuit breaker thresholds) applied to every subgraph.

  2. The subgraphs section lets you override specific values for specific services.

  3. Retries use exponential backoff with jitter; mutations are never retried.

  4. Timeouts cover the full request lifecycle: dial, TLS handshake, response header, full request.

  5. Circuit breakers use a time-based sliding window and open automatically when a subgraph fails consistently.

One YAML file, GraphQL-aware.

Traffic shaping

Before & After

Before CosmoWith Cosmo
Retry logic duplicated across servicesCentralized retry config at the router with exponential backoff and jitter
Inconsistent timeouts, client sees worst caseUnified timeout settings with per-subgraph overrides
Cascading failures from one broken subgraphCircuit breakers isolate the bad service until it recovers
Traffic rules scattered across code, mesh, and sidecarsOne YAML file, GraphQL-aware

Router controls

What traffic shaping controls

These apply

  • Request timeouts
  • Retry attempts + backoff
  • Circuit breaker thresholds
  • Per-subgraph overrides

These do not

  • HTTP 4xx/5xx status codes
  • Client-side cancellations
  • Service mesh rules

If the subgraph responds, the router considers the request delivered.

How GraphQL traffic shaping works

01
Exponential backoff with jitter. Mutations never retried.

Define the defaults

Set retry, timeout, and circuit breaker defaults in the all section; that applies to every subgraph request.

02
One slow outlier shouldn't set the rules for everyone.

Override where needed

Under subgraphs, add entries keyed by subgraph name. Any field set there overrides the default for that subgraph only.

03
No subgraph code changes. No sidecar. Transparent to the services.

Runtime enforcement

Timeouts abort long-running calls; retries re-issue failed calls with backoff; circuit breakers track per-subgraph health and open when thresholds breach.

04
Open circuit = a subgraph needs attention.

Observe

Circuit breaker state is exposed as a metric; retry counts and timeout errors appear in traces.

Use cases

Patterns teams enable first

Set limits per subgraph, ship changes from one config, and watch traffic behavior in your existing metrics stack.

Baseline reliability across every subgraph

Traffic shaping

Set global retry, timeout, and circuit breaker defaults in the all section. Every subgraph benefits with no per-service configuration needed.

Accommodating a known slow service

Traffic shaping

Override request_timeout under subgraphs.legacy-service. Other subgraphs keep strict budgets.

Protecting a high-value service during incidents

Traffic shaping

Give the payments subgraph stricter circuit breaker thresholds: lower request threshold, faster sleep window, higher success bar.

Metrics & monitoring

Observability

Circuit breaker open/closed transitions emit as metrics. Retry counts appear per-subgraph. Alert on sustained circuit-breaker-open state.

Traffic shaping metrics documentation

Where traffic shaping runs

System boundary
  • Runs inside the GraphQL router.
  • Tracks retries, timeouts, and circuit breaker state per subgraph on every request.
  • No reliance on upstream retries or clients.

Subgraph services receive no new dependencies. The router does all the work.

Add traffic shaping in minutes

Built into Cosmo Router: no new services, no subgraph changes.

FAQ

GraphQL traffic shaping

More detail in the traffic shaping documentation.