GraphQL Circuit Breaker

Keep your API running when a subgraph fails

The router detects the outage, stops traffic to the broken service, and restores it when it recovers. The rest of your graph keeps serving. One config file. No code changes.

Built into the router. Works at the subgraph level. No service mesh required.

Per-subgraph protection

Circuit breakerSubgraph AClosed · OKSubgraph BOpen · isolatedSubgraph CHalf-open · probeTraffic stops to failed subgraphs; the rest of the graph keepsserving.

Available onFreeProEnterprise

The problem

One failing subgraph can take down your whole GraphQL API

Each service in your federated graph can fail on its own. When one does, cascading failures spread across your GraphQL API.

Failures cascade across your entire API

When a subgraph becomes slow or unresponsive, requests can pile up, resources are exhausted, and the router can become unresponsive.

Timeouts waste resources on failing services

Without circuit breakers, requests continue waiting on a degraded dependency. Router resources stay tied up instead of being freed for healthy parts of the graph.

Recovery depends on manual intervention

Without automatic circuit protection, teams have to stop traffic to failing services by hand and decide when it is safe to restore it.

Our solution

Contain subgraph failures without losing the whole API

The router tracks each subgraph on every request. When failures exceed your thresholds, it opens the circuit, rejects requests immediately, and probes recovery after the sleep window.

What happens when the circuit breaker trips

  1. A request to a subgraph fails, and the error rate rises inside the rolling window.

  2. When the configured threshold is exceeded and the minimum request count is met, the circuit opens.

  3. Requests to that subgraph are rejected immediately instead of waiting for timeouts.

  4. Healthy subgraphs continue resolving normally while the failed service is isolated.

  5. After the sleep window, the circuit moves to half-open and allows limited test requests.

  6. If enough test requests succeed, the circuit closes. If a test request fails, it opens again.

The failure stays contained while recovery happens through configurable half-open testing.

Partial outages

Built for partial outages

In federated GraphQL, not every outage affects the whole graph. A database issue, network failure, or timeout in one subgraph can still consume router resources and slow unrelated requests.

Cosmo Router uses circuit breakers to isolate the affected subgraph once failure thresholds are exceeded. Requests to that subgraph are rejected immediately while healthy parts of the graph continue operating.

After the sleep window, the router tests recovery through the half-open state before restoring normal traffic.

Circuit logic

What counts as a failure

The circuit breaker tracks network-level and transport-level failures:

  • Connection refused errors
  • DNS errors
  • TLS failures
  • Broken connections
  • Read/write timeouts
  • Circuit breaker execution timeouts

These contribute to the error rate.

These do not

  • HTTP 4xx or 5xx responses when a response is received
  • Request cancellations
  • Client-side timeouts

If the subgraph returns an HTTP response, the circuit breaker does not treat that response as a transport failure.

How a GraphQL circuit breaker works

01
Failures hide in aggregate latency.

Detect

Set the rolling window and error threshold. The circuit opens when the threshold is exceeded and the minimum request count is met.

02
One slow subgraph can stall the graph.

Isolate

Requests to the affected subgraph are rejected immediately while the circuit is open, freeing router resources for the rest of the graph.

03
Blind retries amplify outages.

Test

After the sleep window, the router enters half-open state and allows a limited number of test requests.

04
Recovery should not be guesswork.

Restore

If enough test requests succeed, the circuit closes and normal traffic resumes. If a test request fails, the circuit opens again.

Router controls

Tune the circuit breaker per subgraph

Set limits per subgraph, ship changes from one config, and watch circuit state in your existing metrics stack.

Set rules per subgraph

Traffic shaping

Different services fail in different ways, so Cosmo lets you tune circuit breaker behavior independently.

Error rate

Open the circuit when failures exceed a threshold.

Time window

Controls how fast the circuit reacts.

Short window = faster response.

Long window = more stability.

Retry behavior

Retries stop when the circuit opens.

Recovery requires successful probes before traffic returns.

Metrics & monitoring

Observability

Monitor circuit breaker state, error rates, and recovery behavior through router metrics. Use those signals to alert on open circuits and confirm recovery.

Circuit breaker metrics documentation

One config. No code changes.

Cosmo Router

Add circuit breakers in a single config file.

No changes to your subgraphs.

Where the circuit breaker runs

System boundary

The circuit breaker runs inside the GraphQL router, where it tracks request outcomes and applies protection before failures spread across the graph.

When a circuit opens:

  • requests stop immediately
  • traffic is cut at the router
  • healthy subgraphs continue resolving normally

Failure is isolated to the subgraph. Not the whole API.

Example YAML configuration

Configuration

The Cosmo Router handles federated API resilience at the subgraph level: no service mesh required, no code changes in your subgraphs.

1
2
3
4
5
6
7
8
9
10
11

Configure circuit breakers in Cosmo Router

Built into Cosmo Router and configured through traffic shaping.

FAQ

GraphQL circuit breaker

More detail in the circuit breaker documentation.