How NerdWallet Eliminated 10-Second GraphQL Query Planning Delays

Company Overview
NerdWallet is a personal finance platform that helps people make smarter decisions about their money by comparing and evaluating products like credit cards, banking, investing, mortgages, and insurance. It offers tools, content, and a mobile app to help consumers understand options, choose financial products, and track their finances with greater confidence.
The Challenge
A Gateway Architecture Under Pressure
NerdWallet ran its GraphQL platform with Apollo Gateway in front of roughly 10 subgraphs — mostly Node.js services, with one written in Go. The setup worked early on, but as more product teams attached subgraphs to the graph, the gateway began showing operational scaling limits.
Upgrading was blocked.
Legacy code tied the team to an old Federation version. New subgraphs followed current Apollo patterns, but the gateway couldn't validate or plan those shapes. Teams building subgraphs found themselves constrained by a router that couldn't keep up.
Query planning started blocking traffic.
Complex queries took more than 10 seconds to plan. While those queries waited, other requests backed up behind them. The slowdown spread across the graph and began affecting users.
To compensate, the team wrote their own internal prewarming logic. It helped until a new slow query appeared. If that query wasn't added to the list, the performance of the entire graph could suffer.
The Solution
The Solution: A Phased Migration to WunderGraph Cosmo
When NerdWallet evaluated alternatives, they focused on three things: Federation support, performance, and cost, both in infrastructure and team time. The migration also had to be safe.
They chose WunderGraph Cosmo and ran a three-phase rollout:
When roadblocks appeared - missing Federation features, behavior gaps tied to newer Apollo spec versions - the WunderGraph team patched them or built the missing support directly.
"The WunderGraph team jumped in to patch issues and build features that unblocked our migration."
Subgraph teams retained ownership of their services and features throughout the migration. Internal QA ran regression tests at each stage. The phased approach allowed them to migrate while keeping user-facing behavior stable.
The Result
The Graph Got Faster. The Team Got Time Back.
- Cache warming eliminated repeated 10+ second planning delays and brought planning latency down dramatically after warmup.
- Removed fragile internal prewarming scripts that required manual updates every time a slow query was added.
- High confidence in Supergraph updates and horizontal scaling, with router version upgrades still handled carefully.
- Improved visibility into slow queries and bottlenecks via Cosmo Studio and OTEL tracing.
- Reduced operational overhead, no more emergency prewarm list maintenance, fewer brittle workarounds around the old gateway.
Cache Warmer
The Performance Breakthrough
The biggest improvement came from Cosmo's Cache Warmer.
Previously, NerdWallet's internal script tried to prewarm the worst offenders. It worked, but it required manual updates.
Cosmo Cache Warmer replaced this entirely. It automatically identifies and prewarms the most expensive query plans before serving traffic. Subsequent requests skip the planning phase completely.
"Caching query planning was a game-changer in terms of performance for us."
For some of their slowest production operations, cache warming eliminated repeated 10+ second planning delays and dramatically improved performance after warmup. Similar cold-start planning behavior was later documented publicly in WunderGraph’s Super Bowl scaling analysis, where cache warm-up reduced planning spikes from 8-15 seconds to below one second.
Observability
Improved Observability Across the Graph
As the platform grew, the team needed better visibility into what was happening inside the graph.
Cosmo Studio provided traces, analytics, and schema visibility in one place. The team could see slow requests, identify complex execution paths, and quickly spot where subgraphs were struggling, turning guesswork into targeted fixes.
OTEL tagging confirmed that performance stayed on par or improved through the migration. Cache-hit metrics helped tune the router after rollout.
Schema Checks help reduce the risk of introducing breaking changes during Supergraph updates.
The team now has high confidence in Supergraph updates and horizontal scaling, supported by OTEL signals and Studio analytics.
Scaling
Predictably Under Load
NerdWallet already ran traffic forecasting, load tests, and infrastructure prescaling ahead of high-traffic events. But expensive query planning was always the hardest variable to control. A single slow planning operation could stall unrelated traffic across the graph.
Cache warming removed that variable. With the most expensive plans precomputed and cached before traffic arrived, new router instances came online with a populated planning cache instead of a cold start.
We see this same pattern in other federated graphs at scale. A router starting from a cold cache will hit the same planning delays, regardless of infrastructure capacity. In our experience, eliminating that cold-start window is one of the most direct ways to stabilize performance during scale events. WunderGraph documented this in detail in a Super Bowl scale-up analysis, where cold-start planning behavior under high traffic followed the same dynamic this team had seen before adopting warm-up.
The result for NerdWallet: more predictable scaling, fewer surprises during planned traffic events.
From Legacy Gateway to a Future-Ready Federation Platform
NerdWallet's migration replaced an upgrade‑blocked gateway with a modern federation platform, avoiding a big‑bang cutover while keeping user-facing behavior stable and preserving subgraph teams’ ownership of their services and features.
The key outcomes:
- Removed the query planning bottleneck that blocked upgrades and degraded the graph
- Eliminated manual operational work tied to custom prewarming infrastructure
- Gained direct visibility into graph behavior through Cosmo Studio and OTEL tracing
- Established a scaling posture that handles traffic spikes without cold-start surprises
If your gateway is showing the early signs - slow planning, blocked upgrades, increasing operational workarounds - those delays tend to grow over time. Cosmo is designed to address these bottlenecks at the router and planning layer.


