Apollo GraphQL Federation with Subscriptions - production grade and highly scalable

Editor's Note

This post was written in 2021 and describes an earlier version of WunderGraph that is no longer available. The product is now WunderGraph Cosmo . For current documentation on federated subscriptions, see the Cosmo Router subscriptions docs .

TL;DR

Federation and subscriptions used to be hard to run together, and in 2021 few gateways supported both. That has changed. With WunderGraph Cosmo, real-time subscriptions are built into the Router, and the modern approach is Cosmo Streams (event-driven federated subscriptions, or EDFS): the Router subscribes to an event source like Kafka, NATS, or Redis, and your subgraphs stay HTTP-only and stateless instead of holding WebSocket connections. This post keeps the still-useful parts on what Federation solves and how to adopt it, and updates the subscriptions architecture to how Cosmo does it today. The 2021 WunderNode design later in the post is kept for history.

Apollo GraphQL Federation is the concept of building a single GraphQL API using a distributed microservice architecture.

State of GraphQL Federation 2026

How are teams governing schema changes, handling production traffic, and measuring Federation success? Share your experience and get early access to the full report. For every valid survey completed, we'll donate $30 to UNICEF .

Take the Survey

In this blog post, you'll discover what problems Federation tries to solve and why it's not just a simple extension to APIs but creates a whole new experience that can help API-driven companies to scale. We'll have an in depth look at why you should it use and why not. We will be talking about how you can adopt Federation in case you decided it's a good fit for your company. Finally, we will discuss why Subscriptions matter for Federation and how the WunderGraph architecture enables you to leverage both Federation and Subscriptions without any extra work.

What problems does Apollo GraphQL Federation solve?

GraphQL has proven countless times that it's helping companies build better APIs. Developers struggle to find consensus on how they should build their REST APIs. With GraphQL, despite being a compromise, developers are able to rely on a simple and clear specification that doesn't leave room for interpretation. GraphQL might not always be the best choice. However, in many cases it's easier to build a good GraphQL API compared to REST because devs have to make fewer choices.

Alright, enough GraphQL "hype". You've probably heard pro arguments on GraphQL enough. To give it some balance. No API style is a one size fits all. Depending on you use-case, you should always consider alternatives like REST, OpenAPI, gRPC etc., make yourself familiar with pros and cons of each implementation and decide for the best tool for you project. Deciding on tooling before the resulting architecture is not making your project more successful.

GraphQL itself is just a query language, paired with HTTP, you could put it in the same basked with other API styles like e.g. REST, gRPC, SOAP, etc... All of them have their pros and cons, but the patterns are quite similar.

However, when Federation comes into play, it's clear that we're operating at a completely different level. It's no longer just another API style. Federation goes beyond just architecture. Federation enables organizations to scale in a way that was not possible with other API styles.

Most if not all API styles have some kind of concept of a "Resource". I'm borrowing the term from REST APIs as it makes most sense in this context. What all these API styles have in common is that a Resource is usually tied to one Endpoint, Instance or Service.

Tying a Resource to a specific service means, the Resource is tied to a specific team.

With Federation, the implementation of a Resource can be distributed across many teams.

Imagine you have the type User in your domain model. Let's say Users are able to write Posts as well as add Comments to Posts.

If it's a simple Rails app, ofc there's no issue implementing this design. What if your application is scaling to the point where Posts and Comments are becoming large complex topics, and you have to build individual teams for each of them?

How can you make sure that the API scales well and is still as easy to use and understand as having a single monolith? This is the kind of problems that Federation tries to solve. It gives both teams to collaborate on finding a common API schema while allowing them to implement their part of the schema the way they want.

This comes with some benefits but also has a cost obviously.

Why you should use Apollo GraphQL Federation

Are you trying to scale your organization? Do you find yourself in a similar position to the example above? Do you have to split teams and hire more people because your API surface it getting bigger and bigger?

If you can answer these questions with Yes, you should definitely take a closer look!

Apollo GraphQL Federation can help you in this case to build a coherent API surface. Thanks to Federation, API consumers won't notice when they cross the boundaries of individual teams. As all teams agree on a common schema, it's easy to navigate the different types as they all follow a common language.

At the same time, teams of API producers are free to implement their part of the schema the way they want. They can use any language or framework they like, as long as the result is a Federation compliant service.

That said, it might still be a very good idea to not give each team too much freedom in terms of technology choice. If the stack is similar across teams, it's easier for team members to move between teams.

Why you should avoid Apollo GraphQL Federation

Adopting Apollo GraphQL Federation comes at a cost and is not for everyone.

Looking at the questions above, if you're mostly answering them with "No", it very likely that you don't have the problems that Federation tries to solve.

If your API surface is small, and you only have a single team of three developers working on the API, Federation adds just complexity without doing you any good.

Another situation where you have to carefully think about adopting Federation is when your company is not yet using GraphQL at all. In order to be able to leverage Federation, all services have to be rewritten as GraphQL services. This can take a huge effort, and it'll take long until the investment pays off.

Actually, with WunderGraph you don't have to rewrite all your services as GraphQL APIs but we'll come to that later.

Another important factor when it comes to adopting Federation is when you have API consumers that are not familiar with GraphQL. Rewriting your Services as GraphQL APIs means, you have to teach your API consumers a new technology. If your API consumers are mainly in-house, this might work. What if your API is used by partners or even public API consumers who are unknown to you? Can you convince all of them to adopt GraphQL? If not, you could have to build a REST-ful facade on top of your federated GraphQL API.

In the end, an API is like a product, and your customers are developers. What matters most is that you deliver a great product to your customers. GraphQL might be a solution and so does Federation. Your customers might not care about your technology choices.

How can you safely adopt Apollo GraphQL Federation?

Migrating from non-GraphQL to Federation

In case you don't already have any GraphQL APIs, the transition is a multi-step process. You can keep your existing architecture running while you start to "strangle" (wrap) it with GraphQL services. Once the first bit of your new facade is running, you can ask your API consumers to migrate to the new GraphQL API. This allows you to test your Federation environment and identify problems.

In case of issues, you're always able to switch your API consumers back to the "old" APIs. This way, you're able to build up confidence that your new architecture works under real life conditions.

You'll then migrate more and more services to the new architecture until the whole "old" API is eaten up by Federated services. At this point, you're able to migrate API consumers completely off of the "old" API.

Once this step is complete, you're able to move all business logic from the "old" API into the federated services. When no federation service and API consumer is relying on a part of the "old" API, it's time to finally switch this service off. Iterate through all the services until nobody is relying on the "old" API anymore. You can and should ensure this by looking at your API analytics.

Speaking of API analytics. It really helps if your "old" API has analytics enabled. Additionally, you should give all API consumers some information like a client ID or token by which they always have to present when using your API and by which you can identify them. If you don't have such information it can become hard to identify some of your API consumers and help them migrate off of your old API. A very good practice is to make use of OAuth2. This allows you to create a client ID and secret for all your API consumers. Each app and each partner can have their own client ID, making it easy for you to track them down. Keep a list of client IDs and their contact info and you're well-prepared for the upcoming migration.

Migrating existing GraphQL services towards a federated architecture

Assuming that you have one or more existing GraphQL services, it's very likely that it's either a monolithic service or multiple services glued together with schema stitching. Either way, migrating them towards a Federated architecture is rather simple.

WunderGraph allows you to combine multiple Services into a single GraphQL API. It's possible to use Federation and schema stitching at the same time, you can even add REST APIs and databases like PostgreSQL or MySQL to your API.

WunderGraph does so by creating a facade on top of all your services. Once the facade is established, you're able to "move" the implementation of your API out of the monolithic architecture into microservices. As long as the schema stays the same, you're able to move the implementation wherever you want. WunderGraph will glue all services together without introducing any breaking changes for API consumers.

This means, you're able to migrate type by type, field by field in tiny steps out of your monolith. You're able to test the new architecture very early on and can always switch back to the monolith if anything goes wrong.

Once all logic is migrated off of the monolith, your schema will solely depend on the federated services. You're then able to shut down your old systems. This is rather easy compared to the multi-step process above because you're able to keep your API contract intact all the time.

At the very beginning, you're introducing the WunderGraph facade between your clients and your existing infrastructure. WunderGraph makes sure this facade stays intact throughout the whole process of migrating business logic from one place to another.

The Elephant in the room: Apollo GraphQL Federation with Subscriptions

Implementing a GraphQL Gateway with support for Federation is hard. Implementing the same functionality with support for Subscriptions is an even bigger challenge. When this post was written in 2021, few gateways supported both.

That has since changed across the ecosystem, and federated subscriptions are now common. What sets the WunderGraph approach apart today is how the Router handles them. The engine is open source under Apache 2.0, and the Cosmo Router supports subscriptions over WebSockets, Server-Sent Events, and Multipart HTTP out of the box. The bigger shift is event-driven federated subscriptions, covered next.

How Cosmo does federated subscriptions today: Cosmo Streams (EDFS)

The modern way to run subscriptions across a federated graph is Cosmo Streams (event-driven federated subscriptions, or EDFS). Instead of each subgraph holding open WebSocket connections and pushing data through the graph, the Cosmo Router connects directly to your event source and streams updates to subscribers itself.

The Router supports Kafka, NATS, and Redis as event sources. You declare subscriptions in your schema with directives like @edfs__kafkaSubscribe, @edfs__natsSubscribe, or @edfs__redisSubscribe, and the Router does the rest. It holds the subscriber connections, applies per-subscriber filtering and authorization centrally, and keeps your subgraphs HTTP-only and stateless.

On the client side, the Router speaks WebSockets (graphql-ws), Server-Sent Events, and Multipart HTTP, so clients pick whatever transport fits.

For the full picture, see Event-Driven Federated Subscriptions in the docs, the EDFS announcement, and native subscriptions with the Cosmo Router.

But GraphQL and Federation is not the peak of WunderGraph. The idea behind WunderGraph is to be a generic GraphQL engine, supporting many other datasources.

So far, we're able to connect PostgreSQL, MySQL, REST APIs (through OpenAPI Specification) and GraphQL, with more to come soon.

There are other tools trying to solve similar problems.

WunderGraph, being a generic Query Compiler, can scale to infinite schema and operation sizes.

Other implementations of GraphQL Gateways usually "interpret" GraphQL Queries at runtime and are written using tools like Node.JS, making it hard to scale to large scale schemas and high numbers of operations.

WunderGraph, being a generic Query Compiler decouples planning from execution. This means it's possible to process any size of schemas and any number of operations without increasing runtime overhead because all the heavy lifting happens at build/compile time.

How WunderGraph makes Subscriptions possible for Apollo GraphQL Federation

History: the 2021 WunderNode design

The sections below describe how the original WunderGraph (the WunderNode) implemented federated subscriptions in 2021, using a thunk-based engine and HTTP/2 to the client. They are kept here for reference. For how subscriptions work in the current product, see the EDFS section above and the Cosmo Router subscriptions docs .

If you want to understand how WunderGraph is capable of supporting Subscriptions, we have to take a closer look at the architecture.

WunderGraph is written in Go (Golang), another viable solution if you want to build highly scalable highly concurrent network services. Go makes it easy to scale an application across all cores of a computer. Additionally, in case of Subscriptions, Go doesn't have to create CPU threads for each concurrent Subscription. Instead, Go has the concept of Goroutines. A Goroutine can be seen as a lightweight thread, consuming a lot less resources compared to real CPU threads. All this is possible because Go is coming with its own scheduler embedded into the runtime. Instead of leaving concurrency to the system, the Go runtime abstracts this away and lets the developer start new lightweight threads by using the "go" keyword. Behind the scenes, the Go scheduler runs multiple Goroutines on a much smaller number of real CPU threads. Luckily, as a developer using Go, you don't really have to think about this nor understand it in depth. It helps to have some knowledge about the runtime but in most cases it's not required to fully understand it.

Your takeaway should be that Go makes it easy to build API Gateways.

A language alone doesn't yet make a fast API Gateway though. The architecture of the application contributes a lot. The underlying engine of WunderGraph is fully open source , exists for multiple years now and is used in production by many companies.

A word on open source

I've started working on the first parts of the engine more than three years ago. Until now, many contributors helped to evolve it into what it is now. Without open sourcing this project, I think the project would have never been able to grow to this point. I'm very thankful for everybody involved and thrilled to see how companies build on top of it.

Recently, I was looking into the Network graph of my repository to identify interesting forks.

To my surprise, there was actually someone working on implementing some missing parts to fully support Federation. I've figured out their contact info, you can do so by adding ".path" to one of their commits on GitHub", and we started a discussion.

This discussion led to a Pull Request that really brought graphql-go-tools forward.

Thank you Vasyl Domanchuk for your contribution. If you're ever looking for new job, please use this blog post as a reference. I was amazed to see how you were able to digest the complexity of graphql-go-tools without asking any questions. Your contribution was of high quality, you've added a lot of tests to keep the coverage up. If I could, I'd hire you from the spot but you seem to be happy with what you have, fair enough.

The Architecture to make Apollo GraphQL Federation and Subscriptions highly scalable

Contrary to most if not all GraphQL implementations, the engine underneath WunderGraph is taking a "thunk-based" approach to resolving GraphQL Queries. I might write an in depth blogpost on how to design efficient GraphQL resolvers where I'd like to cover more details. For now, I'd like to stick to the essentials to not bloat this post.

When designing the engine of WunderGraph, I've taken inspiration from database management systems like for example PostgreSQL. If you ever see Resolvers that return data, you know that the underlying framework is not thunk-based. A thunk-based GraphQL Engine divides the whole operation of resolving a Query, Mutation or Subscription into multiple steps.

The first step is to analyze the request and build a plan for the execution. Ideally, this execution plan is stateless and allows for variable injection. If done right, this allows to cache execution plans.

In order to generate an execution plan from a GraphQL Operation, you have to run a number of tasks in a very specific order.

In most cases, a GraphQL Operation is transmitted over HTTP using a JSON encoded representation of the Request. The Request contains three fields, the "query" field which contains one or more GraphQL Operations. Then there's the "operationName" field so the client can specify exactly which Operation to run if there are multiple. Finally, the "variables" field contains an arbitrary number of variables.

After parsing the JSON, the engine can start lexing the text of the query field. Lexing means to turn the text into a list of tokens. Once you have that list, you can move on and parse the list of tokens into an AST (abstract syntax tree). Now that you have this AST, you should make sure that it's valid. Validation means to check if the AST makes sense semantically. Actually, there's a step before validation. First, we have to normalize the AST. Normalization means e.g. to inline all fragments and remove duplicate fields so that other processing steps don't have to take care of edge cases when an AST is "dirty". Finally, you can take the clean and validated AST and turn it into an execution plan.

Now that you have this execution plan, you can store somewhere, e.g. in a cache. You can do so by hashing the initial payload and use the hash as the key, and the prepared execution plan as the value. Doing so allows you to skip all the complicated steps above for subsequent operations. If you ask yourself how WunderGraph can be so fast, that's the answer. By skipping a lot of CPU intensive work in the hot path, we're able to save a lot of time.

Building an execution plan for Queries and Mutations is rather simple. You start with one or more root nodes that need to fetch data. Once the data is fetched, you can start building a JSON Response object. If there are any child nodes that need additional fetches, you'll resolve these too and continue building the JSON response. If you're curious about the details, run one of these tests with a debugger attached.

The execution plan of a GraphQL Subscription looks slightly different. Subscriptions only have one single root node. The execution of Subscriptions is also not started by the engine itself. Instead, it's triggered by the origin which is pushing data to the engine.

For that reason, I've added the concept of a "Trigger" to implement Subscriptions. In case of a GraphQL upstream, the execution is triggered when new data arrives from the upstream GraphQL server. This data is directly fed into the root resolver which starts building the JSON response. This is the biggest difference compared to executing a Query or Mutation. The engine waits for the trigger, then starts resolving. For child Nodes it's possible to have additional fetches attached. This allows WunderGraph to resolve GraphQL Subscriptions, even for Federation.

A few details on how WunderGraph executes federated Subscriptions

WunderGraph clients connect via HTTP/2 (HTTP/1.1 Chunked-Encoding as a fallback) when they want to execute a Subscription. This makes handling multiple Subscriptions a lot more efficient and easier to handle compared to using WebSockets. WebSockets enforce HTTP/1.1 and require the client application to multiplex multiple GraphQL Subscriptions over the same WebSocket connection. With HTTP/2, multiple Subscriptions can be multiplexed over a single TCP connection without the client doing any additional work.

Once connected, the WunderNode (GraphQL Gateway / Engine) checks for an existing WebSocket connection to the origin. If there's no connection, it will start one and initiate the Subscription. If a connection exists, and the security context allows it, the engine will reuse the existing connection and multiplex multiple Subscriptions over it. If all clients disconnect to the WunderNode, the WebSocket connection to the upstream will be closed.

If multiple clients request the exact same Subscription, and the security context allows it, e.g. no authorization required, the Engine will only start one single WebSocket connection with one single Subscription. This means, the only component you'd have to scale in this scenario is the WunderNode. The origin GraphQL server can be quite weak as it doesn't get a lot of requests through deduplication. If you're using WunderGraph as a managed service, there's actually nothing to scale for you.

If the execution of a federated subscription requires multiple sub-fetches, the execution looks like this. First, the engine sets up the trigger and waits for new data from the federated GraphQL service that is responsible for the subscription root field. Once new data arrives, the engine resolves all possible fields. If one child fields requires additional fetches from another federated GraphQL service, these fetches will be executed and then the engine continues resolving the rest of the fields.

If you divide the problem into multiple steps, you'll realize that individually, most of these tiny steps are quite simple, easy to understand and easy to test. Only if you look at the problem as a whole you might be overwhelmed.

Divide and conquer, an efficient technique to solve complex problems. First, turn complex tasks into complicated ones. It's hard to solve complex tasks at once. Once the problems are complicated, that is, they are well-defined, it's easy to write tests that define the expected outcome.

WunderGraph as a whole is complex. All individual parts are just complicated.

What does it all mean for you?

Subscriptions are an essential part of GraphQL, allowing you to build applications that update in Realtime.

With our advanced resolver techniques it's possible to resolve Queries, Mutations as well as Subscriptions with minimal overhead.

Out of the box support for Subscriptions, even in federated environments means, you're able to implement GraphQL services without any extra steps. Just implement the Subscription resolvers in your language and framework of choice, WunderGraph glues it all together.

You don't have to make any changes to your architecture just because you want to adopt federation.

WunderGraph comes with a full bag of goodies

WunderGraph is not just fast. It's the result of my frustration over the complexity of setting up new projects and maintaining them. From Getting Started to running in production, I was super unsatisfied with the complexity involved.

If you look at the features, you'll realize that there's nothing essential missing. Give it a try and let me know what you think.

Frequently Asked Questions (FAQ)

Cosmo Streams (EDFS) lets the Cosmo Router deliver real-time updates across a federated graph by subscribing directly to an event source. Instead of subgraphs pushing data over WebSockets, the Router connects to Kafka, NATS, or Redis, holds the subscriber connections, and streams updates to clients.

No. With Cosmo Streams (EDFS) the Cosmo Router connects to the event source and holds the subscriber connections. Your subgraphs stay HTTP-only and stateless.

Kafka, NATS, and Redis, using the @edfs__kafkaSubscribe, @edfs__natsSubscribe, and @edfs__redisSubscribe directives.

The Cosmo Router supports WebSockets (graphql-ws), Server-Sent Events (SSE), and Multipart HTTP.