Announcing EDFS - Event Driven Federated Subscriptions

In the last few months, we've put a lot of effort into supporting the most advanced use cases of Federated GraphQL, like Entity Interfaces, Shareable, Composite Keys, Overrides and more. All of these features are now available in Cosmo and drove a lot of adoption.

We're very happy to see that Cosmo is getting adopted by more and more companies and household names. The feedback we've received so far has been overwhelmingly positive and we're very grateful for that. Most users mention the ease of use and the performance of Cosmo Router as the outstanding characteristics.

What's great about driving adoption is that we get to learn a lot about the use cases and struggles of our users. One such use case is a proper solution for (federated) GraphQL Subscriptions.

Subscriptions are a very powerful feature of GraphQL which allows you to build real-time applications. Unfortunately, a lot of people struggle to implement Subscriptions. As you'll see in this article, Federation makes Subscriptions especially challenging.

Let's take a look at the current state of Subscriptions in GraphQL and how Cosmo Router solves these problems with EDFS.

Introducing WunderGraph Hub: Rethinking How Teams Build APIs

WunderGraph Hub is our new collaborative platform for designing, evolving, and shipping APIs together. It’s a design-first workspace that brings schema design, mocks, and workflows into one place.

Request Early Access

Problems and Challenges with (federated) GraphQL Subscriptions

Here are some of the problems and challenges you'll face when implementing Subscriptions in a (federated) GraphQL API:

Subscriptions make your Subgraphs stateful
Subscriptions can only have a single root field (Ownership?)
A lot of frameworks are very inefficient and consume a lot of resources (CPU, Memory) when implementing Subscriptions

We heard a lot of stories where people tried to implement Subscriptions but abandoned the idea because their servers couldn't handle the load. Others reported that memory consumption got out of hand and they simply couldn't justify the cost of running Subscriptions.

In contrast to that, you're probably familiar with Event Driven Architectures (EDA). In an EDA, you have a lot of small services that communicate with each other by sending events. These events are usually sent to a message broker like Kafka, NATS, SQS, RabbitMQ and more.

The great thing about EDA is that it scales very well. You can easily add more producers & consumers to a topic and scale horizontally.

On the other hand, we really like the way GraphQL allows us to define our API with a Schema. Combined with Federation, we can build a federated Graph composed of multiple Subgraphs. But Federation is missing some pieces that make EDA so powerful.

So we asked ourselves: Is there a way to combine the best of both worlds? Can we combine the power of Federation, Entities and Subgraphs with an Event-Driven Architecture? The answer is yes, and we call it EDFS - Event Driven Federated Subscriptions.

But before we dive into the details, let's take a look at the current state of Subscriptions in GraphQL and break down the problems.

Subscriptions make your Subgraphs stateful

When we're only implementing Queries and Mutations, our Subgraphs are stateless. This means that we only have short-lived requests that are handled by our Subgraphs. The response is usually sent back to the client in a few milliseconds up to a few seconds. There's no state that needs to be maintained between requests.

Stateless Subgraphs are very easy to deploy and scale. Add a load balancer in front of your Subgraphs and deploy as many instances as you need to handle the load. In a stateless system, it doesn't matter which instance handles the request.

With Subscriptions on the other hand, things get a lot more complicated. Subscriptions are long-lived and can be open for minutes, hours, or even days, making your Subgraphs stateful.

Compared to a stateless system, you have a lot more things to consider when deploying and scaling your Subgraphs.

For example, a platform team might be capable of deploying, running and monitoring a stateful Router, but what about the backend teams that are responsible for the Subgraphs? Do they have the knowledge and resources to deploy and run a stateful Subgraph? Popular languages like NodeJS, Python, Ruby, PHP, and others are not well known for their ability to handle stateful workloads efficiently.

A lot of people like the simplicity of AWS Lambda and other serverless deployment options. The problem is that these services have limitations on how long a function can run. In addition to that, they might not be able to upgrade a regular HTTP request to a WebSocket connection.

Summarizing, Subscriptions require a lot more resources and knowledge to implement and operate them properly.

Subscriptions can only have a single root field (Ownership?)

Another problem with Subscriptions is that they can only have a single root field. In GraphQL, a root field on the Subscription type can only be owned by a single Subgraph. Unfortunately, this defeats the purpose of Federation. The whole point of Federation is to allow multiple Subgraphs to contribute fields to an Entity, allowing multiple teams to collaborate on a single Entity without stepping on each other's toes.

However, with Subscriptions, this is not possible. Only a single Subgraph can own a root field, so what's the problem with that? It's about ownership and coordination.

If a Subgraph owns a root field, it's responsible for "invalidating" the Subscription. So, if a Subgraph owns the root field and a different Subgraph wants to "invalidate" the Subscription for an Entity, it has to coordinate with the Subgraph that owns the root field, but not only that.

How do you know which Subgraph has to be notified? How do you know which Subgraph has clients connected that subscribed to a specific Entity? This is a technical challenge that is not easy to solve, and even if you solve it, it requires a lot of overhead and coordination between services, which means that teams need to collaborate to make it work, which is the exact opposite of what Federation is trying to achieve, which is to allow teams to work independently.

Solving this problem is at the level of complexity of solving the distributed transaction problem, which you ideally want to avoid. But that's not all, there's another problem.

GraphQL Subscriptions consume a lot of resources (CPU, Memory)

The third big problem with Subscriptions is that they consume a lot of resources.

Here's the architecture of a typical GraphQL Subscription implementation:

The client opens a WebSocket connection to the Router (first Connection). The Router then opens a WebSocket connection to the Subgraph (second Connections). The Subgraph itself needs to handle the WebSocket connection that is opened by the Router (third Connection).

So, for every client that opens a Subscription, you have three WebSocket connections. But that's not the whole story.

For each WebSocket connection, each service needs to run at least two threads to be able to read and write concurrently to the connection. In addition to that, you usually need one buffer per thread to be able to read and write, so that's a total of 6 threads and 6 buffers per client.

We're still not done, though. This is really just the overhead of an idle WebSocket connection that did the initial acks (handshake). What if the client starts one or more GraphQL Subscriptions? In this case, we need to run a trigger for each Subscription that is active, which is another thread per Subscription. Each Subscription also needs a buffer to be able to prepare the response. If a Subscription requires additional Entity requests, which is usually the case, we need more buffers and potentially threads, e.g. if we want to fetch Entity fields in parallel.

Summarizing, for each active Subscription, we need a bunch of threads and buffers, some of which can leverage resource pooling, some of which can't.

I'd also like to point out that blocking threads because you're reading from or writing to a WebSocket connection is not free. It consumes CPU cycles because the scheduler needs to swith between threads and it also consumes memory because for each thread that's reading or writing, you need a buffer.

Wouldn't it be great if we could get rid of all these allocations and threads? Yes, it would, and that's exactly what EDFS does!

Introducing Event Driven Federated Subscriptions (EDFS)

What if we could create a "virtual" Subgraph that is managed by the Router and bridges the gap between Federation and EDA? That's exactly what EDFS does.

Event Driven Federated Subscriptions or EDFS for short allows us to completely rethink Subscriptions. With EDFS you get the following benefits:

EDFS makes your Subgraphs stateless
Subscription root fields are not owned by Subgraphs, but by the Event Broker
EDFS leverages Epoll/Kqueue to handle tens of thousands of Subscriptions with a small number of threads and buffers

Let's take a look at how EDFS works.

First, you need to connect your Cosmo Router to an Event Broker. Currently, we only support NATS, but we're planning to add support for Kafka, SQS, RabbitMQ and more.

Once you've connected your Router to an Event Broker, you can start defining "virtual" Subgraphs that connect your GraphQL Schema to the Event Broker. Let's take a look at an example Schema:

We've introduced three new directives: @eventsRequest, @eventsPublish and @eventsSubscribe. These directives allow us to connect our Schema to the Event Broker.

With @eventsRequest we can implement request/response patterns. When a client invokes this field, the Router will create a response topic, send the request to the Event Broker using the configured topic and wait for a response on the response topic. As you can guess from the topic name, we can use field arguments to dynamically resolve the topic.

The @eventsPublish directive allows us to publish events to a specific topic on the Event Broker. Again, we can use field arguments to dynamically resolve the topic. The Router will publish a JSON representation of all field arguments to the Event Broker, e.g.:

Finally, the @eventsSubscribe directive allows us to subscribe to events on the Event Broker. The Router will create a trigger that listens to the configured topic and starts resolving all sub-fields when an event is received. Again, we can use field arguments to dynamically resolve the topic.

If you're familiar with federated GraphQL, you'll notice that we've made Employee an Entity by adding the @key directive. This is what allows us to link between our Event Broker and our Subgraphs.

The way we define our Events-Schema is important as it defines the contract between the Router and the Event Broker in order to enable resolvability of Entities across Subgraphs. In our case, we've defined the Employee Entity with a single key field id. This means that we're expecting the Event Broker to send us events in the following format:

The __typename field is required to be able to resolve the Entity. In case of Interfaces or Unions, the __typename field is required to be able to resolve the concrete type. Furthermore, the id field is required to be able to "jump" to additional Subgraphs to resolve additional fields.

That's all we need to know about EDFS to be able to use it.

For those of you who are interested in the technical details, here's how EDFS works under the hood:

How do Event Driven Federated Subscriptions work under the hood?

When a client opens a Subscription, the Router upgrades the HTTP connection to a WebSocket connection. The WebSocket connection is then added to an Epoll/Kqueue event loop. Depending on the system, we're using Epoll on Linux and Kqueue on MacOS, but they both work in a similar way.

Epoll/Kqueue allows us to delegate the handling of the WebSocket connection to the operating system. Instead of blocking a thread while trying to read on a connection, we're registering the connection with the event loop and wait for the operating system to notify us when there's data available to read.

To bring this into perspective, imagine you're in a restaurant and you want to order something. Instead of waiting for the waiter to come to your table, you're registering your table with the waiter. When the waiter is ready to take your order, he'll come to your table and take your order. In the meantime, you can do whatever you want, e.g. read a book, talk to your friends, etc.

This makes a huge difference in terms of resource consumption. Imagine you have 10.000 clients connected to your Router. Each WebSocket connection requires at least 2 threads to be able to read and write on the connection. That's 20.000 threads and 20.000 buffers just to handle the WebSocket connections. With Epoll/Kqueue, we can handle 10.000 WebSocket connections with a single thread.

Another important aspect of EDFS is deduplication of triggers. If multiple clients subscribe to the same topic, we only create a single trigger and attach all Subscriptions to it. So, if 10.000 clients subscribe to the same topic, we only create a single trigger, which means that we save another 9.999 threads.

EDFS is not limited to just GraphQL

One other aspect of EDFS is that you're not limited to just GraphQL and Subgraphs to invalidate Subscriptions. Let's say you're implementing a long running task that takes a few minutes to complete. Multiple systems might be involved in completing the task. These can be Subgraphs, but they don't have to be.

You can use EDFS to publish events to the Event Broker from any system. The Router will then invalidate all Subscriptions that are listening to the topic of the event and resolve additional fields from Subgraphs.

Performance Characteristics and Memory Consumption of EDFS

We've already talked about the performance characteristics of EDFS, but let's make this a bit more tangible by looking at some numbers.

We've tested EDFS with 10.000 concurrent Subscriptions listening to a single topic. We used pprof to profile the Router after all Subscriptions were established. We measured the Heap at ~150MB-200MB and the number of goroutines at ~40. CPU usage was at 0% when no events were published and jumped to ~300% when 50.000 events per second were published, meaning that we fully utilized 3 CPU cores.

Summarizing, EDFS is very efficient in terms of memory consumption and scales well across multiple CPU cores when publishing a lot of events.

Getting started with EDFS

We're very excited about EDFS and we hope you are too. If you're interested in trying it out, please check out the Documentation and join our Discord to share your feedback.

We'd love to hear from you! What can you build with EDFS that you couldn't build before? What other Brokers would you like to see supported?

Conclusion

In this article, we've talked about the problems and challenges with (federated) GraphQL Subscriptions. We've introduced EDFS, blurring the lines between Federated GraphQL and Event Driven Architectures. In an upcoming article, we'll dive deeper into the technical details of EDFS. In addition to that, we'll add more content and examples on how to leverage EDFS to build real-time applications.

Router / Gateway

MCP Gateway

Documentation

Zero to Production

GitHub

Community