The Future of Federation: Replacing GraphQL Subgraphs with gRPC Services
Federation is a very powerful concept. It lets you split a unified API into multiple subsystems. The longer we've worked with it, one question kept coming up:
Is splitting the GraphQL schema at the Subgraph level and having the Router talk to Subgraphs via GraphQL the right approach?What if we removed GraphQL from the Subgraphs, simplified the protocol between the Router and backend services, and replaced the transport layer with gRPC ?
There are efforts from the community to formalize a GraphQL Federation spec. However, it's becoming clear to us that GraphQL itself is the limiting factor in improving the developer experience.
That said, this is not an easy topic to unpack. Let's step back a bit and start by explaining the problems with GraphQL Subgraphs.
How GraphQL Subgraphs Limit Federation and Developer Experience
We can break down the problems into four main categories:
- Technical differences between GraphQL and gRPC
- Performance of GraphQL vs gRPC as a transport layer
- Apollo gatekeeping the Subgraph Specification & Subgraph Frameworks
- Adoption of new features across Subgraph Frameworks
Let's go through each of these categories in more detail.
GraphQL Subgraphs are not type safe
Here's a simple example of a Subgraph SDL:
If the Router needs to fetch the name
for three users, it sends the following JSON to the Subgraph:
The Subgraph has to implement a handler that accepts a list of _Any
objects. If the __typename
fields match the User
type, the Subgraph framework will call the User
resolver with the id
field.
This might sound simple and straightforward, but it can very easily cause problems. Subgraph frameworks will oftentimes only inform you through runtime errors that there's a mismatch between the Subgraph SDL and the implementation.
While GraphQL embraces type safety between the client and the server, many Federation Frameworks don't enforce type safety strictly enough. By using gRPC, we can eliminate a whole class of errors at code generation or compile time.
GraphQL Subgraphs Require Manual Data Loader Implementation
Another problem we're seeing is that Subgraph frameworks don't solve data loading problems for you out of the box. They make data loading the responsibility of the implementer, leaving them to solve N+1 issues.
This is a huge problem because it doesn't just require awareness, it also means that every Subgraph implementer has to build a solution for it. Wouldn't it be great if the architecture solved the problem for everyone?
This is where our new approach comes in. Cosmo Router already solves data loading and batching. The Query Planner and Execution Engine are designed to solve the N+1 problem . By moving GraphQL to the Router level, we can reuse the existing data loading and batching logic, meaning that gRPC services, by design, don't have to worry about data loading or batching. Requests to gRPC services come in batches out of the box.
GraphQL vs gRPC: Performance as a Transport Layer
Another topic I'd like to address is the difference in performance between GraphQL and gRPC as the protocol between the Router and the Subgraphs.
In the Router, we've implemented a highly optimized GraphQL parser, normalization, and validation pipeline. Once the pipeline is done, if it is possible, we cache the results to avoid running the same query through the pipeline multiple times. After normalization, we generate a query plan, which is a very CPU intensive operation. As such, we also implement a query plan cache to reduce latency and CPU usage.
All in all, the Router is probably one of the most optimized GraphQL execution engines out there.
That said, there's a huge problem with the Apollo Subgraph approach. We can optimize the Router as much as we want, but if we're sending a GraphQL request to an unoptimized Subgraph, or a Subgraph implemented in a less performant language, our performance will always be limited by the Subgraph language and Framework implementation.
By using gRPC and removing GraphQL from the Subgraphs, we solve multiple problems at once. We no longer have to parse, normalize, validate, plan and execute GraphQL at the Subgraph level. It's just a simple gRPC request. Second, we can much more reliably predict the performance of gRPC services because we're no longer relying on Subgraph Frameworks to implement GraphQL efficiently.
Apollo Vendor Lock-in and GraphQL Federation
Another problem for the wider community is that Apollo is gatekeeping the Subgraph Specification and some Subgraph Frameworks. You can only ever innovate at the pace of Apollo, and to be frank, their primary goal seems to be adding GraphOS Enterprise features to their Router, locked behind a paywall.
Our philosophy is to make Cosmo Router the most open and flexible Federation Router on the market. Cosmo Router is open source, and we're collaborating with FAANG companies on the implementation to make it more modular and extensible. Our business model is to help teams build and evolve great APIs through collaboration.
By removing GraphQL from the Subgraphs, we're not just removing the dependency on Apollo's Subgraph Specification and Subgraph Frameworks, we're also opening the door for faster innovation. We can roll out new features to Cosmo Router and every language that supports gRPC will be able to use them immediately.
Why Subgraph Frameworks Lag Behind Apollo’s Federation Spec
This is an extension of the previous point. When a new feature is added to the Subgraph Specification, there will always be a delay until every Subgraph Framework implements it.
If you look at the Reference for compatible GraphQL server libraries , you'll notice two things:
- There are gaps in features supported by the different Subgraph Frameworks
- A lot of frameworks depend on Apollo
Like I said previously, there's a huge risk if you're depending on Frameworks maintained by a single company that forces you into Enterprise contracts. Just recently it was announced that they had to lay off 25% of their staff. If they further stumble with their business model, this could negatively affect the long-term support of all these Frameworks.
Wouldn't it be much better if we could reduce the number of Subgraph Frameworks that need maintenance to zero? With our approach of replacing GraphQL with gRPC, our engineering team can focus on a single component, Cosmo Router, and every language that supports gRPC will always be able to use the latest features with no maintenance overhead.
The Challenge of Evolving GraphQL Subgraph Frameworks
Adding features to Subgraph Frameworks is a lot of work. You have to extend the Specification, which is gatekept by Apollo, and then every Subgraph Framework must catch up and implement the new features.
This means that it's a community wide effort to improve the developer experience of Federation and roll out new features. Most Federation users are often stuck with the same features for years, with no meaningful way to contribute.
When the gatekeeper of the Subgraph Specification is focused on adding Enterprise features to support their business model, you're probably stuck forever.
With the Cosmo gRPC approach, adding a new feature to Federation is as simple as updating the Router, and it's automatically available to everyone. Just update the Router and the GraphQL SDL-to-proto compiler, and you're good to go.
The Future of GraphQL Federation Runs on gRPC
Now that we've extensively covered the problems we're trying to solve, let's enter a new era of GraphQL Federation. It's simpler, easier to use, strictly typed, and much more performant. But at the same time, it's still GraphQL on the client side and the workflow is very similar, so you don't have to learn a whole new way of building APIs.
Here's a quick overview of the new approach:
Like with Apollo Federation, you start with a Subgraph SDL. Here's a quick example:
Now, you can use the GraphQL SDL-to-proto compiler to generate a gRPC service. The result looks like this:
Note how LookupUserByIdRequestKey
is prefixed with the repeated
keyword. We're not just sending a single key to the resolver
; we're automatically creating a batch.
Now, all you have to do is implement the gRPC service in your language of choice and tell Cosmo Router where to find it. Here's an example configuration:
That's it. You can now start the Router and begin querying your GraphQL API.
Conclusion
If you're one of our users, you might be wondering if we're going to remove GraphQL support at some point. We're absolutely not, GraphQL will be used by many teams for a long time. Under the hood of the Router, gRPC is really just an extension of our existing GraphQL execution engine.
That being said, we absolutely recommend trying out the new approach to see how it works for you. In the long run, we expect gRPC to become the standard for implementing federated APIs.
In summary, we're beyond excited to go this route because we believe it will not only improve the developer experience and Federation performance. Federation over gRPC breaks us free from Apollo’s gatekeeping and opens the door to much faster innovation.
If you're curious about how we built the Subgraph to gRPC Compiler and the GraphQL-to-gRPC mapping layer, we cover it in The Next Generation of GraphQL Federation Speaks gRPC
If you're as excited as we are, take a look at the Cosmo gRPC Service Quickstart and start building your next generation of federated APIs.
We'd love to hear your feedback on this approach. You can join our Discord or create an issue on GitHub .
We're looking forward to seeing you on the other side!