Blog
/
Education

The Next Generation of GraphQL Federation Speaks gRPC

cover
Jens Neuse

Jens Neuse

min read

Apollo introduced the amazing concept of Entities to GraphQL, allowing you to split a single monolithic GraphQL Schema into multiple Subgraphs and federate them through a Router.

But as amazing as this idea is, it always felt a bit like a hack. We're fixing this with our new Subgraph to gRPC Compiler.

tl;dr:

We've been building a completely new approach to implementing GraphQL Federation. We take Subgraph SDLs and compile them to gRPC services with out-of-the-box support for data loading, so you never run into N+1 problems again. At the same time, this approach lets you leverage the type safety and tooling benefits of the gRPC ecosystem. This approach is not just multitudes faster than GraphQL Subgraphs but also makes implementations strictly typed and easier to reason about.

The Problem: Subgraph Entity Representation fetches are not type-safe

The way Apollo distributes Entities across Subgraphs looks like this:

One Subgraph defines a root field that returns an Entity, which is defined by adding a @key directive to the type. Then, another Subgraph defines the same Entity and adds an additional field. Once we compose the two Subgraphs together, we get a Schema that incorporates all the fields from both Subgraphs.

The first Subgraph could look like this, defining a User Entity:

1
2
3
4
5
6
7
8
9
10

The second Subgraph could look like this, defining a User Entity with an additional field:

1
2
3
4
5
6
7
8
9
10
11

Now, when we compose the two Subgraphs together, we get a Schema that combines all the fields from both Subgraphs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Now, let's say we make the following Query:

1
2
3
4
5
6
7
8
9
10

The Router would make the following request to the User Subgraph:

1
2
3
4
5
6

Let's assume the response from the User Subgraph is looking like this:

1
2
3
4
5
6
7
8

The Router would then follow up with the following fetch to the Posts Subgraph:

1
2
3
4
5
6
7
8
9
10

The full request, including the variables, would look like this:

1
2
3
4
5
6
7
8
9
10
11

The main problem with this approach is that the Subgraph must implement an _entities field that accepts a list of _Any objects. There's not much you can do at compile time to ensure that the Subgraph is implemented correctly. Discussions with our customers show that this "hack" is a common source of bugs. We've been asked more than once to advise customers on solutions to "prove" the correctness of a Subgraph implementation.

So, let's take a look at the new approach.

The Solution: Replacing Apollo Subgraphs with gRPC Services

We have been in the market of GraphQL Federation for a few years now, so we've been able to build a lot of relationships with Federation users and learn about their Architecture, processes, tooling, and workflows.

One very common pattern we've seen is that many organizations build GraphQL Subgraph shim services that sit on top of gRPC services. They felt like gRPC was the more mature technology to implement internal APIs, while GraphQL was more like a frontend-facing technology that sits on top of the internal APIs. For backend engineers, gRPC seems like an approach that is easy for them to reason about and implement.

All of this made us think about removing the "intermediate" layer of GraphQL Subgraph shim services. If frontend engineers prefer to work with GraphQL, and backend engineers want to work with gRPC, why can't the Router take the responsibility of directly fetching data from the gRPC services and translating between GraphQL Queries and gRPC Messages?

So that's what we've been working on. A Subgraph to gRPC Compiler and an Adapter in the Router that translates between the two API styles.

The Subgraph GraphQL SDL to gRPC Compiler

Let's tackle the problem step by step. First, we needed to build a compiler that takes a GraphQL SDL and compiles it into a gRPC proto document.

Let's take a look at how the User Subgraph would look like:

1
2
3
4
5
6
7
8
9
10

If we run this through the compiler, the proto document would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

You'll notice two RPC methods, LookupUserById and QueryMe. These are the two entry points we need to implement. The QueryMe method is responsible for returning the me root field. The LookupUserById method was generated to satisfy the @key directive by creating an entry point to extend the User type.

One observation you might have is that the LookupUserByIdRequestKey field is repeated. We're not sending a single key, which would create an N+1 problem. Instead, we're implementing the data loader pattern at the Router level, which automatically batches "lookups" for the same entity and Subgraph.

This is not just an improvement in terms of performance but it also makes the implementation much easier to reason about. It's best practice to always implement the data loader pattern at the Subgraph level. With gRPC replacing GraphQL Subgraphs, data loading is one less problem backend engineers have to worry about. It just works, allowing the backend engineer to focus on the business logic.

Just for completeness, here's the proto document for the Posts Subgraph:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

Implementing the gRPC to GraphQL Adapter at the Router level

Mapping between GraphQL and gRPC isn't trivial due to fundamental differences in how both ecosystems model data and APIs. Both styles are schema-first, but GraphQL is way more flexible and dynamic, inheriting the good and bad parts of JSON. At the same time, gRPC is way more rigid and static. It's optimized for performance and memory usage, and it comes with data types that aren't available in JSON, such as differentiating between floats, doubles, and integers, while JSON only has numbers.

Core Mapping Rules between GraphQL and gRPC

Types:

Every GraphQL type becomes a message in Protocol Buffers. Each field in a GraphQL type gets a corresponding field in the message, annotated with a unique tag number (like field1 = 1).

Scalars:

Built-in GraphQL scalars are mapped to their Protocol Buffers equivalents:

  • String → string
  • Int → int32
  • Float → float
  • Boolean → bool
  • ID → string

Custom scalars require explicit definitions or fallback types.

Enums:

GraphQL enums are directly mapped to protobuf enums, with values given numeric identifiers.

Input Types:

GraphQL input objects are mapped in the same way as regular messages, since both are essentially request payloads.

Queries and Mutations:

These are translated into gRPC service methods, where the operation name becomes the method name, and the input/output types are derived from the GraphQL schema.

Example:

1
2
3
4
5
6
7
8
9

Will roughly become:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Challenges of mapping GraphQL to gRPC

One major challenge is the loss of expressiveness. GraphQL's optionality and unions don't translate directly into Protobuf's more rigid model. The mapping must enforce stricter typing (e.g., no null vs. nullable fields) and resolve things like default values and repeated fields carefully. Tag numbering in protobuf also demands discipline and uniqueness, which is absent in GraphQL.

To handle these, the Cosmo system uses a deterministic field ordering strategy, translates optional/nullable semantics via wrappers, and ensures that field numbers are stable across schema generations. It also enforces constraints (e.g., disallowing untagged types) to ensure the mapped proto is valid and future-proof.

Proto lock file: Keeping track of changes to avoid breaking changes

Another challenge we faced was the significance of field numbers in Protobuf. In a GraphQL SDL, the order of fields doesn't matter. You can add new fields to the end of the type definition and later move them to the top, without affecting the API at all. In comparison, each field in a Protobuf message has a unique field number.

Imagine the following scenario:

You create a field in a proto file to represent the id and name of a User type.

1
2
3
4

Now, you decide to remove the name field. Your proto file now looks like this:

1
2
3

Now, you want to add a new field to the User type to represent the age of the user.

1
2
3
4

We now have one version of the message where the number 2 is assigned to the age field and a previous version where the number 2 was assigned to the name field. Both fields are not just semantically different; they are also incompatible types.

To solve this problem, we've introduced a "proto lock file" that will keep track of previous versions of the proto file. This way, we can use the reserved keyword to block number reuse.

1
2
3
4
5

Conclusion

As a next step, you can learn more about gRPC Services in the documentation.

We've also prepared a quickstart tutorial so you can try out the new approach yourself.

This is our first step towards improving the Federation experience and making it less dependent on GraphQL. We're very eager to hear your feedback. Please join our Discord and let us know what you think. You can also open an issue on GitHub if you have any questions or feedback.

Thanks for reading!