Designing a Multi-Tenant Federated GraphQL Schema

One topic that keeps coming up in discussions about GraphQL Schemas is how to design a schema that supports multi-tenancy. Let's explore the different approaches!

We're hiring!

We're looking for Golang (Go) Developers, DevOps Engineers and Solution Architects who want to help us shape the future of Microservices, distributed systems, and APIs.

By working at WunderGraph, you'll have the opportunity to build the next generation of API and Microservices infrastructure. Our customer base ranges from small startups to well-known enterprises, allowing you to not just have an impact at scale, but also to build a network of industry professionals.

See Open Positions

What is Multi-Tenancy in a GraphQL API?

Multi-tenancy is a software architecture where a single instance of the software runs on a server and serves multiple tenants. Each tenant is a group of users who share a common access with specific privileges to the software instance.

In the context of a GraphQL API, multi-tenancy means that the API serves multiple tenants, each with its own data and access rules. In a monolithic GraphQL API, multi-tenancy can achieved by implementing a middleware at the root level which checks the authentication and adds the tenant information to the request context, which can then be used by the resolvers to implement tenant-specific logic.

So, while multi-tenancy in a monolithic GraphQL API is relatively straightforward, things are a little more complicated when it comes to federated GraphQL schemas.

Challenges of Multi-Tenancy in a Federated GraphQL Schema

In a federated GraphQL API, the schema is split into multiple services, each responsible for a subset of the overall schema. This means that we can't just have a shared memory object that holds the tenant information, as each service is independent and doesn't have access to the context of the other services.

Additionally, it's possible that some parts of the Schema are shared across tenants, while others are tenant-specific, so our ideal solution should allow for both approaches.

Finally, we don't want to duplicate the tenant-specific logic in each service, as this would lead to code duplication and make it harder to maintain the schema. The tenant-specific logic should be centralized in a single place.

The last point leads us to another important consideration: Should the tenant-specific logic be implemented in the Gateway or in one of the services?

Approaches to Multi-Tenancy in a Federated GraphQL Schema

There are multiple approaches to implementing multi-tenancy in a federated GraphQL API. They can be broadly categorized into two groups: Schema-based and Transport-based.

You can make multi-tenancy part of the GraphQL Schema, or you can handle it at the transport level, making the Schema itself tenant-agnostic. Let's explore both approaches in more detail.

Transport-based Multi-Tenancy in a Federated GraphQL API

In the transport-based approach, the tenant information is passed as part of the request. This can be done in an opaque way, e.g. by forwarding a JWT from the Gateway to all Subgraphs, or by extracting the tenant information from the client request and passing it along to the Subgraphs as a header.

Both approaches have the advantage that the Subgraphs don't have to worry about multi-tenancy at all, as they receive the tenant information as part of the request.

If you're validating the JWT in the Gateway, you can already terminate the request if the JWT is invalid. This way, all Subgraphs can trust that the JWT is valid and that the tenant information is correct.

By forwarding the JWT to the Subgraphs, you also establish a zero-trust architecture, as the Subgraphs don't have to trust the Gateway to provide the correct tenant information. The downside is that all Subgraphs have to validate the JWT themselves, but this is a small price to pay for the added security. If the JWT is not opaque, but contains the tenant information in a readable format, it's an efficient way to pass the tenant information along.

On the other hand, handling multi-tenancy at the transport level also has some downsides.

First, all involved GraphQL tools won't be aware of the tenant information. This means that your analytics solution, monitoring, GraphQL Playground, and other tools won't be able to differentiate between tenants. The tenant will always be a concern that has to be handled on top of the GraphQL Schema.

Second, all Subgraphs have to implement the same logic to extract the tenant information from the request and use it in their resolvers. Although it might seem like a little detail, this can easily lead to inconsistencies and bugs if not implemented correctly. If a request depends on the tenant information, why is that not part of the arguments of the resolver? Why hide this important information in the context?

So, while the transport-based approach is a very simple and lightweight way to implement multi-tenancy in a federated GraphQL API, it also has some downsides that you should be aware of.

Let's now contrast this with the Schema-based approach.

Schema-based Multi-Tenancy in a Federated GraphQL API

In the Schema-based approach, the tenant information is part of the Schema itself. This means that the Schema is aware of the tenant information and can use it to implement tenant-specific logic. We don't have to think about passing the tenant information along with the request, as the Schema already knows which tenant the request belongs to.

Let's look at an example to illustrate this. Imagine we're building a multi-tenant Schema for an e-commerce platform. Each tenant has its own products, orders, and customers.

The resulting Client Schema would look like this:

Let's make a query to fetch the products of a tenant:

Let's take a look at how this Schema is implemented in the Subgraphs. For the Tenant Subgraph to work, we need to enable Header Propagation in the Router. This way, the Router will forward the Authorization header to the Tenant Subgraph, which can then choose the correct tenant based on a claim in the JWT.

Here's the Router configuration:

Now, the Tenant Subgraph can use the Authorization header to determine the tenant and return it in the tenant resolver.

Next, we'd like to implement the products resolver in the Product Subgraph. As we've already resolved the tenant, there's no need to implement any extra logic except the products resolver itself.

Here's how the products Service implementation could look like:

We're not handling JWTs or dealing with untyped context objects. We implement the function to resolve a Tenant by its ID (reference) and load the products for this tenant.

The Order Subgraph works similarly. We can use the tenant information to load the orders for this tenant.

The Tenant Entity becomes the Entry Point for Multi-Tenancy in a Federated Graph

If you follow the Schema-based approach closely, you'll notice that the Tenant entity becomes the entry point for multi-tenancy in the federated graph. Instead of adding fields to the Query type or using a middleware to determine the tenant, all Subgraphs extend the Tenant type and use it to resolve the tenant-specific data.

The key fields of the Tenant Entity are used as arguments in all Subgraphs that extend the Tenant type. Instead of using a JWT or a generic context object, we're able to establish a clear contract between all Subgraphs on how to resolve the tenant.

Even in cases where you need to share more complex information to resolve the tenant, we're able to accomplish this with a composite key.

With this composite key in place, here's an example of how a request from the Router to the Product Subgraph could look like:

Comparing a Schema-based and Transport-based Approach to achieve Multi-Tenancy in a Federated Graph

With the Transport-based approach, we're typically passing the tenant information along every Subgraph request in the form of a JWT. This JWT is validated by the Gateway and forwarded to the Subgraphs. All Subgraphs can then parse and validate the JWT themselves to determine the tenant in a middleware, which then injects the tenant information into the resolver context.

In the Schema-based approach, the tenant information is becoming a first-class citizen of the Schema. If you're using a Schema-first approach, you can generate boilerplate code for the Subgraphs to make the resolvers aware of the tenant information. With a Code-first approach, you can implement integration tests to ensure that the Resolvers align with the Schema.

The Schema-based approach is more declarative and allows for a clear contract between the Subgraphs. The transport-based approach on the other hand is more flexible and allows Subgraphs to leverage other claims in the JWT to implement more complex logic.

This very last point is crucial when you're thinking about scaling your Schema across more teams and Subgraphs. While the transport-based approach is more lightweight and easier to implement, it's also more error-prone and harder to govern at scale.

Everything that's part of the Schema is part of the contract between all Subgraphs, which means that all teams agree on this contract. If a team decides that they need to change or extend this contract, they can propose a change to the Schema and all other teams can discuss it.

With a Schema Registry like Cosmo, you have schema checks in place that ensure that no Subgraph gets deployed that doesn't adhere to the Schema. If we're introducing a breaking change in our JWT structure, or add custom logic that depends on unofficial claims, we're introducing the risk of breaking our Federated Graph without any checks in place.

Conclusion

Everything that's part of the Schema is part of the contract between all Subgraphs, which is a powerful concept to establish a framework for teams to work together on a federated graph.

If a feature like multi-tenancy is implemented in a way that's not part of the Schema, it's harder to govern and maintain at scale.

In an ideal world, the results of a Subgraph resolver should be predictable and reproducible. If we're introducing external dependencies like JWTs or context objects, we're introducing a level of uncertainty that's hard to manage and govern.

In a nutshell, by having all arguments that influence the result of a resolver as part of the Schema, it's much easier to reason about the behavior of our Subgraphs, and to ensure that they work together in harmony.

If you like the work we're doing at WunderGraph, please take a look at our open positions.

Router / Gateway

MCP Gateway

Documentation

Zero to Production

GitHub

Community