GraphQL Federation Field-level Metrics 101


Prithwish Nath

min read

Cosmo: Full Lifecycle GraphQL API Management

Are you looking for an Open Source Graph Manager? Cosmo is the most complete solution including Schema Registry, Router, Studio, Metrics, Analytics, Distributed Tracing, Breaking Change detection and more.

Federated GraphQL has been an invaluable tool for enterprise systems because it offers a scalable, decentralized architecture that accommodates evolving data requirements across distributed teams — and their diverse microservices. You have your independent services, merge them into a unified schema with a single endpoint, and now all of your clients can access exactly the data they want, as if it were all coming from a single GraphQL API.

Instead of having a single, monolithic GraphQL server that eventually becomes difficult to scale, you’ve divided your schema and functionality across multiple services.

But the agility, collaboration, and adaptability afforded by such an architecture would hold little value if you didn’t also have crucial metrics that let you optimize your data fetching strategies, minimize downtime by patching issues fast, allocate your resources efficiently, and — in general — make informed decisions in the context of your system.

Field-level metrics in federated GraphQL are precisely this detailed insight.

To demonstrate field usage metrics in Federation, I’ll be using WunderGraph Cosmo  — a fully open source, fully self-hostable platform for Federation V1/V2 that is a drop in replacement for Apollo GraphOS.

Field Usage Metrics 101

What’s a ‘field’ in GraphQL, anyway?

field  is just an atomic unit of information that can be queried using GraphQL. Suppose we have these two very simple subgraphs — Users, and Posts:

Posts subgraph
Users Subgraph

From these two graphs, we can tell that Users have id’s and names, Posts have id’s, content, and authorId’s — and the shape of each specific data is represented by their respective fields (name is a simple, built-in GraphQL type — a String, while the author of a Post is a compound type represented by the User object type ).

The relationship in this example is simple enough — Each Post has a User who authored it, resolved through the authorId field to uniquely identify a User for each Post.

Let’s not go too deep into Federation specific directives here (TL;DR: @key represents the unique identifier for each object type, @external signals that a field is defined in another subgraph and will be resolved externally, via whichever field is presented by the @requires directive — here, authorId).

So if you wanted to query for all Posts along with their User authors, you would request these fields in your GraphQL query:

  • posts is a root query object (technically a field) on the Posts subgraph, and contains an array of Post type objects.
  • id, and content are fields on the Post type.
  • author is a field on the User type. Within the posts query we’re using the relation via authorId to reference a User from the Users subgraph

Field-level usage metrics in GraphQL would track how often these specific fields across different subgraphs are requested in queries on the federated graph. And then, for object types like posts, we could get even more fine-grained and look at the usage of its individual fields, in turn.

What does all this information get us?

  • We’d be able to debug issues faster, because thanks to our metrics we’d know exactly which fields were having trouble resolving data.
  • Even if there were no immediate fatal errors, specific performance data for each field would still allow us to pinpoint bottlenecks or optimization opportunities — and then we could ship fixes/improvements at different levels: our resolver functions, database queries, or network calls associated with those specific fields.
  • Just knowing how many times a specific field has been requested or resolved (taking into account potential caching) within a given timeframe would provide valuable insights into user behavior and needs, help us streamline the schema and reduce infrastructure costs, or just help us make informed decisions about pricing tiers and resource allocation.
  • We’d have insight into performance trends — error rates, latency, etc. — of specific fields. We could use this to proactively improve scalability (ex. a certain field might require ramping up compute power, another might require increased database throughput) based on anticipated increased demand for certain fields, before they ever get bad enough to impact user experience.
  • Tracking field-level metrics is crucial for enterprises to ensure compliance with SLAs  —make sure the performance of individual fields meet predefined service-level expectations.

TL;DR: less reactive firefighting, more proactive optimization. Let’s show off these metrics for a second.

Field-usage Metrics with WunderGraph Cosmo

I’ll use WunderGraph Cosmo to federate those two subgraphs. Cosmo is an all-in-one platform for GraphQL Federation that comes with composition checks, routing, analytics, and distributed tracing — all under the Apache 2.0 license, and able to be run entirely on-prem. It’s essentially a drop-in, open-source replacement for Apollo GraphOS, and helpfully offers a one-click migrate option from it.

👉 Cosmo on GitHub: The code

The Cosmo platform comprises of:

  1. the Studio — a GUI web interface for managing schemas, users, projects, and metrics/traces,
  2. the Router — a Go server that implements Federation V1/V2, routing requests and aggregating responses,
  3. and the Control Plane — a layer that houses core Cosmo APIs.

The key to managing your Federation with the Cosmo stack is its CLI tool: wgc . You install it from the NPM registry, and your subsequent workflow would look something like this:

  • Create subgraphs from your independently deployed and managed GraphQL services using wgc subgraph create.
  • Publish the created subgraphs to the Cosmo platform (or more accurately, to its Control Plane) with wgc subgraph publish. This makes the subgraphs available for consumption. Note that the Cosmo “platform” here can be entirely on-prem.
  • Once you have all your subgraphs created and published, federate them into a unified graph using wgc federated-graph create
  • Configure and deploy the Cosmo Router to make your federated graph available to be queried at the routing URL you specified. The Router, in addition to being a stateless gateway that intelligently routes client requests to subgraphs that can resolve them, also generates the field usage metrics for our federated Graph as it’s being queried.

Then, we run a few queries against our federated graph, and then fire up Studio, our web interface.

Studio contains the Schema Explorer, which is the control room for your federated GraphQL ecosystem. Here, you can view and download schemas of all of your subgraphs and federated graphs, and — more importantly in our case — view usage of every single type in your federated ecosystem, from Objects (UsersPosts) to the Scalars that they’re made of (BooleanID, and String), and even the root operation types (each query, mutation, and subscription).

This is an incredibly fine-grained look at your system. Want to know exactly how many times the author relation (via authorId) was actually accessed when querying for one or more Posts? Go right ahead.

The field usage metrics for the author relation here tell you exactly how many clients and operations requested it, along with a histogram for usage. You get to see exactly which operations accessed it, how many times they did so, which subgraphs were involved in resolving requests for this field, and finally, the first and last time the relation was accessed.

What could these metrics tell us, anyway?

The first thing that jumps out right away from these numbers is that in a real world scenario, certain posts will always be more popular than others, but frequent lookups for the same author across multiple posts is redundant, and can and will strain the Users subgraph and its backend. A simple solution could be to implement caching on the User subgraph, and cache author (User) data for the most popular posts, without having to retrieve it every single time.

Since Cosmo lets you filter field usage by client and operation, you might find that your mobile client predominantly accesses the content and author fields, while your analytics dashboard frequently retrieves likes and shares. Now, you can create specialized queries on each client, optimizing for speed and minimizing unnecessary data transfer. Field usage numbers here let you recognize unique requirements of each client type, and their unique field access patterns.

These metrics also show you exactly when a field was accessed over a 7 day retention period (free tier default; extensible), and this is useful in more ways than one: historical usage data, of course, can be used to align caching strategies with predicted future demand, meaning proactive infra scaling (up or down) to avoid bottlenecks during peaks.

But also, the timestamps provide a historical perspective on the adoption and usage patterns of the features each field represents. If you’re not seeing expected usage rate for a certain field/feature, perhaps you need to reassess its relevance to user needs, its value proposition, or even its pricing/monetization strategy.

Simply put, engineers and stakeholders make better decisions on how to evolve the organization’s graphs when they have relevant data to back it up.

In Summary…

Field level metrics — along with Cosmo’s suite of analytics — ultimately help organizations evolve their federated architecture and deliver a better, more valuable product.

In fact, with deeper insights into how fields are accessed over time, orgs can go beyond just performance optimization to answer questions like: Which fields are consistently accessed together, suggesting potential customization opportunities? Do usage patterns evolve over time? Can we identify underutilized fields for potential streamlining? And these insights inform content strategy, personalization efforts, and even the data model itself.

Of course, the Cosmo platform goes beyond metrics. It includes a blazingly fast V1/V2 compatible Router, visualizes data flow, automates deployments, and integrates seamlessly with your existing infrastructure — you could even migrate over from Apollo GraphOS with one click if you wanted to. And all of its stack is open-source, and completely self-hostable.