WunderGraph Cloud Waitlist
Before we get into the blog post. WunderGraph Cloud is being released very soon. We’re looking for Alpha and Beta testers for WunderGraph Cloud.
Testers will receive access to WunderGraph Cloud and 3 months Cloud Pro for free.
Analyzing public GraphQL APIs is a Series of blog posts to learn from big public GraphQL implementations, starting with Twitch.tv, the popular streaming platform.
We usually assume that GraphQL is just GraphQL. With REST, there's a lot of confusion what it actually is. Build a REST API and the first response you get is that someone says this is not really REST but just JSON over HTTP, etc...
But is this really exclusively a REST thing? Is there really just one way of doing GraphQL?
I've looked at many publicly available GraphQL APIs of companies whose name you're familiar with and analyzed how they "do GraphQL". I quickly realized that everybody does it a bit differently. With this series of posts, I want to extract good and bad patterns from large GraphQL production deployments.
At the end of the series, we'll conclude with a WhitePaper, summarizing all the best practices on how to run GraphQL in production. Make sure to sign up with our WhitePaper early access list. We'll keep you updated on the next post of this series and send you the WhitePaper once it's out.
I'm not using any special equipment to do this. You can use your preferred browser with the browser dev tools to follow along.
Let's dive into the first candidate: Twitch.tv
Analyzing the GraphQL API of Twitch.tv
The first thing you notice is that twitch hosts their GraphQL API on the subdomain
https://gql.twitch.tv/gql. Looking at the URL patterns and Headers, it seems that twitch is not versioning their API.
If you look at the Chrome Devtools or similar, you'll notice that for each new "route" on the website, multiple requests are being made to the gql subdomain. In my case, I can count 12 requests on the initial load of the site.
What's interesting is that these requests are being queued sequentially. Starting with the first one at 313ms, then 1.27s, 1.5s, 2.15s, ... , and the last one at 4.33s. One of the promises of GraphQL is to solve the Waterfall problem. However, this only works if all the data required for the website is available in a single GraphQL Operation.
In case of twitch, we've counted 12 requests, but we're not yet at the operation level. Twitch batches requests, but we'll come to that in a minute.
I've noticed another problem with the twitch API. It's using HTTP/1.1 for all requests, not HTTP/2. Why is it a problem? HTTP/2 multiplexes multiple Requests over a single TCP connection, HTTP/1.1 doesn't. You can see this if you look at the timings in Chrome DevTools. Most of the requests can (re-)use an existing TCP Connection, while others initiate a new one. Most of the requests have ~300ms latency while the ones with a connection init and TLS handshake clock in at around 430ms.
Now let's have a closer look at the requests itself. Twitch sends GraphQL Queries using HTTP POST. Their preferred Content-Encoding for Responses is gzip, they don't support brotli.
If you're not logged in, the client sends the Header "Authorization: undefined", which looks like a frontend glitch. Content-Type of the Request is "text/plain" although the payload is JSON.
Some of their requests are single GraphQL requests with a JSON Object. Others are using a batching mechanism, meaning, they send multiple Operations as an Array. The response also comes back as an Array as well, so the client then matches all batched operations to the same response index.
Here's an example of such a batch request:
Counting all GraphQL Operations for the initial Website load, I get at 74 Operations in total.
Here's a list of all Operations in order of appearance:
- Single 1 (1.2kb Response gzip)
- Batch 1 (5.9kb Response gzip)
- PersonalSections (different arguments)
- Batch 2 (0.7kb Response gzip)
- Batch 3 (20.4 Response gzip)
- Batch 4 (0.5kb Response gzip)
- Batch 5 (15.7kb Response gzip)
- Batch 6 (2kb Response gzip)
- Batch 7 (1.1kb Response gzip)
- Batch 8 (1.5kb Response gzip)
- Batch 9 (1.0kb Response gzip)
- Batch 10 (1.3kb Response gzip)
- Batch 11 (11.7kb Response gzip)
All responses cumulated clock in at 63kb gzipped.
Note that all of these Requests are HTTP POST and therefore don't make any use of Cache-Control Headers. The batch requests use transfer-encoding chunked.
However, on subsequent routes, there seems to be some client-side caching happening. If I change the route to another channel, I can only count 69 GraphQL Operations.
Another observation I can make is that twitch uses APQ, Automatic Persisted Queries. On the first request, the client sends the complete Query to the server. The server then uses the "extends" field on the response object to tell the client the Persisted Operation Hash. Subsequent client requests will then omit the Query payload and instead just send the Hash of the Persisted Operation. This saves bandwidth for subsequent requests.
Looking at the Batch Requests, it seems that the "registration" of Operations happens at build time. So there's no initial registration step. The client only sends the Operation Name as well the Query Hash using the extensions field in the JSON request. (see the example request from above)
Next, I've tried to use Postman to talk to the GraphQL Endpoint.
The first response I've got was a 400, Bad Request.
I've copy-pasted the Client-ID from Chrome Devtools to solve the "problem".
I then wanted to explore their schema. Unfortunately, I wasn't able to use the Introspection Query, it seems to be silently blocked.
However, you could still easily extract the schema from their API using a popular exploit of the graphql-js library.
If you send the following Query:
You'll get this response:
Using these suggestions, we're able to reconstruct the Schema. I don't really think this is a security risk though. They are storing all GraphQL Queries in the client, and their API is public.
Finally, I've tried to figure out how their chat works and if they are using GraphQL Subscriptions as well. Switching the Chrome Dev Tools view to "WS" (WebSocket) shows us two WebSocket connections.
One is hosted on the URL
wss://pubsub-edge.twitch.tv/v1. It seems to be using versioning, or at least they expect to version this API. Looking at the messages going back and forth between client and server, I can say that the communication protocol is not GraphQL. The information exchanged over this connection is mainly around video playback, server time and view count, so it's keeping the player information in sync.
The second WebSocket connection connects to this URL:
wss://irc-ws.chat.twitch.tv/ IRC stands for "Internet Relay Chat". I can only assume that this WebSocket connection is a bridge to an IRC server which hosts all the chats for twitch. The protocol is also not GraphQL. Here's an example message:
Let's start with the things that surprised me the most.
HTTP 1.1 vs. HTTP2 - GraphQL Request Batching
If you need to run more than 70 GraphQL Operations, it's obvious that you have to implement some sort of optimizations to handle the load when there could be hundreds of thousands or even millions of viewers per channel.
Batching can be achieved in different ways. One way of batching leverages the HTTP protocol, but batching is also possible in the application layer itself.
Batching has the advantage that it can reduce the number of HTTP requests. In case of twitch, they are batching their 70+ Operations over 12 HTTP requests. Without batching, the Waterfall could be even more extreme. So, it's a very good solution to reduce the number of Requests.
However, batching in the application layer also has its downsides. If you batch 20 Operations into one single Request, you always have to wait for all Operations to resolve before the first byte of the response can be sent to the client. If a single resolver is slow or times out, I assume there are timeouts, all other Operations must wait for the timeout until the responses can be delivered to the client.
Another downside is that batch requests almost always defeat the possibility of HTTP caching. As the API from twitch uses HTTP POST for READ (Query) requests, this option is already gone though.
Additionally, batching can also lead to a slower perceived user experience. A small response can be parsed and processed very quickly by a client. A large response with 20+ kb of gzipped JSON takes longer to parse, leading to longer processing times until the data can be presented in the UI.
So, batching can reduce network latency, but it's not free.
Another way of batching makes use of HTTP/2. It's a very elegant way and almost invisible.
HTTP/2 allows browsers to send hundreds of individual HTTP Requests over the same TCP connection. Additionally, the protocol implements Header Compression, which means that client and server can build a dictionary of words in addition to some well known terms to reduce the size of Headers dramatically.
This means, if you're using HTTP/2 for your API, there's no real benefit of "batching at the application layer".
The opposite is actually the case, "batching" over HTTP/2 comes with big advantages over HTTP/1.1 application layer batching.
First, you don't have to wait for all Requests to finish or time out. Each individual request can return a small portion of the required data, which the client can then render immediately.
Second, serving READ Requests over HTTP GET allows for some extra optimizations. You're able to use Cache-Control Headers as well as ETags. Let's discuss these in the next section.
HTTP POST, the wrong way of doing READ requests
Twitch is sending all of their GraphQL Requests over HTTP/1.1 POST. I've investigated the payloads and found out that many of the Requests are loading public data that uses the current channel as a variable. This data seems to be always the same, for all users.
In a high-traffic scenario where millions of users are watching a game, I'd assume that thousands of watchers will continually leave and join the same channel. With HTTP POST and no Cache-Control or ETag Headers, all these Requests will hit the origin server. Depending on the complexity of the backend, this could actually work, e.g. with a REST API and an in memory database.
However, these POST Requests hit the origin server which then executes the persisted GraphQL Operations. This can only work with thousands of servers, combined with a well-defined Resolver architecture using the Data-Loader pattern and application-side caching, e.g. using Redis.
I've looked into the Response timings, and they are coming back quite fast! So, the twitch engineers must have done a few things quite well to handle this kind of load with such a low latency.
Let's assume that twitch used HTTP GET Requests for Queries over HTTP/2. Even with a MaxAge of just 1 second, we'd be able to use a CDN like Cloudflare which could turn 50k "channel joins" into a single Request. Reducing 50k RPS hitting the GraphQL origin can result in a dramatic cost reduction, and we're just talking about a single twitch channel.
However, this is not yet the end of the story. If we add ETags to our environment, we can reduce the load even further. With ETags, the browser can send an "If-None-Match" Header with the value received from a previous network Request. If the response did not change, and therefore the ETag also didn't change, the server simply returns a 304 Not Modified response without a body.
So, if not much has changed when hopping between channels, we're able to save most of the 60kb gzipped JSON per channel switch.
Keep in mind that this is only possible if we don't do batching at the application layer. The larger the batch, the smaller the likelyhood that an ETag for a whole batch doesn't change.
As you've learned, using HTTP/2 with GET for READS can reduce the load on the origin as well as reduce the bandwidth to load the website. For those watching twitch from their mobile or on a low bandwidth connection, this could make the difference.
Does GraphQL really solve the Waterfall problem?
One of my pet peeves is when developers glorify GraphQL. One of these glorifications is that GraphQL solves the Waterfall problem of REST APIs.
I've read it in many blog posts on GraphQL vs REST that the Query language allows you to Query all the data in one single Request and solves the Waterfall problem this way.
Then tell me why the engineers decided to send 70 GraphQL Operations over 12 batch requests with a Waterfall of more than 4 seconds? Don't they understand the capabilities of GraphQL? Why do they use GraphQL if they still fall into the same traps as with REST APIs?
The reality is, it's probably not a single team of 3 Frontend Developers and 2 Backend Developers who develop the website.
If you were a single developer who builds a simple blog, you're probably able to Request all the data you need in a single GraphQL Request. Clients like Relay can help achieve this goal.
However, I think every larger (not all) batch Request can be understood as a pointer to Conway's Law .
Different parts of the website could be implemented by different teams. Each component, e.g. the Chat, has some specific Operations which are batched together.
Obviously, these are just assumptions, but I want to be fair and not judge their implementation only by looking at it from the outside.
In terms of the Waterfall problem, GraphQL doesn't really solve it for twitch. That said, I don't think this is their biggest issue. I just wanted to point out that it's not always possible to leverage technologies to their full extend if organizational structures don't allow for it.
If you want to improve the architecture of your application, look at the organization first.
Two teams will probably build a two-step compiler. The teams will probably build an application with three big batch requests. If you want to optimize how individual parts of your application communicate, think about the communication within your company first.
APQ - Automatic Persisted Queries, are they worth it?
With APQ, GraphQL Operations will be stored on the server to reduce bandwidth and increase performance. Instead of sending the complete Query, the client only sends the Hash of the registered Operation. There's an example above.
While APQ reduce the Request size slightly, we've already learned that they don't help with the Response size as ETags do.
On the server-side, most implementations don't really optimize. They look up the Operation from a dictionary, parse and execute it. The operation will not be pre-processes or anything.
The twitch GraphQL API allows you to send arbitrary, non-persisted, Operations as well, so they are not using APQ as a security mechanism.
My personal opinion is that APQ add complexity without much benefit.
Disabling introspection without fixing the recommendations bug
I don't want to deep dive into security in this post, so this is just a quick note on disabling introspection.
In general, it could make sense to disable introspection to not allow every API user to explore your GraphQL Schema. The schema might leak sensitive information. That said, there's a problem with some implementations, like the graphql-js reference implementation, that leak Schema information even with introspection disabled.
If your implementation uses these suggestions, and you want to disable introspection entirely, make sure to tackle this problem. We'll discuss a solution in the suggestions section of this post.
Should you use GraphQL Subscriptions for Realtime Updates?
GraphQL Subscriptions allow you to stream updates to the client using the Query Language. Twitch is not leveraging this feature though.
In terms of the Chat, it looks like they are using IRC underneath. They've probably started using it before they looked at GraphQL. Wrapping this implementation with GraphQL Subscriptions might not add any extra benefits.
It would obviously be a lot cleaner if all the traffic was handled by GraphQL but making the switch might not be worth it.
One thing to keep in mind is that twitch is using WebSockets for Realtime updates. I've tackled this topic in another blog post, the gist is that WebSockets are a terrible solution for Realtime Updates for many reasons. As an alternative, I suggest using HTTP/2 streams.
That's enough for the discussion. Next, I'll share some of my recommendations on how you can build production-grade GraphQL APIs using the twitch API as an example.
READ Requests should always use HTTP GET over HTTP/2
READ Requests or GraphQL Queries should always use HTTP GET Requests over HTTP/2. This solves almost all problems I've described above.
With this in place, there's no need to do application layer batching.
How can you achieve this?
For each GraphQL Operation that you define within your application, create a dedicated JSON API Endpoint and make your API client use GET requests for Queries, variables can be sent using a Query parameter.
For each Endpoint, you can then add specific Cache-Control configurations, and a middleware to handle ETags to improve performance for individual operations without sacrificing a good User Experience.
You might be thinking that this adds complexity to your application. Keeping client and server in sync might be complicated. Doesn't this break all of the existing GraphQL clients?
Yes, it does add complexity. It doesn't just break existing clients, it's against everything you've probably heard about GraphQL.
Yet, it makes so much sense to leverage HTTP to its full extend, allow Browsers to do their Job as well as Proxies and CDNs. They all understand Cache-Control Headers and ETags, let them do their work!
But please, without the additional complexity. At least, that's what we thought, so we solved this problem, the solution is way too simple.
First, define all the Operations you need for your application, just like the twitch engineers did. WunderGraph then generates a GraphQL Gateway that exposes a secure JSON RPC API. Additionally, we generate a type-safe API client / SDK in any language so that you can easily "call" into your pre-defined Operations.
This setup uses HTTP/2 and leverages all the capabilities of Browsers, CDNs and Proxies. Because we're not talking GraphQL over the wire, it also increases security. Introspection leaks? Impossible. Denial of Service attacks using complex Queries? Impossible.
You're still defining GraphQL Operations, it still feels like GraphQL, it's just not sending Queries over POST Requests.
APQ < Compiled Operations
Automatic Persisted Queries are a good idea to improve performance, however, they are not really thought out well.
Looking up a persisted Operation in a hashmap to then parse and execute them still means you're "interpreting" with all its downsides.
With WunderGraph we're going a different route. When you define an Operation, we're actually validating and compiling it into extremely efficient code, at runtime.
When executing a pre-defined Operation in WunderGraph, all we do is to insert the variables and then execute a tree of operations. There's no parsing and validation happening at runtime.
WunderGraph works like a database with prepared statements, it's just not using tables as storage but talks to APIs.
This way, we're adding almost no overhead at runtime. Instead, with the ETag & Caching middlewares, we can easily speed up your GraphQL APIs.
Subscriptions over HTTP/2 Streams
We've linked another post above outlining the problems with WebSockets . In a nutshell, WebSockets are stateful, make authentication complicated and require an extra TCP connection per socket.
To solve this issue for you, both WunderGraph client and server implement Subscriptions and Realtime Streams over HTTP/2.
We're fully compatible to "standard" GraphQL Subscription implementations using WebSockets, when talking to your origins though. We'll just hide these behind our secure JSON RPC API, streaming responses to clients over HTTP/2.
This way, your Subscriptions are kept stateless and authentication is properly handled for you. Another problem you don't have to solve.
I hope this new series helps you see through glorified blog posts, and you realize that reality looks differently.
I think it needs a standard to run GraphQL in production. If you follow this series, you'll realize that all big players do it differently. It's really inefficient if every company tries to find their own ways of building their API infrastructure.
That's why we're here! We're establishing this standard. We can give you a tool that lets you leverage all the best practices you'll discover in this series. Ask yourself if solving all these problems is the core domain of your business. Your answer should be "no", otherwise you're probably an API or Dev-Tool vendor.
If you need help with your GraphQL implementation, please get in touch!
If you liked this new series, make sure to sign up with the WhitePaper or follow us on Twitter. Feel free suggest another API which we should analyze.
By the way, if you're working at twitch, we'd love to talk to you and get some more insights on the internals of your GraphQL API.