What does observability mean in Cosmo Router?

Cosmo Router emits three signals: distributed traces, Prometheus metrics, and access logs. Each signal is GraphQL-aware: spans, metrics, and log fields carry operation name, operation type, subgraph identity, and client information. The same data can be viewed in Cosmo Studio or exported to your own observability stack.

Is OpenTelemetry supported natively?

Yes. The router ships with built-in OpenTelemetry instrumentation that exports traces and metrics over HTTP or gRPC. W3C Trace Context propagates by default; Jaeger, B3, and Baggage are available as options. No additional SDK or instrumentation library is required.

Can I send telemetry to multiple backends?

Yes. Configure multiple exporters in the router itself, or run an OpenTelemetry Collector as a central hub that routes data to Cosmo Cloud, Datadog, Jaeger, Prometheus, and any other OTEL-compatible backend from one place.

What Prometheus metrics does the router expose?

Rate, errors, and duration for both router and subgraph requests, following the R.E.D. method. The default endpoint is http://localhost:8088/metrics. Core series include router_http_requests_total, router_http_request_duration_milliseconds, and router_http_requests_error_total, with dimensions for operation name, operation type, client, subgraph, and HTTP status code.

How do I debug a slow federated query?

Use Distributed Tracing to see the full request path across subgraphs in Cosmo Studio. For deeper visibility, enable Advanced Request Tracing per request via the X-WG-Trace header, which returns the execution plan, fetch types, and timing in the GraphQL response extensions.

Is observability available on the free plan?

OpenTelemetry, Prometheus Metrics, Grafana Integration, OTEL Collector integration, Access Logs, Advanced Request Tracing, and Profiling are available on Free, Pro, and Enterprise. Distributed Tracing in Cosmo Studio is available on Pro and Enterprise.

How do I diagnose CPU hotspots, memory leaks, or deadlocks?

Set the PPROF_ADDR environment variable to enable Go pprof endpoints on the router. You can capture heap, CPU, goroutine, block, and thread profiles and visualize them with go tool pprof or any Go-compatible profiling viewer.

Cosmo Observability

See every span, metric, and log in your federated graph

OpenTelemetry-native traces, Prometheus metrics, structured access logs, and pprof, built into the Cosmo Router. Every signal carries GraphQL operation, subgraph, and client context out of the box.

Start Free Read the Docs

OpenTelemetry-native. Exports to any OTEL-compatible backend.

Overview

What Cosmo Observability is

Cosmo Observability is the telemetry layer of the Cosmo Router: distributed traces, Prometheus metrics, structured access logs, and Go pprof endpoints. Every signal is GraphQL-aware: spans, metrics, and log fields carry the operation name, operation type, subgraph identity, and client information that matter for federated APIs.

The implementation is OpenTelemetry-native. The router uses the OTEL Go SDK to generate traces and metrics and exports them over HTTP or gRPC to any OTEL-compatible backend: Cosmo Cloud, Jaeger, Datadog, Prometheus, or an OpenTelemetry Collector that routes the data further.

Why GraphQL-aware observability matters

Why teams choose federation-native telemetry

Generic HTTP telemetry treats every GraphQL request the same. In a federated graph, the labels and spans that matter (operation, subgraph, client, fetch type) get lost. The result: long incident response times and metric dashboards that cannot answer simple questions.

Teams running federated GraphQL through generic tooling tend to hit the same four walls.

Federated requests are invisible across services.

One query can touch dozens of subgraphs. Without correlated trace context, teams spend hours per incident scanning logs across services to find the failing one.

Generic HTTP metrics miss GraphQL.

Operation name, operation type, subgraph, and client are the labels that matter. Stock router metrics rarely carry any of them.

DIY instrumentation rots.

Building custom OTEL spans, label schemes, and exporters in every service is its own maintenance burden, and the labels drift over time as teams change.

High-cardinality metrics explode.

Without exclusion patterns or a cardinality limit, GraphQL operation labels can overwhelm a monitoring backend. Then teams strip the dimensions that made the metrics useful in the first place.

Cosmo Router handles all of this natively. Traces, metrics, logs, and profiles in one binary, one config.

Cosmo Observability capabilities

OpenTelemetry (OTEL)

Native OTEL instrumentation for traces and metrics. Export over HTTP or gRPC to any OTEL-compatible backend. W3C Trace Context by default, with optional Jaeger, B3, and Baggage propagation.

Free / Pro / Enterprise

OTEL Collector Integration

Run an OpenTelemetry Collector as a single export hub. The router sends to one endpoint; the Collector routes traces and metrics to Cosmo Cloud, Jaeger, Prometheus, and any other OTEL-compatible backend.

Free / Pro / Enterprise

Which GraphQL observability capability do you need?

If you are…	Start here
Standing up GraphQL observability from scratch	OpenTelemetry
Debugging a slow or failing federated query in production	Distributed Tracing
Understanding the exact execution plan for a specific query	Advanced Request Tracing
Monitoring p99 latency, error rate, and request volume	Prometheus Metrics
Want production dashboards working in under 30 minutes	Grafana Integration
Routing traces and metrics to multiple backends from one place	OTEL Collector Integration
Capturing per-request logs with GraphQL operation context	Access Logs
Diagnosing CPU hotspots, memory leaks, or goroutine blocks	Profiling (pprof)

How Cosmo Observability compares

	Cosmo Observability	Apollo Router	DIY instrumentation
OpenTelemetry native	Yes	Partial	Manual
GraphQL-aware metric dimensions	Native	Limited	Manual
Multi-exporter	Yes	Limited	Custom
Built-in cardinality controls	Yes (2000 / metric default)	Manual	Manual
Pre-built Grafana dashboards	Cache, Go runtime	N/A	Self-built
Go pprof exposed	Yes (on-demand)	N/A (Rust)	Varies

Use cases

GraphQL observability use cases

Real debugging, monitoring, and performance patterns, and the Cosmo capability behind each one.

Incident response

Checkout starts returning errors at peak traffic

Scenario

A critical checkout API returns intermittent errors during peak load. The team needs to find the failing subgraph fast.

How Cosmo handles it

Filter traces in Cosmo Studio by error status. The span tree shows the inventory subgraph timing out, and the span details carry the error message, extension codes, and stack trace. The auto-refreshing trace view updates every 10 seconds while the incident is live.

Outcome

Root cause identified in five minutes instead of two hours: database connection exhaustion in the inventory subgraph.

SLO monitoring

Track p99 latency for a critical operation against an SLO

Scenario

The platform team needs an alert when p99 latency on the most important GraphQL operation drifts above its SLO budget.

How Cosmo handles it

Query router_http_request_duration_milliseconds with histogram_quantile(), filtered by the wg_operation_name label. Wire the result into Alertmanager.

Outcome

Automated alerting fires the moment p99 latency exceeds the SLO threshold. No custom instrumentation, no schema changes.

Multi-backend

Send telemetry to Cosmo Cloud and an existing observability stack

Scenario

An organization uses Cosmo Cloud for GraphQL analytics and Datadog for company-wide dashboards. They don't want a duplicate exporter in every service.

How Cosmo handles it

Configure two exporters in the router, or run an OpenTelemetry Collector as a single intermediary with two pipelines: one to Cosmo Cloud, one to Datadog. Either path needs zero application code changes.

Outcome

One router configuration. Data flows to both platforms automatically. Credentials and protocol translation stay centralized.

Performance

Diagnose memory growth in a long-running router

Scenario

Router instances grow heap usage over several days and need periodic restarts. Metrics show the symptom; the team needs the cause.

How Cosmo handles it

Enable pprof with PPROF_ADDR=:6060, capture heap profiles at intervals, and compare them with go tool pprof in diff mode to see which allocations grow.

Outcome

A subscription handler not releasing resources is identified directly from the heap profile. The fix eliminates the memory growth pattern.

Why teams run Cosmo Observability

OpenTelemetry-native, no vendor lock-in. Native OTEL SDK, W3C Trace Context propagation, export over HTTP or gRPC to any OTEL-compatible backend.
GraphQL-aware dimensions on every signal. Operation name, operation type, subgraph, and client labels on metrics. Span attributes carry the same context. Access logs include operation hash and per-stage timing.
One config exports to many backends. Configure multiple exporters in the router, or run one OTEL Collector pipeline that fans out to Cosmo Cloud, Jaeger, Prometheus, Datadog, and beyond.

Observability FAQ

Common questions about traces, metrics, logs, and profiling on the Cosmo Router.

Get started

Run federated GraphQL with full observability on the Cosmo Router

Start Free Read the Docs