Cosmo Observability

See every span, metric, and log in your federated graph

OpenTelemetry-native traces, Prometheus metrics, structured access logs, and pprof, built into the Cosmo Router. Every signal carries GraphQL operation, subgraph, and client context out of the box.

OpenTelemetry-native. Exports to any OTEL-compatible backend.

Overview

What Cosmo Observability is

Cosmo Observability is the telemetry layer of the Cosmo Router: distributed traces, Prometheus metrics, structured access logs, and Go pprof endpoints. Every signal is GraphQL-aware: spans, metrics, and log fields carry the operation name, operation type, subgraph identity, and client information that matter for federated APIs.

The implementation is OpenTelemetry-native. The router uses the OTEL Go SDK to generate traces and metrics and exports them over HTTP or gRPC to any OTEL-compatible backend: Cosmo Cloud, Jaeger, Datadog, Prometheus, or an OpenTelemetry Collector that routes the data further.

Why GraphQL-aware observability matters

Why teams choose federation-native telemetry

Generic HTTP telemetry treats every GraphQL request the same. In a federated graph, the labels and spans that matter (operation, subgraph, client, fetch type) get lost. The result: long incident response times and metric dashboards that cannot answer simple questions.

Teams running federated GraphQL through generic tooling tend to hit the same four walls.

Federated requests are invisible across services.

One query can touch dozens of subgraphs. Without correlated trace context, teams spend hours per incident scanning logs across services to find the failing one.

Generic HTTP metrics miss GraphQL.

Operation name, operation type, subgraph, and client are the labels that matter. Stock router metrics rarely carry any of them.

DIY instrumentation rots.

Building custom OTEL spans, label schemes, and exporters in every service is its own maintenance burden, and the labels drift over time as teams change.

High-cardinality metrics explode.

Without exclusion patterns or a cardinality limit, GraphQL operation labels can overwhelm a monitoring backend. Then teams strip the dimensions that made the metrics useful in the first place.

Cosmo Router handles all of this natively. Traces, metrics, logs, and profiles in one binary, one config.

Cosmo Observability capabilities

OpenTelemetry (OTEL)

Native OTEL instrumentation for traces and metrics. Export over HTTP or gRPC to any OTEL-compatible backend. W3C Trace Context by default, with optional Jaeger, B3, and Baggage propagation.

Free / Pro / Enterprise

OTEL Collector Integration

Run an OpenTelemetry Collector as a single export hub. The router sends to one endpoint; the Collector routes traces and metrics to Cosmo Cloud, Jaeger, Prometheus, and any other OTEL-compatible backend.

Free / Pro / Enterprise

Which GraphQL observability capability do you need?

If you are…Start here
Standing up GraphQL observability from scratchOpenTelemetry
Debugging a slow or failing federated query in productionDistributed Tracing
Understanding the exact execution plan for a specific queryAdvanced Request Tracing
Monitoring p99 latency, error rate, and request volumePrometheus Metrics
Want production dashboards working in under 30 minutesGrafana Integration
Routing traces and metrics to multiple backends from one placeOTEL Collector Integration
Capturing per-request logs with GraphQL operation contextAccess Logs
Diagnosing CPU hotspots, memory leaks, or goroutine blocksProfiling (pprof)

How Cosmo Observability compares

Cosmo ObservabilityApollo RouterDIY instrumentation
OpenTelemetry nativeYesPartialManual
GraphQL-aware metric dimensionsNativeLimitedManual
Multi-exporterYesLimitedCustom
Built-in cardinality controlsYes (2000 / metric default)ManualManual
Pre-built Grafana dashboardsCache, Go runtimeN/ASelf-built
Go pprof exposedYes (on-demand)N/A (Rust)Varies
Use cases

GraphQL observability use cases

Real debugging, monitoring, and performance patterns, and the Cosmo capability behind each one.

Incident response

Checkout starts returning errors at peak traffic

Scenario

A critical checkout API returns intermittent errors during peak load. The team needs to find the failing subgraph fast.

How Cosmo handles it

Filter traces in Cosmo Studio by error status. The span tree shows the inventory subgraph timing out, and the span details carry the error message, extension codes, and stack trace. The auto-refreshing trace view updates every 10 seconds while the incident is live.

Outcome

Root cause identified in five minutes instead of two hours: database connection exhaustion in the inventory subgraph.

SLO monitoring

Track p99 latency for a critical operation against an SLO

Scenario

The platform team needs an alert when p99 latency on the most important GraphQL operation drifts above its SLO budget.

How Cosmo handles it

Query router_http_request_duration_milliseconds with histogram_quantile(), filtered by the wg_operation_name label. Wire the result into Alertmanager.

Outcome

Automated alerting fires the moment p99 latency exceeds the SLO threshold. No custom instrumentation, no schema changes.

Multi-backend

Send telemetry to Cosmo Cloud and an existing observability stack

Scenario

An organization uses Cosmo Cloud for GraphQL analytics and Datadog for company-wide dashboards. They don't want a duplicate exporter in every service.

How Cosmo handles it

Configure two exporters in the router, or run an OpenTelemetry Collector as a single intermediary with two pipelines: one to Cosmo Cloud, one to Datadog. Either path needs zero application code changes.

Outcome

One router configuration. Data flows to both platforms automatically. Credentials and protocol translation stay centralized.

Performance

Diagnose memory growth in a long-running router

Scenario

Router instances grow heap usage over several days and need periodic restarts. Metrics show the symptom; the team needs the cause.

How Cosmo handles it

Enable pprof with PPROF_ADDR=:6060, capture heap profiles at intervals, and compare them with go tool pprof in diff mode to see which allocations grow.

Outcome

A subscription handler not releasing resources is identified directly from the heap profile. The fix eliminates the memory growth pattern.

Why teams run Cosmo Observability

  • OpenTelemetry-native, no vendor lock-in. Native OTEL SDK, W3C Trace Context propagation, export over HTTP or gRPC to any OTEL-compatible backend.
  • GraphQL-aware dimensions on every signal. Operation name, operation type, subgraph, and client labels on metrics. Span attributes carry the same context. Access logs include operation hash and per-stage timing.
  • One config exports to many backends. Configure multiple exporters in the router, or run one OTEL Collector pipeline that fans out to Cosmo Cloud, Jaeger, Prometheus, Datadog, and beyond.

Observability FAQ

Common questions about traces, metrics, logs, and profiling on the Cosmo Router.

Get started

Run federated GraphQL with full observability on the Cosmo Router