Observability · Prometheus Metrics

R.E.D. metrics for every GraphQL operation and subgraph

The Cosmo Router emits rate, errors, and duration metrics with GraphQL-aware labels. Default endpointhttp://localhost:8088/metrics. Built-in 2000-combination cardinality cap.

Built into the router. OpenTelemetry foundation. Core R.E.D. metrics on the Prometheus endpoint and via OTEL export at the same time.

/metrics on :8088

Cosmo Router:8088/metricsPrometheusscrape · 15sGrafanaDashboardsAlertmanagerSLO alerts

Available onFreeProEnterprise

The problem

Generic metrics cannot answer GraphQL questions

Operations teams need per-operation, per-subgraph, per-client metrics. Generic HTTP series do not carry those dimensions.

Generic HTTP metrics miss GraphQL

Stock router metrics report request counts and durations but rarely carry the dimensions that matter: operation name, operation type, client, and subgraph.

Subgraph latency is invisible

When a federated query slows down, the only metric is the aggregate request duration. Without per-subgraph series, you cannot tell which service is causing the spike.

Cardinality grows until something breaks

Adding GraphQL operation labels without limits can blow up the time-series database. Then teams strip the labels and lose the visibility that justified them.

Our solution

GraphQL-aware Prometheus metrics, built in

The router exports R.E.D. metrics for router and subgraph traffic with the GraphQL labels operators actually need. Cardinality is bounded out of the box.

What happens on every request

  1. The router collects R.E.D. metrics via the OpenTelemetry SDK on every request.

  2. A Prometheus exporter publishes them on http://localhost:8088/metrics by default.

  3. Labels carry GraphQL context: wg_operation_name, wg_operation_type, wg_operation_protocol, wg_client_name, wg_client_version, wg_subgraph_name, wg_subgraph_id, and http_status_code.

  4. Prometheus scrapes the endpoint on its configured interval and stores the series.

  5. Grafana queries the series for dashboards; Alertmanager queries them for SLO alerts.

  6. A built-in cardinality limit of 2000 combinations per metric and regex exclusions keep the series count bounded.

Plug Prometheus into the endpoint. Dashboards and alerts follow.

Prometheus metrics

Before & After

Before CosmoWith Cosmo
Generic HTTP metrics without operation or subgraph dimensionsR.E.D. metrics with wg_operation_name, wg_subgraph_name, and related GraphQL labels
Aggregate request duration hides which subgraph is slowPer-subgraph latency and error series on the same endpoint
High-cardinality labels overwhelm the time-series databaseBuilt-in 2000-combination limit and regex exclusions per metric
Custom instrumentation to expose federation metrics/metrics on :8088 by default: scrape and query

Optional metrics

Beyond R.E.D.

  • Cache. Hit and miss ratios, costs, and key statistics.
  • Engine. Connections, subscriptions, and triggers.
  • Connection pool. Utilization and acquisition duration.
  • Circuit breaker. State and short-circuits.

Go runtime metrics (memory, GC, goroutines) are available via OTEL export with router_runtime enabled, not on the Prometheus scrape endpoint.

How Prometheus metrics work in Cosmo Router

01
R.E.D. for router and every subgraph.

Emit

Rate, errors, and duration for router and subgraph requests. Default endpoint http://localhost:8088/metrics. No custom instrumentation required.

02
GraphQL dimensions on every series.

Label

Series carry wg_operation_name, wg_operation_type, wg_operation_protocol, wg_client_name, wg_client_version, wg_subgraph_name, wg_subgraph_id, and http_status_code.

03
2000-combination cap, regex exclusions.

Bound

A default cardinality limit of 2000 unique combinations per metric bounds label growth. Once the limit is reached, further datapoints are stored without attributes. Regex exclusions remove labels or whole metrics by pattern.

04
Alertmanager-ready out of the box.

Alert

Wire the metrics into Grafana for dashboards and Alertmanager for SLO alerting. Common queries (p99 latency, subgraph error rate) fit on one screen.

Telemetry controls

PromQL, cardinality, and runtime

Query by operation and subgraph. Bound label cardinality. Export Go runtime metrics via OTEL when you need them.

p99 latency by operation

histogram_quantile(
  0.99,
  sum by (le, wg_operation_name) (
    rate(router_http_request_duration_milliseconds_bucket[5m])
  )
)

Error rate by subgraph

sum by (wg_subgraph_name) (
  rate(router_http_requests_error_total[5m])
)

Cardinality cap

A default limit of 2000 unique label combinations per metric. Regex exclusions remove labels or whole metrics by pattern.

Runtime metrics

Go runtime statistics (memory, GC, goroutines) are available via OTEL export with router_runtime enabled. They are not on the Prometheus /metrics scrape endpoint.

Scrape Cosmo Router today

The Prometheus endpoint is on by default. Add the scrape job and start querying.

FAQ

Prometheus metrics on Cosmo Router

Deep dive in the metrics and monitoring documentation.