Skip to main content

Overview

The Service Health page provides visibility into the reliability of your services. Track health scores, uptime percentages, and incident patterns for each service to identify areas needing attention.

Health Scores

Composite reliability scores for each service

Uptime Tracking

Monitor service availability percentages

Incident Impact

See which services have the most incidents

MTTR by Service

Compare resolution times across services

Summary Statistics

Four metrics provide an overview of service health:
MetricDescription
Avg Health ScoreAverage health score across all services
Healthy ServicesServices with health score ≥ 80
Degraded ServicesServices with health score 50-79
Critical ServicesServices with health score < 50

Health Score Calculation

The health score (0-100) combines multiple factors:
FactorWeightDescription
Uptime40%Percentage of time without incidents
Incident Count30%Lower is better
MTTR20%Faster resolution improves score
Severity Mix10%Critical incidents have more impact

Health Status Levels

StatusScore RangeBadge ColorMeaning
Healthy80-100🟢 GreenService is performing well
Degraded50-79🟡 AmberService needs attention
Critical0-49🔴 RedService requires immediate action

Service Cards

Each service displays a detailed card with:

Card Information

SectionDetails
HeaderService name and health status badge
Health ScoreVisual progress bar with numeric score
MetricsUptime %, Incident count, Average MTTR

Understanding the Metrics

The overall reliability indicator:
  • 90-100 — Excellent reliability
  • 80-89 — Good, minor issues
  • 70-79 — Degraded, needs attention
  • 50-69 — Significant problems
  • < 50 — Critical, requires immediate action

Using Service Health Data

Prioritizing Improvements

1

Identify Critical Services

Start with any services showing “Critical” status
2

Review Degraded Services

Plan improvements for “Degraded” services
3

Investigate Root Causes

For unhealthy services, analyze:
  • Recurring incident patterns
  • Common failure modes
  • Resource constraints
4

Track Improvement

Monitor health scores over time to verify fixes

Comparing Services

Compare services with similar functions:
  • Why does API Service A have 95% uptime while API Service B has 99%?
  • What practices from healthy services can be adopted?
Pay extra attention to services that:
  • Support revenue-generating features
  • Are dependencies for many other services
  • Have external SLA commitments
New services may naturally have lower scores:
  • Track improvement trajectory
  • Ensure adequate monitoring is in place
  • Document expected stabilization timeline

Improving Service Health

Quick Wins

  • Tune noisy alerts that don’t require action
  • Fix recurring issues identified in postmortems
  • Implement preventive monitoring
  • Create and maintain runbooks
  • Improve logging and observability
  • Cross-train team members
  • Add redundancy for single points of failure
  • Implement graceful degradation
  • Improve deployment practices

Long-term Improvements

For persistently unhealthy services:
  • Evaluate technical debt
  • Consider refactoring or rewriting
  • Review dependencies and failure domains
Health issues may indicate:
  • Insufficient resources
  • Scaling limitations
  • Need for performance optimization
  • Implement better change management
  • Improve deployment practices
  • Enhance pre-production testing

Best Practices

Establish health score targets based on service criticality:
  • Customer-facing critical: 95+
  • Internal critical: 90+
  • Non-critical: 80+
Include service health in:
  • Weekly team standups
  • Monthly reliability reviews
  • Quarterly planning
If a healthy service suddenly becomes degraded:
  • Check for recent deployments
  • Review infrastructure changes
  • Look for external factors (dependencies, traffic)
Don’t over-invest in already-healthy services. Focus improvement effort on degraded and critical services.
Maintain documentation explaining:
  • Expected health baselines
  • Known limitations
  • Improvement roadmaps

Troubleshooting

  • Verify incidents exist with this service name
  • Check service tagging in integrations
  • Ensure consistent service naming across alerts
  • Review incident data for the service
  • Check if all incidents are properly attributed
  • Verify the calculation period matches expectations
  • Check if incident durations are being recorded
  • Verify resolution times are being set
  • Review how uptime is calculated for your setup
  • Review alert naming conventions
  • Consolidate similar service names
  • Consider service grouping strategies

Service Naming Best Practices

Consistent service naming improves analytics accuracy:
PatternExampleBenefit
Environment prefixprod-api, staging-apiSeparate production metrics
Team ownershippayments-gatewayEasy team attribution
Functional groupingauth-service, auth-cacheGroup related services
Establish service naming conventions and document them. Inconsistent naming creates fragmented analytics.