Overview
The Service Health page provides visibility into the reliability of your services. Track health scores, uptime percentages, and incident patterns for each service to identify areas needing attention.Health Scores
Composite reliability scores for each service
Uptime Tracking
Monitor service availability percentages
Incident Impact
See which services have the most incidents
MTTR by Service
Compare resolution times across services
Summary Statistics
Four metrics provide an overview of service health:| Metric | Description |
|---|---|
| Avg Health Score | Average health score across all services |
| Healthy Services | Services with health score ≥ 80 |
| Degraded Services | Services with health score 50-79 |
| Critical Services | Services with health score < 50 |
Health Score Calculation
The health score (0-100) combines multiple factors:| Factor | Weight | Description |
|---|---|---|
| Uptime | 40% | Percentage of time without incidents |
| Incident Count | 30% | Lower is better |
| MTTR | 20% | Faster resolution improves score |
| Severity Mix | 10% | Critical incidents have more impact |
Health Status Levels
| Status | Score Range | Badge Color | Meaning |
|---|---|---|---|
| Healthy | 80-100 | 🟢 Green | Service is performing well |
| Degraded | 50-79 | 🟡 Amber | Service needs attention |
| Critical | 0-49 | 🔴 Red | Service requires immediate action |
Service Cards
Each service displays a detailed card with:Card Information
| Section | Details |
|---|---|
| Header | Service name and health status badge |
| Health Score | Visual progress bar with numeric score |
| Metrics | Uptime %, Incident count, Average MTTR |
Understanding the Metrics
- Health Score
- Uptime
- Incident Count
- Avg MTTR
The overall reliability indicator:
- 90-100 — Excellent reliability
- 80-89 — Good, minor issues
- 70-79 — Degraded, needs attention
- 50-69 — Significant problems
- < 50 — Critical, requires immediate action
Using Service Health Data
Prioritizing Improvements
Investigate Root Causes
For unhealthy services, analyze:
- Recurring incident patterns
- Common failure modes
- Resource constraints
Comparing Services
Similar Services
Similar Services
Compare services with similar functions:
- Why does API Service A have 95% uptime while API Service B has 99%?
- What practices from healthy services can be adopted?
Critical Path Services
Critical Path Services
Pay extra attention to services that:
- Support revenue-generating features
- Are dependencies for many other services
- Have external SLA commitments
New vs. Established
New vs. Established
New services may naturally have lower scores:
- Track improvement trajectory
- Ensure adequate monitoring is in place
- Document expected stabilization timeline
Improving Service Health
Quick Wins
Reduce Incident Volume
Reduce Incident Volume
- Tune noisy alerts that don’t require action
- Fix recurring issues identified in postmortems
- Implement preventive monitoring
Improve MTTR
Improve MTTR
- Create and maintain runbooks
- Improve logging and observability
- Cross-train team members
Increase Uptime
Increase Uptime
- Add redundancy for single points of failure
- Implement graceful degradation
- Improve deployment practices
Long-term Improvements
Architecture Review
Architecture Review
For persistently unhealthy services:
- Evaluate technical debt
- Consider refactoring or rewriting
- Review dependencies and failure domains
Capacity Planning
Capacity Planning
Health issues may indicate:
- Insufficient resources
- Scaling limitations
- Need for performance optimization
Process Improvements
Process Improvements
- Implement better change management
- Improve deployment practices
- Enhance pre-production testing
Best Practices
Set Service-Level Targets
Set Service-Level Targets
Establish health score targets based on service criticality:
- Customer-facing critical: 95+
- Internal critical: 90+
- Non-critical: 80+
Regular Health Reviews
Regular Health Reviews
Include service health in:
- Weekly team standups
- Monthly reliability reviews
- Quarterly planning
Track Trends
Track Trends
A service improving from 60 to 75 is progress, even if not yet “healthy.”
Investigate Sudden Drops
Investigate Sudden Drops
If a healthy service suddenly becomes degraded:
- Check for recent deployments
- Review infrastructure changes
- Look for external factors (dependencies, traffic)
Balance Investment
Balance Investment
Don’t over-invest in already-healthy services. Focus improvement effort on degraded and critical services.
Document Service Context
Document Service Context
Maintain documentation explaining:
- Expected health baselines
- Known limitations
- Improvement roadmaps
Troubleshooting
Service not appearing
Service not appearing
- Verify incidents exist with this service name
- Check service tagging in integrations
- Ensure consistent service naming across alerts
Health score seems incorrect
Health score seems incorrect
- Review incident data for the service
- Check if all incidents are properly attributed
- Verify the calculation period matches expectations
Uptime shows 100% despite incidents
Uptime shows 100% despite incidents
- Check if incident durations are being recorded
- Verify resolution times are being set
- Review how uptime is calculated for your setup
Too many services listed
Too many services listed
- Review alert naming conventions
- Consolidate similar service names
- Consider service grouping strategies
Service Naming Best Practices
Consistent service naming improves analytics accuracy:| Pattern | Example | Benefit |
|---|---|---|
| Environment prefix | prod-api, staging-api | Separate production metrics |
| Team ownership | payments-gateway | Easy team attribution |
| Functional grouping | auth-service, auth-cache | Group related services |