> ## Documentation Index > Fetch the complete documentation index at: https://docs.easyalert.io/llms.txt > Use this file to discover all available pages before exploring further. # Services > Monitor service reliability scores, uptime, and incident impact by service ## Overview The Service Health page provides visibility into the reliability of your services. Track health scores, uptime percentages, and incident patterns for each service to identify areas needing attention. Composite reliability scores for each service Monitor service availability percentages See which services have the most incidents Compare resolution times across services *** ## Summary Statistics Four metrics provide an overview of service health: | Metric | Description | | --------------------- | ---------------------------------------- | | **Avg Health Score** | Average health score across all services | | **Healthy Services** | Services with health score ≥ 80 | | **Degraded Services** | Services with health score 50-79 | | **Critical Services** | Services with health score \< 50 | *** ## Health Score Calculation The health score (0-100) combines multiple factors: | Factor | Weight | Description | | -------------- | ------ | ------------------------------------ | | Uptime | 40% | Percentage of time without incidents | | Incident Count | 30% | Lower is better | | MTTR | 20% | Faster resolution improves score | | Severity Mix | 10% | Critical incidents have more impact | ### Health Status Levels | Status | Score Range | Badge Color | Meaning | | ------------ | ----------- | ----------- | --------------------------------- | | **Healthy** | 80-100 | 🟢 Green | Service is performing well | | **Degraded** | 50-79 | 🟡 Amber | Service needs attention | | **Critical** | 0-49 | 🔴 Red | Service requires immediate action | *** ## Service Cards Each service displays a detailed card with: ### Card Information | Section | Details | | ---------------- | -------------------------------------- | | **Header** | Service name and health status badge | | **Health Score** | Visual progress bar with numeric score | | **Metrics** | Uptime %, Incident count, Average MTTR | ### Understanding the Metrics The overall reliability indicator: * **90-100** — Excellent reliability * **80-89** — Good, minor issues * **70-79** — Degraded, needs attention * **50-69** — Significant problems * **\< 50** — Critical, requires immediate action Percentage of time without active incidents: * **99.9%+** — High availability target met * **99-99.9%** — Generally reliable * **95-99%** — Room for improvement * **\< 95%** — Significant reliability issues Total incidents for this service in the period: * Compare to similar services * Track trends over time * High counts may indicate systemic issues Average time to resolve incidents for this service: * Shorter is better * Compare to organization average * Long MTTR may indicate complexity or knowledge gaps *** ## Using Service Health Data ### Prioritizing Improvements Start with any services showing "Critical" status Plan improvements for "Degraded" services For unhealthy services, analyze: * Recurring incident patterns * Common failure modes * Resource constraints Monitor health scores over time to verify fixes ### Comparing Services Compare services with similar functions: * Why does API Service A have 95% uptime while API Service B has 99%? * What practices from healthy services can be adopted? Pay extra attention to services that: * Support revenue-generating features * Are dependencies for many other services * Have external SLA commitments New services may naturally have lower scores: * Track improvement trajectory * Ensure adequate monitoring is in place * Document expected stabilization timeline *** ## Improving Service Health ### Quick Wins * Tune noisy alerts that don't require action * Fix recurring issues identified in postmortems * Implement preventive monitoring * Create and maintain runbooks * Improve logging and observability * Cross-train team members * Add redundancy for single points of failure * Implement graceful degradation * Improve deployment practices ### Long-term Improvements For persistently unhealthy services: * Evaluate technical debt * Consider refactoring or rewriting * Review dependencies and failure domains Health issues may indicate: * Insufficient resources * Scaling limitations * Need for performance optimization * Implement better change management * Improve deployment practices * Enhance pre-production testing *** ## Best Practices Establish health score targets based on service criticality: * Customer-facing critical: 95+ * Internal critical: 90+ * Non-critical: 80+ Include service health in: * Weekly team standups * Monthly reliability reviews * Quarterly planning A service improving from 60 to 75 is progress, even if not yet "healthy." If a healthy service suddenly becomes degraded: * Check for recent deployments * Review infrastructure changes * Look for external factors (dependencies, traffic) Don't over-invest in already-healthy services. Focus improvement effort on degraded and critical services. Maintain documentation explaining: * Expected health baselines * Known limitations * Improvement roadmaps *** ## Troubleshooting * Verify incidents exist with this service name * Check service tagging in integrations * Ensure consistent service naming across alerts * Review incident data for the service * Check if all incidents are properly attributed * Verify the calculation period matches expectations * Check if incident durations are being recorded * Verify resolution times are being set * Review how uptime is calculated for your setup * Review alert naming conventions * Consolidate similar service names * Consider service grouping strategies *** ## Service Naming Best Practices Consistent service naming improves analytics accuracy: | Pattern | Example | Benefit | | ------------------- | ---------------------------- | --------------------------- | | Environment prefix | `prod-api`, `staging-api` | Separate production metrics | | Team ownership | `payments-gateway` | Easy team attribution | | Functional grouping | `auth-service`, `auth-cache` | Group related services | Establish service naming conventions and document them. Inconsistent naming creates fragmented analytics. *** ## Related Pages Detailed incident analysis Document and learn from incidents Configure service metadata