> ## Documentation Index
> Fetch the complete documentation index at: https://docs.easyalert.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Services

> Monitor service reliability scores, uptime, and incident impact by service

## Overview

The Service Health page provides visibility into the reliability of your services. Track health scores, uptime percentages, and incident patterns for each service to identify areas needing attention.

<CardGroup cols={2}>
  <Card title="Health Scores" icon="heart-pulse">
    Composite reliability scores for each service
  </Card>

  <Card title="Uptime Tracking" icon="check-circle">
    Monitor service availability percentages
  </Card>

  <Card title="Incident Impact" icon="exclamation-triangle">
    See which services have the most incidents
  </Card>

  <Card title="MTTR by Service" icon="clock">
    Compare resolution times across services
  </Card>
</CardGroup>

***

## Summary Statistics

Four metrics provide an overview of service health:

| Metric                | Description                              |
| --------------------- | ---------------------------------------- |
| **Avg Health Score**  | Average health score across all services |
| **Healthy Services**  | Services with health score ≥ 80          |
| **Degraded Services** | Services with health score 50-79         |
| **Critical Services** | Services with health score \< 50         |

***

## Health Score Calculation

The health score (0-100) combines multiple factors:

| Factor         | Weight | Description                          |
| -------------- | ------ | ------------------------------------ |
| Uptime         | 40%    | Percentage of time without incidents |
| Incident Count | 30%    | Lower is better                      |
| MTTR           | 20%    | Faster resolution improves score     |
| Severity Mix   | 10%    | Critical incidents have more impact  |

### Health Status Levels

| Status       | Score Range | Badge Color | Meaning                           |
| ------------ | ----------- | ----------- | --------------------------------- |
| **Healthy**  | 80-100      | 🟢 Green    | Service is performing well        |
| **Degraded** | 50-79       | 🟡 Amber    | Service needs attention           |
| **Critical** | 0-49        | 🔴 Red      | Service requires immediate action |

***

## Service Cards

Each service displays a detailed card with:

### Card Information

| Section          | Details                                |
| ---------------- | -------------------------------------- |
| **Header**       | Service name and health status badge   |
| **Health Score** | Visual progress bar with numeric score |
| **Metrics**      | Uptime %, Incident count, Average MTTR |

### Understanding the Metrics

<Tabs>
  <Tab title="Health Score">
    The overall reliability indicator:

    * **90-100** — Excellent reliability
    * **80-89** — Good, minor issues
    * **70-79** — Degraded, needs attention
    * **50-69** — Significant problems
    * **\< 50** — Critical, requires immediate action
  </Tab>

  <Tab title="Uptime">
    Percentage of time without active incidents:

    * **99.9%+** — High availability target met
    * **99-99.9%** — Generally reliable
    * **95-99%** — Room for improvement
    * **\< 95%** — Significant reliability issues
  </Tab>

  <Tab title="Incident Count">
    Total incidents for this service in the period:

    * Compare to similar services
    * Track trends over time
    * High counts may indicate systemic issues
  </Tab>

  <Tab title="Avg MTTR">
    Average time to resolve incidents for this service:

    * Shorter is better
    * Compare to organization average
    * Long MTTR may indicate complexity or knowledge gaps
  </Tab>
</Tabs>

***

## Using Service Health Data

### Prioritizing Improvements

<Steps>
  <Step title="Identify Critical Services">
    Start with any services showing "Critical" status
  </Step>

  <Step title="Review Degraded Services">
    Plan improvements for "Degraded" services
  </Step>

  <Step title="Investigate Root Causes">
    For unhealthy services, analyze:

    * Recurring incident patterns
    * Common failure modes
    * Resource constraints
  </Step>

  <Step title="Track Improvement">
    Monitor health scores over time to verify fixes
  </Step>
</Steps>

### Comparing Services

<AccordionGroup>
  <Accordion title="Similar Services">
    Compare services with similar functions:

    * Why does API Service A have 95% uptime while API Service B has 99%?
    * What practices from healthy services can be adopted?
  </Accordion>

  <Accordion title="Critical Path Services">
    Pay extra attention to services that:

    * Support revenue-generating features
    * Are dependencies for many other services
    * Have external SLA commitments
  </Accordion>

  <Accordion title="New vs. Established">
    New services may naturally have lower scores:

    * Track improvement trajectory
    * Ensure adequate monitoring is in place
    * Document expected stabilization timeline
  </Accordion>
</AccordionGroup>

***

## Improving Service Health

### Quick Wins

<AccordionGroup>
  <Accordion title="Reduce Incident Volume">
    * Tune noisy alerts that don't require action
    * Fix recurring issues identified in postmortems
    * Implement preventive monitoring
  </Accordion>

  <Accordion title="Improve MTTR">
    * Create and maintain runbooks
    * Improve logging and observability
    * Cross-train team members
  </Accordion>

  <Accordion title="Increase Uptime">
    * Add redundancy for single points of failure
    * Implement graceful degradation
    * Improve deployment practices
  </Accordion>
</AccordionGroup>

### Long-term Improvements

<AccordionGroup>
  <Accordion title="Architecture Review">
    For persistently unhealthy services:

    * Evaluate technical debt
    * Consider refactoring or rewriting
    * Review dependencies and failure domains
  </Accordion>

  <Accordion title="Capacity Planning">
    Health issues may indicate:

    * Insufficient resources
    * Scaling limitations
    * Need for performance optimization
  </Accordion>

  <Accordion title="Process Improvements">
    * Implement better change management
    * Improve deployment practices
    * Enhance pre-production testing
  </Accordion>
</AccordionGroup>

***

## Best Practices

<AccordionGroup>
  <Accordion title="Set Service-Level Targets">
    Establish health score targets based on service criticality:

    * Customer-facing critical: 95+
    * Internal critical: 90+
    * Non-critical: 80+
  </Accordion>

  <Accordion title="Regular Health Reviews">
    Include service health in:

    * Weekly team standups
    * Monthly reliability reviews
    * Quarterly planning
  </Accordion>

  <Accordion title="Track Trends">
    A service improving from 60 to 75 is progress, even if not yet "healthy."
  </Accordion>

  <Accordion title="Investigate Sudden Drops">
    If a healthy service suddenly becomes degraded:

    * Check for recent deployments
    * Review infrastructure changes
    * Look for external factors (dependencies, traffic)
  </Accordion>

  <Accordion title="Balance Investment">
    Don't over-invest in already-healthy services. Focus improvement effort on degraded and critical services.
  </Accordion>

  <Accordion title="Document Service Context">
    Maintain documentation explaining:

    * Expected health baselines
    * Known limitations
    * Improvement roadmaps
  </Accordion>
</AccordionGroup>

***

## Troubleshooting

<AccordionGroup>
  <Accordion title="Service not appearing">
    * Verify incidents exist with this service name
    * Check service tagging in integrations
    * Ensure consistent service naming across alerts
  </Accordion>

  <Accordion title="Health score seems incorrect">
    * Review incident data for the service
    * Check if all incidents are properly attributed
    * Verify the calculation period matches expectations
  </Accordion>

  <Accordion title="Uptime shows 100% despite incidents">
    * Check if incident durations are being recorded
    * Verify resolution times are being set
    * Review how uptime is calculated for your setup
  </Accordion>

  <Accordion title="Too many services listed">
    * Review alert naming conventions
    * Consolidate similar service names
    * Consider service grouping strategies
  </Accordion>
</AccordionGroup>

***

## Service Naming Best Practices

Consistent service naming improves analytics accuracy:

| Pattern             | Example                      | Benefit                     |
| ------------------- | ---------------------------- | --------------------------- |
| Environment prefix  | `prod-api`, `staging-api`    | Separate production metrics |
| Team ownership      | `payments-gateway`           | Easy team attribution       |
| Functional grouping | `auth-service`, `auth-cache` | Group related services      |

<Tip>
  Establish service naming conventions and document them. Inconsistent naming creates fragmented analytics.
</Tip>

***

## Related Pages

<CardGroup cols={3}>
  <Card title="Alert Analytics" icon="bell" href="/analytics/incidents">
    Detailed incident analysis
  </Card>

  <Card title="Postmortems" icon="clipboard-list" href="/analytics/postmortems">
    Document and learn from incidents
  </Card>

  <Card title="Integrations" icon="plug" href="/integrations/overview">
    Configure service metadata
  </Card>
</CardGroup>
