SignoZ Distributed tracing, metrics, aur logs

- Distributed tracing, metrics, and logs are the **three pillars of observability**
  • help us understand, debug, and optimize modern distributed systems.

1. Distributed Tracing

Concept:

  • Distributed tracing is a ==method used to track the flow of a single request== or transaction across different services or components in a distributed system.
  • It helps you see how a request moves through the system, which services it interacts with, and how long each step takes.

How It Works:

  • Each request is assigned a ==unique trace ID==.
  • As the request moves through services, ==span IDs== are generated for each service call or operation.
  • The trace ID ties all spans together to represent the complete request flow.

Key Terminology:

  • Trace: Represents the entire lifecycle of a request.
  • Span: ==A single operation or unit of work within a trace==. Each span includes details like start time, end time, and metadata.
  • Parent-Child Relationships: Spans can have parent-child relationships to show dependencies (e.g., Service A → Service B → Service C).

Benefits:

  • Identifies bottlenecks by showing which service or operation is slowing down the request.
  • Helps visualize dependencies between services.
  • Useful for debugging errors in complex, distributed architectures.

2. Metrics

Concept:

  • Metrics are numerical data points that represent the state or performance of your system over time.
  • They are quantitative and can be aggregated to provide insights into system health.

Common Metrics Types:

  1. Infrastructure Metrics:
    • CPU usage, memory consumption, disk I/O, network traffic.
  2. Application Metrics:
    • Request rates (RPS), response times (latency), error rates.
  3. Business Metrics:
    • User sign-ups, transactions per second, revenue per hour.

Characteristics:

  • Metrics are time-series data: They are tracked over time, enabling trend analysis.
  • Metrics are typically predefined and emitted at regular intervals (e.g., every second or minute).

Use Cases:

  • Monitoring system performance and availability.
  • Detecting anomalies, like sudden spikes in traffic or errors.
  • Setting up alerts for threshold breaches (e.g., CPU > 80%).

3. Logs

Concept:

  • Logs are unstructured or semi-structured text-based records generated by applications or systems to describe specific events or states.
  • Logs are contextual and provide detailed information about what happened at a specific time.

Types of Logs:

  1. Application Logs:
    • Messages generated by the application, e.g., “User authentication failed.”
  2. System Logs:
    • Logs from the operating system or hardware, e.g., kernel logs, systemd logs.
  3. Security Logs:
    • Logs for auditing, e.g., failed login attempts, unauthorized access.

Log Levels:

  • DEBUG: Detailed information for developers.
  • INFO: General system information (e.g., “Server started on port 8080”).
  • WARNING: Indications of potential problems.
  • ERROR: Errors that need attention but do not crash the system.
  • CRITICAL/FATAL: Severe errors that may crash the application.

Benefits:

  • Helps troubleshoot issues by providing detailed context.
  • Useful for auditing and compliance.
  • Complements metrics and tracing for a holistic view.