- Distributed tracing, metrics, and logs are the **three pillars of observability**
help us understand, debug, and optimize modern distributed systems.
1. Distributed Tracing
Concept:
Distributed tracing is a ==method used to track the flow of a single request== or transaction across different services or components in a distributed system.
It helps you see how a request moves through the system, which services it interacts with, and how long each step takes.
How It Works:
Each request is assigned a ==unique trace ID==.
As the request moves through services, ==span IDs== are generated for each service call or operation.
The trace ID ties all spans together to represent the complete request flow.
Key Terminology:
Trace: Represents the entire lifecycle of a request.
Span: ==A single operation or unit of work within a trace==. Each span includes details like start time, end time, and metadata.
Parent-Child Relationships: Spans can have parent-child relationships to show dependencies (e.g., Service A → Service B → Service C).
Benefits:
Identifies bottlenecks by showing which service or operation is slowing down the request.
Helps visualize dependencies between services.
Useful for debugging errors in complex, distributed architectures.
2. Metrics
Concept:
Metrics are numerical data points that represent the state or performance of your system over time.
They are quantitative and can be aggregated to provide insights into system health.
Common Metrics Types:
Infrastructure Metrics:
CPU usage, memory consumption, disk I/O, network traffic.
Application Metrics:
Request rates (RPS), response times (latency), error rates.
Business Metrics:
User sign-ups, transactions per second, revenue per hour.
Characteristics:
Metrics are time-series data: They are tracked over time, enabling trend analysis.
Metrics are typically predefined and emitted at regular intervals (e.g., every second or minute).
Use Cases:
Monitoring system performance and availability.
Detecting anomalies, like sudden spikes in traffic or errors.
Setting up alerts for threshold breaches (e.g., CPU > 80%).
3. Logs
Concept:
Logs are unstructured or semi-structured text-based records generated by applications or systems to describe specific events or states.
Logs are contextual and provide detailed information about what happened at a specific time.
Types of Logs:
Application Logs:
Messages generated by the application, e.g., “User authentication failed.”
System Logs:
Logs from the operating system or hardware, e.g., kernel logs, systemd logs.
Security Logs:
Logs for auditing, e.g., failed login attempts, unauthorized access.
Log Levels:
DEBUG: Detailed information for developers.
INFO: General system information (e.g., “Server started on port 8080”).
WARNING: Indications of potential problems.
ERROR: Errors that need attention but do not crash the system.
CRITICAL/FATAL: Severe errors that may crash the application.
Benefits:
Helps troubleshoot issues by providing detailed context.
Useful for auditing and compliance.
Complements metrics and tracing for a holistic view.