Observability Isn’t Just Monitoring Anymore - Here’s Why
If you’ve been on-call during a production incident, you know how quickly things can spiral. Metrics will say one thing, logs tell half the story, and traces? Maybe they’re not even configured. When you’re trying to pinpoint causality across microservices in a distributed system, relying on just one pillar of information is like debugging with one eye closed.
That’s where modern observability practices come in.
Observability goes beyond traditional monitoring by pulling together logs, metrics, and traces - the three pillars - to give you a real-time, high-fidelity picture of your systems. What’s surprising to many teams is how seamlessly this can be done using the LGTM stack - short for Loki, Grafana, Tempo, and Mimir.
In this guide, I’ll show you how to harness the full power of these tools - integrated with OpenTelemetry - to build robust observability pipelines, connect logs to traces, define SLIs and SLOs that matter, and most importantly, build intuitive and actionable dashboards. Whether you’re a seasoned SRE or a DevOps engineer mid-journey, this one’s for you.
Why Traditional Monitoring Comes Up Short
Monitoring tools of the past were built to track infrastructure: CPU usage, memory allocation, disk I/O. That was fine when we were deploying monoliths onto a handful of VMs.
Today? We’re dealing with polyglot microservices, container orchestration layers, event-driven transactions, and third-party APIs. Metrics still play a role, but alone, they’re not enough.
Observability enables you to ask new questions without having predefined all possible ones. It’s about being able to understand what’s happening inside your system just by looking at its outputs - logs, metrics, and traces.
Let’s briefly define these:
- Metrics: Structured numeric data over time, like request count or memory usage.
- Logs: Unstructured or semi-structured text records of events - the bread and butter of troubleshooting.
- Traces: Visual maps of how a request travels through services - price calculation, checkout service, inventory lookup - all included.
Integrated properly, these three provide a layered context you just can’t get from traditional tools.
Meet the LGTM Stack
Let’s break down the components that make LGTM the go-to for modern, scalable observability:
Loki – Logs, Simplified and Scalable
Loki is Grafana’s log aggregation system, purpose-built for containers. Unlike something like Elasticsearch (used in ELK), Loki doesn’t index full-text logs. Instead, it indexes metadata - just like Prometheus handles metrics.
What makes Loki a game-changer:
- Cost-effective: Lower storage overhead than full-text indexing
- Label-based filtering: Match logs with metrics seamlessly
- Built for Prometheus users: Feels familiar if you’ve used PromQL
- Promtail integration: Easily ships logs and attaches metadata like pod name and namespace
Grafana – The Visualization Engine
Grafana is at the heart of this stack. It’s not just pretty charts. It’s the platform that brings logs, metrics, and traces together in actionable dashboards.
With Grafana, you get:
- Multi-source queries using PromQL, LogQL, and Tempo’s native language
- Cross-data-source linking: Click from a metric spike directly into related logs or traces
- Alerting engines with Prometheus-style syntax
- Dashboard annotation with deployment markers, alarms, and trace links
It becomes your home base for observability.
Tempo – Tracing Without the Storage Drama
Tempo is a scalable tracing backend that plays well with OpenTelemetry, Jaeger, and Zipkin.
Unlike Jaeger, which requires separate storage layers per component, Tempo writes trace data to object storage (like S3). That means fewer headaches.
Why Tempo sings:
- Trace ingestion at scale (millions per day)
- No indexing - great for cost, offset with smart querying
- Tightly integrated with Grafana
- Supports various ingestion protocols (OTLP, Jaeger, Zipkin)
Mimir – Metrics Storage that Scales with You
Mimir is the long-term storage engine behind Prometheus-style metrics in LGTM. It’s horizontally scalable and multi-tenant - perfect for large teams or organizations.
Key things I love about Mimir:
- Works with raw Prometheus or remote write
- Efficient even with high-cardinality labels (like Kubernetes pod names)
- Built-in compression, downsampling, and durable storage
If you’ve ever had a Prometheus server melt under scale - this is your answer.
OpenTelemetry: Your Observability Secret Weapon
So how do you get tracing, metrics, and logs from your app into the LGTM stack?
Say hello to OpenTelemetry (OTel). It’s the industry-standard open-source framework for instrumenting code and emitting observability signals.
Here’s how you can integrate it:
- Instrumentation: Use OpenTelemetry SDKs in languages like Go, Java, Python, or Node.js to produce spans and metrics.
- Context propagation: It automatically passes trace context across HTTP, gRPC, or message queues.
- Exporters: Send traces to Tempo, metrics to Mimir, logs to Loki - all via the OTel Collector.
Sample OTEL Collector Config for Tempo
receivers:
otlp:
protocols:
grpc:
http:
exporters:
tempo:
endpoint: "http://tempo:4317"
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
exporters: [tempo]
This makes it easy to start piping trace data into your LGTM pipeline with almost no modifications to your services.
Correlating Logs and Traces for Real Insights
Here’s where the magic happens - and what separates observability pros from dashboard decorators.
Distributed traces show the path of a request. Logs explain what’s happening along the journey.
How to Make Them Talk:
- Inject trace IDs into your logs (
trace_id,span_id) - Use structured logging so Loki can ingest these tags
- In Grafana, set up queries that link a log event back to its trace - or vice versa
Pro tip: Use Loki queries like this to zero in on events tied to a specific request:
{app="checkout", trace_id="abc123def456"}
This is incredibly useful during incident response. Something went wrong? Start at the trace, jump to the specific logs from the failing span, and boom - you’re in business.
Defining SLIs and SLOs That Keep You Honest
Big picture observability is about more than dashboards - it’s about accountability to the customer experience.
That’s where SLIs (Service Level Indicators) and SLOs (Service Level Objectives) come in.
What Are They?
- SLIs are measurable events that reflect service health (e.g., 95th percentile latency, 500 error rate).
- SLOs are your target performance objectives (e.g., 99.95% success rate in 7 days).
How to Implement in LGTM:
- Use Mimir to track availability, error rates, or latency as Prometheus-style metrics.
- Create alert rules in Grafana tied to SLO breaches.
- Visualize SLO burn over time using dashboards or heatmaps.
- Use error budgets to prioritize engineering work vs. reliability fixes.
SLA-style Alert Expression Example:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
You’d alert if this exceeds, say, 500ms for more than 2 out of 10 minutes.
Build Dashboards That Do More Than Look Pretty
Here’s what changes when you treat your Grafana dashboards as live instrumentation panels:
- Highlight critical SLIs front and center
- Use templated variables (e.g., service, cluster, region) to explore context
- Embed trace panels and logs right below metrics
- Add annotations for deploys or alerts
- Preview current alerts, not just trends
My Ideal Failure Dashboard
| Panel Name | Type | Why It Matters |
|---|---|---|
| Request Latency | Line Graph | Watch real-user impact |
| Error Rate | Time Series | See spikes quickly |
| Trace Panel | Trace Viewer | Dive into requests fast |
| Recent Logs | Log Stream | View current events |
| Pod CPU & Memory | Gauge/Graph | Spot degraded services |
| Active Alerts | Table | Surface what’s firing |
You want clarity under pressure, not visual noise.
Gotchas to Avoid (And How to Fix Them)
| Problem | Why It Happens | What to Do About It |
|---|---|---|
| Logs don’t show in Grafana | Misconfigured Promtail or missing labels | Check relabeling rules; test with logcli |
| Traces end abruptly | Sampling is too aggressive | Reduce sampling rate or use dynamic logic |
| High dashboard latency | Overly complex queries | Pre-aggregate data; tune Mimir/Loki backends |
| Metrics look flat | No data? Or wrong Prometheus expressions | Validate metrics with curl localhost:9090 |
| Alert fatigue | SLOs too tight or noisy rules | Define realistic thresholds & group alerts |
Best Practices Checklist
Before rolling into production, make sure you:
- Instrument your apps with OpenTelemetry
- Correlate trace IDs in logs
- Tag logs and metrics using consistent labels (service, env, version)
- Archive long-term metrics in Mimir
- Build dashboards that emphasize failures over vanity metrics
- Define SLIs that truly represent user experience
- Review alert noise quarterly and tune!
- Continuously audit trace fidelity and coverage
Want to Go Deeper?
You’ll find these resources helpful:
- Grafana LGTM Docs
- OpenTelemetry Getting Started
- Prometheus Instrumentation Best Practices
- Google SRE Book – SLIs & SLOs
- INFOiYo Deep Dives:
Final Thoughts: Observability is Culture, Not Just Tools
There’s one truth about building highly available systems: you will never catch every bug in testing.
That’s why observability matters. It’s your runtime x-ray, your postmortem lens, your system intuition.
With the LGTM stack plus OpenTelemetry, you empower your team to diagnose the real root causes, not just guess from noisy alerts.
Start small: instrument a single service, add trace IDs to logs, set up one failure-mode dashboard. Learn, iterate, and expand. The payoff isn’t just faster debugging - it’s customer trust, better sleep during on-call shifts, and systems that scale with confidence.
Stay observant, and your systems will thank you.