Log Sampling Without Losing the Logs That Matter
Structured logs. Clear thinking.
Every line of code we write generates a potential signal, but not every signal carries the weight of a critical failure. In a microservices architecture, the volume of logs can easily outpace the ability of a human operator to parse them in real-time. We face a dilemma: suppress the noise to save costs and reduce alert fatigue, or keep everything to ensure we never miss the next outage?
The volume problem: why logging everything is not sustainable
When you instrument every request, every database query, and every cache miss, you generate terabytes of data. The problem isn't just storage costs; it's the cognitive load on your team. When an incident occurs, sifting through millions of log lines to find the root cause is a time-consuming process that directly impacts Mean Time to Resolution (MTTR).
Standard centralized logging solutions often require complex pipelines—Kafka clusters, heavy indexing, and expensive storage tiers—to handle this throughput. This leads to query latency and hidden costs that scale linearly with your traffic.
Choosing the right filter
Random Sampling
Each log event has a fixed probability of being emitted. Simple to implement, but statistically consistent—if you miss an error, you miss it 100% of the time.
Rate-Limited Sampling
Emits a fixed number of logs per second, regardless of traffic volume. Guarantees throughput but loses context during traffic spikes.
Head-Based Sampling
Keeps the first N events of a log stream. Useful for debugging the start of a session, but useless for analyzing tail latency.
Tail-Based Sampling
Keeps the last N events of a stream. Excellent for diagnosing the end of a request, but misses the initiation of the issue.
Adaptive Sampling
Dynamically adjusts the sample rate based on system health indicators like error rates or latency percentiles. Keeps high-value logs during anomalies.
Key-Value Sampling
Filters based on specific attributes (e.g., sampling only requests with HTTP 5xx or a specific user ID).
The danger of naive sampling: what you miss when errors are rare events
The most common mistake is applying a static random sampling rate (e.g., 1%) across all traffic. In a stable system, 99% of logs are benign. However, errors are often rare events. If an error occurs only 0.1% of the time, a 1% sampling rate means you will statistically never see that error in your logs.
This creates a false sense of security. You might believe your system is healthy when it is actually failing silently for a small subset of users. Adaptive sampling solves this by detecting anomalies and increasing the sample rate specifically when error rates rise, ensuring you never look away when it matters most.
LogKit's adaptive sampler
LogKit's SDK exposes a Sampler interface that hooks into your logging pipeline. It doesn't just drop logs; it tracks context. When an error threshold is breached, the sampler automatically increases its capture probability for subsequent events within the same correlation context.
Under the hood, the sampler uses a Token Bucket algorithm modified for anomaly detection. It maintains a baseline sample rate but creates a "burst" capacity when specific metrics (like 500 status codes or P99 latency) exceed configured thresholds. This allows you to capture the full context of a failure storm without permanently bloating your storage.
YAML examples
Configure adaptive policies directly in your application config. Here is how to set up a policy that samples aggressively during latency spikes.
type: adaptive
baseline_rate: 0.01 # 1% base rate
spike_threshold: 0.05 # Trigger if P99 > 500ms
max_rate: 1.0 # Cap at 100% during spikes
context_keys:
- trace_id
- user_id
Alternatively, for a simpler "keep the last N events" approach:
type: tail
window_size: 100 # Keep last 100 events per stream
context_keys:
- request_id
Monitoring your sampler
You must verify that your sampling configuration is actually working as intended. LogKit provides built-in metrics for the sampler itself.
Stop losing logs in the noise.
Implement adaptive sampling in your stack today with the open-source LogKit SDK.