Engineering Deep Dive

Log Sampling Without Losing the Logs That Matter

Structured logs. Clear thinking.

Every line of code we write generates a potential signal, but not every signal carries the weight of a critical failure. In a microservices architecture, the volume of logs can easily outpace the ability of a human operator to parse them in real-time. We face a dilemma: suppress the noise to save costs and reduce alert fatigue, or keep everything to ensure we never miss the next outage?

The volume problem: why logging everything is not sustainable

When you instrument every request, every database query, and every cache miss, you generate terabytes of data. The problem isn't just storage costs; it's the cognitive load on your team. When an incident occurs, sifting through millions of log lines to find the root cause is a time-consuming process that directly impacts Mean Time to Resolution (MTTR).

Standard centralized logging solutions often require complex pipelines—Kafka clusters, heavy indexing, and expensive storage tiers—to handle this throughput. This leads to query latency and hidden costs that scale linearly with your traffic.

Sampling strategies overview

Choosing the right filter

Random Sampling

Each log event has a fixed probability of being emitted. Simple to implement, but statistically consistent—if you miss an error, you miss it 100% of the time.

Rate-Limited Sampling

Emits a fixed number of logs per second, regardless of traffic volume. Guarantees throughput but loses context during traffic spikes.

Head-Based Sampling

Keeps the first N events of a log stream. Useful for debugging the start of a session, but useless for analyzing tail latency.

Tail-Based Sampling

Keeps the last N events of a stream. Excellent for diagnosing the end of a request, but misses the initiation of the issue.

Adaptive Sampling

Dynamically adjusts the sample rate based on system health indicators like error rates or latency percentiles. Keeps high-value logs during anomalies.

Key-Value Sampling

Filters based on specific attributes (e.g., sampling only requests with HTTP 5xx or a specific user ID).

The danger of naive sampling: what you miss when errors are rare events

The most common mistake is applying a static random sampling rate (e.g., 1%) across all traffic. In a stable system, 99% of logs are benign. However, errors are often rare events. If an error occurs only 0.1% of the time, a 1% sampling rate means you will statistically never see that error in your logs.

This creates a false sense of security. You might believe your system is healthy when it is actually failing silently for a small subset of users. Adaptive sampling solves this by detecting anomalies and increasing the sample rate specifically when error rates rise, ensuring you never look away when it matters most.

LogKit adaptive sampler visualization showing increasing sample rate during error spikes

How it works

LogKit's adaptive sampler

LogKit's SDK exposes a Sampler interface that hooks into your logging pipeline. It doesn't just drop logs; it tracks context. When an error threshold is breached, the sampler automatically increases its capture probability for subsequent events within the same correlation context.

Under the hood, the sampler uses a Token Bucket algorithm modified for anomaly detection. It maintains a baseline sample rate but creates a "burst" capacity when specific metrics (like 500 status codes or P99 latency) exceed configured thresholds. This allows you to capture the full context of a failure storm without permanently bloating your storage.

Configuration

YAML examples

Configure adaptive policies directly in your application config. Here is how to set up a policy that samples aggressively during latency spikes.

logkit.yaml

sampler:
  type: adaptive
  baseline_rate: 0.01 # 1% base rate
  spike_threshold: 0.05 # Trigger if P99 > 500ms
  max_rate: 1.0 # Cap at 100% during spikes
  context_keys:
    - trace_id
    - user_id

Alternatively, for a simpler "keep the last N events" approach:

logkit.yaml

sampler:
  type: tail
  window_size: 100 # Keep last 100 events per stream
  context_keys:
    - request_id

Observability

Monitoring your sampler

You must verify that your sampling configuration is actually working as intended. LogKit provides built-in metrics for the sampler itself.

99.2% Sampled Rate (Avg)

0.8% Dropped Events

14 Spikes Detected Today

0.4ms Sampler Overhead