What Monitoring Systems See

7 min read
Zekari
日志监控StripeWebhook生产环境安全

Production logs showed 47 errors in the last hour. The monitoring system sent alerts. On-call engineer woke up at 3 AM.

None of them were real errors.

Test webhooks triggering validation failures. Expected business logic rejections. All logged as console.error. The monitoring system treats them all the same: something is broken, wake someone up.

This is not a monitoring problem. This is a logging problem.

When Everything is an Error

Stripe sends test webhooks during development. They contain minimal data—no metadata, no customer email, sometimes no subscription ID. The system validates these fields and rejects the webhook. This is expected behavior.

But the log says:

console.error("api-gateway: Missing user_id or plan_id in Stripe session metadata.");

Error level. The monitoring system increments the error counter. If this happens enough times, it triggers an alert.

A user's email doesn't match the payment session email. Security validation working as designed. The webhook is rejected. The log says:

console.error(
  `api-gateway: 🚨 Email mismatch detected! User: ${userId}, Stripe: ${stripeEmail}`
);

Error level again. Another increment to the error counter.

The monitoring system can't distinguish between:

  • Database connection failure (real error, needs immediate attention)
  • Test webhook missing metadata (expected behavior, no action needed)
  • Email validation rejection (security working correctly, maybe needs review)

They're all errors in the logs. They're all the same to the monitoring system.

When most alerts are false positives, the real ones get missed. Engineers start ignoring alerts. Response time increases. Actual incidents get delayed.

This isn't a people problem. It's a signal-to-noise problem. If 90% of your error logs aren't errors, your monitoring system is broken.

The Other Problem: Leaking What Shouldn't Be Logged

While investigating the error levels, production logs showed something worse:

console.log('🔍 [Stripe Webhook] checkout.session.completed received:', {
  sessionId: object.id,
  metadata: object.metadata,
  customer: object.customer,
  subscription: object.subscription,
  customer_email: object.customer_email,
  customer_details_email: object.customer_details?.email,
});

This is a debug log. It prints the entire Stripe session object. Including:

  • Customer email addresses
  • Payment intent IDs
  • Customer IDs
  • Billing details

In production. In logs that are indexed, searchable, and retained for weeks.

The intent was debugging. The result was a privacy leak.

Debug logs are helpful during development. They show you what's happening. But in production, they become a liability. Every piece of sensitive data logged is a compliance risk and a security vulnerability.

Production logs should contain:

  • Event types (what happened)
  • Identifiers (which resource, but not sensitive details)
  • Outcomes (success/failure)
  • Timing (when it happened)

Production logs should NOT contain:

  • Customer personal information (emails, names, addresses)
  • Payment details (payment intent IDs, customer IDs, detailed transaction records)
  • Authentication tokens (JWT, session IDs, API keys)
  • Full request/response bodies (unless you're certain they're safe)

When debugging production issues, use distributed tracing with proper redaction. Don't print everything to logs.

The Semantic Boundary of Log Levels

The fix isn't just removing debug logs or changing some error to warn. The fix is understanding what each level means.

console.error - System Failure

Use this only for conditions that require immediate human intervention:

  • Database connection lost
  • External service unreachable after retries
  • Critical RPC call failed
  • Unhandled exceptions that crash request handling

These are conditions where the system cannot fulfill its core function. They should be rare. When they happen, someone should be alerted.

console.warn - Business Validation Failure

Use this for conditions that are rejected by business rules but don't indicate system malfunction:

  • Email mismatch in payment verification
  • User not found during webhook processing
  • Duplicate order detection
  • Rate limit exceeded

These need attention but not immediate action. They should be monitored for patterns (e.g., sudden increase in email mismatches might indicate an attack), but individual occurrences are expected.

console.log - Expected Behavior

Use this for normal operation, including expected rejections:

  • Test webhooks missing required fields
  • Routine validation passes
  • Successful processing steps
  • Configuration-driven behavior

These are for troubleshooting and auditing, not alerting.

Before:

// Test webhook with no metadata
console.error("Missing user_id or plan_id in Stripe session metadata.");
// → Triggers error alert

// Email validation rejection
console.error(`Email mismatch detected! User: ${email}, Stripe: ${stripeEmail}`);
// → Triggers error alert

// Debug log in production
console.log('Full session object:', JSON.stringify(session, null, 2));
// → Leaks customer email, payment IDs

After:

// Test webhook with no metadata
console.log("Missing user_id or plan_id in Stripe session metadata.");
// → Normal log, no alert

// Email validation rejection
console.warn(`Email mismatch detected! User: ${userId}, Order: ${orderId}`);
// → Warning level, monitored for patterns

// No debug logs in production
// Removed entirely

Result:

  • Error count dropped significantly (only real failures remain)
  • Only system failures trigger alerts, not validation failures
  • No sensitive data in logs
  • Monitoring system becomes useful again

Monitoring Systems Are Literal

Traditional log-based monitoring systems primarily rely on log levels, not semantic context. They count error logs and trigger alerts when thresholds are exceeded. If you log expected behavior as errors, you're training your monitoring system to cry wolf.

The problem isn't the monitoring system. The problem is what we're telling it.

Every log statement is an input to your monitoring infrastructure. console.error tells the system "something is broken." If you use it for test webhooks and validation failures, you're saying those are breakages.

The monitoring system believes you. It alerts. Engineers respond. They find nothing broken. This repeats until alert fatigue sets in.

Then when something actually breaks, the alert comes, and everyone assumes it's another false positive.

The Fix is Simple, The Discipline is Hard

Fixing these issues took 7 changes across 2 files:

  • 2 debug log removals
  • 4 console.errorconsole.warn or console.log
  • 1 missing validation check

The changes are straightforward. The hard part is maintaining the discipline:

  • Before using console.error, ask: "Is this a system failure or a business validation?"
  • Before logging any object, ask: "Does this contain sensitive data?"
  • Before shipping to production, review logs for information leakage

These aren't one-time fixes. They're ongoing practices.

Related: Defensive Programming in Stripe Webhooks discusses the code-level validations that prevent errors. This article addresses what happens when you log those validations.

What We Tell the System

Logs are not just for developers. They're inputs to monitoring systems, which make decisions about when to alert humans.

If we log carelessly:

  • Monitoring systems generate false alerts
  • Engineers develop alert fatigue
  • Real incidents get missed
  • Sensitive data gets exposed

If we log intentionally:

  • Monitoring systems accurately identify real failures
  • Alerts have high signal-to-noise ratio
  • Response times improve
  • Compliance risks decrease

The monitoring system sees what we tell it to see. If we tell it that test webhooks are errors, it believes us. If we tell it that email validations are system failures, it believes us.

The question is not "what went wrong?" The question is "what did we tell the system was wrong?"

Often, we're lying to our monitoring systems. Not maliciously, but carelessly. Using console.error because it's there, because it's easy, because "it's just a log."

But logs aren't "just logs." They're the primary input to the system that wakes people up at 3 AM.

Choose your log levels carefully. Your future on-call self will thank you.