What Monitoring Systems See
Production logs showed 47 errors in the last hour. The monitoring system sent alerts. On-call engineer woke up at 3 AM.
None of them were real errors.
Test webhooks triggering validation failures. Expected business logic rejections. All logged as console.error. The monitoring system treats them all the same: something is broken, wake someone up.
This is not a monitoring problem. This is a logging problem.
When Everything is an Error
Stripe sends test webhooks during development. They contain minimal data—no metadata, no customer email, sometimes no subscription ID. The system validates these fields and rejects the webhook. This is expected behavior.
But the log says:
console.error("api-gateway: Missing user_id or plan_id in Stripe session metadata.");
Error level. The monitoring system increments the error counter. If this happens enough times, it triggers an alert.
A user's email doesn't match the payment session email. Security validation working as designed. The webhook is rejected. The log says:
console.error(
`api-gateway: 🚨 Email mismatch detected! User: ${userId}, Stripe: ${stripeEmail}`
);
Error level again. Another increment to the error counter.
The monitoring system can't distinguish between:
- Database connection failure (real error, needs immediate attention)
- Test webhook missing metadata (expected behavior, no action needed)
- Email validation rejection (security working correctly, maybe needs review)
They're all errors in the logs. They're all the same to the monitoring system.
When most alerts are false positives, the real ones get missed. Engineers start ignoring alerts. Response time increases. Actual incidents get delayed.
This isn't a people problem. It's a signal-to-noise problem. If 90% of your error logs aren't errors, your monitoring system is broken.
The Other Problem: Leaking What Shouldn't Be Logged
While investigating the error levels, production logs showed something worse:
console.log('🔍 [Stripe Webhook] checkout.session.completed received:', {
sessionId: object.id,
metadata: object.metadata,
customer: object.customer,
subscription: object.subscription,
customer_email: object.customer_email,
customer_details_email: object.customer_details?.email,
});
This is a debug log. It prints the entire Stripe session object. Including:
- Customer email addresses
- Payment intent IDs
- Customer IDs
- Billing details
In production. In logs that are indexed, searchable, and retained for weeks.
The intent was debugging. The result was a privacy leak.
Debug logs are helpful during development. They show you what's happening. But in production, they become a liability. Every piece of sensitive data logged is a compliance risk and a security vulnerability.
Production logs should contain:
- Event types (what happened)
- Identifiers (which resource, but not sensitive details)
- Outcomes (success/failure)
- Timing (when it happened)
Production logs should NOT contain:
- Customer personal information (emails, names, addresses)
- Payment details (payment intent IDs, customer IDs, detailed transaction records)
- Authentication tokens (JWT, session IDs, API keys)
- Full request/response bodies (unless you're certain they're safe)
When debugging production issues, use distributed tracing with proper redaction. Don't print everything to logs.
The Semantic Boundary of Log Levels
The fix isn't just removing debug logs or changing some error to warn. The fix is understanding what each level means.
console.error - System Failure
Use this only for conditions that require immediate human intervention:
- Database connection lost
- External service unreachable after retries
- Critical RPC call failed
- Unhandled exceptions that crash request handling
These are conditions where the system cannot fulfill its core function. They should be rare. When they happen, someone should be alerted.
console.warn - Business Validation Failure
Use this for conditions that are rejected by business rules but don't indicate system malfunction:
- Email mismatch in payment verification
- User not found during webhook processing
- Duplicate order detection
- Rate limit exceeded
These need attention but not immediate action. They should be monitored for patterns (e.g., sudden increase in email mismatches might indicate an attack), but individual occurrences are expected.
console.log - Expected Behavior
Use this for normal operation, including expected rejections:
- Test webhooks missing required fields
- Routine validation passes
- Successful processing steps
- Configuration-driven behavior
These are for troubleshooting and auditing, not alerting.
Before:
// Test webhook with no metadata
console.error("Missing user_id or plan_id in Stripe session metadata.");
// → Triggers error alert
// Email validation rejection
console.error(`Email mismatch detected! User: ${email}, Stripe: ${stripeEmail}`);
// → Triggers error alert
// Debug log in production
console.log('Full session object:', JSON.stringify(session, null, 2));
// → Leaks customer email, payment IDs
After:
// Test webhook with no metadata
console.log("Missing user_id or plan_id in Stripe session metadata.");
// → Normal log, no alert
// Email validation rejection
console.warn(`Email mismatch detected! User: ${userId}, Order: ${orderId}`);
// → Warning level, monitored for patterns
// No debug logs in production
// Removed entirely
Result:
- Error count dropped significantly (only real failures remain)
- Only system failures trigger alerts, not validation failures
- No sensitive data in logs
- Monitoring system becomes useful again
Monitoring Systems Are Literal
Traditional log-based monitoring systems primarily rely on log levels, not semantic context. They count error logs and trigger alerts when thresholds are exceeded. If you log expected behavior as errors, you're training your monitoring system to cry wolf.
The problem isn't the monitoring system. The problem is what we're telling it.
Every log statement is an input to your monitoring infrastructure. console.error tells the system "something is broken." If you use it for test webhooks and validation failures, you're saying those are breakages.
The monitoring system believes you. It alerts. Engineers respond. They find nothing broken. This repeats until alert fatigue sets in.
Then when something actually breaks, the alert comes, and everyone assumes it's another false positive.
The Fix is Simple, The Discipline is Hard
Fixing these issues took 7 changes across 2 files:
- 2 debug log removals
- 4
console.error→console.warnorconsole.log - 1 missing validation check
The changes are straightforward. The hard part is maintaining the discipline:
- Before using
console.error, ask: "Is this a system failure or a business validation?" - Before logging any object, ask: "Does this contain sensitive data?"
- Before shipping to production, review logs for information leakage
These aren't one-time fixes. They're ongoing practices.
Related: Defensive Programming in Stripe Webhooks discusses the code-level validations that prevent errors. This article addresses what happens when you log those validations.
What We Tell the System
Logs are not just for developers. They're inputs to monitoring systems, which make decisions about when to alert humans.
If we log carelessly:
- Monitoring systems generate false alerts
- Engineers develop alert fatigue
- Real incidents get missed
- Sensitive data gets exposed
If we log intentionally:
- Monitoring systems accurately identify real failures
- Alerts have high signal-to-noise ratio
- Response times improve
- Compliance risks decrease
The monitoring system sees what we tell it to see. If we tell it that test webhooks are errors, it believes us. If we tell it that email validations are system failures, it believes us.
The question is not "what went wrong?" The question is "what did we tell the system was wrong?"
Often, we're lying to our monitoring systems. Not maliciously, but carelessly. Using console.error because it's there, because it's easy, because "it's just a log."
But logs aren't "just logs." They're the primary input to the system that wakes people up at 3 AM.
Choose your log levels carefully. Your future on-call self will thank you.
Related Posts
Articles you might also find interesting
Stripe Webhook中的防御性编程
三个Bug揭示的真相:假设是代码中最危险的东西。API返回类型、环境配置、变量作用域——每个看似合理的假设都可能导致客户损失。
双重验证:Stripe生产模式的防御性切换
从测试到生产不是更换API keys,而是建立一套双重验证系统。每一步都在两个环境中验证,确保真实支付不会因假设而失败。
监控观察期法
部署不是结束,而是验证的开始。修复代码只是假设,监控数据才是证明。48小时观察期:让错误主动暴露,让数据证明修复。
告警分级与响应时间
不是所有问题都需要立即响应。RPC失败会在凌晨3点叫醒人。安全事件每15分钟检查一次。支付成功只记录,不告警。系统的响应时间应该匹配问题的紧急程度。
MD5 和 API token 外表相似但功能相反
API token 看起来像 MD5 哈希值。两者都是字符串。但相似掩盖了本质的分野——一个计算,一个认证。理解这个区别揭示了为什么我们会混淆工具与它们的外表。