The Job Queue That Silently Ate 12,000 Emails

reliability queues debugging consulting observability

A client's notification queue was draining normally and all dashboards showed green. But three weeks of transactional emails had vanished into a catch block nobody thought to monitor.

I got a call from a client's VP of Customer Success. Not the engineering lead, not the CTO — the person whose team deals with angry customers. "Our users say they're not getting order confirmation emails. Some of them haven't gotten one in weeks."

The engineering team's response was predictable: "The queue is healthy. Jobs are processing. Grafana shows zero backlog." And they were right about all of it.

Everything looked fine

The system was straightforward. When a customer placed an order, the API published a message to a BullMQ queue in Redis. A worker process picked up the job, rendered an email template with the order details, and sent it through SendGrid. The queue dashboard showed jobs going in and coming out at a steady rate. No failures. No retries. No backlog.

I asked to see the SendGrid activity log. The team hadn't checked it — they monitored the queue, not the downstream service. When we pulled up the last 30 days of SendGrid data, the picture changed fast. Email volume had dropped by about 60% three weeks earlier. On March 3rd, specifically. Right around the time someone had deployed a change to the email template rendering.

The catch block that caught everything

The worker code looked like this:

worker.on('completed', (job) => {
  logger.info(`Job ${job.id} completed`);
});
 
worker.on('failed', (job, err) => {
  logger.error(`Job ${job.id} failed: ${err.message}`);
  metrics.increment('email.jobs.failed');
});

Looks reasonable. Failed jobs get logged and counted. Except the actual processing function had its own try-catch:

async function processEmailJob(job: Job) {
  try {
    const html = await renderTemplate(job.data.templateId, job.data.context);
    await sendGrid.send({
      to: job.data.recipient,
      subject: job.data.subject,
      html,
    });
  } catch (err) {
    logger.warn('Email processing issue', { jobId: job.id });
    // Don't rethrow — we don't want to clog the retry queue
    // with transient SendGrid errors
  }
}

See the problem? The catch block swallows the error. The job completes successfully from BullMQ's perspective. The failed event never fires. The failure metric never increments. The dashboard stays green.

The comment tells the whole story. Someone had been getting retries piling up — probably during a SendGrid outage — and decided to "fix" it by swallowing the error. The intent was to avoid flooding the retry queue. The effect was to make all failures invisible.

What actually broke

The March 3rd deploy had updated the email template engine from Handlebars to React Email. During the migration, a handful of templates had broken variable names — {{orderTotal}} became {order.total}, but three templates still referenced {orderTotal}. The render function threw a ReferenceError. The catch block ate it. The job reported success.

For three weeks, roughly 12,000 order confirmation emails, shipping notifications, and password reset links had been silently discarded. The queue metrics said everything was perfect.

Warning

A job that completes without doing its work is worse than a job that fails loudly. At least a failed job shows up in your metrics.

The fix was boring

We did three things. First, the obvious: remove the blanket catch. If the job fails, let it fail. BullMQ has built-in retry with exponential backoff — that's what it's for.

async function processEmailJob(job: Job) {
  const html = await renderTemplate(job.data.templateId, job.data.context);
  await sendGrid.send({
    to: job.data.recipient,
    subject: job.data.subject,
    html,
  });
}

Second, we added a delivery-side metric. Not "did the job complete" but "did SendGrid accept the email." The queue draining is an implementation detail. The email arriving at SendGrid is the actual outcome.

Third — and this is the one that felt like overkill until it wasn't — we set up a reconciliation check. A nightly job that compares orders created in the past 24 hours against SendGrid's activity API. If an order exists without a corresponding delivered email event, it gets re-queued. Belt and suspenders, but after three weeks of silent data loss, the team wasn't in the mood for minimalism.

The deeper problem

The real failure wasn't the catch block. It was monitoring the mechanism instead of the outcome. The queue draining at a healthy rate told us that the system was working. It told us nothing about whether customers were actually getting their emails.

This pattern shows up everywhere. Teams monitor deployment success rates but not whether the deployed feature actually works. They watch database query counts but not whether the results are correct. They track API response times but not whether the response body has the data the client needs.

The queue was never the product. The email was.

I think about this engagement whenever I see a dashboard full of green metrics. Green means the system is doing what you measured. It doesn't mean it's doing what you meant. How many of your monitoring dashboards are watching the mechanism when they should be watching the outcome?