Skip to content

John Bennett's blog

Our service bus – Part 5: Handling poison messages and dead letters

Sunday, February 6, 2011

See the Intro to this series, which has links to all the parts.

In Part 4, I described our convention for declaring a TimeToLive on each message type. Why do we need a TTL?

Consider the situation of a process that subscribes to one or more message types and then goes down. Messages would accumulate in its in queue until the MSMQ disk quota is full. Messages would then start accumulating in the out queues on the servers where publishing processes are running.

At some point, the MSMQ disk quota on those machines would fill up. Publishes and sends would start to throw MSMQ “insufficient resources” exceptions. Messages would start accumulating in that process’s in queues. The vicious cycle continues…

Not only can this quickly bring your entire system to its knees, it eliminates the fundamental benefit of a message-based system: the logical, physical, and temporal decoupling of components. A component that is publishing messages should never be brought down by a misbehaving subscriber.

Like transactional-ness, the TTL is part of the contract of the message. Messages that aren’t handled in a timely way are returned to the dead-letter queue (DLQ) of the sender, so that the sender knows the message wasn’t successfully delivered. Our applications monitor their DLQs, log the problem and throw the message out.

(Throwing out messages works in the context of our business requirements. If the message represented a $1 million order, we’d probably choose not to throw it out, but rather alert someone to the problem!)

Poison messages

Another problem is poison messages. A malformed message (or a bug in the handling code) can cause a message to fail every time. These are known as poison messages. After trying them a certain number of times, we declare them poison and stop trying.

WCF’s NetMsmqBinding allows us to specify the number of retries, the time between retries, and what happens when a message is declared as poison.

For tx messages, we return the message to the sender’s DLQ, where it is handled just like a timeout (logged and thrown out). For non-tx messages, we don’t bother even returning it to the sender’s DLQ. We simply throw it out.

(Once again, throwing out messages works for us. If you use a similar bus, you need to determine what kind of poison message handling works for your business needs. One size does not fit all.)

In Part 6, I’ll talk about how we manage subscriptions and how we register message handlers.