Surviving poison messages in MSMQ

by Dejan Grujic

Introduction

It's not hard to find articles about MSMQ. They usually focus on good sides - what's MSMQ good for, how, when and why to use it. Although MSMQ really helps in many distributed scenarios, it's not without problems. If you have complex system with some external dependencies, where subsystems are under constant development with lots of updates, sooner or later something will break. One subsystem will send message which receiver will not be able to process, for one reason or another. You'll have poison message.

Note:

We have two MSMQ products which are great for fighting poison messages. QueueMonitor can permanently watch your queues. When poison message shows up it will move it somewhere else so that it doesn't block other messages and send you alert email. You can later use QueueExplorer to investigate that problematic message, edit it if there's something wrong and send it back to its destination queue.

What are poison messages?

Poison messages are messages that can not be processed. If they stay on top of queue other messages will never get a chance. This could happen in any architecture involving first-in first-out queues, including MSMQ.

What happens when message fails? Sometime failure is transient - for example when processing involves db transaction which was terminated as deadlock victim, or when some external component is temporarily unavailable. Solution is simple - we will receive messages transactionally, and if any exception is thrown during processing, we will roll back entire transaction. Message will go back to queue and after some time we will retry processing. Hopefully transient problem will go away in the meantime. MSMQ used with transactions works very good in this scenario, and with little help of COM+ database operations can be within same transaction too.

However sometimes failures are permanent. For instance message says "Update user 1234" but before it's received user 1234 is deleted from database. That update command will always fail. Versioning issues are also common - for example we could send message serialized with old version of class but try to deserialize it with new version which expects additional field. If we apply same strategy of rolling back transaction, that same message will always return to top of queue, and other pending messages will never be processed.

There are cases in between - message that will succeed after some delay or after some manual intervention. Technically only message that never could succeed should be called poison message, but even if it will work after some manual intervention or after too many retries, we can name and treat it as poisonous.

How to detect poison message?

It's not easy to determine if failure is transient or permanent. It's usually a good idea to retry failed message, but not indefinitely. If it doesn't work after couple retries something should be done - notification sent to admin, message moved to some "poison messages" queue or something else. We'll discuss handling of poison messages later - first challenge is to determine when message should not be retried any more.

Using TimeToBeReceived message property

Each message has TimeToBeReceived property. Default is InfiniteTimeout but sender can set that time to something else, like 5 minutes in the future. Best thing with this property is when that time expires Message Queuing itself will take care of message, by deleting it or moving to dead letter queue. Which of these two happens depends on UseDeadLetterQueue property.

One problem is if we have offline scenario. Application could work without connection with main server, in which case messages are collected and sent when connection is available. 5 min timeout will obviously not be good in that case, but if we increase that time poison message could be retried for hours before we detect them. Another, more serious problem is that other pending messages will get older too although they are perhaps valid and never retried. As we'll see later, whether this approach is usable depends on strategy we choose to handle poison messages.

Retry counter

In simple scenarios we need to keep one retry counter per queue - if we always retry only last failed message. However in some strategies we will try second message after first one fails, etc. - we'll have to remember retry counter for each message that's still in processing.

In both cases question is where to keep these counters - in memory, registry, db, or somewhere else? Wherever we put it there's possibility of resource leak - for instance when administrator manually deletes messages but application still keeps their counters. Another problem - if we keep them in memory we'll lose all counters if application is restarted.

Message modification

Some other messaging systems automatically keep retry count as part of each message but not MSMQ. We can make something like that - for instance put retry count in AppSpecific property (if we don't use it for other purposes). Major limitation is we can not modify message that will be rolled back - we have to send message to back of queue and commit transaction. This makes it unsuitable for most handling strategies.

Exception filter

We can analyze exception thrown from processing code. For example if we catch SerializationException or some custom exception which says something like "couldn't find row 1234 in users table" we will immediately know that's poison message. However, for some exceptions it's hard to tell if it's temporary or permanent problem, and taking care that all exceptions are treated as they should is hard and error prone. Therefore exception filtering could be used in combination with some other method, as a shortcut to detect at least some positives as early as possible.

External monitoring

Some external application or another thread could monitor queue and raise alarm if same message stays on top of queue for too long. It will work if we always return failed messages back to top of queue. Biggest advantage of this method is that it will also detect when our application is not processing any messages at all, not only poison messages - maybe our windows service is not running for some reason?

Poison message handling strategies

So far we covered ways for detecting poison messages. In most cases some retrying was needed. Are there ways to improve performance and robustness of our application by different retry strategies? What to do when message should not be retried any more - should we delete it, put it somewhere else or what?

All handling strategies described here kick in whenever message fails. It's usually a good idea to log details why processing failed and send some notification to admin. After system is fixed you will probably want to delete poison message or move it back to queue to be processed again. Problem is you can't perform these operations from management console, there is only one operation available - deleting all messages from queue (purge). You'll have to use some of existing third party tools or write some utilities yourself to move and delete messages.

We'll start with simpler strategies first.

Discard poison messages

If primary concern is to process messages as quickly as possible, and it's not important if some messages are lost, read no more - just drop poison message and move on. I believe not many people use MSMQ in this fashion.

Always roll back

Simplest message-preserving solution could be not to do anything except roll back - poison message would return to top of queue and stay there, potentially forever. This is simplest solution and there are no risks of losing message. It's up to administrator to fix problem. Obviously no other messages will be processed in the meantime. Another problem is that after failure we have to make pause before retry, so even if message failed only once for some temporary reason we will have delay in processing.

This solution is satisfactory if most important thing is that every message is processed, and we don't care when it will be. Also this is only solution that maintains ordering of messages. On the other hand constant monitoring and manual interventions are unavoidable. This strategy fits nicely with external monitoring which checks if same message is on top of queue too long.

Retry, move to dead-letter queue

This strategy is extension of previous one. After message fails several times we can move it to some special dead-letter queue. We can put all poison messages to system dead-letter queue, or make separate dead-letter queue for each queue we have. What happens if there's some temporary issue? If for instance connection to some external resource is down, some perfectly valid messages will end up in dead-letter queue after couple of retries. They will not be put back to original queue unless administrator moves them manually.

Therefore this solution also demands constant monitoring and manual interventions, and possibility of having valid messages in dead-letter queue, but at least queue can never be blocked for too long.

Send to back

Obvious problem with previous two strategies is that failed message goes back to top of queue, preventing other messages from processing. What if we could move it to the bottom of queue after failure? Other messages will have a chance to be processed.

Since roll back returns message to top of queue we can not use it. We could finish receive operation (thus removing message from top), and send message again to queue - which will put it to the bottom. It would be best to have both Receive and Send in single transaction.

Special care must be taken with message priorities - if we send high priority poison message back to queue it will nevertheless be in front of older but lower priority messages, blocking them. Therefore, reduce priority of poison message to lowest when it's sent back.

Another problem is it's harder to detect which message is coming first time, and which was already retried several times. Easiest way to deal with this is to inject retry count into message itself. We can modify message any way we want, for MSMQ it's like completely new message is sent. AppSpecific field is ideal for this purpose, so it's used in sample application.

If only poison messages are in a queue, we will kill our system if we retry them immediately - we will constantly process same messages in circle. Some delay must be introduced. One approach is to make a pause when we receive message that failed before. These delays inevitably affect valid messages too.

Separate retry queue

All solutions so far blocked main queue at least temporarily. If that's an issue, we could use another queue for retrying, and move all failed messages immediately there. Main queue will never be blocked. Now we also need separate receiver for retry queue, which will have delays (using some of previous strategies), but valid messages will be processed as soon as possible.

The Queued Components way

Queued components are part of COM+ which uses MSMQ for transporting remote method invocations. Even if you're not using QC, it's good to know how it works because considerable attention was given to poison messages handling. Combination of separate retry queue and retry, move to dead-letter queue is used.

General idea is to have more than one retry queue. After message fails, it is moved to retry queue 1. It's retried there 3 times with 1 minute pause between each retry. If it doesn't succeed it's moved to retry queue 2, where delay is 2 minutes. Retry queue 3 will have delay of 4 minutes, etc. By default 5 retry queues are used. After message drops out from last retry queue it goes to dead letter queue, where no retries are performed.

Although you can increase or reduce number of retry queues, and what happens before message is moved to dead letter queue, you can not affect number of retries and delays for each queue - so in default configuration message will reach dead letter queue in (1+2+4+8+16)*3=93 minutes. Queued components don't completely solve problem of valid messages in dead-letter queue, they reduce it only. Manual interventions are still required from time to time.

Also you would have to deal with COM+, but that's inevitable if you want to integrate database and MSMQ transactions.

Testing application

You can play different poison handling strategies in sample application. You can send valid and poison messages and choose poison handling strategies. Three private transactional queues will be automatically created:� test_queue, test_queue_retry, and test_queue_dead_letter. Messages from any of these queues can be seen, so you can check how messages are moved between queues.

All failure handlers implement IFailedMessageHandler interface:


public interface IFailedMessageHandler
{
TransactionAction HandleFailedMessage ( Message message, MessageQueueTransaction transaction );
}

This method is called any time message fails. Current transaction is passed as parameter so that message can be sent to another queue within same transaction. Each handler returns what should be performed at the end of processing - commit or roll back. Architecture like this allows choice of poison handling strategy at run time. Also you can pick strategy for each queue separately.

Conclusion

Optimal poison handling strategy depends on application requirements - sometimes all messages must be handled in order they came, or they must be handled as fast as possible even if some are lost. There is no ideal solution to poison message problem - human still must be involved when things go wrong. At least we can make sure that bad messages are put aside so that other messages could flow. Poison messages could wait in some other queue for administrator to delete or retry them.

Anyway MSMQ, if used properly, allows our systems to survive various problems and outages. Some additional effort to deal with poison messages shouldn't stop us from using it.

Download sample app - 8K

Download source - 10K