[FrameworkBundle] Removed `RejectRedeliveredMessageMiddleware` to avoid message to be lost if it cannot be processed by both handler and failed transport #40249

theravel · Feb 18, 2021

This pull-request fixes situation described in #36870

Q	A
Branch?	5.x
Bug fix?	yes
New feature?	no
Deprecations?	no
Tickets	#36870
License	MIT
Doc PR	-

…ost if it cannot be processed by both handler and failed transport

weaverryan · Feb 19, 2021

Hey @theravel!

I understand your opinion on this - it makes sense to me. However, I believe there IS a use case for RejectRedeliveredMessageMiddleware that needs to be considered along with your case. When it was introduced in #34107, it was (I believe) to avoid a problem with super-long-running handlers. The flow is:

A) Message is received and passed to the handler
B) The handler takes a long time to process it. So, we have no ack'ed the message yet.
C) RabbitMQ thinks that the worker must have died, so it re-delivers it.
D) Now, another worker starts working on the same message... and works for a long time.

For a very long message, eventually all of your workers could get caught trying to handle the same message, leaving no workers available to handle other messages (this is especially a problem because when Rabbit redelivers, it puts the message at the front of the queue).

So if I understand the situation fully, we have 2 different scenarios:

How to avoid handling long-running messages that are still being processed
How to guarantee that no messages are lost (high availability).

Do you have a thought about how to handle this? Is there way that could satisfy everyone?

Thanks!

theravel · Feb 19, 2021

@weaverryan thank you for looking.
RabbitMQ documentation clearly says:

There aren't any message timeouts; RabbitMQ will redeliver the message when the consumer dies. It's fine even if processing a message takes a very, very long time.

So the use-case with redelivery while another consumer is processing a message should not be valid. Looking at page
https://blog.forma-pro.com/rabbitmq-redelivery-pitfalls-440e0347f4e0
looks like they tried to solve a problem with redelivery to the head of the queue in case of exception in consumer. In this case message would be stuck in a queue forever and further messages processing would be blocked. But this use case should be solved via retry strategy (and Symfony messenger has everything for it), so that message will be removed from head of queue to failed transport.

So it is still not clear why messages should be lost on purpose.

weaverryan · Feb 19, 2021

Hey @theravel!

So the use-case with redelivery while another consumer is processing a message should not be valid

We need to be really sure on this. From the original issue about this #32055:

A similar situation can happen when the handling of a message takes longer than the rabbitmq connection heartbeat or
timeout. Ref. #31707 and php-enqueue/enqueue-dev#658 (comment)

But perhaps the problem is not quite what I was describing... but that once a long message is done, the connection to Rabbit is already lost, so it fails to ack it. In that case, you could increase your Rabbit timeout (#31707 (comment)) or we could even potentially have a hook where users could opt into a "before handling" ack (that would be opt in for sure).

Looking at page https://blog.forma-pro.com/rabbitmq-redelivery-pitfalls-440e0347f4e0
looks like they tried to solve a problem with redelivery to the head of the queue in case of exception in consumer

Hmm. So one original issue that RejectRedeliveredMessageMiddleware may have been trying to solve was "what happens if an event listener on WorkerMessageFailedEvent throws an exception. With how the current code is written right now (if we remove RejectRedeliveredMessageMiddleware), it's possible that a listener could throw an exception, which would cause (I think?) the listener to die and for the message to not be rejected (so then Rabbit would just redeliver it). We could probably fix this in a smarter way, however: put a try/catch around dispatching WorkerMessageFailedEvent. One tricky detail is that redelivery happens during that event. And so, if there IS an exception, we would need to know whether the message WAS redelivered or not. If it WAS, then we are ok to reject the message after the try/catch. If it was NOT, then we would not reject or ack the message - we would do nothing and allow Rabbit to redeliver. Does that sound correct?

To track if the message was actually redelivered, we could have the SendFailedMessageForRetryListener modify the WorkerMessageFailedEvent object - maybe calling $event->markRetryAsSuccessfullySent(). Then, in the catch in Worker, we could use that to know if the message was redelivered or not. However, with this setup, it would still be possible for a message to get redelivered over and over again by Rabbit (and it would continually be at the front of the queue I believe) because some listener to this event is exploding repeatedly. That doesn't sound desirable... but neither does dropping the message.

Cheers!

jderusse · Feb 19, 2021

So if I understand the situation fully, we have 2 different scenarios:

How to avoid handling long-running messages that are still being processed

How to guarantee that no messages are lost (high availability).

IMHO, loosing message is unacceptable.

People can deal with long running process issue. Either at Infra level (defining the right TTL before AMQP re-send the message), or at application level (with Lock, or making the application resilient to multi delivery).

But they have no way to workaround a messages lost.
The issue is talking about database down, but they have the exact same issue with segfault, power outage, network lost...

theravel · Feb 19, 2021

Hi @weaverryan

As for the original intention it gets more fuzzy, e.g. it was solved on infra level that time: [Messenger] Long task after receiving the message - message is not removed from queue #31707 (comment)
As for my pull request.
Do you suggest doing it like that?

$failedEvent = new WorkerMessageFailedEvent($envelope, $transportName, $throwable);
try {
    $this->dispatchEvent($failedEvent);
} catch (\Throwable $e) {
   // but what do we do in this case?
}

In my opinion if dispatching of failed event failed (so we are not sure that failed message has been delivered to failed transport), we should not reject message from original queue. Compromise should be taken: we can not make two actions transactional (remove original message AND put it into failed transport - just because there might be two storage solutions involved), so we rather duplicate messages in failed transport but do not lose data. WDYT?

weaverryan · Feb 20, 2021

IMHO, loosing message is unacceptable.

@jderusse Ok then, let's get this taken care of. Unfortunately, I think (for some very good reasons) @Tobion is less available these days to jump in with his thoughts.

Do you suggest doing it like that?

@theravel That is what I was thinking. But with one addition:

$failedEvent = new WorkerMessageFailedEvent($envelope, $transportName, $throwable);
try {
    $this->dispatchEvent($failedEvent);
} catch (\Throwable $e) {
    if (!$event->wasMessageRedelivered()) {
        // we would do nothing, allow Rabbit to naturally redeliver
        // hmm... unless there is no "retry" configured. In that case, this is expected
        // we would need to differentiate that
        // perhaps we combine this with $event->willRetry()
        // or we call the method $event->wasMessageRedeliveredIfDesired();
        // SendFailedMessageForRetryListener would set this IF it was redelivered or shouldn't have been
    } else {
        // here is where we would "reject" the original, because we know the message was properly redelivered
    }
}

This $event->wasMessageRedelivered() is something I'm inventing. The idea would be that the SendFailedMessageForRetryListener would call a method on the event object to mark that it DID deliver the retry.

I'm also trying not to make things "too fancy". The best scenario is to be as close to "industry standard" as possible. With my proposed setup, would we basically be saying:

If message handling fails, AND the message should be redelivered, ONLY "reject" the original message if we can
verify that the redelivery was successful. Else, do nothing, and fallback to the transports' native "redelivery" to
be safe.

Thoughts?

theravel · Feb 20, 2021

@weaverryan why do you think that we can reject a message if it was redelivered? Redelivery doesn't mean that it was successfully handled, so rejecting this message would cause data loss.

Tobion

I don't see any alternative suggestion how to solve the problem that the current solution fixes.
Currently this PR just removes everything which is not acceptable.
If you don't agree with this solution, you could just remove the middleware messenger.middleware.reject_redelivered_message_middleware in your app.

theravel · Feb 22, 2021

@Tobion can you please provide more information about the original problem, which the middleware is supposed to solve? This would help a lot with the fix.
I asked about original problem in #36870 almost a year ago - no answer has been given.

Removing the middleware from specific application will not solve the problem: since middleware is enabled by default, engineers (users of the library) just do not know about it, and they do not expect reliable Symfony library to lose messages. Middleware should be either removed or disabled by default (so it must be educated decision to enable it).

Tobion · Feb 22, 2021

It's all there already. See https://github.com/symfony/symfony/pull/40249/files#diff-ee9dd49b87dd2ada0777379dcb3475511f4c4297cd0f98c397519e2dee12594bL25-L29 and #34107

theravel · Feb 22, 2021

@Tobion would the following solution work from you perspective?

If failed transport is configured: disabling RejectRedeliveredMessageMiddleware. Messages should be removed from queue using failed transport. If this does not work (e.g. failed transport is misconfigured or not available at the moment), we let message be in queue forever and avoid losing data at any cost.
If failed transport is not configured: keep everything as is. The middleware would remove a message because developer does not care about it (maybe losing data is acceptable in this case, e.g. streaming logs or so).

… to be lost if it cannot be processed by both handler and failed transport" This reverts commit 354ba2a.

…is not configured. Otherwise redelivery can happen only when failed transport is not available and message cannot be lost

theravel · Feb 24, 2021

@Tobion please check one more version, I implemented it as described in the previous comment.
Expected behaviour from my perspective: if failure transport is configured, developer wants to avoid data loss. Also in this case redelivery can happen only if failure transport is not healthy. With all these preconditions it should be safe to leave redelivered message in queue.

If this works, I will squash commits (for now it might be easier for review).

Nyholm · Feb 28, 2021

Correct me if Im wrong. We are discussing the scenario where you can loose messages if you database is down AND if you failure queue is down.

These are two catastrophic things that should never happen, little less at the same time. It seams like you have an infrastructure problem and not a problem with the application logic.

I vote 👎 to make any changes in 4.4. If this problem should be handled, it will be a new feature that goes in 5.x. But Im currently very hesitant to make any changes for this scenario.

theravel · Feb 28, 2021

@Nyholm not entirely so. Scenario is:

You have consumer, which cannot handle specific message (logic exception, NPE, whatever). Let's consider for example that as a part of handling logic you should write something to DB, which is down currently.
You have failure transport configured (doctrine transport, so same DB). Message goes (maybe after few retries) to failure transport (to the same DB).
DB is still down, so failure transport cannot store this message.

Current behaviour is: message will be forcefully removed from queue, even though it will be completely lost. So single component (DB) goes down, and you lose potentially critical data. This is not intended behaviour when you build resilient software.

What are we supposed to do if message cannot be handled by handler and by failure transport? My opinion is that it should stay in queue until infrastructure is back to normal.

Whether it should go to 4.4: I agree, it would be safe to make it in 5.x.

theravel · Mar 24, 2021

Any updates on this? It would be great to have reliable messaging layer in Symfony 5.3

Nyholm · Mar 24, 2021

I still dont agree with your scenario. If you use RabbitMQ/SQS as a failure queue, then this problem is non existent.

You are trying to recover for that scenario where both your failure queue and database is down. The fact that you are putting your failure queue in the database is an infrastructure decision you have made.

If we were about to merge this PR. How would we solve the problem with infinite retries that blocks valid messages? Ie, MessageA always fails but MessageB would pass, but it never gets a chance because the worker keeps trying MessageA.

See from class doc of RejectRedeliveredMessageMiddleware:

The purpose of this middleware is to prevent infinite redelivery loops and to unblock the queue by republishing the redelivered messages as retries with a retry limit and potential delay.

theravel · Mar 24, 2021

What I am trying to make clear is that failure transport is currently not reliable in Symfony messenger because of this middleware. It is up to me as a developer to decide where I want to store my failed messages: in the same RabbitMQ/SQS/whatever instance, or external storage. This is general-purpose framework library which should be applicable in variety of configurations. Not to mention that documentation shows default example with main transport as AMQP and failure transport as DB. Otherwise we should clearly declare in documentation that reliability is not guaranteed except specific configurations (default example configuration does not guarantee it for example).

As an example let's consider scenario with any other storage except DB. For example my application consumes messages from different sources (PubSub, SQS, RabbitMQ, etc.), but I want to have unified way of working with failed messages, therefore I declare my failure transport as RabbitMQ queue (DB is not involved at all).
Everything works fine until the moment when application tried to consume a message from SQS and failed (maybe due to business logic mistake), and tried to put a message into failure RabbitMQ queue, which was down (for whatever reason - we can never guarantee absence of network issues in 100% cases for example). What happens in this case? Message will be lost.

If I were to merge this MR, the problem of infinite redelivery loops should be solved the following way. A developer would have a choice:

Do you deal with non-business-critical data (e.g. application logs)? Do not define failure transport. In this case middleware will be enabled, infinite redelivery will not happen. It is not guaranteed that 100% messages will be successfully processed though.
Do you deal with business-critical data (e.g. finance transactions or orders)? Define failure transport. Normally it should work, infinite redelivery will not happen. If a failure transport is down, you would prefer to have 2 millions messages stuck instead of 1 message lost.

This problem is similar to at-least-once vs at-most-once, you have to make a compromise.

Nyholm · Mar 24, 2021

Hm. What you fail to understand is that the fact that your database is down is an extreme thing. That should be your biggest issue.

You will lose data if your database is down.

Same thing if other critical part of your infrastructure is unavailable.

I am 👎 on the current patch in this pr.

Could this be solved in a different way? Ie, we ack the message only after we successfully written it to the fail queue?

theravel · Mar 24, 2021

Sorry, but when you live in distributed applications world, outages of databases, network, microservices, etc. is a normal thing. One should not hope for everything to be okay, one should design application in a resilient way.

Nyholm · Mar 24, 2021

I understand your point of view. Do you have any comments to this question:

Could this be solved in a different way? Ie, we ack the message only after we successfully written it to the fail queue?

theravel · Mar 28, 2021

Hi @Nyholm
Thank you for your attention and responses.

I have taken few days to think, but I do not see alternative solutions unfortunately.

When message should be written to failure transport (e.g. exceeded max retries, or retires were disabled), we should definitely acknowledge it only after write has been successful. This is exactly what my pull request is doing: when Middleware is disabled, this block is disabled https://github.com/symfony/symfony/blob/5.x/src/Symfony/Component/Messenger/Worker.php#L133-L138

theravel · Jun 1, 2021

What should we do to make this reliability fix in Symfony Messenger?

klnjmm · May 10, 2023

Hi,

I'm also considerating this issue.
I understand that the behavior implemented in symfony messenger is to avoid inifinite loop of a message, but like it has been already said, there exist some situation where we need to keep the message in the queue.

We are moving from perl daemon to php daemon with messenger and this issue is a big problem.

Have you news about it ?

Thank you.

motoronik · Jun 4, 2024

I agree with @theravel . I also encountered a similar problem.

Example:

I need to set up integration with an EXTERNAL queue that does not use PHP serialization but exchanges messages in json format, for this I write my own serializer following the example of https://symfonycasts.com/screencast/messenger/transport-serializer

This example is not working, because the native serializer cannot cope with the serialization task due to Closure.

Serialization of 'Closure' is not allowed

I decided to discard stamps from the message to disable the error transport logic. But there remains a problem with RejectRedeliveredMessageMiddleware.

If a message with isRedelivered: true comes to the queue, then this message is deleted from the queue without a trace. This behavior is unacceptable. And if you choose between two options:

Stopping message processing due to an error
Discarding erroneous messages

Then I’ll choose option 1, I’d rather deal with the problem manually than silently lose data.

And yes, if you use web ui rabbitmq and use it to read a message from the queue, it will return to the queue with the isRedelivered: true attribute, this means that even a CORRECT and VALID message can be with the isRedelivered: true attribute, so I conclude that RejectRedeliveredMessageMiddleware is a big bug and a Messenger problem that needs to be removed completely.

theravel requested a review from sroze as a code owner February 18, 2021 21:59

carsonbot added Status: Needs Review Bug labels Feb 18, 2021

theravel mentioned this pull request Feb 18, 2021

[Messenger] Message is lost if it cannot be processed by both handler and failed transport #36870

Closed

theravel force-pushed the avoid-lost-messages branch from 06ed105 to f2395f0 Compare February 18, 2021 22:02

Removed RejectRedeliveredMessageMiddleware to avoid message to be l…

354ba2a

…ost if it cannot be processed by both handler and failed transport

theravel force-pushed the avoid-lost-messages branch from f2395f0 to 354ba2a Compare February 18, 2021 22:10

Tobion suggested changes Feb 21, 2021

View reviewed changes

carsonbot added Status: Needs Work and removed Status: Needs Review labels Feb 21, 2021

nicolas-grekas added this to the 4.4 milestone Feb 22, 2021

theravel added 2 commits February 25, 2021 00:39

Revert "Removed RejectRedeliveredMessageMiddleware to avoid message…

787973c

… to be lost if it cannot be processed by both handler and failed transport" This reverts commit 354ba2a.

Use RejectRedeliveredMessageMiddleware only in case failed transport …

50b5183

…is not configured. Otherwise redelivery can happen only when failed transport is not available and message cannot be lost

theravel requested a review from Tobion February 24, 2021 23:42

carsonbot added Status: Needs Review and removed Status: Needs Work labels Feb 24, 2021

nicolas-grekas changed the title ~~Removed RejectRedeliveredMessageMiddleware to avoid message to be l…~~ Removed RejectRedeliveredMessageMiddleware to avoid message to be lost if it cannot be processed by both handler and failed transport Jul 12, 2021

nicolas-grekas added the FrameworkBundle label Jul 12, 2021

fabpot modified the milestones: 4.4, 5.4 Aug 26, 2021

fabpot modified the milestones: 5.4, 6.1 Nov 16, 2021

yasincetintas approved these changes Nov 22, 2021

View reviewed changes

carsonbot added Status: Reviewed and removed Status: Needs Review labels Nov 22, 2021

MichaelKubovic mentioned this pull request Feb 18, 2022

[Messenger] Retried message should be acknowledged - NOT rejected #40797

Closed

fabpot modified the milestones: 6.1, 6.2 May 20, 2022

nicolas-grekas modified the milestones: 6.2, 6.3 Nov 5, 2022

nicolas-grekas modified the milestones: 6.3, 6.4 May 23, 2023

nicolas-grekas modified the milestones: 6.4, 7.1 Nov 15, 2023

Search code, repositories, users, issues, pull requests...

Uh oh!

[FrameworkBundle] Removed RejectRedeliveredMessageMiddleware to avoid message to be lost if it cannot be processed by both handler and failed transport #40249

Are you sure you want to change the base?

[FrameworkBundle] Removed RejectRedeliveredMessageMiddleware to avoid message to be lost if it cannot be processed by both handler and failed transport #40249

Uh oh!

Conversation

theravel commented Feb 18, 2021

Uh oh!

weaverryan commented Feb 19, 2021

Uh oh!

theravel commented Feb 19, 2021

Uh oh!

weaverryan commented Feb 19, 2021

Uh oh!

jderusse commented Feb 19, 2021

Uh oh!

theravel commented Feb 19, 2021

Uh oh!

weaverryan commented Feb 20, 2021

Uh oh!

theravel commented Feb 20, 2021

Uh oh!

Tobion left a comment

Choose a reason for hiding this comment

Uh oh!

theravel commented Feb 22, 2021

Uh oh!

Tobion commented Feb 22, 2021

Uh oh!

theravel commented Feb 22, 2021

Uh oh!

theravel commented Feb 24, 2021

Uh oh!

Nyholm commented Feb 28, 2021

Uh oh!

theravel commented Feb 28, 2021

Uh oh!

theravel commented Mar 24, 2021

Uh oh!

Nyholm commented Mar 24, 2021

Uh oh!

theravel commented Mar 24, 2021

Uh oh!

Nyholm commented Mar 24, 2021

Uh oh!

theravel commented Mar 24, 2021

Uh oh!

Nyholm commented Mar 24, 2021

Uh oh!

theravel commented Mar 28, 2021

Uh oh!

theravel commented Jun 1, 2021

Uh oh!

klnjmm commented May 10, 2023

Uh oh!

motoronik commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

[FrameworkBundle] Removed `RejectRedeliveredMessageMiddleware` to avoid message to be lost if it cannot be processed by both handler and failed transport #40249

[FrameworkBundle] Removed `RejectRedeliveredMessageMiddleware` to avoid message to be lost if it cannot be processed by both handler and failed transport #40249

motoronik commented Jun 4, 2024 •

edited

Loading