[Doctrine][Messenger] Remove old MySQL special handling that causes deadlocks #61963
+17
−33
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We run over 3 million queue items a day, we had run into major issues with current implementation deadlocking regularly, no amount of adjusting the purge threads and other settings did fix the root case - the messenger_messages table not having a proper covering index for the SELECT FOR UPDATE query.
Because MySQL implementation has been special cased to batch delete's by
delivered_at
having a special value, at least in MySQL 8.0.* and up (we run 8.0.42 and now running 8.4.6) this results in row range locks that basically lock the whole table due to delivered_at index being of extremely low cardinality, resulting in locking of all the rows that delivered_at is at null value.Then UPDATE queries try to update delivered_at and delete is run by delivered_at condition, resulting in eventual deadlock.
At out scale this lead to deadlocks completelly overwhelming the server within an hour and hard-locking it to a point we had to
kill -9 <mysql pid>
, even running very agressive deadlock timeouts doesn't help.Our machine for the database has plenty of resources and ram free, so it never was a CPU, RAM or I/O issue - server barelly uses over 15% of the CPU, innodb buffer is only 40% full so everything fits into memory. I/O never rose above 3%, mostly sitting bellow 1% (we have InnoDB io capacity set at 6000 baseline and 12000 peak, which is only a fraction of what the storage layer is capable of).
Adding covering index
delivered_at, id
does help to aliviate the onset of the issue, but still resulted in hard dealocks, just took about 14-16 hours under our workloads.I was unable to find the original reasons why delete batching was added, but I suspect that's some MySQL 4/5 era schenanigans that are outdated and not true any more.
So this PR is what I have deployed 6 days ago to our production enviroment and it has been running trouble free since then without a single deadlock recorded against messenger compoment table. Collecting statistics also shows that this is the correct way to solve this, here are performance schema queries that show before and after:
I removed all batched handling and let MySQL run the same way all other databases do it, which works like a charm if we also add a proper index of
queue_name + avaiable_at + delivered_at + id
- this allows MySQL to lock only the specificly required row by it's primary id, removing all lock contention issues (the id field in the index is need, that's what gives index the cardinality to do the job right).Before, notice average lock ms column, it is bad.
After
I imagine that the same covering index for the select query should have similar results for other databases, as this goes down to basics of indexing columns for database performance, but obviousuly some help with validating would be appriciated.
I also belive this should be backported all the way down to 6.4 branch, as this is an issue I have seen a lot of people running into and common advice being "just use RabbitMQ instead", while the root cause isn't investigated properly. I had the envrioment and authority to dig into root cause and this is the result of that investigation.