fix(cluster): reconnect to nodes that restart without slot changes#2096
fix(cluster): reconnect to nodes that restart without slot changes#2096maxbronnikov10 wants to merge 10 commits intoredis:mainredis/ioredis:mainfrom maxbronnikov10:feat/cluster-node-retry-strategymaxbronnikov10/ioredis:feat/cluster-node-retry-strategyCopy head branch name to clipboard
Conversation
When a cluster node (typically a replica) restarts without any slot changes, ioredis permanently removes it from the connection pool and never reconnects. This happens because is hardcoded to for every node in the pool — by design, to force MOVED-based topology refresh. However, MOVED errors are never triggered when slots don't move, so the node is lost forever.
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
There was a problem hiding this comment.
Pull request overview
Adds a new Cluster option to allow per-node reconnection retries so cluster nodes (notably replicas) that restart without slot changes aren’t permanently dropped from the pool—improving resiliency in rolling-restart environments.
Changes:
- Introduces
clusterNodeRetryStrategy(Cluster option) to controlretryStrategyfor individual node connections in the connection pool. - Updates cluster command dispatch to reject commands (when
enableOfflineQueue: false) if the chosen node exists but is not ready (e.g., reconnecting), instead of silently queueing on the node’s offline queue. - Adds unit + functional tests covering default behavior, node retention vs removal, and offline-queue rejection behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
lib/cluster/ClusterOptions.ts |
Adds the new clusterNodeRetryStrategy option (and updates typing to allow clusterRetryStrategy: null). |
lib/cluster/ConnectionPool.ts |
Applies clusterNodeRetryStrategy (when provided) as the per-node retryStrategy; otherwise keeps retryStrategy: null. |
lib/cluster/index.ts |
Passes clusterNodeRetryStrategy into the pool and rejects commands to non-ready nodes when cluster offline queue is disabled. |
test/unit/clusters/ConnectionPool.ts |
Unit tests verifying how retryStrategy is set based on clusterNodeRetryStrategy. |
test/functional/cluster/node_reconnect.ts |
Functional tests for node removal vs retention, command rejection with enableOfflineQueue: false, and reconnect after restart. |
Comments suppressed due to low confidence (1)
lib/cluster/ConnectionPool.ts:86
clusterNodeRetryStrategyis being passed intodefaults(..., this.redisOptions, ...), which means it will also be copied onto the per-nodeRedisinstance options object (as an unknown option). Consider strippingclusterNodeRetryStrategyout before passing options tonew Redis(...)to avoid leaking Cluster-only configuration into Redis options and reduce confusion when inspectingredis.options.
createRedisFromOptions(node: RedisOptions, readOnly: boolean) {
const redis = new Redis(
defaults(
{
// By default, never try to reconnect when a node is lost,
// instead, waiting for a `MOVED` error and fetching slots again.
// When `clusterNodeRetryStrategy` is set, use it to allow
// reconnection (e.g. for replica nodes that restart without
// any slot changes).
retryStrategy:
typeof this.redisOptions.clusterNodeRetryStrategy === "function"
? this.redisOptions.clusterNodeRetryStrategy
: null,
// Offline queue should be enabled so that
// we don't need to wait for the `ready` event
// before sending commands to the node.
enableOfflineQueue: true,
readOnly: readOnly,
},
node,
this.redisOptions,
{ lazyConnect: true }
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b5d8c27 to
fac3eaa
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@PavelPashov hello! can u take a look on PR please? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 31ebcd1. Configure here.
|
Thanks for the PR. I will review it and follow up once I have gone through it. |
|
run sharded pub sub tests |
|
1 similar comment
|
|
|
Hello! @uglide Thanks for running the tests. I believe these tests were already flaky before my changes — I've checked and confirmed that this PR doesn't touch any of the related logic. However, I'm happy to investigate the issue and submit a fix in a separate PR. @PavelPashov what do you think? |
Closes:
#587
#1732
Problem
When a cluster node (typically a replica) restarts without any slot changes,
ioredis permanently removes it from the connection pool and never reconnects.
This happens because
retryStrategyis hardcoded tonullfor every node inthe pool — by design, to force MOVED-based topology refresh. However, MOVED
errors are never triggered when slots don't move, so the node is lost forever.
This causes two downstream issues:
scaleReads: "slave"silently drops replicas, degrading read throughputenableOfflineQueue: false, commands fail with"Cluster isn't ready and enableOfflineQueue options is false"becausegetInstanceByKey()returnsundefinedfor the removed nodeCommon in Kubernetes environments during rolling restarts.
Changes
clusterNodeRetryStrategy— new option (mirrorsclusterRetryStrategy)that controls the
retryStrategyfor individual cluster node connections.When set, a node that loses its connection stays in the pool and retries
instead of being permanently removed. If the node is later removed from the
cluster topology via
reset()(triggered by MOVED orslotsRefreshInterval),it is still cleaned up correctly.
Note
Medium Risk
Touches Redis Cluster connection lifecycle and command routing when nodes disconnect, which can affect availability and error behavior under failure conditions. Changes are scoped and covered by new functional/unit tests, but reconnection timing and edge cases warrant review.
Overview
Adds
clusterNodeRetryStrategytoClusterOptions(defaultnull) to control per-noderetryStrategyso restarted nodes (often replicas) can reconnect without requiring slot changes.ConnectionPoolnow accepts this strategy and applies it when creating nodeRedisinstances, andClusterpasses the option through while also tighteningsendCommandrejection/queueing logic whenenableOfflineQueue: falseand a chosen node is not ready.Includes new functional tests for node restart/reconnect and MOVED redirection scenarios plus unit tests for
ConnectionPoolstrategy wiring andnodeErroremission; also ignores.historyin.gitignore.Reviewed by Cursor Bugbot for commit 0e31bf3. Bugbot is set up for automated code reviews on this repo. Configure here.