Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

fix(cluster): reconnect to nodes that restart without slot changes#2096

Open
maxbronnikov10 wants to merge 10 commits intoredis:mainredis/ioredis:mainfrom
maxbronnikov10:feat/cluster-node-retry-strategymaxbronnikov10/ioredis:feat/cluster-node-retry-strategyCopy head branch name to clipboard
Open

fix(cluster): reconnect to nodes that restart without slot changes#2096
maxbronnikov10 wants to merge 10 commits intoredis:mainredis/ioredis:mainfrom
maxbronnikov10:feat/cluster-node-retry-strategymaxbronnikov10/ioredis:feat/cluster-node-retry-strategyCopy head branch name to clipboard

Conversation

@maxbronnikov10
Copy link
Copy Markdown
Contributor

@maxbronnikov10 maxbronnikov10 commented Apr 9, 2026

Closes:
#587
#1732

Problem

When a cluster node (typically a replica) restarts without any slot changes,
ioredis permanently removes it from the connection pool and never reconnects.
This happens because retryStrategy is hardcoded to null for every node in
the pool — by design, to force MOVED-based topology refresh. However, MOVED
errors are never triggered when slots don't move, so the node is lost forever.

This causes two downstream issues:

  • scaleReads: "slave" silently drops replicas, degrading read throughput
  • With enableOfflineQueue: false, commands fail with
    "Cluster isn't ready and enableOfflineQueue options is false" because
    getInstanceByKey() returns undefined for the removed node

Common in Kubernetes environments during rolling restarts.

Changes

clusterNodeRetryStrategy — new option (mirrors clusterRetryStrategy)
that controls the retryStrategy for individual cluster node connections.
When set, a node that loses its connection stays in the pool and retries
instead of being permanently removed. If the node is later removed from the
cluster topology via reset() (triggered by MOVED or slotsRefreshInterval),
it is still cleaned up correctly.

new Cluster(nodes, {
  clusterNodeRetryStrategy: (times) => Math.min(100 + times * 2, 2000),
})

Note

Medium Risk
Touches Redis Cluster connection lifecycle and command routing when nodes disconnect, which can affect availability and error behavior under failure conditions. Changes are scoped and covered by new functional/unit tests, but reconnection timing and edge cases warrant review.

Overview
Adds clusterNodeRetryStrategy to ClusterOptions (default null) to control per-node retryStrategy so restarted nodes (often replicas) can reconnect without requiring slot changes.

ConnectionPool now accepts this strategy and applies it when creating node Redis instances, and Cluster passes the option through while also tightening sendCommand rejection/queueing logic when enableOfflineQueue: false and a chosen node is not ready.

Includes new functional tests for node restart/reconnect and MOVED redirection scenarios plus unit tests for ConnectionPool strategy wiring and nodeError emission; also ignores .history in .gitignore.

Reviewed by Cursor Bugbot for commit 0e31bf3. Bugbot is set up for automated code reviews on this repo. Configure here.

When a cluster node (typically a replica) restarts without any slot changes,
ioredis permanently removes it from the connection pool and never reconnects.
This happens because  is hardcoded to  for every node in
the pool — by design, to force MOVED-based topology refresh. However, MOVED
errors are never triggered when slots don't move, so the node is lost forever.
Copilot AI review requested due to automatic review settings April 9, 2026 18:59
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Apr 9, 2026

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Cluster option to allow per-node reconnection retries so cluster nodes (notably replicas) that restart without slot changes aren’t permanently dropped from the pool—improving resiliency in rolling-restart environments.

Changes:

  • Introduces clusterNodeRetryStrategy (Cluster option) to control retryStrategy for individual node connections in the connection pool.
  • Updates cluster command dispatch to reject commands (when enableOfflineQueue: false) if the chosen node exists but is not ready (e.g., reconnecting), instead of silently queueing on the node’s offline queue.
  • Adds unit + functional tests covering default behavior, node retention vs removal, and offline-queue rejection behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
lib/cluster/ClusterOptions.ts Adds the new clusterNodeRetryStrategy option (and updates typing to allow clusterRetryStrategy: null).
lib/cluster/ConnectionPool.ts Applies clusterNodeRetryStrategy (when provided) as the per-node retryStrategy; otherwise keeps retryStrategy: null.
lib/cluster/index.ts Passes clusterNodeRetryStrategy into the pool and rejects commands to non-ready nodes when cluster offline queue is disabled.
test/unit/clusters/ConnectionPool.ts Unit tests verifying how retryStrategy is set based on clusterNodeRetryStrategy.
test/functional/cluster/node_reconnect.ts Functional tests for node removal vs retention, command rejection with enableOfflineQueue: false, and reconnect after restart.
Comments suppressed due to low confidence (1)

lib/cluster/ConnectionPool.ts:86

  • clusterNodeRetryStrategy is being passed into defaults(..., this.redisOptions, ...), which means it will also be copied onto the per-node Redis instance options object (as an unknown option). Consider stripping clusterNodeRetryStrategy out before passing options to new Redis(...) to avoid leaking Cluster-only configuration into Redis options and reduce confusion when inspecting redis.options.
  createRedisFromOptions(node: RedisOptions, readOnly: boolean) {
    const redis = new Redis(
        defaults(
            {
              // By default, never try to reconnect when a node is lost,
              // instead, waiting for a `MOVED` error and fetching slots again.
              // When `clusterNodeRetryStrategy` is set, use it to allow
              // reconnection (e.g. for replica nodes that restart without
              // any slot changes).
              retryStrategy:
                typeof this.redisOptions.clusterNodeRetryStrategy === "function"
                  ? this.redisOptions.clusterNodeRetryStrategy
                  : null,
              // Offline queue should be enabled so that
              // we don't need to wait for the `ready` event
              // before sending commands to the node.
              enableOfflineQueue: true,
              readOnly: readOnly,
            },
            node,
            this.redisOptions,
            { lazyConnect: true }
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/cluster/ClusterOptions.ts Outdated
Comment thread lib/cluster/index.ts Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/cluster/index.ts Outdated
Comment thread lib/cluster/index.ts Outdated
Comment thread test/functional/cluster/node_reconnect.ts Outdated
@maxbronnikov10
Copy link
Copy Markdown
Contributor Author

@PavelPashov hello! can u take a look on PR please?

Comment thread lib/cluster/index.ts Outdated
Comment thread lib/cluster/index.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/cluster/index.ts
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 31ebcd1. Configure here.

Comment thread lib/cluster/index.ts
@PavelPashov
Copy link
Copy Markdown
Contributor

Thanks for the PR. I will review it and follow up once I have gone through it.

@PavelPashov
Copy link
Copy Markdown
Contributor

run sharded pub sub tests

@uglide
Copy link
Copy Markdown
Contributor

uglide commented Apr 11, 2026

Testcase Errors Failures Skipped Total
Root Suite 0 0 0 0
Sharded Pub/Sub E2E - Basic 0 0 0 4
Sharded Pub/Sub E2E - Connection Lifecycle 0 0 0 3
Sharded Pub/Sub E2E - Failure Recovery Multiple Subscribers 0 0 0 4
Sharded Pub/Sub E2E - Failure Recovery Single Subscriber 0 0 0 4

---- Details for maintainers

1 similar comment
@uglide
Copy link
Copy Markdown
Contributor

uglide commented Apr 11, 2026

Testcase Errors Failures Skipped Total
Root Suite 0 0 0 0
Sharded Pub/Sub E2E - Basic 0 0 0 4
Sharded Pub/Sub E2E - Connection Lifecycle 0 0 0 3
Sharded Pub/Sub E2E - Failure Recovery Multiple Subscribers 0 0 0 4
Sharded Pub/Sub E2E - Failure Recovery Single Subscriber 0 0 0 4

---- Details for maintainers

@uglide
Copy link
Copy Markdown
Contributor

uglide commented Apr 11, 2026

Testcase Errors Failures Skipped Total
Root Suite 0 0 0 0
Sharded Pub/Sub E2E - Basic 0 0 0 4
Sharded Pub/Sub E2E - Connection Lifecycle 0 0 0 3
Sharded Pub/Sub E2E - Failure Recovery Multiple Subscribers 0 3 0 4
Sharded Pub/Sub E2E - Failure Recovery Single Subscriber 0 2 0 4

---- Details for maintainers

@maxbronnikov10
Copy link
Copy Markdown
Contributor Author

Hello! @uglide Thanks for running the tests.

I believe these tests were already flaky before my changes — I've checked and confirmed that this PR doesn't touch any of the related logic.

However, I'm happy to investigate the issue and submit a fix in a separate PR. @PavelPashov what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.