Make transport action invocations more safely retryable

As described in #95100, we could make Elasticsearch less sensitive to TCP channel disconnects. The ideas described there will certainly help, but they still leave open the issue that outstanding requests on a broken channel must be failed since we cannot know whether or not they were delivered before the channel broke, and they are not safe to retry in general.

We should also consider whether we can improve the retryability of transport messages to the point where occasional channel drops are invisible to the user.

One possible solution would be to assign a channel-local sequence number to each request and response sent over the channel, allowing subsequent messages to ack the receipt of earlier ones. On disconnect, rather than failing outstanding requests immediately, the two nodes could first attempt to open a fresh channel which they use to determine which messages were or weren't delivered on the now-closed channel. Undelivered messages could then be re-sent automatically, and the client side of the channel could re-wire any pending listeners to send their eventual responses over the new channel. Note that this would require nodes to retain transport messages in memory until acked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make transport action invocations more safely retryable #136538

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Make transport action invocations more safely retryable #136538

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions