-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Description
As described in #95100, we could make Elasticsearch less sensitive to TCP channel disconnects. The ideas described there will certainly help, but they still leave open the issue that outstanding requests on a broken channel must be failed since we cannot know whether or not they were delivered before the channel broke, and they are not safe to retry in general.
We should also consider whether we can improve the retryability of transport messages to the point where occasional channel drops are invisible to the user.
One possible solution would be to assign a channel-local sequence number to each request and response sent over the channel, allowing subsequent messages to ack the receipt of earlier ones. On disconnect, rather than failing outstanding requests immediately, the two nodes could first attempt to open a fresh channel which they use to determine which messages were or weren't delivered on the now-closed channel. Undelivered messages could then be re-sent automatically, and the client side of the channel could re-wire any pending listeners to send their eventual responses over the new channel. Note that this would require nodes to retain transport messages in memory until acked.