-
-
Notifications
You must be signed in to change notification settings - Fork 321
Fix race conditions in FTP socket closure that cause intermittent errors #17540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I just ran a test with Cyberduck on Windows - on one of the FTP servers, there are NO issues - hundreds of transfers are fast and don't cause 426. On the other server, the first file just hangs at 100% forever. Other clients (e.g. WinSCP) work fine for both servers. This is likely a separate issue though, so let's ignore it for now. |
I'm trying to compile the Windows version with this PR to see if it happens to resolve the hanging issue. However, I'm getting errors on pulling the NuGet dependencies from nuget.config: <?xml version="1.0" encoding="utf-8"?>
<configuration>
<packageSources>
<clear />
<add key="nuget.org" value="https://api.nuget.org/v3/index.json" protocolVersion="3" />
<add key="gh-iterate-ch" value="https://nuget.pkg.github.com/iterate-ch/index.json" protocolVersion="3" />
<add key="gh-ikvmnet" value="https://nuget.pkg.github.com/ikvmnet/index.json" protocolVersion="3" />
</packageSources>
<packageSourceMapping>
<packageSource key="nuget.org">
<package pattern="IKVM.ByteCode" />
<package pattern="*" />
</packageSource>
<packageSource key="gh-ikvmnet">
<package pattern="IKVM*" />
</packageSource>
<packageSource key="gh-iterate-ch">
<package pattern="iterate-ch.*" />
</packageSource>
</packageSourceMapping>
</configuration> It gives me 401, and if I put my GH token, it gives 403. |
Sorry for the unclear instructions, I'll update the building-part of the readme. In the meantime, make sure to use a "Legacy/Classic" personal access token with at least "read:packages" permissions, then in the Cyberduck repo root use Terminal commands
After that, building with |
@AliveDevil thanks, that worked. Two more notes:
Sadly, even with the changes from this PR applied, the transfer still hangs on one of the FTP servers on Windows. However, I looked at its WireShark dump, and this issue seems directly connected to the one with macOS. The problem on Windows is that it sometimes doesn't send FIN with the last FTP-DATA packet, so the server naturally waits for more data. The client on the other side is waiting for server response, so it hangs. I believe that my fix from this PR should have worked here too, but perhaps due to the Windows networking stack and/or IKVM cross-compilation, it has no effect and/or different behavior. I collected TCP dumps of various FTP clients that work properly, both on macOS and Windows, and they all shared the same flow:
To be honest, there might be a better way to achieve a correct flow here. Could this perhaps be a bug somewhere outside of the read/write close() functions? Maybe a race with when data is drained vs. when close is called? I'm not familiar with Cyberduck's codebase. Sadly, as things stand now, Cyberduck's FTP implementation seems very broken on both Mac and Windows. |
Good news, I came up with a proper implementation that fixes both Windows and macOS across all the clients I tested. Additionally, the macOS delay side effect is now gone. Will polish it up and update this PR. Ultimately, this turned out to be a bug in Apache Common NET's implementation of FTPClient - more specifically, their socket handling. |
Force-pushed the new fix. To summarize, the issue was in Apache Commons NET's public class SocketOutputStream extends FilterOutputStream {
private final Socket socket;
public SocketOutputStream(Socket socket, OutputStream stream) {
super(stream);
this.socket = socket;
}
public void close() throws IOException {
super.close();
this.socket.close();
}
public void write(byte[] buffer, int offset, int length) throws IOException {
this.out.write(buffer, offset, length);
}
} The There are two ways I could think of fixing this - either ensure connections are finalized in all instances where I wasn't sure just how many places would need changing for the first approach, and whether it's possible at all since at least some functions of Apache's FTPClient perform the closure on their own, so you have no control over them. For example, So I opted to wrap the socket instead - since we have a nice hook in the beginning of the connection, the new socket will be used throughout Cyberduck and Apache's code, and anywhere it is closed, proper shutdown will be enforced. I went a bit heavy with the comments in the code given the low-level and quirky nature. If something doesn't make sense, please let me know. I was also unsure about the socket timeout part, but I will leave a comment in the PR for us to discuss it. As mentioned in my previous comment, this fix works with all of my FTP servers with no side effects. However, it's worth noting that I only tested plaintext FTP over LAN. I haven't tested FTP+TLS, SFTP, Active mode, etc. I don't expect there to be issues, but you might want to check any more exotic configurations on your end. |
Alright, with the latest changes, I'm happy to announce that Cyberduck works on all of my test FTP servers, both Windows and macOS, with no performance regressions. I stress-tested all of them with 1,000 small files back-to-back, and there were no issues. What's more, the performance of Cyberduck is significantly better than all other FTP clients that I tested, in both multiple small files and a single large file. I think the PR is now ready to merge, at least from my view. Please let me know if you have any questions. |
finally { | ||
log.debug("Closing socket {}", delegate); | ||
// Work around macOS quirk where Java NIO's SocketDispatcher.close0() has a 1,000ms delay | ||
CompletableFuture.runAsync(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running in a background thread and missing any error looks dangerious to me as we will not report any error when closing the stream fails which may indicate that the file is not properly persisted on the remote server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree. I tried really hard to find a workaround, but couldn't come up with one. Having 1 whole second delay between each file upload is really bad when you upload lots of small files, and it causes Cyberduck to be upwards of 10x slower than competitor FTP clients. If you have any suggestion here, I'm more than happy to take it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think this failing specifically could result in data loss. By the time we've reached here we already flushed our data, closed our stream, and waited for the server to close its stream. So, I don't think failing here could ever result in data loss.
The only case of data loss should occur if "Failed to shutdown output for socket" is hit. Maybe we can change that to throw instead of logging and discarding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right there is no risk here in the real world. But it should probably not be in the finally block as it does not need to be executed when above code fails with I/O.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that close quirk on macOS specific to server implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't want to include the workaround if it can only be reproduced with a single FTP server implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other FTP client experiences this delay though, so it's clearly on Cyberduck's side. I'm away from home this week but will look more into it next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkocher I did a deep dive on this.
- I compared Forklift's traffic dump with Cyberduck + this PR, and it is completely identical. Packets are the same and their options (SYN,ACK,FIN) too.
- I debugged the JVM native code and traced the 1s delay to macOS
close()
syscall. - I wrote a script to dump
netstat -anl
to file and confirmed that in both apps, the socket transitions from SYN_SENT -> ESTABLISHED -> FIN_WAIT_1 -> FIN_WAIT_2 -> TIME_WAIT. Forklift remains in TIME_WAIT for 5s+, while Cyberduck for 1-5s+ depending on SO_LINGER flag. I'm unsure why SO_LINGER(true, 0) results in 1s of TIME_WAIT, as I thought that it should be skipped entirely? This could be a quirk in netstat on macOS though. - I tried to reproduce this issue using a minimal TCP client/server in C, but it does not block there regardless of SO_LINGER.
- The only difference between the "1s hang" server and "no hang" server is that the "1s hang" server completes 4-way shutdown of data socket first, and then sends "OK" on control socket, while the "no hang" server sends "OK" on control socket first, and then sends the FIN on data socket. This, and also some packet timing differences.
If the control socket is truly independent of the data socket as I understand it is, then my conclusion is that Socket.close() blocking is purely due to timing. There should be nothing preventing it from happening on any FTP server given that it hits the same timing.
So, unless you have better ideas, I vote to keep the current workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just one more suggestion to try out to see what difference it makes with different SO_LINGER
options:
- Are you testing with SSL sockets? We enabled SO_LINGER for this use case if I remember correctly 1
- What is the behaviour with
connection.socket.linger
set tofalse
indefault.properties
or explicitly inDefaultSocketConfigurator
- Enable SO_LINGER but modify
DefaultSocketConfigurator
to set with a shorter timeout value of1
. With0
I see426
errors for data transfers. 2
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opened #17562 to make the SO_LINGER timeout configurable.
Can you comment what server implementations you tested this changeset with to avoid regressions. |
Found related documentation in Orderly Versus Abortive Connection Release in Java 1
Footnotes |
I tested with the following FTP servers:
However, I just noticed a peculiar issue exclusively with FileZilla Server - there's insane read amplification. I confirmed that us draining the input on download operations actually triggers the server to re-send the file again and again. I am thinking since we already wrap the input/output streams, we could track how many bytes they've processed. And when we close, if the output bytes > 0, only then we do full close. Does that make sense to you? Only caveat is that if a socket is used for both input and output, and draining the input starts re-sending, we can't avoid this. But at least in FTP, this should not be the case - data sockets are either in only or out only. |
Pushed my fix as described above. Seems to work everywhere now, and no more read amplification. |
We will want to test interoperability with common FTP servers in use such as
The embedded integration tests run against Apache FTP Server. |
finally { | ||
log.debug("Closing socket {}", delegate); | ||
// Work around macOS quirk where Java NIO's SocketDispatcher.close0() has a 1,000ms delay | ||
CompletableFuture.runAsync(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean do you see this delay regardless of what server implementation you connect to? If it shows only when connected to a specific server implementation we would not want to implement a workaround.
finally { | ||
log.debug("Closing socket {}", delegate); | ||
// Work around macOS quirk where Java NIO's SocketDispatcher.close0() has a 1,000ms delay | ||
CompletableFuture.runAsync(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree that we will want to close the socket regardless of any previous error.
Hello!
First thing's first, I want to clarify that this issue was only experienced on macOS. I don't know if Windows is also affected.
Over the past years, I've been observing intermittent instances of
426 Connection closed; transfer aborted.
across various FTP servers (TLS and not). Recently, I tried to use Cyberduck to transfer files to my Nintendo Switch where I host a FTP server using Sphaira or FTPd. There I realized that the same intermittent issue now has more of a 50% chance to happen on every single file, rendering Cyberduck basically useless. I tried a variety of other clients and they never experienced such issues, so I just abandoned Cyberduck for this purpose.However, today, I decided to dig into this properly. For some context, I am connecting over plain FTP, passive mode, over LAN. I noticed that the 426 occurs just at the end of a file transfer (e.g. 10GB file would transfer fine for minutes and then fails at 100%). I sniffed the packets with WireShark and noticed something interesting - Cyberduck was sometimes closing the socket right after the data is sent, before the server ACKs, resulting in RST. Here are dumps of a working and non-working case:
TCP logs
The fix in this PR definitely doesn't follow proper OOP - it's merely a PoC - but I think it gets the idea right. We half-close the socket (output only), signaling to the FTP server that we're done here. Then the server ACKs and responds with "226 OK", and only then we full-close the socket (input as well).
I re-ran my previously failing test cases a few hundred times, and all worked fine with this PR! The only caveat is, with this PR transfers to one of the FTP servers has a small delay between a file upload being complete, and a new file transfer starting. I compared with other FTP clients (Forklift on macOS) and there was no such delay. On the other FTP server, neither Cyberduck nor Forklift have this delay.