Opened 18 months ago

Closed 18 months ago

Last modified 18 months ago

#12939 closed Bug report (rejected)

Active Filetransfer terminated after TCP Window Size 0 reply

Reported by: agowa Owned by:
Priority: normal Component: FileZilla Client
Keywords: tcp window size zero reset Cc: agowa
Component version: 3.64.0 Operating system type: Windows
Operating system version: Windows 11 (build 25393), 64-bit edition

Description

Steps to reproduce

  1. Upload a large file (I tried a 1 TiB file) to a proftpd server with "slow" disks using default settings in active mode.
  2. Once all disk caches on the server are filled (which happened always somewhere around 12 GiB) and the server has to sync the data to disk, it'll set the TCP Window Size to 0 while the data is written and synced.
  3. After that, it'll increase the window size again to continue the transfer.

Expected behavior

FileZilla should consider this an active connection (as long as the tcp window size probes are still ACked ofc), report a 0 byte/s current transfer speed, and continue normally once the window size has been increased again.

Current behavior

FileZilla terminates the file transfer with the reason "connection timed out", probably because of oversight within the timeout logic. It probably only checks for 0 bytes/s transferred instead of also checking if the TCP connection is still alive and only set to a window size of 0.
Funnily FileZilla tries to restart the connection but then fails because the temporary ".in.*" file still exists (A 2nd bug, or do we hit some kind of edge case here?)
Also noteworthy, the retry setting was set to 5, but FileZilla did only one automatic retry and then failed the transfer. Also, the 10-second delay between retries was not honored here, as the retry happened instantly after the reset was sent to the server.
And also noteworthy is that FileZilla will delete the partial upload from the server once it failed the transfer...

Workaround

Go into the options and set the timeout to 0.

But this has still undesirable side effects like no timeouts are detected anymore and the transfer will therefore get stuck completely once an actual timeout happens.

(Admittedly, another workaround is to set the delay kinda high to 5 minutes or so, but even though it would work for this connection, it'll unnecessarily slow down other transfers to different servers and also not correctly detect the connection state. It'd only work if the server manages to sync and accept data again within 5 minutes.)

Proposed fix

1st Add a TCP connection state check into the timeout detection logic (not a timeout if still packages are exchanged, see screenshot for example connection)
2nd And (maybe) add an additional option within the connection timeout section for backward compatibility like "☑️ treat a Zero TCP Window Size as timed out" with a default value of being unchecked (Enabling this option will result in the current behavior again).
3rd (purely visual), fix the connection speed that is shown in the UI, currently it never shows 0, not even if "Display momentary transfer speed instead of average speed" is enabled.
4th If the reconnect logic triggers a reset of an active file transfer, it should expect the temporary ".in.*" file already be present and automatically resume it instead of failing the transfer because it already exists. (My default file exists action is set to "Ask for action" btw, but it also didn't ask, it just failed the (unnecessarily triggered) retry)
5th honor the reconnect settings also for automatic reconnects (amount and delay in-between).

Attachments (2)

Screenshot 2023-06-26 214253.png (198.9 KB ) - added by agowa 18 months ago.
Wireshark screenshot of an FTP transfer with tcp windowsize zero
capture.pcapng.gz (1016 bytes ) - added by agowa 18 months ago.

Download all attachments as: .zip

Change History (7)

by agowa, 18 months ago

Wireshark screenshot of an FTP transfer with tcp windowsize zero

by agowa, 18 months ago

Attachment: capture.pcapng.gz added

comment:1 by Tim Kosse, 18 months ago

Status: newmoreinfo

using [...] active mode.

Apart from the direction of data connection establishment, active vs. passive mode does not have any bearing on the rest of the transfer.

Once all disk caches on the server are filled (which happened always somewhere around 12 GiB) and the server has to sync the data to disk, it'll set the TCP Window Size to 0 while the data is written and synced.

Some crucial information is missing here, why isn't the write-out happening in parallel with receiving new data? In particular, during steady-state in the middle of the transfer, the disk write speed on the server should be the same as the speed data is read from the network.

Please provide inside on this peculiar server behavior.

instead of also checking if the TCP connection is still alive and only set to a window size of 0.

This information isn't typically available to ordinary programs, all of this is hidden inside the operating system's TCP stack.

Funnily FileZilla tries to restart the connection but then fails because the temporary ".in.*" file still exists (A 2nd bug, or do we hit some kind of edge case here?)

Why temporary files? The server should immediately write to the final file name. If transfer gets interrupted, client will reconnect, check the size of the target file and resume from what's on the server.

The protocol itself has no concept of incomplete transfers under temporary names. Using temporary names as a server is a non-portable and implementation specific, doomed to fail approach.

How is the client to know whether the server uses temporary files, and if so, what the temporary file name is supposed to be? What is the escapement scheme to distinguish temporary files from those that merely happen to have the same filename as what the server considers a temporary file?

honor the reconnect settings also for automatic reconnects (amount and delay in-between).

Automatic retry depends on progress made. If progress isn't made, e.g. due to there not being a file because of an implementation specific, non-portable temporary file name, automatic resume isn't attempted. Otherwise on large files, terabytes of the same data would transferred over and over again.

comment:2 by agowa, 18 months ago

Status: moreinfonew

Some crucial information is missing here, why isn't the write-out happening in parallel with receiving new data? In particular, during steady-state in the middle of the transfer, the disk write speed on the server should be the same as the speed data is read from the network.

BTRFS + cryptsetup + LVM (RAID 5) + filesystem caching is to blame, I suppose.
Also, the LVM raid consists of multiple very slow but high-capacity HDDs. And the file transfer was via a fast direct link (2 bonded Mellanox interfaces, I.E. 50000mbps total with a link mtu of 9000)

This information isn't typically available to ordinary programs, all of this is hidden inside the operating system's TCP stack.

Other apps can handle it, and from a quick look into the documentation it should be possible by setting SO_RCVTIMEO and SO_SNDTIMEO and then checking EAGAIN or EWOULDBLOCK, or EINPROGRESS.
And according to the tcp documentation, the default timeout in such cases TCP_USER_TIMEOUT should be 20 minutes and for tcp_retries2 it is at least 13 minutes, so I suppose you're handling it in FileZilla code somehow already, as this issue wouldn't occur with these kernel defaults.

https://linux.die.net/man/7/socket
https://man7.org/linux/man-pages/man7/tcp.7.html

Why temporary files?

I don't know, but it is the FileZilla client that wants to upload the data to a temporary file, at least that's what I thought when I opened this ticket. As that's what the log file of (the client and the server) showed. Filezilla itself ultimately starts the upload to the temporary filename, so it should know the name even in case of a retry (the error message filezille generates and that causes the transfer to fail is also about the temp filename and not about the actual filename).

But you're right, it's indeed related to a kind of server feature (even though it's FileZilla at the ends that initiate the upload to the temp/alternative name and not the server redirecting the writes without FileZilla knowing).
http://www.proftpd.org/docs/modules/mod_xfer.html#HiddenStores

Automatic retry depends on progress made. If progress isn't made, e.g. due to there not being a file because of an implementation specific, non-portable temporary file name, automatic resume isn't attempted.

As FileZilla is uploading to the ".in" name and the error it throws is about "the file existing", I'd think it should be able to handle this as usual.

If you want to try to reproduce these issues, the Data Transfer Nodes use a fully patched Alpine Linux with basically just this proftpd config (except for the slow disks in the raid 5 LVM and the "unusually" fast network link)

/etc/proftpd/proftpd.conf:

ModulePath /usr/lib/proftpd
ModuleControlsACLs              insmod,rmmod allow user root
ModuleControlsACLs              lsmod allow user *
ServerName                      "storage001.local"
ServerType                      standalone
WtmpLog                         off
Port                            21
UseIPv6                         on
Umask                           022
MaxInstances                    30
User                            proftpd
Group                           proftpd
AllowOverwrite                  on
# proftpd supports utf8, but MultilineRFC2228 being on causes issues with the detection logic in clients and caused a lot of damaged file names, so disabling it for now...
MultilineRFC2228                off
DefaultServer                   on
ShowSymlinks                    on
TimeoutNoTransfer               600
TimeoutStalled                  600
TimeoutIdle                     1200
DisplayLogin                    welcome.msg
DisplayChdir                    .message true
ListOptions                     "-l"
DenyFilter                      \*.*/
AllowStoreRestart               On
HiddenStores                    on
DeleteAbortedStores             on
DefaultRoot                     /mnt/data

comment:3 by Tim Kosse, 18 months ago

Resolution: rejected
Status: newclosed

BTRFS + cryptsetup + LVM (RAID 5) + filesystem caching is to blame, I suppose

That's a performance killer for a check-summed, copy-on-write filesystem due to write amplification.

I found that a Btrfs raid 1 (avoid Btrfs' raid 5/6 due to the write hole issue) on top of multiple individually encrypted disks performs a lot better.

Block-layer raid must replicate everything, whereas raid on filesystem level only replicates what needs to be replicated.

Needless to say, always use the most recent kernel version, there have been significant performance improvements lately.

SO_RCVTIMEO and SO_SNDTIMEO

These aren't applicable here. FileZilla does not use blocking socket i/o.

TCP_USER_TIMEOUT

For handling timeouts, FileZilla users the timeout value configured in the settings dialog and keeps track of the last time data was sent/received. Due to it using non-blocking I/O it does not need to depend on TCP_USER_TIMEOUTS. In particular, using a non-zero TCP_USER_TIMEOUT with the same value as configured as timeout in FileZilla would not have changed the outcome in any way.

HiddenStores on
DeleteAbortedStores on

That makes it a server-side configuration issue that makes resume impossible.

As FileZilla is uploading to the ".in" name

No, it isn't. FileZilla does not use temporary files, it uploads directly to the target name, as you can see in the STOR command it sends.

Closing this issue, as you can configure custom timeouts in FileZilla should the default value not fit the particular use-case. The inability to resume is due to bad server configuration.

comment:4 by agowa, 18 months ago

Am I misinterpreting the log of the STOR command then? "Response: 550 test.dat: Temporary hidden file /upload_test/.in.test.dat."

16:08:27	Response:	200 EPRT command successful
16:08:27	Trace:	CFtpRawTransferOpData::ParseResponse() in state 2
16:08:27	Trace:	CControlSocket::SendNextCommand()
16:08:27	Trace:	CFtpRawTransferOpData::Send() in state 4
16:08:27	Command:	STOR test.dat
16:08:27	Trace:	CTransferSocket::OnAccept(0)
16:08:27	Trace:	CTransferSocket::OnConnect
16:08:27	Trace:	First EAGAIN in CTransferSocket::OnSend()
16:08:27	Trace:	Made progress in CTransferSocket::OnSend()
16:08:27	Trace:	CFtpControlSocket::OnReceive()
16:08:27	Response:	150 Opening BINARY mode data connection for test.dat
16:08:27	Trace:	CFtpRawTransferOpData::ParseResponse() in state 4
16:08:27	Trace:	CControlSocket::SendNextCommand()
16:08:27	Trace:	CFtpRawTransferOpData::Send() in state 5
16:08:55	Error:	Connection timed out after 20 seconds of inactivity
16:08:55	Trace:	CRealControlSocket::DoClose(2050)
16:08:55	Trace:	CControlSocket::DoClose(2050)
16:08:55	Trace:	CFtpControlSocket::ResetOperation(2114)
16:08:55	Trace:	CControlSocket::ResetOperation(2114)
16:08:55	Trace:	CFtpRawTransferOpData::Reset(2114) in state 5
16:08:55	Trace:	CFtpControlSocket::ResetOperation(2114)
16:08:55	Trace:	CControlSocket::ResetOperation(2114)
16:08:55	Trace:	CFtpFileTransferOpData::Reset(2114) in state 7
16:08:55	Error:	File transfer failed after transferring 5909315872 bytes in 27 seconds
16:08:55	Trace:	CFileZillaEnginePrivate::ResetOperation(2114)
16:08:55	Trace:	CFtpControlSocket::FileTransfer()
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpLogonOpData::Send() in state 0
16:08:55	Status:	Resolving address of storage001.local
16:08:55	Status:	Connecting to [2001:*****:5]:21...
16:08:55	Status:	Connection established, waiting for welcome message...
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	220 ProFTPD Server (ProFTPD Default Installation) [2001:*****:5]
16:08:55	Trace:	CFtpLogonOpData::ParseResponse() in state 1
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpLogonOpData::Send() in state 2
16:08:55	Command:	AUTH TLS
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	500 AUTH not understood
16:08:55	Trace:	CFtpLogonOpData::ParseResponse() in state 2
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpLogonOpData::Send() in state 3
16:08:55	Command:	AUTH SSL
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	500 AUTH not understood
16:08:55	Trace:	CFtpLogonOpData::ParseResponse() in state 3
16:08:55	Status:	Insecure server, it does not support FTP over TLS.
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpLogonOpData::Send() in state 5
16:08:55	Trace:	CFtpControlSocket::SetAsyncRequestReply
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpLogonOpData::Send() in state 6
16:08:55	Command:	USER user
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	331 Password required for user
16:08:55	Trace:	CFtpLogonOpData::ParseResponse() in state 6
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpLogonOpData::Send() in state 6
16:08:55	Command:	PASS ************
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	230 User user logged in
16:08:55	Trace:	CFtpLogonOpData::ParseResponse() in state 6
16:08:55	Status:	Server does not support non-ASCII characters.
16:08:55	Status:	Logged in
16:08:55	Trace:	Measured latency of 2 ms
16:08:55	Trace:	CFtpControlSocket::ResetOperation(0)
16:08:55	Trace:	CControlSocket::ResetOperation(0)
16:08:55	Trace:	CFtpLogonOpData::Reset(0) in state 15
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpFileTransferOpData::Send() in state 0
16:08:55	Status:	Starting upload of /tmp/test.dat
16:08:55	Trace:	CFtpChangeDirOpData::Send() in state 0
16:08:55	Trace:	CFtpChangeDirOpData::Send() in state 2
16:08:55	Command:	CWD /upload_test
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	250 CWD command successful
16:08:55	Trace:	CFtpChangeDirOpData::ParseResponse() in state 2
16:08:55	Trace:	CFtpControlSocket::ResetOperation(0)
16:08:55	Trace:	CControlSocket::ResetOperation(0)
16:08:55	Trace:	CFtpChangeDirOpData::Reset(0) in state 2
16:08:55	Trace:	CFtpFileTransferOpData::SubcommandResult(0) in state 1
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpListOpData::Send() in state 0
16:08:55	Status:	Retrieving directory listing of "/upload_test"...
16:08:55	Trace:	CFtpChangeDirOpData::Send() in state 0
16:08:55	Trace:	CFtpControlSocket::ResetOperation(0)
16:08:55	Trace:	CControlSocket::ResetOperation(0)
16:08:55	Trace:	CFtpChangeDirOpData::Reset(0) in state 0
16:08:55	Trace:	CFtpListOpData::SubcommandResult(0) in state 1
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpListOpData::Send() in state 2
16:08:55	Trace:	CFtpRawTransferOpData::Send() in state 0
16:08:55	Trace:	CFtpRawTransferOpData::Send() in state 1
16:08:55	Command:	TYPE I
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	200 Type set to I
16:08:55	Trace:	CFtpRawTransferOpData::ParseResponse() in state 1
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpRawTransferOpData::Send() in state 2
16:08:55	Command:	EPRT |2|2001:***:f|56637|
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	200 EPRT command successful
16:08:55	Trace:	CFtpRawTransferOpData::ParseResponse() in state 2
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpRawTransferOpData::Send() in state 4
16:08:55	Command:	MLSD
16:08:55	Trace:	CTransferSocket::OnAccept(0)
16:08:55	Trace:	CTransferSocket::OnConnect
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	150 Opening BINARY mode data connection for MLSD
16:08:55	Trace:	CFtpRawTransferOpData::ParseResponse() in state 4
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpRawTransferOpData::Send() in state 5
16:08:55	Trace:	CTransferSocket::OnReceive(), m_transferMode=0
16:08:55	Trace:	CTransferSocket::OnReceive(), m_transferMode=0
16:08:55	Trace:	CTransferSocket::OnReceive(), m_transferMode=0
16:08:55	Trace:	CTransferSocket::OnReceive(), m_transferMode=0
16:08:55	Trace:	CTransferSocket::OnReceive(), m_transferMode=0
16:08:55	Trace:	CTransferSocket::TransferEnd(1)
16:08:55	Trace:	CFtpControlSocket::TransferEnd()
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	226 Transfer complete
16:08:55	Trace:	CFtpRawTransferOpData::ParseResponse() in state 7
16:08:55	Trace:	CFtpControlSocket::ResetOperation(0)
16:08:55	Trace:	CControlSocket::ResetOperation(0)
16:08:55	Trace:	CFtpRawTransferOpData::Reset(0) in state 7
16:08:55	Trace:	CFtpListOpData::SubcommandResult(0) in state 3
16:08:55	Trace:	CFtpControlSocket::ResetOperation(0)
16:08:55	Trace:	CControlSocket::ResetOperation(0)
16:08:55	Trace:	CFtpListOpData::Reset(0) in state 3
16:08:55	Trace:	CFtpFileTransferOpData::SubcommandResult(0) in state 2
16:08:55	Trace:	CControlSocket::CheckOverwriteFile()
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpFileTransferOpData::Send() in state 5
16:08:55	Trace:	CFtpRawTransferOpData::Send() in state 0
16:08:55	Trace:	CFtpRawTransferOpData::Send() in state 2
16:08:55	Command:	EPRT |2|2001:***:f|46289|
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	200 EPRT command successful
16:08:55	Trace:	CFtpRawTransferOpData::ParseResponse() in state 2
16:08:55	Trace:	CControlSocket::SendNextCommand()
16:08:55	Trace:	CFtpRawTransferOpData::Send() in state 4
16:08:55	Command:	STOR test.dat
16:08:55	Trace:	CFtpControlSocket::OnReceive()
16:08:55	Response:	550 test.dat: Temporary hidden file /upload_test/.in.test.dat. already exists
16:08:55	Trace:	CFtpRawTransferOpData::ParseResponse() in state 4
16:08:55	Trace:	CFtpControlSocket::ResetOperation(2)
16:08:55	Trace:	CControlSocket::ResetOperation(2)
16:08:55	Trace:	CFtpRawTransferOpData::Reset(2) in state 4
16:08:55	Trace:	CFtpFileTransferOpData::SubcommandResult(2) in state 7
16:08:55	Trace:	CFtpControlSocket::ResetOperation(2)
16:08:55	Trace:	CControlSocket::ResetOperation(6)
16:08:55	Trace:	CFtpFileTransferOpData::Reset(6) in state 7
16:08:55	Error:	Critical file transfer error
16:08:55	Trace:	CFileZillaEnginePrivate::ResetOperation(6)
Version 0, edited 18 months ago by agowa (next)

comment:5 by Tim Kosse, 18 months ago

The human-readable part of the response to the STOR command has no meaning. The only thing clients look at is the response code, and of that only the very first digit.

Note: See TracTickets for help on using tickets.