Opened 12 years ago
Closed 12 years ago
#8595 closed Bug report (wontfix)
UTF-16 line endings get corrupted by uploading and downloading
Reported by: | Jens Mühlenhoff | Owned by: | |
---|---|---|---|
Priority: | normal | Component: | FileZilla Client |
Keywords: | utf-16 unicode text corruption | Cc: | |
Component version: | Operating system type: | Windows | |
Operating system version: | Windows 7 Professional x64 |
Description
I uploaded a UTF-16 text file with the following content:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00 57 00 6F 00 72 00 6C 00 64 00
Note the correct BOM for UTF-16 little endian and that every second byte in the text is set to zero.
After downloading it again the content changes like this:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0D 0A 00 57 00 6F 00 72 00 6C 00 64 00
FileZilla transmits the file in text mode and because of that the UTF-16 line ending chars get corrupted to 8 bit ASCII line endings. In this case only 0A 00 got corrupted to 0D 0A 00, because of Windows -> Unix -> Windows.
The file is now 1 byte longer (an odd byte count for UTF-16 files is already invalid) and the content is broken.
Uploading in binary mode avoids this problem, so I suggest that FileZilla detects UTF-16 using BOM or line endings in 0D 00 0A 00 format and uses binary mode for these files by default to avoid corruption.
Attachments (2)
Change History (3)
by , 12 years ago
Attachment: | Before.txt added |
---|
comment:1 by , 12 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Do you have a pointer to the FTP specification mentioning UTF-16 and text mode transfers? The problem is the very same with any binary data. Just rename an image to ".png.txt" and upload it in text mode (or automatic mode). It will break.
A text file in theory could be detected by embedded zeros. But imagine a UTF-16 file with no embedded zeros due to using non-ASCII characters only. There is no way to get auto-detection done properly.
The best and most deterministic way is to always use binary transfers. FTP was designed with US-ASCII in mind.
Feel free to re-open this issue if you find a draft, proposal or other kind of specification regarding FTP and arbitrary character encodings.
File before upload