Opened 11 years ago

Closed 11 years ago

#8595 closed Bug report (wontfix)

UTF-16 line endings get corrupted by uploading and downloading

Reported by: Jens Mühlenhoff Owned by:
Priority: normal Component: FileZilla Client
Keywords: utf-16 unicode text corruption Cc:
Component version: Operating system type: Windows
Operating system version: Windows 7 Professional x64

Description

I uploaded a UTF-16 text file with the following content:

FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00 57 00 6F 00 72 00 6C 00 64 00

Note the correct BOM for UTF-16 little endian and that every second byte in the text is set to zero.

After downloading it again the content changes like this:

FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0D 0A 00 57 00 6F 00 72 00 6C 00 64 00

FileZilla transmits the file in text mode and because of that the UTF-16 line ending chars get corrupted to 8 bit ASCII line endings. In this case only 0A 00 got corrupted to 0D 0A 00, because of Windows -> Unix -> Windows.

The file is now 1 byte longer (an odd byte count for UTF-16 files is already invalid) and the content is broken.

Uploading in binary mode avoids this problem, so I suggest that FileZilla detects UTF-16 using BOM or line endings in 0D 00 0A 00 format and uses binary mode for these files by default to avoid corruption.

Attachments (2)

Before.txt (26 bytes ) - added by Jens Mühlenhoff 11 years ago.
File before upload
After.txt (27 bytes ) - added by Jens Mühlenhoff 11 years ago.
File after download

Download all attachments as: .zip

Change History (3)

by Jens Mühlenhoff, 11 years ago

Attachment: Before.txt added

File before upload

by Jens Mühlenhoff, 11 years ago

Attachment: After.txt added

File after download

comment:1 by Alexander Schuch, 11 years ago

Resolution: wontfix
Status: newclosed

Do you have a pointer to the FTP specification mentioning UTF-16 and text mode transfers? The problem is the very same with any binary data. Just rename an image to ".png.txt" and upload it in text mode (or automatic mode). It will break.

A text file in theory could be detected by embedded zeros. But imagine a UTF-16 file with no embedded zeros due to using non-ASCII characters only. There is no way to get auto-detection done properly.

The best and most deterministic way is to always use binary transfers. FTP was designed with US-ASCII in mind.

Feel free to re-open this issue if you find a draft, proposal or other kind of specification regarding FTP and arbitrary character encodings.

Note: See TracTickets for help on using tickets.