Opened 14 years ago

Last modified 6 years ago

#2295 new Feature request

Ideas for improved ASCII detection

Reported by: hutyerah Owned by:
Priority: normal Component: FileZilla Client
Keywords: Cc: hutyerah, elsapo
Component version: Operating system type:
Operating system version:

Description

Currently there are lists of file extensions that
should be used in ASCII mode. This has problems as
everyone knows, because you have to explicitly make
sure that files are in the list, or if they have no
extension like many UNIX files, you have to set the
transfer mode.

I propose an addition to this system that comes into
play when an extension for a file is unknown, or if the
file has no extension. FileZilla would try to determine
the use of ASCII and non-ASCII character codes within
the file being transferred. More on that in a little
bit. Once it has detected the type, it would confirm
with the user that this is the correct type for the
file/extension, and also allow them to add the
extension or file name to the list of extensions/filenames.

This system would be easier to implement than you
probably think. ASCII files generally use a limited
subset of the available combinations for each byte. A
good example is that ASCII files *generally* use only
things you can see on your keyboard. You would scan the
entire file for usages of bytes outside this range,
(for example, the ASCII control codes, anything with a
decimal value of greater than 127, anything not visible
on the keyboard) and if one is detected the file would
be labeled as binary for confirmation by the user. Text
files are generally significantly smaller than binary
files, so to scan a text file would take limited time,
and scanning of a binary file would halt quickly
because it would soon discover a non-ASCII character.
Even so, you could have an option to limit the size of
a portion of a file to allow scanning on.

Of course this would bring false positives, and this is
why a dialog with a pre-selected button with possible
values of ASCII, binary, ASCII [save extension] and
binary [save extension] (which I mention above as
confirmation from the user) should appear upon
detection. The save extension options selection would
be determined by whether the file had an extension or not.

Thanks for reading :D

Change History (3)

comment:1 Changed 14 years ago by elsapo

Are you speaking of pure ASCII files (ie, using only the 127
ASCII character codes), or generally of text files, to be
handled in the ftp carriage return translation mode of "ASCII" ?

If the latter, then the assumption that no bytes occur above
127 is obviously only good in an English-only environment.

comment:2 Changed 14 years ago by hutyerah

I was talking generally of text files, so you're right about
"the assumption that no bytes occur above
127 is obviously only good in an English-only environment".
But I wasn't specifically talking about using bytes greater
than 127 as the check, that was just an example. There may
be something better to use, for example, not using any
control characters, as these wouldn't be used in a
non-English environment either, right? Or perhaps some kind
of selector that would choose which kind of environment you
were in.

Obviously there are issues like this with my proposal, but
it's probably better than me uploading text files to a
unix-based server without extensions and then getting
annoyed that they're not working :D

comment:3 Changed 14 years ago by hutyerah

Perhaps some kind of statistical analysis may be useful as
well. Binaries would have mostly random distribution among
the available bytes, whereas text files would have usage
among a smaller range. I dunno, there needs to be a bit more
thought here, it seems.

Note: See TracTickets for help on using tickets.