Post by Sam BerlinPost by Philippe VerdyNo, not on outgoing responses! it's too much processing at that time.
True. So yes, we should just store the normalized name within the FileDesc,
then all will work. All the other code that does normalization can then be
simplified, since the FileDesc will store an already-normalized string.
This done (on MacOS only), we will no longer have to modify the display or
routing, and Windows/Linux users will be able to see the correct filenames.
Note that MacOS has several filesystems, two of them being native to Macs:
HFS and HFS+.
When it is HFS, it uses the legacy 8-bit Mac encoding, where no
Apple-decomposition occurs. So if you create a file named "café.txt" (with a
precombined e with accute accent) on a HFS volume, the volume will store a
single character for "é", and File.list() will return the unmodified
filename with the precombined letter with accent.
When it is HFS+, or a Unicode compatible filesystem such as a mounted FAT32
partition or diskette, it uses the HFS+ decomposition, and the file created
will be stored as "cafe'.txt" (with a separate accent). The application can
still open and find the file indifferently with the precombined name
"café.txt" or with the decomposed name "cafe'.txt". But File.list() in Java
on MacOS will only return the decomposed name "cafe'.txt" (which will still
be indexed as "cafe.txt" by the HashFunction, and can still be searched on
the network with queries for "cafe" or "café" or "cafe'", but that the
current LimeWire returns to others in Responses containing "cafe'.txt", that
displays so bad for other Windows or Linux users...)
This means that the volume location of the local file influences the same
program on the same MacOS system, unlike with NTFS or FAT32 on Windows, or
Linux/UFS filesystems where the two filenames "cafe'.txt" and "café.txt" are
distinct (because they don't force any normalization, but they may only
change the letter case or may limit the filename length, or they may remove
a final dot).
Post by Sam BerlinPost by Philippe VerdyThis must be done even before the detected filenames are indexed in the
QRP hash table...
It already is done prior to insertion in the hash table. See
HashFunction.keywords.
The HashFunction performs another type of normalization. In fact it removes
most combining accents, by going through a limited form of NFD, and
suppression of some base characters, or their replacement by spaces.
But this will be faster and will use less resources (shorter strings) if the
HashFunction input string is already in NFC form (on Mac only, because
there's nothing to do on Windows or Linux filesystems).
The algorithm in HashFunction uses specific tables made for "simplifying"
keywords, and they are not strictly conforming to what we need for NFC forms
in FileDesc constructed from File.list() results.
The only problem I see is that the case when we should perform normalization
to NFC of File.list() results is easy to determine from the OS type, but the
real discriminant should be the filesystem used on the volume that stores
the directory for the enumerated filenames.
However it seems that Java has no function in its java.io.File class to
determine which filesystem type is used; the internal java.io.File
implementation uses a OS-specific driver that does not seem to have a
filesystem-specific behavior, but a OS-specific behavior instead, assuming
that all filesystems on that OS have the same rules (which is wrong...).
If there's such API in Java, or in Mac-specific Java classes, we could
determine if filenames stored in a directory and retreived in File.list()
really need (and must) to be normalized first to NFC, or need not (and must
not): normalization to NFC should be enabled only for directories stored on
HFS+ volumes, not on HFS volumes...
In absence of such an API, the only way to test it would be to create a
dummy filename containing at least one character that gets decomposed on
HFS+, and then trying to find if it was stored in precomposed or decomposed
form.
But for compatibility with other systems that will receive our responses, it
seems that NFC could be forced in all cases when parsing local directories
on MacOS. My opinion is that the decomposition forced in HFS+ was a design
error by Apple. And Apple has not implemented in its port of java.io.File a
way to reverse transparently this HFS+ transformation when implementing
File.list(), so we have to do it ourselves.
This is certainly a problem for lots of other Mac developers that expect
that the filenames they create and successfully store on Mac volumes, will
be the same names retreived when listing directories, to see if a file is
still present in that list. A simple parsing of this list may return false
(filename no more present), even though the original filename can still be
opened with its original content (this developer assumption was valid on Mac
Classic with only Mac-encoded HFS volumes, but is no longer true on more
recent MacOS versions or on MacOSX that preferably use HFS+ which is Unicode
capable, and not limited to the local Mac 8-bit charset).
Thanks, on Windows and Linux/Unix, we don't have those complications:
Filenames are preferably created in NFC form, but other forms are possible
and considered distinct).