Discussion:
[gui-dev] LimeWire 4.2.3 international font display patch
heavy_baby
2004-11-27 15:06:22 UTC
Permalink
LimeWire 4.2.3 still have a problem. In search results, some combining
characters
(such as Japanese accents, and Korean Hangul...) are broken box. These
filenames are from Mac.
If I download these files, my NTFS Volume keeps broken filenames.

So, I have added this code, and it is fixed!
Normalizer.compose(RemoteFileDesc.getFileName(), false)

But this is hardcoded, I don't know the best position to add normalize patch.
Sorry.

My suggestion is
1) Apply Unicode normalization to search results
2) Apply Unicode normalization to create download filename

regards,
heavy


__________________________________
Do You Yahoo!?
Upgrade Your Life
http://bb.yahoo.co.jp/
Sam Berlin
2004-11-27 16:10:25 UTC
Permalink
Hi heavy,

Do you recall if you saw this problem with earlier LimeWire versions?
Nothing has changed with normalization that should have affected this, so
we're hesitant to make a change for the 4.2 series. Your two suggestions
are indeed the correct places to add the normalization. We're just hesitant
to add it to all incoming search results because it may be a very
time-consuming process.

Thanks,
Sam
-----Original Message-----
Sent: Saturday, November 27, 2004 10:06 AM
Subject: [gui-dev] LimeWire 4.2.3 international font display patch
LimeWire 4.2.3 still have a problem. In search results, some combining
characters
(such as Japanese accents, and Korean Hangul...) are broken box. These
filenames are from Mac.
If I download these files, my NTFS Volume keeps broken filenames.
So, I have added this code, and it is fixed!
Normalizer.compose(RemoteFileDesc.getFileName(), false)
But this is hardcoded, I don't know the best position to add normalize patch.
Sorry.
My suggestion is
1) Apply Unicode normalization to search results
2) Apply Unicode normalization to create download filename
regards,
heavy
__________________________________
Do You Yahoo!?
Upgrade Your Life
http://bb.yahoo.co.jp/
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
Philippe Verdy
2004-11-27 20:54:16 UTC
Permalink
----- Original Message -----
From: "Sam Berlin" <***@limepeer.com>
To: <***@gui.limewire.org>
Sent: Saturday, November 27, 2004 5:10 PM
Subject: RE: [gui-dev] LimeWire 4.2.3 international font display patch
Post by Sam Berlin
Hi heavy,
Do you recall if you saw this problem with earlier LimeWire versions?
Nothing has changed with normalization that should have affected this, so
we're hesitant to make a change for the 4.2 series. Your two suggestions
are indeed the correct places to add the normalization. We're just hesitant
to add it to all incoming search results because it may be a very
time-consuming process.
No Sam, you just need to add normalized composition when reading directories
on Mac filesystems (which are not consistant, depending on the filesystem
type and version effectively used on MacOSX).
MacOS or MacOSX will accept to create files with precomposed filenames in
NFC form, but the HFS+ filesystem driver will remap these names to the
Apple-specific decomposed form (a variant of NFD, based on preliminary
versions of Unicode canonical decompositions, but which is incomplete and
not fully conforming to the normalized NFD). However you can be sure that
the Apple-decompositions created by HFS+ are canonically equivalent to a NFC
form, and that all filenames given to HFS+ that are canonically equivalent
to the Apple decomposition will be stored on HFS+ with the same
Apple-decomposition.

What this means is that it is always safe on MacOS filesystems to force the
recomposition to NFC, when reading directory entries. I just wonder why the
java.io.* packages ported on MacOS or MacOSX do not perform that normalized
composition by default, so that Java programs would not have to worry about
these details, and could safely create files with names in NFC forms, and
then parsing directories to find the same names.

The FAT32 or NTFS filesystems on Windows, or Linux filesystems do not
perform such forced composition or decompositions, so an application that
creates a file using any valid form will be able to retreive the same form
when reading directory entries.

On all these filesystems however, whever the FS driver performs some
normalization or not, you can create a file with any form, and then open
that file for reading using the same form, even if the file was created with
a different form (it only affects the case where we read directory entries).

So the correct place to do that normalization is only when parsing shared
directory contents...
Sam Berlin
2004-11-27 21:16:32 UTC
Permalink
Hi Philippe,

Performing the normalization will indeed fix things for the future.
Unfortunately, there's a ton of existing clients out there that aren't
sending correctly formatted results. To display those properly,
normalization would indeed need to be done on every incoming result.

Thanks,
Sam
-----Original Message-----
Sent: Saturday, November 27, 2004 3:54 PM
Subject: Re: [gui-dev] LimeWire 4.2.3 international font display patch
----- Original Message -----
Sent: Saturday, November 27, 2004 5:10 PM
Subject: RE: [gui-dev] LimeWire 4.2.3 international font display patch
Post by Sam Berlin
Hi heavy,
Do you recall if you saw this problem with earlier LimeWire versions?
Nothing has changed with normalization that should have affected this,
so
Post by Sam Berlin
we're hesitant to make a change for the 4.2 series. Your two
suggestions
Post by Sam Berlin
are indeed the correct places to add the normalization. We're just hesitant
to add it to all incoming search results because it may be a very
time-consuming process.
No Sam, you just need to add normalized composition when reading directories
on Mac filesystems (which are not consistant, depending on the filesystem
type and version effectively used on MacOSX).
MacOS or MacOSX will accept to create files with precomposed filenames in
NFC form, but the HFS+ filesystem driver will remap these names to the
Apple-specific decomposed form (a variant of NFD, based on preliminary
versions of Unicode canonical decompositions, but which is incomplete and
not fully conforming to the normalized NFD). However you can be sure that
the Apple-decompositions created by HFS+ are canonically equivalent to a NFC
form, and that all filenames given to HFS+ that are canonically equivalent
to the Apple decomposition will be stored on HFS+ with the same
Apple-decomposition.
What this means is that it is always safe on MacOS filesystems to force the
recomposition to NFC, when reading directory entries. I just wonder why the
java.io.* packages ported on MacOS or MacOSX do not perform that normalized
composition by default, so that Java programs would not have to worry about
these details, and could safely create files with names in NFC forms, and
then parsing directories to find the same names.
The FAT32 or NTFS filesystems on Windows, or Linux filesystems do not
perform such forced composition or decompositions, so an application that
creates a file using any valid form will be able to retreive the same form
when reading directory entries.
On all these filesystems however, whever the FS driver performs some
normalization or not, you can create a file with any form, and then open
that file for reading using the same form, even if the file was created with
a different form (it only affects the case where we read directory entries).
So the correct place to do that normalization is only when parsing shared
directory contents...
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
Philippe Verdy
2004-11-27 21:37:16 UTC
Permalink
Not necessarily: you can normalize filenames read from filesystems so that
they will be shared easily with Windows or Unic users, and this won't
prevent other mac users to get access to these files, or to read these
filenames .

On the opposite, it is the fact that we keep the denormalized filenames read
from Mac HFS+ filesystems that causes interoperability problems with other
users (this causes no problem only for Mac users, that won't see the
difference if they now receive normalized filenames from Windows/Linux or
from corrected Mac versions...)

So there's no need to normalize queries or results in Gnutella, only when
reading files from local shared directories...

Normalizing incoming results would only help reading these results on non
MacOS systems.

And of course we should not normalize queries during query matching, but
instead assume that all files accessible on Gnutella are normalized
preferably in NFC form, as well as all sent queries and results... The only
case where we will need normalization is when those strings are FIRST
generated:
- on user input within search forms, this is the keyboard driver that
normaly generates precomposed forms, but if not sure, we could normalize the
user input string before processing.
- when reading or writing files, we can directly use filenames in NFC form
on all filesystem types, including HFS+!
- when reading directories the entries should already be in NFC form, except
on MacOS HFS+ where it will be extremely useful to reconvert these
denormalized strings back to NFC before processing and inclusion in our
shared library list.

So no need to modify the query routing/matching algorithms whose performance
are already critical...

----- Original Message -----
From: "Sam Berlin" <***@limepeer.com>
To: <***@gui.limewire.org>
Sent: Saturday, November 27, 2004 10:16 PM
Subject: RE: [gui-dev] LimeWire 4.2.3 international font display patch
Post by Sam Berlin
Hi Philippe,
Performing the normalization will indeed fix things for the future.
Unfortunately, there's a ton of existing clients out there that aren't
sending correctly formatted results. To display those properly,
normalization would indeed need to be done on every incoming result.
Sam Berlin
2004-11-27 22:09:11 UTC
Permalink
Hi Philippe,

There is no problem with query routing or matching. All those routines
correctly use the normalized form of the filename. The problem is in the
actual 'Response' object -- it is constructed with a non-normalized name.
If the name is sent from an OSX system to another OSX system, then it is
displayed fine (according to your email). However, if that response is sent
to a non OSX system, then it is displayed with the additional characters
(again, according to your email). We can correct this for the future by
having the name normalized in the Response object. Unfortunately, it is not
possible for us to correct all the existing LimeWire's out there sending
non-normalized responses. In order to display those correctly, we'd have to
normalize the filename of the search result prior to displaying it.

For reference points in the code, see the various places in FileManager that
a new Response is constructed and then in the Response constructor how it is
not normalized (and is sent over the wire un-normalized). However,
insertion into the QueryRouteTable does use the normalized string
(HashFunction.keywords normalizes it, and is used by QueryRouteTable.add,
which is used by FileManager.buildQRT). The internal Trie in FileManager
also uses the normalized strings (done by FileManager.extractKeywords, which
is used by FileManager.addFile). Queries sent out are also normalized (done
by the various constructors if QueryRequest) and XML queries are also
normalized (done by InputPanel.getInput).

Because the problems are not in routing or matching, this problem is purely
a display issue (unless the file can't be created on disk because of the
improper characters).

heavy -- If you've managed to read this far, can you test whether or not
these results can be downloaded okay?

Thanks,
Sam
-----Original Message-----
Sent: Saturday, November 27, 2004 4:37 PM
Subject: Re: [gui-dev] LimeWire 4.2.3 international font display patch
Not necessarily: you can normalize filenames read from filesystems so that
they will be shared easily with Windows or Unic users, and this won't
prevent other mac users to get access to these files, or to read these
filenames .
On the opposite, it is the fact that we keep the denormalized filenames read
from Mac HFS+ filesystems that causes interoperability problems with other
users (this causes no problem only for Mac users, that won't see the
difference if they now receive normalized filenames from Windows/Linux or
from corrected Mac versions...)
So there's no need to normalize queries or results in Gnutella, only when
reading files from local shared directories...
Normalizing incoming results would only help reading these results on non
MacOS systems.
And of course we should not normalize queries during query matching, but
instead assume that all files accessible on Gnutella are normalized
preferably in NFC form, as well as all sent queries and results... The only
case where we will need normalization is when those strings are FIRST
- on user input within search forms, this is the keyboard driver that
normaly generates precomposed forms, but if not sure, we could normalize the
user input string before processing.
- when reading or writing files, we can directly use filenames in NFC form
on all filesystem types, including HFS+!
- when reading directories the entries should already be in NFC form, except
on MacOS HFS+ where it will be extremely useful to reconvert these
denormalized strings back to NFC before processing and inclusion in our
shared library list.
So no need to modify the query routing/matching algorithms whose performance
are already critical...
----- Original Message -----
Sent: Saturday, November 27, 2004 10:16 PM
Subject: RE: [gui-dev] LimeWire 4.2.3 international font display patch
Post by Sam Berlin
Hi Philippe,
Performing the normalization will indeed fix things for the future.
Unfortunately, there's a ton of existing clients out there that aren't
sending correctly formatted results. To display those properly,
normalization would indeed need to be done on every incoming result.
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
Philippe Verdy
2004-11-27 22:29:19 UTC
Permalink
OK but you still ignore the fact that the source of these denormalized names
is purely the internal HFS+ storage, when you parse its directory entries.
If we only normalized strings coming from directory entries, we no longer
have any denormalized strings anywhere, including in the Response's we send!
So when storing a NFC filename on HFS+, HFS+ would denormalize it internally
and we would not need to change anything in our code which uses NFC forms to
access these files individually.

The only case where we will see these denormalized names is when we will add
again the shared folder to our library, because we will use File.list() to
enumerate them from the HFS+ filesystem; unfortunately at this time, HFS+
will not return to us the NFC forms we used to create the file, or the NFC
forms we want to share to the network and through which the file remains
accessible. So the solution is just to normalize immediately the
denormalized entries coming from File.list().

OK we will not be able to display the denormalized strings coming from older
servents, but at least we will stop sending denormalized strings to the
network in our responses... which are inconvenient, including for non MacOS
user: notably for Windows 95/98 or Linux users trying to download from a Mac
and get deceptive Query Replies with many square boxes after the base letter
for each decomposed accent!

Trying to normalize these responses may help, but it may slow down the
processing of incoming results (that's why we should not do that during the
routing of responses, but only by the final recipient that initiated the
query, when the response will be displayed in the search result list). But
even this will not be necessary, if all servents on the Gnet stop sending
denormalized results (the way Limewire does on MacOS).

----- Original Message -----
From: "Sam Berlin" <***@limepeer.com>
To: <***@gui.limewire.org>
Sent: Saturday, November 27, 2004 11:09 PM
Subject: RE: [gui-dev] LimeWire 4.2.3 international font display patch
Post by Sam Berlin
Hi Philippe,
There is no problem with query routing or matching. All those routines
correctly use the normalized form of the filename. The problem is in the
actual 'Response' object -- it is constructed with a non-normalized name.
If the name is sent from an OSX system to another OSX system, then it is
displayed fine (according to your email). However, if that response is sent
to a non OSX system, then it is displayed with the additional characters
(again, according to your email). We can correct this for the future by
having the name normalized in the Response object. Unfortunately, it is not
possible for us to correct all the existing LimeWire's out there sending
non-normalized responses. In order to display those correctly, we'd have to
normalize the filename of the search result prior to displaying it.
For reference points in the code, see the various places in FileManager that
a new Response is constructed and then in the Response constructor how it is
not normalized (and is sent over the wire un-normalized). However,
insertion into the QueryRouteTable does use the normalized string
(HashFunction.keywords normalizes it, and is used by QueryRouteTable.add,
which is used by FileManager.buildQRT). The internal Trie in FileManager
also uses the normalized strings (done by FileManager.extractKeywords, which
is used by FileManager.addFile). Queries sent out are also normalized (done
by the various constructors if QueryRequest) and XML queries are also
normalized (done by InputPanel.getInput).
Because the problems are not in routing or matching, this problem is purely
a display issue (unless the file can't be created on disk because of the
improper characters).
heavy -- If you've managed to read this far, can you test whether or not
these results can be downloaded okay?
Thanks,
Sam
-----Original Message-----
Sent: Saturday, November 27, 2004 4:37 PM
Subject: Re: [gui-dev] LimeWire 4.2.3 international font display patch
Not necessarily: you can normalize filenames read from filesystems so that
they will be shared easily with Windows or Unic users, and this won't
prevent other mac users to get access to these files, or to read these
filenames .
On the opposite, it is the fact that we keep the denormalized filenames read
from Mac HFS+ filesystems that causes interoperability problems with other
users (this causes no problem only for Mac users, that won't see the
difference if they now receive normalized filenames from Windows/Linux or
from corrected Mac versions...)
So there's no need to normalize queries or results in Gnutella, only when
reading files from local shared directories...
Normalizing incoming results would only help reading these results on non
MacOS systems.
And of course we should not normalize queries during query matching, but
instead assume that all files accessible on Gnutella are normalized
preferably in NFC form, as well as all sent queries and results... The only
case where we will need normalization is when those strings are FIRST
- on user input within search forms, this is the keyboard driver that
normaly generates precomposed forms, but if not sure, we could normalize the
user input string before processing.
- when reading or writing files, we can directly use filenames in NFC form
on all filesystem types, including HFS+!
- when reading directories the entries should already be in NFC form, except
on MacOS HFS+ where it will be extremely useful to reconvert these
denormalized strings back to NFC before processing and inclusion in our
shared library list.
So no need to modify the query routing/matching algorithms whose performance
are already critical...
----- Original Message -----
Sent: Saturday, November 27, 2004 10:16 PM
Subject: RE: [gui-dev] LimeWire 4.2.3 international font display patch
Post by Sam Berlin
Hi Philippe,
Performing the normalization will indeed fix things for the future.
Unfortunately, there's a ton of existing clients out there that aren't
sending correctly formatted results. To display those properly,
normalization would indeed need to be done on every incoming result.
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
Sam Berlin
2004-11-27 23:10:49 UTC
Permalink
Exactly. Your email pretty much restates what mine said. We can add
normalization on outgoing responses to help for the future, but for existing
clients we'll need to normalize all responses prior to displaying.
Unfortunately, normalizing all results prior to display might even be too
much processing -- we'll have to look closely at it.

Thanks,
Sam
-----Original Message-----
Sent: Saturday, November 27, 2004 5:29 PM
Subject: Re: [gui-dev] LimeWire 4.2.3 international font display patch
OK but you still ignore the fact that the source of these denormalized names
is purely the internal HFS+ storage, when you parse its directory entries.
If we only normalized strings coming from directory entries, we no longer
have any denormalized strings anywhere, including in the Response's we send!
So when storing a NFC filename on HFS+, HFS+ would denormalize it internally
and we would not need to change anything in our code which uses NFC forms to
access these files individually.
The only case where we will see these denormalized names is when we will add
again the shared folder to our library, because we will use File.list() to
enumerate them from the HFS+ filesystem; unfortunately at this time, HFS+
will not return to us the NFC forms we used to create the file, or the NFC
forms we want to share to the network and through which the file remains
accessible. So the solution is just to normalize immediately the
denormalized entries coming from File.list().
OK we will not be able to display the denormalized strings coming from older
servents, but at least we will stop sending denormalized strings to the
network in our responses... which are inconvenient, including for non MacOS
user: notably for Windows 95/98 or Linux users trying to download from a Mac
and get deceptive Query Replies with many square boxes after the base letter
for each decomposed accent!
Trying to normalize these responses may help, but it may slow down the
processing of incoming results (that's why we should not do that during the
routing of responses, but only by the final recipient that initiated the
query, when the response will be displayed in the search result list). But
even this will not be necessary, if all servents on the Gnet stop sending
denormalized results (the way Limewire does on MacOS).
----- Original Message -----
Sent: Saturday, November 27, 2004 11:09 PM
Subject: RE: [gui-dev] LimeWire 4.2.3 international font display patch
Post by Sam Berlin
Hi Philippe,
There is no problem with query routing or matching. All those routines
correctly use the normalized form of the filename. The problem is in
the
Post by Sam Berlin
actual 'Response' object -- it is constructed with a non-normalized
name.
Post by Sam Berlin
If the name is sent from an OSX system to another OSX system, then it is
displayed fine (according to your email). However, if that response is sent
to a non OSX system, then it is displayed with the additional characters
(again, according to your email). We can correct this for the future by
having the name normalized in the Response object. Unfortunately, it is not
possible for us to correct all the existing LimeWire's out there sending
non-normalized responses. In order to display those correctly, we'd
have
Post by Sam Berlin
to
normalize the filename of the search result prior to displaying it.
For reference points in the code, see the various places in FileManager that
a new Response is constructed and then in the Response constructor how
it
Post by Sam Berlin
is
not normalized (and is sent over the wire un-normalized). However,
insertion into the QueryRouteTable does use the normalized string
(HashFunction.keywords normalizes it, and is used by
QueryRouteTable.add,
Post by Sam Berlin
which is used by FileManager.buildQRT). The internal Trie in
FileManager
Post by Sam Berlin
also uses the normalized strings (done by FileManager.extractKeywords, which
is used by FileManager.addFile). Queries sent out are also normalized (done
by the various constructors if QueryRequest) and XML queries are also
normalized (done by InputPanel.getInput).
Because the problems are not in routing or matching, this problem is purely
a display issue (unless the file can't be created on disk because of the
improper characters).
heavy -- If you've managed to read this far, can you test whether or not
these results can be downloaded okay?
Thanks,
Sam
-----Original Message-----
Sent: Saturday, November 27, 2004 4:37 PM
Subject: Re: [gui-dev] LimeWire 4.2.3 international font display patch
Not necessarily: you can normalize filenames read from filesystems so that
they will be shared easily with Windows or Unic users, and this won't
prevent other mac users to get access to these files, or to read these
filenames .
On the opposite, it is the fact that we keep the denormalized filenames read
from Mac HFS+ filesystems that causes interoperability problems with other
users (this causes no problem only for Mac users, that won't see the
difference if they now receive normalized filenames from Windows/Linux
or
Post by Sam Berlin
from corrected Mac versions...)
So there's no need to normalize queries or results in Gnutella, only
when
Post by Sam Berlin
reading files from local shared directories...
Normalizing incoming results would only help reading these results on
non
Post by Sam Berlin
MacOS systems.
And of course we should not normalize queries during query matching,
but
Post by Sam Berlin
instead assume that all files accessible on Gnutella are normalized
preferably in NFC form, as well as all sent queries and results... The only
case where we will need normalization is when those strings are FIRST
- on user input within search forms, this is the keyboard driver that
normaly generates precomposed forms, but if not sure, we could
normalize
Post by Sam Berlin
the
user input string before processing.
- when reading or writing files, we can directly use filenames in NFC form
on all filesystem types, including HFS+!
- when reading directories the entries should already be in NFC form, except
on MacOS HFS+ where it will be extremely useful to reconvert these
denormalized strings back to NFC before processing and inclusion in our
shared library list.
So no need to modify the query routing/matching algorithms whose performance
are already critical...
----- Original Message -----
Sent: Saturday, November 27, 2004 10:16 PM
Subject: RE: [gui-dev] LimeWire 4.2.3 international font display patch
Post by Sam Berlin
Hi Philippe,
Performing the normalization will indeed fix things for the future.
Unfortunately, there's a ton of existing clients out there that
aren't
Post by Sam Berlin
Post by Sam Berlin
sending correctly formatted results. To display those properly,
normalization would indeed need to be done on every incoming result.
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
Philippe Verdy
2004-11-27 23:27:37 UTC
Permalink
Post by Sam Berlin
Exactly. Your email pretty much restates what mine said. We can add
normalization on outgoing responses to help for the future,
No, not on outgoing responses! it's too much processing at that time.

Make it much earlier, just when reading the local directory contents with
File.list(), because it is a one-time transformation performed at boot-time,
or when adding a new shared directory, which is then no longer needed when
creating responses from the already loaded and normalized library.

This must be done even before the detected filenames are indexed in the QRP
hash table, and can just consist in a class derived from File, to perform
NFC normalization that File.list() in the java.io core package should have
performed itself.

However this is only needed if running on MacOS because other non-Mac
filesystems (FAT32, NTFS, Linux UFS, ...) retain the distinction between
filenames with distinct normalization forms (doing normalization of
File.list() results on these systems may cause some compatibility problems,
which should be extremely rare for Latin/Greek/Cyrillic languages, but these
problems may happen with Hebrew or Arabic or Chinese filenames, where the
normalization is not always enforced by the input methods used initially to
create the filenames.)
Sam Berlin
2004-11-27 23:35:31 UTC
Permalink
Post by Philippe Verdy
No, not on outgoing responses! it's too much processing at that time.
True. So yes, we should just store the normalized name within the FileDesc,
then all will work. All the other code that does normalization can then be
simplified, since the FileDesc will store an already-normalized string.
Post by Philippe Verdy
This must be done even before the detected filenames are indexed in the QRP
hash table, and can just consist in a class derived from File, to perform
NFC normalization that File.list() in the java.io core package should have
performed itself.
It already is done prior to insertion in the hash table. See
HashFunction.keywords.

Thanks,
Sam
Philippe Verdy
2004-11-28 00:17:32 UTC
Permalink
Post by Sam Berlin
Post by Philippe Verdy
No, not on outgoing responses! it's too much processing at that time.
True. So yes, we should just store the normalized name within the FileDesc,
then all will work. All the other code that does normalization can then be
simplified, since the FileDesc will store an already-normalized string.
This done (on MacOS only), we will no longer have to modify the display or
routing, and Windows/Linux users will be able to see the correct filenames.

Note that MacOS has several filesystems, two of them being native to Macs:
HFS and HFS+.

When it is HFS, it uses the legacy 8-bit Mac encoding, where no
Apple-decomposition occurs. So if you create a file named "café.txt" (with a
precombined e with accute accent) on a HFS volume, the volume will store a
single character for "é", and File.list() will return the unmodified
filename with the precombined letter with accent.

When it is HFS+, or a Unicode compatible filesystem such as a mounted FAT32
partition or diskette, it uses the HFS+ decomposition, and the file created
will be stored as "cafe'.txt" (with a separate accent). The application can
still open and find the file indifferently with the precombined name
"café.txt" or with the decomposed name "cafe'.txt". But File.list() in Java
on MacOS will only return the decomposed name "cafe'.txt" (which will still
be indexed as "cafe.txt" by the HashFunction, and can still be searched on
the network with queries for "cafe" or "café" or "cafe'", but that the
current LimeWire returns to others in Responses containing "cafe'.txt", that
displays so bad for other Windows or Linux users...)

This means that the volume location of the local file influences the same
program on the same MacOS system, unlike with NTFS or FAT32 on Windows, or
Linux/UFS filesystems where the two filenames "cafe'.txt" and "café.txt" are
distinct (because they don't force any normalization, but they may only
change the letter case or may limit the filename length, or they may remove
a final dot).
Post by Sam Berlin
Post by Philippe Verdy
This must be done even before the detected filenames are indexed in the
QRP hash table...
It already is done prior to insertion in the hash table. See
HashFunction.keywords.
The HashFunction performs another type of normalization. In fact it removes
most combining accents, by going through a limited form of NFD, and
suppression of some base characters, or their replacement by spaces.

But this will be faster and will use less resources (shorter strings) if the
HashFunction input string is already in NFC form (on Mac only, because
there's nothing to do on Windows or Linux filesystems).

The algorithm in HashFunction uses specific tables made for "simplifying"
keywords, and they are not strictly conforming to what we need for NFC forms
in FileDesc constructed from File.list() results.

The only problem I see is that the case when we should perform normalization
to NFC of File.list() results is easy to determine from the OS type, but the
real discriminant should be the filesystem used on the volume that stores
the directory for the enumerated filenames.

However it seems that Java has no function in its java.io.File class to
determine which filesystem type is used; the internal java.io.File
implementation uses a OS-specific driver that does not seem to have a
filesystem-specific behavior, but a OS-specific behavior instead, assuming
that all filesystems on that OS have the same rules (which is wrong...).

If there's such API in Java, or in Mac-specific Java classes, we could
determine if filenames stored in a directory and retreived in File.list()
really need (and must) to be normalized first to NFC, or need not (and must
not): normalization to NFC should be enabled only for directories stored on
HFS+ volumes, not on HFS volumes...

In absence of such an API, the only way to test it would be to create a
dummy filename containing at least one character that gets decomposed on
HFS+, and then trying to find if it was stored in precomposed or decomposed
form.

But for compatibility with other systems that will receive our responses, it
seems that NFC could be forced in all cases when parsing local directories
on MacOS. My opinion is that the decomposition forced in HFS+ was a design
error by Apple. And Apple has not implemented in its port of java.io.File a
way to reverse transparently this HFS+ transformation when implementing
File.list(), so we have to do it ourselves.

This is certainly a problem for lots of other Mac developers that expect
that the filenames they create and successfully store on Mac volumes, will
be the same names retreived when listing directories, to see if a file is
still present in that list. A simple parsing of this list may return false
(filename no more present), even though the original filename can still be
opened with its original content (this developer assumption was valid on Mac
Classic with only Mac-encoded HFS volumes, but is no longer true on more
recent MacOS versions or on MacOSX that preferably use HFS+ which is Unicode
capable, and not limited to the local Mac 8-bit charset).

Thanks, on Windows and Linux/Unix, we don't have those complications:
Filenames are preferably created in NFC form, but other forms are possible
and considered distinct).
Sam Berlin
2004-11-28 00:47:57 UTC
Permalink
Post by Philippe Verdy
The HashFunction performs another type of normalization. In fact it removes
most combining accents, by going through a limited form of NFD, and
suppression of some base characters, or their replacement by spaces.
The HashFunction just delegates the call to IBM's icu package. Is there
anything not done by this package that needs to be done on for
normalization? (I ask this with respect to all normalization -- not just
prior to putting it in the query routing tables.)
Post by Philippe Verdy
But this will be faster and will use less resources (shorter strings) if the
HashFunction input string is already in NFC form (on Mac only, because
there's nothing to do on Windows or Linux filesystems).
Yes, this is what is already done. The normalization is only performed once
right now -- before inserting it into the QRT. What we're likely do is push
that back to when the FileDesc itself is created.

I doubt that we're going to do anything with File.list() specifically.
Instead, we'll likely just change FileDesc.getPath to return the normalized
path. That will fix all the problems we're seeing (except for results from
older LimeWires, which is another issue entirely).

Thanks,
Sam
Philippe Verdy
2004-11-27 21:49:34 UTC
Permalink
Post by heavy_baby
My suggestion is
1) Apply Unicode normalization to search results
2) Apply Unicode normalization to create download filename
My suggestion is different: apply normalization only when reading the local
directory contents when loading the shared library... This will be enough to
let a MacOS user handle all its strings in NFC form, including for display
(because it will only receive strings in NFC form too, if no more MacOS are
sending decomposed strings...)

The implication on performance is minimal, because it is a one time
operation which won't affect query routing or HTTP file requests and
transfers.
heavy_baby
2004-11-28 15:47:00 UTC
Permalink
Hi, Sam, and Philippe.
FIrst, thanks for your support and discussion.

Normalize issue is quite interesting for me.
I understand that adding normalize patch for every search results is
performance down.
So we can wait every Mac users update normalize patched LimeWire.
Philippe's way is optimized best solution.


Next suggestion:
Currently LimeWire network has mixed 3 types of filenames Win client can see.
1) Composed correct filenames they are from Windows clients
2) Decomposed filenames with square box character from Mac clients
3) Decomposed filenames with broken character such as "_" from Win clients
who downloaded decomposed file from Mac clients
!!!(3) are increasing on the network!!!
So is it possible to apply normalize patch for creating download file?
This point of patch dosen't kill large preformance, I guess.

best regards,
heavy


__________________________________
Do You Yahoo!?
Upgrade Your Life
http://bb.yahoo.co.jp/
Sam Berlin
2005-01-13 23:25:36 UTC
Permalink
This is fixed and will be in LimeWire 4.3.3. Thanks for letting us
know about the problem, heavy.

We're applying composition when:
a) Showing search results
b) Sending replies
c) Making a file on disk

a) Should fix viewing search results from older LimeWires.
b) Should fix older LimeWires viewing newer LimeWire's results
c) Should fix saving downloads from older LimeWires

Thanks,
Sam
Post by heavy_baby
LimeWire 4.2.3 still have a problem. In search results, some combining
characters
(such as Japanese accents, and Korean Hangul...) are broken box. These
filenames are from Mac.
If I download these files, my NTFS Volume keeps broken filenames.
So, I have added this code, and it is fixed!
Normalizer.compose(RemoteFileDesc.getFileName(), false)
But this is hardcoded, I don't know the best position to add normalize patch.
Sorry.
My suggestion is
1) Apply Unicode normalization to search results
2) Apply Unicode normalization to create download filename
regards,
heavy
__________________________________
Do You Yahoo!?
Upgrade Your Life
http://bb.yahoo.co.jp/
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
Philippe VERDY
2005-01-14 19:08:10 UTC
Permalink
I thought it was discussed many months ago, and fixed. This was a known caveat of Mac's HFS+ filesystem which uses some unique decomposed form (not really conforming to NFD, because it is based on a preliminary beta version of the NFD algorithm, and then it was frozen and never updated with new Unicode characters that have been added since HFS+ has been released; this HFS+ behavior was really a design error that will persist until MacOS adopts a newer filesystem that will not enforce a possibly broken normalization form; NTFS and Unix filexystems do not force the normalization form, however it's a common convention to use NFC, as the prefered form, including for the web, and alsmost all XML or HTML or SGML related standards).

The idea of using some decomposed form was bad: it was used to simplify the implementation of what is really called "collation" in Unicode (and used in MacOS to search and sort files), but the scheme was broken as collation is locale-sensitive. HFS+ should not have enforce the normalization. Instead collation for sorting or searching files should be done in the user interface according to the user locale, but the filesystem should not do that.

NTFS or FAT32 still perform similar transformations: notably case conversions and mappings to locate case-insensitive filenames, used as unique keys to open files. This is also broken, because this conversion does not follow exactly all the standard case mappings defined in Unicode. This was, however, fixed in newer versions of NTFS (but it creates some interoperability problems with older legacy systems, notably when sharing folders in a network, or mapping network drives).

Unix does not have such difficulties in its filesystems: filenames are unique keys, with no aliases. It may look more difficult for users, but users can be helped by the user interface providing local-sensitive search algorithms for identying existing filenames.
Message du 14/01/05 00:26
De : "Sam Berlin"
Copie à : "heavy_baby"
Objet : Re: [gui-dev] LimeWire 4.2.3 international font display patch
This is fixed and will be in LimeWire 4.3.3. Thanks for letting us
know about the problem, heavy.
a) Showing search results
b) Sending replies
c) Making a file on disk
a) Should fix viewing search results from older LimeWires.
b) Should fix older LimeWires viewing newer LimeWire's results
c) Should fix saving downloads from older LimeWires
Thanks,
Sam
Sam Berlin
2005-01-14 19:11:25 UTC
Permalink
It was discussed a while ago, but was not fixed because we were very
late in the release cycle for LimeWire 4.2. Now that we are starting a
new cycle, it is fixed.

Thanks,
Sam
Post by Philippe VERDY
I thought it was discussed many months ago, and fixed. This was a
known caveat of Mac's HFS+ filesystem which uses some unique
decomposed form (not really conforming to NFD, because it is based on
a preliminary beta version of the NFD algorithm, and then it was
frozen and never updated with new Unicode characters that have been
added since HFS+ has been released; this HFS+ behavior was really a
design error that will persist until MacOS adopts a newer filesystem
that will not enforce a possibly broken normalization form; NTFS and
Unix filexystems do not force the normalization form, however it's a
common convention to use NFC, as the prefered form, including for the
web, and alsmost all XML or HTML or SGML related standards).
The idea of using some decomposed form was bad: it was used to
simplify the implementation of what is really called "collation" in
Unicode (and used in MacOS to search and sort files), but the scheme
was broken as collation is locale-sensitive. HFS+ should not have
enforce the normalization. Instead collation for sorting or searching
files should be done in the user interface according to the user
locale, but the filesystem should not do that.
NTFS or FAT32 still perform similar transformations: notably case
conversions and mappings to locate case-insensitive filenames, used as
unique keys to open files. This is also broken, because this
conversion does not follow exactly all the standard case mappings
defined in Unicode. This was, however, fixed in newer versions of NTFS
(but it creates some interoperability problems with older legacy
systems, notably when sharing folders in a network, or mapping network
drives).
Unix does not have such difficulties in its filesystems: filenames are
unique keys, with no aliases. It may look more difficult for users,
but users can be helped by the user interface providing
local-sensitive search algorithms for identying existing filenames.
Message du 14/01/05 00:26
De : "Sam Berlin"
Copie à : "heavy_baby"
Objet : Re: [gui-dev] LimeWire 4.2.3 international font display patch
This is fixed and will be in LimeWire 4.3.3. Thanks for letting us
know about the problem, heavy.
a) Showing search results
b) Sending replies
c) Making a file on disk
a) Should fix viewing search results from older LimeWires.
b) Should fix older LimeWires viewing newer LimeWire's results
c) Should fix saving downloads from older LimeWires
Thanks,
Sam
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
Loading...