Discussion:
[gui-dev] QueryRouteTable.java.patch (second version)
Sam Berlin
2004-11-28 00:55:44 UTC
Permalink
Thanks for sending this in, Philippe. Right now we're more concerned with
the memory that the BitSets use as opposed to the speed with which they're
processed. We'd like to increase the default QRT size in order to
accommodate ultrapeers holding more leaves, but cannot do this until we come
up with a more memory-efficient table. Currently, the QRT's of each leaf &
ultrapeer are using more memory than anything else while LimeWire is running
as an ultrapeer.

The problem is that the BitSet implementation that we're using (a customized
version of java.util.BitSet) is not sparse. So, if the table is 1000
entries large, and the 999th entry is set, the bitset will allocate storage
for the former 998 entries. The storage is extremely compact (one long per
64 entries), but the size requirements are still too large. You can view
the waste of the BitSet by enabling the 'QRP %' and 'QRP Empty' columns in
the 'connections' table in LimeWire. The QRP % column is a measure of how
full the table is with respect to the total size. The QRP Empty column is
two numbers -- the first is the number of unused longs (a long that is
storing no 'set' bits), and the second is the number of longs allocated. If
LimeWire is running as an ultrapeer, you'll see that for leaves the 'QRP
Empty' column has a very high first number and just a slightly higher second
number -- this means that the BitSet has allocated many unused longs in
order to store just the last few 'set' bits.

Thanks,
Sam
-----Original Message-----
Sent: Saturday, November 27, 2004 7:44 PM
Subject: [gui-dev] QueryRouteTable.java.patch (second version)
Hello Sam (Berlin) and LimeWire developers,
You may have noticed the proposed patch for QueryRouteTable I proposed
yesterday.
After reflection, I have refined this patch, for faster computing of
patches, by reducing the number of calls to BitSet methods, so that I will
group updates to the generated data.
Some more optimizations are possible notably in the patch decoders.
But for the 1-bit and 2-bit formats, the difference of speed is very
significant, and it is still visible for the 4-bit format and as well in
the
8-bit patch format.
See the attachment...
(Note that the new supported 2-bit and 1-bit formats can easily be
disabled
by enabling the few comment lines starting at "2-BIT FORMAT PARSING" and
"2-BIT FORMAT GENERATION")
Philippe Verdy
2004-11-28 01:13:48 UTC
Permalink
Post by Sam Berlin
The problem is that the BitSet implementation that we're using (a customized
version of java.util.BitSet) is not sparse. So, if the table is 1000
I see... Couldn't we use multiple sub-bitsets? The complexity being to
create the indirection table.
I had another idea in the past, using something similar to Integer-Trie (the
way they are used in the implementation of Unicode normalization tables),
but dynamic maintenance and even generation of these tables are quite
complex...
The problem is that, despite tables are sparsely used, their distribution is
still random, and the coverage still uniform, so solutions with double
indirections will too often give the worst case where nothing can be gained.
A sparse table will only save memory provided that it is nearly empty or
nearly full (we can ignore this second case if we want to create larger
tables).

So may be we could use two alternate representation within a container
class: as a bitset above some fill-threshold, or as an array of integers or
an integer hashtable with collision below this threshold, the transformation
back to an array of integers or hash table occuring only after a complete
patch has been applied, that has reduced the fill level.

Another representation could use binary trees for low fill levels.
Sam Berlin
2004-11-28 01:45:01 UTC
Permalink
Post by Philippe Verdy
I see... Couldn't we use multiple sub-bitsets?
Yup, we could. Greg's been suggesting this idea. Perhaps we'll actually
get around to doing it sometime soon. :)
Post by Philippe Verdy
The problem is that, despite tables are sparsely used, their distribution
is still random, and the coverage still uniform, so solutions with double
indirections will too often give the worst case where nothing can be gained.
Exactly -- the extreme optimizations that are required in order to make
patch processing bearable on the CPU make it very difficult to come up with
a workable solution.

Thanks,
Sam
z***@limepeer.com
2004-11-28 02:43:47 UTC
Permalink
Any container abstraction suffers from the need to very quickly be able to
switch from one concrete representation to another. In the specific case of
Integer array, we need to keep track of certain fill level after which the
bitset representation becomes more efficient and be able to convert the array
to that bitset on the fly using minimal amounts of cpu. Needless to say, any
such conversion must be ultra-efficient when done both directions.
Furthermore, the managing code which decides when to do the conversion must be
lightweight and optimal.

Regarding various sub-bitset representations, we would ideally have an algorithm
which finds the optimal one for a given load factor/distribution. My hunch is
that such algorithm would require memoization, but if anyone is aware of
something greedy that would work, let us know.
Post by Philippe Verdy
Post by Sam Berlin
The problem is that the BitSet implementation that we're using (a customized
version of java.util.BitSet) is not sparse. So, if the table is 1000
I see... Couldn't we use multiple sub-bitsets? The complexity being to
create the indirection table.
I had another idea in the past, using something similar to Integer-Trie (the
way they are used in the implementation of Unicode normalization tables),
but dynamic maintenance and even generation of these tables are quite
complex...
The problem is that, despite tables are sparsely used, their distribution is
still random, and the coverage still uniform, so solutions with double
indirections will too often give the worst case where nothing can be gained.
A sparse table will only save memory provided that it is nearly empty or
nearly full (we can ignore this second case if we want to create larger
tables).
So may be we could use two alternate representation within a container
class: as a bitset above some fill-threshold, or as an array of integers or
an integer hashtable with collision below this threshold, the transformation
back to an array of integers or hash table occuring only after a complete
patch has been applied, that has reduced the fill level.
Another representation could use binary trees for low fill levels.
_______________________________________________
gui-dev mailing list
http://www.limewire.org/mailman/listinfo/gui-dev
Philippe Verdy
2004-11-29 09:45:06 UTC
Permalink
Post by z***@limepeer.com
Any container abstraction suffers from the need to very quickly be able to
switch from one concrete representation to another. In the specific case of
Integer array, we need to keep track of certain fill level after which the
bitset representation becomes more efficient and be able to convert the array
to that bitset on the fly using minimal amounts of cpu. Needless to say, any
such conversion must be ultra-efficient when done both directions.
Furthermore, the managing code which decides when to do the conversion must be
lightweight and optimal.
Regarding various sub-bitset representations, we would ideally have an algorithm
which finds the optimal one for a given load factor/distribution. My hunch is
that such algorithm would require memoization, but if anyone is aware of
something greedy that would work, let us know.
My feeling is that we should manage large bitsets by splitting them into
large enough (but not too much) fragments of fixed size, that could have
each their own alternate representations, but that could each be converted
quickly from one to the other.

Also, each fragment should be stored in a class inheriting from a common
BitSetBased parent class. Large bitsets would also inherit from this parent
class, but could simply be an array of small bitsets represented in one of
the other BitSetBased derived classes.

Because, in fine, only small bitsets will be stored in the leaf nodes, it
becomes easy to represent them in several alternate forms: uncompressed
packed array of bits, no array at all (if all bits are set to 0 or to 1),
RLE packing, or even a deflated form...

Which alternate form will be used will be computed by the parent node,
performing the necessary cleanup for each fragment, without requiring too
much work in memory (because fragments will have a maximum size, a single
buffer can be reused to compute intermediate representations for every
fragments)

Now some experimentation would be needed to determine which fragment size
would be needed. If we want to be able to use the most compact forms, we
need statistics on bitset fill rates. Immediately, it comes the fact that
there are two types of QRP tables, with distinct fill rate averages:
- bitsets for QRP from leaf nodes, that are about 2 to 10% filled (with 64K
tables), meaning that there's 1 bit set every 10 to 50 bits, i.e. with
average unused gaps between 9 to 49 bits; not enough to perform a good
compression with no array at all, but for which a Lempel-Ziv based
compression (deflate?) would create significant compression.
- bitsets for QRP from ultrapeers, that are about 40 to 60% filled (with 64K
tables), meaning that there's about as many bits set to 1 and to 0, in a
pseudo-random distribution (as bits are set from a pseudo-random hash
function), for which compression is likely to be very deceptive.

I think that byte-RLE compression of fragments will be rarely efficient,
unless the bitset is really very sparse (fill rates below 1 bit set on 32,
i.e. fill rates below 3%). But even at this level, a Lempel-Ziv compression
may finally be more effective...

The choice of the compression function to use, will also affect performance:
there's a tradeoff to find between expected compression level (to maximize
the memory saved), and performance:

- bitsets for QRP tables from leaf nodes that share the least (lowest fill
rates) could be maximaly compressed,

- but bitset for QRP tables from ultrapeers, or from leaf nodes that share a
lot would probably use a lower compression level.

If we split large bitsets into small manageable fixed-size bitsets, we can
cache for each of them a fill rate, and then determine which representation
is the most appropriate. Additionally, when updating fragments dynamically,
we are not required to compress these fragments to their best representation
immediately (this can be differed and performed by a background process,
which will need a constant and reusable maximum working memory, minimizing
the VM memory fragmentation).


What do you think of this idea?
Zlatin Balevsky
2004-11-29 16:32:15 UTC
Permalink
Post by Philippe Verdy
Post by z***@limepeer.com
Any container abstraction suffers from the need to very quickly be able to
switch from one concrete representation to another. In the specific case of
Integer array, we need to keep track of certain fill level after which the
bitset representation becomes more efficient and be able to convert the array
to that bitset on the fly using minimal amounts of cpu. Needless to say, any
such conversion must be ultra-efficient when done both directions.
Furthermore, the managing code which decides when to do the
conversion must be
lightweight and optimal.
Regarding various sub-bitset representations, we would ideally have an algorithm
which finds the optimal one for a given load factor/distribution. My hunch is
that such algorithm would require memoization, but if anyone is aware of
something greedy that would work, let us know.
My feeling is that we should manage large bitsets by splitting them
into large enough (but not too much) fragments of fixed size, that
could have each their own alternate representations, but that could
each be converted quickly from one to the other.
The constants involved in this splitting should ideally be determined by
some automated fashion unless through experimentation we come up to a
set of constants which perform reasonably well. Keep in mind that while
we may be able to do this easily for leaves, we don't know how will the
fill rates of ultrapeer tables change in future if we increase the
number of leaf connections.

Compression is a little trickier to get right because of the transient
memory usage, and the fact that routing happens on several threads - so
we may end up decompressing the same data many times. Only if message
processing is serialized to a single thread (which will hopefully be the
case with NIO) then we can be sure that uncompressing, reading and
re-compressing each table will indeed use less memory. Therefore I'm
more optimistic about dynamically finding some optimal representation of
each table in the form of a bitset tree and simply not allocating those
fragments that are all 0s or 1s.

A small note on RLE - we could relatively easily modify our table
reading code to read directly from a RLE'd table; this alone could save
us a lot of memory on the leaf tables. This seems like the easiest
first step to take.
Philippe Verdy
2004-11-29 21:04:46 UTC
Permalink
A small note on RLE - we could relatively easily modify our table reading
code to read directly from a RLE'd table; this alone could save us a lot
of memory on the leaf tables. This seems like the easiest first step to
take.
It's just that I really doubt about the efficiency of RLE encoding for
bitsets filled by QRP, whose set bits are more or less regularly distributed
throughout the table.
Same thing about the tree encoding, for the same reason, because the tree
structure itself will be hard and costly to compute on many levels, and a
limited number of levels will necessarily have almost all branches filled
with an actual node, at very moderate fill level.
For example, with a 64Kbit table, a 5% fill level means that there's one bit
set every 20 bit (so the RLE compression will not be extremely performing,
due to very short compressible lengths, and 20 bits will not be enough to
avoid that all branches in a tree representation, where leaf nodes will need
to represent some significant group of bits, will be present).

What is interesting to see first, is how many bits are set in leaf nodes
(those for which we need the highest number of QRP tables, and so the
highest number of large bitsets). On average it should represent 2 or 3
times the number of shared files on each leaf node. If we expect (and
measure) that on average, leaf nodes share about 300 files, this will set
about 1000 bits in their bitset. To start being significantly compressible,
a large bitset should be about 50 times this number, so this gives an
initial bitset size of at least 50,000 bits; the current size of 64K bits is
then very moderately compressible with RLE (I think we could gain no more
than 10 to 20% of total size for 64K bitsets, i.e. a total size of about 80
to 90% of an uncompressed bitset).

But if you double the size of the virtual bitset, you don't modify the
number of total bits set, so the compression rate will be significantly
higher, because all additional bits will be zeroes, so that the compressed
data could be kept roughly at the same size as a 64Kbit table, effectively
more than doubling the compression rate: suppose as above that we compressed
the 64Kbit size at 80 to 90%, then the comrpessed data would be around 51 to
57 Kbit; with a doubled virtual table size of 128Kbit, the compressed data
would remain roughly at the same size of 51 to 57 Kbit, i.e. a compressed
size at 40 to 45% of the uncompressed bitset.

The more we increase the virtual bitsize, the more the RLE compression will
become effective to maintain (or not increase significantly the average
compressed size).

The problem with this scheme is in its maintenance cost: now the compressed
bitsets have dynamic size, so we need to reallocate very often each time a
bitset needs to be updated. This can cause significant performance penalties
due to excessive fragmentation, if we don't split the bitsets into fixed
virtual size (I think that fragments around 16Kbits, i.e. a max of 4KB,
should work OK to greatly limit the memory fragmentation costs in the VM.)
Splitting fragments may also offer another opportunity to simply not store
completely empty fragments, because their position in a table of fragments
will be left to null, adding some memory save. Splitting is very simple to
compute with simple bitmasks and shifts, and it limits a lot the impact of
necessary data copies and allocations.

Also each fragment may have its own compression scheme, if fragments are
represented as distinct objects; no need to add other fields because their
instance's class pointer wil implement a common interface to effectively
decode them; this allows also to temporarily compute fragments one by one in
an uncompressed form, easy to compute and set for OR/XOR/AND operations, and
terminate this work by computing their optimal compression scheme, before
going to the next fragment. The temporary uncompressed fragment is easily
reusable to compute the next actual fragment, so this limits the number of
allocations needed to process the whole bitset.
Zlatin Balevsky
2004-11-29 21:23:20 UTC
Permalink
Post by Philippe Verdy
Post by Zlatin Balevsky
A small note on RLE - we could relatively easily modify our table
reading code to read directly from a RLE'd table; this alone could
save us a lot of memory on the leaf tables. This seems like the
easiest first step to take.
<...>
The more we increase the virtual bitsize, the more the RLE compression
will become effective to maintain (or not increase significantly the
average compressed size).
Lets hold this thought for a moment - suppose we want to use RLE only
for the tables of leaves. We could make their virtual size some large
number like 1MB - if it is a single bitset it will compress with RLE
wonderfully. And since leaf tables tend to change rarely so we are not
going to re-allocate storage too often. The reading complexity is
proportional to the fill rate and can be implemented as a simple
iterative read of the compressed table.

While this may not be a great improvement, it will reduce the number of
false positives sent to leaves greatly with minimal increase in cpu
usage and virtually no increase in memory usage. Its certainly worth
experimenting with ;)

Loading...