Some Measurements on Direct Connect File Lists

=pod

(Published on B<2014-01-09>.)

=head1 Introduction

I've been working on Direct Connect related projects for a while now. This
includes maintaining L<ncdc|http://dev.yorhel.nl/ncdc> and
L<Globster|http://dev.yorhel.nl/globster>, and doing a bit of research into
improving the downloading performance and scalability (to be published at some
later date). Whether I'm writing code or trying to setup experiments for
research, there's one thing that helps a lot in making decisions. Measurements
from an actual network.

Because useful measurements are often missing, I decided to do some myself.
There's a lot to measure in an actual P2P network, but I restricted myself to
information that can be gathered quite easily from file lists.


=head1 Obtaining the Data

Different hubs will likely have totally different patterns in terms of what is
being shared. In order to keep this experiment simple, I limited myself to a
single hub. And in order to get as much data as possible, I chose the hub that
is commonly known as "Ozerki", famous for being one of the larger hubs in
existence.

My approach to getting as many file lists as possible from this hub was perhaps
a bit too simple. I simply modified ncdc to have an "add the file list from all
users to the download queue" key, and to save all downloaded lists to a
directory instead of opening them.

I started this downloading process on a Monday around noon when there were a
little over 11k users online. I hit my hacked download-all-filelists-key two
more times later that day in order to get the file lists from those users who
joined the hub at a later time. I let this downloading process running until
the evening.

One thing I learned from this experience was that the downloading algorithm in
ncdc (1.18.1) does not scale particularly well. Every 60 seconds, it would try
to open a connection with B<all> users listed in the download queue. You can
imagine that trying to connect to 11k users simultaneously put a significantly
heavier load on the hub than would have been necessary. Not good. Not something
a well-behaving netizen would do. Surprisingly enough, the hub didn't seem to
mind too much and handled the load fine. This might have been because Mondays
are typically not the most busy days in P2P land. Weekends tend to be busier.

Despite that scalability issue, I successfully managed to download the file
lists of almost everyone who remained online for long enough to finally get
their list downloaded. In total I managed to download 14143 file lists (that's
one list too many for C<10000*sqrt(2)>, I should have stopped the process a bit
earlier). The total bzip2-compressed size of these lists is 6.5 GiB.

For obvious reasons, I won't be sharing my modifications to ncdc. I already
tarnished the reputation of ncdc enough in that single day.  If you wish to
repeat this experiment, please do so with a scalable downloading
implementation. :-)


=head1 Obtaining the Stats

And then comes the challenge of aggregating statistics on 6.5 GiB of compressed
XML files. This didn't really sound like much of a challenge. After all, all
one needs to do is decompress the file lists, do some XML parsing and update
some values.  Most of the CPU time in this process would likely be spent on
bzip2 decompression, so I figured I'd just pipe the output of L<bzcat(1)> to a
Perl script and be done with it.

To get the statistics on the sizes and the distribution of unique files, a data
structure containing information on all unique files in the lists was
necessary. Perl being the perfect language for data manipulation, I made use of
its great support for hash tables to store this information. It turned out,
rather unsurprisingly, that Perl isn't all that conservative with respect to
memory usage. Neither my 4GB or RAM nor the extra 4GB of swap turned out to be
enough to run the script to completion. I tried rewriting the script to use a
disk-based data structure, but that slowed things down to a crawl. Some other
solution was needed.

When faced with such a problem, some people will try to optimize the algorithm,
others will throw extra hardware at it, and I did what I do best: Optimize away
the constants. That is, I rewrote the data analysis program in C. Using the
excellent L<khash|https://github.com/attractivechaos/klib> hash table library
to keep track of the file information and the equally awesome
L<yxml|http://dev.yorhel.nl/yxml> library (a little bit of self-promotion
doesn't hurt, right?) to do the XML parsing, I was able to do all the necessary
processing in 30 minutes using at most 3.6GB of RAM.

Long story short, here's my analysis program:
L<dcfilestats.c|http://g.blicky.net/dcstats.git/tree/dcfilestats.c>.


=head1 A Look at the Stats

Some lists didn't decompress/parse correctly, so the actual number of file
lists used in these stats is B<14137>. The total compressed size of these lists
is B<6,945,269,469> bytes (6.5 GiB), and uncompressed B<25,533,519,352> bytes
(24 GiB). In total these lists mentioned B<197,413,253> files. After taking
duplicate listings in account, there's still B<84,131,932> unique files.

And now for some graphs...

=head2 Size of the File Lists

Behold, the compressed and uncompressed size of the downloaded file lists:

[img graph dclistsize.png ]

Nothing too surprising here, I guess. 100 KiB seems to be a common size for a
compressed file lists, but lists of 1 MiB aren't too weird, either. The largest
file list in this set is 34.8 MiB compressed and 120 MiB uncompressed. The
uncompressed size of a list tends to be (*gasp*) a bit larger, but we can't
easily infer the compression ratio from this graph. Hence, another graph:

[img graph dclistcomp.png ]

Most file lists compress to about 24% - 35% of their original size. This seems
to be consistent with L<similar
measurements|http://forum.dcbase.org/viewtopic.php?f=18&t=667> done in 2010.

The raw data for these graphs is found in
L<dclistsize|http://g.blicky.net/dcstats.git/tree/dclistsize>, which lists the
compressed and uncompressed size, respectively, for each file list. The gnuplot
script for the first graph is
L<dclistsize.plot|http://g.blicky.net/dcstats.git/tree/dclistsize.plot> and
L<dclistcomp.plot|http://g.blicky.net/dcstats.git/tree/dclistcomp.plot> for the
second.

=head2 Number of Files Per List

So how many files are people sharing? Let's find out.

[img graph dcnumfiles.png ]

As expected, this graph looks very similar to the one about the size of the
file list. The size of a list tends to be linear in the number of items it
holds, after all.

The raw data for this graph is found in
L<dcnumfiles|http://g.blicky.net/dcstats.git/tree/dcnumfiles>, which lists the
unique and total number of files, respectively, for each file list. The gnuplot
script is
L<dcnumfiles.plot|http://g.blicky.net/dcstats.git/tree/dcnumfiles.plot>.

=head2 File Sizes

And how large are the files being shared? Well,

[img graph dcfilesize.png ]

This graph is fun, and rather hard to explain without knowing what kind of
files we're dealing with. I'm not going to do any further analysis on what kind
of files these file sizes represent exactly, but I am going to make some
guesses.  The files below 1 MiB could be anything, text files, images,
subtitles, source code, etc. And considering that the hub in question doesn't
put a whole lot of effort in weeding out spammers and bots, it's likely that
some malicious users will be sharing small variations of the same virus within
the 100 KiB range.  The peak of files between 7 and 10 MiB would likely be
audio files.  The number of files larger than, say, 20 MiB drop significantly,
but there are still a few million files in the 20 MiB to 1 GiB range.

I cut off the graph after 10 GiB, but there's apparently someone who claims to
share a file between 1 and 2 TiB (don't know the exact size due to the
binning). Since I can't imagine why someone would share a file that large, I
expect it to be a fake file list entry. Note that there could be more fakes in
my data set. I can't tell which files are fake and which are genuine from the
information in the file lists, but I don't expect the number of fake files to
be very significant.

The "raw" data for this graph is found in
L<dcfilesize|http://g.blicky.net/dcstats.git/tree/dcfilesize>. Because I wasn't
interested in dealing with a text file of 84 million lines, the data is already
binned. The first column is the bin number and the second column the number of
unique files in that bin. The file sizes that each bin represents are between
C<2^(bin+9)> and C<2^(bin+10)>, with the exception of bin 0, which starts at a
file size of 0. The source of the gnuplot script is
L<dcfilesize.plot|http://g.blicky.net/dcstats.git/tree/dcfilesize.plot>.

=head2 Distribution of Files

Another interesting thing to measure is how often files are shared. That is,
how many users have the same file?

[img graph dcfiledist.png ]

Many files are only available from a single user. That's not really a good sign
when you wish to download such a file, but luckily there are also tons of files
that I<are> available from multiple users. What is interesting in this graph
isn't that it follows the L<power law|https://en.wikipedia.org/wiki/Power_law>,
but it's wondering what those outliers could possibly be. There's a collection
of 269 files that has been shared among 831 users, and there appears to be a
similar group of around 510-515 files that is shared among 20 or so users. I've
honestly no idea what those collections could be. Well, yes, I could probably
figure that out from the file lists, but my analysis program doesn't tell me
which files it's talking about and I'm too lazy to fix that.

The graph has been clipped to 600, but there's another interesting outlier. A
single file that has been shared by 5668 users. I'm going to guess that this is
the empty file. There are so many ways to get an empty file somewhere in your
filesystem, after all.

The raw data for this graph is found in
L<dcfiledist|http://g.blicky.net/dcstats.git/tree/dcfiledist>, which lists the
number of times shared and the aggregate number of files. The gnuplot script is
L<dcfiledist.plot|http://g.blicky.net/dcstats.git/tree/dcfiledist.plot>.


=head1 Final Notes

So, erm, what conclusions can we draw from this? That stats are fun, I guess.
If anyone (including me) is going to repeat this experiment on a fresh data
set, make sure to use a more scalable downloading process that I did. My
approach shouldn't be repeated if we wish to keep the Direct Connect network
alive.

Furthermore, keep in mind that this is just a snapshot of a single day on a
single hub. The graphs may look very different when the file lists are
harvested at some other time. And it's also quite likely that different hubs
will have very different share profiles. It could be interesting to try and
graph everything, but I don't have I<that> kind of free time.

