222 lines
11 KiB
Text
222 lines
11 KiB
Text
Some Measurements on Direct Connect File Lists
|
|
|
|
=pod
|
|
|
|
(Published on B<2014-01-09>.)
|
|
|
|
=head1 Introduction
|
|
|
|
I've been working on Direct Connect related projects for a while now. This
|
|
includes maintaining L<ncdc|http://dev.yorhel.nl/ncdc> and
|
|
L<Globster|http://dev.yorhel.nl/globster>, and doing a bit of research into
|
|
improving the downloading performance and scalability (to be published at some
|
|
later date). Whether I'm writing code or trying to setup experiments for
|
|
research, there's one thing that helps a lot in making decisions. Measurements
|
|
from an actual network.
|
|
|
|
Because useful measurements are often missing, I decided to do some myself.
|
|
There's a lot to measure in an actual P2P network, but I restricted myself to
|
|
information that can be gathered quite easily from file lists.
|
|
|
|
|
|
=head1 Obtaining the Data
|
|
|
|
Different hubs will likely have totally different patterns in terms of what is
|
|
being shared. In order to keep this experiment simple, I limited myself to a
|
|
single hub. And in order to get as much data as possible, I chose the hub that
|
|
is commonly known as "Ozerki", famous for being one of the larger hubs in
|
|
existence.
|
|
|
|
My approach to getting as many file lists as possible from this hub was perhaps
|
|
a bit too simple. I simply modified ncdc to have an "add the file list from all
|
|
users to the download queue" key, and to save all downloaded lists to a
|
|
directory instead of opening them.
|
|
|
|
I started this downloading process on a Monday around noon when there were a
|
|
little over 11k users online. I hit my hacked download-all-filelists-key two
|
|
more times later that day in order to get the file lists from those users who
|
|
joined the hub at a later time. I let this downloading process running until
|
|
the evening.
|
|
|
|
One thing I learned from this experience was that the downloading algorithm in
|
|
ncdc (1.18.1) does not scale particularly well. Every 60 seconds, it would try
|
|
to open a connection with B<all> users listed in the download queue. You can
|
|
imagine that trying to connect to 11k users simultaneously put a significantly
|
|
heavier load on the hub than would have been necessary. Not good. Not something
|
|
a well-behaving netizen would do. Surprisingly enough, the hub didn't seem to
|
|
mind too much and handled the load fine. This might have been because Mondays
|
|
are typically not the most busy days in P2P land. Weekends tend to be busier.
|
|
|
|
Despite that scalability issue, I successfully managed to download the file
|
|
lists of almost everyone who remained online for long enough to finally get
|
|
their list downloaded. In total I managed to download 14143 file lists (that's
|
|
one list too many for C<10000*sqrt(2)>, I should have stopped the process a bit
|
|
earlier). The total bzip2-compressed size of these lists is 6.5 GiB.
|
|
|
|
For obvious reasons, I won't be sharing my modifications to ncdc. I already
|
|
tarnished the reputation of ncdc enough in that single day. If you wish to
|
|
repeat this experiment, please do so with a scalable downloading
|
|
implementation. :-)
|
|
|
|
|
|
=head1 Obtaining the Stats
|
|
|
|
And then comes the challenge of aggregating statistics on 6.5 GiB of compressed
|
|
XML files. This didn't really sound like much of a challenge. After all, all
|
|
one needs to do is decompress the file lists, do some XML parsing and update
|
|
some values. Most of the CPU time in this process would likely be spent on
|
|
bzip2 decompression, so I figured I'd just pipe the output of L<bzcat(1)> to a
|
|
Perl script and be done with it.
|
|
|
|
To get the statistics on the sizes and the distribution of unique files, a data
|
|
structure containing information on all unique files in the lists was
|
|
necessary. Perl being the perfect language for data manipulation, I made use of
|
|
its great support for hash tables to store this information. It turned out,
|
|
rather unsurprisingly, that Perl isn't all that conservative with respect to
|
|
memory usage. Neither my 4GB or RAM nor the extra 4GB of swap turned out to be
|
|
enough to run the script to completion. I tried rewriting the script to use a
|
|
disk-based data structure, but that slowed things down to a crawl. Some other
|
|
solution was needed.
|
|
|
|
When faced with such a problem, some people will try to optimize the algorithm,
|
|
others will throw extra hardware at it, and I did what I do best: Optimize away
|
|
the constants. That is, I rewrote the data analysis program in C. Using the
|
|
excellent L<khash|https://github.com/attractivechaos/klib> hash table library
|
|
to keep track of the file information and the equally awesome
|
|
L<yxml|http://dev.yorhel.nl/yxml> library (a little bit of self-promotion
|
|
doesn't hurt, right?) to do the XML parsing, I was able to do all the necessary
|
|
processing in 30 minutes using at most 3.6GB of RAM.
|
|
|
|
Long story short, here's my analysis program:
|
|
L<dcfilestats.c|http://g.blicky.net/dcstats.git/tree/dcfilestats.c>.
|
|
|
|
|
|
=head1 A Look at the Stats
|
|
|
|
Some lists didn't decompress/parse correctly, so the actual number of file
|
|
lists used in these stats is B<14137>. The total compressed size of these lists
|
|
is B<6,945,269,469> bytes (6.5 GiB), and uncompressed B<25,533,519,352> bytes
|
|
(24 GiB). In total these lists mentioned B<197,413,253> files. After taking
|
|
duplicate listings in account, there's still B<84,131,932> unique files.
|
|
|
|
And now for some graphs...
|
|
|
|
=head2 Size of the File Lists
|
|
|
|
Behold, the compressed and uncompressed size of the downloaded file lists:
|
|
|
|
[img graph dclistsize.png ]
|
|
|
|
Nothing too surprising here, I guess. 100 KiB seems to be a common size for a
|
|
compressed file lists, but lists of 1 MiB aren't too weird, either. The largest
|
|
file list in this set is 34.8 MiB compressed and 120 MiB uncompressed. The
|
|
uncompressed size of a list tends to be (*gasp*) a bit larger, but we can't
|
|
easily infer the compression ratio from this graph. Hence, another graph:
|
|
|
|
[img graph dclistcomp.png ]
|
|
|
|
Most file lists compress to about 24% - 35% of their original size. This seems
|
|
to be consistent with L<similar
|
|
measurements|http://forum.dcbase.org/viewtopic.php?f=18&t=667> done in 2010.
|
|
|
|
The raw data for these graphs is found in
|
|
L<dclistsize|http://g.blicky.net/dcstats.git/tree/dclistsize>, which lists the
|
|
compressed and uncompressed size, respectively, for each file list. The gnuplot
|
|
script for the first graph is
|
|
L<dclistsize.plot|http://g.blicky.net/dcstats.git/tree/dclistsize.plot> and
|
|
L<dclistcomp.plot|http://g.blicky.net/dcstats.git/tree/dclistcomp.plot> for the
|
|
second.
|
|
|
|
=head2 Number of Files Per List
|
|
|
|
So how many files are people sharing? Let's find out.
|
|
|
|
[img graph dcnumfiles.png ]
|
|
|
|
As expected, this graph looks very similar to the one about the size of the
|
|
file list. The size of a list tends to be linear in the number of items it
|
|
holds, after all.
|
|
|
|
The raw data for this graph is found in
|
|
L<dcnumfiles|http://g.blicky.net/dcstats.git/tree/dcnumfiles>, which lists the
|
|
unique and total number of files, respectively, for each file list. The gnuplot
|
|
script is
|
|
L<dcnumfiles.plot|http://g.blicky.net/dcstats.git/tree/dcnumfiles.plot>.
|
|
|
|
=head2 File Sizes
|
|
|
|
And how large are the files being shared? Well,
|
|
|
|
[img graph dcfilesize.png ]
|
|
|
|
This graph is fun, and rather hard to explain without knowing what kind of
|
|
files we're dealing with. I'm not going to do any further analysis on what kind
|
|
of files these file sizes represent exactly, but I am going to make some
|
|
guesses. The files below 1 MiB could be anything, text files, images,
|
|
subtitles, source code, etc. And considering that the hub in question doesn't
|
|
put a whole lot of effort in weeding out spammers and bots, it's likely that
|
|
some malicious users will be sharing small variations of the same virus within
|
|
the 100 KiB range. The peak of files between 7 and 10 MiB would likely be
|
|
audio files. The number of files larger than, say, 20 MiB drop significantly,
|
|
but there are still a few million files in the 20 MiB to 1 GiB range.
|
|
|
|
I cut off the graph after 10 GiB, but there's apparently someone who claims to
|
|
share a file between 1 and 2 TiB (don't know the exact size due to the
|
|
binning). Since I can't imagine why someone would share a file that large, I
|
|
expect it to be a fake file list entry. Note that there could be more fakes in
|
|
my data set. I can't tell which files are fake and which are genuine from the
|
|
information in the file lists, but I don't expect the number of fake files to
|
|
be very significant.
|
|
|
|
The "raw" data for this graph is found in
|
|
L<dcfilesize|http://g.blicky.net/dcstats.git/tree/dcfilesize>. Because I wasn't
|
|
interested in dealing with a text file of 84 million lines, the data is already
|
|
binned. The first column is the bin number and the second column the number of
|
|
unique files in that bin. The file sizes that each bin represents are between
|
|
C<2^(bin+9)> and C<2^(bin+10)>, with the exception of bin 0, which starts at a
|
|
file size of 0. The source of the gnuplot script is
|
|
L<dcfilesize.plot|http://g.blicky.net/dcstats.git/tree/dcfilesize.plot>.
|
|
|
|
=head2 Distribution of Files
|
|
|
|
Another interesting thing to measure is how often files are shared. That is,
|
|
how many users have the same file?
|
|
|
|
[img graph dcfiledist.png ]
|
|
|
|
Many files are only available from a single user. That's not really a good sign
|
|
when you wish to download such a file, but luckily there are also tons of files
|
|
that I<are> available from multiple users. What is interesting in this graph
|
|
isn't that it follows the L<power law|https://en.wikipedia.org/wiki/Power_law>,
|
|
but it's wondering what those outliers could possibly be. There's a collection
|
|
of 269 files that has been shared among 831 users, and there appears to be a
|
|
similar group of around 510-515 files that is shared among 20 or so users. I've
|
|
honestly no idea what those collections could be. Well, yes, I could probably
|
|
figure that out from the file lists, but my analysis program doesn't tell me
|
|
which files it's talking about and I'm too lazy to fix that.
|
|
|
|
The graph has been clipped to 600, but there's another interesting outlier. A
|
|
single file that has been shared by 5668 users. I'm going to guess that this is
|
|
the empty file. There are so many ways to get an empty file somewhere in your
|
|
filesystem, after all.
|
|
|
|
The raw data for this graph is found in
|
|
L<dcfiledist|http://g.blicky.net/dcstats.git/tree/dcfiledist>, which lists the
|
|
number of times shared and the aggregate number of files. The gnuplot script is
|
|
L<dcfiledist.plot|http://g.blicky.net/dcstats.git/tree/dcfiledist.plot>.
|
|
|
|
|
|
=head1 Final Notes
|
|
|
|
So, erm, what conclusions can we draw from this? That stats are fun, I guess.
|
|
If anyone (including me) is going to repeat this experiment on a fresh data
|
|
set, make sure to use a more scalable downloading process than I did. My
|
|
approach shouldn't be repeated if we wish to keep the Direct Connect network
|
|
alive.
|
|
|
|
Furthermore, keep in mind that this is just a snapshot of a single day on a
|
|
single hub. The graphs may look very different when the file lists are
|
|
harvested at some other time. And it's also quite likely that different hubs
|
|
will have very different share profiles. It could be interesting to try and
|
|
graph everything, but I don't have I<that> kind of free time.
|
|
|