Rewrite to static site
With a complete reorganisation of the directory structure and most of the content converted to pandoc-flavoured markdown. Some TODO's left before this can go live: - Main page - Atom feeds - Bug tracker
This commit is contained in:
parent
5c85a7d32f
commit
6242b2ee9c
291 changed files with 4346 additions and 6141 deletions
222
dat/doc-dcstats
222
dat/doc-dcstats
|
|
@ -1,222 +0,0 @@
|
|||
Some Measurements on Direct Connect File Lists
|
||||
|
||||
=pod
|
||||
|
||||
(Published on B<2014-01-09>.)
|
||||
|
||||
=head1 Introduction
|
||||
|
||||
I've been working on Direct Connect related projects for a while now. This
|
||||
includes maintaining L<ncdc|https://dev.yorhel.nl/ncdc> and
|
||||
L<Globster|https://dev.yorhel.nl/globster>, and doing a bit of research into
|
||||
improving the downloading performance and scalability (to be published at some
|
||||
later date). Whether I'm writing code or trying to setup experiments for
|
||||
research, there's one thing that helps a lot in making decisions. Measurements
|
||||
from an actual network.
|
||||
|
||||
Because useful measurements are often missing, I decided to do some myself.
|
||||
There's a lot to measure in an actual P2P network, but I restricted myself to
|
||||
information that can be gathered quite easily from file lists.
|
||||
|
||||
|
||||
=head1 Obtaining the Data
|
||||
|
||||
Different hubs will likely have totally different patterns in terms of what is
|
||||
being shared. In order to keep this experiment simple, I limited myself to a
|
||||
single hub. And in order to get as much data as possible, I chose the hub that
|
||||
is commonly known as "Ozerki", famous for being one of the larger hubs in
|
||||
existence.
|
||||
|
||||
My approach to getting as many file lists as possible from this hub was perhaps
|
||||
a bit too simple. I simply modified ncdc to have an "add the file list from all
|
||||
users to the download queue" key, and to save all downloaded lists to a
|
||||
directory instead of opening them.
|
||||
|
||||
I started this downloading process on a Monday around noon when there were a
|
||||
little over 11k users online. I hit my hacked download-all-filelists-key two
|
||||
more times later that day in order to get the file lists from those users who
|
||||
joined the hub at a later time. I let this downloading process running until
|
||||
the evening.
|
||||
|
||||
One thing I learned from this experience was that the downloading algorithm in
|
||||
ncdc (1.18.1) does not scale particularly well. Every 60 seconds, it would try
|
||||
to open a connection with B<all> users listed in the download queue. You can
|
||||
imagine that trying to connect to 11k users simultaneously put a significantly
|
||||
heavier load on the hub than would have been necessary. Not good. Not something
|
||||
a well-behaving netizen would do. Surprisingly enough, the hub didn't seem to
|
||||
mind too much and handled the load fine. This might have been because Mondays
|
||||
are typically not the most busy days in P2P land. Weekends tend to be busier.
|
||||
|
||||
Despite that scalability issue, I successfully managed to download the file
|
||||
lists of almost everyone who remained online for long enough to finally get
|
||||
their list downloaded. In total I managed to download 14143 file lists (that's
|
||||
one list too many for C<10000*sqrt(2)>, I should have stopped the process a bit
|
||||
earlier). The total bzip2-compressed size of these lists is 6.5 GiB.
|
||||
|
||||
For obvious reasons, I won't be sharing my modifications to ncdc. I already
|
||||
tarnished the reputation of ncdc enough in that single day. If you wish to
|
||||
repeat this experiment, please do so with a scalable downloading
|
||||
implementation. :-)
|
||||
|
||||
|
||||
=head1 Obtaining the Stats
|
||||
|
||||
And then comes the challenge of aggregating statistics on 6.5 GiB of compressed
|
||||
XML files. This didn't really sound like much of a challenge. After all, all
|
||||
one needs to do is decompress the file lists, do some XML parsing and update
|
||||
some values. Most of the CPU time in this process would likely be spent on
|
||||
bzip2 decompression, so I figured I'd just pipe the output of L<bzcat(1)> to a
|
||||
Perl script and be done with it.
|
||||
|
||||
To get the statistics on the sizes and the distribution of unique files, a data
|
||||
structure containing information on all unique files in the lists was
|
||||
necessary. Perl being the perfect language for data manipulation, I made use of
|
||||
its great support for hash tables to store this information. It turned out,
|
||||
rather unsurprisingly, that Perl isn't all that conservative with respect to
|
||||
memory usage. Neither my 4GB or RAM nor the extra 4GB of swap turned out to be
|
||||
enough to run the script to completion. I tried rewriting the script to use a
|
||||
disk-based data structure, but that slowed things down to a crawl. Some other
|
||||
solution was needed.
|
||||
|
||||
When faced with such a problem, some people will try to optimize the algorithm,
|
||||
others will throw extra hardware at it, and I did what I do best: Optimize away
|
||||
the constants. That is, I rewrote the data analysis program in C. Using the
|
||||
excellent L<khash|https://github.com/attractivechaos/klib> hash table library
|
||||
to keep track of the file information and the equally awesome
|
||||
L<yxml|https://dev.yorhel.nl/yxml> library (a little bit of self-promotion
|
||||
doesn't hurt, right?) to do the XML parsing, I was able to do all the necessary
|
||||
processing in 30 minutes using at most 3.6GB of RAM.
|
||||
|
||||
Long story short, here's my analysis program:
|
||||
L<dcfilestats.c|http://g.blicky.net/dcstats.git/tree/dcfilestats.c>.
|
||||
|
||||
|
||||
=head1 A Look at the Stats
|
||||
|
||||
Some lists didn't decompress/parse correctly, so the actual number of file
|
||||
lists used in these stats is B<14137>. The total compressed size of these lists
|
||||
is B<6,945,269,469> bytes (6.5 GiB), and uncompressed B<25,533,519,352> bytes
|
||||
(24 GiB). In total these lists mentioned B<197,413,253> files. After taking
|
||||
duplicate listings in account, there's still B<84,131,932> unique files.
|
||||
|
||||
And now for some graphs...
|
||||
|
||||
=head2 Size of the File Lists
|
||||
|
||||
Behold, the compressed and uncompressed size of the downloaded file lists:
|
||||
|
||||
[img graph dclistsize.png ]
|
||||
|
||||
Nothing too surprising here, I guess. 100 KiB seems to be a common size for a
|
||||
compressed file lists, but lists of 1 MiB aren't too weird, either. The largest
|
||||
file list in this set is 34.8 MiB compressed and 120 MiB uncompressed. The
|
||||
uncompressed size of a list tends to be (*gasp*) a bit larger, but we can't
|
||||
easily infer the compression ratio from this graph. Hence, another graph:
|
||||
|
||||
[img graph dclistcomp.png ]
|
||||
|
||||
Most file lists compress to about 24% - 35% of their original size. This seems
|
||||
to be consistent with L<similar
|
||||
measurements|http://forum.dcbase.org/viewtopic.php?f=18&t=667> done in 2010.
|
||||
|
||||
The raw data for these graphs is found in
|
||||
L<dclistsize|http://g.blicky.net/dcstats.git/tree/dclistsize>, which lists the
|
||||
compressed and uncompressed size, respectively, for each file list. The gnuplot
|
||||
script for the first graph is
|
||||
L<dclistsize.plot|http://g.blicky.net/dcstats.git/tree/dclistsize.plot> and
|
||||
L<dclistcomp.plot|http://g.blicky.net/dcstats.git/tree/dclistcomp.plot> for the
|
||||
second.
|
||||
|
||||
=head2 Number of Files Per List
|
||||
|
||||
So how many files are people sharing? Let's find out.
|
||||
|
||||
[img graph dcnumfiles.png ]
|
||||
|
||||
As expected, this graph looks very similar to the one about the size of the
|
||||
file list. The size of a list tends to be linear in the number of items it
|
||||
holds, after all.
|
||||
|
||||
The raw data for this graph is found in
|
||||
L<dcnumfiles|http://g.blicky.net/dcstats.git/tree/dcnumfiles>, which lists the
|
||||
unique and total number of files, respectively, for each file list. The gnuplot
|
||||
script is
|
||||
L<dcnumfiles.plot|http://g.blicky.net/dcstats.git/tree/dcnumfiles.plot>.
|
||||
|
||||
=head2 File Sizes
|
||||
|
||||
And how large are the files being shared? Well,
|
||||
|
||||
[img graph dcfilesize.png ]
|
||||
|
||||
This graph is fun, and rather hard to explain without knowing what kind of
|
||||
files we're dealing with. I'm not going to do any further analysis on what kind
|
||||
of files these file sizes represent exactly, but I am going to make some
|
||||
guesses. The files below 1 MiB could be anything, text files, images,
|
||||
subtitles, source code, etc. And considering that the hub in question doesn't
|
||||
put a whole lot of effort in weeding out spammers and bots, it's likely that
|
||||
some malicious users will be sharing small variations of the same virus within
|
||||
the 100 KiB range. The peak of files between 7 and 10 MiB would likely be
|
||||
audio files. The number of files larger than, say, 20 MiB drop significantly,
|
||||
but there are still a few million files in the 20 MiB to 1 GiB range.
|
||||
|
||||
I cut off the graph after 10 GiB, but there's apparently someone who claims to
|
||||
share a file between 1 and 2 TiB (don't know the exact size due to the
|
||||
binning). Since I can't imagine why someone would share a file that large, I
|
||||
expect it to be a fake file list entry. Note that there could be more fakes in
|
||||
my data set. I can't tell which files are fake and which are genuine from the
|
||||
information in the file lists, but I don't expect the number of fake files to
|
||||
be very significant.
|
||||
|
||||
The "raw" data for this graph is found in
|
||||
L<dcfilesize|http://g.blicky.net/dcstats.git/tree/dcfilesize>. Because I wasn't
|
||||
interested in dealing with a text file of 84 million lines, the data is already
|
||||
binned. The first column is the bin number and the second column the number of
|
||||
unique files in that bin. The file sizes that each bin represents are between
|
||||
C<2^(bin+9)> and C<2^(bin+10)>, with the exception of bin 0, which starts at a
|
||||
file size of 0. The source of the gnuplot script is
|
||||
L<dcfilesize.plot|http://g.blicky.net/dcstats.git/tree/dcfilesize.plot>.
|
||||
|
||||
=head2 Distribution of Files
|
||||
|
||||
Another interesting thing to measure is how often files are shared. That is,
|
||||
how many users have the same file?
|
||||
|
||||
[img graph dcfiledist.png ]
|
||||
|
||||
Many files are only available from a single user. That's not really a good sign
|
||||
when you wish to download such a file, but luckily there are also tons of files
|
||||
that I<are> available from multiple users. What is interesting in this graph
|
||||
isn't that it follows the L<power law|https://en.wikipedia.org/wiki/Power_law>,
|
||||
but it's wondering what those outliers could possibly be. There's a collection
|
||||
of 269 files that has been shared among 831 users, and there appears to be a
|
||||
similar group of around 510-515 files that is shared among 20 or so users. I've
|
||||
honestly no idea what those collections could be. Well, yes, I could probably
|
||||
figure that out from the file lists, but my analysis program doesn't tell me
|
||||
which files it's talking about and I'm too lazy to fix that.
|
||||
|
||||
The graph has been clipped to 600, but there's another interesting outlier. A
|
||||
single file that has been shared by 5668 users. I'm going to guess that this is
|
||||
the empty file. There are so many ways to get an empty file somewhere in your
|
||||
filesystem, after all.
|
||||
|
||||
The raw data for this graph is found in
|
||||
L<dcfiledist|http://g.blicky.net/dcstats.git/tree/dcfiledist>, which lists the
|
||||
number of times shared and the aggregate number of files. The gnuplot script is
|
||||
L<dcfiledist.plot|http://g.blicky.net/dcstats.git/tree/dcfiledist.plot>.
|
||||
|
||||
|
||||
=head1 Final Notes
|
||||
|
||||
So, erm, what conclusions can we draw from this? That stats are fun, I guess.
|
||||
If anyone (including me) is going to repeat this experiment on a fresh data
|
||||
set, make sure to use a more scalable downloading process than I did. My
|
||||
approach shouldn't be repeated if we wish to keep the Direct Connect network
|
||||
alive.
|
||||
|
||||
Furthermore, keep in mind that this is just a snapshot of a single day on a
|
||||
single hub. The graphs may look very different when the file lists are
|
||||
harvested at some other time. And it's also quite likely that different hubs
|
||||
will have very different share profiles. It could be interesting to try and
|
||||
graph everything, but I don't have I<that> kind of free time.
|
||||
|
||||
Loading…
Add table
Add a link
Reference in a new issue