Rewrite to static site

With a complete reorganisation of the directory structure and most of the content converted to pandoc-flavoured markdown. Some TODO's left before this can go live: - Main page - Atom feeds - Bug tracker
2019-03-23 11:52:08 +01:00 · 2019-03-23 11:52:08 +01:00 · 6242b2ee9c
commit 6242b2ee9c
parent 5c85a7d32f
291 changed files with 4346 additions and 6141 deletions
--- a/dat/doc-dcstats
+++ b/dat/doc-dcstats
@ -1,222 +0,0 @@
-Some Measurements on Direct Connect File Lists
-
-=pod
-
-(Published on B<2014-01-09>.)
-
-=head1 Introduction
-
-I've been working on Direct Connect related projects for a while now. This
-includes maintaining L<ncdc|https://dev.yorhel.nl/ncdc> and
-L<Globster|https://dev.yorhel.nl/globster>, and doing a bit of research into
-improving the downloading performance and scalability (to be published at some
-later date). Whether I'm writing code or trying to setup experiments for
-research, there's one thing that helps a lot in making decisions. Measurements
-from an actual network.
-
-Because useful measurements are often missing, I decided to do some myself.
-There's a lot to measure in an actual P2P network, but I restricted myself to
-information that can be gathered quite easily from file lists.
-
-
-=head1 Obtaining the Data
-
-Different hubs will likely have totally different patterns in terms of what is
-being shared. In order to keep this experiment simple, I limited myself to a
-single hub. And in order to get as much data as possible, I chose the hub that
-is commonly known as "Ozerki", famous for being one of the larger hubs in
-existence.
-
-My approach to getting as many file lists as possible from this hub was perhaps
-a bit too simple. I simply modified ncdc to have an "add the file list from all
-users to the download queue" key, and to save all downloaded lists to a
-directory instead of opening them.
-
-I started this downloading process on a Monday around noon when there were a
-little over 11k users online. I hit my hacked download-all-filelists-key two
-more times later that day in order to get the file lists from those users who
-joined the hub at a later time. I let this downloading process running until
-the evening.
-
-One thing I learned from this experience was that the downloading algorithm in
-ncdc (1.18.1) does not scale particularly well. Every 60 seconds, it would try
-to open a connection with B<all> users listed in the download queue. You can
-imagine that trying to connect to 11k users simultaneously put a significantly
-heavier load on the hub than would have been necessary. Not good. Not something
-a well-behaving netizen would do. Surprisingly enough, the hub didn't seem to
-mind too much and handled the load fine. This might have been because Mondays
-are typically not the most busy days in P2P land. Weekends tend to be busier.
-
-Despite that scalability issue, I successfully managed to download the file
-lists of almost everyone who remained online for long enough to finally get
-their list downloaded. In total I managed to download 14143 file lists (that's
-one list too many for C<10000*sqrt(2)>, I should have stopped the process a bit
-earlier). The total bzip2-compressed size of these lists is 6.5 GiB.
-
-For obvious reasons, I won't be sharing my modifications to ncdc. I already
-tarnished the reputation of ncdc enough in that single day.  If you wish to
-repeat this experiment, please do so with a scalable downloading
-implementation. :-)
-
-
-=head1 Obtaining the Stats
-
-And then comes the challenge of aggregating statistics on 6.5 GiB of compressed
-XML files. This didn't really sound like much of a challenge. After all, all
-one needs to do is decompress the file lists, do some XML parsing and update
-some values.  Most of the CPU time in this process would likely be spent on
-bzip2 decompression, so I figured I'd just pipe the output of L<bzcat(1)> to a
-Perl script and be done with it.
-
-To get the statistics on the sizes and the distribution of unique files, a data
-structure containing information on all unique files in the lists was
-necessary. Perl being the perfect language for data manipulation, I made use of
-its great support for hash tables to store this information. It turned out,
-rather unsurprisingly, that Perl isn't all that conservative with respect to
-memory usage. Neither my 4GB or RAM nor the extra 4GB of swap turned out to be
-enough to run the script to completion. I tried rewriting the script to use a
-disk-based data structure, but that slowed things down to a crawl. Some other
-solution was needed.
-
-When faced with such a problem, some people will try to optimize the algorithm,
-others will throw extra hardware at it, and I did what I do best: Optimize away
-the constants. That is, I rewrote the data analysis program in C. Using the
-excellent L<khash|https://github.com/attractivechaos/klib> hash table library
-to keep track of the file information and the equally awesome
-L<yxml|https://dev.yorhel.nl/yxml> library (a little bit of self-promotion
-doesn't hurt, right?) to do the XML parsing, I was able to do all the necessary
-processing in 30 minutes using at most 3.6GB of RAM.
-
-Long story short, here's my analysis program:
-L<dcfilestats.c|http://g.blicky.net/dcstats.git/tree/dcfilestats.c>.
-
-
-=head1 A Look at the Stats
-
-Some lists didn't decompress/parse correctly, so the actual number of file
-lists used in these stats is B<14137>. The total compressed size of these lists
-is B<6,945,269,469> bytes (6.5 GiB), and uncompressed B<25,533,519,352> bytes
-(24 GiB). In total these lists mentioned B<197,413,253> files. After taking
-duplicate listings in account, there's still B<84,131,932> unique files.
-
-And now for some graphs...
-
-=head2 Size of the File Lists
-
-Behold, the compressed and uncompressed size of the downloaded file lists:
-
-[img graph dclistsize.png ]
-
-Nothing too surprising here, I guess. 100 KiB seems to be a common size for a
-compressed file lists, but lists of 1 MiB aren't too weird, either. The largest
-file list in this set is 34.8 MiB compressed and 120 MiB uncompressed. The
-uncompressed size of a list tends to be (*gasp*) a bit larger, but we can't
-easily infer the compression ratio from this graph. Hence, another graph:
-
-[img graph dclistcomp.png ]
-
-Most file lists compress to about 24% - 35% of their original size. This seems
-to be consistent with L<similar
-measurements|http://forum.dcbase.org/viewtopic.php?f=18&t=667> done in 2010.
-
-The raw data for these graphs is found in
-L<dclistsize|http://g.blicky.net/dcstats.git/tree/dclistsize>, which lists the
-compressed and uncompressed size, respectively, for each file list. The gnuplot
-script for the first graph is
-L<dclistsize.plot|http://g.blicky.net/dcstats.git/tree/dclistsize.plot> and
-L<dclistcomp.plot|http://g.blicky.net/dcstats.git/tree/dclistcomp.plot> for the
-second.
-
-=head2 Number of Files Per List
-
-So how many files are people sharing? Let's find out.
-
-[img graph dcnumfiles.png ]
-
-As expected, this graph looks very similar to the one about the size of the
-file list. The size of a list tends to be linear in the number of items it
-holds, after all.
-
-The raw data for this graph is found in
-L<dcnumfiles|http://g.blicky.net/dcstats.git/tree/dcnumfiles>, which lists the
-unique and total number of files, respectively, for each file list. The gnuplot
-script is
-L<dcnumfiles.plot|http://g.blicky.net/dcstats.git/tree/dcnumfiles.plot>.
-
-=head2 File Sizes
-
-And how large are the files being shared? Well,
-
-[img graph dcfilesize.png ]
-
-This graph is fun, and rather hard to explain without knowing what kind of
-files we're dealing with. I'm not going to do any further analysis on what kind
-of files these file sizes represent exactly, but I am going to make some
-guesses.  The files below 1 MiB could be anything, text files, images,
-subtitles, source code, etc. And considering that the hub in question doesn't
-put a whole lot of effort in weeding out spammers and bots, it's likely that
-some malicious users will be sharing small variations of the same virus within
-the 100 KiB range.  The peak of files between 7 and 10 MiB would likely be
-audio files.  The number of files larger than, say, 20 MiB drop significantly,
-but there are still a few million files in the 20 MiB to 1 GiB range.
-
-I cut off the graph after 10 GiB, but there's apparently someone who claims to
-share a file between 1 and 2 TiB (don't know the exact size due to the
-binning). Since I can't imagine why someone would share a file that large, I
-expect it to be a fake file list entry. Note that there could be more fakes in
-my data set. I can't tell which files are fake and which are genuine from the
-information in the file lists, but I don't expect the number of fake files to
-be very significant.
-
-The "raw" data for this graph is found in
-L<dcfilesize|http://g.blicky.net/dcstats.git/tree/dcfilesize>. Because I wasn't
-interested in dealing with a text file of 84 million lines, the data is already
-binned. The first column is the bin number and the second column the number of
-unique files in that bin. The file sizes that each bin represents are between
-C<2^(bin+9)> and C<2^(bin+10)>, with the exception of bin 0, which starts at a
-file size of 0. The source of the gnuplot script is
-L<dcfilesize.plot|http://g.blicky.net/dcstats.git/tree/dcfilesize.plot>.
-
-=head2 Distribution of Files
-
-Another interesting thing to measure is how often files are shared. That is,
-how many users have the same file?
-
-[img graph dcfiledist.png ]
-
-Many files are only available from a single user. That's not really a good sign
-when you wish to download such a file, but luckily there are also tons of files
-that I<are> available from multiple users. What is interesting in this graph
-isn't that it follows the L<power law|https://en.wikipedia.org/wiki/Power_law>,
-but it's wondering what those outliers could possibly be. There's a collection
-of 269 files that has been shared among 831 users, and there appears to be a
-similar group of around 510-515 files that is shared among 20 or so users. I've
-honestly no idea what those collections could be. Well, yes, I could probably
-figure that out from the file lists, but my analysis program doesn't tell me
-which files it's talking about and I'm too lazy to fix that.
-
-The graph has been clipped to 600, but there's another interesting outlier. A
-single file that has been shared by 5668 users. I'm going to guess that this is
-the empty file. There are so many ways to get an empty file somewhere in your
-filesystem, after all.
-
-The raw data for this graph is found in
-L<dcfiledist|http://g.blicky.net/dcstats.git/tree/dcfiledist>, which lists the
-number of times shared and the aggregate number of files. The gnuplot script is
-L<dcfiledist.plot|http://g.blicky.net/dcstats.git/tree/dcfiledist.plot>.
-
-
-=head1 Final Notes
-
-So, erm, what conclusions can we draw from this? That stats are fun, I guess.
-If anyone (including me) is going to repeat this experiment on a fresh data
-set, make sure to use a more scalable downloading process than I did. My
-approach shouldn't be repeated if we wish to keep the Direct Connect network
-alive.
-
-Furthermore, keep in mind that this is just a snapshot of a single day on a
-single hub. The graphs may look very different when the file lists are
-harvested at some other time. And it's also quite likely that different hubs
-will have very different share profiles. It could be interesting to try and
-graph everything, but I don't have I<that> kind of free time.
-