ncdc 1.18.1 + yxml manual + dcstats + minor restyle

...I need to commit more often.
2014-02-11 10:28:26 +01:00 · 2014-02-11 10:28:26 +01:00 · 57e7bb546e
commit 57e7bb546e
parent 610b0fb31c
20 changed files with 339 additions and 56 deletions
--- a/dat/doc
+++ b/dat/doc
@ -6,6 +6,11 @@ rare occasions are published on this page.

 =over

+=item C<2014-01-09 > - L<Some Measurements on Direct Connect File Lists|http://dev.yorhel.nl/doc/dcstats>
+
+The report of a short measurement study on the file lists obtained from a
+Direct Connect hub. Lots of graphs!
+
 =item C<2012-02-15 > - L<A Distributed Communication System for Modular Applications|http://dev.yorhel.nl/doc/commvis>

 In this article I explain a vision of mine, and the results of a small research
--- a/dat/doc-dcstats
+++ b/dat/doc-dcstats
@ -0,0 +1,222 @@
+Some Measurements on Direct Connect File Lists
+
+=pod
+
+(Published on B<2014-01-09>.)
+
+=head1 Introduction
+
+I've been working on Direct Connect related projects for a while now. This
+includes maintaining L<ncdc|http://dev.yorhel.nl/ncdc> and
+L<Globster|http://dev.yorhel.nl/globster>, and doing a bit of research into
+improving the downloading performance and scalability (to be published at some
+later date). Whether I'm writing code or trying to setup experiments for
+research, there's one thing that helps a lot in making decisions. Measurements
+from an actual network.
+
+Because useful measurements are often missing, I decided to do some myself.
+There's a lot to measure in an actual P2P network, but I restricted myself to
+information that can be gathered quite easily from file lists.
+
+
+=head1 Obtaining the Data
+
+Different hubs will likely have totally different patterns in terms of what is
+being shared. In order to keep this experiment simple, I limited myself to a
+single hub. And in order to get as much data as possible, I chose the hub that
+is commonly known as "Ozerki", famous for being one of the larger hubs in
+existence.
+
+My approach to getting as many file lists as possible from this hub was perhaps
+a bit too simple. I simply modified ncdc to have an "add the file list from all
+users to the download queue" key, and to save all downloaded lists to a
+directory instead of opening them.
+
+I started this downloading process on a Monday around noon when there were a
+little over 11k users online. I hit my hacked download-all-filelists-key two
+more times later that day in order to get the file lists from those users who
+joined the hub at a later time. I let this downloading process running until
+the evening.
+
+One thing I learned from this experience was that the downloading algorithm in
+ncdc (1.18.1) does not scale particularly well. Every 60 seconds, it would try
+to open a connection with B<all> users listed in the download queue. You can
+imagine that trying to connect to 11k users simultaneously put a significantly
+heavier load on the hub than would have been necessary. Not good. Not something
+a well-behaving netizen would do. Surprisingly enough, the hub didn't seem to
+mind too much and handled the load fine. This might have been because Mondays
+are typically not the most busy days in P2P land. Weekends tend to be busier.
+
+Despite that scalability issue, I successfully managed to download the file
+lists of almost everyone who remained online for long enough to finally get
+their list downloaded. In total I managed to download 14143 file lists (that's
+one list too many for C<10000*sqrt(2)>, I should have stopped the process a bit
+earlier). The total bzip2-compressed size of these lists is 6.5 GiB.
+
+For obvious reasons, I won't be sharing my modifications to ncdc. I already
+tarnished the reputation of ncdc enough in that single day.  If you wish to
+repeat this experiment, please do so with a scalable downloading
+implementation. :-)
+
+
+=head1 Obtaining the Stats
+
+And then comes the challenge of aggregating statistics on 6.5 GiB of compressed
+XML files. This didn't really sound like much of a challenge. After all, all
+one needs to do is decompress the file lists, do some XML parsing and update
+some values.  Most of the CPU time in this process would likely be spent on
+bzip2 decompression, so I figured I'd just pipe the output of L<bzcat(1)> to a
+Perl script and be done with it.
+
+To get the statistics on the sizes and the distribution of unique files, a data
+structure containing information on all unique files in the lists was
+necessary. Perl being the perfect language for data manipulation, I made use of
+its great support for hash tables to store this information. It turned out,
+rather unsurprisingly, that Perl isn't all that conservative with respect to
+memory usage. Neither my 4GB or RAM nor the extra 4GB of swap turned out to be
+enough to run the script to completion. I tried rewriting the script to use a
+disk-based data structure, but that slowed things down to a crawl. Some other
+solution was needed.
+
+When faced with such a problem, some people will try to optimize the algorithm,
+others will throw extra hardware at it, and I did what I do best: Optimize away
+the constants. That is, I rewrote the data analysis program in C. Using the
+excellent L<khash|https://github.com/attractivechaos/klib> hash table library
+to keep track of the file information and the equally awesome
+L<yxml|http://dev.yorhel.nl/yxml> library (a little bit of self-promotion
+doesn't hurt, right?) to do the XML parsing, I was able to do all the necessary
+processing in 30 minutes using at most 3.6GB of RAM.
+
+Long story short, here's my analysis program:
+L<dcfilestats.c|http://g.blicky.net/dcstats.git/tree/dcfilestats.c>.
+
+
+=head1 A Look at the Stats
+
+Some lists didn't decompress/parse correctly, so the actual number of file
+lists used in these stats is B<14137>. The total compressed size of these lists
+is B<6,945,269,469> bytes (6.5 GiB), and uncompressed B<25,533,519,352> bytes
+(24 GiB). In total these lists mentioned B<197,413,253> files. After taking
+duplicate listings in account, there's still B<84,131,932> unique files.
+
+And now for some graphs...
+
+=head2 Size of the File Lists
+
+Behold, the compressed and uncompressed size of the downloaded file lists:
+
+[img graph dclistsize.png ]
+
+Nothing too surprising here, I guess. 100 KiB seems to be a common size for a
+compressed file lists, but lists of 1 MiB aren't too weird, either. The largest
+file list in this set is 34.8 MiB compressed and 120 MiB uncompressed. The
+uncompressed size of a list tends to be (*gasp*) a bit larger, but we can't
+easily infer the compression ratio from this graph. Hence, another graph:
+
+[img graph dclistcomp.png ]
+
+Most file lists compress to about 24% - 35% of their original size. This seems
+to be consistent with L<similar
+measurements|http://forum.dcbase.org/viewtopic.php?f=18&t=667> done in 2010.
+
+The raw data for these graphs is found in
+L<dclistsize|http://g.blicky.net/dcstats.git/tree/dclistsize>, which lists the
+compressed and uncompressed size, respectively, for each file list. The gnuplot
+script for the first graph is
+L<dclistsize.plot|http://g.blicky.net/dcstats.git/tree/dclistsize.plot> and
+L<dclistcomp.plot|http://g.blicky.net/dcstats.git/tree/dclistcomp.plot> for the
+second.
+
+=head2 Number of Files Per List
+
+So how many files are people sharing? Let's find out.
+
+[img graph dcnumfiles.png ]
+
+As expected, this graph looks very similar to the one about the size of the
+file list. The size of a list tends to be linear in the number of items it
+holds, after all.
+
+The raw data for this graph is found in
+L<dcnumfiles|http://g.blicky.net/dcstats.git/tree/dcnumfiles>, which lists the
+unique and total number of files, respectively, for each file list. The gnuplot
+script is
+L<dcnumfiles.plot|http://g.blicky.net/dcstats.git/tree/dcnumfiles.plot>.
+
+=head2 File Sizes
+
+And how large are the files being shared? Well,
+
+[img graph dcfilesize.png ]
+
+This graph is fun, and rather hard to explain without knowing what kind of
+files we're dealing with. I'm not going to do any further analysis on what kind
+of files these file sizes represent exactly, but I am going to make some
+guesses.  The files below 1 MiB could be anything, text files, images,
+subtitles, source code, etc. And considering that the hub in question doesn't
+put a whole lot of effort in weeding out spammers and bots, it's likely that
+some malicious users will be sharing small variations of the same virus within
+the 100 KiB range.  The peak of files between 7 and 10 MiB would likely be
+audio files.  The number of files larger than, say, 20 MiB drop significantly,
+but there are still a few million files in the 20 MiB to 1 GiB range.
+
+I cut off the graph after 10 GiB, but there's apparently someone who claims to
+share a file between 1 and 2 TiB (don't know the exact size due to the
+binning). Since I can't imagine why someone would share a file that large, I
+expect it to be a fake file list entry. Note that there could be more fakes in
+my data set. I can't tell which files are fake and which are genuine from the
+information in the file lists, but I don't expect the number of fake files to
+be very significant.
+
+The "raw" data for this graph is found in
+L<dcfilesize|http://g.blicky.net/dcstats.git/tree/dcfilesize>. Because I wasn't
+interested in dealing with a text file of 84 million lines, the data is already
+binned. The first column is the bin number and the second column the number of
+unique files in that bin. The file sizes that each bin represents are between
+C<2^(bin+9)> and C<2^(bin+10)>, with the exception of bin 0, which starts at a
+file size of 0. The source of the gnuplot script is
+L<dcfilesize.plot|http://g.blicky.net/dcstats.git/tree/dcfilesize.plot>.
+
+=head2 Distribution of Files
+
+Another interesting thing to measure is how often files are shared. That is,
+how many users have the same file?
+
+[img graph dcfiledist.png ]
+
+Many files are only available from a single user. That's not really a good sign
+when you wish to download such a file, but luckily there are also tons of files
+that I<are> available from multiple users. What is interesting in this graph
+isn't that it follows the L<power law|https://en.wikipedia.org/wiki/Power_law>,
+but it's wondering what those outliers could possibly be. There's a collection
+of 269 files that has been shared among 831 users, and there appears to be a
+similar group of around 510-515 files that is shared among 20 or so users. I've
+honestly no idea what those collections could be. Well, yes, I could probably
+figure that out from the file lists, but my analysis program doesn't tell me
+which files it's talking about and I'm too lazy to fix that.
+
+The graph has been clipped to 600, but there's another interesting outlier. A
+single file that has been shared by 5668 users. I'm going to guess that this is
+the empty file. There are so many ways to get an empty file somewhere in your
+filesystem, after all.
+
+The raw data for this graph is found in
+L<dcfiledist|http://g.blicky.net/dcstats.git/tree/dcfiledist>, which lists the
+number of times shared and the aggregate number of files. The gnuplot script is
+L<dcfiledist.plot|http://g.blicky.net/dcstats.git/tree/dcfiledist.plot>.
+
+
+=head1 Final Notes
+
+So, erm, what conclusions can we draw from this? That stats are fun, I guess.
+If anyone (including me) is going to repeat this experiment on a fresh data
+set, make sure to use a more scalable downloading process that I did. My
+approach shouldn't be repeated if we wish to keep the Direct Connect network
+alive.
+
+Furthermore, keep in mind that this is just a snapshot of a single day on a
+single hub. The graphs may look very different when the file lists are
+harvested at some other time. And it's also quite likely that different hubs
+will have very different share profiles. It could be interesting to try and
+graph everything, but I don't have I<that> kind of free time.
+
--- a/dat/ncdc
+++ b/dat/ncdc
@ -10,14 +10,14 @@ ncurses interface.

 =item Latest version

-1.18 ([dllink ncdc-1.18.tar.gz download]
+1.18.1 ([dllink ncdc-1.18.1.tar.gz download]
 - L<changes|http://dev.yorhel.nl/ncdc/changes>
 - L<mirror|https://sourceforge.net/projects/ncdc/files/ncdc/>)

 Convenient static binaries for Linux:
-L<64-bit|http://dev.yorhel.nl/download/ncdc-linux-x86_64-1.18.tar.gz> -
-L<32-bit|http://dev.yorhel.nl/download/ncdc-linux-i486-1.18.tar.gz> -
-L<ARM|http://dev.yorhel.nl/download/ncdc-linux-arm-1.18.tar.gz>.  Check the
+L<64-bit|http://dev.yorhel.nl/download/ncdc-linux-x86_64-1.18.1.tar.gz> -
+L<32-bit|http://dev.yorhel.nl/download/ncdc-linux-i486-1.18.1.tar.gz> -
+L<ARM|http://dev.yorhel.nl/download/ncdc-linux-arm-1.18.1.tar.gz>.  Check the
 L<installation instructions|http://dev.yorhel.nl/ncdc/install> for more info.

 =item Development version
@ -45,6 +45,7 @@ C<adc://dc.blicky.net:2780/> - If the mailing list is too slow for you.

 Are available for the following systems:
 L<Arch Linux|http://aur.archlinux.org/packages.php?ID=50949> -
+L<Fedora|https://apps.fedoraproject.org/packages/ncdc/overview/> -
 L<FreeBSD|http://www.freshports.org/net-p2p/ncdc/> -
 L<Frugalware|http://frugalware.org/packages/136807> -
 L<Gentoo|http://packages.gentoo.org/package/net-p2p/ncdc> -
@ -55,6 +56,9 @@ L<OpenSUSE|http://packman.links2linux.org/package/ncdc>
 I also have a few packages on the L<Open Build
 Service|https://build.opensuse.org/package/show?package=ncdc&project=home%3Ayorhel>.

+An convenient installer is available for
+L<Android|http://code.ivysaur.me/ncdcinstaller.html>.
+
 =back

 =cut
--- a/dat/ncdc-changelog
+++ b/dat/ncdc-changelog
@ -1,3 +1,8 @@
+1.18.1 - 2013-10-05
+	- Fix crash when downloading files from multiple sources
+	- Use the yxml library to parse files.xml.bz2 files
+	- Fix various XML conformance bugs in parsing files.xml.bz2 files
+
 1.18 - 2013-09-25
 	- Add support for segmented downloading
 	- Support $MyINFO without flags byte on NMDC hubs
--- a/dat/ncdc-install
+++ b/dat/ncdc-install
@ -38,11 +38,11 @@ compiling and/or installing it, I also offer statically linked binaries:

 =over

-=item * L<Linux, 64-bit|http://dev.yorhel.nl/download/ncdc-linux-x86_64-1.18.tar.gz>
+=item * L<Linux, 64-bit|http://dev.yorhel.nl/download/ncdc-linux-x86_64-1.18.1.tar.gz>

-=item * L<Linux, 32-bit|http://dev.yorhel.nl/download/ncdc-linux-i486-1.18.tar.gz>
+=item * L<Linux, 32-bit|http://dev.yorhel.nl/download/ncdc-linux-i486-1.18.1.tar.gz>

-=item * L<Linux, ARM|http://dev.yorhel.nl/download/ncdc-linux-arm-1.18.tar.gz>
+=item * L<Linux, ARM|http://dev.yorhel.nl/download/ncdc-linux-arm-1.18.1.tar.gz>

 =back

@ -58,6 +58,12 @@ architecture, please bug me and I'll see what I can do.

 =head1 System-specific instructions

+=head2 Android
+
+An L<convenient installer|http://code.ivysaur.me/ncdcinstaller.html> is
+available for Android 2.3 and later, which makes use of the static binary.
+
+
 =head2 Arch Linux

 Ncdc is available on L<AUR|https://aur.archlinux.org/packages.php?ID=50949>, to
@ -70,6 +76,15 @@ favorite, go for the manual approach:
  makepkg -si


+=head2 Fedora
+
+There's a L<package|https://apps.fedoraproject.org/packages/ncdc/overview/>
+available for Fedora.
+
+Alternatively, I also have packages on the L<Open Build
+Service|http://software.opensuse.org/download/package?project=home:yorhel&package=ncdc>.
+
+

 =head2 FreeBSD

@ -115,9 +130,9 @@ First install some required packages (as root):

 Then, fetch the ncdc source tarball, extract and build as follows:

-  wget http://dev.yorhel.nl/download/ncdc-1.18.tar.gz
-  tar -xf ncdc-1.18.tar.gz
-  cd ncdc-1.18
+  wget http://dev.yorhel.nl/download/ncdc-1.18.1.tar.gz
+  tar -xf ncdc-1.18.1.tar.gz
+  cd ncdc-1.18.1
  export PATH="$PATH:/usr/perl5/5.10.0/bin"
  ./configure --prefix=/usr LDFLAGS='-L/usr/gnu/lib -R/usr/gnu/lib'
  make
@ -165,9 +180,9 @@ required libraries:

 Then run the following commands to download and install ncdc:

-  wget http://dev.yorhel.nl/download/ncdc-1.18.tar.gz
-  tar -xf ncdc-1.18.tar.gz
-  cd ncdc-1.18
+  wget http://dev.yorhel.nl/download/ncdc-1.18.1.tar.gz
+  tar -xf ncdc-1.18.1.tar.gz
+  cd ncdc-1.18.1
  ./configure --prefix=/usr
  make
  sudo make install
@ -209,8 +224,8 @@ website|http://cygwin.com/> and use it to install the following packages:
 Then open a Cygwin terminal and run the following commands to download,
 compile, and install ncdc:

-  wget http://dev.yorhel.nl/download/ncdc-1.18.tar.gz
-  tar -xf ncdc-1.18.tar.gz
-  cd ncdc-1.18
+  wget http://dev.yorhel.nl/download/ncdc-1.18.1.tar.gz
+  tar -xf ncdc-1.18.1.tar.gz
+  cd ncdc-1.18.1
  ./configure --prefix=/usr
  make install
--- a/dat/ncdu
+++ b/dat/ncdu
@ -41,7 +41,7 @@ notifications for new releases.

 =head2 Packages and ports

-Ncdu has been packaged for quite a few systems already, here's a list of the ones I am aware of:
+Ncdu has been packaged for quite a few systems, here's a list of the ones I am aware of:

 L<AgiliaLinux|http://packages.agilialinux.ru/search.php?tag=sys-fs> -
 L<AIX|http://www.perzl.org/aix/index.php?n=Main.Ncdu> -
@ -68,7 +68,7 @@ L<Zenwalk|http://zur.zenwalk.org/view/package/name/ncdu>
 Packages for CentOS, RHEL and (open)SUSE can be found on the
 L<Open Build Service|https://build.opensuse.org/package/show?package=ncdu&project=utilities>.

-Packages for NetBSD, DragonFlyBSD, MirBSD and others and be found on
+Packages for NetBSD, DragonFlyBSD, MirBSD and others can be found on
 L<pkgsrc|http://pkgsrc.se/sysutils/ncdu>.


--- a/dat/yxml
+++ b/dat/yxml
@ -11,14 +11,15 @@ The code can be obtained from the L<git repo|http://g.blicky.net/yxml.git> and
 is available under a permissive MIT license. The only two files you need are
 L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and
 L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h>, which can easily be
-included and compiled as part of your project. Minimal documentation is
-included in yxml.h, more complete documentation is pending.
+included and compiled as part of your project. Complete API documentation is
+available in L<the manual|http://dev.yorhel.nl/yxml/man>.

-The API follows a simple, mostly buffer-less design and only consists of two
-functions:
+The API follows a simple and mostly buffer-less design, and only consists of
+three functions:

-  void yxml_init(yxml_t *x, char *stack, size_t stacksize);
+  void yxml_init(yxml_t *x, void *buf, size_t bufsize);
  yxml_ret_t yxml_parse(yxml_t *x, int ch);
+  yxml_ret_t yxml_eof(yxml_t *x);

 Be aware that I<simple> is not necessarily I<easy> or I<convenient>. The API is
 relatively low-level and designed to integrate into pretty much any application
@ -28,11 +29,9 @@ devices. It is possible to implement a more convenient and high-level API on
 top of yxml, but I'm not very fond of libraries that do more than what I
 strictly need.

-Yxml is still in a beta stage and hasn't been very thoroughly tested yet. There
-are no tarball releases available at the moment. The API and ABI may still
-change a bit, so I strongly advise against dynamic linking (I'm not sure if
-I'll ever promise a stable ABI, but the API should certainly get stabilized at
-some point).
+There are no tarball releases available at the moment. The API is relatively
+stable, but I won't currently promise any ABI stability. Dynamic linking
+against yxml is therefore not a very good idea.

 =head3 Features

@ -95,11 +94,11 @@ using C<< <!ENTITY> >>.
 =back

 These conformance issues are the result of the byte-oriented and minimal design
-of yxml, and I do not intent to fix these directly within the library. All of
-the above mentioned issues can be fixed on top of yxml (by the application, or
-by a wrapper) if strict conformance is required. With the exception of custom
-entity references, but I have a simple idea on how to support that in the
-future, too.
+of yxml, and I do not intent to fix these directly within the library. The
+intention is to make sure that all of the above mentioned issues can be fixed
+on top of yxml (by the application, or by a wrapper) if strict conformance is
+required, but the required functionality to support custom entity references
+and DTD handling has not been implemented yet.

 =head3 Non-features

@ -136,7 +135,7 @@ implementation is also included as an indication of the "theoretical" minimum.
  expat    2.1.0  MIT              162 139   194 432       1.47     1.09
  libxml2  2.9.1  MIT              464 328   518 816       2.53     1.75
  mxml     2.7    LGPL2+static      32 733    75 832      12.38     7.80
-  yxml     git    MIT                5 935    31 384       1.14     0.74
+  yxml     git    MIT                5 971    31 416       1.15     0.74

 The code for these benchmarks is available in the
 L<bench/|http://g.blicky.net/yxml.git/tree/bench> directory on git. Some
@ -177,7 +176,7 @@ with C<-Os> than with C<-O2>.
  expat     2.1.0  MIT             113 314   145 632       1.58     1.20
  libxml2   2.9.1  MIT             356 948   412 256       3.01     2.08
  mxml      2.7    LGPL2+static     27 725    71 704      11.70     7.44
-  yxml      git    MIT               4 835    30 264       1.72     1.05
+  yxml      git    MIT               4 955    30 392       1.67     1.02


 =head2 Validating vs. non-validating
@ -204,6 +203,6 @@ It should be noted that a lot of XML documents found in the wild are not
 described with a DTD, but instead use an alternative technology such as XML
 schema. Wikipedia L<has more
 information|https://en.wikipedia.org/wiki/XML#Schemas_and_validation> on this.
-Using a validating parser for such documents would only introduce bloat and may
+Using a validating parser for such documents would only add bloat and may
 introduce L<potential security
 vulnerabilities|https://en.wikipedia.org/wiki/Billion_laughs>.
--- a/dat/yxml-man
+++ b/dat/yxml-man
@ -0,0 +1 @@
+../../yxml/yxml.pod