ncdc 1.18.1 + yxml manual + dcstats + minor restyle

...I need to commit more often.
This commit is contained in:
Yorhel 2014-02-11 10:28:26 +01:00
parent 610b0fb31c
commit 57e7bb546e
20 changed files with 339 additions and 56 deletions

View file

@ -6,6 +6,11 @@ rare occasions are published on this page.
=over
=item C<2014-01-09 > - L<Some Measurements on Direct Connect File Lists|http://dev.yorhel.nl/doc/dcstats>
The report of a short measurement study on the file lists obtained from a
Direct Connect hub. Lots of graphs!
=item C<2012-02-15 > - L<A Distributed Communication System for Modular Applications|http://dev.yorhel.nl/doc/commvis>
In this article I explain a vision of mine, and the results of a small research

222
dat/doc-dcstats Normal file
View file

@ -0,0 +1,222 @@
Some Measurements on Direct Connect File Lists
=pod
(Published on B<2014-01-09>.)
=head1 Introduction
I've been working on Direct Connect related projects for a while now. This
includes maintaining L<ncdc|http://dev.yorhel.nl/ncdc> and
L<Globster|http://dev.yorhel.nl/globster>, and doing a bit of research into
improving the downloading performance and scalability (to be published at some
later date). Whether I'm writing code or trying to setup experiments for
research, there's one thing that helps a lot in making decisions. Measurements
from an actual network.
Because useful measurements are often missing, I decided to do some myself.
There's a lot to measure in an actual P2P network, but I restricted myself to
information that can be gathered quite easily from file lists.
=head1 Obtaining the Data
Different hubs will likely have totally different patterns in terms of what is
being shared. In order to keep this experiment simple, I limited myself to a
single hub. And in order to get as much data as possible, I chose the hub that
is commonly known as "Ozerki", famous for being one of the larger hubs in
existence.
My approach to getting as many file lists as possible from this hub was perhaps
a bit too simple. I simply modified ncdc to have an "add the file list from all
users to the download queue" key, and to save all downloaded lists to a
directory instead of opening them.
I started this downloading process on a Monday around noon when there were a
little over 11k users online. I hit my hacked download-all-filelists-key two
more times later that day in order to get the file lists from those users who
joined the hub at a later time. I let this downloading process running until
the evening.
One thing I learned from this experience was that the downloading algorithm in
ncdc (1.18.1) does not scale particularly well. Every 60 seconds, it would try
to open a connection with B<all> users listed in the download queue. You can
imagine that trying to connect to 11k users simultaneously put a significantly
heavier load on the hub than would have been necessary. Not good. Not something
a well-behaving netizen would do. Surprisingly enough, the hub didn't seem to
mind too much and handled the load fine. This might have been because Mondays
are typically not the most busy days in P2P land. Weekends tend to be busier.
Despite that scalability issue, I successfully managed to download the file
lists of almost everyone who remained online for long enough to finally get
their list downloaded. In total I managed to download 14143 file lists (that's
one list too many for C<10000*sqrt(2)>, I should have stopped the process a bit
earlier). The total bzip2-compressed size of these lists is 6.5 GiB.
For obvious reasons, I won't be sharing my modifications to ncdc. I already
tarnished the reputation of ncdc enough in that single day. If you wish to
repeat this experiment, please do so with a scalable downloading
implementation. :-)
=head1 Obtaining the Stats
And then comes the challenge of aggregating statistics on 6.5 GiB of compressed
XML files. This didn't really sound like much of a challenge. After all, all
one needs to do is decompress the file lists, do some XML parsing and update
some values. Most of the CPU time in this process would likely be spent on
bzip2 decompression, so I figured I'd just pipe the output of L<bzcat(1)> to a
Perl script and be done with it.
To get the statistics on the sizes and the distribution of unique files, a data
structure containing information on all unique files in the lists was
necessary. Perl being the perfect language for data manipulation, I made use of
its great support for hash tables to store this information. It turned out,
rather unsurprisingly, that Perl isn't all that conservative with respect to
memory usage. Neither my 4GB or RAM nor the extra 4GB of swap turned out to be
enough to run the script to completion. I tried rewriting the script to use a
disk-based data structure, but that slowed things down to a crawl. Some other
solution was needed.
When faced with such a problem, some people will try to optimize the algorithm,
others will throw extra hardware at it, and I did what I do best: Optimize away
the constants. That is, I rewrote the data analysis program in C. Using the
excellent L<khash|https://github.com/attractivechaos/klib> hash table library
to keep track of the file information and the equally awesome
L<yxml|http://dev.yorhel.nl/yxml> library (a little bit of self-promotion
doesn't hurt, right?) to do the XML parsing, I was able to do all the necessary
processing in 30 minutes using at most 3.6GB of RAM.
Long story short, here's my analysis program:
L<dcfilestats.c|http://g.blicky.net/dcstats.git/tree/dcfilestats.c>.
=head1 A Look at the Stats
Some lists didn't decompress/parse correctly, so the actual number of file
lists used in these stats is B<14137>. The total compressed size of these lists
is B<6,945,269,469> bytes (6.5 GiB), and uncompressed B<25,533,519,352> bytes
(24 GiB). In total these lists mentioned B<197,413,253> files. After taking
duplicate listings in account, there's still B<84,131,932> unique files.
And now for some graphs...
=head2 Size of the File Lists
Behold, the compressed and uncompressed size of the downloaded file lists:
[img graph dclistsize.png ]
Nothing too surprising here, I guess. 100 KiB seems to be a common size for a
compressed file lists, but lists of 1 MiB aren't too weird, either. The largest
file list in this set is 34.8 MiB compressed and 120 MiB uncompressed. The
uncompressed size of a list tends to be (*gasp*) a bit larger, but we can't
easily infer the compression ratio from this graph. Hence, another graph:
[img graph dclistcomp.png ]
Most file lists compress to about 24% - 35% of their original size. This seems
to be consistent with L<similar
measurements|http://forum.dcbase.org/viewtopic.php?f=18&t=667> done in 2010.
The raw data for these graphs is found in
L<dclistsize|http://g.blicky.net/dcstats.git/tree/dclistsize>, which lists the
compressed and uncompressed size, respectively, for each file list. The gnuplot
script for the first graph is
L<dclistsize.plot|http://g.blicky.net/dcstats.git/tree/dclistsize.plot> and
L<dclistcomp.plot|http://g.blicky.net/dcstats.git/tree/dclistcomp.plot> for the
second.
=head2 Number of Files Per List
So how many files are people sharing? Let's find out.
[img graph dcnumfiles.png ]
As expected, this graph looks very similar to the one about the size of the
file list. The size of a list tends to be linear in the number of items it
holds, after all.
The raw data for this graph is found in
L<dcnumfiles|http://g.blicky.net/dcstats.git/tree/dcnumfiles>, which lists the
unique and total number of files, respectively, for each file list. The gnuplot
script is
L<dcnumfiles.plot|http://g.blicky.net/dcstats.git/tree/dcnumfiles.plot>.
=head2 File Sizes
And how large are the files being shared? Well,
[img graph dcfilesize.png ]
This graph is fun, and rather hard to explain without knowing what kind of
files we're dealing with. I'm not going to do any further analysis on what kind
of files these file sizes represent exactly, but I am going to make some
guesses. The files below 1 MiB could be anything, text files, images,
subtitles, source code, etc. And considering that the hub in question doesn't
put a whole lot of effort in weeding out spammers and bots, it's likely that
some malicious users will be sharing small variations of the same virus within
the 100 KiB range. The peak of files between 7 and 10 MiB would likely be
audio files. The number of files larger than, say, 20 MiB drop significantly,
but there are still a few million files in the 20 MiB to 1 GiB range.
I cut off the graph after 10 GiB, but there's apparently someone who claims to
share a file between 1 and 2 TiB (don't know the exact size due to the
binning). Since I can't imagine why someone would share a file that large, I
expect it to be a fake file list entry. Note that there could be more fakes in
my data set. I can't tell which files are fake and which are genuine from the
information in the file lists, but I don't expect the number of fake files to
be very significant.
The "raw" data for this graph is found in
L<dcfilesize|http://g.blicky.net/dcstats.git/tree/dcfilesize>. Because I wasn't
interested in dealing with a text file of 84 million lines, the data is already
binned. The first column is the bin number and the second column the number of
unique files in that bin. The file sizes that each bin represents are between
C<2^(bin+9)> and C<2^(bin+10)>, with the exception of bin 0, which starts at a
file size of 0. The source of the gnuplot script is
L<dcfilesize.plot|http://g.blicky.net/dcstats.git/tree/dcfilesize.plot>.
=head2 Distribution of Files
Another interesting thing to measure is how often files are shared. That is,
how many users have the same file?
[img graph dcfiledist.png ]
Many files are only available from a single user. That's not really a good sign
when you wish to download such a file, but luckily there are also tons of files
that I<are> available from multiple users. What is interesting in this graph
isn't that it follows the L<power law|https://en.wikipedia.org/wiki/Power_law>,
but it's wondering what those outliers could possibly be. There's a collection
of 269 files that has been shared among 831 users, and there appears to be a
similar group of around 510-515 files that is shared among 20 or so users. I've
honestly no idea what those collections could be. Well, yes, I could probably
figure that out from the file lists, but my analysis program doesn't tell me
which files it's talking about and I'm too lazy to fix that.
The graph has been clipped to 600, but there's another interesting outlier. A
single file that has been shared by 5668 users. I'm going to guess that this is
the empty file. There are so many ways to get an empty file somewhere in your
filesystem, after all.
The raw data for this graph is found in
L<dcfiledist|http://g.blicky.net/dcstats.git/tree/dcfiledist>, which lists the
number of times shared and the aggregate number of files. The gnuplot script is
L<dcfiledist.plot|http://g.blicky.net/dcstats.git/tree/dcfiledist.plot>.
=head1 Final Notes
So, erm, what conclusions can we draw from this? That stats are fun, I guess.
If anyone (including me) is going to repeat this experiment on a fresh data
set, make sure to use a more scalable downloading process that I did. My
approach shouldn't be repeated if we wish to keep the Direct Connect network
alive.
Furthermore, keep in mind that this is just a snapshot of a single day on a
single hub. The graphs may look very different when the file lists are
harvested at some other time. And it's also quite likely that different hubs
will have very different share profiles. It could be interesting to try and
graph everything, but I don't have I<that> kind of free time.

View file

@ -10,14 +10,14 @@ ncurses interface.
=item Latest version
1.18 ([dllink ncdc-1.18.tar.gz download]
1.18.1 ([dllink ncdc-1.18.1.tar.gz download]
- L<changes|http://dev.yorhel.nl/ncdc/changes>
- L<mirror|https://sourceforge.net/projects/ncdc/files/ncdc/>)
Convenient static binaries for Linux:
L<64-bit|http://dev.yorhel.nl/download/ncdc-linux-x86_64-1.18.tar.gz> -
L<32-bit|http://dev.yorhel.nl/download/ncdc-linux-i486-1.18.tar.gz> -
L<ARM|http://dev.yorhel.nl/download/ncdc-linux-arm-1.18.tar.gz>. Check the
L<64-bit|http://dev.yorhel.nl/download/ncdc-linux-x86_64-1.18.1.tar.gz> -
L<32-bit|http://dev.yorhel.nl/download/ncdc-linux-i486-1.18.1.tar.gz> -
L<ARM|http://dev.yorhel.nl/download/ncdc-linux-arm-1.18.1.tar.gz>. Check the
L<installation instructions|http://dev.yorhel.nl/ncdc/install> for more info.
=item Development version
@ -45,6 +45,7 @@ C<adc://dc.blicky.net:2780/> - If the mailing list is too slow for you.
Are available for the following systems:
L<Arch Linux|http://aur.archlinux.org/packages.php?ID=50949> -
L<Fedora|https://apps.fedoraproject.org/packages/ncdc/overview/> -
L<FreeBSD|http://www.freshports.org/net-p2p/ncdc/> -
L<Frugalware|http://frugalware.org/packages/136807> -
L<Gentoo|http://packages.gentoo.org/package/net-p2p/ncdc> -
@ -55,6 +56,9 @@ L<OpenSUSE|http://packman.links2linux.org/package/ncdc>
I also have a few packages on the L<Open Build
Service|https://build.opensuse.org/package/show?package=ncdc&project=home%3Ayorhel>.
An convenient installer is available for
L<Android|http://code.ivysaur.me/ncdcinstaller.html>.
=back
=cut

View file

@ -1,3 +1,8 @@
1.18.1 - 2013-10-05
- Fix crash when downloading files from multiple sources
- Use the yxml library to parse files.xml.bz2 files
- Fix various XML conformance bugs in parsing files.xml.bz2 files
1.18 - 2013-09-25
- Add support for segmented downloading
- Support $MyINFO without flags byte on NMDC hubs

View file

@ -38,11 +38,11 @@ compiling and/or installing it, I also offer statically linked binaries:
=over
=item * L<Linux, 64-bit|http://dev.yorhel.nl/download/ncdc-linux-x86_64-1.18.tar.gz>
=item * L<Linux, 64-bit|http://dev.yorhel.nl/download/ncdc-linux-x86_64-1.18.1.tar.gz>
=item * L<Linux, 32-bit|http://dev.yorhel.nl/download/ncdc-linux-i486-1.18.tar.gz>
=item * L<Linux, 32-bit|http://dev.yorhel.nl/download/ncdc-linux-i486-1.18.1.tar.gz>
=item * L<Linux, ARM|http://dev.yorhel.nl/download/ncdc-linux-arm-1.18.tar.gz>
=item * L<Linux, ARM|http://dev.yorhel.nl/download/ncdc-linux-arm-1.18.1.tar.gz>
=back
@ -58,6 +58,12 @@ architecture, please bug me and I'll see what I can do.
=head1 System-specific instructions
=head2 Android
An L<convenient installer|http://code.ivysaur.me/ncdcinstaller.html> is
available for Android 2.3 and later, which makes use of the static binary.
=head2 Arch Linux
Ncdc is available on L<AUR|https://aur.archlinux.org/packages.php?ID=50949>, to
@ -70,6 +76,15 @@ favorite, go for the manual approach:
makepkg -si
=head2 Fedora
There's a L<package|https://apps.fedoraproject.org/packages/ncdc/overview/>
available for Fedora.
Alternatively, I also have packages on the L<Open Build
Service|http://software.opensuse.org/download/package?project=home:yorhel&package=ncdc>.
=head2 FreeBSD
@ -115,9 +130,9 @@ First install some required packages (as root):
Then, fetch the ncdc source tarball, extract and build as follows:
wget http://dev.yorhel.nl/download/ncdc-1.18.tar.gz
tar -xf ncdc-1.18.tar.gz
cd ncdc-1.18
wget http://dev.yorhel.nl/download/ncdc-1.18.1.tar.gz
tar -xf ncdc-1.18.1.tar.gz
cd ncdc-1.18.1
export PATH="$PATH:/usr/perl5/5.10.0/bin"
./configure --prefix=/usr LDFLAGS='-L/usr/gnu/lib -R/usr/gnu/lib'
make
@ -165,9 +180,9 @@ required libraries:
Then run the following commands to download and install ncdc:
wget http://dev.yorhel.nl/download/ncdc-1.18.tar.gz
tar -xf ncdc-1.18.tar.gz
cd ncdc-1.18
wget http://dev.yorhel.nl/download/ncdc-1.18.1.tar.gz
tar -xf ncdc-1.18.1.tar.gz
cd ncdc-1.18.1
./configure --prefix=/usr
make
sudo make install
@ -209,8 +224,8 @@ website|http://cygwin.com/> and use it to install the following packages:
Then open a Cygwin terminal and run the following commands to download,
compile, and install ncdc:
wget http://dev.yorhel.nl/download/ncdc-1.18.tar.gz
tar -xf ncdc-1.18.tar.gz
cd ncdc-1.18
wget http://dev.yorhel.nl/download/ncdc-1.18.1.tar.gz
tar -xf ncdc-1.18.1.tar.gz
cd ncdc-1.18.1
./configure --prefix=/usr
make install

View file

@ -41,7 +41,7 @@ notifications for new releases.
=head2 Packages and ports
Ncdu has been packaged for quite a few systems already, here's a list of the ones I am aware of:
Ncdu has been packaged for quite a few systems, here's a list of the ones I am aware of:
L<AgiliaLinux|http://packages.agilialinux.ru/search.php?tag=sys-fs> -
L<AIX|http://www.perzl.org/aix/index.php?n=Main.Ncdu> -
@ -68,7 +68,7 @@ L<Zenwalk|http://zur.zenwalk.org/view/package/name/ncdu>
Packages for CentOS, RHEL and (open)SUSE can be found on the
L<Open Build Service|https://build.opensuse.org/package/show?package=ncdu&project=utilities>.
Packages for NetBSD, DragonFlyBSD, MirBSD and others and be found on
Packages for NetBSD, DragonFlyBSD, MirBSD and others can be found on
L<pkgsrc|http://pkgsrc.se/sysutils/ncdu>.

View file

@ -11,14 +11,15 @@ The code can be obtained from the L<git repo|http://g.blicky.net/yxml.git> and
is available under a permissive MIT license. The only two files you need are
L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and
L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h>, which can easily be
included and compiled as part of your project. Minimal documentation is
included in yxml.h, more complete documentation is pending.
included and compiled as part of your project. Complete API documentation is
available in L<the manual|http://dev.yorhel.nl/yxml/man>.
The API follows a simple, mostly buffer-less design and only consists of two
functions:
The API follows a simple and mostly buffer-less design, and only consists of
three functions:
void yxml_init(yxml_t *x, char *stack, size_t stacksize);
void yxml_init(yxml_t *x, void *buf, size_t bufsize);
yxml_ret_t yxml_parse(yxml_t *x, int ch);
yxml_ret_t yxml_eof(yxml_t *x);
Be aware that I<simple> is not necessarily I<easy> or I<convenient>. The API is
relatively low-level and designed to integrate into pretty much any application
@ -28,11 +29,9 @@ devices. It is possible to implement a more convenient and high-level API on
top of yxml, but I'm not very fond of libraries that do more than what I
strictly need.
Yxml is still in a beta stage and hasn't been very thoroughly tested yet. There
are no tarball releases available at the moment. The API and ABI may still
change a bit, so I strongly advise against dynamic linking (I'm not sure if
I'll ever promise a stable ABI, but the API should certainly get stabilized at
some point).
There are no tarball releases available at the moment. The API is relatively
stable, but I won't currently promise any ABI stability. Dynamic linking
against yxml is therefore not a very good idea.
=head3 Features
@ -95,11 +94,11 @@ using C<< <!ENTITY> >>.
=back
These conformance issues are the result of the byte-oriented and minimal design
of yxml, and I do not intent to fix these directly within the library. All of
the above mentioned issues can be fixed on top of yxml (by the application, or
by a wrapper) if strict conformance is required. With the exception of custom
entity references, but I have a simple idea on how to support that in the
future, too.
of yxml, and I do not intent to fix these directly within the library. The
intention is to make sure that all of the above mentioned issues can be fixed
on top of yxml (by the application, or by a wrapper) if strict conformance is
required, but the required functionality to support custom entity references
and DTD handling has not been implemented yet.
=head3 Non-features
@ -136,7 +135,7 @@ implementation is also included as an indication of the "theoretical" minimum.
expat 2.1.0 MIT 162 139 194 432 1.47 1.09
libxml2 2.9.1 MIT 464 328 518 816 2.53 1.75
mxml 2.7 LGPL2+static 32 733 75 832 12.38 7.80
yxml git MIT 5 935 31 384 1.14 0.74
yxml git MIT 5 971 31 416 1.15 0.74
The code for these benchmarks is available in the
L<bench/|http://g.blicky.net/yxml.git/tree/bench> directory on git. Some
@ -177,7 +176,7 @@ with C<-Os> than with C<-O2>.
expat 2.1.0 MIT 113 314 145 632 1.58 1.20
libxml2 2.9.1 MIT 356 948 412 256 3.01 2.08
mxml 2.7 LGPL2+static 27 725 71 704 11.70 7.44
yxml git MIT 4 835 30 264 1.72 1.05
yxml git MIT 4 955 30 392 1.67 1.02
=head2 Validating vs. non-validating
@ -204,6 +203,6 @@ It should be noted that a lot of XML documents found in the wild are not
described with a DTD, but instead use an alternative technology such as XML
schema. Wikipedia L<has more
information|https://en.wikipedia.org/wiki/XML#Schemas_and_validation> on this.
Using a validating parser for such documents would only introduce bloat and may
Using a validating parser for such documents would only add bloat and may
introduce L<potential security
vulnerabilities|https://en.wikipedia.org/wiki/Billion_laughs>.

1
dat/yxml-man Symbolic link
View file

@ -0,0 +1 @@
../../yxml/yxml.pod