This commit is contained in:
Yorhel 2024-09-27 11:18:10 +02:00
parent 477101e733
commit b3c9c7407c
27 changed files with 523 additions and 26 deletions

View file

@ -17,6 +17,12 @@ respective issue tracker or send a mail to
# Entries
`2024-08-17` - Young Lee
: hey, thanks for the nice app!
`2024-08-08` - raksO
: Thanks! It is really helpful! ❤️
`2024-07-12` - DIANA PUNKY
: ncdu rocks!

View file

@ -10,9 +10,14 @@ crap I've written over the years. :)
<!-- These announcements are parsed by mkfeed.pl, see that file for formatting -->
## Announcements <a href="/feed.atom"><img src="/img/feed_icon.png" alt="Atom feed"></a>
`2024-07-24` - 2.5 released <!-- tags: ncdu, link: /ncdu -->
`2024-09-27` - ncdu 2.6 released <!-- tags: ncdu, link: /ncdu -->
: Adds a new binary export format that better works with parallel scanning,
offers built-in compression and supports browsing directory trees that are
too large to fit in memory. [Homepage](/ncdu) - [Changelog](/ncdu/changes2).
`2024-07-24` - ncdu 2.5 released <!-- tags: ncdu, link: /ncdu -->
: Adds support for parallel scanning, improves import/export performance and
fixes a number of bugs. [Ncdu homepage](/ncdu) - [Changelog](/ncdu/changes).
fixes a number of bugs. [Homepage](/ncdu) - [Changelog](/ncdu/changes2).
`2024-07-18` - ncdc 1.24.1 released <!-- tags: ncdc, link: /ncdc -->
: Just fixes a build error. [Homepage](/ncdc) - [Changelog](/ncdc/changes).

View file

@ -1,34 +1,49 @@
% NCurses Disk Usage
Ncdu is a disk usage analyzer with an ncurses interface. It is designed to find
space hogs on a remote server where you don't have an entire graphical setup
available, but it is a useful tool even on regular desktop systems. Ncdu aims
to be fast, simple and easy to use, and should be able to run in any minimal
POSIX-like environment with ncurses installed.
Ncdu is a disk usage analyzer with a text-mode user interface. It is designed
to find space hogs on a remote server where you don't have an entire graphical
setup available, but it is a useful tool even on regular desktop systems. Ncdu
aims to be fast, simple, easy to use, and should be able to run on any
POSIX-like system.
**NEWS FLASH!** Ncdu 2.5 adds support for parallel scanning, but it's not
(yet?) enabled by default. To give it a try, run with `-t8` to scan with 8
threads. If you're running an unusual setup, such as networked storage, odd
filesystems, complex RAID configurations, etc, I'd love to hear about the
performance impact of this new feature. Feedback is welcome on the [issue
tracker](https://code.blicky.net/yorhel/ncdu/issues) or through mail @
[projects@yorhel.nl](mailto:projects@yorhel.nl).
<br>
If you want to run benchmarks, `-0 --quit-after-scan` can be useful to disable
the browser interface, or run with `-0o/dev/null` to benchmark JSON export.
## Notable updates
Parallel scanning
: Ncdu 2.5 adds support for parallel scanning, but it's not enabled by
default. To give it a try, run with `-t8` to scan with 8 threads. If you're
running an unusual setup, such as networked storage, odd filesystems,
complex RAID configurations, etc, I'd love to hear about the performance
impact of this new feature. Feedback is welcome on the [issue
tracker](https://code.blicky.net/yorhel/ncdu/issues) or to
[projects@yorhel.nl](mailto:projects@yorhel.nl).[^1]
Binary export
: Ncdu 2.6 adds a new binary export format that better works with parallel
scanning, offers built-in compression and supports browsing directory
trees that are too large to fit in memory. To give it a try, use the `-O`
flag instead of `-o`.
Colors
: Ncdu has had color support since version 1.13. Colors were enabled by
default in 1.17 and 2.0, and then later disabled again in 1.20 and 2.4
because the text was not legible in all terminal configurations.
If you do prefer the colors, add `--color=dark` to your [config
file](/ncdu/man#configuration). Maybe at some point in the future we'll
have colors that *are* readable in every setup.
## Download <a href="/ncdu/feed.atom"><img src="/img/feed_icon.png" alt="Atom feed"></a>
Static binaries
: Convenient static binaries for Linux. Download, extract and run; no
compilation or installation necessary:
[x86](/download/ncdu-2.5-linux-x86.tar.gz) -
[x86_64](/download/ncdu-2.5-linux-x86_64.tar.gz) -
[ARM](/download/ncdu-2.5-linux-arm.tar.gz) -
[AArch64](/download/ncdu-2.5-linux-aarch64.tar.gz).
[x86](/download/ncdu-2.6-linux-x86.tar.gz) -
[x86_64](/download/ncdu-2.6-linux-x86_64.tar.gz) -
[ARM](/download/ncdu-2.6-linux-arm.tar.gz) -
[AArch64](/download/ncdu-2.6-linux-aarch64.tar.gz).
Zig version (stable)
: 2.5 (2024-07-24 - [ncdu-2.5.tar.gz](/download/ncdu-2.5.tar.gz) - [changes](/ncdu/changes2))
: 2.6 (2024-09-27 - [ncdu-2.6.tar.gz](/download/ncdu-2.6.tar.gz) - [changes](/ncdu/changes2))
Requires Zig 0.12 or 0.13.
@ -106,3 +121,8 @@ There's no shortage of alternatives to ncdu nowadays. In no particular order:
- [K4DirStat](https://github.com/jeromerobert/k4dirstat) - Qt, treemap.
- [xdiskusage](http://xdiskusage.sourceforge.net/) - FLTK, with a treemap display.
- [fsv](http://fsv.sourceforge.net/) - 3D visualization.
[^1]: If you want to run benchmarks, `-0 --quit-after-scan` can be useful to
disable the browser interface, or run with `-0o/dev/null` to benchmark JSON
export.

364
dat/ncdu/binfmt.md Normal file
View file

@ -0,0 +1,364 @@
% Ncdu Binary Export File Format
This document describes the new binary file format added in ncdu 2.6. This
format offers the following advantages compared to the [JSON export file
format](/ncdu/jsonfmt):
- Support for exporting data from a multithreaded filesystem scan with minimal
thread-local buffering and minimal synchronisation between threads.
- Support for reading the directory tree in depth-first, breath-first and mixed
iteration order, thus permitting interactive browsing through the tree
without reading the entire file.
- Cumulative directory sizes are included in the exported data, allowing
readers to display this data without walking through the entire tree.
- Built-in support for compression.
These features come at the cost of increased complexity. The JSON format is
generally easier to work with and therefore still the recommended approach for
external tooling to interact with ncdu's export/import functionality.
A binary export can be created with the `-O` option to ncdu. It is also
possible to convert to and from the JSON format:
```
ncdu -O export.ncdu / # Scan root, write to 'export.ncdu'
ncdu -f in.json -O out.ncdu # Convert from JSON to binary
ncdu -f in.ncdu -o out.json # Convert from binary to JSON
```
# Format description
## File signature
An exported file starts with the following file signature (in hex):
```
bf 6e 63 64 75 45 58 31
```
Formatted as a C string, that is `"\xbfncduEX1"`.
Non-backwards compatible changes to the export format should use a different
file signature. N.B. A different compression algorithm is a non-backwards
compatible change.
## Block format
The file signature is followed by one or more *blocks*. A block has the
following format:
------- ------- ---------------
TypeLen 4 bytes Big-endian type + length of this block
Content n bytes n = Length - 8
TypeLen 4 bytes Repeat of TypeLen
------- ------- ---------------
The high 4 bits of the *TypeLen* indicate the block type, the lower 28 bits
encode the length of the block, including the header and footer.
The *TypeLen* is repeated at the end of the block to allow for reading the file
in both forwards and backwards direction.
The block type determines how the *Content* should be interpreted. There are
currently two block types:
Type Meaning
---- --
0 Data block
1 Index block
---- --
Parsers should ignore blocks with an unknown type.
A valid file must have at least one data block and exactly one index block. The
index block must be the last block in the file.
## Data blocks
Data blocks have the following contents:
---------------- ------- ------
Number 4 bytes Big-endian unsigned block number
Compressed\_data n bytes
---------------- ------- ------
Every data block must have a unique number, starting from zero and ideally (but
not necessarily) allocated without gaps. Data blocks may appear in a different
order than their numbering.
Data is compressed with [Zstandard](http://www.zstd.net/). Data must be
compressed in a single frame and the uncompressed size must be available
through `ZSTD_getFrameContentSize()`, so that readers can pre-allocate a
properly-sized buffer for decompression.
The total length of a data block, including block header and footer, must not
exceed 16 MiB minus one byte. The total size of the decompressed data must also
not exceed 16 MiB minus one byte.
The decompressed data consists of a stream of one or more *Items* (see below).
## Index block
The index block provides a lookup table for data blocks and a reference to the
root item:
--------------- ----------
Block\_pointers n\*8 bytes
Root\_itemref 8 bytes
--------------- ----------
*Block\_pointers* is an array containing an 8-byte pointer for each data block
in the file. Pointers are indexed by block number, so the first pointer is for
block number 0, the second pointer for block number 1, etc. Each pointer is
interpreted as a 64bit big-endian unsigned integer. The higher 40 bits indicate
the byte offset of the data block header, relative to the start of the file.
The lower 24 bits indicate the block length and must be equivalent to the
length in the *TypeLen* of the corresponding data block. An all-zero value
indicates that there is no block with this number in the file.
The last 8 bytes of the index block represent an unsigned big-endian integer
that refers to the root item of the directory tree. See *Itemref* below.
## Itemref
An *Itemref* encodes a reference to an *Item*, there are two types:
Absolute
: An absolute *Itemref* is a 64bit unsigned integer that encodes a block
number in the higher 40 bits and a byte offset of the start of the item
within the block in the lower 24 bits. Every item in the file has exactly
one absolute *Itemref* value. The *Root\_itemref* in the index block must
be absolute.
Relative
: A relative *Itemref* is a negative integer that represents the byte offset
of the referenced item relative to the start of the item containing the
reference. Relative references can only reference a previously written item
within the same block.
## Item
An *Item* represents a file or directory entry, encoded as a
[CBOR](https://cbor.io/) map. Key/value pairs may be encoded in any order and
unknown keys are ignored. Summary of keys recognized by ncdu:
Key Field Value
---- -------- --------
0 type i32
1 name String
2 prev Itemref
3 asize u64
4 dsize u64
5 dev u64
6 rderr bool
7 cumasize u64
8 cumdsize u64
9 shrasize u64
10 shrdsize u64
11 items u64
12 sub Itemref
13 ino u64
14 nlink u32
15 uid u32
16 gid u32
17 mode u16
18 mtime u64
---- -------- --------
**Common fields for all items**
type
: Mandatory. A negative value indicates that the item that has been excluded
from the size calculations for some reason, positive values are used for
different item types:
--- --
-4 Excluded with `--exclude-kernfs`
-3 Excluded with `-x`
-2 Excluded by pattern match
-1 Error while reading this entry
0 Directory
1 Regular file
2 Non-regular file (symlink, device, etc)
3 Hardlink candidate (i.e. stat().st_nlink > 1)
--- --
Unrecognized negative values are treated as equivalent to -2, unrecognized
positive values are treated as a non-regular file (type=2).
name
: Mandatory. Ncdu always encodes the name as a byte string, but also accepts
UTF-8 text strings. Ncdu does not support indefinite-length CBOR strings,
the name must be encoded with a known length.
prev
: Reference to the previous item in the same directory. This field must be
absent if this is the first item in a directory. This field forms a
singly-linked list of all items in a directory.
**Fields for type >= 0**
asize
: Apparent size of this file/directory as reported by `stat().st_size`.
Optional, defaults to 0.
dsize
: Disk usage of this file/directory as reported by `stat().st_blocks`
multiplied by the block size. Optional, defaults to 0.
**Fields for type = 0**
dev
: Device number. Optional, defaults to the same device number as the parent
directory, or 0 of this is the root item.
rderr
: Whether an error occurred while reading this directory. When *true*, an
error occurred while reading the directory list itself and the list may
therefore be incomplete. When *false*, an error occurred while reading a
child item. This implies that somewhere in this sub-tree there must be at
least one item of `type=-1` or a directory with `rderr=true`.
cumasize
: Cumulative apparent size of this directory. Optional, defaults to 0.
cumdsize
: Cumulative disk usage of this directory. Optional, defaults to 0.
shrasize
: Shared apparent size. Optional, defaults to 0.
shrdsize
: Shared disk usage. Optional, defaults to 0.
items
: Cumulative number of items in this directory. Ncdu currently caps this
number to `2^32-1` when reading, but supports larger numbers when
exporting. Optional, defaults to 0.
sub
: Reference to the last item in this directory, or absent if the directory is
empty.
**Fields for type=3**
ino
: Inode number.
nlink
: Number of links to this inode.
**Extended information**
These fields are only exported when the `-e` flag is passed to ncdu. They are
relevant to all items with type >= 0.
uid
: User id.
gid
: Group id.
mode
: File mode.
mtime
: Last modification time as a UNIX timestamp.
# Limitations
Compressed data block size
: 16 MiB minus 1 byte. This limit comes from *Block\_pointers* in the index
block using 24 bits to encode the block length.
Uncompressed data block size
: 16 MiB minus 1 byte. This limit comes from *Itemref* encoding item offset
in 24 bits.
Largest data block number
: 33,554,428. The size of the index block is limited by the 28-bit length in
the block's *TypeLen* header, which limits the number of *Block\_pointers*
it can hold to `((2^28 - 1) - 16) / 8` (subtract one to get the maximum
block number because counting starts at 0).
Compressed data size
: Excluding block overhead, the total amount of compressed data is limited to
about 1 TiB. This is limited by *Block\_pointers* using 40 bits to encode
the data block offset within the file.
Uncompressed data size
: Limited by either the maximum number of data blocks or the compressed data
size, depending on compression ratio and the chosen data block size.
Assuming the number of data blocks is the limit, about 512 TiB of
uncompressed data can be stored with the maximum data block size of 16 MiB.
Ncdu's adaptive block size selection has a limit of about 40 TiB.
The real question is how many items an export can hold with the above limits in
place. This will heavily depend on the average encoded item size and the
compression ratio, both of which can vary wildly from one directory structure
to another.
I've had one report with ~1.4 billion files resulting in a ~21 GiB file.
Extrapolating from that and assuming the compressed data size is the limiting
factor, this format could hold ~68 billion items. Increasing the compression
level and using larger data block sizes to further improve compression ratio,
one could perhaps store about 100 billion items. On the one hand, that sounds
like an insane number nobody will ever reach. On the other hand, a decade ago I
couldn't imagine people having more than 100 million files, yet here we are.
On the upside, all the major limitations can be attributed to the maximum size
and format of the index block. It's possible to implement an alternative index
format in the future that can be automatically switched to whenever any of the
above limits are exceeded, thus providing a seamless upgrade path without
breaking compatibility for the existing exports that do fit within the limits.
# Security considerations
Directory trees can get very large and you can easily exceed available RAM when
attempting to read everything into memory. Reading only small parts of the tree
can help cut down on memory use, but it's still a good idea to implement limits
or detect and handle when you're about to run out of memory.
There are several places in the format where byte offets are used to refer to
blocks or items. These offets must be validated to ensure that they stay within
the bounds of the respective file or block. In particular, itemref offsets
could potentially refer to memory before (in the case of a relative itemref) or
after (absolute itemref) the decompressed data, and pointers in the index block
could refer to offsets beyond the end of the file.
The CBOR encoding used for items is self-delimiting, but a badly formatted item
may not be properly terminated before the end of the decompressed block
contents. Readers should take care that this does not lead to reading past the
allocated buffer.
In a well-formed directory tree, each item is referenced exactly once by either
the *Root\_itemref* or a *pref* or *sub* field. However, it is also possible to
construct a file where this is not the case, and implementers should be aware
that itemref loops are possible.
# Implementation notes
Data block size
: It is up to the file writer to choose a suitable data block size. This is a
compromise between compression efficiency and memory use: larger blocks
compress better but also require more memory, both for reading and writing.
Ncdu currently keeps 8 uncompressed blocks in memory when reading and one
block per thread when writing. Ncdu starts with blocks of 64 KiB, but
gradually increases the size to 2 MiB for very large directory trees in
order to not bloat the index size too much and to prevent running into the
maximum data block number limit.
Testing
: If you're implementing a custom writer for this format, make sure to check
out the
[ncdubinexp.pl](https://code.blicky.net/yorhel/ncdu/src/ncdubinexp.pl)
script in the git repository. Ncdu only reads the parts of a file that it
actually needs, so passing a file to ncdu is no guarantee that it is
well-formed. The ncdubinexp.pl script is more thorough in validating file
correctness but misses a few invariants that ncdu does check for, so
the best way to verify a file is to run both:
```
ncdu -f file.ncdu -o/dev/null # Read entire tree and export to /dev/null
ncdubinfmt.pl <file.ncdu # Read and verify the entire file
```