fu/FU/Util.pm

package FU::Util 0.1;

use v5.36;
use FU::XS;
use Carp 'confess';
use Exporter 'import';

our @EXPORT_OK = qw/
    json_format json_parse
    utf8_decode uri_escape uri_unescape
    fdpass_send fdpass_recv
/;

sub utf8_decode :prototype($) {
    return if !defined $_[0];
    confess 'Invalid UTF-8' if !utf8::decode($_[0]);
    confess 'Invalid control character' if $_[0] =~ /[\x00-\x08\x0b\x0c\x0e-\x1f]/;
    $_[0]
}

sub uri_escape :prototype($) {
    utf8::encode(local $_ = shift);
    s/([^A-Za-z0-9._~-])/sprintf '%%%02X', ord $1/eg;
    $_;
}

sub uri_unescape :prototype($) {
    return if !defined $_[0];
    utf8::encode(local $_ = shift);
    s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
    utf8_decode $_;
}

1;
__END__

=head1 NAME

FU::Util - Miscellaneous utility functions that really should have been part of
a core Perl installation but aren't for some reason because the Perl community
doesn't believe in the concept of a "batteries included" standard library.
</rant>

=head1 SYNOPSIS

  use FU::Util qw/json_format/;

  my $data = json_format [1, 2, 3];

=head1 DESCRIPTION

=head2 JSON parsing & formatting

This module comes with a custom C-based JSON parser and formatter. These
functions conform strictly to L<RFC-8259|https://tools.ietf.org/html/rfc8259>,
non-standard extensions are not supported and never will be. It also happens to
be pretty fast, refer to L<FU::Benchmarks> for some numbers.

JSON booleans are parsed into C<builtin::true> and C<builtin::false>. When
formatting, those builtin constants are the I<only> recognized boolean values -
alternative representations such as C<JSON::PP::true> and C<JSON::PP::false>
are not recognized and attempting to format such values will croak.

JSON numbers that are too large fit into a Perl integer are parsed into a
floating point value instead. This obviously loses precision, but is consistent
with C<JSON.parse()> in JavaScript land - except Perl does support the full
range of a 64bit integer. JSON numbers with a fraction or exponent are also
converted into floating point, which may lose precision as well.
L<Math::BigInt> and L<Math::BigFloat> are not currently supported. Attempting
to format a floating point C<NaN> or C<Inf> results in an error.

=over

=item json_parse($string, %options)

Parse a JSON string and return a Perl value. With the default options, this
function is roughly similar to:

  JSON::PP->new->allow_nonref->core_bools-decode($string);

Croaks on invalid JSON, but the error messages are not super useful.  This
function also throws an error on JSON objects with duplicate keys, which is
consistent with the default behavior of L<Cpanel::JSON::XS> but inconsistent
with other modules.

Supported C<%options>:

=over

=item utf8

Boolean, interpret the input C<$string> as a UTF-8 encoded byte string instead
of a Perl Unicode string.

=item max_depth

Maximum permitted nesting depth of arrays and objects. Defaults to 512.

=item max_size

Throw an error if the JSON data is larger than the given size in bytes.
Defaults to 1 GiB.

=item offset

Takes a reference to a scalar that indicates from which byte offset in
C<$string> to start parsing. On success, the offset is updated to point to the
next non-whitespace character or C<undef> if the string has been fully
consumed.

This option can be used to parse a stream of JSON values:

  my $data = '{"obj":1}{"obj":2}';
  my $offset = 0;
  my $obj1 = json_parse($data, offset => \$offset);
  # $obj1 = {obj=>1};  $offset = 9;
  my $obj2 = json_parse($data, offset => \$offset);
  # $obj2 = {obj=>2};  $offset = undef;

=back


=item json_format($scalar, %options)

Format a Perl value as JSON. With the default options, this function behaves
roughly similar to:

  JSON::PP->new->allow_nonref->core_bools->convert_blessed->encode($scalar);

Some modules escape the slash character in encoded strings to prevent a
potential XSS vulnerability when embedding JSON inside C<< <script> ..
</script> >> tags.  This function does I<not> do that because it might not even
be sufficient. The following is probably an improvement:

  json_format($data) =~ s{</}{<\\/}rg =~ s/<!--/<\\u0021--/rg;

The following C<%options> are supported:

=over

=item canonical

Boolean, write hash keys in deterministic (sorted) order. This option currently
has no effect on tied hashes.

=item pretty

Boolean, format JSON with newlines and indentation for easier reading.  Beauty
is in the eye of the beholder, this option currently follows the convention
used by L<JSON::XS> and others: 3 space indent and one space around the C<:>
separating object keys and values. The exact format might change in later
versions.

=item utf8

Boolean, returns a UTF-8 encoded byte string instead of a Perl Unicode string.

=item max_size

Maximum permitted size, in bytes, of the generated JSON string. Defaults to 1 GiB.

=item max_depth

Maximum permitted nesting depth of Perl values. Defaults to 512.

=back

=back

(Why the hell yet another JSON codec when CPAN is already full of them!? Well,
L<JSON::XS> is pretty cool but isn't going to be updated to support Perl's new
builtin booleans. L<JSON::PP> is slow and while L<Cpanel::JSON::XS> is
perfectly adequate, its codebase is too large and messy for my taste - too many
unnecessary features and C<#ifdef>s to support ancient perls and esoteric
configurations. Still, if you need anything not provided by these functions,
L<JSON::PP> and L<Cpanel::JSON::XS> are perfectly fine alternatives.
L<JSON::SIMD> and L<Mojo::JSON> also look like good and maintained candidates.)


=head2 String Utility Functions

=over

=item utf8_decode($bytes)

Convert a (perl-UTF-8 encoded) byte string into a sanitized perl Unicode
string. The conversion is performed in-place, so the C<$bytes> argument is
turned into a Unicode string. Returns the same string for convenience.

This function throws an error if the input is not valid UTF-8 or if it contains
ASCII control characters - that is, any character between C<0x00> and C<0x1f>
except for tab, newline and carriage return.

(This is a tiny wrapper around C<utf8::decode()> with some extra checks)

=item uri_escape($string)

Takes an Unicode string and returns a percent-encoded ASCII string, suitable
for use in a query parameter.

=item uri_unescape($string)

Takes an Unicode string potentially containing percent-encoding and returns a
decoded Unicode string. Also checks for ASCII control characters as per
C<utf8_decode()>.

=back


=head2 File Descriptor Passing

UNIX sockets (see L<IO::Socket::UNIX>) have the fancy property of letting you
send file descriptors over them, allowing you to pass, for example, a socket
from one process to another. This is a pretty low-level operation and not
something you'll often need, but two functions to use that feature are provided
here anyway because the L<FU> supervisor uses them:

=over

=item fdpass_send($send_fd, $pass_fd, $message)

Send a message and a file descriptor (C<$pass_fd>) over the given socket
(<$send_fd>). C<$message> must not be empty, even if you don't intend to do
anything with it on receipt. Both C<$send_fd> and C<$pass_fd> must be numeric
file descriptors, as obtained by C<fileno()>.

=item ($fd, $message) = fdpass_recv($recv_fd, $max_message_len)

Read a file descriptor and message from the given C<$recv_fd>, which must be
the numeric file descriptor of a socket. This function can be used as a
replacement for C<sysread()>: the returned C<$fd> is undef if no file
descriptor was received. The returned C<$message> is undef on error or an empty
string on EOF.

Like regular socket I/O, a single C<fdpass_send()> message may be split across
multiple C<fdpass_recv()> calls; in that case the C<$fd> will only be received
on the first call.

Don't use this function if the sender may include multiple file descriptors in
a single message, weird things can happen. File descriptors received this way
do not have the C<CLOEXEC> flag and will thus survive a call to C<exec()>.
Refer to L<this wonderful
discussion|https://gist.github.com/kentonv/bc7592af98c68ba2738f4436920868dc>
for more weirdness and edge cases.

=back

See also L<IO::FDPass> for a more portable solution, although that one does not
support passing along regular data.