Commit graph

860 commits

Author SHA1 Message Date
David Gibson
de93acbe70 tcp: Correct function comments for address types
A number of functions describe themselves as taking a pointer to 'sin_addr
or sin6_addr'.  Those are field names, not type names.  Replace them with
the correct type names, in_addr or in6_addr.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:30 +01:00
David Gibson
f7653a1446 Use endian-safer typing in struct tap4_l4_t
We recently converted to using struct in_addr rather than bare in_addr_t
or uint32_t to represent IPv4 addresses in network order.  This makes it
harder forget to apply the correct endian conversions.

We omitted the IPv4 addresses stored in struct tap4_l4_t, however.  Convert
those as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:26 +01:00
David Gibson
7c7b68dbe0 Use typing to reduce chances of IPv4 endianness errors
We recently corrected some errors handling the endianness of IPv4
addresses.  These are very easy errors to make since although we mostly
store them in network endianness, we sometimes need to manipulate them in
host endianness.

To reduce the chances of making such mistakes again, change to always using
a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network
endian addresses.  This makes it harder to accidentally do arithmetic or
comparisons on such addresses as if they were host endian.

We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to
directly work with struct in_addr values.  This has the additional benefit
of making the IPv4 and IPv6 paths more visually similar.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:24 +01:00
David Gibson
dd3470d9a9 Use IPV4_IS_LOOPBACK more widely
This macro checks if an IPv4 address is in the loopback network
(127.0.0.0/8).  There are two places where we open code an identical check,
use the macro instead.

There are also a number of places we specifically exclude the loopback
address (127.0.0.1), but we should actually be excluding anything in the
loopback network.  Change those sites to use the macro as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:21 +01:00
David Gibson
dd09cceaee Minor improvements to IPv4 netmask handling
There are several minor problems with our parsing of IPv4 netmasks (-n).

First, we don't reject nonsensical netmasks like 0.255.0.255.  Address this
structurally by using prefix length instead of netmask as the primary
variable, only converting (and validating) when we need to.  This has the
added benefit of making some things more uniform with the IPv6 path.

Second, when the user specifies a prefix length, we truncate the output
from strtol() to an integer, which means we would treat -n 4294967320 as
valid (equivalent to 24).  Fix types to check for this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:19 +01:00
David Gibson
2b793d94ca Correct some missing endian conversions of IPv4 addresses
The INADDR_LOOPBACK constant is in host endianness, and similarly the
IN_MULTICAST macro expects a host endian address.  However, there are some
places in passt where we use those with network endian values.  This means
that passt will incorrectly allow you to set 127.0.0.1 or a multicast
address as the guest address or DNS forwarding address.  Add the necessary
conversions to correct this.

INADDR_ANY and INADDR_BROADCAST logically behave the same way, although
because they're palindromes it doesn't have an effect in practice.  Change
them to be logically correct while we're there, though.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:03:58 +01:00
Stefano Brivio
40fc9e6e7b test: Add memory/passt test cases
These show a summary of memory usage in kernel and userspace with
different port forwarding configurations, details of userspace usage
using 'nm' (passt only uses statically allocated memory), and details
of kernel memory from slab reporting facilities.

This adds a new test image, mbuto.mem.img, with harcoded IPv4 and
IPv6 addresses and routes, and just the tools we need to start and
stop passt, to report from /proc/slabinfo, /proc/meminfo, and to
print and parse symbol sizes using nm(1).

passt can't pivot_root() for sandboxing purposes on ramfs, so we need
to create another filesystem and chroot into it, first.

We don't want to use pane context functions, as we're checking memory
usage for sockets: resort to screen-scraping.

Configure a dummy interface to provide passt with an appearance of
working IPv4 and IPv6 connectivity, contributed by David.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-04 12:01:27 +01:00
Stefano Brivio
ce2a0a5bb4 test/lib: Add "td" directive, handled by table_value()
This can be used for generic cell values with an arbitrary scale.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-04 12:01:18 +01:00
Stefano Brivio
bfd311aec7 test/lib/perf_report: Use own flag to track initialisation
Instead of just disabling performance reports if running in demo
mode. This allows us to use table functions outside of performance
reports.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-04 12:01:17 +01:00
Stefano Brivio
2d4468ebb7 tap: Support for detection of existing sockets on ramfs
On ramfs, connecting to a non-existent UNIX domain socket yields
EACCESS, instead of ENOENT. This is visible if we use passt directly
on rootfs (a ramfs instance) from an initramfs image.

It's probably wrong for ramfs to return EACCES, but given the
simplicity of the filesystem, I doubt we should try to fix it there
at the possible cost of added complexity.

Also, this whole beauty should go away once qrap-less usage is
established, so just accept EACCES as indication that a conflicting
socket does not, in fact, exist.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-04 12:01:09 +01:00
Stefano Brivio
e76e65a36e test/lib: Move screen-scraping setup and layout functions to _ugly files
I'm going to add yet another one of those, for which I have no quick
solution. It's a regression in some sense, but at least if we make
this regression more observable and defined, it should be easier to
find a comprehensive solution later, within this or another testing
framework.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-04 12:01:05 +01:00
Stefano Brivio
ea5e046646 README: Add Podman, vhost-user links, and links to Bugzilla queries
Unfortunately Bugzilla doesn't enable sharing of queries to
unregistered users:
  https://bugzilla.mozilla.org/show_bug.cgi?id=400063

...but we can still use ugly search links.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-27 22:41:37 +02:00
Stefano Brivio
10cabe3dbf passt.1: Fix typo: "addressses", reported by Lintian
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-27 14:28:00 +02:00
Stefano Brivio
f212044940 icmp: Don't discard first reply sequence for a given echo ID
In pasta mode, ICMP and ICMPv6 echo sockets relay back to us any
reply we send: we're on the same host as the target, after all. We
discard them by comparing the last sequence we sent with the sequence
we receive.

However, on the first reply for a given identifier, the sequence
might be zero, depending on the implementation of ping(8): we need
another value to indicate we haven't sent any sequence number, yet.

Use -1 as initialiser in the echo identifier map.

This is visible with Busybox's ping, and was reported by Paul on the
integration at https://github.com/containers/podman/pull/16141, with:

  $ podman run --net=pasta alpine ping -c 2 192.168.188.1

...where only the second reply would be routed back.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 33482d5bf2 ("passt: Add PASTA mode, major rework")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-27 00:18:21 +02:00
Stefano Brivio
b062ee47d1 icmp: Add debugging messages for handled replies and requests
...instead of just reporting errors.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-27 00:18:18 +02:00
Stefano Brivio
947d756747 tap: Trace received (outbound) ICMP packets in debug mode, too
This only worked for ICMPv6: ICMP packets have no TCP-style header,
so they are handled as a special case before packet sequences are
formed, and the call to tap_packet_debug() was missing.

Fixes: bb70811183 ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-27 00:18:16 +02:00
Stefano Brivio
7402951658 conf, passt.1: Don't imply --foreground with --debug
Having -f implied by -d (and --trace) usually saves some typing, but
debug mode in background (with a log file) is quite useful if pasta
is started by Podman, and is probably going to be handy for passt
with libvirt later, too.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-27 00:17:56 +02:00
Stefano Brivio
e4df8b0844 test/run: Temporarily disable distribution tests
They're too slow to cope with current release cycles, and they
haven't found bugs in months, also because clang-tidy and cppcheck
would find most of them earlier.

Disable them for the moment. We should pre-install gcc and make in
non-x86 images, as those run on my test machine with qemu TCG, and
that's the real slow-down here. Then we can re-enable them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-26 07:03:56 +02:00
Stefano Brivio
fb820ebb2e hooks: Temporarily disable demo generation in pre-push
The out-of-tree Podman patch needs to be rebased every second week or
so, and I'm currently trying to get that upstream:
  https://github.com/containers/podman/pull/16141

Disable demo generation for the moment, so that I avoid wasting time
with those rebases. We'll re-enable it later.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-26 06:56:25 +02:00
Stefano Brivio
d472476caa test: Add log file tests for pasta plus corresponding layout and setup
To test log files on a tmpfs mount, we need to unshare the mount
namespace, which means using a context for the passt pane is not
really practical at the moment, as we can't open a shell there, so
we would have to encapsulate all the commands under 'unshare -rUm',
plus the "inner" pasta command, running in turn a tcp_rr server.

It might be worth fixing this by e.g. detecting we are trying to
spawn an interactive shell and adding a special path in the context
setup with some form of stdin redirection -- I'm not sure it's doable
though.

For this reason, add a new layout, using a context only for the host
pane, while keeping the old command dispatch mechanism for the passt
pane.

We also need a new setup function that doesn't start pasta: we want
to start and restart it with different options.

Further, we need a 'pint' directive, to send an interrupt to the
passt pane: add that in lib/test.

All the tests before the one involving tmpfs and a detached mount
namespace were also tested with the context mechanism. To make an
eventual conversion easier, pass tcp_crr directly as a command on
pasta's command line where feasible.

While at it, fix the comment to the teardown_pasta() function.

The new test set can be semi-conveniently run as:

  ./run pasta_options/log_to_file

and it checks basic log creation, size of the log file after flooding
it with debug entries, rotations, and basic consistency after
rotations, on both an existing filesystem and a tmpfs, chosen as
it doesn't support collapsing data ranges via fallocate(), hence
triggering the fall-back mechanism for logging rotation.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-26 06:28:41 +02:00
Stefano Brivio
e67039f712 checksum: Fix calculation for ICMP checksum on IPv4
We need to zero out the checksum field before calculating the
checksum, of course. I have no idea how this passed the "icmp" test
set, looking into it.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 67ab617172 ("Add csum_icmp4() helper for calculating ICMP checksums")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-26 06:28:06 +02:00
Stefano Brivio
c11277b94f conf: Don't pass leading ~ to parse_port_range() on exclusions
Commit 84fec4e998 ("Clean up parsing of port ranges") drops the
strspn() call before the parsing of excluded port ranges, because now
we're checking against any stray characters at every step.

However, that also has the effect of passing ~ as first character to
the new parse_port_range(), which makes no sense: we already checked
that ~ is the first character before the call, so skip it.

Alona reported this output:
  Invalid port specifier ~15000,~15001,~15006,~15008,~15020,~15021,~15090

while the whole specifier is indeed valid.

Reported-by: Alona Paz <alkaplan@redhat.com>
Fixes: 84fec4e998 ("Clean up parsing of port ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-24 14:37:22 +02:00
Stefano Brivio
b68da100ba util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size
...instead of one fourth. On the main() -> conf() -> nl_sock_init()
call path, LTO from gcc 12 on (at least) x86_64 decides to inline...
everything: nl_sock_init() is effectively part of main(), after
commit 3e2eb4337b ("conf: Bind inbound ports with
CAP_NET_BIND_SERVICE before isolate_user()").

This means we exceed the maximum stack size, and we get SIGSEGV,
under any condition, at start time, as reported by Andrea on a recent
build for CentOS Stream 9.

The calculation of NS_FN_STACK_SIZE, which is the stack size we
reserve for clones, was previously obtained by dividing the maximum
stack size by two, to avoid an explicit check on architecture (on
PA-RISC, also known as hppa, the stack grows up, so we point the
clone to the middle of this area), and then further divided by two
to allow for any additional usage in the caller.

Well, if there are essentially no function calls anymore, this is
not enough. Divide it by eight, which is anyway much more than
possibly needed by any clone()d callee.

I think this is robust, so it's a fix in some sense. Strictly
speaking, though, we have no formal guarantees that this isn't
either too little or too much.

What we should do, eventually: check cloned() callees, there are just
thirteen of them at the moment. Note down any stack usage (they are
mostly small helpers), bonus points for an automated way at build
time, quadruple that or so, to allow for extreme clumsiness, and use
as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa.

Reported-by: Andrea Bolognani <abologna@redhat.com>
Fixes: 3e2eb4337b ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-22 08:46:57 +02:00
Andrea Bolognani
5715a297a7 Add git-publish configuration file
Signed-off-by: Andrea Bolognani <abologna@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-22 03:45:50 +02:00
Andrea Bolognani
b944ca1855 qrap: Support JSON syntax for -device
Starting with version 8.1.0, libvirt uses JSON syntax when
generating the arguments to -device, so they will now look like

  {"driver":"virtio-scsi-pci","bus":"pci.3","addr":"0x0"}

instead of

  virtio-scsi-pci,bus=pci.3,addr=0x0

qrap needs to parse these arguments and extract the bus number
in order to figure out what address to use for the virtio-net
device it adds, and the libvirt change described above has
broken this parsing logic.

Tweak the code so that both styles are accepted and handled
correctly.

Note that, when JSON is in use, qrap needs to generate its own
command line options in that format as well or things will not
work as expected.

Signed-off-by: Andrea Bolognani <abologna@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-21 11:43:45 +02:00
David Gibson
c6845f60a0 dhcp: Use tap_udp4_send() helper in dhcp()
The IPv4 specific dhcp() manually constructs L2 and IP headers to send its
DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the
tap_udp6_send() helper.  Now that we've broaded the parameters to
tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:35:00 +02:00
David Gibson
2dbc622f54 tap: Split tap_ip4_send() into UDP and ICMP variants
tap_ip4_send() has special case logic to compute the checksums for UDP
and ICMP packets, which is a mild layering violation.  By using a suitable
helper we can split it into tap_udp4_send() and tap_icmp4_send() functions
without greatly increasing the code size, this removing that layering
violation.

We make some small changes to the interface while there.  In both cases
we make the destination IPv4 address a parameter, which will be useful
later.  For the UDP variant we make it take just the UDP payload, and it
will generate the UDP header.  For the ICMP variant we pass in the ICMP
header as before.  The inconsistency is because that's what seems to be
the more natural way to invoke the function in the callers in each case.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:56 +02:00
David Gibson
db07804d26 ndp: Use tap_icmp6_send() helper
We send ICMPv6 packets to the guest from both icmp.c and from ndp.c.  The
case in ndp() manually constructs L2 and IPv6 headers, unlike the version
in icmp.c which uses the tap_icmp6_send() helper from tap.c  Now that we've
broaded the parameters of tap_icmp6_send() we can use it in ndp() as well
saving some duplicated logic.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:53 +02:00
David Gibson
cb1edae3b5 ndp: Remove unneeded eh_source parameter
ndp() takes a parameter giving the ethernet source address of the packet
it is to respond to, which it uses to determine the destination address to
send the reply packet to.

This is not necessary, because the address will always be the guest's
MAC address.  Even if the guest has just changed MAC address, then either
tap_handler_passt() or tap_handler_pasta() - which are the only call paths
leading to ndp() will have updated c->mac_guest with the new value.

So, remove the parameter, and just use c->mac_guest, making it more
consistent with other paths where we construct packets to send inwards.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:51 +02:00
David Gibson
9d8dd8b6f4 tap: Split tap_ip6_send() into UDP and ICMP variants
tap_ip6_send() has special case logic to compute the checksums for UDP
and ICMP packets, which is a mild layering violation.  By using a suitable
helper we can split it into tap_udp6_send() and tap_icmp6_send() functions
without greatly increasing the code size, this removing that layering
violation.

We make some small changes to the interface while there.  In both cases
we make the destination IPv6 address a parameter, which will be useful
later.  For the UDP variant we make it take just the UDP payload, and it
will generate the UDP header.  For the ICMP variant we pass in the ICMP
header as before.  The inconsistency is because that's what seems to be
the more natural way to invoke the function in the callers in each case.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:48 +02:00
David Gibson
f616ca231e Split tap_ip_send() into IPv4 and IPv6 specific functions
The IPv4 and IPv6 paths in tap_ip_send() have very little in common, and
it turns out that every caller (statically) knows if it is using IPv4 or
IPv6.  So split into separate tap_ip4_send() and tap_ip6_send() functions.
Use a new tap_l2_hdr() function for the very small common part.

While we're there, make some minor cleanups:
  - We were double writing some fields in the IPv6 header, so that it
    temporary matched the pseudo-header for checksum calculation.  With
    recent checksum reworks, this isn't neccessary any more.
  - We don't use any IPv4 header options, so use some sizeof() constructs
    instead of some open coded values for header length.
  - The comment used to say that the flow label was for TCP over IPv6, but
    in fact the only thing we used it for was DHCPv6 over UDP traffic

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:45 +02:00
David Gibson
fb5d1c5d7d tap: Remove unhelpeful vnet_pre optimization from tap_send()
Callers of tap_send() can optionally use a small optimization by adding
extra space for the 4 byte length header used on the qemu socket interface.
tap_ip_send() is currently the only user of this, but this is used only
for "slow path" ICMP and DHCP packets, so there's not a lot of value to
the optimization.

Worse, having the two paths here complicates the interface and makes future
cleanups difficult, so just remove it.  I have some plans to bring back the
optimization in a more general way in future, but for now it's just in the
way.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:43 +02:00
David Gibson
f72b63e92f Remove support for TCP packets from tap_ip_send()
tap_ip_send() is never used for TCP packets, we're unlikely to use it for
that in future, and the handling of TCP packets makes other cleanups
unnecessarily awkward.  Remove it.

This is the only user of csum_tcp4(), so we can remove that as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:40 +02:00
David Gibson
a2eb2d310a Add helpers for normal inbound packet destination addresses
tap_ip_send() doesn't take a destination address, because it's specifically
for inbound packets, and the IP addresses of the guest/namespace are
already known to us.  Rather than open-coding this destination address
logic, make helper functions for it which will enable some later cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:38 +02:00
David Gibson
3d8ccb44a6 Add csum_ip4_header() helper to calculate IPv4 header checksums
We calculate IPv4 header checksums in at least two places, in dhcp() and
in tap_ip_send.  Add a helper to handle this calculation in both places.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:34 +02:00
David Gibson
bd4be308fc Add csum_udp4() helper for calculating UDP over IPv4 checksums
At least two places in passt fill in UDP over IPv4 checksums, although
since UDP checksums are optional with IPv4 that just amounts to storing
a 0 (in tap_ip_send()) or leaving a 0 from an earlier initialization (in
dhcp()).  For consistency, add a helper for this "calculation".

Just for the heck of it, add the option (compile time disabled for now) to
calculate real UDP checksums.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:32 +02:00
David Gibson
6905ac75ec Add csum_udp6() helper for calculating UDP over IPv6 checksums
Add a helper for calculating UDP checksums when used over IPv6
For future flexibility, the new helper takes parameters for the fields in
the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to
be explicitly constructed.  It also allows the UDP header and payload to
be in separate buffers, although we don't use this yet.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:29 +02:00
David Gibson
67ab617172 Add csum_icmp4() helper for calculating ICMP checksums
Although tap_ip_send() is currently the only place calculating ICMP
checksums, create a helper function for symmetry with ICMPv6.  For
future flexibility it allows the ICMPv6 header and payload to be in
separate buffers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:26 +02:00
David Gibson
7abd2b0d72 Add csum_icmp6() helper for calculating ICMPv6 checksums
At least two places in passt calculate ICMPv6 checksums, ndp() and
tap_ip_send().  Add a helper to handle this calculation in both places.
For future flexibility, the new helper takes parameters for the fields in
the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to
be explicitly constructed.  It also allows the ICMPv6 header and payload to
be in separate buffers, although we don't use this yet.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:21 +02:00
Stefano Brivio
b3f359167b passt.1: Add David to AUTHORS
I just realised while reading the man page.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
3e2eb4337b conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()
Even if CAP_NET_BIND_SERVICE is granted, we'll lose the capability in
the target user namespace as we isolate the process, which means
we're unable to bind to low ports at that point.

Bind inbound ports, and only those, before isolate_user(). Keep the
handling of outbound ports (for pasta mode only) after the setup of
the namespace, because that's where we'll bind them.

To this end, initialise the netlink socket for the init namespace
before isolate_user() as well, as we actually need to know the
addresses of the upstream interface before binding ports, in case
they're not explicitly passed by the user.

As we now call nl_sock_init() twice, checking its return code from
conf() twice looks a bit heavy: make it exit(), instead, as we
can't do much if we don't have netlink sockets.

While at it:

- move the v4_only && v6_only options check just after the first
  option processing loop, as this is more strictly related to
  option parsing proper

- update the man page, explaining that CAP_NET_BIND_SERVICE is
  *not* the preferred way to bind ports, because passt and pasta
  can be abused to allow other processes to make effective usage
  of it. Add a note about the recommended sysctl instead

- simplify nl_sock_init_do() now that it's called once for each
  case

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
40abd447c8 Rename pasta_setup_ns() to pasta_spawn_cmd()
pasta_setup_ns() no longer has much to do with setting up a namespace.
Instead it's really about starting the shell or other command we want to
run with pasta connectivity.  Rename it and its argument structure to be
less misleading.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
eb3d03a588 isolation: Only configure UID/GID mappings in userns when spawning shell
When in passt mode, or pasta mode spawning a command, we create a userns
for ourselves.  This is used both to isolate the pasta/passt process itself
and to run the spawned command, if any.

Since eed17a47 "Handle userns isolation and dropping root at the same time"
we've handled both cases the same, configuring the UID and GID mappings in
the new userns to map whichever UID we're running as to root within the
userns.

This mapping is desirable when spawning a shell or other command, so that
the user gets a root shell with reasonably clear abilities within the
userns and netns.  It's not necessarily essential, though.  When not
spawning a shell, it doesn't really have any purpose: passt itself doesn't
need to be root and can operate fine with an unmapped user (using some of
the capabilities we get when entering the userns instead).

Configuring the uid_map can cause problems if passt is running with any
capabilities in the initial namespace, such as CAP_NET_BIND_SERVICE to
allow it to forward low ports.  In this case the kernel makes files in
/proc/pid owned by root rather than the starting user to prevent the user
from interfering with the operation of the capability-enhanced process.
This includes uid_map meaning we are not able to write to it.

Whether this behaviour is correct in the kernel is debatable, but in any
case we might as well avoid problems by only initializing the user mappings
when we really want them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
fb449b16bd isolation: Prevent any child processes gaining capabilities
We drop our own capabilities, but it's possible that processes we exec()
could gain extra privilege via file capabilities.  It shouldn't be possible
for us to exec() anyway due to seccomp() and our filesystem isolation.  But
just in case, zero the bounding and inheritable capability sets to prevent
any such child from gainin privilege.

Note that we do this *after* spawning the pasta shell/command (if any),
because we do want the user to be able to give that privilege if they want.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
c22ebccba8 isolation: Replace drop_caps() with a version that actually does something
The current implementation of drop_caps() doesn't really work because it
attempts to drop capabilities from the bounding set.  That's not the set
that really matters, it's about limiting the abilities of things we might
later exec() rather than our own capabilities.  It also requires
CAP_SETPCAP which we won't usually have.

Replace it with a new version which uses setcap(2) to drop capabilities
from the effective and permitted sets.  For now we leave the inheritable
set as is, since we don't want to preclude the user from passing
inheritable capabilities to the command spawed by pasta.

Correctly dropping caps reveals that we were relying on some capabilities
we'd supposedly dropped.  Re-divide the dropping of capabilities between
isolate_initial(), isolate_user() and isolate_prefork() to make this work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
ceb2061587 isolation: Refactor isolate_user() to allow for a common exit path
Currently, isolate_user() exits early if the --netns-only option is given.
That works for now, but shortly we're going to want to add some logic to
go at the end of isolate_user() that needs to run in all cases: joining a
given userns, creating a new userns, or staying in our original userns
(--netns-only).

To avoid muddying those changes, here we reorganize isolate_user() to have
a common exit path for all cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
ea5936dd3f Replace FWRITE with a function
In a few places we use the FWRITE() macro to open a file, replace it's
contents with a given string and close it again.  There's no real
reason this needs to be a macro rather than just a function though.
Turn it into a function 'write_file()' and make some ancillary
cleanups while we're there:
  - Add a return code so the caller can handle giving a useful error message
  - Handle the case of short write()s (unlikely, but possible)
  - Add O_TRUNC, to make sure we replace the existing contents entirely

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
096e48669b isolation: Clarify various self-isolation steps
We have a number of steps of self-isolation scattered across our code.
Improve function names and add comments to make it clearer what the self
isolation model is, what the steps do, and why they happen at the points
they happen.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
6909a8e339 Remove unhelpful drop_caps() call in pasta_start_ns()
drop_caps() has a number of bugs which mean it doesn't do what you'd
expect.  However, even if we fixed those, the call in pasta_start_ns()
doesn't do anything useful:

* In the common case, we're UID 0 at this point.  In this case drop_caps()
  doesn't accomplish anything, because even with capabilities dropped, we
  are still privileged.
* When attaching to an existing namespace with --userns or --netns-only
  we might not be UID 0.  In this case it's too early to drop all
  capabilities: we need at least CAP_NET_ADMIN to configure the
  tap device in the namespace.

Remove this call - we will still drop capabilities a little later in
sandbox().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
01b4e71f7a pasta_start_ns() always ends in parent context
The end of pasta_start_ns() has a test against pasta_child_pid, testing
if we're in the parent or the child.  However we started the child running
the pasta_setup_ns function which always exec()s or exit()s, so if we
return from the clone() we are always in the parent, making that test
unnecessary.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00