passt

Author	SHA1	Message	Date
David Gibson	de93acbe70	tcp: Correct function comments for address types A number of functions describe themselves as taking a pointer to 'sin_addr or sin6_addr'. Those are field names, not type names. Replace them with the correct type names, in_addr or in6_addr. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:30 +01:00
David Gibson	f7653a1446	Use endian-safer typing in struct tap4_l4_t We recently converted to using struct in_addr rather than bare in_addr_t or uint32_t to represent IPv4 addresses in network order. This makes it harder forget to apply the correct endian conversions. We omitted the IPv4 addresses stored in struct tap4_l4_t, however. Convert those as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:26 +01:00
David Gibson	7c7b68dbe0	Use typing to reduce chances of IPv4 endianness errors We recently corrected some errors handling the endianness of IPv4 addresses. These are very easy errors to make since although we mostly store them in network endianness, we sometimes need to manipulate them in host endianness. To reduce the chances of making such mistakes again, change to always using a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network endian addresses. This makes it harder to accidentally do arithmetic or comparisons on such addresses as if they were host endian. We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to directly work with struct in_addr values. This has the additional benefit of making the IPv4 and IPv6 paths more visually similar. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:24 +01:00
David Gibson	dd3470d9a9	Use IPV4_IS_LOOPBACK more widely This macro checks if an IPv4 address is in the loopback network (127.0.0.0/8). There are two places where we open code an identical check, use the macro instead. There are also a number of places we specifically exclude the loopback address (127.0.0.1), but we should actually be excluding anything in the loopback network. Change those sites to use the macro as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:21 +01:00
David Gibson	dd09cceaee	Minor improvements to IPv4 netmask handling There are several minor problems with our parsing of IPv4 netmasks (-n). First, we don't reject nonsensical netmasks like 0.255.0.255. Address this structurally by using prefix length instead of netmask as the primary variable, only converting (and validating) when we need to. This has the added benefit of making some things more uniform with the IPv6 path. Second, when the user specifies a prefix length, we truncate the output from strtol() to an integer, which means we would treat -n 4294967320 as valid (equivalent to 24). Fix types to check for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:19 +01:00
David Gibson	2b793d94ca	Correct some missing endian conversions of IPv4 addresses The INADDR_LOOPBACK constant is in host endianness, and similarly the IN_MULTICAST macro expects a host endian address. However, there are some places in passt where we use those with network endian values. This means that passt will incorrectly allow you to set 127.0.0.1 or a multicast address as the guest address or DNS forwarding address. Add the necessary conversions to correct this. INADDR_ANY and INADDR_BROADCAST logically behave the same way, although because they're palindromes it doesn't have an effect in practice. Change them to be logically correct while we're there, though. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:03:58 +01:00
Stefano Brivio	40fc9e6e7b	test: Add memory/passt test cases These show a summary of memory usage in kernel and userspace with different port forwarding configurations, details of userspace usage using 'nm' (passt only uses statically allocated memory), and details of kernel memory from slab reporting facilities. This adds a new test image, mbuto.mem.img, with harcoded IPv4 and IPv6 addresses and routes, and just the tools we need to start and stop passt, to report from /proc/slabinfo, /proc/meminfo, and to print and parse symbol sizes using nm(1). passt can't pivot_root() for sandboxing purposes on ramfs, so we need to create another filesystem and chroot into it, first. We don't want to use pane context functions, as we're checking memory usage for sockets: resort to screen-scraping. Configure a dummy interface to provide passt with an appearance of working IPv4 and IPv6 connectivity, contributed by David. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-04 12:01:27 +01:00
Stefano Brivio	ce2a0a5bb4	test/lib: Add "td" directive, handled by table_value() This can be used for generic cell values with an arbitrary scale. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-04 12:01:18 +01:00
Stefano Brivio	bfd311aec7	test/lib/perf_report: Use own flag to track initialisation Instead of just disabling performance reports if running in demo mode. This allows us to use table functions outside of performance reports. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-04 12:01:17 +01:00
Stefano Brivio	2d4468ebb7	tap: Support for detection of existing sockets on ramfs On ramfs, connecting to a non-existent UNIX domain socket yields EACCESS, instead of ENOENT. This is visible if we use passt directly on rootfs (a ramfs instance) from an initramfs image. It's probably wrong for ramfs to return EACCES, but given the simplicity of the filesystem, I doubt we should try to fix it there at the possible cost of added complexity. Also, this whole beauty should go away once qrap-less usage is established, so just accept EACCES as indication that a conflicting socket does not, in fact, exist. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-04 12:01:09 +01:00
Stefano Brivio	e76e65a36e	test/lib: Move screen-scraping setup and layout functions to _ugly files I'm going to add yet another one of those, for which I have no quick solution. It's a regression in some sense, but at least if we make this regression more observable and defined, it should be easier to find a comprehensive solution later, within this or another testing framework. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-04 12:01:05 +01:00
Stefano Brivio	ea5e046646	README: Add Podman, vhost-user links, and links to Bugzilla queries Unfortunately Bugzilla doesn't enable sharing of queries to unregistered users: https://bugzilla.mozilla.org/show_bug.cgi?id=400063 ...but we can still use ugly search links. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-27 22:41:37 +02:00
Stefano Brivio	10cabe3dbf	passt.1: Fix typo: "addressses", reported by Lintian Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-27 14:28:00 +02:00
Stefano Brivio	f212044940	icmp: Don't discard first reply sequence for a given echo ID In pasta mode, ICMP and ICMPv6 echo sockets relay back to us any reply we send: we're on the same host as the target, after all. We discard them by comparing the last sequence we sent with the sequence we receive. However, on the first reply for a given identifier, the sequence might be zero, depending on the implementation of ping(8): we need another value to indicate we haven't sent any sequence number, yet. Use -1 as initialiser in the echo identifier map. This is visible with Busybox's ping, and was reported by Paul on the integration at https://github.com/containers/podman/pull/16141, with: $ podman run --net=pasta alpine ping -c 2 192.168.188.1 ...where only the second reply would be routed back. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: `33482d5bf2` ("passt: Add PASTA mode, major rework") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-27 00:18:21 +02:00
Stefano Brivio	b062ee47d1	icmp: Add debugging messages for handled replies and requests ...instead of just reporting errors. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-27 00:18:18 +02:00
Stefano Brivio	947d756747	tap: Trace received (outbound) ICMP packets in debug mode, too This only worked for ICMPv6: ICMP packets have no TCP-style header, so they are handled as a special case before packet sequences are formed, and the call to tap_packet_debug() was missing. Fixes: `bb70811183` ("treewide: Packet abstraction with mandatory boundary checks") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-27 00:18:16 +02:00
Stefano Brivio	7402951658	conf, passt.1: Don't imply --foreground with --debug Having -f implied by -d (and --trace) usually saves some typing, but debug mode in background (with a log file) is quite useful if pasta is started by Podman, and is probably going to be handy for passt with libvirt later, too. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-27 00:17:56 +02:00
Stefano Brivio	e4df8b0844	test/run: Temporarily disable distribution tests They're too slow to cope with current release cycles, and they haven't found bugs in months, also because clang-tidy and cppcheck would find most of them earlier. Disable them for the moment. We should pre-install gcc and make in non-x86 images, as those run on my test machine with qemu TCG, and that's the real slow-down here. Then we can re-enable them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-26 07:03:56 +02:00
Stefano Brivio	fb820ebb2e	hooks: Temporarily disable demo generation in pre-push The out-of-tree Podman patch needs to be rebased every second week or so, and I'm currently trying to get that upstream: https://github.com/containers/podman/pull/16141 Disable demo generation for the moment, so that I avoid wasting time with those rebases. We'll re-enable it later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-26 06:56:25 +02:00
Stefano Brivio	d472476caa	test: Add log file tests for pasta plus corresponding layout and setup To test log files on a tmpfs mount, we need to unshare the mount namespace, which means using a context for the passt pane is not really practical at the moment, as we can't open a shell there, so we would have to encapsulate all the commands under 'unshare -rUm', plus the "inner" pasta command, running in turn a tcp_rr server. It might be worth fixing this by e.g. detecting we are trying to spawn an interactive shell and adding a special path in the context setup with some form of stdin redirection -- I'm not sure it's doable though. For this reason, add a new layout, using a context only for the host pane, while keeping the old command dispatch mechanism for the passt pane. We also need a new setup function that doesn't start pasta: we want to start and restart it with different options. Further, we need a 'pint' directive, to send an interrupt to the passt pane: add that in lib/test. All the tests before the one involving tmpfs and a detached mount namespace were also tested with the context mechanism. To make an eventual conversion easier, pass tcp_crr directly as a command on pasta's command line where feasible. While at it, fix the comment to the teardown_pasta() function. The new test set can be semi-conveniently run as: ./run pasta_options/log_to_file and it checks basic log creation, size of the log file after flooding it with debug entries, rotations, and basic consistency after rotations, on both an existing filesystem and a tmpfs, chosen as it doesn't support collapsing data ranges via fallocate(), hence triggering the fall-back mechanism for logging rotation. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-26 06:28:41 +02:00
Stefano Brivio	e67039f712	checksum: Fix calculation for ICMP checksum on IPv4 We need to zero out the checksum field before calculating the checksum, of course. I have no idea how this passed the "icmp" test set, looking into it. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: `67ab617172` ("Add csum_icmp4() helper for calculating ICMP checksums") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-26 06:28:06 +02:00
Stefano Brivio	c11277b94f	conf: Don't pass leading ~ to parse_port_range() on exclusions Commit `84fec4e998` ("Clean up parsing of port ranges") drops the strspn() call before the parsing of excluded port ranges, because now we're checking against any stray characters at every step. However, that also has the effect of passing ~ as first character to the new parse_port_range(), which makes no sense: we already checked that ~ is the first character before the call, so skip it. Alona reported this output: Invalid port specifier ~15000,~15001,~15006,~15008,~15020,~15021,~15090 while the whole specifier is indeed valid. Reported-by: Alona Paz <alkaplan@redhat.com> Fixes: `84fec4e998` ("Clean up parsing of port ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-24 14:37:22 +02:00
Stefano Brivio	b68da100ba	util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size ...instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit `3e2eb4337b` ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna@redhat.com> Fixes: `3e2eb4337b` ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-22 08:46:57 +02:00
Andrea Bolognani	5715a297a7	Add git-publish configuration file Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-22 03:45:50 +02:00
Andrea Bolognani	b944ca1855	qrap: Support JSON syntax for -device Starting with version 8.1.0, libvirt uses JSON syntax when generating the arguments to -device, so they will now look like {"driver":"virtio-scsi-pci","bus":"pci.3","addr":"0x0"} instead of virtio-scsi-pci,bus=pci.3,addr=0x0 qrap needs to parse these arguments and extract the bus number in order to figure out what address to use for the virtio-net device it adds, and the libvirt change described above has broken this parsing logic. Tweak the code so that both styles are accepted and handled correctly. Note that, when JSON is in use, qrap needs to generate its own command line options in that format as well or things will not work as expected. Signed-off-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-21 11:43:45 +02:00
David Gibson	c6845f60a0	dhcp: Use tap_udp4_send() helper in dhcp() The IPv4 specific dhcp() manually constructs L2 and IP headers to send its DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the tap_udp6_send() helper. Now that we've broaded the parameters to tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:35:00 +02:00
David Gibson	2dbc622f54	tap: Split tap_ip4_send() into UDP and ICMP variants tap_ip4_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp4_send() and tap_icmp4_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv4 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:56 +02:00
David Gibson	db07804d26	ndp: Use tap_icmp6_send() helper We send ICMPv6 packets to the guest from both icmp.c and from ndp.c. The case in ndp() manually constructs L2 and IPv6 headers, unlike the version in icmp.c which uses the tap_icmp6_send() helper from tap.c Now that we've broaded the parameters of tap_icmp6_send() we can use it in ndp() as well saving some duplicated logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:53 +02:00
David Gibson	cb1edae3b5	ndp: Remove unneeded eh_source parameter ndp() takes a parameter giving the ethernet source address of the packet it is to respond to, which it uses to determine the destination address to send the reply packet to. This is not necessary, because the address will always be the guest's MAC address. Even if the guest has just changed MAC address, then either tap_handler_passt() or tap_handler_pasta() - which are the only call paths leading to ndp() will have updated c->mac_guest with the new value. So, remove the parameter, and just use c->mac_guest, making it more consistent with other paths where we construct packets to send inwards. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:51 +02:00
David Gibson	9d8dd8b6f4	tap: Split tap_ip6_send() into UDP and ICMP variants tap_ip6_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp6_send() and tap_icmp6_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv6 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:48 +02:00
David Gibson	f616ca231e	Split tap_ip_send() into IPv4 and IPv6 specific functions The IPv4 and IPv6 paths in tap_ip_send() have very little in common, and it turns out that every caller (statically) knows if it is using IPv4 or IPv6. So split into separate tap_ip4_send() and tap_ip6_send() functions. Use a new tap_l2_hdr() function for the very small common part. While we're there, make some minor cleanups: - We were double writing some fields in the IPv6 header, so that it temporary matched the pseudo-header for checksum calculation. With recent checksum reworks, this isn't neccessary any more. - We don't use any IPv4 header options, so use some sizeof() constructs instead of some open coded values for header length. - The comment used to say that the flow label was for TCP over IPv6, but in fact the only thing we used it for was DHCPv6 over UDP traffic Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:45 +02:00
David Gibson	fb5d1c5d7d	tap: Remove unhelpeful vnet_pre optimization from tap_send() Callers of tap_send() can optionally use a small optimization by adding extra space for the 4 byte length header used on the qemu socket interface. tap_ip_send() is currently the only user of this, but this is used only for "slow path" ICMP and DHCP packets, so there's not a lot of value to the optimization. Worse, having the two paths here complicates the interface and makes future cleanups difficult, so just remove it. I have some plans to bring back the optimization in a more general way in future, but for now it's just in the way. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:43 +02:00
David Gibson	f72b63e92f	Remove support for TCP packets from tap_ip_send() tap_ip_send() is never used for TCP packets, we're unlikely to use it for that in future, and the handling of TCP packets makes other cleanups unnecessarily awkward. Remove it. This is the only user of csum_tcp4(), so we can remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:40 +02:00
David Gibson	a2eb2d310a	Add helpers for normal inbound packet destination addresses tap_ip_send() doesn't take a destination address, because it's specifically for inbound packets, and the IP addresses of the guest/namespace are already known to us. Rather than open-coding this destination address logic, make helper functions for it which will enable some later cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:38 +02:00
David Gibson	3d8ccb44a6	Add csum_ip4_header() helper to calculate IPv4 header checksums We calculate IPv4 header checksums in at least two places, in dhcp() and in tap_ip_send. Add a helper to handle this calculation in both places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:34 +02:00
David Gibson	bd4be308fc	Add csum_udp4() helper for calculating UDP over IPv4 checksums At least two places in passt fill in UDP over IPv4 checksums, although since UDP checksums are optional with IPv4 that just amounts to storing a 0 (in tap_ip_send()) or leaving a 0 from an earlier initialization (in dhcp()). For consistency, add a helper for this "calculation". Just for the heck of it, add the option (compile time disabled for now) to calculate real UDP checksums. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:32 +02:00
David Gibson	6905ac75ec	Add csum_udp6() helper for calculating UDP over IPv6 checksums Add a helper for calculating UDP checksums when used over IPv6 For future flexibility, the new helper takes parameters for the fields in the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to be explicitly constructed. It also allows the UDP header and payload to be in separate buffers, although we don't use this yet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:29 +02:00
David Gibson	67ab617172	Add csum_icmp4() helper for calculating ICMP checksums Although tap_ip_send() is currently the only place calculating ICMP checksums, create a helper function for symmetry with ICMPv6. For future flexibility it allows the ICMPv6 header and payload to be in separate buffers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:26 +02:00
David Gibson	7abd2b0d72	Add csum_icmp6() helper for calculating ICMPv6 checksums At least two places in passt calculate ICMPv6 checksums, ndp() and tap_ip_send(). Add a helper to handle this calculation in both places. For future flexibility, the new helper takes parameters for the fields in the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to be explicitly constructed. It also allows the ICMPv6 header and payload to be in separate buffers, although we don't use this yet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:21 +02:00
Stefano Brivio	b3f359167b	passt.1: Add David to AUTHORS I just realised while reading the man page. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-15 02:10:36 +02:00
Stefano Brivio	3e2eb4337b	conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user() Even if CAP_NET_BIND_SERVICE is granted, we'll lose the capability in the target user namespace as we isolate the process, which means we're unable to bind to low ports at that point. Bind inbound ports, and only those, before isolate_user(). Keep the handling of outbound ports (for pasta mode only) after the setup of the namespace, because that's where we'll bind them. To this end, initialise the netlink socket for the init namespace before isolate_user() as well, as we actually need to know the addresses of the upstream interface before binding ports, in case they're not explicitly passed by the user. As we now call nl_sock_init() twice, checking its return code from conf() twice looks a bit heavy: make it exit(), instead, as we can't do much if we don't have netlink sockets. While at it: - move the v4_only && v6_only options check just after the first option processing loop, as this is more strictly related to option parsing proper - update the man page, explaining that CAP_NET_BIND_SERVICE is not the preferred way to bind ports, because passt and pasta can be abused to allow other processes to make effective usage of it. Add a note about the recommended sysctl instead - simplify nl_sock_init_do() now that it's called once for each case Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	40abd447c8	Rename pasta_setup_ns() to pasta_spawn_cmd() pasta_setup_ns() no longer has much to do with setting up a namespace. Instead it's really about starting the shell or other command we want to run with pasta connectivity. Rename it and its argument structure to be less misleading. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	eb3d03a588	isolation: Only configure UID/GID mappings in userns when spawning shell When in passt mode, or pasta mode spawning a command, we create a userns for ourselves. This is used both to isolate the pasta/passt process itself and to run the spawned command, if any. Since `eed17a47` "Handle userns isolation and dropping root at the same time" we've handled both cases the same, configuring the UID and GID mappings in the new userns to map whichever UID we're running as to root within the userns. This mapping is desirable when spawning a shell or other command, so that the user gets a root shell with reasonably clear abilities within the userns and netns. It's not necessarily essential, though. When not spawning a shell, it doesn't really have any purpose: passt itself doesn't need to be root and can operate fine with an unmapped user (using some of the capabilities we get when entering the userns instead). Configuring the uid_map can cause problems if passt is running with any capabilities in the initial namespace, such as CAP_NET_BIND_SERVICE to allow it to forward low ports. In this case the kernel makes files in /proc/pid owned by root rather than the starting user to prevent the user from interfering with the operation of the capability-enhanced process. This includes uid_map meaning we are not able to write to it. Whether this behaviour is correct in the kernel is debatable, but in any case we might as well avoid problems by only initializing the user mappings when we really want them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	fb449b16bd	isolation: Prevent any child processes gaining capabilities We drop our own capabilities, but it's possible that processes we exec() could gain extra privilege via file capabilities. It shouldn't be possible for us to exec() anyway due to seccomp() and our filesystem isolation. But just in case, zero the bounding and inheritable capability sets to prevent any such child from gainin privilege. Note that we do this after spawning the pasta shell/command (if any), because we do want the user to be able to give that privilege if they want. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	c22ebccba8	isolation: Replace drop_caps() with a version that actually does something The current implementation of drop_caps() doesn't really work because it attempts to drop capabilities from the bounding set. That's not the set that really matters, it's about limiting the abilities of things we might later exec() rather than our own capabilities. It also requires CAP_SETPCAP which we won't usually have. Replace it with a new version which uses setcap(2) to drop capabilities from the effective and permitted sets. For now we leave the inheritable set as is, since we don't want to preclude the user from passing inheritable capabilities to the command spawed by pasta. Correctly dropping caps reveals that we were relying on some capabilities we'd supposedly dropped. Re-divide the dropping of capabilities between isolate_initial(), isolate_user() and isolate_prefork() to make this work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	ceb2061587	isolation: Refactor isolate_user() to allow for a common exit path Currently, isolate_user() exits early if the --netns-only option is given. That works for now, but shortly we're going to want to add some logic to go at the end of isolate_user() that needs to run in all cases: joining a given userns, creating a new userns, or staying in our original userns (--netns-only). To avoid muddying those changes, here we reorganize isolate_user() to have a common exit path for all cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	ea5936dd3f	Replace FWRITE with a function In a few places we use the FWRITE() macro to open a file, replace it's contents with a given string and close it again. There's no real reason this needs to be a macro rather than just a function though. Turn it into a function 'write_file()' and make some ancillary cleanups while we're there: - Add a return code so the caller can handle giving a useful error message - Handle the case of short write()s (unlikely, but possible) - Add O_TRUNC, to make sure we replace the existing contents entirely Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	096e48669b	isolation: Clarify various self-isolation steps We have a number of steps of self-isolation scattered across our code. Improve function names and add comments to make it clearer what the self isolation model is, what the steps do, and why they happen at the points they happen. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	6909a8e339	Remove unhelpful drop_caps() call in pasta_start_ns() drop_caps() has a number of bugs which mean it doesn't do what you'd expect. However, even if we fixed those, the call in pasta_start_ns() doesn't do anything useful: * In the common case, we're UID 0 at this point. In this case drop_caps() doesn't accomplish anything, because even with capabilities dropped, we are still privileged. * When attaching to an existing namespace with --userns or --netns-only we might not be UID 0. In this case it's too early to drop all capabilities: we need at least CAP_NET_ADMIN to configure the tap device in the namespace. Remove this call - we will still drop capabilities a little later in sandbox(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
David Gibson	01b4e71f7a	pasta_start_ns() always ends in parent context The end of pasta_start_ns() has a test against pasta_child_pid, testing if we're in the parent or the child. However we started the child running the pasta_setup_ns function which always exec()s or exit()s, so if we return from the clone() we are always in the parent, making that test unnecessary. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00

1 2 3 4 5 ...

860 commits