passt

Author	SHA1	Message	Date
David Gibson	d836d9e345	util: Remove possible quadratic behaviour from write_remainder() write_remainder() steps through the buffers in an IO vector writing out everything past a certain byte offset. However, on each iteration it rescans the buffer from the beginning to find out where we're up to. With an unfortunate set of write sizes this could lead to quadratic behaviour. In an even less likely set of circumstances (total vector length > maximum size_t) the 'skip' variable could overflow. This is one factor in a longstanding Coverity error we've seen (although I still can't figure out the remainder of its complaint). Rework write_remainder() to always work out our new position in the vector relative to our old/current position, rather than starting from the beginning each time. As a bonus this seems to fix the Coverity error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:15:03 +02:00
David Gibson	bfc294b90d	util: Add helper to write() all of a buffer write(2) might not write all the data it is given. Add a write_all_buf() helper to keep calling it until all the given data is written, or we get an error. Currently we use write_remainder() to do this operation in pcap_frame(). That's a little awkward since it requires constructing an iovec, and future changes we want to make to write_remainder() will be easier in terms of this single buffer helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:59 +02:00
David Gibson	bb41901c71	tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean This parameter is already treated as a boolean internally. Make it a 'bool' type for clarity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:55 +02:00
David Gibson	265b2099c7	tcp: Simplify ifdef logic in tcp_update_seqack_wnd() This function has a block conditional on !snd_wnd_cap shortly before an snd_wnd_cap is statically false). Therefore, simplify this down to a single conditional with an else branch. While we're there, fix some improperly indented closing braces. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:50 +02:00
David Gibson	4aff6f9392	tcp: Clean up tcpi_snd_wnd probing When available, we want to retrieve our socket peer's advertised window and forward that to the guest. That information has been available from the kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035. Currently our probing for this is a bit odd. The HAS_SND_WND define determines if our headers include the tcp_snd_wnd field, but that doesn't necessarily mean the running kernel supports it. Currently we start by assuming it's _not_ available, but mark it as available if we ever see a non-zero value in the field. This is a bit hit and miss in two ways: * Zero is perfectly possible window the peer could report, so we can get false negatives * We're reading TCP_INFO into a local variable, which might not be zero initialised, so if the kernel _doesn't_ write it it could have non-zero garbage, giving us false positives. We can use a more direct way of probing for this: getsockopt() reports the length of the information retreived. So, check whether that's long enough to include the field. This lets us probe the availability of the field once and for all during initialisation. That in turn allows ctx to become a const pointer to tcp_prepare_flags() which cascades through many other functions. We also move the flag for the probe result from the ctx structure to a global, to match peek_offset_cap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:47 +02:00
David Gibson	7d8804beb8	tcp: Make some extra functions private tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c, and have no prototype in a header. Make them static. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:33 +02:00
David Gibson	5ff5d55291	tcp: Avoid overlapping memcpy() in DUP_ACK handling When handling the DUP_ACK flag, we copy all the buffers making up the ack frame. However, all our frames share the same buffer for the Ethernet header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will result in a (perfectly) overlapping memcpy(). This seems to have been harmless so far, but overlapping ranges to memcpy() is undefined behaviour, so we really should avoid it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-12 09:13:59 +02:00
David Gibson	1f414ed8f0	tcp: Remove redundant initialisation of iov[TCP_IOV_ETH].iov_base This initialisation for IPv4 flags buffers is redundant with the very next line which sets both iov_base and iov_len. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-12 09:13:46 +02:00
Stefano Brivio	6b38f07239	apparmor: Allow read access to /proc/sys/net/ipv4/ip_local_port_range ...for both passt and pasta: use passt's abstraction for this. Fixes: `eedc81b6ef` ("fwd, conf: Probe host's ephemeral ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 15:34:06 +02:00
Stefano Brivio	116bc8266d	selinux: Allow read access to /proc/sys/net/ipv4/ip_local_port_range Since commit `eedc81b6ef` ("fwd, conf: Probe host's ephemeral ports"), we might need to read from /proc/sys/net/ipv4/ip_local_port_range in both passt and pasta. While pasta was already allowed to open and write /proc/sys/net entries, read access was missing in SELinux's type enforcement: add that. In passt, instead, this is the first time we need to access an entry there: add everything we need. Fixes: `eedc81b6ef` ("fwd, conf: Probe host's ephemeral ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 15:34:06 +02:00
David Gibson	a33ecafbd9	tap: Don't risk truncating frames on full buffer in tap_pasta_input() tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder. Adjust the code to make sure we always have room for a maximum size frame. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:46 +02:00
David Gibson	d2a1dc744b	tap: Restructure in tap_pasta_input() tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue. The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop. Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:43 +02:00
David Gibson	11e29054fe	tap: Improve handling of EINTR in tap_passt_input() When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling. So, handle EINTR separately by immediately retrying until we succeed or get a different type of error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:41 +02:00
David Gibson	49fc4e0414	tap: Split out handling of EPOLLIN events Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:37 +02:00
Stefano Brivio	63513e54f3	util: Fix order of operands and carry of one second in timespec_diff_us() If the nanoseconds of the minuend timestamp are less than the nanoseconds of the subtrahend timestamp, we need to carry one second in the subtraction. I subtracted this second from the minuend, but didn't actually carry it in the subtraction of nanoseconds, and logged timestamps would jump back whenever we switched to the first branch of timespec_diff_us() from the second one. Most likely, the reason why I didn't carry the second is that I instinctively thought that swapping the operands would have the same effect. But it doesn't, in general: that only happens with arithmetic in modulo powers of 2. Undo the swap as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-06 13:01:34 +02:00
David Gibson	748ef4cd6e	cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings cppcheck-2.15.0 has apparently broadened when it throws a warning about redundant initialization to include some cases where we have an initializer for some fields, but then set other fields in the function body. This is arguably a false positive: although we are technically overwriting the zero-initialization the compiler supplies for fields not explicitly initialized, this sort of construct makes sense when there are some fields we know at the top of the function where the initializer is, but others that require more complex calculation. That said, in the two places this shows up, it's pretty easy to work around. The results are arguably slightly clearer than what we had, since they move the parts of the initialization closer together. So do that rather than having ugly suppressions or dealing with the tedious process of reporting a cppcheck false positive. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:54:20 +02:00
Stefano Brivio	afedc2412e	tcp: Use EPOLLET for any state of not established connections Currently, for not established connections, we monitor sockets with edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state (outbound connection being established) but not in the TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK to the container/guest). While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted another possibility for a short EPOLLRDHUP storm (10 seconds), which doesn't seem to happen in actual use cases, but I could reproduce it: start a connection from a container, while dropping (using netfilter) ACK segments coming out of the container itself. On the server side, outside the container, accept the connection and shutdown the writing side of it immediately. At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't have any reasonable way to handle it other than waiting for the tap side to complete the three-way handshake. So we'll just keep getting this EPOLLRDHUP until the SYN_TIMEOUT kicks in. Always enable EPOLLET when EPOLLRDHUP is the only epoll event we subscribe to: in this case, getting multiple EPOLLRDHUP reports is totally useless. In the only remaining non-established state, SOCK_ACCEPTED, for inbound connections, we're anyway discarding EPOLLRDHUP events until we established the conection, because we don't know what to do with them until we get an answer from the tap side, so it's safe to enable EPOLLET also in that case. Link: https://bugs.passt.top/show_bug.cgi?id=94 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:54:16 +02:00
David Gibson	aff5a49b0e	udp: Handle more error conditions in udp_sock_errs() udp_sock_errs() reads out everything in the socket error queue. However we've seen some cases[0] where an EPOLLERR event is active, but there isn't anything in the queue. One possibility is that the error is reported instead by the SO_ERROR sockopt. Check for that case and report it as best we can. If we still get an EPOLLERR without visible error, we have no way to clear the error state, so treat it as an unrecoverable error. [0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010 Link: https://bugs.passt.top/show_bug.cgi?id=95 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:38 +02:00
David Gibson	bd99f02a64	udp: Treat errors getting errors as unrecoverable We can get network errors, usually transient, reported via the socket error queue. However, at least theoretically, we could get errors trying to read the queue itself. Since we have no idea how to clear an error condition in that case, treat it as unrecoverable. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:35 +02:00
David Gibson	bd092ca421	udp: Split socket error handling out from udp_sock_recv() Currently udp_sock_recv() both attempts to clear socket errors and read a batch of datagrams for forwarding. That made sense initially, since both listening and reply sockets need to do this. However, we have certain error cases which will add additional complexity to the error processing. Furthermore, if we ever wanted to more thoroughly handle errors received here - e.g. by synthesising ICMP messages on the tap device - it will likely require different handling for the listening and reply socket cases. So, split handling of error events into its own udp_sock_errs() function. While we're there, allow it to report "unrecoverable errors". We don't have any of these so far, but some cases we're working on might require it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:33 +02:00
David Gibson	88bfa3801e	flow: Helpers to log details of a flow The details of a flow - endpoints, interfaces etc. - can be pretty important for debugging. We log this on flow state transitions, but it can also be useful to log this when we report specific conditions. Add some helper functions and macros to make it easy to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:30 +02:00
David Gibson	1166401c2f	udp: Allow UDP flows to be prematurely closed Unlike TCP, UDP has no in-band signalling for the end of a flow. So the only way we remove flows is on a timer if they have no activity for 180s. However, we've started to investigate some error conditions in which we want to prematurely abort / abandon a UDP flow. We can call udp_flow_close(), which will make the flow inert (sockets closed, no epoll events, can't be looked up in hash). However it will still wait 3 minutes to clear away the stale entry. Clean this up by adding an explicit 'closed' flag which will cause a flow to be more promptly cleaned up. We also publish udp_flow_close() so it can be called from other places to abort UDP flows(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:24 +02:00
David Gibson	7ad9f9bd2b	flow: Fix incorrect hash probe in flowside_lookup() Our flow hash table uses linear probing in which we step backwards through clusters of adjacent hash entries when we have near collisions. Usually that's implemented by flow_hash_probe(). However, due to some details we need a second implementation in flowside_lookup(). An embarrassing oversight in rebasing from earlier versions has mean that version is incorrect, trying to step forward through clusters rather than backward. In situations with the right sorts of has near-collisions this can lead to us not associating an ACK from the tap device with the right flow, leaving it in a not-quite-established state. If the remote peer does a shutdown() at the right time, this can lead to a storm of EPOLLRDHUP events causing high CPU load. Fixes: `acca4235c4` ("flow, tcp: Generalise TCP hash table to general flow hash table") Link: https://bugs.passt.top/show_bug.cgi?id=94 Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:52:31 +02:00
Stefano Brivio	0ea60e5a77	log: Don't prefix log file messages with time and severity if they're continuations In `fecb1b65b1` ("log: Don't prefix message with timestamp on --debug if it's a continuation"), I fixed this for --debug on standard error, but not for log files: if messages are continuations, they shouldn't be prefixed by timestamp and severity. Otherwise, we'll print stuff like this: 0.0028: ERROR: Receive error on guest connection, reset0.0028: ERROR: : Bad file descriptor Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-06 12:52:21 +02:00
Michal Privoznik	38363964fc	Makefile: Enable _FORTIFY_SOURCE iff needed On some systems source fortification is enabled whenever code optimization is enabled (e.g. with -O2). Since code fortification is explicitly enabled too (with possibly different value than the system wants, there are three levels [1]), distros are required to patch our Makefile, e.g. [2]. Detect whether fortification is not already enabled and enable it explicitly only if really needed. 1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html 2: `edfeb8763a` Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:21 +02:00
David Gibson	eedc81b6ef	fwd, conf: Probe host's ephemeral ports When we forward "all" ports (-t all or -u all), or use an exclude-only range, we don't actually forward all ports - that wouln't leave local ports to use for outgoing connections. Rather we forward all non-ephemeral ports - those that won't be used for outgoing connections or datagrams. Currently we assume the range of ephemeral ports is that recommended by RFC 6335, 49152-65535. However, that's not the range used by default on Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range sysctl. We can't really know what range the guest will consider ephemeral, but if it differs too much from the host it's likely to cause problems we can't avoid anyway. So, using the host's ephemeral range is a better guess than using the RFC 6335 range. Therefore, add logic to probe the host's ephemeral range, falling back to the RFC 6335 range if that fails. This has the bonus advantage of reducing the number of ports bound by -t all -u all on most Linux machines thereby reducing kernel memory usage. Specifically this reduces kernel memory usage with -t all -u all from ~380MiB to ~289MiB. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:08 +02:00
David Gibson	4a41dc58d6	conf, fwd: Don't attempt to forward port 0 When using -t all, -u all or exclude-only ranges, we'll attempt to forward all non-ephemeral port numbers, including port 0. However, this won't work as intended: bind() treats a zero port not as literal port 0, but as "pick a port for me". Because of the special meaning of port 0, we mostly outright exclude it in our handling. Do the same for setting up forwards, not attempting to forward for port 0. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:05 +02:00
David Gibson	1daf6f4615	conf, fwd: Make ephemeral port logic more flexible "Ephemeral" ports are those which the kernel may allocate as local port numbers for outgoing connections or datagrams. Because of that, they're generally not good choices for listening servers to bind to. Thefore when using -t all, -u all or exclude-only ranges, we map only non-ephemeral ports. Our logic for this is a bit rigid though: we assume the ephemeral ports are always a fixed range at the top of the port number space. We also assume PORT_EPHEMERAL_MIN is a multiple of 8, or we won't set the forward bitmap correctly. Make the logic in conf.c more flexible, using a helper moved into fwd.[ch], although we don't change which ports we consider ephemeral (yet). The new handling is undoubtedly more computationally expensive, but since it's a once-off operation at start off, I don't think it really matters. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:25:51 +02:00
Stefano Brivio	712ca32353	seccomp.sh: Try to account for terminal width while formatting list of system calls Avoid excess lines on wide terminals, but make sure we don't fail if we can't fetch the number of columns for any reason, as it's not a fundamental feature and we don't want to break anything with it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-27 14:30:17 +02:00
David Gibson	e0be6bc2f4	udp: Use dual stack sockets for port forwarding when possible Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as well as native IPv6 connections. By doing this we halve the number of listening sockets we need (assuming passt/pasta is listening on the same ports for IPv4 and IPv6). When forwarding many ports (e.g. -u all) this can significantly reduce the amount of kernel memory that passt consumes. We've used such dual stack sockets for TCP since `8e914238b` "tcp: Use dual stack sockets for port forwarding when possible". Add similar support for UDP "listening" sockets. Since UDP sockets don't use as much kernel memory as TCP sockets this isn't as big a saving, but it's still significant. When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all), this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version 6.10.6 on Fedora 40, x86_64). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-27 09:04:41 +02:00
David Gibson	c78b194001	udp: Remove unnnecessary local from udp_sock_init() The 's' variable is always redundant with either 'r4' or 'r6', so remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-27 09:04:38 +02:00
David Gibson	620e19a1b4	udp: Merge udp[46]_mh_recv arrays We've already gotten rid of most of the IPv4/IPv6 specific data structures in udp.c by merging them with each other. One significant one remains: udp[46]_mh_recv. This was a bit awkward to remove because of a subtle interaction. We initialise the msg_namelen fields to represent the total size we have for a socket address, but when we receive into the arrays those are modified to the actual length of the sockaddr we received. That meant that naively merging the arrays meant that if we received IPv4 datagrams, then IPv6 datagrams, the addresses for the latter would be truncated. In this patch address that by resetting the received msg_namelen as soon as we've found a flow for the datagram. Finding the flow is the only thing that might use the actual sockaddr length, although we in fact don't need it for the time being. This also removes the last use of the 'v6' field from udp_listen_epoll_ref, so remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-27 09:04:25 +02:00
Stefano Brivio	418feb37ec	test: Look for possible sshd-session paths (if it's there at all) in mbuto's profile Some distributions already have OpenSSH 9.8, which introduces split sshd/sshd-session binaries, and there we need to copy the binary from the host, which can be /usr/libexec/openssh/sshd-session (Fedora Rawhide), /usr/lib/ssh/sshd-session (Arch Linux), /usr/lib/openssh/sshd-session (Debian), and possibly other paths. Add at least those three, and, if we don't find sshd-session, assume we don't need it: it could very well be an older version of OpenSSH, as reported by David for Fedora 40, or perhaps another daemon (would Dropbear even work? I'm not sure). Reported-by: David Gibson <david@gibson.dropbear.id.au> Fixes: `d6817b3930` ("test/passt.mbuto: Install sshd-session OpenSSH's split process") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-27 09:03:47 +02:00
Stefano Brivio	1d6142f362	README: pasta is indeed a supported back-end for rootless Docker ...https://github.com/moby/moby/issues/48257 just reminded me. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:05:26 +02:00
Stefano Brivio	f00ebda369	util: Don't stop on unrelated values when looking for --fd in close_open_files() Seen with krun: we get a file descriptor via --fd, but we close it and happily use the same number for TCP files. The issue is that if we also get other options before --fd, with arguments, getopt_long() stops parsing them because it sees them as non-option values. Use the - modifier at the beginning of optstring (before :, which is needed to avoid printing errors) instead of +, which means we'll continue parsing after finding unrelated option values, but getopt_long() won't reorder them anyway: they'll be passed with option value '1', which we can ignore. By the way, we also need to add : after F in the optstring, so that we're able to parse the option when given as short name as well. Now that we change the parsing mode between close_open_files() and conf(), we need to reset optind to 0, not to 1, whenever we call getopt_long() again in conf(), so that the internal initialisation of getopt_long() evaluating GNU extensions is re-triggered. Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828 Fixes: `baccfb95ce` ("conf: Stop parsing options at first non-option argument") Fixes: `09603cab28` ("passt, util: Close any open file that the parent might have leaked") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:04:53 +02:00
Stefano Brivio	05453ea590	test: Update list of dependencies in README.md Mostly packages we now need to run Podman-based tests. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:04:30 +02:00
Stefano Brivio	1a66806c18	tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf) These system calls are needed after the conversion of time_t to 64-bit types on 32-bit architectures. Tested by running some transfer tests with passt and pasta on Debian Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l. Suggested-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:04:17 +02:00
Stefano Brivio	6e9ecf5741	util: Provide own version of close_range(), and no-op fallback musl, as of 1.2.5, and glibc < 2.34 don't ship a (trivial) close_range() implementation. This will probably be added to musl soon, by the way: https://www.openwall.com/lists/musl/2024/08/01/9 Add a weakly-aliased implementation, if it's supported by the kernel. If it's not supported (< 5.9), use a no-op fallback. Looping over 2^31 file descriptors calling close() on them is probably not a good idea. Reported-by: lemmi <lemmi@nerd2nerd.org> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:03:48 +02:00
Stefano Brivio	7291b70ba7	udp_flow: Add missing unistd.h include for close() For some reason, this is reported only with musl, and older glibc versions (2.31, at least). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:03:34 +02:00
Stefano Brivio	396307541e	test: Duplicate existing recvfrom() valgrind suppression for recv() Some architectures, including i686, actually have a recv() system call, not just a recvfrom(), and we need to cover the recv() with MSG_TRUNC into a NULL buffer for them as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:03:23 +02:00
Stefano Brivio	d6817b3930	test/passt.mbuto: Install sshd-session OpenSSH's split process OpenSSH now ships a per-session binary, sshd-session, with sshd acting as mere listener. It's typically not found in $PATH, so specify the whole path at which it's commonly installed in $PROGS. Link: https://www.openssh.com/releasenotes.html#9.8p1 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:03:03 +02:00
Stefano Brivio	34be8eeb38	test/passt.mbuto: Run sshd from vsock proxy with absolute path ...OpenSSH >= 9.8 otherwise complains that: sshd requires execution with an absolute path Link: https://bugs.gentoo.org/936041 Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078429 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:02:37 +02:00
Stefano Brivio	aded2b671c	test/lib/setup: Transform i686 kernel architecture name into QEMU name (i386) It's qemu-system-i386, but uname -m reports i686. I didn't test i486 and i586. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:01:48 +02:00
Stefano Brivio	2aea1da143	treewide: Allow additional system calls for i386/i686 I haven't tested i386 for a long time (after playing with some openSUSE i586 image a couple of years ago). It turns out that a number of system calls we actually need were denied by the seccomp filter, and not even basic functionality works. Add some system calls that glibc started using with the 64-bit time ("t64") transition, see also: https://wiki.debian.org/ReleaseGoals/64bit-time that is: clock_gettime64, timerfd_gettime64, fcntl64, and recvmmsg_time64. Add further system calls that are needed regardless of time_t width, that is, mmap2 (valgrind profile only), _llseek and sigreturn (common outside x86_64), and socketcall (same as s390x). I validated this against an almost full run of the test suite, with just a few selected tests skipped. Fixes needed to run most tests on i386/i686, and other assorted fixes for tests, are included in upcoming patches. Reported-by: Uroš Knupleš <uros@knuples.net> Analysed-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:00:43 +02:00
David Gibson	57b7bd2a48	fwd, conf: Allow NAT of the guest's assigned address The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --map-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable. Add a new --map-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback. If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:40 +02:00
David Gibson	8436c0d61b	fwd: Distinguish translatable from untranslatable addresses on inbound fwd_nat_from_host() needs to adjust the source address for new flows coming from an address which is not accessible to the guest. Currently we always use our_tap_addr or our_tap_ll. However in cases where the address is accessible to the guest via translation (i.e. via --map-host-loopback) then it makes more sense to use that translation, rather than the fallback mapping of our_tap_*. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:37 +02:00
David Gibson	e813a4df7d	conf: Allow address remapped to host to be configured Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback. Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --map-host-loopback option. In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:35 +02:00
David Gibson	dbaaebbe00	test: Reconfigure IPv6 address after changing MTU In the TCP throughput tests, we adjust the guest's MTU in order to test various packet sizes. Some of those are below 1280 which causes IPv6 to be deconfigured on the guest interface. When we increase it above 1280 again, IPv6 is re-enabled and we get an address in the right prefix with NDP, but we don't get exactly the expected address back - that's only communicated with --config-net or DHCPv6. With changes to how we handle NAT this can cause some of the IPv6 tests to fail, because they don't use the address that passt/pasta expects, and the guest doesn't initiate any traffic which allows us to learn what the new address is. Work around this by re-invoking dhclient -6 between adjusting the MTU and running IPv6 test cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:33 +02:00
David Gibson	935bd81936	conf, fwd: Split notion of gateway/router from guest-visible host address The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host. Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback. Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:31 +02:00
David Gibson	90e83d50a9	Don't take "our" MAC address from the host When sending frames to the guest over the tap link, we need a source MAC address. Currently we take that from the MAC address of the main interface on the host, but that doesn't actually make much sense: * We can't preserve the real MAC address of packets from anywhere external so there's no transparency case here * In fact, it's confusingly different from how we handle IP addresses: whereas we give the guest the same IP as the host, we're making the host's MAC the one MAC that the guest can't use for itself. * We already need a fallback case if the host doesn't have an Ethernet like MAC (e.g. if it's connected via a point to point interface, such as a wireguard VPN). Change to just just use an arbitrary fixed MAC address - I've picked 9a:55:9a:55:9a:55. It's simpler and has the small advantage of making the fact that passt/pasta is in use typically obvious from guest side packet dumps. This can still, of course, be overridden with the -M option. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:28 +02:00

1 2 3 4 5 ...

1737 commits