passt

mirror of https://passt.top/passt synced 2025-06-05 23:45:34 +02:00

Author	SHA1	Message	Date
David Gibson	a33ecafbd9	tap: Don't risk truncating frames on full buffer in tap_pasta_input() tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder. Adjust the code to make sure we always have room for a maximum size frame. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:46 +02:00
David Gibson	d2a1dc744b	tap: Restructure in tap_pasta_input() tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue. The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop. Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:43 +02:00
David Gibson	11e29054fe	tap: Improve handling of EINTR in tap_passt_input() When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling. So, handle EINTR separately by immediately retrying until we succeed or get a different type of error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:41 +02:00
David Gibson	49fc4e0414	tap: Split out handling of EPOLLIN events Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:37 +02:00
Stefano Brivio	63513e54f3	util: Fix order of operands and carry of one second in timespec_diff_us() If the nanoseconds of the minuend timestamp are less than the nanoseconds of the subtrahend timestamp, we need to carry one second in the subtraction. I subtracted this second from the minuend, but didn't actually carry it in the subtraction of nanoseconds, and logged timestamps would jump back whenever we switched to the first branch of timespec_diff_us() from the second one. Most likely, the reason why I didn't carry the second is that I instinctively thought that swapping the operands would have the same effect. But it doesn't, in general: that only happens with arithmetic in modulo powers of 2. Undo the swap as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-06 13:01:34 +02:00
David Gibson	748ef4cd6e	cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings cppcheck-2.15.0 has apparently broadened when it throws a warning about redundant initialization to include some cases where we have an initializer for some fields, but then set other fields in the function body. This is arguably a false positive: although we are technically overwriting the zero-initialization the compiler supplies for fields not explicitly initialized, this sort of construct makes sense when there are some fields we know at the top of the function where the initializer is, but others that require more complex calculation. That said, in the two places this shows up, it's pretty easy to work around. The results are arguably slightly clearer than what we had, since they move the parts of the initialization closer together. So do that rather than having ugly suppressions or dealing with the tedious process of reporting a cppcheck false positive. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:54:20 +02:00
Stefano Brivio	afedc2412e	tcp: Use EPOLLET for any state of not established connections Currently, for not established connections, we monitor sockets with edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state (outbound connection being established) but not in the TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK to the container/guest). While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted another possibility for a short EPOLLRDHUP storm (10 seconds), which doesn't seem to happen in actual use cases, but I could reproduce it: start a connection from a container, while dropping (using netfilter) ACK segments coming out of the container itself. On the server side, outside the container, accept the connection and shutdown the writing side of it immediately. At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't have any reasonable way to handle it other than waiting for the tap side to complete the three-way handshake. So we'll just keep getting this EPOLLRDHUP until the SYN_TIMEOUT kicks in. Always enable EPOLLET when EPOLLRDHUP is the only epoll event we subscribe to: in this case, getting multiple EPOLLRDHUP reports is totally useless. In the only remaining non-established state, SOCK_ACCEPTED, for inbound connections, we're anyway discarding EPOLLRDHUP events until we established the conection, because we don't know what to do with them until we get an answer from the tap side, so it's safe to enable EPOLLET also in that case. Link: https://bugs.passt.top/show_bug.cgi?id=94 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:54:16 +02:00
David Gibson	aff5a49b0e	udp: Handle more error conditions in udp_sock_errs() udp_sock_errs() reads out everything in the socket error queue. However we've seen some cases[0] where an EPOLLERR event is active, but there isn't anything in the queue. One possibility is that the error is reported instead by the SO_ERROR sockopt. Check for that case and report it as best we can. If we still get an EPOLLERR without visible error, we have no way to clear the error state, so treat it as an unrecoverable error. [0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010 Link: https://bugs.passt.top/show_bug.cgi?id=95 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:38 +02:00
David Gibson	bd99f02a64	udp: Treat errors getting errors as unrecoverable We can get network errors, usually transient, reported via the socket error queue. However, at least theoretically, we could get errors trying to read the queue itself. Since we have no idea how to clear an error condition in that case, treat it as unrecoverable. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:35 +02:00
David Gibson	bd092ca421	udp: Split socket error handling out from udp_sock_recv() Currently udp_sock_recv() both attempts to clear socket errors and read a batch of datagrams for forwarding. That made sense initially, since both listening and reply sockets need to do this. However, we have certain error cases which will add additional complexity to the error processing. Furthermore, if we ever wanted to more thoroughly handle errors received here - e.g. by synthesising ICMP messages on the tap device - it will likely require different handling for the listening and reply socket cases. So, split handling of error events into its own udp_sock_errs() function. While we're there, allow it to report "unrecoverable errors". We don't have any of these so far, but some cases we're working on might require it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:33 +02:00
David Gibson	88bfa3801e	flow: Helpers to log details of a flow The details of a flow - endpoints, interfaces etc. - can be pretty important for debugging. We log this on flow state transitions, but it can also be useful to log this when we report specific conditions. Add some helper functions and macros to make it easy to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:30 +02:00
David Gibson	1166401c2f	udp: Allow UDP flows to be prematurely closed Unlike TCP, UDP has no in-band signalling for the end of a flow. So the only way we remove flows is on a timer if they have no activity for 180s. However, we've started to investigate some error conditions in which we want to prematurely abort / abandon a UDP flow. We can call udp_flow_close(), which will make the flow inert (sockets closed, no epoll events, can't be looked up in hash). However it will still wait 3 minutes to clear away the stale entry. Clean this up by adding an explicit 'closed' flag which will cause a flow to be more promptly cleaned up. We also publish udp_flow_close() so it can be called from other places to abort UDP flows(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:24 +02:00
David Gibson	7ad9f9bd2b	flow: Fix incorrect hash probe in flowside_lookup() Our flow hash table uses linear probing in which we step backwards through clusters of adjacent hash entries when we have near collisions. Usually that's implemented by flow_hash_probe(). However, due to some details we need a second implementation in flowside_lookup(). An embarrassing oversight in rebasing from earlier versions has mean that version is incorrect, trying to step forward through clusters rather than backward. In situations with the right sorts of has near-collisions this can lead to us not associating an ACK from the tap device with the right flow, leaving it in a not-quite-established state. If the remote peer does a shutdown() at the right time, this can lead to a storm of EPOLLRDHUP events causing high CPU load. Fixes: `acca4235c4` ("flow, tcp: Generalise TCP hash table to general flow hash table") Link: https://bugs.passt.top/show_bug.cgi?id=94 Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:52:31 +02:00
Stefano Brivio	0ea60e5a77	log: Don't prefix log file messages with time and severity if they're continuations In `fecb1b65b1` ("log: Don't prefix message with timestamp on --debug if it's a continuation"), I fixed this for --debug on standard error, but not for log files: if messages are continuations, they shouldn't be prefixed by timestamp and severity. Otherwise, we'll print stuff like this: 0.0028: ERROR: Receive error on guest connection, reset0.0028: ERROR: : Bad file descriptor Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-06 12:52:21 +02:00
Michal Privoznik	38363964fc	Makefile: Enable _FORTIFY_SOURCE iff needed On some systems source fortification is enabled whenever code optimization is enabled (e.g. with -O2). Since code fortification is explicitly enabled too (with possibly different value than the system wants, there are three levels [1]), distros are required to patch our Makefile, e.g. [2]. Detect whether fortification is not already enabled and enable it explicitly only if really needed. 1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html 2: `edfeb8763a` Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:21 +02:00
David Gibson	eedc81b6ef	fwd, conf: Probe host's ephemeral ports When we forward "all" ports (-t all or -u all), or use an exclude-only range, we don't actually forward all ports - that wouln't leave local ports to use for outgoing connections. Rather we forward all non-ephemeral ports - those that won't be used for outgoing connections or datagrams. Currently we assume the range of ephemeral ports is that recommended by RFC 6335, 49152-65535. However, that's not the range used by default on Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range sysctl. We can't really know what range the guest will consider ephemeral, but if it differs too much from the host it's likely to cause problems we can't avoid anyway. So, using the host's ephemeral range is a better guess than using the RFC 6335 range. Therefore, add logic to probe the host's ephemeral range, falling back to the RFC 6335 range if that fails. This has the bonus advantage of reducing the number of ports bound by -t all -u all on most Linux machines thereby reducing kernel memory usage. Specifically this reduces kernel memory usage with -t all -u all from ~380MiB to ~289MiB. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:08 +02:00
David Gibson	4a41dc58d6	conf, fwd: Don't attempt to forward port 0 When using -t all, -u all or exclude-only ranges, we'll attempt to forward all non-ephemeral port numbers, including port 0. However, this won't work as intended: bind() treats a zero port not as literal port 0, but as "pick a port for me". Because of the special meaning of port 0, we mostly outright exclude it in our handling. Do the same for setting up forwards, not attempting to forward for port 0. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:05 +02:00
David Gibson	1daf6f4615	conf, fwd: Make ephemeral port logic more flexible "Ephemeral" ports are those which the kernel may allocate as local port numbers for outgoing connections or datagrams. Because of that, they're generally not good choices for listening servers to bind to. Thefore when using -t all, -u all or exclude-only ranges, we map only non-ephemeral ports. Our logic for this is a bit rigid though: we assume the ephemeral ports are always a fixed range at the top of the port number space. We also assume PORT_EPHEMERAL_MIN is a multiple of 8, or we won't set the forward bitmap correctly. Make the logic in conf.c more flexible, using a helper moved into fwd.[ch], although we don't change which ports we consider ephemeral (yet). The new handling is undoubtedly more computationally expensive, but since it's a once-off operation at start off, I don't think it really matters. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:25:51 +02:00
Stefano Brivio	712ca32353	seccomp.sh: Try to account for terminal width while formatting list of system calls Avoid excess lines on wide terminals, but make sure we don't fail if we can't fetch the number of columns for any reason, as it's not a fundamental feature and we don't want to break anything with it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-27 14:30:17 +02:00
David Gibson	e0be6bc2f4	udp: Use dual stack sockets for port forwarding when possible Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as well as native IPv6 connections. By doing this we halve the number of listening sockets we need (assuming passt/pasta is listening on the same ports for IPv4 and IPv6). When forwarding many ports (e.g. -u all) this can significantly reduce the amount of kernel memory that passt consumes. We've used such dual stack sockets for TCP since `8e914238b` "tcp: Use dual stack sockets for port forwarding when possible". Add similar support for UDP "listening" sockets. Since UDP sockets don't use as much kernel memory as TCP sockets this isn't as big a saving, but it's still significant. When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all), this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version 6.10.6 on Fedora 40, x86_64). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-27 09:04:41 +02:00
David Gibson	c78b194001	udp: Remove unnnecessary local from udp_sock_init() The 's' variable is always redundant with either 'r4' or 'r6', so remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-27 09:04:38 +02:00
David Gibson	620e19a1b4	udp: Merge udp[46]_mh_recv arrays We've already gotten rid of most of the IPv4/IPv6 specific data structures in udp.c by merging them with each other. One significant one remains: udp[46]_mh_recv. This was a bit awkward to remove because of a subtle interaction. We initialise the msg_namelen fields to represent the total size we have for a socket address, but when we receive into the arrays those are modified to the actual length of the sockaddr we received. That meant that naively merging the arrays meant that if we received IPv4 datagrams, then IPv6 datagrams, the addresses for the latter would be truncated. In this patch address that by resetting the received msg_namelen as soon as we've found a flow for the datagram. Finding the flow is the only thing that might use the actual sockaddr length, although we in fact don't need it for the time being. This also removes the last use of the 'v6' field from udp_listen_epoll_ref, so remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-27 09:04:25 +02:00
Stefano Brivio	418feb37ec	test: Look for possible sshd-session paths (if it's there at all) in mbuto's profile Some distributions already have OpenSSH 9.8, which introduces split sshd/sshd-session binaries, and there we need to copy the binary from the host, which can be /usr/libexec/openssh/sshd-session (Fedora Rawhide), /usr/lib/ssh/sshd-session (Arch Linux), /usr/lib/openssh/sshd-session (Debian), and possibly other paths. Add at least those three, and, if we don't find sshd-session, assume we don't need it: it could very well be an older version of OpenSSH, as reported by David for Fedora 40, or perhaps another daemon (would Dropbear even work? I'm not sure). Reported-by: David Gibson <david@gibson.dropbear.id.au> Fixes: `d6817b3930` ("test/passt.mbuto: Install sshd-session OpenSSH's split process") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-27 09:03:47 +02:00
Stefano Brivio	1d6142f362	README: pasta is indeed a supported back-end for rootless Docker ...https://github.com/moby/moby/issues/48257 just reminded me. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:05:26 +02:00
Stefano Brivio	f00ebda369	util: Don't stop on unrelated values when looking for --fd in close_open_files() Seen with krun: we get a file descriptor via --fd, but we close it and happily use the same number for TCP files. The issue is that if we also get other options before --fd, with arguments, getopt_long() stops parsing them because it sees them as non-option values. Use the - modifier at the beginning of optstring (before :, which is needed to avoid printing errors) instead of +, which means we'll continue parsing after finding unrelated option values, but getopt_long() won't reorder them anyway: they'll be passed with option value '1', which we can ignore. By the way, we also need to add : after F in the optstring, so that we're able to parse the option when given as short name as well. Now that we change the parsing mode between close_open_files() and conf(), we need to reset optind to 0, not to 1, whenever we call getopt_long() again in conf(), so that the internal initialisation of getopt_long() evaluating GNU extensions is re-triggered. Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828 Fixes: `baccfb95ce` ("conf: Stop parsing options at first non-option argument") Fixes: `09603cab28` ("passt, util: Close any open file that the parent might have leaked") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:04:53 +02:00
Stefano Brivio	05453ea590	test: Update list of dependencies in README.md Mostly packages we now need to run Podman-based tests. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:04:30 +02:00
Stefano Brivio	1a66806c18	tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf) These system calls are needed after the conversion of time_t to 64-bit types on 32-bit architectures. Tested by running some transfer tests with passt and pasta on Debian Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l. Suggested-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:04:17 +02:00
Stefano Brivio	6e9ecf5741	util: Provide own version of close_range(), and no-op fallback musl, as of 1.2.5, and glibc < 2.34 don't ship a (trivial) close_range() implementation. This will probably be added to musl soon, by the way: https://www.openwall.com/lists/musl/2024/08/01/9 Add a weakly-aliased implementation, if it's supported by the kernel. If it's not supported (< 5.9), use a no-op fallback. Looping over 2^31 file descriptors calling close() on them is probably not a good idea. Reported-by: lemmi <lemmi@nerd2nerd.org> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:03:48 +02:00
Stefano Brivio	7291b70ba7	udp_flow: Add missing unistd.h include for close() For some reason, this is reported only with musl, and older glibc versions (2.31, at least). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:03:34 +02:00
Stefano Brivio	396307541e	test: Duplicate existing recvfrom() valgrind suppression for recv() Some architectures, including i686, actually have a recv() system call, not just a recvfrom(), and we need to cover the recv() with MSG_TRUNC into a NULL buffer for them as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:03:23 +02:00
Stefano Brivio	d6817b3930	test/passt.mbuto: Install sshd-session OpenSSH's split process OpenSSH now ships a per-session binary, sshd-session, with sshd acting as mere listener. It's typically not found in $PATH, so specify the whole path at which it's commonly installed in $PROGS. Link: https://www.openssh.com/releasenotes.html#9.8p1 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:03:03 +02:00
Stefano Brivio	34be8eeb38	test/passt.mbuto: Run sshd from vsock proxy with absolute path ...OpenSSH >= 9.8 otherwise complains that: sshd requires execution with an absolute path Link: https://bugs.gentoo.org/936041 Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078429 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:02:37 +02:00
Stefano Brivio	aded2b671c	test/lib/setup: Transform i686 kernel architecture name into QEMU name (i386) It's qemu-system-i386, but uname -m reports i686. I didn't test i486 and i586. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:01:48 +02:00
Stefano Brivio	2aea1da143	treewide: Allow additional system calls for i386/i686 I haven't tested i386 for a long time (after playing with some openSUSE i586 image a couple of years ago). It turns out that a number of system calls we actually need were denied by the seccomp filter, and not even basic functionality works. Add some system calls that glibc started using with the 64-bit time ("t64") transition, see also: https://wiki.debian.org/ReleaseGoals/64bit-time that is: clock_gettime64, timerfd_gettime64, fcntl64, and recvmmsg_time64. Add further system calls that are needed regardless of time_t width, that is, mmap2 (valgrind profile only), _llseek and sigreturn (common outside x86_64), and socketcall (same as s390x). I validated this against an almost full run of the test suite, with just a few selected tests skipped. Fixes needed to run most tests on i386/i686, and other assorted fixes for tests, are included in upcoming patches. Reported-by: Uroš Knupleš <uros@knuples.net> Analysed-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:00:43 +02:00
David Gibson	57b7bd2a48	fwd, conf: Allow NAT of the guest's assigned address The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --map-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable. Add a new --map-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback. If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:40 +02:00
David Gibson	8436c0d61b	fwd: Distinguish translatable from untranslatable addresses on inbound fwd_nat_from_host() needs to adjust the source address for new flows coming from an address which is not accessible to the guest. Currently we always use our_tap_addr or our_tap_ll. However in cases where the address is accessible to the guest via translation (i.e. via --map-host-loopback) then it makes more sense to use that translation, rather than the fallback mapping of our_tap_*. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:37 +02:00
David Gibson	e813a4df7d	conf: Allow address remapped to host to be configured Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback. Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --map-host-loopback option. In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:35 +02:00
David Gibson	dbaaebbe00	test: Reconfigure IPv6 address after changing MTU In the TCP throughput tests, we adjust the guest's MTU in order to test various packet sizes. Some of those are below 1280 which causes IPv6 to be deconfigured on the guest interface. When we increase it above 1280 again, IPv6 is re-enabled and we get an address in the right prefix with NDP, but we don't get exactly the expected address back - that's only communicated with --config-net or DHCPv6. With changes to how we handle NAT this can cause some of the IPv6 tests to fail, because they don't use the address that passt/pasta expects, and the guest doesn't initiate any traffic which allows us to learn what the new address is. Work around this by re-invoking dhclient -6 between adjusting the MTU and running IPv6 test cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:33 +02:00
David Gibson	935bd81936	conf, fwd: Split notion of gateway/router from guest-visible host address The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host. Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback. Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:31 +02:00
David Gibson	90e83d50a9	Don't take "our" MAC address from the host When sending frames to the guest over the tap link, we need a source MAC address. Currently we take that from the MAC address of the main interface on the host, but that doesn't actually make much sense: * We can't preserve the real MAC address of packets from anywhere external so there's no transparency case here * In fact, it's confusingly different from how we handle IP addresses: whereas we give the guest the same IP as the host, we're making the host's MAC the one MAC that the guest can't use for itself. * We already need a fallback case if the host doesn't have an Ethernet like MAC (e.g. if it's connected via a point to point interface, such as a wireguard VPN). Change to just just use an arbitrary fixed MAC address - I've picked 9a:55:9a:55:9a:55. It's simpler and has the small advantage of making the fact that passt/pasta is in use typically obvious from guest side packet dumps. This can still, of course, be overridden with the -M option. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:28 +02:00
David Gibson	356de97e43	fwd: Split notion of "our tap address" from gateway for IPv4 ip4.gw conflates 3 conceptually different things, which (for now) have the same value: 1. The router/gateway address as seen by the guest 2. An address to NAT to the host with --no-map-gw isn't specified 3. An address to use as source when nothing else makes sense Case 3 occurs in two situations: a) for our DHCP responses - since they come from passt internally there's no naturally meaningful address for them to come from b) for forwarded connections coming from an address that isn't guest accessible (localhost or the guest's own address). (b) occurs even with --no-map-gw, and the expected behaviour of forwarding local connections requires it. For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the same value as ip6.gw). For future flexibility we may want to make this "address of last resort" different from the gateway address, so split them logically for IPv4 as well. Specifically, add a new ip4.our_tap_addr field for the address with this role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always get a link-local address, we might not be able to get a (non 0.0.0.0) address here (e.g. if the host is disconnected or only has a point to point link with no gateway address). In that case we have to disable forwarding of inbound connections with guest-inaccessible source addresses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:26 +02:00
David Gibson	4d8dd1fbe7	fwd: Helpers to clarify what host addresses aren't guest accessible We usually avoid NAT, but in a few cases we need to apply address translations. For inbound connections that happens for addresses which make sense to the host but are either inaccessible, or mean a different location from the guest's point of view. Add some helper functions to determine such addresses, and use them in fwd_nat_from_host(). In doing so clarify some of the reasons for the logic. We'll also have further use for these helpers in future. While we're there fix one unneccessary inconsistency between IPv4 and IPv6. We always translated the guest's observed address, but for IPv4 we didn't translate the guest's assigned address, whereas for IPv6 we did. Change this to translate both in all cases for consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:23 +02:00
David Gibson	975cfa5f32	Initialise our_tap_ll to ip6.gw when suitable In every place we use our_tap_ll, we only use it as a fallback if the IPv6 gateway address is not link-local. We can avoid that conditional at use time by doing it at initialisation of our_tap_ll instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:22 +02:00
David Gibson	8d4baa4446	Clarify which addresses in ip[46]_ctx are meaningful where Some are guest visible addresses and may not be valid on the host, others are host visible addresses and may not be valid on the guest. Rearrange and comment the ip[46]_ctx definitions to make it clearer which is which. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:19 +02:00
David Gibson	a42fb9c000	treewide: Change misleading 'addr_ll' name c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the guest, but the former is an address for our use on the tap link. Rename it accordingly, to 'our_tap_ll'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:16 +02:00
David Gibson	c9f0ec3227	util: Correct sock_l4() binding for link local addresses When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the host interface's link local address, the actual purpose of it is to provide a link local address for passt's private use on the tap interface. Instead set the scope id for any link-local address we're binding to. We're going to need something and this is what makes sense for sockets on the host. It doesn't make sense for PIF_SPLICE sockets, but those should always have loopback, not link-local addresses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:13 +02:00
David Gibson	57532f1ded	conf: Remove incorrect initialisation of addr_ll_seen Despite the names, addr_ll_seen does not relate to addr_ll the same way addr_seen relates to addr. addr_ll_seen is an observed address from the guest, whereas addr_ll is our link-local address for use on the tap link when we can't use an external endpoint address. It's used both for passt provided services (DHCPv6, NDP) and in some cases for connections from addresses the guest can't access. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:10 +02:00
David Gibson	0b25cac94e	conf: Treat --dns addresses as guest visible addresses Although it's not 100% explicit in the man page, addresses given to the --dns option are intended to be addresses as seen by the guest. This differs from addresses taken from the host's /etc/resolv.conf, which must be translated to guest accessible versions in some cases. Our implementation is currently inconsistent on this: when using --dns-forward, you must usually also give --dns with the matching address, which is meaningful only in the guest's address view. However if you give --dns with a loopback addres, it will be translated like a host view address. Move the remapping logic for DNS addresses out of add_dns4() and add_dns6() into add_dns_resolv() so that it is only applied for host nameserver addresses, not for nameservers given explicitly with --dns. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:08 +02:00
David Gibson	a6066f4e27	conf: Correct setting of dns_match address in add_dns6() add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to the given address, rather than the gateway address. This is doubly wrong: - We've just established the given address is a host loopback address the guest can't access - We've just set ip6.dns[] to tell the guest to use the gateway address, so it won't use the dns_match address we're setting Correct this to use the gateway address, like IPv4. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:06 +02:00
David Gibson	7c083ee41c	conf: Move adding of a nameserver from resolv.conf into subfunction get_dns() is already quite deeply nested, and future changes I have in mind will add more complexity. Prepare for this by splitting out the adding of a single nameserver to the configuration into its own function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:04 +02:00

1 2 3 4 5 ...

1677 commits