passt

mirror of https://passt.top/passt synced 2025-04-11 15:45:03 +02:00

Author	SHA1	Message	Date
David Gibson	f6d5a52392	conf: Delay handling -D option until after addresses are configured add_dns[46]() rely on the gateway address and c->no_map_gw being already initialised, in order to properly handle DNS servers which need NAT to be accessed from the guest. Usually these are called from get_dns() which is well after the addresses are configured, so that's fine. However, they can also be called earlier if an explicit -D command line option is given. In this case no_map_gw and/or c->ip[46].gw may not get be initialised properly, leading to this doing the wrong thing. Luckily we already have a second pass of option parsing for things which need addresses to already be configured. Move handling of -D to there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-12 21:29:36 +02:00
David Gibson	86bdd968ea	Correct inaccurate comments on ip[46]_ctx::addr These fields are described as being an address for an external, routable interface. That's not necessarily the case when using -a. But, more importantly, saying where the value comes from is not as useful as what it's used for. The real purpose of this field is as the address which we assign to the guest via DHCP or --config-net. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-12 21:29:21 +02:00
Stefano Brivio	fecb1b65b1	log: Don't prefix message with timestamp on --debug if it's a continuation If we prefix the second part of messages printed through logmsg_perror() by the timestamp, on debug, we'll have two timestamps and a weird separator in the result, such as this beauty: 0.0013: Failed to clone process with detached namespaces0.0013: : Operation not permitted Add a parameter to logmsg() and vlogmsg() which indicates a message continuation. If that's set, don't print the timestamp in vlogmsg(). Link: https://github.com/moby/moby/issues/48257#issuecomment-2282875092 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-12 16:21:53 +02:00
Stefano Brivio	baccfb95ce	conf: Stop parsing options at first non-option argument Given that pasta supports specifying a command to be executed on the command line, even without the usual -- separator as long as there's no ambiguity, we shouldn't eat up options that are not meant for us. Paul reports, for instance, that with: pasta --config-net ip -6 route -6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'. That's because getopt_long(), by default, shuffles the argument list to shift non-option arguments at the end. Avoid that by adding '+' at the beginning of 'optstring'. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-08 21:34:06 +02:00
Stefano Brivio	09603cab28	passt, util: Close any open file that the parent might have leaked If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any. This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul. Use close_range(2) to close all open files except for standard streams and the one from --fd. Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple. As we need to check that the passed --fd option doesn't exceed INT_MAX, because we'll parse it with strtol() but file descriptor indices are signed ints (regardless of the arguments close_range() take), extend the existing check in the actual --fd parsing in conf(), also rejecting file descriptors numbers that match standard streams, while at it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>	2024-08-08 21:31:25 +02:00
David Gibson	755f9fd911	nstool: Propagate SIGTERM to processes executed in the namespace Particularly in shell it's sometimes natural to save the pid from a process run and later kill it. If doing this with nstool exec, however, it will kill nstool itself, not the program it is running, which isn't usually what you want or expect. Address this by having nstool propagate SIGTERM to its child process. It may make sense to propagate some other signals, but some introduce extra complications, so we'll worry about them when and if it seems useful. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:48 +02:00
David Gibson	5ca61c2f34	nstool: Fix some trivial typos Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:45 +02:00
David Gibson	a628cb93a7	log: Avoid duplicate calls to logtime() We use logtime() to get a timestamp for the log in two places: - in vlogmsg(), which is used only for debug_print messages - in logfile_write() which is only used messages to the log file These cases are mutually exclusive, so we don't ever print the same message with different timestamps, but that's not particularly obvious to see. It's possible future tweaks to logging logic could mean we log to two different places with different timestamps, which would be confusing. Refactor to have a single logtime() call in vlogmsg() and use it for all the places we need it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:40 +02:00
David Gibson	2c7558dc43	log: Handle errors from clock_gettime() clock_gettime() can, theoretically, fail, although it probably won't until 2038 on old 32-bit systems. Still, it's possible someone could run with a wildly out of sync clock, or new errors could be added, or it could fail due to a bug in libc or the kernel. We don't handle this well. In the debug_print case in vlogmsg we'll just ignore the failure, and print a timestamp based on uninitialised garbage. In logfile_write() we exit early and won't log anything at all, which seems like a good way to make an already weird situation undebuggable. Add some helpers to instead handle this by using "<error>" in place of a timestamp if something goes wrong with clock_gettime(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:29 +02:00
David Gibson	b91bae1ded	log: Correct formatting of timestamps logtime_fmt_and_arg() is a rather odd macro, producing both a format string and an argument, which can only be used in quite specific printf() like formulations. It also has a significant bug: it tries to display 4 digits after the decimal point (so down to tenths of milliseconds) using %04i. But the field width in printf() is always a minimum not maximum field width, so this will not truncate the given value, but will redisplay the entire tenth-of-milliseconds difference again after the decimal point. Replace the macro with an snprintf() like function which will format the timestamp, and use an explicit % to correct the display. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Make logtime_fmt() static] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:10 +02:00
David Gibson	95569e4aa4	util: Some corrections for timespec_diff_us The comment for timespec_diff_us() claims it will wrap after 2^64µs. This is incorrect for two reasons: * It returns a long long, which is probably 64-bits, but might not be * It returns a signed value, so even if it is 64 bits it will wrap after 2^63µs Correct the comment and use an explicitly 64-bit type to avoid that imprecision. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:15:40 +02:00
Stefano Brivio	fbb0c9523e	conf, pasta: Make -g and -a skip route/addresses copy for matching IP version only Paul reports that setting IPv4 address and gateway manually, using --address and --gateway, causes pasta to fail inserting IPv6 routes in a setup where multiple, inter-dependent IPv6 routes are present on the host. That's because, currently, any -g option implies --no-copy-routes altogether, and any -a implies --no-copy-addrs. Limit this implication to the matching IP version, instead, by having two copies of no_copy_routes and no_copy_addrs in the context structure, separately for IPv4 and IPv6. While at it, change them to 'bool': we had them as 'int' because getopt_long() used to set them directly, but it hasn't been the case for a while already. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-07 09:15:25 +02:00
Stefano Brivio	ee36266a55	log, passt: Keep printing to stderr when passt is running in foreground There are two cases where we want to stop printing to stderr: if it's closed, and if pasta spawned a shell (and --debug wasn't given). But if passt is running in foreground, we currently stop to report any message, even error messages, once we're ready, as reported by Laurent, because we set the log_runtime flag, which we use to indicate we're ready, regardless of whether we're running in foreground or not. Turn that flag (back) to log_stderr, and set it only when we really want to stop printing to stderr. Reported-by: Laurent Vivier <lvivier@redhat.com> Fixes: `afd9cdc9bb` ("log, passt: Always print to stderr before initialisation is complete") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-06 15:03:48 +02:00
Stefano Brivio	3a082c4ecb	tcp_splice: Fix side in OUT_WAIT flag setting If the "from" (input) side for a given transfer is 0, and we can't complete the write right away, what we need to be waiting for is for output readiness on side 1, not 0, and the other way around as well. This causes random transfer failures for local TCP connections, depending if we ever need to wait for output readiness. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/23517 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com>	2024-08-06 15:03:31 +02:00
David Gibson	031df332e9	util: Use unsigned (size_t) value for iov length The "correct" type for the length of an IOV is unclear: writev() and readv() use an int, but sendmsg() and recvmsg() use a size_t. Using the unsigned size_t has some advantages, though, and it makes more sense for the case of write_remainder. Using size_t throughout here means we don't have a signed vs. unsigned comparison, and we don't have to deal with the case of iov_skip_bytes() returning a value which becomes negative when assigned to an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-06 15:01:46 +02:00
Laurent Vivier	e877f905e5	udp_flow: move all udp_flow functions to udp_flow.c No code change. They need to be exported to be available by the vhost-user version of passt. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-05 17:38:17 +02:00
Laurent Vivier	623ceb1f2b	udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock() To be used with the vhost-user version of udp.c, we need to export the udp_flow functions. To avoid to export udp_meta_t too that is specific to the socket version of udp.c, don't pass udp_meta_t to it, but the only needed field, s_in. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-05 17:37:53 +02:00
David Gibson	a5bbefa6fb	log: Make logfile_write() private logfile_write() is not used outside log.c, nor should it be. It should only be used externall via the general logging functions. Make it static in log.c. To avoid forward declarations this requires moving a bunch of functions earlier in the file. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-05 15:03:33 +02:00
Stefano Brivio	f30ed68c52	pasta: Save errno on signal handler entry, restore on return when needed Ed reported this: # Error: pasta failed with exit code 1: # Couldn't drop cap 3 from bounding set # : No child processes in a Podman CI run with tests being run in parallel. The error message itself, by the way, is fixed by commit `1cd773081f` ("log: Drop newlines in the middle of the perror()-like messages"), but how can we possibly get ECHILD as failure code for prctl()? Well, we don't, but if we exit early enough, pasta_child_handler() might run before we're even done with isolation steps, and it calls waitid(), which sets errno. We need to restore it before returning from the signal handler (if we return after calling functions that might set it), as signal-safety(7) also implies: Fetching and setting the value of errno is async-signal-safe provided that the signal handler saves errno on entry and restores its value before returning. Eventually, we'll probably need to switch to signalfd(2) the day we want to implement multithreading, but this will do for the moment. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/23478 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-05 15:02:36 +02:00
Danish Prakash	0149d11cc5	pasta: modify hostname when detaching new namespace When invoking pasta without any arguments, it's difficult to tell whether we are in the new namespace or not leaving users a bit confused. This change modifies the host namespace to add a prefix "pasta-" to make it a bit more obvious. Signed-off-by: Danish Prakash <contact@danishpraka.sh> [sbrivio: coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-30 17:27:43 +02:00
AbdAlRahman Gad	8fae3b73cb	Fix typo in README file - remove duplicated 'the' in the 'Services' section Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-29 19:02:35 +02:00
Stefano Brivio	f87b11c7be	fedora/rpkg: List myself as author for changelog entries ...instead of the latest author for contrib/fedora. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 16:40:41 +02:00
David Gibson	57a21d2df1	tap: Improve handling of partially received frames on qemu socket Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:42 +02:00
David Gibson	37e3b24d90	tap: Correctly handle frames of odd length The Qemu socket protocol consists of a 32-bit frame length in network (BE) order, followed by the Ethernet frame itself. As far as I can tell, frames can be any length, with no particular alignment requirement. This means that although pkt_buf itself is aligned, if we have a frame of odd length, frames after it will have their frame length at an unaligned address. Currently we load the frame length by just casting a char pointer to (uint32_t ) and loading. Some platforms will generate a fatal trap on such an unaligned load. Even if they don't casting an incorrectly aligned pointer to (uint32_t ) is undefined behaviour, strictly speaking. Introduce a new helper to safely load a possibly unaligned value here. We assume that the compiler is smart enough to optimize this into nothing on platforms that provide performant unaligned loads. If that turns out not to be the case, we can look at improvements then. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:38 +02:00
David Gibson	4684f60344	tap: Don't use EPOLLET on Qemu sockets Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however. Furthermore, using EPOLLET causes additional complications: 1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however we do set it when using pasta mode with --fd. This inconsistency doesn't seem to have broken anything, but it's odd. 2) EPOLLET requires that tap_handler_passt() loop until all data available is read (otherwise we may have data in the buffer but never get an event causing us to read it). We do that with a rather ugly goto. Worse, our condition for that goto appears to be incorrect. We'll only loop if rem is non-zero, which will only happen if we perform a blocking recv() for a partially received frame. We'll only perform that second recv() if the original recv() resulted in a partially read frame. As far as I can tell the original recv() could end on a frame boundary (never triggering the second recv()) even if there is additional data in the socket buffer. In that circumstance we wouldn't goto redo and could leave unprocessed frames in the qemu socket buffer indefinitely. This doesn't seem to have caused any problems in practice, but since there's no obvious reason to use EPOLLET here anyway, we might as well get rid of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:20 +02:00
David Gibson	9e3f2355c4	tap: Don't attempt to carry on if we get a bad frame length from qemu If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:05:44 +02:00
David Gibson	a06db27c49	tap: Better report errors receiving from QEMU socket If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 13:51:23 +02:00
Stefano Brivio	77c092ee5e	log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME We report relative timestamps in logs, so we want to avoid jumps in the system time. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-26 13:47:36 +02:00
Stefano Brivio	e5c37ba0f4	log: Initialise timestamp for relative log time also if we use a log file ...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero. Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-26 13:47:34 +02:00
Stefano Brivio	327d9d482f	log, util: Fix sub-second part in relative log time calculation For some reason, in commit `01efc71ddd` ("log, conf: Add support for logging to file"), I added calculations for relative logging timestamps using the difference for the seconds part only, not for accounting for the fractional part. Fix that by storing the initial timestamp, log_start, as a timespec struct, and by calculating the difference from the starting time. Do this in a macro as we need the same format in a few places. To calculate the difference, turn the existing timespec_diff_ms() to microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to use that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-26 13:43:19 +02:00
Stefano Brivio	2ce1d37831	test/lib/perf_report: Fix highlight Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-25 17:56:47 +02:00
David Gibson	e9a542321f	test: Fix spurious test failure with systemd-resolved systemd-resolved has the rather strange behaviour of listening on the non-standard loopback address 127.0.0.53. Various changes we've made in passt mean that we now usually work fine on a host using systemd-resolved. However our tests still fail in this case. We have a special case for when the guest's resolv.conf needs to differ from the host's because the resolver is on a host loopback address. However, we only consider the case where the host resolver is on 127.0.0.1, not other loopback addresses. Correct this with a different test condition. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-25 12:39:00 +02:00
David Gibson	becf81ab88	fwd: Broaden what we consider for DNS specific forwarding rules passt/pasta has options to redirect DNS requests from the guest to a different server address on the host side. Currently, however, only UDP packets to port 53 are considered "DNS requests". This ignores DNS requests over TCP - less common, but certainly possible. It also ignores encrypted DNS requests on port 853. Extend the DNS forwarding logic to handle both of those cases. Link: https://github.com/containers/podman/issues/23239 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-25 12:38:11 +02:00
David Gibson	0ada84e3f8	fwd: Refactor tests in fwd_nat_from_tap() for clarity Currently, we start by handling the common case, where we don't translate the destination address, then we modify the tgt side for the special cases. In the process we do comparisons on the tentatively set fields in tgt, which obscures the fact that tgt should be an essentially pure function of ini, and risks people examining fields of tgt that are not yet initialized. To make this clearer, do all our tests on 'ini', constructing tgt from scratch on that basis. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-25 12:37:57 +02:00
Stefano Brivio	4a333c88d7	conf: Accept addresses enclosed by square brackets in port forwarding specifiers Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit. Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools). Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:30:56 +02:00
Stefano Brivio	6ff702f325	tap: Exit if we fail to bind a UNIX domain socket with explicit path In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally. But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:30:42 +02:00
Stefano Brivio	f72d35a78d	test: iperf3 3.16 introduces multiple threads, drop our own implementation of that Starting from iperf3 version 3.16, -P / --parallel spawns multiple clients as separate threads, instead of multiple streams serviced by the same thread. So we can drop our lib/test implementation to spawn several iperf3 client and server processes and finally simplify things quite a bit. Adjust number of threads and UDP sending bandwidth to values that seem to be more or less matching previous throughput tests on my setup. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:30:38 +02:00
Stefano Brivio	606e0c7b95	test: Update names of symbols and slabinfo entries Differences in allocated Acpi-Parse entries are gone (at least) since the 6.1 Linux kernel series. I should run this on a 6.10 kernel, eventually, and adjust things further, as needed. Userspace symbols are also fairly different now: show whatever is more than 1 MiB at the moment. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:30:29 +02:00
Stefano Brivio	f16f8f5bf6	test: Fix memory/passt tests, --netns-only is not a valid option for passt This used to work on my setup as I kept reusing an old mbuto (initramfs) image, but since commit `65923ba798` ("conf: Accept duplicate and conflicting options, the last one wins"), --netns-only is, as originally intended, a pasta-only option. I had used --netns-only, here, to prevent passt from trying to detach its own user namespace, which is not permitted as we're in a chroot, see unshare(2). In turn, we need the chroot because passt can't pivot root directly into its own empty filesystem using an initramfs. Use switch_root into the tmpfs mountpoint instead of chroot, so that we can still detach user namespaces. Note that in the mbuto images, we can't switch to nobody as we have no password entries at all, so we need to detach a further user namespace before starting passt, to trick passt into running as UID 0. Given the new sequence, it's now more convenient to directly switch to a detached network namespace as well, which means we need to move the initialisation of the dummy network from the init script into the test script. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:30:08 +02:00
Stefano Brivio	1cd773081f	log: Drop newlines in the middle of the perror()-like messages Calling vlogmsg() twice from logmsg_perror() results in this beauty: $ ./pasta -i foo Invalid interface name foo : No such device because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it. Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:25:31 +02:00
Stefano Brivio	13295583f8	tcp: Change SO_PEEK_OFF support message to debug() This: $ ./pasta SO_PEEK_OFF not supported # is a bit annoying, and might trick users who face other issues into thinking that SO_PEEK_OFF not being supported on a given kernel is an actual issue. Even if SO_PEEK_OFF is supported by the kernel, that would be the only message displayed there, with default options, which looks a bit out of context. Switch that to debug(): now that Podman users can pass --debug too, we can find out quickly if it's supported or not, if SO_PEEK_OFF usage is suspected of causing any issue. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:25:26 +02:00
Stefano Brivio	d19b396f11	tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down If we start pasta with some ports forwarded, but no --config-net, say: $ ./pasta -u 10001 and then use a local, non-loopback address to send traffic to that port, say: $ socat -u FILE:test UDP4:192.0.2.1:10001 pasta writes to the tap file descriptor, but if the interface is down, we get EIO and terminate. By itself, what I'm doing in this case is not very useful (I simply forgot to pass --config-net), but if we happen to have a DHCP client in the network namespace, the interface might still be down while somebody tries to send traffic to it, and exiting in that case is not really helpful. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:25:05 +02:00
David Gibson	a09aeb4bd6	tcp: Correctly update SO_PEEK_OFF when tcp_send_frames() drops frames When using the new SO_PEEK_OFF feature on TCP sockets, we must adjust the SO_PEEK_OFF value whenever we move conn->seq_to_tap backwards. Although it was discussed during development, somewhere during the shuffles the case where we move the pointer backwards because we lost frames while sending them to the guest. This can happen, for example, if the socket buffer on the Unix socket to qemu overflows. Fixing this is slightly complicated because we need to pass a non-const context pointer to some places we previously didn't need it. While we're there also fix a small stylistic issue in the function comment for tcp_revert_seq() - it was using spaces instead of tabs. Fixes: `e63d281871` ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-24 09:27:46 +02:00
Jon Maloy	9cb6b50815	tcp: probe for SO_PEEK_OFF both in tcpv4 and tcp6 Based on an original patch by Jon Maloy: -- The recently added socket option SO_PEEK_OFF is not supported for TCP/IPv6 sockets. Until we get that support into the kernel we need to test for support in both protocols to set the global 'peek_offset_cap´ to true. -- Compared to the original patch: - only check for SO_PEEK_OFF support for enabled IP versions - use sa_family_t instead of int to pass the address family around Fixes: `e63d281871` ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-23 16:42:27 +02:00
David Gibson	882599e180	udp: Rename UDP listening sockets EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN and associated functions to match. Along with that, remove the .orig field from union udp_listen_epoll_ref, since it is now always true. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:34:01 +02:00
David Gibson	d29fa0856e	udp: Remove rdelta port forwarding maps In addition to the struct fwd_ports used by both UDP and TCP to track port forwarding, UDP also included an 'rdelta' field, which contained the reverse mapping of the main port map. This was used so that we could properly direct reply packets to a forwarded packet where we change the destination port. This has now been taken over by the flow table: reply packets will match the flow of the originating packet, and that gives the correct ports on the originating side. So, eliminate the rdelta field, and with it struct udp_fwd_ports, which now has no additional information over struct fwd_ports. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:33:57 +02:00
David Gibson	d89b3aa097	udp: Remove obsolete socket tracking Now that UDP datagrams are all directed via the flow table, we no longer use the udp_tap_map[ or udp_act[] arrays. Remove them and connected code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:33:55 +02:00
David Gibson	898f797174	udp: Direct datagrams from host to guest via flow table This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:33:51 +02:00
David Gibson	b7ad19347f	udp: Find or create flows for datagrams from tap interface Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones. As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket. For non-obvious reasons (perhaps increased stack usage?), this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter. Since we already allow rt_sigaction and others in the valgrind target, it seems reasonable to add rt_sigreturn as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:33:48 +02:00
David Gibson	8126f7a660	udp: Remove obsolete splice tracking Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding. Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:33:45 +02:00

1 2 3 4 5 ...

1709 commits