passt

mirror of https://passt.top/passt synced 2025-05-15 14:25:34 +02:00

Author	SHA1	Message	Date
David Gibson	0b25cac94e	conf: Treat --dns addresses as guest visible addresses Although it's not 100% explicit in the man page, addresses given to the --dns option are intended to be addresses as seen by the guest. This differs from addresses taken from the host's /etc/resolv.conf, which must be translated to guest accessible versions in some cases. Our implementation is currently inconsistent on this: when using --dns-forward, you must usually also give --dns with the matching address, which is meaningful only in the guest's address view. However if you give --dns with a loopback addres, it will be translated like a host view address. Move the remapping logic for DNS addresses out of add_dns4() and add_dns6() into add_dns_resolv() so that it is only applied for host nameserver addresses, not for nameservers given explicitly with --dns. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:08 +02:00
David Gibson	a6066f4e27	conf: Correct setting of dns_match address in add_dns6() add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to the given address, rather than the gateway address. This is doubly wrong: - We've just established the given address is a host loopback address the guest can't access - We've just set ip6.dns[] to tell the guest to use the gateway address, so it won't use the dns_match address we're setting Correct this to use the gateway address, like IPv4. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:06 +02:00
David Gibson	7c083ee41c	conf: Move adding of a nameserver from resolv.conf into subfunction get_dns() is already quite deeply nested, and future changes I have in mind will add more complexity. Prepare for this by splitting out the adding of a single nameserver to the configuration into its own function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:04 +02:00
David Gibson	1d10760c9f	conf: Move DNS array bounds checks into add_dns[46] Every time we call add_dns[46] we need to first check if there's space in the c->ip[46].dns array for the new entry. We might as well make that check in add_dns[46]() itself. In fact it looks like the calls in get_dns() had an off by one error, not allowing the last entry of the array to be filled. So, that bug is also fixed by the change. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:02 +02:00
David Gibson	6852bd07cc	conf: More accurately count entries added in get_dns() get_dns() counts the number of guest DNS servers it adds, and gives an error if it couldn't add any. However, this count ignores the fact that add_dns[46]() may in some cases not add an entry. Use the array indices we're already tracking to get an accurate count. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 12:00:00 +02:00
David Gibson	c679894668	conf: Use array indices rather than pointers for DNS array slots Currently add_dns[46]() take a somewhat awkward double pointer to the entry in the c->ip[46].dns array to update. It turns out to be easier to work with indices into that array instead. This diff does add some lines, but it's comments, and will allow some future code reductions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 11:59:58 +02:00
David Gibson	ceea52ca93	treewide: Use struct assignment instead of memcpy() for IP addresses We rely on C11 already, so we can use clearer and more type-checkable struct assignment instead of mempcy() for copying IP addresses around. This exposes some "pointer could be const" warnings from cppcheck, so address those too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 11:59:56 +02:00
David Gibson	905ecd2b0b	treewide: Rename MAC address fields for clarity c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 11:59:54 +02:00
David Gibson	066e69986b	util: Helper for formatting MAC addresses There are a couple of places where we somewhat messily open code formatting an Ethernet like MAC address for display. Add an eth_ntop() helper for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 11:59:51 +02:00
David Gibson	e6feb5a892	treewide: Use "our address" instead of "forwarding address" The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead. (While we're there correct an error in flow_initiate_af()s comments where we referred to parameters by the wrong name). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 11:59:29 +02:00
Stefano Brivio	32c386834d	netlink: Fix typo in function comment for nl_addr_set() Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-18 01:29:52 +02:00
Stefano Brivio	f4e9f26480	pasta: Disable neighbour solicitations on device up to prevent DAD As soon as we the kernel notifier for IPv6 address configuration (addrconf_notify()) sees that we bring the target interface up (NETDEV_UP), it will schedule duplicate address detection, so, by itself, setting the nodad flag later is useless, because that won't stop a detection that's already in progress. However, if we disable neighbour solicitations with IFF_NOARP (which is a misnomer for IPv6 interfaces, but there's no possibility of mixing things up), the notifier will not trigger DAD, because it can't be done, of course, without neighbour solicitations. Set IFF_NOARP as we bring up the device, and drop it after we had a chance to set the nodad attribute on the link. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-18 01:29:52 +02:00
Stefano Brivio	d6f0220731	netlink, pasta: Fetch link-local address from namespace interface once it's up As soon as we bring up the interface, the Linux kernel will set up a link-local address for it, so we can fetch it and start using right away, if we need a link-local address to communicate to the container before we see any traffic coming from it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-18 01:29:52 +02:00
Stefano Brivio	74e508cf79	netlink, pasta: Disable DAD for link-local addresses on namespace interface It makes no sense for a container or a guest to try and perform duplicate address detection for their link-local address, as we'll anyway not relay neighbour solicitations with an unspecified source address. While they perform duplicate address detection, the link-local address is not usable, which prevents us from bringing up especially containers and communicate with them right away via IPv6. This is not enough to prevent DAD and reach the container right away: we'll need a couple more patches. As we send NLM_F_REPLACE requests right away, while we still have to read out other addresses on the same socket, we can't use nl_do(): keep track of the last sequence we sent (last address we changed), and deal with the answers to those NLM_F_REPLACE requests in a separate loop, later. Link: https://github.com/containers/podman/pull/23561#discussion_r1711639663 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-18 01:29:38 +02:00
Stefano Brivio	0c74068f56	netlink, pasta: Turn nl_link_up() into a generic function to set link flags In the next patches, we'll reuse it to set flags other than IFF_UP. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-15 09:14:47 +02:00
Stefano Brivio	8231ce54c3	netlink, pasta: Split MTU setting functionality out of nl_link_up() As we'll use nl_link_up() for more than just bringing up devices, it will become awkward to carry empty MTU values around whenever we call it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-15 09:14:43 +02:00
Stefano Brivio	b91d3373ac	netlink: Fix typo in function comment for nl_addr_get() Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-15 09:14:29 +02:00
Stefano Brivio	946206437a	test: Speed up by cutting on eye candy and performance test duration We have a number of delays when we switch to new layouts that were added to make the tests visually easier to follow, together with blinking status bars. Shorten the delays and avoid blinking the status bar if $FAST is set to 1 (no demo mode). Shorten delays in busy loops to 10ms, instead of 100ms, and skip the one-second fixed delay when we wait for the status of a command. Cut the duration of throughput and latency tests to one second, down from ten. Somewhat surprisingly, the results we get are rather consistent, and not significantly different from what we'd get with 10 seconds. This, together with Podman's commit 20f3e8909e3a ("test/system: pasta_test_do add explicit port check"), cuts the time needed on my setup for full test run from approximately 37 minutes to...: $ time ./run [exited] PASS: 165, FAIL: 0 Log at /home/sbrivio/passt/test/test_logs/test.log real 15m34.253s user 0m0.011s sys 0m0.011s Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-15 09:13:15 +02:00
David Gibson	61c0b0d0f1	flow: Don't crash if guest attempts to connect to port 0 Using a zero port on TCP or UDP is dubious, and we can't really deal with forwarding such a flow within the constraints of the socket API. Hence we ASSERT()ed that we had non-zero ports in flow_hash(). The intention was to make sure that the protocol code sanitizes such ports before completing a flow entry. Unfortunately, flow_hash() is also called on new packets to see if they have an existing flow, so the unsanitized guest packet can crash passt with the assert. Correct this by moving the assert from flow_hash() to flow_sidx_hash() which is only used on entries already in the table, not on unsanitized data. Reported-by: Matt Hamilton <matt@thmail.io> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-14 12:20:31 +02:00
David Gibson	baba284912	conf: Don't ignore -t and -u options after -D `f6d5a52392` moved handling of -D into a later loop. However as a side effect it moved this from a switch block to an if block. I left a couple of 'break' statements that don't make sense in the new context. They should be 'continue' so that we go onto the next option, rather than leaving the loop entirely. Fixes: `f6d5a52392` ("conf: Delay handling -D option until after addresses are configured") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-14 09:14:12 +02:00
AbdAlRahman Gad	c16141eda5	ndp.c: Turn NDP responder into more declarative implementation - Add structs for NA, RA, NS, MTU, prefix info, option header, link-layer address, RDNSS, DNSSL and link-layer for RA message. - Turn NA message from purely imperative, going byte by byte, to declarative by filling it's struct. - Turn part of RA message into declarative. - Move packet_add() to be before the call of ndp() in tap6_handler() if the protocol of the packet is ICMPv6. - Add a pool of packets as an additional parameter to ndp(). - Check the size of NS packet with packet_get() before sending an NA packet. - Add documentation for the structs. - Add an enum for NDP option types. Link: https://bugs.passt.top/show_bug.cgi?id=21 Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> [sbrivio: Minor coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-13 19:46:16 +02:00
David Gibson	f6d5a52392	conf: Delay handling -D option until after addresses are configured add_dns[46]() rely on the gateway address and c->no_map_gw being already initialised, in order to properly handle DNS servers which need NAT to be accessed from the guest. Usually these are called from get_dns() which is well after the addresses are configured, so that's fine. However, they can also be called earlier if an explicit -D command line option is given. In this case no_map_gw and/or c->ip[46].gw may not get be initialised properly, leading to this doing the wrong thing. Luckily we already have a second pass of option parsing for things which need addresses to already be configured. Move handling of -D to there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-12 21:29:36 +02:00
David Gibson	86bdd968ea	Correct inaccurate comments on ip[46]_ctx::addr These fields are described as being an address for an external, routable interface. That's not necessarily the case when using -a. But, more importantly, saying where the value comes from is not as useful as what it's used for. The real purpose of this field is as the address which we assign to the guest via DHCP or --config-net. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-12 21:29:21 +02:00
Stefano Brivio	fecb1b65b1	log: Don't prefix message with timestamp on --debug if it's a continuation If we prefix the second part of messages printed through logmsg_perror() by the timestamp, on debug, we'll have two timestamps and a weird separator in the result, such as this beauty: 0.0013: Failed to clone process with detached namespaces0.0013: : Operation not permitted Add a parameter to logmsg() and vlogmsg() which indicates a message continuation. If that's set, don't print the timestamp in vlogmsg(). Link: https://github.com/moby/moby/issues/48257#issuecomment-2282875092 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-12 16:21:53 +02:00
Stefano Brivio	baccfb95ce	conf: Stop parsing options at first non-option argument Given that pasta supports specifying a command to be executed on the command line, even without the usual -- separator as long as there's no ambiguity, we shouldn't eat up options that are not meant for us. Paul reports, for instance, that with: pasta --config-net ip -6 route -6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'. That's because getopt_long(), by default, shuffles the argument list to shift non-option arguments at the end. Avoid that by adding '+' at the beginning of 'optstring'. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-08 21:34:06 +02:00
Stefano Brivio	09603cab28	passt, util: Close any open file that the parent might have leaked If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any. This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul. Use close_range(2) to close all open files except for standard streams and the one from --fd. Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple. As we need to check that the passed --fd option doesn't exceed INT_MAX, because we'll parse it with strtol() but file descriptor indices are signed ints (regardless of the arguments close_range() take), extend the existing check in the actual --fd parsing in conf(), also rejecting file descriptors numbers that match standard streams, while at it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>	2024-08-08 21:31:25 +02:00
David Gibson	755f9fd911	nstool: Propagate SIGTERM to processes executed in the namespace Particularly in shell it's sometimes natural to save the pid from a process run and later kill it. If doing this with nstool exec, however, it will kill nstool itself, not the program it is running, which isn't usually what you want or expect. Address this by having nstool propagate SIGTERM to its child process. It may make sense to propagate some other signals, but some introduce extra complications, so we'll worry about them when and if it seems useful. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:48 +02:00
David Gibson	5ca61c2f34	nstool: Fix some trivial typos Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:45 +02:00
David Gibson	a628cb93a7	log: Avoid duplicate calls to logtime() We use logtime() to get a timestamp for the log in two places: - in vlogmsg(), which is used only for debug_print messages - in logfile_write() which is only used messages to the log file These cases are mutually exclusive, so we don't ever print the same message with different timestamps, but that's not particularly obvious to see. It's possible future tweaks to logging logic could mean we log to two different places with different timestamps, which would be confusing. Refactor to have a single logtime() call in vlogmsg() and use it for all the places we need it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:40 +02:00
David Gibson	2c7558dc43	log: Handle errors from clock_gettime() clock_gettime() can, theoretically, fail, although it probably won't until 2038 on old 32-bit systems. Still, it's possible someone could run with a wildly out of sync clock, or new errors could be added, or it could fail due to a bug in libc or the kernel. We don't handle this well. In the debug_print case in vlogmsg we'll just ignore the failure, and print a timestamp based on uninitialised garbage. In logfile_write() we exit early and won't log anything at all, which seems like a good way to make an already weird situation undebuggable. Add some helpers to instead handle this by using "<error>" in place of a timestamp if something goes wrong with clock_gettime(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:29 +02:00
David Gibson	b91bae1ded	log: Correct formatting of timestamps logtime_fmt_and_arg() is a rather odd macro, producing both a format string and an argument, which can only be used in quite specific printf() like formulations. It also has a significant bug: it tries to display 4 digits after the decimal point (so down to tenths of milliseconds) using %04i. But the field width in printf() is always a minimum not maximum field width, so this will not truncate the given value, but will redisplay the entire tenth-of-milliseconds difference again after the decimal point. Replace the macro with an snprintf() like function which will format the timestamp, and use an explicit % to correct the display. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Make logtime_fmt() static] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:16:10 +02:00
David Gibson	95569e4aa4	util: Some corrections for timespec_diff_us The comment for timespec_diff_us() claims it will wrap after 2^64µs. This is incorrect for two reasons: * It returns a long long, which is probably 64-bits, but might not be * It returns a signed value, so even if it is 64 bits it will wrap after 2^63µs Correct the comment and use an explicitly 64-bit type to avoid that imprecision. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-07 09:15:40 +02:00
Stefano Brivio	fbb0c9523e	conf, pasta: Make -g and -a skip route/addresses copy for matching IP version only Paul reports that setting IPv4 address and gateway manually, using --address and --gateway, causes pasta to fail inserting IPv6 routes in a setup where multiple, inter-dependent IPv6 routes are present on the host. That's because, currently, any -g option implies --no-copy-routes altogether, and any -a implies --no-copy-addrs. Limit this implication to the matching IP version, instead, by having two copies of no_copy_routes and no_copy_addrs in the context structure, separately for IPv4 and IPv6. While at it, change them to 'bool': we had them as 'int' because getopt_long() used to set them directly, but it hasn't been the case for a while already. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-07 09:15:25 +02:00
Stefano Brivio	ee36266a55	log, passt: Keep printing to stderr when passt is running in foreground There are two cases where we want to stop printing to stderr: if it's closed, and if pasta spawned a shell (and --debug wasn't given). But if passt is running in foreground, we currently stop to report any message, even error messages, once we're ready, as reported by Laurent, because we set the log_runtime flag, which we use to indicate we're ready, regardless of whether we're running in foreground or not. Turn that flag (back) to log_stderr, and set it only when we really want to stop printing to stderr. Reported-by: Laurent Vivier <lvivier@redhat.com> Fixes: `afd9cdc9bb` ("log, passt: Always print to stderr before initialisation is complete") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-06 15:03:48 +02:00
Stefano Brivio	3a082c4ecb	tcp_splice: Fix side in OUT_WAIT flag setting If the "from" (input) side for a given transfer is 0, and we can't complete the write right away, what we need to be waiting for is for output readiness on side 1, not 0, and the other way around as well. This causes random transfer failures for local TCP connections, depending if we ever need to wait for output readiness. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/23517 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com>	2024-08-06 15:03:31 +02:00
David Gibson	031df332e9	util: Use unsigned (size_t) value for iov length The "correct" type for the length of an IOV is unclear: writev() and readv() use an int, but sendmsg() and recvmsg() use a size_t. Using the unsigned size_t has some advantages, though, and it makes more sense for the case of write_remainder. Using size_t throughout here means we don't have a signed vs. unsigned comparison, and we don't have to deal with the case of iov_skip_bytes() returning a value which becomes negative when assigned to an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-06 15:01:46 +02:00
Laurent Vivier	e877f905e5	udp_flow: move all udp_flow functions to udp_flow.c No code change. They need to be exported to be available by the vhost-user version of passt. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-05 17:38:17 +02:00
Laurent Vivier	623ceb1f2b	udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock() To be used with the vhost-user version of udp.c, we need to export the udp_flow functions. To avoid to export udp_meta_t too that is specific to the socket version of udp.c, don't pass udp_meta_t to it, but the only needed field, s_in. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-05 17:37:53 +02:00
David Gibson	a5bbefa6fb	log: Make logfile_write() private logfile_write() is not used outside log.c, nor should it be. It should only be used externall via the general logging functions. Make it static in log.c. To avoid forward declarations this requires moving a bunch of functions earlier in the file. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-05 15:03:33 +02:00
Stefano Brivio	f30ed68c52	pasta: Save errno on signal handler entry, restore on return when needed Ed reported this: # Error: pasta failed with exit code 1: # Couldn't drop cap 3 from bounding set # : No child processes in a Podman CI run with tests being run in parallel. The error message itself, by the way, is fixed by commit `1cd773081f` ("log: Drop newlines in the middle of the perror()-like messages"), but how can we possibly get ECHILD as failure code for prctl()? Well, we don't, but if we exit early enough, pasta_child_handler() might run before we're even done with isolation steps, and it calls waitid(), which sets errno. We need to restore it before returning from the signal handler (if we return after calling functions that might set it), as signal-safety(7) also implies: Fetching and setting the value of errno is async-signal-safe provided that the signal handler saves errno on entry and restores its value before returning. Eventually, we'll probably need to switch to signalfd(2) the day we want to implement multithreading, but this will do for the moment. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/23478 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-05 15:02:36 +02:00
Danish Prakash	0149d11cc5	pasta: modify hostname when detaching new namespace When invoking pasta without any arguments, it's difficult to tell whether we are in the new namespace or not leaving users a bit confused. This change modifies the host namespace to add a prefix "pasta-" to make it a bit more obvious. Signed-off-by: Danish Prakash <contact@danishpraka.sh> [sbrivio: coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-30 17:27:43 +02:00
AbdAlRahman Gad	8fae3b73cb	Fix typo in README file - remove duplicated 'the' in the 'Services' section Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-29 19:02:35 +02:00
Stefano Brivio	f87b11c7be	fedora/rpkg: List myself as author for changelog entries ...instead of the latest author for contrib/fedora. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 16:40:41 +02:00
David Gibson	57a21d2df1	tap: Improve handling of partially received frames on qemu socket Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:42 +02:00
David Gibson	37e3b24d90	tap: Correctly handle frames of odd length The Qemu socket protocol consists of a 32-bit frame length in network (BE) order, followed by the Ethernet frame itself. As far as I can tell, frames can be any length, with no particular alignment requirement. This means that although pkt_buf itself is aligned, if we have a frame of odd length, frames after it will have their frame length at an unaligned address. Currently we load the frame length by just casting a char pointer to (uint32_t ) and loading. Some platforms will generate a fatal trap on such an unaligned load. Even if they don't casting an incorrectly aligned pointer to (uint32_t ) is undefined behaviour, strictly speaking. Introduce a new helper to safely load a possibly unaligned value here. We assume that the compiler is smart enough to optimize this into nothing on platforms that provide performant unaligned loads. If that turns out not to be the case, we can look at improvements then. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:38 +02:00
David Gibson	4684f60344	tap: Don't use EPOLLET on Qemu sockets Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however. Furthermore, using EPOLLET causes additional complications: 1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however we do set it when using pasta mode with --fd. This inconsistency doesn't seem to have broken anything, but it's odd. 2) EPOLLET requires that tap_handler_passt() loop until all data available is read (otherwise we may have data in the buffer but never get an event causing us to read it). We do that with a rather ugly goto. Worse, our condition for that goto appears to be incorrect. We'll only loop if rem is non-zero, which will only happen if we perform a blocking recv() for a partially received frame. We'll only perform that second recv() if the original recv() resulted in a partially read frame. As far as I can tell the original recv() could end on a frame boundary (never triggering the second recv()) even if there is additional data in the socket buffer. In that circumstance we wouldn't goto redo and could leave unprocessed frames in the qemu socket buffer indefinitely. This doesn't seem to have caused any problems in practice, but since there's no obvious reason to use EPOLLET here anyway, we might as well get rid of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:20 +02:00
David Gibson	9e3f2355c4	tap: Don't attempt to carry on if we get a bad frame length from qemu If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:05:44 +02:00
David Gibson	a06db27c49	tap: Better report errors receiving from QEMU socket If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 13:51:23 +02:00
Stefano Brivio	77c092ee5e	log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME We report relative timestamps in logs, so we want to avoid jumps in the system time. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-26 13:47:36 +02:00
Stefano Brivio	e5c37ba0f4	log: Initialise timestamp for relative log time also if we use a log file ...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero. Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-26 13:47:34 +02:00

1 2 3 4 5 ...

1730 commits