passt

Author	SHA1	Message	Date
Jon Maloy	a740e16fd1	tcp: handle shrunk window advertisements from guest A bug in kernel TCP may lead to a deadlock where a zero window is sent from the guest peer, while it is unable to send out window updates even after socket reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisements from the peer can only be triggered by data packets arriving from this side. However, currently such packets are never sent, because the zero-window condition prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered only after the peer has dropped one or more arriving packets because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that the implementation goes against the RFC-9293 recommendation that a previously advertised window never should shrink. RFC-9293 seems to permit that we can continue sending up to the right edge of the last advertised non-zero window in such situations, so that is what we do to resolve this situation. It turns out that this solution is extremely simple to implememt in the code: We just omit to save the advertised zero-window when we see that it has shrunk, i.e., if the acknowledged sequence number in the advertisement message is lower than that of the last data byte sent from our side. When that is the case, the following happens: - The 'retr' flag in tcp_data_from_tap() will be 'false', so no retransmission will occur at this occasion. - The data stream will soon reach the right edge of the previously advertised window. In fact, in all observed cases we have seen that it is already there when the zero-advertisement arrives. - At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set, unless they already have been, meaning that only the next timer expiration will open for data retransmission or transmission. - When that happens, the memory squeeze at the guest will normally have abated, and the data flow can resume. It should be noted that although this solves the problem we have at hand, it is a work-around, and not a genuine solution to the described kernel bug. Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor fix in commit title and commit reference in comment to workaround Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-15 18:05:08 +02:00
Jon Maloy	e63d281871	tcp: leverage support of SO_PEEK_OFF socket option when available >From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction. In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not. Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-15 17:57:03 +02:00
David Gibson	8bd57bf25b	doc: Trivial fix for reuseaddr-priority This test program checks for particular behaviour regardless of order of operations. So, we step through the test with all possible orders for a number of different of parts. Or at least, we're supposed to, a copy pasta error led to using the same order for two things which should be independent. Fixes: `299c407501` ("doc: Add program to document and test assumptions about SO_REUSEADDR") Reported-by: David Taylor <davidt@yadt.co.uk> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-15 17:55:52 +02:00
David Gibson	ec2691a12e	doc: Test behaviour of zero length datagram recv()s Add a test program verifying that we're able to discard datagrams from a socket without needing a big discard buffer, by using a zero length recv(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:48 +02:00
David Gibson	299c407501	doc: Add program to document and test assumptions about SO_REUSEADDR For the approach we intend to use for handling UDP flows, we have some pretty specific requirements about how SO_REUSEADDR works with UDP sockets. Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s, and therefore there can be multiple sockets which are eligible to receive the same datagram. Which one will actually receive it is important to us. Add a test program which verifies things work the way we expect, which documents what those expectations are in the process. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:43 +02:00
David Gibson	be0214cca6	udp: Consolidate datagram batching When we receive datagrams on a socket, we need to split them into batches depending on how they need to be forwarded (either via a specific splice socket, or via tap). The logic to do this, is somewhat awkwardly split between udp_buf_sock_handler() itself, udp_splice_send() and udp_tap_send(). Move all the batching logic into udp_buf_sock_handler(), leaving udp_splice_send() to just send the prepared batch. udp_tap_send() reduces to just a call to tap_send_frames() so open-code that call in udp_buf_sock_handler(). This will allow separating the batching logic from the rest of the datagram forwarding logic, which we'll need for upcoming flow table support. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:41 +02:00
David Gibson	69e5393c37	udp: Move some more of sock_handler tasks into sub-functions udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four things between them: 1. Receive some datagrams from a socket 2. Split those datagrams into batches depending on how they need to be sent (via tap or via a specific splice socket) 3. Prepare buffers for each datagram to send it onwards 4. Actually send it onwards Split (1) and (3) into specific helper functions. This isn't immediately useful (udp_splice_prepare(), in particular, is trivial), but it will make further reworks clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:39 +02:00
David Gibson	c6c61a9e1a	udp: Don't repeatedly initialise udp[46]_eth_hdr Since we split our packet frame buffers into different pieces, we have a single buffer per IP version for the ethernet header, rather than one per frame. This makes sense since our ethernet header is alwaus the same. However we initialise those buffers udp[46]_eth_hdr inside a per frame loop. Pull that outside the loop so we just initialise them once. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:37 +02:00
David Gibson	55aff45bc1	udp: Unify udp[46]_l2_iov The only differences between these arrays are that udp4_l2_iov is pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame header and udp6_l2_iov points to the IPv6 versions. We already have to set up a bunch of headers per-frame, including updating udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len. It makes more sense to adjust the IOV entries to point at the correct headers for the frame than to have two complete sets of iovecs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:35 +02:00
David Gibson	9f9b15f949	udp: Unify udp[46]_mh_splice We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where the only difference is that they point to different sockaddr buffers for the destination address. Unify these by having the common array point at a sockaddr_inany as the address. This does mean slightly more work when we're about to splice, because we need to write the whole socket address, rather than just the port. However it removes 32 mmsghdr structures and we're going to need more flexibility constructing that target address for the flow table. Because future changes might mean that the address isn't always loopback, change the name of the common address from *_localname to udp_splicename. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:33 +02:00
David Gibson	fbd78b6f3e	udp: Rename IOV and mmsghdr arrays Make the salient points about these various arrays clearer with renames: * udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do with L2. They are, however, specific to receiving not sending. Rename to udp_iov_recv and udp[46]_mh_recv. * udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa. Rename to udp[46]_l2_iov * udp[46]_localname are (for now) pre-populated with the local address but the more salient point is that these are the destination address for the splice arrays. Rename to udp[46]_splice_to Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:30 +02:00
David Gibson	f62c33d85f	udp: Pass full epoll reference through more of sock handler path udp_buf_sock_handler() takes the epoll reference from the receiving socket, and passes the UDP relevant part on to several other functions. Future changes are going to need several different epoll types for UDP, and to pass that information through to some of those functions. To avoid extra noise in the patches making the real changes, change those functions now to take the full epoll reference, rather than just the UDP part. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:28 +02:00
David Gibson	8f8eb73482	flow: Add flow_sidx_valid() helper To implement the TCP hash table, we need an invalid (NULL-like) value for flow_sidx_t. We use FLOW_SIDX_NONE for that, but for defensiveness, we treat (usually) anything with an out of bounds flow index the same way. That's not always done consistently though. In flow_at_sidx() we open code a check on the flow index. In tcp_hash_probe() we instead compare against FLOW_SIDX_NONE, and in some other places we use the fact that flow_at_sidx() will return NULL in this case, even if we don't otherwise need the flow it returns. Clean this up a bit, by adding an explicit flow_sidx_valid() test function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:25 +02:00
David Gibson	74c1c5efcf	util: sock_l4() determine protocol from epoll type rather than the reverse sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:09 +02:00
Stefano Brivio	b625ed5fee	conf: Use the right maximum buffer size for c->sock_path UNIX_SOCK_MAX is the maximum number we'll append to the socket path if we generate it automatically. If it's given on the command line, it can be up to UNIX_PATH_MAX (including the terminating character) long. UNIX_SOCK_MAX happened to kind of fit because it's 100 (instead of 108). Commit `ceddcac74a` ("conf, tap: False "Buffer not null terminated" positives, CWE-170") fixed the wrong problem: the right fix for the problem at hand was actually commit `cc287af173` ("conf: Fix incorrect bounds checking for sock_path parameter"). Fixes: `ceddcac74a` ("conf, tap: False "Buffer not null terminated" positives, CWE-170") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-02 21:34:06 +02:00
Stefano Brivio	403a7c14a0	tcp_splice: Check return value of setsockopt() for SO_RCVLOWAT Spotted by Coverity, harmless as we would consider that successful and check on the socket later from the timer, but printing a debug message in that case is definitely wise, should it ever happen. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-02 21:33:57 +02:00
Stefano Brivio	21ee1eb2de	conf: Copy up to MAXDNSRCH - 1 bytes, not MAXDNSRCH Spotted by Coverity just recently. Not that it really matters as MAXDNSRCH always appears to be defined as 1025, while a full domain name can have up to 253 characters: it would be a bit pointless to have a longer search domain. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-02 21:33:29 +02:00
Stefano Brivio	1ee2ecade3	udp: Reduce scope of rport in udp_invert_portmap() cppcheck 2.14 warns that the scope of the rport variable could be reduced: do that, as reverted commit `c80fa6a6bb` ("udp: Make rport calculation more local") did, but keep the temporary variable of in_port_t type, otherwise the sum gets promoted to int. While at it, add a comment explaining why we calculate rport like this instead of directly using the sum as array index. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-24 15:41:38 +02:00
Stefano Brivio	054697598f	Revert "udp: Make rport calculation more local" This reverts commit `c80fa6a6bb`, as it reintroduces the issue fixed by commit `1e6f92b995` ("udp: Fix 16-bit overflow in udp_invert_portmap()"). Reported-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=80 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-24 15:41:38 +02:00
Stefano Brivio	c66f0341d9	log: Don't report syslog failures to stderr after initialisation If we daemonised, we can't use standard error. If we didn't, it's rather annoying to have all those messages on standard error anyway, and kind of pointless too, as the messages we wanted to print were printed to standard error anyway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-21 15:32:48 +02:00
Stefano Brivio	e7323e515a	conf, passt: Don't call __openlog() if a log file is used If a log file is configured, we would otherwise open a connection to the system logger (if any), print any message that we might have before we initialise the log file, and then keep that connection around for no particular reason. Call __openlog() as an alternative to the log file setup, instead. This way, we might skip printing some messages during the initialisation phase, but they're probably not really valuable to have in a system log, and we're going to print them to standard error anyway. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:46 +02:00
Stefano Brivio	dba7f0f5ce	treewide: Replace strerror() calls Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:44 +02:00
Stefano Brivio	92a22fef93	treewide: Replace perror() calls with calls to logging functions perror() prints directly to standard error, but in many cases standard error might be already closed, or we might want to skip logging, based on configuration. Our logging functions provide all that. While at it, make errors more descriptive, replacing some of the existing basic perror-style messages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:43 +02:00
Stefano Brivio	c1140df889	log: Add _perror() logging function variants In many places, we have direct perror() calls, which completely bypass logging functions and log files. They are definitely convenient: offer similar convenience with _perror() logging variants, so that we can drop those direct perror() calls. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:40 +02:00
Stefano Brivio	afd9cdc9bb	log, passt: Always print to stderr before initialisation is complete After commit `15001b39ef` ("conf: set the log level much earlier"), we had a phase during initialisation when messages wouldn't be printed to standard error anymore. Commit `f67238aa86` ("passt, log: Call __openlog() earlier, log to stderr until we detach") fixed that, but only for the case where no log files are given. If a log file is configured, vlogmsg() will not call passt_vsyslog(), but during initialisation, LOG_PERROR is set, so to avoid duplicated prints (which would result from passt_vsyslog() printing to stderr), we don't call fprintf() from vlogmsg() either. This is getting a bit too complicated. Instead of abusing LOG_PERROR, define an internal logging flag that clearly represents that we're not done with the initialisation phase yet. If this flag is not set, make sure we always print to stderr, if the log mask matches. Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:34 +02:00
Stefano Brivio	8c2f24a560	conf, log: Instead of abusing log levels, add log_conf_parsed flag We currently use a LOG_EMERG log mask to represent the fact that we don't know yet what the mask resulting from configuration should be, before the command line is parsed. However, we have the necessity of representing another phase as well, that is, configuration is parsed but we didn't daemonise yet, or we're not ready for operation yet. The next patch will add that notion explicitly. Mapping these cases to further log levels isn't really practical. Introduce boolean log flags to represent them, instead of abusing log priorities. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:31 +02:00
Stefano Brivio	bca0fefa32	conf, passt: Make --stderr do nothing, and deprecate it The original behaviour of printing messages to standard error by default when running from a non-interactive terminal was introduced because the first KubeVirt integration draft used to start passt in foreground and get messages via standard error. For development purposes, the system logger was more convenient at that point, and passt was running from interactive terminals only if not started by the KubeVirt integration. This behaviour was introduced by `84a62b79a2` ("passt: Also log to stderr, don't fork to background if not interactive"). Later, I added command-line options in `1e49d194d0` ("passt, pasta: Introduce command-line options and port re-mapping") and accidentally reversed this condition, which wasn't a problem as --stderr could force printing to standard error anyway (and it was used by KubeVirt). Nowadays, the KubeVirt integration uses a log file (requested via libvirt configuration), and the same applies for Podman if one actually needs to look at runtime logs. There are no use cases left, as far as I know, where passt runs in foreground in non-interactive terminals. Seize the chance to reintroduce some sanity here. If we fork to background, standard error is closed, so --stderr is useless in that case. If we run in foreground, there's no harm in printing messages to standard error, and that accidentally became the default behaviour anyway, so --stderr is not needed in that case. It would be needed for non-interactive terminals, but there are no use cases, and if there were, let's log to standard error anyway: the user can always redirect standard error to /dev/null if needed. Before we're up and running, we need to print to standard error anyway if something happens, otherwise we can't report failure to start in any kind of usage, stand-alone or in integrations. So, make --stderr do nothing, and deprecate it. While at it, drop a left-over comment about --foreground being the default only for interactive terminals, because it's not the case anymore. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:28 +02:00
Stefano Brivio	b74801645c	conf, passt: Don't try to log to stderr after we close it If we don't run in foreground, we close standard error as we daemonise, so it makes no sense to check if the controlling terminal is an interactive terminal or if --force-stderr was given, to decide if we want to log to standard error. Make --force-stderr depend on --foreground. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:15 +02:00
Stefano Brivio	65923ba798	conf: Accept duplicate and conflicting options, the last one wins In multiple occasions, especially when passt(1) and pasta(1) are used in integrations such as the one with Podman, the ability to override earlier options on the command line with later one would have been convenient. Recently, to debug a number of issues happening with Podman, I would have liked to ask users to share a debug log by passing --debug as additional option, but pasta refuses --quiet (always passed by Podman) and --debug at the same time. On top of this, Podman lets users specify other pasta options in its containers.conf(5) file, as well as on the command line. The options from the configuration files are appended together with the ones from the command line, which makes it impossible for users to override options from the configuration file, if duplicated options are refused, unless Podman takes care of sorting them, which is clearly not sustainable. For --debug and --trace, somebody took care of this on Podman side at: https://github.com/containers/common/pull/2052 but this doesn't fix the issue with other options, and we'll have anyway older versions of Podman around, too. I think there's some value in telling users about duplicated or conflicting options, because that might reveal issues in integrations or accidental misconfigurations, but by now I'm fairly convinced that the downsides outweigh this. Drop checks about duplicate options and mutually exclusive ones. In some cases, we need to also undo a couple of initialisations caused by earlier options, but this looks like a simplification, overall. Notable exception: --stderr still conflicts with --log-file, because users might have the expectation that they don't actually conflict. But they do conflict in the existing implementation, so it's safer to make sure that the users notice that. Suggested-by: Paul Holzinger <pholzing@redhat.com> Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Paul Holzinger <pholzing@redhat.com>	2024-06-21 15:31:46 +02:00
Stefano Brivio	62de6140d9	netlink: Strip nexthop identifiers when duplicating routes If routing daemons set up host routes, for example FRR via OSPF as in the reported issue, they might add nexthop identifiers (not objects) that are generally not valid in the target namespace. Strip them off as well, otherwise we'll get EINVAL from the kernel. Link: https://github.com/containers/podman/issues/22960 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-20 17:03:28 +02:00
Danish Prakash	1544a43863	passt.1, qrap.1: align license description with SPDX identifier The SPDX identifier states GPL-2.0-or-later but the copyright section mentions GPL-3.0 or later causing a mismatch. Also, only correctly refers to GPL instead of AGPL. Signed-off-by: Danish Prakash <contact@danishpraka.sh> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-19 15:00:55 +02:00
Stefano Brivio	f301bb18b5	netlink: Ignore EHOSTUNREACH failures when duplicating routes To implicitly resolve possible dependencies between routes as we duplicate them into the target namespace, we go through a set of n routes n times, and ignore EEXIST responses to netlink messages (we already inserted the route) and ENETUNREACH (we didn't insert the route yet, but we need to insert another one first). Until now, we didn't ignore EHOSTUNREACH responses. However, NetworkManager users with multiple non-subnet routes for the same interface report that pasta exits with "no route to host" while duplicating routes. This happens because NetworkManager sets the 'noprefixroute' attribute on addresses, meaning that the kernel won't create subnet routes automatically depending on the prefix length of the address. We copy this attribute as we copy the address into the target namespace, and as a result, the kernel doesn't create subnet routes in the target namespace either. This means that the gateway for routes that are inserted later can be unreachable at some points during the sequence of route duplication. That is, we don't just have dependencies between regular routes, but we can also have dependencies between regular routes and subnet routes, as subnet routes are not automatically inserted in advance. Link: https://github.com/containers/podman/issues/22824 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-19 15:00:55 +02:00
Stefano Brivio	450a6131be	netlink: With no default route, pick the first interface with a route While commit `f919dc7a4b` ("conf, netlink: Don't require a default route to start") sounded reasonable in the assumption that, if we don't find default routes for a given address family, we can still proceed by selecting an interface with any route iff it's the only one for that protocol family, Jelle reported a further issue in a similar setup. There, multiple interfaces are present, and while remote container connectivity doesn't matter for the container, local connectivity is desired. There are no default routes, but those multiple interfaces all have non-default routes, so we should just pick one and start. Pick the first interface reported by the kernel with any route, if there are no default routes. There should be no harm in doing so. Reported-by: Jelle van der Waa <jvanderwaa@redhat.com> Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2277954 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>	2024-06-19 15:00:55 +02:00
Stefano Brivio	54a9d3801b	tcp: Don't rely on bind() to fail to decide that connection target is valid Commit `e1a2e2780c` ("tcp: Check if connection is local or low RTT was seen before using large MSS") added a call to bind() before we issue a connect() to the target for an outbound connection. If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we can conclude that the target address is a local (host) address, and we can use an unlimited MSS. While at it, according to the reasoning of that commit, if bind() succeeds, we would know right away that nobody is listening at that (local) address and port, and we don't even need to call connect(): we can just fail early and reset the connection attempt. But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will actually succeed, so we can't rely on it to fail in general. The visible issue with the existing behaviour is that we would reset any outbound connection to non-local addresses, if non-local binds are enabled. Keep the significant optimisation for local addresses along with the bind() call, but if it succeeds, don't draw any conclusion: close the socket, grab another one, and proceed normally. This will incur a small latency penalty if non-local binds are enabled (we'll likely fetch an existing socket from the pool but additionally call close()), or if the target is local but not bound: we'll need to call connect() and get a failure before relaying that failure back. Link: https://github.com/containers/podman/issues/23003 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-19 15:00:55 +02:00
David Gibson	020ff7a40e	siphash: Remove stale prototypes In `fc8f0f8c` ("siphash: Use incremental rather than all-at-once siphash functions") we removed the older interface to the SipHash implementation, which took fixed sized blocks of data. However, we forgot to remove the prototypes for those functions, so do that now. Fixes: `fc8f0f8c48` ("siphash: Use incremental rather than all-at-once siphash functions") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-19 14:55:42 +02:00
David Gibson	7e87bd98ac	udp: Move management of udp[46]_localname into udp_splice_send() Mostly, udp_sock_handler() is independent of how the datagrams it processes will be forwarded (tap or splice). However, it also updates the msg_name fields for spliced sends, which doesn't really make sense here. Move it into udp_splice_send() which is all about spliced sends. This does potentially mean we'll update the field to the same value several times, but we're going to need this in future anyway: with the extensions the flow table allows, it might not be the same value each time after all. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-14 12:11:46 +02:00
David Gibson	ff57f8ddc6	udp: Rework how we divide queued datagrams between sending methods udp_sock_handler() takes a number of datagrams from sockets that depending on their addresses could be forwarded either to the L2 interface ("tap") or to another socket ("spliced"). In the latter case we can also only send packets together if they have the same source port, and therefore are sent via the same socket. To reduce the total number of system calls we gather contiguous batches of datagrams with the same destination interface and socket where applicable. The determination of what the target is is made by udp_mmh_splice_port(). It returns the source port for splice packets and -1 for "tap" packets. We find batches by looking ahead in our queue until we find a datagram whose "splicefrom" port doesn't match the first in our current batch. udp_mmh_splice_port() is moderately expensive, and unfortunately we can call it twice on the same datagram: once as the (last + 1) entry in one batch (to check it's not in that batch), then again as the first entry in the next batch. Avoid this by keeping track of the "splice port" in the metadata structure, and filling it in one entry ahead of the one we're currently considering. This is a bit subtle, but not that hard. It will also generalise better when we have more complex possibilities based on the flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-14 12:11:42 +02:00
David Gibson	63db7dcdbf	udp: Fold checking of splice flag into udp_mmh_splice_port() udp_mmh_splice_port() is used to determine if a UDP datagram can be "spliced" (forwarded via a socket instead of tap). We only invoke it if the origin socket has the 'splice' flag set. Fold the checking of the flag into the helper itself, which makes the caller simpler. It does mean we have a loop looking for a batch of spliceable or non-spliceable packets even in the case where the flag is clear. This shouldn't be that expensive though, since each call to udp_mmh_splice_port() will return without accessing memory in that case. In any case we're going to need a similar loop in more cases with upcoming flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-14 12:11:39 +02:00
David Gibson	523fbc5af7	util: Split construction of bind socket address from the rest of sock_l4() sock_l4() creates, binds and otherwise prepares a new socket. It builds the socket address to bind from separately provided address and port. However, we have use cases coming up where it's more natural to construct the socket address in the caller. Prepare for this by adding sock_l4_sa() which takes a pre-constructed socket address, and rewriting sock_l4() in terms of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-14 12:10:52 +02:00
Laurent Vivier	4070bac7a4	tap: use in->buf_size rather than sizeof(pkt_buf) buf_size is set to sizeof(pkt_buf) by default. And it seems more correct to provide the actual size of the buffer. Later a buf_size of 0 will allow vhost-user mode to detect guest memory buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:42 +02:00
Laurent Vivier	7290335b14	iov: remove iov_copy() it was needed by a draft version of vhost-user, it is not needed anymore. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:40 +02:00
Laurent Vivier	0c335d751a	vhost-user: compare mode MODE_PASTA and not MODE_PASST As we are going to introduce the MODE_VU that will act like the mode MODE_PASST, compare to MODE_PASTA rather than to add a comparison to MODE_VU when we check for MODE_PASST. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:38 +02:00
Laurent Vivier	377b666dc9	udp: rename udp_sock_handler() to udp_buf_sock_handler() We are going to introduce a variant of the function to use vhost-user buffers rather than passt internal buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:34 +02:00
Laurent Vivier	e7ac995217	udp: refactor UDP header update functions This commit refactors the udp_update_hdr4() and udp_update_hdr6() functions to improve code portability by replacing the udp_meta_t parameter with more specific parameters for the IPv4 and IPv6 headers (iphdr/ipv6hdr) and the source socket address (sockaddr_in/sockaddr_in6). It also moves the tap_hdr_update() function call inside the udp_tap_send() function not to have to pass the TAP header to udp_update_hdr4() and udp_update_hdr6() This refactor reduces complexity by making the functions more modular and ensuring that each function operates on more narrowly scoped data structures. This will facilitate future backend introduction like vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:32 +02:00
Laurent Vivier	9ecf7fedc5	tap: refactor packets handling functions Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(), and tap4_handler() and tap6_handler() into tap_handler(). Create a generic tap_add_packet() to consolidate packet addition logic and reduce code duplication. The purpose is to ease the export of these functions to use them with the vhost-user backend. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:19 +02:00
Laurent Vivier	fba2b544b6	tcp: move buffers management functions to their own file Move all the TCP parts using internal buffers to tcp_buf.c and keep generic TCP management functions in tcp.c. Add tcp_internal.h to export needed functions from tcp.c and tcp_buf.h from tcp_buf.c With this change we can use existing TCP functions with a different kind of memory storage as for instance the shared memory provided by the guest via vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:05 +02:00
Laurent Vivier	ec26fa013a	tcp: extract buffer management from tcp_send_flag() This commit isolates the internal data structure management used for storing data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[], tcp4_flags[], ...) from the tcp_send_flag() function. The extracted functionality is relocated to a new function named tcp_fill_flag_header(). tcp_fill_flag_header() is now a generic function that accepts parameters such as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[]. This separation sets the stage for utilizing tcp_prepare_flags() to set the memory provided by the guest via vhost-user in future developments. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:43:35 +02:00
David Gibson	d949667436	cppcheck: Suppress constParameterCallback errors We have several functions which are used as callbacks for NS_CALL() which only read their void * parameter, they don't write it. The constParameterCallback warning in cppcheck 2.14.1 complains that this parameter could be const void , also pointing out that that would require casting the function pointer when used as a callback. Casting the function pointers seems substantially uglier than using a non-const void as the parameter, especially since in each case we cast the void * to a const pointer of specific type immediately. So, suppress these errors. I think it would make logical sense to suppress this globally, but that would cause unmatchedSuppression errors on earlier cppcheck versions. So, instead individually suppress it, along with unmatchedSuppression in the relevant places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-08 13:06:20 +02:00
Derek Schrock	8a83b530fe	selinux: Allow access to user_devpts Allow access to user_devpts. $ pasta --version pasta 0^20240510.g7288448-1.fc40.x86_64 ... $ awk '' < /dev/null $ pasta --version $ While this might be a awk bug it appears pasta should still have access to devpts. Signed-off-by: Derek Schrock <dereks@lifeofadishwasher.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	ec416fdcc4	tcp, flow: Fix some error paths which didn't clean up flows properly Flow table entries need to be fully initialised before returning to the main epoll loop. Commit `0060acd1` ("flow: Clarify and enforce flow state transitions") now enforces that: once a flow is allocated we must either cancel it, or activate it before returning to the main loop, or we will hit an ASSERT(). Some error paths in tcp_conn_from_tap() weren't correctly updated for this requirement - we can exit with a flow entry incompletely initialised. Correct that by cancelling the flows in those situations. I don't have enough information to be certain if this is the cause for podman bug 22925, but it plausibly could be. Fixes: `0060acd11b` ("flow: Clarify and enforce flow state transitions") Link: https://github.com/containers/podman/issues/22925 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00

1 2 3 4 5 ...

1627 commits