passt

mirror of https://passt.top/passt synced 2025-04-11 07:45:01 +02:00

Author	SHA1	Message	Date
David Gibson	4e2d36e83f	flow: Common address information for target side Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:32:37 +02:00
David Gibson	8012f5ff55	flow: Common address information for initiating side Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:32:32 +02:00
David Gibson	ba74b1fea1	doc: Extend zero-recv test with methods using msghdr This test program verifies that we can receive and discard datagrams by using recv() with a NULL buffer and zero-length. Extend it to verify it also works using recvmsg() and either an iov with a zero-length NULL buffer or an iov that itself is NULL and zero-length. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Fixed printf() message in main of recv-zero.c] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 15:31:02 +02:00
David Gibson	01e5611ec3	doc: Test behaviour of closing duplicate UDP sockets To simplify lifetime management of "listening" UDP sockets, UDP flow support needs to duplicate existing bound sockets. Those duplicates will be close()d when their corresponding flow expires, but we expect the original to still receive datagrams as always. That is, we expect the close() on the duplicate to remove the duplicated fd, but not to close the underlying UDP socket. Add a test program to doc/platform-requirements to verify this requirement. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 15:30:14 +02:00
David Gibson	66a02c9f7c	tcp_splice: Use parameterised macros for per-side event/flag bits Both the events and flags fields in tcp_splice_conn have several bits which are per-side, e.g. OUT_WAIT_0 for side 0 and OUT_WAIT_1 for side 1. This necessitates some rather awkward ternary expressions when we need to get the relevant bit for a particular side. Simplify this by using a parameterised macro for the bit values. This needs a ternary expression inside the macros, but makes the places we use it substantially clearer. That simplification in turn allows us to use a loop across each side to implement several things which are currently open coded to do equivalent things for each side in turn. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 15:30:11 +02:00
David Gibson	5235c47c79	flow: Introduce flow_foreach_sidei() macro We have a handful of places where we use a loop to step through each side of a flow or flows, and we're probably going to have mroe in future. Introduce a macro to implement this loop for convenience. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 15:30:07 +02:00
David Gibson	71d7985188	flow, tcp_splice: Prefer 'sidei' for variables referring to side index In various places we have variables named 'side' or similar which always have the value 0 or 1 (INISIDE or TGTSIDE). Given a flow, this refers to a specific side of it. Upcoming flow table work will make it more useful for "side" to refer to a specific side of a specific flow. To make things less confusing then, prefer the name term "side index" and name 'sidei' for variables with just the 0 or 1 value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Fixed minor detail in comment to struct flow_common] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 15:29:47 +02:00
David Gibson	9b125e7776	flow, icmp, tcp: Clean up helpers for getting flow from index TCP (both regular and spliced) and ICMP both have macros to retrieve the relevant protcol specific flow structure from a flow index. In most cases what we actually want is to get the specific flow from a sidx. Replace those simple macros with a more precise inline, which also asserts that the flow is of the type we expect. While we're they're also add a pif_at_sidx() helper to get the interface of a specific flow & side, which is useful in some places. Finally, fix some minor style issues in the comments on some of the existing sidx related helpers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 15:27:27 +02:00
David Gibson	2fa91ee391	udp: Handle errors on UDP sockets Currently we ignore all events other than EPOLLIN on UDP sockets. This means that if we ever receive an EPOLLERR event, we'll enter an infinite loop on epoll, because we'll never do anything to clear the error. Luckily that doesn't seem to have happened in practice, but it's certainly fragile. Furthermore changes in how we handle UDP sockets with the flow table mean we will start receiving error events. Add handling of EPOLLERR events. For now we just read the error from the error queue (thereby clearing the error state) and print a debug message. We can add more substantial handling of specific events in future if we want to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 07:05:21 +02:00
David Gibson	6bd8283bf9	util: Add AF_UNSPEC support to sockaddr_ntop() Allow sockaddr_ntop() to format AF_UNSPEC socket addresses. There do exist a few cases where we might legitimately have either an AF_UNSPEC or a real address, such as the origin address from MSG_ERRQUEUE. Even in cases where we shouldn't get an AF_UNSPEC address, formatting it is likely to make things easier to debug if we ever somehow do. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 07:05:18 +02:00
David Gibson	4e1f850f61	udp, tcp: Tweak handling of no_udp and no_tcp flags We abort the UDP socket handler if the no_udp flag is set. But if UDP was disabled we should never have had a UDP socket to trigger the handler in the first place. If we somehow did, ignoring it here isn't really going to help because aborting without doing anything is likely to lead to an epoll loop. The same is the case for the TCP socket and timer handlers and the no_tcp flag. Change these checks on the flag to ASSERT()s. Similarly add ASSERT()s to several other entry points to the protocol specific code which should never be called if the protocol is disabled. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 07:05:15 +02:00
David Gibson	272d1d033c	udp: Make udp_sock_recv static Through an oversight this was previously declared as a public function although it's only used in udp.c and there is no prototype in any header. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 07:05:13 +02:00
David Gibson	f79c42317f	conf: Don't configure port forwarding for a disabled protocol UDP and/or TCP can be disabled with the --no-udp and --no-tcp options. However, when this is specified, it's still possible to configure forwarded ports for the disabled protocol. In some cases this will open sockets and perform other actions, which might not be safe since the entire protocol won't be initialised. Check for this case, and explicitly forbid it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-17 07:04:55 +02:00
Jon Maloy	a740e16fd1	tcp: handle shrunk window advertisements from guest A bug in kernel TCP may lead to a deadlock where a zero window is sent from the guest peer, while it is unable to send out window updates even after socket reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisements from the peer can only be triggered by data packets arriving from this side. However, currently such packets are never sent, because the zero-window condition prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered only after the peer has dropped one or more arriving packets because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that the implementation goes against the RFC-9293 recommendation that a previously advertised window never should shrink. RFC-9293 seems to permit that we can continue sending up to the right edge of the last advertised non-zero window in such situations, so that is what we do to resolve this situation. It turns out that this solution is extremely simple to implememt in the code: We just omit to save the advertised zero-window when we see that it has shrunk, i.e., if the acknowledged sequence number in the advertisement message is lower than that of the last data byte sent from our side. When that is the case, the following happens: - The 'retr' flag in tcp_data_from_tap() will be 'false', so no retransmission will occur at this occasion. - The data stream will soon reach the right edge of the previously advertised window. In fact, in all observed cases we have seen that it is already there when the zero-advertisement arrives. - At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set, unless they already have been, meaning that only the next timer expiration will open for data retransmission or transmission. - When that happens, the memory squeeze at the guest will normally have abated, and the data flow can resume. It should be noted that although this solves the problem we have at hand, it is a work-around, and not a genuine solution to the described kernel bug. Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor fix in commit title and commit reference in comment to workaround Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-15 18:05:08 +02:00
Jon Maloy	e63d281871	tcp: leverage support of SO_PEEK_OFF socket option when available >From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction. In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not. Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-15 17:57:03 +02:00
David Gibson	8bd57bf25b	doc: Trivial fix for reuseaddr-priority This test program checks for particular behaviour regardless of order of operations. So, we step through the test with all possible orders for a number of different of parts. Or at least, we're supposed to, a copy pasta error led to using the same order for two things which should be independent. Fixes: `299c407501` ("doc: Add program to document and test assumptions about SO_REUSEADDR") Reported-by: David Taylor <davidt@yadt.co.uk> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-15 17:55:52 +02:00
David Gibson	ec2691a12e	doc: Test behaviour of zero length datagram recv()s Add a test program verifying that we're able to discard datagrams from a socket without needing a big discard buffer, by using a zero length recv(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:48 +02:00
David Gibson	299c407501	doc: Add program to document and test assumptions about SO_REUSEADDR For the approach we intend to use for handling UDP flows, we have some pretty specific requirements about how SO_REUSEADDR works with UDP sockets. Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s, and therefore there can be multiple sockets which are eligible to receive the same datagram. Which one will actually receive it is important to us. Add a test program which verifies things work the way we expect, which documents what those expectations are in the process. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:43 +02:00
David Gibson	be0214cca6	udp: Consolidate datagram batching When we receive datagrams on a socket, we need to split them into batches depending on how they need to be forwarded (either via a specific splice socket, or via tap). The logic to do this, is somewhat awkwardly split between udp_buf_sock_handler() itself, udp_splice_send() and udp_tap_send(). Move all the batching logic into udp_buf_sock_handler(), leaving udp_splice_send() to just send the prepared batch. udp_tap_send() reduces to just a call to tap_send_frames() so open-code that call in udp_buf_sock_handler(). This will allow separating the batching logic from the rest of the datagram forwarding logic, which we'll need for upcoming flow table support. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:41 +02:00
David Gibson	69e5393c37	udp: Move some more of sock_handler tasks into sub-functions udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four things between them: 1. Receive some datagrams from a socket 2. Split those datagrams into batches depending on how they need to be sent (via tap or via a specific splice socket) 3. Prepare buffers for each datagram to send it onwards 4. Actually send it onwards Split (1) and (3) into specific helper functions. This isn't immediately useful (udp_splice_prepare(), in particular, is trivial), but it will make further reworks clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:39 +02:00
David Gibson	c6c61a9e1a	udp: Don't repeatedly initialise udp[46]_eth_hdr Since we split our packet frame buffers into different pieces, we have a single buffer per IP version for the ethernet header, rather than one per frame. This makes sense since our ethernet header is alwaus the same. However we initialise those buffers udp[46]_eth_hdr inside a per frame loop. Pull that outside the loop so we just initialise them once. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:37 +02:00
David Gibson	55aff45bc1	udp: Unify udp[46]_l2_iov The only differences between these arrays are that udp4_l2_iov is pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame header and udp6_l2_iov points to the IPv6 versions. We already have to set up a bunch of headers per-frame, including updating udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len. It makes more sense to adjust the IOV entries to point at the correct headers for the frame than to have two complete sets of iovecs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:35 +02:00
David Gibson	9f9b15f949	udp: Unify udp[46]_mh_splice We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where the only difference is that they point to different sockaddr buffers for the destination address. Unify these by having the common array point at a sockaddr_inany as the address. This does mean slightly more work when we're about to splice, because we need to write the whole socket address, rather than just the port. However it removes 32 mmsghdr structures and we're going to need more flexibility constructing that target address for the flow table. Because future changes might mean that the address isn't always loopback, change the name of the common address from *_localname to udp_splicename. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:33 +02:00
David Gibson	fbd78b6f3e	udp: Rename IOV and mmsghdr arrays Make the salient points about these various arrays clearer with renames: * udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do with L2. They are, however, specific to receiving not sending. Rename to udp_iov_recv and udp[46]_mh_recv. * udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa. Rename to udp[46]_l2_iov * udp[46]_localname are (for now) pre-populated with the local address but the more salient point is that these are the destination address for the splice arrays. Rename to udp[46]_splice_to Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:30 +02:00
David Gibson	f62c33d85f	udp: Pass full epoll reference through more of sock handler path udp_buf_sock_handler() takes the epoll reference from the receiving socket, and passes the UDP relevant part on to several other functions. Future changes are going to need several different epoll types for UDP, and to pass that information through to some of those functions. To avoid extra noise in the patches making the real changes, change those functions now to take the full epoll reference, rather than just the UDP part. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:28 +02:00
David Gibson	8f8eb73482	flow: Add flow_sidx_valid() helper To implement the TCP hash table, we need an invalid (NULL-like) value for flow_sidx_t. We use FLOW_SIDX_NONE for that, but for defensiveness, we treat (usually) anything with an out of bounds flow index the same way. That's not always done consistently though. In flow_at_sidx() we open code a check on the flow index. In tcp_hash_probe() we instead compare against FLOW_SIDX_NONE, and in some other places we use the fact that flow_at_sidx() will return NULL in this case, even if we don't otherwise need the flow it returns. Clean this up a bit, by adding an explicit flow_sidx_valid() test function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:25 +02:00
David Gibson	74c1c5efcf	util: sock_l4() determine protocol from epoll type rather than the reverse sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-05 15:26:09 +02:00
Stefano Brivio	b625ed5fee	conf: Use the right maximum buffer size for c->sock_path UNIX_SOCK_MAX is the maximum number we'll append to the socket path if we generate it automatically. If it's given on the command line, it can be up to UNIX_PATH_MAX (including the terminating character) long. UNIX_SOCK_MAX happened to kind of fit because it's 100 (instead of 108). Commit `ceddcac74a` ("conf, tap: False "Buffer not null terminated" positives, CWE-170") fixed the wrong problem: the right fix for the problem at hand was actually commit `cc287af173` ("conf: Fix incorrect bounds checking for sock_path parameter"). Fixes: `ceddcac74a` ("conf, tap: False "Buffer not null terminated" positives, CWE-170") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-02 21:34:06 +02:00
Stefano Brivio	403a7c14a0	tcp_splice: Check return value of setsockopt() for SO_RCVLOWAT Spotted by Coverity, harmless as we would consider that successful and check on the socket later from the timer, but printing a debug message in that case is definitely wise, should it ever happen. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-02 21:33:57 +02:00
Stefano Brivio	21ee1eb2de	conf: Copy up to MAXDNSRCH - 1 bytes, not MAXDNSRCH Spotted by Coverity just recently. Not that it really matters as MAXDNSRCH always appears to be defined as 1025, while a full domain name can have up to 253 characters: it would be a bit pointless to have a longer search domain. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-02 21:33:29 +02:00
Stefano Brivio	1ee2ecade3	udp: Reduce scope of rport in udp_invert_portmap() cppcheck 2.14 warns that the scope of the rport variable could be reduced: do that, as reverted commit `c80fa6a6bb` ("udp: Make rport calculation more local") did, but keep the temporary variable of in_port_t type, otherwise the sum gets promoted to int. While at it, add a comment explaining why we calculate rport like this instead of directly using the sum as array index. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-24 15:41:38 +02:00
Stefano Brivio	054697598f	Revert "udp: Make rport calculation more local" This reverts commit `c80fa6a6bb`, as it reintroduces the issue fixed by commit `1e6f92b995` ("udp: Fix 16-bit overflow in udp_invert_portmap()"). Reported-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=80 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-24 15:41:38 +02:00
Stefano Brivio	c66f0341d9	log: Don't report syslog failures to stderr after initialisation If we daemonised, we can't use standard error. If we didn't, it's rather annoying to have all those messages on standard error anyway, and kind of pointless too, as the messages we wanted to print were printed to standard error anyway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-21 15:32:48 +02:00
Stefano Brivio	e7323e515a	conf, passt: Don't call __openlog() if a log file is used If a log file is configured, we would otherwise open a connection to the system logger (if any), print any message that we might have before we initialise the log file, and then keep that connection around for no particular reason. Call __openlog() as an alternative to the log file setup, instead. This way, we might skip printing some messages during the initialisation phase, but they're probably not really valuable to have in a system log, and we're going to print them to standard error anyway. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:46 +02:00
Stefano Brivio	dba7f0f5ce	treewide: Replace strerror() calls Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:44 +02:00
Stefano Brivio	92a22fef93	treewide: Replace perror() calls with calls to logging functions perror() prints directly to standard error, but in many cases standard error might be already closed, or we might want to skip logging, based on configuration. Our logging functions provide all that. While at it, make errors more descriptive, replacing some of the existing basic perror-style messages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:43 +02:00
Stefano Brivio	c1140df889	log: Add _perror() logging function variants In many places, we have direct perror() calls, which completely bypass logging functions and log files. They are definitely convenient: offer similar convenience with _perror() logging variants, so that we can drop those direct perror() calls. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:40 +02:00
Stefano Brivio	afd9cdc9bb	log, passt: Always print to stderr before initialisation is complete After commit `15001b39ef` ("conf: set the log level much earlier"), we had a phase during initialisation when messages wouldn't be printed to standard error anymore. Commit `f67238aa86` ("passt, log: Call __openlog() earlier, log to stderr until we detach") fixed that, but only for the case where no log files are given. If a log file is configured, vlogmsg() will not call passt_vsyslog(), but during initialisation, LOG_PERROR is set, so to avoid duplicated prints (which would result from passt_vsyslog() printing to stderr), we don't call fprintf() from vlogmsg() either. This is getting a bit too complicated. Instead of abusing LOG_PERROR, define an internal logging flag that clearly represents that we're not done with the initialisation phase yet. If this flag is not set, make sure we always print to stderr, if the log mask matches. Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:34 +02:00
Stefano Brivio	8c2f24a560	conf, log: Instead of abusing log levels, add log_conf_parsed flag We currently use a LOG_EMERG log mask to represent the fact that we don't know yet what the mask resulting from configuration should be, before the command line is parsed. However, we have the necessity of representing another phase as well, that is, configuration is parsed but we didn't daemonise yet, or we're not ready for operation yet. The next patch will add that notion explicitly. Mapping these cases to further log levels isn't really practical. Introduce boolean log flags to represent them, instead of abusing log priorities. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:31 +02:00
Stefano Brivio	bca0fefa32	conf, passt: Make --stderr do nothing, and deprecate it The original behaviour of printing messages to standard error by default when running from a non-interactive terminal was introduced because the first KubeVirt integration draft used to start passt in foreground and get messages via standard error. For development purposes, the system logger was more convenient at that point, and passt was running from interactive terminals only if not started by the KubeVirt integration. This behaviour was introduced by `84a62b79a2` ("passt: Also log to stderr, don't fork to background if not interactive"). Later, I added command-line options in `1e49d194d0` ("passt, pasta: Introduce command-line options and port re-mapping") and accidentally reversed this condition, which wasn't a problem as --stderr could force printing to standard error anyway (and it was used by KubeVirt). Nowadays, the KubeVirt integration uses a log file (requested via libvirt configuration), and the same applies for Podman if one actually needs to look at runtime logs. There are no use cases left, as far as I know, where passt runs in foreground in non-interactive terminals. Seize the chance to reintroduce some sanity here. If we fork to background, standard error is closed, so --stderr is useless in that case. If we run in foreground, there's no harm in printing messages to standard error, and that accidentally became the default behaviour anyway, so --stderr is not needed in that case. It would be needed for non-interactive terminals, but there are no use cases, and if there were, let's log to standard error anyway: the user can always redirect standard error to /dev/null if needed. Before we're up and running, we need to print to standard error anyway if something happens, otherwise we can't report failure to start in any kind of usage, stand-alone or in integrations. So, make --stderr do nothing, and deprecate it. While at it, drop a left-over comment about --foreground being the default only for interactive terminals, because it's not the case anymore. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:28 +02:00
Stefano Brivio	b74801645c	conf, passt: Don't try to log to stderr after we close it If we don't run in foreground, we close standard error as we daemonise, so it makes no sense to check if the controlling terminal is an interactive terminal or if --force-stderr was given, to decide if we want to log to standard error. Make --force-stderr depend on --foreground. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:15 +02:00
Stefano Brivio	65923ba798	conf: Accept duplicate and conflicting options, the last one wins In multiple occasions, especially when passt(1) and pasta(1) are used in integrations such as the one with Podman, the ability to override earlier options on the command line with later one would have been convenient. Recently, to debug a number of issues happening with Podman, I would have liked to ask users to share a debug log by passing --debug as additional option, but pasta refuses --quiet (always passed by Podman) and --debug at the same time. On top of this, Podman lets users specify other pasta options in its containers.conf(5) file, as well as on the command line. The options from the configuration files are appended together with the ones from the command line, which makes it impossible for users to override options from the configuration file, if duplicated options are refused, unless Podman takes care of sorting them, which is clearly not sustainable. For --debug and --trace, somebody took care of this on Podman side at: https://github.com/containers/common/pull/2052 but this doesn't fix the issue with other options, and we'll have anyway older versions of Podman around, too. I think there's some value in telling users about duplicated or conflicting options, because that might reveal issues in integrations or accidental misconfigurations, but by now I'm fairly convinced that the downsides outweigh this. Drop checks about duplicate options and mutually exclusive ones. In some cases, we need to also undo a couple of initialisations caused by earlier options, but this looks like a simplification, overall. Notable exception: --stderr still conflicts with --log-file, because users might have the expectation that they don't actually conflict. But they do conflict in the existing implementation, so it's safer to make sure that the users notice that. Suggested-by: Paul Holzinger <pholzing@redhat.com> Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Paul Holzinger <pholzing@redhat.com>	2024-06-21 15:31:46 +02:00
Stefano Brivio	62de6140d9	netlink: Strip nexthop identifiers when duplicating routes If routing daemons set up host routes, for example FRR via OSPF as in the reported issue, they might add nexthop identifiers (not objects) that are generally not valid in the target namespace. Strip them off as well, otherwise we'll get EINVAL from the kernel. Link: https://github.com/containers/podman/issues/22960 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-20 17:03:28 +02:00
Danish Prakash	1544a43863	passt.1, qrap.1: align license description with SPDX identifier The SPDX identifier states GPL-2.0-or-later but the copyright section mentions GPL-3.0 or later causing a mismatch. Also, only correctly refers to GPL instead of AGPL. Signed-off-by: Danish Prakash <contact@danishpraka.sh> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-19 15:00:55 +02:00
Stefano Brivio	f301bb18b5	netlink: Ignore EHOSTUNREACH failures when duplicating routes To implicitly resolve possible dependencies between routes as we duplicate them into the target namespace, we go through a set of n routes n times, and ignore EEXIST responses to netlink messages (we already inserted the route) and ENETUNREACH (we didn't insert the route yet, but we need to insert another one first). Until now, we didn't ignore EHOSTUNREACH responses. However, NetworkManager users with multiple non-subnet routes for the same interface report that pasta exits with "no route to host" while duplicating routes. This happens because NetworkManager sets the 'noprefixroute' attribute on addresses, meaning that the kernel won't create subnet routes automatically depending on the prefix length of the address. We copy this attribute as we copy the address into the target namespace, and as a result, the kernel doesn't create subnet routes in the target namespace either. This means that the gateway for routes that are inserted later can be unreachable at some points during the sequence of route duplication. That is, we don't just have dependencies between regular routes, but we can also have dependencies between regular routes and subnet routes, as subnet routes are not automatically inserted in advance. Link: https://github.com/containers/podman/issues/22824 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-19 15:00:55 +02:00
Stefano Brivio	450a6131be	netlink: With no default route, pick the first interface with a route While commit `f919dc7a4b` ("conf, netlink: Don't require a default route to start") sounded reasonable in the assumption that, if we don't find default routes for a given address family, we can still proceed by selecting an interface with any route iff it's the only one for that protocol family, Jelle reported a further issue in a similar setup. There, multiple interfaces are present, and while remote container connectivity doesn't matter for the container, local connectivity is desired. There are no default routes, but those multiple interfaces all have non-default routes, so we should just pick one and start. Pick the first interface reported by the kernel with any route, if there are no default routes. There should be no harm in doing so. Reported-by: Jelle van der Waa <jvanderwaa@redhat.com> Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2277954 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>	2024-06-19 15:00:55 +02:00
Stefano Brivio	54a9d3801b	tcp: Don't rely on bind() to fail to decide that connection target is valid Commit `e1a2e2780c` ("tcp: Check if connection is local or low RTT was seen before using large MSS") added a call to bind() before we issue a connect() to the target for an outbound connection. If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we can conclude that the target address is a local (host) address, and we can use an unlimited MSS. While at it, according to the reasoning of that commit, if bind() succeeds, we would know right away that nobody is listening at that (local) address and port, and we don't even need to call connect(): we can just fail early and reset the connection attempt. But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will actually succeed, so we can't rely on it to fail in general. The visible issue with the existing behaviour is that we would reset any outbound connection to non-local addresses, if non-local binds are enabled. Keep the significant optimisation for local addresses along with the bind() call, but if it succeeds, don't draw any conclusion: close the socket, grab another one, and proceed normally. This will incur a small latency penalty if non-local binds are enabled (we'll likely fetch an existing socket from the pool but additionally call close()), or if the target is local but not bound: we'll need to call connect() and get a failure before relaying that failure back. Link: https://github.com/containers/podman/issues/23003 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-19 15:00:55 +02:00
David Gibson	020ff7a40e	siphash: Remove stale prototypes In `fc8f0f8c` ("siphash: Use incremental rather than all-at-once siphash functions") we removed the older interface to the SipHash implementation, which took fixed sized blocks of data. However, we forgot to remove the prototypes for those functions, so do that now. Fixes: `fc8f0f8c48` ("siphash: Use incremental rather than all-at-once siphash functions") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-19 14:55:42 +02:00
David Gibson	7e87bd98ac	udp: Move management of udp[46]_localname into udp_splice_send() Mostly, udp_sock_handler() is independent of how the datagrams it processes will be forwarded (tap or splice). However, it also updates the msg_name fields for spliced sends, which doesn't really make sense here. Move it into udp_splice_send() which is all about spliced sends. This does potentially mean we'll update the field to the same value several times, but we're going to need this in future anyway: with the extensions the flow table allows, it might not be the same value each time after all. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-14 12:11:46 +02:00
David Gibson	ff57f8ddc6	udp: Rework how we divide queued datagrams between sending methods udp_sock_handler() takes a number of datagrams from sockets that depending on their addresses could be forwarded either to the L2 interface ("tap") or to another socket ("spliced"). In the latter case we can also only send packets together if they have the same source port, and therefore are sent via the same socket. To reduce the total number of system calls we gather contiguous batches of datagrams with the same destination interface and socket where applicable. The determination of what the target is is made by udp_mmh_splice_port(). It returns the source port for splice packets and -1 for "tap" packets. We find batches by looking ahead in our queue until we find a datagram whose "splicefrom" port doesn't match the first in our current batch. udp_mmh_splice_port() is moderately expensive, and unfortunately we can call it twice on the same datagram: once as the (last + 1) entry in one batch (to check it's not in that batch), then again as the first entry in the next batch. Avoid this by keeping track of the "splice port" in the metadata structure, and filling it in one entry ahead of the one we're currently considering. This is a bit subtle, but not that hard. It will also generalise better when we have more complex possibilities based on the flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-14 12:11:42 +02:00

1 2 3 4 5 ...

1590 commits