passt

mirror of https://passt.top/passt synced 2025-08-15 03:03:14 +02:00

Author	SHA1	Message	Date
Stefano Brivio	587980ca1e	udp: Actually discard datagrams we can't forward Given that udp_sock_fwd() now loops on udp_peek_addr() to get endpoint addresses for datagrams, if we can't forward one of these datagrams, we need to make sure we actually discard it. Otherwise, with MSG_PEEK, we won't dequeue and loop on it forever. For example, if we fail to create a socket for a new flow, because, say, the destination of an inbound packet is multicast, and we can't bind() to a multicast address, the loop will look like this: 18.0563: Flow 0 (NEW): FREE -> NEW 18.0563: Flow 0 (INI): NEW -> INI 18.0563: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ? 18.0563: Flow 0 (TGT): INI -> TGT 18.0563: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997 18.0563: Flow 0 (UDP flow): TGT -> TYPED 18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997 18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument 18.0564: Flow 0 (FREE): TYPED -> FREE 18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997 18.0564: Discarding datagram without flow 18.0564: Flow 0 (NEW): FREE -> NEW 18.0564: Flow 0 (INI): NEW -> INI 18.0564: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ? 18.0564: Flow 0 (TGT): INI -> TGT 18.0564: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997 18.0564: Flow 0 (UDP flow): TGT -> TYPED 18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997 18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument 18.0564: Flow 0 (FREE): TYPED -> FREE 18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997 18.0564: Discarding datagram without flow and seen from strace: epoll_wait(3, [{events=EPOLLIN, data=0x1076c00000705}], 8, 1000) = 1 recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK\|MSG_DONTWAIT) = 0 socket(AF_INET6, SOCK_DGRAM\|SOCK_NONBLOCK, IPPROTO_UDP) = 12 setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0 setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0 setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0 bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument) close(12) = 0 recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK\|MSG_DONTWAIT) = 0 socket(AF_INET6, SOCK_DGRAM\|SOCK_NONBLOCK, IPPROTO_UDP) = 12 setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0 setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0 setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0 bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument) close(12) = 0 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-05-03 10:21:20 +02:00
David Gibson	436afc3044	udp: Translate offender addresses for ICMP messages We've recently added support for propagating ICMP errors related to a UDP flow from the host to the guest, by handling the extended UDP error on the socket and synthesizing a suitable ICMP on the tap interface. Currently we create that ICMP with a source address of the "offender" from the extended error information - the source of the ICMP error received on the host. However, we don't translate this address for cases where we NAT between host and guest. This means (amongst other things) that we won't get a "Connection refused" error as expected if send data from the guest to the --map-host-loopback address. The error comes from 127.0.0.1 on the host, which doesn't make sense on the tap interface and will be discarded by the guest. Because ICMP errors can be sent by an intermediate host, not just by the endpoints of the flow, we can't handle this translation purely with the information in the flow table entry. We need to explicitly translate this address by our NAT rules, which we can do with the nat_inbound() helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-22 12:42:05 +02:00
David Gibson	08e617ec2b	udp: Rework offender address handling in udp_sock_recverr() Make a number of changes to udp_sock_recverr() to improve the robustness of how we handle addresses. * Get the "offender" address (source of the ICMP packet) using the SO_EE_OFFENDER() macro, reducing assumptions about structure layout. * Parse the offender sockaddr using inany_from_sockaddr() * Check explicitly that the source and destination pifs are what we expect. Previously we checked something that was probably equivalent in practice, but isn't strictly speaking what we require for the rest of the code. * Verify that for an ICMPv4 error we also have an IPv4 source/offender and destination/endpoint address * Verify that for an ICMPv6 error we have an IPv6 endpoint * Improve debug reporting of any failures Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-22 12:42:03 +02:00
David Gibson	2340bbf867	udp: Propagate errors on listening and brand new sockets udp_sock_recverr() processes errors on UDP sockets and attempts to propagate them as ICMP packets on the tap interface. To do this it currently requires the flow with which the error is associated as a parameter. If that's missing it will clear the error condition, but not propagate it. That means that we largely ignore errors on "listening" sockets. It also means we may discard some errors on flow specific sockets if they occur very shortly after the socket is created. In udp_flush_flow() we need to clear any datagrams received between bind() and connect() which might not be associated with the "final" flow for the socket. If we get errors before that point we'll ignore them in the same way because we don't know the flow they're associated with in advance. This can happen in practice if we have errors which occur almost immediately after connect(), such as ECONNREFUSED when we connect() to a local address where nothing is listening. Between the extended error message itself and the PKTINFO information we do actually have enough information to find the correct flow. So, rather than ignoring errors where we don't have a flow "hint", determine the flow the hard way in udp_sock_recverr(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change warn() to debug() in udp_sock_recverr()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-15 19:56:16 +02:00
David Gibson	cfc0ee145a	udp: Minor re-organisation of udp_sock_recverr() Usually we work with the "exit early" flow style, where we return early on "error" conditions in functions. We don't currently do this in udp_sock_recverr() for the case where we don't have a flow to associate the error with. Reorganise to use the "exit early" style, which will make some subsequent changes less awkward. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-15 19:49:06 +02:00
David Gibson	f107a86cc0	udp: Add udp_pktinfo() helper Currently we open code parsing the control message for IP_PKTINFO in udp_peek_addr(). We have an upcoming case where we want to parse PKTINFO in another place, so split this out into a helper function. While we're there, make the parsing a bit more robust: scan all cmsgs to look for the one we want, rather than assuming there's only one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: udp_pktinfo(): Fix typo in comment and change err() to debug()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-15 19:48:35 +02:00
David Gibson	04984578b0	udp: Deal with errors as we go in udp_sock_fwd() When we get an epoll event on a listening socket, we first deal with any errors (udp_sock_errs()), then with any received packets (udp_sock_fwd()). However, it's theoretically possible that new errors could get flagged on the socket after we call udp_sock_errs(), in which case we could get errors returned in in udp_sock_fwd() -> udp_peek_addr() -> recvmsg(). In fact, we do deal with this correctly, although the path is somewhat non-obvious. The recvmsg() error will cause us to bail out of udp_sock_fwd(), but the EPOLLERR event will now be flagged, so we'll come back here next epoll loop and call udp_sock_errs(). Except.. we call udp_sock_fwd() from udp_flush_flow() as well as from epoll events. This is to deal with any packets that arrived between bind() and connect(), and so might not be associated with the socket's intended flow. This expects udp_sock_fwd() to flush _all_ queued datagrams, so that anything received later must be for the correct flow. At the moment, udp_sock_errs() might fail to flush all datagrams if errors occur. In particular this can happen in practice for locally reported errors which occur immediately after connect() (e.g. connecting to a local port with nothing listening). We can deal with the problem case, and also make the flow a little more natural for the common case by having udp_sock_fwd() call udp_sock_errs() to handle errors as the occur, rather than trying to deal with all errors in advance. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-15 19:45:19 +02:00
David Gibson	3f995586b3	udp: Pass socket & flow information direction to error handling functions udp_sock_recverr() and udp_sock_errs() take an epoll reference from which they obtain both the socket fd to receive errors from, and - for flow specific sockets - the flow and side the socket is associated with. We have some upcoming cases where we want to clear errors when we're not directly associated with receiving an epoll event, so it's not natural to have an epoll reference. Therefore, make these functions take the socket and flow from explicit parameters. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-15 19:45:09 +02:00
David Gibson	1bb8145c22	udp: Be quieter about errors on UDP receive If we get an error on UDP receive, either in udp_peek_addr() or udp_sock_recv(), we'll print an error message. However, this could be a perfectly routine UDP error triggered by an ICMP, which need not go to the error log. This doesn't usually happen, because before receiving we typically clear the error queue from udp_sock_errs(). However, it's possible an error could be flagged after udp_sock_errs() but before we receive. So it's better to handle this error "silently" (trace level only). We'll bail out of the receive, return to the epoll loop, and get an EPOLLERR where we'll handle and report the error properly. In particular there's one situation that can trigger this case much more easily. If we start a new outbound UDP flow to a local destination with nothing listening, we'll get a more or less immediate connection refused error. So, we'll get that error on the very first receive after the connect(). That will occur in udp_flow_defer() -> udp_flush_flow() -> udp_sock_fwd() -> udp_peek_addr() -> recvmsg(). This path doesn't call udp_sock_errs() first, so isn't (imperfectly) protected the way we are most of the time. Fixes: `84ab1305fa` ("udp: Polish udp_vu_sock_info() and remove from vu specific code") Fixes: `69e5393c37` ("udp: Move some more of sock_handler tasks into sub-functions") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-15 19:43:56 +02:00
David Gibson	baf049f8e0	udp: Fix breakage of UDP error handling by PKTINFO support We recently enabled the IP_PKTINFO / IPV6_RECVPKTINFO socket options on our UDP sockets. This lets us obtain and properly handle the specific local address used when we're "listening" with a socket on 0.0.0.0 or ::. However, the PKTINFO cmsgs this option generates appear on error queue messages as well as regular datagrams. udp_sock_recverr() doesn't expect this and so flags an unrecoverable error when it can't parse the control message. Correct this by adding space in udp_sock_recverr()s control buffer for the additional PKTINFO data, and scan through all cmsgs for the RECVERR, rather than only looking at the first one. Link: https://bugs.passt.top/show_bug.cgi?id=99 Fixes: `f4b0dd8b06` ("udp: Use PKTINFO cmsgs to get destination address for received datagrams") Reported-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-15 19:43:00 +02:00
David Gibson	59cc89f4cc	udp, udp_flow: Track our specific address on socket interfaces So far for UDP flows (like TCP connections) we didn't record our address (oaddr) in the flow table entry for socket based pifs. That's because we didn't have that information when a flow was initiated by a datagram coming to a "listening" socket with 0.0.0.0 or :: address. Even when we did have the information, we didn't record it, to simplify address matching on lookups. This meant that in some circumstances we could send replies on a UDP flow from a different address than the originating request came to, which is surprising and breaks certain setups. We now have code in udp_peek_addr() which does determine our address for incoming UDP datagrams. We can use that information to properly populate oaddr in the flow table for flow initiated from a socket. In order to be able to consistently match datagrams to flows, we must always have a specific oaddr, not an unspecified address (that's how the flow hash table works). So, we also need to fill in oaddr correctly for flows we initiate to sockets. Our forwarding logic doesn't specify oaddr here, letting the kernel decide based on the routing table. In this case we need to call getsockname() after connect()ing the socket to find which local address the kernel picked. This adds getsockname() to our seccomp profile for all variants. Link: https://bugs.passt.top/show_bug.cgi?id=99 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-10 19:46:16 +02:00
David Gibson	f4b0dd8b06	udp: Use PKTINFO cmsgs to get destination address for received datagrams Currently we get the source address for received datagrams from recvmsg(), but we don't get the local destination address. Sometimes we implicitly know this because the receiving socket is bound to a specific address, but when listening on 0.0.0.0 or ::, we don't. We need this information to properly direct replies to flows which come in to a non-default local address. So, enable the IP_PKTINFO and IPV6_PKTINFO control messages to obtain this information in udp_peek_addr(). For now we log a trace messages but don't do anything more with the information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-10 19:45:59 +02:00
David Gibson	9725e79888	udp_flow: Don't discard packets that arrive between bind() and connect() When we establish a new UDP flow we create connect()ed sockets that will only handle datagrams for this flow. However, there is a race between bind() and connect() where they might get some packets queued for a different flow. Currently we handle this by simply discarding any queued datagrams after the connect. UDP protocols should be able to handle such packet loss, but it's not ideal. We now have the tools we need to handle this better, by redirecting any datagrams received during that race to the appropriate flow. We need to use a deferred handler for this to avoid unexpectedly re-ordering datagrams in some edge cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Update comment to udp_flow_defer()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:44:31 +02:00
David Gibson	9eb5406260	udp: Fold udp_splice_prepare and udp_splice_send into udp_sock_to_sock udp_splice() prepare and udp_splice_send() are both quite simple functions that now have only one caller: udp_sock_to_sock(). Fold them both into that caller. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:44:31 +02:00
David Gibson	bd6a41ee76	udp: Rework udp_listen_sock_data() into udp_sock_fwd() udp_listen_sock_data() forwards datagrams from a "listening" socket until there are no more (for now). We have an upcoming use case where we want to do that for a socket that's not a "listening" socket, and uses a different epoll reference. So, adjust the function to take the pieces it needs from the reference as direct parameters and rename to udp_sock_fwd(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:43:53 +02:00
David Gibson	159beefa36	udp_flow: Take pif and port as explicit parameters to udp_flow_from_sock() Currently udp_flow_from_sock() is only used when receiving a datagram from a "listening" socket. It takes the listening socket's epoll reference to get the interface and port on which the datagram arrived. We have some upcoming cases where we want to use this in different contexts, so make it take the pif and port as direct parameters instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Drop @ref from comment to udp_flow_from_sock()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:43:53 +02:00
David Gibson	fd844a90bc	udp: Move UDP_MAX_FRAMES to udp.c Recent changes mean that this define is no longer used anywhere except in udp.c. Move it back into udp.c from udp_internal.h. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:43:53 +02:00
David Gibson	fc6ee68ad3	udp: Merge vhost-user and "buf" listening socket paths udp_buf_listen_sock_data() and udp_vu_listen_sock_data() now have effectively identical structure. The forwarding functions used for flow specific sockets (udp_buf_sock_to_tap(), udp_vu_sock_to_tap() and udp_sock_to_sock()) also now take a number of datagrams. This means we can re-use them for the listening socket path, just passing '1' so they handle a single datagram at a time. This allows us to merge both the vhost-user and flow specific paths into a single, simpler udp_listen_sock_data(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:43:52 +02:00
David Gibson	0304dd9c34	udp: Split spliced forwarding path from udp_buf_reply_sock_data() udp_buf_reply_sock_data() can handle forwarding data either from socket to socket ("splicing") or from socket to tap. It has a test on each datagram for which case we're in, but that will be the same for everything in the batch. Split out the spliced path into a separate udp_sock_to_sock() function. This leaves udp_{buf,vu}_reply_sock_data() handling only forwards from socket to tap, so rename and simplify them accordingly. This makes the code slightly longer for now, but will allow future cleanups to shrink it back down again. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Fix typos in comments to udp_sock_recv() and udp_vu_listen_sock_data()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:41:32 +02:00
David Gibson	5221e177e1	udp: Parameterize number of datagrams handled by udp_*_reply_sock_data() Both udp_buf_reply_sock_data() and udp_vu_reply_sock_data() internally decide what the maximum number of datagrams they will forward is. We have some upcoming reasons to allow the caller to decide that instead, so make the maximum number of datagrams a parameter for both of them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:31:54 +02:00
David Gibson	3a0881dfd0	udp: Don't bother to batch datagrams from "listening" socket A "listening" UDP socket can receive datagrams from multiple flows. So, we currently have some quite subtle and complex code in udp_buf_listen_sock_data() to group contiguously received packets for the same flow into batches for forwarding. However, since we are now always using flow specific connect()ed sockets once a flow is established, handling of datagrams on listening sockets is essentially a slow path. Given that, it's not worth the complexity. Substantially simplify the code by using an approach more like vhost-user, and "peeking" at the address of the next datagram, one at a time to determine the correct flow before we actually receive the data, This removes all meaningful use of the s_in and tosidx fields in udp_meta_t, so they too can be removed, along with setting of msg_name and msg_namelen in the msghdr arrays which referenced them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:30:17 +02:00
David Gibson	84ab1305fa	udp: Polish udp_vu_sock_info() and remove from vu specific code udp_vu_sock_info() uses MSG_PEEK to look ahead at the next datagram to be received and gets its source address. Currently we only use it in the vhost-user path, but there's nothing inherently vhost-user specific about it. We have upcoming uses for it elsewhere so rename and move to udp.c. While we're there, polish its error reporting a litle. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Drop excess newline before udp_sock_recv()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:29:23 +02:00
David Gibson	1d7bbb101a	udp: Make udp_sock_recv() take max number of frames as a parameter Currently udp_sock_recv() decides the maximum number of frames it is willing to receive based on the mode. However, we have upcoming use cases where we will have different criteria for how many frames we want with information that's not naturally available here but is in the caller. So make the maximum number of frames a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Fix typo in comment in udp_buf_reply_sock_data()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:25:33 +02:00
David Gibson	d74b5a7c10	udp: Use connect()ed sockets for initiating side Currently we have an asymmetry in how we handle UDP sockets. For flows where the target side is a socket, we create a new connect()ed socket - the "reply socket" specifically for that flow used for sending and receiving datagrams on that flow and only that flow. For flows where the initiating side is a socket, we continue to use the "listening" socket (or rather, a dup() of it). This has some disadvantages: * We need a hash lookup for every datagram on the listening socket in order to work out what flow it belongs to * The dup() keeps the socket alive even if automatic forwarding removes the listening socket. However, the epoll data remains the same including containing the now stale original fd. This causes bug 103. * We can't (easily) set flow-specific options on an initiating side socket, because that could affect other flows as well Alter the code to use a connect()ed socket on the initiating side as well as the target side. There's no way to "clone and connect" the listening socket (a loose equivalent of accept() for UDP), so we have to create a new socket. We have to bind() this socket before we connect() it, which is allowed thanks to SO_REUSEADDR, but does leave a small window where it could receive datagrams not intended for this flow. For now we handle this by simply discarding any datagrams received between bind() and connect(), but I intend to improve this in a later patch. Link: https://bugs.passt.top/show_bug.cgi?id=103 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:24:36 +02:00
Jon Maloy	a7775e9550	udp: support traceroute in direction tap-socket Now that ICMP pass-through from socket-to-tap is in place, it is easy to support UDP based traceroute functionality in direction tap-to-socket. We fix that in this commit. Link: https://bugs.passt.top/show_bug.cgi?id=64 Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-07 21:23:35 +02:00
David Gibson	684870a766	udp: Correct some seccomp filter annotations Both udp_buf_listen_sock_data() and udp_buf_reply_sock_data() have comments stating they use recvmmsg(). That's not correct, they only do so via udp_sock_recv() which lists recvmmsg() itself. In contrast udp_splice_send() and udp_tap_handler() both directly use sendmmsg(), but only the latter lists it. Add it to the former as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-02 11:30:29 +02:00
David Gibson	76e554d9ec	udp: Simplify updates to UDP flow timestamp Since UDP has no built in knowledge of connections, the only way we know when we're done with a UDP flow is a timeout with no activity. To keep track of this struct udp_flow includes a timestamp to record the last time we saw traffic on the flow. For data from listening sockets and from tap, this is done implicitly via udp_flow_from_{sock,tap}() but for reply sockets it's done explicitly. However, that logic is duplicated between the vhost-user and "buf" paths. Make it common in udp_reply_sock_handler() instead. Technically this is a behavioural change: previously if we got an EPOLLIN event, but there wasn't actually any data we wouldn't update the timestamp, now we will. This should be harmless: if there's an EPOLLIN we expect there to be data, and even if there isn't the worst we can do is mildly delay the cleanup of a stale flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-02 11:30:26 +02:00
David Gibson	8aa2d90c8d	udp: Remove redundant udp_at_sidx() call in udp_tap_handler() We've already have a pointer to the UDP flow in variable uflow, we can just re-use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-04-02 11:30:14 +02:00
David Gibson	3de5af6e41	udp: Improve name of UDP related ICMP sending functions udp_send_conn_fail_icmp[46]() aren't actually specific to connections failing: they can propagate a variety of ICMP errors, which might or might not break a "connection". They are, however, specific to sending ICMP errors to the tap connection, not splice or host. Rename them to better reflect that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-28 13:26:11 +01:00
David Gibson	025a3c2686	udp: Don't attempt to forward ICMP socket errors to other sockets Recently we added support for detecting ICMP triggered errors on UDP sockets and forwarding them to the tap interface. However, in udp_sock_recverr() where this is handled we don't know for certain that the tap interface is the other side of the UDP flow. It could be a spliced connection with another socket on the other side. To forward errors in that case, we'd need to force the other side's socket to trigger issue an ICMP error. I'm not sure if there's a way to do that; probably not for an arbitrary ICMP but it might be possible for certain error conditions. Nonetheless what we do now - synthesise an ICMP on the tap interface - is certainly wrong. It's probably harmless; for a spliced connection it will have loopback addresses meaning we can expect the guest to discard it. But, correct this for now, by not attempting to propagate errors when the other side of the flow is a socket. Fixes: `55431f0077` ("udp: create and send ICMPv4 to local peer when applicable") Fixes: `68b04182e0` ("udp: create and send ICMPv6 to local peer when applicable") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-28 13:25:51 +01:00
Jon Maloy	65cca54be8	udp: correct source address for ICMP messages While developing traceroute forwarding tap-to-sock we found that struct msghdr.msg_name for the ICMPs in the opposite direction always contains the destination address of the original UDP message, and not, as one might expect, the one of the host which created the error message. Study of the kernel code reveals that this address instead is appended as extra data after the received struct sock_extended_err area. We now change the ICMP receive code accordingly. Fixes: `55431f0077` ("udp: create and send ICMPv4 to local peer when applicable") Fixes: `68b04182e0` ("udp: create and send ICMPv6 to local peer when applicable") Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-27 05:05:20 +01:00
David Gibson	f67c488b81	udp: Better handling of failure to forward from reply socket In udp_reply_sock_handler() if we're unable to forward the datagrams we just print an error. Generally this means we have an unsupported pair of pifs in the flow table, though, and that hasn't change. So, next time we get a matching packet we'll just get the same failure. In vhost-user mode we don't even dequeue the incoming packets which triggered this so we're likely to get the same failure immediately. Instead, close the flow, in the same we we do for an unrecoverable error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-26 21:34:30 +01:00
David Gibson	269cf6a12a	udp: Share more logic between vu and non-vu reply socket paths Share some additional miscellaneous logic between the vhost-user and "buf" paths for data on udp reply sockets. The biggest piece is error handling of cases where we can't forward between the two pifs of the flow. We also make common some more simple logic locating the correct flow and its parameters. This adds some lines of code due to extra comment lines, but nonetheless reduces logic duplication. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-26 21:34:28 +01:00
David Gibson	5a977c2f4e	udp: Simplify checking of epoll event bits udp_{listen,reply}_sock_handler() can accept both EPOLLERR and EPOLLIN events. However, unlike most epoll event handlers we don't check the event bits right there. EPOLLERR is checked within udp_sock_errs() which we call unconditionally. Checking EPOLLIN is still more buried: it is checked within both udp_sock_recv() and udp_vu_sock_recv(). We can simplify the logic and pass less extraneous parameters around by moving the checking of the event bits to the top level event handlers. This makes udp_{buf,vu}_{listen,reply}_sock_handler() no longer general event handlers, but specific to EPOLLIN events, meaning new data. So, rename those functions to udp_{buf,vu}_{listen,reply}_sock_data() to better reflect their function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-26 21:34:23 +01:00
David Gibson	89b203b851	udp: Common invocation of udp_sock_errs() for vhost-user and "buf" paths The vhost-user and non-vhost-user paths for both udp_listen_sock_handler() and udp_reply_sock_handler() are more or less completely separate. Both, however, start with essentially the same invocation of udp_sock_errs(), so that can be made common. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-26 21:34:11 +01:00
Jon Maloy	68b04182e0	udp: create and send ICMPv6 to local peer when applicable When a local peer sends a UDP message to a non-existing port on an existing remote host, that host will return an ICMPv6 message containing the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header and the first 1232 bytes of the original message, if any. If the sender socket has been connected, it uses this message to issue a "Connection Refused" event to the user. Until now, we have only read such events from the externally facing socket, but we don't forward them back to the local sender because we cannot read the ICMP message directly to user space. Because of this, the local peer will hang and wait for a response that never arrives. We now fix this for IPv6 by recreating and forwarding a correct ICMP message back to the internal sender. We synthesize the message based on the information in the extended error structure, plus the returned part of the original message body. Note that for the sake of completeness, we even produce ICMP messages for other error types and codes. We have noticed that at least ICMP_PROT_UNREACH is propagated as an error event back to the user. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> [sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't modify saddr which can be declared as const] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
Jon Maloy	55431f0077	udp: create and send ICMPv4 to local peer when applicable When a local peer sends a UDP message to a non-existing port on an existing remote host, that host will return an ICMP message containing the error code ICMP_PORT_UNREACH, plus the header and the first eight bytes of the original message. If the sender socket has been connected, it uses this message to issue a "Connection Refused" event to the user. Until now, we have only read such events from the externally facing socket, but we don't forward them back to the local sender because we cannot read the ICMP message directly to user space. Because of this, the local peer will hang and wait for a response that never arrives. We now fix this for IPv4 by recreating and forwarding a correct ICMP message back to the internal sender. We synthesize the message based on the information in the extended error structure, plus the returned part of the original message body. Note that for the sake of completeness, we even produce ICMP messages for other error codes. We have noticed that at least ICMP_PROT_UNREACH is propagated as an error event back to the user. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> [sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't modify 'in', it can be declared as const] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:19 +01:00
Stefano Brivio	09478d55fe	treewide: Dodge dynamic memory allocation in strerror() from glibc > 2.40 With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot use buffer after dlmopen (bug 32026)"), strerror() now needs, at least on x86, the getrandom() and brk() system calls, in order to fill in the locale-translated error message. But getrandom() and brk() are not allowed by our seccomp profiles. This became visible on Fedora Rawhide with the "podman login and logout" Podman tests, defined at test/e2e/login_logout_test.go in the Podman source tree, where pasta would terminate upon printing error descriptions (at least the ones related to the SO_ERROR queue for spliced connections). Avoid dynamic memory allocation by calling strerrordesc_np() instead, which is a GNU function returning a static, untranslated version of the error description. If it's not available, keep calling strerror(), which at that point should be simple enough as to be usable (at least, that's currently the case for musl). Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/24804 Analysed-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Paul Holzinger <pholzing@redhat.com>	2024-12-11 12:21:23 +01:00
David Gibson	67151090bc	iov, checksum: Replace csum_iov() with csum_iov_tail() We usually want to checksum only the tail part of a frame, excluding at least some headers. csum_iov() does that for a frame represented as an IO vector, not actually summing the entire IO vector. We now have struct iov_tail to explicitly represent this construct, so replace csum_iov() with csum_iov_tail() taking that representation rather than 3 parameters. We propagate the same change to csum_udp4() and csum_udp6() which take similar parameters. This slightly simplifies the code, and will allow some further simplifications as struct iov_tail is more widely used. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
Laurent Vivier	28997fcb29	vhost-user: add vhost-user add virtio and vhost-user functions to connect with QEMU. $ ./passt --vhost-user and # qemu-system-x86_64 ... -m 4G \ -object memory-backend-memfd,id=memfd0,share=on,size=4G \ -numa node,memdev=memfd0 \ -chardev socket,id=chr0,path=/tmp/passt_1.socket \ -netdev vhost-user,id=netdev0,chardev=chr0 \ -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \ ... Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: as suggested by lvivier, include <netinet/if_ether.h> before including <linux/if_ether.h> as C libraries such as musl __UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have a definition of struct ethhdr] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:47:32 +01:00
Laurent Vivier	5a8b33c667	udp: Prepare udp.c to be shared with vhost-user Export udp_payload_t, udp_update_hdr4(), udp_update_hdr6() and udp_sock_errs(). Rename udp_listen_sock_handler() to udp_buf_listen_sock_handler() and udp_reply_sock_handler to udp_buf_reply_sock_handler(). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:12:21 +01:00
David Gibson	6f913b3af0	udp: Don't dereference uflow before NULL check in udp_reply_sock_handler() We have an ASSERT() verifying that we're able to look up the flow in udp_reply_sock_handler(). However, we dereference uflow before that in an initializer, rather defeating the point. Rearrange to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:47:22 +01:00
Stefano Brivio	b1a607fba1	udp: Take care of cert-int09-c clang-tidy warning for enum udp_iov_idx /home/sbrivio/passt/udp.c:171:1: error: inital values in enum 'udp_iov_idx' are not consistent, consider explicit initialization of all, none or only the first enumerator [cert-int09-c,readability-enum-initial-value,-warnings-as-errors] 171 \| enum udp_iov_idx { \| ^ 172 \| UDP_IOV_TAP = 0, 173 \| UDP_IOV_ETH = 1, 174 \| UDP_IOV_IP = 2, 175 \| UDP_IOV_PAYLOAD = 3, 176 \| UDP_NUM_IOVS \| \| = 4 Don't initialise any value, so that it's obvious that constants map to unique values. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-30 12:37:31 +01:00
Laurent Vivier	151dbe0d3d	udp: Update UDP checksum using an iovec array As for tcp_update_check_tcp4()/tcp_update_check_tcp6(), change csum_udp4() and csum_udp6() to use an iovec array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:13 +02:00
David Gibson	cbde4192ee	tcp, udp: Make {tcp,udp}_sock_init() take an inany address tcp_sock_init() and udp_sock_init() take an address to bind to as an address family and void * pair. Use an inany instead. Formerly AF_UNSPEC was used to indicate that we want to listen on both 0.0.0.0 and ::, now use a NULL inany to indicate that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-25 19:03:16 +02:00
David Gibson	b8d4fac6a2	util, pif: Replace sock_l4() with pif_sock_l4() The sock_l4() function is very convenient for creating sockets bound to a given address, but its interface has some problems. Most importantly, the address and port alone aren't enough in some cases. For link-local addresses (at least) we also need the pif in order to properly construct a socket adddress. This case doesn't yet arise, but it might cause us trouble in future. Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it should attempt to create a "dual stack" socket which will respond to both IPv4 and IPv6 traffic. This only makes sense if there is no specific address given. We verify this at runtime, but it would be nicer if we could enforce it structurally. For sockets associated specifically with a single flow we already replaced sock_l4() with flowside_sock_l4() which avoids those problems. Now, replace all the remaining users with a new pif_sock_l4() which also takes an explicit pif. The new function takes the address as an inany *, with NULL indicating the dual stack case. This does add some complexity in some of the callers, however future planned cleanups should make this go away again. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-25 19:03:15 +02:00
David Gibson	204e77cd11	udp: Don't attempt to get dual-stack sockets in nonsensical cases To save some kernel memory we try to use "dual stack" sockets (that listen to both IPv4 and IPv6 traffic) when possible. However udp_sock_init() attempts to do this in some cases that can't work. Specifically we can only do this when listening on any address. That's never true for the ns (splicing) case, because we always listen on loopback. For the !ns case and AF_UNSPEC case, addr should always be NULL, but add an assert to verify. This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll fall back to the other path. But, it's messy and makes some upcoming changes harder, so avoid attempting this in cases we know can't work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-25 19:03:15 +02:00
Laurent Vivier	4fe5f4e813	udp: Allow checksum to be disabled We can need not to set the UDP checksum. Add a parameter to udp_update_hdr4() and udp_update_hdr6() to disable it. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:15:20 +02:00
David Gibson	748ef4cd6e	cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings cppcheck-2.15.0 has apparently broadened when it throws a warning about redundant initialization to include some cases where we have an initializer for some fields, but then set other fields in the function body. This is arguably a false positive: although we are technically overwriting the zero-initialization the compiler supplies for fields not explicitly initialized, this sort of construct makes sense when there are some fields we know at the top of the function where the initializer is, but others that require more complex calculation. That said, in the two places this shows up, it's pretty easy to work around. The results are arguably slightly clearer than what we had, since they move the parts of the initialization closer together. So do that rather than having ugly suppressions or dealing with the tedious process of reporting a cppcheck false positive. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:54:20 +02:00
David Gibson	aff5a49b0e	udp: Handle more error conditions in udp_sock_errs() udp_sock_errs() reads out everything in the socket error queue. However we've seen some cases[0] where an EPOLLERR event is active, but there isn't anything in the queue. One possibility is that the error is reported instead by the SO_ERROR sockopt. Check for that case and report it as best we can. If we still get an EPOLLERR without visible error, we have no way to clear the error state, so treat it as an unrecoverable error. [0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010 Link: https://bugs.passt.top/show_bug.cgi?id=95 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:38 +02:00

1 2 3 4 5 ...

265 commits