passt

mirror of https://passt.top/passt synced 2025-06-19 21:45:34 +02:00

Author	SHA1	Message	Date
Laurent Vivier	2c883498b5	Correct various function comment headers This commit refines function comment headers for improved accuracy and consistency. Key changes include: - Corrected parameter/return descriptions (e.g., `logtime`, `__daemon`). - Added missing and removed incorrect parameter documentation (e.g., `tcp_vu_sock_recv`, `ndp`). - Standardized comments to the `/** ... */` style for functions like `udp_flow_close` and `ns_enter`. - Ensured function names in comments consistently use `()`. - Addressed minor typos and updated comments for renamed functions. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-06-04 12:32:04 +02:00
David Gibson	cb5b593563	tcp, flow: Better use flow specific logging heleprs A number of places in the TCP code use general logging functions, instead of the flow specific ones. That includes a few older ones as well as many places in the new migration code. Thus they either don't identify which flow the problem happened on, or identify it in a non-standard way. Convert many of these to use the existing flow specific helpers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-14 23:40:40 +01:00
David Gibson	5a07eb3ccc	tcp_vu: head_cnt need not be global head_cnt is a global variable which tracks how many entries in head[] are currently used. The fact that it's global obscures the fact that the lifetime over which it has a meaningful value is quite short: a single call to of tcp_vu_data_from_sock(). Make it a local to tcp_vu_data_from_sock() to make that lifetime clearer. We keep the head[] array global for now - although technically it has the same valid lifetime - because it's large enough we might not want to put it on the stack. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 11:28:37 +01:00
Laurent Vivier	def7de4690	tcp_vu: Fix off-by one in header count array adjustment head_cnt represents the number of frames we're going to forward to the guest in tcp_vu_sock_recv(), each of which could require multiple buffers ("elements"). We initialise it with as many frames as we can find space for in vu buffers, and we then need to adjust it down to the number of frames we actually (partially) filled. We adjust it down based on number of individual buffers used by the data from recvmsg(). At this point 'i' is one greater than that number of buffers, so we need to discard all (unused) frames with a buffer index >= i, instead of > i. Reported-by: David Gibson <david@gibson.dropbear.id.au> [david: Contributed actual commit message] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:44:25 +01:00
Stefano Brivio	ec5c4d936d	tcp: Set PSH flag for last incoming packets in a batch So far we omitted setting PSH flags for inbound traffic altogether: as we ignore the nature of the data we're sending, we can't conclude that some data is more or less urgent. This works fine with Linux guests, as the Linux kernel doesn't do much with it, on input: it will generally deliver data to the application layer without delay. However, with Windows, things change: if we don't set the PSH flag on interactive inbound traffic, we can expect long delays before the data is delivered to the application. This is very visible with RDP, where packets we send on behalf of the RDP client are delivered with delays exceeding one second: $ tshark -r rdp.pcap -td -Y 'frame.number in { 33170 .. 33173 }' --disable-protocol tls 33170 0.030296 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285229 Win=387968 Len=0 33171 0.985412 88.198.0.164 → 93.235.154.248 105 TCP 3389 → 49012 [PSH, ACK] Seq=285229 Ack=13820 Win=63198 Len=51 33172 0.030373 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285280 Win=387968 Len=0 33173 1.383776 88.198.0.164 → 93.235.154.248 424 TCP 3389 → 49012 [PSH, ACK] Seq=285280 Ack=13820 Win=63198 Len=370 in this example (packet capture taken by passt), frame #33172 is a mouse event sent by the RDP client, and frame #33173 is the first event (display reacting to click) sent back by the server. This appears as a 1.4 s delay before we get frame #33173. If we set PSH, instead: $ tshark -r rdp_psh.pcap -td -Y 'frame.number in { 314 .. 317 }' --disable-protocol tls 314 0.002503 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7779 Ack=74047 Win=31872 Len=116 315 0.000557 88.198.0.164 → 93.235.154.248 54 TCP 3389 → 51066 [ACK] Seq=79162 Ack=7895 Win=62872 Len=0 316 0.012752 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7895 Ack=79162 Win=31872 Len=116 317 0.011927 88.198.0.164 → 93.235.154.248 107 TCP 3389 → 51066 [PSH, ACK] Seq=79162 Ack=8011 Win=62756 Len=53 here, in frame #316, our mouse event is delivered without a delay and receives a response in approximately 12 ms. Set PSH on the last segment for any batch we dequeue from the socket, that is, set it whenever we know that we might not be sending data to the same port for a while. Reported-by: NN708 Link: https://bugs.passt.top/show_bug.cgi?id=107 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-21 14:28:44 +01:00
Stefano Brivio	a8f4fc481c	tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest There are pretty much two cases of the (misnomer) STALLED: in one case, we could send more data to the guest if it becomes available, and in another case, we can't, because we filled the window. If, in this second case, we keep EPOLLIN enabled, but never read from the socket, we get short but CPU-annoying storms of EPOLLIN events, upon which we reschedule the ACK timeout handler, never read from the socket, go back to epoll_wait(), and so on: timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0 epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1 timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0 epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1 timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0 epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1 also known as: 29.1517: Flow 2 (TCP connection): timer expires in 2.000s 29.1517: Flow 2 (TCP connection): timer expires in 2.000s 29.1517: Flow 2 (TCP connection): timer expires in 2.000s which, for some reason, becomes very visible with muvm and aria2c downloading from a server nearby in parallel chunks. That's because EPOLLIN isn't cleared if we don't read from the socket, and even with EPOLLET, epoll_wait() will repeatedly wake us up until we actually read something. In this case, we don't want to subscribe to EPOLLIN at all: all we're waiting for is an ACK segment from the guest. Differentiate this case with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just indicate that we're waiting for an ACK from the guest (ACK_FROM_TAP_DUE), but also that we're blocked waiting for it. If this flag is set before we set STALLED, EPOLLIN will be masked while we set EPOLLET because of STALLED. Whenever we clear STALLED, we also clear this flag. This is definitely not elegant, but it's a minimal fix. We can probably simplify this at a later point by having a category of connection flags directly corresponding to epoll flags, and dropping STALLED altogether, or, perhaps, always using EPOLLET (but we need a mechanism to re-check sockets for pending data if we can't temporarily write to the guest). I suspect that this might also be implied in https://github.com/containers/podman/issues/23686, hence the Link: tag. It doesn't necessarily mean I'm fixing it (I can't reproduce that). Link: https://github.com/containers/podman/issues/23686 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-16 21:15:33 +01:00
Stefano Brivio	b8f573cdc2	tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Before SO_PEEK_OFF support was introduced by commit `e63d281871` ("tcp: leverage support of SO_PEEK_OFF socket option when available"), we would peek data from sockets using a "discard" buffer as first iovec element, so that, unless we had no pending data at all, we would always get a positive return code from recvmsg() (except for closing connections or errors). If we couldn't send more data to the guest, in the window, we would set the STALLED flag (causing the epoll descriptor to switch to edge-triggered mode), and return early from tcp_data_from_sock(). With SO_PEEK_OFF, we don't have a discard buffer, and if there's data on the socket, but nothing beyond our current peeking offset, we'll get EAGAIN instead of our current "discard" length. In that case, we return even earlier, and we don't set EPOLLET on the socket as a result. As reported by Asahi Lina, this causes event loops where the kernel is signalling socket readiness, because there's data we didn't dequeue yet (waiting for the guest to acknowledge it), but we won't actually peek anything new, and return early without setting EPOLLET. This is the original report, mentioning the originally proposed fix: -- When there is unacknowledged data in the inbound socket buffer, passt leaves the socket in the epoll instance to accept new data from the server. Since there is already data in the socket buffer, an epoll without EPOLLET will repeatedly fire while no data is processed, busy-looping the CPU: epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) Add in the missing EPOLLET flag for this case. This brings CPU usage down from around ~80% when downloading over TCP, to ~5% (use case: passt as network transport for muvm, downloading Steam games). -- we can't set EPOLLET unconditionally though, at least right now, because we don't monitor the guest tap for EPOLLOUT in case we fail to write on that side because we filled up that buffer (and not the window of a TCP connection). Instead, rely on the observation that, once a connection is established, we only get EAGAIN on recvmsg() if we are attempting to peek data from a socket with a non-zero peeking offset: we only peek when there's pending data on a socket, and in that case, if we peek without offset, we'll always see some data. And if we peek data with a non-zero offset and get EAGAIN, that means that we're either waiting for more data to arrive on the socket (which would cause further wake-ups, even with EPOLLET), or we're waiting for the guest to acknowledge some of it, which would anyway cause a wake-up. In that case, it's safe to set STALLED and, in turn, EPOLLET on the socket, which fixes the EPOLLIN event loop. While we're establishing a connection from the socket side, though, we'll call, once, tcp_{buf,vu}_data_from_sock() to see if we got any data while we were waiting for SYN, ACK from the guest. See the comment at the end of tcp_conn_from_sock_finish(). And if there's no data queued on the socket as we check, we'll also get EAGAIN, even if our peeking offset is zero. For this reason, we need to additionally check that 'already_sent' is not zero, meaning, explicitly, that our peeking offset is not zero. Reported-by: Asahi Lina <lina@asahilina.net> Fixes: `e63d281871` ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-16 21:15:33 +01:00
Laurent Vivier	020c8b7127	tcp_vu: Compute IPv4 header checksum if dlen changes In tcp_vu_data_from_sock() we compute IPv4 header checksum only for the first and the last packets, and re-use the first packet checksum for all the other packets as the content of the header doesn't change. It's more accurate to check the dlen value to know if the checksum should change as dlen is the only information that can change in the loop. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
David Gibson	b6e79efa0b	tcp_vu: Remove unnecessary tcp_vu_update_check() function Because the vhost-user <-> virtio-net path ignores checksums, we usually don't calculate them when sending packets to the guest. So, we always pass no_tcp_csum=true to tcp_fill_headers(). We do want accurate checksums when capturing packets though, so the captures don't show bogus values. Currently we handle this by updating the checksum field immediately before writing the packet to the capture file, using tcp_vu_update_check(). This is unnecessary, though: in each case tcp_fill_headers() is called not very long before, so we can alter its no_tcp_csum parameter pased on whether we're generating captures or not. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
David Gibson	a6348cad51	tcp: Merge tcp_fill_headers[46]() with each other We have different versions of this function for IPv4 and IPv6, but the caller already requires some IP version specific code to get the right header pointers. Instead, have a common function that fills either an IPv4 or an IPv6 header based on which header pointer it is passed. This allows us to remove a small amount of code duplication and make a few slightly ugly conditionals. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
David Gibson	2abf5ab7f3	tcp: Merge tcp_update_check_tcp[46]() The only reason we need separate functions for the IPv4 and IPv6 case is to calculate the checksum of the IP pseudo-header, which is different for the two cases. However, the caller already knows which path it's on and can access the values needed for the pseudo-header partial sum more easily than tcp_update_check_tcp[46]() can. So, merge these functions into a single tcp_update_csum() function that just takes the pseudo-header partial sum, calculated in the caller. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
David Gibson	08ea3cc581	tcp: Pass TCP header and payload separately to tcp_fill_headers[46]() At the moment these take separate pointers to the tap specific and IP headers, but expect the TCP header and payload as a single tcp_payload_t. As well as being slightly inconsistent, this involves some slightly iffy pointer shenanigans when called on the flags path with a tcp_flags_t instead of a tcp_payload_t. More importantly, it's inconvenient for the upcoming vhost-user case, where the TCP header and payload might not be contiguous. Furthermore, the payload itself might not be contiguous. So, pass the TCP header as its own pointer, and the TCP payload as an IO vector. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
David Gibson	2ee07697c4	tcp: Pass TCP header and payload separately to tcp_update_check_tcp[46]() Currently these expects both the TCP header and payload in a single IOV, and goes to some trouble to locate the checksum field within it. In the current caller we've already know where the TCP header is, so we might as well just pass it in. This will need to work a bit differently for vhost-user, but that code already needs to locate the TCP header for other reasons, so again we can just pass it in. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
Stefano Brivio	804a7ce94a	tcp_vu: Change 'dlen' to ssize_t in tcp_vu_data_from_sock() ...to quickly suppress a false positive from Coverity, which assumes that iov_size is 0 and 'dlen' might overflow as a result (with hdrlen being 66). An ASSERT() in tcp_vu_sock_recv() already guarantees that iov_size(iov, buf_cnt) here is anyway greater than 'hdrlen'. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Laurent Vivier <lvivier@redhat.com>	2024-11-27 16:49:21 +01:00
David Gibson	7e131e920c	tcp: Move tcp_l2_buf_fill_headers() to tcp_buf.c This function only has callers in tcp_buf.c. More importantly, it's inherently tied to the "buf" path, because it uses internal knowledge of how we lay out the various headers across our locally allocated buffers. Therefore, move it to tcp_buf.c. Slightly reformat the prototypes while we're at it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:49:21 +01:00
Laurent Vivier	28997fcb29	vhost-user: add vhost-user add virtio and vhost-user functions to connect with QEMU. $ ./passt --vhost-user and # qemu-system-x86_64 ... -m 4G \ -object memory-backend-memfd,id=memfd0,share=on,size=4G \ -numa node,memdev=memfd0 \ -chardev socket,id=chr0,path=/tmp/passt_1.socket \ -netdev vhost-user,id=netdev0,chardev=chr0 \ -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \ ... Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: as suggested by lvivier, include <netinet/if_ether.h> before including <linux/if_ether.h> as C libraries such as musl __UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have a definition of struct ethhdr] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:47:32 +01:00

16 commits