passt

mirror of https://passt.top/passt synced 2025-08-14 19:03:12 +02:00

Author	SHA1	Message	Date
David Gibson	9866d146e6	tap: Clarify calculation of TAP_MSGS The rationale behind the calculation of TAP_MSGS isn't necessarily obvious. It's supposed to be the maximum number of packets that can fit in pkt_buf. However, the calculation is wrong in several ways: * It's based on ETH_ZLEN which isn't meaningful for virtual devices * It always includes the qemu socket header which isn't used for pasta * The size of pkt_buf isn't relevant for vhost-user We've already made sure this is just a tuning parameter, not a hard limit. Clarify what we're calculating here and why. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-20 20:33:12 +01:00
David Gibson	a41d6d125e	tap: Make size of pool_tap[46] purely a tuning parameter Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf (TAP_MSGS). However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But ETH_ZLEN is based on physical constraints of Ethernet, which don't apply to our virtual devices. It is possible to generate a legitimate frame smaller than this, for example an empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long. Further more, the same limit applies for vhost-user, which is not limited by the size of pkt_buf like the other backends. In that case we don't even have full control of the maximum buffer size, so we can't really calculate how many packets could fit in there. If we exceed do TAP_MSGS we'll drop packets, not just use more batches, which is moderately bad. The fact that this needs to be sized just so for correctness not merely for tuning is a fairly non-obvious coupling between different parts of the code. To make this more robust, alter the tap code so it doesn't rely on everything fitting in a single batch of TAP_MSGS packets, instead breaking into multiple batches as necessary. This leaves TAP_MSGS as purely a tuning parameter, which we can freely adjust based on performance measures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-20 20:33:09 +01:00
David Gibson	9d1a6b3eba	pcap: Correctly set snaplen based on tap backend type The pcap header includes a value indicating how much of each frame is captured. We always capture the entire frame, so we want to set this to the maximum possible frame size. Currently we do that by setting it to ETH_MAX_MTU, but that's a confusingly named constant which might not always be correct depending on the details of our tap backend. Instead add a tap_l2_max_len() function that explicitly returns the maximum frame size for the current mode and use that to set snaplen. While we're there, there's no particular need for the pcap header to be defined in a global; make it local to pcap_init() instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-12 23:08:33 +01:00
David Gibson	c4bfa3339c	tap: Use explicit defines for maximum length of L2 frame Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of an L2 frame. This define comes from the kernel, but it's badly named and used confusingly. First, it doesn't really have anything to do with Ethernet, which has no structural limit on frame lengths. It comes more from either a) IP which imposes a 64k datagram limit or b) from internal buffers used in various places in the kernel (and in passt). Worse, MTU generally means the maximum size of the IP (L3) datagram which may be transferred, _not_ counting the L2 headers. In the kernel ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as a maximum frame length, _including_ L2 headers. In tap.c we're mostly using it in the second way. Finally, each of our tap backends could have different limits on the frame size imposed by the mechanisms they're using. Start clearing up this confusion by replacing it in tap.c with new L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame length for each backend. Signed-off-by: David Gibson <dgibson@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-12 23:08:33 +01:00
David Gibson	1eda8de438	packet: Remove redundant TAP_BUF_BYTES define Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially the same thing. They'll be different only if TAP_BUF_BYTES is negative, which makes no sense. So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES. In addition, most places we use this to just mean the size of the main packet buffer (pkt_buf) for which we can just directly use sizeof. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-12 23:08:33 +01:00
Jon Maloy	68b04182e0	udp: create and send ICMPv6 to local peer when applicable When a local peer sends a UDP message to a non-existing port on an existing remote host, that host will return an ICMPv6 message containing the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header and the first 1232 bytes of the original message, if any. If the sender socket has been connected, it uses this message to issue a "Connection Refused" event to the user. Until now, we have only read such events from the externally facing socket, but we don't forward them back to the local sender because we cannot read the ICMP message directly to user space. Because of this, the local peer will hang and wait for a response that never arrives. We now fix this for IPv6 by recreating and forwarding a correct ICMP message back to the internal sender. We synthesize the message based on the information in the extended error structure, plus the returned part of the original message body. Note that for the sake of completeness, we even produce ICMP messages for other error types and codes. We have noticed that at least ICMP_PROT_UNREACH is propagated as an error event back to the user. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> [sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't modify saddr which can be declared as const] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
Jon Maloy	87e6a46442	tap: break out building of udp header from tap_udp6_send function We will need to build the UDP header at other locations than in function tap_udp6_send(), so we break that part out to a separate function. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
Jon Maloy	55431f0077	udp: create and send ICMPv4 to local peer when applicable When a local peer sends a UDP message to a non-existing port on an existing remote host, that host will return an ICMP message containing the error code ICMP_PORT_UNREACH, plus the header and the first eight bytes of the original message. If the sender socket has been connected, it uses this message to issue a "Connection Refused" event to the user. Until now, we have only read such events from the externally facing socket, but we don't forward them back to the local sender because we cannot read the ICMP message directly to user space. Because of this, the local peer will hang and wait for a response that never arrives. We now fix this for IPv4 by recreating and forwarding a correct ICMP message back to the internal sender. We synthesize the message based on the information in the extended error structure, plus the returned part of the original message body. Note that for the sake of completeness, we even produce ICMP messages for other error codes. We have noticed that at least ICMP_PROT_UNREACH is propagated as an error event back to the user. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> [sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't modify 'in', it can be declared as const] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:19 +01:00
Jon Maloy	82a839be98	tap: break out building of udp header from tap_udp4_send function We will need to build the UDP header at other locations than in function tap_udp4_send(), so we break that part out to a separate function. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-06 20:17:36 +01:00
David Gibson	672d786de1	tcp: Send RST in response to guest packets that match no connection Currently, if a non-SYN TCP packet arrives which doesn't match any existing connection, we simply ignore it. However RFC 9293, section 3.10.7.1 says we should respond with an RST to a non-SYN, non-RST packet that's for a CLOSED (i.e. non-existent) connection. This can arise in practice with migration, in cases where some error means we have to discard a connection. We destroy the connection with tcp_rst() in that case, but because the guest is stopped, we may not be able to deliver the RST packet on the tap interface immediately. This change ensures an RST will be sent if the guest tries to use the connection again. A similar situation can arise if a passt/pasta instance is killed or crashes, but is then replaced with another attached to the same guest. This can leave the guest with stale connections that the new passt instance isn't aware of. It's better to send an RST so the guest knows quickly these are broken, rather than letting them linger until they time out. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-05 21:46:32 +01:00
David Gibson	1f236817ea	tap: Consider IPv6 flow label when building packet sequences To allow more batching, we group together related packets into "seqs" in the tap layer, before passing them to the L4 protocol layers. Currently we consider the IP protocol, both IP addresses and also the L4 ports when grouping things into seqs. We ignore the IPv6 flow label. We have some future cases where we want to consider the the flow label in the L4 code, which is awkward if we could be given a single batch with multiple labels. Add the flow label to tap6_l4_t and group by it as well as the other criteria. In future we could possibly use the flow label _instead_ of peeking into the L4 header for the ports, but we don't do so for now. The guest should use the same flow label for all packets in a low, but if it doesn't this change won't break anything, it just means we'll batch things a bit sub-optimally. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-05 21:46:29 +01:00
David Gibson	008175636c	ip: Helpers to access IPv6 flow label The flow label is a 20-bit field in the IPv6 header. The length and alignment make it awkward to pass around as is. Obviously, it can be packed into a 32-bit integer though, and we do this in two places. We have some further upcoming places where we want to manipulate the flow label, so make some helpers for marshalling and unmarshalling it to an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-05 21:46:17 +01:00
Jon Maloy	ea69ca6a20	tap: always set the no_frag flag in IPv4 headers When studying the Linux source code and Wireshark dumps it seems like the no_frag flag in the IPv4 header is always set. Following discussions in the Internet on this subject indicates that modern routers never fragment packets, and that it isn't even supported in many cases. Adding to this that incoming messages forwarded on the tap interface never even pass through a router it seems safe to always set this flag. This makes the IPv4 headers of forwarded messages identical to those sent by the external sockets, something we must consider desirable. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-20 12:43:00 +01:00
Stefano Brivio	b899141ad5	Add interfaces and configuration bits for passt-repair In vhost-user mode, by default, create a second UNIX domain socket accepting connections from passt-repair, with the usual listener socket. When we need to set or clear TCP_REPAIR on sockets, we'll send them via SCM_RIGHTS to passt-repair, who sets the socket option values we ask for. To that end, introduce batched functions to request TCP_REPAIR settings on sockets, so that we don't have to send a single message for each socket, on migration. When needed, repair_flush() will send the message and check for the reply. Co-authored-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:47:28 +01:00
Paul Holzinger	d0006fa784	treewide: use _exit() over exit() In the podman CI I noticed many seccomp denials in our logs even though tests passed: comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000 Which is futex being called and blocked by the pasta profile. After a few tries I managed to reproduce locally with this loop in ~20 min: while :; do podman run -d --network bridge quay.io/libpod/testimage:20241011 \ sleep 100 && \ sleep 10 && \ podman rm -fa -t0 done And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the following stack trace: Stack trace of thread 1: #0 0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b) #1 0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de) #2 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e) #3 0x000055f31b78c50b n/a (n/a + 0x0) #4 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e) #5 0x000055f31b78d5a2 n/a (n/a + 0x0) Pasta got killed in exit(), it seems glibc is trying to use a lock when running exit handlers even though no exit handlers are defined. Given no exit handlers are needed we can call _exit() instead. This skips exit handlers and does not flush stdio streams compared to exit() which should be fine for the use here. Based on the input from Stefano I did not change the test/doc programs or qrap as they do not use seccomp filters. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-05 15:19:02 +01:00
David Gibson	0349cf637f	util: Rename and make global vu_remove_watch() vu_remove_watch() is used in vhost_user.c to remove an fd from the global epoll set. There's nothing really vhost user specific about it though, so rename, move to util.c and use it in a bunch of places outside vhost_user.c where it makes things marginally more readable. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-03 07:32:51 +01:00
Laurent Vivier	947f5cdb93	tap: Call vu_init() with --fd We need to initialize vhost-user structures with --fd too. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-12-10 12:26:56 +01:00
Laurent Vivier	2139ad33fc	tap: Use a common function to start a new connection Merge code from tap_backend_init(), tap_sock_tun_init() and tap_listen_handler() to set epoll_ref entry and to add it to epollfd. No functionality change Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-12-10 12:26:34 +01:00
David Gibson	67151090bc	iov, checksum: Replace csum_iov() with csum_iov_tail() We usually want to checksum only the tail part of a frame, excluding at least some headers. csum_iov() does that for a frame represented as an IO vector, not actually summing the entire IO vector. We now have struct iov_tail to explicitly represent this construct, so replace csum_iov() with csum_iov_tail() taking that representation rather than 3 parameters. We propagate the same change to csum_udp4() and csum_udp6() which take similar parameters. This slightly simplifies the code, and will allow some further simplifications as struct iov_tail is more widely used. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
Laurent Vivier	28997fcb29	vhost-user: add vhost-user add virtio and vhost-user functions to connect with QEMU. $ ./passt --vhost-user and # qemu-system-x86_64 ... -m 4G \ -object memory-backend-memfd,id=memfd0,share=on,size=4G \ -numa node,memdev=memfd0 \ -chardev socket,id=chr0,path=/tmp/passt_1.socket \ -netdev vhost-user,id=netdev0,chardev=chr0 \ -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \ ... Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: as suggested by lvivier, include <netinet/if_ether.h> before including <linux/if_ether.h> as C libraries such as musl __UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have a definition of struct ethhdr] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:47:32 +01:00
Laurent Vivier	b2e62f7e85	passt: rename tap_sock_init() to tap_backend_init() Extract pool storage initialization loop to tap_sock_update_pool(), extract QEMU hints to tap_backend_show_hints(). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:12:27 +01:00
Stefano Brivio	14b84a7f07	treewide: Introduce 'local mode' for disconnected setups There are setups where no host interface is available or configured at all, intentionally or not, temporarily or not, but users expect (Podman) containers to run in any case as they did with slirp4netns, and we're now getting reports that we broke such setups at a rather alarming rate. To this end, if we don't find any usable host interface, instead of exiting: - for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2 as default gateway - for IPv6, don't assign any address (forcibly disable DHCPv6), and use the first link-local address we observe to represent the guest/container. Advertise fe80::1 as default gateway - use 'tap0' as default interface name for pasta Change ifi4 and ifi6 in struct ctx to int and accept a special -1 value meaning that no host interface was selected, but the IP family is enabled. The fact that the kernel uses unsigned int values for those is not an issue as 1. one can't create so many interfaces anyway and 2. we otherwise handle those values transparently. Fix a botched conditional in conf_print() to actually skip printing DHCPv6 information if DHCPv6 is disabled (and skip printing NDP information if NDP is disabled). Link: https://github.com/containers/podman/issues/24614 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 05:16:38 +01:00
Stefano Brivio	58fa5508bd	tap, tcp, util: Add some missing SOCK_CLOEXEC flags I have no idea why, but these are reported by clang-tidy (19.2.1) on Alpine (x86) only: /home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors] 1139 \| int fd = socket(AF_UNIX, SOCK_STREAM, 0); \| ^ \| \| SOCK_CLOEXEC /home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors] 1158 \| ex = socket(AF_UNIX, SOCK_STREAM \| SOCK_NONBLOCK, 0); \| ^ \| \| SOCK_CLOEXEC /home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors] 1413 \| s = socket(af, SOCK_STREAM \| SOCK_NONBLOCK, IPPROTO_TCP); \| ^ \| \| SOCK_CLOEXEC /home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors] 188 \| if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) { \| ^ \| \| SOCK_CLOEXEC Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:58 +01:00
Stefano Brivio	87940f9aa7	tap: Cast TAP_BUF_BYTES - ETH_MAX_MTU to ssize_t, not TAP_BUF_BYTES Given that we're comparing against 'n', which is signed, we cast TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be signed. This doesn't necessarily happen on 32-bit architectures, though. On armhf and i686, clang-tidy 18.1.8 and 19.1.2 report: /home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors] 1087 \| for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) { \| ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cast the whole difference to ssize_t, as we know it's going to be positive anyway, instead of relying on that side effect. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:45 +01:00
Stefano Brivio	9afce0b45c	tap: Explicitly cast TUNSETIFF to fix build warning with musl on ppc64le On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than INT_MAX (2^31 - 1), and musl declares the second argument of ioctl() as 'int', not 'unsigned long' like glibc does, probably because of how POSIX specifies the equivalent argument, int dcmd, in posix_devctl(), so gcc reports a warning: tap.c: In function 'tap_ns_tun': tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow] 1291 \| rc = ioctl(fd, TUNSETIFF, &ifr); \| ^~~~~~~~~ We don't care about that overflow, so explicitly cast TUNSETIFF to int. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-05 23:46:33 +01:00
Stefano Brivio	98efe7c2fd	treewide: Comply with CERT C rule ERR33-C for snprintf() clang-tidy, starting from LLVM version 16, up to at least LLVM version 19, now checks that we detect and handle errors for snprintf() as requested by CERT C rule ERR33-C. These warnings were logged with LLVM version 19.1.2 (at least Debian and Fedora match): /home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 43 \| snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning /home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 577 \| snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning /home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 579 \| snprintf(userns, PATH_MAX, "/proc/%ld/ns/user", \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 580 \| pidval); \| ~~~~~~~ /home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning /home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 105 \| snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning /home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 242 \| snprintf(uidmap, BUFSIZ, "0 %u 1", uid); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning /home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 243 \| snprintf(gidmap, BUFSIZ, "0 %u 1", gid); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning /home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 1155 \| snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning Don't silence the warnings as they might actually have some merit. Add an snprintf_check() function, instead, checking that we're not truncating messages while printing to buffers, and terminate if the check fails. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-30 12:37:25 +01:00
Laurent Vivier	151dbe0d3d	udp: Update UDP checksum using an iovec array As for tcp_update_check_tcp4()/tcp_update_check_tcp6(), change csum_udp4() and csum_udp6() to use an iovec array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:13 +02:00
David Gibson	a33ecafbd9	tap: Don't risk truncating frames on full buffer in tap_pasta_input() tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder. Adjust the code to make sure we always have room for a maximum size frame. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:46 +02:00
David Gibson	d2a1dc744b	tap: Restructure in tap_pasta_input() tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue. The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop. Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:43 +02:00
David Gibson	11e29054fe	tap: Improve handling of EINTR in tap_passt_input() When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling. So, handle EINTR separately by immediately retrying until we succeed or get a different type of error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:41 +02:00
David Gibson	49fc4e0414	tap: Split out handling of EPOLLIN events Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:37 +02:00
David Gibson	905ecd2b0b	treewide: Rename MAC address fields for clarity c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 11:59:54 +02:00
AbdAlRahman Gad	c16141eda5	ndp.c: Turn NDP responder into more declarative implementation - Add structs for NA, RA, NS, MTU, prefix info, option header, link-layer address, RDNSS, DNSSL and link-layer for RA message. - Turn NA message from purely imperative, going byte by byte, to declarative by filling it's struct. - Turn part of RA message into declarative. - Move packet_add() to be before the call of ndp() in tap6_handler() if the protocol of the packet is ICMPv6. - Add a pool of packets as an additional parameter to ndp(). - Check the size of NS packet with packet_get() before sending an NA packet. - Add documentation for the structs. - Add an enum for NDP option types. Link: https://bugs.passt.top/show_bug.cgi?id=21 Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> [sbrivio: Minor coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-13 19:46:16 +02:00
David Gibson	57a21d2df1	tap: Improve handling of partially received frames on qemu socket Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:42 +02:00
David Gibson	37e3b24d90	tap: Correctly handle frames of odd length The Qemu socket protocol consists of a 32-bit frame length in network (BE) order, followed by the Ethernet frame itself. As far as I can tell, frames can be any length, with no particular alignment requirement. This means that although pkt_buf itself is aligned, if we have a frame of odd length, frames after it will have their frame length at an unaligned address. Currently we load the frame length by just casting a char pointer to (uint32_t ) and loading. Some platforms will generate a fatal trap on such an unaligned load. Even if they don't casting an incorrectly aligned pointer to (uint32_t ) is undefined behaviour, strictly speaking. Introduce a new helper to safely load a possibly unaligned value here. We assume that the compiler is smart enough to optimize this into nothing on platforms that provide performant unaligned loads. If that turns out not to be the case, we can look at improvements then. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:38 +02:00
David Gibson	4684f60344	tap: Don't use EPOLLET on Qemu sockets Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however. Furthermore, using EPOLLET causes additional complications: 1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however we do set it when using pasta mode with --fd. This inconsistency doesn't seem to have broken anything, but it's odd. 2) EPOLLET requires that tap_handler_passt() loop until all data available is read (otherwise we may have data in the buffer but never get an event causing us to read it). We do that with a rather ugly goto. Worse, our condition for that goto appears to be incorrect. We'll only loop if rem is non-zero, which will only happen if we perform a blocking recv() for a partially received frame. We'll only perform that second recv() if the original recv() resulted in a partially read frame. As far as I can tell the original recv() could end on a frame boundary (never triggering the second recv()) even if there is additional data in the socket buffer. In that circumstance we wouldn't goto redo and could leave unprocessed frames in the qemu socket buffer indefinitely. This doesn't seem to have caused any problems in practice, but since there's no obvious reason to use EPOLLET here anyway, we might as well get rid of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:07:20 +02:00
David Gibson	9e3f2355c4	tap: Don't attempt to carry on if we get a bad frame length from qemu If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 14:05:44 +02:00
David Gibson	a06db27c49	tap: Better report errors receiving from QEMU socket If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-26 13:51:23 +02:00
Stefano Brivio	6ff702f325	tap: Exit if we fail to bind a UNIX domain socket with explicit path In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally. But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:30:42 +02:00
Stefano Brivio	d19b396f11	tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down If we start pasta with some ports forwarded, but no --config-net, say: $ ./pasta -u 10001 and then use a local, non-loopback address to send traffic to that port, say: $ socat -u FILE:test UDP4:192.0.2.1:10001 pasta writes to the tap file descriptor, but if the interface is down, we get EIO and terminate. By itself, what I'm doing in this case is not very useful (I simply forgot to pass --config-net), but if we happen to have a DHCP client in the network namespace, the interface might still be down while somebody tries to send traffic to it, and exiting in that case is not really helpful. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-25 12:25:05 +02:00
David Gibson	6d76278c21	icmp: Obtain destination addresses from the flowsides icmp_sock_handler() obtains the guest address from it's most recently observed IP. However, this can now be obtained from the common flowside information. icmp_tap_handler() builds its socket address for sendto() directly from the destination address supplied by the incoming tap packet. This can instead be generated from the flow. Using the flowsides as the common source of truth here prepares us for allowing more flexible NAT and forwarding by properly initialising that flowside information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:33:13 +02:00
Stefano Brivio	dba7f0f5ce	treewide: Replace strerror() calls Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:44 +02:00
Laurent Vivier	4070bac7a4	tap: use in->buf_size rather than sizeof(pkt_buf) buf_size is set to sizeof(pkt_buf) by default. And it seems more correct to provide the actual size of the buffer. Later a buf_size of 0 will allow vhost-user mode to detect guest memory buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:42 +02:00
Laurent Vivier	0c335d751a	vhost-user: compare mode MODE_PASTA and not MODE_PASST As we are going to introduce the MODE_VU that will act like the mode MODE_PASST, compare to MODE_PASTA rather than to add a comparison to MODE_VU when we check for MODE_PASST. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:38 +02:00
Laurent Vivier	9ecf7fedc5	tap: refactor packets handling functions Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(), and tap4_handler() and tap6_handler() into tap_handler(). Create a generic tap_add_packet() to consolidate packet addition logic and reduce code duplication. The purpose is to ease the export of these functions to use them with the vhost-user backend. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:19 +02:00
David Gibson	0e36fe1a43	clang-tidy: Enable the bugprone-macro-parentheses check We globally disabled this, with a justification lumped together with several checks about braces. They don't really go together, the others are essentially a stylistic choice which doesn't match our style. Omitting brackets on macro parameters can lead to real and hard to track down bugs if an expression is ever passed to the macro instead of a plain identifier. We've only gotten away with the macros which trigger the warning, because of other conventions its been unlikely to invoke them with anything other than a simple identifier. Fix the macros, and enable the warning for the future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
Stefano Brivio	c9b2413465	conf, passt, tap: Open socket and PID files before switching UID/GID Otherwise, if the user runs us as root, and gives us paths that are only accessible by root, we'll fail to open them, which might in turn encourage users to change permissions or ownerships: definitely a bad idea in terms of security. Reported-by: Minxi Hou <mhou@redhat.com> Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:43:26 +02:00
Stefano Brivio	cbca08cd38	tap: Split tap_sock_unix_init() into opening and listening parts We'll need to open and bind the socket a while before listening to it, so split that into two different functions. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:42:43 +02:00
Stefano Brivio	fcfb592adc	passt, tap: Don't use -1 as uninitialised value for fd_tap_listen This is a remnant from the time we kept access to the original filesystem and we could reinitialise the listening AF_UNIX socket. Since commit `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), however, we can't re-bind the listening socket once we're up and running. Drop the -1 initalisation and the corresponding check. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-05-23 16:42:27 +02:00
Stefano Brivio	d02bb6ca05	tap: Move all-ones initialisation of mac_guest to tap_sock_init() It has nothing to do with tap_sock_unix_init(). It used to be there as that function could be called multiple times per passt instance, but it's not the case anymore. This also takes care of the fact that, with --fd, we wouldn't set the initial MAC address, so we would need to wait for the guest to send us an ARP packet before we could exchange data. Fixes: `6b4e68383c` ("passt, tap: Add --fd option") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:42:06 +02:00

1 2 3 4

187 commits