passt

mirror of https://passt.top/passt synced 2025-07-07 20:48:43 +02:00

Author	SHA1	Message	Date
Stefano Brivio	0470170247	passt-repair: Add directory watch It might not be feasible for users to start passt-repair after passt is started, on a migration target, but before the migration process starts. For instance, with libvirt, the guest domain (and, hence, passt) is started on the target as part of the migration process. At least for the moment being, there's no hook a libvirt user (including KubeVirt) can use to start passt-repair before the migration starts. Add a directory watch using inotify: if PATH is a directory, instead of connecting to it, we'll watch for a .repair socket file to appear in it, and then attempt to connect to that socket. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-03-12 21:34:36 +01:00
David Gibson	2b58b22845	cppcheck: Add suppressions for "logically" exported functions We have some functions in our headers which are definitely there on purpose. However, they're not yet used outside the files in which they're defined. That causes sufficiently recent cppcheck versions (2.17) to complain they should be static. Suppress the errors for these "logically" exported functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
David Gibson	a83c806d17	vhost_user: Don't export several functions vhost-user added several functions which are exposed in headers, but not used outside the file where they're defined. I can't tell if these are really internal functions, or of they're logically supposed to be exported, but we don't happen to have anything using them yet. For the time being, just remove the exports. We can add them back if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
David Gibson	27395e67c2	tcp: Don't export tcp_update_csum() tcp_update_csum() is exposed in tcp_internal.h, but is only used in tcp.c. Remove the unneded prototype and make it static. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
David Gibson	12d5b36b2f	checksum: Don't export various functions Several of the exposed functions in checksum.h are no longer directly used. Remove them from the header, and make static. In particular sum_16b() should not be used outside: generally csum_unfolded() should be used which will automatically use either the AVX2 optimized version or sum_16b() as necessary. csum_fold() and csum() could have external uses, but they're not used right now. We can expose them again if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
David Gibson	e36c35c952	log: Don't export passt_vsyslog() passt_vsyslog() is an exposed function in log.h. However it shouldn't be called from outside log.c: it writes specifically to the system log, and most code should call passt's logging helpers which might go to the syslog or to a log file. Make passt_vsyslog() local to log.c. This requires a code motion to avoid a forward declaration. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
David Gibson	57d2db370b	treewide: Mark assorted functions static This marks static a number of functions which are only used in their .c file, have no prototypes in a .h and were never intended to be globally exposed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
Jon Maloy	68b04182e0	udp: create and send ICMPv6 to local peer when applicable When a local peer sends a UDP message to a non-existing port on an existing remote host, that host will return an ICMPv6 message containing the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header and the first 1232 bytes of the original message, if any. If the sender socket has been connected, it uses this message to issue a "Connection Refused" event to the user. Until now, we have only read such events from the externally facing socket, but we don't forward them back to the local sender because we cannot read the ICMP message directly to user space. Because of this, the local peer will hang and wait for a response that never arrives. We now fix this for IPv6 by recreating and forwarding a correct ICMP message back to the internal sender. We synthesize the message based on the information in the extended error structure, plus the returned part of the original message body. Note that for the sake of completeness, we even produce ICMP messages for other error types and codes. We have noticed that at least ICMP_PROT_UNREACH is propagated as an error event back to the user. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> [sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't modify saddr which can be declared as const] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
Jon Maloy	87e6a46442	tap: break out building of udp header from tap_udp6_send function We will need to build the UDP header at other locations than in function tap_udp6_send(), so we break that part out to a separate function. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:24 +01:00
Jon Maloy	55431f0077	udp: create and send ICMPv4 to local peer when applicable When a local peer sends a UDP message to a non-existing port on an existing remote host, that host will return an ICMP message containing the error code ICMP_PORT_UNREACH, plus the header and the first eight bytes of the original message. If the sender socket has been connected, it uses this message to issue a "Connection Refused" event to the user. Until now, we have only read such events from the externally facing socket, but we don't forward them back to the local sender because we cannot read the ICMP message directly to user space. Because of this, the local peer will hang and wait for a response that never arrives. We now fix this for IPv4 by recreating and forwarding a correct ICMP message back to the internal sender. We synthesize the message based on the information in the extended error structure, plus the returned part of the original message body. Note that for the sake of completeness, we even produce ICMP messages for other error codes. We have noticed that at least ICMP_PROT_UNREACH is propagated as an error event back to the user. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> [sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't modify 'in', it can be declared as const] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-07 02:21:19 +01:00
Jon Maloy	82a839be98	tap: break out building of udp header from tap_udp4_send function We will need to build the UDP header at other locations than in function tap_udp4_send(), so we break that part out to a separate function. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-06 20:17:36 +01:00
David Gibson	1924e25f07	conf: Be more precise about minimum MTUs Currently we reject the -m option if given a value less than ETH_MIN_MTU (68). That define is derived from the kernel, but its name is misleading: it doesn't really have anything to do with Ethernet per se, but is rather the minimum payload any L2 link must be able to handle in order to carry IPv4. For IPv6, it's not sufficient: that requires an MTU of at least 1280. Newer kernels have better named constants IPV4_MIN_MTU and IPv6_MIN_MTU. Copy and use those constants instead, along with some more specific error messages. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-06 20:17:23 +01:00
David Gibson	672d786de1	tcp: Send RST in response to guest packets that match no connection Currently, if a non-SYN TCP packet arrives which doesn't match any existing connection, we simply ignore it. However RFC 9293, section 3.10.7.1 says we should respond with an RST to a non-SYN, non-RST packet that's for a CLOSED (i.e. non-existent) connection. This can arise in practice with migration, in cases where some error means we have to discard a connection. We destroy the connection with tcp_rst() in that case, but because the guest is stopped, we may not be able to deliver the RST packet on the tap interface immediately. This change ensures an RST will be sent if the guest tries to use the connection again. A similar situation can arise if a passt/pasta instance is killed or crashes, but is then replaced with another attached to the same guest. This can leave the guest with stale connections that the new passt instance isn't aware of. It's better to send an RST so the guest knows quickly these are broken, rather than letting them linger until they time out. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-05 21:46:32 +01:00
David Gibson	1f236817ea	tap: Consider IPv6 flow label when building packet sequences To allow more batching, we group together related packets into "seqs" in the tap layer, before passing them to the L4 protocol layers. Currently we consider the IP protocol, both IP addresses and also the L4 ports when grouping things into seqs. We ignore the IPv6 flow label. We have some future cases where we want to consider the the flow label in the L4 code, which is awkward if we could be given a single batch with multiple labels. Add the flow label to tap6_l4_t and group by it as well as the other criteria. In future we could possibly use the flow label _instead_ of peeking into the L4 header for the ports, but we don't do so for now. The guest should use the same flow label for all packets in a low, but if it doesn't this change won't break anything, it just means we'll batch things a bit sub-optimally. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-05 21:46:29 +01:00
David Gibson	008175636c	ip: Helpers to access IPv6 flow label The flow label is a 20-bit field in the IPv6 header. The length and alignment make it awkward to pass around as is. Obviously, it can be packed into a 32-bit integer though, and we do this in two places. We have some further upcoming places where we want to manipulate the flow label, so make some helpers for marshalling and unmarshalling it to an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-03-05 21:46:17 +01:00
David Gibson	52419a64f2	migrate, tcp: Don't flow_alloc_cancel() during incoming migration In tcp_flow_migrate_target(), if we're unable to create and bind the new socket, we print an error, cancel the flow and carry on. This seems to make sense based on our policy of generally letting the migration complete even if some or all flows are lost in the process. But it doesn't quite work: the flow_alloc_cancel() means that the flows in the target's flow table are no longer one to one match to the flows which the source is sending data for. This means that data for later flows will be mismatched to a different flow. Most likely that will cause some nasty error later, but even worse it might appear to succeed but lead to data corruption due to incorrectly restoring one of the flows. Instead, we should leave the flow in the table until we've read all the data for it, then discard it. Technically removing the flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket() fails it leaves conn->sock == -1, which will cause the restore functions in tcp_flow_migrate_target_ext() to fail, discarding the flow. To make what's going on clearer (and with less extraneous error messages), put several explicit tests for a missing socket later in the migration path to read the data associated with the flow but explicitly discard it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-28 01:32:38 +01:00
David Gibson	b2708218a6	tcp: Unconditionally move to CLOSED state on tcp_rst() tcp_rst() attempts to send an RST packet to the guest, and if that succeeds moves the flow to CLOSED state. However, even if the tcp_send_flag() fails the flow is still dead: we've usually closed the socket already, and something has already gone irretrievably wrong. So we should still mark the flow as CLOSED. That will cause it to be cleaned up, meaning any future packets from the guest for it won't match a flow, so should generate new RSTs (they don't at the moment, but that's a separate bug). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-28 01:32:36 +01:00
David Gibson	56ce03ed0a	tcp: Correct error code handling from tcp_flow_repair_socket() There are two small bugs in error returns from tcp_low_repair_socket(), which is supposed to return a negative errno code: 1) On bind() failures, wedirectly pass on the return code from bind(), which is just 0 or -1, instead of an error code. 2) In the caller, tcp_flow_migrate_target() we call strerror_() directly on the negative error code, but strerror() requires a positive error code. Correct both of these. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-28 01:32:35 +01:00
David Gibson	39f85bce1a	migrate, flow: Don't attempt to migrate TCP flows without passt-repair Migrating TCP flows requires passt-repair in order to use TCP_REPAIR. If passt-repair is not started, our failure mode is pretty ugly though: we'll attempt the migration, hitting various problems when we can't enter repair mode. In some cases we may not roll back these changes properly, meaning we break network connections on the source. Our general approach is not to completely block migration if there are problems, but simply to break any flows we can't migrate. So, if we have no connection from passt-repair carry on with the migration, but don't attempt to migrate any TCP connections. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-28 01:32:33 +01:00
David Gibson	7b92f2e852	migrate, flow: Trivially succeed if migrating with no flows We could get a migration request when we have no active flows; or at least none that we need or are able to migrate. In this case after sending or receiving the number of flows we continue to step through various lists. In the target case, this could include communication with passt-repair. If passt-repair wasn't started that could cause further errors, but of course they shouldn't matter if we have nothing to repair. Make it more obvious that there's nothing to do and avoid such errors by short-circuiting flow_migrate_{source,target}() if there are no migratable flows. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-28 01:32:21 +01:00
Stefano Brivio	87471731e6	selinux: Fixes/workarounds for passt and passt-repair, mostly for libvirt usage Here are a bunch of workarounds and a couple of fixes for libvirt usage which are rather hard to split into single logical patches as there appear to be some obscure dependencies between some of them: - passt-repair needs to have an exec_type typeattribute (otherwise the policy for lsmd(1) causes a violation on getattr on its executable) file, and that typeattribute just happened to be there for passt as a result of init_daemon_domain(), but passt-repair isn't a daemon, so we need an explicit corecmd_executable_file() - passt-repair needs a workaround, which I'll revisit once https://github.com/fedora-selinux/selinux-policy/issues/2579 is solved, for usage with libvirt: allow it to use qemu_var_run_t and virt_var_run_t sockets - add 'bpf' and 'dac_read_search' capabilities for passt-repair: they are needed (for whatever reason I didn't investigate) to actually receive socket files via SCM_RIGHTS - passt needs further workarounds in the sense of https://github.com/fedora-selinux/selinux-policy/issues/2579: allow it to use map and use svirt_tmpfs_t (not just svirt_image_t): it depends on where the libvirt guest image is - ...it also needs to map /dev/null if <access mode='shared'/> is enabled in libvirt's XML for the memoryBacking object, for vhost-user operation - and 'ioctl' on the TCP socket appears to be actually needed, on top of 'getattr', to dump some socket parameters Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-28 01:14:01 +01:00
Michal Privoznik	be86232f72	seccomp.sh: Silence stty errors When printing list of allowed syscalls the width of terminal is obtained for nicer output (see commit below). The width is obtained by running 'stty'. While this works when building from a console, it doesn't work during rpmbuild/emerge/.. as stdout is usually not a console but a logfile and stdin is usually /dev/null or something. This results in stty reporting errors like this: stty: 'standard input': Inappropriate ioctl for device Redirect stty's stderr to /dev/null to silence it. Fixes: `712ca32353` ("seccomp.sh: Try to account for terminal width while formatting list of system calls") Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-24 18:46:28 +01:00
Jon Maloy	ea69ca6a20	tap: always set the no_frag flag in IPv4 headers When studying the Linux source code and Wireshark dumps it seems like the no_frag flag in the IPv4 header is always set. Following discussions in the Internet on this subject indicates that modern routers never fragment packets, and that it isn't even supported in many cases. Adding to this that incoming messages forwarded on the tap interface never even pass through a router it seems safe to always set this flag. This makes the IPv4 headers of forwarded messages identical to those sent by the external sockets, something we must consider desirable. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-20 12:43:00 +01:00
Stefano Brivio	4dac2351fa	contrib/fedora: Actually install passt-repair SELinux policy file Otherwise we build it, but we don't install it. Not an issue that warrants a a release right away as it's anyway usable. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-19 23:33:53 +01:00
Stefano Brivio	16553c8280	dhcp: Add option code byte in calculation for OPT_MAX boundary check Otherwise we'll limit messages to 577 bytes, instead of 576 bytes as intended: $ fqdn="thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.then_make_it_251_with_this" $ hostname="__eighteen_bytes__" $ ./pasta --fqdn ${fqdn} -H ${hostname} -p dhcp.pcap -- /sbin/dhclient -4 Saving packet capture to dhcp.pcap $ tshark -r dhcp.pcap -V -Y 'dhcp.option.value == 5' \| grep "Total Length" Total Length: 577 This was hidden by the issue fixed by commit `bcc4908c2b` ("dhcp Remove option 255 length byte") until now. Fixes: `31e8109a86` ("dhcp, dhcpv6: Add hostname and client fqdn ops") Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Enrique Llorente <ellorent@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-19 23:33:19 +01:00
Stefano Brivio	183bedf478	Makefile: Use mmap2() as alternative for mmap() in valgrind extra syscalls ...instead of unconditionally trying to enable both: mmap2() is the 32-bit ARM variant for mmap() (and perhaps for other architectures), bot if mmap() is available, valgrind will use that one. This avoids seccomp.sh warning us about missing mmap2() if mmap() is present, and is consistent with what we do in vhost-user code. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-19 16:36:47 +01:00
David Gibson	1cc5d4c9fe	conf: Use 0 instead of -1 as "unassigned" mtu value On the command line -m 0 means "don't assign an MTU" (letting the guest use its default. However, internally we use (c->mtu == -1) to represent that state. We use (c->mtu == 0) to represent "the user didn't specify on the command line, so use the default" - but this is only used during conf(), never afterwards. This is unnecessarily confusing. We can instead just initialise c->mtu to its default (65520) before parsing options and use 0 on both the command line and internally to represent the "don't assign" special case. This ensures that c->mtu is always 0..65535, so we can store it in a uint16_t which is more natural. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-19 06:35:41 +01:00
David Gibson	3dc7da68a2	conf: More thorough error checking when parsing --mtu option We're a bit sloppy with parsing MTU which can lead to some surprising, though fairly harmless, results: * Passing a non-number like '-m xyz' will not give an error and act like -m 0 * Junk after a number (e.g. '-m 1500pqr') will be ignored rather than giving an error * We parse the MTU as a long, then immediately assign to an int, so on some platforms certain ludicrously out of bounds values will be silently truncated, rather than giving an error Be a bit more thorough with the error checking to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-19 06:35:38 +01:00
David Gibson	65e317a8fc	flow: Clean up and generalise flow traversal macros The migration code introduced a number of 'foreach' macros to traverse the flow table. These aren't inherently tied to migration, so polish up their naming, move them to flow_table.h and also use in flow_defer_handler() which is the other place we need to traverse the whole table. For now we keep foreach_established_tcp_flow() as is. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-19 06:35:36 +01:00
David Gibson	b79a22d360	flow: Remove unneeded bound parameter from flow traversal macros The foreach macros used to step through flows each take a 'bound' parameter to only scan part of the flow table. Only one place actually passes a bound different from FLOW_MAX. So we can simplify every other invocation by having that one case manually handle the bound. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-19 06:35:34 +01:00
David Gibson	7ffca35fdd	flow: Remove unneeded index from foreach_* macros The foreach macros are odd in that they take two loop counters: an integer index, and a pointer to the flow. We nearly always want the latter, not the former, and we can get the index from the pointer trivially when we need it. So, rearrange the macros not to need the integer index. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-19 06:35:20 +01:00
David Gibson	adb46c11d0	flow: Add flow_perror() helper Our general logging helpers include a number of _perror() variants which, like perror(3) include the description of the current errno. We didn't have those for our flow specific logging helpers, though. Fill this gap with flow_perror() and flow_dbg_perror(), and use them where it's useful. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 13:33:12 +01:00
David Gibson	ba0823f8a0	tcp: Don't pass both flow pointer and flow index tcp_flow_migrate_source_ext() is passed both the index of the flow it operates on and the pointer to the connection structure. However, the former is trivially derived from the latter. Simplify the interface. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 13:33:10 +01:00
David Gibson	854bc7b1a3	tcp: Remove spurious prototype for tcp_flow_migrate_shrink_window This function existed in drafts of the migration code, but not the final version. Get rid of the prototype. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 13:33:08 +01:00
David Gibson	e56c8038fc	tcp: More type safety for tcp_flow_migrate_target_ext() tcp_flow_migrate_target_ext() takes a raw union flow , although it is TCP specific, and requires a FLOW_TYPE_TCP entry. Our usual convention is that such functions should take a struct tcp_tap_conn instead. Convert it to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 13:32:52 +01:00
David Gibson	5a07eb3ccc	tcp_vu: head_cnt need not be global head_cnt is a global variable which tracks how many entries in head[] are currently used. The fact that it's global obscures the fact that the lifetime over which it has a meaningful value is quite short: a single call to of tcp_vu_data_from_sock(). Make it a local to tcp_vu_data_from_sock() to make that lifetime clearer. We keep the head[] array global for now - although technically it has the same valid lifetime - because it's large enough we might not want to put it on the stack. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 11:28:37 +01:00
David Gibson	6b4065153c	tap: Remove unused ETH_HDR_INIT() macro The uses of this macro were removed in `d4598e1d18` ("udp: Use the same buffer for the L2 header for all frames"). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 08:43:18 +01:00
David Gibson	354bc0bab1	packet: Don't pass start and offset separately to packet_check_range() Fundamentally what packet_check_range() does is to check whether a given memory range is within the allowed / expected memory set aside for packets from a particular pool. That range could represent a whole packet (from packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't really matter which. However, we pass the start of the range as two parameters: @start which is the start of the packet, and @offset which is the offset within the packet of the range we're interested in. We never use these separately, only as (start + offset). Simplify the interface of packet_check_range() and vu_packet_check_range() to directly take the start of the relevant range. This will allow some additional future improvements. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 08:43:12 +01:00
David Gibson	0a51060f7a	packet: Use flexible array member in struct pool Currently we have a dummy pkt[1] array, which we alias with an array of a different size via various macros. However, we already require C11 which includes flexible array members, so we can do better. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 08:43:04 +01:00
Enrique Llorente	bcc4908c2b	dhcp: Remove option 255 length byte The option 255 (end of options) do not need the length byte, this change remove that allowing to have one extra byte at other dynamic options. Signed-off-by: Enrique Llorente <ellorent@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 08:42:35 +01:00
Stefano Brivio	a1e48a02ff	test: Add migration tests PCAP=1 ./run migrate/bidirectional gives an overview of how the whole thing is working. Add 12 tests in total, checking basic functionality with and without flows in both directions, with and without sockets in half-closed states (both inbound and outbound), migration behaviour under traffic flood, under traffic flood with > 253 flows, and strict checking of sequences under flood with ramp patterns in both directions. These tests need preparation and teardown for each case, as we need to restore the source guest in its own context and pane before we can test again. Eventually, we could consider alternating source and target so that we don't need to restart from scratch every time, but that's beyond the scope of this initial test implementation. Trick: './run migrate/*' runs all the tests with preparation and teardown steps. Co-authored-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-17 08:29:36 +01:00
Stefano Brivio	89ecf2fd40	migrate: Migrate TCP flows This implements flow preparation on the source, transfer of data with a format roughly inspired by struct tcp_tap_conn, plus a specific structure for parameters that don't fit in the flow table, and flow insertion on the target, with all the appropriate window options, window scaling, MSS, etc. Contents of pending queues are transferred as well. The target side is rather convoluted because we first need to create sockets and switch them to repair mode, before we can apply options that are not stored in the flow table. This also means that, if we're testing this on the same machine, in the same namespace, we need to close the listening socket on the source before we can start moving data. Further, we need to connect() the socket on the target before we can restore data queues, but we can't do that (again, on the same machine) as long as the matching source socket is open, which implies an arbitrary limit on queue sizes we can transfer, because we can only dump pending queues on the source as long as the socket is open, of course. Co-authored-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-17 08:29:03 +01:00
Stefano Brivio	3e903bbb1f	repair, passt-repair: Build and warning fixes for musl Checked against musl 1.2.5. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-17 08:28:48 +01:00
Stefano Brivio	01b6a164d9	tcp_splice: A typo three years ago and SO_RCVLOWAT is gone In commit `e5eefe7743` ("tcp: Refactor to use events instead of states, split out spliced implementation"), this: if (!bitmap_isset(rcvlowat_set, conn - ts) && readlen > (long)c->tcp.pipe_size / 10) { (note the !) became: if (conn->flags & lowat_set_flag && readlen > (long)c->tcp.pipe_size / 10) { in the new tcp_splice_sock_handler(). We want to check, there, if we should set SO_RCVLOWAT, only if we haven't set it already. But, instead, we're checking if it's already set before we set it, so we'll never set it, of course. Fix the check and re-enable the functionality, which should give us improved CPU utilisation in non-interactive cases where we are not transferring at full pipe capacity. Fixes: `e5eefe7743` ("tcp: Refactor to use events instead of states, split out spliced implementation") Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-17 08:28:45 +01:00
Stefano Brivio	667caa09c6	tcp_splice: Don't wake up on input data if we can't write it anywhere If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a given flow, it means that we're blocked, waiting for the receiver to actually receive data, with a full pipe. In that case, if we keep EPOLLIN set for the socket on the other side (our receiving side), we'll get into a loop such as: 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 leading to 100% CPU usage, of course. Drop EPOLLIN on our receiving side as long when we're waiting for output readiness on the other side. Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584 Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/ Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-17 08:27:30 +01:00
David Gibson	7c33b12086	vhost_user: Clear ring address on GET_VRING_BASE GET_VRING_BASE stops the queue, clearing the call and kick fds. However, we don't clear vring.avail. That means that if vu_queue_notify() is called it won't realise the queue isn't ready and will die with an EBADFD. We get this during migration, because for some reason, qemu reconfigures the vhost-user device when a migration is triggered. There's a window between the GET_VRING_BASE and re-establishing the call fd where the notify function can be called, causing a crash. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-15 05:34:21 +01:00
Stefano Brivio	71249ef3f9	tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values I added this a long long time ago because it dramatically improved throughput back then: with rmem_max and wmem_max >= 4 MiB, we would force send and receive buffer sizes for TCP sockets to the maximum allowed value. This effectively disables TCP auto-tuning, which would otherwise allow us to exceed those limits, as crazy as it might sound. But in any case, it made sense. Now that we have zero (internal) copies on every path, plus vhost-user support, it turns out that these settings are entirely obsolete. I get substantially the same throughput in every test we perform, even with very short durations (one second). The settings are not just useless: they actually cause us quite some trouble on guest state migration, because they lead to huge queues that need to be moved as well. Drop those settings. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-14 12:02:55 +01:00
Stefano Brivio	30f1e082c3	tcp: Keep updating window and checking for socket data after FIN from guest Once we get a FIN segment from the container/guest, we enter something resembling CLOSE_WAIT (from the perspective of the peer), but that doesn't mean that we should stop processing window updates from the guest and checking for socket data if the guest acknowledges something. If we don't do that, we can very easily run into a situation where we send a burst of data to the tap, get a zero window update, along with a FIN segment, because the flow is meant to be unidirectional, and now the connection will be stuck forever, because we'll ignore updates. Reproducer, server: $ pasta --config-net -t 9999 -- sh -c 'echo DONE \| socat TCP-LISTEN:9997,shut-down STDIO' and client: $ ./test/rampstream send 50000 \| socat -u STDIN TCP:$LOCAL_ADDR:9997 2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe while at it, update the message string for the third passive close state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-14 10:04:39 +01:00
Stefano Brivio	98d474c895	contrib/selinux: Enable mapping guest memory for libvirt guests This doesn't actually belong to passt's own policy: we should export an interface and libvirt's policy should use it, because passt's policy shouldn't be aware of svirt_image_t at all. However, libvirt doesn't maintain its own policy, which makes policy updates rather involved. Add this workaround to ensure --vhost-user is working in combination with libvirt, as it might take ages before we can get the proper rule in libvirt's policy. Reported-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-14 10:04:39 +01:00
Stefano Brivio	9a84df4c3f	selinux: Add rules needed to run tests ...other than being convenient, they might be reasonably representative of typical stand-alone usage. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-13 00:42:52 +01:00

1 2 3 4 5 ...

1932 commits