1
0
Fork 0
mirror of https://passt.top/passt synced 2025-06-19 05:25:34 +02:00
Commit graph

1932 commits

Author SHA1 Message Date
Stefano Brivio
0470170247 passt-repair: Add directory watch
It might not be feasible for users to start passt-repair after passt
is started, on a migration target, but before the migration process
starts.

For instance, with libvirt, the guest domain (and, hence, passt) is
started on the target as part of the migration process. At least for
the moment being, there's no hook a libvirt user (including KubeVirt)
can use to start passt-repair before the migration starts.

Add a directory watch using inotify: if PATH is a directory, instead
of connecting to it, we'll watch for a .repair socket file to appear
in it, and then attempt to connect to that socket.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-12 21:34:36 +01:00
David Gibson
2b58b22845 cppcheck: Add suppressions for "logically" exported functions
We have some functions in our headers which are definitely there on
purpose.  However, they're not yet used outside the files in which they're
defined.  That causes sufficiently recent cppcheck versions (2.17) to
complain they should be static.

Suppress the errors for these "logically" exported functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
a83c806d17 vhost_user: Don't export several functions
vhost-user added several functions which are exposed in headers, but not
used outside the file where they're defined.  I can't tell if these are
really internal functions, or of they're logically supposed to be exported,
but we don't happen to have anything using them yet.

For the time being, just remove the exports.  We can add them back if we
need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
27395e67c2 tcp: Don't export tcp_update_csum()
tcp_update_csum() is exposed in tcp_internal.h, but is only used in tcp.c.
Remove the unneded prototype and make it static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
12d5b36b2f checksum: Don't export various functions
Several of the exposed functions in checksum.h are no longer directly used.
Remove them from the header, and make static.  In particular sum_16b()
should not be used outside: generally csum_unfolded() should be used which
will automatically use either the AVX2 optimized version or sum_16b() as
necessary.

csum_fold() and csum() could have external uses, but they're not used right
now.  We can expose them again if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
e36c35c952 log: Don't export passt_vsyslog()
passt_vsyslog() is an exposed function in log.h.  However it shouldn't
be called from outside log.c: it writes specifically to the system log,
and most code should call passt's logging helpers which might go to the
syslog or to a log file.

Make passt_vsyslog() local to log.c.  This requires a code motion to avoid
a forward declaration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
57d2db370b treewide: Mark assorted functions static
This marks static a number of functions which are only used in their .c
file, have no prototypes in a .h and were never intended to be globally
exposed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
68b04182e0 udp: create and send ICMPv6 to local peer when applicable
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMPv6 message containing
the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header
and the first 1232 bytes of the original message, if any. If the sender
socket has been connected, it uses this message to issue a
"Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv6 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error types and codes. We have noticed that at least
ICMP_PROT_UNREACH is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't
 modify saddr which can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
87e6a46442 tap: break out building of udp header from tap_udp6_send function
We will need to build the UDP header at other locations than in function
tap_udp6_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
55431f0077 udp: create and send ICMPv4 to local peer when applicable
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv4 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error codes. We have noticed that at least ICMP_PROT_UNREACH
is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't
 modify 'in', it can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:19 +01:00
Jon Maloy
82a839be98 tap: break out building of udp header from tap_udp4_send function
We will need to build the UDP header at other locations than in function
tap_udp4_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-06 20:17:36 +01:00
David Gibson
1924e25f07 conf: Be more precise about minimum MTUs
Currently we reject the -m option if given a value less than ETH_MIN_MTU
(68).  That define is derived from the kernel, but its name is misleading:
it doesn't really have anything to do with Ethernet per se, but is rather
the minimum payload any L2 link must be able to handle in order to carry
IPv4.  For IPv6, it's not sufficient: that requires an MTU of at least
1280.

Newer kernels have better named constants IPV4_MIN_MTU and IPv6_MIN_MTU.
Copy and use those constants instead, along with some more specific error
messages.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-06 20:17:23 +01:00
David Gibson
672d786de1 tcp: Send RST in response to guest packets that match no connection
Currently, if a non-SYN TCP packet arrives which doesn't match any existing
connection, we simply ignore it.  However RFC 9293, section 3.10.7.1 says
we should respond with an RST to a non-SYN, non-RST packet that's for a
CLOSED (i.e. non-existent) connection.

This can arise in practice with migration, in cases where some error means
we have to discard a connection.  We destroy the connection with tcp_rst()
in that case, but because the guest is stopped, we may not be able to
deliver the RST packet on the tap interface immediately.  This change
ensures an RST will be sent if the guest tries to use the connection again.

A similar situation can arise if a passt/pasta instance is killed or
crashes, but is then replaced with another attached to the same guest.
This can leave the guest with stale connections that the new passt instance
isn't aware of.  It's better to send an RST so the guest knows quickly
these are broken, rather than letting them linger until they time out.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:32 +01:00
David Gibson
1f236817ea tap: Consider IPv6 flow label when building packet sequences
To allow more batching, we group together related packets into "seqs" in
the tap layer, before passing them to the L4 protocol layers.  Currently
we consider the IP protocol, both IP addresses and also the L4 ports when
grouping things into seqs.  We ignore the IPv6 flow label.

We have some future cases where we want to consider the the flow label in
the L4 code, which is awkward if we could be given a single batch with
multiple labels.  Add the flow label to tap6_l4_t and group by it as well
as the other criteria.  In future we could possibly use the flow label
_instead_ of peeking into the L4 header for the ports, but we don't do so
for now.

The guest should use the same flow label for all packets in a low, but if
it doesn't this change won't break anything, it just means we'll batch
things a bit sub-optimally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:29 +01:00
David Gibson
008175636c ip: Helpers to access IPv6 flow label
The flow label is a 20-bit field in the IPv6 header.  The length and
alignment make it awkward to pass around as is.  Obviously, it can be
packed into a 32-bit integer though, and we do this in two places.  We
have some further upcoming places where we want to manipulate the flow
label, so make some helpers for marshalling and unmarshalling it to an
integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:17 +01:00
David Gibson
52419a64f2 migrate, tcp: Don't flow_alloc_cancel() during incoming migration
In tcp_flow_migrate_target(), if we're unable to create and bind the new
socket, we print an error, cancel the flow and carry on.  This seems to
make sense based on our policy of generally letting the migration complete
even if some or all flows are lost in the process.  But it doesn't quite
work: the flow_alloc_cancel() means that the flows in the target's flow
table are no longer one to one match to the flows which the source is
sending data for.  This means that data for later flows will be mismatched
to a different flow.  Most likely that will cause some nasty error later,
but even worse it might appear to succeed but lead to data corruption due
to incorrectly restoring one of the flows.

Instead, we should leave the flow in the table until we've read all the
data for it, *then* discard it.  Technically removing the
flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket()
fails it leaves conn->sock == -1, which will cause the restore functions
in tcp_flow_migrate_target_ext() to fail, discarding the flow.  To make
what's going on clearer (and with less extraneous error messages), put
several explicit tests for a missing socket later in the migration path to
read the data associated with the flow but explicitly discard it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:38 +01:00
David Gibson
b2708218a6 tcp: Unconditionally move to CLOSED state on tcp_rst()
tcp_rst() attempts to send an RST packet to the guest, and if that succeeds
moves the flow to CLOSED state.  However, even if the tcp_send_flag() fails
the flow is still dead: we've usually closed the socket already, and
something has already gone irretrievably wrong.  So we should still mark
the flow as CLOSED.  That will cause it to be cleaned up, meaning any
future packets from the guest for it won't match a flow, so should generate
new RSTs (they don't at the moment, but that's a separate bug).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:36 +01:00
David Gibson
56ce03ed0a tcp: Correct error code handling from tcp_flow_repair_socket()
There are two small bugs in error returns from tcp_low_repair_socket(),
which is supposed to return a negative errno code:

1) On bind() failures, wedirectly pass on the return code from bind(),
   which is just 0 or -1, instead of an error code.

2) In the caller, tcp_flow_migrate_target() we call strerror_() directly
   on the negative error code, but strerror() requires a positive error
   code.

Correct both of these.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:35 +01:00
David Gibson
39f85bce1a migrate, flow: Don't attempt to migrate TCP flows without passt-repair
Migrating TCP flows requires passt-repair in order to use TCP_REPAIR.  If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode.  In some cases we may not roll back these changes properly, meaning
we break network connections on the source.

Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate.  So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:33 +01:00
David Gibson
7b92f2e852 migrate, flow: Trivially succeed if migrating with no flows
We could get a migration request when we have no active flows; or at least
none that we need or are able to migrate.  In this case after sending or
receiving the number of flows we continue to step through various lists.

In the target case, this could include communication with passt-repair.  If
passt-repair wasn't started that could cause further errors, but of course
they shouldn't matter if we have nothing to repair.

Make it more obvious that there's nothing to do and avoid such errors by
short-circuiting flow_migrate_{source,target}() if there are no migratable
flows.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:21 +01:00
Stefano Brivio
87471731e6 selinux: Fixes/workarounds for passt and passt-repair, mostly for libvirt usage
Here are a bunch of workarounds and a couple of fixes for libvirt
usage which are rather hard to split into single logical patches
as there appear to be some obscure dependencies between some of them:

- passt-repair needs to have an exec_type typeattribute (otherwise
  the policy for lsmd(1) causes a violation on getattr on its
  executable) file, and that typeattribute just happened to be there
  for passt as a result of init_daemon_domain(), but passt-repair
  isn't a daemon, so we need an explicit corecmd_executable_file()

- passt-repair needs a workaround, which I'll revisit once
  https://github.com/fedora-selinux/selinux-policy/issues/2579 is
  solved, for usage with libvirt: allow it to use qemu_var_run_t
  and virt_var_run_t sockets

- add 'bpf' and 'dac_read_search' capabilities for passt-repair:
  they are needed (for whatever reason I didn't investigate) to
  actually receive socket files via SCM_RIGHTS

- passt needs further workarounds in the sense of
  https://github.com/fedora-selinux/selinux-policy/issues/2579:
  allow it to use map and use svirt_tmpfs_t (not just svirt_image_t):
  it depends on where the libvirt guest image is

- ...it also needs to map /dev/null if <access mode='shared'/> is
  enabled in libvirt's XML for the memoryBacking object, for
  vhost-user operation

- and 'ioctl' on the TCP socket appears to be actually needed, on top
  of 'getattr', to dump some socket parameters

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:14:01 +01:00
Michal Privoznik
be86232f72 seccomp.sh: Silence stty errors
When printing list of allowed syscalls the width of terminal is
obtained for nicer output (see commit below). The width is
obtained by running 'stty'. While this works when building from a
console, it doesn't work during rpmbuild/emerge/.. as stdout is
usually not a console but a logfile and stdin is usually
/dev/null or something. This results in stty reporting errors
like this:

  stty: 'standard input': Inappropriate ioctl for device

Redirect stty's stderr to /dev/null to silence it.

Fixes: 712ca32353 ("seccomp.sh: Try to account for terminal width while formatting list of system calls")
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-24 18:46:28 +01:00
Jon Maloy
ea69ca6a20 tap: always set the no_frag flag in IPv4 headers
When studying the Linux source code and Wireshark dumps it seems like
the no_frag flag in the IPv4 header is always set. Following discussions
in the Internet on this subject indicates that modern routers never
fragment packets, and that it isn't even supported in many cases.

Adding to this that incoming messages forwarded on the tap interface
never even pass through a router it seems safe to always set this flag.

This makes the IPv4 headers of forwarded messages identical to those
sent by the external sockets, something we must consider desirable.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-20 12:43:00 +01:00
Stefano Brivio
4dac2351fa contrib/fedora: Actually install passt-repair SELinux policy file
Otherwise we build it, but we don't install it. Not an issue that
warrants a a release right away as it's anyway usable.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 23:33:53 +01:00
Stefano Brivio
16553c8280 dhcp: Add option code byte in calculation for OPT_MAX boundary check
Otherwise we'll limit messages to 577 bytes, instead of 576 bytes as
intended:

  $ fqdn="thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.then_make_it_251_with_this"
  $ hostname="__eighteen_bytes__"
  $ ./pasta --fqdn ${fqdn} -H ${hostname} -p dhcp.pcap -- /sbin/dhclient -4
  Saving packet capture to dhcp.pcap
  $ tshark -r dhcp.pcap -V -Y 'dhcp.option.value == 5' | grep "Total Length"
      Total Length: 577

This was hidden by the issue fixed by commit bcc4908c2b ("dhcp
Remove option 255 length byte") until now.

Fixes: 31e8109a86 ("dhcp, dhcpv6: Add hostname and client fqdn ops")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 23:33:19 +01:00
Stefano Brivio
183bedf478 Makefile: Use mmap2() as alternative for mmap() in valgrind extra syscalls
...instead of unconditionally trying to enable both: mmap2() is the
32-bit ARM variant for mmap() (and perhaps for other architectures),
bot if mmap() is available, valgrind will use that one.

This avoids seccomp.sh warning us about missing mmap2() if mmap() is
present, and is consistent with what we do in vhost-user code.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 16:36:47 +01:00
David Gibson
1cc5d4c9fe conf: Use 0 instead of -1 as "unassigned" mtu value
On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default.  However, internally we use (c->mtu == -1) to represent that
state.  We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.

This is unnecessarily confusing.  We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case.  This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:41 +01:00
David Gibson
3dc7da68a2 conf: More thorough error checking when parsing --mtu option
We're a bit sloppy with parsing MTU which can lead to some surprising,
though fairly harmless, results:
  * Passing a non-number like '-m xyz' will not give an error and act like
    -m 0
  * Junk after a number (e.g. '-m 1500pqr') will be ignored rather than
    giving an error
  * We parse the MTU as a long, then immediately assign to an int, so on
    some platforms certain ludicrously out of bounds values will be
    silently truncated, rather than giving an error

Be a bit more thorough with the error checking to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:38 +01:00
David Gibson
65e317a8fc flow: Clean up and generalise flow traversal macros
The migration code introduced a number of 'foreach' macros to traverse the
flow table.  These aren't inherently tied to migration, so polish up their
naming, move them to flow_table.h and also use in flow_defer_handler()
which is the other place we need to traverse the whole table.

For now we keep foreach_established_tcp_flow() as is.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:36 +01:00
David Gibson
b79a22d360 flow: Remove unneeded bound parameter from flow traversal macros
The foreach macros used to step through flows each take a 'bound' parameter
to only scan part of the flow table.  Only one place actually passes a
bound different from FLOW_MAX.  So we can simplify every other invocation
by having that one case manually handle the bound.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:34 +01:00
David Gibson
7ffca35fdd flow: Remove unneeded index from foreach_* macros
The foreach macros are odd in that they take two loop counters: an integer
index, and a pointer to the flow.  We nearly always want the latter, not
the former, and we can get the index from the pointer trivially when we
need it.  So, rearrange the macros not to need the integer index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:20 +01:00
David Gibson
adb46c11d0 flow: Add flow_perror() helper
Our general logging helpers include a number of _perror() variants which,
like perror(3) include the description of the current errno.  We didn't
have those for our flow specific logging helpers, though.  Fill this gap
with flow_perror() and flow_dbg_perror(), and use them where it's useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:12 +01:00
David Gibson
ba0823f8a0 tcp: Don't pass both flow pointer and flow index
tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:10 +01:00
David Gibson
854bc7b1a3 tcp: Remove spurious prototype for tcp_flow_migrate_shrink_window
This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:08 +01:00
David Gibson
e56c8038fc tcp: More type safety for tcp_flow_migrate_target_ext()
tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:32:52 +01:00
David Gibson
5a07eb3ccc tcp_vu: head_cnt need not be global
head_cnt is a global variable which tracks how many entries in head[] are
currently used.  The fact that it's global obscures the fact that the
lifetime over which it has a meaningful value is quite short: a single
call to of tcp_vu_data_from_sock().

Make it a local to tcp_vu_data_from_sock() to make that lifetime clearer.
We keep the head[] array global for now - although technically it has the
same valid lifetime - because it's large enough we might not want to put
it on the stack.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 11:28:37 +01:00
David Gibson
6b4065153c tap: Remove unused ETH_HDR_INIT() macro
The uses of this macro were removed in d4598e1d18 ("udp: Use the same
buffer for the L2 header for all frames").

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:18 +01:00
David Gibson
354bc0bab1 packet: Don't pass start and offset separately to packet_check_range()
Fundamentally what packet_check_range() does is to check whether a given
memory range is within the allowed / expected memory set aside for packets
from a particular pool.  That range could represent a whole packet (from
packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't
really matter which.

However, we pass the start of the range as two parameters: @start which is
the start of the packet, and @offset which is the offset within the packet
of the range we're interested in.  We never use these separately, only as
(start + offset).  Simplify the interface of packet_check_range() and
vu_packet_check_range() to directly take the start of the relevant range.
This will allow some additional future improvements.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:12 +01:00
David Gibson
0a51060f7a packet: Use flexible array member in struct pool
Currently we have a dummy pkt[1] array, which we alias with an array of
a different size via various macros.  However, we already require C11 which
includes flexible array members, so we can do better.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:04 +01:00
Enrique Llorente
bcc4908c2b dhcp: Remove option 255 length byte
The option 255 (end of options) do not need the length byte, this change
remove that allowing to have one extra byte at other dynamic options.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:42:35 +01:00
Stefano Brivio
a1e48a02ff test: Add migration tests
PCAP=1 ./run migrate/bidirectional gives an overview of how the
whole thing is working.

Add 12 tests in total, checking basic functionality with and without
flows in both directions, with and without sockets in half-closed
states (both inbound and outbound), migration behaviour under traffic
flood, under traffic flood with > 253 flows, and strict checking of
sequences under flood with ramp patterns in both directions.

These tests need preparation and teardown for each case, as we need
to restore the source guest in its own context and pane before we can
test again. Eventually, we could consider alternating source and
target so that we don't need to restart from scratch every time, but
that's beyond the scope of this initial test implementation.

Trick: './run migrate/*' runs all the tests with preparation and
teardown steps.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:29:36 +01:00
Stefano Brivio
89ecf2fd40 migrate: Migrate TCP flows
This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:29:03 +01:00
Stefano Brivio
3e903bbb1f repair, passt-repair: Build and warning fixes for musl
Checked against musl 1.2.5.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-17 08:28:48 +01:00
Stefano Brivio
01b6a164d9 tcp_splice: A typo three years ago and SO_RCVLOWAT is gone
In commit e5eefe7743 ("tcp: Refactor to use events instead of
states, split out spliced implementation"), this:

			if (!bitmap_isset(rcvlowat_set, conn - ts) &&
			    readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

			if (conn->flags & lowat_set_flag &&
			    readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we
haven't set it already.

But, instead, we're checking if it's already set before we set it, so
we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us
improved CPU utilisation in non-interactive cases where we are not
transferring at full pipe capacity.

Fixes: e5eefe7743 ("tcp: Refactor to use events instead of states, split out spliced implementation")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:28:45 +01:00
Stefano Brivio
667caa09c6 tcp_splice: Don't wake up on input data if we can't write it anywhere
If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
given flow, it means that we're blocked, waiting for the receiver to
actually receive data, with a full pipe.

In that case, if we keep EPOLLIN set for the socket on the other side
(our receiving side), we'll get into a loop such as:

  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577

leading to 100% CPU usage, of course.

Drop EPOLLIN on our receiving side as long when we're waiting for
output readiness on the other side.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:27:30 +01:00
David Gibson
7c33b12086 vhost_user: Clear ring address on GET_VRING_BASE
GET_VRING_BASE stops the queue, clearing the call and kick fds.  However,
we don't clear vring.avail.  That means that if vu_queue_notify() is called
it won't realise the queue isn't ready and will die with an EBADFD.

We get this during migration, because for some reason, qemu reconfigures
the vhost-user device when a migration is triggered.  There's a window
between the GET_VRING_BASE and re-establishing the call fd where the
notify function can be called, causing a crash.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-15 05:34:21 +01:00
Stefano Brivio
71249ef3f9 tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values
I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.

This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.

Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).

The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.

Drop those settings.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 12:02:55 +01:00
Stefano Brivio
30f1e082c3 tcp: Keep updating window and checking for socket data after FIN from guest
Once we get a FIN segment from the container/guest, we enter something
resembling CLOSE_WAIT (from the perspective of the peer), but that
doesn't mean that we should stop processing window updates from the
guest and checking for socket data if the guest acknowledges
something.

If we don't do that, we can very easily run into a situation where we
send a burst of data to the tap, get a zero window update, along with
a FIN segment, because the flow is meant to be unidirectional, and now
the connection will be stuck forever, because we'll ignore updates.

Reproducer, server:

  $ pasta --config-net -t 9999 -- sh -c 'echo DONE | socat TCP-LISTEN:9997,shut-down STDIO'

and client:

  $ ./test/rampstream send 50000 | socat -u STDIN TCP:$LOCAL_ADDR:9997
  2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe

while at it, update the message string for the third passive close
state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 10:04:39 +01:00
Stefano Brivio
98d474c895 contrib/selinux: Enable mapping guest memory for libvirt guests
This doesn't actually belong to passt's own policy: we should export
an interface and libvirt's policy should use it, because passt's
policy shouldn't be aware of svirt_image_t at all.

However, libvirt doesn't maintain its own policy, which makes policy
updates rather involved. Add this workaround to ensure --vhost-user
is working in combination with libvirt, as it might take ages before
we can get the proper rule in libvirt's policy.

Reported-by: Laine Stump <laine@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 10:04:39 +01:00
Stefano Brivio
9a84df4c3f selinux: Add rules needed to run tests
...other than being convenient, they might be reasonably
representative of typical stand-alone usage.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-13 00:42:52 +01:00