Commit graph

1627 commits

Author SHA1 Message Date
Jon Maloy
a740e16fd1 tcp: handle shrunk window advertisements from guest
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the guest peer, while it is unable to send out window updates even
after socket reads have freed up enough buffer space to permit a larger
window. In this situation, new window advertisements from the peer can
only be triggered by data packets arriving from this side.

However, currently such packets are never sent, because the zero-window
condition prevents this side from sending out any packets whatsoever
to the peer.

We notice that the above bug is triggered *only* after the peer has
dropped one or more arriving packets because of severe memory squeeze,
and that we hence always enter a retransmission situation when this
occurs. This also means that the implementation goes against the
RFC-9293 recommendation that a previously advertised window never
should shrink.

RFC-9293 seems to permit that we can continue sending up to the right
edge of the last advertised non-zero window in such situations, so that
is what we do to resolve this situation.

It turns out that this solution is extremely simple to implememt in the
code: We just omit to save the advertised zero-window when we see that
it has shrunk, i.e., if the acknowledged sequence number in the
advertisement message is lower than that of the last data byte sent
from our side.

When that is the case, the following happens:
- The 'retr' flag in tcp_data_from_tap() will be 'false', so no
  retransmission will occur at this occasion.
- The data stream will soon reach the right edge of the previously
  advertised window. In fact, in all observed cases we have seen that
  it is already there when the zero-advertisement arrives.
- At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set,
  unless they already have been, meaning that only the next timer
  expiration will open for data retransmission or transmission.
- When that happens, the memory squeeze at the guest will normally have
  abated, and the data flow can resume.

It should be noted that although this solves the problem we have at
hand, it is a work-around, and not a genuine solution to the described
kernel bug.

Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor fix in commit title and commit reference in comment
 to workaround
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 18:05:08 +02:00
Jon Maloy
e63d281871 tcp: leverage support of SO_PEEK_OFF socket option when available
>From linux-6.9.0 the kernel will contain
commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").

This new feature makes is possible to call recv_msg(MSG_PEEK) and make
it start reading data from a given offset set by the SO_PEEK_OFF socket
option. This way, we can avoid repeated reading of already read bytes of
a received message, hence saving read cycles when forwarding TCP
messages in the host->name space direction.

In this commit, we add functionality to leverage this feature when
available, while we fall back to the previous behavior when not.

Measurements with iperf3 shows that throughput increases with 15-20
percent in the host->namespace direction when this feature is used.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 17:57:03 +02:00
David Gibson
8bd57bf25b doc: Trivial fix for reuseaddr-priority
This test program checks for particular behaviour regardless of order of
operations.  So, we step through the test with all possible orders for
a number of different of parts.  Or at least, we're supposed to, a copy
pasta error led to using the same order for two things which should be
independent.

Fixes: 299c407501 ("doc: Add program to document and test assumptions about SO_REUSEADDR")
Reported-by: David Taylor <davidt@yadt.co.uk>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 17:55:52 +02:00
David Gibson
ec2691a12e doc: Test behaviour of zero length datagram recv()s
Add a test program verifying that we're able to discard datagrams from a
socket without needing a big discard buffer, by using a zero length recv().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:48 +02:00
David Gibson
299c407501 doc: Add program to document and test assumptions about SO_REUSEADDR
For the approach we intend to use for handling UDP flows, we have some
pretty specific requirements about how SO_REUSEADDR works with UDP sockets.
Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s,
and therefore there can be multiple sockets which are eligible to receive
the same datagram.  Which one will actually receive it is important to us.

Add a test program which verifies things work the way we expect, which
documents what those expectations are in the process.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:43 +02:00
David Gibson
be0214cca6 udp: Consolidate datagram batching
When we receive datagrams on a socket, we need to split them into batches
depending on how they need to be forwarded (either via a specific splice
socket, or via tap).  The logic to do this, is somewhat awkwardly split
between udp_buf_sock_handler() itself, udp_splice_send() and
udp_tap_send().

Move all the batching logic into udp_buf_sock_handler(), leaving
udp_splice_send() to just send the prepared batch.  udp_tap_send() reduces
to just a call to tap_send_frames() so open-code that call in
udp_buf_sock_handler().

This will allow separating the batching logic from the rest of the datagram
forwarding logic, which we'll need for upcoming flow table support.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:41 +02:00
David Gibson
69e5393c37 udp: Move some more of sock_handler tasks into sub-functions
udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four
things between them:
  1. Receive some datagrams from a socket
  2. Split those datagrams into batches depending on how they need to be
     sent (via tap or via a specific splice socket)
  3. Prepare buffers for each datagram to send it onwards
  4. Actually send it onwards

Split (1) and (3) into specific helper functions.  This isn't
immediately useful (udp_splice_prepare(), in particular, is trivial),
but it will make further reworks clearer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:39 +02:00
David Gibson
c6c61a9e1a udp: Don't repeatedly initialise udp[46]_eth_hdr
Since we split our packet frame buffers into different pieces, we have
a single buffer per IP version for the ethernet header, rather than one
per frame.  This makes sense since our ethernet header is alwaus the same.

However we initialise those buffers udp[46]_eth_hdr inside a per frame
loop.  Pull that outside the loop so we just initialise them once.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:37 +02:00
David Gibson
55aff45bc1 udp: Unify udp[46]_l2_iov
The only differences between these arrays are that udp4_l2_iov is
pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame
header and udp6_l2_iov points to the IPv6 versions.

We already have to set up a bunch of headers per-frame, including updating
udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len.  It makes more sense to adjust
the IOV entries to point at the correct headers for the frame than to have
two complete sets of iovecs.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:35 +02:00
David Gibson
9f9b15f949 udp: Unify udp[46]_mh_splice
We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where
the only difference is that they point to different sockaddr buffers for
the destination address.

Unify these by having the common array point at a sockaddr_inany as the
address.  This does mean slightly more work when we're about to splice,
because we need to write the whole socket address, rather than just the
port.  However it removes 32 mmsghdr structures and we're going to need
more flexibility constructing that target address for the flow table.

Because future changes might mean that the address isn't always loopback,
change the name of the common address from *_localname  to udp_splicename.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:33 +02:00
David Gibson
fbd78b6f3e udp: Rename IOV and mmsghdr arrays
Make the salient points about these various arrays clearer with renames:

* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do
  with L2.  They are, however, specific to receiving not sending.  Rename
  to udp_iov_recv and udp[46]_mh_recv.

* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa.
  Rename to udp[46]_l2_iov

* udp[46]_localname are (for now) pre-populated with the local address but
  the more salient point is that these are the destination address for the
  splice arrays.  Rename to udp[46]_splice_to

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:30 +02:00
David Gibson
f62c33d85f udp: Pass full epoll reference through more of sock handler path
udp_buf_sock_handler() takes the epoll reference from the receiving socket,
and passes the UDP relevant part on to several other functions.  Future
changes are going to need several different epoll types for UDP, and to
pass that information through to some of those functions.  To avoid extra
noise in the patches making the real changes, change those functions now
to take the full epoll reference, rather than just the UDP part.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:28 +02:00
David Gibson
8f8eb73482 flow: Add flow_sidx_valid() helper
To implement the TCP hash table, we need an invalid (NULL-like) value for
flow_sidx_t.  We use FLOW_SIDX_NONE for that, but for defensiveness, we
treat (usually) anything with an out of bounds flow index the same way.

That's not always done consistently though.  In flow_at_sidx() we open code
a check on the flow index.  In tcp_hash_probe() we instead compare against
FLOW_SIDX_NONE, and in some other places we use the fact that
flow_at_sidx() will return NULL in this case, even if we don't otherwise
need the flow it returns.

Clean this up a bit, by adding an explicit flow_sidx_valid() test function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:25 +02:00
David Gibson
74c1c5efcf util: sock_l4() determine protocol from epoll type rather than the reverse
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state.  Currently it determines the correct tag for the epoll
data based on the protocol.  However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol.  So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:09 +02:00
Stefano Brivio
b625ed5fee conf: Use the right maximum buffer size for c->sock_path
UNIX_SOCK_MAX is the maximum number we'll append to the socket path
if we generate it automatically. If it's given on the command line,
it can be up to UNIX_PATH_MAX (including the terminating character)
long.

UNIX_SOCK_MAX happened to kind of fit because it's 100 (instead of
108).

Commit ceddcac74a ("conf, tap: False "Buffer not null terminated"
positives, CWE-170") fixed the wrong problem: the right fix for the
problem at hand was actually commit cc287af173 ("conf: Fix
incorrect bounds checking for sock_path parameter").

Fixes: ceddcac74a ("conf, tap: False "Buffer not null terminated" positives, CWE-170")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-02 21:34:06 +02:00
Stefano Brivio
403a7c14a0 tcp_splice: Check return value of setsockopt() for SO_RCVLOWAT
Spotted by Coverity, harmless as we would consider that successful
and check on the socket later from the timer, but printing a debug
message in that case is definitely wise, should it ever happen.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-02 21:33:57 +02:00
Stefano Brivio
21ee1eb2de conf: Copy up to MAXDNSRCH - 1 bytes, not MAXDNSRCH
Spotted by Coverity just recently. Not that it really matters as
MAXDNSRCH always appears to be defined as 1025, while a full domain
name can have up to 253 characters: it would be a bit pointless to
have a longer search domain.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-02 21:33:29 +02:00
Stefano Brivio
1ee2ecade3 udp: Reduce scope of rport in udp_invert_portmap()
cppcheck 2.14 warns that the scope of the rport variable could be
reduced: do that, as reverted commit c80fa6a6bb ("udp: Make rport
calculation more local") did, but keep the temporary variable of
in_port_t type, otherwise the sum gets promoted to int.

While at it, add a comment explaining why we calculate rport like
this instead of directly using the sum as array index.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-24 15:41:38 +02:00
Stefano Brivio
054697598f Revert "udp: Make rport calculation more local"
This reverts commit c80fa6a6bb, as it
reintroduces the issue fixed by commit 1e6f92b995 ("udp: Fix 16-bit
overflow in udp_invert_portmap()").

Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=80
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-24 15:41:38 +02:00
Stefano Brivio
c66f0341d9 log: Don't report syslog failures to stderr after initialisation
If we daemonised, we can't use standard error. If we didn't, it's
rather annoying to have all those messages on standard error anyway,
and kind of pointless too, as the messages we wanted to print were
printed to standard error anyway.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-21 15:32:48 +02:00
Stefano Brivio
e7323e515a conf, passt: Don't call __openlog() if a log file is used
If a log file is configured, we would otherwise open a connection to
the system logger (if any), print any message that we might have
before we initialise the log file, and then keep that connection
around for no particular reason.

Call __openlog() as an alternative to the log file setup, instead.

This way, we might skip printing some messages during the
initialisation phase, but they're probably not really valuable to
have in a system log, and we're going to print them to standard
error anyway.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:46 +02:00
Stefano Brivio
dba7f0f5ce treewide: Replace strerror() calls
Now that we have logging functions embedding perror() functionality,
we can make _some_ calls more terse by using them. In many places,
the strerror() calls are still more convenient because, for example,
they are used in flow debugging functions, or because the return code
variable of interest is not 'errno'.

While at it, convert a few error messages from a scant perror style
to proper failure descriptions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:44 +02:00
Stefano Brivio
92a22fef93 treewide: Replace perror() calls with calls to logging functions
perror() prints directly to standard error, but in many cases standard
error might be already closed, or we might want to skip logging, based
on configuration. Our logging functions provide all that.

While at it, make errors more descriptive, replacing some of the
existing basic perror-style messages.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:43 +02:00
Stefano Brivio
c1140df889 log: Add _perror() logging function variants
In many places, we have direct perror() calls, which completely bypass
logging functions and log files.

They are definitely convenient: offer similar convenience with
_perror() logging variants, so that we can drop those direct perror()
calls.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:40 +02:00
Stefano Brivio
afd9cdc9bb log, passt: Always print to stderr before initialisation is complete
After commit 15001b39ef ("conf: set the log level much earlier"), we
had a phase during initialisation when messages wouldn't be printed to
standard error anymore.

Commit f67238aa86 ("passt, log: Call __openlog() earlier, log to
stderr until we detach") fixed that, but only for the case where no
log files are given.

If a log file is configured, vlogmsg() will not call passt_vsyslog(),
but during initialisation, LOG_PERROR is set, so to avoid duplicated
prints (which would result from passt_vsyslog() printing to stderr),
we don't call fprintf() from vlogmsg() either.

This is getting a bit too complicated. Instead of abusing LOG_PERROR,
define an internal logging flag that clearly represents that we're not
done with the initialisation phase yet.

If this flag is not set, make sure we always print to stderr, if the
log mask matches.

Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:34 +02:00
Stefano Brivio
8c2f24a560 conf, log: Instead of abusing log levels, add log_conf_parsed flag
We currently use a LOG_EMERG log mask to represent the fact that we
don't know yet what the mask resulting from configuration should be,
before the command line is parsed.

However, we have the necessity of representing another phase as well,
that is, configuration is parsed but we didn't daemonise yet, or
we're not ready for operation yet. The next patch will add that
notion explicitly.

Mapping these cases to further log levels isn't really practical.
Introduce boolean log flags to represent them, instead of abusing
log priorities.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:31 +02:00
Stefano Brivio
bca0fefa32 conf, passt: Make --stderr do nothing, and deprecate it
The original behaviour of printing messages to standard error by
default when running from a non-interactive terminal was introduced
because the first KubeVirt integration draft used to start passt in
foreground and get messages via standard error.

For development purposes, the system logger was more convenient at
that point, and passt was running from interactive terminals only if
not started by the KubeVirt integration.

This behaviour was introduced by 84a62b79a2 ("passt: Also log to
stderr, don't fork to background if not interactive").

Later, I added command-line options in 1e49d194d0 ("passt, pasta:
Introduce command-line options and port re-mapping") and accidentally
reversed this condition, which wasn't a problem as --stderr could
force printing to standard error anyway (and it was used by KubeVirt).

Nowadays, the KubeVirt integration uses a log file (requested via
libvirt configuration), and the same applies for Podman if one
actually needs to look at runtime logs. There are no use cases left,
as far as I know, where passt runs in foreground in non-interactive
terminals.

Seize the chance to reintroduce some sanity here. If we fork to
background, standard error is closed, so --stderr is useless in that
case.

If we run in foreground, there's no harm in printing messages to
standard error, and that accidentally became the default behaviour
anyway, so --stderr is not needed in that case.

It would be needed for non-interactive terminals, but there are no
use cases, and if there were, let's log to standard error anyway:
the user can always redirect standard error to /dev/null if needed.

Before we're up and running, we need to print to standard error anyway
if something happens, otherwise we can't report failure to start in
any kind of usage, stand-alone or in integrations.

So, make --stderr do nothing, and deprecate it.

While at it, drop a left-over comment about --foreground being the
default only for interactive terminals, because it's not the case
anymore.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:28 +02:00
Stefano Brivio
b74801645c conf, passt: Don't try to log to stderr after we close it
If we don't run in foreground, we close standard error as we
daemonise, so it makes no sense to check if the controlling terminal
is an interactive terminal or if --force-stderr was given, to decide
if we want to log to standard error.

Make --force-stderr depend on --foreground.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:15 +02:00
Stefano Brivio
65923ba798 conf: Accept duplicate and conflicting options, the last one wins
In multiple occasions, especially when passt(1) and pasta(1) are used
in integrations such as the one with Podman, the ability to override
earlier options on the command line with later one would have been
convenient.

Recently, to debug a number of issues happening with Podman, I would
have liked to ask users to share a debug log by passing --debug as
additional option, but pasta refuses --quiet (always passed by Podman)
and --debug at the same time.

On top of this, Podman lets users specify other pasta options in its
containers.conf(5) file, as well as on the command line.

The options from the configuration files are appended together with
the ones from the command line, which makes it impossible for users to
override options from the configuration file, if duplicated options
are refused, unless Podman takes care of sorting them, which is
clearly not sustainable.

For --debug and --trace, somebody took care of this on Podman side at:
  https://github.com/containers/common/pull/2052

but this doesn't fix the issue with other options, and we'll have
anyway older versions of Podman around, too.

I think there's some value in telling users about duplicated or
conflicting options, because that might reveal issues in integrations
or accidental misconfigurations, but by now I'm fairly convinced that
the downsides outweigh this.

Drop checks about duplicate options and mutually exclusive ones. In
some cases, we need to also undo a couple of initialisations caused
by earlier options, but this looks like a simplification, overall.

Notable exception: --stderr still conflicts with --log-file, because
users might have the expectation that they don't actually conflict.
But they do conflict in the existing implementation, so it's safer
to make sure that the users notice that.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-06-21 15:31:46 +02:00
Stefano Brivio
62de6140d9 netlink: Strip nexthop identifiers when duplicating routes
If routing daemons set up host routes, for example FRR via OSPF as in
the reported issue, they might add nexthop identifiers (not objects)
that are generally not valid in the target namespace. Strip them off
as well, otherwise we'll get EINVAL from the kernel.

Link: https://github.com/containers/podman/issues/22960
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-20 17:03:28 +02:00
Danish Prakash
1544a43863 passt.1, qrap.1: align license description with SPDX identifier
The SPDX identifier states GPL-2.0-or-later but the copyright section
mentions GPL-3.0 or later causing a mismatch.

Also, only correctly refers to GPL instead of AGPL.

Signed-off-by: Danish Prakash <contact@danishpraka.sh>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-19 15:00:55 +02:00
Stefano Brivio
f301bb18b5 netlink: Ignore EHOSTUNREACH failures when duplicating routes
To implicitly resolve possible dependencies between routes as we
duplicate them into the target namespace, we go through a set of n
routes n times, and ignore EEXIST responses to netlink messages (we
already inserted the route) and ENETUNREACH (we didn't insert the
route yet, but we need to insert another one first).

Until now, we didn't ignore EHOSTUNREACH responses. However,
NetworkManager users with multiple non-subnet routes for the same
interface report that pasta exits with "no route to host" while
duplicating routes.

This happens because NetworkManager sets the 'noprefixroute' attribute
on addresses, meaning that the kernel won't create subnet routes
automatically depending on the prefix length of the address. We copy
this attribute as we copy the address into the target namespace, and
as a result, the kernel doesn't create subnet routes in the target
namespace either.

This means that the gateway for routes that are inserted later can be
unreachable at some points during the sequence of route duplication.
That is, we don't just have dependencies between regular routes, but
we can also have dependencies between regular routes and subnet
routes, as subnet routes are not automatically inserted in advance.

Link: https://github.com/containers/podman/issues/22824
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-19 15:00:55 +02:00
Stefano Brivio
450a6131be netlink: With no default route, pick the first interface with a route
While commit f919dc7a4b ("conf, netlink: Don't require a default
route to start") sounded reasonable in the assumption that, if we
don't find default routes for a given address family, we can still
proceed by selecting an interface with any route *iff it's the only
one for that protocol family*, Jelle reported a further issue in a
similar setup.

There, multiple interfaces are present, and while remote container
connectivity doesn't matter for the container, local connectivity is
desired. There are no default routes, but those multiple interfaces
all have non-default routes, so we should just pick one and start.

Pick the first interface reported by the kernel with any route, if
there are no default routes. There should be no harm in doing so.

Reported-by: Jelle van der Waa <jvanderwaa@redhat.com>
Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2277954
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
2024-06-19 15:00:55 +02:00
Stefano Brivio
54a9d3801b tcp: Don't rely on bind() to fail to decide that connection target is valid
Commit e1a2e2780c ("tcp: Check if connection is local or low RTT
was seen before using large MSS") added a call to bind() before we
issue a connect() to the target for an outbound connection.

If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we
can conclude that the target address is a local (host) address, and we
can use an unlimited MSS.

While at it, according to the reasoning of that commit, if bind()
succeeds, we would know right away that nobody is listening at that
(local) address and port, and we don't even need to call connect(): we
can just fail early and reset the connection attempt.

But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or
net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will
actually succeed, so we can't rely on it to fail in general.

The visible issue with the existing behaviour is that we would reset
any outbound connection to non-local addresses, if non-local binds are
enabled.

Keep the significant optimisation for local addresses along with the
bind() call, but if it succeeds, don't draw any conclusion: close the
socket, grab another one, and proceed normally.

This will incur a small latency penalty if non-local binds are
enabled (we'll likely fetch an existing socket from the pool but
additionally call close()), or if the target is local but not bound:
we'll need to call connect() and get a failure before relaying that
failure back.

Link: https://github.com/containers/podman/issues/23003
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-19 15:00:55 +02:00
David Gibson
020ff7a40e siphash: Remove stale prototypes
In fc8f0f8c ("siphash: Use incremental rather than all-at-once siphash
functions") we removed the older interface to the SipHash implementation,
which took fixed sized blocks of data.  However, we forgot to remove the
prototypes for those functions, so do that now.

Fixes: fc8f0f8c48 ("siphash: Use incremental rather than all-at-once siphash functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-19 14:55:42 +02:00
David Gibson
7e87bd98ac udp: Move management of udp[46]_localname into udp_splice_send()
Mostly, udp_sock_handler() is independent of how the datagrams it processes
will be forwarded (tap or splice).  However, it also updates the msg_name
fields for spliced sends, which doesn't really make sense here.  Move it
into udp_splice_send() which is all about spliced sends.  This does
potentially mean we'll update the field to the same value several times,
but we're going to need this in future anyway: with the extensions the
flow table allows, it might not be the same value each time after all.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-14 12:11:46 +02:00
David Gibson
ff57f8ddc6 udp: Rework how we divide queued datagrams between sending methods
udp_sock_handler() takes a number of datagrams from sockets that depending
on their addresses could be forwarded either to the L2 interface ("tap")
or to another socket ("spliced").  In the latter case we can also only
send packets together if they have the same source port, and therefore
are sent via the same socket.

To reduce the total number of system calls we gather contiguous batches of
datagrams with the same destination interface and socket where applicable.
The determination of what the target is is made by udp_mmh_splice_port().
It returns the source port for splice packets and -1 for "tap" packets.
We find batches by looking ahead in our queue until we find a datagram
whose "splicefrom" port doesn't match the first in our current batch.

udp_mmh_splice_port() is moderately expensive, and unfortunately we
can call it twice on the same datagram: once as the (last + 1) entry
in one batch (to check it's not in that batch), then again as the
first entry in the next batch.

Avoid this by keeping track of the "splice port" in the metadata structure,
and filling it in one entry ahead of the one we're currently considering.
This is a bit subtle, but not that hard.  It will also generalise better
when we have more complex possibilities based on the flow table.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-14 12:11:42 +02:00
David Gibson
63db7dcdbf udp: Fold checking of splice flag into udp_mmh_splice_port()
udp_mmh_splice_port() is used to determine if a UDP datagram can be
"spliced" (forwarded via a socket instead of tap).  We only invoke it if
the origin socket has the 'splice' flag set.

Fold the checking of the flag into the helper itself, which makes the
caller simpler.  It does mean we have a loop looking for a batch of
spliceable or non-spliceable packets even in the case where the flag is
clear.  This shouldn't be that expensive though, since each call to
udp_mmh_splice_port() will return without accessing memory in that case.
In any case we're going to need a similar loop in more cases with upcoming
flow table work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-14 12:11:39 +02:00
David Gibson
523fbc5af7 util: Split construction of bind socket address from the rest of sock_l4()
sock_l4() creates, binds and otherwise prepares a new socket.  It builds
the socket address to bind from separately provided address and port.
However, we have use cases coming up where it's more natural to construct
the socket address in the caller.

Prepare for this by adding sock_l4_sa() which takes a pre-constructed
socket address, and rewriting sock_l4() in terms of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-14 12:10:52 +02:00
Laurent Vivier
4070bac7a4 tap: use in->buf_size rather than sizeof(pkt_buf)
buf_size is set to sizeof(pkt_buf) by default. And it seems more correct
to provide the actual size of the buffer.

Later a buf_size of 0 will allow vhost-user mode to detect
guest memory buffers.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:42 +02:00
Laurent Vivier
7290335b14 iov: remove iov_copy()
it was needed by a draft version of vhost-user, it is not needed
anymore.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:40 +02:00
Laurent Vivier
0c335d751a vhost-user: compare mode MODE_PASTA and not MODE_PASST
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:38 +02:00
Laurent Vivier
377b666dc9 udp: rename udp_sock_handler() to udp_buf_sock_handler()
We are going to introduce a variant of the function to use
vhost-user buffers rather than passt internal buffers.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:34 +02:00
Laurent Vivier
e7ac995217 udp: refactor UDP header update functions
This commit refactors the udp_update_hdr4() and udp_update_hdr6() functions
to improve code portability by replacing the udp_meta_t parameter with
more specific parameters for the IPv4 and IPv6 headers (iphdr/ipv6hdr)
and the source socket address (sockaddr_in/sockaddr_in6).
It also moves the tap_hdr_update() function call inside the udp_tap_send()
function not to have to pass the TAP header to udp_update_hdr4() and
udp_update_hdr6()

This refactor reduces complexity by making the functions more modular and
ensuring that each function operates on more narrowly scoped data structures.
This will facilitate future backend introduction like vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:32 +02:00
Laurent Vivier
9ecf7fedc5 tap: refactor packets handling functions
Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(),
and tap4_handler() and tap6_handler() into tap_handler().
Create a generic tap_add_packet() to consolidate packet
addition logic and reduce code duplication.

The purpose is to ease the export of these functions to use
them with the vhost-user backend.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:19 +02:00
Laurent Vivier
fba2b544b6 tcp: move buffers management functions to their own file
Move all the TCP parts using internal buffers to tcp_buf.c
and keep generic TCP management functions in tcp.c.
Add tcp_internal.h to export needed functions from tcp.c and
tcp_buf.h from tcp_buf.c

With this change we can use existing TCP functions with a
different kind of memory storage as for instance the shared
memory provided by the guest via vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:05 +02:00
Laurent Vivier
ec26fa013a tcp: extract buffer management from tcp_send_flag()
This commit isolates the internal data structure management used for storing
data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[],
tcp4_flags[], ...) from the tcp_send_flag() function. The extracted
functionality is relocated to a new function named tcp_fill_flag_header().

tcp_fill_flag_header() is now a generic function that accepts parameters such
as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to
pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[].

This separation sets the stage for utilizing tcp_prepare_flags() to
set the memory provided by the guest via vhost-user in future developments.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:43:35 +02:00
David Gibson
d949667436 cppcheck: Suppress constParameterCallback errors
We have several functions which are used as callbacks for NS_CALL() which
only read their void * parameter, they don't write it.  The
constParameterCallback warning in cppcheck 2.14.1 complains that this
parameter could be const void *, also pointing out that that would require
casting the function pointer when used as a callback.

Casting the function pointers seems substantially uglier than using a
non-const void * as the parameter, especially since in each case we cast
the void * to a const pointer of specific type immediately.  So, suppress
these errors.

I think it would make logical sense to suppress this globally, but that
would cause unmatchedSuppression errors on earlier cppcheck versions.  So,
instead individually suppress it, along with unmatchedSuppression in the
relevant places.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-08 13:06:20 +02:00
Derek Schrock
8a83b530fe selinux: Allow access to user_devpts
Allow access to user_devpts.

	$ pasta --version
	pasta 0^20240510.g7288448-1.fc40.x86_64
	...
	$ awk '' < /dev/null
	$ pasta --version
	$

While this might be a awk bug it appears pasta should still have access
to devpts.

Signed-off-by: Derek Schrock <dereks@lifeofadishwasher.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
ec416fdcc4 tcp, flow: Fix some error paths which didn't clean up flows properly
Flow table entries need to be fully initialised before returning to the
main epoll loop.  Commit 0060acd1 ("flow: Clarify and enforce flow state
transitions") now enforces that: once a flow is allocated we must either
cancel it, or activate it before returning to the main loop, or we will hit
an ASSERT().

Some error paths in tcp_conn_from_tap() weren't correctly updated for this
requirement - we can exit with a flow entry incompletely initialised.
Correct that by cancelling the flows in those situations.

I don't have enough information to be certain if this is the cause for
podman bug 22925, but it plausibly could be.

Fixes: 0060acd11b ("flow: Clarify and enforce flow state transitions")
Link: https://github.com/containers/podman/issues/22925
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00