Commit graph

1478 commits

Author SHA1 Message Date
David Gibson
ec416fdcc4 tcp, flow: Fix some error paths which didn't clean up flows properly
Flow table entries need to be fully initialised before returning to the
main epoll loop.  Commit 0060acd1 ("flow: Clarify and enforce flow state
transitions") now enforces that: once a flow is allocated we must either
cancel it, or activate it before returning to the main loop, or we will hit
an ASSERT().

Some error paths in tcp_conn_from_tap() weren't correctly updated for this
requirement - we can exit with a flow entry incompletely initialised.
Correct that by cancelling the flows in those situations.

I don't have enough information to be certain if this is the cause for
podman bug 22925, but it plausibly could be.

Fixes: 0060acd11b ("flow: Clarify and enforce flow state transitions")
Link: https://github.com/containers/podman/issues/22925
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
3f63743a65 util: Use 'long' to represent millisecond durations
timespec_diff_ms() returns an int representing a duration in milliseconds.
This will overflow in about 25 days when an int is 32 bits.  The way we
use this function, we're probably not going to get a result that long, but
it's not outrageously implausible.  Use a long for safety.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
f9e8ee0777 lineread: Use ssize_t for line lengths
Functions and structures in lineread.c use plain int to record and report
the length of lines we receive.  This means we truncate the result from
read(2) in some circumstances.  Use ssize_t to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
c919bbbdd3 conf: Safer parsing of MAC addresses
In conf() we parse a MAC address in two places, for the --ns-mac-addr and
the -M options.  As well as duplicating code, the logic for this parsing
has several bugs:
  * The most serious is that if the given string is shorter than a MAC
    address should be, we'll access past the end of it.
  * We don't check the endptr supplied by strtol() which means we could
    ignore certain erroneous contents
  * We never check the separator characters between each octet
  * We ignore certain sorts of garbage that follow the MAC address

Correct all these bugs in a new parse_mac() helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
bda80ef53f util: Use unsigned indices for bits in bitmaps
A negative bit index in a bitmap doesn't make sense.  Avoid this by
construction by using unsigned indices.  While we're there adjust
bitmap_isset() to return a bool instead of an int.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
0e36fe1a43 clang-tidy: Enable the bugprone-macro-parentheses check
We globally disabled this, with a justification lumped together with
several checks about braces.  They don't really go together, the others
are essentially a stylistic choice which doesn't match our style.  Omitting
brackets on macro parameters can lead to real and hard to track down bugs
if an expression is ever passed to the macro instead of a plain identifier.

We've only gotten away with the macros which trigger the warning, because
of other conventions its been unlikely to invoke them with anything other
than a simple identifier.  Fix the macros, and enable the warning for the
future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
7094b91d10 Remove pointless macro parameters in CALL_PROTO_HANDLER
The 'c' parameter is always passed exactly 'c'.  The 'now' parameter is
always passed exactly 'now'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
c80fa6a6bb udp: Make rport calculation more local
cppcheck 2.14.1 complains about the rport variable not being in as small
as scope as it could be.  It's also only used once, so we might as well
just open code the calculation for it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
d2afb4b625 tcp: Make pointer const in tcp_revert_seq
The th pointer could be const, which causes a cppcheck warning on at least
some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40).

Fixes: e84a01e94c ("tcp: move seq_to_tap update to when frame is queued")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
b3aeb004ea log: Remove log_to_stdout option
Now that we've simplified how usage() works, nothing ever sets the
log_to_stdout flag. Eliminate it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:14:09 +02:00
David Gibson
7cb2088835 conf: Don't print usage via the logging subsystem
The message from usage() when given invalid options, or the -h / --help
option is currently printed by many calls to the info() function, also
used for runtime logging of informational messages.

That isn't useful: the usage message should always go to the terminal
(stdout or stderr), never syslog or a logfile.  It should never be
filtered by priority.  Really the only thing using the common logging
functions does is give more opportunities for something to go wrong.

Replace all the info() calls with direct fprintf() calls.  This does mean
manually adding "\n" to each message.  A little messy, but worth it for the
simplicity in other dimensions.  While we're there make much heavier use
of single strings containing multiple lines of output text.  That reduces
the number of fprintf calls, reducing visual clutter and making it easier
to see what the output will look like from the source.

Link: https://bugs.passt.top/show_bug.cgi?id=90
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:14:06 +02:00
David Gibson
e651197b5c conf: Remove unhelpful usage() wrapper
usage() does nothing but call print_usage() with EXIT_FAILURE as a
parameter.  It's no more complex to just give that parameter at the single
call site.  Eliminate it and rename print_usage() to just usage().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:14:03 +02:00
Jon Maloy
e84a01e94c tcp: move seq_to_tap update to when frame is queued
commit a469fc393f ("tcp, tap: Don't increase tap-side sequence counter for dropped frames")
delayed update of conn->seq_to_tap until the moment the corresponding
frame has been successfully pushed out. This has the advantage that we
immediately can make a new attempt to transmit a frame after a failed
trasnmit, rather than waiting for the peer to later discover a gap and
trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence
number updates during peer-initiated retransmits, and we have realized
it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the
moment a frame is added to the outqueue. To retain the advantage of
having a quick re-attempt based on local failure detection, we now scan
through the part of the outqueue that had do be dropped, and restore the
sequence counter for each affected connection to the most appropriate
value.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:13:16 +02:00
Stefano Brivio
765eb0bf16 apparmor: Fix comments after PID file and AF_UNIX socket creation refactoring
Now:
- we don't open the PID file in main() anymore
- PID file and AF_UNIX socket are opened by pidfile_open() and
  tap_sock_unix_open()
- write_pidfile() becomes pidfile_write()

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:44:21 +02:00
Stefano Brivio
0608ec42f2 conf, passt.h: Rename pid_file in struct ctx to pidfile
We have pidfile_fd now, pid_file_fd would be quite ugly.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:44:14 +02:00
Stefano Brivio
c9b2413465 conf, passt, tap: Open socket and PID files before switching UID/GID
Otherwise, if the user runs us as root, and gives us paths that are
only accessible by root, we'll fail to open them, which might in turn
encourage users to change permissions or ownerships: definitely a bad
idea in terms of security.

Reported-by: Minxi Hou <mhou@redhat.com>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:26 +02:00
Stefano Brivio
ba23b05545 passt, util: Move opening of PID file to its own function
We won't call it from main() any longer: move it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:13 +02:00
Stefano Brivio
57d8aa8ffe util: Rename write_pidfile() to pidfile_write()
As I'm adding pidfile_open() in the next patch. The function comment
didn't match, by the way.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:05 +02:00
Stefano Brivio
cbca08cd38 tap: Split tap_sock_unix_init() into opening and listening parts
We'll need to open and bind the socket a while before listening to it,
so split that into two different functions. No functional changes
intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:42:43 +02:00
Stefano Brivio
fcfb592adc passt, tap: Don't use -1 as uninitialised value for fd_tap_listen
This is a remnant from the time we kept access to the original
filesystem and we could reinitialise the listening AF_UNIX socket.

Since commit 0515adceaa ("passt, pasta: Namespace-based sandboxing,
defer seccomp policy application"), however, we can't re-bind the
listening socket once we're up and running.

Drop the -1 initalisation and the corresponding check.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-23 16:42:27 +02:00
Stefano Brivio
d02bb6ca05 tap: Move all-ones initialisation of mac_guest to tap_sock_init()
It has nothing to do with tap_sock_unix_init(). It used to be there as
that function could be called multiple times per passt instance, but
it's not the case anymore.

This also takes care of the fact that, with --fd, we wouldn't set the
initial MAC address, so we would need to wait for the guest to send us
an ARP packet before we could exchange data.

Fixes: 6b4e68383c ("passt, tap: Add --fd option")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:42:06 +02:00
Stefano Brivio
45b8632dcc conf: Don't lecture user about starting us as root
libguestfs tools have a good reason to run as root: if the guest image
is owned by root, it would be counterproductive to encourage users to
invoke them as non-root, as it would require changing permissions or
ownership of the image file.

And if they run as root, we'll start as root, too. Warn users we'll
switch to 'nobody', but don't tell them what to do.

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:40:33 +02:00
David Gibson
3f917b326b netlink, test: Ignore deprecated addresses
When we retrieve or copy host addresses we can include deprecated
addresses, which is not what we want.  Adjust our logic to exclude them.
Similarly our tests can retrieve deprecated addresses, so exclude them
there too.

I hit this in practice because my router sometimes temporarily advertises
an fd00:: prefix before the real delegated IPv6 prefix.  The deprecated
address can hang around for some time messing up my tests.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:09 +02:00
David Gibson
cc801fb38f tcp: Remove interim 'tapside' field from connection
We recently introduced this field to keep track of which side of a TCP flow
is the guest/tap facing one.  Now that we generically record which pif each
side of each flow is connected to, we can easily derive that, and no longer
need to keep track of it explicitly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:06 +02:00
David Gibson
8a2accb847 flow: Record the pifs for each side of each flow
Currently we have no generic information flows apart from the type and
state, everything else is specific to the flow type.  Start introducing
generic flow information by recording the pifs which the flow connects.

To keep track of what information is valid, introduce new flow states:
INI for when the initiating side information is complete, and TGT for
when both sides information is complete, but we haven't chosen the
flow type yet.  For now, these states don't do an awful lot, but
they'll become more important as we add more generic information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:03 +02:00
David Gibson
43571852e6 flow: Make side 0 always be the initiating side
Each flow in the flow table has two sides, 0 and 1, representing the
two interfaces between which passt/pasta will forward data for that flow.
Which side is which is currently up to the protocol specific code:  TCP
uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except
for spliced connections where it uses 0 for the initiating side and 1 for
the target side.  ICMP also uses 0 for the host/"sock" side and 1 for the
guest/"tap" side, but in its case the latter is always also the initiating
side.

Make this generically consistent by always using side 0 for the initiating
side and 1 for the target side.  This doesn't simplify a lot for now, and
arguably makes TCP slightly more complex, since we add an extra field to
the connection structure to record which is the guest facing side. This is
an interim change, which we'll be able to remove later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:01 +02:00
David Gibson
0060acd11b flow: Clarify and enforce flow state transitions
Flows move over several different states in their lifetime.  The rules for
these are documented in comments, but they're pretty complex and a number
of the transitions are implicit, which makes this pretty fragile and
error prone.

Change the code to explicitly track the states in a field.  Make all
transitions explicit and logged.  To the extent that it's practical in C,
enforce what can and can't be done in various states with ASSERT()s.

While we're at it, tweak the docs to clarify the restrictions on each state
a bit.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:58 +02:00
David Gibson
a63199832a inany: Better helpers for using inany and specific family addrs together
This adds some extra inany helpers for comparing an inany address to
addresses of a specific family (including special addresses), and building
an inany from an IPv4 address (either statically or at runtime).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:55 +02:00
David Gibson
7a832a8a0e flow: Properly type callbacks to protocol specific handlers
The flow dispatches deferred and timer handling for flows centrally, but
needs to call into protocol specific code for the handling of individual
flows.  Currently this passes a general union flow *.  It makes more sense
to pass the specific relevant flow type structure.  That brings the check
on the flow type adjacent to casting to the union variant which it tags.

Arguably, this is a slight abstraction violation since it involves the
generic flow code using protocol specific types.  It's already calling into
protocol specific functions, so I don't think this really makes any
difference.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:52 +02:00
David Gibson
1a20370b36 util, tcp: Add helper to display socket addresses
When reporting errors, we sometimes want to show a relevant socket address.
Doing so by extracting the various relevant fields can be pretty awkward,
so introduce a sockaddr_ntop() helper to make it simpler.  For now we just
have one user in tcp.c, but I have further upcoming patches which can make
use of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:37 +02:00
Maxime Bélair
3ff3a8a467 apparmor: Fix passt abstraction
Commit b686afa2 introduced the invalid apparmor rule
`mount options=(rw, runbindable) /,` since runbindable mount rules
cannot have a source.

Therefore running aa-logprof/aa-genprof will trigger errors (see
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685)

    $ sudo aa-logprof
    ERROR: Operation {'runbindable'} cannot have a source. Source = AARE('/')

This patch fixes it to the intended behavior.

Link: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685
Fixes: b686afa23e ("apparmor: Explicitly pass options we use while remounting root filesystem")
Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:16:27 +02:00
Paul Holzinger
6cdc9fd51b apparmor: allow netns paths on /tmp
For some unknown reason "owner" makes it impossible to open bind mounted
netns references as apparmor denies it. In the kernel denied log entry
we see ouid=0 but it is not clear why that is as the actual file is
owned by the real (rootless) user id.

In abstractions/pasta there is already `@{run}/user/@{uid}/**` without
owner set for the same reason as this path contains the netns path by
default when running under Podman.

Fixes: 72884484b0 ("apparmor: allow read access on /tmp for pasta")
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:02:18 +02:00
David Gibson
80f7ff2996 clang-tidy: Suppress macro to enum conversion warnings
clang-tidy 18.1.1 in Fedora 40 complains about every #define of an integral
value, suggesting it be converted to an enum.  Although that's certainly
possible, it's of dubious value and results in some awkward arrangements on
out codebase.  Suppress it globally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:02:07 +02:00
David Gibson
29bd08ff0f conf: Fix clang-tidy warning about using an undefined enum value
In conf() we temporarily set the forwarding mode variables to 0 - an
invalid value, so that we can check later if they've been set by the
intervening logic.  clang-tidy 18.1.1 in Fedora 40 now complains about
this.  Satisfy it by giving an name in the enum to the 0 value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:02:05 +02:00
lemmi
26c71db332 passt.c: explicitly include libgen.h for basename
fixes implicit declaration warning on musl

Signed-off-by: lemmi <lemmi@nerd2nerd.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:00:52 +02:00
Stefano Brivio
623c2fd621 netlink: Don't duplicate routes referring to unrelated host interfaces
We take care of this in nl_addr_dup(): if the interface index
associated to an address doesn't match the selected host interface
(ifa->ifa_index != ifi_src), we don't copy that address.

But for routes, we just unconditionally update the interface index to
match the index in the target namespace, even if the source interface
didn't match.

This might happen in two cases: with a pre-4.20 kernel without support
for NETLINK_GET_STRICT_CHK, which won't filter routes based on the
interface we pass in the request, as reported by runsisi, and any
kernel with support for multipath routes where any of the nexthops
refers to an unrelated host interface.

In both cases, check the index of the source interface, and avoid
copying unrelated routes.

Reported-by: runsisi <runsisi@hust.edu.cn>
Link: https://bugs.passt.top/show_bug.cgi?id=86
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: runsisi <runsisi@hust.edu.cn>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-11 00:52:19 +02:00
Paul Holzinger
72884484b0 apparmor: allow read access on /tmp for pasta
The podman CI on debian runs tests based on /tmp but pasta is failing
there because it is unable to open the netns path as the open for read
access is denied.

Link: https://github.com/containers/podman/issues/22625
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-10 16:53:35 +02:00
Stefano Brivio
7e6a606c32 tcp_splice: Set OUT_WAIT_ flag whenever pipe isn't emptied
In tcp_splice_sock_handler(), if we get EAGAIN on the second splice(),
from pipe to receiving socket, that doesn't necessarily mean that the
pipe is empty: the receiver buffer might be full instead.

Hence, we can't use the 'never_read' flag to decide that there's
nothing to wait for: even if we didn't read anything from the sending
side in a given iteration, we might still have data to send in the
pipe. Use read/written counters, instead.

This fixes an issue where large bulk transfers would occasionally
hang. From a corresponding strace:

     0.000061 epoll_wait(4, [{events=EPOLLOUT, data={u32=29442, u64=12884931330}}], 8, 1000) = 1
     0.005003 epoll_ctl(4, EPOLL_CTL_MOD, 211, {events=EPOLLIN|EPOLLRDHUP, data={u32=54018, u64=8589988610}}) = 0
     0.000089 epoll_ctl(4, EPOLL_CTL_MOD, 115, {events=EPOLLIN|EPOLLRDHUP, data={u32=29442, u64=12884931330}}) = 0
     0.000081 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000073 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 1048576
     0.000087 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000045 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 520415
     0.000060 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000044 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000044 epoll_wait(4, [], 8, 1000) = 0

we're reading from socket 211 into to the pipe end numbered 151,
which connects to pipe end 150, and from there we're writing into
socket 115.

We initially drop EPOLLOUT from the set of monitored flags for socket
115, because it already signaled it's ready for output. Then we read
nothing from socket 211 (the sender had nothing to send), and we keep
emptying the pipe into socket 115 (first 1048576 bytes, then 520415
bytes).

This call of tcp_splice_sock_handler() ends with EAGAIN on the writing
side, and we just exit this function without setting the OUT_WAIT_1
flag (and, in turn, EPOLLOUT for socket 115). However, it turns out,
the pipe wasn't actually emptied, and while socket 211 had nothing
more to send, we should have waited on socket 115 to be ready for
output again.

As a further step, we could consider not clearing EPOLLOUT at all,
unless the read/written counters match, but I'm first trying to fix
this ugly issue with a minimal patch.

Link: https://github.com/containers/podman/issues/22575
Link: https://github.com/containers/podman/issues/22593
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-10 16:52:29 +02:00
David Gibson
1ba76c9e8c udp: Single buffer for IPv4, IPv6 headers and metadata
Currently we have separate arrays for IPv4 and IPv6 which contain the
headers for guest-bound packets, and also the originating socket address.
We can combine these into a single array of "metadata" structures with
space for both pre-cooked IPv4 and IPv6 headers, as well as shared space
for the tap specific header and socket address (using sockaddr_inany).

Because we're using IOVs to separately address the pieces of each frame,
these structures don't need to be packed to keep the headers contiguous
so we can more naturally arrange for the alignment we want.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:52 +02:00
David Gibson
d4598e1d18 udp: Use the same buffer for the L2 header for all frames
Currently each tap-bound frame buffer has room for its own ethernet header.
However the ethernet header is always the same for such frames, so now
that we're indirectly referencing the ethernet header via iov, we can use
a single buffer for all of them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:49 +02:00
David Gibson
6170688616 udp: Share payload buffers between IPv4 and IPv6
Currently the IPv4 and IPv6 paths unnecessarily use different buffers for
the UDP payload.  Now that we're handling the various pieces of the UDP
packets with an iov, we can split the payload part of the buffers off into
its own array shared between IPv4 and IPv6.  As well as saving a little
memory, this allows the payload buffers to be neatly page aligned.

With the buffers merged, udp[46]_l2_iov_sock contain exactly the same thing
as each other and can also be merged.  Likewise udp[46]_iov_splice can be
merged together.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:46 +02:00
David Gibson
2d16946bac udp: Explicitly set checksum in guest-bound UDP headers
For IPv4, UDP checksums are optional and can just be set to 0.
udp_update_hdr4() ignores the checksum field entirely.  Since these are set
to 0 during startup, this works as intended for now.

However, we'd like to share payload and UDP header buffers betweem IPv4 and
IPv6, which does calculate UDP checksums.  Therefore, for robustness, we
should explicitly set the checksum field to 0 for guest-bound UDP packets.

In the tap_udp4_send() slow path, however, we do allow IPv4 UDP checksums
to be calculated as a compile time option.  For consistency, use the same
thing in the udp_update_hdr4() path, which will typically initialize to 0,
but calculate a real checksum if configured to do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:44 +02:00
David Gibson
6c4d26a364 udp: Combine initialisation of IPv4 and IPv6 iovs
We're going to introduce more sharing between the IPv4 and IPv6 buffer
structures.  Prepare for this by combinng the initialisation functions.
While we're at it remove the misleading "sock" from the name since these
initialise both tap side and sock side structures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:41 +02:00
David Gibson
3f9bd867b5 udp: Split tap-bound UDP packets into multiple buffers using io vector
When sending to the tap device, currently we assemble the headers and
payload into a single contiguous buffer.  Those are described by a single
struct iovec, then a batch of frames is sent to the device with
tap_send_frames().

In order to better integrate the IPv4 and IPv6 paths, we want the IP
header in a different buffer that might not be contiguous with the
payload.  To prepare for that, split the UDP packet into an iovec of
buffers.  We use the same split that Laurent recently introduced for
TCP for convenience.

This removes the last use of tap_hdr_len_(), tap_frame_base() and
tap_frame_len(), so remove those too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:38 +02:00
David Gibson
fcd9308856 test: Allow sftp via vsock-ssh in tests
During some debugging recently, I wanted to extact a file from a test
guest and found it was tricky, since the ssh-over-vsock setup we had didn't
allow sftp/scp.  We can fix this by adding a line to the guest side sshd
config from mbuto.  While we're there correct an inaccurate comment.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:36 +02:00
David Gibson
eea5d3ef2d tcp: Update tap specific header too in tcp_fill_headers[46]()
tcp_fill_headers[46]() fill most of the headers, but the tap specific
header (the frame length for qemu sockets) is filled in afterwards.
Filling this as well:
  * Removes a little redundancy between the tcp_send_flag() and
    tcp_data_to_tap() path
  * Makes calculation of the correct length a little easier
  * Removes the now misleadingly named 'vnet_len' variable in
    tcp_send_flag()

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:34 +02:00
David Gibson
3559899586 iov: Helper macro to construct iovs covering existing variables or fields
Laurent's recent changes mean we use IO vectors much more heavily in the
TCP code.  In many of those cases, and few others around the code base,
individual iovs of these vectors are constructed to exactly cover existing
variables or fields.  We can make initializing such iovs shorter and
clearer with a macro for the purpose.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:31 +02:00
David Gibson
40f8b2976a tap, tcp: (Re-)abstract TAP specific header handling
Recent changes to the TCP code (reworking of the buffer handling) have
meant that it now (again) deals explicitly with the MODE_PASST specific
vnet_len field, instead of using the (partial) abstractions provided by the
tap layer.

The abstractions we had don't work for the new TCP structure, so make some
new ones that do: tap_hdr_iov() which constructs an iovec suitable for
containing (just) the TAP specific header and tap_hdr_update() which
updates it as necessary per-packet.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:29 +02:00
David Gibson
68d1b0a152 tcp: Simplify packet length calculation when preparing headers
tcp_fill_headers[46]() compute the L3 packet length from the L4 packet
length, then their caller tcp_l2_buf_fill_headers() converts it back to the
L4 packet length.  We can just use the L4 length throughout.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:26 +02:00
David Gibson
5566386f5f treewide: Standardise variable names for various packet lengths
At various points we need to track the lengths of a packet including or
excluding various different sets of headers.  We don't always use the same
variable names for doing so.  Worse in some places we use the same name
for different things: e.g. tcp_fill_headers[46]() use ip_len for the
length including the IP headers, but then tcp_send_flag() which calls it
uses it to mean the IP payload length only.

To improve clarity, standardise on these names:
   dlen:		L4 protocol payload length ("data length")
   l4len:		plen + length of L4 protocol header
   l3len:		l4len + length of IPv4/IPv6 header
   l2len:		l3len + length of L2 (ethernet) header

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:23 +02:00