Commit graph

1749 commits

Author SHA1 Message Date
David Gibson
95569e4aa4 util: Some corrections for timespec_diff_us
The comment for timespec_diff_us() claims it will wrap after 2^64µs.  This
is incorrect for two reasons:
  * It returns a long long, which is probably 64-bits, but might not be
  * It returns a signed value, so even if it is 64 bits it will wrap after
    2^63µs

Correct the comment and use an explicitly 64-bit type to avoid that
imprecision.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:15:40 +02:00
Stefano Brivio
fbb0c9523e conf, pasta: Make -g and -a skip route/addresses copy for matching IP version only
Paul reports that setting IPv4 address and gateway manually, using
--address and --gateway, causes pasta to fail inserting IPv6 routes
in a setup where multiple, inter-dependent IPv6 routes are present
on the host.

That's because, currently, any -g option implies --no-copy-routes
altogether, and any -a implies --no-copy-addrs.

Limit this implication to the matching IP version, instead, by having
two copies of no_copy_routes and no_copy_addrs in the context
structure, separately for IPv4 and IPv6.

While at it, change them to 'bool': we had them as 'int' because
getopt_long() used to set them directly, but it hasn't been the case
for a while already.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-07 09:15:25 +02:00
Stefano Brivio
ee36266a55 log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).

But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.

Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.

Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 15:03:48 +02:00
Stefano Brivio
3a082c4ecb tcp_splice: Fix side in OUT_WAIT flag setting
If the "from" (input) side for a given transfer is 0, and we can't
complete the write right away, what we need to be waiting for is for
output readiness on side 1, not 0, and the other way around as well.

This causes random transfer failures for local TCP connections,
depending if we ever need to wait for output readiness.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/23517
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-08-06 15:03:31 +02:00
David Gibson
031df332e9 util: Use unsigned (size_t) value for iov length
The "correct" type for the length of an IOV is unclear: writev() and
readv() use an int, but sendmsg() and recvmsg() use a size_t.  Using the
unsigned size_t has some advantages, though, and it makes more sense for
the case of write_remainder.  Using size_t throughout here means we don't
have a signed vs. unsigned comparison, and we don't have to deal with
the case of iov_skip_bytes() returning a value which becomes negative
when assigned to an integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 15:01:46 +02:00
Laurent Vivier
e877f905e5 udp_flow: move all udp_flow functions to udp_flow.c
No code change.

They need to be exported to be available by the vhost-user version of
passt.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 17:38:17 +02:00
Laurent Vivier
623ceb1f2b udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock()
To be used with the vhost-user version of udp.c, we need to export the
udp_flow functions. To avoid to export udp_meta_t too that is specific
to the socket version of udp.c, don't pass udp_meta_t to it,
but the only needed field, s_in.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 17:37:53 +02:00
David Gibson
a5bbefa6fb log: Make logfile_write() private
logfile_write() is not used outside log.c, nor should it be.  It should
only be used externall via the general logging functions.  Make it static
in log.c.  To avoid forward declarations this requires moving a bunch of
functions earlier in the file.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 15:03:33 +02:00
Stefano Brivio
f30ed68c52 pasta: Save errno on signal handler entry, restore on return when needed
Ed reported this:

  # Error: pasta failed with exit code 1:
  # Couldn't drop cap 3 from bounding set
  # : No child processes

in a Podman CI run with tests being run in parallel. The error message
itself, by the way, is fixed by commit 1cd773081f ("log: Drop
newlines in the middle of the perror()-like messages"), but how can we
possibly get ECHILD as failure code for prctl()?

Well, we don't, but if we exit early enough, pasta_child_handler()
might run before we're even done with isolation steps, and it calls
waitid(), which sets errno. We need to restore it before returning
from the signal handler (if we return after calling functions that
might set it), as signal-safety(7) also implies:

       Fetching and setting the value of errno is async-signal-safe
       provided that the signal handler saves errno on entry and
       restores its value before returning.

Eventually, we'll probably need to switch to signalfd(2) the day we
want to implement multithreading, but this will do for the moment.

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/23478
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-05 15:02:36 +02:00
Danish Prakash
0149d11cc5 pasta: modify hostname when detaching new namespace
When invoking pasta without any arguments, it's difficult
to tell whether we are in the new namespace or not leaving
users a bit confused. This change modifies the host namespace
to add a prefix "pasta-" to make it a bit more obvious.

Signed-off-by: Danish Prakash <contact@danishpraka.sh>
[sbrivio: coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-30 17:27:43 +02:00
AbdAlRahman Gad
8fae3b73cb Fix typo in README file
- remove duplicated 'the' in the 'Services' section

Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-29 19:02:35 +02:00
Stefano Brivio
f87b11c7be fedora/rpkg: List myself as author for changelog entries
...instead of the latest author for contrib/fedora.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 16:40:41 +02:00
David Gibson
57a21d2df1 tap: Improve handling of partially received frames on qemu socket
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie.  Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.

Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket.  This handles a number of (unlikely) cases
which previously would not be dealt with correctly:

* If qemu sent a partial frame then waited some time before sending the
  remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
  doing the partial frame handling, which would put us out of sync with
  the stream from qemu
* If a the blocking recv() only received some of the remainder of the
  frame, not all of it, we'd return leaving us out of sync with the
  stream again

Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU).  This
is probably acceptable because it's an unlikely case in practice.  If
necessary we could mitigate this by using a true ring buffer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:42 +02:00
David Gibson
37e3b24d90 tap: Correctly handle frames of odd length
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself.  As far as I can tell,
frames can be any length, with no particular alignment requirement.  This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.

Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading.  Some platforms will generate a fatal trap on
such an unaligned load.  Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.

Introduce a new helper to safely load a possibly unaligned value here.  We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads.  If that turns out not
to be the case, we can look at improvements then.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:38 +02:00
David Gibson
4684f60344 tap: Don't use EPOLLET on Qemu sockets
Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket.  It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling.  That consideration
doesn't apply to the way we handle the qemu socket however.

Furthermore, using EPOLLET causes additional complications:

1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
   we *do* set it when using pasta mode with --fd.  This inconsistency
   doesn't seem to have broken anything, but it's odd.

2) EPOLLET requires that tap_handler_passt() loop until all data available
   is read (otherwise we may have data in the buffer but never get an event
   causing us to read it).  We do that with a rather ugly goto.

   Worse, our condition for that goto appears to be incorrect.  We'll only
   loop if rem is non-zero, which will only happen if we perform a blocking
   recv() for a partially received frame.  We'll only perform that second
   recv() if the original recv() resulted in a partially read frame.  As
   far as I can tell the original recv() could end on a frame boundary
   (never triggering the second recv()) even if there is additional data in
   the socket buffer.  In that circumstance we wouldn't goto redo and could
   leave unprocessed frames in the qemu socket buffer indefinitely.

   This doesn't seem to have caused any problems in practice, but since
   there's no obvious reason to use EPOLLET here anyway, we might as well
   get rid of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:20 +02:00
David Gibson
9e3f2355c4 tap: Don't attempt to carry on if we get a bad frame length from qemu
If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on.  That sounds sensible on first blush, but
probably isn't wise in practice.  If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket.  Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.

Neither (b) nor (c) is really salvageable with the same stream.  Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.

So, instead of just skipping the frame and trying to carry on, log an error
and close the socket.  As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
 that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:05:44 +02:00
David Gibson
a06db27c49 tap: Better report errors receiving from QEMU socket
If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error.  Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.

Add an error message in this case.  Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
 that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 13:51:23 +02:00
Stefano Brivio
77c092ee5e log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME
We report relative timestamps in logs, so we want to avoid jumps in
the system time.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:47:36 +02:00
Stefano Brivio
e5c37ba0f4 log: Initialise timestamp for relative log time also if we use a log file
...not just for debug messages. Otherwise, timestamps in the log file
are consistent but the starting point is not zero.

Do this right away as we enter main(), so that the resulting
timestamps are as closely as possible relative to when we start.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:47:34 +02:00
Stefano Brivio
327d9d482f log, util: Fix sub-second part in relative log time calculation
For some reason, in commit 01efc71ddd ("log, conf: Add support for
logging to file"), I added calculations for relative logging
timestamps using the difference for the seconds part only, not for
accounting for the fractional part.

Fix that by storing the initial timestamp, log_start, as a timespec
struct, and by calculating the difference from the starting time. Do
this in a macro as we need the same format in a few places.

To calculate the difference, turn the existing timespec_diff_ms() to
microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to
use that.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:43:19 +02:00
Stefano Brivio
2ce1d37831 test/lib/perf_report: Fix highlight
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-25 17:56:47 +02:00
David Gibson
e9a542321f test: Fix spurious test failure with systemd-resolved
systemd-resolved has the rather strange behaviour of listening on the
non-standard loopback address 127.0.0.53.  Various changes we've made in
passt mean that we now usually work fine on a host using systemd-resolved.
However our tests still fail in this case.  We have a special case for when
the guest's resolv.conf needs to differ from the host's because the
resolver is on a host loopback address.  However, we only consider the case
where the host resolver is on 127.0.0.1, not other loopback addresses.

Correct this with a different test condition.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-25 12:39:00 +02:00
David Gibson
becf81ab88 fwd: Broaden what we consider for DNS specific forwarding rules
passt/pasta has options to redirect DNS requests from the guest to a
different server address on the host side.  Currently, however, only UDP
packets to port 53 are considered "DNS requests".  This ignores DNS
requests over TCP - less common, but certainly possible.  It also ignores
encrypted DNS requests on port 853.

Extend the DNS forwarding logic to handle both of those cases.

Link: https://github.com/containers/podman/issues/23239
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-25 12:38:11 +02:00
David Gibson
0ada84e3f8 fwd: Refactor tests in fwd_nat_from_tap() for clarity
Currently, we start by handling the common case, where we don't translate
the destination address, then we modify the tgt side for the special cases.
In the process we do comparisons on the tentatively set fields in tgt,
which obscures the fact that tgt should be an essentially pure function of
ini, and risks people examining fields of tgt that are not yet initialized.

To make this clearer, do all our tests on 'ini', constructing tgt from
scratch on that basis.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-25 12:37:57 +02:00
Stefano Brivio
4a333c88d7 conf: Accept addresses enclosed by square brackets in port forwarding specifiers
Even though we don't use : as delimiter for the port, making square
brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6
literals. We want IPv6 addresses there, but some users might still
specify them out of habit.

Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for
IPv4 literals, but I had reports of users actually trying to use them
(they're accepted by many tools).

Allow square brackets for both IPv4 and IPv6 addresses, correct or
not, they're harmless anyway.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:56 +02:00
Stefano Brivio
6ff702f325 tap: Exit if we fail to bind a UNIX domain socket with explicit path
In tap_sock_unix_open(), if we have a given path for the socket from
configuration, we don't need to loop over possible paths, so we exit
the loop on the first iteration, unconditionally.

But if we failed to bind() the socket to that explicit path, we should
exit, instead of continuing. Otherwise we'll pretend we're up and
running, but nobody can contact us, and this might be mildly confusing
for users.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:42 +02:00
Stefano Brivio
f72d35a78d test: iperf3 3.16 introduces multiple threads, drop our own implementation of that
Starting from iperf3 version 3.16, -P / --parallel spawns multiple
clients as separate threads, instead of multiple streams serviced by
the same thread.

So we can drop our lib/test implementation to spawn several iperf3
client and server processes and finally simplify things quite a bit.

Adjust number of threads and UDP sending bandwidth to values that seem
to be more or less matching previous throughput tests on my setup.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:38 +02:00
Stefano Brivio
606e0c7b95 test: Update names of symbols and slabinfo entries
Differences in allocated Acpi-Parse entries are gone (at least) since
the 6.1 Linux kernel series. I should run this on a 6.10 kernel,
eventually, and adjust things further, as needed.

Userspace symbols are also fairly different now: show whatever is more
than 1 MiB at the moment.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:29 +02:00
Stefano Brivio
f16f8f5bf6 test: Fix memory/passt tests, --netns-only is not a valid option for passt
This used to work on my setup as I kept reusing an old mbuto
(initramfs) image, but since commit 65923ba798 ("conf: Accept
duplicate and conflicting options, the last one wins"), --netns-only
is, as originally intended, a pasta-only option.

I had used --netns-only, here, to prevent passt from trying to detach
its own user namespace, which is not permitted as we're in a chroot,
see unshare(2). In turn, we need the chroot because passt can't pivot
root directly into its own empty filesystem using an initramfs.

Use switch_root into the tmpfs mountpoint instead of chroot, so that
we can still detach user namespaces.

Note that in the mbuto images, we can't switch to nobody as we have
no password entries at all, so we need to detach a further user
namespace before starting passt, to trick passt into running as UID
0.

Given the new sequence, it's now more convenient to directly switch
to a detached network namespace as well, which means we need to move
the initialisation of the dummy network from the init script into the
test script.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:08 +02:00
Stefano Brivio
1cd773081f log: Drop newlines in the middle of the perror()-like messages
Calling vlogmsg() twice from logmsg_perror() results in this beauty:

  $ ./pasta -i foo
  Invalid interface name foo
  : No such device

because the first part of the message, corresponding to the first
call, doesn't end with a newline, and vlogmsg() adds it.

Given that we can't easily append an argument (error description) to
a variadic list, add a 'newline' parameter to all the functions that
currently add a newline if missing, and disable that on the first call
to vlogmsg() from logmsg_perror(). Not very pretty but I can't think
of any solution that's less messy than this.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:25:31 +02:00
Stefano Brivio
13295583f8 tcp: Change SO_PEEK_OFF support message to debug()
This:

  $ ./pasta
  SO_PEEK_OFF not supported
  #

is a bit annoying, and might trick users who face other issues into
thinking that SO_PEEK_OFF not being supported on a given kernel is
an actual issue.

Even if SO_PEEK_OFF is supported by the kernel, that would be the
only message displayed there, with default options, which looks a bit
out of context.

Switch that to debug(): now that Podman users can pass --debug too, we
can find out quickly if it's supported or not, if SO_PEEK_OFF usage is
suspected of causing any issue.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:25:26 +02:00
Stefano Brivio
d19b396f11 tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down
If we start pasta with some ports forwarded, but no --config-net, say:

  $ ./pasta -u 10001

and then use a local, non-loopback address to send traffic to that
port, say:

  $ socat -u FILE:test UDP4:192.0.2.1:10001

pasta writes to the tap file descriptor, but if the interface is down,
we get EIO and terminate.

By itself, what I'm doing in this case is not very useful (I simply
forgot to pass --config-net), but if we happen to have a DHCP client
in the network namespace, the interface might still be down while
somebody tries to send traffic to it, and exiting in that case is not
really helpful.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:25:05 +02:00
David Gibson
a09aeb4bd6 tcp: Correctly update SO_PEEK_OFF when tcp_send_frames() drops frames
When using the new SO_PEEK_OFF feature on TCP sockets, we must adjust
the SO_PEEK_OFF value whenever we move conn->seq_to_tap backwards.
Although it was discussed during development, somewhere during the shuffles
the case where we move the pointer backwards because we lost frames while
sending them to the guest.  This can happen, for example, if the socket
buffer on the Unix socket to qemu overflows.

Fixing this is slightly complicated because we need to pass a non-const
context pointer to some places we previously didn't need it.  While we're
there also fix a small stylistic issue in the function comment for
tcp_revert_seq() - it was using spaces instead of tabs.

Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-24 09:27:46 +02:00
Jon Maloy
9cb6b50815 tcp: probe for SO_PEEK_OFF both in tcpv4 and tcp6
Based on an original patch by Jon Maloy:

--
The recently added socket option SO_PEEK_OFF is not supported for
TCP/IPv6 sockets. Until we get that support into the kernel we need to
test for support in both protocols to set the global 'peek_offset_cap´
to true.
--

Compared to the original patch:
- only check for SO_PEEK_OFF support for enabled IP versions
- use sa_family_t instead of int to pass the address family around

Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-23 16:42:27 +02:00
David Gibson
882599e180 udp: Rename UDP listening sockets
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived
sockets which can initiate new flows.  Rename to EPOLL_TYPE_UDP_LISTEN
and associated functions to match.  Along with that, remove the .orig
field from union udp_listen_epoll_ref, since it is now always true.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:34:01 +02:00
David Gibson
d29fa0856e udp: Remove rdelta port forwarding maps
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map.  This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port.  This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.

So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:57 +02:00
David Gibson
d89b3aa097 udp: Remove obsolete socket tracking
Now that UDP datagrams are all directed via the flow table, we no longer
use the udp_tap_map[ or udp_act[] arrays.  Remove them and connected
code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:55 +02:00
David Gibson
898f797174 udp: Direct datagrams from host to guest via flow table
This replaces the last piece of existing UDP port tracking with the
common flow table.  Specifically use the flow table to direct datagrams
from host sockets to the guest tap interface.  Since this now requires
a flow for every datagram, we add some logging if we encounter any
datagrams for which we can't find or create a flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:51 +02:00
David Gibson
b7ad19347f udp: Find or create flows for datagrams from tap interface
Currently we create flows for datagrams from socket interfaces, and use
them to direct "spliced" (socket to socket) datagrams.  We don't yet
match datagrams from the tap interface to existing flows, nor create new
flows for them.  Add that functionality, matching datagrams from tap to
existing flows when they exist, or creating new ones.

As with spliced flows, when creating a new flow from tap to socket, we
create a new connected socket to receive reply datagrams attached to that
flow specifically. We extend udp_flow_sock_handler() to handle reply
packets bound for tap rather than another socket.

For non-obvious reasons (perhaps increased stack usage?), this caused
a failure for me when running under valgrind, because valgrind invoked
rt_sigreturn which is not in our seccomp filter.  Since we already
allow rt_sigaction and others in the valgrind target, it seems
reasonable to add rt_sigreturn as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:48 +02:00
David Gibson
8126f7a660 udp: Remove obsolete splice tracking
Now that spliced datagrams are managed via the flow table, remove
UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used.  With
those removed, the 'ts' field in udp_splice_port is also no longer used.
struct udp_splice_port now contains just a socket fd, so replace it with
a plain int in udp_splice_ns[] and udp_splice_init[].  The latter are still
used for tracking of automatic port forwarding.

Finally, the 'splice' field of union udp_epoll_ref is no longer used so
remove it as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:45 +02:00
David Gibson
e0647ad80c udp: Handle "spliced" datagrams with per-flow sockets
When forwarding a datagram to a socket, we need to find a socket with a
suitable local address to send it.  Currently we keep track of such sockets
in an array indexed by local port, but this can't properly handle cases
where we have multiple local addresses in active use.

For "spliced" (socket to socket) cases, improve this by instead opening
a socket specifically for the target side of the flow.  We connect() as
well as bind()ing that socket, so that it will only receive the flow's
reply packets, not anything else.  We direct datagrams sent via that socket
using the addresses from the flow table, effectively replacing bespoke
addressing logic with the unified logic in fwd.c

When we create the flow, we also take a duplicate of the originating
socket, and use that to deliver reply datagrams back to the origin, again
using addresses from the flow table entry.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:42 +02:00
David Gibson
a45a7e9798 udp: Create flows for datagrams from originating sockets
This implements the first steps of tracking UDP packets with the flow table
rather than its own (buggy) set of port maps.  Specifically we create flow
table entries for datagrams received from a socket (PIF_HOST or
PIF_SPLICE).

When splitting datagrams from sockets into batches, we group by the flow
as well as splicesrc.  This may result in smaller batches, but makes things
easier down the line.  We can re-optimise this later if necessary.  For now
we don't do anything else with the flow, not even match reply packets to
the same flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:39 +02:00
David Gibson
8abd06e9fa fwd: Update flow forwarding logic for UDP
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets.  The
logic here doesn't exactly match our current forwarding, since our current
forwarding has some very strange and buggy edge cases.  Instead it's
attempting to replicate what appears to be the intended logic behind the
current forwarding.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:35 +02:00
David Gibson
c000f2aba6 flow, icmp: Use general flow forwarding rules for ICMP
Current ICMP hard codes its forwarding rules, and never applies any
translations.  Change it to use the flow_target() function, so that
it's translated the same as TCP (excluding TCP specific port
redirection).

This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine.  This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.

This removes the last user of flow_target_af(), so that's removed as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:33 +02:00
David Gibson
060f24e310 flow, tcp: Flow based NAT and port forwarding for TCP
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.

Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side.  This performs any NAT or port forwarding needed.

We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:29 +02:00
David Gibson
4cd753e65c icmp: Manage outbound socket address via flow table
For now when we forward a ping to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and id will be used by the kernel.  When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.

Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there.  This allows the
possibility of more complex rules for what outbound address and/or id
we use in future.

To implement this we create a new helper which sets up a new socket based
on information in a flowside, which will also have future uses.  It
behaves slightly differently from the existing ICMP code, in that it
doesn't bind to a specific interface if given a loopback address.  This is
logically correct - the loopback address means we need to operate through
the host's loopback interface, not ifname_out.  We didn't need it in ICMP
because ICMP will never generate a loopback address at this point, however
we intend to change that in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:25 +02:00
David Gibson
781164e25b flow: Helper to create sockets based on flowside
We have upcoming use cases where it's useful to create new bound socket
based on information from the flow table.  Add flowside_sock_l4() to do
this for either PIF_HOST or PIF_SPLICE sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:23 +02:00
David Gibson
2faf6fcd8b icmp: Eliminate icmp_id_map
With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything.  Eliminate it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:20 +02:00
David Gibson
2f40a01944 icmp: Look up ping flows using flow hash
When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables.  However, we can look this up using the flow hash
table - that's what it's for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:16 +02:00
David Gibson
6d76278c21 icmp: Obtain destination addresses from the flowsides
icmp_sock_handler() obtains the guest address from it's most recently
observed IP.  However, this can now be obtained from the common flowside
information.

icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.

Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:13 +02:00