Commit graph

136 commits

Author SHA1 Message Date
David Gibson
5cada56186 treewide: Avoid in_addr_t
IPv4 addresses can be stored in an in_addr_t or a struct in_addr.  The
former is just a type alias to a 32-bit integer, so doesn't really give us
any type checking.  Therefore we generally prefer the structure, since we
mostly want to treat IP address as opaque objects.  Fix a few places where
we still use in_addr_t, but can just as easily use struct in_addr.

Note there are still some uses of in_addr_t in conf.c, but those are
justified: since they're doing prefix calculations, they actually need to
look at the internals of the address as an integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
546332786c treewide: Use IN4ADDR_LOOPBACK_INIT more widely
We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to
the loopback address without delving into its internals.  However there are
some places we don't use it, and explicitly look at the internal structure
of struct in_addr, which we generally want to avoid.  Use the define more
widely to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
Stefano Brivio
4117bd94f9 port_fwd, util: Don't bind UDP ports with opposite-side bound TCP ports
When pasta periodically scans bound ports and binds them on the other
side in order to forward traffic, we bind UDP ports for corresponding
TCP port numbers, too, to support protocols and applications such as
iperf3 which use UDP port numbers matching the ones used by the TCP
data connection.

If we scan UDP ports in order to bind UDP ports, we skip detection of
the UDP ports we already bound ourselves, to avoid looping back our
own ports. Same with scanning and binding TCP ports.

But if we scan for TCP ports in order to bind UDP ports, we need to
skip bound TCP ports too, otherwise, as David pointed out:

- we find a bound TCP port on side A, and bind the corresponding TCP
  and UDP ports on side B

- at the next periodic scan, we find that UDP port bound on side B,
  and we bind the corresponding UDP port on side A

- at this point, we unbind that UDP port on side B: we would
  otherwise loop back our own port.

To fix this, we need to avoid binding UDP ports that we already
bound, on the other side, as a consequence of finding a corresponding
bound TCP port.

Reproducing this issue is straightforward:

  ./pasta -- iperf3 -s

  # Wait one second, then from another terminal:
  iperf3 -c ::1 -u

Reported-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 457ff122e3 ("udp,pasta: Periodically scan for ports to automatically forward")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-22 07:19:36 +01:00
David Gibson
457ff122e3 udp,pasta: Periodically scan for ports to automatically forward
pasta supports automatic port forwarding, where we look for listening
sockets in /proc/net (in both namespace and outside) and establish port
forwarding to match.

For TCP we do this scan both at initial startup, then periodically
thereafter.  For UDP however, we currently only scan at start.  So unlike
TCP we won't update forwarding to handle services that start after pasta
has begun.

There's no particular reason for that, other than that we didn't implement
it.  So, remove that difference, by scanning for new UDP forwards
periodically too.  The logic is basically identical to that for TCP, but it
needs some changes to handle the mildly different data structures in the
UDP case.

Link: https://bugs.passt.top/show_bug.cgi?id=45
Link: https://github.com/rootless-containers/rootlesskit/issues/383
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:08:39 +01:00
David Gibson
de974f0cf1 udp: Remove socket from udp_{tap,splice}_map when timed out
We save sockets bound to particular ports in udp_{tap,splice}_map for
reuse later.  If they're not used for a time, we time them out and close
them. However, when that happened, we weren't actually removing the fds
from the relevant map.  That meant that later interactions on the same port
could get a stale fd from the map.

The stale fd might be closed, leading to unexpected EBADF errors, or it
could have been re-used by a completely different socket bound to a
different port, which could lead to us incorrectly forwarding packets.

Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu>
Reported-by: Jay <bugs.passt.top@bitsbetwixt.com>
Link: https://bugs.passt.top/show_bug.cgi?id=57
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:55:08 +01:00
David Gibson
480aa4a108 udp: Consistently use -1 to indicate un-opened sockets in maps
udp uses the udp_tap_map, udp_splice_ns and udp_splice_init tables to keep
track of already opened sockets bound to specific ports.  We need a way to
indicate entries where a socket hasn't been opened, but the code isn't
consistent if this is indicated by a 0 or a -1:
  * udp_splice_sendfrom() and udp_tap_handler() assume that 0 indicates
    an unopened socket
  * udp_sock_init() fills in -1 for a failure to open a socket
  * udp_timer_one() is somewhere in between, treating only strictly
    positive fds as valid

-1 (or, at least, negative) is really the correct choice here, since 0 is
a theoretically valid fd value (if very unlikely in practice).  Change to
use that consistently throughout.

The table does need to be initialised to all -1 values before any calls to
udp_sock_init() which can happen from conf_ports().  Because C doesn't make
it easy to statically initialise non zero values in large tables, this does
require a somewhat awkward call to initialise the table from conf().  This
is the best approach I could see for the short term, with any luck it will
go away at some point when those socket tables are replaced by a unified
flow table.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:55:03 +01:00
David Gibson
f6d8dc2355 pif: Pass originating pif to tap handler functions
For now, packets passed to the various *_tap_handler() functions always
come from the single "tap" interface.  We want to allow the possibility to
broaden that in future.  As preparation for that, have the code in tap.c
pass the pif id of the originating interface to each of those handler
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:45 +01:00
David Gibson
732e249376 pif: Record originating pif in listening socket refs
For certain socket types, we record in the epoll ref whether they're
sockets in the namespace, or on the host.  We now have the notion of "pif"
to indicate what "place" a socket is associated with, so generalise the
simple one-bit 'ns' to a pif id.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:41 +01:00
David Gibson
0d73fa2225 udp: Clean up ref initialisation in udp_sock_init()
udp_sock_init() has a number of paths that initialise uref differently.
However some of the fields are initialised the same way in all of them.
Move those fields into the original initialiser to save a few lines.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:35 +01:00
David Gibson
6471c7d01b cppcheck: Make many pointers const
Newer versions of cppcheck (as of 2.12.0, at least) added a warning for
pointers which could be declared to point at const data, but aren't.
Based on that, make many pointers throughout the codebase const.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:23:35 +02:00
David Gibson
7b56117dae udp, tap: Correctly advance through packets in udp_tap_handler()
In both tap4_handler() and tap6_handler(), once we've sorted incoming l3
packets into "sequences", we then step through all the packets in each DUP
sequence calling udp_tap_handler().  Or so it appears.

In fact, udp_tap_handler() doesn't take an index and always starts with
packet 0 of the sequence, even if called repeatedly.  It appears to be
written with the idea that the struct pool is a queue, from which it
consumes packets as it processes them, but that's not how the pool data
structure works.

Correct this by adding an index parameter to udp_tap_handler() and altering
the loops in tap.c to step through the pool properly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:04 +02:00
David Gibson
955dd3251c tcp, udp: Don't pre-fill IPv4 destination address in headers
Because packets sent on the tap interface will always be going to the
guest/namespace, we more-or-less know what address they'll be going to.  So
we pre-fill this destination address in our header buffers for IPv4.  We
can't do the same for IPv6 because we could need either the global or
link-local address for the guest.  In future we're going to want more
flexibility for the destination address, so this pre-filling will get in
the way.

Change the flow so we always fill in the IPv4 destination address for each
packet, rather than prefilling it from proto_update_l2_buf().  In fact for
TCP we already redundantly filled the destination for each packet anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:33 +02:00
David Gibson
5bf200ae8a tcp, udp: Don't include destination address in partially precomputed csums
We partially prepopulate IP and TCP header structures including, amongst
other things the destination address, which for IPv4 is always the known
address of the guest/namespace.  We partially precompute both the IPv4
header checksum and the TCP checksum based on this.

In future we're going to want more flexibility with controlling the
destination for IPv4 (as we already do for IPv6), so this precomputed value
gets in the way.  Therefore remove the IPv4 destination from the
precomputed checksum and fold it into the checksum update when we actually
send a packet.

Doing this means we no longer need to recompute those partial sums when
the destination address changes ({tcp,udp}_update_l2_buf()) and instead
the computation can be moved to compile time.  This means while we perform
slightly more computations on each packet, we slightly reduce the amount of
memory we need to access.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:30 +02:00
David Gibson
cee4a2da48 tap: Pass source address to protocol handler functions
The tap code passes the IPv4 or IPv6 destination address of packets it
receives to the protocol specific code.  Currently that protocol code
doesn't use the source address, but we want it to in future.  So, in
preparation, pass the IPv4/IPv6 source address of tap packets to those
functions as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:21 +02:00
David Gibson
8271a2ed57 epoll: Tiny cleanup to udp_sock_handler()
Move the test for c->no_udp into the function itself, rather than in the
dispatching switch statement to better localize the UDP specific logic, and
make for greated consistency with other handler functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:10 +02:00
David Gibson
3401644453 epoll: Generalize epoll_ref to cover things other than sockets
The epoll_ref type includes fields for the IP protocol of a socket, and the
socket fd.  However, we already have a few things in the epoll which aren't
protocol sockets, and we may have more in future.  Rename these fields to
an abstract "fd type" and file descriptor for more generality.

Similarly, rather than using existing IP protocol numbers for the type,
introduce our own number space.  For now these just correspond to the
supported protocols, but we'll expand on that in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:51 +02:00
David Gibson
6920adda0d util: Make ns_enter() a void function and report setns() errors
ns_enter() returns an integer... but it's always zero.  If we actually fail
the function doesn't return.  Therefore it makes more sense for this to be
a function returning void, and we can remove the cases where we pointlessly
checked its return value.

In addition ns_enter() is usually called from an ephemeral thread created
by NS_CALL().  That means that the exit(EXIT_FAILURE) there usually won't
be reported (since NS_CALL() doesn't wait() for the thread).  So, use die()
instead to print out some information in the unlikely event that our
setns() here does fail.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:18:02 +02:00
David Gibson
8218d99013 Use C11 anonymous members to make poll refs less verbose to use
union epoll_ref has a deeply nested set of structs and unions to let us
subdivide it into the various different fields we want.  This means that
referencing elements can involve an awkward long string of intermediate
fields.

Using C11 anonymous structs and unions lets us do this less clumsily.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:17:57 +02:00
Stefano Brivio
ca2749e1bd passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.

Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.

Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-06 18:00:33 +02:00
David Gibson
34ade90957 Work around weird false positives with cppcheck-2.9.1
Commit 89e38f55 "treewide: Fix header includes to build with musl" added
extra #includes to work with musl.  Unfortunately with the cppcheck version
I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false
positives: specifically cppcheck seems to hit a #error in <bits/unistd.h>
complaining about including it directly instead of via <unistd.h> (which is
not something we're doing).

I have no idea why that would be happening; but I'm guessing it has to be
a bug in the cpp implementation in that cppcheck version.  In any case,
it's possible to work around this by moving the include of <unistd.h>
before the include of <signal.h>.  So, do that.

Fixes: 89e38f5540 ("treewide: Fix header includes to build with musl")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 16:38:06 +01:00
Stefano Brivio
ccf6d2a7b4 udp: Actually bind detected namespace ports in init namespace
When I reworked udp_init() to move most of the port binding logic
to conf_ports, I accidentally dropped this bit of automatic port
detection (and binding) at start-up.

On -U auto, in pasta mode, udp_sock_init_ns() binds ports in the
namespace that correspond to ports bound in the init namespace,
but on -u auto, nothing actually happens after port detection.

Add udp_sock_init_init() to deal with this, and while at it fix
the comment to udp_sock_init_ns(): the latter takes care of
outbound "connections".

This is currently not covered by tests, and the UDP port needs to
be already bound in the namespace when pasta starts (periodic
detection for UDP is a missing feature at the moment). It can be
checked like this:

  $ unshare -rUn
  # echo $$
  590092
  # socat -u UDP-LISTEN:5555 STDOUT

  $ pasta -q -u auto 590092
  $ echo "test" | socat -u STDIN UDP:localhost:5555

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 3c6ae62510 ("conf, tcp, udp: Allow address specification for forwarded ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 16:19:04 +01:00
Stefano Brivio
5aea2f88ab tcp, udp: Fix partial success return codes in {tcp,udp}_sock_init()
The comments say we should return 0 on partial success, and an error
code on complete failure. Rationale: if the user configures a port
forwarding, and we succeed to bind that port for IPv4 or IPv6 only,
that might actually be what the user intended.

Adjust the two functions to reflect the comments.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
73992c42ce tcp, udp, util: Pass socket creation errors all the way up
...starting from sock_l4(), pass negative error (errno) codes instead
of -1. They will only be used in two commits from now, no functional
changes intended here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Chris Kuhn
89e38f5540 treewide: Fix header includes to build with musl
Roughly inspired from a patch by Chris Kuhn: fix up includes so that
we can build against musl: glibc is more lenient as headers generally
include a larger amount of other headers.

Compared to the original patch, I only included what was needed
directly in C files, instead of adding blanket includes in local
header files. It's a bit more involved, but more consistent with the
current (not ideal) situation.

Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
a9c59dd91b conf, icmp, tcp, udp: Add options to bind to outbound address and interface
I didn't notice earlier: libslirp (and slirp4netns) supports binding
outbound sockets to specific IPv4 and IPv6 addresses, to force the
source addresse selection. If we want to claim feature parity, we
should implement that as well.

Further, Podman supports specifying outbound interfaces as well, but
this is simply done by resolving the primary address for an interface
when the network back-end is started. However, since kernel version
5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for
non-root users"), we can actually bind to a specific interface name,
which doesn't need to be validated in advance.

Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses,
and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets
to given interfaces.

Given that it probably makes little sense to select addresses and
routes from interfaces different than the ones given for outbound
sockets, also assign those as "template" interfaces, by default,
unless explicitly overridden by '-i'.

For ICMP and UDP, we call sock_l4() to open outbound sockets, as we
already needed to bind to given ports or echo identifiers, and we
can bind() a socket only once: there, pass address (if any) and
interface (if any) for the existing bind() and setsockopt() calls.

For TCP, in general, we wouldn't otherwise bind sockets. Add a
specific helper to do that.

For UDP outbound sockets, we need to know if the final destination
of the socket is a loopback address, before we decide whether it
makes sense to bind the socket at all: move the block mangling the
address destination before the creation of the socket in the IPv4
path. This was already the case for the IPv6 path.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:43:59 +01:00
Stefano Brivio
83236216c4 udp: Fix signedness warning on 32-bits architectures
When a ssize_t is an int:

udp.c: In function ‘udp_sock_handler’:
udp.c:774:23: warning: comparison of integer expressions of different signedness: ‘unsigned int’ and ‘ssize_t’ {aka ‘int’} [-Wsign-compare]
  774 |         for (i = 0; i < n; i += m) {
      |                       ^
udp.c:781:43: warning: comparison of integer expressions of different signedness: ‘unsigned int’ and ‘ssize_t’ {aka ‘int’} [-Wsign-compare]
  781 |                         for (m = 1; i + m < n; m++) {
      |

Change 'i' and 'm' counters in udp_sock_handler() to signed versions,
to match ssize_t n.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 00:36:06 +01:00
Stefano Brivio
bad2526872 conf, udp: Allow any loopback address to be used as resolver
Andrea reports that with a Fedora 37 guest running on a Fedora 37
host, both using systemd-resolved, with passt connecting them,
running with default options, DNS queries don't work.

systemd-resolved on the host is reachable only at the loopback
address 127.0.0.53.

We advertise the default gateway address to the guest as resolver,
because our local address is of course unreachable from there, which
means we see DNS queries directed to the default gateway, and we
redirect them to 127.0.0.1. However, systemd-resolved doesn't answer
on 127.0.0.1.

To fix this, set @dns_match to the address of the default gateway,
unless a different resolver address is explicitly configured, so that
we know we explicitly have to map DNS queries, in this case, to the
address of the local resolver.

This means that in udp_tap_handler() we need to check, first, if
the destination address of packets matches @dns_match: even if it's
the address of the local gateway, we want to map that to a specific
address, which isn't necessarily 127.0.0.1.

Do the same for IPv6 for consistency, even though IPv6 defines a
single loopback address.

Reported-by: Andrea Bolognani <abologna@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Andrea Bolognani <abologna@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-02-27 18:52:56 +01:00
Stefano Brivio
ddf7097a71 udp: Actually use host resolver to forward DNS queries
Instead of the address of the first resolver we advertise to
the guest or namespace.

This was one of the intentions behind commit 3a2afde87d ("conf,
udp: Drop mostly duplicated dns_send arrays, rename related fields"),
but I forgot to implement this part. In practice, they are usually
the same thing, unless /etc/resolv.conf points to a loopback address.

Fixes: 3a2afde87d ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Andrea Bolognani <abologna@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-02-27 18:52:04 +01:00
Stefano Brivio
3d0de2c1d7 conf, tcp, udp: Exit if we fail to bind sockets for all given ports
passt supports ranges of forwarded ports as well as 'all' for TCP and
UDP, so it might be convenient to proceed if we fail to bind only
some of the desired ports.

But if we fail to bind even a single port for a given specification,
we're clearly, unexpectedly, conflicting with another network
service. In that case, report failure and exit.

Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-02-16 17:33:49 +01:00
David Gibson
7a8ed9459d Make assertions actually useful
There are some places in passt/pasta which #include <assert.h> and make
various assertions.  If we hit these something has already gone wrong, but
they're there so that we a useful message instead of cryptic misbehaviour
if assumptions we thought were correct turn out not to be.

Except.. the glibc implementation of assert() uses syscalls that aren't in
our seccomp filter, so we'll get a SIGSYS before it actually prints the
message.  Work around this by adding our own ASSERT() implementation using
our existing err() function to log the message, and an abort().  The
abort() probably also won't work exactly right with seccomp, but once we've
printed the message, dying with a SIGSYS works just as well as dying with
a SIGABRT.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 23:42:24 +01:00
David Gibson
54502cca7f udp: Use tap_send_frames()
To send frames on the tap interface, the UDP uses a fairly complicated two
level batching.  First multiple frames are gathered into a single "message"
for the qemu stream socket, then multiple messages are send with
sendmmsg().  We now have tap_send_frames() which already deals with sending
a number of frames, including batching and handling partial sends.  Use
that to considerably simplify things.

This does make a couple of behavioural changes:
  * We used to split messages to keep them under 32kiB (except when a
    single frame was longer than that).  The comments claim this is
    needed to stop qemu from closing the connection, but we don't have any
    equivalent logic for TCP.  I wasn't able to reproduce the problem with
    this series, although it was apparently easy to reproduce earlier.

    My suspicion is that there was never an inherent need to keep messages
    small, however with larger messages (and default kernel buffer sizes)
    the chances of needing more than one resend for partial send()s is
    greatly increased.  We used not to correctly handle that case of
    multiple resends, but now we do.

  * Previously when we got a partial send on UDP, we would resend the
    remainder of the entire "message", including multiple frames.  The
    common code now only resends the remainder of a single frame, simply
    dropping any frames which weren't even partially sent.  This is what
    TCP always did and is probably a better idea for UDP too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:55:04 +01:00
David Gibson
97d1ca2ed6 udp: Use abstracted tap header
Update the UDP code to use the tap layer abstractions for initializing and
updating the L2 and lower headers.  This will make adding other tap
backends in future easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:59 +01:00
David Gibson
716a926ef4 util: Parameterize ethernet header initializer macro
We have separate IPv4 and IPv6 versions of a macro to construct an
initializer for ethernet headers.  However, now that we have htons_constant
it's easy to simply paramterize this with the ethernet protocol number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:46 +01:00
David Gibson
67afaab411 tcp, udp: Use named field initializers in iov_init functions
Both the TCP and UDP iov_init functions have some large structure literals
defined in "field order" style.  These are pretty hard to read since it's
not obvious what value corresponds to what field.  Use named field style
initializers instead to make this clearer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:44 +01:00
David Gibson
b93d025d50 udp: Don't use separate sockets to listen for spliced packets
Currently, when ports are forwarded inbound in pasta mode, we open two
sockets for incoming traffic: one listens on the public IP address and will
forward packets to the tuntap interface.  The other listens on localhost
and forwards via "splicing" (resending directly via sockets in the ns).

Now that we've improved the logic about whether we "splice" any individual
packet, we don't need this.  Instead we can have a single socket bound to
0.0.0.0 or ::, marked as able to splice and udp_sock_handler() will deal
with each packet as appropriate.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:12 +01:00
David Gibson
8d503e825f udp: Decide whether to "splice" per datagram rather than per socket
Currently we have special sockets for receiving datagrams from locahost
which can use the optimized "splice" path rather than going across the tap
interface.

We want to loosen this so that sockets can receive sockets that will be
forwarded by both the spliced and non-spliced paths.  To do this, we alter
the meaning of the @splice bit in the reference to mean that packets
receieved on this socket *can* be spliced, not that they *will* be spliced.
They'll only actually be spliced if they come from 127.0.0.1 or ::1.

We can't (for now) remove the splice bit entirely, unlike with TCP.  Our
gateway mapping means that if the ns initiates communication to the gw
address, we'll translate that to target 127.0.0.1 on the host side.  Reply
packets will therefore have source address 127.0.0.1 when received on the
host, but these need to go via the tap path where that will be translated
back to the gateway address.  We need the @splice bit to distinguish that
case from packets going from localhost to a port mapped explicitly with
-u which should be spliced.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:09 +01:00
David Gibson
8a10f23720 udp: Unify udp_sock_handler_splice() with udp_sock_handler()
These two functions now have a very similar structure, and their first
part (calling recvmmsg()) is functionally identical.  So, merge the two
functions into one.

This does have the side effect of meaning we no longer receive multiple
packets at once for splice (we already didn't for tap).  This does hurt
throughput for small spliced packets, but improves it for large spliced
packets and tap packets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:06 +01:00
David Gibson
f1ed8dbfa7 udp: Pre-populate msg_names with local address
udp_splice_namebuf is now used only for spliced sending, and so it is
only ever populated with the localhost address, either IPv4 or IPv6.
So, replace the awkward initialization in udp_sock_handler_splice()
with statically initialized versions for IPv4 and IPv6.  We then just
need to update the port number in udp_sock_handler_splice().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:03 +01:00
David Gibson
c9e193b5ae udp: Don't handle tap receive batch size calculation within a #define
UDP_MAX_FRAMES gives the maximum number of datagrams we'll ever handle as a
batch for sizing our buffers and control structures.  The subtly different
UDP_TAP_FRAMES gives the maximum number of datagrams we'll actually try to
receive at once for tap packets in the current configuration.

This depends on the mode, meaning that the macro has a non-obvious
dependency on the usual 'c' context variable being available.  We only use
it in one place, so it makes more sense to open code this.  Add an
explanatory comment while we're there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:01 +01:00
David Gibson
4eb54fd2e7 udp: Split receive from preparation and send in udp_sock_handler()
The receive part of udp_sock_handler() and udp_sock_handler_splice() is now
almost identical.  In preparation for merging that, split the receive part
of udp_sock_handler() from the part preparing and sending the frames for
sending on the tap interface.  The latter goes into a new udp_tap_send()
function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:06:58 +01:00
David Gibson
09c00f1d2a udp: Split sending to passt tap interface into separate function
The last part of udp_sock_handler() does the actual sending of frames
to the tap interface.  For pasta that's just a call to
udp_tap_send_pasta() but for passt, it's moderately complex and open
coded.

For symmetry, move the passt send path into its own function,
udp_tap_send_passt().  This will make it easier to abstract the tap
interface in future (e.g. when we want to add vhost-user).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:06:55 +01:00
David Gibson
a6e919a951 udp: Move sending pasta tap frames to the end of udp_sock_handler()
udp_sock_handler() has a surprising difference in flow between pasta and
passt mode: For pasta we send each frame to the tap interface as we prepare
it.  For passt, though, we prepare all the frames, then send them with a
single sendmmsg().

Alter the pasta path to also prepare all the frames, then send them at the
end.  We already have a suitable data structure for the passt case.  This
will make it easier to abstract out the tap backend difference in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:06:51 +01:00
David Gibson
310bdbdcf4 udp: Factor out control structure management from udp_sock_fill_data_v[46]
The main purpose of udp_sock_fill_data_v[46]() is to construct the IP, UDP
and other headers we'll need to forward data onto the tap interface.  In
addition they update the control structures (iovec and mmsghdr) we'll need
to send the messages, and in the case of pasta actually sends it.

This leads the control structure management and the send itself awkwardly
split between udp_sock_fill_data_v[46]() and their caller
udp_sock_handler().  In addition, this tail part of udp_sock_fill_datav[46]
is essentially common between the IPv4 and IPv6 versions, apart from which
control array we're working on.

Clean this up by reducing these functions to just construct the headers
and renaming them to udp_update_hdr[46]() accordingly.  The control
structure updates are now all in the caller, and common for IPv4 and IPv6.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:22 +01:00
David Gibson
4b2d227c86 udp: Preadjust udp[46]_l2_iov_tap[].iov_base for pasta mode
Currently, we always populate udp[46]_l2_iov_tap[].iov_base with the
very start of the header buffers, including space for the qemu vnet_len
tag suitable for passt mode.  That's ok because we don't actually use these
iovecs for pasta mode.

However, we do know the mode in udp_sock[46]_iov_init() so adjust these
to the beginning of the headers we'll actually need for the mode: including
the vnet_len tag for passt, but excluding it for pasta.

This allows a slightly nicer way to locate the right buffer to send in the
pasta case, and will allow some additional cleanups later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:19 +01:00
David Gibson
d533807027 udp: Better factor IPv4 and IPv6 paths in udp_sock_handler()
Apart from which mh array they're operating on the recvmmsg() calls in
udp_sock_handler() are identical between the IPv4 and IPv6 paths, as are
some of the control structure updates.

By using some local variables to refer to the IP version specific control
arrays, make some more logic common between the IPv4 and IPv6 paths.  As
well as slightly reducing the code size, this makes it less likely that
we'll accidentally use the IPv4 arrays in the IPv6 path or vice versa as we
did in a recently fixed bug.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:16 +01:00
David Gibson
6af7ee74cf udp: Fix incorrect use of IPv6 mh buffers in IPv4 path
udp_sock_handler() incorrectly uses udp6_l2_mh_tap[] on the IPv4 path.  In
fact this is harmless because this assignment is redundant (the 0th entry
msg_hdr will always point to the 0th iov entry for both IPv4 and IPv6 and
won't change).

There is also an incorrect usage of udp6_l2_mh_tap[] in
udp_sock_fill_data_v4.  This one can cause real problems, because we'll
use stale iov_len values if we send multiple messages to the qemu socket.
Most of the time that will be relatively harmless - we're likely to either
drop UDP packets, or send duplicates.  However, if the stale iov_len we
use ends up referencing an uninitialized buffer we could desynchronize the
qemu stream socket.

Correct both these bugs.  The UDP6 path appears to be correct, but it does
have some comments that incorrectly reference the IPv4 versions, so fix
those as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:07 +01:00
David Gibson
34764ea4f3 udp: Correct splice forwarding when receiving from multiple sources
udp_sock_handler_splice() reads a whole batch of datagrams at once with
recvmmsg().  It then forwards them all via a single socket on the other
side, based on the source port.

However, it's entirely possible that the datagrams in the set have
different source ports, and thus ought to be forwarded via different
sockets on the destination side.  In fact this situation arises with the
iperf -P4 throughput tests in our own test suite.  AFAICT we only get away
with this because iperf3 is strictly one way and doesn't send reply packets
which would be misdirected because of the incorrect source ports.

Alter udp_sock_handler_splice() to split the packets it receives into
batches with the same source address and send each batch with a separate
sendmmsg().

For now we only look for already contiguous batches, which means that if
there are multiple active flows interleaved this is likely to degenerate
to batches of size 1.  For now this is the simplest way to correct the
behaviour and we can try to optimize later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:03 +01:00
David Gibson
2dec914209 udp: Split send half of udp_sock_handler_splice() from the receive half
Move the part of udp_sock_handler_splice() concerned with sending out the
datagrams into a new udp_splice_sendfrom() helper.  This will make later
cleanups easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:00 +01:00
David Gibson
fc7f91e709 udp: Unify buffers for tap and splice paths
We maintain a set of buffers for UDP packets to be forwarded via the tap
interface in udp[46]_l2_buf.  We then have a separate set of buffers for
packets to be "spliced" in udp_splice_buf[].  However, we only use one of
these at a time, so we can share the buffer space.

For the receiving splice packets we can not only re-use the data buffers
but also the udp[46]_l2_iov_sock and udp[46]_l2_mh_sock control structures.

For sending the splice packets we keep the same data buffers, but we need
specific control structures.  We create udp[46]_iov_splice - we can't
reuse udp_l2_iov_sock[] because we need to write iov_len as we're writing
spliced packets, but the tap path expects iov_len to remain the same (it
only uses it for receive).  Likewise we create udp[46]_mh_splice with the
mmsghdr structures for sending spliced packets.  As well as needing to
reference different iovs, these need to all reference udp_splice_namebuf
instead of individual msg_name fields for each slot.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:56 +01:00
David Gibson
c52ca4aecf udp: Add helper to extract port from a sockaddr_in or sockaddr_in6
udp_sock_handler_splice() has a somewhat clunky if to extract the port from
a socket address which could be either IPv4 or IPv6.  Future changes are
going to make this even more clunky, so introduce a helper function to
do this extraction.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:53 +01:00