Commit graph

1693 commits

Author SHA1 Message Date
David Gibson
96590b0560 flow: Make unified version of flow table compaction
tcp_table_compact() will move entries in the connection/flow table to keep
it compact when other entries are removed.  The moved entries need not have
the same type as the flow removed, so it needs to be able to handle moving
any type of flow.  Therefore, move it to flow.c rather than being
purportedly TCP specific.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:09 +01:00
David Gibson
9d44aba7e0 util: MAX_FROM_BITS() should be unsigned
MAX_FROM_BITS() computes the maximum value representable in a number of
bits.  The expression for that is an unsigned value, but we explicitly cast
it to a signed int.  It looks like this is because one of the main users is
for FD_REF_MAX, which is used to bound fd values, typically stored as a
signed int.

The value MAX_FROM_BITS() is calculating is naturally non-negative, though,
so it makes more sense for it to be unsigned, and to move the case to the
definition of FD_REF_MAX.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:06 +01:00
David Gibson
e2e8219f13 flow, tcp: Consolidate flow pointer<->index helpers
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index
of their connection structures in the connection table, now become the
unified flow table.  We can easily combine these into a common helper.
While we're there, add some trickery for some additional type safety.

They also define their own CONN() versions, which aren't so easily combined
since they need to return different types, but we can have them use a
common helper.

In the process, we standardise on always using an unsigned type to store
the connection / flow index, which makes more sense.  tcp.c's conn_at_idx()
remains for now, but we change its parameter to unsigned to match.  That in
turn means we can remove a check for negative values from it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:04 +01:00
David Gibson
f08ce92a13 flow, tcp: Move TCP connection table to unified flow table
We want to generalise "connection" tracking to things other than true TCP
connections.  Continue implenenting this by renaming the TCP connection
table to the "flow table" and moving it to flow.c.  The definitions are
split between flow.h and flow_table.h - we need this separation to avoid
circular dependencies: the definitions in flow.h will be needed by many
headers using the flow mechanism, but flow_table.h needs all those protocol
specific headers in order to define the full flow table entry.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:02 +01:00
David Gibson
16ae032608 flow, tcp: Generalise connection types
Currently TCP connections use a 1-bit selector, 'spliced', to determine the
rest of the contents of the structure.  We want to generalise the TCP
connection table to other types of flows in other protocols.  Make a start
on this by replacing the tcp_conn_common structure with a new flow_common
structure with an enum rather than a simple boolean indicating the type of
flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:50:59 +01:00
David Gibson
ba84a3b17a treewide: Add messages to static_assert() calls
A while ago, we updated passt to require C11, for several reasons, but one
was to be able to use static_assert() for build time checks.  The C11
version of static_assert() requires a message to print in case of failure
as well as the test condition it self.  It was extended in C23 to make the
message optional, but we don't want to require C23 at this time.

Unfortunately we missed that in some of the static_assert()s we already
added which use the C23 form without a message.  clang-tidy has a warning
for this, but for some reason it's not seeming to trigger in the current
cases (but did for some I'm working on adding).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:50:53 +01:00
Laurent Vivier
d902bb6288 tcp: remove useless assignment
In tcp_send_flag(), a4826ee04b has replaced:

    th->doff = sizeof(*th) / 4;
    th->doff += OPT_MSS_LEN / 4;
    th->doff += (1 + OPT_WS_LEN) / 4;

by

    optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
    th->doff = (sizeof(*th) + optlen) / 4;

but forgot to remove the useless "th->doff += (1 + OPT_WS_LEN) / 4;"

Fixes: a4826ee04b ("tcp: Defer and coalesce all segments with no data (flags) to handler")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:50:51 +01:00
Stefano Brivio
bae676a44e port_fwd, util: Include additional headers to fix build with musl
lseek() is declared in unistd.h, and stdio.h provides sscanf().
Include these two headers in port_fwd.c.

SIGCHLD, even if used exclusively for clone(), is defined in
signal.h: add the include to util.h, as NS_CALL needs it.

Reported-by: lemmi <lemmi@nerd2nerd.org>
Link: https://github.com/void-linux/void-packages/actions/runs/6999782606/job/19039526604#step:7:57
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-12-02 03:54:55 +01:00
Stefano Brivio
7e175f32c1 packet: Offset plus length is not always uint32_t, but it's always size_t
According to gcc, PRIu32 matches the type of the argument we're
printing here on both 64 and 32-bits architectures. According to
Clang, though, that's not the case, as the result of the sum is an
unsigned long on 64-bit.

Use the z modifier, given that we're summing uint32_t to size_t, and
the result is at most promoted to size_t.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-12-02 03:54:47 +01:00
Stefano Brivio
06559048e7 treewide: Use 'z' length modifier for size_t/ssize_t conversions
Types size_t and ssize_t are not necessarily long, it depends on the
architecture.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-12-02 03:54:42 +01:00
Stefano Brivio
4117bd94f9 port_fwd, util: Don't bind UDP ports with opposite-side bound TCP ports
When pasta periodically scans bound ports and binds them on the other
side in order to forward traffic, we bind UDP ports for corresponding
TCP port numbers, too, to support protocols and applications such as
iperf3 which use UDP port numbers matching the ones used by the TCP
data connection.

If we scan UDP ports in order to bind UDP ports, we skip detection of
the UDP ports we already bound ourselves, to avoid looping back our
own ports. Same with scanning and binding TCP ports.

But if we scan for TCP ports in order to bind UDP ports, we need to
skip bound TCP ports too, otherwise, as David pointed out:

- we find a bound TCP port on side A, and bind the corresponding TCP
  and UDP ports on side B

- at the next periodic scan, we find that UDP port bound on side B,
  and we bind the corresponding UDP port on side A

- at this point, we unbind that UDP port on side B: we would
  otherwise loop back our own port.

To fix this, we need to avoid binding UDP ports that we already
bound, on the other side, as a consequence of finding a corresponding
bound TCP port.

Reproducing this issue is straightforward:

  ./pasta -- iperf3 -s

  # Wait one second, then from another terminal:
  iperf3 -c ::1 -u

Reported-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 457ff122e3 ("udp,pasta: Periodically scan for ports to automatically forward")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-22 07:19:36 +01:00
David Gibson
4f1709db1b valgrind: Don't disable optimizations for valgrind builds
When we plan to use valgrind, we need to build passt a bit differently:
  * We need debug symbols so that valgrind can match problems it finds to
    meaningful locations
  * We need to allow additional syscalls in the seccomp filter, because
    valgrind's wrappers need them

Currently we also disable optimization (-O0).  This is unfortunate, because
it will make valgrind tests even slower than they'd otherwise be.  Worse,
it's possible that the asm behaviour without optimization might be
different enough that valgrind could miss a use of uninitialized variable
or other fault it would detect.

I suspect this was originally done because without it inlining could mean
that suppressions we use don't reliably match the places we want them to.
Alas, it turns out this is true even with -O0.  We've now implemented a
more robust workaround for this (explicit ((noinline)) attributes when
compiled with -DVALGRIND).  So, we can re-enable optimization for valgrind
builds, making them closer to regular builds.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:10:30 +01:00
David Gibson
f7724647b1 valgrind: Adjust suppression for MSG_TRUNC with NULL buffer
valgrind complains if we pass a NULL buffer to recv(), even if we use
MSG_TRUNC, in which case it's actually safe.  For a long time we've had
a valgrind suppression for this.  It singles out the recv() in
tcp_sock_consume(), the only place we use MSG_TRUNC.

However, tcp_sock_consume() only has a single caller, which makes it a
prime candidate for inlining.  If inlined, it won't appear on the stack and
valgrind won't match the correct suppression.

It appears that certain compiler versions (for example gcc-13.2.1 in
Fedora 39) will inline this function even with the -O0 we use for valgrind
builds.  This breaks the suppression leading to a spurious failure in the
tests.

There's not really any way to adjust the suppression itself without making
it overly broad (we don't want to match other recv() calls).  So, as a hack
explicitly prevent inlining of this function when we're making a valgrind
build.  To accomplish this add an explicit -DVALGRIND when making such a
build.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:10:12 +01:00
David Gibson
457ff122e3 udp,pasta: Periodically scan for ports to automatically forward
pasta supports automatic port forwarding, where we look for listening
sockets in /proc/net (in both namespace and outside) and establish port
forwarding to match.

For TCP we do this scan both at initial startup, then periodically
thereafter.  For UDP however, we currently only scan at start.  So unlike
TCP we won't update forwarding to handle services that start after pasta
has begun.

There's no particular reason for that, other than that we didn't implement
it.  So, remove that difference, by scanning for new UDP forwards
periodically too.  The logic is basically identical to that for TCP, but it
needs some changes to handle the mildly different data structures in the
UDP case.

Link: https://bugs.passt.top/show_bug.cgi?id=45
Link: https://github.com/rootless-containers/rootlesskit/issues/383
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:08:39 +01:00
David Gibson
4ccdeecb74 tcp: Simplify away tcp_port_rebind()
tcp_port_rebind() is desgined to be called from NS_CALL() and has two
disjoint cases: one where it enters the namespace (outbound forwards) and
one where it doesn't (inbound forwards).

We only actually need the NS_CALL() framing for the outbound case, for
inbound we can just call tcp_port_do_rebind() directly.  So simplify
tcp_port_rebind() to tcp_port_rebind_outbound(), allowing us to eliminate
an awkward parameters structure.

With that done we can safely rename tcp_port_do_rebind() to
tcp_port_rebind() for brevity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:08:37 +01:00
David Gibson
1776e7af9b tcp: Use common helper for rebinding inbound and outbound ports
tcp_port_rebind() has two cases with almost but not quite identical code.
Simplify things a bit by factoring this out into a single parameterised
helper, tcp_port_do_rebind().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:08:32 +01:00
David Gibson
3be9e0010e clang-tidy: Suppress silly misc-include-cleaner warnings
clang-tidy from LLVM 17.0.3 (which is in Fedora 39) includes a new
"misc-include-cleaner" warning that tries to make sure that headers
*directly* provide the things that are used in the .c file.  That sounds
great in theory but is in practice unusable:

Quite a few common things in the standard library are ultimately provided
by OS-specific system headers, but for portability should be accessed via
closer-to-standardised library headers.  This will warn constantly about
such cases: e.g. it will want you to include <linux/limits.h> instead of
<limits.h> to get PATH_MAX.

So, suppress this warning globally in the Makefile.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:08:18 +01:00
David Gibson
5ec3634b07 tap, pasta: Handle short writes to /dev/tap
tap_send_frames_pasta() sends frames to the namespace by sending them to
our the /dev/tap device.  If that write() returns an error, we already
handle it.  However we don't handle the case where the write() returns
short, meaning we haven't successfully transmitted the whole frame.

I don't know if this can ever happen with the kernel tap device, but we
should at least report the case so we don't get a cryptic failure.  For
the purposes of the return value for tap_send_frames_pasta() we treat this
case as though it was an error (on the grounds that a partial frame is no
use to the namespace).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2023-11-10 16:51:33 +01:00
David Gibson
f0776eac07 tap, pasta: Handle incomplete tap sends for pasta too
Since a469fc39 ("tcp, tap: Don't increase tap-side sequence counter for
dropped frames") we've handled more gracefully the case where we get data
from the socket side, but are temporarily unable to send it all to the tap
side (e.g. due to full buffers).

That code relies on tap_send_frames() returning the number of frames it
successfully sent, which in turn gets it from tap_send_frames_passt() or
tap_send_frames_pasta().

While tap_send_frames_passt() has returned that information since b62ed9ca
("tap: Don't pcap frames that didn't get sent"), tap_send_frames_pasta()
always returns as though it succesfully sent every frame.  However there
certainly are cases where it will return early without sending all frames.
Update it report that properly, so that the calling functions can handle it
properly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2023-11-10 16:51:33 +01:00
David Gibson
cf3eeba6c0 tcp: Don't use TCP_WINDOW_CLAMP
On the L2 tap side, we see TCP headers and know the TCP window that the
ultimate receiver is advertising.  In order to avoid unnecessary buffering
within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt
to advertise that window back to the original sock-side sender using
TCP_WINDOW_CLAMP.

However, TCP_WINDOW_CLAMP just doesn't work like this.  Prior to kernel
commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply
had no effect on established sockets.  After that commit, it does affect
established sockets but doesn't behave the way we need:
  * It appears to be designed only to shrink the window, not to allow it to
    re-expand.
  * More importantly, that commit has a serious bug where if the
    setsockopt() is made when the existing kernel advertised window for the
    socket happens to be zero, it will now become locked at zero, stopping
    any further data from being received on the socket.

Since this has never worked as intended, simply remove it.  It might be
possible to re-implement the intended behaviour by manipulating SO_RCVBUF,
so we leave a comment to that effect.

This kernel bug is the underlying cause of both the linked passt bug and
the linked podman bug.  We attempted to fix this before with passt commit
d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag").
However while that commit masked the bug for some cases, it didn't really
address the problem.

Fixes: d3192f67c4 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag")
Link: https://github.com/containers/podman/issues/20170
Link: https://bugs.passt.top/show_bug.cgi?id=74
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-10 06:42:19 +01:00
David Gibson
930bc3b0f2 tcp: Rename and small cleanup to tcp_clamp_window()
tcp_clamp_window() is _mostly_ about using TCP_WINDOW_CLAMP to control the
sock side advertised window, but it is also responsible for actually
updating the conn->wnd_from_tap value.

Rename to tcp_tap_window_update() to reflect that broader purpose, and pull
the logic that's not TCP_WINDOW_CLAMP related out to the front.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-10 06:42:10 +01:00
Stefano Brivio
2c1554c994 test/lib/perf_report: Fix up table highlight for pasta's local flows
As commit 2926970523 ("test/perf: Small MTUs for spliced TCP
aren't interesting") drops all columns for TCP test MTUs except for
one, in throughput test for pasta's local flows, the first column we
need to highlight in that table is now the second one.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-10 06:35:54 +01:00
Stefano Brivio
56d9f6d588 Revert "selinux: Drop user_namespace class rules for Fedora 37"
This reverts commit 3fb3f0f7a5: it was
meant as a patch for Fedora 37 (and no later versions), not something
I should have merged upstream.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 14:58:02 +01:00
Stefano Brivio
74e6f48038 selinux: Allow passt to talk over unconfined_t UNIX domain socket for --fd
If passt is started with --fd to talk over a pre-opened UNIX domain
socket, we don't really know what label might be associated to it,
but at least for an unconfined_t socket, this bit of policy wouldn't
belong to anywhere else: enable that here.

This is rather loose, of course, but on the other hand passt will
sandbox itself into an empty filesystem, so we're not really adding
much to the attack surface except for what --fd is supposed to do.

Reported-by: Matej Hrica <mhrica@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2247221
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 12:28:27 +01:00
Stefano Brivio
50bc25a23c log: Match implicit va_start() with va_end() in vlogmsg()
According to C99, 7.15.1:

  Each invocation of the va_start and va_copy macros shall be matched
  by a corresponding invocation of the va_end macro in the same
  function

and the same applies to C11. I still have to come across a platform
where va_end() actually does something, but thus spake the standard.
This would be reported by Coverity as "Missing varargs init or
cleanup" (CWE-573).

Fixes: c0426ff10b ("log: Add vlogmsg()")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 12:24:27 +01:00
Stefano Brivio
9494a51a4e port_fwd: Don't try to read bound ports from invalid file handles
This is a minimal fix for what would be reported by Coverity as
"Improper use of negative value" (CWE-394): port_fwd_init() doesn't
guarantee that all the pre-opened file handles are actually valid.

We should probably warn on failing open() and open_in_ns() in
port_fwd_init(), too, but that's outside the scope of this minimal
fix.

Before commit 5a0485425b ("port_fwd: Pre-open /proc/net/* files
rather than on-demand"), we used to have a single open() call and
a check after it.

Fixes: 5a0485425b ("port_fwd: Pre-open /proc/net/* files rather than on-demand")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 12:24:27 +01:00
Stefano Brivio
b944622969 netlink: Sequence numbers are actually 32 bits wide
Harmless, as we use sequence numbers monotonically anyway, but now
clang-tidy reports:

/home/sbrivio/passt/netlink.c:155:7: error: format specifies type 'unsigned short' but the argument has type '__u32' (aka 'unsigned int') [clang-diagnostic-format,-warnings-as-errors]
                    nh->nlmsg_seq, seq);
                    ^
/home/sbrivio/passt/log.h:26:7: note: expanded from macro 'die'
                err(__VA_ARGS__);                                       \
                    ^~~~~~~~~~~
/home/sbrivio/passt/log.h:19:34: note: expanded from macro 'err'
                                        ^~~~~~~~~~~
Suppressed 222820 warnings (222816 in non-user code, 4 NOLINT).
Use -header-filter=.* to display errors from all non-system headers. Use -system-headers to display errors from system headers as well.
1 warning treated as error
make: *** [Makefile:255: clang-tidy] Error 1

Fixes: 9d4ab98d53 ("netlink: Add nl_do() helper for simple operations with error checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 12:22:13 +01:00
David Gibson
53ff387156 test/perf: Simplify calculation of "omit" time for TCP throughput
For the TCP throughput tests, we use iperf3's -O "omit" option which
ignores results for the given time at the beginning of the test.  Currently
we calculate this as 1/6th of the test measurement time.  The purpose of
-O, however, is to skip over the TCP slow start period, which in no way
depends on the overall length of the test.

The slow start time is roughly speaking
    log_2 ( max_window_size / MSS ) * round_trip_time
These factors all vary between tests and machines we're running on, but we
can estimate some reasonable bounds for them:
  * The maximum window size is bounded by the buffer sizes at each end,
    which shouldn't exceed 16MiB
  * The mss varies with the MTU we use, but the smallest we use in tests is
    ~256 bytes
  * Round trip time will vary with the system, but with these essentially
    local transfers it will typically be well under 1ms (on my laptop it is
    closer to 0.03ms)

That gives a worst case slow start time of about 16ms.  Setting an omit
time of 0.1s uniformly is therefore more than enough, and substantially
smaller than what we calculate now for the default case (10s / 6 ~= 1.7s).

This reduces total time for the standard benchmark run by around 30s.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:56:24 +01:00
David Gibson
aa0bb9f471 test/perf: Remove unnecessary --pacing-timer options
We always set --pacing-timer when invoking iperf3.  However, the iperf3
man page implies this is only relevant for the -b option.  We only use the
-b option for the UDP tests, not TCP, so remove --pacing-timer from the TCP
cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:56:21 +01:00
David Gibson
498108ad70 test/perf: "MTU" changes in passt_tcp host to guest aren't useful
The TCP packet size used on the passt L2 link (qemu socket) makes a huge
difference to passt/pasta throughput; many of passt's overheads (chiefly
syscalls) are per-packet.

That packet size is largely determined by the MTU on the L2 link, so we
benchmark for a number of different MTUs.  That works well for the guest to
host transfers.  For the host to guest transfers, we purport to test for
different MTUs, but we're not actually adjusting anything interesting.

The host to guest transfers adjust the MTU on the "host's" (actually ns)
loopback interface.  However, that only affects the packet size for the
socket going to passt, not the packet size for the L2 link that passt
manages - passt can and will repack the stream into packets of its own
size.  Since the depacketization on that socket is handled by the kernel it
doesn't have a lot of bearing on passt's performance.

We can't fix this by changing the L2 link MTU from the guest side (as we do
for guest to host), because that would only change the guest's view of the
MTU, passt would still think it has the large MTU.  We could test this by
using the --mtu option to passt, but that would require restarting passt
for each run, which is awkward in the current setup.  So, for now, drop all
the "small MTU" tests for host to guest.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:56:18 +01:00
David Gibson
f94adb121a test/perf: Explicitly control UDP packet length, instead of MTU
Packet size can make a big difference to UDP throughput, so it makes sense
to measure it for a variety of different sizes.  Currently we do this by
adjusting the MTU on the relevant interface before running iperf3.

However, the UDP packet size has no inherent connection to the MTU - it's
controlled by the sender, and the MTU just affects whether the packet will
make it through or be fragmented.  The only reason adjusting the MTU works
is because iperf3 bases its default packet size on the (path) MTU.

We can test this more simply by using the -l option to the iperf3 client
to directly control the packet size, instead of adjusting the MTU.

As well as simplifying this lets us test different packet sizes for host to
ns traffic.  We couldn't do that previously because we don't have
permission to change the MTU on the host.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:56:16 +01:00
David Gibson
2926970523 test/perf: Small MTUs for spliced TCP aren't interesting
Currently we make TCP throughput measurements for spliced connections with
a number of different MTU values.  However, the results from this aren't
really interesting.

Unlike with tap connections, spliced connections only involve the loopback
interface on host and container, not a "real" external interface.  lo
typically has an MTU of 65535 and there is very little reason to ever
change that.  So, the measurements for smaller MTUs are rarely going to be
relevant.

In addition, the fact that we can offload all the {de,}packetization to the
kernel with splice(2) means that the throughput difference between these
MTUs isn't very great anyway.

Remove the short MTUs and only show spliced throughput for the normal
65535 byte loopback MTU.  This reduces runtime of the performance tests on
my laptop by about 1 minute (out of ~24 minutes).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:56:13 +01:00
David Gibson
e516809a74 test/perf: Start iperf3 server less often
Currently we start both the iperf3 server(s) and client(s) afresh each time
we want to make a bandwidth measurement.  That's not really necessary as
usually a whole batch of bandwidth measurements can use the same server.

Split up the iperf3 directive into 3 directives: iperf3s to start the
server, iperf3 to make a measurement and iperf3k to kill the server, so
that we can start the server less often.  This - and more importantly, the
reduced number of waits for the server to be ready - reduces runtime of the
performance tests on my laptop by about 4m (out of ~28minutes).

For now we still restart the server between IPv4 and IPv6 tests.  That's
because in some cases the latency measurements we make in between use the
same ports.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:56:10 +01:00
David Gibson
f9ff6678d4 test/perf: Get iperf3 stats from client side
iperf3 generates statistics about its run on both the client and server
sides.  They don't have exactly the same information, but both have the
pieces we need (AFAICT the server communicates some nformation to the
client over the control socket, so the most important information is in the
client side output, even if measured by the server).

Currently we use the server side information for our measurements. Using
the client side information has several advantages though:

 * We can directly wait for the client to complete and we know we'll have
   the output we want.  We don't need to sleep to give the server time to
   write out the results.
 * That in turn means we can wrap up as soon as the client is done, we
   don't need to wait overlong to make sure everything is finished.
 * The slightly different organisation of the data in the client output
   means that we always want the same json value, rather than requiring
   slightly different onces for UDP and TCP.

The fact that we avoid some extra delays speeds up the overal run of the
perf tests by around 7 minutes (out of around 35 minutes) on my laptop.

The fact that we no longer unconditionally kill client and server after
a certain time means that the client could run indefinitely if the server
doesn't respond.  We mitigate that by setting 1s connect timeout on the
client.  This isn't foolproof - if we get an initial response, but then
lose connectivity this could still run indefinitely, however it does cover
by far the most likely failure cases.  --snd-timeout would provide more
robustness, but I've hit odd failures when trying to use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:56:06 +01:00
David Gibson
8a41a8b20f test/perf: Remove stale iperf3c/iperf3s directives
Some older revisions used separate iperf3c and iperf3s test directives to
invoke the iperf3 client and server.  Those were combined into a single
iperf3 directive some time ago, but a couple of places still have the old
syntax.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:56:03 +01:00
David Gibson
de974f0cf1 udp: Remove socket from udp_{tap,splice}_map when timed out
We save sockets bound to particular ports in udp_{tap,splice}_map for
reuse later.  If they're not used for a time, we time them out and close
them. However, when that happened, we weren't actually removing the fds
from the relevant map.  That meant that later interactions on the same port
could get a stale fd from the map.

The stale fd might be closed, leading to unexpected EBADF errors, or it
could have been re-used by a completely different socket bound to a
different port, which could lead to us incorrectly forwarding packets.

Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu>
Reported-by: Jay <bugs.passt.top@bitsbetwixt.com>
Link: https://bugs.passt.top/show_bug.cgi?id=57
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:55:08 +01:00
David Gibson
480aa4a108 udp: Consistently use -1 to indicate un-opened sockets in maps
udp uses the udp_tap_map, udp_splice_ns and udp_splice_init tables to keep
track of already opened sockets bound to specific ports.  We need a way to
indicate entries where a socket hasn't been opened, but the code isn't
consistent if this is indicated by a 0 or a -1:
  * udp_splice_sendfrom() and udp_tap_handler() assume that 0 indicates
    an unopened socket
  * udp_sock_init() fills in -1 for a failure to open a socket
  * udp_timer_one() is somewhere in between, treating only strictly
    positive fds as valid

-1 (or, at least, negative) is really the correct choice here, since 0 is
a theoretically valid fd value (if very unlikely in practice).  Change to
use that consistently throughout.

The table does need to be initialised to all -1 values before any calls to
udp_sock_init() which can happen from conf_ports().  Because C doesn't make
it easy to statically initialise non zero values in large tables, this does
require a somewhat awkward call to initialise the table from conf().  This
is the best approach I could see for the short term, with any luck it will
go away at some point when those socket tables are replaced by a unified
flow table.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:55:03 +01:00
David Gibson
c0426ff10b log: Add vlogmsg()
Currently logmsg() is only available as a variadic function.  This is fine
for normal use, but is awkward if we ever want to write wrappers around it
which (for example) add standardised prefix information.  To allow that,
add a vlogmsg() function which takes a va_list instead, and implement
logmsg() in terms of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:59 +01:00
David Gibson
5972203174 log: Enable format warnings
logmsg() takes printf like arguments, but because it's not a built in, the
compiler won't generate warnings if the format string and parameters don't
match.  Enable those by using the format attribute.

Strictly speaking this is a gcc extension, but I believe it is also
supported by some other common compilers.  We already use some other
attributes in various places.  For now, just use it and we can worry about
compilers that don't support it if it comes up.

This exposes some warnings from existing callers, both in gcc and in
clang-tidy:
 - Some are straight out bugs, which we correct
 - It's occasionally useful to invoke the logging functions with an empty
   string, which gcc objects to, so disable that specific warning in the
   Makefile
 - Strictly speaking the C standard requires that the parameter for a %p
   be a (void *), not some other pointer type.  That's only likely to cause
   problems in practice on weird architectures with different sized
   representations for pointers to different types.  Nonetheless add the
   casts to make it happy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:56 +01:00
David Gibson
50d46ec847 log: Don't define logging function 4 times
In log.c we use a macro to define logging functions for each of 4 priority
levels.  The only difference between these is the priority we pass to
vsyslog() and similar functions.  Because it's done as a macro, however,
the entire functions code is included in the binary 4 times.

Rearrange this to take the priority level as a parameter to a regular
function, then just use macros to define trivial wrappers which pass the
priority level.

This saves about 600 bytes of text in the executable (x86, non-AVX2).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:53 +01:00
Laurent Vivier
0ad54e1043 tcp: Remove remaining declaration of tcp_l2_mh
Use of tcp_l2_mh has been removed in commit 38fbfdbcb9,
but its declaration and initialization are always in the code.
Remove them as they are useless.

Fixes: 38fbfdbcb9 ("tcp: Get rid of iov with cached MSS, drop sendmmsg(), add deferred flush")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:23 +01:00
David Gibson
515db1ecc4 tcp_splice: Simplify selection of socket and pipe sides in socket handler
tcp_splice_sock_handler() uses the tcp_splice_dir() helper to select
which of the socket, pipe and counter fields to use depending on which
side of the connection the socket event is coming from.

Now that we are using arrays for the two sides, rather than separate named
fields, we can instead just use a variable indicating the side and use
that to index the arrays whever we need a particular side's field.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:21 +01:00
David Gibson
7486cd13af tcp_splice: Exploit side symmetry in tcp_splice_destroy()
tcp_splice_destroy() has some close-to-duplicated logic handling closing of
the socket and pipes for each side of the connection.  We can use a loop
across the sides to reduce the duplication.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:19 +01:00
David Gibson
69db3b3b29 tcp_splice: Exploit side symmetry in tcp_splice_connect_finish()
tcp_splice_connect_finish() has two very similar blocks opening the two
pipes for each direction of the connection.  We can deduplicate this with
a loop across the two sides.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:15 +01:00
David Gibson
1b76257147 tcp_splice: Exploit side symmetry in tcp_splice_timer()
tcp_splice_timer() has two very similar blocks one after another that
handle the SO_RCVLOWAT flags for the two sides of the connection.  We can
deduplicate this with a loop across the two sides.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:13 +01:00
David Gibson
8545058fbe tcp_splice: Rename sides of connection from a/b to 0/1
Each spliced connection has two mostly, although not entirely, symmetric
sides.  We currently call those "a" and "b" and have different fields in
the connection structure for each one.

We can better exploit that symmetry if we use two element arrays rather
thatn separately named fields.  Do that in the places we can, and for the
others change the "a"/"b" terminology to 0/1 to match.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:10 +01:00
David Gibson
0e8e534850 tcp_splice: Don't pool pipes in pairs
To reduce latencies, the tcp splice code maintains a pool of pre-opened
pipes to use for new connections.  This is structured as an array of pairs
of pipes, with each pipe, of course, being a pair of fds.  Thus when we
use the pool, a single pool "slot" provides both the a->b and b->a pipes.

There's no strong reason to store the pool in pairs, though - we can
with not much difficulty instead take the a->b and b->a pipes for a new
connection independently from separate slots in the pool, or even take one
from the the pool and create the other as we need it, if there's only one
pipe left in the pool.

This marginally increases the length of code, but simplifies the structure
of the pipe pool.  We should be able to re-shrink the code with later
changes, too.

In the process we also fix some minor bugs:
- If we both failed to find a pipe in the pool and to create a new one, we
  didn't log an error and would silently drop the connection.  That could
  make debugging such a situation difficult.  Add in an error message for
  that case
- When refilling the pool, if we were only able to open a single pipe in
  the pair, we attempted to rollback, but instead of closing the opened
  pipe, we instead closed the pipe we failed to open (probably leading to
  some ignored EBADFD errors).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:06 +01:00
David Gibson
6357010cab tcp_splice: Avoid awkward temporaries in tcp_splice_epoll_ctl()
We initialise the events_a and events_b variables with
tcp_splice_conn_epoll_events() function, then immediately copy the values
into ev_a.events and ev_b.events.  We can't simply pass &ev_[ab].events to
tcp_splice_conn_epoll_events(), because struct epoll_event is packed,
leading to 'pointer may be unaligned' warnings if we attempt that.

We can, however, make tcp_splice_conn_epoll_events() take struct
epoll_event pointers rather than raw u32 pointers, avoiding the awkward
temporaries.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:03 +01:00
David Gibson
409d3ca87f tcp_splice: Remove unnecessary forward declaration
In tcp_splice.c we forward declare tcp_splice_epoll_ctl() then define it
later on.  However, there are no circular dependencies which prevent us
from simply having the full definition in place of the forward declaration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:00 +01:00
David Gibson
5a79ba6272 tcp_splice: Don't handle EPOLL_CTL_DEL as part of tcp_splice_epoll_ctl()
tcp_splice_epoll_ctl() removes both sockets from the epoll set if called
when conn->flags & CLOSING.  This will always happen immediately after
setting that flag, since conn_flag_do() makes the call itself.  That's also
the _only_ time it can happen: we perform the EPOLL_CTL_DEL without
clearing the conn->in_epoll flag, meaning that any further calls to
tcp_splice_epoll_ctl() would attempt EPOLL_CTL_MOD, which would necessarily
fail since the fds are no longer in the epoll.

The EPOLL_CTL_DEL path in tcp_splice_epoll_ctl() has essentially zero
overlap with anything else the function does, so just move them to be
open coded in conn_flag_do().

This does require kernel 2.6.9 or later, in order to pass NULL as the
event structure for epoll_ctl().  However, we already require at least
3.13 to allow unprivileged user namespaces.

Given that, simply directly perform the EPOLL_CTL_DEL operations from
conn_flag_do() rather than unnecessarily multiplexini

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:58 +01:00