Commit graph

1754 commits

Author SHA1 Message Date
Jon Maloy
78da088f7b tcp: unify payload and flags l2 frames array
In order to reduce static memory and code footprint, we merge
the array for l2 flag frames into the one for payload frames.

This change also ensures that no flag message will be sent out
over the l2 media bypassing already queued payload messages.

Performance measurements with iperf3, where we force all
traffic via the tap queue, show no significant difference:

Dual traffic both directions sinmultaneously, with patch:
========================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  36.3 GBytes  3.12 Gbits/sec  4759       sender
[  5]   0.00-100.04 sec  36.3 GBytes  3.11 Gbits/sec             receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   321 GBytes  27.6 Gbits/sec            receiver

Dual traffic both directions sinmultaneously, without patch:
============================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  35.0 GBytes  3.01 Gbits/sec  6001       sender
[  5]   0.00-100.04 sec  34.8 GBytes  2.99 Gbits/sec            receiver

ns->host
--------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   345 GBytes  29.6 Gbits/sec            receiver

Single connection, with patch:
==============================
host->ns:
---------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   138 GBytes  11.8 Gbits/sec  922       sender
[  5]   0.00-100.04 sec   138 GBytes  11.8 Gbits/sec            receiver

ns->host:
-----------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   430 GBytes  36.9 Gbits/sec            receiver

Single connection, without patch:
=================================
host->ns:
------------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   139 GBytes  11.9 Gbits/sec  900       sender
[  5]   0.00-100.04 sec   139 GBytes  11.9 Gbits/sec            receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   440 GBytes  37.8 Gbits/sec            receiver

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:41 +01:00
David Gibson
9a0e544f05 test: Improve test for NDP assigned prefix
In the NDP tests we search explicitly for a guest address with prefix
length 64.  AFAICT this is an attempt to specifically find the SLAAC
assigned address, rather than something assigned by other means.  We can do
that more explicitly by checking for .protocol == "kernel_ra". however.

The SLAAC prefixes we assigned *will* always be 64-bit, that's hard-coded
into our NDP implementation.  RFC4862 doesn't really allow anything else
since the interface identifiers for an Ethernet-like link are 64-bits.

Let's actually verify that, rather than just assuming it, by extracting the
prefix length assigned in the guest and checking it as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:37 +01:00
David Gibson
910f4f9103 test: Don't require 64-bit prefixes in perf tests
When determining the namespace's IPv6 address in the perf test setup, we
explicitly filter for addresses with a 64-bit prefix length.  There's no
real reason we need that - as long as it's a global address we can use it.
I suspect this was copied without thinking from a similar example in the
NDP tests, where the 64-bit prefix length _is_ meaningful (though it's not
entirely clear if the handling is correct there either).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:34 +01:00
David Gibson
1699083f29 test: Make nstool hold robust against interruptions to control clients
Currently nstool die()s on essentially any error.  In most cases that's
fine for our purposes.  However, it's a problem when in "hold" mode and
getting an IO error on an accept()ed socket.  This could just indicate that
the control client aborted prematurely, in which case we don't want to
kill of the namespace we're holding.

Adjust these to print an error, close() the control client socket and
carry on.  In addition, we need to explicitly ignore SIGPIPE in order not
to be killed by an abruptly closed client connection.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:30 +01:00
David Gibson
b456ee1b53 test: Rename propagating signal handler
nstool in "exec" mode will propagate some signals (specifically SIGTERM) to
the process in the namespace it executes.  The signal handler which
accomplishes this is called simply sig_handler().  However, it turns out
we're going to need some other signal handlers, so rename this to the more
specific sig_propagate().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:27 +01:00
David Gibson
867db07fcf util: Work around cppcheck bug 6936
While experimenting with cppcheck options, I hit several false positives
caused by this bug: https://trac.cppcheck.net/ticket/13227

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:24 +01:00
David Gibson
6f913b3af0 udp: Don't dereference uflow before NULL check in udp_reply_sock_handler()
We have an ASSERT() verifying that we're able to look up the flow in
udp_reply_sock_handler().  However, we dereference uflow before that in
an initializer, rather defeating the point.  Rearrange to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:22 +01:00
David Gibson
d8e05a3fe0 ndp: Use const pointer for ndp_ns packet
We don't modify this structure at all.  For some reason cppcheck doesn't
catch this with our current options, but did when I was experimenting with
some different options.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:19 +01:00
David Gibson
0d7b8201ed linux_dep: Generalise tcp_info.h to handling Linux extension compatibility
tcp_info.h exists just to contain a modern enough version of struct
tcp_info for our needs, removing compile time dependency on the version of
kernel headers.  There are several other cases where we can remove similar
compile time dependencies on kernel version.  Prepare for that by renaming
tcp_info.h to linux_dep.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:16 +01:00
David Gibson
c5f4e4d146 fwd: Squash different-signedness comparison warning
On certain architectures we get a warning about comparison between
different signedness integers in fwd_probe_ephemeral().  This is because
NUM_PORTS evaluates to an unsigned integer.  It's a fixed value, though
and we know it will fit in a signed long on anything reasonable, so add
a cast to suppress the warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:14 +01:00
David Gibson
1e76a19895 util: Remove unused ffsl() function
We supply a weak alias for ffsl() in case it's not defined in our libc.
Except.. we don't have any users for it any more, so remove it.

make cppcheck doesn't spot this at present for complicated reasons, but it
might with tweaks to the options I'm experimenting with.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:11 +01:00
David Gibson
1d7cff3779 clang: Add rudimentary clangd configuration
clangd's default configuration seems to try to treat .h files as C++ not
C.  There are many more spurious warnings generated at present, but this
removes some of the most egregious ones.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:07 +01:00
David Gibson
c560e2f65b Makefile: Don't attempt to auto-detect stack size
We probe the available stack limit in the Makefile using rlimit, then use
that to set the size of the stack when we clone() extra threads.  But
the rlimit at compile time need not be the same as the rlimit at runtime,
so that's not particularly sensible.

Ideally, we'd set the stack size based on an estimate of the actual
maximum stack usage of all our clone()ed functions.  We don't have that
at the moment, but to keep things simple just set it to 1MiB - that's what
the current probe will set things to on my default configuration Fedora 40,
so it's likely to be fine in most cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:03 +01:00
David Gibson
13fc6d511e Makefile: Use -DARCH for qrap only
We insert -DARCH for all compiles, based on TARGET_ARCH determined in the
Makefile.  However, this is only used in qrap.c, not anywhere else in
passt or pasta.  Only supply this -D when compiling qrap specifically.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:59 +01:00
David Gibson
7917159005 seccomp: Simplify handling of AUDIT_ARCH
Currently we construct the AUDIT_ARCH variable in the Makefile, then pass
it into the C code with -D.  The only place that uses it, though is the
BPF filter generated by seccomp.sh.  seccomp.sh already needs to do things
differently depending on the arch, so it might as well just insert the
expanded AUDIT_ARCH directly into the generated code, rather than using
a #define.  Arguably this is better, even, since it ensures more locally
that the arch the BPF checks for matches the arch seccomp.sh built the
filter for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:55 +01:00
David Gibson
93bce404c1 Makefile: Move NETNS_RUN_DIR definition to C code
NETNS_RUN_DIR is set in the Makefile, then passed into the C code with
-D.  But NETNS_RUN_DIR is just a fixed string, it doesn't depend on any
make probes or variables, so there's really no reason to handle it via the
Makefile.  Just move it to a plain #define in conf.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:52 +01:00
David Gibson
c938d8a93e netlink: RTA_PAYLOAD() returns int, not size_t
Since it's the size of a chunk of memory it would seem logical that
RTA_PAYLOAD() returns size_t.  However, it doesn't - it explicitly casts
its result to an int.  RTNH_OK(), which often takes the result of
RTA_PAYLOAD() as a parameter compares it to an int, so using size_t can
result in comparison of different-signed integer warnings from clang.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:48 +01:00
David Gibson
f6b546c6e4 flow: Correct type of flowside_at_sidx()
Due to a copy-pasta error, this returns 'PIF_NONE' instead of NULL on the
failure case.  PIF_NONE expands to 0, which turns into NULL, but it's
still confusing, so fix it.  This removes a clang warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:44 +01:00
David Gibson
30b4f88167 arch: Avoid explicit access to 'environ'
We pass 'environ' to execve() in arch_avc2_exec(), so that we retain the
environment in the current process.  But the declaration of 'environ' is
a bit weird - it doesn't seem to be in a standard header, requiring a
manual explicit declaration.  But, we can avoid needing to reference it
explicitly by using execv() instead of execve().  This removes a clang
warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:29 +01:00
David Gibson
b78e72da0b clang: Move clang-tidy configuration from Makefile to .clang-tidy
Currently we configure clang-tidy with a very long command line spelled out
in the Makefile (mostly a big list of lints to disable).  Move it from here
into a .clang-tidy configuration file, so that the config is accessible if
clang-tidy is invoked in other ways (e.g. via clangd) as well.  As a bonus
this also means that we can move the bulky comments about why we're
suppressing various tests inline with the relevant config lines.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:19 +01:00
David Gibson
8346216c9a Makefile: Simplify exclusion of qrap from static checks
There are things in qrap.c that clang-tidy complains about that aren't
worth fixing.  So, we currently exclude it using $(filter-out).  However,
we already have a make variable which has just the passt sources, excluding
qrap, so we can use that instead of the awkward filter-out expression.

Currently, we still include qrap.c for cppcheck, but there's not much
point doing so: it's, well, qrap, so we don't care that much about lints.
Exclude it from cppcheck as well, for consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:07 +01:00
David Gibson
8f1b6a0ca6 clang: Add .clang-format file
I've been experimenting with clangd, but its default format style is
horrid.  Since our style is basically that of the Linux kernel, copy the
.clang-format from the kernel, minus reference to a bunch of kernel
specific macros.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:45:16 +01:00
David Gibson
5e93bcd8bf test: Adjust misplaced sleeps in two_guests code
Most of our transfer tests using socat use 'sleep' waaiting for the server
side to be ready before starting the client.  However in two_guests/basic
the sleep is in the wrong place: rather than being between starting the
server and starting the client, it's after waiting for the server to
complete.  This causes occasional hangs when the client runs before the
server is ready - in that case the receiving guest sends an RST, which we
don't (currently) propagate back to the sender.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-05 23:46:38 +01:00
Stefano Brivio
9afce0b45c tap: Explicitly cast TUNSETIFF to fix build warning with musl on ppc64le
On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than
INT_MAX (2^31 - 1), and musl declares the second argument of ioctl()
as 'int', not 'unsigned long' like glibc does, probably because of how
POSIX specifies the equivalent argument, int dcmd, in posix_devctl(),
so gcc reports a warning:

tap.c: In function 'tap_ns_tun':
tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow]
 1291 |         rc = ioctl(fd, TUNSETIFF, &ifr);
      |                        ^~~~~~~~~

We don't care about that overflow, so explicitly cast TUNSETIFF to
int.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-05 23:46:33 +01:00
Stefano Brivio
d165d36a0c tcp: Fix build against musl, __sum16 comes from linux/types.h
Use a plain uint16_t instead and avoid including one extra header:
the 'bitwise' attribute of __sum16 is just used by sparse(1).

Reported-by: omni <omni+alpine@hack.org>
Fixes: 3d484aa370 ("tcp: Update TCP checksum using an iovec array")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-05 23:46:24 +01:00
Stefano Brivio
ee7d0b62a7 util: Don't use errno after a successful call in __daemon()
I thought we could just set errno to 0, do a bunch of stuff, and check
that errno didn't change to infer we succeeded. But clang-tidy,
starting with LLVM 19, reports:

/home/sbrivio/passt/util.c:465:6: error: An undefined value may be read from 'errno' [clang-analyzer-unix.Errno,-warnings-as-errors]
  465 |         if (errno)
      |             ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
   38 | # define errno (*__errno_location ())
      |                ^~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:446:6: note: Assuming the condition is false
  446 |         if (pid == -1) {
      |             ^~~~~~~~~
/home/sbrivio/passt/util.c:446:2: note: Taking false branch
  446 |         if (pid == -1) {
      |         ^
/home/sbrivio/passt/util.c:451:6: note: Assuming 'pid' is 0
  451 |         if (pid) {
      |             ^~~
/home/sbrivio/passt/util.c:451:2: note: Taking false branch
  451 |         if (pid) {
      |         ^
/home/sbrivio/passt/util.c:463:2: note: Assuming that 'close' is successful; 'errno' becomes undefined after the call
  463 |         close(devnull_fd);
      |         ^~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:465:6: note: An undefined value may be read from 'errno'
  465 |         if (errno)
      |             ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
   38 | # define errno (*__errno_location ())
      |                ^~~~~~~~~~~~~~~~~~~~~~

And the LLVM documentation for the unix.Errno checker, 1.1.8.3
unix.Errno (C), mentions, at:

  https://clang.llvm.org/docs/analyzer/checkers.html#unix-errno

that:

  The C and POSIX standards often do not define if a standard library
  function may change value of errno if the call does not fail.
  Therefore, errno should only be used if it is known from the return
  value of a function that the call has failed.

which is, somewhat surprisingly, the case for close().

Instead of using errno, check the actual return values of the calls
we issue here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
b1a607fba1 udp: Take care of cert-int09-c clang-tidy warning for enum udp_iov_idx
/home/sbrivio/passt/udp.c:171:1: error: inital values in enum 'udp_iov_idx' are not consistent, consider explicit initialization of all, none or only the first enumerator [cert-int09-c,readability-enum-initial-value,-warnings-as-errors]
  171 | enum udp_iov_idx {
      | ^
  172 |         UDP_IOV_TAP     = 0,
  173 |         UDP_IOV_ETH     = 1,
  174 |         UDP_IOV_IP      = 2,
  175 |         UDP_IOV_PAYLOAD = 3,
  176 |         UDP_NUM_IOVS
      |
      |                      = 4

Don't initialise any value, so that it's obvious that constants map to
unique values.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
099ace64ce treewide: Address cert-err33-c clang-tidy warnings for clock and timer functions
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.

As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.

For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
59fe34ee36 treewide: Suppress clang-tidy warning if we already use O_CLOEXEC
In pcap_init(), we should always open the packet capture file with
O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means
close-on-exec, not close-on-fork.

In logfile_init() and pidfile_open(), the fact that we pass a third
'mode' argument to open() seems to confuse the android-cloexec-open
checker in LLVM versions from 16 to 19 (at least).

The checker is suggesting to add O_CLOEXEC to 'mode', and not in
'flags', where we already have it.

Add a suppression for clang-tidy and a comment, and avoid repeating
those three times by adding a new helper, output_file_open().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
134b4d58b4 Makefile: Disable readability-math-missing-parentheses clang-tidy check
With clang-tidy and LLVM 19:

/home/sbrivio/passt/conf.c:1218:29: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
 1218 |                 const char *octet = str + 3 * i;
      |                                           ^~~~~~
      |                                           (    )
/home/sbrivio/passt/ndp.c:285:18: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  285 |                                         .len            = 1 + 2 * n,
      |                                                               ^~~~~~
      |                                                               (    )
/home/sbrivio/passt/ndp.c:329:23: error: '%' has higher precedence than '-'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  329 |                         memset(ptr, 0, 8 - dns_s_len % 8);      /* padding */
      |                                            ^~~~~~~~~~~~~~
      |                                            (            )
/home/sbrivio/passt/pcap.c:131:20: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  131 |                 pcap_frame(iov + i * frame_parts, frame_parts, offset, &now);
      |                                  ^~~~~~~~~~~~~~~~
      |                                  (              )
/home/sbrivio/passt/util.c:216:10: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  216 |                 return (a->tv_nsec + 1000000000 - b->tv_nsec) / 1000 +
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                        (                                            )
/home/sbrivio/passt/util.c:217:10: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  217 |                        (a->tv_sec - b->tv_sec - 1) * 1000000;
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                        (                                    )
/home/sbrivio/passt/util.c:220:9: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  220 |         return (a->tv_nsec - b->tv_nsec) / 1000 +
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                (                               )
/home/sbrivio/passt/util.c:221:9: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  221 |                (a->tv_sec - b->tv_sec) * 1000000;
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                (                                )
/home/sbrivio/passt/util.c:545:32: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  545 |         return clone(fn, stack_area + stack_size / 2, flags, arg);
      |                                       ^~~~~~~~~~~~~~~
      |                                       (             )

Just... no.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
744247856d treewide: Silence cert-err33-c clang-tidy warnings for fprintf()
We use fprintf() to print to standard output or standard error
streams. If something gets truncated or there's an output error, we
don't really want to try and report that, and at the same time it's
not abnormal behaviour upon which we should terminate, either.

Just silence the warning with an ugly FPRINTF() variadic macro casting
the fprintf() expressions to void.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
98efe7c2fd treewide: Comply with CERT C rule ERR33-C for snprintf()
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):

/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
   43 |                 snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  577 |                         snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  579 |                                 snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  580 |                                          pidval);
      |                                          ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  105 |         snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  242 |         snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  243 |         snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
 1155 |                         snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning

Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:25 +01:00
Stefano Brivio
988a4d75f8 Makefile: Exclude qrap.c from clang-tidy checks
We'll deprecate qrap(1) soon, and warnings reported by clang-tidy as
of LLVM versions 16 and later would need a bunch of changes there to
be addressed, mostly around CERT C rule ERR33-C and checking return
code from snprintf().

It makes no sense to fix warnings in qrap just for the sake of it, so
officially declare the bitrotting season open.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 08:21:19 +01:00
Jon Maloy
ba38e67cf4 tcp: unify l2 TCPv4 and TCPv6 queues and structures
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.

Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.

Tests with iperf3 shows no measurable change in performance after this
change.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:44:08 +01:00
Jon Maloy
2053c36dec tcp: set ip and eth headers in l2 tap queues on the fly
l2 tap queue entries are currently initialized at system start, and
reused with preset headers through its whole life time. The only
fields we need to update per message are things like payload size
and checksums.

If we want to reuse these entries between ipv4 and ipv6 messages we
will need to set the pointer to the right header on the fly per
message, since the header type may differ between entries in the same
queue.

The same needs to be done for the ethernet header.

We do these changes here.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:43:24 +01:00
Laurent Vivier
5563d5f668 test: remove obsolete images
Remove debian-9-nocloud-amd64-daily-20200210-166.qcow2 and
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 as they cannot be
downloaded anymore

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:30:06 +02:00
Laurent Vivier
f43f7d5e89 tcp: cleanup tcp_buf_data_from_sock()
Remove the err label as there is only one caller, and move code
to the caller position. ret is not needed here anymore as it is
always 0.
Remove sendlen as we can user directly len.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:51 +02:00
David Gibson
e7fcd0c348 tcp: Use runtime tests for TCP_INFO fields
In order to use particular fields from the TCP_INFO getsockopt() we
need them to be in structure returned by the runtime kernel.  We attempt
to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed
in the Makefile.

However, that's not correct, because the kernel headers we compile against
may not be the same as the runtime kernel.  We instead should check against
the size of structure returned from the TCP_INFO getsockopt() as we already
do for tcpi_snd_wnd.  Switch from the compile time flags to a runtime
test.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:46 +02:00
David Gibson
81143813a6 tcp: Generalise probing for tcpi_snd_wnd field
In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we
need the field to be supported in the runtime kernel (snd_wnd_cap).

In fact we should check that for for every tcp_info field we want to use,
beyond the very old ones shared with BSD.  Prepare to do that, by
generalising the probing from setting a single bool to instead record the
size of the returned TCP_INFO structure.  We can then use that recorded
value to check for the presence of any field we need.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:27:17 +02:00
David Gibson
13f0291ede tcp: Remove compile-time dependency on struct tcp_info version
In the Makefile we probe to create several defines based on the presence
of particular fields in struct tcp_info.  These defines are used for two
purposes, neither of which they accomplish well:

1) Determining if the tcp_info fields are available at runtime.  For this
   purpose the defines are Just Plain Wrong, since the runtime kernel may
   not be the same as the compile time kernel. We corrected this for
   tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt

2) Allowing the source to compile against older kernel headers which don't
   have the fields in question.  This works in theory, but it does mean
   we won't be able to use the fields, even if later run against a
   newer kernel.  Furthermore, it's quite fragile: without much more
   thorough tests of builds in different environments that we're currently
   set up for, it's very easy to miss cases where we're accessing a field
   without protection from an #ifdef.  For example we currently access
   tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd().

Improve this with a different approach, borrowed from qemu (which has many
instances of similar problems).  Don't compile against linux/tcp.h, using
netinet/tcp.h instead.  Then for when we need an extension field, define
a struct tcp_info_linux, copied from the kernel, with all the fields we're
interested in.  That may need updating from future kernel versions, but
only when we want to use a new extension, so it shouldn't be frequent.

This allows us to remove the HAS_SND_WND define entirely.  We keep
HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1),
we'll fix that in a later patch.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Trivial grammar fixes in comments]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:26:48 +02:00
Stefano Brivio
9e4615b40b tcp_splice: fcntl(2) returns the size of the pipe, if F_SETPIPE_SZ succeeds
Don't report bogus failures (with --trace) just because the return
value is not zero.

Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-25 09:36:30 +02:00
Stefano Brivio
149f457b23 tcp_splice: splice() all we have to the writing side, not what we just read
In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.

However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 98304 from read-side call
  Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
  Flow 0 (TCP connection (spliced)): -1 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
  Flow 0 (TCP connection (spliced)): OUT_WAIT_0

we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 32768 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580

we splice() 32768 more bytes from our receiving side to the pipe. At
some point:

  pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
  Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
  Flow 0 (TCP connection (spliced)): 1320 from read-side call
  Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)

the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).

This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.

Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.

So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.

Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.

Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.

Reported-by: Ed Santiago <santiago@redhat.com>
Reported-by: Paul Holzinger <pholzing@redhat.com>
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-25 09:36:18 +02:00
David Gibson
9e5df350d6 tcp: Use structures to construct initial TCP options
As a rule, we prefer constructing packets with matching C structures,
rather than building them byte by byte.  However, one case we still build
byte by byte is the TCP options we include in SYN packets (in fact the only
time we generate TCP options on the tap interface).

Rework this to use a structure and initialisers which make it a bit
clearer what's going on.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
2024-10-21 18:51:04 +02:00
David Gibson
b4dace8f46 fwd: Direct inbound spliced forwards to the guest's external address
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing.  This gives us a very large performance improvement when it's
possible.

Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address.  However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1.  pasta's
forwarding exposes those services to the host, which isn't generally what
we want.

Correct this by instead forwarding inbound "splice" flows to the guest's
external address.

Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:03 +02:00
David Gibson
58e6d68599 test: Clarify test for spliced inbound transfers
The tests in pasta/tcp and pasta/udp for inbound transfers have the server
listening within the namespace explicitly bound to 127.0.0.1 or ::1.  This
only works because of the behaviour of inbound splice connections, which
always appear with both source and destination addresses as loopback in
the namespace.  That's not an inherent property for "spliced" connections
and arguably an undesirable one.  Also update the test names to make it
clearer that these tests are expecting to exercise the "splice" path.

Interestingly this was already correct for the equivalent passt_in_ns/*,
although we also update the test names for clarity there.

Note that there are similar issues in some of the podman tests, addressed
in https://github.com/containers/podman/pull/24064

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:00 +02:00
David Gibson
1fa421192c passt.1: Clarify and update "Handling of local addresses" section
This section didn't mention the effect of the --map-host-loopback option
which now alters this behaviour.  Update it accordingly.

It used "local addresses" to mean specifically 127.0.0.0/8 and ::1.
However, "local" could also refer to link-local addresses or to addresses
of any scope which happen to be configured on the host.  Use "loopback
address" to be more precise about this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:57 +02:00
David Gibson
ef8a5161d0 passt.1: Mark --stderr as deprecated more prominently
The description of this option says that it's deprecated, but unlike
--no-copy-addrs and --no-copy-routes it doesn't have a clear label.  Add
one to make it easier to spot.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:54 +02:00
David Gibson
53176ca91d test: Wait for DAD on DHCPv6 addresses
After running dhclient -6 we expect the DHCPv6 assigned address to be
immediately usable.  That's true with the Fedora dhclient-script (and the
upstream ISC DHCP one), however it's not true with the Debian
dhclient-script.  The Debian script can complete with the address still
in "tentative" state, and the address won't be usable until Duplicate
Address Detection (DAD) completes.  That's arguably a bug in Debian (see
link below), but for the time being we need to work around it anyway.

We usually get away with this, because by the time we do anything where the
address matters, DAD has completed.  However, it's not robust, so we should
explicitly wait for DAD to complete when we get an DHCPv6 address.

Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:51 +02:00
David Gibson
75b9c0feb0 test: Explicitly wait for DAD to complete on SLAAC addresses
Getting a SLAAC address takes a little while because the kernel must
complete Duplicate Address Detection (DAD) before marking the address as
ready.  In several places we have an explicit 'sleep 2' to wait for that
to complete.

Fixed length delays are never a great idea, although this one is pretty
solid.  Still, it would be better to explicitly wait for DAD to complete
in case of long delays (which might happen on slow emulated hosts, or with
heavy load), and to speed the tests up if DAD completes quicker.

Replace the fixed sleeps with a loop waiting for DAD to complete.  We do
this by looping waiting for all tentative addresses to disappear.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:47 +02:00
David Gibson
f9d677bff6 arp: Fix a handful of small warts
This fixes a number of harmless but slightly ugly warts in the ARP
resolution code:
 * Use in4addr_any to represent 0.0.0.0 rather than hand constructing an
   example.
 * When comparing am->sip against 0.0.0.0 use sizeof(am->sip) instead of
   sizeof(am->tip) (same value, but makes more logical sense)
 * Described the guest's assigned address as such, rather than as "our
   address" - that's not usually what we mean by "our address" these days
 * Remove "we might have the same IP address" comment which I can't make
   sense of in context (possibly it's relating to the statement below,
   which already has its own comment?)

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:04 +02:00