We pass -I options to cppcheck so that it will find the system headers.
Then we need to pass a bunch more options to suppress the zillions of
cppcheck errors found in those headers.
It turns out, however, that it's not recommended to give the system headers
to cppcheck anyway. Instead it has built-in knowledge of the ANSI libc and
uses that as the basis of its checks. We do need to suppress
missingIncludeSystem warnings instead though.
Not bothering with the system headers makes the cppcheck runtime go from
~37s to ~14s on my machine, which is a pretty nice win.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If CLOSE_RANGE_UNSHARE isn't defined, we define a fallback version of
close_range() which is a (successful) no-op. This is broken in several
ways:
* It doesn't actually fix compile if using old kernel headers, because
the caller of close_range() still directly uses CLOSE_RANGE_UNSHARE
unprotected by ifdefs
* Even if it did fix the compile, it means inconsistent behaviour between
a compile time failure to find the value (we silently don't close files)
and a runtime failure (we die with an error from close_range())
* Silently not closing the files we intend to close for security reasons
is probably not a good idea in any case
We don't want to simply error if close_range() or CLOSE_RANGE_UNSHARE isn't
available, because that would require running on kernel >= 5.9. On the
other hand there's not really any other way to flush all possible fds
leaked by the parent (close() in a loop takes over a minute). So in this
case print a warning and carry on.
As bonus this fixes a cppcheck error I see with some different options I'm
looking to apply in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
util.h has some #ifdefs and weak definitions to handle compatibility with
various kernel versions. Move this to linux_dep.h which handles several
other similar cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
log.c has several #ifdefs on FALLOC_FL_COLLAPSE_RANGE that won't attempt
to use it if not defined. But even if the value is defined at compile
time, it might not be available in the runtime kernel, so we need to check
for errors from a fallocate() call and fall back to other methods.
Simplify this to only need the runtime check by using linux_dep.h to define
FALLOC_FL_COLLAPSE_RANGE if it's not in the kernel headers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:
/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1139 | int fd = socket(AF_UNIX, SOCK_STREAM, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1158 | ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1413 | s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
188 | if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
| ^
| | SOCK_CLOEXEC
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For some reason, this is only reported by clang-tidy 19.1.2 on
Alpine:
/home/sbrivio/passt/passt.c:314:53: error: conditional operator with identical true and false expressions [bugprone-branch-clone,-warnings-as-errors]
314 | nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
| ^
We do have a suppression, but not on the line preceding it, because
we also need a cppcheck suppression there. Use NOLINTBEGIN/NOLINTEND
for the clang-tidy suppression.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
On 32-bit architectures, clang-tidy reports:
/home/pi/passt/tcp.c:728:11: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
728 | if (v >= SNDBUF_BIG)
| ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
158 | #define SNDBUF_BIG (4UL * 1024 * 1024)
| ^
/home/pi/passt/tcp.c:728:11: note: make conversion explicit to silence this warning
728 | if (v >= SNDBUF_BIG)
| ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
158 | #define SNDBUF_BIG (4UL * 1024 * 1024)
| ^~~~~~~~~~~~~~~~~
/home/pi/passt/tcp.c:728:11: note: perform multiplication in a wider type
728 | if (v >= SNDBUF_BIG)
| ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
158 | #define SNDBUF_BIG (4UL * 1024 * 1024)
| ^~~~~~~~~~
/home/pi/passt/tcp.c:730:15: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
730 | else if (v > SNDBUF_SMALL)
| ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
159 | #define SNDBUF_SMALL (128UL * 1024)
| ^
/home/pi/passt/tcp.c:730:15: note: make conversion explicit to silence this warning
730 | else if (v > SNDBUF_SMALL)
| ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
159 | #define SNDBUF_SMALL (128UL * 1024)
| ^~~~~~~~~~~~
/home/pi/passt/tcp.c:730:15: note: perform multiplication in a wider type
730 | else if (v > SNDBUF_SMALL)
| ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
159 | #define SNDBUF_SMALL (128UL * 1024)
| ^~~~~
/home/pi/passt/tcp.c:731:17: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
731 | v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
| ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
159 | #define SNDBUF_SMALL (128UL * 1024)
| ^
/home/pi/passt/tcp.c:731:17: note: make conversion explicit to silence this warning
731 | v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
| ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
159 | #define SNDBUF_SMALL (128UL * 1024)
| ^~~~~~~~~~~~
/home/pi/passt/tcp.c:731:17: note: perform multiplication in a wider type
731 | v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
| ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
159 | #define SNDBUF_SMALL (128UL * 1024)
| ^~~~~
because, wherever we use those thresholds, we define the other term
of comparison as uint64_t. Define the thresholds as unsigned long long
as well, to make sure we match types.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Given that we're comparing against 'n', which is signed, we cast
TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated
as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be
signed.
This doesn't necessarily happen on 32-bit architectures, though. On
armhf and i686, clang-tidy 18.1.8 and 19.1.2 report:
/home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors]
1087 | for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
| ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cast the whole difference to ssize_t, as we know it's going to be
positive anyway, instead of relying on that side effect.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
cppcheck 2.14.2 on Alpine reports:
dhcpv6.c:431:32: style: Variable 'client_id' can be declared as pointer to const [constVariablePointer]
struct opt_hdr *ia, *bad_ia, *client_id;
^
It's not only 'client_id': we can declare 'ia' as const pointer too.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
cppcheck 2.16.0 reports:
dhcpv6.c:334:14: style: The comparison 'ia_type == 3' is always true. [knownConditionTrueFalse]
if (ia_type == OPT_IA_NA) {
^
dhcpv6.c:306:12: note: 'ia_type' is assigned value '3' here.
ia_type = OPT_IA_NA;
^
dhcpv6.c:334:14: note: The comparison 'ia_type == 3' is always true.
if (ia_type == OPT_IA_NA) {
^
this is not really the case as we set ia_type to OPT_IA_TA and then
jump back.
Anyway, there's no particular reason to use a goto here: add a trivial
foreach() macro to go through elements of an array and use it instead.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In the NDP tests we search explicitly for a guest address with prefix
length 64. AFAICT this is an attempt to specifically find the SLAAC
assigned address, rather than something assigned by other means. We can do
that more explicitly by checking for .protocol == "kernel_ra". however.
The SLAAC prefixes we assigned *will* always be 64-bit, that's hard-coded
into our NDP implementation. RFC4862 doesn't really allow anything else
since the interface identifiers for an Ethernet-like link are 64-bits.
Let's actually verify that, rather than just assuming it, by extracting the
prefix length assigned in the guest and checking it as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When determining the namespace's IPv6 address in the perf test setup, we
explicitly filter for addresses with a 64-bit prefix length. There's no
real reason we need that - as long as it's a global address we can use it.
I suspect this was copied without thinking from a similar example in the
NDP tests, where the 64-bit prefix length _is_ meaningful (though it's not
entirely clear if the handling is correct there either).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently nstool die()s on essentially any error. In most cases that's
fine for our purposes. However, it's a problem when in "hold" mode and
getting an IO error on an accept()ed socket. This could just indicate that
the control client aborted prematurely, in which case we don't want to
kill of the namespace we're holding.
Adjust these to print an error, close() the control client socket and
carry on. In addition, we need to explicitly ignore SIGPIPE in order not
to be killed by an abruptly closed client connection.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
nstool in "exec" mode will propagate some signals (specifically SIGTERM) to
the process in the namespace it executes. The signal handler which
accomplishes this is called simply sig_handler(). However, it turns out
we're going to need some other signal handlers, so rename this to the more
specific sig_propagate().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
While experimenting with cppcheck options, I hit several false positives
caused by this bug: https://trac.cppcheck.net/ticket/13227
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have an ASSERT() verifying that we're able to look up the flow in
udp_reply_sock_handler(). However, we dereference uflow before that in
an initializer, rather defeating the point. Rearrange to avoid that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We don't modify this structure at all. For some reason cppcheck doesn't
catch this with our current options, but did when I was experimenting with
some different options.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_info.h exists just to contain a modern enough version of struct
tcp_info for our needs, removing compile time dependency on the version of
kernel headers. There are several other cases where we can remove similar
compile time dependencies on kernel version. Prepare for that by renaming
tcp_info.h to linux_dep.h.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
On certain architectures we get a warning about comparison between
different signedness integers in fwd_probe_ephemeral(). This is because
NUM_PORTS evaluates to an unsigned integer. It's a fixed value, though
and we know it will fit in a signed long on anything reasonable, so add
a cast to suppress the warning.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We supply a weak alias for ffsl() in case it's not defined in our libc.
Except.. we don't have any users for it any more, so remove it.
make cppcheck doesn't spot this at present for complicated reasons, but it
might with tweaks to the options I'm experimenting with.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
clangd's default configuration seems to try to treat .h files as C++ not
C. There are many more spurious warnings generated at present, but this
removes some of the most egregious ones.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We probe the available stack limit in the Makefile using rlimit, then use
that to set the size of the stack when we clone() extra threads. But
the rlimit at compile time need not be the same as the rlimit at runtime,
so that's not particularly sensible.
Ideally, we'd set the stack size based on an estimate of the actual
maximum stack usage of all our clone()ed functions. We don't have that
at the moment, but to keep things simple just set it to 1MiB - that's what
the current probe will set things to on my default configuration Fedora 40,
so it's likely to be fine in most cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We insert -DARCH for all compiles, based on TARGET_ARCH determined in the
Makefile. However, this is only used in qrap.c, not anywhere else in
passt or pasta. Only supply this -D when compiling qrap specifically.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we construct the AUDIT_ARCH variable in the Makefile, then pass
it into the C code with -D. The only place that uses it, though is the
BPF filter generated by seccomp.sh. seccomp.sh already needs to do things
differently depending on the arch, so it might as well just insert the
expanded AUDIT_ARCH directly into the generated code, rather than using
a #define. Arguably this is better, even, since it ensures more locally
that the arch the BPF checks for matches the arch seccomp.sh built the
filter for.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
NETNS_RUN_DIR is set in the Makefile, then passed into the C code with
-D. But NETNS_RUN_DIR is just a fixed string, it doesn't depend on any
make probes or variables, so there's really no reason to handle it via the
Makefile. Just move it to a plain #define in conf.c.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since it's the size of a chunk of memory it would seem logical that
RTA_PAYLOAD() returns size_t. However, it doesn't - it explicitly casts
its result to an int. RTNH_OK(), which often takes the result of
RTA_PAYLOAD() as a parameter compares it to an int, so using size_t can
result in comparison of different-signed integer warnings from clang.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Due to a copy-pasta error, this returns 'PIF_NONE' instead of NULL on the
failure case. PIF_NONE expands to 0, which turns into NULL, but it's
still confusing, so fix it. This removes a clang warning.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We pass 'environ' to execve() in arch_avc2_exec(), so that we retain the
environment in the current process. But the declaration of 'environ' is
a bit weird - it doesn't seem to be in a standard header, requiring a
manual explicit declaration. But, we can avoid needing to reference it
explicitly by using execv() instead of execve(). This removes a clang
warning.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we configure clang-tidy with a very long command line spelled out
in the Makefile (mostly a big list of lints to disable). Move it from here
into a .clang-tidy configuration file, so that the config is accessible if
clang-tidy is invoked in other ways (e.g. via clangd) as well. As a bonus
this also means that we can move the bulky comments about why we're
suppressing various tests inline with the relevant config lines.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are things in qrap.c that clang-tidy complains about that aren't
worth fixing. So, we currently exclude it using $(filter-out). However,
we already have a make variable which has just the passt sources, excluding
qrap, so we can use that instead of the awkward filter-out expression.
Currently, we still include qrap.c for cppcheck, but there's not much
point doing so: it's, well, qrap, so we don't care that much about lints.
Exclude it from cppcheck as well, for consistency.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I've been experimenting with clangd, but its default format style is
horrid. Since our style is basically that of the Linux kernel, copy the
.clang-format from the kernel, minus reference to a bunch of kernel
specific macros.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Most of our transfer tests using socat use 'sleep' waaiting for the server
side to be ready before starting the client. However in two_guests/basic
the sleep is in the wrong place: rather than being between starting the
server and starting the client, it's after waiting for the server to
complete. This causes occasional hangs when the client runs before the
server is ready - in that case the receiving guest sends an RST, which we
don't (currently) propagate back to the sender.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than
INT_MAX (2^31 - 1), and musl declares the second argument of ioctl()
as 'int', not 'unsigned long' like glibc does, probably because of how
POSIX specifies the equivalent argument, int dcmd, in posix_devctl(),
so gcc reports a warning:
tap.c: In function 'tap_ns_tun':
tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow]
1291 | rc = ioctl(fd, TUNSETIFF, &ifr);
| ^~~~~~~~~
We don't care about that overflow, so explicitly cast TUNSETIFF to
int.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Use a plain uint16_t instead and avoid including one extra header:
the 'bitwise' attribute of __sum16 is just used by sparse(1).
Reported-by: omni <omni+alpine@hack.org>
Fixes: 3d484aa370 ("tcp: Update TCP checksum using an iovec array")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I thought we could just set errno to 0, do a bunch of stuff, and check
that errno didn't change to infer we succeeded. But clang-tidy,
starting with LLVM 19, reports:
/home/sbrivio/passt/util.c:465:6: error: An undefined value may be read from 'errno' [clang-analyzer-unix.Errno,-warnings-as-errors]
465 | if (errno)
| ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
38 | # define errno (*__errno_location ())
| ^~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:446:6: note: Assuming the condition is false
446 | if (pid == -1) {
| ^~~~~~~~~
/home/sbrivio/passt/util.c:446:2: note: Taking false branch
446 | if (pid == -1) {
| ^
/home/sbrivio/passt/util.c:451:6: note: Assuming 'pid' is 0
451 | if (pid) {
| ^~~
/home/sbrivio/passt/util.c:451:2: note: Taking false branch
451 | if (pid) {
| ^
/home/sbrivio/passt/util.c:463:2: note: Assuming that 'close' is successful; 'errno' becomes undefined after the call
463 | close(devnull_fd);
| ^~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:465:6: note: An undefined value may be read from 'errno'
465 | if (errno)
| ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
38 | # define errno (*__errno_location ())
| ^~~~~~~~~~~~~~~~~~~~~~
And the LLVM documentation for the unix.Errno checker, 1.1.8.3
unix.Errno (C), mentions, at:
https://clang.llvm.org/docs/analyzer/checkers.html#unix-errno
that:
The C and POSIX standards often do not define if a standard library
function may change value of errno if the call does not fail.
Therefore, errno should only be used if it is known from the return
value of a function that the call has failed.
which is, somewhat surprisingly, the case for close().
Instead of using errno, check the actual return values of the calls
we issue here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.
As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.
For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In pcap_init(), we should always open the packet capture file with
O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means
close-on-exec, not close-on-fork.
In logfile_init() and pidfile_open(), the fact that we pass a third
'mode' argument to open() seems to confuse the android-cloexec-open
checker in LLVM versions from 16 to 19 (at least).
The checker is suggesting to add O_CLOEXEC to 'mode', and not in
'flags', where we already have it.
Add a suppression for clang-tidy and a comment, and avoid repeating
those three times by adding a new helper, output_file_open().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We use fprintf() to print to standard output or standard error
streams. If something gets truncated or there's an output error, we
don't really want to try and report that, and at the same time it's
not abnormal behaviour upon which we should terminate, either.
Just silence the warning with an ugly FPRINTF() variadic macro casting
the fprintf() expressions to void.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):
/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
43 | snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
577 | snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
579 | snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
580 | pidval);
| ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
105 | snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
242 | snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
243 | snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
1155 | snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning
Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We'll deprecate qrap(1) soon, and warnings reported by clang-tidy as
of LLVM versions 16 and later would need a bunch of changes there to
be addressed, mostly around CERT C rule ERR33-C and checking return
code from snprintf().
It makes no sense to fix warnings in qrap just for the sake of it, so
officially declare the bitrotting season open.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.
Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.
Tests with iperf3 shows no measurable change in performance after this
change.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
l2 tap queue entries are currently initialized at system start, and
reused with preset headers through its whole life time. The only
fields we need to update per message are things like payload size
and checksums.
If we want to reuse these entries between ipv4 and ipv6 messages we
will need to set the pointer to the right header on the fly per
message, since the header type may differ between entries in the same
queue.
The same needs to be done for the ethernet header.
We do these changes here.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Remove debian-9-nocloud-amd64-daily-20200210-166.qcow2 and
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 as they cannot be
downloaded anymore
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Remove the err label as there is only one caller, and move code
to the caller position. ret is not needed here anymore as it is
always 0.
Remove sendlen as we can user directly len.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In order to use particular fields from the TCP_INFO getsockopt() we
need them to be in structure returned by the runtime kernel. We attempt
to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed
in the Makefile.
However, that's not correct, because the kernel headers we compile against
may not be the same as the runtime kernel. We instead should check against
the size of structure returned from the TCP_INFO getsockopt() as we already
do for tcpi_snd_wnd. Switch from the compile time flags to a runtime
test.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we
need the field to be supported in the runtime kernel (snd_wnd_cap).
In fact we should check that for for every tcp_info field we want to use,
beyond the very old ones shared with BSD. Prepare to do that, by
generalising the probing from setting a single bool to instead record the
size of the returned TCP_INFO structure. We can then use that recorded
value to check for the presence of any field we need.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the Makefile we probe to create several defines based on the presence
of particular fields in struct tcp_info. These defines are used for two
purposes, neither of which they accomplish well:
1) Determining if the tcp_info fields are available at runtime. For this
purpose the defines are Just Plain Wrong, since the runtime kernel may
not be the same as the compile time kernel. We corrected this for
tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt
2) Allowing the source to compile against older kernel headers which don't
have the fields in question. This works in theory, but it does mean
we won't be able to use the fields, even if later run against a
newer kernel. Furthermore, it's quite fragile: without much more
thorough tests of builds in different environments that we're currently
set up for, it's very easy to miss cases where we're accessing a field
without protection from an #ifdef. For example we currently access
tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd().
Improve this with a different approach, borrowed from qemu (which has many
instances of similar problems). Don't compile against linux/tcp.h, using
netinet/tcp.h instead. Then for when we need an extension field, define
a struct tcp_info_linux, copied from the kernel, with all the fields we're
interested in. That may need updating from future kernel versions, but
only when we want to use a new extension, so it shouldn't be frequent.
This allows us to remove the HAS_SND_WND define entirely. We keep
HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1),
we'll fix that in a later patch.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Trivial grammar fixes in comments]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Don't report bogus failures (with --trace) just because the return
value is not zero.
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.
However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:
pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
Flow 0 (TCP connection (spliced)): 98304 from read-side call
Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
Flow 0 (TCP connection (spliced)): -1 from read-side call
Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
Flow 0 (TCP connection (spliced)): OUT_WAIT_0
we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:
pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
Flow 0 (TCP connection (spliced)): 32768 from read-side call
Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
we splice() 32768 more bytes from our receiving side to the pipe. At
some point:
pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
Flow 0 (TCP connection (spliced)): 1320 from read-side call
Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)
the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).
This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.
Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.
So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.
Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.
Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.
Reported-by: Ed Santiago <santiago@redhat.com>
Reported-by: Paul Holzinger <pholzing@redhat.com>
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As a rule, we prefer constructing packets with matching C structures,
rather than building them byte by byte. However, one case we still build
byte by byte is the TCP options we include in SYN packets (in fact the only
time we generate TCP options on the tap interface).
Rework this to use a structure and initialisers which make it a bit
clearer what's going on.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing. This gives us a very large performance improvement when it's
possible.
Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address. However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
forwarding exposes those services to the host, which isn't generally what
we want.
Correct this by instead forwarding inbound "splice" flows to the guest's
external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The tests in pasta/tcp and pasta/udp for inbound transfers have the server
listening within the namespace explicitly bound to 127.0.0.1 or ::1. This
only works because of the behaviour of inbound splice connections, which
always appear with both source and destination addresses as loopback in
the namespace. That's not an inherent property for "spliced" connections
and arguably an undesirable one. Also update the test names to make it
clearer that these tests are expecting to exercise the "splice" path.
Interestingly this was already correct for the equivalent passt_in_ns/*,
although we also update the test names for clarity there.
Note that there are similar issues in some of the podman tests, addressed
in https://github.com/containers/podman/pull/24064
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This section didn't mention the effect of the --map-host-loopback option
which now alters this behaviour. Update it accordingly.
It used "local addresses" to mean specifically 127.0.0.0/8 and ::1.
However, "local" could also refer to link-local addresses or to addresses
of any scope which happen to be configured on the host. Use "loopback
address" to be more precise about this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The description of this option says that it's deprecated, but unlike
--no-copy-addrs and --no-copy-routes it doesn't have a clear label. Add
one to make it easier to spot.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
After running dhclient -6 we expect the DHCPv6 assigned address to be
immediately usable. That's true with the Fedora dhclient-script (and the
upstream ISC DHCP one), however it's not true with the Debian
dhclient-script. The Debian script can complete with the address still
in "tentative" state, and the address won't be usable until Duplicate
Address Detection (DAD) completes. That's arguably a bug in Debian (see
link below), but for the time being we need to work around it anyway.
We usually get away with this, because by the time we do anything where the
address matters, DAD has completed. However, it's not robust, so we should
explicitly wait for DAD to complete when we get an DHCPv6 address.
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Getting a SLAAC address takes a little while because the kernel must
complete Duplicate Address Detection (DAD) before marking the address as
ready. In several places we have an explicit 'sleep 2' to wait for that
to complete.
Fixed length delays are never a great idea, although this one is pretty
solid. Still, it would be better to explicitly wait for DAD to complete
in case of long delays (which might happen on slow emulated hosts, or with
heavy load), and to speed the tests up if DAD completes quicker.
Replace the fixed sleeps with a loop waiting for DAD to complete. We do
this by looping waiting for all tentative addresses to disappear.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This fixes a number of harmless but slightly ugly warts in the ARP
resolution code:
* Use in4addr_any to represent 0.0.0.0 rather than hand constructing an
example.
* When comparing am->sip against 0.0.0.0 use sizeof(am->sip) instead of
sizeof(am->tip) (same value, but makes more logical sense)
* Described the guest's assigned address as such, rather than as "our
address" - that's not usually what we mean by "our address" these days
* Remove "we might have the same IP address" comment which I can't make
sense of in context (possibly it's relating to the statement below,
which already has its own comment?)
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Starting from commit 9178a9e346 ("tcp: Always send an ACK segment
once the handshake is completed"), we always send an ACK segment,
without any payload, to complete the three-way handshake while
establishing a connection started from a socket.
We queue that segment after checking if we already have data to send
to the tap, which means that its sequence number is higher than any
segment with data we're sending in the same iteration, if any data is
available on the socket.
However, in tcp_defer_handler(), we first flush "flags" buffers, that
is, we send out segments without any data first, and then segments
with data, which means that our "empty" ACK is sent before the ACK
segment with data (if any), which has a lower sequence number.
This appears to be harmless as the guest or container will generally
reorder segments, but it looks rather weird and we can't exclude it's
actually causing problems.
Queue the empty ACK first, so that it gets a lower sequence number,
before checking for any data from the socket.
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Just like we do for PCAP, DEBUG and KERNEL. Otherwise, running tests
with TRACE=1 will not actually enable tracing output.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...instead of echo: otherwise, bash won't handle escape sequences we
use to colour messages (and 'echo -e' is not specified by POSIX).
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When redirecting DNS queries with the --dns-forward option, passt/pasta
needs a host side nameserver to redirect the queries to. This is
controlled by the c->ip[46].dns_host variables. This is set to the first
first nameserver listed in the host's /etc/resolv.conf, and there isn't
currently a way to override it from the command line.
Prior to 0b25cac9 ("conf: Treat --dns addresses as guest visible
addresses") it was possible to alter this with the -D/--dns option.
However, doing so was confusing and had some nonsensical edge cases because
-D generally takes guest side addresses, rather than host side addresses.
Add a new --dns-host option to restore this functionality in a more
sensible way.
Link: https://bugs.passt.top/show_bug.cgi?id=102
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In a couple of recent reports, we've seen that it can be useful for pasta
to forward ports from addresses which are not currently configured on the
host, but might be in future. That can be done with the sysctl
net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in
the first place. We can allow the same thing on a per-socket basis with
the IP_FREEBIND (or IPV6_FREEBIND) socket option.
Add a --freebind command line argument to enable this socket option on
all listening sockets.
Link: https://bugs.passt.top/show_bug.cgi?id=101
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
TCP header and payload are supposed to be in the same buffer,
and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute
the checksum from the base address of the header using the
length of the IP payload.
In the future (for vhost-user) we need to dispatch the TCP header and
the TCP payload through several buffers. To be able to manage that, we
provide an iovec array that points to the data of the TCP frame.
We provide also an offset to be able to provide an array that contains
the TCP frame embedded in an lower level frame, and this offset points
to the TCP header inside the iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The offset allows any headers that are not part of the data
to checksum to be skipped.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The offset is passed directly to pcap_frame() and allows
any headers that are not part of the frame to
capture to be skipped.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the
checksum using the TCP header and the TCP payload, it is clearer
to use a pointer to tcp_payload_t that includes tcphdr and payload
rather than a pointer to tcphdr (and guessing TCP header is
followed by the payload).
Move tcp_payload_t and tcp_flags_t to tcp_internal.h.
(They will be used also by vhost-user).
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is quite useful at least for myself as I'm usually running tests
using a guest kernel that's not the same as the one on the host.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We already have an inany_ntop() function to format inany addresses into
text. Add inany_pton() to parse them from text, and use it in
conf_ports().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair. Use an inany instead. Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.
Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress. This case doesn't yet arise, but
it might cause us trouble in future.
Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic. This only makes sense if there is no specific
address given. We verify this at runtime, but it would be nicer if we
could enforce it structurally.
For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems. Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.
The new function takes the address as an inany *, with NULL indicating the
dual stack case. This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
To save some kernel memory we try to use "dual stack" sockets (that listen
to both IPv4 and IPv6 traffic) when possible. However udp_sock_init()
attempts to do this in some cases that can't work. Specifically we can
only do this when listening on any address. That's never true for the
ns (splicing) case, because we always listen on loopback. For the !ns
case and AF_UNSPEC case, addr should always be NULL, but add an assert to
verify.
This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll
fall back to the other path. But, it's messy and makes some upcoming
changes harder, so avoid attempting this in cases we know can't work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We can need not to set TCP checksum. Add a parameter to
tcp_fill_headers4() and tcp_fill_headers6() to disable it.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can need not to set the UDP checksum. Add a parameter to
udp_update_hdr4() and udp_update_hdr6() to disable it.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
write_remainder() steps through the buffers in an IO vector writing out
everything past a certain byte offset. However, on each iteration it
rescans the buffer from the beginning to find out where we're up to. With
an unfortunate set of write sizes this could lead to quadratic behaviour.
In an even less likely set of circumstances (total vector length > maximum
size_t) the 'skip' variable could overflow. This is one factor in a
longstanding Coverity error we've seen (although I still can't figure out
the remainder of its complaint).
Rework write_remainder() to always work out our new position in the vector
relative to our old/current position, rather than starting from the
beginning each time. As a bonus this seems to fix the Coverity error.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
write(2) might not write all the data it is given. Add a write_all_buf()
helper to keep calling it until all the given data is written, or we get an
error.
Currently we use write_remainder() to do this operation in pcap_frame().
That's a little awkward since it requires constructing an iovec, and future
changes we want to make to write_remainder() will be easier in terms of
this single buffer helper.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This parameter is already treated as a boolean internally. Make it a
'bool' type for clarity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This function has a block conditional on !snd_wnd_cap shortly before an
snd_wnd_cap is statically false).
Therefore, simplify this down to a single conditional with an else branch.
While we're there, fix some improperly indented closing braces.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When available, we want to retrieve our socket peer's advertised window and
forward that to the guest. That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.
Currently our probing for this is a bit odd. The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it. Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field. This is a bit hit and miss in two ways:
* Zero is perfectly possible window the peer could report, so we can
get false negatives
* We're reading TCP_INFO into a local variable, which might not be zero
initialised, so if the kernel _doesn't_ write it it could have non-zero
garbage, giving us false positives.
We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived. So, check whether that's long enough
to include the field. This lets us probe the availability of the field
once and for all during initialisation. That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.
We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c,
and have no prototype in a header. Make them static.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When handling the DUP_ACK flag, we copy all the buffers making up the ack
frame. However, all our frames share the same buffer for the Ethernet
header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will
result in a (perfectly) overlapping memcpy(). This seems to have been
harmless so far, but overlapping ranges to memcpy() is undefined behaviour,
so we really should avoid it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This initialisation for IPv4 flags buffers is redundant with the very next
line which sets both iov_base and iov_len.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since commit eedc81b6ef ("fwd, conf: Probe host's ephemeral ports"),
we might need to read from /proc/sys/net/ipv4/ip_local_port_range in
both passt and pasta.
While pasta was already allowed to open and write /proc/sys/net
entries, read access was missing in SELinux's type enforcement: add
that.
In passt, instead, this is the first time we need to access an entry
there: add everything we need.
Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_pasta_input() keeps reading frames from the tap device until the
buffer is full. However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer. If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.
Adjust the code to make sure we always have room for a maximum size frame.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around. This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.
The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass. This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.
Along the way handle a couple of extra edge cases:
- Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
ports where those might not have the same value
- Detect EOF on the tap device and exit in that case
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all
three cases it returns from the function. That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again. For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.
So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN. We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If the nanoseconds of the minuend timestamp are less than the
nanoseconds of the subtrahend timestamp, we need to carry one second
in the subtraction.
I subtracted this second from the minuend, but didn't actually carry
it in the subtraction of nanoseconds, and logged timestamps would jump
back whenever we switched to the first branch of timespec_diff_us()
from the second one.
Most likely, the reason why I didn't carry the second is that I
instinctively thought that swapping the operands would have the same
effect. But it doesn't, in general: that only happens with arithmetic
in modulo powers of 2. Undo the swap as well.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
cppcheck-2.15.0 has apparently broadened when it throws a warning about
redundant initialization to include some cases where we have an initializer
for some fields, but then set other fields in the function body.
This is arguably a false positive: although we are technically overwriting
the zero-initialization the compiler supplies for fields not explicitly
initialized, this sort of construct makes sense when there are some fields
we know at the top of the function where the initializer is, but others
that require more complex calculation.
That said, in the two places this shows up, it's pretty easy to work
around. The results are arguably slightly clearer than what we had, since
they move the parts of the initialization closer together.
So do that rather than having ugly suppressions or dealing with the
tedious process of reporting a cppcheck false positive.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, for not established connections, we monitor sockets with
edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state
(outbound connection being established) but not in the
TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK
to the container/guest).
While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted
another possibility for a short EPOLLRDHUP storm (10 seconds), which
doesn't seem to happen in actual use cases, but I could reproduce it:
start a connection from a container, while dropping (using netfilter)
ACK segments coming out of the container itself.
On the server side, outside the container, accept the connection and
shutdown the writing side of it immediately.
At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere
TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't
have any reasonable way to handle it other than waiting for the tap
side to complete the three-way handshake. So we'll just keep getting
this EPOLLRDHUP until the SYN_TIMEOUT kicks in.
Always enable EPOLLET when EPOLLRDHUP is the only epoll event we
subscribe to: in this case, getting multiple EPOLLRDHUP reports is
totally useless.
In the only remaining non-established state, SOCK_ACCEPTED, for
inbound connections, we're anyway discarding EPOLLRDHUP events until
we established the conection, because we don't know what to do with
them until we get an answer from the tap side, so it's safe to enable
EPOLLET also in that case.
Link: https://bugs.passt.top/show_bug.cgi?id=94
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_sock_errs() reads out everything in the socket error queue. However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.
One possibility is that the error is reported instead by the SO_ERROR
sockopt. Check for that case and report it as best we can. If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.
[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010
Link: https://bugs.passt.top/show_bug.cgi?id=95
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can get network errors, usually transient, reported via the socket error
queue. However, at least theoretically, we could get errors trying to
read the queue itself. Since we have no idea how to clear an error
condition in that case, treat it as unrecoverable.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently udp_sock_recv() both attempts to clear socket errors and read
a batch of datagrams for forwarding. That made sense initially, since
both listening and reply sockets need to do this. However, we have certain
error cases which will add additional complexity to the error processing.
Furthermore, if we ever wanted to more thoroughly handle errors received
here - e.g. by synthesising ICMP messages on the tap device - it will
likely require different handling for the listening and reply socket cases.
So, split handling of error events into its own udp_sock_errs() function.
While we're there, allow it to report "unrecoverable errors". We don't
have any of these so far, but some cases we're working on might require it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The details of a flow - endpoints, interfaces etc. - can be pretty
important for debugging. We log this on flow state transitions, but it can
also be useful to log this when we report specific conditions. Add some
helper functions and macros to make it easy to do that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Unlike TCP, UDP has no in-band signalling for the end of a flow. So the
only way we remove flows is on a timer if they have no activity for 180s.
However, we've started to investigate some error conditions in which we
want to prematurely abort / abandon a UDP flow. We can call
udp_flow_close(), which will make the flow inert (sockets closed, no epoll
events, can't be looked up in hash). However it will still wait 3 minutes
to clear away the stale entry.
Clean this up by adding an explicit 'closed' flag which will cause a flow
to be more promptly cleaned up. We also publish udp_flow_close() so it
can be called from other places to abort UDP flows().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Our flow hash table uses linear probing in which we step backwards through
clusters of adjacent hash entries when we have near collisions. Usually
that's implemented by flow_hash_probe(). However, due to some details we
need a second implementation in flowside_lookup(). An embarrassing
oversight in rebasing from earlier versions has mean that version is
incorrect, trying to step forward through clusters rather than backward.
In situations with the right sorts of has near-collisions this can lead to
us not associating an ACK from the tap device with the right flow, leaving
it in a not-quite-established state. If the remote peer does a shutdown()
at the right time, this can lead to a storm of EPOLLRDHUP events causing
high CPU load.
Fixes: acca4235c4 ("flow, tcp: Generalise TCP hash table to general flow hash table")
Link: https://bugs.passt.top/show_bug.cgi?id=94
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In fecb1b65b1 ("log: Don't prefix message with timestamp on --debug
if it's a continuation"), I fixed this for --debug on standard error,
but not for log files: if messages are continuations, they shouldn't
be prefixed by timestamp and severity.
Otherwise, we'll print stuff like this:
0.0028: ERROR: Receive error on guest connection, reset0.0028: ERROR: : Bad file descriptor
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
On some systems source fortification is enabled whenever code
optimization is enabled (e.g. with -O2). Since code fortification
is explicitly enabled too (with possibly different value than the
system wants, there are three levels [1]), distros are required
to patch our Makefile, e.g. [2].
Detect whether fortification is not already enabled and enable it
explicitly only if really needed.
1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html
2: edfeb8763a
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we forward "all" ports (-t all or -u all), or use an exclude-only
range, we don't actually forward *all* ports - that wouln't leave local
ports to use for outgoing connections. Rather we forward all non-ephemeral
ports - those that won't be used for outgoing connections or datagrams.
Currently we assume the range of ephemeral ports is that recommended by
RFC 6335, 49152-65535. However, that's not the range used by default on
Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range
sysctl.
We can't really know what range the guest will consider ephemeral, but if
it differs too much from the host it's likely to cause problems we can't
avoid anyway. So, using the host's ephemeral range is a better guess than
using the RFC 6335 range.
Therefore, add logic to probe the host's ephemeral range, falling back to
the RFC 6335 range if that fails. This has the bonus advantage of
reducing the number of ports bound by -t all -u all on most Linux machines
thereby reducing kernel memory usage. Specifically this reduces kernel
memory usage with -t all -u all from ~380MiB to ~289MiB.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When using -t all, -u all or exclude-only ranges, we'll attempt to forward
all non-ephemeral port numbers, including port 0. However, this won't work
as intended: bind() treats a zero port not as literal port 0, but as
"pick a port for me". Because of the special meaning of port 0, we mostly
outright exclude it in our handling.
Do the same for setting up forwards, not attempting to forward for port 0.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
"Ephemeral" ports are those which the kernel may allocate as local
port numbers for outgoing connections or datagrams. Because of that,
they're generally not good choices for listening servers to bind to.
Thefore when using -t all, -u all or exclude-only ranges, we map only
non-ephemeral ports. Our logic for this is a bit rigid though: we
assume the ephemeral ports are always a fixed range at the top of the
port number space. We also assume PORT_EPHEMERAL_MIN is a multiple of
8, or we won't set the forward bitmap correctly.
Make the logic in conf.c more flexible, using a helper moved into
fwd.[ch], although we don't change which ports we consider ephemeral
(yet).
The new handling is undoubtedly more computationally expensive, but
since it's a once-off operation at start off, I don't think it really
matters.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Avoid excess lines on wide terminals, but make sure we don't fail if
we can't fetch the number of columns for any reason, as it's not a
fundamental feature and we don't want to break anything with it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
well as native IPv6 connections. By doing this we halve the number of
listening sockets we need (assuming passt/pasta is listening on the same
ports for IPv4 and IPv6). When forwarding many ports (e.g. -u all) this
can significantly reduce the amount of kernel memory that passt consumes.
We've used such dual stack sockets for TCP since 8e914238b "tcp: Use dual
stack sockets for port forwarding when possible". Add similar support for
UDP "listening" sockets. Since UDP sockets don't use as much kernel memory
as TCP sockets this isn't as big a saving, but it's still significant.
When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all),
this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version
6.10.6 on Fedora 40, x86_64).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The 's' variable is always redundant with either 'r4' or 'r6', so remove
it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We've already gotten rid of most of the IPv4/IPv6 specific data structures
in udp.c by merging them with each other. One significant one remains:
udp[46]_mh_recv. This was a bit awkward to remove because of a subtle
interaction. We initialise the msg_namelen fields to represent the total
size we have for a socket address, but when we receive into the arrays
those are modified to the actual length of the sockaddr we received.
That meant that naively merging the arrays meant that if we received IPv4
datagrams, then IPv6 datagrams, the addresses for the latter would be
truncated. In this patch address that by resetting the received
msg_namelen as soon as we've found a flow for the datagram. Finding the
flow is the only thing that might use the actual sockaddr length, although
we in fact don't need it for the time being.
This also removes the last use of the 'v6' field from udp_listen_epoll_ref,
so remove that as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Some distributions already have OpenSSH 9.8, which introduces split
sshd/sshd-session binaries, and there we need to copy the binary from
the host, which can be /usr/libexec/openssh/sshd-session (Fedora
Rawhide), /usr/lib/ssh/sshd-session (Arch Linux),
/usr/lib/openssh/sshd-session (Debian), and possibly other paths.
Add at least those three, and, if we don't find sshd-session, assume
we don't need it: it could very well be an older version of OpenSSH,
as reported by David for Fedora 40, or perhaps another daemon (would
Dropbear even work? I'm not sure).
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: d6817b3930 ("test/passt.mbuto: Install sshd-session OpenSSH's split process")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Seen with krun: we get a file descriptor via --fd, but we close it and
happily use the same number for TCP files.
The issue is that if we also get other options before --fd, with
arguments, getopt_long() stops parsing them because it sees them as
non-option values.
Use the - modifier at the beginning of optstring (before :, which is
needed to avoid printing errors) instead of +, which means we'll
continue parsing after finding unrelated option values, but
getopt_long() won't reorder them anyway: they'll be passed with option
value '1', which we can ignore.
By the way, we also need to add : after F in the optstring, so that
we're able to parse the option when given as short name as well.
Now that we change the parsing mode between close_open_files() and
conf(), we need to reset optind to 0, not to 1, whenever we call
getopt_long() again in conf(), so that the internal initialisation
of getopt_long() evaluating GNU extensions is re-triggered.
Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828
Fixes: baccfb95ce ("conf: Stop parsing options at first non-option argument")
Fixes: 09603cab28 ("passt, util: Close any open file that the parent might have leaked")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Mostly packages we now need to run Podman-based tests.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
These system calls are needed after the conversion of time_t to 64-bit
types on 32-bit architectures.
Tested by running some transfer tests with passt and pasta on Debian
Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l.
Suggested-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
musl, as of 1.2.5, and glibc < 2.34 don't ship a (trivial)
close_range() implementation. This will probably be added to musl
soon, by the way:
https://www.openwall.com/lists/musl/2024/08/01/9
Add a weakly-aliased implementation, if it's supported by the kernel.
If it's not supported (< 5.9), use a no-op fallback. Looping over 2^31
file descriptors calling close() on them is probably not a good idea.
Reported-by: lemmi <lemmi@nerd2nerd.org>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For some reason, this is reported only with musl, and older glibc
versions (2.31, at least).
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Some architectures, including i686, actually have a recv() system
call, not just a recvfrom(), and we need to cover the recv() with
MSG_TRUNC into a NULL buffer for them as well.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
OpenSSH now ships a per-session binary, sshd-session, with sshd
acting as mere listener. It's typically not found in $PATH, so specify
the whole path at which it's commonly installed in $PROGS.
Link: https://www.openssh.com/releasenotes.html#9.8p1
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
It's qemu-system-i386, but uname -m reports i686. I didn't test i486
and i586.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.
Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:
https://wiki.debian.org/ReleaseGoals/64bit-time
that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.
Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).
I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.
Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The guest is usually assigned one of the host's IP addresses. That means
it can't access the host itself via its usual address. The
--map-host-loopback option (enabled by default with the gateway address)
allows the guest to contact the host. However, connections forwarded this
way appear on the host to have originated from the loopback interface,
which isn't always desirable.
Add a new --map-guest-addr option, which acts similarly but forwarded
connections will go to the host's external address, instead of loopback.
If '-a' is used, so the guest's address is not the same as the host's, this
will instead forward to whatever host-visible site is shadowed by the
guest's assigned address.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
fwd_nat_from_host() needs to adjust the source address for new flows coming
from an address which is not accessible to the guest. Currently we always
use our_tap_addr or our_tap_ll. However in cases where the address is
accessible to the guest via translation (i.e. via --map-host-loopback) then
it makes more sense to use that translation, rather than the fallback
mapping of our_tap_*.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because the host and guest share the same IP address with passt/pasta, it's
not possible for the guest to directly address the host. Therefore we
allow packets from the guest going to a special "NAT to host" address to be
redirected to the host, appearing there as though they have both source and
destination address of loopback.
Currently that special address is always the address of the default
gateway (or none). That can be a problem if we want that gateway to be
addressable by the guest. Therefore, allow the special "NAT to host"
address to be overridden on the command line with a new --map-host-loopback
option.
In order to exercise and test it, update the passt_in_ns and perf
tests to use this option and give different mapping addresses for the
two layers of the environment.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the TCP throughput tests, we adjust the guest's MTU in order to test
various packet sizes. Some of those are below 1280 which causes IPv6 to
be deconfigured on the guest interface. When we increase it above 1280
again, IPv6 is re-enabled and we get an address in the right prefix with
NDP, but we don't get exactly the expected address back - that's only
communicated with --config-net or DHCPv6.
With changes to how we handle NAT this can cause some of the IPv6 tests to
fail, because they don't use the address that passt/pasta expects, and the
guest doesn't initiate any traffic which allows us to learn what the new
address is.
Work around this by re-invoking dhclient -6 between adjusting the MTU and
running IPv6 test cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default
gateway. We use this for two quite distinct things: advertising the
gateway that the guest should use (via DHCP, NDP and/or --config-net)
and for a limited form of NAT. So that the guest can access services
on the host, we map the gateway address within the guest to the
loopback address on the host.
Using the gateway address for this isn't necessarily the best choice
for this purpose, certainly not for all circumstances. So, start off
by splitting the notion of these into two different values: @guest_gw
which is the gateway address the guest should use and @nat_host_loopback,
which is the guest visible address to remap to the host's loopback.
Usually nat_host_loopback will have the same value as guest_gw. However
when --no-map-gw is specified we leave them unspecified instead. This
means when we use nat_host_loopback, we don't need to separately check
c->no_map_gw to see if it's relevant.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When sending frames to the guest over the tap link, we need a source MAC
address. Currently we take that from the MAC address of the main interface
on the host, but that doesn't actually make much sense:
* We can't preserve the real MAC address of packets from anywhere
external so there's no transparency case here
* In fact, it's confusingly different from how we handle IP addresses:
whereas we give the guest the same IP as the host, we're making the
host's MAC the one MAC that the guest *can't* use for itself.
* We already need a fallback case if the host doesn't have an Ethernet
like MAC (e.g. if it's connected via a point to point interface, such
as a wireguard VPN).
Change to just just use an arbitrary fixed MAC address - I've picked
9a:55:9a:55:9a:55. It's simpler and has the small advantage of making
the fact that passt/pasta is in use typically obvious from guest side
packet dumps. This can still, of course, be overridden with the -M option.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
ip4.gw conflates 3 conceptually different things, which (for now) have the
same value:
1. The router/gateway address as seen by the guest
2. An address to NAT to the host with --no-map-gw isn't specified
3. An address to use as source when nothing else makes sense
Case 3 occurs in two situations:
a) for our DHCP responses - since they come from passt internally there's
no naturally meaningful address for them to come from
b) for forwarded connections coming from an address that isn't guest
accessible (localhost or the guest's own address).
(b) occurs even with --no-map-gw, and the expected behaviour of forwarding
local connections requires it.
For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the
same value as ip6.gw). For future flexibility we may want to make this
"address of last resort" different from the gateway address, so split them
logically for IPv4 as well.
Specifically, add a new ip4.our_tap_addr field for the address with this
role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always
get a link-local address, we might not be able to get a (non 0.0.0.0)
address here (e.g. if the host is disconnected or only has a point to point
link with no gateway address). In that case we have to disable forwarding
of inbound connections with guest-inaccessible source addresses.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We usually avoid NAT, but in a few cases we need to apply address
translations. For inbound connections that happens for addresses which
make sense to the host but are either inaccessible, or mean a different
location from the guest's point of view.
Add some helper functions to determine such addresses, and use them in
fwd_nat_from_host(). In doing so clarify some of the reasons for the
logic. We'll also have further use for these helpers in future.
While we're there fix one unneccessary inconsistency between IPv4 and IPv6.
We always translated the guest's observed address, but for IPv4 we didn't
translate the guest's assigned address, whereas for IPv6 we did. Change
this to translate both in all cases for consistency.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In every place we use our_tap_ll, we only use it as a fallback if the
IPv6 gateway address is not link-local. We can avoid that conditional at
use time by doing it at initialisation of our_tap_ll instead.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Some are guest visible addresses and may not be valid on the host, others
are host visible addresses and may not be valid on the guest. Rearrange
and comment the ip[46]_ctx definitions to make it clearer which is which.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the
guest, but the former is an address for our use on the tap link. Rename it
accordingly, to 'our_tap_ll'.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When binding an IPv6 socket in sock_l4() we need to supply a scope id
if the address is link-local. We check for this by comparing the
given address to c->ip6.addr_ll. This is correct only by accident:
while c->ip6.addr_ll is typically set to the host interface's link
local address, the actual purpose of it is to provide a link local
address for passt's private use on the tap interface.
Instead set the scope id for any link-local address we're binding to.
We're going to need something and this is what makes sense for sockets
on the host. It doesn't make sense for PIF_SPLICE sockets, but those
should always have loopback, not link-local addresses.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Despite the names, addr_ll_seen does not relate to addr_ll the same
way addr_seen relates to addr. addr_ll_seen is an observed address
from the guest, whereas addr_ll is *our* link-local address for use on
the tap link when we can't use an external endpoint address. It's
used both for passt provided services (DHCPv6, NDP) and in some cases
for connections from addresses the guest can't access.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Although it's not 100% explicit in the man page, addresses given to
the --dns option are intended to be addresses as seen by the guest.
This differs from addresses taken from the host's /etc/resolv.conf,
which must be translated to guest accessible versions in some cases.
Our implementation is currently inconsistent on this: when using
--dns-forward, you must usually also give --dns with the matching address,
which is meaningful only in the guest's address view. However if you give
--dns with a loopback addres, it will be translated like a host view
address.
Move the remapping logic for DNS addresses out of add_dns4() and add_dns6()
into add_dns_resolv() so that it is only applied for host nameserver
addresses, not for nameservers given explicitly with --dns.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to
the given address, rather than the gateway address. This is doubly wrong:
- We've just established the given address is a host loopback address
the guest can't access
- We've just set ip6.dns[] to tell the guest to use the gateway address,
so it won't use the dns_match address we're setting
Correct this to use the gateway address, like IPv4.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
get_dns() is already quite deeply nested, and future changes I have in
mind will add more complexity. Prepare for this by splitting out the
adding of a single nameserver to the configuration into its own function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Every time we call add_dns[46] we need to first check if there's space in
the c->ip[46].dns array for the new entry. We might as well make that
check in add_dns[46]() itself.
In fact it looks like the calls in get_dns() had an off by one error, not
allowing the last entry of the array to be filled. So, that bug is also
fixed by the change.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
get_dns() counts the number of guest DNS servers it adds, and gives an
error if it couldn't add any. However, this count ignores the fact that
add_dns[46]() may in some cases *not* add an entry. Use the array indices
we're already tracking to get an accurate count.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently add_dns[46]() take a somewhat awkward double pointer to the
entry in the c->ip[46].dns array to update. It turns out to be easier to
work with indices into that array instead.
This diff does add some lines, but it's comments, and will allow some
future code reductions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We rely on C11 already, so we can use clearer and more type-checkable
struct assignment instead of mempcy() for copying IP addresses around.
This exposes some "pointer could be const" warnings from cppcheck, so
address those too.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it. Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are a couple of places where we somewhat messily open code formatting
an Ethernet like MAC address for display. Add an eth_ntop() helper for
this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing. As discussed on a
recent call, let's try "our" instead.
(While we're there correct an error in flow_initiate_af()s comments where
we referred to parameters by the wrong name).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As soon as we the kernel notifier for IPv6 address configuration
(addrconf_notify()) sees that we bring the target interface up
(NETDEV_UP), it will schedule duplicate address detection, so, by
itself, setting the nodad flag later is useless, because that won't
stop a detection that's already in progress.
However, if we disable neighbour solicitations with IFF_NOARP (which
is a misnomer for IPv6 interfaces, but there's no possibility of
mixing things up), the notifier will not trigger DAD, because it can't
be done, of course, without neighbour solicitations.
Set IFF_NOARP as we bring up the device, and drop it after we had a
chance to set the nodad attribute on the link.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As soon as we bring up the interface, the Linux kernel will set up a
link-local address for it, so we can fetch it and start using right
away, if we need a link-local address to communicate to the container
before we see any traffic coming from it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
It makes no sense for a container or a guest to try and perform
duplicate address detection for their link-local address, as we'll
anyway not relay neighbour solicitations with an unspecified source
address.
While they perform duplicate address detection, the link-local address
is not usable, which prevents us from bringing up especially
containers and communicate with them right away via IPv6.
This is not enough to prevent DAD and reach the container right away:
we'll need a couple more patches.
As we send NLM_F_REPLACE requests right away, while we still have to
read out other addresses on the same socket, we can't use nl_do():
keep track of the last sequence we sent (last address we changed), and
deal with the answers to those NLM_F_REPLACE requests in a separate
loop, later.
Link: https://github.com/containers/podman/pull/23561#discussion_r1711639663
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In the next patches, we'll reuse it to set flags other than IFF_UP.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As we'll use nl_link_up() for more than just bringing up devices, it
will become awkward to carry empty MTU values around whenever we call
it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We have a number of delays when we switch to new layouts that were
added to make the tests visually easier to follow, together with
blinking status bars. Shorten the delays and avoid blinking the
status bar if $FAST is set to 1 (no demo mode).
Shorten delays in busy loops to 10ms, instead of 100ms, and skip the
one-second fixed delay when we wait for the status of a command.
Cut the duration of throughput and latency tests to one second, down
from ten. Somewhat surprisingly, the results we get are rather
consistent, and not significantly different from what we'd get with
10 seconds.
This, together with Podman's commit 20f3e8909e3a ("test/system:
pasta_test_do add explicit port check"), cuts the time needed on my
setup for full test run from approximately 37 minutes to...:
$ time ./run
[exited]
PASS: 165, FAIL: 0
Log at /home/sbrivio/passt/test/test_logs/test.log
real 15m34.253s
user 0m0.011s
sys 0m0.011s
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Using a zero port on TCP or UDP is dubious, and we can't really deal with
forwarding such a flow within the constraints of the socket API. Hence
we ASSERT()ed that we had non-zero ports in flow_hash().
The intention was to make sure that the protocol code sanitizes such ports
before completing a flow entry. Unfortunately, flow_hash() is also called
on new packets to see if they have an existing flow, so the unsanitized
guest packet can crash passt with the assert.
Correct this by moving the assert from flow_hash() to flow_sidx_hash()
which is only used on entries already in the table, not on unsanitized
data.
Reported-by: Matt Hamilton <matt@thmail.io>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
f6d5a52392 moved handling of -D into a later loop. However as a side
effect it moved this from a switch block to an if block. I left a couple
of 'break' statements that don't make sense in the new context. They
should be 'continue' so that we go onto the next option, rather than
leaving the loop entirely.
Fixes: f6d5a52392 ("conf: Delay handling -D option until after addresses are configured")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
- Add structs for NA, RA, NS, MTU, prefix info, option header,
link-layer address, RDNSS, DNSSL and link-layer for RA message.
- Turn NA message from purely imperative, going byte by byte,
to declarative by filling it's struct.
- Turn part of RA message into declarative.
- Move packet_add() to be before the call of ndp() in tap6_handler()
if the protocol of the packet is ICMPv6.
- Add a pool of packets as an additional parameter to ndp().
- Check the size of NS packet with packet_get() before sending an NA
packet.
- Add documentation for the structs.
- Add an enum for NDP option types.
Link: https://bugs.passt.top/show_bug.cgi?id=21
Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
[sbrivio: Minor coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add_dns[46]() rely on the gateway address and c->no_map_gw being already
initialised, in order to properly handle DNS servers which need NAT to be
accessed from the guest.
Usually these are called from get_dns() which is well after the addresses
are configured, so that's fine. However, they can also be called earlier
if an explicit -D command line option is given. In this case no_map_gw
and/or c->ip[46].gw may not get be initialised properly, leading to this
doing the wrong thing.
Luckily we already have a second pass of option parsing for things which
need addresses to already be configured. Move handling of -D to there.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These fields are described as being an address for an external, routable
interface. That's not necessarily the case when using -a. But, more
importantly, saying where the value comes from is not as useful as what
it's used for. The real purpose of this field is as the address which we
assign to the guest via DHCP or --config-net.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we prefix the second part of messages printed through
logmsg_perror() by the timestamp, on debug, we'll have two timestamps
and a weird separator in the result, such as this beauty:
0.0013: Failed to clone process with detached namespaces0.0013: : Operation not permitted
Add a parameter to logmsg() and vlogmsg() which indicates a message
continuation. If that's set, don't print the timestamp in vlogmsg().
Link: https://github.com/moby/moby/issues/48257#issuecomment-2282875092
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Given that pasta supports specifying a command to be executed on the
command line, even without the usual -- separator as long as there's
no ambiguity, we shouldn't eat up options that are not meant for us.
Paul reports, for instance, that with:
pasta --config-net ip -6 route
-6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'.
That's because getopt_long(), by default, shuffles the argument list
to shift non-option arguments at the end.
Avoid that by adding '+' at the beginning of 'optstring'.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If a parent accidentally or due to implementation reasons leaks any
open file, we don't want to have access to them, except for the file
passed via --fd, if any.
This is the case for Podman when Podman's parent leaks files into
Podman: it's not practical for Podman to close unrelated files before
starting pasta, as reported by Paul.
Use close_range(2) to close all open files except for standard streams
and the one from --fd.
Given that parts of conf() depend on other files to be already opened,
such as the epoll file descriptor, we can't easily defer this to a
more convenient point, where --fd was already parsed. Introduce a
minimal, duplicate version of --fd parsing to keep this simple.
As we need to check that the passed --fd option doesn't exceed
INT_MAX, because we'll parse it with strtol() but file descriptor
indices are signed ints (regardless of the arguments close_range()
take), extend the existing check in the actual --fd parsing in conf(),
also rejecting file descriptors numbers that match standard streams,
while at it.
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Particularly in shell it's sometimes natural to save the pid from a process
run and later kill it. If doing this with nstool exec, however, it will
kill nstool itself, not the program it is running, which isn't usually what
you want or expect.
Address this by having nstool propagate SIGTERM to its child process. It
may make sense to propagate some other signals, but some introduce extra
complications, so we'll worry about them when and if it seems useful.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We use logtime() to get a timestamp for the log in two places:
- in vlogmsg(), which is used only for debug_print messages
- in logfile_write() which is only used messages to the log file
These cases are mutually exclusive, so we don't ever print the same message
with different timestamps, but that's not particularly obvious to see.
It's possible future tweaks to logging logic could mean we log to two
different places with different timestamps, which would be confusing.
Refactor to have a single logtime() call in vlogmsg() and use it for all
the places we need it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
clock_gettime() can, theoretically, fail, although it probably won't until
2038 on old 32-bit systems. Still, it's possible someone could run with
a wildly out of sync clock, or new errors could be added, or it could fail
due to a bug in libc or the kernel.
We don't handle this well. In the debug_print case in vlogmsg we'll just
ignore the failure, and print a timestamp based on uninitialised garbage.
In logfile_write() we exit early and won't log anything at all, which seems
like a good way to make an already weird situation undebuggable.
Add some helpers to instead handle this by using "<error>" in place of a
timestamp if something goes wrong with clock_gettime().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
logtime_fmt_and_arg() is a rather odd macro, producing both a format
string and an argument, which can only be used in quite specific printf()
like formulations. It also has a significant bug: it tries to display 4
digits after the decimal point (so down to tenths of milliseconds) using
%04i. But the field width in printf() is always a *minimum* not maximum
field width, so this will not truncate the given value, but will redisplay
the entire tenth-of-milliseconds difference again after the decimal point.
Replace the macro with an snprintf() like function which will format the
timestamp, and use an explicit % to correct the display.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Make logtime_fmt() static]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The comment for timespec_diff_us() claims it will wrap after 2^64µs. This
is incorrect for two reasons:
* It returns a long long, which is probably 64-bits, but might not be
* It returns a signed value, so even if it is 64 bits it will wrap after
2^63µs
Correct the comment and use an explicitly 64-bit type to avoid that
imprecision.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Paul reports that setting IPv4 address and gateway manually, using
--address and --gateway, causes pasta to fail inserting IPv6 routes
in a setup where multiple, inter-dependent IPv6 routes are present
on the host.
That's because, currently, any -g option implies --no-copy-routes
altogether, and any -a implies --no-copy-addrs.
Limit this implication to the matching IP version, instead, by having
two copies of no_copy_routes and no_copy_addrs in the context
structure, separately for IPv4 and IPv6.
While at it, change them to 'bool': we had them as 'int' because
getopt_long() used to set them directly, but it hasn't been the case
for a while already.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).
But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.
Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.
Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If the "from" (input) side for a given transfer is 0, and we can't
complete the write right away, what we need to be waiting for is for
output readiness on side 1, not 0, and the other way around as well.
This causes random transfer failures for local TCP connections,
depending if we ever need to wait for output readiness.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/23517
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
The "correct" type for the length of an IOV is unclear: writev() and
readv() use an int, but sendmsg() and recvmsg() use a size_t. Using the
unsigned size_t has some advantages, though, and it makes more sense for
the case of write_remainder. Using size_t throughout here means we don't
have a signed vs. unsigned comparison, and we don't have to deal with
the case of iov_skip_bytes() returning a value which becomes negative
when assigned to an integer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
No code change.
They need to be exported to be available by the vhost-user version of
passt.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To be used with the vhost-user version of udp.c, we need to export the
udp_flow functions. To avoid to export udp_meta_t too that is specific
to the socket version of udp.c, don't pass udp_meta_t to it,
but the only needed field, s_in.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
logfile_write() is not used outside log.c, nor should it be. It should
only be used externall via the general logging functions. Make it static
in log.c. To avoid forward declarations this requires moving a bunch of
functions earlier in the file.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Ed reported this:
# Error: pasta failed with exit code 1:
# Couldn't drop cap 3 from bounding set
# : No child processes
in a Podman CI run with tests being run in parallel. The error message
itself, by the way, is fixed by commit 1cd773081f ("log: Drop
newlines in the middle of the perror()-like messages"), but how can we
possibly get ECHILD as failure code for prctl()?
Well, we don't, but if we exit early enough, pasta_child_handler()
might run before we're even done with isolation steps, and it calls
waitid(), which sets errno. We need to restore it before returning
from the signal handler (if we return after calling functions that
might set it), as signal-safety(7) also implies:
Fetching and setting the value of errno is async-signal-safe
provided that the signal handler saves errno on entry and
restores its value before returning.
Eventually, we'll probably need to switch to signalfd(2) the day we
want to implement multithreading, but this will do for the moment.
Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/23478
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When invoking pasta without any arguments, it's difficult
to tell whether we are in the new namespace or not leaving
users a bit confused. This change modifies the host namespace
to add a prefix "pasta-" to make it a bit more obvious.
Signed-off-by: Danish Prakash <contact@danishpraka.sh>
[sbrivio: coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie. Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.
Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket. This handles a number of (unlikely) cases
which previously would not be dealt with correctly:
* If qemu sent a partial frame then waited some time before sending the
remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
doing the partial frame handling, which would put us out of sync with
the stream from qemu
* If a the blocking recv() only received some of the remainder of the
frame, not all of it, we'd return leaving us out of sync with the
stream again
Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This
is probably acceptable because it's an unlikely case in practice. If
necessary we could mitigate this by using a true ring buffer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself. As far as I can tell,
frames can be any length, with no particular alignment requirement. This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.
Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading. Some platforms will generate a fatal trap on
such an unaligned load. Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.
Introduce a new helper to safely load a possibly unaligned value here. We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads. If that turns out not
to be the case, we can look at improvements then.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket. It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling. That consideration
doesn't apply to the way we handle the qemu socket however.
Furthermore, using EPOLLET causes additional complications:
1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
we *do* set it when using pasta mode with --fd. This inconsistency
doesn't seem to have broken anything, but it's odd.
2) EPOLLET requires that tap_handler_passt() loop until all data available
is read (otherwise we may have data in the buffer but never get an event
causing us to read it). We do that with a rather ugly goto.
Worse, our condition for that goto appears to be incorrect. We'll only
loop if rem is non-zero, which will only happen if we perform a blocking
recv() for a partially received frame. We'll only perform that second
recv() if the original recv() resulted in a partially read frame. As
far as I can tell the original recv() could end on a frame boundary
(never triggering the second recv()) even if there is additional data in
the socket buffer. In that circumstance we wouldn't goto redo and could
leave unprocessed frames in the qemu socket buffer indefinitely.
This doesn't seem to have caused any problems in practice, but since
there's no obvious reason to use EPOLLET here anyway, we might as well
get rid of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on. That sounds sensible on first blush, but
probably isn't wise in practice. If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket. Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.
Neither (b) nor (c) is really salvageable with the same stream. Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.
So, instead of just skipping the frame and trying to carry on, log an error
and close the socket. As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error. Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.
Add an error message in this case. Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We report relative timestamps in logs, so we want to avoid jumps in
the system time.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...not just for debug messages. Otherwise, timestamps in the log file
are consistent but the starting point is not zero.
Do this right away as we enter main(), so that the resulting
timestamps are as closely as possible relative to when we start.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For some reason, in commit 01efc71ddd ("log, conf: Add support for
logging to file"), I added calculations for relative logging
timestamps using the difference for the seconds part only, not for
accounting for the fractional part.
Fix that by storing the initial timestamp, log_start, as a timespec
struct, and by calculating the difference from the starting time. Do
this in a macro as we need the same format in a few places.
To calculate the difference, turn the existing timespec_diff_ms() to
microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to
use that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
systemd-resolved has the rather strange behaviour of listening on the
non-standard loopback address 127.0.0.53. Various changes we've made in
passt mean that we now usually work fine on a host using systemd-resolved.
However our tests still fail in this case. We have a special case for when
the guest's resolv.conf needs to differ from the host's because the
resolver is on a host loopback address. However, we only consider the case
where the host resolver is on 127.0.0.1, not other loopback addresses.
Correct this with a different test condition.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
passt/pasta has options to redirect DNS requests from the guest to a
different server address on the host side. Currently, however, only UDP
packets to port 53 are considered "DNS requests". This ignores DNS
requests over TCP - less common, but certainly possible. It also ignores
encrypted DNS requests on port 853.
Extend the DNS forwarding logic to handle both of those cases.
Link: https://github.com/containers/podman/issues/23239
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, we start by handling the common case, where we don't translate
the destination address, then we modify the tgt side for the special cases.
In the process we do comparisons on the tentatively set fields in tgt,
which obscures the fact that tgt should be an essentially pure function of
ini, and risks people examining fields of tgt that are not yet initialized.
To make this clearer, do all our tests on 'ini', constructing tgt from
scratch on that basis.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Even though we don't use : as delimiter for the port, making square
brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6
literals. We want IPv6 addresses there, but some users might still
specify them out of habit.
Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for
IPv4 literals, but I had reports of users actually trying to use them
(they're accepted by many tools).
Allow square brackets for both IPv4 and IPv6 addresses, correct or
not, they're harmless anyway.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In tap_sock_unix_open(), if we have a given path for the socket from
configuration, we don't need to loop over possible paths, so we exit
the loop on the first iteration, unconditionally.
But if we failed to bind() the socket to that explicit path, we should
exit, instead of continuing. Otherwise we'll pretend we're up and
running, but nobody can contact us, and this might be mildly confusing
for users.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Starting from iperf3 version 3.16, -P / --parallel spawns multiple
clients as separate threads, instead of multiple streams serviced by
the same thread.
So we can drop our lib/test implementation to spawn several iperf3
client and server processes and finally simplify things quite a bit.
Adjust number of threads and UDP sending bandwidth to values that seem
to be more or less matching previous throughput tests on my setup.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Differences in allocated Acpi-Parse entries are gone (at least) since
the 6.1 Linux kernel series. I should run this on a 6.10 kernel,
eventually, and adjust things further, as needed.
Userspace symbols are also fairly different now: show whatever is more
than 1 MiB at the moment.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
This used to work on my setup as I kept reusing an old mbuto
(initramfs) image, but since commit 65923ba798 ("conf: Accept
duplicate and conflicting options, the last one wins"), --netns-only
is, as originally intended, a pasta-only option.
I had used --netns-only, here, to prevent passt from trying to detach
its own user namespace, which is not permitted as we're in a chroot,
see unshare(2). In turn, we need the chroot because passt can't pivot
root directly into its own empty filesystem using an initramfs.
Use switch_root into the tmpfs mountpoint instead of chroot, so that
we can still detach user namespaces.
Note that in the mbuto images, we can't switch to nobody as we have
no password entries at all, so we need to detach a further user
namespace before starting passt, to trick passt into running as UID
0.
Given the new sequence, it's now more convenient to directly switch
to a detached network namespace as well, which means we need to move
the initialisation of the dummy network from the init script into the
test script.
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Calling vlogmsg() twice from logmsg_perror() results in this beauty:
$ ./pasta -i foo
Invalid interface name foo
: No such device
because the first part of the message, corresponding to the first
call, doesn't end with a newline, and vlogmsg() adds it.
Given that we can't easily append an argument (error description) to
a variadic list, add a 'newline' parameter to all the functions that
currently add a newline if missing, and disable that on the first call
to vlogmsg() from logmsg_perror(). Not very pretty but I can't think
of any solution that's less messy than this.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
This:
$ ./pasta
SO_PEEK_OFF not supported
#
is a bit annoying, and might trick users who face other issues into
thinking that SO_PEEK_OFF not being supported on a given kernel is
an actual issue.
Even if SO_PEEK_OFF is supported by the kernel, that would be the
only message displayed there, with default options, which looks a bit
out of context.
Switch that to debug(): now that Podman users can pass --debug too, we
can find out quickly if it's supported or not, if SO_PEEK_OFF usage is
suspected of causing any issue.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If we start pasta with some ports forwarded, but no --config-net, say:
$ ./pasta -u 10001
and then use a local, non-loopback address to send traffic to that
port, say:
$ socat -u FILE:test UDP4:192.0.2.1:10001
pasta writes to the tap file descriptor, but if the interface is down,
we get EIO and terminate.
By itself, what I'm doing in this case is not very useful (I simply
forgot to pass --config-net), but if we happen to have a DHCP client
in the network namespace, the interface might still be down while
somebody tries to send traffic to it, and exiting in that case is not
really helpful.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When using the new SO_PEEK_OFF feature on TCP sockets, we must adjust
the SO_PEEK_OFF value whenever we move conn->seq_to_tap backwards.
Although it was discussed during development, somewhere during the shuffles
the case where we move the pointer backwards because we lost frames while
sending them to the guest. This can happen, for example, if the socket
buffer on the Unix socket to qemu overflows.
Fixing this is slightly complicated because we need to pass a non-const
context pointer to some places we previously didn't need it. While we're
there also fix a small stylistic issue in the function comment for
tcp_revert_seq() - it was using spaces instead of tabs.
Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Based on an original patch by Jon Maloy:
--
The recently added socket option SO_PEEK_OFF is not supported for
TCP/IPv6 sockets. Until we get that support into the kernel we need to
test for support in both protocols to set the global 'peek_offset_cap´
to true.
--
Compared to the original patch:
- only check for SO_PEEK_OFF support for enabled IP versions
- use sa_family_t instead of int to pass the address family around
Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived
sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN
and associated functions to match. Along with that, remove the .orig
field from union udp_listen_epoll_ref, since it is now always true.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map. This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port. This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.
So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that UDP datagrams are all directed via the flow table, we no longer
use the udp_tap_map[ or udp_act[] arrays. Remove them and connected
code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This replaces the last piece of existing UDP port tracking with the
common flow table. Specifically use the flow table to direct datagrams
from host sockets to the guest tap interface. Since this now requires
a flow for every datagram, we add some logging if we encounter any
datagrams for which we can't find or create a flow.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we create flows for datagrams from socket interfaces, and use
them to direct "spliced" (socket to socket) datagrams. We don't yet
match datagrams from the tap interface to existing flows, nor create new
flows for them. Add that functionality, matching datagrams from tap to
existing flows when they exist, or creating new ones.
As with spliced flows, when creating a new flow from tap to socket, we
create a new connected socket to receive reply datagrams attached to that
flow specifically. We extend udp_flow_sock_handler() to handle reply
packets bound for tap rather than another socket.
For non-obvious reasons (perhaps increased stack usage?), this caused
a failure for me when running under valgrind, because valgrind invoked
rt_sigreturn which is not in our seccomp filter. Since we already
allow rt_sigaction and others in the valgrind target, it seems
reasonable to add rt_sigreturn as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that spliced datagrams are managed via the flow table, remove
UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With
those removed, the 'ts' field in udp_splice_port is also no longer used.
struct udp_splice_port now contains just a socket fd, so replace it with
a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still
used for tracking of automatic port forwarding.
Finally, the 'splice' field of union udp_epoll_ref is no longer used so
remove it as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When forwarding a datagram to a socket, we need to find a socket with a
suitable local address to send it. Currently we keep track of such sockets
in an array indexed by local port, but this can't properly handle cases
where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening
a socket specifically for the target side of the flow. We connect() as
well as bind()ing that socket, so that it will only receive the flow's
reply packets, not anything else. We direct datagrams sent via that socket
using the addresses from the flow table, effectively replacing bespoke
addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating
socket, and use that to deliver reply datagrams back to the origin, again
using addresses from the flow table entry.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This implements the first steps of tracking UDP packets with the flow table
rather than its own (buggy) set of port maps. Specifically we create flow
table entries for datagrams received from a socket (PIF_HOST or
PIF_SPLICE).
When splitting datagrams from sockets into batches, we group by the flow
as well as splicesrc. This may result in smaller batches, but makes things
easier down the line. We can re-optimise this later if necessary. For now
we don't do anything else with the flow, not even match reply packets to
the same flow.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The
logic here doesn't exactly match our current forwarding, since our current
forwarding has some very strange and buggy edge cases. Instead it's
attempting to replicate what appears to be the intended logic behind the
current forwarding.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Current ICMP hard codes its forwarding rules, and never applies any
translations. Change it to use the flow_target() function, so that
it's translated the same as TCP (excluding TCP specific port
redirection).
This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine. This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.
This removes the last user of flow_target_af(), so that's removed as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.
Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side. This performs any NAT or port forwarding needed.
We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For now when we forward a ping to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and id will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or id
we use in future.
To implement this we create a new helper which sets up a new socket based
on information in a flowside, which will also have future uses. It
behaves slightly differently from the existing ICMP code, in that it
doesn't bind to a specific interface if given a loopback address. This is
logically correct - the loopback address means we need to operate through
the host's loopback interface, not ifname_out. We didn't need it in ICMP
because ICMP will never generate a loopback address at this point, however
we intend to change that in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have upcoming use cases where it's useful to create new bound socket
based on information from the flow table. Add flowside_sock_l4() to do
this for either PIF_HOST or PIF_SPLICE sockets.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything. Eliminate it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables. However, we can look this up using the flow hash
table - that's what it's for.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
icmp_sock_handler() obtains the guest address from it's most recently
observed IP. However, this can now be obtained from the common flowside
information.
icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.
Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
struct icmp_ping_flow contains a field for the ICMP id of the ping, but
this is now redundant, since the id is also stored as the "port" in the
common flowsides.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp. Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.
With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one. It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows. This is
largely code motion and straightforward renames. There are two semantic
changes:
* flow_lookup_af() now needs to verify that the entry has a matching
protocol and interface as well as matching addresses and ports.
* We double the size of the hash table, because it's now at least
theoretically possible for both sides of each flow to be hashed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports. We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well. We also want a unified hash table
to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow. This includes all the
information from struct flowside, plus the pif and the L4 protocol number.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection. Hence, drop the now unused SPLICE_V6 flag.
As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For now when we forward a connection to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and port will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or port
we use in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address. This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address. That won't
work; it will probably result in an RST from the guest. Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.
Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure. Access it from there instead, and
remove it from the TCP specific structure. With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Require the address and port information for the target (non
initiating) side to be populated when a flow enters TGT state.
Implement that for TCP and ICMP. For now this leaves some information
redundantly recorded in both generic and type specific fields. We'll
fix that in later patches.
For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow. Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).
To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.
For now we only populate the initiating side, requiring that
information be completed when a flows enter INI. Later patches will
populate the target side.
For now this leaves some information redundantly recorded in both generic
and type specific fields. We'll fix that in later patches.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This test program verifies that we can receive and discard datagrams by
using recv() with a NULL buffer and zero-length. Extend it to verify it
also works using recvmsg() and either an iov with a zero-length NULL
buffer or an iov that itself is NULL and zero-length.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fixed printf() message in main of recv-zero.c]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To simplify lifetime management of "listening" UDP sockets, UDP flow
support needs to duplicate existing bound sockets. Those duplicates will
be close()d when their corresponding flow expires, but we expect the
original to still receive datagrams as always. That is, we expect the
close() on the duplicate to remove the duplicated fd, but not to close the
underlying UDP socket.
Add a test program to doc/platform-requirements to verify this requirement.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Both the events and flags fields in tcp_splice_conn have several bits
which are per-side, e.g. OUT_WAIT_0 for side 0 and OUT_WAIT_1 for side 1.
This necessitates some rather awkward ternary expressions when we need
to get the relevant bit for a particular side.
Simplify this by using a parameterised macro for the bit values. This
needs a ternary expression inside the macros, but makes the places we use
it substantially clearer.
That simplification in turn allows us to use a loop across each side to
implement several things which are currently open coded to do equivalent
things for each side in turn.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have a handful of places where we use a loop to step through each side
of a flow or flows, and we're probably going to have mroe in future.
Introduce a macro to implement this loop for convenience.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In various places we have variables named 'side' or similar which always
have the value 0 or 1 (INISIDE or TGTSIDE). Given a flow, this refers to
a specific side of it. Upcoming flow table work will make it more useful
for "side" to refer to a specific side of a specific flow. To make things
less confusing then, prefer the name term "side index" and name 'sidei' for
variables with just the 0 or 1 value.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fixed minor detail in comment to struct flow_common]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
TCP (both regular and spliced) and ICMP both have macros to retrieve the
relevant protcol specific flow structure from a flow index. In most cases
what we actually want is to get the specific flow from a sidx. Replace
those simple macros with a more precise inline, which also asserts that
the flow is of the type we expect.
While we're they're also add a pif_at_sidx() helper to get the interface of
a specific flow & side, which is useful in some places.
Finally, fix some minor style issues in the comments on some of the
existing sidx related helpers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we ignore all events other than EPOLLIN on UDP sockets. This
means that if we ever receive an EPOLLERR event, we'll enter an infinite
loop on epoll, because we'll never do anything to clear the error.
Luckily that doesn't seem to have happened in practice, but it's certainly
fragile. Furthermore changes in how we handle UDP sockets with the flow
table mean we will start receiving error events.
Add handling of EPOLLERR events. For now we just read the error from the
error queue (thereby clearing the error state) and print a debug message.
We can add more substantial handling of specific events in future if we
want to.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Allow sockaddr_ntop() to format AF_UNSPEC socket addresses. There do exist
a few cases where we might legitimately have either an AF_UNSPEC or a real
address, such as the origin address from MSG_ERRQUEUE. Even in cases where
we shouldn't get an AF_UNSPEC address, formatting it is likely to make
things easier to debug if we ever somehow do.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We abort the UDP socket handler if the no_udp flag is set. But if UDP
was disabled we should never have had a UDP socket to trigger the handler
in the first place. If we somehow did, ignoring it here isn't really going
to help because aborting without doing anything is likely to lead to an
epoll loop. The same is the case for the TCP socket and timer handlers and
the no_tcp flag.
Change these checks on the flag to ASSERT()s. Similarly add ASSERT()s to
several other entry points to the protocol specific code which should never
be called if the protocol is disabled.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Through an oversight this was previously declared as a public function
although it's only used in udp.c and there is no prototype in any header.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
UDP and/or TCP can be disabled with the --no-udp and --no-tcp options.
However, when this is specified, it's still possible to configure forwarded
ports for the disabled protocol. In some cases this will open sockets and
perform other actions, which might not be safe since the entire protocol
won't be initialised.
Check for this case, and explicitly forbid it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the guest peer, while it is unable to send out window updates even
after socket reads have freed up enough buffer space to permit a larger
window. In this situation, new window advertisements from the peer can
only be triggered by data packets arriving from this side.
However, currently such packets are never sent, because the zero-window
condition prevents this side from sending out any packets whatsoever
to the peer.
We notice that the above bug is triggered *only* after the peer has
dropped one or more arriving packets because of severe memory squeeze,
and that we hence always enter a retransmission situation when this
occurs. This also means that the implementation goes against the
RFC-9293 recommendation that a previously advertised window never
should shrink.
RFC-9293 seems to permit that we can continue sending up to the right
edge of the last advertised non-zero window in such situations, so that
is what we do to resolve this situation.
It turns out that this solution is extremely simple to implememt in the
code: We just omit to save the advertised zero-window when we see that
it has shrunk, i.e., if the acknowledged sequence number in the
advertisement message is lower than that of the last data byte sent
from our side.
When that is the case, the following happens:
- The 'retr' flag in tcp_data_from_tap() will be 'false', so no
retransmission will occur at this occasion.
- The data stream will soon reach the right edge of the previously
advertised window. In fact, in all observed cases we have seen that
it is already there when the zero-advertisement arrives.
- At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set,
unless they already have been, meaning that only the next timer
expiration will open for data retransmission or transmission.
- When that happens, the memory squeeze at the guest will normally have
abated, and the data flow can resume.
It should be noted that although this solves the problem we have at
hand, it is a work-around, and not a genuine solution to the described
kernel bug.
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor fix in commit title and commit reference in comment
to workaround
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
>From linux-6.9.0 the kernel will contain
commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").
This new feature makes is possible to call recv_msg(MSG_PEEK) and make
it start reading data from a given offset set by the SO_PEEK_OFF socket
option. This way, we can avoid repeated reading of already read bytes of
a received message, hence saving read cycles when forwarding TCP
messages in the host->name space direction.
In this commit, we add functionality to leverage this feature when
available, while we fall back to the previous behavior when not.
Measurements with iperf3 shows that throughput increases with 15-20
percent in the host->namespace direction when this feature is used.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This test program checks for particular behaviour regardless of order of
operations. So, we step through the test with all possible orders for
a number of different of parts. Or at least, we're supposed to, a copy
pasta error led to using the same order for two things which should be
independent.
Fixes: 299c407501 ("doc: Add program to document and test assumptions about SO_REUSEADDR")
Reported-by: David Taylor <davidt@yadt.co.uk>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add a test program verifying that we're able to discard datagrams from a
socket without needing a big discard buffer, by using a zero length recv().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For the approach we intend to use for handling UDP flows, we have some
pretty specific requirements about how SO_REUSEADDR works with UDP sockets.
Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s,
and therefore there can be multiple sockets which are eligible to receive
the same datagram. Which one will actually receive it is important to us.
Add a test program which verifies things work the way we expect, which
documents what those expectations are in the process.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we receive datagrams on a socket, we need to split them into batches
depending on how they need to be forwarded (either via a specific splice
socket, or via tap). The logic to do this, is somewhat awkwardly split
between udp_buf_sock_handler() itself, udp_splice_send() and
udp_tap_send().
Move all the batching logic into udp_buf_sock_handler(), leaving
udp_splice_send() to just send the prepared batch. udp_tap_send() reduces
to just a call to tap_send_frames() so open-code that call in
udp_buf_sock_handler().
This will allow separating the batching logic from the rest of the datagram
forwarding logic, which we'll need for upcoming flow table support.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four
things between them:
1. Receive some datagrams from a socket
2. Split those datagrams into batches depending on how they need to be
sent (via tap or via a specific splice socket)
3. Prepare buffers for each datagram to send it onwards
4. Actually send it onwards
Split (1) and (3) into specific helper functions. This isn't
immediately useful (udp_splice_prepare(), in particular, is trivial),
but it will make further reworks clearer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since we split our packet frame buffers into different pieces, we have
a single buffer per IP version for the ethernet header, rather than one
per frame. This makes sense since our ethernet header is alwaus the same.
However we initialise those buffers udp[46]_eth_hdr inside a per frame
loop. Pull that outside the loop so we just initialise them once.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The only differences between these arrays are that udp4_l2_iov is
pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame
header and udp6_l2_iov points to the IPv6 versions.
We already have to set up a bunch of headers per-frame, including updating
udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len. It makes more sense to adjust
the IOV entries to point at the correct headers for the frame than to have
two complete sets of iovecs.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where
the only difference is that they point to different sockaddr buffers for
the destination address.
Unify these by having the common array point at a sockaddr_inany as the
address. This does mean slightly more work when we're about to splice,
because we need to write the whole socket address, rather than just the
port. However it removes 32 mmsghdr structures and we're going to need
more flexibility constructing that target address for the flow table.
Because future changes might mean that the address isn't always loopback,
change the name of the common address from *_localname to udp_splicename.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Make the salient points about these various arrays clearer with renames:
* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do
with L2. They are, however, specific to receiving not sending. Rename
to udp_iov_recv and udp[46]_mh_recv.
* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa.
Rename to udp[46]_l2_iov
* udp[46]_localname are (for now) pre-populated with the local address but
the more salient point is that these are the destination address for the
splice arrays. Rename to udp[46]_splice_to
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_buf_sock_handler() takes the epoll reference from the receiving socket,
and passes the UDP relevant part on to several other functions. Future
changes are going to need several different epoll types for UDP, and to
pass that information through to some of those functions. To avoid extra
noise in the patches making the real changes, change those functions now
to take the full epoll reference, rather than just the UDP part.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To implement the TCP hash table, we need an invalid (NULL-like) value for
flow_sidx_t. We use FLOW_SIDX_NONE for that, but for defensiveness, we
treat (usually) anything with an out of bounds flow index the same way.
That's not always done consistently though. In flow_at_sidx() we open code
a check on the flow index. In tcp_hash_probe() we instead compare against
FLOW_SIDX_NONE, and in some other places we use the fact that
flow_at_sidx() will return NULL in this case, even if we don't otherwise
need the flow it returns.
Clean this up a bit, by adding an explicit flow_sidx_valid() test function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state. Currently it determines the correct tag for the epoll
data based on the protocol. However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol. So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
UNIX_SOCK_MAX is the maximum number we'll append to the socket path
if we generate it automatically. If it's given on the command line,
it can be up to UNIX_PATH_MAX (including the terminating character)
long.
UNIX_SOCK_MAX happened to kind of fit because it's 100 (instead of
108).
Commit ceddcac74a ("conf, tap: False "Buffer not null terminated"
positives, CWE-170") fixed the wrong problem: the right fix for the
problem at hand was actually commit cc287af173 ("conf: Fix
incorrect bounds checking for sock_path parameter").
Fixes: ceddcac74a ("conf, tap: False "Buffer not null terminated" positives, CWE-170")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Spotted by Coverity, harmless as we would consider that successful
and check on the socket later from the timer, but printing a debug
message in that case is definitely wise, should it ever happen.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Spotted by Coverity just recently. Not that it really matters as
MAXDNSRCH always appears to be defined as 1025, while a full domain
name can have up to 253 characters: it would be a bit pointless to
have a longer search domain.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
cppcheck 2.14 warns that the scope of the rport variable could be
reduced: do that, as reverted commit c80fa6a6bb ("udp: Make rport
calculation more local") did, but keep the temporary variable of
in_port_t type, otherwise the sum gets promoted to int.
While at it, add a comment explaining why we calculate rport like
this instead of directly using the sum as array index.
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
This reverts commit c80fa6a6bb, as it
reintroduces the issue fixed by commit 1e6f92b995 ("udp: Fix 16-bit
overflow in udp_invert_portmap()").
Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=80
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If we daemonised, we can't use standard error. If we didn't, it's
rather annoying to have all those messages on standard error anyway,
and kind of pointless too, as the messages we wanted to print were
printed to standard error anyway.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If a log file is configured, we would otherwise open a connection to
the system logger (if any), print any message that we might have
before we initialise the log file, and then keep that connection
around for no particular reason.
Call __openlog() as an alternative to the log file setup, instead.
This way, we might skip printing some messages during the
initialisation phase, but they're probably not really valuable to
have in a system log, and we're going to print them to standard
error anyway.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Now that we have logging functions embedding perror() functionality,
we can make _some_ calls more terse by using them. In many places,
the strerror() calls are still more convenient because, for example,
they are used in flow debugging functions, or because the return code
variable of interest is not 'errno'.
While at it, convert a few error messages from a scant perror style
to proper failure descriptions.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
perror() prints directly to standard error, but in many cases standard
error might be already closed, or we might want to skip logging, based
on configuration. Our logging functions provide all that.
While at it, make errors more descriptive, replacing some of the
existing basic perror-style messages.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In many places, we have direct perror() calls, which completely bypass
logging functions and log files.
They are definitely convenient: offer similar convenience with
_perror() logging variants, so that we can drop those direct perror()
calls.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
After commit 15001b39ef ("conf: set the log level much earlier"), we
had a phase during initialisation when messages wouldn't be printed to
standard error anymore.
Commit f67238aa86 ("passt, log: Call __openlog() earlier, log to
stderr until we detach") fixed that, but only for the case where no
log files are given.
If a log file is configured, vlogmsg() will not call passt_vsyslog(),
but during initialisation, LOG_PERROR is set, so to avoid duplicated
prints (which would result from passt_vsyslog() printing to stderr),
we don't call fprintf() from vlogmsg() either.
This is getting a bit too complicated. Instead of abusing LOG_PERROR,
define an internal logging flag that clearly represents that we're not
done with the initialisation phase yet.
If this flag is not set, make sure we always print to stderr, if the
log mask matches.
Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We currently use a LOG_EMERG log mask to represent the fact that we
don't know yet what the mask resulting from configuration should be,
before the command line is parsed.
However, we have the necessity of representing another phase as well,
that is, configuration is parsed but we didn't daemonise yet, or
we're not ready for operation yet. The next patch will add that
notion explicitly.
Mapping these cases to further log levels isn't really practical.
Introduce boolean log flags to represent them, instead of abusing
log priorities.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The original behaviour of printing messages to standard error by
default when running from a non-interactive terminal was introduced
because the first KubeVirt integration draft used to start passt in
foreground and get messages via standard error.
For development purposes, the system logger was more convenient at
that point, and passt was running from interactive terminals only if
not started by the KubeVirt integration.
This behaviour was introduced by 84a62b79a2 ("passt: Also log to
stderr, don't fork to background if not interactive").
Later, I added command-line options in 1e49d194d0 ("passt, pasta:
Introduce command-line options and port re-mapping") and accidentally
reversed this condition, which wasn't a problem as --stderr could
force printing to standard error anyway (and it was used by KubeVirt).
Nowadays, the KubeVirt integration uses a log file (requested via
libvirt configuration), and the same applies for Podman if one
actually needs to look at runtime logs. There are no use cases left,
as far as I know, where passt runs in foreground in non-interactive
terminals.
Seize the chance to reintroduce some sanity here. If we fork to
background, standard error is closed, so --stderr is useless in that
case.
If we run in foreground, there's no harm in printing messages to
standard error, and that accidentally became the default behaviour
anyway, so --stderr is not needed in that case.
It would be needed for non-interactive terminals, but there are no
use cases, and if there were, let's log to standard error anyway:
the user can always redirect standard error to /dev/null if needed.
Before we're up and running, we need to print to standard error anyway
if something happens, otherwise we can't report failure to start in
any kind of usage, stand-alone or in integrations.
So, make --stderr do nothing, and deprecate it.
While at it, drop a left-over comment about --foreground being the
default only for interactive terminals, because it's not the case
anymore.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If we don't run in foreground, we close standard error as we
daemonise, so it makes no sense to check if the controlling terminal
is an interactive terminal or if --force-stderr was given, to decide
if we want to log to standard error.
Make --force-stderr depend on --foreground.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In multiple occasions, especially when passt(1) and pasta(1) are used
in integrations such as the one with Podman, the ability to override
earlier options on the command line with later one would have been
convenient.
Recently, to debug a number of issues happening with Podman, I would
have liked to ask users to share a debug log by passing --debug as
additional option, but pasta refuses --quiet (always passed by Podman)
and --debug at the same time.
On top of this, Podman lets users specify other pasta options in its
containers.conf(5) file, as well as on the command line.
The options from the configuration files are appended together with
the ones from the command line, which makes it impossible for users to
override options from the configuration file, if duplicated options
are refused, unless Podman takes care of sorting them, which is
clearly not sustainable.
For --debug and --trace, somebody took care of this on Podman side at:
https://github.com/containers/common/pull/2052
but this doesn't fix the issue with other options, and we'll have
anyway older versions of Podman around, too.
I think there's some value in telling users about duplicated or
conflicting options, because that might reveal issues in integrations
or accidental misconfigurations, but by now I'm fairly convinced that
the downsides outweigh this.
Drop checks about duplicate options and mutually exclusive ones. In
some cases, we need to also undo a couple of initialisations caused
by earlier options, but this looks like a simplification, overall.
Notable exception: --stderr still conflicts with --log-file, because
users might have the expectation that they don't actually conflict.
But they do conflict in the existing implementation, so it's safer
to make sure that the users notice that.
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
If routing daemons set up host routes, for example FRR via OSPF as in
the reported issue, they might add nexthop identifiers (not objects)
that are generally not valid in the target namespace. Strip them off
as well, otherwise we'll get EINVAL from the kernel.
Link: https://github.com/containers/podman/issues/22960
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The SPDX identifier states GPL-2.0-or-later but the copyright section
mentions GPL-3.0 or later causing a mismatch.
Also, only correctly refers to GPL instead of AGPL.
Signed-off-by: Danish Prakash <contact@danishpraka.sh>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To implicitly resolve possible dependencies between routes as we
duplicate them into the target namespace, we go through a set of n
routes n times, and ignore EEXIST responses to netlink messages (we
already inserted the route) and ENETUNREACH (we didn't insert the
route yet, but we need to insert another one first).
Until now, we didn't ignore EHOSTUNREACH responses. However,
NetworkManager users with multiple non-subnet routes for the same
interface report that pasta exits with "no route to host" while
duplicating routes.
This happens because NetworkManager sets the 'noprefixroute' attribute
on addresses, meaning that the kernel won't create subnet routes
automatically depending on the prefix length of the address. We copy
this attribute as we copy the address into the target namespace, and
as a result, the kernel doesn't create subnet routes in the target
namespace either.
This means that the gateway for routes that are inserted later can be
unreachable at some points during the sequence of route duplication.
That is, we don't just have dependencies between regular routes, but
we can also have dependencies between regular routes and subnet
routes, as subnet routes are not automatically inserted in advance.
Link: https://github.com/containers/podman/issues/22824
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
While commit f919dc7a4b ("conf, netlink: Don't require a default
route to start") sounded reasonable in the assumption that, if we
don't find default routes for a given address family, we can still
proceed by selecting an interface with any route *iff it's the only
one for that protocol family*, Jelle reported a further issue in a
similar setup.
There, multiple interfaces are present, and while remote container
connectivity doesn't matter for the container, local connectivity is
desired. There are no default routes, but those multiple interfaces
all have non-default routes, so we should just pick one and start.
Pick the first interface reported by the kernel with any route, if
there are no default routes. There should be no harm in doing so.
Reported-by: Jelle van der Waa <jvanderwaa@redhat.com>
Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2277954
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Commit e1a2e2780c ("tcp: Check if connection is local or low RTT
was seen before using large MSS") added a call to bind() before we
issue a connect() to the target for an outbound connection.
If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we
can conclude that the target address is a local (host) address, and we
can use an unlimited MSS.
While at it, according to the reasoning of that commit, if bind()
succeeds, we would know right away that nobody is listening at that
(local) address and port, and we don't even need to call connect(): we
can just fail early and reset the connection attempt.
But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or
net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will
actually succeed, so we can't rely on it to fail in general.
The visible issue with the existing behaviour is that we would reset
any outbound connection to non-local addresses, if non-local binds are
enabled.
Keep the significant optimisation for local addresses along with the
bind() call, but if it succeeds, don't draw any conclusion: close the
socket, grab another one, and proceed normally.
This will incur a small latency penalty if non-local binds are
enabled (we'll likely fetch an existing socket from the pool but
additionally call close()), or if the target is local but not bound:
we'll need to call connect() and get a failure before relaying that
failure back.
Link: https://github.com/containers/podman/issues/23003
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In fc8f0f8c ("siphash: Use incremental rather than all-at-once siphash
functions") we removed the older interface to the SipHash implementation,
which took fixed sized blocks of data. However, we forgot to remove the
prototypes for those functions, so do that now.
Fixes: fc8f0f8c48 ("siphash: Use incremental rather than all-at-once siphash functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Mostly, udp_sock_handler() is independent of how the datagrams it processes
will be forwarded (tap or splice). However, it also updates the msg_name
fields for spliced sends, which doesn't really make sense here. Move it
into udp_splice_send() which is all about spliced sends. This does
potentially mean we'll update the field to the same value several times,
but we're going to need this in future anyway: with the extensions the
flow table allows, it might not be the same value each time after all.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_sock_handler() takes a number of datagrams from sockets that depending
on their addresses could be forwarded either to the L2 interface ("tap")
or to another socket ("spliced"). In the latter case we can also only
send packets together if they have the same source port, and therefore
are sent via the same socket.
To reduce the total number of system calls we gather contiguous batches of
datagrams with the same destination interface and socket where applicable.
The determination of what the target is is made by udp_mmh_splice_port().
It returns the source port for splice packets and -1 for "tap" packets.
We find batches by looking ahead in our queue until we find a datagram
whose "splicefrom" port doesn't match the first in our current batch.
udp_mmh_splice_port() is moderately expensive, and unfortunately we
can call it twice on the same datagram: once as the (last + 1) entry
in one batch (to check it's not in that batch), then again as the
first entry in the next batch.
Avoid this by keeping track of the "splice port" in the metadata structure,
and filling it in one entry ahead of the one we're currently considering.
This is a bit subtle, but not that hard. It will also generalise better
when we have more complex possibilities based on the flow table.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_mmh_splice_port() is used to determine if a UDP datagram can be
"spliced" (forwarded via a socket instead of tap). We only invoke it if
the origin socket has the 'splice' flag set.
Fold the checking of the flag into the helper itself, which makes the
caller simpler. It does mean we have a loop looking for a batch of
spliceable or non-spliceable packets even in the case where the flag is
clear. This shouldn't be that expensive though, since each call to
udp_mmh_splice_port() will return without accessing memory in that case.
In any case we're going to need a similar loop in more cases with upcoming
flow table work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
sock_l4() creates, binds and otherwise prepares a new socket. It builds
the socket address to bind from separately provided address and port.
However, we have use cases coming up where it's more natural to construct
the socket address in the caller.
Prepare for this by adding sock_l4_sa() which takes a pre-constructed
socket address, and rewriting sock_l4() in terms of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
buf_size is set to sizeof(pkt_buf) by default. And it seems more correct
to provide the actual size of the buffer.
Later a buf_size of 0 will allow vhost-user mode to detect
guest memory buffers.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
it was needed by a draft version of vhost-user, it is not needed
anymore.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We are going to introduce a variant of the function to use
vhost-user buffers rather than passt internal buffers.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This commit refactors the udp_update_hdr4() and udp_update_hdr6() functions
to improve code portability by replacing the udp_meta_t parameter with
more specific parameters for the IPv4 and IPv6 headers (iphdr/ipv6hdr)
and the source socket address (sockaddr_in/sockaddr_in6).
It also moves the tap_hdr_update() function call inside the udp_tap_send()
function not to have to pass the TAP header to udp_update_hdr4() and
udp_update_hdr6()
This refactor reduces complexity by making the functions more modular and
ensuring that each function operates on more narrowly scoped data structures.
This will facilitate future backend introduction like vhost-user.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(),
and tap4_handler() and tap6_handler() into tap_handler().
Create a generic tap_add_packet() to consolidate packet
addition logic and reduce code duplication.
The purpose is to ease the export of these functions to use
them with the vhost-user backend.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Move all the TCP parts using internal buffers to tcp_buf.c
and keep generic TCP management functions in tcp.c.
Add tcp_internal.h to export needed functions from tcp.c and
tcp_buf.h from tcp_buf.c
With this change we can use existing TCP functions with a
different kind of memory storage as for instance the shared
memory provided by the guest via vhost-user.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This commit isolates the internal data structure management used for storing
data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[],
tcp4_flags[], ...) from the tcp_send_flag() function. The extracted
functionality is relocated to a new function named tcp_fill_flag_header().
tcp_fill_flag_header() is now a generic function that accepts parameters such
as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to
pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[].
This separation sets the stage for utilizing tcp_prepare_flags() to
set the memory provided by the guest via vhost-user in future developments.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have several functions which are used as callbacks for NS_CALL() which
only read their void * parameter, they don't write it. The
constParameterCallback warning in cppcheck 2.14.1 complains that this
parameter could be const void *, also pointing out that that would require
casting the function pointer when used as a callback.
Casting the function pointers seems substantially uglier than using a
non-const void * as the parameter, especially since in each case we cast
the void * to a const pointer of specific type immediately. So, suppress
these errors.
I think it would make logical sense to suppress this globally, but that
would cause unmatchedSuppression errors on earlier cppcheck versions. So,
instead individually suppress it, along with unmatchedSuppression in the
relevant places.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Allow access to user_devpts.
$ pasta --version
pasta 0^20240510.g7288448-1.fc40.x86_64
...
$ awk '' < /dev/null
$ pasta --version
$
While this might be a awk bug it appears pasta should still have access
to devpts.
Signed-off-by: Derek Schrock <dereks@lifeofadishwasher.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Flow table entries need to be fully initialised before returning to the
main epoll loop. Commit 0060acd1 ("flow: Clarify and enforce flow state
transitions") now enforces that: once a flow is allocated we must either
cancel it, or activate it before returning to the main loop, or we will hit
an ASSERT().
Some error paths in tcp_conn_from_tap() weren't correctly updated for this
requirement - we can exit with a flow entry incompletely initialised.
Correct that by cancelling the flows in those situations.
I don't have enough information to be certain if this is the cause for
podman bug 22925, but it plausibly could be.
Fixes: 0060acd11b ("flow: Clarify and enforce flow state transitions")
Link: https://github.com/containers/podman/issues/22925
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
timespec_diff_ms() returns an int representing a duration in milliseconds.
This will overflow in about 25 days when an int is 32 bits. The way we
use this function, we're probably not going to get a result that long, but
it's not outrageously implausible. Use a long for safety.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Functions and structures in lineread.c use plain int to record and report
the length of lines we receive. This means we truncate the result from
read(2) in some circumstances. Use ssize_t to avoid that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In conf() we parse a MAC address in two places, for the --ns-mac-addr and
the -M options. As well as duplicating code, the logic for this parsing
has several bugs:
* The most serious is that if the given string is shorter than a MAC
address should be, we'll access past the end of it.
* We don't check the endptr supplied by strtol() which means we could
ignore certain erroneous contents
* We never check the separator characters between each octet
* We ignore certain sorts of garbage that follow the MAC address
Correct all these bugs in a new parse_mac() helper.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A negative bit index in a bitmap doesn't make sense. Avoid this by
construction by using unsigned indices. While we're there adjust
bitmap_isset() to return a bool instead of an int.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We globally disabled this, with a justification lumped together with
several checks about braces. They don't really go together, the others
are essentially a stylistic choice which doesn't match our style. Omitting
brackets on macro parameters can lead to real and hard to track down bugs
if an expression is ever passed to the macro instead of a plain identifier.
We've only gotten away with the macros which trigger the warning, because
of other conventions its been unlikely to invoke them with anything other
than a simple identifier. Fix the macros, and enable the warning for the
future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The 'c' parameter is always passed exactly 'c'. The 'now' parameter is
always passed exactly 'now'.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
cppcheck 2.14.1 complains about the rport variable not being in as small
as scope as it could be. It's also only used once, so we might as well
just open code the calculation for it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The th pointer could be const, which causes a cppcheck warning on at least
some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40).
Fixes: e84a01e94c ("tcp: move seq_to_tap update to when frame is queued")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we've simplified how usage() works, nothing ever sets the
log_to_stdout flag. Eliminate it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The message from usage() when given invalid options, or the -h / --help
option is currently printed by many calls to the info() function, also
used for runtime logging of informational messages.
That isn't useful: the usage message should always go to the terminal
(stdout or stderr), never syslog or a logfile. It should never be
filtered by priority. Really the only thing using the common logging
functions does is give more opportunities for something to go wrong.
Replace all the info() calls with direct fprintf() calls. This does mean
manually adding "\n" to each message. A little messy, but worth it for the
simplicity in other dimensions. While we're there make much heavier use
of single strings containing multiple lines of output text. That reduces
the number of fprintf calls, reducing visual clutter and making it easier
to see what the output will look like from the source.
Link: https://bugs.passt.top/show_bug.cgi?id=90
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
usage() does nothing but call print_usage() with EXIT_FAILURE as a
parameter. It's no more complex to just give that parameter at the single
call site. Eliminate it and rename print_usage() to just usage().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
commit a469fc393f ("tcp, tap: Don't increase tap-side sequence counter for dropped frames")
delayed update of conn->seq_to_tap until the moment the corresponding
frame has been successfully pushed out. This has the advantage that we
immediately can make a new attempt to transmit a frame after a failed
trasnmit, rather than waiting for the peer to later discover a gap and
trigger the fast retransmit mechanism to solve the problem.
This approach has turned out to cause a problem with spurious sequence
number updates during peer-initiated retransmits, and we have realized
it may not be the best way to solve the above issue.
We now restore the previous method, by updating the said field at the
moment a frame is added to the outqueue. To retain the advantage of
having a quick re-attempt based on local failure detection, we now scan
through the part of the outqueue that had do be dropped, and restore the
sequence counter for each affected connection to the most appropriate
value.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now:
- we don't open the PID file in main() anymore
- PID file and AF_UNIX socket are opened by pidfile_open() and
tap_sock_unix_open()
- write_pidfile() becomes pidfile_write()
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
We have pidfile_fd now, pid_file_fd would be quite ugly.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Otherwise, if the user runs us as root, and gives us paths that are
only accessible by root, we'll fail to open them, which might in turn
encourage users to change permissions or ownerships: definitely a bad
idea in terms of security.
Reported-by: Minxi Hou <mhou@redhat.com>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
We won't call it from main() any longer: move it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
As I'm adding pidfile_open() in the next patch. The function comment
didn't match, by the way.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
We'll need to open and bind the socket a while before listening to it,
so split that into two different functions. No functional changes
intended.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
This is a remnant from the time we kept access to the original
filesystem and we could reinitialise the listening AF_UNIX socket.
Since commit 0515adceaa ("passt, pasta: Namespace-based sandboxing,
defer seccomp policy application"), however, we can't re-bind the
listening socket once we're up and running.
Drop the -1 initalisation and the corresponding check.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
It has nothing to do with tap_sock_unix_init(). It used to be there as
that function could be called multiple times per passt instance, but
it's not the case anymore.
This also takes care of the fact that, with --fd, we wouldn't set the
initial MAC address, so we would need to wait for the guest to send us
an ARP packet before we could exchange data.
Fixes: 6b4e68383c ("passt, tap: Add --fd option")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
libguestfs tools have a good reason to run as root: if the guest image
is owned by root, it would be counterproductive to encourage users to
invoke them as non-root, as it would require changing permissions or
ownership of the image file.
And if they run as root, we'll start as root, too. Warn users we'll
switch to 'nobody', but don't tell them what to do.
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
When we retrieve or copy host addresses we can include deprecated
addresses, which is not what we want. Adjust our logic to exclude them.
Similarly our tests can retrieve deprecated addresses, so exclude them
there too.
I hit this in practice because my router sometimes temporarily advertises
an fd00:: prefix before the real delegated IPv6 prefix. The deprecated
address can hang around for some time messing up my tests.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We recently introduced this field to keep track of which side of a TCP flow
is the guest/tap facing one. Now that we generically record which pif each
side of each flow is connected to, we can easily derive that, and no longer
need to keep track of it explicitly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we have no generic information flows apart from the type and
state, everything else is specific to the flow type. Start introducing
generic flow information by recording the pifs which the flow connects.
To keep track of what information is valid, introduce new flow states:
INI for when the initiating side information is complete, and TGT for
when both sides information is complete, but we haven't chosen the
flow type yet. For now, these states don't do an awful lot, but
they'll become more important as we add more generic information.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Each flow in the flow table has two sides, 0 and 1, representing the
two interfaces between which passt/pasta will forward data for that flow.
Which side is which is currently up to the protocol specific code: TCP
uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except
for spliced connections where it uses 0 for the initiating side and 1 for
the target side. ICMP also uses 0 for the host/"sock" side and 1 for the
guest/"tap" side, but in its case the latter is always also the initiating
side.
Make this generically consistent by always using side 0 for the initiating
side and 1 for the target side. This doesn't simplify a lot for now, and
arguably makes TCP slightly more complex, since we add an extra field to
the connection structure to record which is the guest facing side. This is
an interim change, which we'll be able to remove later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Flows move over several different states in their lifetime. The rules for
these are documented in comments, but they're pretty complex and a number
of the transitions are implicit, which makes this pretty fragile and
error prone.
Change the code to explicitly track the states in a field. Make all
transitions explicit and logged. To the extent that it's practical in C,
enforce what can and can't be done in various states with ASSERT()s.
While we're at it, tweak the docs to clarify the restrictions on each state
a bit.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This adds some extra inany helpers for comparing an inany address to
addresses of a specific family (including special addresses), and building
an inany from an IPv4 address (either statically or at runtime).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The flow dispatches deferred and timer handling for flows centrally, but
needs to call into protocol specific code for the handling of individual
flows. Currently this passes a general union flow *. It makes more sense
to pass the specific relevant flow type structure. That brings the check
on the flow type adjacent to casting to the union variant which it tags.
Arguably, this is a slight abstraction violation since it involves the
generic flow code using protocol specific types. It's already calling into
protocol specific functions, so I don't think this really makes any
difference.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When reporting errors, we sometimes want to show a relevant socket address.
Doing so by extracting the various relevant fields can be pretty awkward,
so introduce a sockaddr_ntop() helper to make it simpler. For now we just
have one user in tcp.c, but I have further upcoming patches which can make
use of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Commit b686afa2 introduced the invalid apparmor rule
`mount options=(rw, runbindable) /,` since runbindable mount rules
cannot have a source.
Therefore running aa-logprof/aa-genprof will trigger errors (see
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685)
$ sudo aa-logprof
ERROR: Operation {'runbindable'} cannot have a source. Source = AARE('/')
This patch fixes it to the intended behavior.
Link: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685
Fixes: b686afa23e ("apparmor: Explicitly pass options we use while remounting root filesystem")
Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For some unknown reason "owner" makes it impossible to open bind mounted
netns references as apparmor denies it. In the kernel denied log entry
we see ouid=0 but it is not clear why that is as the actual file is
owned by the real (rootless) user id.
In abstractions/pasta there is already `@{run}/user/@{uid}/**` without
owner set for the same reason as this path contains the netns path by
default when running under Podman.
Fixes: 72884484b0 ("apparmor: allow read access on /tmp for pasta")
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
clang-tidy 18.1.1 in Fedora 40 complains about every #define of an integral
value, suggesting it be converted to an enum. Although that's certainly
possible, it's of dubious value and results in some awkward arrangements on
out codebase. Suppress it globally.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In conf() we temporarily set the forwarding mode variables to 0 - an
invalid value, so that we can check later if they've been set by the
intervening logic. clang-tidy 18.1.1 in Fedora 40 now complains about
this. Satisfy it by giving an name in the enum to the 0 value.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We take care of this in nl_addr_dup(): if the interface index
associated to an address doesn't match the selected host interface
(ifa->ifa_index != ifi_src), we don't copy that address.
But for routes, we just unconditionally update the interface index to
match the index in the target namespace, even if the source interface
didn't match.
This might happen in two cases: with a pre-4.20 kernel without support
for NETLINK_GET_STRICT_CHK, which won't filter routes based on the
interface we pass in the request, as reported by runsisi, and any
kernel with support for multipath routes where any of the nexthops
refers to an unrelated host interface.
In both cases, check the index of the source interface, and avoid
copying unrelated routes.
Reported-by: runsisi <runsisi@hust.edu.cn>
Link: https://bugs.passt.top/show_bug.cgi?id=86
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: runsisi <runsisi@hust.edu.cn>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The podman CI on debian runs tests based on /tmp but pasta is failing
there because it is unable to open the netns path as the open for read
access is denied.
Link: https://github.com/containers/podman/issues/22625
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_splice_sock_handler(), if we get EAGAIN on the second splice(),
from pipe to receiving socket, that doesn't necessarily mean that the
pipe is empty: the receiver buffer might be full instead.
Hence, we can't use the 'never_read' flag to decide that there's
nothing to wait for: even if we didn't read anything from the sending
side in a given iteration, we might still have data to send in the
pipe. Use read/written counters, instead.
This fixes an issue where large bulk transfers would occasionally
hang. From a corresponding strace:
0.000061 epoll_wait(4, [{events=EPOLLOUT, data={u32=29442, u64=12884931330}}], 8, 1000) = 1
0.005003 epoll_ctl(4, EPOLL_CTL_MOD, 211, {events=EPOLLIN|EPOLLRDHUP, data={u32=54018, u64=8589988610}}) = 0
0.000089 epoll_ctl(4, EPOLL_CTL_MOD, 115, {events=EPOLLIN|EPOLLRDHUP, data={u32=29442, u64=12884931330}}) = 0
0.000081 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
0.000073 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 1048576
0.000087 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
0.000045 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 520415
0.000060 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
0.000044 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
0.000044 epoll_wait(4, [], 8, 1000) = 0
we're reading from socket 211 into to the pipe end numbered 151,
which connects to pipe end 150, and from there we're writing into
socket 115.
We initially drop EPOLLOUT from the set of monitored flags for socket
115, because it already signaled it's ready for output. Then we read
nothing from socket 211 (the sender had nothing to send), and we keep
emptying the pipe into socket 115 (first 1048576 bytes, then 520415
bytes).
This call of tcp_splice_sock_handler() ends with EAGAIN on the writing
side, and we just exit this function without setting the OUT_WAIT_1
flag (and, in turn, EPOLLOUT for socket 115). However, it turns out,
the pipe wasn't actually emptied, and while socket 211 had nothing
more to send, we should have waited on socket 115 to be ready for
output again.
As a further step, we could consider not clearing EPOLLOUT at all,
unless the read/written counters match, but I'm first trying to fix
this ugly issue with a minimal patch.
Link: https://github.com/containers/podman/issues/22575
Link: https://github.com/containers/podman/issues/22593
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Currently we have separate arrays for IPv4 and IPv6 which contain the
headers for guest-bound packets, and also the originating socket address.
We can combine these into a single array of "metadata" structures with
space for both pre-cooked IPv4 and IPv6 headers, as well as shared space
for the tap specific header and socket address (using sockaddr_inany).
Because we're using IOVs to separately address the pieces of each frame,
these structures don't need to be packed to keep the headers contiguous
so we can more naturally arrange for the alignment we want.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently each tap-bound frame buffer has room for its own ethernet header.
However the ethernet header is always the same for such frames, so now
that we're indirectly referencing the ethernet header via iov, we can use
a single buffer for all of them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently the IPv4 and IPv6 paths unnecessarily use different buffers for
the UDP payload. Now that we're handling the various pieces of the UDP
packets with an iov, we can split the payload part of the buffers off into
its own array shared between IPv4 and IPv6. As well as saving a little
memory, this allows the payload buffers to be neatly page aligned.
With the buffers merged, udp[46]_l2_iov_sock contain exactly the same thing
as each other and can also be merged. Likewise udp[46]_iov_splice can be
merged together.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For IPv4, UDP checksums are optional and can just be set to 0.
udp_update_hdr4() ignores the checksum field entirely. Since these are set
to 0 during startup, this works as intended for now.
However, we'd like to share payload and UDP header buffers betweem IPv4 and
IPv6, which does calculate UDP checksums. Therefore, for robustness, we
should explicitly set the checksum field to 0 for guest-bound UDP packets.
In the tap_udp4_send() slow path, however, we do allow IPv4 UDP checksums
to be calculated as a compile time option. For consistency, use the same
thing in the udp_update_hdr4() path, which will typically initialize to 0,
but calculate a real checksum if configured to do so.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We're going to introduce more sharing between the IPv4 and IPv6 buffer
structures. Prepare for this by combinng the initialisation functions.
While we're at it remove the misleading "sock" from the name since these
initialise both tap side and sock side structures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When sending to the tap device, currently we assemble the headers and
payload into a single contiguous buffer. Those are described by a single
struct iovec, then a batch of frames is sent to the device with
tap_send_frames().
In order to better integrate the IPv4 and IPv6 paths, we want the IP
header in a different buffer that might not be contiguous with the
payload. To prepare for that, split the UDP packet into an iovec of
buffers. We use the same split that Laurent recently introduced for
TCP for convenience.
This removes the last use of tap_hdr_len_(), tap_frame_base() and
tap_frame_len(), so remove those too.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
During some debugging recently, I wanted to extact a file from a test
guest and found it was tricky, since the ssh-over-vsock setup we had didn't
allow sftp/scp. We can fix this by adding a line to the guest side sshd
config from mbuto. While we're there correct an inaccurate comment.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_fill_headers[46]() fill most of the headers, but the tap specific
header (the frame length for qemu sockets) is filled in afterwards.
Filling this as well:
* Removes a little redundancy between the tcp_send_flag() and
tcp_data_to_tap() path
* Makes calculation of the correct length a little easier
* Removes the now misleadingly named 'vnet_len' variable in
tcp_send_flag()
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Laurent's recent changes mean we use IO vectors much more heavily in the
TCP code. In many of those cases, and few others around the code base,
individual iovs of these vectors are constructed to exactly cover existing
variables or fields. We can make initializing such iovs shorter and
clearer with a macro for the purpose.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Recent changes to the TCP code (reworking of the buffer handling) have
meant that it now (again) deals explicitly with the MODE_PASST specific
vnet_len field, instead of using the (partial) abstractions provided by the
tap layer.
The abstractions we had don't work for the new TCP structure, so make some
new ones that do: tap_hdr_iov() which constructs an iovec suitable for
containing (just) the TAP specific header and tap_hdr_update() which
updates it as necessary per-packet.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_fill_headers[46]() compute the L3 packet length from the L4 packet
length, then their caller tcp_l2_buf_fill_headers() converts it back to the
L4 packet length. We can just use the L4 length throughout.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At various points we need to track the lengths of a packet including or
excluding various different sets of headers. We don't always use the same
variable names for doing so. Worse in some places we use the same name
for different things: e.g. tcp_fill_headers[46]() use ip_len for the
length including the IP headers, but then tcp_send_flag() which calls it
uses it to mean the IP payload length only.
To improve clarity, standardise on these names:
dlen: L4 protocol payload length ("data length")
l4len: plen + length of L4 protocol header
l3len: l4len + length of IPv4/IPv6 header
l2len: l3len + length of L2 (ethernet) header
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
csum_ip4_header() takes the packet length as a network endian value. In
general it's very error-prone to pass non-native-endian values as a raw
integer. It's particularly bad here because this differs from other
checksum functions (e.g. proto_ipv4_header_psum()) which take host native
lengths.
It turns out all the callers have easy access to the native endian value,
so switch it to use host order like everything else.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In general, it's much less error-prone to have the endianness of values
implied by the type, rather than just noting it in comments. We can't
always easily avoid it, because C, but we can do so when possible. struct
in_addr and in6_addr are always encoded network endian, so noting it
explicitly isn't useful. Remove them.
In some cases we also have endianness notes on uint8_t parameters, which
doesn't make sense: for a single byte endianness is irrelevant. Remove
those too.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Use of these structures was removed in bb70811183 ("treewide: Packet
abstraction with mandatory boundary checks"). Remove the stale
declarations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In some places (well, actually only UDP now) we use struct tap_hdr to
represent both tap backend specific and L2 ethernet headers. Handling
these together seemed like a good idea at the time, but Laurent's changes
in the TCP code working towards vhost-user support suggest that treating
them separately is more useful, more often.
Alter struct tap_hdr to represent only the TAP backend specific headers.
Updated related helpers and the UDP code to match.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
7df624e79 ("checksum: introduce functions to compute the header part
checksum for TCP/UDP") introduced a helper to compute the partial checksum
for the IPv6 pseudo-header used in L4 protocol checksums. It used it in
csum_udp6() for UDP packets, but not in csum_icmp6() for the identical
calculation in csum_icmp6() for ICMPv6 packets.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Somewhat confusingly, RTNH_NEXT(), as defined by <linux/rtnetlink.h>,
doesn't take an attribute length parameter like RTA_NEXT() does, and
I just modelled loops over nexthops after RTA loops, forgetting to
decrease the remaining length we pass to RTNH_OK().
In practice, this didn't cause issue in any of the combinations I
checked, at least without the next patch.
We seem to be the only user of RTNH_OK(): even iproute2 has an
open-coded version of it in print_rta_multipath() (ip/iproute.c).
Introduce RTNH_NEXT_AND_DEC(), similar to RTA_NEXT(), and use it.
Fixes: 6c7623d07b ("netlink: Add support to fetch default gateway from multipath routes")
Fixes: f4e38b5cd2 ("netlink: Adjust interface index inside copied nexthop objects too")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...not just for a single set address (legacy operation with
--no-copy-addrs). I forgot to add this to nl_addr_dup().
Note that we can have two version of flags: the 8-bit ifa_flags in
ifaddrmsg, and the newer 32-bit version as IFA_FLAGS attribute, which
is given priority if present. Make sure IFA_F_NODAD is set in both.
Without this, a Podman user reports, something on the lines of:
pasta --config-net -- ping -c1 -6 passt.top
would fail as the kernel would start Duplicate Address Detection
once we configure the address, which can't really work (and doesn't
make sense) in this case.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
See the comment to the unnamed enum in linux/if_addr.h, which
currently states:
/*
* Important comment:
* IFA_ADDRESS is prefix address, rather than local interface address.
* It makes no difference for normally configured broadcast interfaces,
* but for point-to-point IFA_ADDRESS is DESTINATION address,
* local address is supplied in IFA_LOCAL attribute.
*
* [...]
*/
if we fetch IFA_ADDRESS, and we have a point-to-point link with a peer
address configured, we'll source the peer address as "our" address,
and refuse to resolve it in arp().
This was reported with pasta and a tun upstream interface configured
by OpenVPN in "p2p" topology: the target namespace will have similar
addresses and routes as the host, which is fine, and will try to
resolve the point-to-point peer address (because it's the default
gateway).
Given that we configure it as our address (only internally, not
visibly in the namespace), we'll fail to resolve that and traffic
doesn't go anywhere.
Note that this is not the case for IPv6: there, IFA_ADDRESS is the
actual, local address of the interface, and IFA_LOCAL is not
necessarily present, so the comment in linux/if_addr.h doesn't apply
either.
Link: https://github.com/containers/podman/issues/22320
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
test/pasta_options/log_to_file checks that pasta truncates its log file
when started. It does that by starting pasta with a log file once, then
starting it again and checking that after the second round, the log file
has only one line: the startup banner from the second invocation.
However, this test will break if the second invocation logs any additional
messages at startup. This can easily happen on a host with multiple
network interfaces due to the "Multiple default route" informational
messages added in 639fdf06e ("netlink: Fix selection of template
interface"). I believe it could also happen on a host without IPv6
connectivity due to the "Couldn't pick external interface" messages, though
I haven't confirmed this.
Make the log file test more robust, by not testing for a single line, but
instead explicitly testing for the PID of the second pasta invocation in
the banner line.
Link: https://bugs.passt.top/show_bug.cgi?id=88
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
test/pasta_options/log_to_file contains a couple of rudimentary tests
where we start pasta with an interactive shell, then immediately exit it.
We can achieve the same thing by using /bin/true as the command to pasta.
This also means that waiting for pasta to start, waiting for the executed
command to complete and for pasta to clean up are all handled by simply
waiting for pasta to complete in the foreground, so there's no need for an
additional sleep.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Commit bb9bf0bb ("tcp, udp: Don't precompute port remappings in epoll
references") changed the epoll reference for UDP sockets to include the
bound port as seen by the socket itself, rather than the bound port it
would be translated to on the guest side. As a side effect, it also means
that udp_tap_map[] is indexed by the bound port on the host side, rather
than on the guest side. This is consistent and a good idea, however we
forgot to account for it when finding the correct outgoing socket for
packets originating in the guest. This means that if forwarding UDP
inbound with a port number change, reply packets would be misdirected.
Fix this by applying the reverse mapping before looking up the socket in
udp_tap_handler(). While we're at it, use 'port' directly instead of
'uref.port' in udp_sock_init(). Those now always have the same value -
failing to realise that is the same error as above.
Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=87
Fixes: bb9bf0bb8f ("tcp, udp: Don't precompute port remappings in epoll references")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To be able to provide pointers to TCP headers and IP headers without
worrying about alignment in the structure, split the structure into
several arrays and point to each part of the frame using an iovec array.
Using iovec also allows us to simply ignore the first entry when the
vnet length header is not needed.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...simply resort to using locally-administered address (LAA) as
host-side source, instead.
Pick 02:00:00:00:00:00, to make it clear that we don't actually care
about that address, and also to match the 00 (Administratively
Assigned Identifier) quadrant of SLAP (RFC 8948).
Otherwise, pasta refuses to start if the template is a tun or
Wireguard interface.
Link: https://bugs.passt.top/show_bug.cgi?id=49
Link: https://github.com/containers/podman/issues/22320
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Given that we use this stack pointer as a location to store arbitrary
data types from the cloned process, we need to guarantee that its
alignment matches any of those possible data types.
runsisi reports that pasta gets a SIGBUS in pasta_open_ns() on
aarch64, where the alignment requirement for stack pointers is a
16 bytes (same as the size of a long double), and similar requirements
actually apply to most architectures we run on.
Reported-by: runsisi <runsisi@hust.edu.cn>
Link: https://bugs.passt.top/show_bug.cgi?id=85
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the
architecture name for, among others, seccomp.sh, I didn't realise
that "armv6l" and "armv7l" are just Linux kernel names -- compilers
just call that "arm".
Fix the "syscalls" annotation we use to define seccomp profiles
accordingly, otherwise pasta will be terminated on sigreturn() on
armv6l and armv7l.
Fixes: 213c397492 ("passt, pasta: Run-time selection of AVX2 build")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Paul Holzinger pointed out that when we invoke the podman tests inside the
passt testsuite, the way we point podman at the newly built pasta binary
is kind of indirect. It's therefore prudent to check that podman is
actually using the binary we expect it to - in particular that it is using
the binary built in this tree, not some system installed pasta binary.
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The pasta_podman/bats test script looks for 'catatonit' amongst other tools
to be avaiiliable on the host. However, while the podman tests do require
catatonit, it doesn't necessarily need to be in the regular path. For
example Fedora and RHEL place catatonit in /usr/libexec and podman finds it
there fine.
Therefore, remove it as an htools dependency.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The pasta_podman/bats test scrpt downloads and builds podman, then runs its
pasta specific tests. Downloading from within a test case has some
drawbacks:
* It can be very tedious if you have poor connectivity to the server
* It makes a test that's ostensibly for pasta itself dependent on the
state of the github server
* It precludes runnning the tests in an isolated network environment
The same concerns largely apply to building podman too, because it's pretty
common for Go builds to download dependencies themselves. Therefore move
the download and build of podman from the test itself, to the Makefile
where we prepare other test assets.
To avoid cryptic failures if something went wrong with the build, make
running the test dependent on having the built podman binary.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We download and use mbuto to build trivial boot images for our VM tests.
However, if mbuto is already cloned, we won't update it to the current
version. Add some make logic to ensure that we do this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently "make cppcheck" invokes cppcheck on ".", so it will check all the
.c and .h files it can find in the source tree. This isn't ideal, because
it can find files that aren't actually part of the real build, or even
stale files which aren't in git.
More practically, some upcoming changes are looking at downloading other
source trees for some tests. Static errors in there is Not Our Problem,
so checking them is both slow and pointless.
So, change the Makefile to invoke cppcheck only on the specific source
files that are part of the build. For some reason in this format the
badBitmaskCheck warnings in seccomp.h which were suppressed by 5beb3472e
("cppcheck: Avoid errors due to zeroes in bitwise ORs") no longer trigger.
That means we get unmatchedSuppression warnings instead. We add an
unmatchedSuppression suppression instead of simply removing the original
suppressions, just in case this odd behaviour isn't the same for all
cppcheck versions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since f919dc7a4b ("conf, netlink: Don't require a default route to
start"), and since 639fdf06ed ("netlink: Fix selection of template
interface") less buggily, we haven't required a default route on the host
in order to operate. Instead, if we lack a default route we'll pick an
interface with any route, as long as there's only one such interface. If
there's more than one, we don't have a good criterion to pick, so we give
up with an informational message.
Paul Holzinger pointed out that this code considers it ambiguous even if
all but one of the interfaces has only routes to link-local addresses
(fe80::/10). A route to link-local addresses isn't really useful from
pasta's point of view, so ignore them instead. This removes a misleading
message in many cases, and a spurious failure in some cases.
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have a few places where we want to include the name of the internet
protocol version (IPv4 or IPv6) in a message, which we handle with an
open-coded ?: expression.
This seems like something that might be more widely useful, so make a
trivial helper to return the correct string based on the address family.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As pasta duplicates host routes into the target namespaces, interface
indices might not match, so we go through RTA_OIF attributes and fix
them up to match the identifier in the namespace.
But RTA_OIF is not the ony attribute specifying interfaces for routes:
multipath routes use RTA_MULTIPATH attributes with nexthop objects,
which contain in turn interface indices. Fix them up as well.
If we don't, and we have at least two host interfaces, and the host
interface we use as template isn't the first one (hence the
mismatching indices), we'll fail to insert multipath routes with
nexthop objects, and ultimately refuse to start as the kernel
unexpectedly gives us ENODEV.
Link: https://github.com/containers/podman/issues/22192
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
From an original patch by Danish Prakash.
With commit ff22a78d7b ("pasta: Don't try to watch namespaces in
procfs with inotify, use timer instead"), if a filesystem-bound
target namespace is passed on the command line, we'll grab a handle
on its parent directory. That commit, however, didn't introduce a
matching AppArmor rule. Add it here.
To access a network namespace procfs entry, we also need a 'ptrace'
rule. See commit 594dce66d3 ("isolation: keep CAP_SYS_PTRACE when
required") for details as to when we need this -- essentially, it's
about operation with Buildah.
Reported-by: Jörg Sonnenberger <joerg@bec.de>
Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Fixes: ff22a78d7b ("pasta: Don't try to watch namespaces in procfs with inotify, use timer instead")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With Podman's custom networks, pasta will typically need to open the
target network namespace at /run/user/<UID>/containers/networks:
grant access to anything under /run/user/<UID> instead of limiting it
to some subpath.
Note that in this case, Podman will need pasta to write out a PID
file, so we need write access, for similar locations, too.
Reported-by: Jörg Sonnenberger <joerg@bec.de>
Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For the policy to work as expected across either AppArmor commit
9d3f8c6cc05d ("parser: fix parsing of source as mount point for
propagation type flags") and commit 300889c3a4b7 ("parser: fix option
flag processing for single conditional rules"), we need one mount
rule with matching mount options as "source" (that is, without
source), and one without mount options and an explicit, empty source.
Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we set ACK on flags packets only when the acknowledged byte
pointer has advanced, or we hadn't previously set a window. This means
in particular that we can send a window update with no ACK flag, which
doesn't appear to be correct. RFC 9293 requires a receiver to ignore such
a packet [0], and indeed it appears that every non-SYN, non-RST packet
should have the ACK flag.
The reason for the existing logic, rather than always forcing an ACK seems
to be to avoid having the packet mistaken as a duplicate ACK which might
trigger a fast retransmit. However, earlier tests in the function mean we
won't reach here if we don't have either an advance in the ack pointer -
which will already set the ACK flag, or a window update - which shouldn't
trigger a fast retransmit.
[0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1
Link: https://github.com/containers/podman/issues/22146
Link: https://bugs.passt.top/show_bug.cgi?id=84
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_send_flag() will sometimes force on the ACK flag for all !SYN packets.
This doesn't make sense for RST packets, where plain RST and RST+ACK have
somewhat different meanings. AIUI, RST+ACK indicates an abrupt end to
a connection, but acknowledges data already sent. Plain RST indicates an
abort, when one end receives a packet that doesn't seem to make sense in
the context of what it knows about the connection. All of the cases where
we send RSTs are the second, so we don't want an ACK flag, but we currently
could add one anyway.
Change that, so we won't add an ACK to an RST unless the caller explicitly
requests it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have different paths for controlling the ACK flag for the SYN and !SYN
paths. This amounts to sometimes forcing on the ACK flag in the !SYN path
regardless of options. We can rearrange things to explicitly be that which
will make things neater for some future changes.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The DUP_ACK flag to tcp_send_flag() has two effects: first it forces the
setting of the ACK flag in the packet, even if we otherwise wouldn't.
Secondly, it causes a duplicate of the flags packet to be sent immediately
after the first.
Setting the ACK flag to tcp_send_flag() also has the first effect, so
instead of having DUP_ACK also do that, pass both flags when we need both
operations. This slightly simplifies the logic of tcp_send_flag() in a way
that makes some future changes easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In write_remainder() 'skip' is the offset to start the operation from
in the iovec array.
In iov_skip_bytes(), 'skip' is also the offset in the iovec array but
'offset' is the first unskipped byte in the iovec entry.
As write_remainder() uses 'skip' for both, 'skip' is reset to the
first unskipped byte in the iovec entry rather to staying the first
unskipped byte in the iovec array.
Fix the problem by introducing a new variable not to overwrite 'skip'
on each loop.
Fixes: 8bdb0883b4 ("util: Add write_remainder() helper")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since f919dc7a4b ("conf, netlink: Don't require a default route to
start"), if there is only one host interface with routes, we will pick that
as the template interface, even if there are no default routes for an IP
version. Unfortunately this selection had a serious flaw: in some cases
it would 'return' in the middle of an nl_foreach() loop, meaning we
wouldn't consume all the netlink responses for our query. This could cause
later netlink operations to fail as we read leftover responses from the
aborted query.
Rewrite the interface detection to avoid this problem. While we're there:
* Perform detection of both default and non-default routes in a single
pass, avoiding an ugly goto
* Give more detail on error and working but unusual paths about the
situation (no suitable interface, multiple possible candidates, etc.).
Fixes: f919dc7a4b ("conf, netlink: Don't require a default route to start")
Link: https://bugs.passt.top/show_bug.cgi?id=83
Link: https://github.com/containers/podman/issues/22052
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2270257
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Use info(), not warn() for somewhat expected cases where one
IP version has no default routes, or no routes at all]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A recent kernel change 87d381973e49 ("genetlink: fit NLMSG_DONE into
same read() as families") changed netlink behaviour so that the
NLMSG_DONE terminating a bunch of responses can go in the same
datagram as those responses, rather than in a separate one.
Our netlink code is supposed to handle that behaviour, and indeed does
so for most cases, using the nl_foreach() macro. However, there was a
subtle error in nl_route_dup() which doesn't work with this change.
f00b1534 ("netlink: Don't try to get further datagrams in
nl_route_dup() on NLMSG_DONE") attempted to fix this, but has its own
subtle error.
The problem arises because nl_route_dup(), unlike other cases doesn't
just make a single pass through all the responses to a netlink
request. It needs to get all the routes, then make multiple passes
through them. We don't really have anywhere to buffer multiple
datagrams, so we only support the case where all the routes fit in a
single datagram - but we need to fail gracefully when that's not the
case.
After receiving the first datagram of responses (with nl_next()) we
have a first loop scanning them. It needs to exit when either we run
out of messages in the datagram (!NLMSG_OK()) or when we get a message
indicating the last response (nl_status() <= 0).
What we do after the loop depends on which exit case we had. If we
saw the last response, we're done, but otherwise we need to receive
more datagrams to discard the rest of the responses.
We attempt to check for that second case by re-checking NLMSG_OK(nh,
status). However in the got-last-response case, we've altered status
from the number of remaining bytes to the error code (usually 0). That
means NLMSG_OK() now returns false even if it didn't during the loop
check. To fix this we need separate variables for the number of bytes
left and the final status code.
We also checked status after the loop, but this was redundant: we can
only exit the loop with NLMSG_OK() == true if status <= 0.
Reported-by: Martin Pitt <mpitt@redhat.com>
Fixes: f00b153414 ("netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE")
Fixes: 4d6e9d0816 ("netlink: Always process all responses to a netlink request")
Link: https://github.com/containers/podman/issues/22052
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The spec file patch by Dan Čermák was originally contributed at:
https://src.fedoraproject.org/rpms/passt/pull-request/1
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Paul reports that if pasta is configured with --dns-forward, and the
container queries a resolver which is configured on the host directly,
without using the address given for --dns-forward, we'll translate
the source address of the response pretending it's coming from the
address passed as --dns-forward, and the client will discard the
reply.
That is,
$ cat /etc/resolv.conf
198.51.100.1
$ pasta --config-net --dns-forward 192.0.2.1 nslookup passt.top
will not work, because we change the source address of the reply from
198.51.100.1 to 192.0.2.1. But the client contacted 198.51.100.1, and
it's from that address that it expects an answer.
Add a PORT_DNS_FWD flag for tap-facing ports, which is triggered by
activity in the opposite direction as the other flags. If the
tap-facing port was seen sending a DNS query that was remapped, we'll
remap the source address of the response, otherwise we'll leave it
unaffected.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
There might be isolated testing environments where default routes and
global connectivity are not needed, a single interface has all
non-loopback addresses and routes, and still passt and pasta are
expected to work.
In this case, it's pretty obvious what our upstream interface should
be, so go ahead and select the only interface with at least one
route, disabling DHCP and implying --no-map-gw as the documentation
already states.
If there are multiple interfaces with routes, though, refuse to start,
because at that point it's really not clear what we should do.
Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://github.com/containers/podman/issues/21896
Signed-off-by: Stefano brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Martin reports that, with Fedora Linux kernel version
kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
read() as families"), pasta doesn't exit once the network namespace
is gone.
Actually, pasta is completely non-functional, at least with default
options, because nl_route_dup(), which duplicates routes from the
parent namespace into the target namespace at start-up, is stuck on
a second receive operation for RTM_GETROUTE.
However, with that commit, the kernel is now able to fit the whole
response, including the NLMSG_DONE message, into a single datagram,
so no further messages will be received.
It turns out that commit 4d6e9d0816 ("netlink: Always process all
responses to a netlink request") accidentally relied on the fact that
we would always get at least two datagrams as a response to
RTM_GETROUTE.
That is, the test to check if we expect another datagram, is based
on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
but we'll also expect another datagram if NLMSG_OK on the last
message is false. But NLMSG_OK with a zero length is always false.
The problem is that we don't distinguish if status is zero because
we got a NLMSG_DONE message, or because we processed all the
available datagram bytes.
Introduce an explicit check on NLMSG_DONE. We should probably
refactor this slightly, for example by introducing a special return
code from nl_status(), but this is probably the least invasive fix
for the issue at hand.
Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://github.com/containers/podman/issues/22052
Fixes: 4d6e9d0816 ("netlink: Always process all responses to a netlink request")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
These two functions are typically used to calculate values to go into the
iov_base and iov_len fields of a struct iovec. They don't have to be used
for that, though. Rename them in terms of what they actually do: calculate
the base address and total length of the complete frame, including both L2
and tap specific headers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Most times we send frames to the guest it goes via tap_send_frames().
However "slow path" protocols - ARP, ICMP, ICMPv6, DHCP and DHCPv6 - go
via tap_send().
As well as being a semantic duplication, tap_send() contains at least one
serious problem: it doesn't properly handle short sends, which can be fatal
on the qemu socket connection, since frame boundaries will get out of sync.
Rewrite tap_send() to call tap_send_frames(). While we're there, rename it
tap_send_single() for clarity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can both remove some variables which differ from others only in type,
and slightly improve type safety.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_send_frames() takes a vector of buffers and requires exactly one frame
per buffer. We have future plans where we want to have multiple buffers
per frame in some circumstances, so extend tap_send_frames() to take the
number of buffers per frame as a parameter.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Improve comment to rembufs calculation]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Paul reports that, with commit 15001b39ef ("conf: set the log level
much earlier"), early messages aren't reported to standard error
anymore.
The reason is that, once the log mask is changed from LOG_EARLY, we
don't force logging to stderr, and this mechanism was abused to have
early errors on stderr. Now that we drop LOG_EARLY earlier on, this
doesn't work anymore.
Call __openlog() as soon as we know the mode we're running as, using
LOG_PERROR. Then, once we detach, if we're not running from an
interactive terminal and logging to standard error is not forced,
drop LOG_PERROR from the options.
While at it, check if the standard error descriptor refers to a
terminal, instead of checking standard output: if the user redirects
standard output to /dev/null, they might still want to see messages
from standard error.
Further, make sure we don't print messages to standard error reporting
that we couldn't log to the system logger, if we didn't open a
connection yet. That's expected.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 15001b39ef ("conf: set the log level much earlier")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
POSIX.1-2008 declared gettimeofday() as obsolete, but I'm a dinosaur.
Usually, C libraries translate that to the clock_gettime() system
call anyway, but this doesn't happen in Jon's environment, and,
there, seccomp happily kills pasta(1) when started with --pcap,
because we didn't add gettimeofday() to our seccomp profiles.
Use clock_gettime() instead.
Reported-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We might have read from resolv.conf, or from the command line, a
resolver that's reachable via loopback address, but that doesn't mean
we can offer that via DHCP, NDP or DHCPv6: warn if there are no
resolvers we can offer for a given IP version.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...that is, call add_dns4() and add_dns6() instead of simply adding
those to the list of servers we advertise.
Most importantly, this will set the 'dns_host' field for the matching
IP version, so that, as mentioned in the man page, servers passed via
--dns are used for DNS mapping as well, if used in combination with
--dns-forward.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=82
Fixes: 89678c5157 ("conf, udp: Introduce basic DNS forwarding")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In tap_send_frames(), if we failed to send all the frames, we must
only log the frames that have been sent, not all the frames we wanted
to send.
Fixes: dda7945ca9 ("pcap: Handle short writes in pcap_frame()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we open code the calculation of the UDP checksum in
udp_update_hdr6(). We calling a helper to handle the IPv6 pseudo-header,
and preset the checksum field to 0 so an uninitialised value doesn't get
folded in. We already have a helper to do this: csum_udp6() which we use
in some slow paths. Use it here as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We carry around the source address as a pointer to a constant struct
in_addr. But it's silly to carry around a 4 or 8 byte pointer to a 4 byte
IPv4 address. Just copy the IPv4 address around by value.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The order of things in these functions is a bit odd for historical reasons.
We initialise some IP header fields early, the more later after making
some tests. Likewise we declare some variables without initialisation,
but then unconditionally set them to values we could calculate at the
start of the function.
Previous cleanups have removed the reasons for some of these choices, so
reorder for clarity, and where possible move the first assignment into an
initialiser.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These functions take an index to the L2 buffer whose header information to
update. They use that for two things: to locate the buffer pointer itself,
and to retrieve the length of the received message from the paralllel
udp[46]_l2_mh_sock array. The latter is arguably a failure to separate
concerns. Change these functions to explicitly take a buffer pointer and
payload length as parameters.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In these functions we have 'dstport' for the destination port, but
'src_port' for the source port. Change the latter to 'srcport' for
consistency.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Each of these functions have 3 essentially identical loops in a row.
Merge the loops into a single common udp_sock_iov_init() function, calling
udp_sock[46]_iov_init_one() helpers to initialize each "slot" in the
various parallel arrays. This is slightly neater now, and more naturally
allows changes we want to make where more initialization will become common
between IPv4 and IPv6.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Starting from commit 3a2afde87d ("conf, udp: Drop mostly duplicated
dns_send arrays, rename related fields"), we won't add to c->ip4.dns
and c->ip6.dns nameservers that can't be used by the guest or
container, and we won't advertise them.
However, the fact that we don't advertise any nameserver doesn't mean
that we didn't find any, and we should warn only if we couldn't find
any.
This is particularly relevant in case both --dns-forward and
--no-map-gw are passed, and a single loopback address is listed in
/etc/resolv.conf: we'll forward queries directed to the address
specified by --dns-forward to the loopback address we found, we
won't advertise that address, so we shouldn't warn: this is a
perfectly legitimate usage.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/19213
Fixes: 3a2afde87d ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Currently ping sockets use a custom epoll reference type which includes
the ICMP id. However, now that we have entries in the flow table for
ping flows, finding that is sufficient to get everything else we want,
including the id. Therefore remove the icmp_epoll_ref type and use the
general 'flowside' field for ping sockets.
Having done this we no longer need separate EPOLL_TYPE_ICMP and
EPOLL_TYPE_ICMPV6 reference types, because we can easily determine
which case we have from the flow type. Merge both types into
EPOLL_TYPE_PING.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Use flow_dbg() and flow_err() helpers to generate flow-linked error
messages in most places. Make a few small improvements to the messages
while we're at it. This allows us to avoid the awkward 'pname' variables
since whether we're dealing with ICMP or ICMPv6 is already built into the
flow type which these helpers include.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Coding style fix in icmp_tap_handler()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently icmp_id_map[][] stores information about ping sockets in a
bespoke structure. Move the same information into new types of flow
in the flow table. To match that change, replace the existing ICMP
timer with a flow-based timer for expiring ping sockets. This has the
advantage that we only need to scan the active flows, not all possible
ids.
We convert icmp_id_map[][] to point to the flow table entries, rather
than containing its own information. We do still use that array for
locating the right ping flows, rather than using a "flow native" form
of lookup for the time being.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update id_sock description in comment to icmp_ping_new()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
instead of htons_constant(), which is for... constants.
Fixes: 5bf200ae8a ("tcp, udp: Don't include destination address in partially precomputed csums")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-08 10:31:14 +01:00
120 changed files with 8200 additions and 7442 deletions
Throughput in Gbps, latency in µs. Threads are <span style="font-family: monospace;">iperf3</span> processes, <i>passt</i> and <i>pasta</i> are currently single-threaded.<br/>
Throughput in Gbps, latency in µs. Threads are <span style="font-family: monospace;">iperf3</span> threads, <i>passt</i> and <i>pasta</i> are currently single-threaded.<br/>
Click on numbers to show test execution. Measured at head, commit <span style="font-family: monospace;">__commit__</span>.