Compare commits

..

396 commits

Author SHA1 Message Date
David Gibson
0588163b1f cppcheck: Don't check the system headers
We pass -I options to cppcheck so that it will find the system headers.
Then we need to pass a bunch more options to suppress the zillions of
cppcheck errors found in those headers.

It turns out, however, that it's not recommended to give the system headers
to cppcheck anyway.  Instead it has built-in knowledge of the ANSI libc and
uses that as the basis of its checks.  We do need to suppress
missingIncludeSystem warnings instead though.

Not bothering with the system headers makes the cppcheck runtime go from
~37s to ~14s on my machine, which is a pretty nice win.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:21 +01:00
David Gibson
14dd70e2b3 linux_dep: Fix CLOSE_RANGE_UNSHARE availability handling
If CLOSE_RANGE_UNSHARE isn't defined, we define a fallback version of
close_range() which is a (successful) no-op.  This is broken in several
ways:
 * It doesn't actually fix compile if using old kernel headers, because
   the caller of close_range() still directly uses CLOSE_RANGE_UNSHARE
   unprotected by ifdefs
 * Even if it did fix the compile, it means inconsistent behaviour between
   a compile time failure to find the value (we silently don't close files)
   and a runtime failure (we die with an error from close_range())
 * Silently not closing the files we intend to close for security reasons
   is probably not a good idea in any case

We don't want to simply error if close_range() or CLOSE_RANGE_UNSHARE isn't
available, because that would require running on kernel >= 5.9.  On the
other hand there's not really any other way to flush all possible fds
leaked by the parent (close() in a loop takes over a minute).  So in this
case print a warning and carry on.

As bonus this fixes a cppcheck error I see with some different options I'm
looking to apply in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:17 +01:00
David Gibson
d64f257243 linux_dep: Move close_range() conditional handling to linux_dep.h
util.h has some #ifdefs and weak definitions to handle compatibility with
various kernel versions.  Move this to linux_dep.h which handles several
other similar cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:15 +01:00
David Gibson
b84cd05098 log: Only check for FALLOC_FL_COLLAPSE_RANGE availability at runtime
log.c has several #ifdefs on FALLOC_FL_COLLAPSE_RANGE that won't attempt
to use it if not defined.  But even if the value is defined at compile
time, it might not be available in the runtime kernel, so we need to check
for errors from a fallocate() call and fall back to other methods.

Simplify this to only need the runtime check by using linux_dep.h to define
FALLOC_FL_COLLAPSE_RANGE if it's not in the kernel headers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:25:58 +01:00
Stefano Brivio
58fa5508bd tap, tcp, util: Add some missing SOCK_CLOEXEC flags
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:

/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1139 |         int fd = socket(AF_UNIX, SOCK_STREAM, 0);
      |                                             ^
      |                                              | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1158 |                 ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
      |                                                                 ^
      |                                                                  | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1413 |         s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
      |                                                   ^
      |                                                    | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
  188 |         if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
      |                                             ^
      |                                              | SOCK_CLOEXEC

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:58 +01:00
Stefano Brivio
71869e2912 passt: Use NOLINT clang-tidy block instead of NOLINTNEXTLINE
For some reason, this is only reported by clang-tidy 19.1.2 on
Alpine:

/home/sbrivio/passt/passt.c:314:53: error: conditional operator with identical true and false expressions [bugprone-branch-clone,-warnings-as-errors]
  314 |         nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
      |                                                            ^

We do have a suppression, but not on the line preceding it, because
we also need a cppcheck suppression there. Use NOLINTBEGIN/NOLINTEND
for the clang-tidy suppression.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:52 +01:00
Stefano Brivio
d4f09c9b96 util: Define small and big thresholds for socket buffers as unsigned long long
On 32-bit architectures, clang-tidy reports:

/home/pi/passt/tcp.c:728:11: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^
/home/pi/passt/tcp.c:728:11: note: make conversion explicit to silence this warning
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~~~~~~~~
/home/pi/passt/tcp.c:728:11: note: perform multiplication in a wider type
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~
/home/pi/passt/tcp.c:730:15: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:730:15: note: make conversion explicit to silence this warning
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:730:15: note: perform multiplication in a wider type
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~
/home/pi/passt/tcp.c:731:17: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:731:17: note: make conversion explicit to silence this warning
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:731:17: note: perform multiplication in a wider type
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~

because, wherever we use those thresholds, we define the other term
of comparison as uint64_t. Define the thresholds as unsigned long long
as well, to make sure we match types.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:49 +01:00
Stefano Brivio
87940f9aa7 tap: Cast TAP_BUF_BYTES - ETH_MAX_MTU to ssize_t, not TAP_BUF_BYTES
Given that we're comparing against 'n', which is signed, we cast
TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated
as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be
signed.

This doesn't necessarily happen on 32-bit architectures, though. On
armhf and i686, clang-tidy 18.1.8 and 19.1.2 report:

/home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors]
 1087 |         for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
      |                     ~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

cast the whole difference to ssize_t, as we know it's going to be
positive anyway, instead of relying on that side effect.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:45 +01:00
Stefano Brivio
1feb90fe62 dhcpv6: Turn some option headers pointers to const
cppcheck 2.14.2 on Alpine reports:

dhcpv6.c:431:32: style: Variable 'client_id' can be declared as pointer to const [constVariablePointer]
 struct opt_hdr *ia, *bad_ia, *client_id;
                               ^

It's not only 'client_id': we can declare 'ia' as const pointer too.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:41 +01:00
Stefano Brivio
5f5e814cfc dhcpv6: Use for loop instead of goto to avoid false positive cppcheck warning
cppcheck 2.16.0 reports:

dhcpv6.c:334:14: style: The comparison 'ia_type == 3' is always true. [knownConditionTrueFalse]
 if (ia_type == OPT_IA_NA) {
             ^
dhcpv6.c:306:12: note: 'ia_type' is assigned value '3' here.
 ia_type = OPT_IA_NA;
           ^
dhcpv6.c:334:14: note: The comparison 'ia_type == 3' is always true.
 if (ia_type == OPT_IA_NA) {
             ^

this is not really the case as we set ia_type to OPT_IA_TA and then
jump back.

Anyway, there's no particular reason to use a goto here: add a trivial
foreach() macro to go through elements of an array and use it instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:11 +01:00
Jon Maloy
78da088f7b tcp: unify payload and flags l2 frames array
In order to reduce static memory and code footprint, we merge
the array for l2 flag frames into the one for payload frames.

This change also ensures that no flag message will be sent out
over the l2 media bypassing already queued payload messages.

Performance measurements with iperf3, where we force all
traffic via the tap queue, show no significant difference:

Dual traffic both directions sinmultaneously, with patch:
========================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  36.3 GBytes  3.12 Gbits/sec  4759       sender
[  5]   0.00-100.04 sec  36.3 GBytes  3.11 Gbits/sec             receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   321 GBytes  27.6 Gbits/sec            receiver

Dual traffic both directions sinmultaneously, without patch:
============================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  35.0 GBytes  3.01 Gbits/sec  6001       sender
[  5]   0.00-100.04 sec  34.8 GBytes  2.99 Gbits/sec            receiver

ns->host
--------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   345 GBytes  29.6 Gbits/sec            receiver

Single connection, with patch:
==============================
host->ns:
---------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   138 GBytes  11.8 Gbits/sec  922       sender
[  5]   0.00-100.04 sec   138 GBytes  11.8 Gbits/sec            receiver

ns->host:
-----------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   430 GBytes  36.9 Gbits/sec            receiver

Single connection, without patch:
=================================
host->ns:
------------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   139 GBytes  11.9 Gbits/sec  900       sender
[  5]   0.00-100.04 sec   139 GBytes  11.9 Gbits/sec            receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   440 GBytes  37.8 Gbits/sec            receiver

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:41 +01:00
David Gibson
9a0e544f05 test: Improve test for NDP assigned prefix
In the NDP tests we search explicitly for a guest address with prefix
length 64.  AFAICT this is an attempt to specifically find the SLAAC
assigned address, rather than something assigned by other means.  We can do
that more explicitly by checking for .protocol == "kernel_ra". however.

The SLAAC prefixes we assigned *will* always be 64-bit, that's hard-coded
into our NDP implementation.  RFC4862 doesn't really allow anything else
since the interface identifiers for an Ethernet-like link are 64-bits.

Let's actually verify that, rather than just assuming it, by extracting the
prefix length assigned in the guest and checking it as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:37 +01:00
David Gibson
910f4f9103 test: Don't require 64-bit prefixes in perf tests
When determining the namespace's IPv6 address in the perf test setup, we
explicitly filter for addresses with a 64-bit prefix length.  There's no
real reason we need that - as long as it's a global address we can use it.
I suspect this was copied without thinking from a similar example in the
NDP tests, where the 64-bit prefix length _is_ meaningful (though it's not
entirely clear if the handling is correct there either).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:34 +01:00
David Gibson
1699083f29 test: Make nstool hold robust against interruptions to control clients
Currently nstool die()s on essentially any error.  In most cases that's
fine for our purposes.  However, it's a problem when in "hold" mode and
getting an IO error on an accept()ed socket.  This could just indicate that
the control client aborted prematurely, in which case we don't want to
kill of the namespace we're holding.

Adjust these to print an error, close() the control client socket and
carry on.  In addition, we need to explicitly ignore SIGPIPE in order not
to be killed by an abruptly closed client connection.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:30 +01:00
David Gibson
b456ee1b53 test: Rename propagating signal handler
nstool in "exec" mode will propagate some signals (specifically SIGTERM) to
the process in the namespace it executes.  The signal handler which
accomplishes this is called simply sig_handler().  However, it turns out
we're going to need some other signal handlers, so rename this to the more
specific sig_propagate().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:27 +01:00
David Gibson
867db07fcf util: Work around cppcheck bug 6936
While experimenting with cppcheck options, I hit several false positives
caused by this bug: https://trac.cppcheck.net/ticket/13227

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:24 +01:00
David Gibson
6f913b3af0 udp: Don't dereference uflow before NULL check in udp_reply_sock_handler()
We have an ASSERT() verifying that we're able to look up the flow in
udp_reply_sock_handler().  However, we dereference uflow before that in
an initializer, rather defeating the point.  Rearrange to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:22 +01:00
David Gibson
d8e05a3fe0 ndp: Use const pointer for ndp_ns packet
We don't modify this structure at all.  For some reason cppcheck doesn't
catch this with our current options, but did when I was experimenting with
some different options.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:19 +01:00
David Gibson
0d7b8201ed linux_dep: Generalise tcp_info.h to handling Linux extension compatibility
tcp_info.h exists just to contain a modern enough version of struct
tcp_info for our needs, removing compile time dependency on the version of
kernel headers.  There are several other cases where we can remove similar
compile time dependencies on kernel version.  Prepare for that by renaming
tcp_info.h to linux_dep.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:16 +01:00
David Gibson
c5f4e4d146 fwd: Squash different-signedness comparison warning
On certain architectures we get a warning about comparison between
different signedness integers in fwd_probe_ephemeral().  This is because
NUM_PORTS evaluates to an unsigned integer.  It's a fixed value, though
and we know it will fit in a signed long on anything reasonable, so add
a cast to suppress the warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:14 +01:00
David Gibson
1e76a19895 util: Remove unused ffsl() function
We supply a weak alias for ffsl() in case it's not defined in our libc.
Except.. we don't have any users for it any more, so remove it.

make cppcheck doesn't spot this at present for complicated reasons, but it
might with tweaks to the options I'm experimenting with.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:11 +01:00
David Gibson
1d7cff3779 clang: Add rudimentary clangd configuration
clangd's default configuration seems to try to treat .h files as C++ not
C.  There are many more spurious warnings generated at present, but this
removes some of the most egregious ones.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:07 +01:00
David Gibson
c560e2f65b Makefile: Don't attempt to auto-detect stack size
We probe the available stack limit in the Makefile using rlimit, then use
that to set the size of the stack when we clone() extra threads.  But
the rlimit at compile time need not be the same as the rlimit at runtime,
so that's not particularly sensible.

Ideally, we'd set the stack size based on an estimate of the actual
maximum stack usage of all our clone()ed functions.  We don't have that
at the moment, but to keep things simple just set it to 1MiB - that's what
the current probe will set things to on my default configuration Fedora 40,
so it's likely to be fine in most cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:03 +01:00
David Gibson
13fc6d511e Makefile: Use -DARCH for qrap only
We insert -DARCH for all compiles, based on TARGET_ARCH determined in the
Makefile.  However, this is only used in qrap.c, not anywhere else in
passt or pasta.  Only supply this -D when compiling qrap specifically.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:59 +01:00
David Gibson
7917159005 seccomp: Simplify handling of AUDIT_ARCH
Currently we construct the AUDIT_ARCH variable in the Makefile, then pass
it into the C code with -D.  The only place that uses it, though is the
BPF filter generated by seccomp.sh.  seccomp.sh already needs to do things
differently depending on the arch, so it might as well just insert the
expanded AUDIT_ARCH directly into the generated code, rather than using
a #define.  Arguably this is better, even, since it ensures more locally
that the arch the BPF checks for matches the arch seccomp.sh built the
filter for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:55 +01:00
David Gibson
93bce404c1 Makefile: Move NETNS_RUN_DIR definition to C code
NETNS_RUN_DIR is set in the Makefile, then passed into the C code with
-D.  But NETNS_RUN_DIR is just a fixed string, it doesn't depend on any
make probes or variables, so there's really no reason to handle it via the
Makefile.  Just move it to a plain #define in conf.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:52 +01:00
David Gibson
c938d8a93e netlink: RTA_PAYLOAD() returns int, not size_t
Since it's the size of a chunk of memory it would seem logical that
RTA_PAYLOAD() returns size_t.  However, it doesn't - it explicitly casts
its result to an int.  RTNH_OK(), which often takes the result of
RTA_PAYLOAD() as a parameter compares it to an int, so using size_t can
result in comparison of different-signed integer warnings from clang.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:48 +01:00
David Gibson
f6b546c6e4 flow: Correct type of flowside_at_sidx()
Due to a copy-pasta error, this returns 'PIF_NONE' instead of NULL on the
failure case.  PIF_NONE expands to 0, which turns into NULL, but it's
still confusing, so fix it.  This removes a clang warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:44 +01:00
David Gibson
30b4f88167 arch: Avoid explicit access to 'environ'
We pass 'environ' to execve() in arch_avc2_exec(), so that we retain the
environment in the current process.  But the declaration of 'environ' is
a bit weird - it doesn't seem to be in a standard header, requiring a
manual explicit declaration.  But, we can avoid needing to reference it
explicitly by using execv() instead of execve().  This removes a clang
warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:29 +01:00
David Gibson
b78e72da0b clang: Move clang-tidy configuration from Makefile to .clang-tidy
Currently we configure clang-tidy with a very long command line spelled out
in the Makefile (mostly a big list of lints to disable).  Move it from here
into a .clang-tidy configuration file, so that the config is accessible if
clang-tidy is invoked in other ways (e.g. via clangd) as well.  As a bonus
this also means that we can move the bulky comments about why we're
suppressing various tests inline with the relevant config lines.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:19 +01:00
David Gibson
8346216c9a Makefile: Simplify exclusion of qrap from static checks
There are things in qrap.c that clang-tidy complains about that aren't
worth fixing.  So, we currently exclude it using $(filter-out).  However,
we already have a make variable which has just the passt sources, excluding
qrap, so we can use that instead of the awkward filter-out expression.

Currently, we still include qrap.c for cppcheck, but there's not much
point doing so: it's, well, qrap, so we don't care that much about lints.
Exclude it from cppcheck as well, for consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:07 +01:00
David Gibson
8f1b6a0ca6 clang: Add .clang-format file
I've been experimenting with clangd, but its default format style is
horrid.  Since our style is basically that of the Linux kernel, copy the
.clang-format from the kernel, minus reference to a bunch of kernel
specific macros.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:45:16 +01:00
David Gibson
5e93bcd8bf test: Adjust misplaced sleeps in two_guests code
Most of our transfer tests using socat use 'sleep' waaiting for the server
side to be ready before starting the client.  However in two_guests/basic
the sleep is in the wrong place: rather than being between starting the
server and starting the client, it's after waiting for the server to
complete.  This causes occasional hangs when the client runs before the
server is ready - in that case the receiving guest sends an RST, which we
don't (currently) propagate back to the sender.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-05 23:46:38 +01:00
Stefano Brivio
9afce0b45c tap: Explicitly cast TUNSETIFF to fix build warning with musl on ppc64le
On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than
INT_MAX (2^31 - 1), and musl declares the second argument of ioctl()
as 'int', not 'unsigned long' like glibc does, probably because of how
POSIX specifies the equivalent argument, int dcmd, in posix_devctl(),
so gcc reports a warning:

tap.c: In function 'tap_ns_tun':
tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow]
 1291 |         rc = ioctl(fd, TUNSETIFF, &ifr);
      |                        ^~~~~~~~~

We don't care about that overflow, so explicitly cast TUNSETIFF to
int.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-05 23:46:33 +01:00
Stefano Brivio
d165d36a0c tcp: Fix build against musl, __sum16 comes from linux/types.h
Use a plain uint16_t instead and avoid including one extra header:
the 'bitwise' attribute of __sum16 is just used by sparse(1).

Reported-by: omni <omni+alpine@hack.org>
Fixes: 3d484aa370 ("tcp: Update TCP checksum using an iovec array")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-05 23:46:24 +01:00
Stefano Brivio
ee7d0b62a7 util: Don't use errno after a successful call in __daemon()
I thought we could just set errno to 0, do a bunch of stuff, and check
that errno didn't change to infer we succeeded. But clang-tidy,
starting with LLVM 19, reports:

/home/sbrivio/passt/util.c:465:6: error: An undefined value may be read from 'errno' [clang-analyzer-unix.Errno,-warnings-as-errors]
  465 |         if (errno)
      |             ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
   38 | # define errno (*__errno_location ())
      |                ^~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:446:6: note: Assuming the condition is false
  446 |         if (pid == -1) {
      |             ^~~~~~~~~
/home/sbrivio/passt/util.c:446:2: note: Taking false branch
  446 |         if (pid == -1) {
      |         ^
/home/sbrivio/passt/util.c:451:6: note: Assuming 'pid' is 0
  451 |         if (pid) {
      |             ^~~
/home/sbrivio/passt/util.c:451:2: note: Taking false branch
  451 |         if (pid) {
      |         ^
/home/sbrivio/passt/util.c:463:2: note: Assuming that 'close' is successful; 'errno' becomes undefined after the call
  463 |         close(devnull_fd);
      |         ^~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:465:6: note: An undefined value may be read from 'errno'
  465 |         if (errno)
      |             ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
   38 | # define errno (*__errno_location ())
      |                ^~~~~~~~~~~~~~~~~~~~~~

And the LLVM documentation for the unix.Errno checker, 1.1.8.3
unix.Errno (C), mentions, at:

  https://clang.llvm.org/docs/analyzer/checkers.html#unix-errno

that:

  The C and POSIX standards often do not define if a standard library
  function may change value of errno if the call does not fail.
  Therefore, errno should only be used if it is known from the return
  value of a function that the call has failed.

which is, somewhat surprisingly, the case for close().

Instead of using errno, check the actual return values of the calls
we issue here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
b1a607fba1 udp: Take care of cert-int09-c clang-tidy warning for enum udp_iov_idx
/home/sbrivio/passt/udp.c:171:1: error: inital values in enum 'udp_iov_idx' are not consistent, consider explicit initialization of all, none or only the first enumerator [cert-int09-c,readability-enum-initial-value,-warnings-as-errors]
  171 | enum udp_iov_idx {
      | ^
  172 |         UDP_IOV_TAP     = 0,
  173 |         UDP_IOV_ETH     = 1,
  174 |         UDP_IOV_IP      = 2,
  175 |         UDP_IOV_PAYLOAD = 3,
  176 |         UDP_NUM_IOVS
      |
      |                      = 4

Don't initialise any value, so that it's obvious that constants map to
unique values.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
099ace64ce treewide: Address cert-err33-c clang-tidy warnings for clock and timer functions
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.

As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.

For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
59fe34ee36 treewide: Suppress clang-tidy warning if we already use O_CLOEXEC
In pcap_init(), we should always open the packet capture file with
O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means
close-on-exec, not close-on-fork.

In logfile_init() and pidfile_open(), the fact that we pass a third
'mode' argument to open() seems to confuse the android-cloexec-open
checker in LLVM versions from 16 to 19 (at least).

The checker is suggesting to add O_CLOEXEC to 'mode', and not in
'flags', where we already have it.

Add a suppression for clang-tidy and a comment, and avoid repeating
those three times by adding a new helper, output_file_open().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
134b4d58b4 Makefile: Disable readability-math-missing-parentheses clang-tidy check
With clang-tidy and LLVM 19:

/home/sbrivio/passt/conf.c:1218:29: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
 1218 |                 const char *octet = str + 3 * i;
      |                                           ^~~~~~
      |                                           (    )
/home/sbrivio/passt/ndp.c:285:18: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  285 |                                         .len            = 1 + 2 * n,
      |                                                               ^~~~~~
      |                                                               (    )
/home/sbrivio/passt/ndp.c:329:23: error: '%' has higher precedence than '-'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  329 |                         memset(ptr, 0, 8 - dns_s_len % 8);      /* padding */
      |                                            ^~~~~~~~~~~~~~
      |                                            (            )
/home/sbrivio/passt/pcap.c:131:20: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  131 |                 pcap_frame(iov + i * frame_parts, frame_parts, offset, &now);
      |                                  ^~~~~~~~~~~~~~~~
      |                                  (              )
/home/sbrivio/passt/util.c:216:10: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  216 |                 return (a->tv_nsec + 1000000000 - b->tv_nsec) / 1000 +
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                        (                                            )
/home/sbrivio/passt/util.c:217:10: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  217 |                        (a->tv_sec - b->tv_sec - 1) * 1000000;
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                        (                                    )
/home/sbrivio/passt/util.c:220:9: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  220 |         return (a->tv_nsec - b->tv_nsec) / 1000 +
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                (                               )
/home/sbrivio/passt/util.c:221:9: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  221 |                (a->tv_sec - b->tv_sec) * 1000000;
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                (                                )
/home/sbrivio/passt/util.c:545:32: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  545 |         return clone(fn, stack_area + stack_size / 2, flags, arg);
      |                                       ^~~~~~~~~~~~~~~
      |                                       (             )

Just... no.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
744247856d treewide: Silence cert-err33-c clang-tidy warnings for fprintf()
We use fprintf() to print to standard output or standard error
streams. If something gets truncated or there's an output error, we
don't really want to try and report that, and at the same time it's
not abnormal behaviour upon which we should terminate, either.

Just silence the warning with an ugly FPRINTF() variadic macro casting
the fprintf() expressions to void.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
98efe7c2fd treewide: Comply with CERT C rule ERR33-C for snprintf()
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):

/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
   43 |                 snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  577 |                         snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  579 |                                 snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  580 |                                          pidval);
      |                                          ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  105 |         snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  242 |         snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  243 |         snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
 1155 |                         snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning

Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:25 +01:00
Stefano Brivio
988a4d75f8 Makefile: Exclude qrap.c from clang-tidy checks
We'll deprecate qrap(1) soon, and warnings reported by clang-tidy as
of LLVM versions 16 and later would need a bunch of changes there to
be addressed, mostly around CERT C rule ERR33-C and checking return
code from snprintf().

It makes no sense to fix warnings in qrap just for the sake of it, so
officially declare the bitrotting season open.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 08:21:19 +01:00
Jon Maloy
ba38e67cf4 tcp: unify l2 TCPv4 and TCPv6 queues and structures
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.

Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.

Tests with iperf3 shows no measurable change in performance after this
change.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:44:08 +01:00
Jon Maloy
2053c36dec tcp: set ip and eth headers in l2 tap queues on the fly
l2 tap queue entries are currently initialized at system start, and
reused with preset headers through its whole life time. The only
fields we need to update per message are things like payload size
and checksums.

If we want to reuse these entries between ipv4 and ipv6 messages we
will need to set the pointer to the right header on the fly per
message, since the header type may differ between entries in the same
queue.

The same needs to be done for the ethernet header.

We do these changes here.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:43:24 +01:00
Laurent Vivier
5563d5f668 test: remove obsolete images
Remove debian-9-nocloud-amd64-daily-20200210-166.qcow2 and
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 as they cannot be
downloaded anymore

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:30:06 +02:00
Laurent Vivier
f43f7d5e89 tcp: cleanup tcp_buf_data_from_sock()
Remove the err label as there is only one caller, and move code
to the caller position. ret is not needed here anymore as it is
always 0.
Remove sendlen as we can user directly len.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:51 +02:00
David Gibson
e7fcd0c348 tcp: Use runtime tests for TCP_INFO fields
In order to use particular fields from the TCP_INFO getsockopt() we
need them to be in structure returned by the runtime kernel.  We attempt
to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed
in the Makefile.

However, that's not correct, because the kernel headers we compile against
may not be the same as the runtime kernel.  We instead should check against
the size of structure returned from the TCP_INFO getsockopt() as we already
do for tcpi_snd_wnd.  Switch from the compile time flags to a runtime
test.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:46 +02:00
David Gibson
81143813a6 tcp: Generalise probing for tcpi_snd_wnd field
In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we
need the field to be supported in the runtime kernel (snd_wnd_cap).

In fact we should check that for for every tcp_info field we want to use,
beyond the very old ones shared with BSD.  Prepare to do that, by
generalising the probing from setting a single bool to instead record the
size of the returned TCP_INFO structure.  We can then use that recorded
value to check for the presence of any field we need.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:27:17 +02:00
David Gibson
13f0291ede tcp: Remove compile-time dependency on struct tcp_info version
In the Makefile we probe to create several defines based on the presence
of particular fields in struct tcp_info.  These defines are used for two
purposes, neither of which they accomplish well:

1) Determining if the tcp_info fields are available at runtime.  For this
   purpose the defines are Just Plain Wrong, since the runtime kernel may
   not be the same as the compile time kernel. We corrected this for
   tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt

2) Allowing the source to compile against older kernel headers which don't
   have the fields in question.  This works in theory, but it does mean
   we won't be able to use the fields, even if later run against a
   newer kernel.  Furthermore, it's quite fragile: without much more
   thorough tests of builds in different environments that we're currently
   set up for, it's very easy to miss cases where we're accessing a field
   without protection from an #ifdef.  For example we currently access
   tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd().

Improve this with a different approach, borrowed from qemu (which has many
instances of similar problems).  Don't compile against linux/tcp.h, using
netinet/tcp.h instead.  Then for when we need an extension field, define
a struct tcp_info_linux, copied from the kernel, with all the fields we're
interested in.  That may need updating from future kernel versions, but
only when we want to use a new extension, so it shouldn't be frequent.

This allows us to remove the HAS_SND_WND define entirely.  We keep
HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1),
we'll fix that in a later patch.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Trivial grammar fixes in comments]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:26:48 +02:00
Stefano Brivio
9e4615b40b tcp_splice: fcntl(2) returns the size of the pipe, if F_SETPIPE_SZ succeeds
Don't report bogus failures (with --trace) just because the return
value is not zero.

Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-25 09:36:30 +02:00
Stefano Brivio
149f457b23 tcp_splice: splice() all we have to the writing side, not what we just read
In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.

However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 98304 from read-side call
  Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
  Flow 0 (TCP connection (spliced)): -1 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
  Flow 0 (TCP connection (spliced)): OUT_WAIT_0

we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 32768 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580

we splice() 32768 more bytes from our receiving side to the pipe. At
some point:

  pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
  Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
  Flow 0 (TCP connection (spliced)): 1320 from read-side call
  Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)

the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).

This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.

Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.

So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.

Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.

Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.

Reported-by: Ed Santiago <santiago@redhat.com>
Reported-by: Paul Holzinger <pholzing@redhat.com>
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-25 09:36:18 +02:00
David Gibson
9e5df350d6 tcp: Use structures to construct initial TCP options
As a rule, we prefer constructing packets with matching C structures,
rather than building them byte by byte.  However, one case we still build
byte by byte is the TCP options we include in SYN packets (in fact the only
time we generate TCP options on the tap interface).

Rework this to use a structure and initialisers which make it a bit
clearer what's going on.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
2024-10-21 18:51:04 +02:00
David Gibson
b4dace8f46 fwd: Direct inbound spliced forwards to the guest's external address
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing.  This gives us a very large performance improvement when it's
possible.

Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address.  However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1.  pasta's
forwarding exposes those services to the host, which isn't generally what
we want.

Correct this by instead forwarding inbound "splice" flows to the guest's
external address.

Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:03 +02:00
David Gibson
58e6d68599 test: Clarify test for spliced inbound transfers
The tests in pasta/tcp and pasta/udp for inbound transfers have the server
listening within the namespace explicitly bound to 127.0.0.1 or ::1.  This
only works because of the behaviour of inbound splice connections, which
always appear with both source and destination addresses as loopback in
the namespace.  That's not an inherent property for "spliced" connections
and arguably an undesirable one.  Also update the test names to make it
clearer that these tests are expecting to exercise the "splice" path.

Interestingly this was already correct for the equivalent passt_in_ns/*,
although we also update the test names for clarity there.

Note that there are similar issues in some of the podman tests, addressed
in https://github.com/containers/podman/pull/24064

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:00 +02:00
David Gibson
1fa421192c passt.1: Clarify and update "Handling of local addresses" section
This section didn't mention the effect of the --map-host-loopback option
which now alters this behaviour.  Update it accordingly.

It used "local addresses" to mean specifically 127.0.0.0/8 and ::1.
However, "local" could also refer to link-local addresses or to addresses
of any scope which happen to be configured on the host.  Use "loopback
address" to be more precise about this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:57 +02:00
David Gibson
ef8a5161d0 passt.1: Mark --stderr as deprecated more prominently
The description of this option says that it's deprecated, but unlike
--no-copy-addrs and --no-copy-routes it doesn't have a clear label.  Add
one to make it easier to spot.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:54 +02:00
David Gibson
53176ca91d test: Wait for DAD on DHCPv6 addresses
After running dhclient -6 we expect the DHCPv6 assigned address to be
immediately usable.  That's true with the Fedora dhclient-script (and the
upstream ISC DHCP one), however it's not true with the Debian
dhclient-script.  The Debian script can complete with the address still
in "tentative" state, and the address won't be usable until Duplicate
Address Detection (DAD) completes.  That's arguably a bug in Debian (see
link below), but for the time being we need to work around it anyway.

We usually get away with this, because by the time we do anything where the
address matters, DAD has completed.  However, it's not robust, so we should
explicitly wait for DAD to complete when we get an DHCPv6 address.

Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:51 +02:00
David Gibson
75b9c0feb0 test: Explicitly wait for DAD to complete on SLAAC addresses
Getting a SLAAC address takes a little while because the kernel must
complete Duplicate Address Detection (DAD) before marking the address as
ready.  In several places we have an explicit 'sleep 2' to wait for that
to complete.

Fixed length delays are never a great idea, although this one is pretty
solid.  Still, it would be better to explicitly wait for DAD to complete
in case of long delays (which might happen on slow emulated hosts, or with
heavy load), and to speed the tests up if DAD completes quicker.

Replace the fixed sleeps with a loop waiting for DAD to complete.  We do
this by looping waiting for all tentative addresses to disappear.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:47 +02:00
David Gibson
f9d677bff6 arp: Fix a handful of small warts
This fixes a number of harmless but slightly ugly warts in the ARP
resolution code:
 * Use in4addr_any to represent 0.0.0.0 rather than hand constructing an
   example.
 * When comparing am->sip against 0.0.0.0 use sizeof(am->sip) instead of
   sizeof(am->tip) (same value, but makes more logical sense)
 * Described the guest's assigned address as such, rather than as "our
   address" - that's not usually what we mean by "our address" these days
 * Remove "we might have the same IP address" comment which I can't make
   sense of in context (possibly it's relating to the statement below,
   which already has its own comment?)

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:04 +02:00
Stefano Brivio
2d7f734c45 tcp: Send "empty" handshake ACK before first data segment
Starting from commit 9178a9e346 ("tcp: Always send an ACK segment
once the handshake is completed"), we always send an ACK segment,
without any payload, to complete the three-way handshake while
establishing a connection started from a socket.

We queue that segment after checking if we already have data to send
to the tap, which means that its sequence number is higher than any
segment with data we're sending in the same iteration, if any data is
available on the socket.

However, in tcp_defer_handler(), we first flush "flags" buffers, that
is, we send out segments without any data first, and then segments
with data, which means that our "empty" ACK is sent before the ACK
segment with data (if any), which has a lower sequence number.

This appears to be harmless as the guest or container will generally
reorder segments, but it looks rather weird and we can't exclude it's
actually causing problems.

Queue the empty ACK first, so that it gets a lower sequence number,
before checking for any data from the socket.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-15 20:34:26 +02:00
Stefano Brivio
7612cb80fe test: Pass TRACE from run_term() into ./run from_term
Just like we do for PCAP, DEBUG and KERNEL. Otherwise, running tests
with TRACE=1 will not actually enable tracing output.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-10 05:25:19 +02:00
Stefano Brivio
b40880c157 test/lib/term: Always use printf for messages with escape sequences
...instead of echo: otherwise, bash won't handle escape sequences we
use to colour messages (and 'echo -e' is not specified by POSIX).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-10 05:25:00 +02:00
David Gibson
ff63ac922a conf: Add --dns-host option to configure host side nameserver
When redirecting DNS queries with the --dns-forward option, passt/pasta
needs a host side nameserver to redirect the queries to.  This is
controlled by the c->ip[46].dns_host variables.  This is set to the first
first nameserver listed in the host's /etc/resolv.conf, and there isn't
currently a way to override it from the command line.

Prior to 0b25cac9 ("conf: Treat --dns addresses as guest visible
addresses") it was possible to alter this with the -D/--dns option.
However, doing so was confusing and had some nonsensical edge cases because
-D generally takes guest side addresses, rather than host side addresses.

Add a new --dns-host option to restore this functionality in a more
sensible way.

Link: https://bugs.passt.top/show_bug.cgi?id=102
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 19:04:29 +02:00
David Gibson
9d66df9a9a conf: Add command line switch to enable IP_FREEBIND socket option
In a couple of recent reports, we've seen that it can be useful for pasta
to forward ports from addresses which are not currently configured on the
host, but might be in future.  That can be done with the sysctl
net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in
the first place.  We can allow the same thing on a per-socket basis with
the IP_FREEBIND (or IPV6_FREEBIND) socket option.

Add a --freebind command line argument to enable this socket option on
all listening sockets.

Link: https://bugs.passt.top/show_bug.cgi?id=101
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 19:04:29 +02:00
Laurent Vivier
151dbe0d3d udp: Update UDP checksum using an iovec array
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:13 +02:00
Laurent Vivier
3d484aa370 tcp: Update TCP checksum using an iovec array
TCP header and payload are supposed to be in the same buffer,
and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute
the checksum from the base address of the header using the
length of the IP payload.

In the future (for vhost-user) we need to dispatch the TCP header and
the TCP payload through several buffers. To be able to manage that, we
provide an iovec array that points to the data of the TCP frame.
We provide also an offset to be able to provide an array that contains
the TCP frame embedded in an lower level frame, and this offset points
to the TCP header inside the iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:10 +02:00
Laurent Vivier
e6548c6437 checksum: Add an offset argument in csum_iov()
The offset allows any headers that are not part of the data
to checksum to be skipped.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:08 +02:00
Laurent Vivier
fd8334b25d pcap: Add an offset argument in pcap_iov()
The offset is passed directly to pcap_frame() and allows
any headers that are not part of the frame to
capture to be skipped.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:02 +02:00
Laurent Vivier
72e7d3024b tcp: Use tcp_payload_t rather than tcphdr
As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the
checksum using the TCP header and the TCP payload, it is clearer
to use a pointer to tcp_payload_t that includes tcphdr and payload
rather than a pointer to tcphdr (and guessing TCP header is
followed by the payload).

Move tcp_payload_t and tcp_flags_t to tcp_internal.h.
(They will be used also by vhost-user).

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:50:46 +02:00
Stefano Brivio
def8acdcd8 test: Kernel binary can now be passed via the KERNEL environmental variable
This is quite useful at least for myself as I'm usually running tests
using a guest kernel that's not the same as the one on the host.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-02 14:50:34 +02:00
David Gibson
b55013b1a7 inany: Add inany_pton() helper
We already have an inany_ntop() function to format inany addresses into
text.  Add inany_pton() to parse them from text, and use it in
conf_ports().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:17 +02:00
David Gibson
cbde4192ee tcp, udp: Make {tcp,udp}_sock_init() take an inany address
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair.  Use an inany instead.  Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:16 +02:00
David Gibson
b8d4fac6a2 util, pif: Replace sock_l4() with pif_sock_l4()
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.

Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress.  This case doesn't yet arise, but
it might cause us trouble in future.

Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic.  This only makes sense if there is no specific
address given.  We verify this at runtime, but it would be nicer if we
could enforce it structurally.

For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems.  Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.

The new function takes the address as an inany *, with NULL indicating the
dual stack case.  This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:15 +02:00
David Gibson
204e77cd11 udp: Don't attempt to get dual-stack sockets in nonsensical cases
To save some kernel memory we try to use "dual stack" sockets (that listen
to both IPv4 and IPv6 traffic) when possible.   However udp_sock_init()
attempts to do this in some cases that can't work.  Specifically we can
only do this when listening on any address.  That's never true for the
ns (splicing) case, because we always listen on loopback.  For the !ns
case and AF_UNSPEC case, addr should always be NULL, but add an assert to
verify.

This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll
fall back to the other path.  But, it's messy and makes some upcoming
changes harder, so avoid attempting this in cases we know can't work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:15 +02:00
Laurent Vivier
8f8c4d27eb tcp: Allow checksum to be disabled
We can need not to set TCP checksum. Add a parameter to
tcp_fill_headers4() and tcp_fill_headers6() to disable it.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:28 +02:00
Laurent Vivier
4fe5f4e813 udp: Allow checksum to be disabled
We can need not to set the UDP checksum. Add a parameter to
udp_update_hdr4() and udp_update_hdr6() to disable it.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:20 +02:00
David Gibson
d836d9e345 util: Remove possible quadratic behaviour from write_remainder()
write_remainder() steps through the buffers in an IO vector writing out
everything past a certain byte offset.  However, on each iteration it
rescans the buffer from the beginning to find out where we're up to.  With
an unfortunate set of write sizes this could lead to quadratic behaviour.

In an even less likely set of circumstances (total vector length > maximum
size_t) the 'skip' variable could overflow.  This is one factor in a
longstanding Coverity error we've seen (although I still can't figure out
the remainder of its complaint).

Rework write_remainder() to always work out our new position in the vector
relative to our old/current position, rather than starting from the
beginning each time.  As a bonus this seems to fix the Coverity error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:03 +02:00
David Gibson
bfc294b90d util: Add helper to write() all of a buffer
write(2) might not write all the data it is given.  Add a write_all_buf()
helper to keep calling it until all the given data is written, or we get an
error.

Currently we use write_remainder() to do this operation in pcap_frame().
That's a little awkward since it requires constructing an iovec, and future
changes we want to make to write_remainder() will be easier in terms of
this single buffer helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:59 +02:00
David Gibson
bb41901c71 tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean
This parameter is already treated as a boolean internally.  Make it a
'bool' type for clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:55 +02:00
David Gibson
265b2099c7 tcp: Simplify ifdef logic in tcp_update_seqack_wnd()
This function has a block conditional on !snd_wnd_cap shortly before an
snd_wnd_cap is statically false).

Therefore, simplify this down to a single conditional with an else branch.
While we're there, fix some improperly indented closing braces.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:50 +02:00
David Gibson
4aff6f9392 tcp: Clean up tcpi_snd_wnd probing
When available, we want to retrieve our socket peer's advertised window and
forward that to the guest.  That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.

Currently our probing for this is a bit odd.  The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it.  Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field.  This is a bit hit and miss in two ways:
 * Zero is perfectly possible window the peer could report, so we can
   get false negatives
 * We're reading TCP_INFO into a local variable, which might not be zero
   initialised, so if the kernel _doesn't_ write it it could have non-zero
   garbage, giving us false positives.

We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived.  So, check whether that's long enough
to include the field.  This lets us probe the availability of the field
once and for all during initialisation.  That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.

We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:47 +02:00
David Gibson
7d8804beb8 tcp: Make some extra functions private
tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c,
and have no prototype in a header.  Make them static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:33 +02:00
David Gibson
5ff5d55291 tcp: Avoid overlapping memcpy() in DUP_ACK handling
When handling the DUP_ACK flag, we copy all the buffers making up the ack
frame.  However, all our frames share the same buffer for the Ethernet
header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will
result in a (perfectly) overlapping memcpy().  This seems to have been
harmless so far, but overlapping ranges to memcpy() is undefined behaviour,
so we really should avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:59 +02:00
David Gibson
1f414ed8f0 tcp: Remove redundant initialisation of iov[TCP_IOV_ETH].iov_base
This initialisation for IPv4 flags buffers is redundant with the very next
line which sets both iov_base and iov_len.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:46 +02:00
Stefano Brivio
6b38f07239 apparmor: Allow read access to /proc/sys/net/ipv4/ip_local_port_range
...for both passt and pasta: use passt's abstraction for this.

Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 15:34:06 +02:00
Stefano Brivio
116bc8266d selinux: Allow read access to /proc/sys/net/ipv4/ip_local_port_range
Since commit eedc81b6ef ("fwd, conf: Probe host's ephemeral ports"),
we might need to read from /proc/sys/net/ipv4/ip_local_port_range in
both passt and pasta.

While pasta was already allowed to open and write /proc/sys/net
entries, read access was missing in SELinux's type enforcement: add
that.

In passt, instead, this is the first time we need to access an entry
there: add everything we need.

Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 15:34:06 +02:00
David Gibson
a33ecafbd9 tap: Don't risk truncating frames on full buffer in tap_pasta_input()
tap_pasta_input() keeps reading frames from the tap device until the
buffer is full.  However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer.  If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.

Adjust the code to make sure we always have room for a maximum size frame.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:46 +02:00
David Gibson
d2a1dc744b tap: Restructure in tap_pasta_input()
tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around.  This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.

The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass.  This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.

Along the way handle a couple of extra edge cases:
 - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
   ports where those might not have the same value
 - Detect EOF on the tap device and exit in that case

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:43 +02:00
David Gibson
11e29054fe tap: Improve handling of EINTR in tap_passt_input()
When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK.  However in all
three cases it returns from the function.  That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again.  For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.

So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:41 +02:00
David Gibson
49fc4e0414 tap: Split out handling of EPOLLIN events
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN.  We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:37 +02:00
Stefano Brivio
63513e54f3 util: Fix order of operands and carry of one second in timespec_diff_us()
If the nanoseconds of the minuend timestamp are less than the
nanoseconds of the subtrahend timestamp, we need to carry one second
in the subtraction.

I subtracted this second from the minuend, but didn't actually carry
it in the subtraction of nanoseconds, and logged timestamps would jump
back whenever we switched to the first branch of timespec_diff_us()
from the second one.

Most likely, the reason why I didn't carry the second is that I
instinctively thought that swapping the operands would have the same
effect. But it doesn't, in general: that only happens with arithmetic
in modulo powers of 2. Undo the swap as well.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-06 13:01:34 +02:00
David Gibson
748ef4cd6e cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings
cppcheck-2.15.0 has apparently broadened when it throws a warning about
redundant initialization to include some cases where we have an initializer
for some fields, but then set other fields in the function body.

This is arguably a false positive: although we are technically overwriting
the zero-initialization the compiler supplies for fields not explicitly
initialized, this sort of construct makes sense when there are some fields
we know at the top of the function where the initializer is, but others
that require more complex calculation.

That said, in the two places this shows up, it's pretty easy to work
around.  The results are arguably slightly clearer than what we had, since
they move the parts of the initialization closer together.

So do that rather than having ugly suppressions or dealing with the
tedious process of reporting a cppcheck false positive.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:54:20 +02:00
Stefano Brivio
afedc2412e tcp: Use EPOLLET for any state of not established connections
Currently, for not established connections, we monitor sockets with
edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state
(outbound connection being established) but not in the
TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK
to the container/guest).

While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted
another possibility for a short EPOLLRDHUP storm (10 seconds), which
doesn't seem to happen in actual use cases, but I could reproduce it:
start a connection from a container, while dropping (using netfilter)
ACK segments coming out of the container itself.

On the server side, outside the container, accept the connection and
shutdown the writing side of it immediately.

At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere
TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't
have any reasonable way to handle it other than waiting for the tap
side to complete the three-way handshake. So we'll just keep getting
this EPOLLRDHUP until the SYN_TIMEOUT kicks in.

Always enable EPOLLET when EPOLLRDHUP is the only epoll event we
subscribe to: in this case, getting multiple EPOLLRDHUP reports is
totally useless.

In the only remaining non-established state, SOCK_ACCEPTED, for
inbound connections, we're anyway discarding EPOLLRDHUP events until
we established the conection, because we don't know what to do with
them until we get an answer from the tap side, so it's safe to enable
EPOLLET also in that case.

Link: https://bugs.passt.top/show_bug.cgi?id=94
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:54:16 +02:00
David Gibson
aff5a49b0e udp: Handle more error conditions in udp_sock_errs()
udp_sock_errs() reads out everything in the socket error queue.  However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.

One possibility is that the error is reported instead by the SO_ERROR
sockopt.  Check for that case and report it as best we can.  If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.

[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010

Link: https://bugs.passt.top/show_bug.cgi?id=95
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:38 +02:00
David Gibson
bd99f02a64 udp: Treat errors getting errors as unrecoverable
We can get network errors, usually transient, reported via the socket error
queue.  However, at least theoretically, we could get errors trying to
read the queue itself.  Since we have no idea how to clear an error
condition in that case, treat it as unrecoverable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:35 +02:00
David Gibson
bd092ca421 udp: Split socket error handling out from udp_sock_recv()
Currently udp_sock_recv() both attempts to clear socket errors and read
a batch of datagrams for forwarding.  That made sense initially, since
both listening and reply sockets need to do this.  However, we have certain
error cases which will add additional complexity to the error processing.
Furthermore, if we ever wanted to more thoroughly handle errors received
here - e.g. by synthesising ICMP messages on the tap device - it will
likely require different handling for the listening and reply socket cases.

So, split handling of error events into its own udp_sock_errs() function.
While we're there, allow it to report "unrecoverable errors".  We don't
have any of these so far, but some cases we're working on might require it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:33 +02:00
David Gibson
88bfa3801e flow: Helpers to log details of a flow
The details of a flow - endpoints, interfaces etc. - can be pretty
important for debugging.  We log this on flow state transitions, but it can
also be useful to log this when we report specific conditions.  Add some
helper functions and macros to make it easy to do that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:30 +02:00
David Gibson
1166401c2f udp: Allow UDP flows to be prematurely closed
Unlike TCP, UDP has no in-band signalling for the end of a flow.  So the
only way we remove flows is on a timer if they have no activity for 180s.

However, we've started to investigate some error conditions in which we
want to prematurely abort / abandon a UDP flow.  We can call
udp_flow_close(), which will make the flow inert (sockets closed, no epoll
events, can't be looked up in hash).  However it will still wait 3 minutes
to clear away the stale entry.

Clean this up by adding an explicit 'closed' flag which will cause a flow
to be more promptly cleaned up.  We also publish udp_flow_close() so it
can be called from other places to abort UDP flows().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:24 +02:00
David Gibson
7ad9f9bd2b flow: Fix incorrect hash probe in flowside_lookup()
Our flow hash table uses linear probing in which we step backwards through
clusters of adjacent hash entries when we have near collisions.  Usually
that's implemented by flow_hash_probe().  However, due to some details we
need a second implementation in flowside_lookup().  An embarrassing
oversight in rebasing from earlier versions has mean that version is
incorrect, trying to step forward through clusters rather than backward.

In situations with the right sorts of has near-collisions this can lead to
us not associating an ACK from the tap device with the right flow, leaving
it in a not-quite-established state.  If the remote peer does a shutdown()
at the right time, this can lead to a storm of EPOLLRDHUP events causing
high CPU load.

Fixes: acca4235c4 ("flow, tcp: Generalise TCP hash table to general flow hash table")
Link: https://bugs.passt.top/show_bug.cgi?id=94
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:52:31 +02:00
Stefano Brivio
0ea60e5a77 log: Don't prefix log file messages with time and severity if they're continuations
In fecb1b65b1 ("log: Don't prefix message with timestamp on --debug
if it's a continuation"), I fixed this for --debug on standard error,
but not for log files: if messages are continuations, they shouldn't
be prefixed by timestamp and severity.

Otherwise, we'll print stuff like this:

  0.0028: ERROR:   Receive error on guest connection, reset0.0028:  ERROR:   : Bad file descriptor

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-06 12:52:21 +02:00
Michal Privoznik
38363964fc Makefile: Enable _FORTIFY_SOURCE iff needed
On some systems source fortification is enabled whenever code
optimization is enabled (e.g. with -O2). Since code fortification
is explicitly enabled too (with possibly different value than the
system wants, there are three levels [1]), distros are required
to patch our Makefile, e.g. [2].

Detect whether fortification is not already enabled and enable it
explicitly only if really needed.

1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html
2: edfeb8763a

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:21 +02:00
David Gibson
eedc81b6ef fwd, conf: Probe host's ephemeral ports
When we forward "all" ports (-t all or -u all), or use an exclude-only
range, we don't actually forward *all* ports - that wouln't leave local
ports to use for outgoing connections.  Rather we forward all non-ephemeral
ports - those that won't be used for outgoing connections or datagrams.

Currently we assume the range of ephemeral ports is that recommended by
RFC 6335, 49152-65535.  However, that's not the range used by default on
Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range
sysctl.

We can't really know what range the guest will consider ephemeral, but if
it differs too much from the host it's likely to cause problems we can't
avoid anyway.  So, using the host's ephemeral range is a better guess than
using the RFC 6335 range.

Therefore, add logic to probe the host's ephemeral range, falling back to
the RFC 6335 range if that fails.  This has the bonus advantage of
reducing the number of ports bound by -t all -u all on most Linux machines
thereby reducing kernel memory usage.  Specifically this reduces kernel
memory usage with -t all -u all from ~380MiB to ~289MiB.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:08 +02:00
David Gibson
4a41dc58d6 conf, fwd: Don't attempt to forward port 0
When using -t all, -u all or exclude-only ranges, we'll attempt to forward
all non-ephemeral port numbers, including port 0.  However, this won't work
as intended: bind() treats a zero port not as literal port 0, but as
"pick a port for me".  Because of the special meaning of port 0, we mostly
outright exclude it in our handling.

Do the same for setting up forwards, not attempting to forward for port 0.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:05 +02:00
David Gibson
1daf6f4615 conf, fwd: Make ephemeral port logic more flexible
"Ephemeral" ports are those which the kernel may allocate as local
port numbers for outgoing connections or datagrams.  Because of that,
they're generally not good choices for listening servers to bind to.

Thefore when using -t all, -u all or exclude-only ranges, we map only
non-ephemeral ports.  Our logic for this is a bit rigid though: we
assume the ephemeral ports are always a fixed range at the top of the
port number space.  We also assume PORT_EPHEMERAL_MIN is a multiple of
8, or we won't set the forward bitmap correctly.

Make the logic in conf.c more flexible, using a helper moved into
fwd.[ch], although we don't change which ports we consider ephemeral
(yet).

The new handling is undoubtedly more computationally expensive, but
since it's a once-off operation at start off, I don't think it really
matters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:25:51 +02:00
Stefano Brivio
712ca32353 seccomp.sh: Try to account for terminal width while formatting list of system calls
Avoid excess lines on wide terminals, but make sure we don't fail if
we can't fetch the number of columns for any reason, as it's not a
fundamental feature and we don't want to break anything with it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 14:30:17 +02:00
David Gibson
e0be6bc2f4 udp: Use dual stack sockets for port forwarding when possible
Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
well as native IPv6 connections.  By doing this we halve the number of
listening sockets we need (assuming passt/pasta is listening on the same
ports for IPv4 and IPv6).  When forwarding many ports (e.g. -u all) this
can significantly reduce the amount of kernel memory that passt consumes.

We've used such dual stack sockets for TCP since 8e914238b "tcp: Use dual
stack sockets for port forwarding when possible".  Add similar support for
UDP "listening" sockets.  Since UDP sockets don't use as much kernel memory
as TCP sockets this isn't as big a saving, but it's still significant.
When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all),
this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version
6.10.6 on Fedora 40, x86_64).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:41 +02:00
David Gibson
c78b194001 udp: Remove unnnecessary local from udp_sock_init()
The 's' variable is always redundant with either 'r4' or 'r6', so remove
it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:38 +02:00
David Gibson
620e19a1b4 udp: Merge udp[46]_mh_recv arrays
We've already gotten rid of most of the IPv4/IPv6 specific data structures
in udp.c by merging them with each other.  One significant one remains:
udp[46]_mh_recv.  This was a bit awkward to remove because of a subtle
interaction.  We initialise the msg_namelen fields to represent the total
size we have for a socket address, but when we receive into the arrays
those are modified to the actual length of the sockaddr we received.

That meant that naively merging the arrays meant that if we received IPv4
datagrams, then IPv6 datagrams, the addresses for the latter would be
truncated.  In this patch address that by resetting the received
msg_namelen as soon as we've found a flow for the datagram.  Finding the
flow is the only thing that might use the actual sockaddr length, although
we in fact don't need it for the time being.

This also removes the last use of the 'v6' field from udp_listen_epoll_ref,
so remove that as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:25 +02:00
Stefano Brivio
418feb37ec test: Look for possible sshd-session paths (if it's there at all) in mbuto's profile
Some distributions already have OpenSSH 9.8, which introduces split
sshd/sshd-session binaries, and there we need to copy the binary from
the host, which can be /usr/libexec/openssh/sshd-session (Fedora
Rawhide), /usr/lib/ssh/sshd-session (Arch Linux),
/usr/lib/openssh/sshd-session (Debian), and possibly other paths.

Add at least those three, and, if we don't find sshd-session, assume
we don't need it: it could very well be an older version of OpenSSH,
as reported by David for Fedora 40, or perhaps another daemon (would
Dropbear even work? I'm not sure).

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: d6817b3930 ("test/passt.mbuto: Install sshd-session OpenSSH's split process")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-27 09:03:47 +02:00
Stefano Brivio
1d6142f362 README: pasta is indeed a supported back-end for rootless Docker
...https://github.com/moby/moby/issues/48257 just reminded me.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:05:26 +02:00
Stefano Brivio
f00ebda369 util: Don't stop on unrelated values when looking for --fd in close_open_files()
Seen with krun: we get a file descriptor via --fd, but we close it and
happily use the same number for TCP files.

The issue is that if we also get other options before --fd, with
arguments, getopt_long() stops parsing them because it sees them as
non-option values.

Use the - modifier at the beginning of optstring (before :, which is
needed to avoid printing errors) instead of +, which means we'll
continue parsing after finding unrelated option values, but
getopt_long() won't reorder them anyway: they'll be passed with option
value '1', which we can ignore.

By the way, we also need to add : after F in the optstring, so that
we're able to parse the option when given as short name as well.

Now that we change the parsing mode between close_open_files() and
conf(), we need to reset optind to 0, not to 1, whenever we call
getopt_long() again in conf(), so that the internal initialisation
of getopt_long() evaluating GNU extensions is re-triggered.

Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828
Fixes: baccfb95ce ("conf: Stop parsing options at first non-option argument")
Fixes: 09603cab28 ("passt, util: Close any open file that the parent might have leaked")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:04:53 +02:00
Stefano Brivio
05453ea590 test: Update list of dependencies in README.md
Mostly packages we now need to run Podman-based tests.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:04:30 +02:00
Stefano Brivio
1a66806c18 tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf)
These system calls are needed after the conversion of time_t to 64-bit
types on 32-bit architectures.

Tested by running some transfer tests with passt and pasta on Debian
Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l.

Suggested-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:04:17 +02:00
Stefano Brivio
6e9ecf5741 util: Provide own version of close_range(), and no-op fallback
musl, as of 1.2.5, and glibc < 2.34 don't ship a (trivial)
close_range() implementation. This will probably be added to musl
soon, by the way:
  https://www.openwall.com/lists/musl/2024/08/01/9

Add a weakly-aliased implementation, if it's supported by the kernel.
If it's not supported (< 5.9), use a no-op fallback. Looping over 2^31
file descriptors calling close() on them is probably not a good idea.

Reported-by: lemmi <lemmi@nerd2nerd.org>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:48 +02:00
Stefano Brivio
7291b70ba7 udp_flow: Add missing unistd.h include for close()
For some reason, this is reported only with musl, and older glibc
versions (2.31, at least).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:34 +02:00
Stefano Brivio
396307541e test: Duplicate existing recvfrom() valgrind suppression for recv()
Some architectures, including i686, actually have a recv() system
call, not just a recvfrom(), and we need to cover the recv() with
MSG_TRUNC into a NULL buffer for them as well.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:23 +02:00
Stefano Brivio
d6817b3930 test/passt.mbuto: Install sshd-session OpenSSH's split process
OpenSSH now ships a per-session binary, sshd-session, with sshd
acting as mere listener. It's typically not found in $PATH, so specify
the whole path at which it's commonly installed in $PROGS.

Link: https://www.openssh.com/releasenotes.html#9.8p1
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:03 +02:00
Stefano Brivio
34be8eeb38 test/passt.mbuto: Run sshd from vsock proxy with absolute path
...OpenSSH >= 9.8 otherwise complains that:

  sshd requires execution with an absolute path

Link: https://bugs.gentoo.org/936041
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078429
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:02:37 +02:00
Stefano Brivio
aded2b671c test/lib/setup: Transform i686 kernel architecture name into QEMU name (i386)
It's qemu-system-i386, but uname -m reports i686. I didn't test i486
and i586.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:01:48 +02:00
Stefano Brivio
2aea1da143 treewide: Allow additional system calls for i386/i686
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.

Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:

  https://wiki.debian.org/ReleaseGoals/64bit-time

that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.

Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).

I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.

Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:00:43 +02:00
David Gibson
57b7bd2a48 fwd, conf: Allow NAT of the guest's assigned address
The guest is usually assigned one of the host's IP addresses.  That means
it can't access the host itself via its usual address.  The
--map-host-loopback option (enabled by default with the gateway address)
allows the guest to contact the host.  However, connections forwarded this
way appear on the host to have originated from the loopback interface,
which isn't always desirable.

Add a new --map-guest-addr option, which acts similarly but forwarded
connections will go to the host's external address, instead of loopback.

If '-a' is used, so the guest's address is not the same as the host's, this
will instead forward to whatever host-visible site is shadowed by the
guest's assigned address.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:40 +02:00
David Gibson
8436c0d61b fwd: Distinguish translatable from untranslatable addresses on inbound
fwd_nat_from_host() needs to adjust the source address for new flows coming
from an address which is not accessible to the guest.  Currently we always
use our_tap_addr or our_tap_ll.  However in cases where the address is
accessible to the guest via translation (i.e. via --map-host-loopback) then
it makes more sense to use that translation, rather than the fallback
mapping of our_tap_*.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:37 +02:00
David Gibson
e813a4df7d conf: Allow address remapped to host to be configured
Because the host and guest share the same IP address with passt/pasta, it's
not possible for the guest to directly address the host.  Therefore we
allow packets from the guest going to a special "NAT to host" address to be
redirected to the host, appearing there as though they have both source and
destination address of loopback.

Currently that special address is always the address of the default
gateway (or none).  That can be a problem if we want that gateway to be
addressable by the guest.  Therefore, allow the special "NAT to host"
address to be overridden on the command line with a new --map-host-loopback
option.

In order to exercise and test it, update the passt_in_ns and perf
tests to use this option and give different mapping addresses for the
two layers of the environment.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:35 +02:00
David Gibson
dbaaebbe00 test: Reconfigure IPv6 address after changing MTU
In the TCP throughput tests, we adjust the guest's MTU in order to test
various packet sizes.  Some of those are below 1280 which causes IPv6 to
be deconfigured on the guest interface.  When we increase it above 1280
again, IPv6 is re-enabled and we get an address in the right prefix with
NDP, but we don't get exactly the expected address back - that's only
communicated with --config-net or DHCPv6.

With changes to how we handle NAT this can cause some of the IPv6 tests to
fail, because they don't use the address that passt/pasta expects, and the
guest doesn't initiate any traffic which allows us to learn what the new
address is.

Work around this by re-invoking dhclient -6 between adjusting the MTU and
running IPv6 test cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:33 +02:00
David Gibson
935bd81936 conf, fwd: Split notion of gateway/router from guest-visible host address
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default
gateway.  We use this for two quite distinct things: advertising the
gateway that the guest should use (via DHCP, NDP and/or --config-net)
and for a limited form of NAT.  So that the guest can access services
on the host, we map the gateway address within the guest to the
loopback address on the host.

Using the gateway address for this isn't necessarily the best choice
for this purpose, certainly not for all circumstances.  So, start off
by splitting the notion of these into two different values: @guest_gw
which is the gateway address the guest should use and @nat_host_loopback,
which is the guest visible address to remap to the host's loopback.

Usually nat_host_loopback will have the same value as guest_gw.  However
when --no-map-gw is specified we leave them unspecified instead.  This
means when we use nat_host_loopback, we don't need to separately check
c->no_map_gw to see if it's relevant.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:31 +02:00
David Gibson
90e83d50a9 Don't take "our" MAC address from the host
When sending frames to the guest over the tap link, we need a source MAC
address.  Currently we take that from the MAC address of the main interface
on the host, but that doesn't actually make much sense:
 * We can't preserve the real MAC address of packets from anywhere
   external so there's no transparency case here
 * In fact, it's confusingly different from how we handle IP addresses:
   whereas we give the guest the same IP as the host, we're making the
   host's MAC the one MAC that the guest *can't* use for itself.
 * We already need a fallback case if the host doesn't have an Ethernet
   like MAC (e.g. if it's connected via a point to point interface, such
   as a wireguard VPN).

Change to just just use an arbitrary fixed MAC address - I've picked
9a:55:9a:55:9a:55.  It's simpler and has the small advantage of making
the fact that passt/pasta is in use typically obvious from guest side
packet dumps.  This can still, of course, be overridden with the -M option.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:28 +02:00
David Gibson
356de97e43 fwd: Split notion of "our tap address" from gateway for IPv4
ip4.gw conflates 3 conceptually different things, which (for now) have the
same value:
  1. The router/gateway address as seen by the guest
  2. An address to NAT to the host with --no-map-gw isn't specified
  3. An address to use as source when nothing else makes sense

Case 3 occurs in two situations:

a) for our DHCP responses - since they come from passt internally there's
   no naturally meaningful address for them to come from
b) for forwarded connections coming from an address that isn't guest
   accessible (localhost or the guest's own address).

(b) occurs even with --no-map-gw, and the expected behaviour of forwarding
local connections requires it.

For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the
same value as ip6.gw).  For future flexibility we may want to make this
"address of last resort" different from the gateway address, so split them
logically for IPv4 as well.

Specifically, add a new ip4.our_tap_addr field for the address with this
role, and initialise it to ip4.gw for now.  Unlike IPv6 where we can always
get a link-local address, we might not be able to get a (non 0.0.0.0)
address here (e.g. if the host is disconnected or only has a point to point
link with no gateway address).  In that case we have to disable forwarding
of inbound connections with guest-inaccessible source addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:26 +02:00
David Gibson
4d8dd1fbe7 fwd: Helpers to clarify what host addresses aren't guest accessible
We usually avoid NAT, but in a few cases we need to apply address
translations.  For inbound connections that happens for addresses which
make sense to the host but are either inaccessible, or mean a different
location from the guest's point of view.

Add some helper functions to determine such addresses, and use them in
fwd_nat_from_host().  In doing so clarify some of the reasons for the
logic.  We'll also have further use for these helpers in future.

While we're there fix one unneccessary inconsistency between IPv4 and IPv6.
We always translated the guest's observed address, but for IPv4 we didn't
translate the guest's assigned address, whereas for IPv6 we did.  Change
this to translate both in all cases for consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:23 +02:00
David Gibson
975cfa5f32 Initialise our_tap_ll to ip6.gw when suitable
In every place we use our_tap_ll, we only use it as a fallback if the
IPv6 gateway address is not link-local.  We can avoid that conditional at
use time by doing it at initialisation of our_tap_ll instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:22 +02:00
David Gibson
8d4baa4446 Clarify which addresses in ip[46]_ctx are meaningful where
Some are guest visible addresses and may not be valid on the host, others
are host visible addresses and may not be valid on the guest.  Rearrange
and comment the ip[46]_ctx definitions to make it clearer which is which.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:19 +02:00
David Gibson
a42fb9c000 treewide: Change misleading 'addr_ll' name
c->ip6.addr_ll is not like c->ip6.addr.  The latter is an address for the
guest, but the former is an address for our use on the tap link.  Rename it
accordingly, to 'our_tap_ll'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:16 +02:00
David Gibson
c9f0ec3227 util: Correct sock_l4() binding for link local addresses
When binding an IPv6 socket in sock_l4() we need to supply a scope id
if the address is link-local.  We check for this by comparing the
given address to c->ip6.addr_ll.  This is correct only by accident:
while c->ip6.addr_ll is typically set to the host interface's link
local address, the actual purpose of it is to provide a link local
address for passt's private use on the tap interface.

Instead set the scope id for any link-local address we're binding to.
We're going to need something and this is what makes sense for sockets
on the host.  It doesn't make sense for PIF_SPLICE sockets, but those
should always have loopback, not link-local addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:13 +02:00
David Gibson
57532f1ded conf: Remove incorrect initialisation of addr_ll_seen
Despite the names, addr_ll_seen does not relate to addr_ll the same
way addr_seen relates to addr.  addr_ll_seen is an observed address
from the guest, whereas addr_ll is *our* link-local address for use on
the tap link when we can't use an external endpoint address.  It's
used both for passt provided services (DHCPv6, NDP) and in some cases
for connections from addresses the guest can't access.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:10 +02:00
David Gibson
0b25cac94e conf: Treat --dns addresses as guest visible addresses
Although it's not 100% explicit in the man page, addresses given to
the --dns option are intended to be addresses as seen by the guest.
This differs from addresses taken from the host's /etc/resolv.conf,
which must be translated to guest accessible versions in some cases.

Our implementation is currently inconsistent on this: when using
--dns-forward, you must usually also give --dns with the matching address,
which is meaningful only in the guest's address view.  However if you give
--dns with a loopback addres, it will be translated like a host view
address.

Move the remapping logic for DNS addresses out of add_dns4() and add_dns6()
into add_dns_resolv() so that it is only applied for host nameserver
addresses, not for nameservers given explicitly with --dns.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:08 +02:00
David Gibson
a6066f4e27 conf: Correct setting of dns_match address in add_dns6()
add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to
the given address, rather than the gateway address.  This is doubly wrong:
 - We've just established the given address is a host loopback address
   the guest can't access
 - We've just set ip6.dns[] to tell the guest to use the gateway address,
   so it won't use the dns_match address we're setting

Correct this to use the gateway address, like IPv4.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:06 +02:00
David Gibson
7c083ee41c conf: Move adding of a nameserver from resolv.conf into subfunction
get_dns() is already quite deeply nested, and future changes I have in
mind will add more complexity.  Prepare for this by splitting out the
adding of a single nameserver to the configuration into its own function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:04 +02:00
David Gibson
1d10760c9f conf: Move DNS array bounds checks into add_dns[46]
Every time we call add_dns[46] we need to first check if there's space in
the c->ip[46].dns array for the new entry.  We might as well make that
check in add_dns[46]() itself.

In fact it looks like the calls in get_dns() had an off by one error, not
allowing the last entry of the array to be filled.  So, that bug is also
fixed by the change.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:02 +02:00
David Gibson
6852bd07cc conf: More accurately count entries added in get_dns()
get_dns() counts the number of guest DNS servers it adds, and gives an
error if it couldn't add any.  However, this count ignores the fact that
add_dns[46]() may in some cases *not* add an entry.  Use the array indices
we're already tracking to get an accurate count.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:00 +02:00
David Gibson
c679894668 conf: Use array indices rather than pointers for DNS array slots
Currently add_dns[46]() take a somewhat awkward double pointer to the
entry in the c->ip[46].dns array to update.  It turns out to be easier to
work with indices into that array instead.

This diff does add some lines, but it's comments, and will allow some
future code reductions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:58 +02:00
David Gibson
ceea52ca93 treewide: Use struct assignment instead of memcpy() for IP addresses
We rely on C11 already, so we can use clearer and more type-checkable
struct assignment instead of mempcy() for copying IP addresses around.

This exposes some "pointer could be const" warnings from cppcheck, so
address those too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:56 +02:00
David Gibson
905ecd2b0b treewide: Rename MAC address fields for clarity
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it.  Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac".  Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:54 +02:00
David Gibson
066e69986b util: Helper for formatting MAC addresses
There are a couple of places where we somewhat messily open code formatting
an Ethernet like MAC address for display.  Add an eth_ntop() helper for
this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:51 +02:00
David Gibson
e6feb5a892 treewide: Use "our address" instead of "forwarding address"
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing.  As discussed on a
recent call, let's try "our" instead.

(While we're there correct an error in flow_initiate_af()s comments where
we referred to parameters by the wrong name).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:29 +02:00
Stefano Brivio
32c386834d netlink: Fix typo in function comment for nl_addr_set()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-18 01:29:52 +02:00
Stefano Brivio
f4e9f26480 pasta: Disable neighbour solicitations on device up to prevent DAD
As soon as we the kernel notifier for IPv6 address configuration
(addrconf_notify()) sees that we bring the target interface up
(NETDEV_UP), it will schedule duplicate address detection, so, by
itself, setting the nodad flag later is useless, because that won't
stop a detection that's already in progress.

However, if we disable neighbour solicitations with IFF_NOARP (which
is a misnomer for IPv6 interfaces, but there's no possibility of
mixing things up), the notifier will not trigger DAD, because it can't
be done, of course, without neighbour solicitations.

Set IFF_NOARP as we bring up the device, and drop it after we had a
chance to set the nodad attribute on the link.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-18 01:29:52 +02:00
Stefano Brivio
d6f0220731 netlink, pasta: Fetch link-local address from namespace interface once it's up
As soon as we bring up the interface, the Linux kernel will set up a
link-local address for it, so we can fetch it and start using right
away, if we need a link-local address to communicate to the container
before we see any traffic coming from it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-18 01:29:52 +02:00
Stefano Brivio
74e508cf79 netlink, pasta: Disable DAD for link-local addresses on namespace interface
It makes no sense for a container or a guest to try and perform
duplicate address detection for their link-local address, as we'll
anyway not relay neighbour solicitations with an unspecified source
address.

While they perform duplicate address detection, the link-local address
is not usable, which prevents us from bringing up especially
containers and communicate with them right away via IPv6.

This is not enough to prevent DAD and reach the container right away:
we'll need a couple more patches.

As we send NLM_F_REPLACE requests right away, while we still have to
read out other addresses on the same socket, we can't use nl_do():
keep track of the last sequence we sent (last address we changed), and
deal with the answers to those NLM_F_REPLACE requests in a separate
loop, later.

Link: https://github.com/containers/podman/pull/23561#discussion_r1711639663
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-18 01:29:38 +02:00
Stefano Brivio
0c74068f56 netlink, pasta: Turn nl_link_up() into a generic function to set link flags
In the next patches, we'll reuse it to set flags other than IFF_UP.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-15 09:14:47 +02:00
Stefano Brivio
8231ce54c3 netlink, pasta: Split MTU setting functionality out of nl_link_up()
As we'll use nl_link_up() for more than just bringing up devices, it
will become awkward to carry empty MTU values around whenever we call
it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-15 09:14:43 +02:00
Stefano Brivio
b91d3373ac netlink: Fix typo in function comment for nl_addr_get()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-15 09:14:29 +02:00
Stefano Brivio
946206437a test: Speed up by cutting on eye candy and performance test duration
We have a number of delays when we switch to new layouts that were
added to make the tests visually easier to follow, together with
blinking status bars. Shorten the delays and avoid blinking the
status bar if $FAST is set to 1 (no demo mode).

Shorten delays in busy loops to 10ms, instead of 100ms, and skip the
one-second fixed delay when we wait for the status of a command.

Cut the duration of throughput and latency tests to one second, down
from ten. Somewhat surprisingly, the results we get are rather
consistent, and not significantly different from what we'd get with
10 seconds.

This, together with Podman's commit 20f3e8909e3a ("test/system:
pasta_test_do add explicit port check"), cuts the time needed on my
setup for full test run from approximately 37 minutes to...:

  $ time ./run
  [exited]
  PASS: 165, FAIL: 0
  Log at /home/sbrivio/passt/test/test_logs/test.log

  real	15m34.253s
  user	0m0.011s
  sys	0m0.011s

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-15 09:13:15 +02:00
David Gibson
61c0b0d0f1 flow: Don't crash if guest attempts to connect to port 0
Using a zero port on TCP or UDP is dubious, and we can't really deal with
forwarding such a flow within the constraints of the socket API.  Hence
we ASSERT()ed that we had non-zero ports in flow_hash().

The intention was to make sure that the protocol code sanitizes such ports
before completing a flow entry.  Unfortunately, flow_hash() is also called
on new packets to see if they have an existing flow, so the unsanitized
guest packet can crash passt with the assert.

Correct this by moving the assert from flow_hash() to flow_sidx_hash()
which is only used on entries already in the table, not on unsanitized
data.

Reported-by: Matt Hamilton <matt@thmail.io>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-14 12:20:31 +02:00
David Gibson
baba284912 conf: Don't ignore -t and -u options after -D
f6d5a52392 moved handling of -D into a later loop.  However as a side
effect it moved this from a switch block to an if block.  I left a couple
of 'break' statements that don't make sense in the new context.  They
should be 'continue' so that we go onto the next option, rather than
leaving the loop entirely.

Fixes: f6d5a52392 ("conf: Delay handling -D option until after addresses are configured")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-14 09:14:12 +02:00
AbdAlRahman Gad
c16141eda5 ndp.c: Turn NDP responder into more declarative implementation
- Add structs for NA, RA, NS, MTU, prefix info, option header,
  link-layer address, RDNSS, DNSSL and link-layer for RA message.

- Turn NA message from purely imperative, going byte by byte,
  to declarative by filling it's struct.

- Turn part of RA message into declarative.

- Move packet_add() to be before the call of ndp() in tap6_handler()
  if the protocol of the packet  is ICMPv6.

- Add a pool of packets as an additional parameter to ndp().

- Check the size of NS packet with packet_get() before sending an NA
  packet.

- Add documentation for the structs.

- Add an enum for NDP option types.

Link: https://bugs.passt.top/show_bug.cgi?id=21
Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
[sbrivio: Minor coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-13 19:46:16 +02:00
David Gibson
f6d5a52392 conf: Delay handling -D option until after addresses are configured
add_dns[46]() rely on the gateway address and c->no_map_gw being already
initialised, in order to properly handle DNS servers which need NAT to be
accessed from the guest.

Usually these are called from get_dns() which is well after the addresses
are configured, so that's fine.  However, they can also be called earlier
if an explicit -D command line option is given.  In this case no_map_gw
and/or c->ip[46].gw may not get be initialised properly, leading to this
doing the wrong thing.

Luckily we already have a second pass of option parsing for things which
need addresses to already be configured.  Move handling of -D to there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-12 21:29:36 +02:00
David Gibson
86bdd968ea Correct inaccurate comments on ip[46]_ctx::addr
These fields are described as being an address for an external, routable
interface.  That's not necessarily the case when using -a.  But, more
importantly, saying where the value comes from is not as useful as what
it's used for.  The real purpose of this field is as the address which we
assign to the guest via DHCP or --config-net.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-12 21:29:21 +02:00
Stefano Brivio
fecb1b65b1 log: Don't prefix message with timestamp on --debug if it's a continuation
If we prefix the second part of messages printed through
logmsg_perror() by the timestamp, on debug, we'll have two timestamps
and a weird separator in the result, such as this beauty:

  0.0013: Failed to clone process with detached namespaces0.0013: : Operation not permitted

Add a parameter to logmsg() and vlogmsg() which indicates a message
continuation. If that's set, don't print the timestamp in vlogmsg().

Link: https://github.com/moby/moby/issues/48257#issuecomment-2282875092
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-12 16:21:53 +02:00
Stefano Brivio
baccfb95ce conf: Stop parsing options at first non-option argument
Given that pasta supports specifying a command to be executed on the
command line, even without the usual -- separator as long as there's
no ambiguity, we shouldn't eat up options that are not meant for us.

Paul reports, for instance, that with:

  pasta --config-net ip -6 route

-6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'.
That's because getopt_long(), by default, shuffles the argument list
to shift non-option arguments at the end.

Avoid that by adding '+' at the beginning of 'optstring'.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-08 21:34:06 +02:00
Stefano Brivio
09603cab28 passt, util: Close any open file that the parent might have leaked
If a parent accidentally or due to implementation reasons leaks any
open file, we don't want to have access to them, except for the file
passed via --fd, if any.

This is the case for Podman when Podman's parent leaks files into
Podman: it's not practical for Podman to close unrelated files before
starting pasta, as reported by Paul.

Use close_range(2) to close all open files except for standard streams
and the one from --fd.

Given that parts of conf() depend on other files to be already opened,
such as the epoll file descriptor, we can't easily defer this to a
more convenient point, where --fd was already parsed. Introduce a
minimal, duplicate version of --fd parsing to keep this simple.

As we need to check that the passed --fd option doesn't exceed
INT_MAX, because we'll parse it with strtol() but file descriptor
indices are signed ints (regardless of the arguments close_range()
take), extend the existing check in the actual --fd parsing in conf(),
also rejecting file descriptors numbers that match standard streams,
while at it.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
2024-08-08 21:31:25 +02:00
David Gibson
755f9fd911 nstool: Propagate SIGTERM to processes executed in the namespace
Particularly in shell it's sometimes natural to save the pid from a process
run and later kill it.  If doing this with nstool exec, however, it will
kill nstool itself, not the program it is running, which isn't usually what
you want or expect.

Address this by having nstool propagate SIGTERM to its child process.  It
may make sense to propagate some other signals, but some introduce extra
complications, so we'll worry about them when and if it seems useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:48 +02:00
David Gibson
5ca61c2f34 nstool: Fix some trivial typos
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:45 +02:00
David Gibson
a628cb93a7 log: Avoid duplicate calls to logtime()
We use logtime() to get a timestamp for the log in two places:
  - in vlogmsg(), which is used only for debug_print messages
  - in logfile_write() which is only used messages to the log file

These cases are mutually exclusive, so we don't ever print the same message
with different timestamps, but that's not particularly obvious to see.
It's possible future tweaks to logging logic could mean we log to two
different places with different timestamps, which would be confusing.

Refactor to have a single logtime() call in vlogmsg() and use it for all
the places we need it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:40 +02:00
David Gibson
2c7558dc43 log: Handle errors from clock_gettime()
clock_gettime() can, theoretically, fail, although it probably won't until
2038 on old 32-bit systems.  Still, it's possible someone could run with
a wildly out of sync clock, or new errors could be added, or it could fail
due to a bug in libc or the kernel.

We don't handle this well.  In the debug_print case in vlogmsg we'll just
ignore the failure, and print a timestamp based on uninitialised garbage.
In logfile_write() we exit early and won't log anything at all, which seems
like a good way to make an already weird situation undebuggable.

Add some helpers to instead handle this by using "<error>" in place of a
timestamp if something goes wrong with clock_gettime().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:29 +02:00
David Gibson
b91bae1ded log: Correct formatting of timestamps
logtime_fmt_and_arg() is a rather odd macro, producing both a format
string and an argument, which can only be used in quite specific printf()
like formulations.  It also has a significant bug: it tries to display 4
digits after the decimal point (so down to tenths of milliseconds) using
%04i.  But the field width in printf() is always a *minimum* not maximum
field width, so this will not truncate the given value, but will redisplay
the entire tenth-of-milliseconds difference again after the decimal point.

Replace the macro with an snprintf() like function which will format the
timestamp, and use an explicit % to correct the display.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Make logtime_fmt() static]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:10 +02:00
David Gibson
95569e4aa4 util: Some corrections for timespec_diff_us
The comment for timespec_diff_us() claims it will wrap after 2^64µs.  This
is incorrect for two reasons:
  * It returns a long long, which is probably 64-bits, but might not be
  * It returns a signed value, so even if it is 64 bits it will wrap after
    2^63µs

Correct the comment and use an explicitly 64-bit type to avoid that
imprecision.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:15:40 +02:00
Stefano Brivio
fbb0c9523e conf, pasta: Make -g and -a skip route/addresses copy for matching IP version only
Paul reports that setting IPv4 address and gateway manually, using
--address and --gateway, causes pasta to fail inserting IPv6 routes
in a setup where multiple, inter-dependent IPv6 routes are present
on the host.

That's because, currently, any -g option implies --no-copy-routes
altogether, and any -a implies --no-copy-addrs.

Limit this implication to the matching IP version, instead, by having
two copies of no_copy_routes and no_copy_addrs in the context
structure, separately for IPv4 and IPv6.

While at it, change them to 'bool': we had them as 'int' because
getopt_long() used to set them directly, but it hasn't been the case
for a while already.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-07 09:15:25 +02:00
Stefano Brivio
ee36266a55 log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).

But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.

Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.

Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 15:03:48 +02:00
Stefano Brivio
3a082c4ecb tcp_splice: Fix side in OUT_WAIT flag setting
If the "from" (input) side for a given transfer is 0, and we can't
complete the write right away, what we need to be waiting for is for
output readiness on side 1, not 0, and the other way around as well.

This causes random transfer failures for local TCP connections,
depending if we ever need to wait for output readiness.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/23517
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-08-06 15:03:31 +02:00
David Gibson
031df332e9 util: Use unsigned (size_t) value for iov length
The "correct" type for the length of an IOV is unclear: writev() and
readv() use an int, but sendmsg() and recvmsg() use a size_t.  Using the
unsigned size_t has some advantages, though, and it makes more sense for
the case of write_remainder.  Using size_t throughout here means we don't
have a signed vs. unsigned comparison, and we don't have to deal with
the case of iov_skip_bytes() returning a value which becomes negative
when assigned to an integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 15:01:46 +02:00
Laurent Vivier
e877f905e5 udp_flow: move all udp_flow functions to udp_flow.c
No code change.

They need to be exported to be available by the vhost-user version of
passt.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 17:38:17 +02:00
Laurent Vivier
623ceb1f2b udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock()
To be used with the vhost-user version of udp.c, we need to export the
udp_flow functions. To avoid to export udp_meta_t too that is specific
to the socket version of udp.c, don't pass udp_meta_t to it,
but the only needed field, s_in.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 17:37:53 +02:00
David Gibson
a5bbefa6fb log: Make logfile_write() private
logfile_write() is not used outside log.c, nor should it be.  It should
only be used externall via the general logging functions.  Make it static
in log.c.  To avoid forward declarations this requires moving a bunch of
functions earlier in the file.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 15:03:33 +02:00
Stefano Brivio
f30ed68c52 pasta: Save errno on signal handler entry, restore on return when needed
Ed reported this:

  # Error: pasta failed with exit code 1:
  # Couldn't drop cap 3 from bounding set
  # : No child processes

in a Podman CI run with tests being run in parallel. The error message
itself, by the way, is fixed by commit 1cd773081f ("log: Drop
newlines in the middle of the perror()-like messages"), but how can we
possibly get ECHILD as failure code for prctl()?

Well, we don't, but if we exit early enough, pasta_child_handler()
might run before we're even done with isolation steps, and it calls
waitid(), which sets errno. We need to restore it before returning
from the signal handler (if we return after calling functions that
might set it), as signal-safety(7) also implies:

       Fetching and setting the value of errno is async-signal-safe
       provided that the signal handler saves errno on entry and
       restores its value before returning.

Eventually, we'll probably need to switch to signalfd(2) the day we
want to implement multithreading, but this will do for the moment.

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/23478
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-05 15:02:36 +02:00
Danish Prakash
0149d11cc5 pasta: modify hostname when detaching new namespace
When invoking pasta without any arguments, it's difficult
to tell whether we are in the new namespace or not leaving
users a bit confused. This change modifies the host namespace
to add a prefix "pasta-" to make it a bit more obvious.

Signed-off-by: Danish Prakash <contact@danishpraka.sh>
[sbrivio: coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-30 17:27:43 +02:00
AbdAlRahman Gad
8fae3b73cb Fix typo in README file
- remove duplicated 'the' in the 'Services' section

Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-29 19:02:35 +02:00
Stefano Brivio
f87b11c7be fedora/rpkg: List myself as author for changelog entries
...instead of the latest author for contrib/fedora.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 16:40:41 +02:00
David Gibson
57a21d2df1 tap: Improve handling of partially received frames on qemu socket
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie.  Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.

Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket.  This handles a number of (unlikely) cases
which previously would not be dealt with correctly:

* If qemu sent a partial frame then waited some time before sending the
  remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
  doing the partial frame handling, which would put us out of sync with
  the stream from qemu
* If a the blocking recv() only received some of the remainder of the
  frame, not all of it, we'd return leaving us out of sync with the
  stream again

Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU).  This
is probably acceptable because it's an unlikely case in practice.  If
necessary we could mitigate this by using a true ring buffer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:42 +02:00
David Gibson
37e3b24d90 tap: Correctly handle frames of odd length
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself.  As far as I can tell,
frames can be any length, with no particular alignment requirement.  This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.

Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading.  Some platforms will generate a fatal trap on
such an unaligned load.  Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.

Introduce a new helper to safely load a possibly unaligned value here.  We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads.  If that turns out not
to be the case, we can look at improvements then.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:38 +02:00
David Gibson
4684f60344 tap: Don't use EPOLLET on Qemu sockets
Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket.  It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling.  That consideration
doesn't apply to the way we handle the qemu socket however.

Furthermore, using EPOLLET causes additional complications:

1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
   we *do* set it when using pasta mode with --fd.  This inconsistency
   doesn't seem to have broken anything, but it's odd.

2) EPOLLET requires that tap_handler_passt() loop until all data available
   is read (otherwise we may have data in the buffer but never get an event
   causing us to read it).  We do that with a rather ugly goto.

   Worse, our condition for that goto appears to be incorrect.  We'll only
   loop if rem is non-zero, which will only happen if we perform a blocking
   recv() for a partially received frame.  We'll only perform that second
   recv() if the original recv() resulted in a partially read frame.  As
   far as I can tell the original recv() could end on a frame boundary
   (never triggering the second recv()) even if there is additional data in
   the socket buffer.  In that circumstance we wouldn't goto redo and could
   leave unprocessed frames in the qemu socket buffer indefinitely.

   This doesn't seem to have caused any problems in practice, but since
   there's no obvious reason to use EPOLLET here anyway, we might as well
   get rid of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:20 +02:00
David Gibson
9e3f2355c4 tap: Don't attempt to carry on if we get a bad frame length from qemu
If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on.  That sounds sensible on first blush, but
probably isn't wise in practice.  If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket.  Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.

Neither (b) nor (c) is really salvageable with the same stream.  Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.

So, instead of just skipping the frame and trying to carry on, log an error
and close the socket.  As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
 that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:05:44 +02:00
David Gibson
a06db27c49 tap: Better report errors receiving from QEMU socket
If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error.  Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.

Add an error message in this case.  Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
 that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 13:51:23 +02:00
Stefano Brivio
77c092ee5e log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME
We report relative timestamps in logs, so we want to avoid jumps in
the system time.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:47:36 +02:00
Stefano Brivio
e5c37ba0f4 log: Initialise timestamp for relative log time also if we use a log file
...not just for debug messages. Otherwise, timestamps in the log file
are consistent but the starting point is not zero.

Do this right away as we enter main(), so that the resulting
timestamps are as closely as possible relative to when we start.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:47:34 +02:00
Stefano Brivio
327d9d482f log, util: Fix sub-second part in relative log time calculation
For some reason, in commit 01efc71ddd ("log, conf: Add support for
logging to file"), I added calculations for relative logging
timestamps using the difference for the seconds part only, not for
accounting for the fractional part.

Fix that by storing the initial timestamp, log_start, as a timespec
struct, and by calculating the difference from the starting time. Do
this in a macro as we need the same format in a few places.

To calculate the difference, turn the existing timespec_diff_ms() to
microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to
use that.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:43:19 +02:00
Stefano Brivio
2ce1d37831 test/lib/perf_report: Fix highlight
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-25 17:56:47 +02:00
David Gibson
e9a542321f test: Fix spurious test failure with systemd-resolved
systemd-resolved has the rather strange behaviour of listening on the
non-standard loopback address 127.0.0.53.  Various changes we've made in
passt mean that we now usually work fine on a host using systemd-resolved.
However our tests still fail in this case.  We have a special case for when
the guest's resolv.conf needs to differ from the host's because the
resolver is on a host loopback address.  However, we only consider the case
where the host resolver is on 127.0.0.1, not other loopback addresses.

Correct this with a different test condition.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-25 12:39:00 +02:00
David Gibson
becf81ab88 fwd: Broaden what we consider for DNS specific forwarding rules
passt/pasta has options to redirect DNS requests from the guest to a
different server address on the host side.  Currently, however, only UDP
packets to port 53 are considered "DNS requests".  This ignores DNS
requests over TCP - less common, but certainly possible.  It also ignores
encrypted DNS requests on port 853.

Extend the DNS forwarding logic to handle both of those cases.

Link: https://github.com/containers/podman/issues/23239
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-25 12:38:11 +02:00
David Gibson
0ada84e3f8 fwd: Refactor tests in fwd_nat_from_tap() for clarity
Currently, we start by handling the common case, where we don't translate
the destination address, then we modify the tgt side for the special cases.
In the process we do comparisons on the tentatively set fields in tgt,
which obscures the fact that tgt should be an essentially pure function of
ini, and risks people examining fields of tgt that are not yet initialized.

To make this clearer, do all our tests on 'ini', constructing tgt from
scratch on that basis.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-25 12:37:57 +02:00
Stefano Brivio
4a333c88d7 conf: Accept addresses enclosed by square brackets in port forwarding specifiers
Even though we don't use : as delimiter for the port, making square
brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6
literals. We want IPv6 addresses there, but some users might still
specify them out of habit.

Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for
IPv4 literals, but I had reports of users actually trying to use them
(they're accepted by many tools).

Allow square brackets for both IPv4 and IPv6 addresses, correct or
not, they're harmless anyway.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:56 +02:00
Stefano Brivio
6ff702f325 tap: Exit if we fail to bind a UNIX domain socket with explicit path
In tap_sock_unix_open(), if we have a given path for the socket from
configuration, we don't need to loop over possible paths, so we exit
the loop on the first iteration, unconditionally.

But if we failed to bind() the socket to that explicit path, we should
exit, instead of continuing. Otherwise we'll pretend we're up and
running, but nobody can contact us, and this might be mildly confusing
for users.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:42 +02:00
Stefano Brivio
f72d35a78d test: iperf3 3.16 introduces multiple threads, drop our own implementation of that
Starting from iperf3 version 3.16, -P / --parallel spawns multiple
clients as separate threads, instead of multiple streams serviced by
the same thread.

So we can drop our lib/test implementation to spawn several iperf3
client and server processes and finally simplify things quite a bit.

Adjust number of threads and UDP sending bandwidth to values that seem
to be more or less matching previous throughput tests on my setup.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:38 +02:00
Stefano Brivio
606e0c7b95 test: Update names of symbols and slabinfo entries
Differences in allocated Acpi-Parse entries are gone (at least) since
the 6.1 Linux kernel series. I should run this on a 6.10 kernel,
eventually, and adjust things further, as needed.

Userspace symbols are also fairly different now: show whatever is more
than 1 MiB at the moment.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:29 +02:00
Stefano Brivio
f16f8f5bf6 test: Fix memory/passt tests, --netns-only is not a valid option for passt
This used to work on my setup as I kept reusing an old mbuto
(initramfs) image, but since commit 65923ba798 ("conf: Accept
duplicate and conflicting options, the last one wins"), --netns-only
is, as originally intended, a pasta-only option.

I had used --netns-only, here, to prevent passt from trying to detach
its own user namespace, which is not permitted as we're in a chroot,
see unshare(2). In turn, we need the chroot because passt can't pivot
root directly into its own empty filesystem using an initramfs.

Use switch_root into the tmpfs mountpoint instead of chroot, so that
we can still detach user namespaces.

Note that in the mbuto images, we can't switch to nobody as we have
no password entries at all, so we need to detach a further user
namespace before starting passt, to trick passt into running as UID
0.

Given the new sequence, it's now more convenient to directly switch
to a detached network namespace as well, which means we need to move
the initialisation of the dummy network from the init script into the
test script.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:30:08 +02:00
Stefano Brivio
1cd773081f log: Drop newlines in the middle of the perror()-like messages
Calling vlogmsg() twice from logmsg_perror() results in this beauty:

  $ ./pasta -i foo
  Invalid interface name foo
  : No such device

because the first part of the message, corresponding to the first
call, doesn't end with a newline, and vlogmsg() adds it.

Given that we can't easily append an argument (error description) to
a variadic list, add a 'newline' parameter to all the functions that
currently add a newline if missing, and disable that on the first call
to vlogmsg() from logmsg_perror(). Not very pretty but I can't think
of any solution that's less messy than this.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:25:31 +02:00
Stefano Brivio
13295583f8 tcp: Change SO_PEEK_OFF support message to debug()
This:

  $ ./pasta
  SO_PEEK_OFF not supported
  #

is a bit annoying, and might trick users who face other issues into
thinking that SO_PEEK_OFF not being supported on a given kernel is
an actual issue.

Even if SO_PEEK_OFF is supported by the kernel, that would be the
only message displayed there, with default options, which looks a bit
out of context.

Switch that to debug(): now that Podman users can pass --debug too, we
can find out quickly if it's supported or not, if SO_PEEK_OFF usage is
suspected of causing any issue.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:25:26 +02:00
Stefano Brivio
d19b396f11 tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down
If we start pasta with some ports forwarded, but no --config-net, say:

  $ ./pasta -u 10001

and then use a local, non-loopback address to send traffic to that
port, say:

  $ socat -u FILE:test UDP4:192.0.2.1:10001

pasta writes to the tap file descriptor, but if the interface is down,
we get EIO and terminate.

By itself, what I'm doing in this case is not very useful (I simply
forgot to pass --config-net), but if we happen to have a DHCP client
in the network namespace, the interface might still be down while
somebody tries to send traffic to it, and exiting in that case is not
really helpful.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:25:05 +02:00
David Gibson
a09aeb4bd6 tcp: Correctly update SO_PEEK_OFF when tcp_send_frames() drops frames
When using the new SO_PEEK_OFF feature on TCP sockets, we must adjust
the SO_PEEK_OFF value whenever we move conn->seq_to_tap backwards.
Although it was discussed during development, somewhere during the shuffles
the case where we move the pointer backwards because we lost frames while
sending them to the guest.  This can happen, for example, if the socket
buffer on the Unix socket to qemu overflows.

Fixing this is slightly complicated because we need to pass a non-const
context pointer to some places we previously didn't need it.  While we're
there also fix a small stylistic issue in the function comment for
tcp_revert_seq() - it was using spaces instead of tabs.

Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-24 09:27:46 +02:00
Jon Maloy
9cb6b50815 tcp: probe for SO_PEEK_OFF both in tcpv4 and tcp6
Based on an original patch by Jon Maloy:

--
The recently added socket option SO_PEEK_OFF is not supported for
TCP/IPv6 sockets. Until we get that support into the kernel we need to
test for support in both protocols to set the global 'peek_offset_cap´
to true.
--

Compared to the original patch:
- only check for SO_PEEK_OFF support for enabled IP versions
- use sa_family_t instead of int to pass the address family around

Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-23 16:42:27 +02:00
David Gibson
882599e180 udp: Rename UDP listening sockets
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived
sockets which can initiate new flows.  Rename to EPOLL_TYPE_UDP_LISTEN
and associated functions to match.  Along with that, remove the .orig
field from union udp_listen_epoll_ref, since it is now always true.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:34:01 +02:00
David Gibson
d29fa0856e udp: Remove rdelta port forwarding maps
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map.  This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port.  This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.

So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:57 +02:00
David Gibson
d89b3aa097 udp: Remove obsolete socket tracking
Now that UDP datagrams are all directed via the flow table, we no longer
use the udp_tap_map[ or udp_act[] arrays.  Remove them and connected
code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:55 +02:00
David Gibson
898f797174 udp: Direct datagrams from host to guest via flow table
This replaces the last piece of existing UDP port tracking with the
common flow table.  Specifically use the flow table to direct datagrams
from host sockets to the guest tap interface.  Since this now requires
a flow for every datagram, we add some logging if we encounter any
datagrams for which we can't find or create a flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:51 +02:00
David Gibson
b7ad19347f udp: Find or create flows for datagrams from tap interface
Currently we create flows for datagrams from socket interfaces, and use
them to direct "spliced" (socket to socket) datagrams.  We don't yet
match datagrams from the tap interface to existing flows, nor create new
flows for them.  Add that functionality, matching datagrams from tap to
existing flows when they exist, or creating new ones.

As with spliced flows, when creating a new flow from tap to socket, we
create a new connected socket to receive reply datagrams attached to that
flow specifically. We extend udp_flow_sock_handler() to handle reply
packets bound for tap rather than another socket.

For non-obvious reasons (perhaps increased stack usage?), this caused
a failure for me when running under valgrind, because valgrind invoked
rt_sigreturn which is not in our seccomp filter.  Since we already
allow rt_sigaction and others in the valgrind target, it seems
reasonable to add rt_sigreturn as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:48 +02:00
David Gibson
8126f7a660 udp: Remove obsolete splice tracking
Now that spliced datagrams are managed via the flow table, remove
UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used.  With
those removed, the 'ts' field in udp_splice_port is also no longer used.
struct udp_splice_port now contains just a socket fd, so replace it with
a plain int in udp_splice_ns[] and udp_splice_init[].  The latter are still
used for tracking of automatic port forwarding.

Finally, the 'splice' field of union udp_epoll_ref is no longer used so
remove it as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:45 +02:00
David Gibson
e0647ad80c udp: Handle "spliced" datagrams with per-flow sockets
When forwarding a datagram to a socket, we need to find a socket with a
suitable local address to send it.  Currently we keep track of such sockets
in an array indexed by local port, but this can't properly handle cases
where we have multiple local addresses in active use.

For "spliced" (socket to socket) cases, improve this by instead opening
a socket specifically for the target side of the flow.  We connect() as
well as bind()ing that socket, so that it will only receive the flow's
reply packets, not anything else.  We direct datagrams sent via that socket
using the addresses from the flow table, effectively replacing bespoke
addressing logic with the unified logic in fwd.c

When we create the flow, we also take a duplicate of the originating
socket, and use that to deliver reply datagrams back to the origin, again
using addresses from the flow table entry.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:42 +02:00
David Gibson
a45a7e9798 udp: Create flows for datagrams from originating sockets
This implements the first steps of tracking UDP packets with the flow table
rather than its own (buggy) set of port maps.  Specifically we create flow
table entries for datagrams received from a socket (PIF_HOST or
PIF_SPLICE).

When splitting datagrams from sockets into batches, we group by the flow
as well as splicesrc.  This may result in smaller batches, but makes things
easier down the line.  We can re-optimise this later if necessary.  For now
we don't do anything else with the flow, not even match reply packets to
the same flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:39 +02:00
David Gibson
8abd06e9fa fwd: Update flow forwarding logic for UDP
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets.  The
logic here doesn't exactly match our current forwarding, since our current
forwarding has some very strange and buggy edge cases.  Instead it's
attempting to replicate what appears to be the intended logic behind the
current forwarding.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:35 +02:00
David Gibson
c000f2aba6 flow, icmp: Use general flow forwarding rules for ICMP
Current ICMP hard codes its forwarding rules, and never applies any
translations.  Change it to use the flow_target() function, so that
it's translated the same as TCP (excluding TCP specific port
redirection).

This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine.  This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.

This removes the last user of flow_target_af(), so that's removed as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:33 +02:00
David Gibson
060f24e310 flow, tcp: Flow based NAT and port forwarding for TCP
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.

Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side.  This performs any NAT or port forwarding needed.

We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:29 +02:00
David Gibson
4cd753e65c icmp: Manage outbound socket address via flow table
For now when we forward a ping to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and id will be used by the kernel.  When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.

Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there.  This allows the
possibility of more complex rules for what outbound address and/or id
we use in future.

To implement this we create a new helper which sets up a new socket based
on information in a flowside, which will also have future uses.  It
behaves slightly differently from the existing ICMP code, in that it
doesn't bind to a specific interface if given a loopback address.  This is
logically correct - the loopback address means we need to operate through
the host's loopback interface, not ifname_out.  We didn't need it in ICMP
because ICMP will never generate a loopback address at this point, however
we intend to change that in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:25 +02:00
David Gibson
781164e25b flow: Helper to create sockets based on flowside
We have upcoming use cases where it's useful to create new bound socket
based on information from the flow table.  Add flowside_sock_l4() to do
this for either PIF_HOST or PIF_SPLICE sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:23 +02:00
David Gibson
2faf6fcd8b icmp: Eliminate icmp_id_map
With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything.  Eliminate it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:20 +02:00
David Gibson
2f40a01944 icmp: Look up ping flows using flow hash
When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables.  However, we can look this up using the flow hash
table - that's what it's for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:16 +02:00
David Gibson
6d76278c21 icmp: Obtain destination addresses from the flowsides
icmp_sock_handler() obtains the guest address from it's most recently
observed IP.  However, this can now be obtained from the common flowside
information.

icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.

Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:13 +02:00
David Gibson
5cffb1bf64 icmp: Remove redundant id field from flow table entry
struct icmp_ping_flow contains a field for the ICMP id of the ping, but
this is now redundant, since the id is also stored as the "port" in the
common flowsides.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:06 +02:00
David Gibson
508adde342 tcp: Re-use flow hash for initial sequence number generation
We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp.  Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.

With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one.  It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:01 +02:00
David Gibson
acca4235c4 flow, tcp: Generalise TCP hash table to general flow hash table
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows.  This is
largely code motion and straightforward renames.  There are two semantic
changes:

 * flow_lookup_af() now needs to verify that the entry has a matching
   protocol and interface as well as matching addresses and ports.

 * We double the size of the hash table, because it's now at least
   theoretically possible for both sides of each flow to be hashed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:59 +02:00
David Gibson
163a339214 tcp, flow: Replace TCP specific hash function with general flow hash
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports.  We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well.  We also want a unified hash table
to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow.  This includes all the
information from struct flowside, plus the pif and the L4 protocol number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:56 +02:00
David Gibson
f19a8f71f9 tcp_splice: Eliminate SPLICE_V6 flag
Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection.  Hence, drop the now unused SPLICE_V6 flag.

As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:53 +02:00
David Gibson
528a6517f8 tcp: Simplify endpoint validation using flowside information
Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:50 +02:00
David Gibson
e2ea10e246 tcp: Manage outbound address via flow table
For now when we forward a connection to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and port will be used by the kernel.  When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.

Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there.  This allows the
possibility of more complex rules for what outbound address and/or port
we use in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:47 +02:00
David Gibson
52d45f1737 tcp: Obtain guest address from flowside
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address.  This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address.  That won't
work; it will probably result in an RST from the guest.  Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.

Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:44 +02:00
David Gibson
f9fe212b1f tcp, flow: Remove redundant information, repack connection structures
Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure.  Access it from there instead, and
remove it from the TCP specific structure.   With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:41 +02:00
David Gibson
4e2d36e83f flow: Common address information for target side
Require the address and port information for the target (non
initiating) side to be populated when a flow enters TGT state.
Implement that for TCP and ICMP.  For now this leaves some information
redundantly recorded in both generic and type specific fields.  We'll
fix that in later patches.

For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:37 +02:00
David Gibson
8012f5ff55 flow: Common address information for initiating side
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow.  Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).

To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.

For now we only populate the initiating side, requiring that
information be completed when a flows enter INI.  Later patches will
populate the target side.

For now this leaves some information redundantly recorded in both generic
and type specific fields.  We'll fix that in later patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:32 +02:00
David Gibson
ba74b1fea1 doc: Extend zero-recv test with methods using msghdr
This test program verifies that we can receive and discard datagrams by
using recv() with a NULL buffer and zero-length.  Extend it to verify it
also works using recvmsg() and either an iov with a zero-length NULL
buffer or an iov that itself is NULL and zero-length.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fixed printf() message in main of recv-zero.c]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:31:02 +02:00
David Gibson
01e5611ec3 doc: Test behaviour of closing duplicate UDP sockets
To simplify lifetime management of "listening" UDP sockets, UDP flow
support needs to duplicate existing bound sockets.  Those duplicates will
be close()d when their corresponding flow expires, but we expect the
original to still receive datagrams as always.  That is, we expect the
close() on the duplicate to remove the duplicated fd, but not to close the
underlying UDP socket.

Add a test program to doc/platform-requirements to verify this requirement.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:30:14 +02:00
David Gibson
66a02c9f7c tcp_splice: Use parameterised macros for per-side event/flag bits
Both the events and flags fields in tcp_splice_conn have several bits
which are per-side, e.g. OUT_WAIT_0 for side 0 and OUT_WAIT_1 for side 1.
This necessitates some rather awkward ternary expressions when we need
to get the relevant bit for a particular side.

Simplify this by using a parameterised macro for the bit values.  This
needs a ternary expression inside the macros, but makes the places we use
it substantially clearer.

That simplification in turn allows us to use a loop across each side to
implement several things which are currently open coded to do equivalent
things for each side in turn.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:30:11 +02:00
David Gibson
5235c47c79 flow: Introduce flow_foreach_sidei() macro
We have a handful of places where we use a loop to step through each side
of a flow or flows, and we're probably going to have mroe in future.
Introduce a macro to implement this loop for convenience.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:30:07 +02:00
David Gibson
71d7985188 flow, tcp_splice: Prefer 'sidei' for variables referring to side index
In various places we have variables named 'side' or similar which always
have the value 0 or 1 (INISIDE or TGTSIDE).  Given a flow, this refers to
a specific side of it.  Upcoming flow table work will make it more useful
for "side" to refer to a specific side of a specific flow.  To make things
less confusing then, prefer the name term "side index" and name 'sidei' for
variables with just the 0 or 1 value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fixed minor detail in comment to struct flow_common]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:29:47 +02:00
David Gibson
9b125e7776 flow, icmp, tcp: Clean up helpers for getting flow from index
TCP (both regular and spliced) and ICMP both have macros to retrieve the
relevant protcol specific flow structure from a flow index.  In most cases
what we actually want is to get the specific flow from a sidx.  Replace
those simple macros with a more precise inline, which also asserts that
the flow is of the type we expect.

While we're they're also add a pif_at_sidx() helper to get the interface of
a specific flow & side, which is useful in some places.

Finally, fix some minor style issues in the comments on some of the
existing sidx related helpers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:27:27 +02:00
David Gibson
2fa91ee391 udp: Handle errors on UDP sockets
Currently we ignore all events other than EPOLLIN on UDP sockets.  This
means that if we ever receive an EPOLLERR event, we'll enter an infinite
loop on epoll, because we'll never do anything to clear the error.

Luckily that doesn't seem to have happened in practice, but it's certainly
fragile.  Furthermore changes in how we handle UDP sockets with the flow
table mean we will start receiving error events.

Add handling of EPOLLERR events.  For now we just read the error from the
error queue (thereby clearing the error state) and print a debug message.
We can add more substantial handling of specific events in future if we
want to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:21 +02:00
David Gibson
6bd8283bf9 util: Add AF_UNSPEC support to sockaddr_ntop()
Allow sockaddr_ntop() to format AF_UNSPEC socket addresses.  There do exist
a few cases where we might legitimately have either an AF_UNSPEC or a real
address, such as the origin address from MSG_ERRQUEUE.  Even in cases where
we shouldn't get an AF_UNSPEC address, formatting it is likely to make
things easier to debug if we ever somehow do.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:18 +02:00
David Gibson
4e1f850f61 udp, tcp: Tweak handling of no_udp and no_tcp flags
We abort the UDP socket handler if the no_udp flag is set.  But if UDP
was disabled we should never have had a UDP socket to trigger the handler
in the first place.  If we somehow did, ignoring it here isn't really going
to help because aborting without doing anything is likely to lead to an
epoll loop.  The same is the case for the TCP socket and timer handlers and
the no_tcp flag.

Change these checks on the flag to ASSERT()s.  Similarly add ASSERT()s to
several other entry points to the protocol specific code which should never
be called if the protocol is disabled.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:15 +02:00
David Gibson
272d1d033c udp: Make udp_sock_recv static
Through an oversight this was previously declared as a public function
although it's only used in udp.c and there is no prototype in any header.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:13 +02:00
David Gibson
f79c42317f conf: Don't configure port forwarding for a disabled protocol
UDP and/or TCP can be disabled with the --no-udp and --no-tcp options.
However, when this is specified, it's still possible to configure forwarded
ports for the disabled protocol.  In some cases this will open sockets and
perform other actions, which might not be safe since the entire protocol
won't be initialised.

Check for this case, and explicitly forbid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:04:55 +02:00
Jon Maloy
a740e16fd1 tcp: handle shrunk window advertisements from guest
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the guest peer, while it is unable to send out window updates even
after socket reads have freed up enough buffer space to permit a larger
window. In this situation, new window advertisements from the peer can
only be triggered by data packets arriving from this side.

However, currently such packets are never sent, because the zero-window
condition prevents this side from sending out any packets whatsoever
to the peer.

We notice that the above bug is triggered *only* after the peer has
dropped one or more arriving packets because of severe memory squeeze,
and that we hence always enter a retransmission situation when this
occurs. This also means that the implementation goes against the
RFC-9293 recommendation that a previously advertised window never
should shrink.

RFC-9293 seems to permit that we can continue sending up to the right
edge of the last advertised non-zero window in such situations, so that
is what we do to resolve this situation.

It turns out that this solution is extremely simple to implememt in the
code: We just omit to save the advertised zero-window when we see that
it has shrunk, i.e., if the acknowledged sequence number in the
advertisement message is lower than that of the last data byte sent
from our side.

When that is the case, the following happens:
- The 'retr' flag in tcp_data_from_tap() will be 'false', so no
  retransmission will occur at this occasion.
- The data stream will soon reach the right edge of the previously
  advertised window. In fact, in all observed cases we have seen that
  it is already there when the zero-advertisement arrives.
- At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set,
  unless they already have been, meaning that only the next timer
  expiration will open for data retransmission or transmission.
- When that happens, the memory squeeze at the guest will normally have
  abated, and the data flow can resume.

It should be noted that although this solves the problem we have at
hand, it is a work-around, and not a genuine solution to the described
kernel bug.

Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor fix in commit title and commit reference in comment
 to workaround
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 18:05:08 +02:00
Jon Maloy
e63d281871 tcp: leverage support of SO_PEEK_OFF socket option when available
>From linux-6.9.0 the kernel will contain
commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").

This new feature makes is possible to call recv_msg(MSG_PEEK) and make
it start reading data from a given offset set by the SO_PEEK_OFF socket
option. This way, we can avoid repeated reading of already read bytes of
a received message, hence saving read cycles when forwarding TCP
messages in the host->name space direction.

In this commit, we add functionality to leverage this feature when
available, while we fall back to the previous behavior when not.

Measurements with iperf3 shows that throughput increases with 15-20
percent in the host->namespace direction when this feature is used.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 17:57:03 +02:00
David Gibson
8bd57bf25b doc: Trivial fix for reuseaddr-priority
This test program checks for particular behaviour regardless of order of
operations.  So, we step through the test with all possible orders for
a number of different of parts.  Or at least, we're supposed to, a copy
pasta error led to using the same order for two things which should be
independent.

Fixes: 299c407501 ("doc: Add program to document and test assumptions about SO_REUSEADDR")
Reported-by: David Taylor <davidt@yadt.co.uk>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 17:55:52 +02:00
David Gibson
ec2691a12e doc: Test behaviour of zero length datagram recv()s
Add a test program verifying that we're able to discard datagrams from a
socket without needing a big discard buffer, by using a zero length recv().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:48 +02:00
David Gibson
299c407501 doc: Add program to document and test assumptions about SO_REUSEADDR
For the approach we intend to use for handling UDP flows, we have some
pretty specific requirements about how SO_REUSEADDR works with UDP sockets.
Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s,
and therefore there can be multiple sockets which are eligible to receive
the same datagram.  Which one will actually receive it is important to us.

Add a test program which verifies things work the way we expect, which
documents what those expectations are in the process.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:43 +02:00
David Gibson
be0214cca6 udp: Consolidate datagram batching
When we receive datagrams on a socket, we need to split them into batches
depending on how they need to be forwarded (either via a specific splice
socket, or via tap).  The logic to do this, is somewhat awkwardly split
between udp_buf_sock_handler() itself, udp_splice_send() and
udp_tap_send().

Move all the batching logic into udp_buf_sock_handler(), leaving
udp_splice_send() to just send the prepared batch.  udp_tap_send() reduces
to just a call to tap_send_frames() so open-code that call in
udp_buf_sock_handler().

This will allow separating the batching logic from the rest of the datagram
forwarding logic, which we'll need for upcoming flow table support.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:41 +02:00
David Gibson
69e5393c37 udp: Move some more of sock_handler tasks into sub-functions
udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four
things between them:
  1. Receive some datagrams from a socket
  2. Split those datagrams into batches depending on how they need to be
     sent (via tap or via a specific splice socket)
  3. Prepare buffers for each datagram to send it onwards
  4. Actually send it onwards

Split (1) and (3) into specific helper functions.  This isn't
immediately useful (udp_splice_prepare(), in particular, is trivial),
but it will make further reworks clearer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:39 +02:00
David Gibson
c6c61a9e1a udp: Don't repeatedly initialise udp[46]_eth_hdr
Since we split our packet frame buffers into different pieces, we have
a single buffer per IP version for the ethernet header, rather than one
per frame.  This makes sense since our ethernet header is alwaus the same.

However we initialise those buffers udp[46]_eth_hdr inside a per frame
loop.  Pull that outside the loop so we just initialise them once.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:37 +02:00
David Gibson
55aff45bc1 udp: Unify udp[46]_l2_iov
The only differences between these arrays are that udp4_l2_iov is
pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame
header and udp6_l2_iov points to the IPv6 versions.

We already have to set up a bunch of headers per-frame, including updating
udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len.  It makes more sense to adjust
the IOV entries to point at the correct headers for the frame than to have
two complete sets of iovecs.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:35 +02:00
David Gibson
9f9b15f949 udp: Unify udp[46]_mh_splice
We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where
the only difference is that they point to different sockaddr buffers for
the destination address.

Unify these by having the common array point at a sockaddr_inany as the
address.  This does mean slightly more work when we're about to splice,
because we need to write the whole socket address, rather than just the
port.  However it removes 32 mmsghdr structures and we're going to need
more flexibility constructing that target address for the flow table.

Because future changes might mean that the address isn't always loopback,
change the name of the common address from *_localname  to udp_splicename.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:33 +02:00
David Gibson
fbd78b6f3e udp: Rename IOV and mmsghdr arrays
Make the salient points about these various arrays clearer with renames:

* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do
  with L2.  They are, however, specific to receiving not sending.  Rename
  to udp_iov_recv and udp[46]_mh_recv.

* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa.
  Rename to udp[46]_l2_iov

* udp[46]_localname are (for now) pre-populated with the local address but
  the more salient point is that these are the destination address for the
  splice arrays.  Rename to udp[46]_splice_to

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:30 +02:00
David Gibson
f62c33d85f udp: Pass full epoll reference through more of sock handler path
udp_buf_sock_handler() takes the epoll reference from the receiving socket,
and passes the UDP relevant part on to several other functions.  Future
changes are going to need several different epoll types for UDP, and to
pass that information through to some of those functions.  To avoid extra
noise in the patches making the real changes, change those functions now
to take the full epoll reference, rather than just the UDP part.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:28 +02:00
David Gibson
8f8eb73482 flow: Add flow_sidx_valid() helper
To implement the TCP hash table, we need an invalid (NULL-like) value for
flow_sidx_t.  We use FLOW_SIDX_NONE for that, but for defensiveness, we
treat (usually) anything with an out of bounds flow index the same way.

That's not always done consistently though.  In flow_at_sidx() we open code
a check on the flow index.  In tcp_hash_probe() we instead compare against
FLOW_SIDX_NONE, and in some other places we use the fact that
flow_at_sidx() will return NULL in this case, even if we don't otherwise
need the flow it returns.

Clean this up a bit, by adding an explicit flow_sidx_valid() test function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:25 +02:00
David Gibson
74c1c5efcf util: sock_l4() determine protocol from epoll type rather than the reverse
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state.  Currently it determines the correct tag for the epoll
data based on the protocol.  However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol.  So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:09 +02:00
Stefano Brivio
b625ed5fee conf: Use the right maximum buffer size for c->sock_path
UNIX_SOCK_MAX is the maximum number we'll append to the socket path
if we generate it automatically. If it's given on the command line,
it can be up to UNIX_PATH_MAX (including the terminating character)
long.

UNIX_SOCK_MAX happened to kind of fit because it's 100 (instead of
108).

Commit ceddcac74a ("conf, tap: False "Buffer not null terminated"
positives, CWE-170") fixed the wrong problem: the right fix for the
problem at hand was actually commit cc287af173 ("conf: Fix
incorrect bounds checking for sock_path parameter").

Fixes: ceddcac74a ("conf, tap: False "Buffer not null terminated" positives, CWE-170")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-02 21:34:06 +02:00
Stefano Brivio
403a7c14a0 tcp_splice: Check return value of setsockopt() for SO_RCVLOWAT
Spotted by Coverity, harmless as we would consider that successful
and check on the socket later from the timer, but printing a debug
message in that case is definitely wise, should it ever happen.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-02 21:33:57 +02:00
Stefano Brivio
21ee1eb2de conf: Copy up to MAXDNSRCH - 1 bytes, not MAXDNSRCH
Spotted by Coverity just recently. Not that it really matters as
MAXDNSRCH always appears to be defined as 1025, while a full domain
name can have up to 253 characters: it would be a bit pointless to
have a longer search domain.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-02 21:33:29 +02:00
Stefano Brivio
1ee2ecade3 udp: Reduce scope of rport in udp_invert_portmap()
cppcheck 2.14 warns that the scope of the rport variable could be
reduced: do that, as reverted commit c80fa6a6bb ("udp: Make rport
calculation more local") did, but keep the temporary variable of
in_port_t type, otherwise the sum gets promoted to int.

While at it, add a comment explaining why we calculate rport like
this instead of directly using the sum as array index.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-24 15:41:38 +02:00
Stefano Brivio
054697598f Revert "udp: Make rport calculation more local"
This reverts commit c80fa6a6bb, as it
reintroduces the issue fixed by commit 1e6f92b995 ("udp: Fix 16-bit
overflow in udp_invert_portmap()").

Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=80
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-24 15:41:38 +02:00
Stefano Brivio
c66f0341d9 log: Don't report syslog failures to stderr after initialisation
If we daemonised, we can't use standard error. If we didn't, it's
rather annoying to have all those messages on standard error anyway,
and kind of pointless too, as the messages we wanted to print were
printed to standard error anyway.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-21 15:32:48 +02:00
Stefano Brivio
e7323e515a conf, passt: Don't call __openlog() if a log file is used
If a log file is configured, we would otherwise open a connection to
the system logger (if any), print any message that we might have
before we initialise the log file, and then keep that connection
around for no particular reason.

Call __openlog() as an alternative to the log file setup, instead.

This way, we might skip printing some messages during the
initialisation phase, but they're probably not really valuable to
have in a system log, and we're going to print them to standard
error anyway.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:46 +02:00
Stefano Brivio
dba7f0f5ce treewide: Replace strerror() calls
Now that we have logging functions embedding perror() functionality,
we can make _some_ calls more terse by using them. In many places,
the strerror() calls are still more convenient because, for example,
they are used in flow debugging functions, or because the return code
variable of interest is not 'errno'.

While at it, convert a few error messages from a scant perror style
to proper failure descriptions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:44 +02:00
Stefano Brivio
92a22fef93 treewide: Replace perror() calls with calls to logging functions
perror() prints directly to standard error, but in many cases standard
error might be already closed, or we might want to skip logging, based
on configuration. Our logging functions provide all that.

While at it, make errors more descriptive, replacing some of the
existing basic perror-style messages.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:43 +02:00
Stefano Brivio
c1140df889 log: Add _perror() logging function variants
In many places, we have direct perror() calls, which completely bypass
logging functions and log files.

They are definitely convenient: offer similar convenience with
_perror() logging variants, so that we can drop those direct perror()
calls.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:40 +02:00
Stefano Brivio
afd9cdc9bb log, passt: Always print to stderr before initialisation is complete
After commit 15001b39ef ("conf: set the log level much earlier"), we
had a phase during initialisation when messages wouldn't be printed to
standard error anymore.

Commit f67238aa86 ("passt, log: Call __openlog() earlier, log to
stderr until we detach") fixed that, but only for the case where no
log files are given.

If a log file is configured, vlogmsg() will not call passt_vsyslog(),
but during initialisation, LOG_PERROR is set, so to avoid duplicated
prints (which would result from passt_vsyslog() printing to stderr),
we don't call fprintf() from vlogmsg() either.

This is getting a bit too complicated. Instead of abusing LOG_PERROR,
define an internal logging flag that clearly represents that we're not
done with the initialisation phase yet.

If this flag is not set, make sure we always print to stderr, if the
log mask matches.

Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:34 +02:00
Stefano Brivio
8c2f24a560 conf, log: Instead of abusing log levels, add log_conf_parsed flag
We currently use a LOG_EMERG log mask to represent the fact that we
don't know yet what the mask resulting from configuration should be,
before the command line is parsed.

However, we have the necessity of representing another phase as well,
that is, configuration is parsed but we didn't daemonise yet, or
we're not ready for operation yet. The next patch will add that
notion explicitly.

Mapping these cases to further log levels isn't really practical.
Introduce boolean log flags to represent them, instead of abusing
log priorities.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:31 +02:00
Stefano Brivio
bca0fefa32 conf, passt: Make --stderr do nothing, and deprecate it
The original behaviour of printing messages to standard error by
default when running from a non-interactive terminal was introduced
because the first KubeVirt integration draft used to start passt in
foreground and get messages via standard error.

For development purposes, the system logger was more convenient at
that point, and passt was running from interactive terminals only if
not started by the KubeVirt integration.

This behaviour was introduced by 84a62b79a2 ("passt: Also log to
stderr, don't fork to background if not interactive").

Later, I added command-line options in 1e49d194d0 ("passt, pasta:
Introduce command-line options and port re-mapping") and accidentally
reversed this condition, which wasn't a problem as --stderr could
force printing to standard error anyway (and it was used by KubeVirt).

Nowadays, the KubeVirt integration uses a log file (requested via
libvirt configuration), and the same applies for Podman if one
actually needs to look at runtime logs. There are no use cases left,
as far as I know, where passt runs in foreground in non-interactive
terminals.

Seize the chance to reintroduce some sanity here. If we fork to
background, standard error is closed, so --stderr is useless in that
case.

If we run in foreground, there's no harm in printing messages to
standard error, and that accidentally became the default behaviour
anyway, so --stderr is not needed in that case.

It would be needed for non-interactive terminals, but there are no
use cases, and if there were, let's log to standard error anyway:
the user can always redirect standard error to /dev/null if needed.

Before we're up and running, we need to print to standard error anyway
if something happens, otherwise we can't report failure to start in
any kind of usage, stand-alone or in integrations.

So, make --stderr do nothing, and deprecate it.

While at it, drop a left-over comment about --foreground being the
default only for interactive terminals, because it's not the case
anymore.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:28 +02:00
Stefano Brivio
b74801645c conf, passt: Don't try to log to stderr after we close it
If we don't run in foreground, we close standard error as we
daemonise, so it makes no sense to check if the controlling terminal
is an interactive terminal or if --force-stderr was given, to decide
if we want to log to standard error.

Make --force-stderr depend on --foreground.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:15 +02:00
Stefano Brivio
65923ba798 conf: Accept duplicate and conflicting options, the last one wins
In multiple occasions, especially when passt(1) and pasta(1) are used
in integrations such as the one with Podman, the ability to override
earlier options on the command line with later one would have been
convenient.

Recently, to debug a number of issues happening with Podman, I would
have liked to ask users to share a debug log by passing --debug as
additional option, but pasta refuses --quiet (always passed by Podman)
and --debug at the same time.

On top of this, Podman lets users specify other pasta options in its
containers.conf(5) file, as well as on the command line.

The options from the configuration files are appended together with
the ones from the command line, which makes it impossible for users to
override options from the configuration file, if duplicated options
are refused, unless Podman takes care of sorting them, which is
clearly not sustainable.

For --debug and --trace, somebody took care of this on Podman side at:
  https://github.com/containers/common/pull/2052

but this doesn't fix the issue with other options, and we'll have
anyway older versions of Podman around, too.

I think there's some value in telling users about duplicated or
conflicting options, because that might reveal issues in integrations
or accidental misconfigurations, but by now I'm fairly convinced that
the downsides outweigh this.

Drop checks about duplicate options and mutually exclusive ones. In
some cases, we need to also undo a couple of initialisations caused
by earlier options, but this looks like a simplification, overall.

Notable exception: --stderr still conflicts with --log-file, because
users might have the expectation that they don't actually conflict.
But they do conflict in the existing implementation, so it's safer
to make sure that the users notice that.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-06-21 15:31:46 +02:00
Stefano Brivio
62de6140d9 netlink: Strip nexthop identifiers when duplicating routes
If routing daemons set up host routes, for example FRR via OSPF as in
the reported issue, they might add nexthop identifiers (not objects)
that are generally not valid in the target namespace. Strip them off
as well, otherwise we'll get EINVAL from the kernel.

Link: https://github.com/containers/podman/issues/22960
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-20 17:03:28 +02:00
Danish Prakash
1544a43863 passt.1, qrap.1: align license description with SPDX identifier
The SPDX identifier states GPL-2.0-or-later but the copyright section
mentions GPL-3.0 or later causing a mismatch.

Also, only correctly refers to GPL instead of AGPL.

Signed-off-by: Danish Prakash <contact@danishpraka.sh>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-19 15:00:55 +02:00
Stefano Brivio
f301bb18b5 netlink: Ignore EHOSTUNREACH failures when duplicating routes
To implicitly resolve possible dependencies between routes as we
duplicate them into the target namespace, we go through a set of n
routes n times, and ignore EEXIST responses to netlink messages (we
already inserted the route) and ENETUNREACH (we didn't insert the
route yet, but we need to insert another one first).

Until now, we didn't ignore EHOSTUNREACH responses. However,
NetworkManager users with multiple non-subnet routes for the same
interface report that pasta exits with "no route to host" while
duplicating routes.

This happens because NetworkManager sets the 'noprefixroute' attribute
on addresses, meaning that the kernel won't create subnet routes
automatically depending on the prefix length of the address. We copy
this attribute as we copy the address into the target namespace, and
as a result, the kernel doesn't create subnet routes in the target
namespace either.

This means that the gateway for routes that are inserted later can be
unreachable at some points during the sequence of route duplication.
That is, we don't just have dependencies between regular routes, but
we can also have dependencies between regular routes and subnet
routes, as subnet routes are not automatically inserted in advance.

Link: https://github.com/containers/podman/issues/22824
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-19 15:00:55 +02:00
Stefano Brivio
450a6131be netlink: With no default route, pick the first interface with a route
While commit f919dc7a4b ("conf, netlink: Don't require a default
route to start") sounded reasonable in the assumption that, if we
don't find default routes for a given address family, we can still
proceed by selecting an interface with any route *iff it's the only
one for that protocol family*, Jelle reported a further issue in a
similar setup.

There, multiple interfaces are present, and while remote container
connectivity doesn't matter for the container, local connectivity is
desired. There are no default routes, but those multiple interfaces
all have non-default routes, so we should just pick one and start.

Pick the first interface reported by the kernel with any route, if
there are no default routes. There should be no harm in doing so.

Reported-by: Jelle van der Waa <jvanderwaa@redhat.com>
Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2277954
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
2024-06-19 15:00:55 +02:00
Stefano Brivio
54a9d3801b tcp: Don't rely on bind() to fail to decide that connection target is valid
Commit e1a2e2780c ("tcp: Check if connection is local or low RTT
was seen before using large MSS") added a call to bind() before we
issue a connect() to the target for an outbound connection.

If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we
can conclude that the target address is a local (host) address, and we
can use an unlimited MSS.

While at it, according to the reasoning of that commit, if bind()
succeeds, we would know right away that nobody is listening at that
(local) address and port, and we don't even need to call connect(): we
can just fail early and reset the connection attempt.

But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or
net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will
actually succeed, so we can't rely on it to fail in general.

The visible issue with the existing behaviour is that we would reset
any outbound connection to non-local addresses, if non-local binds are
enabled.

Keep the significant optimisation for local addresses along with the
bind() call, but if it succeeds, don't draw any conclusion: close the
socket, grab another one, and proceed normally.

This will incur a small latency penalty if non-local binds are
enabled (we'll likely fetch an existing socket from the pool but
additionally call close()), or if the target is local but not bound:
we'll need to call connect() and get a failure before relaying that
failure back.

Link: https://github.com/containers/podman/issues/23003
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-19 15:00:55 +02:00
David Gibson
020ff7a40e siphash: Remove stale prototypes
In fc8f0f8c ("siphash: Use incremental rather than all-at-once siphash
functions") we removed the older interface to the SipHash implementation,
which took fixed sized blocks of data.  However, we forgot to remove the
prototypes for those functions, so do that now.

Fixes: fc8f0f8c48 ("siphash: Use incremental rather than all-at-once siphash functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-19 14:55:42 +02:00
David Gibson
7e87bd98ac udp: Move management of udp[46]_localname into udp_splice_send()
Mostly, udp_sock_handler() is independent of how the datagrams it processes
will be forwarded (tap or splice).  However, it also updates the msg_name
fields for spliced sends, which doesn't really make sense here.  Move it
into udp_splice_send() which is all about spliced sends.  This does
potentially mean we'll update the field to the same value several times,
but we're going to need this in future anyway: with the extensions the
flow table allows, it might not be the same value each time after all.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-14 12:11:46 +02:00
David Gibson
ff57f8ddc6 udp: Rework how we divide queued datagrams between sending methods
udp_sock_handler() takes a number of datagrams from sockets that depending
on their addresses could be forwarded either to the L2 interface ("tap")
or to another socket ("spliced").  In the latter case we can also only
send packets together if they have the same source port, and therefore
are sent via the same socket.

To reduce the total number of system calls we gather contiguous batches of
datagrams with the same destination interface and socket where applicable.
The determination of what the target is is made by udp_mmh_splice_port().
It returns the source port for splice packets and -1 for "tap" packets.
We find batches by looking ahead in our queue until we find a datagram
whose "splicefrom" port doesn't match the first in our current batch.

udp_mmh_splice_port() is moderately expensive, and unfortunately we
can call it twice on the same datagram: once as the (last + 1) entry
in one batch (to check it's not in that batch), then again as the
first entry in the next batch.

Avoid this by keeping track of the "splice port" in the metadata structure,
and filling it in one entry ahead of the one we're currently considering.
This is a bit subtle, but not that hard.  It will also generalise better
when we have more complex possibilities based on the flow table.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-14 12:11:42 +02:00
David Gibson
63db7dcdbf udp: Fold checking of splice flag into udp_mmh_splice_port()
udp_mmh_splice_port() is used to determine if a UDP datagram can be
"spliced" (forwarded via a socket instead of tap).  We only invoke it if
the origin socket has the 'splice' flag set.

Fold the checking of the flag into the helper itself, which makes the
caller simpler.  It does mean we have a loop looking for a batch of
spliceable or non-spliceable packets even in the case where the flag is
clear.  This shouldn't be that expensive though, since each call to
udp_mmh_splice_port() will return without accessing memory in that case.
In any case we're going to need a similar loop in more cases with upcoming
flow table work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-14 12:11:39 +02:00
David Gibson
523fbc5af7 util: Split construction of bind socket address from the rest of sock_l4()
sock_l4() creates, binds and otherwise prepares a new socket.  It builds
the socket address to bind from separately provided address and port.
However, we have use cases coming up where it's more natural to construct
the socket address in the caller.

Prepare for this by adding sock_l4_sa() which takes a pre-constructed
socket address, and rewriting sock_l4() in terms of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-14 12:10:52 +02:00
Laurent Vivier
4070bac7a4 tap: use in->buf_size rather than sizeof(pkt_buf)
buf_size is set to sizeof(pkt_buf) by default. And it seems more correct
to provide the actual size of the buffer.

Later a buf_size of 0 will allow vhost-user mode to detect
guest memory buffers.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:42 +02:00
Laurent Vivier
7290335b14 iov: remove iov_copy()
it was needed by a draft version of vhost-user, it is not needed
anymore.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:40 +02:00
Laurent Vivier
0c335d751a vhost-user: compare mode MODE_PASTA and not MODE_PASST
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:38 +02:00
Laurent Vivier
377b666dc9 udp: rename udp_sock_handler() to udp_buf_sock_handler()
We are going to introduce a variant of the function to use
vhost-user buffers rather than passt internal buffers.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:34 +02:00
Laurent Vivier
e7ac995217 udp: refactor UDP header update functions
This commit refactors the udp_update_hdr4() and udp_update_hdr6() functions
to improve code portability by replacing the udp_meta_t parameter with
more specific parameters for the IPv4 and IPv6 headers (iphdr/ipv6hdr)
and the source socket address (sockaddr_in/sockaddr_in6).
It also moves the tap_hdr_update() function call inside the udp_tap_send()
function not to have to pass the TAP header to udp_update_hdr4() and
udp_update_hdr6()

This refactor reduces complexity by making the functions more modular and
ensuring that each function operates on more narrowly scoped data structures.
This will facilitate future backend introduction like vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:32 +02:00
Laurent Vivier
9ecf7fedc5 tap: refactor packets handling functions
Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(),
and tap4_handler() and tap6_handler() into tap_handler().
Create a generic tap_add_packet() to consolidate packet
addition logic and reduce code duplication.

The purpose is to ease the export of these functions to use
them with the vhost-user backend.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:19 +02:00
Laurent Vivier
fba2b544b6 tcp: move buffers management functions to their own file
Move all the TCP parts using internal buffers to tcp_buf.c
and keep generic TCP management functions in tcp.c.
Add tcp_internal.h to export needed functions from tcp.c and
tcp_buf.h from tcp_buf.c

With this change we can use existing TCP functions with a
different kind of memory storage as for instance the shared
memory provided by the guest via vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:05 +02:00
Laurent Vivier
ec26fa013a tcp: extract buffer management from tcp_send_flag()
This commit isolates the internal data structure management used for storing
data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[],
tcp4_flags[], ...) from the tcp_send_flag() function. The extracted
functionality is relocated to a new function named tcp_fill_flag_header().

tcp_fill_flag_header() is now a generic function that accepts parameters such
as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to
pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[].

This separation sets the stage for utilizing tcp_prepare_flags() to
set the memory provided by the guest via vhost-user in future developments.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:43:35 +02:00
David Gibson
d949667436 cppcheck: Suppress constParameterCallback errors
We have several functions which are used as callbacks for NS_CALL() which
only read their void * parameter, they don't write it.  The
constParameterCallback warning in cppcheck 2.14.1 complains that this
parameter could be const void *, also pointing out that that would require
casting the function pointer when used as a callback.

Casting the function pointers seems substantially uglier than using a
non-const void * as the parameter, especially since in each case we cast
the void * to a const pointer of specific type immediately.  So, suppress
these errors.

I think it would make logical sense to suppress this globally, but that
would cause unmatchedSuppression errors on earlier cppcheck versions.  So,
instead individually suppress it, along with unmatchedSuppression in the
relevant places.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-08 13:06:20 +02:00
Derek Schrock
8a83b530fe selinux: Allow access to user_devpts
Allow access to user_devpts.

	$ pasta --version
	pasta 0^20240510.g7288448-1.fc40.x86_64
	...
	$ awk '' < /dev/null
	$ pasta --version
	$

While this might be a awk bug it appears pasta should still have access
to devpts.

Signed-off-by: Derek Schrock <dereks@lifeofadishwasher.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
ec416fdcc4 tcp, flow: Fix some error paths which didn't clean up flows properly
Flow table entries need to be fully initialised before returning to the
main epoll loop.  Commit 0060acd1 ("flow: Clarify and enforce flow state
transitions") now enforces that: once a flow is allocated we must either
cancel it, or activate it before returning to the main loop, or we will hit
an ASSERT().

Some error paths in tcp_conn_from_tap() weren't correctly updated for this
requirement - we can exit with a flow entry incompletely initialised.
Correct that by cancelling the flows in those situations.

I don't have enough information to be certain if this is the cause for
podman bug 22925, but it plausibly could be.

Fixes: 0060acd11b ("flow: Clarify and enforce flow state transitions")
Link: https://github.com/containers/podman/issues/22925
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
3f63743a65 util: Use 'long' to represent millisecond durations
timespec_diff_ms() returns an int representing a duration in milliseconds.
This will overflow in about 25 days when an int is 32 bits.  The way we
use this function, we're probably not going to get a result that long, but
it's not outrageously implausible.  Use a long for safety.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
f9e8ee0777 lineread: Use ssize_t for line lengths
Functions and structures in lineread.c use plain int to record and report
the length of lines we receive.  This means we truncate the result from
read(2) in some circumstances.  Use ssize_t to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
c919bbbdd3 conf: Safer parsing of MAC addresses
In conf() we parse a MAC address in two places, for the --ns-mac-addr and
the -M options.  As well as duplicating code, the logic for this parsing
has several bugs:
  * The most serious is that if the given string is shorter than a MAC
    address should be, we'll access past the end of it.
  * We don't check the endptr supplied by strtol() which means we could
    ignore certain erroneous contents
  * We never check the separator characters between each octet
  * We ignore certain sorts of garbage that follow the MAC address

Correct all these bugs in a new parse_mac() helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
bda80ef53f util: Use unsigned indices for bits in bitmaps
A negative bit index in a bitmap doesn't make sense.  Avoid this by
construction by using unsigned indices.  While we're there adjust
bitmap_isset() to return a bool instead of an int.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
0e36fe1a43 clang-tidy: Enable the bugprone-macro-parentheses check
We globally disabled this, with a justification lumped together with
several checks about braces.  They don't really go together, the others
are essentially a stylistic choice which doesn't match our style.  Omitting
brackets on macro parameters can lead to real and hard to track down bugs
if an expression is ever passed to the macro instead of a plain identifier.

We've only gotten away with the macros which trigger the warning, because
of other conventions its been unlikely to invoke them with anything other
than a simple identifier.  Fix the macros, and enable the warning for the
future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
7094b91d10 Remove pointless macro parameters in CALL_PROTO_HANDLER
The 'c' parameter is always passed exactly 'c'.  The 'now' parameter is
always passed exactly 'now'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
c80fa6a6bb udp: Make rport calculation more local
cppcheck 2.14.1 complains about the rport variable not being in as small
as scope as it could be.  It's also only used once, so we might as well
just open code the calculation for it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
d2afb4b625 tcp: Make pointer const in tcp_revert_seq
The th pointer could be const, which causes a cppcheck warning on at least
some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40).

Fixes: e84a01e94c ("tcp: move seq_to_tap update to when frame is queued")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
b3aeb004ea log: Remove log_to_stdout option
Now that we've simplified how usage() works, nothing ever sets the
log_to_stdout flag. Eliminate it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:14:09 +02:00
David Gibson
7cb2088835 conf: Don't print usage via the logging subsystem
The message from usage() when given invalid options, or the -h / --help
option is currently printed by many calls to the info() function, also
used for runtime logging of informational messages.

That isn't useful: the usage message should always go to the terminal
(stdout or stderr), never syslog or a logfile.  It should never be
filtered by priority.  Really the only thing using the common logging
functions does is give more opportunities for something to go wrong.

Replace all the info() calls with direct fprintf() calls.  This does mean
manually adding "\n" to each message.  A little messy, but worth it for the
simplicity in other dimensions.  While we're there make much heavier use
of single strings containing multiple lines of output text.  That reduces
the number of fprintf calls, reducing visual clutter and making it easier
to see what the output will look like from the source.

Link: https://bugs.passt.top/show_bug.cgi?id=90
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:14:06 +02:00
David Gibson
e651197b5c conf: Remove unhelpful usage() wrapper
usage() does nothing but call print_usage() with EXIT_FAILURE as a
parameter.  It's no more complex to just give that parameter at the single
call site.  Eliminate it and rename print_usage() to just usage().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:14:03 +02:00
Jon Maloy
e84a01e94c tcp: move seq_to_tap update to when frame is queued
commit a469fc393f ("tcp, tap: Don't increase tap-side sequence counter for dropped frames")
delayed update of conn->seq_to_tap until the moment the corresponding
frame has been successfully pushed out. This has the advantage that we
immediately can make a new attempt to transmit a frame after a failed
trasnmit, rather than waiting for the peer to later discover a gap and
trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence
number updates during peer-initiated retransmits, and we have realized
it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the
moment a frame is added to the outqueue. To retain the advantage of
having a quick re-attempt based on local failure detection, we now scan
through the part of the outqueue that had do be dropped, and restore the
sequence counter for each affected connection to the most appropriate
value.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:13:16 +02:00
Stefano Brivio
765eb0bf16 apparmor: Fix comments after PID file and AF_UNIX socket creation refactoring
Now:
- we don't open the PID file in main() anymore
- PID file and AF_UNIX socket are opened by pidfile_open() and
  tap_sock_unix_open()
- write_pidfile() becomes pidfile_write()

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:44:21 +02:00
Stefano Brivio
0608ec42f2 conf, passt.h: Rename pid_file in struct ctx to pidfile
We have pidfile_fd now, pid_file_fd would be quite ugly.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:44:14 +02:00
Stefano Brivio
c9b2413465 conf, passt, tap: Open socket and PID files before switching UID/GID
Otherwise, if the user runs us as root, and gives us paths that are
only accessible by root, we'll fail to open them, which might in turn
encourage users to change permissions or ownerships: definitely a bad
idea in terms of security.

Reported-by: Minxi Hou <mhou@redhat.com>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:26 +02:00
Stefano Brivio
ba23b05545 passt, util: Move opening of PID file to its own function
We won't call it from main() any longer: move it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:13 +02:00
Stefano Brivio
57d8aa8ffe util: Rename write_pidfile() to pidfile_write()
As I'm adding pidfile_open() in the next patch. The function comment
didn't match, by the way.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:05 +02:00
Stefano Brivio
cbca08cd38 tap: Split tap_sock_unix_init() into opening and listening parts
We'll need to open and bind the socket a while before listening to it,
so split that into two different functions. No functional changes
intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:42:43 +02:00
Stefano Brivio
fcfb592adc passt, tap: Don't use -1 as uninitialised value for fd_tap_listen
This is a remnant from the time we kept access to the original
filesystem and we could reinitialise the listening AF_UNIX socket.

Since commit 0515adceaa ("passt, pasta: Namespace-based sandboxing,
defer seccomp policy application"), however, we can't re-bind the
listening socket once we're up and running.

Drop the -1 initalisation and the corresponding check.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-23 16:42:27 +02:00
Stefano Brivio
d02bb6ca05 tap: Move all-ones initialisation of mac_guest to tap_sock_init()
It has nothing to do with tap_sock_unix_init(). It used to be there as
that function could be called multiple times per passt instance, but
it's not the case anymore.

This also takes care of the fact that, with --fd, we wouldn't set the
initial MAC address, so we would need to wait for the guest to send us
an ARP packet before we could exchange data.

Fixes: 6b4e68383c ("passt, tap: Add --fd option")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:42:06 +02:00
Stefano Brivio
45b8632dcc conf: Don't lecture user about starting us as root
libguestfs tools have a good reason to run as root: if the guest image
is owned by root, it would be counterproductive to encourage users to
invoke them as non-root, as it would require changing permissions or
ownership of the image file.

And if they run as root, we'll start as root, too. Warn users we'll
switch to 'nobody', but don't tell them what to do.

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:40:33 +02:00
David Gibson
3f917b326b netlink, test: Ignore deprecated addresses
When we retrieve or copy host addresses we can include deprecated
addresses, which is not what we want.  Adjust our logic to exclude them.
Similarly our tests can retrieve deprecated addresses, so exclude them
there too.

I hit this in practice because my router sometimes temporarily advertises
an fd00:: prefix before the real delegated IPv6 prefix.  The deprecated
address can hang around for some time messing up my tests.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:09 +02:00
David Gibson
cc801fb38f tcp: Remove interim 'tapside' field from connection
We recently introduced this field to keep track of which side of a TCP flow
is the guest/tap facing one.  Now that we generically record which pif each
side of each flow is connected to, we can easily derive that, and no longer
need to keep track of it explicitly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:06 +02:00
David Gibson
8a2accb847 flow: Record the pifs for each side of each flow
Currently we have no generic information flows apart from the type and
state, everything else is specific to the flow type.  Start introducing
generic flow information by recording the pifs which the flow connects.

To keep track of what information is valid, introduce new flow states:
INI for when the initiating side information is complete, and TGT for
when both sides information is complete, but we haven't chosen the
flow type yet.  For now, these states don't do an awful lot, but
they'll become more important as we add more generic information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:03 +02:00
David Gibson
43571852e6 flow: Make side 0 always be the initiating side
Each flow in the flow table has two sides, 0 and 1, representing the
two interfaces between which passt/pasta will forward data for that flow.
Which side is which is currently up to the protocol specific code:  TCP
uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except
for spliced connections where it uses 0 for the initiating side and 1 for
the target side.  ICMP also uses 0 for the host/"sock" side and 1 for the
guest/"tap" side, but in its case the latter is always also the initiating
side.

Make this generically consistent by always using side 0 for the initiating
side and 1 for the target side.  This doesn't simplify a lot for now, and
arguably makes TCP slightly more complex, since we add an extra field to
the connection structure to record which is the guest facing side. This is
an interim change, which we'll be able to remove later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:01 +02:00
David Gibson
0060acd11b flow: Clarify and enforce flow state transitions
Flows move over several different states in their lifetime.  The rules for
these are documented in comments, but they're pretty complex and a number
of the transitions are implicit, which makes this pretty fragile and
error prone.

Change the code to explicitly track the states in a field.  Make all
transitions explicit and logged.  To the extent that it's practical in C,
enforce what can and can't be done in various states with ASSERT()s.

While we're at it, tweak the docs to clarify the restrictions on each state
a bit.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:58 +02:00
David Gibson
a63199832a inany: Better helpers for using inany and specific family addrs together
This adds some extra inany helpers for comparing an inany address to
addresses of a specific family (including special addresses), and building
an inany from an IPv4 address (either statically or at runtime).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:55 +02:00
David Gibson
7a832a8a0e flow: Properly type callbacks to protocol specific handlers
The flow dispatches deferred and timer handling for flows centrally, but
needs to call into protocol specific code for the handling of individual
flows.  Currently this passes a general union flow *.  It makes more sense
to pass the specific relevant flow type structure.  That brings the check
on the flow type adjacent to casting to the union variant which it tags.

Arguably, this is a slight abstraction violation since it involves the
generic flow code using protocol specific types.  It's already calling into
protocol specific functions, so I don't think this really makes any
difference.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:52 +02:00
David Gibson
1a20370b36 util, tcp: Add helper to display socket addresses
When reporting errors, we sometimes want to show a relevant socket address.
Doing so by extracting the various relevant fields can be pretty awkward,
so introduce a sockaddr_ntop() helper to make it simpler.  For now we just
have one user in tcp.c, but I have further upcoming patches which can make
use of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:37 +02:00
Maxime Bélair
3ff3a8a467 apparmor: Fix passt abstraction
Commit b686afa2 introduced the invalid apparmor rule
`mount options=(rw, runbindable) /,` since runbindable mount rules
cannot have a source.

Therefore running aa-logprof/aa-genprof will trigger errors (see
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685)

    $ sudo aa-logprof
    ERROR: Operation {'runbindable'} cannot have a source. Source = AARE('/')

This patch fixes it to the intended behavior.

Link: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685
Fixes: b686afa23e ("apparmor: Explicitly pass options we use while remounting root filesystem")
Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:16:27 +02:00
Paul Holzinger
6cdc9fd51b apparmor: allow netns paths on /tmp
For some unknown reason "owner" makes it impossible to open bind mounted
netns references as apparmor denies it. In the kernel denied log entry
we see ouid=0 but it is not clear why that is as the actual file is
owned by the real (rootless) user id.

In abstractions/pasta there is already `@{run}/user/@{uid}/**` without
owner set for the same reason as this path contains the netns path by
default when running under Podman.

Fixes: 72884484b0 ("apparmor: allow read access on /tmp for pasta")
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:02:18 +02:00
David Gibson
80f7ff2996 clang-tidy: Suppress macro to enum conversion warnings
clang-tidy 18.1.1 in Fedora 40 complains about every #define of an integral
value, suggesting it be converted to an enum.  Although that's certainly
possible, it's of dubious value and results in some awkward arrangements on
out codebase.  Suppress it globally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:02:07 +02:00
David Gibson
29bd08ff0f conf: Fix clang-tidy warning about using an undefined enum value
In conf() we temporarily set the forwarding mode variables to 0 - an
invalid value, so that we can check later if they've been set by the
intervening logic.  clang-tidy 18.1.1 in Fedora 40 now complains about
this.  Satisfy it by giving an name in the enum to the 0 value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:02:05 +02:00
lemmi
26c71db332 passt.c: explicitly include libgen.h for basename
fixes implicit declaration warning on musl

Signed-off-by: lemmi <lemmi@nerd2nerd.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:00:52 +02:00
Stefano Brivio
623c2fd621 netlink: Don't duplicate routes referring to unrelated host interfaces
We take care of this in nl_addr_dup(): if the interface index
associated to an address doesn't match the selected host interface
(ifa->ifa_index != ifi_src), we don't copy that address.

But for routes, we just unconditionally update the interface index to
match the index in the target namespace, even if the source interface
didn't match.

This might happen in two cases: with a pre-4.20 kernel without support
for NETLINK_GET_STRICT_CHK, which won't filter routes based on the
interface we pass in the request, as reported by runsisi, and any
kernel with support for multipath routes where any of the nexthops
refers to an unrelated host interface.

In both cases, check the index of the source interface, and avoid
copying unrelated routes.

Reported-by: runsisi <runsisi@hust.edu.cn>
Link: https://bugs.passt.top/show_bug.cgi?id=86
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: runsisi <runsisi@hust.edu.cn>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-11 00:52:19 +02:00
Paul Holzinger
72884484b0 apparmor: allow read access on /tmp for pasta
The podman CI on debian runs tests based on /tmp but pasta is failing
there because it is unable to open the netns path as the open for read
access is denied.

Link: https://github.com/containers/podman/issues/22625
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-10 16:53:35 +02:00
Stefano Brivio
7e6a606c32 tcp_splice: Set OUT_WAIT_ flag whenever pipe isn't emptied
In tcp_splice_sock_handler(), if we get EAGAIN on the second splice(),
from pipe to receiving socket, that doesn't necessarily mean that the
pipe is empty: the receiver buffer might be full instead.

Hence, we can't use the 'never_read' flag to decide that there's
nothing to wait for: even if we didn't read anything from the sending
side in a given iteration, we might still have data to send in the
pipe. Use read/written counters, instead.

This fixes an issue where large bulk transfers would occasionally
hang. From a corresponding strace:

     0.000061 epoll_wait(4, [{events=EPOLLOUT, data={u32=29442, u64=12884931330}}], 8, 1000) = 1
     0.005003 epoll_ctl(4, EPOLL_CTL_MOD, 211, {events=EPOLLIN|EPOLLRDHUP, data={u32=54018, u64=8589988610}}) = 0
     0.000089 epoll_ctl(4, EPOLL_CTL_MOD, 115, {events=EPOLLIN|EPOLLRDHUP, data={u32=29442, u64=12884931330}}) = 0
     0.000081 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000073 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 1048576
     0.000087 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000045 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 520415
     0.000060 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000044 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000044 epoll_wait(4, [], 8, 1000) = 0

we're reading from socket 211 into to the pipe end numbered 151,
which connects to pipe end 150, and from there we're writing into
socket 115.

We initially drop EPOLLOUT from the set of monitored flags for socket
115, because it already signaled it's ready for output. Then we read
nothing from socket 211 (the sender had nothing to send), and we keep
emptying the pipe into socket 115 (first 1048576 bytes, then 520415
bytes).

This call of tcp_splice_sock_handler() ends with EAGAIN on the writing
side, and we just exit this function without setting the OUT_WAIT_1
flag (and, in turn, EPOLLOUT for socket 115). However, it turns out,
the pipe wasn't actually emptied, and while socket 211 had nothing
more to send, we should have waited on socket 115 to be ready for
output again.

As a further step, we could consider not clearing EPOLLOUT at all,
unless the read/written counters match, but I'm first trying to fix
this ugly issue with a minimal patch.

Link: https://github.com/containers/podman/issues/22575
Link: https://github.com/containers/podman/issues/22593
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-10 16:52:29 +02:00
David Gibson
1ba76c9e8c udp: Single buffer for IPv4, IPv6 headers and metadata
Currently we have separate arrays for IPv4 and IPv6 which contain the
headers for guest-bound packets, and also the originating socket address.
We can combine these into a single array of "metadata" structures with
space for both pre-cooked IPv4 and IPv6 headers, as well as shared space
for the tap specific header and socket address (using sockaddr_inany).

Because we're using IOVs to separately address the pieces of each frame,
these structures don't need to be packed to keep the headers contiguous
so we can more naturally arrange for the alignment we want.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:52 +02:00
David Gibson
d4598e1d18 udp: Use the same buffer for the L2 header for all frames
Currently each tap-bound frame buffer has room for its own ethernet header.
However the ethernet header is always the same for such frames, so now
that we're indirectly referencing the ethernet header via iov, we can use
a single buffer for all of them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:49 +02:00
David Gibson
6170688616 udp: Share payload buffers between IPv4 and IPv6
Currently the IPv4 and IPv6 paths unnecessarily use different buffers for
the UDP payload.  Now that we're handling the various pieces of the UDP
packets with an iov, we can split the payload part of the buffers off into
its own array shared between IPv4 and IPv6.  As well as saving a little
memory, this allows the payload buffers to be neatly page aligned.

With the buffers merged, udp[46]_l2_iov_sock contain exactly the same thing
as each other and can also be merged.  Likewise udp[46]_iov_splice can be
merged together.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:46 +02:00
David Gibson
2d16946bac udp: Explicitly set checksum in guest-bound UDP headers
For IPv4, UDP checksums are optional and can just be set to 0.
udp_update_hdr4() ignores the checksum field entirely.  Since these are set
to 0 during startup, this works as intended for now.

However, we'd like to share payload and UDP header buffers betweem IPv4 and
IPv6, which does calculate UDP checksums.  Therefore, for robustness, we
should explicitly set the checksum field to 0 for guest-bound UDP packets.

In the tap_udp4_send() slow path, however, we do allow IPv4 UDP checksums
to be calculated as a compile time option.  For consistency, use the same
thing in the udp_update_hdr4() path, which will typically initialize to 0,
but calculate a real checksum if configured to do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:44 +02:00
David Gibson
6c4d26a364 udp: Combine initialisation of IPv4 and IPv6 iovs
We're going to introduce more sharing between the IPv4 and IPv6 buffer
structures.  Prepare for this by combinng the initialisation functions.
While we're at it remove the misleading "sock" from the name since these
initialise both tap side and sock side structures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:41 +02:00
David Gibson
3f9bd867b5 udp: Split tap-bound UDP packets into multiple buffers using io vector
When sending to the tap device, currently we assemble the headers and
payload into a single contiguous buffer.  Those are described by a single
struct iovec, then a batch of frames is sent to the device with
tap_send_frames().

In order to better integrate the IPv4 and IPv6 paths, we want the IP
header in a different buffer that might not be contiguous with the
payload.  To prepare for that, split the UDP packet into an iovec of
buffers.  We use the same split that Laurent recently introduced for
TCP for convenience.

This removes the last use of tap_hdr_len_(), tap_frame_base() and
tap_frame_len(), so remove those too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:38 +02:00
David Gibson
fcd9308856 test: Allow sftp via vsock-ssh in tests
During some debugging recently, I wanted to extact a file from a test
guest and found it was tricky, since the ssh-over-vsock setup we had didn't
allow sftp/scp.  We can fix this by adding a line to the guest side sshd
config from mbuto.  While we're there correct an inaccurate comment.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:36 +02:00
David Gibson
eea5d3ef2d tcp: Update tap specific header too in tcp_fill_headers[46]()
tcp_fill_headers[46]() fill most of the headers, but the tap specific
header (the frame length for qemu sockets) is filled in afterwards.
Filling this as well:
  * Removes a little redundancy between the tcp_send_flag() and
    tcp_data_to_tap() path
  * Makes calculation of the correct length a little easier
  * Removes the now misleadingly named 'vnet_len' variable in
    tcp_send_flag()

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:34 +02:00
David Gibson
3559899586 iov: Helper macro to construct iovs covering existing variables or fields
Laurent's recent changes mean we use IO vectors much more heavily in the
TCP code.  In many of those cases, and few others around the code base,
individual iovs of these vectors are constructed to exactly cover existing
variables or fields.  We can make initializing such iovs shorter and
clearer with a macro for the purpose.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:31 +02:00
David Gibson
40f8b2976a tap, tcp: (Re-)abstract TAP specific header handling
Recent changes to the TCP code (reworking of the buffer handling) have
meant that it now (again) deals explicitly with the MODE_PASST specific
vnet_len field, instead of using the (partial) abstractions provided by the
tap layer.

The abstractions we had don't work for the new TCP structure, so make some
new ones that do: tap_hdr_iov() which constructs an iovec suitable for
containing (just) the TAP specific header and tap_hdr_update() which
updates it as necessary per-packet.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:29 +02:00
David Gibson
68d1b0a152 tcp: Simplify packet length calculation when preparing headers
tcp_fill_headers[46]() compute the L3 packet length from the L4 packet
length, then their caller tcp_l2_buf_fill_headers() converts it back to the
L4 packet length.  We can just use the L4 length throughout.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:26 +02:00
David Gibson
5566386f5f treewide: Standardise variable names for various packet lengths
At various points we need to track the lengths of a packet including or
excluding various different sets of headers.  We don't always use the same
variable names for doing so.  Worse in some places we use the same name
for different things: e.g. tcp_fill_headers[46]() use ip_len for the
length including the IP headers, but then tcp_send_flag() which calls it
uses it to mean the IP payload length only.

To improve clarity, standardise on these names:
   dlen:		L4 protocol payload length ("data length")
   l4len:		plen + length of L4 protocol header
   l3len:		l4len + length of IPv4/IPv6 header
   l2len:		l3len + length of L2 (ethernet) header

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:23 +02:00
David Gibson
9e22c53aa9 checksum: Make csum_ip4_header() take a host endian length
csum_ip4_header() takes the packet length as a network endian value.  In
general it's very error-prone to pass non-native-endian values as a raw
integer.  It's particularly bad here because this differs from other
checksum functions (e.g. proto_ipv4_header_psum()) which take host native
lengths.

It turns out all the callers have easy access to the native endian value,
so switch it to use host order like everything else.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:21 +02:00
David Gibson
1095a7b0c9 treewide: Remove misleading and redundant endianness notes
In general, it's much less error-prone to have the endianness of values
implied by the type, rather than just noting it in comments.  We can't
always easily avoid it, because C, but we can do so when possible.  struct
in_addr and in6_addr are always encoded network endian, so noting it
explicitly isn't useful.  Remove them.

In some cases we also have endianness notes on uint8_t parameters, which
doesn't make sense: for a single byte endianness is irrelevant.  Remove
those too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:16 +02:00
David Gibson
5d37dab012 tap: Remove unused structs tap_msg, tap_l4_msg
Use of these structures was removed in bb70811183 ("treewide: Packet
abstraction with mandatory boundary checks").  Remove the stale
declarations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:13 +02:00
David Gibson
34fb381b5a tap: Split tap specific and L2 (ethernet) headers
In some places (well, actually only UDP now) we use struct tap_hdr to
represent both tap backend specific and L2 ethernet headers.  Handling
these together seemed like a good idea at the time, but Laurent's changes
in the TCP code working towards vhost-user support suggest that treating
them separately is more useful, more often.

Alter struct tap_hdr to represent only the TAP backend specific headers.
Updated related helpers and the UDP code to match.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:08 +02:00
David Gibson
c27ca91564 checksum: Use proto_ipv6_header_psum() for ICMPv6 as well
7df624e79 ("checksum: introduce functions to compute the header part
checksum for TCP/UDP") introduced a helper to compute the partial checksum
for the IPv6 pseudo-header used in L4 protocol checksums.  It used it in
csum_udp6() for UDP packets, but not in csum_icmp6() for the identical
calculation in csum_icmp6() for ICMPv6 packets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:03 +02:00
Stefano Brivio
76e32022c4 netlink: Fix iterations over nexthop objects
Somewhat confusingly, RTNH_NEXT(), as defined by <linux/rtnetlink.h>,
doesn't take an attribute length parameter like RTA_NEXT() does, and
I just modelled loops over nexthops after RTA loops, forgetting to
decrease the remaining length we pass to RTNH_OK().

In practice, this didn't cause issue in any of the combinations I
checked, at least without the next patch.

We seem to be the only user of RTNH_OK(): even iproute2 has an
open-coded version of it in print_rta_multipath() (ip/iproute.c).

Introduce RTNH_NEXT_AND_DEC(), similar to RTA_NEXT(), and use it.

Fixes: 6c7623d07b ("netlink: Add support to fetch default gateway from multipath routes")
Fixes: f4e38b5cd2 ("netlink: Adjust interface index inside copied nexthop objects too")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-02 16:12:45 +02:00
Stefano Brivio
d03c4e2020 netlink: Use IFA_F_NODAD also while duplicating addresses from the host
...not just for a single set address (legacy operation with
--no-copy-addrs). I forgot to add this to nl_addr_dup().

Note that we can have two version of flags: the 8-bit ifa_flags in
ifaddrmsg, and the newer 32-bit version as IFA_FLAGS attribute, which
is given priority if present. Make sure IFA_F_NODAD is set in both.

Without this, a Podman user reports, something on the lines of:
  pasta --config-net -- ping -c1 -6 passt.top

would fail as the kernel would start Duplicate Address Detection
once we configure the address, which can't really work (and doesn't
make sense) in this case.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-26 07:46:54 +02:00
Stefano Brivio
bfc83b54c4 netlink: For IPv4, IFA_LOCAL is the interface address, not IFA_ADDRESS
See the comment to the unnamed enum in linux/if_addr.h, which
currently states:

  /*
   * Important comment:
   * IFA_ADDRESS is prefix address, rather than local interface address.
   * It makes no difference for normally configured broadcast interfaces,
   * but for point-to-point IFA_ADDRESS is DESTINATION address,
   * local address is supplied in IFA_LOCAL attribute.
   *
   * [...]
   */

if we fetch IFA_ADDRESS, and we have a point-to-point link with a peer
address configured, we'll source the peer address as "our" address,
and refuse to resolve it in arp().

This was reported with pasta and a tun upstream interface configured
by OpenVPN in "p2p" topology: the target namespace will have similar
addresses and routes as the host, which is fine, and will try to
resolve the point-to-point peer address (because it's the default
gateway).

Given that we configure it as our address (only internally, not
visibly in the namespace), we'll fail to resolve that and traffic
doesn't go anywhere.

Note that this is not the case for IPv6: there, IFA_ADDRESS is the
actual, local address of the interface, and IFA_LOCAL is not
necessarily present, so the comment in linux/if_addr.h doesn't apply
either.

Link: https://github.com/containers/podman/issues/22320
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-26 07:46:42 +02:00
David Gibson
ff2ff2fbca test: Make log truncation test more robust
test/pasta_options/log_to_file checks that pasta truncates its log file
when started.  It does that by starting pasta with a log file once, then
starting it again and checking that after the second round, the log file
has only one line: the startup banner from the second invocation.

However, this test will break if the second invocation logs any additional
messages at startup.  This can easily happen on a host with multiple
network interfaces due to the "Multiple default route" informational
messages added in 639fdf06e ("netlink: Fix selection of template
interface").  I believe it could also happen on a host without IPv6
connectivity due to the "Couldn't pick external interface" messages, though
I haven't confirmed this.

Make the log file test more robust, by not testing for a single line, but
instead explicitly testing for the PID of the second pasta invocation in
the banner line.

Link: https://bugs.passt.top/show_bug.cgi?id=88
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-25 00:00:34 +02:00
David Gibson
2681366966 test: Slight simplification to pasta log tests
test/pasta_options/log_to_file contains a couple of rudimentary tests
where we start pasta with an interactive shell, then immediately exit it.
We can achieve the same thing by using /bin/true as the command to pasta.
This also means that waiting for pasta to start, waiting for the executed
command to complete and for pasta to clean up are all handled by simply
waiting for pasta to complete in the foreground, so there's no need for an
additional sleep.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-25 00:00:34 +02:00
David Gibson
0804fdbc28 udp: Correctly look up outbound socket with port remappings
Commit bb9bf0bb ("tcp, udp: Don't precompute port remappings in epoll
references") changed the epoll reference for UDP sockets to include the
bound port as seen by the socket itself, rather than the bound port it
would be translated to on the guest side.  As a side effect, it also means
that udp_tap_map[] is indexed by the bound port on the host side, rather
than on the guest side.  This is consistent and a good idea, however we
forgot to account for it when finding the correct outgoing socket for
packets originating in the guest.  This means that if forwarding UDP
inbound with a port number change, reply packets would be misdirected.

Fix this by applying the reverse mapping before looking up the socket in
udp_tap_handler().  While we're at it, use 'port' directly instead of
'uref.port' in udp_sock_init().  Those now always have the same value -
failing to realise that is the same error as above.

Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=87
Fixes: bb9bf0bb8f ("tcp, udp: Don't precompute port remappings in epoll references")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-25 00:00:34 +02:00
Laurent Vivier
95601237ef tcp: Replace TCP buffer structure by an iovec array
To be able to provide pointers to TCP headers and IP headers without
worrying about alignment in the structure, split the structure into
several arrays and point to each part of the frame using an iovec array.

Using iovec also allows us to simply ignore the first entry when the
vnet length header is not needed.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-19 11:21:09 +02:00
Stefano Brivio
27f1c762b1 conf: Don't fail if the template interface doesn't have a MAC address
...simply resort to using locally-administered address (LAA) as
host-side source, instead.

Pick 02:00:00:00:00:00, to make it clear that we don't actually care
about that address, and also to match the 00 (Administratively
Assigned Identifier) quadrant of SLAP (RFC 8948).

Otherwise, pasta refuses to start if the template is a tun or
Wireguard interface.

Link: https://bugs.passt.top/show_bug.cgi?id=49
Link: https://github.com/containers/podman/issues/22320
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-19 11:21:00 +02:00
Stefano Brivio
eca8baa028 conf: We're interested in the MAC address, not in the MAC itself
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-19 11:15:36 +02:00
Stefano Brivio
ee338a256e pasta, util: Align stack area for clones to maximum natural alignment
Given that we use this stack pointer as a location to store arbitrary
data types from the cloned process, we need to guarantee that its
alignment matches any of those possible data types.

runsisi reports that pasta gets a SIGBUS in pasta_open_ns() on
aarch64, where the alignment requirement for stack pointers is a
16 bytes (same as the size of a long double), and similar requirements
actually apply to most architectures we run on.

Reported-by: runsisi <runsisi@hust.edu.cn>
Link: https://bugs.passt.top/show_bug.cgi?id=85
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-19 11:15:27 +02:00
Stefano Brivio
5d5208b67d treewide: Compilers' name for armv6l and armv7l is "arm"
When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the
architecture name for, among others, seccomp.sh, I didn't realise
that "armv6l" and "armv7l" are just Linux kernel names -- compilers
just call that "arm".

Fix the "syscalls" annotation we use to define seccomp profiles
accordingly, otherwise pasta will be terminated on sigreturn() on
armv6l and armv7l.

Fixes: 213c397492 ("passt, pasta: Run-time selection of AVX2 build")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-11 17:34:04 +02:00
David Gibson
954589b64b test: Verify that podman tests are using the pasta binary we expect
Paul Holzinger pointed out that when we invoke the podman tests inside the
passt testsuite, the way we point podman at the newly built pasta binary
is kind of indirect.  It's therefore prudent to check that podman is
actually using the binary we expect it to - in particular that it is using
the binary built in this tree, not some system installed pasta binary.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:24 +02:00
David Gibson
489b28e216 test: catatonit may not be in $PATH
The pasta_podman/bats test script looks for 'catatonit' amongst other tools
to be avaiiliable on the host.  However, while the podman tests do require
catatonit, it doesn't necessarily need to be in the regular path.  For
example Fedora and RHEL place catatonit in /usr/libexec and podman finds it
there fine.

Therefore, remove it as an htools dependency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:21 +02:00
David Gibson
f9fe3ae5dd test: Build and download podman as a test asset
The pasta_podman/bats test scrpt downloads and builds podman, then runs its
pasta specific tests.  Downloading from within a test case has some
drawbacks:
 * It can be very tedious if you have poor connectivity to the server
 * It makes a test that's ostensibly for pasta itself dependent on the
   state of the github server
 * It precludes runnning the tests in an isolated network environment

The same concerns largely apply to building podman too, because it's pretty
common for Go builds to download dependencies themselves.  Therefore move
the download and build of podman from the test itself, to the Makefile
where we prepare other test assets.

To avoid cryptic failures if something went wrong with the build, make
running the test dependent on having the built podman binary.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:16 +02:00
David Gibson
e8b78217bb test: Make sure to update mbuto repository
We download and use mbuto to build trivial boot images for our VM tests.
However, if mbuto is already cloned, we won't update it to the current
version.  Add some make logic to ensure that we do this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:13 +02:00
David Gibson
ef2cb13b49 cppcheck: Explicitly give files to check
Currently "make cppcheck" invokes cppcheck on ".", so it will check all the
.c and .h files it can find in the source tree.  This isn't ideal, because
it can find files that aren't actually part of the real build, or even
stale files which aren't in git.

More practically, some upcoming changes are looking at downloading other
source trees for some tests.  Static errors in there is Not Our Problem,
so checking them is both slow and pointless.

So, change the Makefile to invoke cppcheck only on the specific source
files that are part of the build.  For some reason in this format the
badBitmaskCheck warnings in seccomp.h which were suppressed by 5beb3472e
("cppcheck: Avoid errors due to zeroes in bitwise ORs") no longer trigger.
That means we get unmatchedSuppression warnings instead.  We add an
unmatchedSuppression suppression instead of simply removing the original
suppressions, just in case this odd behaviour isn't the same for all
cppcheck versions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:11 +02:00
David Gibson
97e8b33f87 netlink: Ignore routes to link-local addresses for selecting interface
Since f919dc7a4b ("conf, netlink: Don't require a default route to
start"), and since 639fdf06ed ("netlink: Fix selection of template
interface") less buggily, we haven't required a default route on the host
in order to operate.  Instead, if we lack a default route we'll pick an
interface with any route, as long as there's only one such interface.  If
there's more than one, we don't have a good criterion to pick, so we give
up with an informational message.

Paul Holzinger pointed out that this code considers it ambiguous even if
all but one of the interfaces has only routes to link-local addresses
(fe80::/10).  A route to link-local addresses isn't really useful from
pasta's point of view, so ignore them instead.  This removes a misleading
message in many cases, and a spurious failure in some cases.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:08 +02:00
David Gibson
67a6258918 util: Add helper to return name of address family
We have a few places where we want to include the name of the internet
protocol version (IPv4 or IPv6) in a message, which we handle with an
open-coded ?: expression.

This seems like something that might be more widely useful, so make a
trivial helper to return the correct string based on the address family.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:05 +02:00
Stefano Brivio
f4e38b5cd2 netlink: Adjust interface index inside copied nexthop objects too
As pasta duplicates host routes into the target namespaces, interface
indices might not match, so we go through RTA_OIF attributes and fix
them up to match the identifier in the namespace.

But RTA_OIF is not the ony attribute specifying interfaces for routes:
multipath routes use RTA_MULTIPATH attributes with nexthop objects,
which contain in turn interface indices. Fix them up as well.

If we don't, and we have at least two host interfaces, and the host
interface we use as template isn't the first one (hence the
mismatching indices), we'll fail to insert multipath routes with
nexthop objects, and ultimately refuse to start as the kernel
unexpectedly gives us ENODEV.

Link: https://github.com/containers/podman/issues/22192
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-05 16:58:52 +02:00
Danish Prakash
88c2f08eba apparmor: Fix access to procfs namespace entries in pasta's abstraction
From an original patch by Danish Prakash.

With commit ff22a78d7b ("pasta: Don't try to watch namespaces in
procfs with inotify, use timer instead"), if a filesystem-bound
target namespace is passed on the command line, we'll grab a handle
on its parent directory. That commit, however, didn't introduce a
matching AppArmor rule. Add it here.

To access a network namespace procfs entry, we also need a 'ptrace'
rule. See commit 594dce66d3 ("isolation: keep CAP_SYS_PTRACE when
required") for details as to when we need this -- essentially, it's
about operation with Buildah.

Reported-by: Jörg Sonnenberger <joerg@bec.de>
Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Fixes: ff22a78d7b ("pasta: Don't try to watch namespaces in procfs with inotify, use timer instead")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 12:12:26 +02:00
Stefano Brivio
100919ce74 apparmor: Expand scope of @{run}/user access, allow writing PID files too
With Podman's custom networks, pasta will typically need to open the
target network namespace at /run/user/<UID>/containers/networks:
grant access to anything under /run/user/<UID> instead of limiting it
to some subpath.

Note that in this case, Podman will need pasta to write out a PID
file, so we need write access, for similar locations, too.

Reported-by: Jörg Sonnenberger <joerg@bec.de>
Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 12:12:26 +02:00
Stefano Brivio
dc7b7f28b7 apparmor: Add mount rule with explicit, empty source in passt abstraction
For the policy to work as expected across either AppArmor commit
9d3f8c6cc05d ("parser: fix parsing of source as mount point for
propagation type flags") and commit 300889c3a4b7 ("parser: fix option
flag processing for single conditional rules"), we need one mount
rule with matching mount options as "source" (that is, without
source), and one without mount options and an explicit, empty source.

Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 12:12:26 +02:00
Stefano Brivio
bbea2752f6 README.md: Alpine, Guix and OpenSUSE now have packages for passt
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-05 12:12:23 +02:00
David Gibson
4988e2b406 tcp: Unconditionally force ACK for all !SYN, !RST packets
Currently we set ACK on flags packets only when the acknowledged byte
pointer has advanced, or we hadn't previously set a window.  This means
in particular that we can send a window update with no ACK flag, which
doesn't appear to be correct.  RFC 9293 requires a receiver to ignore such
a packet [0], and indeed it appears that every non-SYN, non-RST packet
should have the ACK flag.

The reason for the existing logic, rather than always forcing an ACK seems
to be to avoid having the packet mistaken as a duplicate ACK which might
trigger a fast retransmit.  However, earlier tests in the function mean we
won't reach here if we don't have either an advance in the ack pointer -
which will already set the ACK flag, or a window update - which shouldn't
trigger a fast retransmit.

[0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1

Link: https://github.com/containers/podman/issues/22146
Link: https://bugs.passt.top/show_bug.cgi?id=84
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:52:04 +01:00
David Gibson
5894a245b9 tcp: Never automatically add the ACK flag to RST packets
tcp_send_flag() will sometimes force on the ACK flag for all !SYN packets.
This doesn't make sense for RST packets, where plain RST and RST+ACK have
somewhat different meanings.  AIUI, RST+ACK indicates an abrupt end to
a connection, but acknowledges data already sent.  Plain RST indicates an
abort, when one end receives a packet that doesn't seem to make sense in
the context of what it knows about the connection.  All of the cases where
we send RSTs are the second, so we don't want an ACK flag, but we currently
could add one anyway.

Change that, so we won't add an ACK to an RST unless the caller explicitly
requests it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:58 +01:00
David Gibson
16c2d8da0d tcp: Rearrange logic for setting ACK flag in tcp_send_flag()
We have different paths for controlling the ACK flag for the SYN and !SYN
paths.  This amounts to sometimes forcing on the ACK flag in the !SYN path
regardless of options.  We can rearrange things to explicitly be that which
will make things neater for some future changes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:55 +01:00
David Gibson
99355e25b9 tcp: Split handling of DUP_ACK from ACK
The DUP_ACK flag to tcp_send_flag() has two effects: first it forces the
setting of the ACK flag in the packet, even if we otherwise wouldn't.
Secondly, it causes a duplicate of the flags packet to be sent immediately
after the first.

Setting the ACK flag to tcp_send_flag() also has the first effect, so
instead of having DUP_ACK also do that, pass both flags when we need both
operations.  This slightly simplifies the logic of tcp_send_flag() in a way
that makes some future changes easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:41 +01:00
Laurent Vivier
71dd405460 util: fix confusion between offset in the iovec array and in the entry
In write_remainder() 'skip' is the offset to start the operation from
in the iovec array.

In iov_skip_bytes(), 'skip' is also the offset in the iovec array but
'offset' is the first unskipped byte in the iovec entry.

As write_remainder() uses 'skip' for both, 'skip' is reset to the
first unskipped byte in the iovec entry rather to staying the first
unskipped byte in the iovec array.

Fix the problem by introducing a new variable not to overwrite 'skip'
on each loop.

Fixes: 8bdb0883b4 ("util: Add write_remainder() helper")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-20 10:06:32 +01:00
David Gibson
639fdf06ed netlink: Fix selection of template interface
Since f919dc7a4b ("conf, netlink: Don't require a default route to
start"), if there is only one host interface with routes, we will pick that
as the template interface, even if there are no default routes for an IP
version.  Unfortunately this selection had a serious flaw: in some cases
it would 'return' in the middle of an nl_foreach() loop, meaning we
wouldn't consume all the netlink responses for our query.  This could cause
later netlink operations to fail as we read leftover responses from the
aborted query.

Rewrite the interface detection to avoid this problem.  While we're there:
  * Perform detection of both default and non-default routes in a single
    pass, avoiding an ugly goto
  * Give more detail on error and working but unusual paths about the
    situation (no suitable interface, multiple possible candidates, etc.).

Fixes: f919dc7a4b ("conf, netlink: Don't require a default route to start")
Link: https://bugs.passt.top/show_bug.cgi?id=83
Link: https://github.com/containers/podman/issues/22052
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2270257
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Use info(), not warn() for somewhat expected cases where one
 IP version has no default routes, or no routes at all]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-20 09:34:08 +01:00
David Gibson
d35bcbee90 netlink: Fix handling of NLMSG_DONE in nl_route_dup()
A recent kernel change 87d381973e49 ("genetlink: fit NLMSG_DONE into
same read() as families") changed netlink behaviour so that the
NLMSG_DONE terminating a bunch of responses can go in the same
datagram as those responses, rather than in a separate one.

Our netlink code is supposed to handle that behaviour, and indeed does
so for most cases, using the nl_foreach() macro.  However, there was a
subtle error in nl_route_dup() which doesn't work with this change.
f00b1534 ("netlink: Don't try to get further datagrams in
nl_route_dup() on NLMSG_DONE") attempted to fix this, but has its own
subtle error.

The problem arises because nl_route_dup(), unlike other cases doesn't
just make a single pass through all the responses to a netlink
request.  It needs to get all the routes, then make multiple passes
through them.  We don't really have anywhere to buffer multiple
datagrams, so we only support the case where all the routes fit in a
single datagram - but we need to fail gracefully when that's not the
case.

After receiving the first datagram of responses (with nl_next()) we
have a first loop scanning them.  It needs to exit when either we run
out of messages in the datagram (!NLMSG_OK()) or when we get a message
indicating the last response (nl_status() <= 0).

What we do after the loop depends on which exit case we had.  If we
saw the last response, we're done, but otherwise we need to receive
more datagrams to discard the rest of the responses.

We attempt to check for that second case by re-checking NLMSG_OK(nh,
status).  However in the got-last-response case, we've altered status
from the number of remaining bytes to the error code (usually 0). That
means NLMSG_OK() now returns false even if it didn't during the loop
check.  To fix this we need separate variables for the number of bytes
left and the final status code.

We also checked status after the loop, but this was redundant: we can
only exit the loop with NLMSG_OK() == true if status <= 0.

Reported-by: Martin Pitt <mpitt@redhat.com>
Fixes: f00b153414 ("netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE")
Fixes: 4d6e9d0816 ("netlink: Always process all responses to a netlink request")
Link: https://github.com/containers/podman/issues/22052
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-19 10:23:34 +01:00
Dan Čermák
615d370ca2 fedora: Switch license identifier to SPDX
The spec file patch by Dan Čermák was originally contributed at:
  https://src.fedoraproject.org/rpms/passt/pull-request/1

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-18 08:57:47 +01:00
Stefano Brivio
d989eae308 udp: Translate source address of resolver only for DNS remapped queries
Paul reports that if pasta is configured with --dns-forward, and the
container queries a resolver which is configured on the host directly,
without using the address given for --dns-forward, we'll translate
the source address of the response pretending it's coming from the
address passed as --dns-forward, and the client will discard the
reply.

That is,

  $ cat /etc/resolv.conf
  198.51.100.1
  $ pasta --config-net --dns-forward 192.0.2.1 nslookup passt.top

will not work, because we change the source address of the reply from
198.51.100.1 to 192.0.2.1. But the client contacted 198.51.100.1, and
it's from that address that it expects an answer.

Add a PORT_DNS_FWD flag for tap-facing ports, which is triggered by
activity in the opposite direction as the other flags. If the
tap-facing port was seen sending a DNS query that was remapped, we'll
remap the source address of the response, otherwise we'll leave it
unaffected.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-18 08:57:40 +01:00
Stefano Brivio
f919dc7a4b conf, netlink: Don't require a default route to start
There might be isolated testing environments where default routes and
global connectivity are not needed, a single interface has all
non-loopback addresses and routes, and still passt and pasta are
expected to work.

In this case, it's pretty obvious what our upstream interface should
be, so go ahead and select the only interface with at least one
route, disabling DHCP and implying --no-map-gw as the documentation
already states.

If there are multiple interfaces with routes, though, refuse to start,
because at that point it's really not clear what we should do.

Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://github.com/containers/podman/issues/21896
Signed-off-by: Stefano brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-18 08:57:21 +01:00
Stefano Brivio
f00b153414 netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
Martin reports that, with Fedora Linux kernel version
kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
read() as families"), pasta doesn't exit once the network namespace
is gone.

Actually, pasta is completely non-functional, at least with default
options, because nl_route_dup(), which duplicates routes from the
parent namespace into the target namespace at start-up, is stuck on
a second receive operation for RTM_GETROUTE.

However, with that commit, the kernel is now able to fit the whole
response, including the NLMSG_DONE message, into a single datagram,
so no further messages will be received.

It turns out that commit 4d6e9d0816 ("netlink: Always process all
responses to a netlink request") accidentally relied on the fact that
we would always get at least two datagrams as a response to
RTM_GETROUTE.

That is, the test to check if we expect another datagram, is based
on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
but we'll also expect another datagram if NLMSG_OK on the last
message is false. But NLMSG_OK with a zero length is always false.

The problem is that we don't distinguish if status is zero because
we got a NLMSG_DONE message, or because we processed all the
available datagram bytes.

Introduce an explicit check on NLMSG_DONE. We should probably
refactor this slightly, for example by introducing a special return
code from nl_status(), but this is probably the least invasive fix
for the issue at hand.

Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://github.com/containers/podman/issues/22052
Fixes: 4d6e9d0816 ("netlink: Always process all responses to a netlink request")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-18 08:56:32 +01:00
David Gibson
d3eb0d7b59 tap: Rename tap_iov_{base,len}
These two functions are typically used to calculate values to go into the
iov_base and iov_len fields of a struct iovec.  They don't have to be used
for that, though.  Rename them in terms of what they actually do: calculate
the base address and total length of the complete frame, including both L2
and tap specific headers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:43 +01:00
David Gibson
4db947d17c tap: Implement tap_send() "slow path" in terms of fast path
Most times we send frames to the guest it goes via tap_send_frames().
However "slow path" protocols - ARP, ICMP, ICMPv6, DHCP and DHCPv6 - go
via tap_send().

As well as being a semantic duplication, tap_send() contains at least one
serious problem: it doesn't properly handle short sends, which can be fatal
on the qemu socket connection, since frame boundaries will get out of sync.

Rewrite tap_send() to call tap_send_frames().  While we're there, rename it
tap_send_single() for clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:37 +01:00
David Gibson
1ebe787fe4 tap: Simplify some casts in the tap "slow path" functions
We can both remove some variables which differ from others only in type,
and slightly improve type safety.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:33 +01:00
David Gibson
2d0e0084b6 tap: Extend tap_send_frames() to allow multi-buffer frames
tap_send_frames() takes a vector of buffers and requires exactly one frame
per buffer.  We have future plans where we want to have multiple buffers
per frame in some circumstances, so extend tap_send_frames() to take the
number of buffers per frame as a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Improve comment to rembufs calculation]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:28 +01:00
Stefano Brivio
f67238aa86 passt, log: Call __openlog() earlier, log to stderr until we detach
Paul reports that, with commit 15001b39ef ("conf: set the log level
much earlier"), early messages aren't reported to standard error
anymore.

The reason is that, once the log mask is changed from LOG_EARLY, we
don't force logging to stderr, and this mechanism was abused to have
early errors on stderr. Now that we drop LOG_EARLY earlier on, this
doesn't work anymore.

Call __openlog() as soon as we know the mode we're running as, using
LOG_PERROR. Then, once we detach, if we're not running from an
interactive terminal and logging to standard error is not forced,
drop LOG_PERROR from the options.

While at it, check if the standard error descriptor refers to a
terminal, instead of checking standard output: if the user redirects
standard output to /dev/null, they might still want to see messages
from standard error.

Further, make sure we don't print messages to standard error reporting
that we couldn't log to the system logger, if we didn't open a
connection yet. That's expected.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 15001b39ef ("conf: set the log level much earlier")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:19:36 +01:00
Stefano Brivio
3fe9878db7 pcap: Use clock_gettime() instead of gettimeofday()
POSIX.1-2008 declared gettimeofday() as obsolete, but I'm a dinosaur.

Usually, C libraries translate that to the clock_gettime() system
call anyway, but this doesn't happen in Jon's environment, and,
there, seccomp happily kills pasta(1) when started with --pcap,
because we didn't add gettimeofday() to our seccomp profiles.

Use clock_gettime() instead.

Reported-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:18:36 +01:00
Stefano Brivio
0761f29a14 passt.1: --{no-,}dhcp-dns and --{no-,}dhcp-search don't take addresses
...they are simple enable/disable options.

Fixes: 89678c5157 ("conf, udp: Introduce basic DNS forwarding")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:18:09 +01:00
Stefano Brivio
4d05ba2c58 conf: Warn if we can't advertise any nameserver via DHCP, NDP, or DHCPv6
We might have read from resolv.conf, or from the command line, a
resolver that's reachable via loopback address, but that doesn't mean
we can offer that via DHCP, NDP or DHCPv6: warn if there are no
resolvers we can offer for a given IP version.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:17:37 +01:00
Stefano Brivio
43881636c2 conf: Handle addresses passed via --dns just like the ones from resolv.conf
...that is, call add_dns4() and add_dns6() instead of simply adding
those to the list of servers we advertise.

Most importantly, this will set the 'dns_host' field for the matching
IP version, so that, as mentioned in the man page, servers passed via
--dns are used for DNS mapping as well, if used in combination with
--dns-forward.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=82
Fixes: 89678c5157 ("conf, udp: Introduce basic DNS forwarding")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:16:04 +01:00
Laurent Vivier
b299942bbd tap: Capture only packets that are actually sent
In tap_send_frames(), if we failed to send all the frames, we must
only log the frames that have been sent, not all the frames we wanted
to send.

Fixes: dda7945ca9 ("pcap: Handle short writes in pcap_frame()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:27 +01:00
David Gibson
413c15988e udp: Use existing helper for UDP checksum on inbound IPv6 packets
Currently we open code the calculation of the UDP checksum in
udp_update_hdr6().  We calling a helper to handle the IPv6 pseudo-header,
and preset the checksum field to 0 so an uninitialised value doesn't get
folded in.  We already have a helper to do this: csum_udp6() which we use
in some slow paths.  Use it here as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:25 +01:00
David Gibson
ae69838db0 udp: Avoid unnecessary pointer in udp_update_hdr4()
We carry around the source address as a pointer to a constant struct
in_addr.  But it's silly to carry around a 4 or 8 byte pointer to a 4 byte
IPv4 address.  Just copy the IPv4 address around by value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:21 +01:00
David Gibson
b0419d150a udp: Re-order udp_update_hdr[46] for clarity and brevity
The order of things in these functions is a bit odd for historical reasons.
We initialise some IP header fields early, the more later after making
some tests.  Likewise we declare some variables without initialisation,
but then unconditionally set them to values we could calculate at the
start of the function.

Previous cleanups have removed the reasons for some of these choices, so
reorder for clarity, and where possible move the first assignment into an
initialiser.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:19 +01:00
David Gibson
8a842e03cd udp: Pass data length explicitly to to udp_update_hdr[46]
These functions take an index to the L2 buffer whose header information to
update.  They use that for two things: to locate the buffer pointer itself,
and to retrieve the length of the received message from the paralllel
udp[46]_l2_mh_sock array.  The latter is arguably a failure to separate
concerns.  Change these functions to explicitly take a buffer pointer and
payload length as parameters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:17 +01:00
David Gibson
76571ae869 udp: Consistent port variable names in udp_update_hdr[46]
In these functions we have 'dstport' for the destination port, but
'src_port' for the source port.  Change the latter to 'srcport' for
consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:15 +01:00
David Gibson
205b140dec udp: Refactor udp_sock[46]_iov_init()
Each of these functions have 3 essentially identical loops in a row.
Merge the loops into a single common udp_sock_iov_init() function, calling
udp_sock[46]_iov_init_one() helpers to initialize each "slot" in the
various parallel arrays.  This is slightly neater now, and more naturally
allows changes we want to make where more initialization will become common
between IPv4 and IPv6.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:36:59 +01:00
Stefano Brivio
860d2764dd conf: Don't warn if nameservers were found, but won't be advertised
Starting from commit 3a2afde87d ("conf, udp: Drop mostly duplicated
dns_send arrays, rename related fields"), we won't add to c->ip4.dns
and c->ip6.dns nameservers that can't be used by the guest or
container, and we won't advertise them.

However, the fact that we don't advertise any nameserver doesn't mean
that we didn't find any, and we should warn only if we couldn't find
any.

This is particularly relevant in case both --dns-forward and
--no-map-gw are passed, and a single loopback address is listed in
/etc/resolv.conf: we'll forward queries directed to the address
specified by --dns-forward to the loopback address we found, we
won't advertise that address, so we shouldn't warn: this is a
perfectly legitimate usage.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/19213
Fixes: 3a2afde87d ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-03-12 01:50:48 +01:00
David Gibson
4779dfe12f icmp: Use 'flowside' epoll references for ping sockets
Currently ping sockets use a custom epoll reference type which includes
the ICMP id.  However, now that we have entries in the flow table for
ping flows, finding that is sufficient to get everything else we want,
including the id.  Therefore remove the icmp_epoll_ref type and use the
general 'flowside' field for ping sockets.

Having done this we no longer need separate EPOLL_TYPE_ICMP and
EPOLL_TYPE_ICMPV6 reference types, because we can easily determine
which case we have from the flow type. Merge both types into
EPOLL_TYPE_PING.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:49:05 +01:00
David Gibson
02cbdb0b86 icmp: Flow based error reporting
Use flow_dbg() and flow_err() helpers to generate flow-linked error
messages in most places.  Make a few small improvements to the messages
while we're at it.  This allows us to avoid the awkward 'pname' variables
since whether we're dealing with ICMP or ICMPv6 is already built into the
flow type which these helpers include.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Coding style fix in icmp_tap_handler()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:36:04 +01:00
David Gibson
3af5e9fdba icmp: Store ping socket information in flow table
Currently icmp_id_map[][] stores information about ping sockets in a
bespoke structure.  Move the same information into new types of flow
in the flow table.  To match that change, replace the existing ICMP
timer with a flow-based timer for expiring ping sockets.  This has the
advantage that we only need to scan the active flows, not all possible
ids.

We convert icmp_id_map[][] to point to the flow table entries, rather
than containing its own information.  We do still use that array for
locating the right ping flows, rather than using a "flow native" form
of lookup for the time being.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update id_sock description in comment to icmp_ping_new()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:34:45 +01:00
Stefano Brivio
383a6f67e5 ip: Use regular htons() for non-constant protocol number in L2_BUF_IP4_PSUM
instead of htons_constant(), which is for... constants.

Fixes: 5bf200ae8a ("tcp, udp: Don't include destination address in partially precomputed csums")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-08 10:31:14 +01:00
120 changed files with 8200 additions and 7442 deletions

126
.clang-format Normal file
View file

@ -0,0 +1,126 @@
# SPDX-License-Identifier: GPL-2.0
#
# clang-format configuration file. Intended for clang-format >= 11.
#
# For more information, see:
#
# Documentation/dev-tools/clang-format.rst
# https://clang.llvm.org/docs/ClangFormat.html
# https://clang.llvm.org/docs/ClangFormatStyleOptions.html
#
---
AccessModifierOffset: -4
AlignAfterOpenBracket: Align
AlignConsecutiveAssignments: false
AlignConsecutiveDeclarations: false
AlignEscapedNewlines: Left
AlignOperands: true
AlignTrailingComments: false
AllowAllParametersOfDeclarationOnNextLine: false
AllowShortBlocksOnASingleLine: false
AllowShortCaseLabelsOnASingleLine: false
AllowShortFunctionsOnASingleLine: None
AllowShortIfStatementsOnASingleLine: false
AllowShortLoopsOnASingleLine: false
AlwaysBreakAfterDefinitionReturnType: None
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: false
BinPackArguments: true
BinPackParameters: true
BraceWrapping:
AfterClass: false
AfterControlStatement: false
AfterEnum: false
AfterFunction: true
AfterNamespace: true
AfterObjCDeclaration: false
AfterStruct: false
AfterUnion: false
AfterExternBlock: false
BeforeCatch: false
BeforeElse: false
IndentBraces: false
SplitEmptyFunction: true
SplitEmptyRecord: true
SplitEmptyNamespace: true
BreakBeforeBinaryOperators: None
BreakBeforeBraces: Custom
BreakBeforeInheritanceComma: false
BreakBeforeTernaryOperators: false
BreakConstructorInitializersBeforeComma: false
BreakConstructorInitializers: BeforeComma
BreakAfterJavaFieldAnnotations: false
BreakStringLiterals: false
ColumnLimit: 80
CommentPragmas: '^ IWYU pragma:'
CompactNamespaces: false
ConstructorInitializerAllOnOneLineOrOnePerLine: false
ConstructorInitializerIndentWidth: 8
ContinuationIndentWidth: 8
Cpp11BracedListStyle: false
DerivePointerAlignment: false
DisableFormat: false
ExperimentalAutoDetectBinPacking: false
FixNamespaceComments: false
# Taken from:
# git grep -h '^#define [^[:space:]]*for_each[^[:space:]]*(' include/ tools/ \
# | sed "s,^#define \([^[:space:]]*for_each[^[:space:]]*\)(.*$, - '\1'," \
# | LC_ALL=C sort -u
ForEachMacros:
- 'for_each_nst'
IncludeBlocks: Preserve
IncludeCategories:
- Regex: '.*'
Priority: 1
IncludeIsMainRegex: '(Test)?$'
IndentCaseLabels: false
IndentGotoLabels: false
IndentPPDirectives: None
IndentWidth: 8
IndentWrappedFunctionNames: false
JavaScriptQuotes: Leave
JavaScriptWrapImports: true
KeepEmptyLinesAtTheStartOfBlocks: false
MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
ObjCBinPackProtocolList: Auto
ObjCBlockIndentWidth: 8
ObjCSpaceAfterProperty: true
ObjCSpaceBeforeProtocolList: true
# Taken from git's rules
PenaltyBreakAssignment: 10
PenaltyBreakBeforeFirstCallParameter: 30
PenaltyBreakComment: 10
PenaltyBreakFirstLessLess: 0
PenaltyBreakString: 10
PenaltyExcessCharacter: 100
PenaltyReturnTypeOnItsOwnLine: 60
PointerAlignment: Right
ReflowComments: false
SortIncludes: false
SortUsingDeclarations: false
SpaceAfterCStyleCast: false
SpaceAfterTemplateKeyword: true
SpaceBeforeAssignmentOperators: true
SpaceBeforeCtorInitializerColon: true
SpaceBeforeInheritanceColon: true
SpaceBeforeParens: ControlStatementsExceptForEachMacros
SpaceBeforeRangeBasedForLoopColon: true
SpaceInEmptyParentheses: false
SpacesBeforeTrailingComments: 1
SpacesInAngles: false
SpacesInContainerLiterals: false
SpacesInCStyleCastParentheses: false
SpacesInParentheses: false
SpacesInSquareBrackets: false
Standard: Cpp03
TabWidth: 8
UseTab: Always
...

93
.clang-tidy Normal file
View file

@ -0,0 +1,93 @@
---
Checks:
- "clang-diagnostic-*,clang-analyzer-*,*,-modernize-*"
# TODO: enable once https://bugs.llvm.org/show_bug.cgi?id=41311 is fixed
- "-clang-analyzer-valist.Uninitialized"
# Dubious value, would kill readability
- "-cppcoreguidelines-init-variables"
# Dubious value over the compiler's built-in warning. Would
# increase verbosity.
- "-bugprone-assignment-in-if-condition"
# Debatable whether these improve readability, right now it would look
# like a mess
- "-google-readability-braces-around-statements"
- "-hicpp-braces-around-statements"
- "-readability-braces-around-statements"
# TODO: in most cases they are justified, but probably not everywhere
#
- "-readability-magic-numbers"
- "-cppcoreguidelines-avoid-magic-numbers"
# TODO: this is Linux-only for the moment, nice to fix eventually
- "-llvmlibc-restrict-system-libc-headers"
# Those are needed for syscalls, epoll_wait flags, etc.
- "-hicpp-signed-bitwise"
# Probably not doable to impement this without plain memcpy(), memset()
- "-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling"
# TODO: not really important, but nice to fix eventually
- "-llvm-include-order"
# Dubious value, would kill readability
- "-readability-isolate-declaration"
# TODO: nice to fix eventually
- "-bugprone-narrowing-conversions"
- "-cppcoreguidelines-narrowing-conversions"
# TODO: check, fix, and more in general constify wherever possible
- "-cppcoreguidelines-avoid-non-const-global-variables"
# TODO: check paths where it might make sense to improve performance
- "-altera-unroll-loops"
- "-altera-id-dependent-backward-branch"
# Not much can be done about them other than being careful
- "-bugprone-easily-swappable-parameters"
# TODO: split reported functions
- "-readability-function-cognitive-complexity"
# "Poor" alignment needed for structs reflecting message formats/headers
- "-altera-struct-pack-align"
# TODO: check again if multithreading is implemented
- "-concurrency-mt-unsafe"
# Complains about any identifier <3 characters, reasonable for
# globals, pointlessly verbose for locals and parameters.
- "-readability-identifier-length"
# Wants to include headers which *directly* provide the things
# we use. That sounds nice, but means it will often want a OS
# specific header instead of a mostly standard one, such as
# <linux/limits.h> instead of <limits.h>.
- "-misc-include-cleaner"
# Want to replace all #defines of integers with enums. Kind of
# makes sense when those defines form an enum-like set, but
# weird for cases like standalone constants, and causes other
# awkwardness for a bunch of cases we use
- "-cppcoreguidelines-macro-to-enum"
# It's been a couple of centuries since multiplication has been granted
# precedence over addition in modern mathematical notation. Adding
# parentheses to reinforce that certainly won't improve readability.
- "-readability-math-missing-parentheses"
WarningsAsErrors: "*"
HeaderFileExtensions:
- h
ImplementationFileExtensions:
- c
HeaderFilterRegex: ""
FormatStyle: none
CheckOptions:
bugprone-suspicious-string-compare.WarnOnImplicitComparison: "false"
SystemHeaders: false

3
.clangd Normal file
View file

@ -0,0 +1,3 @@
CompileFlags:
# Don't try to interpret our headers as C++'
Add: [-xc, -Wall]

177
Makefile
View file

@ -15,66 +15,41 @@ VERSION ?= $(shell git describe --tags HEAD 2>/dev/null || echo "unknown\ versio
# the IPv6 socket API? (Linux does)
DUAL_STACK_SOCKETS := 1
RLIMIT_STACK_VAL := $(shell /bin/sh -c 'ulimit -s')
ifeq ($(RLIMIT_STACK_VAL),unlimited)
RLIMIT_STACK_VAL := 1024
endif
TARGET ?= $(shell $(CC) -dumpmachine)
# Get 'uname -m'-like architecture description for target
TARGET_ARCH := $(shell echo $(TARGET) | cut -f1 -d- | tr [A-Z] [a-z])
TARGET_ARCH := $(shell echo $(TARGET_ARCH) | sed 's/powerpc/ppc/')
AUDIT_ARCH := $(shell echo $(TARGET_ARCH) | tr [a-z] [A-Z] | sed 's/^ARM.*/ARM/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/I[456]86/I386/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/PPC64/PPC/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/PPCLE/PPC64LE/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/MIPS64EL/MIPSEL64/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/HPPA/PARISC/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/SH4/SH/')
# On some systems enabling optimization also enables source fortification,
# automagically. Do not override it.
FORTIFY_FLAG :=
ifeq ($(shell $(CC) -O2 -dM -E - < /dev/null 2>&1 | grep ' _FORTIFY_SOURCE ' > /dev/null; echo $$?),1)
FORTIFY_FLAG := -D_FORTIFY_SOURCE=2
endif
FLAGS := -Wall -Wextra -Wno-format-zero-length
FLAGS += -pedantic -std=c11 -D_XOPEN_SOURCE=700 -D_GNU_SOURCE
FLAGS += -D_FORTIFY_SOURCE=2 -O2 -pie -fPIE
FLAGS += $(FORTIFY_FLAG) -O2 -pie -fPIE
FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
FLAGS += -DNETNS_RUN_DIR=\"/run/netns\"
FLAGS += -DPASST_AUDIT_ARCH=AUDIT_ARCH_$(AUDIT_ARCH)
FLAGS += -DRLIMIT_STACK_VAL=$(RLIMIT_STACK_VAL)
FLAGS += -DARCH=\"$(TARGET_ARCH)\"
FLAGS += -DVERSION=\"$(VERSION)\"
FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_vu.c util.c vhost_user.c virtio.c
tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c
QRAP_SRCS = qrap.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1
PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
flow_table.h icmp.h inany.h iov.h ip.h isolation.h lineread.h log.h \
ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h siphash.h tap.h \
tcp.h tcp_buf.h tcp_conn.h tcp_splice.h tcp_vu.h udp.h udp_internal.h \
udp_vu.h util.h vhost_user.h virtio.h
flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
udp.h udp_flow.h util.h
HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_SND_WND
endif
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_bytes_acked = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_BYTES_ACKED
endif
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_min_rtt = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_MIN_RTT
endif
C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_GETRANDOM
@ -84,11 +59,6 @@ ifeq ($(shell :|$(CC) -fstack-protector-strong -S -xc - -o - >/dev/null 2>&1; ec
FLAGS += -fstack-protector-strong
endif
C := \#define _GNU_SOURCE\n\#include <fcntl.h>\nint x = FALLOC_FL_COLLAPSE_RANGE;
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
EXTRA_SYSCALLS += fallocate
endif
prefix ?= /usr/local
exec_prefix ?= $(prefix)
bindir ?= $(exec_prefix)/bin
@ -125,11 +95,11 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
ln -sf $< $@
qrap: $(QRAP_SRCS) passt.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(QRAP_SRCS) -o qrap $(LDFLAGS)
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
getpid gettid kill clock_gettime mmap \
munmap open unlink gettimeofday futex
rt_sigreturn getpid gettid kill clock_gettime mmap \
mmap2 munmap open unlink gettimeofday futex
valgrind: FLAGS += -g -DVALGRIND
valgrind: all
@ -189,111 +159,11 @@ docs: README.md
done < README.md; \
) > README.plain.md
# Checkers currently disabled for clang-tidy:
# - llvmlibc-restrict-system-libc-headers
# TODO: this is Linux-only for the moment, nice to fix eventually
#
# - bugprone-macro-parentheses
# - google-readability-braces-around-statements
# - hicpp-braces-around-statements
# - readability-braces-around-statements
# Debatable whether that improves readability, right now it would look
# like a mess
#
# - readability-magic-numbers
# - cppcoreguidelines-avoid-magic-numbers
# TODO: in most cases they are justified, but probably not everywhere
#
# - clang-analyzer-valist.Uninitialized
# TODO: enable once https://bugs.llvm.org/show_bug.cgi?id=41311 is fixed
#
# - clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling
# Probably not doable to impement this without plain memcpy(), memset()
#
# - cppcoreguidelines-init-variables
# Dubious value, would kill readability
#
# - hicpp-signed-bitwise
# Those are needed for syscalls, epoll_wait flags, etc.
#
# - llvm-include-order
# TODO: not really important, but nice to fix eventually
#
# - readability-isolate-declaration
# Dubious value, would kill readability
#
# - bugprone-narrowing-conversions
# - cppcoreguidelines-narrowing-conversions
# TODO: nice to fix eventually
#
# - cppcoreguidelines-avoid-non-const-global-variables
# TODO: check, fix, and more in general constify wherever possible
#
# - altera-unroll-loops
# - altera-id-dependent-backward-branch
# TODO: check paths where it might make sense to improve performance
#
# - bugprone-easily-swappable-parameters
# Not much can be done about them other than being careful
#
# - readability-function-cognitive-complexity
# TODO: split reported functions
#
# - altera-struct-pack-align
# "Poor" alignment needed for structs reflecting message formats/headers
#
# - concurrency-mt-unsafe
# TODO: check again if multithreading is implemented
#
# - readability-identifier-length
# Complains about any identifier <3 characters, reasonable for
# globals, pointlessly verbose for locals and parameters.
#
# - bugprone-assignment-in-if-condition
# Dubious value over the compiler's built-in warning. Would
# increase verbosity.
#
# - misc-include-cleaner
# Wants to include headers which *directly* provide the things
# we use. That sounds nice, but means it will often want a OS
# specific header instead of a mostly standard one, such as
# <linux/limits.h> instead of <limits.h>.
clang-tidy: $(PASST_SRCS) $(HEADERS)
clang-tidy $(PASST_SRCS) -- $(filter-out -pie,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) \
-DCLANG_TIDY_58992
clang-tidy: $(SRCS) $(HEADERS)
clang-tidy -checks=*,-modernize-*,\
-clang-analyzer-valist.Uninitialized,\
-cppcoreguidelines-init-variables,\
-bugprone-assignment-in-if-condition,\
-bugprone-macro-parentheses,\
-google-readability-braces-around-statements,\
-hicpp-braces-around-statements,\
-readability-braces-around-statements,\
-readability-magic-numbers,\
-llvmlibc-restrict-system-libc-headers,\
-hicpp-signed-bitwise,\
-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,\
-llvm-include-order,\
-cppcoreguidelines-avoid-magic-numbers,\
-readability-isolate-declaration,\
-bugprone-narrowing-conversions,\
-cppcoreguidelines-narrowing-conversions,\
-cppcoreguidelines-avoid-non-const-global-variables,\
-altera-unroll-loops,-altera-id-dependent-backward-branch,\
-bugprone-easily-swappable-parameters,\
-readability-function-cognitive-complexity,\
-altera-struct-pack-align,\
-concurrency-mt-unsafe,\
-readability-identifier-length,\
-misc-include-cleaner \
-config='{CheckOptions: [{key: bugprone-suspicious-string-compare.WarnOnImplicitComparison, value: "false"}]}' \
--warnings-as-errors=* $(SRCS) -- $(filter-out -pie,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) -DCLANG_TIDY_58992
SYSTEM_INCLUDES := /usr/include $(wildcard /usr/include/$(TARGET))
ifeq ($(shell $(CC) -v 2>&1 | grep -c "gcc version"),1)
VER := $(shell $(CC) -dumpversion)
SYSTEM_INCLUDES += /usr/lib/gcc/$(TARGET)/$(VER)/include
endif
cppcheck: $(SRCS) $(HEADERS)
cppcheck: $(PASST_SRCS) $(HEADERS)
if cppcheck --check-level=exhaustive /dev/null > /dev/null 2>&1; then \
CPPCHECK_EXHAUSTIVE="--check-level=exhaustive"; \
else \
@ -302,11 +172,8 @@ cppcheck: $(SRCS) $(HEADERS)
cppcheck --std=c11 --error-exitcode=1 --enable=all --force \
--inconclusive --library=posix --quiet \
$${CPPCHECK_EXHAUSTIVE} \
$(SYSTEM_INCLUDES:%=-I%) \
$(SYSTEM_INCLUDES:%=--config-exclude=%) \
$(SYSTEM_INCLUDES:%=--suppress=*:%/*) \
$(SYSTEM_INCLUDES:%=--suppress=unmatchedSuppression:%/*) \
--inline-suppr \
--suppress=missingIncludeSystem \
--suppress=unusedStructMember \
$(filter -D%,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) \
.
$(filter -D%,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) -D CPPCHECK_6936 \
$(PASST_SRCS) $(HEADERS)

View file

@ -338,20 +338,24 @@ speeding up local connections, and usually requiring NAT. _pasta_:
[_slirp4netns_ replacement](/passt/tree/slirp4netns.sh)
* ✅ out-of-tree patch for
[Kata Containers](/passt/tree/contrib/kata-containers) available
* ⌚ drop-in replacement for VPNKit (rootless Docker)
* ✅ rootless Docker
[network back-end](https://docs.docker.com/engine/security/rootless/#networking-errors)
via moby/rootlesskit
### Availability
* official packages for:
* ✅ [Alpine Linux](https://pkgs.alpinelinux.org/packages?name=passt)
* ✅ [Arch Linux](https://archlinux.org/packages/extra/x86_64/passt/) ([aarch64](https://archlinuxarm.org/packages/aarch64/passt), [i486](https://www.archlinux32.org/packages/?q=passt))
* ✅ [CentOS Stream](https://gitlab.com/redhat/centos-stream/rpms/passt)
* ✅ [Debian](https://tracker.debian.org/pkg/passt)
* ✅ [Fedora](https://src.fedoraproject.org/rpms/passt)
* ✅ [Gentoo](https://packages.gentoo.org/packages/net-misc/passt)
* ✅ [GNU Guix](https://packages.guix.gnu.org/packages/passt/)
* ✅ [OpenSUSE](https://build.opensuse.org/package/requests/Virtualization:containers/passt)
* ✅ [Ubuntu](https://launchpad.net/ubuntu/+source/passt)
* ✅ [Void Linux](https://voidlinux.org/packages/?q=passt)
* unofficial packages for:
* ✅ [EPEL, Mageia](https://copr.fedorainfracloud.org/coprs/sbrivio/passt/)
* 🛠 [openSUSE](https://build.opensuse.org/package/show/Virtualization:containers/passt)
* ✅ unofficial [packages](https://passt.top/builds/latest/x86_64/) from x86_64
static builds for other RPM-based distributions
* ✅ unofficial [packages](https://passt.top/builds/latest/x86_64/) from x86_64
@ -396,7 +400,7 @@ services:
and nameserver using SLAAC
* [DHCPv6 server](/passt/tree/dhcpv6.c): a simple
implementation handing out one single IPv6 address to the guest or namespace,
namely, the the same address as the first one configured for the upstream host
namely, the same address as the first one configured for the upstream host
interface, and passing the nameservers configured on the host
## Addresses

18
arch.c
View file

@ -18,6 +18,9 @@
#include <string.h>
#include <unistd.h>
#include "log.h"
#include "util.h"
/**
* arch_avx2_exec() - Switch to AVX2 build if supported
* @argv: Arguments from command line
@ -28,10 +31,8 @@ void arch_avx2_exec(char **argv)
char exe[PATH_MAX] = { 0 };
const char *p;
if (readlink("/proc/self/exe", exe, PATH_MAX - 1) < 0) {
perror("readlink /proc/self/exe");
exit(EXIT_FAILURE);
}
if (readlink("/proc/self/exe", exe, PATH_MAX - 1) < 0)
die_perror("Failed to read own /proc/self/exe link");
p = strstr(exe, ".avx2");
if (p && strlen(p) == strlen(".avx2"))
@ -40,9 +41,12 @@ void arch_avx2_exec(char **argv)
if (__builtin_cpu_supports("avx2")) {
char new_path[PATH_MAX + sizeof(".avx2")];
snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
execve(new_path, argv, environ);
perror("Can't run AVX2 build, using non-AVX2 version");
if (snprintf_check(new_path, PATH_MAX + sizeof(".avx2"),
"%s.avx2", exe))
die_perror("Can't build AVX2 executable path");
execv(new_path, argv);
warn_perror("Can't run AVX2 build, using non-AVX2 version");
}
}
#else

20
arp.c
View file

@ -43,8 +43,7 @@ int arp(const struct ctx *c, const struct pool *p)
struct ethhdr *eh;
struct arphdr *ah;
struct arpmsg *am;
size_t len;
int ret;
size_t l2len;
eh = packet_get(p, 0, 0, sizeof(*eh), NULL);
ah = packet_get(p, 0, sizeof(*eh), sizeof(*ah), NULL);
@ -60,31 +59,28 @@ int arp(const struct ctx *c, const struct pool *p)
ah->ar_op != htons(ARPOP_REQUEST))
return 1;
/* Discard announcements (but not 0.0.0.0 "probes"): we might have the
* same IP address, hide that.
*/
if (memcmp(am->sip, (unsigned char[4]){ 0 }, sizeof(am->tip)) &&
/* Discard announcements, but not 0.0.0.0 "probes" */
if (memcmp(am->sip, &in4addr_any, sizeof(am->sip)) &&
!memcmp(am->sip, am->tip, sizeof(am->sip)))
return 1;
/* Don't resolve our own address, either. */
/* Don't resolve the guest's assigned address, either. */
if (!memcmp(am->tip, &c->ip4.addr, sizeof(am->tip)))
return 1;
ah->ar_op = htons(ARPOP_REPLY);
memcpy(am->tha, am->sha, sizeof(am->tha));
memcpy(am->sha, c->mac, sizeof(am->sha));
memcpy(am->sha, c->our_tap_mac, sizeof(am->sha));
memcpy(swap, am->tip, sizeof(am->tip));
memcpy(am->tip, am->sip, sizeof(am->tip));
memcpy(am->sip, swap, sizeof(am->sip));
len = sizeof(*eh) + sizeof(*ah) + sizeof(*am);
l2len = sizeof(*eh) + sizeof(*ah) + sizeof(*am);
memcpy(eh->h_dest, eh->h_source, sizeof(eh->h_dest));
memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
if ((ret = tap_send(c, eh, len)) < 0)
warn("ARP: send: %s", strerror(ret));
tap_send_single(c, eh, l2len);
return 1;
}

View file

@ -59,6 +59,7 @@
#include "util.h"
#include "ip.h"
#include "checksum.h"
#include "iov.h"
/* Checksums are optional for UDP over IPv4, so we usually just set
* them to 0. Change this to 1 to calculate real UDP over IPv4
@ -116,19 +117,19 @@ uint16_t csum_fold(uint32_t sum)
/**
* csum_ip4_header() - Calculate IPv4 header checksum
* @tot_len: IPv4 payload length (data + IP header, network order)
* @protocol: Protocol number (network order)
* @saddr: IPv4 source address (network order)
* @daddr: IPv4 destination address (network order)
* @l3len: IPv4 packet length (host order)
* @protocol: Protocol number
* @saddr: IPv4 source address
* @daddr: IPv4 destination address
*
* Return: 16-bit folded sum of the IPv4 header
*/
uint16_t csum_ip4_header(uint16_t tot_len, uint8_t protocol,
uint16_t csum_ip4_header(uint16_t l3len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr)
{
uint32_t sum = L2_BUF_IP4_PSUM(protocol);
sum += tot_len;
sum += htons(l3len);
sum += (saddr.s_addr >> 16) & 0xffff;
sum += saddr.s_addr & 0xffff;
sum += (daddr.s_addr >> 16) & 0xffff;
@ -140,13 +141,13 @@ uint16_t csum_ip4_header(uint16_t tot_len, uint8_t protocol,
/**
* proto_ipv4_header_psum() - Calculates the partial checksum of an
* IPv4 header for UDP or TCP
* @tot_len: IPv4 Payload length (host order)
* @proto: Protocol number (host order)
* @saddr: Source address (network order)
* @daddr: Destination address (network order)
* @l4len: IPv4 Payload length (host order)
* @proto: Protocol number
* @saddr: Source address
* @daddr: Destination address
* Returns: Partial checksum of the IPv4 header
*/
uint32_t proto_ipv4_header_psum(uint16_t tot_len, uint8_t protocol,
uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr)
{
uint32_t psum = htons(protocol);
@ -155,7 +156,7 @@ uint32_t proto_ipv4_header_psum(uint16_t tot_len, uint8_t protocol,
psum += saddr.s_addr & 0xffff;
psum += (daddr.s_addr >> 16) & 0xffff;
psum += daddr.s_addr & 0xffff;
psum += htons(tot_len);
psum += htons(l4len);
return psum;
}
@ -165,22 +166,24 @@ uint32_t proto_ipv4_header_psum(uint16_t tot_len, uint8_t protocol,
* @udp4hr: UDP header, initialised apart from checksum
* @saddr: IPv4 source address
* @daddr: IPv4 destination address
* @payload: ICMPv4 packet payload
* @len: Length of @payload (not including UDP)
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Length of the array
* @offset: UDP payload offset in the iovec array
*/
void csum_udp4(struct udphdr *udp4hr,
struct in_addr saddr, struct in_addr daddr,
const void *payload, size_t len)
const struct iovec *iov, int iov_cnt, size_t offset)
{
/* UDP checksums are optional, so don't bother */
udp4hr->check = 0;
if (UDP4_REAL_CHECKSUMS) {
uint16_t tot_len = len + sizeof(struct udphdr);
uint32_t psum = proto_ipv4_header_psum(tot_len, IPPROTO_UDP,
uint16_t l4len = iov_size(iov, iov_cnt) - offset +
sizeof(struct udphdr);
uint32_t psum = proto_ipv4_header_psum(l4len, IPPROTO_UDP,
saddr, daddr);
psum = csum_unfolded(udp4hr, sizeof(struct udphdr), psum);
udp4hr->check = csum(payload, len, psum);
udp4hr->check = csum_iov(iov, iov_cnt, offset, psum);
}
}
@ -188,9 +191,9 @@ void csum_udp4(struct udphdr *udp4hr,
* csum_icmp4() - Calculate and set checksum for an ICMP packet
* @icmp4hr: ICMP header, initialised apart from checksum
* @payload: ICMP packet payload
* @len: Length of @payload (not including ICMP header)
* @dlen: Length of @payload (not including ICMP header)
*/
void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t len)
void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t dlen)
{
uint32_t psum;
@ -199,16 +202,16 @@ void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t len)
/* Partial checksum for ICMP header alone */
psum = sum_16b(icmp4hr, sizeof(*icmp4hr));
icmp4hr->checksum = csum(payload, len, psum);
icmp4hr->checksum = csum(payload, dlen, psum);
}
/**
* proto_ipv6_header_psum() - Calculates the partial checksum of an
* IPv6 header for UDP or TCP
* @payload_len: IPv6 payload length (host order)
* @proto: Protocol number (host order)
* @saddr: Source address (network order)
* @daddr: Destination address (network order)
* @proto: Protocol number
* @saddr: Source address
* @daddr: Destination address
* Returns: Partial checksum of the IPv6 header
*/
uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
@ -226,19 +229,24 @@ uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
/**
* csum_udp6() - Calculate and set checksum for a UDP over IPv6 packet
* @udp6hr: UDP header, initialised apart from checksum
* @payload: UDP packet payload
* @len: Length of @payload (not including UDP header)
* @saddr: Source address
* @daddr: Destination address
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Length of the array
* @offset: UDP payload offset in the iovec array
*/
void csum_udp6(struct udphdr *udp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t len)
const struct iovec *iov, int iov_cnt, size_t offset)
{
uint32_t psum = proto_ipv6_header_psum(len + sizeof(struct udphdr),
IPPROTO_UDP, saddr, daddr);
uint16_t l4len = iov_size(iov, iov_cnt) - offset +
sizeof(struct udphdr);
uint32_t psum = proto_ipv6_header_psum(l4len, IPPROTO_UDP,
saddr, daddr);
udp6hr->check = 0;
psum = csum_unfolded(udp6hr, sizeof(struct udphdr), psum);
udp6hr->check = csum(payload, len, psum);
udp6hr->check = csum_iov(iov, iov_cnt, offset, psum);
}
/**
@ -247,21 +255,19 @@ void csum_udp6(struct udphdr *udp6hr,
* @saddr: IPv6 source address
* @daddr: IPv6 destination address
* @payload: ICMP packet payload
* @len: Length of @payload (not including ICMPv6 header)
* @dlen: Length of @payload (not including ICMPv6 header)
*/
void csum_icmp6(struct icmp6hdr *icmp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t len)
const void *payload, size_t dlen)
{
/* Partial checksum for the pseudo-IPv6 header */
uint32_t psum = sum_16b(saddr, sizeof(*saddr)) +
sum_16b(daddr, sizeof(*daddr)) +
htons(len + sizeof(*icmp6hr)) + htons(IPPROTO_ICMPV6);
uint32_t psum = proto_ipv6_header_psum(dlen + sizeof(*icmp6hr),
IPPROTO_ICMPV6, saddr, daddr);
icmp6hr->icmp6_cksum = 0;
/* Add in partial checksum for the ICMPv6 header alone */
psum += sum_16b(icmp6hr, sizeof(*icmp6hr));
icmp6hr->icmp6_cksum = csum(payload, len, psum);
icmp6hr->icmp6_cksum = csum(payload, dlen, psum);
}
#ifdef __AVX2__
@ -499,16 +505,26 @@ uint16_t csum(const void *buf, size_t len, uint32_t init)
*
* @iov Pointer to the array of IO vectors
* @n Length of the array
* @offset: Offset of the data to checksum within the full data length
* @init Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* cppcheck-suppress unusedFunction */
uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init)
uint16_t csum_iov(const struct iovec *iov, size_t n, size_t offset,
uint32_t init)
{
unsigned int i;
size_t first;
for (i = 0; i < n; i++)
i = iov_skip_bytes(iov, n, offset, &first);
if (i >= n)
return (uint16_t)~csum_fold(init);
init = csum_unfolded((char *)iov[i].iov_base + first,
iov[i].iov_len - first, init);
i++;
for (; i < n; i++)
init = csum_unfolded(iov[i].iov_base, iov[i].iov_len, init);
return (uint16_t)~csum_fold(init);

View file

@ -13,25 +13,26 @@ struct icmp6hdr;
uint32_t sum_16b(const void *buf, size_t len);
uint16_t csum_fold(uint32_t sum);
uint16_t csum_unaligned(const void *buf, size_t len, uint32_t init);
uint16_t csum_ip4_header(uint16_t tot_len, uint8_t protocol,
uint16_t csum_ip4_header(uint16_t l3len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr);
uint32_t proto_ipv4_header_psum(uint16_t tot_len, uint8_t protocol,
uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr);
void csum_udp4(struct udphdr *udp4hr,
struct in_addr saddr, struct in_addr daddr,
const void *payload, size_t len);
void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t len);
const struct iovec *iov, int iov_cnt, size_t offset);
void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t dlen);
uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
const struct in6_addr *saddr,
const struct in6_addr *daddr);
void csum_udp6(struct udphdr *udp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t len);
const struct iovec *iov, int iov_cnt, size_t offset);
void csum_icmp6(struct icmp6hdr *icmp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t len);
const void *payload, size_t dlen);
uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init);
uint16_t csum(const void *buf, size_t len, uint32_t init);
uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init);
uint16_t csum_iov(const struct iovec *iov, size_t n, size_t offset,
uint32_t init);
#endif /* CHECKSUM_H */

1132
conf.c

File diff suppressed because it is too large Load diff

View file

@ -26,13 +26,16 @@
capability sys_ptrace,
/ r, # isolate_prefork(), isolation.c
mount options=(rw, runbindable) /,
mount options=(rw, runbindable) -> /,
mount "" -> "/",
mount "" -> "/tmp/",
pivot_root "/tmp/" -> "/tmp/",
umount "/",
owner @{PROC}/@{pid}/uid_map r, # conf_ugid()
@{PROC}/sys/net/ipv4/ip_local_port_range r, # fwd_probe_ephemeral()
network netlink raw, # nl_sock_init_do(), netlink.c
network inet stream, # tcp.c

View file

@ -27,8 +27,9 @@
@{PROC}/@{pid}/net/udp r,
@{PROC}/@{pid}/net/udp6 r,
@{run}/user/@{uid}/netns/* r, # pasta_open_ns(), pasta.c
@{run}/user/@{uid}/** rw, # pasta_open_ns()
@{PROC}/[0-9]*/ns/ r, # pasta_netns_quit_init(),
@{PROC}/[0-9]*/ns/net r, # pasta_wait_for_ns(),
@{PROC}/[0-9]*/ns/user r, # conf_pasta_ns()
@ -42,3 +43,5 @@
/{usr/,}bin/** Ux,
/usr/bin/pasta.avx2 ix, # arch_avx2_exec(), arch.c
ptrace r, # pasta_open_ns()

View file

@ -19,9 +19,12 @@ profile passt /usr/bin/passt{,.avx2} {
include <abstractions/passt>
# Alternatively: include <abstractions/user-tmp>
owner /tmp/** w, # tap_sock_unix_init(), pcap(),
# write_pidfile(),
owner /tmp/** w, # tap_sock_unix_open(),
# tap_sock_unix_init(), pcap(),
# pidfile_open(),
# pidfile_write(),
# logfile_init()
owner @{HOME}/** w, # pcap(), write_pidfile()
owner @{HOME}/** w, # pcap(), pidfile_open(),
# pidfile_write()
}

View file

@ -19,9 +19,13 @@ profile pasta /usr/bin/pasta{,.avx2} flags=(attach_disconnected) {
include <abstractions/pasta>
# Alternatively: include <abstractions/user-tmp>
owner /tmp/** w, # tap_sock_unix_init(), pcap(),
# write_pidfile(),
# logfile_init()
/tmp/** rw, # tap_sock_unix_open(),
# tap_sock_unix_init(), pcap(),
# pidfile_open(),
# pidfile_write(),
# logfile_init(),
# pasta_open_ns()
owner @{HOME}/** w, # pcap(), write_pidfile()
owner @{HOME}/** w, # pcap(), pidfile_open(),
# pidfile_write()
}

View file

@ -14,7 +14,7 @@ Name: passt
Version: {{{ git_version }}}
Release: 1%{?dist}
Summary: User-mode networking daemons for virtual machines and namespaces
License: GPLv2+ and BSD
License: GPL-2.0-or-later AND BSD-3-Clause
Group: System Environment/Daemons
URL: https://passt.top/
Source: https://passt.top/passt/snapshot/passt-%{git_hash}.tar.xz

View file

@ -29,7 +29,11 @@ function passt_git_changelog_entry {
[ -z "${__from}" ] && __from="$(git rev-list --max-parents=0 HEAD)"
__date="$(git log --pretty="format:%cI" "${__to}" -1)"
__author="$(git log -1 --pretty="format:%an <%ae>" ${__to} -- contrib/fedora)"
__author="Stefano Brivio <sbrivio@redhat.com>"
# Use:
# __author="$(git log -1 --pretty="format:%an <%ae>" ${__to} -- contrib/fedora)"
# if you want the author of changelog entries to match the latest
# author for contrib/fedora
printf "* %s %s - %s\n" "$(date "+%a %b %e %Y" -d "${__date}")" "${__author}" "$(git_version "${__to}")-1"

View file

@ -50,6 +50,7 @@ require {
type passwd_file_t;
class netlink_route_socket { bind create nlmsg_read };
type sysctl_net_t;
class capability { sys_tty_config setuid setgid };
class cap_userns { setpcap sys_admin sys_ptrace };
@ -104,6 +105,8 @@ allow passt_t net_conf_t:lnk_file read;
allow passt_t tmp_t:sock_file { create unlink write };
allow passt_t self:netlink_route_socket { bind create nlmsg_read read write setopt };
kernel_search_network_sysctl(passt_t)
allow passt_t sysctl_net_t:dir search;
allow passt_t sysctl_net_t:file { open read };
corenet_tcp_bind_all_nodes(passt_t)
corenet_udp_bind_all_nodes(passt_t)

View file

@ -196,7 +196,7 @@ allow pasta_t ifconfig_var_run_t:dir { read search watch };
allow pasta_t self:tun_socket create;
allow pasta_t tun_tap_device_t:chr_file { ioctl open read write };
allow pasta_t sysctl_net_t:dir search;
allow pasta_t sysctl_net_t:file { open write };
allow pasta_t sysctl_net_t:file { open read write };
allow pasta_t kernel_t:system module_request;
allow pasta_t nsfs_t:file read;
@ -211,3 +211,4 @@ allow pasta_t ifconfig_t:process { noatsecure rlimitinh siginh };
allow pasta_t netutils_t:process { noatsecure rlimitinh siginh };
allow pasta_t ping_t:process { noatsecure rlimitinh siginh };
allow pasta_t user_tty_device_t:chr_file { append read write };
allow pasta_t user_devpts_t:chr_file { append read write };

25
dhcp.c
View file

@ -275,7 +275,8 @@ static void opt_set_dns_search(const struct ctx *c, size_t max_len)
*/
int dhcp(const struct ctx *c, const struct pool *p)
{
size_t mlen, len, offset = 0, opt_len, opt_off = 0;
size_t mlen, dlen, offset = 0, opt_len, opt_off = 0;
char macstr[ETH_ADDRSTRLEN];
const struct ethhdr *eh;
const struct iphdr *iph;
const struct udphdr *uh;
@ -340,26 +341,26 @@ int dhcp(const struct ctx *c, const struct pool *p)
return -1;
}
info(" from %02x:%02x:%02x:%02x:%02x:%02x",
m->chaddr[0], m->chaddr[1], m->chaddr[2],
m->chaddr[3], m->chaddr[4], m->chaddr[5]);
info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));
m->yiaddr = c->ip4.addr;
mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len));
memcpy(opts[1].s, &mask, sizeof(mask));
memcpy(opts[3].s, &c->ip4.gw, sizeof(c->ip4.gw));
memcpy(opts[54].s, &c->ip4.gw, sizeof(c->ip4.gw));
memcpy(opts[1].s, &mask, sizeof(mask));
memcpy(opts[3].s, &c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
memcpy(opts[54].s, &c->ip4.our_tap_addr, sizeof(c->ip4.our_tap_addr));
/* If the gateway is not on the assigned subnet, send an option 121
* (Classless Static Routing) adding a dummy route to it.
*/
if ((c->ip4.addr.s_addr & mask.s_addr)
!= (c->ip4.gw.s_addr & mask.s_addr)) {
!= (c->ip4.guest_gw.s_addr & mask.s_addr)) {
/* a.b.c.d/32:0.0.0.0, 0:a.b.c.d */
opts[121].slen = 14;
opts[121].s[0] = 32;
memcpy(opts[121].s + 1, &c->ip4.gw, sizeof(c->ip4.gw));
memcpy(opts[121].s + 10, &c->ip4.gw, sizeof(c->ip4.gw));
memcpy(opts[121].s + 1,
&c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
memcpy(opts[121].s + 10,
&c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
}
if (c->mtu != -1) {
@ -377,8 +378,8 @@ int dhcp(const struct ctx *c, const struct pool *p)
if (!c->no_dhcp_dns_search)
opt_set_dns_search(c, sizeof(m->o));
len = offsetof(struct msg, o) + fill(m);
tap_udp4_send(c, c->ip4.gw, 67, c->ip4.addr, 68, m, len);
dlen = offsetof(struct msg, o) + fill(m);
tap_udp4_send(c, c->ip4.our_tap_addr, 67, c->ip4.addr, 68, m, dlen);
return 1;
}

View file

@ -296,45 +296,42 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset,
static struct opt_hdr *dhcpv6_ia_notonlink(const struct pool *p,
struct in6_addr *la)
{
int ia_types[2] = { OPT_IA_NA, OPT_IA_TA }, *ia_type;
const struct opt_ia_addr *opt_addr;
char buf[INET6_ADDRSTRLEN];
struct in6_addr req_addr;
struct opt_hdr *ia, *h;
const struct opt_hdr *h;
struct opt_hdr *ia;
size_t offset;
int ia_type;
ia_type = OPT_IA_NA;
ia_ta:
offset = 0;
while ((ia = dhcpv6_opt(p, &offset, ia_type))) {
if (ntohs(ia->l) < OPT_VSIZE(ia_na))
return NULL;
offset += sizeof(struct opt_ia_na);
while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) {
struct opt_ia_addr *opt_addr = (struct opt_ia_addr *)h;
if (ntohs(h->l) != OPT_VSIZE(ia_addr))
foreach(ia_type, ia_types) {
offset = 0;
while ((ia = dhcpv6_opt(p, &offset, *ia_type))) {
if (ntohs(ia->l) < OPT_VSIZE(ia_na))
return NULL;
memcpy(&req_addr, &opt_addr->addr, sizeof(req_addr));
if (!IN6_ARE_ADDR_EQUAL(la, &req_addr)) {
info("DHCPv6: requested address %s not on link",
inet_ntop(AF_INET6, &req_addr,
buf, sizeof(buf)));
return ia;
}
offset += sizeof(struct opt_ia_na);
offset += sizeof(struct opt_ia_addr);
while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) {
if (ntohs(h->l) != OPT_VSIZE(ia_addr))
return NULL;
opt_addr = (const struct opt_ia_addr *)h;
req_addr = opt_addr->addr;
if (!IN6_ARE_ADDR_EQUAL(la, &req_addr))
goto err;
offset += sizeof(struct opt_ia_addr);
}
}
}
if (ia_type == OPT_IA_NA) {
ia_type = OPT_IA_TA;
goto ia_ta;
}
return NULL;
err:
info("DHCPv6: requested address %s not on link",
inet_ntop(AF_INET6, &req_addr, buf, sizeof(buf)));
return ia;
}
/**
@ -363,7 +360,7 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset)
srv->hdr.l = 0;
}
memcpy(&srv->addr[i], &c->ip6.dns[i], sizeof(srv->addr[i]));
srv->addr[i] = c->ip6.dns[i];
srv->hdr.l += sizeof(srv->addr[i]);
offset += sizeof(srv->addr[i]);
}
@ -426,11 +423,11 @@ search:
int dhcpv6(struct ctx *c, const struct pool *p,
const struct in6_addr *saddr, const struct in6_addr *daddr)
{
struct opt_hdr *ia, *bad_ia, *client_id;
const struct opt_hdr *server_id;
const struct opt_hdr *client_id, *server_id, *ia;
const struct in6_addr *src;
const struct msg_hdr *mh;
const struct udphdr *uh;
struct opt_hdr *bad_ia;
size_t mlen, n;
uh = packet_get(p, 0, 0, sizeof(*uh), &mlen);
@ -451,10 +448,7 @@ int dhcpv6(struct ctx *c, const struct pool *p,
c->ip6.addr_ll_seen = *saddr;
if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
src = &c->ip6.gw;
else
src = &c->ip6.addr_ll;
src = &c->ip6.our_tap_ll;
mh = packet_get(p, 0, sizeof(*uh), sizeof(*mh), NULL);
if (!mh)
@ -574,8 +568,10 @@ void dhcpv6_init(const struct ctx *c)
resp.server_id.duid_time = duid_time;
resp_not_on_link.server_id.duid_time = duid_time;
memcpy(resp.server_id.duid_lladdr, c->mac, sizeof(c->mac));
memcpy(resp_not_on_link.server_id.duid_lladdr, c->mac, sizeof(c->mac));
memcpy(resp.server_id.duid_lladdr,
c->our_tap_mac, sizeof(c->our_tap_mac));
memcpy(resp_not_on_link.server_id.duid_lladdr,
c->our_tap_mac, sizeof(c->our_tap_mac));
resp.ia_addr.addr = c->ip6.addr;
}

3
doc/platform-requirements/.gitignore vendored Normal file
View file

@ -0,0 +1,3 @@
/reuseaddr-priority
/recv-zero
/udp-close-dup

View file

@ -0,0 +1,45 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# Copyright Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
TARGETS = reuseaddr-priority recv-zero udp-close-dup
SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c
CFLAGS = -Wall
all: cppcheck clang-tidy $(TARGETS:%=check-%)
$(TARGETS): %: %.c common.c common.h
check-%: %
./$<
cppcheck:
cppcheck --std=c11 --error-exitcode=1 --enable=all --force \
--check-level=exhaustive --inline-suppr \
--inconclusive --library=posix --quiet \
--suppress=missingIncludeSystem \
$(SRCS)
clang-tidy:
clang-tidy --checks=*,\
-altera-id-dependent-backward-branch,\
-altera-unroll-loops,\
-bugprone-easily-swappable-parameters,\
-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,\
-concurrency-mt-unsafe,\
-cppcoreguidelines-avoid-non-const-global-variables,\
-cppcoreguidelines-init-variables,\
-cppcoreguidelines-macro-to-enum,\
-google-readability-braces-around-statements,\
-hicpp-braces-around-statements,\
-llvmlibc-restrict-system-libc-headers,\
-misc-include-cleaner,\
-modernize-macro-to-enum,\
-readability-braces-around-statements,\
-readability-identifier-length,\
-readability-isolate-declaration \
$(SRCS)
clean:
rm -f $(TARGETS) *.o *~

View file

@ -0,0 +1,18 @@
Platform Requirements
=====================
TODO: document the various Linux specific features we currently require
Test Programs
-------------
In some places we rely on quite specific behaviour of sockets.
Although Linux, at least, seems to behave as required, It's not always
clear from the available documentation if this is required by POSIX or
some other specification.
To specifically document those expectations this directory has some
test programs which explicitly check for the behaviour we need.
When/if we attempt a port to a new platform, running these to check
behaviour would be a good place to start.

View file

@ -0,0 +1,66 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* common.c
*
* Common helper functions for testing SO_REUSEADDR behaviour
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
#include <errno.h>
#include <netinet/in.h>
#include <string.h>
#include <sys/socket.h>
#include "common.h"
int sock_reuseaddr(void)
{
int y = 1;
int s;
s = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
if (s < 0)
die("socket(): %s\n", strerror(errno));
if (setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)) , 0)
die("SO_REUSEADDR: %s\n", strerror(errno));
return s;
}
/* Send a token via the given connected socket */
void send_token(int s, long token)
{
ssize_t rc;
rc = send(s, &token, sizeof(token), 0);
if (rc < 0)
die("send(): %s\n", strerror(errno));
if (rc < sizeof(token))
die("short send()\n");
}
/* Attempt to receive a token via the given socket.
*
* Returns true if we received the token, false if we got an EAGAIN, dies in any
* other case */
bool recv_token(int s, long token)
{
ssize_t rc;
long buf;
rc = recv(s, &buf, sizeof(buf), MSG_DONTWAIT);
if (rc < 0) {
if (errno == EWOULDBLOCK)
return false;
die("recv(): %s\n", strerror(errno));
}
if (rc < sizeof(buf))
die("short recv()\n");
if (buf != token)
die("data mismatch\n");
return true;
}

View file

@ -0,0 +1,47 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* common.h
*
* Useful shared functions
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
#ifndef REUSEADDR_COMMON_H
#define REUSEADDR_COMMON_H
#include <stdarg.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
static inline void die(const char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
(void)vfprintf(stderr, fmt, ap);
va_end(ap);
exit(EXIT_FAILURE);
}
#if __BYTE_ORDER == __BIG_ENDIAN
#define htons_constant(x) (x)
#define htonl_constant(x) (x)
#else
#define htons_constant(x) (__bswap_constant_16(x))
#define htonl_constant(x) (__bswap_constant_32(x))
#endif
#define SOCKADDR_INIT(addr, port) \
{ \
.sin_family = AF_INET, \
.sin_addr = { .s_addr = htonl_constant(addr) }, \
.sin_port = htons_constant(port), \
}
int sock_reuseaddr(void);
void send_token(int s, long token);
bool recv_token(int s, long token);
#endif /* REUSEADDR_COMMON_H */

View file

@ -0,0 +1,118 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* recv-zero.c
*
* Verify that we're able to discard datagrams by recv()ing into a zero-length
* buffer.
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
#include <arpa/inet.h>
#include <errno.h>
#include <net/if.h>
#include <netinet/in.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include "common.h"
#define DSTPORT 13257U
enum discard_method {
DISCARD_NULL_BUF,
DISCARD_ZERO_IOV,
DISCARD_NULL_IOV,
NUM_METHODS,
};
/* 127.0.0.1:DSTPORT */
static const struct sockaddr_in lo_dst = SOCKADDR_INIT(INADDR_LOOPBACK, DSTPORT);
static void test_discard(enum discard_method method)
{
struct iovec zero_iov = { .iov_base = NULL, .iov_len = 0, };
struct msghdr mh_zero = {
.msg_iov = &zero_iov,
.msg_iovlen = 1,
};
struct msghdr mh_null = {
.msg_iov = NULL,
.msg_iovlen = 0,
};
long token1, token2;
int recv_s, send_s;
ssize_t rc;
token1 = random();
token2 = random();
recv_s = sock_reuseaddr();
if (bind(recv_s, (struct sockaddr *)&lo_dst, sizeof(lo_dst)) < 0)
die("bind(): %s\n", strerror(errno));
send_s = sock_reuseaddr();
if (connect(send_s, (struct sockaddr *)&lo_dst, sizeof(lo_dst)) < 0)
die("connect(): %s\n", strerror(errno));
send_token(send_s, token1);
send_token(send_s, token2);
switch (method) {
case DISCARD_NULL_BUF:
/* cppcheck-suppress nullPointer */
rc = recv(recv_s, NULL, 0, MSG_DONTWAIT);
if (rc < 0)
die("discarding recv(): %s\n", strerror(errno));
break;
case DISCARD_ZERO_IOV:
rc = recvmsg(recv_s, &mh_zero, MSG_DONTWAIT);
if (rc < 0)
die("recvmsg() with zero-length buffer: %s\n",
strerror(errno));
if (!((unsigned)mh_zero.msg_flags & MSG_TRUNC))
die("Missing MSG_TRUNC flag\n");
break;
case DISCARD_NULL_IOV:
rc = recvmsg(recv_s, &mh_null, MSG_DONTWAIT);
if (rc < 0)
die("recvmsg() with zero-length iov: %s\n",
strerror(errno));
if (!((unsigned)mh_null.msg_flags & MSG_TRUNC))
die("Missing MSG_TRUNC flag\n");
break;
default:
die("Bad method\n");
}
recv_token(recv_s, token2);
/* cppcheck-suppress nullPointer */
rc = recv(recv_s, NULL, 0, MSG_DONTWAIT);
if (rc < 0 && errno != EAGAIN)
die("redundant discarding recv(): %s\n", strerror(errno));
if (rc >= 0)
die("Unexpected receive: rc=%zd\n", rc);
}
int main(int argc, char *argv[])
{
enum discard_method method;
(void)argc;
(void)argv;
for (method = 0; method < NUM_METHODS; method++)
test_discard(method);
printf("Discarding datagrams with 0-length receives seems to work\n");
exit(0);
}

View file

@ -0,0 +1,240 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* reuseaddr-priority.c
*
* Verify which SO_REUSEADDR UDP sockets get priority to receive
* =============================================================
*
* SO_REUSEADDR allows multiple sockets to bind to overlapping addresses, so
* there can be multiple sockets eligible to receive the same packet. The exact
* semantics of which socket will receive in this circumstance isn't very well
* documented.
*
* This program verifies that things behave the way we expect. Specifically we
* expect:
*
* - If both a connected and an unconnected socket could receive a datagram, the
* connected one will receive it in preference to the unconnected one.
*
* - If an unconnected socket bound to a specific address and an unconnected
* socket bound to the "any" address (0.0.0.0 or ::) could receive a datagram,
* then the one with a specific address will receive it in preference to the
* other.
*
* These should be true regardless of the order the sockets are created in, or
* the order they're polled in.
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
#include <arpa/inet.h>
#include <errno.h>
#include <net/if.h>
#include <netinet/in.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include "common.h"
#define SRCPORT 13246U
#define DSTPORT 13247U
/* Different cases for receiving socket configuration */
enum sock_type {
/* Socket is bound to 0.0.0.0:DSTPORT and not connected */
SOCK_BOUND_ANY = 0,
/* Socket is bound to 127.0.0.1:DSTPORT and not connected */
SOCK_BOUND_LO = 1,
/* Socket is bound to 0.0.0.0:DSTPORT and connected to 127.0.0.1:SRCPORT */
SOCK_CONNECTED = 2,
NUM_SOCK_TYPES,
};
typedef enum sock_type order_t[NUM_SOCK_TYPES];
static order_t orders[] = {
{0, 1, 2}, {0, 2, 1}, {1, 0, 2}, {1, 2, 0}, {2, 0, 1}, {2, 1, 0},
};
/* 127.0.0.2 */
#define INADDR_LOOPBACK2 ((in_addr_t)(0x7f000002))
/* 0.0.0.0:DSTPORT */
static const struct sockaddr_in any_dst = SOCKADDR_INIT(INADDR_ANY, DSTPORT);
/* 127.0.0.1:DSTPORT */
static const struct sockaddr_in lo_dst = SOCKADDR_INIT(INADDR_LOOPBACK, DSTPORT);
/* 127.0.0.2:DSTPORT */
static const struct sockaddr_in lo2_dst = SOCKADDR_INIT(INADDR_LOOPBACK2, DSTPORT);
/* 127.0.0.1:SRCPORT */
static const struct sockaddr_in lo_src = SOCKADDR_INIT(INADDR_LOOPBACK, SRCPORT);
/* Random token to send in datagram */
static long token;
/* Get a socket of the specified type for receiving */
static int sock_recv(enum sock_type type)
{
const struct sockaddr *connect_sa = NULL;
const struct sockaddr *bind_sa = NULL;
int s;
s = sock_reuseaddr();
switch (type) {
case SOCK_CONNECTED:
connect_sa = (struct sockaddr *)&lo_src;
/* fallthrough */
case SOCK_BOUND_ANY:
bind_sa = (struct sockaddr *)&any_dst;
break;
case SOCK_BOUND_LO:
bind_sa = (struct sockaddr *)&lo_dst;
break;
default:
die("bug");
}
if (bind_sa)
if (bind(s, bind_sa, sizeof(struct sockaddr_in)) < 0)
die("bind(): %s\n", strerror(errno));
if (connect_sa)
if (connect(s, connect_sa, sizeof(struct sockaddr_in)) < 0)
die("connect(): %s\n", strerror(errno));
return s;
}
/* Get a socket suitable for sending to the given type of receiving socket */
static int sock_send(enum sock_type type)
{
const struct sockaddr *connect_sa = NULL;
const struct sockaddr *bind_sa = NULL;
int s;
s = sock_reuseaddr();
switch (type) {
case SOCK_BOUND_ANY:
connect_sa = (struct sockaddr *)&lo2_dst;
break;
case SOCK_CONNECTED:
bind_sa = (struct sockaddr *)&lo_src;
/* fallthrough */
case SOCK_BOUND_LO:
connect_sa = (struct sockaddr *)&lo_dst;
break;
default:
die("bug");
}
if (bind_sa)
if (bind(s, bind_sa, sizeof(struct sockaddr_in)) < 0)
die("bind(): %s\n", strerror(errno));
if (connect_sa)
if (connect(s, connect_sa, sizeof(struct sockaddr_in)) < 0)
die("connect(): %s\n", strerror(errno));
return s;
}
/* Check for expected behaviour with one specific ordering for various operations:
*
* @recv_create_order: Order to create receiving sockets in
* @send_create_order: Order to create sending sockets in
* @test_order: Order to test the behaviour of different types
* @recv_order: Order to check the receiving sockets
*/
static void check_one_order(const order_t recv_create_order,
const order_t send_create_order,
const order_t test_order,
const order_t recv_order)
{
int rs[NUM_SOCK_TYPES];
int ss[NUM_SOCK_TYPES];
int nfds = 0;
int i, j;
for (i = 0; i < NUM_SOCK_TYPES; i++) {
enum sock_type t = recv_create_order[i];
int s;
s = sock_recv(t);
if (s >= nfds)
nfds = s + 1;
rs[t] = s;
}
for (i = 0; i < NUM_SOCK_TYPES; i++) {
enum sock_type t = send_create_order[i];
ss[t] = sock_send(t);
}
for (i = 0; i < NUM_SOCK_TYPES; i++) {
enum sock_type ti = test_order[i];
int recv_via = -1;
send_token(ss[ti], token);
for (j = 0; j < NUM_SOCK_TYPES; j++) {
enum sock_type tj = recv_order[j];
if (recv_token(rs[tj], token)) {
if (recv_via != -1)
die("Received token more than once\n");
recv_via = tj;
}
}
if (recv_via == -1)
die("Didn't receive token at all\n");
if (recv_via != ti)
die("Received token via unexpected socket\n");
}
for (i = 0; i < NUM_SOCK_TYPES; i++) {
close(rs[i]);
close(ss[i]);
}
}
static void check_all_orders(void)
{
int norders = sizeof(orders) / sizeof(orders[0]);
int i, j, k, l;
for (i = 0; i < norders; i++)
for (j = 0; j < norders; j++)
for (k = 0; k < norders; k++)
for (l = 0; l < norders; l++)
check_one_order(orders[i], orders[j],
orders[k], orders[l]);
}
int main(int argc, char *argv[])
{
(void)argc;
(void)argv;
token = random();
check_all_orders();
printf("SO_REUSEADDR receive priorities seem to work as expected\n");
exit(0);
}

View file

@ -0,0 +1,105 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* udp-close-dup.c
*
* Verify that closing one dup() of a UDP socket won't stop other dups from
* receiving packets.
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <net/if.h>
#include <netinet/in.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include "common.h"
#define DSTPORT 13257U
/* 127.0.0.1:DSTPORT */
static const struct sockaddr_in lo_dst = SOCKADDR_INIT(INADDR_LOOPBACK, DSTPORT);
enum dup_method {
DUP_DUP,
DUP_FCNTL,
NUM_METHODS,
};
static void test_close_dup(enum dup_method method)
{
long token;
int s1, s2, send_s;
ssize_t rc;
s1 = sock_reuseaddr();
if (bind(s1, (struct sockaddr *)&lo_dst, sizeof(lo_dst)) < 0)
die("bind(): %s\n", strerror(errno));
send_s = sock_reuseaddr();
if (connect(send_s, (struct sockaddr *)&lo_dst, sizeof(lo_dst)) < 0)
die("connect(): %s\n", strerror(errno));
/* Receive before duplicating */
token = random();
send_token(send_s, token);
recv_token(s1, token);
switch (method) {
case DUP_DUP:
/* NOLINTNEXTLINE(android-cloexec-dup) */
s2 = dup(s1);
if (s2 < 0)
die("dup(): %s\n", strerror(errno));
break;
case DUP_FCNTL:
s2 = fcntl(s1, F_DUPFD_CLOEXEC, 0);
if (s2 < 0)
die("F_DUPFD_CLOEXEC: %s\n", strerror(errno));
break;
default:
die("Bad method\n");
}
/* Receive via original handle */
token = random();
send_token(send_s, token);
recv_token(s1, token);
/* Receive via duplicated handle */
token = random();
send_token(send_s, token);
recv_token(s2, token);
/* Close duplicate */
rc = close(s2);
if (rc < 0)
die("close() dup: %s\n", strerror(errno));
/* Receive after closing duplicate */
token = random();
send_token(send_s, token);
recv_token(s1, token);
}
int main(int argc, char *argv[])
{
enum dup_method method;
(void)argc;
(void)argv;
for (method = 0; method < NUM_METHODS; method++)
test_close_dup(method);
printf("Closing dup()ed UDP sockets seems to work as expected\n");
exit(0);
}

43
epoll_type.h Normal file
View file

@ -0,0 +1,43 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
#ifndef EPOLL_TYPE_H
#define EPOLL_TYPE_H
/**
* enum epoll_type - Different types of fds we poll over
*/
enum epoll_type {
/* Special value to indicate an invalid type */
EPOLL_TYPE_NONE = 0,
/* Connected TCP sockets */
EPOLL_TYPE_TCP,
/* Connected TCP sockets (spliced) */
EPOLL_TYPE_TCP_SPLICE,
/* Listening TCP sockets */
EPOLL_TYPE_TCP_LISTEN,
/* timerfds used for TCP timers */
EPOLL_TYPE_TCP_TIMER,
/* UDP "listening" sockets */
EPOLL_TYPE_UDP_LISTEN,
/* UDP socket for replies on a specific flow */
EPOLL_TYPE_UDP_REPLY,
/* ICMP/ICMPv6 ping sockets */
EPOLL_TYPE_PING,
/* inotify fd watching for end of netns (pasta) */
EPOLL_TYPE_NSQUIT_INOTIFY,
/* timer fd watching for end of netns, fallback for inotify (pasta) */
EPOLL_TYPE_NSQUIT_TIMER,
/* tuntap character device */
EPOLL_TYPE_TAP_PASTA,
/* socket connected to qemu */
EPOLL_TYPE_TAP_PASST,
/* socket listening for qemu socket connections */
EPOLL_TYPE_TAP_LISTEN,
EPOLL_NUM_TYPES,
};
#endif /* EPOLL_TYPE_H */

697
flow.c
View file

@ -5,9 +5,11 @@
* Tracking for logical "flows" of packets.
*/
#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <string.h>
#include "util.h"
@ -18,10 +20,24 @@
#include "flow.h"
#include "flow_table.h"
const char *flow_state_str[] = {
[FLOW_STATE_FREE] = "FREE",
[FLOW_STATE_NEW] = "NEW",
[FLOW_STATE_INI] = "INI",
[FLOW_STATE_TGT] = "TGT",
[FLOW_STATE_TYPED] = "TYPED",
[FLOW_STATE_ACTIVE] = "ACTIVE",
};
static_assert(ARRAY_SIZE(flow_state_str) == FLOW_NUM_STATES,
"flow_state_str[] doesn't match enum flow_state");
const char *flow_type_str[] = {
[FLOW_TYPE_NONE] = "<none>",
[FLOW_TCP] = "TCP connection",
[FLOW_TCP_SPLICE] = "TCP connection (spliced)",
[FLOW_PING4] = "ICMP ping sequence",
[FLOW_PING6] = "ICMPv6 ping sequence",
[FLOW_UDP] = "UDP flow",
};
static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
"flow_type_str[] doesn't match enum flow_type");
@ -29,52 +45,15 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
const uint8_t flow_proto[] = {
[FLOW_TCP] = IPPROTO_TCP,
[FLOW_TCP_SPLICE] = IPPROTO_TCP,
[FLOW_PING4] = IPPROTO_ICMP,
[FLOW_PING6] = IPPROTO_ICMPV6,
[FLOW_UDP] = IPPROTO_UDP,
};
static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
"flow_proto[] doesn't match enum flow_type");
/* Global Flow Table */
/**
* DOC: Theory of Operation - flow entry life cycle
*
* An individual flow table entry moves through these logical states, usually in
* this order.
*
* FREE - Part of the general pool of free flow table entries
* Operations:
* - flow_alloc() finds an entry and moves it to ALLOC state
*
* ALLOC - A tentatively allocated entry
* Operations:
* - flow_alloc_cancel() returns the entry to FREE state
* - FLOW_START() set the entry's type and moves to START state
* Caveats:
* - It's not safe to write fields in the flow entry
* - It's not safe to allocate further entries with flow_alloc()
* - It's not safe to return to the main epoll loop (use FLOW_START()
* to move to START state before doing so)
* - It's not safe to use flow_*() logging functions
*
* START - An entry being prepared by flow type specific code
* Operations:
* - Flow type specific fields may be accessed
* - flow_*() logging functions
* - flow_alloc_cancel() returns the entry to FREE state
* Caveats:
* - Returning to the main epoll loop or allocating another entry
* with flow_alloc() implicitly moves the entry to ACTIVE state.
*
* ACTIVE - An active flow entry managed by flow type specific code
* Operations:
* - Flow type specific fields may be accessed
* - flow_*() logging functions
* - Flow may be expired by returning 'true' from flow type specific
* deferred or timer handler. This will return it to FREE state.
* Caveats:
* - It's not safe to call flow_alloc_cancel()
*/
/**
* DOC: Theory of Operation - allocating and freeing flow entries
*
@ -128,10 +107,156 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
unsigned flow_first_free;
union flow flowtab[FLOW_MAX];
static const union flow *flow_new_entry; /* = NULL */
/* Hash table to index it */
#define FLOW_HASH_LOAD 70 /* % */
#define FLOW_HASH_SIZE ((2 * FLOW_MAX * 100 / FLOW_HASH_LOAD))
/* Table for lookup from flowside information */
static flow_sidx_t flow_hashtab[FLOW_HASH_SIZE];
static_assert(ARRAY_SIZE(flow_hashtab) >= 2 * FLOW_MAX,
"Safe linear probing requires hash table with more entries than the number of sides in the flow table");
/* Last time the flow timers ran */
static struct timespec flow_timer_run;
/** flowside_from_af() - Initialise flowside from addresses
* @side: flowside to initialise
* @af: Address family (AF_INET or AF_INET6)
* @eaddr: Endpoint address (pointer to in_addr or in6_addr)
* @eport: Endpoint port
* @oaddr: Our address (pointer to in_addr or in6_addr)
* @oport: Our port
*/
static void flowside_from_af(struct flowside *side, sa_family_t af,
const void *eaddr, in_port_t eport,
const void *oaddr, in_port_t oport)
{
if (oaddr)
inany_from_af(&side->oaddr, af, oaddr);
else
side->oaddr = inany_any6;
side->oport = oport;
if (eaddr)
inany_from_af(&side->eaddr, af, eaddr);
else
side->eaddr = inany_any6;
side->eport = eport;
}
/**
* struct flowside_sock_args - Parameters for flowside_sock_splice()
* @c: Execution context
* @fd: Filled in with new socket fd
* @err: Filled in with errno if something failed
* @type: Socket epoll type
* @sa: Socket address
* @sl: Length of @sa
* @data: epoll reference data
*/
struct flowside_sock_args {
const struct ctx *c;
int fd;
int err;
enum epoll_type type;
const struct sockaddr *sa;
socklen_t sl;
const char *path;
uint32_t data;
};
/** flowside_sock_splice() - Create and bind socket for PIF_SPLICE based on flowside
* @arg: Argument as a struct flowside_sock_args
*
* Return: 0
*/
static int flowside_sock_splice(void *arg)
{
struct flowside_sock_args *a = arg;
ns_enter(a->c);
a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,
a->sa->sa_family == AF_INET6, a->data);
a->err = errno;
return 0;
}
/** flowside_sock_l4() - Create and bind socket based on flowside
* @c: Execution context
* @type: Socket epoll type
* @pif: Interface for this socket
* @tgt: Target flowside
* @data: epoll reference portion for protocol handlers
*
* Return: socket fd of protocol @proto bound to our address and port from @tgt
* (if specified).
*/
int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
const struct flowside *tgt, uint32_t data)
{
const char *ifname = NULL;
union sockaddr_inany sa;
socklen_t sl;
ASSERT(pif_is_socket(pif));
pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport);
switch (pif) {
case PIF_HOST:
if (inany_is_loopback(&tgt->oaddr))
ifname = NULL;
else if (sa.sa_family == AF_INET)
ifname = c->ip4.ifname_out;
else if (sa.sa_family == AF_INET6)
ifname = c->ip6.ifname_out;
return sock_l4_sa(c, type, &sa, sl, ifname,
sa.sa_family == AF_INET6, data);
case PIF_SPLICE: {
struct flowside_sock_args args = {
.c = c, .type = type,
.sa = &sa.sa, .sl = sl, .data = data,
};
NS_CALL(flowside_sock_splice, &args);
errno = args.err;
return args.fd;
}
default:
/* If we add new socket pifs, they'll need to be implemented
* here
*/
ASSERT(0);
}
}
/** flowside_connect() - Connect a socket based on flowside
* @c: Execution context
* @s: Socket to connect
* @pif: Target pif
* @tgt: Target flowside
*
* Connect @s to the endpoint address and port from @tgt.
*
* Return: 0 on success, negative on error
*/
int flowside_connect(const struct ctx *c, int s,
uint8_t pif, const struct flowside *tgt)
{
union sockaddr_inany sa;
socklen_t sl;
pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport);
return connect(s, &sa.sa, sl);
}
/** flow_log_ - Log flow-related message
* @f: flow the message is related to
* @pri: Log priority
@ -140,6 +265,7 @@ static struct timespec flow_timer_run;
*/
void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
{
const char *type_or_state;
char msg[BUFSIZ];
va_list args;
@ -147,40 +273,221 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
(void)vsnprintf(msg, sizeof(msg), fmt, args);
va_end(args);
logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
/* Show type if it's set, otherwise the state */
if (f->state < FLOW_STATE_TYPED)
type_or_state = FLOW_STATE(f);
else
type_or_state = FLOW_TYPE(f);
logmsg(true, false, pri,
"Flow %u (%s): %s", flow_idx(f), type_or_state, msg);
}
/** flow_log_details_() - Log the details of a flow
* @f: flow to log
* @pri: Log priority
* @state: State to log details according to
*
* Logs the details of the flow: endpoints, interfaces, type etc.
*/
void flow_log_details_(const struct flow_common *f, int pri,
enum flow_state state)
{
char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN];
char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN];
const struct flowside *ini = &f->side[INISIDE];
const struct flowside *tgt = &f->side[TGTSIDE];
if (state >= FLOW_STATE_TGT)
flow_log_(f, pri,
"%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
ini->eport,
inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)),
ini->oport,
pif_name(f->pif[TGTSIDE]),
inany_ntop(&tgt->oaddr, fstr1, sizeof(fstr1)),
tgt->oport,
inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)),
tgt->eport);
else if (state >= FLOW_STATE_INI)
flow_log_(f, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
ini->eport,
inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)),
ini->oport);
}
/**
* flow_start() - Set flow type for new flow and log
* @flow: Flow to set type for
* @type: Type for new flow
* @iniside: Which side initiated the new flow
*
* Return: @flow
*
* Should be called before setting any flow type specific fields in the flow
* table entry.
* flow_set_state() - Change flow's state
* @f: Flow changing state
* @state: New state
*/
union flow *flow_start(union flow *flow, enum flow_type type,
unsigned iniside)
static void flow_set_state(struct flow_common *f, enum flow_state state)
{
(void)iniside;
flow->f.type = type;
flow_dbg(flow, "START %s", flow_type_str[flow->f.type]);
uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES);
ASSERT(oldstate < FLOW_NUM_STATES);
f->state = state;
flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
FLOW_STATE(f));
flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
}
/**
* flow_initiate_() - Move flow to INI, setting pif[INISIDE]
* @flow: Flow to change state
* @pif: pif of the initiating side
*/
static void flow_initiate_(union flow *flow, uint8_t pif)
{
struct flow_common *f = &flow->f;
ASSERT(pif != PIF_NONE);
ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_NEW);
ASSERT(f->type == FLOW_TYPE_NONE);
ASSERT(f->pif[INISIDE] == PIF_NONE && f->pif[TGTSIDE] == PIF_NONE);
f->pif[INISIDE] = pif;
flow_set_state(f, FLOW_STATE_INI);
}
/**
* flow_initiate_af() - Move flow to INI, setting INISIDE details
* @flow: Flow to change state
* @pif: pif of the initiating side
* @af: Address family of @saddr and @daddr
* @saddr: Source address (pointer to in_addr or in6_addr)
* @sport: Endpoint port
* @daddr: Destination address (pointer to in_addr or in6_addr)
* @dport: Destination port
*
* Return: pointer to the initiating flowside information
*/
const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport)
{
struct flowside *ini = &flow->f.side[INISIDE];
flowside_from_af(ini, af, saddr, sport, daddr, dport);
flow_initiate_(flow, pif);
return ini;
}
/**
* flow_initiate_sa() - Move flow to INI, setting INISIDE details
* @flow: Flow to change state
* @pif: pif of the initiating side
* @ssa: Source socket address
* @dport: Destination port
*
* Return: pointer to the initiating flowside information
*/
const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
in_port_t dport)
{
struct flowside *ini = &flow->f.side[INISIDE];
inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa);
if (inany_v4(&ini->eaddr))
ini->oaddr = inany_any4;
else
ini->oaddr = inany_any6;
ini->oport = dport;
flow_initiate_(flow, pif);
return ini;
}
/**
* flow_target() - Determine where flow should forward to, and move to TGT
* @c: Execution context
* @flow: Flow to forward
* @proto: Protocol
*
* Return: pointer to the target flowside information
*/
const struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto)
{
char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
struct flow_common *f = &flow->f;
const struct flowside *ini = &f->side[INISIDE];
struct flowside *tgt = &f->side[TGTSIDE];
uint8_t tgtpif = PIF_NONE;
ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI);
ASSERT(f->type == FLOW_TYPE_NONE);
ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE);
ASSERT(flow->f.state == FLOW_STATE_INI);
switch (f->pif[INISIDE]) {
case PIF_TAP:
tgtpif = fwd_nat_from_tap(c, proto, ini, tgt);
break;
case PIF_SPLICE:
tgtpif = fwd_nat_from_splice(c, proto, ini, tgt);
break;
case PIF_HOST:
tgtpif = fwd_nat_from_host(c, proto, ini, tgt);
break;
default:
flow_err(flow, "No rules to forward %s [%s]:%hu -> [%s]:%hu",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr, sizeof(estr)),
ini->eport,
inany_ntop(&ini->oaddr, fstr, sizeof(fstr)),
ini->oport);
}
if (tgtpif == PIF_NONE)
return NULL;
f->pif[TGTSIDE] = tgtpif;
flow_set_state(f, FLOW_STATE_TGT);
return tgt;
}
/**
* flow_set_type() - Set type and move to TYPED
* @flow: Flow to change state
* @pif: pif of the initiating side
*/
union flow *flow_set_type(union flow *flow, enum flow_type type)
{
struct flow_common *f = &flow->f;
ASSERT(type != FLOW_TYPE_NONE);
ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_TGT);
ASSERT(f->type == FLOW_TYPE_NONE);
ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] != PIF_NONE);
f->type = type;
flow_set_state(f, FLOW_STATE_TYPED);
return flow;
}
/**
* flow_end() - Clear flow type for finished flow and log
* @flow: Flow to clear
* flow_activate() - Move flow to ACTIVE
* @f: Flow to change state
*/
static void flow_end(union flow *flow)
void flow_activate(struct flow_common *f)
{
if (flow->f.type == FLOW_TYPE_NONE)
return; /* Nothing to do */
ASSERT(&flow_new_entry->f == f && f->state == FLOW_STATE_TYPED);
ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] != PIF_NONE);
flow_dbg(flow, "END %s", flow_type_str[flow->f.type]);
flow->f.type = FLOW_TYPE_NONE;
flow_set_state(f, FLOW_STATE_ACTIVE);
flow_new_entry = NULL;
}
/**
@ -192,9 +499,12 @@ union flow *flow_alloc(void)
{
union flow *flow = &flowtab[flow_first_free];
ASSERT(!flow_new_entry);
if (flow_first_free >= FLOW_MAX)
return NULL;
ASSERT(flow->f.state == FLOW_STATE_FREE);
ASSERT(flow->f.type == FLOW_TYPE_NONE);
ASSERT(flow->free.n >= 1);
ASSERT(flow_first_free + flow->free.n <= FLOW_MAX);
@ -217,7 +527,10 @@ union flow *flow_alloc(void)
flow_first_free = flow->free.next;
}
flow_new_entry = flow;
memset(flow, 0, sizeof(*flow));
flow_set_state(&flow->f, FLOW_STATE_NEW);
return flow;
}
@ -229,15 +542,228 @@ union flow *flow_alloc(void)
*/
void flow_alloc_cancel(union flow *flow)
{
ASSERT(flow_new_entry == flow);
ASSERT(flow->f.state == FLOW_STATE_NEW ||
flow->f.state == FLOW_STATE_INI ||
flow->f.state == FLOW_STATE_TGT ||
flow->f.state == FLOW_STATE_TYPED);
ASSERT(flow_first_free > FLOW_IDX(flow));
flow_end(flow);
flow_set_state(&flow->f, FLOW_STATE_FREE);
memset(flow, 0, sizeof(*flow));
/* Put it back in a length 1 free cluster, don't attempt to fully
* reverse flow_alloc()s steps. This will get folded together the next
* time flow_defer_handler runs anyway() */
flow->free.n = 1;
flow->free.next = flow_first_free;
flow_first_free = FLOW_IDX(flow);
flow_new_entry = NULL;
}
/**
* flow_hash() - Calculate hash value for one side of a flow
* @c: Execution context
* @proto: Protocol of this flow (IP L4 protocol number)
* @pif: pif of the side to hash
* @side: Flowside (must not have unspecified parts)
*
* Return: hash value
*/
static uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif,
const struct flowside *side)
{
struct siphash_state state = SIPHASH_INIT(c->hash_secret);
inany_siphash_feed(&state, &side->oaddr);
inany_siphash_feed(&state, &side->eaddr);
return siphash_final(&state, 38, (uint64_t)proto << 40 |
(uint64_t)pif << 32 |
(uint64_t)side->oport << 16 |
(uint64_t)side->eport);
}
/**
* flow_sidx_hash() - Calculate hash value for given side of a given flow
* @c: Execution context
* @sidx: Flow & side index to get hash for
*
* Return: hash value, of the flow & side represented by @sidx
*/
static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx)
{
const struct flow_common *f = &flow_at_sidx(sidx)->f;
const struct flowside *side = &f->side[sidx.sidei];
uint8_t pif = f->pif[sidx.sidei];
/* For the hash table to work, entries must have complete endpoint
* information, and at least a forwarding port.
*/
ASSERT(pif != PIF_NONE && !inany_is_unspecified(&side->eaddr) &&
side->eport != 0 && side->oport != 0);
return flow_hash(c, FLOW_PROTO(f), pif, side);
}
/**
* flow_hash_probe_() - Find hash bucket for a flow, given hash
* @hash: Raw hash value for flow & side
* @sidx: Flow and side to find bucket for
*
* Return: If @sidx is in the hash table, its current bucket, otherwise a
* suitable free bucket for it.
*/
static inline unsigned flow_hash_probe_(uint64_t hash, flow_sidx_t sidx)
{
unsigned b = hash % FLOW_HASH_SIZE;
/* Linear probing */
while (flow_sidx_valid(flow_hashtab[b]) &&
!flow_sidx_eq(flow_hashtab[b], sidx))
b = mod_sub(b, 1, FLOW_HASH_SIZE);
return b;
}
/**
* flow_hash_probe() - Find hash bucket for a flow
* @c: Execution context
* @sidx: Flow and side to find bucket for
*
* Return: If @sidx is in the hash table, its current bucket, otherwise a
* suitable free bucket for it.
*/
static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
{
return flow_hash_probe_(flow_sidx_hash(c, sidx), sidx);
}
/**
* flow_hash_insert() - Insert side of a flow into into hash table
* @c: Execution context
* @sidx: Flow & side index
*
* Return: raw (un-modded) hash value of side of flow
*/
uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx)
{
uint64_t hash = flow_sidx_hash(c, sidx);
unsigned b = flow_hash_probe_(hash, sidx);
flow_hashtab[b] = sidx;
flow_dbg(flow_at_sidx(sidx), "Side %u hash table insert: bucket: %u",
sidx.sidei, b);
return hash;
}
/**
* flow_hash_remove() - Drop side of a flow from the hash table
* @c: Execution context
* @sidx: Side of flow to remove
*/
void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx)
{
unsigned b = flow_hash_probe(c, sidx), s;
if (!flow_sidx_valid(flow_hashtab[b]))
return; /* Redundant remove */
flow_dbg(flow_at_sidx(sidx), "Side %u hash table remove: bucket: %u",
sidx.sidei, b);
/* Scan the remainder of the cluster */
for (s = mod_sub(b, 1, FLOW_HASH_SIZE);
flow_sidx_valid(flow_hashtab[s]);
s = mod_sub(s, 1, FLOW_HASH_SIZE)) {
unsigned h = flow_sidx_hash(c, flow_hashtab[s]) % FLOW_HASH_SIZE;
if (!mod_between(h, s, b, FLOW_HASH_SIZE)) {
/* flow_hashtab[s] can live in flow_hashtab[b]'s slot */
debug("hash table remove: shuffle %u -> %u", s, b);
flow_hashtab[b] = flow_hashtab[s];
b = s;
}
}
flow_hashtab[b] = FLOW_SIDX_NONE;
}
/**
* flowside_lookup() - Look for a matching flowside in the flow table
* @c: Execution context
* @proto: Protocol of the flow (IP L4 protocol number)
* @pif: pif to look for in the table
* @side: Flowside to look for in the table
*
* Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
*/
static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto,
uint8_t pif, const struct flowside *side)
{
flow_sidx_t sidx;
union flow *flow;
unsigned b;
b = flow_hash(c, proto, pif, side) % FLOW_HASH_SIZE;
while ((sidx = flow_hashtab[b], flow = flow_at_sidx(sidx)) &&
!(FLOW_PROTO(&flow->f) == proto &&
flow->f.pif[sidx.sidei] == pif &&
flowside_eq(&flow->f.side[sidx.sidei], side)))
b = mod_sub(b, 1, FLOW_HASH_SIZE);
return flow_hashtab[b];
}
/**
* flow_lookup_af() - Look up a flow given addressing information
* @c: Execution context
* @proto: Protocol of the flow (IP L4 protocol number)
* @pif: Interface of the flow
* @af: Address family, AF_INET or AF_INET6
* @eaddr: Guest side endpoint address (guest local address)
* @oaddr: Our guest side address (guest remote address)
* @eport: Guest side endpoint port (guest local port)
* @oport: Our guest side port (guest remote port)
*
* Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
*/
flow_sidx_t flow_lookup_af(const struct ctx *c,
uint8_t proto, uint8_t pif, sa_family_t af,
const void *eaddr, const void *oaddr,
in_port_t eport, in_port_t oport)
{
struct flowside side;
flowside_from_af(&side, af, eaddr, eport, oaddr, oport);
return flowside_lookup(c, proto, pif, &side);
}
/**
* flow_lookup_sa() - Look up a flow given an endpoint socket address
* @c: Execution context
* @proto: Protocol of the flow (IP L4 protocol number)
* @pif: Interface of the flow
* @esa: Socket address of the endpoint
* @oport: Our port number
*
* Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
*/
flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
const void *esa, in_port_t oport)
{
struct flowside side = {
.oport = oport,
};
inany_from_sockaddr(&side.eaddr, &side.eport, esa);
if (inany_v4(&side.eaddr))
side.oaddr = inany_any4;
else
side.oaddr = inany_any6;
return flowside_lookup(c, proto, pif, &side);
}
/**
@ -257,11 +783,14 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
flow_timer_run = *now;
}
ASSERT(!flow_new_entry); /* Incomplete flow at end of cycle */
for (idx = 0; idx < FLOW_MAX; idx++) {
union flow *flow = &flowtab[idx];
bool closed = false;
if (flow->f.type == FLOW_TYPE_NONE) {
switch (flow->f.state) {
case FLOW_STATE_FREE: {
unsigned skip = flow->free.n;
/* First entry of a free cluster must have n >= 1 */
@ -283,17 +812,43 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
continue;
}
case FLOW_STATE_NEW:
case FLOW_STATE_INI:
case FLOW_STATE_TGT:
case FLOW_STATE_TYPED:
/* Incomplete flow at end of cycle */
ASSERT(false);
break;
case FLOW_STATE_ACTIVE:
/* Nothing to do */
break;
default:
ASSERT(false);
}
switch (flow->f.type) {
case FLOW_TYPE_NONE:
ASSERT(false);
break;
case FLOW_TCP:
closed = tcp_flow_defer(flow);
closed = tcp_flow_defer(&flow->tcp);
break;
case FLOW_TCP_SPLICE:
closed = tcp_splice_flow_defer(flow);
closed = tcp_splice_flow_defer(&flow->tcp_splice);
if (!closed && timer)
tcp_splice_timer(c, flow);
tcp_splice_timer(c, &flow->tcp_splice);
break;
case FLOW_PING4:
case FLOW_PING6:
if (timer)
closed = icmp_ping_timer(c, &flow->ping, now);
break;
case FLOW_UDP:
closed = udp_flow_defer(&flow->udp);
if (!closed && timer)
closed = udp_flow_timer(c, &flow->udp, now);
break;
default:
/* Assume other flow types don't need any handling */
@ -301,7 +856,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
}
if (closed) {
flow_end(flow);
flow_set_state(&flow->f, FLOW_STATE_FREE);
memset(flow, 0, sizeof(*flow));
if (free_head) {
/* Add slot to current free cluster */
@ -328,7 +884,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
*/
void flow_init(void)
{
unsigned b;
/* Initial state is a single free cluster containing the whole table */
flowtab[0].free.n = FLOW_MAX;
flowtab[0].free.next = FLOW_MAX;
for (b = 0; b < FLOW_HASH_SIZE; b++)
flow_hashtab[b] = FLOW_SIDX_NONE;
}

199
flow.h
View file

@ -9,6 +9,98 @@
#define FLOW_TIMER_INTERVAL 1000 /* ms */
/**
* enum flow_state - States of a flow table entry
*
* An individual flow table entry moves through these states, usually in this
* order.
* General rules:
* - Code outside flow.c should never write common fields of union flow.
* - The state field may always be read.
*
* FREE - Part of the general pool of free flow table entries
* Operations:
* - flow_alloc() finds an entry and moves it to NEW
*
* NEW - Freshly allocated, uninitialised entry
* Operations:
* - flow_alloc_cancel() returns the entry to FREE
* - flow_initiate() sets the entry's INISIDE details and moves to
* INI
* - FLOW_SET_TYPE() sets the entry's type and moves to TYPED
* Caveats:
* - No fields other than state may be accessed
* - At most one entry may be NEW, INI, TGT or TYPED at a time, so
* it's unsafe to use flow_alloc() again until this entry moves to
* ACTIVE or FREE
* - You may not return to the main epoll loop while any flow is NEW
*
* INI - An entry with INISIDE common information completed
* Operations:
* - Common fields related to INISIDE may be read
* - flow_alloc_cancel() returns the entry to FREE
* - flow_target() sets the entry's TGTSIDE details and moves to TGT
* Caveats:
* - Other common fields may not be read
* - Type specific fields may not be read or written
* - At most one entry may be NEW, INI, TGT or TYPED at a time, so
* it's unsafe to use flow_alloc() again until this entry moves to
* ACTIVE or FREE
* - You may not return to the main epoll loop while any flow is INI
*
* TGT - An entry with only INISIDE and TGTSIDE common information completed
* Operations:
* - Common fields related to INISIDE & TGTSIDE may be read
* - flow_alloc_cancel() returns the entry to FREE
* - FLOW_SET_TYPE() sets the entry's type and moves to TYPED
* Caveats:
* - Other common fields may not be read
* - Type specific fields may not be read or written
* - At most one entry may be NEW, INI, TGT or TYPED at a time, so
* it's unsafe to use flow_alloc() again until this entry moves to
* ACTIVE or FREE
* - You may not return to the main epoll loop while any flow is TGT
*
* TYPED - Generic info initialised, type specific initialisation underway
* Operations:
* - All common fields may be read
* - Type specific fields may be read and written
* - flow_alloc_cancel() returns the entry to FREE
* - FLOW_ACTIVATE() moves the entry to ACTIVE
* Caveats:
* - At most one entry may be NEW, INI, TGT or TYPED at a time, so
* it's unsafe to use flow_alloc() again until this entry moves to
* ACTIVE or FREE
* - You may not return to the main epoll loop while any flow is
* TYPED
*
* ACTIVE - An active, fully-initialised flow entry
* Operations:
* - All common fields may be read
* - Type specific fields may be read and written
* - Flow returns to FREE when it expires, signalled by returning
* 'true' from flow type specific deferred or timer handler
* Caveats:
* - flow_alloc_cancel() may not be called on it
*/
enum flow_state {
FLOW_STATE_FREE,
FLOW_STATE_NEW,
FLOW_STATE_INI,
FLOW_STATE_TGT,
FLOW_STATE_TYPED,
FLOW_STATE_ACTIVE,
FLOW_NUM_STATES,
};
#define FLOW_STATE_BITS 8
static_assert(FLOW_NUM_STATES <= (1 << FLOW_STATE_BITS),
"Too many flow states for FLOW_STATE_BITS");
extern const char *flow_state_str[];
#define FLOW_STATE(f) \
((f)->state < FLOW_NUM_STATES ? flow_state_str[(f)->state] : "?")
/**
* enum flow_type - Different types of packet flows we track
*/
@ -19,9 +111,18 @@ enum flow_type {
FLOW_TCP,
/* A TCP connection between a host socket and ns socket */
FLOW_TCP_SPLICE,
/* ICMP echo requests from guest to host and matching replies back */
FLOW_PING4,
/* ICMPv6 echo requests from guest to host and matching replies back */
FLOW_PING6,
/* UDP pseudo-connection */
FLOW_UDP,
FLOW_NUM_TYPES,
};
#define FLOW_TYPE_BITS 8
static_assert(FLOW_NUM_TYPES <= (1 << FLOW_TYPE_BITS),
"Too many flow types for FLOW_TYPE_BITS");
extern const char *flow_type_str[];
#define FLOW_TYPE(f) \
@ -31,12 +132,66 @@ extern const uint8_t flow_proto[];
#define FLOW_PROTO(f) \
((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)
#define SIDES 2
#define INISIDE 0 /* Initiating side index */
#define TGTSIDE 1 /* Target side index */
/**
* struct flowside - Address information for one side of a flow
* @eaddr: Endpoint address (remote address from passt's PoV)
* @oaddr: Our address (local address from passt's PoV)
* @eport: Endpoint port
* @oport: Our port
*/
struct flowside {
union inany_addr oaddr;
union inany_addr eaddr;
in_port_t oport;
in_port_t eport;
};
/**
* flowside_eq() - Check if two flowsides are equal
* @left, @right: Flowsides to compare
*
* Return: true if equal, false otherwise
*/
static inline bool flowside_eq(const struct flowside *left,
const struct flowside *right)
{
return inany_equals(&left->eaddr, &right->eaddr) &&
left->eport == right->eport &&
inany_equals(&left->oaddr, &right->oaddr) &&
left->oport == right->oport;
}
int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
const struct flowside *tgt, uint32_t data);
int flowside_connect(const struct ctx *c, int s,
uint8_t pif, const struct flowside *tgt);
/**
* struct flow_common - Common fields for packet flows
* @state: State of the flow table entry
* @type: Type of packet flow
* @pif[]: Interface for each side of the flow
* @side[]: Information for each side of the flow
*/
struct flow_common {
#ifdef __GNUC__
enum flow_state state:FLOW_STATE_BITS;
enum flow_type type:FLOW_TYPE_BITS;
#else
uint8_t state;
static_assert(sizeof(uint8_t) * 8 >= FLOW_STATE_BITS,
"Not enough bits for state field");
uint8_t type;
static_assert(sizeof(uint8_t) * 8 >= FLOW_TYPE_BITS,
"Not enough bits for type field");
#endif
uint8_t pif[SIDES];
struct flowside side[SIDES];
};
#define FLOW_INDEX_BITS 17 /* 128k - 1 */
@ -45,24 +200,30 @@ struct flow_common {
#define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */
#define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
union flow *flow_start(union flow *flow, enum flow_type type,
unsigned iniside);
#define FLOW_START(flow_, t_, var_, i_) \
(&flow_start((flow_), (t_), (i_))->var_)
/**
* struct flow_sidx - ID for one side of a specific flow
* @side: Side referenced (0 or 1)
* @flow: Index of flow referenced
* @sidei: Index of side referenced (0 or 1)
* @flowi: Index of flow referenced
*/
typedef struct flow_sidx {
unsigned side :1;
unsigned flow :FLOW_INDEX_BITS;
unsigned sidei :1;
unsigned flowi :FLOW_INDEX_BITS;
} flow_sidx_t;
static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t),
"flow_sidx_t must fit within 32 bits");
#define FLOW_SIDX_NONE ((flow_sidx_t){ .flow = FLOW_MAX })
#define FLOW_SIDX_NONE ((flow_sidx_t){ .flowi = FLOW_MAX })
/**
* flow_sidx_valid() - Test if a sidx is valid
* @sidx: sidx value
*
* Return: true if @sidx refers to a valid flow & side
*/
static inline bool flow_sidx_valid(flow_sidx_t sidx)
{
return sidx.flowi < FLOW_MAX;
}
/**
* flow_sidx_eq() - Test if two sidx values are equal
@ -72,9 +233,18 @@ static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t),
*/
static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
{
return (a.flow == b.flow) && (a.side == b.side);
return (a.flowi == b.flowi) && (a.sidei == b.sidei);
}
uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx);
void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx);
flow_sidx_t flow_lookup_af(const struct ctx *c,
uint8_t proto, uint8_t pif, sa_family_t af,
const void *eaddr, const void *oaddr,
in_port_t eport, in_port_t oport);
flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
const void *esa, in_port_t oport);
union flow;
void flow_init(void);
@ -94,4 +264,11 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
flow_dbg((f), __VA_ARGS__); \
} while (0)
void flow_log_details_(const struct flow_common *f, int pri,
enum flow_state state);
#define flow_log_details(f_, pri) \
flow_log_details_(&((f_)->f), (pri), (f_)->f.state)
#define flow_dbg_details(f_) flow_log_details((f_), LOG_DEBUG)
#define flow_err_details(f_) flow_log_details((f_), LOG_ERR)
#endif /* FLOW_H */

View file

@ -8,6 +8,8 @@
#define FLOW_TABLE_H
#include "tcp_conn.h"
#include "icmp_flow.h"
#include "udp_flow.h"
/**
* struct flow_free_cluster - Information about a cluster of free entries
@ -33,14 +35,22 @@ union flow {
struct flow_free_cluster free;
struct tcp_tap_conn tcp;
struct tcp_splice_conn tcp_splice;
struct icmp_ping_flow ping;
struct udp_flow udp;
};
/* Global Flow Table */
extern unsigned flow_first_free;
extern union flow flowtab[];
/**
* flow_foreach_sidei() - 'for' type macro to step through each side of flow
* @sidei_: Takes value INISIDE, then TGTSIDE
*/
#define flow_foreach_sidei(sidei_) \
for ((sidei_) = INISIDE; (sidei_) < SIDES; (sidei_)++)
/** flow_idx - Index of flow from common structure
/** flow_idx() - Index of flow from common structure
* @f: Common flow fields pointer
*
* Return: index of @f in the flow table
@ -50,59 +60,122 @@ static inline unsigned flow_idx(const struct flow_common *f)
return (union flow *)f - flowtab;
}
/** FLOW_IDX - Find the index of a flow
/** FLOW_IDX() - Find the index of a flow
* @f_: Flow pointer, either union flow * or protocol specific
*
* Return: index of @f in the flow table
*/
#define FLOW_IDX(f_) (flow_idx(&(f_)->f))
/** FLOW - Flow entry at a given index
/** FLOW() - Flow entry at a given index
* @idx: Flow index
*
* Return: pointer to entry @idx in the flow table
*/
#define FLOW(idx) (&flowtab[(idx)])
/** flow_at_sidx - Flow entry for a given sidx
/** flow_at_sidx() - Flow entry for a given sidx
* @sidx: Flow & side index
*
* Return: pointer to the corresponding flow entry, or NULL
*/
static inline union flow *flow_at_sidx(flow_sidx_t sidx)
{
if (sidx.flow >= FLOW_MAX)
if (!flow_sidx_valid(sidx))
return NULL;
return FLOW(sidx.flow);
return FLOW(sidx.flowi);
}
/** flow_sidx_t - Index of one side of a flow from common structure
/** pif_at_sidx() - Interface for a given flow and side
* @sidx: Flow & side index
*
* Return: pif for the flow & side given by @sidx
*/
static inline uint8_t pif_at_sidx(flow_sidx_t sidx)
{
const union flow *flow = flow_at_sidx(sidx);
if (!flow)
return PIF_NONE;
return flow->f.pif[sidx.sidei];
}
/** flowside_at_sidx() - Retrieve a specific flowside
* @sidx: Flow & side index
*
* Return: Flowside for the flow & side given by @sidx
*/
static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx)
{
const union flow *flow = flow_at_sidx(sidx);
if (!flow)
return NULL;
return &flow->f.side[sidx.sidei];
}
/** flow_sidx_opposite() - Get the other side of the same flow
* @sidx: Flow & side index
*
* Return: sidx for the other side of the same flow as @sidx
*/
static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx)
{
if (!flow_sidx_valid(sidx))
return FLOW_SIDX_NONE;
return (flow_sidx_t){.flowi = sidx.flowi, .sidei = !sidx.sidei};
}
/** flow_sidx() - Index of one side of a flow from common structure
* @f: Common flow fields pointer
* @side: Which side to refer to (0 or 1)
* @sidei: Which side to refer to (0 or 1)
*
* Return: index of @f and @side in the flow table
*/
static inline flow_sidx_t flow_sidx(const struct flow_common *f,
int side)
unsigned sidei)
{
/* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression] */
ASSERT(side == !!side);
ASSERT(sidei == !!sidei);
return (flow_sidx_t){
.side = side,
.flow = flow_idx(f),
.sidei = sidei,
.flowi = flow_idx(f),
};
}
/** FLOW_SIDX - Find the index of one side of a flow
/** FLOW_SIDX() - Find the index of one side of a flow
* @f_: Flow pointer, either union flow * or protocol specific
* @side: Which side to index (0 or 1)
* @sidei: Which side to index (0 or 1)
*
* Return: index of @f and @side in the flow table
*/
#define FLOW_SIDX(f_, side) (flow_sidx(&(f_)->f, (side)))
#define FLOW_SIDX(f_, sidei) (flow_sidx(&(f_)->f, (sidei)))
union flow *flow_alloc(void);
void flow_alloc_cancel(union flow *flow);
const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport);
const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
in_port_t dport);
const struct flowside *flow_target_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport);
const struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type);
#define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_)
void flow_activate(struct flow_common *f);
#define FLOW_ACTIVATE(flow_) \
(flow_activate(&(flow_)->f))
#endif /* FLOW_TABLE_H */

387
fwd.c
View file

@ -25,6 +25,81 @@
#include "fwd.h"
#include "passt.h"
#include "lineread.h"
#include "flow_table.h"
/* Empheral port range: values from RFC 6335 */
static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
#define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range"
/** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
*
* Work out what ports the host thinks are emphemeral and record it for later
* use by fwd_port_is_ephemeral(). If we're unable to probe, assume the range
* recommended by RFC 6335.
*/
void fwd_probe_ephemeral(void)
{
char *line, *tab, *end;
struct lineread lr;
long min, max;
ssize_t len;
int fd;
fd = open(PORT_RANGE_SYSCTL, O_RDONLY | O_CLOEXEC);
if (fd < 0) {
warn_perror("Unable to open %s", PORT_RANGE_SYSCTL);
return;
}
lineread_init(&lr, fd);
len = lineread_get(&lr, &line);
close(fd);
if (len < 0)
goto parse_err;
tab = strchr(line, '\t');
if (!tab)
goto parse_err;
*tab = '\0';
errno = 0;
min = strtol(line, &end, 10);
if (*end || errno)
goto parse_err;
errno = 0;
max = strtol(tab + 1, &end, 10);
if (*end || errno)
goto parse_err;
if (min < 0 || min >= (long)NUM_PORTS ||
max < 0 || max >= (long)NUM_PORTS)
goto parse_err;
fwd_ephemeral_min = min;
fwd_ephemeral_max = max;
return;
parse_err:
warn("Unable to parse %s", PORT_RANGE_SYSCTL);
}
/**
* fwd_port_is_ephemeral() - Is port number ephemeral?
* @port: Port number
*
* Return: true if @port is ephemeral, that is may be allocated by the kernel as
* a local port for outgoing connections or datagrams, but should not be
* used for binding services to.
*/
bool fwd_port_is_ephemeral(in_port_t port)
{
return (port >= fwd_ephemeral_min) && (port <= fwd_ephemeral_max);
}
/* See enum in kernel's include/net/tcp_states.h */
#define UDP_LISTEN 0x07
@ -38,7 +113,7 @@
* @exclude: Bitmap of ports to exclude from setting (and clear)
*
* #syscalls:pasta lseek
* #syscalls:pasta ppc64le:_llseek ppc64:_llseek armv6l:_llseek armv7l:_llseek
* #syscalls:pasta ppc64le:_llseek ppc64:_llseek arm:_llseek
*/
static void procfs_scan_listen(int fd, unsigned int lstate,
uint8_t *map, const uint8_t *exclude)
@ -52,7 +127,7 @@ static void procfs_scan_listen(int fd, unsigned int lstate,
return;
if (lseek(fd, 0, SEEK_SET)) {
warn("lseek() failed on /proc/net file: %s", strerror(errno));
warn_perror("lseek() failed on /proc/net file");
return;
}
@ -128,18 +203,18 @@ void fwd_scan_ports_init(struct ctx *c)
c->tcp.fwd_in.scan4 = c->tcp.fwd_in.scan6 = -1;
c->tcp.fwd_out.scan4 = c->tcp.fwd_out.scan6 = -1;
c->udp.fwd_in.f.scan4 = c->udp.fwd_in.f.scan6 = -1;
c->udp.fwd_out.f.scan4 = c->udp.fwd_out.f.scan6 = -1;
c->udp.fwd_in.scan4 = c->udp.fwd_in.scan6 = -1;
c->udp.fwd_out.scan4 = c->udp.fwd_out.scan6 = -1;
if (c->tcp.fwd_in.mode == FWD_AUTO) {
c->tcp.fwd_in.scan4 = open_in_ns(c, "/proc/net/tcp", flags);
c->tcp.fwd_in.scan6 = open_in_ns(c, "/proc/net/tcp6", flags);
fwd_scan_ports_tcp(&c->tcp.fwd_in, &c->tcp.fwd_out);
}
if (c->udp.fwd_in.f.mode == FWD_AUTO) {
c->udp.fwd_in.f.scan4 = open_in_ns(c, "/proc/net/udp", flags);
c->udp.fwd_in.f.scan6 = open_in_ns(c, "/proc/net/udp6", flags);
fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f,
if (c->udp.fwd_in.mode == FWD_AUTO) {
c->udp.fwd_in.scan4 = open_in_ns(c, "/proc/net/udp", flags);
c->udp.fwd_in.scan6 = open_in_ns(c, "/proc/net/udp6", flags);
fwd_scan_ports_udp(&c->udp.fwd_in, &c->udp.fwd_out,
&c->tcp.fwd_in, &c->tcp.fwd_out);
}
if (c->tcp.fwd_out.mode == FWD_AUTO) {
@ -147,10 +222,298 @@ void fwd_scan_ports_init(struct ctx *c)
c->tcp.fwd_out.scan6 = open("/proc/net/tcp6", flags);
fwd_scan_ports_tcp(&c->tcp.fwd_out, &c->tcp.fwd_in);
}
if (c->udp.fwd_out.f.mode == FWD_AUTO) {
c->udp.fwd_out.f.scan4 = open("/proc/net/udp", flags);
c->udp.fwd_out.f.scan6 = open("/proc/net/udp6", flags);
fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f,
if (c->udp.fwd_out.mode == FWD_AUTO) {
c->udp.fwd_out.scan4 = open("/proc/net/udp", flags);
c->udp.fwd_out.scan6 = open("/proc/net/udp6", flags);
fwd_scan_ports_udp(&c->udp.fwd_out, &c->udp.fwd_in,
&c->tcp.fwd_out, &c->tcp.fwd_in);
}
}
/**
* is_dns_flow() - Determine if flow appears to be a DNS request
* @proto: Protocol (IP L4 protocol number)
* @ini: Flow address information of the initiating side
*
* Return: true if the flow appears to be directed at a dns server, that is a
* TCP or UDP flow to port 53 (domain) or port 853 (domain-s)
*/
static bool is_dns_flow(uint8_t proto, const struct flowside *ini)
{
return ((proto == IPPROTO_UDP) || (proto == IPPROTO_TCP)) &&
((ini->oport == 53) || (ini->oport == 853));
}
/**
* fwd_guest_accessible4() - Is IPv4 address guest-accessible
* @c: Execution context
* @addr: Host visible IPv4 address
*
* Return: true if @addr on the host is accessible to the guest without
* translation, false otherwise
*/
static bool fwd_guest_accessible4(const struct ctx *c,
const struct in_addr *addr)
{
if (IN4_IS_ADDR_LOOPBACK(addr))
return false;
/* In socket interfaces 0.0.0.0 generally means "any" or unspecified,
* however on the wire it can mean "this host on this network". Since
* that has a different meaning for host and guest, we can't let it
* through untranslated.
*/
if (IN4_IS_ADDR_UNSPECIFIED(addr))
return false;
/* For IPv4, addr_seen is initialised to addr, so is always a valid
* address
*/
if (IN4_ARE_ADDR_EQUAL(addr, &c->ip4.addr) ||
IN4_ARE_ADDR_EQUAL(addr, &c->ip4.addr_seen))
return false;
return true;
}
/**
* fwd_guest_accessible6() - Is IPv6 address guest-accessible
* @c: Execution context
* @addr: Host visible IPv6 address
*
* Return: true if @addr on the host is accessible to the guest without
* translation, false otherwise
*/
static bool fwd_guest_accessible6(const struct ctx *c,
const struct in6_addr *addr)
{
if (IN6_IS_ADDR_LOOPBACK(addr))
return false;
if (IN6_ARE_ADDR_EQUAL(addr, &c->ip6.addr))
return false;
/* For IPv6, addr_seen starts unspecified, because we don't know what LL
* address the guest will take until we see it. Only check against it
* if it has been set to a real address.
*/
if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_seen) &&
IN6_ARE_ADDR_EQUAL(addr, &c->ip6.addr_seen))
return false;
return true;
}
/**
* fwd_guest_accessible() - Is IPv[46] address guest-accessible
* @c: Execution context
* @addr: Host visible IPv[46] address
*
* Return: true if @addr on the host is accessible to the guest without
* translation, false otherwise
*/
static bool fwd_guest_accessible(const struct ctx *c,
const union inany_addr *addr)
{
const struct in_addr *a4 = inany_v4(addr);
if (a4)
return fwd_guest_accessible4(c, a4);
return fwd_guest_accessible6(c, &addr->a6);
}
/**
* fwd_nat_from_tap() - Determine to forward a flow from the tap interface
* @c: Execution context
* @proto: Protocol (IP L4 protocol number)
* @ini: Flow address information of the initiating side
* @tgt: Flow address information on the target side (updated)
*
* Return: pif of the target interface to forward the flow to, PIF_NONE if the
* flow cannot or should not be forwarded at all.
*/
uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt)
{
if (is_dns_flow(proto, ini) &&
inany_equals4(&ini->oaddr, &c->ip4.dns_match))
tgt->eaddr = inany_from_v4(c->ip4.dns_host);
else if (is_dns_flow(proto, ini) &&
inany_equals6(&ini->oaddr, &c->ip6.dns_match))
tgt->eaddr.a6 = c->ip6.dns_host;
else if (inany_equals4(&ini->oaddr, &c->ip4.map_host_loopback))
tgt->eaddr = inany_loopback4;
else if (inany_equals6(&ini->oaddr, &c->ip6.map_host_loopback))
tgt->eaddr = inany_loopback6;
else if (inany_equals4(&ini->oaddr, &c->ip4.map_guest_addr))
tgt->eaddr = inany_from_v4(c->ip4.addr);
else if (inany_equals6(&ini->oaddr, &c->ip6.map_guest_addr))
tgt->eaddr.a6 = c->ip6.addr;
else
tgt->eaddr = ini->oaddr;
tgt->eport = ini->oport;
/* The relevant addr_out controls the host side source address. This
* may be unspecified, which allows the kernel to pick an address.
*/
if (inany_v4(&tgt->eaddr))
tgt->oaddr = inany_from_v4(c->ip4.addr_out);
else
tgt->oaddr.a6 = c->ip6.addr_out;
/* Let the kernel pick a host side source port */
tgt->oport = 0;
if (proto == IPPROTO_UDP) {
/* But for UDP we preserve the source port */
tgt->oport = ini->eport;
}
return PIF_HOST;
}
/**
* fwd_nat_from_splice() - Determine to forward a flow from the splice interface
* @c: Execution context
* @proto: Protocol (IP L4 protocol number)
* @ini: Flow address information of the initiating side
* @tgt: Flow address information on the target side (updated)
*
* Return: pif of the target interface to forward the flow to, PIF_NONE if the
* flow cannot or should not be forwarded at all.
*/
uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt)
{
if (!inany_is_loopback(&ini->eaddr) ||
(!inany_is_loopback(&ini->oaddr) && !inany_is_unspecified(&ini->oaddr))) {
char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu",
pif_name(PIF_SPLICE),
inany_ntop(&ini->eaddr, estr, sizeof(estr)), ini->eport,
inany_ntop(&ini->oaddr, fstr, sizeof(fstr)), ini->oport);
return PIF_NONE;
}
if (inany_v4(&ini->eaddr))
tgt->eaddr = inany_loopback4;
else
tgt->eaddr = inany_loopback6;
/* Preserve the specific loopback adddress used, but let the kernel pick
* a source port on the target side
*/
tgt->oaddr = ini->eaddr;
tgt->oport = 0;
tgt->eport = ini->oport;
if (proto == IPPROTO_TCP)
tgt->eport += c->tcp.fwd_out.delta[tgt->eport];
else if (proto == IPPROTO_UDP)
tgt->eport += c->udp.fwd_out.delta[tgt->eport];
/* Let the kernel pick a host side source port */
tgt->oport = 0;
if (proto == IPPROTO_UDP)
/* But for UDP preserve the source port */
tgt->oport = ini->eport;
return PIF_HOST;
}
/**
* fwd_nat_from_host() - Determine to forward a flow from the host interface
* @c: Execution context
* @proto: Protocol (IP L4 protocol number)
* @ini: Flow address information of the initiating side
* @tgt: Flow address information on the target side (updated)
*
* Return: pif of the target interface to forward the flow to, PIF_NONE if the
* flow cannot or should not be forwarded at all.
*/
uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt)
{
/* Common for spliced and non-spliced cases */
tgt->eport = ini->oport;
if (proto == IPPROTO_TCP)
tgt->eport += c->tcp.fwd_in.delta[tgt->eport];
else if (proto == IPPROTO_UDP)
tgt->eport += c->udp.fwd_in.delta[tgt->eport];
if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) &&
(proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
/* spliceable */
/* The traffic will go over the guest's 'lo' interface, but by
* default use its external address, so we don't inadvertently
* expose services that listen only on the guest's loopback
* address. That can be overridden by --host-lo-to-ns-lo which
* will instead forward to the loopback address in the guest.
*
* In either case, let the kernel pick the source address to
* match.
*/
if (inany_v4(&ini->eaddr)) {
if (c->host_lo_to_ns_lo)
tgt->eaddr = inany_loopback4;
else
tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
tgt->oaddr = inany_any4;
} else {
if (c->host_lo_to_ns_lo)
tgt->eaddr = inany_loopback6;
else
tgt->eaddr.a6 = c->ip6.addr_seen;
tgt->oaddr = inany_any6;
}
/* Let the kernel pick source port */
tgt->oport = 0;
if (proto == IPPROTO_UDP)
/* But for UDP preserve the source port */
tgt->oport = ini->eport;
return PIF_SPLICE;
}
if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
inany_equals4(&ini->eaddr, &in4addr_loopback)) {
/* Specifically 127.0.0.1, not 127.0.0.0/8 */
tgt->oaddr = inany_from_v4(c->ip4.map_host_loopback);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
inany_equals6(&ini->eaddr, &in6addr_loopback)) {
tgt->oaddr.a6 = c->ip6.map_host_loopback;
} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
inany_equals4(&ini->eaddr, &c->ip4.addr)) {
tgt->oaddr = inany_from_v4(c->ip4.map_guest_addr);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
inany_equals6(&ini->eaddr, &c->ip6.addr)) {
tgt->oaddr.a6 = c->ip6.map_guest_addr;
} else if (!fwd_guest_accessible(c, &ini->eaddr)) {
if (inany_v4(&ini->eaddr)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr))
/* No source address we can use */
return PIF_NONE;
tgt->oaddr = inany_from_v4(c->ip4.our_tap_addr);
} else {
tgt->oaddr.a6 = c->ip6.our_tap_ll;
}
} else {
tgt->oaddr = ini->eaddr;
}
tgt->oport = ini->eport;
if (inany_v4(&tgt->oaddr)) {
tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
} else {
if (inany_is_linklocal6(&tgt->oaddr))
tgt->eaddr.a6 = c->ip6.addr_ll_seen;
else
tgt->eaddr.a6 = c->ip6.addr_seen;
}
return PIF_TAP;
}

13
fwd.h
View file

@ -7,10 +7,16 @@
#ifndef FWD_H
#define FWD_H
struct flowside;
/* Number of ports for both TCP and UDP */
#define NUM_PORTS (1U << 16)
void fwd_probe_ephemeral(void);
bool fwd_port_is_ephemeral(in_port_t port);
enum fwd_ports_mode {
FWD_UNSET = 0,
FWD_SPEC = 1,
FWD_NONE,
FWD_AUTO,
@ -41,4 +47,11 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
const struct fwd_ports *tcp_rev);
void fwd_scan_ports_init(struct ctx *c);
uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
#endif /* FWD_H */

262
icmp.c
View file

@ -40,36 +40,38 @@
#include "siphash.h"
#include "inany.h"
#include "icmp.h"
#include "flow_table.h"
#define ICMP_ECHO_TIMEOUT 60 /* s, timeout for ICMP socket activity */
#define ICMP_NUM_IDS (1U << 16)
/**
* struct icmp_id_sock - Tracking information for single ICMP echo identifier
* @sock: Bound socket for identifier
* @seq: Last sequence number sent to tap, host order, -1: not sent yet
* @ts: Last associated activity from tap, seconds
* ping_at_sidx() - Get ping specific flow at given sidx
* @sidx: Flow and side to retrieve
*
* Return: ping specific flow at @sidx, or NULL of @sidx is invalid. Asserts if
* the flow at @sidx is not FLOW_PING4 or FLOW_PING6
*/
struct icmp_id_sock {
int sock;
int seq;
time_t ts;
};
static struct icmp_ping_flow *ping_at_sidx(flow_sidx_t sidx)
{
union flow *flow = flow_at_sidx(sidx);
/* Indexed by ICMP echo identifier */
static struct icmp_id_sock icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
if (!flow)
return NULL;
ASSERT(flow->f.type == FLOW_PING4 || flow->f.type == FLOW_PING6);
return &flow->ping;
}
/**
* icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket
* @c: Execution context
* @af: Address family (AF_INET or AF_INET6)
* @ref: epoll reference
*/
void icmp_sock_handler(const struct ctx *c, sa_family_t af, union epoll_ref ref)
void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
{
struct icmp_id_sock *const id_sock = af == AF_INET
? &icmp_id_map[V4][ref.icmp.id] : &icmp_id_map[V6][ref.icmp.id];
const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6";
struct icmp_ping_flow *pingf = ping_at_sidx(ref.flowside);
const struct flowside *ini = &pingf->f.side[INISIDE];
union sockaddr_inany sr;
socklen_t sl = sizeof(sr);
char buf[USHRT_MAX];
@ -79,33 +81,33 @@ void icmp_sock_handler(const struct ctx *c, sa_family_t af, union epoll_ref ref)
if (c->no_icmp)
return;
ASSERT(pingf);
n = recvfrom(ref.fd, buf, sizeof(buf), 0, &sr.sa, &sl);
if (n < 0) {
warn("%s: recvfrom() error on ping socket: %s",
pname, strerror(errno));
flow_err(pingf, "recvfrom() error: %s", strerror(errno));
return;
}
if (sr.sa_family != af)
goto unexpected;
if (af == AF_INET) {
if (pingf->f.type == FLOW_PING4) {
struct icmphdr *ih4 = (struct icmphdr *)buf;
if ((size_t)n < sizeof(*ih4) || ih4->type != ICMP_ECHOREPLY)
if (sr.sa_family != AF_INET || (size_t)n < sizeof(*ih4) ||
ih4->type != ICMP_ECHOREPLY)
goto unexpected;
/* Adjust packet back to guest-side ID */
ih4->un.echo.id = htons(ref.icmp.id);
ih4->un.echo.id = htons(ini->eport);
seq = ntohs(ih4->un.echo.sequence);
} else if (af == AF_INET6) {
} else if (pingf->f.type == FLOW_PING6) {
struct icmp6hdr *ih6 = (struct icmp6hdr *)buf;
if ((size_t)n < sizeof(*ih6) ||
if (sr.sa_family != AF_INET6 || (size_t)n < sizeof(*ih6) ||
ih6->icmp6_type != ICMPV6_ECHO_REPLY)
goto unexpected;
/* Adjust packet back to guest-side ID */
ih6->icmp6_identifier = htons(ref.icmp.id);
ih6->icmp6_identifier = htons(ini->eport);
seq = ntohs(ih6->icmp6_sequence);
} else {
ASSERT(0);
@ -113,87 +115,111 @@ void icmp_sock_handler(const struct ctx *c, sa_family_t af, union epoll_ref ref)
/* In PASTA mode, we'll get any reply we send, discard them. */
if (c->mode == MODE_PASTA) {
if (id_sock->seq == seq)
if (pingf->seq == seq)
return;
id_sock->seq = seq;
pingf->seq = seq;
}
debug("%s: echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, pname,
ref.icmp.id, seq);
if (af == AF_INET)
tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n);
else if (af == AF_INET6)
tap_icmp6_send(c, &sr.sa6.sin6_addr,
tap_ip6_daddr(c, &sr.sa6.sin6_addr), buf, n);
flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16,
ini->eport, seq);
if (pingf->f.type == FLOW_PING4) {
const struct in_addr *saddr = inany_v4(&ini->oaddr);
const struct in_addr *daddr = inany_v4(&ini->eaddr);
ASSERT(saddr && daddr); /* Must have IPv4 addresses */
tap_icmp4_send(c, *saddr, *daddr, buf, n);
} else if (pingf->f.type == FLOW_PING6) {
const struct in6_addr *saddr = &ini->oaddr.a6;
const struct in6_addr *daddr = &ini->eaddr.a6;
tap_icmp6_send(c, saddr, daddr, buf, n);
}
return;
unexpected:
warn("%s: Unexpected packet on ping socket", pname);
flow_err(pingf, "Unexpected packet on ping socket");
}
/**
* icmp_ping_close() - Close and clean up a ping socket
* icmp_ping_close() - Close and clean up a ping flow
* @c: Execution context
* @id_sock: Socket number and other info
* @pingf: ping flow entry to close
*/
static void icmp_ping_close(const struct ctx *c, struct icmp_id_sock *id_sock)
static void icmp_ping_close(const struct ctx *c,
const struct icmp_ping_flow *pingf)
{
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, id_sock->sock, NULL);
close(id_sock->sock);
id_sock->sock = -1;
id_sock->seq = -1;
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
close(pingf->sock);
flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
}
/**
* icmp_ping_new() - Prepare a new ping socket for a new id
* @c: Execution context
* @id_sock: Socket fd and other information
* @af: Address family, AF_INET or AF_INET6
* @id: ICMP id for the new socket
* @saddr: Source address
* @daddr: Destination address
*
* Return: Newly opened ping socket fd, or -1 on failure
* Return: Newly opened ping flow, or NULL on failure
*/
static int icmp_ping_new(const struct ctx *c, struct icmp_id_sock *id_sock,
sa_family_t af, uint16_t id)
static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
sa_family_t af, uint16_t id,
const void *saddr, const void *daddr)
{
uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6;
const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6";
union icmp_epoll_ref iref = { .id = id };
const void *bind_addr;
const char *bind_if;
int s;
uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6;
union epoll_ref ref = { .type = EPOLL_TYPE_PING };
union flow *flow = flow_alloc();
struct icmp_ping_flow *pingf;
const struct flowside *tgt;
if (af == AF_INET) {
bind_addr = &c->ip4.addr_out;
bind_if = c->ip4.ifname_out;
} else {
bind_addr = &c->ip6.addr_out;
bind_if = c->ip6.ifname_out;
if (!flow)
return NULL;
flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id);
if (!(tgt = flow_target(c, flow, proto)))
goto cancel;
if (flow->f.pif[TGTSIDE] != PIF_HOST) {
flow_err(flow, "No support for forwarding %s from %s to %s",
proto == IPPROTO_ICMP ? "ICMP" : "ICMPv6",
pif_name(flow->f.pif[INISIDE]),
pif_name(flow->f.pif[TGTSIDE]));
goto cancel;
}
s = sock_l4(c, af, proto, bind_addr, bind_if, 0, iref.u32);
pingf = FLOW_SET_TYPE(flow, flowtype, ping);
if (s < 0) {
pingf->seq = -1;
ref.flowside = FLOW_SIDX(flow, TGTSIDE);
pingf->sock = flowside_sock_l4(c, EPOLL_TYPE_PING, PIF_HOST,
tgt, ref.data);
if (pingf->sock < 0) {
warn("Cannot open \"ping\" socket. You might need to:");
warn(" sysctl -w net.ipv4.ping_group_range=\"0 2147483647\"");
warn("...echo requests/replies will fail.");
goto cancel;
}
if (s > FD_REF_MAX)
if (pingf->sock > FD_REF_MAX)
goto cancel;
id_sock->sock = s;
flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
debug("%s: new socket %i for echo ID %"PRIu16, pname, s, id);
flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE));
return s;
FLOW_ACTIVATE(pingf);
return pingf;
cancel:
if (s >= 0)
close(s);
return -1;
flow_alloc_cancel(flow);
return NULL;
}
/**
@ -212,111 +238,93 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, const struct timespec *now)
{
const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6";
union sockaddr_inany sa = { .sa_family = af };
const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6);
struct icmp_id_sock *id_sock;
struct icmp_ping_flow *pingf;
const struct flowside *tgt;
union sockaddr_inany sa;
size_t dlen, l4len;
uint16_t id, seq;
size_t plen;
union flow *flow;
uint8_t proto;
socklen_t sl;
void *pkt;
int s;
(void)saddr;
(void)pif;
ASSERT(pif == PIF_TAP);
if (af == AF_INET) {
const struct icmphdr *ih;
if (!(pkt = packet_get(p, 0, 0, sizeof(*ih), &plen)))
if (!(pkt = packet_get(p, 0, 0, sizeof(*ih), &dlen)))
return 1;
ih = (struct icmphdr *)pkt;
plen += sizeof(*ih);
l4len = dlen + sizeof(*ih);
if (ih->type != ICMP_ECHO)
return 1;
proto = IPPROTO_ICMP;
id = ntohs(ih->un.echo.id);
id_sock = &icmp_id_map[V4][id];
seq = ntohs(ih->un.echo.sequence);
sa.sa4.sin_addr = *(struct in_addr *)daddr;
} else if (af == AF_INET6) {
const struct icmp6hdr *ih;
if (!(pkt = packet_get(p, 0, 0, sizeof(*ih), &plen)))
if (!(pkt = packet_get(p, 0, 0, sizeof(*ih), &dlen)))
return 1;
ih = (struct icmp6hdr *)pkt;
plen += sizeof(*ih);
l4len = dlen + sizeof(*ih);
if (ih->icmp6_type != ICMPV6_ECHO_REQUEST)
return 1;
proto = IPPROTO_ICMPV6;
id = ntohs(ih->icmp6_identifier);
id_sock = &icmp_id_map[V6][id];
seq = ntohs(ih->icmp6_sequence);
sa.sa6.sin6_addr = *(struct in6_addr *)daddr;
sa.sa6.sin6_scope_id = c->ifi6;
} else {
ASSERT(0);
}
if ((s = id_sock->sock) < 0)
if ((s = icmp_ping_new(c, id_sock, af, id)) < 0)
return 1;
flow = flow_at_sidx(flow_lookup_af(c, proto, PIF_TAP,
af, saddr, daddr, id, id));
id_sock->ts = now->tv_sec;
if (flow)
pingf = &flow->ping;
else if (!(pingf = icmp_ping_new(c, af, id, saddr, daddr)))
return 1;
if (sendto(s, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) {
debug("%s: failed to relay request to socket: %s",
pname, strerror(errno));
tgt = &pingf->f.side[TGTSIDE];
ASSERT(flow_proto[pingf->f.type] == proto);
pingf->ts = now->tv_sec;
pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0);
if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) {
flow_dbg(pingf, "failed to relay request to socket: %s",
strerror(errno));
} else {
debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16,
pname, id, seq);
flow_dbg(pingf,
"echo request to socket, ID: %"PRIu16", seq: %"PRIu16,
id, seq);
}
return 1;
}
/**
* icmp_timer_one() - Handler for timed events related to a given identifier
* icmp_ping_timer() - Handler for timed events related to a given flow
* @c: Execution context
* @id_sock: Socket fd and activity timestamp
* @pingf: Ping flow to check for timeout
* @now: Current timestamp
*
* Return: true if the flow is ready to free, false otherwise
*/
static void icmp_timer_one(const struct ctx *c, struct icmp_id_sock *id_sock,
const struct timespec *now)
bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf,
const struct timespec *now)
{
if (id_sock->sock < 0 || now->tv_sec - id_sock->ts <= ICMP_ECHO_TIMEOUT)
return;
if (now->tv_sec - pingf->ts <= ICMP_ECHO_TIMEOUT)
return false;
icmp_ping_close(c, id_sock);
}
/**
* icmp_timer() - Scan activity bitmap for identifiers with timed events
* @c: Execution context
* @now: Current timestamp
*/
void icmp_timer(const struct ctx *c, const struct timespec *now)
{
unsigned int i;
for (i = 0; i < ICMP_NUM_IDS; i++) {
icmp_timer_one(c, &icmp_id_map[V4][i], now);
icmp_timer_one(c, &icmp_id_map[V6][i], now);
}
}
/**
* icmp_init() - Initialise sequences in ID map to -1 (no sequence sent yet)
*/
void icmp_init(void)
{
unsigned i;
for (i = 0; i < ICMP_NUM_IDS; i++) {
icmp_id_map[V4][i].seq = icmp_id_map[V6][i].seq = -1;
icmp_id_map[V4][i].sock = icmp_id_map[V6][i].sock = -1;
}
icmp_ping_close(c, pingf);
return true;
}

15
icmp.h
View file

@ -9,25 +9,14 @@
#define ICMP_TIMER_INTERVAL 10000 /* ms */
struct ctx;
struct icmp_ping_flow;
void icmp_sock_handler(const struct ctx *c, sa_family_t af, union epoll_ref ref);
void icmp_sock_handler(const struct ctx *c, union epoll_ref ref);
int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, const struct timespec *now);
void icmp_timer(const struct ctx *c, const struct timespec *now);
void icmp_init(void);
/**
* union icmp_epoll_ref - epoll reference portion for ICMP tracking
* @v6: Set for IPv6 sockets or connections
* @u32: Opaque u32 value of reference
* @id: Associated echo identifier, needed if bind() fails
*/
union icmp_epoll_ref {
uint16_t id;
uint32_t u32;
};
/**
* struct icmp_ctx - Execution context for ICMP routines
* @timer_run: Timestamp of most recent timer run

29
icmp_flow.h Normal file
View file

@ -0,0 +1,29 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*
* ICMP flow tracking data structures
*/
#ifndef ICMP_FLOW_H
#define ICMP_FLOW_H
/**
* struct icmp_ping_flow - Descriptor for a flow of ping requests/replies
* @f: Generic flow information
* @seq: Last sequence number sent to tap, host order, -1: not sent yet
* @sock: "ping" socket
* @ts: Last associated activity from tap, seconds
*/
struct icmp_ping_flow {
/* Must be first element */
struct flow_common f;
int seq;
int sock;
time_t ts;
};
bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf,
const struct timespec *now);
#endif /* ICMP_FLOW_H */

37
inany.c
View file

@ -17,21 +17,8 @@
#include "siphash.h"
#include "inany.h"
const union inany_addr inany_loopback4 = {
.v4mapped = {
.zero = { 0 },
.one = { 0xff, 0xff, },
.a4 = IN4ADDR_LOOPBACK_INIT,
},
};
const union inany_addr inany_any4 = {
.v4mapped = {
.zero = { 0 },
.one = { 0xff, 0xff, },
.a4 = IN4ADDR_ANY_INIT,
},
};
const union inany_addr inany_loopback4 = INANY_INIT4(IN4ADDR_LOOPBACK_INIT);
const union inany_addr inany_any4 = INANY_INIT4(IN4ADDR_ANY_INIT);
/** inany_ntop - Convert an IPv[46] address to text format
* @src: IPv[46] address
@ -49,3 +36,23 @@ const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size)
return inet_ntop(AF_INET6, &src->a6, dst, size);
}
/** inany_pton - Parse an IPv[46] address from text format
* @src: IPv[46] address
* @dst: output buffer, filled with parsed address
*
* Return: On success, 1, if no parseable address is found, 0
*/
int inany_pton(const char *src, union inany_addr *dst)
{
if (inet_pton(AF_INET, src, &dst->v4mapped.a4)) {
memset(&dst->v4mapped.zero, 0, sizeof(dst->v4mapped.zero));
memset(&dst->v4mapped.one, 0xff, sizeof(dst->v4mapped.one));
return 1;
}
if (inet_pton(AF_INET6, src, &dst->a6))
return 1;
return 0;
}

98
inany.h
View file

@ -43,6 +43,17 @@ extern const union inany_addr inany_any4;
#define in4addr_loopback (inany_loopback4.v4mapped.a4)
#define in4addr_any (inany_any4.v4mapped.a4)
#define INANY_INIT4(a4init) { \
.v4mapped = { \
.zero = { 0 }, \
.one = { 0xff, 0xff }, \
.a4 = a4init, \
}, \
}
#define inany_from_v4(a4) \
((union inany_addr)INANY_INIT4((a4)))
/** union sockaddr_inany - Either a sockaddr_in or a sockaddr_in6
* @sa_family: Address family, AF_INET or AF_INET6
* @sa: Plain struct sockaddr (useful to avoid casts)
@ -79,16 +90,84 @@ static inline bool inany_equals(const union inany_addr *a,
return IN6_ARE_ADDR_EQUAL(&a->a6, &b->a6);
}
/** inany_equals4 - Compare an IPv[46] address to an IPv4 address
* @a: IPv[46] addresses
* @b: IPv4 address
*
* Return: true if @a and @b are the same address
*/
static inline bool inany_equals4(const union inany_addr *a,
const struct in_addr *b)
{
const struct in_addr *a4 = inany_v4(a);
return a4 && IN4_ARE_ADDR_EQUAL(a4, b);
}
/** inany_equals6 - Compare an IPv[46] address to an IPv6 address
* @a: IPv[46] addresses
* @b: IPv6 address
*
* Return: true if @a and @b are the same address
*/
static inline bool inany_equals6(const union inany_addr *a,
const struct in6_addr *b)
{
return IN6_ARE_ADDR_EQUAL(&a->a6, b);
}
/** inany_is_loopback4() - Check if address is IPv4 loopback
* @a: IPv[46] address
*
* Return: true if @a is in 127.0.0.1/8
*/
static inline bool inany_is_loopback4(const union inany_addr *a)
{
const struct in_addr *v4 = inany_v4(a);
return v4 && IN4_IS_ADDR_LOOPBACK(v4);
}
/** inany_is_loopback6() - Check if address is IPv6 loopback
* @a: IPv[46] address
*
* Return: true if @a is in ::1
*/
static inline bool inany_is_loopback6(const union inany_addr *a)
{
return IN6_IS_ADDR_LOOPBACK(&a->a6);
}
/** inany_is_loopback() - Check if address is loopback
* @a: IPv[46] address
*
* Return: true if @a is either ::1 or in 127.0.0.1/8
*/
static inline bool inany_is_loopback(const union inany_addr *a)
{
return inany_is_loopback4(a) || inany_is_loopback6(a);
}
/** inany_is_unspecified4() - Check if address is unspecified IPv4
* @a: IPv[46] address
*
* Return: true if @a is 0.0.0.0
*/
static inline bool inany_is_unspecified4(const union inany_addr *a)
{
const struct in_addr *v4 = inany_v4(a);
return IN6_IS_ADDR_LOOPBACK(&a->a6) || (v4 && IN4_IS_ADDR_LOOPBACK(v4));
return v4 && IN4_IS_ADDR_UNSPECIFIED(v4);
}
/** inany_is_unspecified6() - Check if address is unspecified IPv6
* @a: IPv[46] address
*
* Return: true if @a is ::
*/
static inline bool inany_is_unspecified6(const union inany_addr *a)
{
return IN6_IS_ADDR_UNSPECIFIED(&a->a6);
}
/** inany_is_unspecified() - Check if address is unspecified
@ -98,10 +177,19 @@ static inline bool inany_is_loopback(const union inany_addr *a)
*/
static inline bool inany_is_unspecified(const union inany_addr *a)
{
const struct in_addr *v4 = inany_v4(a);
return inany_is_unspecified4(a) || inany_is_unspecified6(a);
}
return IN6_IS_ADDR_UNSPECIFIED(&a->a6) ||
(v4 && IN4_IS_ADDR_UNSPECIFIED(v4));
/* FIXME: consider handling of IPv4 link-local addresses */
/** inany_is_linklocal6() - Check if address is link-local IPv6
* @a: IPv[46] address
*
* Return: true if @a is in fe80::/10 (IPv6 link local unicast)
*/
static inline bool inany_is_linklocal6(const union inany_addr *a)
{
return IN6_IS_ADDR_LINKLOCAL(&a->a6);
}
/** inany_is_multicast() - Check if address is multicast or broadcast
@ -123,7 +211,6 @@ static inline bool inany_is_multicast(const union inany_addr *a)
*
* Return: true if @a is specified and a unicast address
*/
/* cppcheck-suppress unusedFunction */
static inline bool inany_is_unicast(const union inany_addr *a)
{
return !inany_is_unspecified(a) && !inany_is_multicast(a);
@ -183,5 +270,6 @@ static inline void inany_siphash_feed(struct siphash_state *state,
#define INANY_ADDRSTRLEN MAX(INET_ADDRSTRLEN, INET6_ADDRSTRLEN)
const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
int inany_pton(const char *src, union inany_addr *dst);
#endif /* INANY_H */

3
iov.h
View file

@ -18,6 +18,9 @@
#include <unistd.h>
#include <string.h>
#define IOV_OF_LVALUE(lval) \
(struct iovec){ .iov_base = &(lval), .iov_len = sizeof(lval) }
size_t iov_skip_bytes(const struct iovec *iov, size_t n,
size_t skip, size_t *offset);
size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,

11
ip.h
View file

@ -24,6 +24,11 @@
#define IN4ADDR_ANY_INIT \
{ .s_addr = htonl_constant(INADDR_ANY) }
#define IN4_IS_ADDR_LINKLOCAL(a) \
((ntohl(((struct in_addr *)(a))->s_addr) >> 16) == 0xa9fe)
#define IN4_IS_PREFIX_LINKLOCAL(a, len) \
((len) >= 16 && IN4_IS_ADDR_LINKLOCAL(a))
#define L2_BUF_IP4_INIT(proto) \
{ \
.version = 4, \
@ -38,7 +43,11 @@
.daddr = 0, \
}
#define L2_BUF_IP4_PSUM(proto) ((uint32_t)htons_constant(0x4500) + \
(uint32_t)htons_constant(0xff00 | (proto)))
(uint32_t)htons(0xff00 | (proto)))
#define IN6_IS_PREFIX_LINKLOCAL(a, len) \
((len) >= 10 && IN6_IS_ADDR_LINKLOCAL(a))
#define L2_BUF_IP6_INIT(proto) \
{ \

View file

@ -29,7 +29,8 @@
*
* Executed immediately after startup, drops capabilities we don't
* need at any point during execution (or which we gain back when we
* need by joining other namespaces).
* need by joining other namespaces), and closes any leaked file we
* might have inherited from the parent process.
*
* 2. isolate_user()
* =================
@ -105,7 +106,7 @@ static void drop_caps_ep_except(uint64_t keep)
int i;
if (syscall(SYS_capget, &hdr, data))
die("Couldn't get current capabilities: %s", strerror(errno));
die_perror("Couldn't get current capabilities");
for (i = 0; i < CAP_WORDS; i++) {
uint32_t mask = keep >> (32 * i);
@ -115,7 +116,7 @@ static void drop_caps_ep_except(uint64_t keep)
}
if (syscall(SYS_capset, &hdr, data))
die("Couldn't drop capabilities: %s", strerror(errno));
die_perror("Couldn't drop capabilities");
}
/**
@ -152,30 +153,31 @@ static void clamp_caps(void)
*/
if (prctl(PR_CAPBSET_DROP, i, 0, 0, 0) &&
errno != EINVAL && errno != EPERM)
die("Couldn't drop cap %i from bounding set: %s",
i, strerror(errno));
die_perror("Couldn't drop cap %i from bounding set", i);
}
if (syscall(SYS_capget, &hdr, data))
die("Couldn't get current capabilities: %s", strerror(errno));
die_perror("Couldn't get current capabilities");
for (i = 0; i < CAP_WORDS; i++)
data[i].inheritable = 0;
if (syscall(SYS_capset, &hdr, data))
die("Couldn't drop inheritable capabilities: %s",
strerror(errno));
die_perror("Couldn't drop inheritable capabilities");
}
/**
* isolate_initial() - Early, config independent self isolation
* isolate_initial() - Early, mostly config independent self isolation
* @argc: Argument count
* @argv: Command line options: only --fd (if present) is relevant here
*
* Should:
* - drop unneeded capabilities
* - close all open files except for standard streams and the one from --fd
* Musn't:
* - remove filesytem access (we need to access files during setup)
*/
void isolate_initial(void)
void isolate_initial(int argc, char **argv)
{
uint64_t keep;
@ -209,6 +211,8 @@ void isolate_initial(void)
keep |= BIT(CAP_SETFCAP) | BIT(CAP_SYS_PTRACE);
drop_caps_ep_except(keep);
close_open_files(argc, argv);
}
/**
@ -234,34 +238,30 @@ void isolate_user(uid_t uid, gid_t gid, bool use_userns, const char *userns,
if (setgroups(0, NULL)) {
/* If we don't have CAP_SETGID, this will EPERM */
if (errno != EPERM)
die("Can't drop supplementary groups: %s",
strerror(errno));
die_perror("Can't drop supplementary groups");
}
if (setgid(gid) != 0)
die("Can't set GID to %u: %s", gid, strerror(errno));
die_perror("Can't set GID to %u", gid);
if (setuid(uid) != 0)
die("Can't set UID to %u: %s", uid, strerror(errno));
die_perror("Can't set UID to %u", uid);
if (*userns) { /* If given a userns, join it */
int ufd;
ufd = open(userns, O_RDONLY | O_CLOEXEC);
if (ufd < 0)
die("Couldn't open user namespace %s: %s",
userns, strerror(errno));
die_perror("Couldn't open user namespace %s", userns);
if (setns(ufd, CLONE_NEWUSER) != 0)
die("Couldn't enter user namespace %s: %s",
userns, strerror(errno));
die_perror("Couldn't enter user namespace %s", userns);
close(ufd);
} else if (use_userns) { /* Create and join a new userns */
if (unshare(CLONE_NEWUSER) != 0)
die("Couldn't create user namespace: %s",
strerror(errno));
die_perror("Couldn't create user namespace");
}
/* Joining a new userns gives us full capabilities; drop the
@ -316,34 +316,34 @@ int isolate_prefork(const struct ctx *c)
flags |= CLONE_NEWPID;
if (unshare(flags)) {
perror("unshare");
err_perror("Failed to detach isolating namespaces");
return -errno;
}
if (mount("", "/", "", MS_UNBINDABLE | MS_REC, NULL)) {
perror("mount /");
err_perror("Failed to remount /");
return -errno;
}
if (mount("", TMPDIR, "tmpfs",
MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RDONLY,
"nr_inodes=2,nr_blocks=0")) {
perror("mount tmpfs");
err_perror("Failed to mount empty tmpfs for pivot_root()");
return -errno;
}
if (chdir(TMPDIR)) {
perror("chdir");
err_perror("Failed to change directory into empty tmpfs");
return -errno;
}
if (syscall(SYS_pivot_root, ".", ".")) {
perror("pivot_root");
err_perror("Failed to pivot_root() into empty tmpfs");
return -errno;
}
if (umount2(".", MNT_DETACH | UMOUNT_NOFOLLOW)) {
perror("umount2");
err_perror("Failed to unmount original root filesystem");
return -errno;
}
@ -388,8 +388,6 @@ void isolate_postfork(const struct ctx *c)
}
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
perror("prctl");
exit(EXIT_FAILURE);
}
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog))
die_perror("Failed to apply seccomp filter");
}

View file

@ -7,7 +7,7 @@
#ifndef ISOLATION_H
#define ISOLATION_H
void isolate_initial(void);
void isolate_initial(int argc, char **argv);
void isolate_user(uid_t uid, gid_t gid, bool use_userns, const char *userns,
enum passt_modes mode);
int isolate_prefork(const struct ctx *c);

View file

@ -39,13 +39,11 @@ void lineread_init(struct lineread *lr, int fd)
*
* Return: length of line in bytes, -1 if no line was found
*/
static int peek_line(struct lineread *lr, bool eof)
static ssize_t peek_line(struct lineread *lr, bool eof)
{
char *nl;
/* Sanity checks (which also document invariants) */
ASSERT(lr->count >= 0);
ASSERT(lr->next_line >= 0);
ASSERT(lr->next_line + lr->count >= lr->next_line);
ASSERT(lr->next_line + lr->count <= LINEREAD_BUFFER_SIZE);
@ -74,13 +72,13 @@ static int peek_line(struct lineread *lr, bool eof)
*
* Return: Length of line read on success, 0 on EOF, negative on error
*/
int lineread_get(struct lineread *lr, char **line)
ssize_t lineread_get(struct lineread *lr, char **line)
{
bool eof = false;
int line_len;
ssize_t line_len;
while ((line_len = peek_line(lr, eof)) < 0) {
int rc;
ssize_t rc;
if ((lr->next_line + lr->count) == LINEREAD_BUFFER_SIZE) {
/* No space at end */

View file

@ -18,14 +18,15 @@
* @buf: Buffer storing data read from file.
*/
struct lineread {
int fd; int next_line;
int count;
int fd;
ssize_t next_line;
ssize_t count;
/* One extra byte for possible trailing \0 */
char buf[LINEREAD_BUFFER_SIZE+1];
};
void lineread_init(struct lineread *lr, int fd);
int lineread_get(struct lineread *lr, char **line);
ssize_t lineread_get(struct lineread *lr, char **line);
#endif /* _LINEREAD_H */

144
linux_dep.h Normal file
View file

@ -0,0 +1,144 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright Red Hat
*
* Declarations for Linux specific dependencies
*/
#ifndef LINUX_DEP_H
#define LINUX_DEP_H
/* struct tcp_info_linux - Information from Linux TCP_INFO getsockopt()
*
* Largely derived from include/linux/tcp.h in the Linux kernel
*
* Some fields returned by TCP_INFO have been there for ages and are shared with
* BSD. struct tcp_info from netinet/tcp.h has only those fields. There are
* also a many Linux specific extensions to the structure, which are only found
* in the linux/tcp.h version of struct tcp_info.
*
* We want to use some of those extension fields, when available. We can test
* for availability in the runtime kernel using the length returned from
* getsockopt(). However, we won't necessarily be compiled against the same
* kernel headers as we'll run with, so compiling directly against linux/tcp.h
* means wrapping every field access in an #ifdef whose #else does the same
* thing as when the field is missing at runtime. This rapidly gets messy.
*
* Instead we define here struct tcp_info_linux which includes all the Linux
* extensions that we want to use. This is taken from v6.11 of the kernel.
*/
struct tcp_info_linux {
uint8_t tcpi_state;
uint8_t tcpi_ca_state;
uint8_t tcpi_retransmits;
uint8_t tcpi_probes;
uint8_t tcpi_backoff;
uint8_t tcpi_options;
uint8_t tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
uint8_t tcpi_delivery_rate_app_limited:1, tcpi_fastopen_client_fail:2;
uint32_t tcpi_rto;
uint32_t tcpi_ato;
uint32_t tcpi_snd_mss;
uint32_t tcpi_rcv_mss;
uint32_t tcpi_unacked;
uint32_t tcpi_sacked;
uint32_t tcpi_lost;
uint32_t tcpi_retrans;
uint32_t tcpi_fackets;
/* Times. */
uint32_t tcpi_last_data_sent;
uint32_t tcpi_last_ack_sent;
uint32_t tcpi_last_data_recv;
uint32_t tcpi_last_ack_recv;
/* Metrics. */
uint32_t tcpi_pmtu;
uint32_t tcpi_rcv_ssthresh;
uint32_t tcpi_rtt;
uint32_t tcpi_rttvar;
uint32_t tcpi_snd_ssthresh;
uint32_t tcpi_snd_cwnd;
uint32_t tcpi_advmss;
uint32_t tcpi_reordering;
uint32_t tcpi_rcv_rtt;
uint32_t tcpi_rcv_space;
uint32_t tcpi_total_retrans;
/* Linux extensions */
uint64_t tcpi_pacing_rate;
uint64_t tcpi_max_pacing_rate;
uint64_t tcpi_bytes_acked; /* RFC4898 tcpEStatsAppHCThruOctetsAcked */
uint64_t tcpi_bytes_received; /* RFC4898 tcpEStatsAppHCThruOctetsReceived */
uint32_t tcpi_segs_out; /* RFC4898 tcpEStatsPerfSegsOut */
uint32_t tcpi_segs_in; /* RFC4898 tcpEStatsPerfSegsIn */
uint32_t tcpi_notsent_bytes;
uint32_t tcpi_min_rtt;
uint32_t tcpi_data_segs_in; /* RFC4898 tcpEStatsDataSegsIn */
uint32_t tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */
uint64_t tcpi_delivery_rate;
uint64_t tcpi_busy_time; /* Time (usec) busy sending data */
uint64_t tcpi_rwnd_limited; /* Time (usec) limited by receive window */
uint64_t tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */
uint32_t tcpi_delivered;
uint32_t tcpi_delivered_ce;
uint64_t tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
uint64_t tcpi_bytes_retrans; /* RFC4898 tcpEStatsPerfOctetsRetrans */
uint32_t tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */
uint32_t tcpi_reord_seen; /* reordering events seen */
uint32_t tcpi_rcv_ooopack; /* Out-of-order packets received */
uint32_t tcpi_snd_wnd; /* peer's advertised receive window after
* scaling (bytes)
*/
uint32_t tcpi_rcv_wnd; /* local advertised receive window after
* scaling (bytes)
*/
uint32_t tcpi_rehash; /* PLB or timeout triggered rehash attempts */
uint16_t tcpi_total_rto; /* Total number of RTO timeouts, including
* SYN/SYN-ACK and recurring timeouts.
*/
uint16_t tcpi_total_rto_recoveries; /* Total number of RTO
* recoveries, including any
* unfinished recovery.
*/
uint32_t tcpi_total_rto_time; /* Total time spent in RTO recoveries
* in milliseconds, including any
* unfinished recovery.
*/
};
#include <linux/falloc.h>
#ifndef FALLOC_FL_COLLAPSE_RANGE
#define FALLOC_FL_COLLAPSE_RANGE 0x08
#endif
#include <linux/close_range.h>
/* glibc < 2.34 and musl as of 1.2.5 need these */
#ifndef SYS_close_range
#define SYS_close_range 436
#endif
#ifndef CLOSE_RANGE_UNSHARE /* Linux kernel < 5.9 */
#define CLOSE_RANGE_UNSHARE (1U << 1)
#endif
__attribute__ ((weak))
/* cppcheck-suppress funcArgNamesDifferent */
int close_range(unsigned int first, unsigned int last, int flags) {
return syscall(SYS_close_range, first, last, flags);
}
#endif /* LINUX_DEP_H */

421
log.c
View file

@ -26,17 +26,14 @@
#include <stdarg.h>
#include <sys/socket.h>
#include "linux_dep.h"
#include "log.h"
#include "util.h"
#include "passt.h"
/* LOG_EARLY means we don't know yet: log everything. LOG_EMERG is unused */
#define LOG_EARLY LOG_MASK(LOG_EMERG)
static int log_sock = -1; /* Optional socket to system logger */
static char log_ident[BUFSIZ]; /* Identifier string for openlog() */
static int log_mask = LOG_EARLY; /* Current log priority mask */
static int log_opt; /* Options for openlog() */
static int log_mask; /* Current log priority mask */
static int log_file = -1; /* Optional log file descriptor */
static size_t log_size; /* Maximum log file size in bytes */
@ -44,50 +41,46 @@ static size_t log_written; /* Currently used bytes in log file */
static size_t log_cut_size; /* Bytes to cut at start on rotation */
static char log_header[BUFSIZ]; /* File header, written back on cuts */
static time_t log_start; /* Start timestamp */
struct timespec log_start; /* Start timestamp */
int log_trace; /* --trace mode enabled */
int log_to_stdout; /* Print to stdout instead of stderr */
bool log_conf_parsed; /* Logging options already parsed */
bool log_stderr = true; /* Not daemonised, no shell spawned */
void vlogmsg(int pri, const char *format, va_list ap)
#define LL_STRLEN (sizeof("-9223372036854775808"))
#define LOGTIME_STRLEN (LL_STRLEN + 5)
/**
* logtime() - Get the current time for logging purposes
* @ts: Buffer into which to store the timestamp
*
* Return: pointer to @now, or NULL if there was an error retrieving the time
*/
const struct timespec *logtime(struct timespec *ts)
{
bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1;
bool early_print = LOG_PRI(log_mask) == LOG_EARLY;
FILE *out = log_to_stdout ? stdout : stderr;
struct timespec tp;
if (debug_print) {
clock_gettime(CLOCK_REALTIME, &tp);
fprintf(out, "%lli.%04lli: ",
(long long int)tp.tv_sec - log_start,
(long long int)tp.tv_nsec / (100L * 1000));
}
if ((log_mask & LOG_MASK(LOG_PRI(pri))) || early_print) {
va_list ap2;
va_copy(ap2, ap); /* Don't clobber ap, we need it again */
if (log_file != -1)
logfile_write(pri, format, ap2);
else if (!(log_mask & LOG_MASK(LOG_DEBUG)))
passt_vsyslog(pri, format, ap2);
va_end(ap2);
}
if (debug_print || (early_print && !(log_opt & LOG_PERROR))) {
(void)vfprintf(out, format, ap);
if (format[strlen(format)] != '\n')
fprintf(out, "\n");
}
if (clock_gettime(CLOCK_MONOTONIC, ts))
return NULL;
return ts;
}
void logmsg(int pri, const char *format, ...)
/**
* logtime_fmt() - Format timestamp into a string for the log
* @buf: Buffer into which to format the time
* @size: Size of @buf
* @ts: Time to format (or NULL on error)
*
* Return: number of characters written to @buf (excluding \0)
*/
static int logtime_fmt(char *buf, size_t size, const struct timespec *ts)
{
va_list ap;
if (ts) {
int64_t delta = timespec_diff_us(ts, &log_start);
va_start(ap, format);
vlogmsg(pri, format, ap);
va_end(ap);
return snprintf(buf, size, "%lli.%04lli", delta / 1000000LL,
(delta / 100LL) % 10000);
}
return snprintf(buf, size, "<error>");
}
/* Prefixes for log file messages, indexed by priority */
@ -100,127 +93,12 @@ const char *logfile_prefix[] = {
" ", /* LOG_DEBUG */
};
/**
* trace_init() - Set log_trace depending on trace (debug) mode
* @enable: Tracing debug mode enabled if non-zero
*/
void trace_init(int enable)
{
log_trace = enable;
}
/**
* __openlog() - Non-optional openlog() implementation, for custom vsyslog()
* @ident: openlog() identity (program name)
* @option: openlog() options
* @facility: openlog() facility (LOG_DAEMON)
*/
void __openlog(const char *ident, int option, int facility)
{
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
log_start = tp.tv_sec;
if (log_sock < 0) {
struct sockaddr_un a = { .sun_family = AF_UNIX, };
log_sock = socket(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0);
if (log_sock < 0)
return;
strncpy(a.sun_path, _PATH_LOG, sizeof(a.sun_path));
if (connect(log_sock, (const struct sockaddr *)&a, sizeof(a))) {
close(log_sock);
log_sock = -1;
return;
}
}
log_mask |= facility;
strncpy(log_ident, ident, sizeof(log_ident) - 1);
log_opt = option;
}
/**
* __setlogmask() - setlogmask() wrapper, to allow custom vsyslog()
* @mask: Same as setlogmask() mask
*/
void __setlogmask(int mask)
{
log_mask = mask;
setlogmask(mask);
}
/**
* passt_vsyslog() - vsyslog() implementation not using heap memory
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
void passt_vsyslog(int pri, const char *format, va_list ap)
{
int prefix_len, n;
char buf[BUFSIZ];
/* Send without timestamp, the system logger should add it */
n = prefix_len = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_opt & LOG_PERROR)
fprintf(stderr, "%s", buf + prefix_len);
if (send(log_sock, buf, n, 0) != n)
fprintf(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
* logfile_init() - Open log file and write header with PID, version, path
* @name: Identifier for header: passt or pasta
* @path: Path to log file
* @size: Maximum size of log file: log_cut_size is calculatd here
*/
void logfile_init(const char *name, const char *path, size_t size)
{
char nl = '\n', exe[PATH_MAX] = { 0 };
int n;
if (readlink("/proc/self/exe", exe, PATH_MAX - 1) < 0) {
perror("readlink /proc/self/exe");
exit(EXIT_FAILURE);
}
log_file = open(path, O_CREAT | O_TRUNC | O_APPEND | O_RDWR | O_CLOEXEC,
S_IRUSR | S_IWUSR);
if (log_file == -1)
die("Couldn't open log file %s: %s", path, strerror(errno));
log_size = size ? size : LOGFILE_SIZE_DEFAULT;
n = snprintf(log_header, sizeof(log_header), "%s " VERSION ": %s (%i)",
name, exe, getpid());
if (write(log_file, log_header, n) <= 0 ||
write(log_file, &nl, 1) <= 0) {
perror("Couldn't write to log file\n");
exit(EXIT_FAILURE);
}
/* For FALLOC_FL_COLLAPSE_RANGE: VFS block size can be up to one page */
log_cut_size = ROUND_UP(log_size * LOGFILE_CUT_RATIO / 100, PAGE_SIZE);
}
#ifdef FALLOC_FL_COLLAPSE_RANGE
/**
* logfile_rotate_fallocate() - Write header, set log_written after fallocate()
* @fd: Log file descriptor
* @now: Current timestamp
*
* #syscalls lseek ppc64le:_llseek ppc64:_llseek armv6l:_llseek armv7l:_llseek
* #syscalls lseek ppc64le:_llseek ppc64:_llseek arm:_llseek i686:_llseek
*/
static void logfile_rotate_fallocate(int fd, const struct timespec *now)
{
@ -233,10 +111,8 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *now)
if (read(fd, buf, BUFSIZ) == -1)
return;
n = snprintf(buf, BUFSIZ,
"%s - log truncated at %lli.%04lli", log_header,
(long long int)(now->tv_sec - log_start),
(long long int)(now->tv_nsec / (100L * 1000)));
n = snprintf(buf, BUFSIZ, "%s - log truncated at ", log_header);
n += logtime_fmt(buf + n, BUFSIZ - n, now);
/* Avoid partial lines by padding the header with spaces */
nl = memchr(buf + n + 1, '\n', BUFSIZ - n - 1);
@ -250,14 +126,13 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *now)
log_written -= log_cut_size;
}
#endif /* FALLOC_FL_COLLAPSE_RANGE */
/**
* logfile_rotate_move() - Fallback: move recent entries toward start, then cut
* @fd: Log file descriptor
* @now: Current timestamp
*
* #syscalls lseek ppc64le:_llseek ppc64:_llseek armv6l:_llseek armv7l:_llseek
* #syscalls lseek ppc64le:_llseek ppc64:_llseek arm:_llseek
* #syscalls ftruncate
*/
static void logfile_rotate_move(int fd, const struct timespec *now)
@ -266,10 +141,10 @@ static void logfile_rotate_move(int fd, const struct timespec *now)
char buf[BUFSIZ];
const char *nl;
header_len = snprintf(buf, BUFSIZ,
"%s - log truncated at %lli.%04lli\n", log_header,
(long long int)(now->tv_sec - log_start),
(long long int)(now->tv_nsec / (100L * 1000)));
header_len = snprintf(buf, BUFSIZ, "%s - log truncated at ",
log_header);
header_len += logtime_fmt(buf + header_len, BUFSIZ - header_len, now);
if (lseek(fd, 0, SEEK_SET) == -1)
return;
if (write(fd, buf, header_len) == -1)
@ -322,21 +197,17 @@ out:
*
* Return: 0 on success, negative error code on failure
*
* #syscalls fcntl
*
* fallocate() passed as EXTRA_SYSCALL only if FALLOC_FL_COLLAPSE_RANGE is there
* #syscalls fcntl fallocate
*/
static int logfile_rotate(int fd, const struct timespec *now)
{
if (fcntl(fd, F_SETFL, O_RDWR /* Drop O_APPEND: explicit lseek() */))
return -errno;
#ifdef FALLOC_FL_COLLAPSE_RANGE
/* Only for Linux >= 3.15, extent-based ext4 or XFS, glibc >= 2.18 */
if (!fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, log_cut_size))
logfile_rotate_fallocate(fd, now);
else
#endif
logfile_rotate_move(fd, now);
if (fcntl(fd, F_SETFL, O_RDWR | O_APPEND))
@ -347,32 +218,212 @@ static int logfile_rotate(int fd, const struct timespec *now)
/**
* logfile_write() - Write entry to log file, trigger rotation if full
* @newline: Append newline at the end of the message, if missing
* @cont: Continuation of a previous message, on the same line
* @pri: Facility and level map, same as priority for vsyslog()
* @now: Timestamp
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
void logfile_write(int pri, const char *format, va_list ap)
static void logfile_write(bool newline, bool cont, int pri,
const struct timespec *now,
const char *format, va_list ap)
{
struct timespec now;
char buf[BUFSIZ];
int n;
int n = 0;
if (clock_gettime(CLOCK_REALTIME, &now))
return;
n = snprintf(buf, BUFSIZ, "%lli.%04lli: %s",
(long long int)(now.tv_sec - log_start),
(long long int)(now.tv_nsec / (100L * 1000)),
logfile_prefix[pri]);
if (!cont) {
n += logtime_fmt(buf, BUFSIZ, now);
n += snprintf(buf + n, BUFSIZ - n, ": %s", logfile_prefix[pri]);
}
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (format[strlen(format)] != '\n')
if (newline && format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if ((log_written + n >= log_size) && logfile_rotate(log_file, &now))
if ((log_written + n >= log_size) && logfile_rotate(log_file, now))
return;
if ((n = write(log_file, buf, n)) >= 0)
log_written += n;
}
/**
* vlogmsg() - Print or send messages to log or output files as configured
* @newline: Append newline at the end of the message, if missing
* @cont: Continuation of a previous message, on the same line
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Message
* @ap: Variable argument list
*/
void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
{
bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1;
const struct timespec *now;
struct timespec ts;
now = logtime(&ts);
if (debug_print && !cont) {
char timestr[LOGTIME_STRLEN];
logtime_fmt(timestr, sizeof(timestr), now);
FPRINTF(stderr, "%s: ", timestr);
}
if ((log_mask & LOG_MASK(LOG_PRI(pri))) || !log_conf_parsed) {
va_list ap2;
va_copy(ap2, ap); /* Don't clobber ap, we need it again */
if (log_file != -1)
logfile_write(newline, cont, pri, now, format, ap2);
else if (!(log_mask & LOG_MASK(LOG_DEBUG)))
passt_vsyslog(newline, pri, format, ap2);
va_end(ap2);
}
if (debug_print || !log_conf_parsed ||
(log_stderr && (log_mask & LOG_MASK(LOG_PRI(pri))))) {
(void)vfprintf(stderr, format, ap);
if (newline && format[strlen(format)] != '\n')
FPRINTF(stderr, "\n");
}
}
/**
* logmsg() - vlogmsg() wrapper for variable argument lists
* @newline: Append newline at the end of the message, if missing
* @cont: Continuation of a previous message, on the same line
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Message
*/
void logmsg(bool newline, bool cont, int pri, const char *format, ...)
{
va_list ap;
va_start(ap, format);
vlogmsg(newline, cont, pri, format, ap);
va_end(ap);
}
/**
* logmsg_perror() - vlogmsg() wrapper with perror()-like functionality
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Message
*/
void logmsg_perror(int pri, const char *format, ...)
{
int errno_copy = errno;
va_list ap;
va_start(ap, format);
vlogmsg(false, false, pri, format, ap);
va_end(ap);
logmsg(true, true, pri, ": %s", strerror(errno_copy));
}
/**
* trace_init() - Set log_trace depending on trace (debug) mode
* @enable: Tracing debug mode enabled if non-zero
*/
void trace_init(int enable)
{
log_trace = enable;
}
/**
* __openlog() - Non-optional openlog() implementation, for custom vsyslog()
* @ident: openlog() identity (program name)
* @option: openlog() options, unused
* @facility: openlog() facility (LOG_DAEMON)
*/
void __openlog(const char *ident, int option, int facility)
{
(void)option;
if (log_sock < 0) {
struct sockaddr_un a = { .sun_family = AF_UNIX, };
log_sock = socket(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0);
if (log_sock < 0)
return;
strncpy(a.sun_path, _PATH_LOG, sizeof(a.sun_path));
if (connect(log_sock, (const struct sockaddr *)&a, sizeof(a))) {
close(log_sock);
log_sock = -1;
return;
}
}
log_mask |= facility;
strncpy(log_ident, ident, sizeof(log_ident) - 1);
}
/**
* __setlogmask() - setlogmask() wrapper, to allow custom vsyslog()
* @mask: Same as setlogmask() mask
*/
void __setlogmask(int mask)
{
log_mask = mask;
setlogmask(mask);
}
/**
* passt_vsyslog() - vsyslog() implementation not using heap memory
* @newline: Append newline at the end of the message, if missing
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
/* Send without timestamp, the system logger should add it */
n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (newline && format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
* logfile_init() - Open log file and write header with PID, version, path
* @name: Identifier for header: passt or pasta
* @path: Path to log file
* @size: Maximum size of log file: log_cut_size is calculatd here
*/
void logfile_init(const char *name, const char *path, size_t size)
{
char nl = '\n', exe[PATH_MAX] = { 0 };
int n;
if (readlink("/proc/self/exe", exe, PATH_MAX - 1) < 0)
die_perror("Failed to read own /proc/self/exe link");
log_file = output_file_open(path, O_APPEND | O_RDWR);
if (log_file == -1)
die_perror("Couldn't open log file %s", path);
log_size = size ? size : LOGFILE_SIZE_DEFAULT;
n = snprintf(log_header, sizeof(log_header), "%s " VERSION ": %s (%i)",
name, exe, getpid());
if (write(log_file, log_header, n) <= 0 ||
write(log_file, &nl, 1) <= 0)
die_perror("Couldn't write to log file");
/* For FALLOC_FL_COLLAPSE_RANGE: VFS block size can be up to one page */
log_cut_size = ROUND_UP(log_size * LOGFILE_CUT_RATIO / 100, PAGE_SIZE);
}

34
log.h
View file

@ -6,20 +6,28 @@
#ifndef LOG_H
#define LOG_H
#include <stdbool.h>
#include <syslog.h>
#define LOGFILE_SIZE_DEFAULT (1024 * 1024UL)
#define LOGFILE_CUT_RATIO 30 /* When full, cut ~30% size */
#define LOGFILE_SIZE_MIN (5UL * MAX(BUFSIZ, PAGE_SIZE))
void vlogmsg(int pri, const char *format, va_list ap);
void logmsg(int pri, const char *format, ...)
void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap);
void logmsg(bool newline, bool cont, int pri, const char *format, ...)
__attribute__((format(printf, 4, 5)));
void logmsg_perror(int pri, const char *format, ...)
__attribute__((format(printf, 2, 3)));
#define err(...) logmsg(LOG_ERR, __VA_ARGS__)
#define warn(...) logmsg(LOG_WARNING, __VA_ARGS__)
#define info(...) logmsg(LOG_INFO, __VA_ARGS__)
#define debug(...) logmsg(LOG_DEBUG, __VA_ARGS__)
#define err(...) logmsg(true, false, LOG_ERR, __VA_ARGS__)
#define warn(...) logmsg(true, false, LOG_WARNING, __VA_ARGS__)
#define info(...) logmsg(true, false, LOG_INFO, __VA_ARGS__)
#define debug(...) logmsg(true, false, LOG_DEBUG, __VA_ARGS__)
#define err_perror(...) logmsg_perror( LOG_ERR, __VA_ARGS__)
#define warn_perror(...) logmsg_perror( LOG_WARNING, __VA_ARGS__)
#define info_perror(...) logmsg_perror( LOG_INFO, __VA_ARGS__)
#define debug_perror(...) logmsg_perror( LOG_DEBUG, __VA_ARGS__)
#define die(...) \
do { \
@ -27,8 +35,17 @@ void logmsg(int pri, const char *format, ...)
exit(EXIT_FAILURE); \
} while (0)
#define die_perror(...) \
do { \
err_perror(__VA_ARGS__); \
exit(EXIT_FAILURE); \
} while (0)
extern int log_trace;
extern int log_to_stdout;
extern bool log_conf_parsed;
extern bool log_stderr;
extern struct timespec log_start;
void trace_init(int enable);
#define trace(...) \
do { \
@ -38,8 +55,7 @@ void trace_init(int enable);
void __openlog(const char *ident, int option, int facility);
void logfile_init(const char *name, const char *path, size_t size);
void passt_vsyslog(int pri, const char *format, va_list ap);
void logfile_write(int pri, const char *format, va_list ap);
void passt_vsyslog(bool newline, int pri, const char *format, va_list ap);
void __setlogmask(int mask);
#endif /* LOG_H */

320
ndp.c
View file

@ -38,23 +38,194 @@
#define NS 135
#define NA 136
enum ndp_option_types {
OPT_SRC_L2_ADDR = 1,
OPT_TARGET_L2_ADDR = 2,
OPT_PREFIX_INFO = 3,
OPT_MTU = 5,
OPT_RDNSS_TYPE = 25,
OPT_DNSSL_TYPE = 31,
};
/**
* struct opt_header - Option header
* @type: Option type
* @len: Option length, in units of 8 bytes
*/
struct opt_header {
uint8_t type;
uint8_t len;
} __attribute__((packed));
/**
* struct opt_l2_addr - Link-layer address
* @header: Option header
* @mac: MAC address
*/
struct opt_l2_addr {
struct opt_header header;
unsigned char mac[ETH_ALEN];
} __attribute__((packed));
/**
* struct ndp_na - NDP Neighbor Advertisement (NA) message
* @ih: ICMPv6 header
* @target_addr: Target IPv6 address
* @target_l2_addr: Target link-layer address
*/
struct ndp_na {
struct icmp6hdr ih;
struct in6_addr target_addr;
struct opt_l2_addr target_l2_addr;
} __attribute__((packed));
/**
* struct opt_prefix_info - Prefix Information option
* @header: Option header
* @prefix_len: The number of leading bits in the Prefix that are valid
* @prefix_flags: Flags associated with the prefix
* @valid_lifetime: Valid lifetime (ms)
* @pref_lifetime: Preferred lifetime (ms)
* @reserved: Unused
*/
struct opt_prefix_info {
struct opt_header header;
uint8_t prefix_len;
uint8_t prefix_flags;
uint32_t valid_lifetime;
uint32_t pref_lifetime;
uint32_t reserved;
} __attribute__((packed));
/**
* struct opt_mtu - Maximum transmission unit (MTU) option
* @header: Option header
* @reserved: Unused
* @value: MTU value, network order
*/
struct opt_mtu {
struct opt_header header;
uint16_t reserved;
uint32_t value;
} __attribute__((packed));
/**
* struct rdnss - Recursive DNS Server (RDNSS) option
* @header: Option header
* @reserved: Unused
* @lifetime: Validity time (s)
* @dns: List of DNS server addresses
*/
struct opt_rdnss {
struct opt_header header;
uint16_t reserved;
uint32_t lifetime;
struct in6_addr dns[MAXNS + 1];
} __attribute__((packed));
/**
* struct dnssl - DNS Search List (DNSSL) option
* @header: Option header
* @reserved: Unused
* @lifetime: Validity time (s)
* @domains: List of NULL-seperated search domains
*/
struct opt_dnssl {
struct opt_header header;
uint16_t reserved;
uint32_t lifetime;
unsigned char domains[MAXDNSRCH * NS_MAXDNAME];
} __attribute__((packed));
/**
* struct ndp_ra - NDP Router Advertisement (RA) message
* @ih: ICMPv6 header
* @reachable: Reachability time, after confirmation (ms)
* @retrans: Time between retransmitted NS messages (ms)
* @prefix_info: Prefix Information option
* @prefix: IPv6 prefix
* @mtu: MTU option
* @source_ll: Target link-layer address
* @var: Variable fields
*/
struct ndp_ra {
struct icmp6hdr ih;
uint32_t reachable;
uint32_t retrans;
struct opt_prefix_info prefix_info;
struct in6_addr prefix;
struct opt_l2_addr source_ll;
unsigned char var[sizeof(struct opt_mtu) + sizeof(struct opt_rdnss) +
sizeof(struct opt_dnssl)];
} __attribute__((packed));
/**
* struct ndp_ns - NDP Neighbor Solicitation (NS) message
* @ih: ICMPv6 header
* @target_addr: Target IPv6 address
*/
struct ndp_ns {
struct icmp6hdr ih;
struct in6_addr target_addr;
} __attribute__((packed));
/**
* ndp() - Check for NDP solicitations, reply as needed
* @c: Execution context
* @ih: ICMPv6 header
* @saddr Source IPv6 address
* @saddr: Source IPv6 address
* @p: Packet pool
*
* Return: 0 if not handled here, 1 if handled, -1 on failure
*/
int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr)
int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
const struct pool *p)
{
struct ndp_na na = {
.ih = {
.icmp6_type = NA,
.icmp6_code = 0,
.icmp6_router = 1,
.icmp6_solicited = 1,
.icmp6_override = 1,
},
.target_l2_addr = {
.header = {
.type = OPT_TARGET_L2_ADDR,
.len = 1,
},
}
};
struct ndp_ra ra = {
.ih = {
.icmp6_type = RA,
.icmp6_code = 0,
.icmp6_hop_limit = 255,
/* RFC 8319 */
.icmp6_rt_lifetime = htons_constant(65535),
.icmp6_addrconf_managed = 1,
},
.prefix_info = {
.header = {
.type = OPT_PREFIX_INFO,
.len = 4,
},
.prefix_len = 64,
.prefix_flags = 0xc0, /* prefix flags: L, A */
.valid_lifetime = ~0U,
.pref_lifetime = ~0U,
},
.source_ll = {
.header = {
.type = OPT_SRC_L2_ADDR,
.len = 1,
},
},
};
const struct in6_addr *rsaddr; /* src addr for reply */
char buf[BUFSIZ] = { 0 };
struct ipv6hdr *ip6hr;
struct icmp6hdr *ihr;
struct ethhdr *ehr;
unsigned char *p;
size_t len;
unsigned char *ptr = NULL;
size_t dlen;
if (ih->icmp6_type < RS || ih->icmp6_type > NA)
return 0;
@ -62,28 +233,22 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr)
if (c->no_ndp)
return 1;
ehr = (struct ethhdr *)buf;
ip6hr = (struct ipv6hdr *)(ehr + 1);
ihr = (struct icmp6hdr *)(ip6hr + 1);
if (ih->icmp6_type == NS) {
const struct ndp_ns *ns =
packet_get(p, 0, 0, sizeof(struct ndp_ns), NULL);
if (!ns)
return -1;
if (IN6_IS_ADDR_UNSPECIFIED(saddr))
return 1;
info("NDP: received NS, sending NA");
ihr->icmp6_type = NA;
ihr->icmp6_code = 0;
ihr->icmp6_router = 1;
ihr->icmp6_solicited = 1;
ihr->icmp6_override = 1;
p = (unsigned char *)(ihr + 1);
memcpy(p, ih + 1, sizeof(struct in6_addr)); /* target address */
p += 16;
*p++ = 2; /* target ll */
*p++ = 1; /* length */
memcpy(p, c->mac, ETH_ALEN);
p += 6;
memcpy(&na.target_addr, &ns->target_addr,
sizeof(na.target_addr));
memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN);
} else if (ih->icmp6_type == RS) {
size_t dns_s_len = 0;
int i, n;
@ -92,31 +257,20 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr)
return 1;
info("NDP: received RS, sending RA");
ihr->icmp6_type = RA;
ihr->icmp6_code = 0;
ihr->icmp6_hop_limit = 255;
ihr->icmp6_rt_lifetime = htons(65535); /* RFC 8319 */
ihr->icmp6_addrconf_managed = 1;
memcpy(&ra.prefix, &c->ip6.addr, sizeof(ra.prefix));
p = (unsigned char *)(ihr + 1);
p += 8; /* reachable, retrans time */
*p++ = 3; /* prefix */
*p++ = 4; /* length */
*p++ = 64; /* prefix length */
*p++ = 0xc0; /* prefix flags: L, A */
*(uint32_t *)p = (uint32_t)~0U; /* lifetime */
p += 4;
*(uint32_t *)p = (uint32_t)~0U; /* preferred lifetime */
p += 8;
memcpy(p, &c->ip6.addr, 8); /* prefix */
p += 16;
ptr = &ra.var[0];
if (c->mtu != -1) {
*p++ = 5; /* type */
*p++ = 1; /* length */
p += 2; /* reserved */
*(uint32_t *)p = htonl(c->mtu); /* MTU */
p += 4;
struct opt_mtu *mtu = (struct opt_mtu *)ptr;
*mtu = (struct opt_mtu) {
.header = {
.type = OPT_MTU,
.len = 1,
},
.value = htonl(c->mtu),
};
ptr += sizeof(struct opt_mtu);
}
if (c->no_dhcp_dns)
@ -124,70 +278,78 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr)
for (n = 0; !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns[n]); n++);
if (n) {
*p++ = 25; /* RDNSS */
*p++ = 1 + 2 * n; /* length */
p += 2; /* reserved */
*(uint32_t *)p = (uint32_t)~0U; /* lifetime */
p += 4;
struct opt_rdnss *rdnss = (struct opt_rdnss *)ptr;
*rdnss = (struct opt_rdnss) {
.header = {
.type = OPT_RDNSS_TYPE,
.len = 1 + 2 * n,
},
.lifetime = ~0U,
};
for (i = 0; i < n; i++) {
memcpy(p, &c->ip6.dns[i], 16); /* address */
p += 16;
memcpy(&rdnss->dns[i], &c->ip6.dns[i],
sizeof(rdnss->dns[i]));
}
ptr += offsetof(struct opt_rdnss, dns) +
i * sizeof(rdnss->dns[0]);
for (n = 0; *c->dns_search[n].n; n++)
dns_s_len += strlen(c->dns_search[n].n) + 2;
}
if (!c->no_dhcp_dns_search && dns_s_len) {
*p++ = 31; /* DNSSL */
*p++ = (dns_s_len + 8 - 1) / 8 + 1; /* length */
p += 2; /* reserved */
*(uint32_t *)p = (uint32_t)~0U; /* lifetime */
p += 4;
struct opt_dnssl *dnssl = (struct opt_dnssl *)ptr;
*dnssl = (struct opt_dnssl) {
.header = {
.type = OPT_DNSSL_TYPE,
.len = DIV_ROUND_UP(dns_s_len, 8) + 1,
},
.lifetime = ~0U,
};
ptr = dnssl->domains;
for (i = 0; i < n; i++) {
size_t len;
char *dot;
*(p++) = '.';
*(ptr++) = '.';
strncpy((char *)p, c->dns_search[i].n,
sizeof(buf) -
((intptr_t)p - (intptr_t)buf));
for (dot = (char *)p - 1; *dot; dot++) {
len = sizeof(dnssl->domains) -
(ptr - dnssl->domains);
strncpy((char *)ptr, c->dns_search[i].n, len);
for (dot = (char *)ptr - 1; *dot; dot++) {
if (*dot == '.')
*dot = strcspn(dot + 1, ".");
}
p += strlen(c->dns_search[i].n);
*(p++) = 0;
ptr += strlen(c->dns_search[i].n);
*(ptr++) = 0;
}
memset(p, 0, 8 - dns_s_len % 8); /* padding */
p += 8 - dns_s_len % 8;
memset(ptr, 0, 8 - dns_s_len % 8); /* padding */
ptr += 8 - dns_s_len % 8;
}
dns_done:
*p++ = 1; /* source ll */
*p++ = 1; /* length */
memcpy(p, c->mac, ETH_ALEN);
p += 6;
memcpy(&ra.source_ll.mac, c->our_tap_mac, ETH_ALEN);
} else {
return 1;
}
len = (uintptr_t)p - (uintptr_t)ihr - sizeof(*ihr);
if (IN6_IS_ADDR_LINKLOCAL(saddr))
c->ip6.addr_ll_seen = *saddr;
else
c->ip6.addr_seen = *saddr;
if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
rsaddr = &c->ip6.gw;
else
rsaddr = &c->ip6.addr_ll;
rsaddr = &c->ip6.our_tap_ll;
tap_icmp6_send(c, rsaddr, saddr, ihr, len + sizeof(*ihr));
if (ih->icmp6_type == NS) {
dlen = sizeof(struct ndp_na);
tap_icmp6_send(c, rsaddr, saddr, &na, dlen);
} else if (ih->icmp6_type == RS) {
dlen = ptr - (unsigned char *)&ra;
tap_icmp6_send(c, rsaddr, saddr, &ra, dlen);
}
return 1;
}

3
ndp.h
View file

@ -6,6 +6,7 @@
#ifndef NDP_H
#define NDP_H
int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr);
int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
const struct pool *p);
#endif /* NDP_H */

326
netlink.c
View file

@ -33,8 +33,13 @@
#include "util.h"
#include "passt.h"
#include "log.h"
#include "ip.h"
#include "netlink.h"
/* Same as RTA_NEXT() but for nexthops: RTNH_NEXT() doesn't take 'attrlen' */
#define RTNH_NEXT_AND_DEC(rtnh, attrlen) \
((attrlen) -= RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh))
/* Netlink expects a buffer of at least 8kiB or the system page size,
* whichever is larger. 32kiB is recommended for more efficient.
* Since the largest page size on any remotely common Linux setup is
@ -128,7 +133,7 @@ static uint32_t nl_send(int s, void *req, uint16_t type,
n = send(s, req, len, 0);
if (n < 0)
die("netlink: Failed to send(): %s", strerror(errno));
die_perror("netlink: Failed to send()");
else if (n < len)
die("netlink: Short send (%zd of %zd bytes)", n, len);
@ -184,7 +189,7 @@ static struct nlmsghdr *nl_next(int s, char *buf, struct nlmsghdr *nh, ssize_t *
*n = recv(s, buf, NLBUFSIZ, 0);
if (*n < 0)
die("netlink: Failed to recv(): %s", strerror(errno));
die_perror("netlink: Failed to recv()");
nh = (struct nlmsghdr *)buf;
if (!NLMSG_OK(nh, *n))
@ -254,7 +259,8 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
.rtm.rtm_type = RTN_UNICAST,
.rtm.rtm_family = af,
};
unsigned int ifi = 0;
unsigned defifi = 0, anyifi = 0;
unsigned ndef = 0, nany = 0;
struct nlmsghdr *nh;
struct rtattr *rta;
char buf[NLBUFSIZ];
@ -262,30 +268,80 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
uint32_t seq;
size_t na;
/* Look for an interface with a default route first, failing that, look
* for any interface with a route, and pick the first one, if any.
*/
seq = nl_send(s, &req, RTM_GETROUTE, NLM_F_DUMP, sizeof(req));
nl_foreach_oftype(nh, status, s, buf, seq, RTM_NEWROUTE) {
struct rtmsg *rtm = (struct rtmsg *)NLMSG_DATA(nh);
const void *dst = NULL;
unsigned thisifi = 0;
if (ifi || rtm->rtm_dst_len || rtm->rtm_family != af)
if (rtm->rtm_family != af)
continue;
for (rta = RTM_RTA(rtm), na = RTM_PAYLOAD(nh); RTA_OK(rta, na);
rta = RTA_NEXT(rta, na)) {
if (rta->rta_type == RTA_OIF) {
ifi = *(unsigned int *)RTA_DATA(rta);
thisifi = *(unsigned int *)RTA_DATA(rta);
} else if (rta->rta_type == RTA_MULTIPATH) {
const struct rtnexthop *rtnh;
rtnh = (struct rtnexthop *)RTA_DATA(rta);
ifi = rtnh->rtnh_ifindex;
thisifi = rtnh->rtnh_ifindex;
} else if (rta->rta_type == RTA_DST) {
dst = RTA_DATA(rta);
}
}
if (!thisifi)
continue; /* No interface for this route */
/* Skip routes to link-local addresses */
if (af == AF_INET && dst &&
IN4_IS_PREFIX_LINKLOCAL(dst, rtm->rtm_dst_len))
continue;
if (af == AF_INET6 && dst &&
IN6_IS_PREFIX_LINKLOCAL(dst, rtm->rtm_dst_len))
continue;
if (rtm->rtm_dst_len == 0) {
/* Default route */
ndef++;
if (!defifi)
defifi = thisifi;
} else {
/* Non-default route */
nany++;
if (!anyifi)
anyifi = thisifi;
}
}
if (status < 0)
warn("netlink: RTM_GETROUTE failed: %s", strerror(-status));
return ifi;
if (defifi) {
if (ndef > 1) {
info("Multiple default %s routes, picked first",
af_name(af));
}
return defifi;
}
if (anyifi) {
if (nany > 1) {
info("Multiple interfaces with %s routes, picked first",
af_name(af));
}
return anyifi;
}
if (!nany)
info("No interfaces with usable %s routes", af_name(af));
return 0;
}
/**
@ -297,12 +353,13 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
*/
bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
{
int nh_len = RTA_PAYLOAD(rta);
struct rtnexthop *rtnh;
bool found = false;
int hops = -1;
for (rtnh = (struct rtnexthop *)RTA_DATA(rta);
RTNH_OK(rtnh, RTA_PAYLOAD(rta)); rtnh = RTNH_NEXT(rtnh)) {
RTNH_OK(rtnh, nh_len); rtnh = RTNH_NEXT_AND_DEC(rtnh, nh_len)) {
size_t len = rtnh->rtnh_len - sizeof(*rtnh);
struct rtattr *rta_inner;
@ -332,7 +389,7 @@ bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
* @af: Address family
* @gw: Default gateway to fill on NL_GET
*
* Return: 0 on success, negative error code on failure
* Return: error on netlink failure, or 0 (gw unset if default route not found)
*/
int nl_route_get_def(int s, unsigned int ifi, sa_family_t af, void *gw)
{
@ -479,7 +536,7 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
.rta.rta_len = RTA_LENGTH(sizeof(unsigned int)),
.ifi = ifi_src,
};
ssize_t nlmsgs_size, status;
ssize_t nlmsgs_size, left, status;
unsigned dup_routes = 0;
struct nlmsghdr *nh;
char buf[NLBUFSIZ];
@ -493,39 +550,83 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
* routes in the buffer at once.
*/
nh = nl_next(s_src, buf, NULL, &nlmsgs_size);
for (status = nlmsgs_size;
NLMSG_OK(nh, status) && (status = nl_status(nh, status, seq)) > 0;
nh = NLMSG_NEXT(nh, status)) {
for (left = nlmsgs_size;
NLMSG_OK(nh, left) && (status = nl_status(nh, left, seq)) > 0;
nh = NLMSG_NEXT(nh, left)) {
struct rtmsg *rtm = (struct rtmsg *)NLMSG_DATA(nh);
bool discard = false;
struct rtattr *rta;
size_t na;
if (nh->nlmsg_type != RTM_NEWROUTE)
continue;
dup_routes++;
for (rta = RTM_RTA(rtm), na = RTM_PAYLOAD(nh); RTA_OK(rta, na);
rta = RTA_NEXT(rta, na)) {
/* RTA_OIF and RTA_MULTIPATH attributes carry the
* identifier of a host interface. If they match the
* host interface we're copying from, change them to
* match the corresponding identifier in the target
* namespace.
*
* If RTA_OIF doesn't match (NETLINK_GET_STRICT_CHK not
* available), or if any interface index in nexthop
* objects differ from the host interface, discard the
* route altogether.
*/
if (rta->rta_type == RTA_OIF) {
/* The host obviously list's the host interface
* id here, we need to change it to the
* namespace's interface id
*/
if (*(unsigned int *)RTA_DATA(rta) != ifi_src) {
discard = true;
break;
}
*(unsigned int *)RTA_DATA(rta) = ifi_dst;
} else if (rta->rta_type == RTA_PREFSRC) {
/* Host routes might include a preferred source
* address, which must be one of the host's
* addresses. However, with -a pasta will use a
* different namespace address, making such a
* route invalid in the namespace. Strip off
* RTA_PREFSRC attributes to avoid that. */
} else if (rta->rta_type == RTA_MULTIPATH) {
int nh_len = RTA_PAYLOAD(rta);
struct rtnexthop *rtnh;
for (rtnh = (struct rtnexthop *)RTA_DATA(rta);
RTNH_OK(rtnh, nh_len);
rtnh = RTNH_NEXT_AND_DEC(rtnh, nh_len)) {
int src = (int)ifi_src;
if (rtnh->rtnh_ifindex != src) {
discard = true;
break;
}
rtnh->rtnh_ifindex = ifi_dst;
}
if (discard)
break;
} else if (rta->rta_type == RTA_PREFSRC ||
rta->rta_type == RTA_NH_ID) {
/* Strip RTA_PREFSRC attributes: host routes
* might include a preferred source address,
* which must be one of the host's addresses.
* However, with -a, pasta will use a different
* namespace address, making such a route
* invalid in the namespace.
*
* Strip RTA_NH_ID attributes: host routes set
* up via routing protocols (e.g. OSPF) might
* contain a nexthop ID (and not nexthop
* objects, which are taken care of in the
* RTA_MULTIPATH case above) that's not valid
* in the target namespace.
*/
rta->rta_type = RTA_UNSPEC;
}
}
if (discard)
nh->nlmsg_type = NLMSG_NOOP;
else
dup_routes++;
}
if (!NLMSG_OK(nh, status) || status > 0) {
if (!NLMSG_OK(nh, left)) {
/* Process any remaining datagrams in a different
* buffer so we don't overwrite the first one.
*/
@ -551,9 +652,9 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
* to calculate dependencies: let the kernel do that.
*/
for (i = 0; i < dup_routes; i++) {
for (nh = (struct nlmsghdr *)buf, status = nlmsgs_size;
NLMSG_OK(nh, status);
nh = NLMSG_NEXT(nh, status)) {
for (nh = (struct nlmsghdr *)buf, left = nlmsgs_size;
NLMSG_OK(nh, left);
nh = NLMSG_NEXT(nh, left)) {
uint16_t flags = nh->nlmsg_flags;
int rc;
@ -563,7 +664,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
rc = nl_do(s_dst, nh, RTM_NEWROUTE,
(flags & ~NLM_F_DUMP_FILTERED) | NLM_F_CREATE,
nh->nlmsg_len);
if (rc < 0 && rc != -ENETUNREACH && rc != -EEXIST)
if (rc < 0 && rc != -EEXIST &&
rc != -ENETUNREACH && rc != -EHOSTUNREACH)
return rc;
}
}
@ -571,6 +673,63 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
return 0;
}
/**
* nl_addr_set_ll_nodad() - Set IFA_F_NODAD on IPv6 link-local addresses
* @s: Netlink socket
* @ifi: Interface index in target namespace
*
* Return: 0 on success, negative error code on failure
*/
int nl_addr_set_ll_nodad(int s, unsigned int ifi)
{
struct req_t {
struct nlmsghdr nlh;
struct ifaddrmsg ifa;
} req = {
.ifa.ifa_family = AF_INET6,
.ifa.ifa_index = ifi,
};
uint32_t seq, last_seq = 0;
ssize_t status, ret = 0;
struct nlmsghdr *nh;
char buf[NLBUFSIZ];
seq = nl_send(s, &req, RTM_GETADDR, NLM_F_DUMP, sizeof(req));
nl_foreach_oftype(nh, status, s, buf, seq, RTM_NEWADDR) {
struct ifaddrmsg *ifa = (struct ifaddrmsg *)NLMSG_DATA(nh);
struct rtattr *rta;
size_t na;
if (ifa->ifa_index != ifi || ifa->ifa_scope != RT_SCOPE_LINK)
continue;
ifa->ifa_flags |= IFA_F_NODAD;
for (rta = IFA_RTA(ifa), na = IFA_PAYLOAD(nh); RTA_OK(rta, na);
rta = RTA_NEXT(rta, na)) {
/* If 32-bit flags are used, add IFA_F_NODAD there */
if (rta->rta_type == IFA_FLAGS)
*(uint32_t *)RTA_DATA(rta) |= IFA_F_NODAD;
}
last_seq = nl_send(s, nh, RTM_NEWADDR, NLM_F_REPLACE,
nh->nlmsg_len);
}
if (status < 0)
ret = status;
for (seq = seq + 1; seq <= last_seq; seq++) {
nl_foreach(nh, status, s, buf, seq)
warn("netlink: Unexpected response message");
if (!ret && status < 0)
ret = status;
}
return ret;
}
/**
* nl_addr_get() - Get most specific global address, given interface and family
* @s: Netlink socket
@ -580,7 +739,7 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
* @prefix_len: Mask or prefix length, to fill (for IPv4)
* @addr_l: Link-scoped address to fill (for IPv6)
*
* Return: 9 on success, negative error code on failure
* Return: 0 on success, negative error code on failure
*/
int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
void *addr, int *prefix_len, void *addr_l)
@ -604,12 +763,13 @@ int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
struct rtattr *rta;
size_t na;
if (ifa->ifa_index != ifi)
if (ifa->ifa_index != ifi || ifa->ifa_flags & IFA_F_DEPRECATED)
continue;
for (rta = IFA_RTA(ifa), na = IFA_PAYLOAD(nh); RTA_OK(rta, na);
rta = RTA_NEXT(rta, na)) {
if (rta->rta_type != IFA_ADDRESS)
if ((af == AF_INET && rta->rta_type != IFA_LOCAL) ||
(af == AF_INET6 && rta->rta_type != IFA_ADDRESS))
continue;
if (af == AF_INET && ifa->ifa_prefixlen > prefix_max) {
@ -637,7 +797,54 @@ int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
}
/**
* nl_add_set() - Set IP addresses for given interface and address family
* nl_addr_get_ll() - Get first IPv6 link-local address for a given interface
* @s: Netlink socket
* @ifi: Interface index in outer network namespace
* @addr: Link-local address to fill
*
* Return: 0 on success, negative error code on failure
*/
int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr)
{
struct req_t {
struct nlmsghdr nlh;
struct ifaddrmsg ifa;
} req = {
.ifa.ifa_family = AF_INET6,
.ifa.ifa_index = ifi,
};
struct nlmsghdr *nh;
bool found = false;
char buf[NLBUFSIZ];
ssize_t status;
uint32_t seq;
seq = nl_send(s, &req, RTM_GETADDR, NLM_F_DUMP, sizeof(req));
nl_foreach_oftype(nh, status, s, buf, seq, RTM_NEWADDR) {
struct ifaddrmsg *ifa = (struct ifaddrmsg *)NLMSG_DATA(nh);
struct rtattr *rta;
size_t na;
if (ifa->ifa_index != ifi || ifa->ifa_scope != RT_SCOPE_LINK ||
found)
continue;
for (rta = IFA_RTA(ifa), na = IFA_PAYLOAD(nh); RTA_OK(rta, na);
rta = RTA_NEXT(rta, na)) {
if (rta->rta_type != IFA_ADDRESS)
continue;
if (!found) {
memcpy(addr, RTA_DATA(rta), RTA_PAYLOAD(rta));
found = true;
}
}
}
return status;
}
/**
* nl_addr_set() - Set IP addresses for given interface and address family
* @s: Netlink socket
* @ifi: Interface index
* @af: Address family
@ -740,10 +947,13 @@ int nl_addr_dup(int s_src, unsigned int ifi_src,
ifa = (struct ifaddrmsg *)NLMSG_DATA(nh);
if (rc < 0 || ifa->ifa_scope == RT_SCOPE_LINK ||
ifa->ifa_index != ifi_src)
ifa->ifa_index != ifi_src ||
ifa->ifa_flags & IFA_F_DEPRECATED)
continue;
ifa->ifa_index = ifi_dst;
/* Same as nl_addr_set(), but here it's more than a default */
ifa->ifa_flags |= IFA_F_NODAD;
for (rta = IFA_RTA(ifa), na = IFA_PAYLOAD(nh); RTA_OK(rta, na);
rta = RTA_NEXT(rta, na)) {
@ -751,6 +961,10 @@ int nl_addr_dup(int s_src, unsigned int ifi_src,
if (rta->rta_type == IFA_LABEL ||
rta->rta_type == IFA_CACHEINFO)
rta->rta_type = IFA_UNSPEC;
/* If 32-bit flags are used, add IFA_F_NODAD there */
if (rta->rta_type == IFA_FLAGS)
*(uint32_t *)RTA_DATA(rta) |= IFA_F_NODAD;
}
rc = nl_do(s_dst, nh, RTM_NEWADDR,
@ -832,14 +1046,14 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac)
}
/**
* nl_link_up() - Bring link up
* nl_link_set_mtu() - Set link MTU
* @s: Netlink socket
* @ifi: Interface index
* @mtu: If non-zero, set interface MTU
* @mtu: Interface MTU
*
* Return: 0 on success, negative error code on failure
*/
int nl_link_up(int s, unsigned int ifi, int mtu)
int nl_link_set_mtu(int s, unsigned int ifi, int mtu)
{
struct req_t {
struct nlmsghdr nlh;
@ -849,17 +1063,35 @@ int nl_link_up(int s, unsigned int ifi, int mtu)
} req = {
.ifm.ifi_family = AF_UNSPEC,
.ifm.ifi_index = ifi,
.ifm.ifi_flags = IFF_UP,
.ifm.ifi_change = IFF_UP,
.rta.rta_type = IFLA_MTU,
.rta.rta_len = RTA_LENGTH(sizeof(unsigned int)),
.mtu = mtu,
};
ssize_t len = sizeof(req);
if (!mtu)
/* Shorten request to drop MTU attribute */
len = offsetof(struct req_t, rta);
return nl_do(s, &req, RTM_NEWLINK, 0, len);
return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req));
}
/**
* nl_link_set_flags() - Set link flags
* @s: Netlink socket
* @ifi: Interface index
* @set: Device flags to set
* @change: Mask of device flag changes
*
* Return: 0 on success, negative error code on failure
*/
int nl_link_set_flags(int s, unsigned int ifi,
unsigned int set, unsigned int change)
{
struct req_t {
struct nlmsghdr nlh;
struct ifinfomsg ifm;
} req = {
.ifm.ifi_family = AF_UNSPEC,
.ifm.ifi_index = ifi,
.ifm.ifi_flags = set,
.ifm.ifi_change = change,
};
return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req));
}

View file

@ -19,10 +19,14 @@ int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
void *addr, int *prefix_len, void *addr_l);
int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
const void *addr, int prefix_len);
int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
int nl_addr_set_ll_nodad(int s, unsigned int ifi);
int nl_addr_dup(int s_src, unsigned int ifi_src,
int s_dst, unsigned int ifi_dst, sa_family_t af);
int nl_link_get_mac(int s, unsigned int ifi, void *mac);
int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
int nl_link_up(int s, unsigned int ifi, int mtu);
int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
int nl_link_set_flags(int s, unsigned int ifi,
unsigned int set, unsigned int change);
#endif /* NETLINK_H */

View file

@ -22,42 +22,6 @@
#include "util.h"
#include "log.h"
static int packet_check_range(const struct pool *p, size_t offset, size_t len,
const char *start, const char *func, int line)
{
ASSERT(p->buf);
if (p->buf_size == 0)
return vu_packet_check_range((void *)p->buf, offset, len, start,
func, line);
if (start < p->buf) {
if (func) {
trace("add packet start %p before buffer start %p, "
"%s:%i", (void *)start, (void *)p->buf, func, line);
}
return -1;
}
if (start + len + offset > p->buf + p->buf_size) {
if (func) {
trace("packet offset plus length %lu from size %lu, "
"%s:%i", start - p->buf + len + offset,
p->buf_size, func, line);
}
return -1;
}
#if UINTPTR_MAX == UINT64_MAX
if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
trace("add packet start %p, buffer start %p, %s:%i",
(void *)start, (void *)p->buf, func, line);
return -1;
}
#endif
return 0;
}
/**
* packet_add_do() - Add data as packet descriptor to given pool
* @p: Existing pool
@ -77,16 +41,34 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
return;
}
if (packet_check_range(p, 0, len, start, func, line))
if (start < p->buf) {
trace("add packet start %p before buffer start %p, %s:%i",
(void *)start, (void *)p->buf, func, line);
return;
}
if (start + len > p->buf + p->buf_size) {
trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
(void *)start, len, (void *)(p->buf + p->buf_size),
func, line);
return;
}
if (len > UINT16_MAX) {
trace("add packet length %zu, %s:%i", len, func, line);
return;
}
p->pkt[idx].iov_base = (void *)start;
p->pkt[idx].iov_len = len;
#if UINTPTR_MAX == UINT64_MAX
if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
trace("add packet start %p, buffer start %p, %s:%i",
(void *)start, (void *)p->buf, func, line);
return;
}
#endif
p->pkt[idx].offset = start - p->buf;
p->pkt[idx].len = len;
p->count++;
}
@ -122,23 +104,28 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
return NULL;
}
if (len + offset > p->pkt[idx].iov_len) {
if (p->pkt[idx].offset + len + offset > p->buf_size) {
if (func) {
trace("data length %zu, offset %zu from length %zu, "
"%s:%i", len, offset, p->pkt[idx].iov_len,
trace("packet offset plus length %zu from size %zu, "
"%s:%i", p->pkt[idx].offset + len + offset,
p->buf_size, func, line);
}
return NULL;
}
if (len + offset > p->pkt[idx].len) {
if (func) {
trace("data length %zu, offset %zu from length %u, "
"%s:%i", len, offset, p->pkt[idx].len,
func, line);
}
return NULL;
}
if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
func, line))
return NULL;
if (left)
*left = p->pkt[idx].iov_len - offset - len;
*left = p->pkt[idx].len - offset - len;
return (char *)p->pkt[idx].iov_base + offset;
return p->buf + p->pkt[idx].offset + offset;
}
/**

View file

@ -6,6 +6,16 @@
#ifndef PACKET_H
#define PACKET_H
/**
* struct desc - Generic offset-based descriptor within buffer
* @offset: Offset of descriptor relative to buffer start, 32-bit limit
* @len: Length of descriptor, host order, 16-bit limit
*/
struct desc {
uint32_t offset;
uint16_t len;
};
/**
* struct pool - Generic pool of packets stored in a buffer
* @buf: Buffer storing packet descriptors
@ -19,11 +29,9 @@ struct pool {
size_t buf_size;
size_t size;
size_t count;
struct iovec pkt[1];
struct desc pkt[1];
};
int vu_packet_check_range(void *buf, size_t offset, size_t len,
const char *start, const char *func, int line);
void packet_add_do(struct pool *p, size_t len, const char *start,
const char *func, int line);
void *packet_get_do(const struct pool *p, const size_t idx,
@ -46,7 +54,7 @@ struct _name ## _t { \
size_t buf_size; \
size_t size; \
size_t count; \
struct iovec pkt[_size]; \
struct desc pkt[_size]; \
}
#define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size) \

216
passt.1
View file

@ -73,6 +73,9 @@ for performance reasons.
.SH OPTIONS
Unless otherwise noted below, \fBif conflicting or multiple options are given,
the last one takes effect.\fR
.TP
.BR \-d ", " \-\-debug
Be verbose, don't log to the system logger.
@ -92,14 +95,18 @@ detached PID namespace after starting, because the PID itself cannot change.
Default is to fork into background.
.TP
.BR \-e ", " \-\-stderr
Log to standard error too.
Default is to log to the system logger only, if started from an interactive
terminal, and to both system logger and standard error otherwise.
.BR \-e ", " \-\-stderr " " (DEPRECATED)
This option has no effect, and is maintained for compatibility purposes only.
Note that this configuration option is \fBdeprecated\fR and will be removed in a
future version.
.TP
.BR \-l ", " \-\-log-file " " \fIPATH\fR
Log to file \fIPATH\fR, not to standard error, and not to the system logger.
Log to file \fIPATH\fR, and not to the system logger.
Specifying this option multiple times does \fInot\fR lead to multiple log files:
the last given option takes effect.
.TP
.BR \-\-log-size " " \fISIZE\fR
@ -128,6 +135,9 @@ Show version and exit.
Capture tap-facing (that is, guest-side or namespace-side) network packets to
\fIfile\fR in \fBpcap\fR format.
Specifying this option multiple times does \fInot\fR lead to multiple capture
files: the last given option takes effect.
.TP
.BR \-P ", " \-\-pid " " \fIfile
Write own PID to \fIfile\fR once initialisation is done, before forking to
@ -148,7 +158,9 @@ for an IPv6 \fIaddr\fR.
This option can be specified zero (for defaults) to two times (once for IPv4,
once for IPv6).
By default, assigned IPv4 and IPv6 addresses are taken from the host interfaces
with the first default route for the corresponding IP version.
with the first default route, if any, for the corresponding IP version. If no
default routes are available and there is any interface with any route for a
given IP version, the first of these interfaces will be chosen instead.
.TP
.BR \-n ", " \-\-netmask " " \fImask
@ -172,9 +184,11 @@ Assign IPv4 \fIaddr\fR as default gateway via DHCP (option 3), or IPv6
This option can be specified zero (for defaults) to two times (once for IPv4,
once for IPv6).
By default, IPv4 and IPv6 gateways are taken from the host interface with the
first default route for the corresponding IP version. If the default route is a
multipath one, the gateway is the first nexthop router returned by the kernel
which has the highest weight in the set of paths.
first default route, if any, for the corresponding IP version. If the default
route is a multipath one, the gateway is the first nexthop router returned by
the kernel which has the highest weight in the set of paths. If no default
routes are available and there is just one interface with any route, that
interface will be chosen instead.
Note: these addresses are also used as source address for packets directed to
the guest or to the target namespace having a loopback or local source address,
@ -185,9 +199,11 @@ to allow mapping of local traffic to guest and target namespace. See the
.BR \-i ", " \-\-interface " " \fIname
Use host interface \fIname\fR to derive addresses and routes.
Default is to use the interfaces specified by \fB--outbound-if4\fR and
\fB--outbound-if6\fR, for IPv4 and IPv6 addresses and routes, respectively. If
no interfaces are given, the interface with the first default routes for each IP
version is selected.
\fB--outbound-if6\fR, for IPv4 and IPv6 addresses and routes, respectively.
If no interfaces are given, the interface with the first default routes for each
IP version is selected. If no default routes are available and there is just one
interface with any route, that interface will be chosen instead.
.TP
.BR \-o ", " \-\-outbound " " \fIaddr
@ -203,30 +219,49 @@ By default, the source address is selected by the routing tables.
Bind IPv4 outbound sockets to host interface \fIname\fR, and, unless another
interface is specified via \fB-i\fR, \fB--interface\fR, use this interface to
derive IPv4 addresses and routes.
By default, the interface given by the default route is selected.
By default, the interface given by the default route is selected. If no default
routes are available and there is just one interface with any route, that
interface will be chosen instead.
.TP
.BR \-\-outbound-if6 " " \fIname
Bind IPv6 outbound sockets to host interface \fIname\fR, and, unless another
interface is specified via \fB-i\fR, \fB--interface\fR, use this interface to
derive IPv6 addresses and routes.
By default, the interface given by the default route is selected.
By default, the interface given by the default route is selected. If no default
routes are available and there is just one interface with any route, that
interface will be chosen instead.
.TP
.BR \-D ", " \-\-dns " " \fIaddr
Use \fIaddr\fR (IPv4 or IPv6) for DHCP, DHCPv6, NDP or DNS forwarding, as
configured (see options \fB--no-dhcp-dns\fR, \fB--dhcp-dns\fR,
\fB--dns-forward\fR) instead of reading addresses from \fI/etc/resolv.conf\fR.
This option can be specified multiple times. Specifying \fB-D none\fR disables
usage of DNS addresses altogether.
Instruct the guest (via DHCP, DHVPv6 or NDP) to use \fIaddr\fR (IPv4
or IPv6) as a nameserver, as configured (see options
\fB--no-dhcp-dns\fR, \fB--dhcp-dns\fR) instead of reading addresses
from \fI/etc/resolv.conf\fR. This option can be specified multiple
times. Specifying \fB-D none\fR disables usage of DNS addresses
altogether. Unlike addresses from \fI/etc/resolv.conf\fR, \fIaddr\fR
is given to the guest without remapping. For example \fB--dns
127.0.0.1\fR will instruct the guest to use itself as nameserver, not
the host.
.TP
.BR \-\-dns-forward " " \fIaddr
Map \fIaddr\fR (IPv4 or IPv6) as seen from guest or namespace to the first
configured DNS resolver (with corresponding IP version). Mapping is limited to
UDP traffic directed to port 53, and DNS answers are translated back with a
reverse mapping.
This option can be specified zero to two times (once for IPv4, once for IPv6).
Map \fIaddr\fR (IPv4 or IPv6) as seen from guest or namespace to the
nameserver (with corresponding IP version) specified by the
\fB\-\-dns-host\fR option. Maps only UDP and TCP traffic to port 53 or
port 853. Replies are translated back with a reverse mapping. This
option can be specified zero to two times (once for IPv4, once for
IPv6).
.TP
.BR \-\-dns-host " " \fIaddr
Configure the host nameserver which guest or namespace queries to the
\fB\-\-dns-forward\fR address will be redirected to. This option can
be specified zero to two times (once for IPv4, once for IPv6).
By default, the first nameserver from the host's
\fI/etc/resolv.conf\fR.
.TP
.BR \-S ", " \-\-search " " \fIlist
@ -237,28 +272,28 @@ list altogether (if you need to search a domain called "none" you can use
\fB--search none.\fR).
.TP
.BR \-\-no-dhcp-dns " " \fIaddr
.BR \-\-no-dhcp-dns
In \fIpasst\fR mode, do not assign IPv4 addresses via DHCP (option 23) or IPv6
addresses via NDP Router Advertisement (option type 25) and DHCPv6 (option 23)
as DNS resolvers.
By default, all the configured addresses are passed.
.TP
.BR \-\-dhcp-dns " " \fIaddr
.BR \-\-dhcp-dns
In \fIpasta\fR mode, assign IPv4 addresses via DHCP (option 23) or IPv6
addresses via NDP Router Advertisement (option type 25) and DHCPv6 (option 23)
as DNS resolvers.
By default, configured addresses, if any, are not passed.
.TP
.BR \-\-no-dhcp-search " " \fIaddr
.BR \-\-no-dhcp-search
In \fIpasst\fR mode, do not send the DNS domain search list addresses via DHCP
(option 119), via NDP Router Advertisement (option type 31) and DHCPv6 (option
24).
By default, the DNS domain search list resulting from configuration is passed.
.TP
.BR \-\-dhcp-search " " \fIaddr
.BR \-\-dhcp-search
In \fIpasta\fR mode, send the DNS domain search list addresses via DHCP (option
119), via NDP Router Advertisement (option type 31) and DHCPv6 (option 24).
By default, the DNS domain search list resulting from configuration is not
@ -301,23 +336,63 @@ namespace will be silently dropped.
Disable Router Advertisements. Router Solicitations coming from guest or target
namespace will be ignored.
.TP
.BR \-\-freebind
Allow any binding address to be specified for \fB-t\fR and \fB-u\fR
options. Usually binding addresses must be addresses currently
configured on the host. With \fB\-\-freebind\fR, the
\fBIP_FREEBIND\fR or \fBIPV6_FREEBIND\fR socket option is enabled
allowing any address to be used. This is typically used to bind
addresses which might be configured on the host in future, at which
point the forwarding will immediately start operating.
.TP
.BR \-\-map-host-loopback " " \fIaddr
Translate \fIaddr\fR to refer to the host. Packets from the guest to
\fIaddr\fR will be redirected to the host. On the host such packets
will appear to have both source and destination of 127.0.0.1 or ::1.
If \fIaddr\fR is 'none', no address is mapped (this implies
\fB--no-map-gw\fR). Only one IPv4 and one IPv6 address can be
translated, if the option is specified multiple times, the last one
takes effect.
Default is to translate the guest's default gateway address, unless
\fB--no-map-gw\fR is given, in which case no address is mapped.
.TP
.BR \-\-no-map-gw
Don't remap TCP connections and untracked UDP traffic, with the gateway address
as destination, to the host. Implied if there is no gateway on the selected
default route for any of the enabled address families.
default route, or if there is no default route, for any of the enabled address
families.
.TP
.BR \-\-map-guest-addr " " \fIaddr
Translate \fIaddr\fR in the guest to be equal to the guest's assigned
address on the host. That is, packets from the guest to \fIaddr\fR
will be redirected to the address assigned to the guest with \fB-a\fR,
or by default the host's global address. This allows the guest to
access services availble on the host's global address, even though its
own address shadows that of the host.
If \fIaddr\fR is 'none', no address is mapped. Only one IPv4 and one
IPv6 address can be translated, and if the option is specified
multiple times, the last one for each address type takes effect.
Default is no mapping.
.TP
.BR \-4 ", " \-\-ipv4-only
Enable IPv4-only operation. IPv6 traffic will be ignored.
By default, IPv6 operation is enabled as long as at least an IPv6 default route
and an interface address are configured on a given host interface.
By default, IPv6 operation is enabled as long as at least an IPv6 route and an
interface address are configured on a given host interface.
.TP
.BR \-6 ", " \-\-ipv6-only
Enable IPv6-only operation. IPv4 traffic will be ignored.
By default, IPv4 operation is enabled as long as at least an IPv4 default route
and an interface address are configured on a given host interface.
By default, IPv4 operation is enabled as long as at least an IPv4 route and an
interface address are configured on a given host interface.
.SS \fBpasst\fR-only options
@ -530,6 +605,13 @@ Configure UDP port forwarding from target namespace to init namespace.
Default is \fBauto\fR.
.TP
.BR \-\-host-lo-to-ns-lo " " (DEPRECATED)
If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from
the host's loopback address will appear on the loopback address in the
guest as well. Without this option such forwarded packets will appear
to come from the guest's public address.
.TP
.BR \-\-userns " " \fIspec
Target user namespace to join, as a path. If PID is given, without this option,
@ -566,7 +648,7 @@ or sourced from the host, and bring up the tap interface.
.BR \-\-no-copy-routes " " (DEPRECATED)
With \-\-config-net, do not copy all the routes associated to the interface we
derive addresses and routes from: set up only the default gateway. Implied by
-g, \-\-gateway.
-g, \-\-gateway, for the corresponding IP version only.
Default is to copy all the routing entries from the interface in the outer
namespace to the target namespace, translating the output interface attribute to
@ -581,7 +663,7 @@ below.
.BR \-\-no-copy-addrs " " (DEPRECATED)
With \-\-config-net, do not copy all the addresses associated to the interface
we derive addresses and routes from: set up a single one. Implied by \-a,
\-\-address.
\-\-address, for the corresponding IP version only.
Default is to copy all the addresses, except for link-local ones, from the
interface from the outer namespace to the target namespace.
@ -807,38 +889,41 @@ root@localhost's password:
.SH NOTES
.SS Handling of traffic with local destination and source addresses
.SS Handling of traffic with loopback destination and source addresses
Both \fBpasst\fR and \fBpasta\fR can bind on ports with a local address,
depending on the configuration. Local destination or source addresses need to be
changed before packets are delivered to the guest or target namespace: most
operating systems would drop packets received from non-loopback interfaces with
local addresses, and it would also be impossible for guest or target namespace
to route answers back.
Both \fBpasst\fR and \fBpasta\fR can bind on ports with a loopback
address (127.0.0.0/8 or ::1), depending on the configuration. Loopback
destination or source addresses need to be changed before packets are
delivered to the guest or target namespace: most operating systems
would drop packets received with loopback addresses on non-loopback
interfaces, and it would also be impossible for guest or target
namespace to route answers back.
For convenience, and somewhat arbitrarily, the source address on these packets
is translated to the address of the default IPv4 or IPv6 gateway -- this is
known to be an existing, valid address on the same subnet.
For convenience, the source address on these packets is translated to
the address specified by the \fB\-\-map-host-loopback\fR option (with
some exceptions in pasta mode, see next section below). If not
specified this defaults, somewhat arbitrarily, to the address of
default IPv4 or IPv6 gateway (if any) -- this is known to be an
existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or
\fB\-\-map-host-loopback none\fR are specified this translation is
disabled and packets with loopback addresses are simply dropped.
Loopback destination addresses are instead translated to the observed external
address of the guest or target namespace. For IPv6 packets, if usage of a
link-local address by guest or namespace has ever been observed, and the
original destination address is also a link-local address, the observed
link-local address is used. Otherwise, the observed global address is used. For
both IPv4 and IPv6, if no addresses have been seen yet, the configured addresses
will be used instead.
Loopback destination addresses are translated to the observed external
address of the guest or target namespace. For IPv6, the observed
link-local address is used if the translated source address is
link-local, otherwise the observed global address is used. For both
IPv4 and IPv6, if no addresses have been seen yet, the configured
addresses will be used instead.
For example, if \fBpasst\fR or \fBpasta\fR receive a connection from 127.0.0.1,
with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
the last observed source address from guest or namespace is 192.0.2.2, this will
be translated to a connection from 192.0.2.1 to 192.0.2.2.
Similarly, for traffic coming from guest or namespace, packets with destination
address corresponding to the default gateway will have their destination address
translated to a loopback address, if and only if a packet, in the opposite
direction, with a loopback destination or source address, port-wise matching for
UDP, or connection-wise for TCP, has been recently forwarded to guest or
namespace. This behaviour can be disabled with \-\-no\-map\-gw.
Similarly, for traffic coming from guest or namespace, packets with
destination address corresponding to the \fB\-\-map-host-loopback\fR
address will have their destination address translated to a loopback
address.
.SS Handling of local traffic in pasta
@ -854,8 +939,15 @@ and the new socket using the \fBsplice\fR(2) system call, and for UDP, a pair
of \fBrecvmmsg\fR(2) and \fBsendmmsg\fR(2) system calls deals with packet
transfers.
This bypass only applies to local connections and traffic, because it's not
possible to bind sockets to foreign addresses.
Because it's not possible to bind sockets to foreign addresses, this
bypass only applies to local connections and traffic. It also means
that the address translation differs slightly from passt mode.
Connections from loopback to loopback on the host will appear to come
from the target namespace's public address within the guest, unless
\fB\-\-host-lo-to-ns-lo\fR is specified, in which case they will
appear to come from loopback in the namespace as well. The latter
behaviour used to be the default, but is usually undesirable, since it
can unintentionally expose namespace local services to the host.
.SS Binding to low numbered ports (well-known or system ports, up to 1023)
@ -964,8 +1056,8 @@ https://passt.top/passt/lists.
Copyright (c) 2020-2022 Red Hat GmbH.
\fBpasst\fR and \fBpasta\fR are free software: you can redistribute them and/or
modify them under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the License, or
modify them under the terms of the GNU General Public License as
published by the Free Software Foundation, either version 2 of the License, or
(at your option) any later version.
.SH SEE ALSO

158
passt.c
View file

@ -35,6 +35,7 @@
#include <syslog.h>
#include <sys/prctl.h>
#include <netinet/if_ether.h>
#include <libgen.h>
#ifdef HAS_GETRANDOM
#include <sys/random.h>
#endif
@ -65,16 +66,14 @@ char *epoll_type_str[] = {
[EPOLL_TYPE_TCP_SPLICE] = "connected spliced TCP socket",
[EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket",
[EPOLL_TYPE_TCP_TIMER] = "TCP timer",
[EPOLL_TYPE_UDP] = "UDP socket",
[EPOLL_TYPE_ICMP] = "ICMP socket",
[EPOLL_TYPE_ICMPV6] = "ICMPv6 socket",
[EPOLL_TYPE_UDP_LISTEN] = "listening UDP socket",
[EPOLL_TYPE_UDP_REPLY] = "UDP reply socket",
[EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket",
[EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch",
[EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch",
[EPOLL_TYPE_TAP_PASTA] = "/dev/net/tun device",
[EPOLL_TYPE_TAP_PASST] = "connected qemu socket",
[EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket",
[EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket",
[EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket",
};
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
"epoll_type_str[] doesn't match enum epoll_type");
@ -86,7 +85,7 @@ static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
*/
static void post_handler(struct ctx *c, const struct timespec *now)
{
#define CALL_PROTO_HANDLER(c, now, lc, uc) \
#define CALL_PROTO_HANDLER(lc, uc) \
do { \
extern void \
lc ## _defer_handler (struct ctx *c) \
@ -105,11 +104,9 @@ static void post_handler(struct ctx *c, const struct timespec *now)
} while (0)
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
CALL_PROTO_HANDLER(c, now, tcp, TCP);
CALL_PROTO_HANDLER(tcp, TCP);
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
CALL_PROTO_HANDLER(c, now, udp, UDP);
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
CALL_PROTO_HANDLER(c, now, icmp, ICMP);
CALL_PROTO_HANDLER(udp, UDP);
flow_defer_handler(c, now);
#undef CALL_PROTO_HANDLER
@ -140,14 +137,13 @@ static void secret_init(struct ctx *c)
}
if (dev_random >= 0)
close(dev_random);
if (random_read < sizeof(c->hash_secret)) {
if (random_read < sizeof(c->hash_secret))
#else
if (getrandom(&c->hash_secret, sizeof(c->hash_secret),
GRND_RANDOM) < 0) {
GRND_RANDOM) < 0)
#endif /* !HAS_GETRANDOM */
perror("TCP initial sequence getrandom");
exit(EXIT_FAILURE);
}
die_perror("Failed to get random bytes for hash table and TCP");
}
/**
@ -167,7 +163,7 @@ static void timer_init(struct ctx *c, const struct timespec *now)
*/
void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
{
tcp_buf_update_l2(eth_d, eth_s);
tcp_update_l2_buf(eth_d, eth_s);
udp_update_l2_buf(eth_d, eth_s);
}
@ -195,28 +191,30 @@ void exit_handler(int signal)
* Return: non-zero on failure
*
* #syscalls read write writev
* #syscalls socket bind connect getsockopt setsockopt s390x:socketcall close
* #syscalls recvfrom sendto shutdown
* #syscalls armv6l:recv armv7l:recv ppc64le:recv
* #syscalls armv6l:send armv7l:send ppc64le:send
* #syscalls socket getsockopt setsockopt s390x:socketcall i686:socketcall close
* #syscalls bind connect recvfrom sendto shutdown
* #syscalls arm:recv ppc64le:recv arm:send ppc64le:send
* #syscalls accept4|accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
* #syscalls clock_gettime armv6l:clock_gettime64 armv7l:clock_gettime64
* #syscalls clock_gettime arm:clock_gettime64 i686:clock_gettime64
*/
int main(int argc, char **argv)
{
int nfds, i, devnull_fd = -1, pidfile_fd = -1;
struct epoll_event events[EPOLL_EVENTS];
char *log_name, argv0[PATH_MAX], *name;
int nfds, i, devnull_fd = -1;
char argv0[PATH_MAX], *name;
struct ctx c = { 0 };
struct rlimit limit;
struct timespec now;
struct sigaction sa;
if (clock_gettime(CLOCK_MONOTONIC, &log_start))
die_perror("Failed to get CLOCK_MONOTONIC time");
arch_avx2_exec(argv);
isolate_initial();
isolate_initial(argc, argv);
c.pasta_netns_fd = c.fd_tap = c.fd_tap_listen = -1;
c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
@ -231,70 +229,52 @@ int main(int argc, char **argv)
name = basename(argv0);
if (strstr(name, "pasta")) {
sa.sa_handler = pasta_child_handler;
if (sigaction(SIGCHLD, &sa, NULL)) {
die("Couldn't install signal handlers: %s",
strerror(errno));
}
if (sigaction(SIGCHLD, &sa, NULL))
die_perror("Couldn't install signal handlers");
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR) {
die("Couldn't set disposition for SIGPIPE: %s",
strerror(errno));
}
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
die_perror("Couldn't set disposition for SIGPIPE");
c.mode = MODE_PASTA;
log_name = "pasta";
} else if (strstr(name, "passt")) {
c.mode = MODE_PASST;
log_name = "passt";
} else {
exit(EXIT_FAILURE);
}
madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
__openlog(log_name, 0, LOG_DAEMON);
c.epollfd = epoll_create1(EPOLL_CLOEXEC);
if (c.epollfd == -1) {
perror("epoll_create1");
exit(EXIT_FAILURE);
}
if (c.epollfd == -1)
die_perror("Failed to create epoll file descriptor");
if (getrlimit(RLIMIT_NOFILE, &limit))
die_perror("Failed to get maximum value of open files limit");
if (getrlimit(RLIMIT_NOFILE, &limit)) {
perror("getrlimit");
exit(EXIT_FAILURE);
}
c.nofile = limit.rlim_cur = limit.rlim_max;
if (setrlimit(RLIMIT_NOFILE, &limit)) {
perror("setrlimit");
exit(EXIT_FAILURE);
}
if (setrlimit(RLIMIT_NOFILE, &limit))
die_perror("Failed to set current limit for open files");
sock_probe_mem(&c);
conf(&c, argc, argv);
trace_init(c.trace);
if (c.force_stderr || isatty(fileno(stdout)))
__openlog(log_name, LOG_PERROR, LOG_DAEMON);
pasta_netns_quit_init(&c);
tap_sock_init(&c);
vu_init(&c);
secret_init(&c);
clock_gettime(CLOCK_MONOTONIC, &now);
if (clock_gettime(CLOCK_MONOTONIC, &now))
die_perror("Failed to get CLOCK_MONOTONIC time");
flow_init();
if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
exit(EXIT_FAILURE);
if (!c.no_icmp)
icmp_init();
proto_update_l2_buf(c.mac_guest, c.mac);
proto_update_l2_buf(c.guest_mac, c.our_tap_mac);
if (c.ifi4 && !c.no_dhcp)
dhcp_init();
@ -305,46 +285,39 @@ int main(int argc, char **argv)
pcap_init(&c);
if (!c.foreground) {
if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0) {
perror("/dev/null open");
exit(EXIT_FAILURE);
}
}
if (*c.pid_file) {
if ((pidfile_fd = open(c.pid_file,
O_CREAT | O_TRUNC | O_WRONLY | O_CLOEXEC,
S_IRUSR | S_IWUSR)) < 0) {
perror("PID file open");
exit(EXIT_FAILURE);
}
if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
die_perror("Failed to open /dev/null");
}
if (isolate_prefork(&c))
die("Failed to sandbox process, exiting");
if (!c.foreground)
__daemon(pidfile_fd, devnull_fd);
else
write_pidfile(pidfile_fd, getpid());
if (!c.foreground) {
__daemon(c.pidfile_fd, devnull_fd);
log_stderr = false;
} else {
pidfile_write(c.pidfile_fd, getpid());
}
if (pasta_child_pid)
if (pasta_child_pid) {
kill(pasta_child_pid, SIGUSR1);
log_stderr = false;
}
isolate_postfork(&c);
timer_init(&c, &now);
loop:
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
/* NOLINTBEGIN(bugprone-branch-clone): intervals can be the same */
/* cppcheck-suppress [duplicateValueTernary, unmatchedSuppression] */
nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
if (nfds == -1 && errno != EINTR) {
perror("epoll_wait");
exit(EXIT_FAILURE);
}
/* NOLINTEND(bugprone-branch-clone) */
if (nfds == -1 && errno != EINTR)
die_perror("epoll_wait() failed in main loop");
clock_gettime(CLOCK_MONOTONIC, &now);
if (clock_gettime(CLOCK_MONOTONIC, &now))
err_perror("Failed to get CLOCK_MONOTONIC time");
for (i = 0; i < nfds; i++) {
union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);
@ -382,23 +355,14 @@ loop:
case EPOLL_TYPE_TCP_TIMER:
tcp_timer_handler(&c, ref);
break;
case EPOLL_TYPE_UDP:
if (c.mode == MODE_VU)
udp_vu_sock_handler(&c, ref, eventmask, &now);
else
udp_buf_sock_handler(&c, ref, eventmask, &now);
case EPOLL_TYPE_UDP_LISTEN:
udp_listen_sock_handler(&c, ref, eventmask, &now);
break;
case EPOLL_TYPE_ICMP:
icmp_sock_handler(&c, AF_INET, ref);
case EPOLL_TYPE_UDP_REPLY:
udp_reply_sock_handler(&c, ref, eventmask, &now);
break;
case EPOLL_TYPE_ICMPV6:
icmp_sock_handler(&c, AF_INET6, ref);
break;
case EPOLL_TYPE_VHOST_CMD:
tap_handler_vu(&c, eventmask);
break;
case EPOLL_TYPE_VHOST_KICK:
vu_kick_cb(&c, ref);
case EPOLL_TYPE_PING:
icmp_sock_handler(&c, ref);
break;
default:
/* Can't happen */

158
passt.h
View file

@ -9,26 +9,6 @@
#define UNIX_SOCK_MAX 100
#define UNIX_SOCK_PATH "/tmp/passt_%i.socket"
/**
* struct tap_msg - Generic message descriptor for arrays of messages
* @pkt_buf_offset: Offset from @pkt_buf
* @len: Message length, with L2 headers
*/
struct tap_msg {
uint32_t pkt_buf_offset;
uint16_t len;
};
/**
* struct tap_l4_msg - Layer-4 message descriptor for protocol handlers
* @pkt_buf_offset: Offset of message from @pkt_buf
* @l4_len: Length of Layer-4 payload, host order
*/
struct tap_l4_msg {
uint32_t pkt_buf_offset;
uint16_t l4_len;
};
union epoll_ref;
#include <stdbool.h>
@ -37,51 +17,21 @@ union epoll_ref;
#include "pif.h"
#include "packet.h"
#include "siphash.h"
#include "ip.h"
#include "inany.h"
#include "flow.h"
#include "icmp.h"
#include "fwd.h"
#include "tcp.h"
#include "udp.h"
#include "udp_vu.h"
#include "vhost_user.h"
/**
* enum epoll_type - Different types of fds we poll over
/* Default address for our end on the tap interface. Bit 0 of byte 0 must be 0
* (unicast) and bit 1 of byte 1 must be 1 (locally administered). Otherwise
* it's arbitrary.
*/
enum epoll_type {
/* Special value to indicate an invalid type */
EPOLL_TYPE_NONE = 0,
/* Connected TCP sockets */
EPOLL_TYPE_TCP,
/* Connected TCP sockets (spliced) */
EPOLL_TYPE_TCP_SPLICE,
/* Listening TCP sockets */
EPOLL_TYPE_TCP_LISTEN,
/* timerfds used for TCP timers */
EPOLL_TYPE_TCP_TIMER,
/* UDP sockets */
EPOLL_TYPE_UDP,
/* IPv4 ICMP sockets */
EPOLL_TYPE_ICMP,
/* ICMPv6 sockets */
EPOLL_TYPE_ICMPV6,
/* inotify fd watching for end of netns (pasta) */
EPOLL_TYPE_NSQUIT_INOTIFY,
/* timer fd watching for end of netns, fallback for inotify (pasta) */
EPOLL_TYPE_NSQUIT_TIMER,
/* tuntap character device */
EPOLL_TYPE_TAP_PASTA,
/* socket connected to qemu */
EPOLL_TYPE_TAP_PASST,
/* socket listening for qemu socket connections */
EPOLL_TYPE_TAP_LISTEN,
/* vhost-user command socket */
EPOLL_TYPE_VHOST_CMD,
/* vhost-user kick event socket */
EPOLL_TYPE_VHOST_KICK,
EPOLL_NUM_TYPES,
};
#define MAC_OUR_LAA \
((uint8_t [ETH_ALEN]){0x9a, 0x55, 0x9a, 0x55, 0x9a, 0x55})
/**
* union epoll_ref - Breakdown of reference for epoll fd bookkeeping
@ -105,8 +55,7 @@ union epoll_ref {
uint32_t flow;
flow_sidx_t flowside;
union tcp_listen_epoll_ref tcp_listen;
union udp_epoll_ref udp;
union icmp_epoll_ref icmp;
union udp_listen_epoll_ref udp;
uint32_t data;
int nsdir_fd;
};
@ -118,7 +67,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
#define TAP_BUF_BYTES \
ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE)
#define TAP_BUF_FILL (TAP_BUF_BYTES - ETH_MAX_MTU - sizeof(uint32_t))
#define TAP_MSGS \
DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
@ -146,59 +94,88 @@ struct fqdn {
enum passt_modes {
MODE_PASST,
MODE_PASTA,
MODE_VU,
};
/**
* struct ip4_ctx - IPv4 execution context
* @addr: IPv4 address for external, routable interface
* @addr: IPv4 address assigned to guest
* @addr_seen: Latest IPv4 address seen as source from tap
* @prefixlen: IPv4 prefix length (netmask)
* @gw: Default IPv4 gateway, network order
* @dns: DNS addresses for DHCP, zero-terminated, network order
* @dns_match: Forward DNS query if sent to this address, network order
* @dns_host: Use this DNS on the host for forwarding, network order
* @guest_gw: IPv4 gateway as seen by the guest
* @map_host_loopback: Outbound connections to this address are NATted to the
* host's 127.0.0.1
* @map_guest_addr: Outbound connections to this address are NATted to the
* guest's assigned address
* @dns: DNS addresses for DHCP, zero-terminated
* @dns_match: Forward DNS query if sent to this address
* @our_tap_addr: IPv4 address for passt's use on tap
* @dns_host: Use this DNS on the host for forwarding
* @addr_out: Optional source address for outbound traffic
* @ifname_out: Optional interface name to bind outbound sockets to
* @no_copy_routes: Don't copy all routes when configuring target namespace
* @no_copy_addrs: Don't copy all addresses when configuring namespace
*/
struct ip4_ctx {
/* PIF_TAP addresses */
struct in_addr addr;
struct in_addr addr_seen;
int prefix_len;
struct in_addr gw;
struct in_addr guest_gw;
struct in_addr map_host_loopback;
struct in_addr map_guest_addr;
struct in_addr dns[MAXNS + 1];
struct in_addr dns_match;
struct in_addr dns_host;
struct in_addr our_tap_addr;
/* PIF_HOST addresses */
struct in_addr dns_host;
struct in_addr addr_out;
char ifname_out[IFNAMSIZ];
bool no_copy_routes;
bool no_copy_addrs;
};
/**
* struct ip6_ctx - IPv6 execution context
* @addr: IPv6 address for external, routable interface
* @addr_ll: Link-local IPv6 address on external, routable interface
* @addr: IPv6 address assigned to guest
* @addr_seen: Latest IPv6 global/site address seen as source from tap
* @addr_ll_seen: Latest IPv6 link-local address seen as source from tap
* @gw: Default IPv6 gateway
* @guest_gw: IPv6 gateway as seen by the guest
* @map_host_loopback: Outbound connections to this address are NATted to the
* host's [::1]
* @map_guest_addr: Outbound connections to this address are NATted to the
* guest's assigned address
* @dns: DNS addresses for DHCPv6 and NDP, zero-terminated
* @dns_match: Forward DNS query if sent to this address
* @our_tap_ll: Link-local IPv6 address for passt's use on tap
* @dns_host: Use this DNS on the host for forwarding
* @addr_out: Optional source address for outbound traffic
* @ifname_out: Optional interface name to bind outbound sockets to
* @no_copy_routes: Don't copy all routes when configuring target namespace
* @no_copy_addrs: Don't copy all addresses when configuring namespace
*/
struct ip6_ctx {
/* PIF_TAP addresses */
struct in6_addr addr;
struct in6_addr addr_ll;
struct in6_addr addr_seen;
struct in6_addr addr_ll_seen;
struct in6_addr gw;
struct in6_addr guest_gw;
struct in6_addr map_host_loopback;
struct in6_addr map_guest_addr;
struct in6_addr dns[MAXNS + 1];
struct in6_addr dns_match;
struct in6_addr dns_host;
struct in6_addr our_tap_ll;
/* PIF_HOST addresses */
struct in6_addr dns_host;
struct in6_addr addr_out;
char ifname_out[IFNAMSIZ];
bool no_copy_routes;
bool no_copy_addrs;
};
#include <netinet/if_ether.h>
@ -210,11 +187,11 @@ struct ip6_ctx {
* @trace: Enable tracing (extra debug) mode
* @quiet: Don't print informational messages
* @foreground: Run in foreground, don't log to stderr by default
* @force_stderr: Force logging to stderr
* @nofile: Maximum number of open files (ulimit -n)
* @sock_path: Path for UNIX domain socket
* @pcap: Path for packet capture file
* @pid_file: Path to PID file, empty string if not configured
* @pidfile: Path to PID file, empty string if not configured
* @pidfile_fd: File descriptor for PID file, -1 if none
* @pasta_netns_fd: File descriptor for network namespace in pasta mode
* @no_netns_quit: In pasta mode, don't exit if fs-bound namespace is gone
* @netns_base: Base name for fs-bound namespace, if any, in pasta mode
@ -222,8 +199,8 @@ struct ip6_ctx {
* @epollfd: File descriptor for epoll instance
* @fd_tap_listen: File descriptor for listening AF_UNIX socket, if any
* @fd_tap: AF_UNIX socket, tuntap device, or pre-opened socket
* @mac: Host MAC address
* @mac_guest: MAC address of guest or namespace, seen or configured
* @our_tap_mac: Pasta/passt's MAC on the tap link
* @guest_mac: MAC address of guest or namespace, seen or configured
* @hash_secret: 128-bit secret for siphash functions
* @ifi4: Index of template interface for IPv4, 0 if IPv4 disabled
* @ip: IPv4 configuration
@ -233,8 +210,6 @@ struct ip6_ctx {
* @pasta_ifn: Name of namespace interface for pasta
* @pasta_ifi: Index of namespace interface for pasta
* @pasta_conf_ns: Configure namespace after creating it
* @no_copy_routes: Don't copy all routes when configuring target namespace
* @no_copy_addrs: Don't copy all addresses when configuring namespace
* @no_tcp: Disable TCP operation
* @tcp: Context for TCP protocol handler
* @no_tcp: Disable UDP operation
@ -250,7 +225,8 @@ struct ip6_ctx {
* @no_dhcpv6: Disable DHCPv6 server
* @no_ndp: Disable NDP handler altogether
* @no_ra: Disable router advertisements
* @no_map_gw: Don't map connections, untracked UDP to gateway to host
* @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses
* @freebind: Allow binding of non-local addresses for forwarding
* @low_wmem: Low probed net.core.wmem_max
* @low_rmem: Low probed net.core.rmem_max
*/
@ -260,11 +236,13 @@ struct ctx {
int trace;
int quiet;
int foreground;
int force_stderr;
int nofile;
char sock_path[UNIX_PATH_MAX];
char pcap[PATH_MAX];
char pid_file[PATH_MAX];
char pidfile[PATH_MAX];
int pidfile_fd;
int one_off;
int pasta_netns_fd;
@ -276,8 +254,8 @@ struct ctx {
int epollfd;
int fd_tap_listen;
int fd_tap;
unsigned char mac[ETH_ALEN];
unsigned char mac_guest[ETH_ALEN];
unsigned char our_tap_mac[ETH_ALEN];
unsigned char guest_mac[ETH_ALEN];
uint64_t hash_secret[2];
unsigned int ifi4;
@ -291,8 +269,6 @@ struct ctx {
char pasta_ifn[IF_NAMESIZE];
unsigned int pasta_ifi;
int pasta_conf_ns;
int no_copy_routes;
int no_copy_addrs;
int no_tcp;
struct tcp_ctx tcp;
@ -310,13 +286,11 @@ struct ctx {
int no_dhcpv6;
int no_ndp;
int no_ra;
int no_map_gw;
int host_lo_to_ns_lo;
int freebind;
int low_wmem;
int low_rmem;
/* vhost-user */
struct VuDev vdev;
};
void proto_update_l2_buf(const unsigned char *eth_d,

114
pasta.c
View file

@ -12,8 +12,8 @@
* Author: Stefano Brivio <sbrivio@redhat.com>
*
* #syscalls:pasta clone waitid exit exit_group rt_sigprocmask
* #syscalls:pasta rt_sigreturn|sigreturn armv6l:sigreturn armv7l:sigreturn
* #syscalls:pasta ppc64:sigreturn s390x:sigreturn
* #syscalls:pasta rt_sigreturn|sigreturn
* #syscalls:pasta arm:sigreturn ppc64:sigreturn s390x:sigreturn i686:sigreturn
*/
#include <sched.h>
@ -50,6 +50,8 @@
#include "netlink.h"
#include "log.h"
#define HOSTNAME_PREFIX "pasta-"
/* PID of child, in case we created a namespace */
int pasta_child_pid;
@ -59,6 +61,7 @@ int pasta_child_pid;
*/
void pasta_child_handler(int signal)
{
int errno_save = errno;
siginfo_t infop;
(void)signal;
@ -83,6 +86,8 @@ void pasta_child_handler(int signal)
waitid(P_ALL, 0, NULL, WEXITED | WNOHANG);
waitid(P_ALL, 0, NULL, WEXITED | WNOHANG);
errno = errno_save;
}
/**
@ -97,7 +102,9 @@ static int pasta_wait_for_ns(void *arg)
int flags = O_RDONLY | O_CLOEXEC;
char ns[PATH_MAX];
snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
if (snprintf_check(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid))
die_perror("Can't build netns path");
do {
while ((c->pasta_netns_fd = open(ns, flags)) < 0) {
if (errno != ENOENT)
@ -138,17 +145,15 @@ void pasta_open_ns(struct ctx *c, const char *netns)
int nfd = -1;
nfd = open(netns, O_RDONLY | O_CLOEXEC);
if (nfd < 0) {
die("Couldn't open network namespace %s: %s",
netns, strerror(errno));
}
if (nfd < 0)
die_perror("Couldn't open network namespace %s", netns);
c->pasta_netns_fd = nfd;
NS_CALL(ns_check, c);
if (c->pasta_netns_fd < 0)
die("Couldn't switch to pasta namespaces: %s", strerror(errno));
die_perror("Couldn't switch to pasta namespaces");
if (!c->no_netns_quit) {
char buf[PATH_MAX] = { 0 };
@ -176,18 +181,28 @@ struct pasta_spawn_cmd_arg {
*
* Return: this function never returns
*/
/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */
static int pasta_spawn_cmd(void *arg)
{
char hostname[HOST_NAME_MAX + 1] = HOSTNAME_PREFIX;
const struct pasta_spawn_cmd_arg *a;
sigset_t set;
/* We run in a detached PID and mount namespace: mount /proc over */
if (mount("", "/proc", "proc", 0, NULL))
warn("Couldn't mount /proc: %s", strerror(errno));
warn_perror("Couldn't mount /proc");
if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
warn("Cannot set ping_group_range, ICMP requests might fail");
if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
errno == ENAMETOOLONG) {
hostname[HOST_NAME_MAX] = '\0';
if (sethostname(hostname, strlen(hostname)))
warn("Unable to set pasta-prefixed hostname");
}
/* Wait for the parent to be ready: see main() */
sigemptyset(&set);
sigaddset(&set, SIGUSR1);
@ -196,8 +211,7 @@ static int pasta_spawn_cmd(void *arg)
a = (const struct pasta_spawn_cmd_arg *)arg;
execvp(a->exe, a->argv);
perror("execvp");
exit(EXIT_FAILURE);
die_perror("Failed to start command or shell");
}
/**
@ -211,12 +225,13 @@ static int pasta_spawn_cmd(void *arg)
void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
int argc, char *argv[])
{
char ns_fn_stack[NS_FN_STACK_SIZE]
__attribute__ ((aligned(__alignof__(max_align_t))));
struct pasta_spawn_cmd_arg arg = {
.exe = argv[0],
.argv = argv,
};
char uidmap[BUFSIZ], gidmap[BUFSIZ];
char ns_fn_stack[NS_FN_STACK_SIZE];
char *sh_argv[] = { NULL, NULL };
char sh_arg0[PATH_MAX + 1];
sigset_t set;
@ -226,8 +241,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
c->quiet = 1;
/* Configure user and group mappings */
snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
if (snprintf_check(uidmap, BUFSIZ, "0 %u 1", uid))
die_perror("Can't build uidmap");
if (snprintf_check(gidmap, BUFSIZ, "0 %u 1", gid))
die_perror("Can't build gidmap");
if (write_file("/proc/self/uid_map", uidmap) ||
write_file("/proc/self/setgroups", "deny") ||
@ -259,14 +277,12 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
CLONE_NEWUTS | CLONE_NEWNS | SIGCHLD,
(void *)&arg);
if (pasta_child_pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (pasta_child_pid == -1)
die_perror("Failed to clone process with detached namespaces");
NS_CALL(pasta_wait_for_ns, c);
if (c->pasta_netns_fd < 0)
die("Failed to join network namespace: %s", strerror(errno));
die_perror("Failed to join network namespace");
}
/**
@ -277,25 +293,33 @@ void pasta_ns_conf(struct ctx *c)
{
int rc = 0;
rc = nl_link_up(nl_sock_ns, 1 /* lo */, 0);
rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);
if (rc < 0)
die("Couldn't bring up loopback interface in namespace: %s",
strerror(-rc));
/* Get or set MAC in target namespace */
if (MAC_IS_ZERO(c->mac_guest))
nl_link_get_mac(nl_sock_ns, c->pasta_ifi, c->mac_guest);
if (MAC_IS_ZERO(c->guest_mac))
nl_link_get_mac(nl_sock_ns, c->pasta_ifi, c->guest_mac);
else
rc = nl_link_set_mac(nl_sock_ns, c->pasta_ifi, c->mac_guest);
rc = nl_link_set_mac(nl_sock_ns, c->pasta_ifi, c->guest_mac);
if (rc < 0)
die("Couldn't set MAC address in namespace: %s",
strerror(-rc));
if (c->pasta_conf_ns) {
nl_link_up(nl_sock_ns, c->pasta_ifi, c->mtu);
unsigned int flags = IFF_UP;
if (c->mtu != -1)
nl_link_set_mtu(nl_sock_ns, c->pasta_ifi, c->mtu);
if (c->ifi6) /* Avoid duplicate address detection on link up */
flags |= IFF_NOARP;
nl_link_set_flags(nl_sock_ns, c->pasta_ifi, flags, flags);
if (c->ifi4) {
if (c->no_copy_addrs) {
if (c->ip4.no_copy_addrs) {
rc = nl_addr_set(nl_sock_ns, c->pasta_ifi,
AF_INET,
&c->ip4.addr,
@ -311,9 +335,10 @@ void pasta_ns_conf(struct ctx *c)
strerror(-rc));
}
if (c->no_copy_routes) {
if (c->ip4.no_copy_routes) {
rc = nl_route_set_def(nl_sock_ns, c->pasta_ifi,
AF_INET, &c->ip4.gw);
AF_INET,
&c->ip4.guest_gw);
} else {
rc = nl_route_dup(nl_sock, c->ifi4, nl_sock_ns,
c->pasta_ifi, AF_INET);
@ -326,7 +351,24 @@ void pasta_ns_conf(struct ctx *c)
}
if (c->ifi6) {
if (c->no_copy_addrs) {
rc = nl_addr_get_ll(nl_sock_ns, c->pasta_ifi,
&c->ip6.addr_ll_seen);
if (rc < 0) {
warn("Can't get LL address from namespace: %s",
strerror(-rc));
}
rc = nl_addr_set_ll_nodad(nl_sock_ns, c->pasta_ifi);
if (rc < 0) {
warn("Can't set nodad for LL in namespace: %s",
strerror(-rc));
}
/* We dodged DAD: re-enable neighbour solicitations */
nl_link_set_flags(nl_sock_ns, c->pasta_ifi,
0, IFF_NOARP);
if (c->ip6.no_copy_addrs) {
rc = nl_addr_set(nl_sock_ns, c->pasta_ifi,
AF_INET6, &c->ip6.addr, 64);
} else {
@ -340,9 +382,10 @@ void pasta_ns_conf(struct ctx *c)
strerror(-rc));
}
if (c->no_copy_routes) {
if (c->ip6.no_copy_routes) {
rc = nl_route_set_def(nl_sock_ns, c->pasta_ifi,
AF_INET6, &c->ip6.gw);
AF_INET6,
&c->ip6.guest_gw);
} else {
rc = nl_route_dup(nl_sock, c->ifi6,
nl_sock_ns, c->pasta_ifi,
@ -356,7 +399,7 @@ void pasta_ns_conf(struct ctx *c)
}
}
proto_update_l2_buf(c->mac_guest, NULL);
proto_update_l2_buf(c->guest_mac, NULL);
}
/**
@ -370,12 +413,12 @@ static int pasta_netns_quit_timer(void)
struct itimerspec it = { { 1, 0 }, { 1, 0 } }; /* one-second interval */
if (fd == -1) {
err("timerfd_create(): %s", strerror(errno));
err_perror("Failed to create timerfd for quit timer");
return -errno;
}
if (timerfd_settime(fd, 0, &it, NULL) < 0) {
err("timerfd_settime(): %s", strerror(errno));
err_perror("Failed to set interval for quit timer");
close(fd);
return -errno;
}
@ -389,12 +432,12 @@ static int pasta_netns_quit_timer(void)
*/
void pasta_netns_quit_init(const struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_NSQUIT_INOTIFY };
struct epoll_event ev = { .events = EPOLLIN };
int flags = O_NONBLOCK | O_CLOEXEC;
struct statfs s = { 0 };
bool try_inotify = true;
int fd = -1, dir_fd;
union epoll_ref ref;
if (c->mode != MODE_PASTA || c->no_netns_quit || !*c->netns_base)
return;
@ -425,6 +468,7 @@ void pasta_netns_quit_init(const struct ctx *c)
ref.type = EPOLL_TYPE_NSQUIT_TIMER;
} else {
close(dir_fd);
ref.type = EPOLL_TYPE_NSQUIT_INOTIFY;
}
if (fd > FD_REF_MAX)
@ -468,7 +512,7 @@ void pasta_netns_quit_timer_handler(struct ctx *c, union epoll_ref ref)
n = read(ref.fd, &expirations, sizeof(expirations));
if (n < 0)
die("Namespace watch timer read() error: %s", strerror(errno));
die_perror("Namespace watch timer read() error");
if ((size_t)n < sizeof(expirations))
warn("Namespace watch timer: short read(): %zi", n);

64
pcap.c
View file

@ -72,44 +72,43 @@ struct pcap_pkthdr {
* @iov: IO vector containing frame (with L2 headers and tap headers)
* @iovcnt: Number of buffers (@iov entries) in frame
* @offset: Byte offset of the L2 headers within @iov
* @tv: Timestamp
* @now: Timestamp
*
* Returns: 0 on success, -errno on error writing to the file
*/
static void pcap_frame(const struct iovec *iov, size_t iovcnt,
size_t offset, const struct timeval *tv)
size_t offset, const struct timespec *now)
{
size_t len = iov_size(iov, iovcnt) - offset;
size_t l2len = iov_size(iov, iovcnt) - offset;
struct pcap_pkthdr h = {
.tv_sec = tv->tv_sec,
.tv_usec = tv->tv_usec,
.caplen = len,
.len = len
.tv_sec = now->tv_sec,
.tv_usec = DIV_ROUND_CLOSEST(now->tv_nsec, 1000),
.caplen = l2len,
.len = l2len
};
struct iovec hiov = { &h, sizeof(h) };
if (write_remainder(pcap_fd, &hiov, 1, 0) < 0 ||
write_remainder(pcap_fd, iov, iovcnt, offset) < 0) {
debug("Cannot log packet, length %zu: %s",
len, strerror(errno));
}
if (write_all_buf(pcap_fd, &h, sizeof(h)) < 0 ||
write_remainder(pcap_fd, iov, iovcnt, offset) < 0)
debug_perror("Cannot log packet, length %zu", l2len);
}
/**
* pcap() - Capture a single frame to pcap file
* @pkt: Pointer to data buffer, including L2 headers
* @len: L2 packet length
* @l2len: L2 frame length
*/
void pcap(const char *pkt, size_t len)
void pcap(const char *pkt, size_t l2len)
{
struct iovec iov = { (char *)pkt, len };
struct timeval tv;
struct iovec iov = { (char *)pkt, l2len };
struct timespec now = { 0 };
if (pcap_fd == -1)
return;
gettimeofday(&tv, NULL);
pcap_frame(&iov, 1, 0, &tv);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
pcap_frame(&iov, 1, 0, &now);
}
/**
@ -122,16 +121,17 @@ void pcap(const char *pkt, size_t len)
void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
size_t offset)
{
struct timeval tv;
struct timespec now = { 0 };
unsigned int i;
if (pcap_fd == -1)
return;
gettimeofday(&tv, NULL);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
for (i = 0; i < n; i++)
pcap_frame(iov + i * frame_parts, frame_parts, offset, &tv);
pcap_frame(iov + i * frame_parts, frame_parts, offset, &now);
}
/*
@ -141,17 +141,20 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
* @iov: Pointer to the array of struct iovec describing the I/O vector
* containing packet data to write, including L2 header
* @iovcnt: Number of buffers (@iov entries)
* @offset: Offset of the L2 frame within the full data length
*/
/* cppcheck-suppress unusedFunction */
void pcap_iov(const struct iovec *iov, size_t iovcnt)
void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
{
struct timeval tv;
struct timespec now = { 0 };
if (pcap_fd == -1)
return;
gettimeofday(&tv, NULL);
pcap_frame(iov, iovcnt, 0, &tv);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
pcap_frame(iov, iovcnt, offset, &now);
}
/**
@ -160,23 +163,20 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt)
*/
void pcap_init(struct ctx *c)
{
int flags = O_WRONLY | O_CREAT | O_TRUNC;
if (pcap_fd != -1)
return;
if (!*c->pcap)
return;
flags |= c->foreground ? O_CLOEXEC : 0;
pcap_fd = open(c->pcap, flags, S_IRUSR | S_IWUSR);
pcap_fd = output_file_open(c->pcap, O_WRONLY);
if (pcap_fd == -1) {
perror("open");
err_perror("Couldn't open pcap file %s", c->pcap);
return;
}
info("Saving packet capture to %s", c->pcap);
if (write(pcap_fd, &pcap_hdr, sizeof(pcap_hdr)) < 0)
warn("Cannot write PCAP header: %s", strerror(errno));
warn_perror("Cannot write PCAP header");
}

4
pcap.h
View file

@ -6,10 +6,10 @@
#ifndef PCAP_H
#define PCAP_H
void pcap(const char *pkt, size_t len);
void pcap(const char *pkt, size_t l2len);
void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
size_t offset);
void pcap_iov(const struct iovec *iov, size_t iovcnt);
void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset);
void pcap_init(struct ctx *c);
#endif /* PCAP_H */

82
pif.c
View file

@ -7,9 +7,14 @@
#include <stdint.h>
#include <assert.h>
#include <netinet/in.h>
#include "util.h"
#include "pif.h"
#include "siphash.h"
#include "ip.h"
#include "inany.h"
#include "passt.h"
const char *pif_type_str[] = {
[PIF_NONE] = "<none>",
@ -19,3 +24,80 @@ const char *pif_type_str[] = {
};
static_assert(ARRAY_SIZE(pif_type_str) == PIF_NUM_TYPES,
"pif_type_str[] doesn't match enum pif_type");
/** pif_sockaddr() - Construct a socket address suitable for an interface
* @c: Execution context
* @sa: Pointer to sockaddr to fill in
* @sl: Updated to relevant length of initialised @sa
* @pif: Interface to create the socket address
* @addr: IPv[46] address
* @port: Port (host byte order)
*/
void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
uint8_t pif, const union inany_addr *addr, in_port_t port)
{
const struct in_addr *v4 = inany_v4(addr);
ASSERT(pif_is_socket(pif));
if (v4) {
sa->sa_family = AF_INET;
sa->sa4.sin_addr = *v4;
sa->sa4.sin_port = htons(port);
memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero));
*sl = sizeof(sa->sa4);
} else {
sa->sa_family = AF_INET6;
sa->sa6.sin6_addr = addr->a6;
sa->sa6.sin6_port = htons(port);
if (pif == PIF_HOST && IN6_IS_ADDR_LINKLOCAL(&addr->a6))
sa->sa6.sin6_scope_id = c->ifi6;
else
sa->sa6.sin6_scope_id = 0;
sa->sa6.sin6_flowinfo = 0;
*sl = sizeof(sa->sa6);
}
}
/** pif_sock_l4() - Open a socket bound to an address on a specified interface
* @c: Execution context
* @type: Socket epoll type
* @pif: Interface for this socket
* @addr: Address to bind to, or NULL for dual-stack any
* @ifname: Interface for binding, NULL for any
* @port: Port number to bind to (host byte order)
* @data: epoll reference portion for protocol handlers
*
* NOTE: For namespace pifs, this must be called having already entered the
* relevant namespace.
*
* Return: newly created socket, negative error code on failure
*/
int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
const union inany_addr *addr, const char *ifname,
in_port_t port, uint32_t data)
{
union sockaddr_inany sa = {
.sa6.sin6_family = AF_INET6,
.sa6.sin6_addr = in6addr_any,
.sa6.sin6_port = htons(port),
};
socklen_t sl;
ASSERT(pif_is_socket(pif));
if (pif == PIF_SPLICE) {
/* Sanity checks */
ASSERT(!ifname);
ASSERT(addr && inany_is_loopback(addr));
}
if (!addr)
return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
ifname, false, data);
pif_sockaddr(c, &sa, &sl, pif, addr, port);
return sock_l4_sa(c, type, &sa, sl,
ifname, sa.sa_family == AF_INET6, data);
}

21
pif.h
View file

@ -7,6 +7,9 @@
#ifndef PIF_H
#define PIF_H
union inany_addr;
union sockaddr_inany;
/**
* enum pif_type - Type of passt/pasta interface ("pif")
*
@ -38,10 +41,26 @@ static inline const char *pif_type(enum pif_type pt)
return "?";
}
/* cppcheck-suppress unusedFunction */
static inline const char *pif_name(uint8_t pif)
{
return pif_type(pif);
}
/**
* pif_is_socket() - Is interface implemented via L4 sockets?
* @pif: pif to check
*
* Return: true of @pif is an L4 socket based interface, otherwise false
*/
static inline bool pif_is_socket(uint8_t pif)
{
return pif == PIF_HOST || pif == PIF_SPLICE;
}
void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
uint8_t pif, const union inany_addr *addr, in_port_t port);
int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
const union inany_addr *addr, const char *ifname,
in_port_t port, uint32_t data);
#endif /* PIF_H */

4
qrap.1
View file

@ -66,8 +66,8 @@ issues to Stefano Brivio <sbrivio@redhat.com>.
Copyright (c) 2020-2021 Red Hat GmbH.
\fBqrap\fR is free software: you can redistribute is and/or modify it under the
terms of the GNU Affero General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
terms of the GNU General Public License as published by the Free Software
Foundation, either version 2 of the License, or (at your option) any later
version.
.SH SEE ALSO

View file

@ -20,6 +20,15 @@ OUT="$(mktemp)"
[ -z "${ARCH}" ] && ARCH="$(uname -m)"
[ -z "${CC}" ] && CC="cc"
AUDIT_ARCH="AUDIT_ARCH_$(echo ${ARCH} | tr [a-z] [A-Z] \
| sed 's/^ARM.*/ARM/' \
| sed 's/I[456]86/I386/' \
| sed 's/PPC64/PPC/' \
| sed 's/PPCLE/PPC64LE/' \
| sed 's/MIPS64EL/MIPSEL64/' \
| sed 's/HPPA/PARISC/' \
| sed 's/SH4/SH/')"
HEADER="/* This file was automatically generated by $(basename ${0}) */
#ifndef AUDIT_ARCH_PPC64LE
@ -29,11 +38,11 @@ HEADER="/* This file was automatically generated by $(basename ${0}) */
# Prefix for each profile: check that 'arch' in seccomp_data is matching
PRE='
struct sock_filter filter_@PROFILE@[] = {
/* cppcheck-suppress badBitmaskCheck */
/* cppcheck-suppress [badBitmaskCheck, unmatchedSuppression] */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, arch))),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, PASST_AUDIT_ARCH, 0, @KILL@),
/* cppcheck-suppress badBitmaskCheck */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, @AUDIT_ARCH@, 0, @KILL@),
/* cppcheck-suppress [badBitmaskCheck, unmatchedSuppression] */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, nr))),
@ -233,7 +242,8 @@ gen_profile() {
sub ${__i} CALL "NR:${__nr}" "NAME:${__name}" "ALLOW:${__allow}"
done
finish PRE "PROFILE:${__profile}" "KILL:$(( __statements + 1))"
finish PRE "PROFILE:${__profile}" "KILL:$(( __statements + 1))" \
"AUDIT_ARCH:${AUDIT_ARCH}"
}
printf '%s\n' "${HEADER}" > "${OUT}"
@ -242,7 +252,10 @@ for __p in ${__profiles}; do
__calls="$(sed -n 's/[\t ]*\*[\t ]*#syscalls\(:'"${__p}"'\|\)[\t ]\{1,\}\(.*\)/\2/p' ${IN})"
__calls="${__calls} ${EXTRA_SYSCALLS:-}"
__calls="$(filter ${__calls})"
echo "seccomp profile ${__p} allows: ${__calls}" | tr '\n' ' ' | fmt -t
cols="$(stty -a | sed -n 's/.*columns \([0-9]*\).*/\1/p' || :)" 2>/dev/null
case $cols in [0-9]*) col_args="-w ${cols}";; *) col_args="";; esac
echo "seccomp profile ${__p} allows: ${__calls}" | tr '\n' ' ' | fmt -t ${col_args}
# Pad here to keep gen_profile() "simple"
__count=0

View file

@ -115,10 +115,4 @@ static inline uint64_t siphash_final(struct siphash_state *state,
return state->v[0] ^ state->v[1] ^ state->v[2] ^ state->v[3];
}
uint64_t siphash_8b(const uint8_t *in, const uint64_t *k);
uint64_t siphash_12b(const uint8_t *in, const uint64_t *k);
uint64_t siphash_20b(const uint8_t *in, const uint64_t *k);
uint64_t siphash_32b(const uint8_t *in, const uint64_t *k);
uint64_t siphash_36b(const uint8_t *in, const uint64_t *k);
#endif /* SIPHASH_H */

803
tap.c

File diff suppressed because it is too large Load diff

101
tap.h
View file

@ -6,90 +6,60 @@
#ifndef TAP_H
#define TAP_H
/*
* TCP frame iovec array:
* TCP_IOV_VNET vnet length
* TCP_IOV_ETH ethernet header
* TCP_IOV_IP IP (v4/v6) header
* TCP_IOV_PAYLOAD IP payload (TCP header + data)
* TCP_IOV_NUM is the number of entries in the iovec array
*/
#define TCP_IOV_VNET 0
#define TCP_IOV_ETH 1
#define TCP_IOV_IP 2
#define TCP_IOV_PAYLOAD 3
#define TCP_IOV_NUM 4
#define ETH_HDR_INIT(proto) { .h_proto = htons_constant(proto) }
/**
* struct tap_hdr - L2 and tap specific headers
* struct tap_hdr - tap backend specific headers
* @vnet_len: Frame length (for qemu socket transport)
* @eh: Ethernet header
*/
struct tap_hdr {
uint32_t vnet_len;
struct ethhdr eh;
} __attribute__((packed));
#define TAP_HDR_INIT(proto) { .eh.h_proto = htons_constant(proto) }
static inline size_t tap_hdr_len_(const struct ctx *c)
/**
* tap_hdr_iov() - struct iovec for a tap header
* @c: Execution context
* @taph: Pointer to tap specific header buffer
*
* Returns: A struct iovec covering the correct portion of @taph to use as the
* tap specific header in the current configuration.
*/
static inline struct iovec tap_hdr_iov(const struct ctx *c,
struct tap_hdr *thdr)
{
if (c->mode == MODE_PASST)
return sizeof(struct tap_hdr);
else
return sizeof(struct ethhdr);
return (struct iovec){
.iov_base = thdr,
.iov_len = c->mode == MODE_PASST ? sizeof(*thdr) : 0,
};
}
/**
* tap_iov_base() - Find start of tap frame
* @c: Execution context
* @taph: Pointer to L2 header buffer
*
* Returns: pointer to the start of tap frame - suitable for an
* iov_base to be passed to tap_send_frames())
* tap_hdr_update() - Update the tap specific header for a frame
* @taph: Tap specific header buffer to update
* @l2len: Frame length (including L2 headers)
*/
static inline void *tap_iov_base(const struct ctx *c, struct tap_hdr *taph)
static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
{
return (char *)(taph + 1) - tap_hdr_len_(c);
thdr->vnet_len = htonl(l2len);
}
/**
* tap_iov_len() - Finalize tap frame and return total length
* @c: Execution context
* @taph: Tap header to finalize
* @plen: L2 payload length (excludes L2 and tap specific headers)
*
* Returns: length of the tap frame including L2 and tap specific
* headers - suitable for an iov_len to be passed to
* tap_send_frames()
*/
static inline size_t tap_iov_len(const struct ctx *c, struct tap_hdr *taph,
size_t plen)
{
if (c->mode == MODE_PASST)
taph->vnet_len = htonl(plen + sizeof(taph->eh));
return plen + tap_hdr_len_(c);
}
struct in_addr tap_ip4_daddr(const struct ctx *c);
void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t len);
const void *in, size_t dlen);
void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
const void *in, size_t len);
const void *in, size_t l4len);
const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
const struct in6_addr *src);
void tap_udp6_send(const struct ctx *c,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
uint32_t flow, const void *in, size_t len);
uint32_t flow, void *in, size_t dlen);
void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
const void *in, size_t len);
int tap_send(const struct ctx *c, const void *data, size_t len);
size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n);
size_t tap_send_iov(const struct ctx *c, struct iovec iov[][TCP_IOV_NUM],
size_t n);
const void *in, size_t l4len);
void tap_send_single(const struct ctx *c, const void *data, size_t l2len);
size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
size_t bufs_per_frame, size_t nframes);
void eth_update_mac(struct ethhdr *eh,
const unsigned char *eth_d, const unsigned char *eth_s);
void tap_listen_handler(struct ctx *c, uint32_t events);
@ -97,17 +67,10 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
const struct timespec *now);
void tap_handler_passt(struct ctx *c, uint32_t events,
const struct timespec *now);
void tap_sock_reset(struct ctx *c);
void tap_sock_update_buf(void *base, size_t size);
int tap_sock_unix_open(char *sock_path);
void tap_sock_init(struct ctx *c);
void pool_flush_all(void);
void tap_handler_all(struct ctx *c, const struct timespec *now);
void packet_add_do(struct pool *p, size_t len, const char *start,
const char *func, int line);
void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
const char *func, int line);
#define packet_add_all(p, len, start) \
packet_add_all_do(p, len, start, __func__, __LINE__)
void tap_flush_pools(void);
void tap_handler(struct ctx *c, const struct timespec *now);
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);
#endif /* TAP_H */

1239
tcp.c

File diff suppressed because it is too large Load diff

20
tcp.h
View file

@ -10,20 +10,24 @@
struct ctx;
void tcp_timer_handler(struct ctx *c, union epoll_ref ref);
void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
void tcp_timer_handler(const struct ctx *c, union epoll_ref ref);
void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
const struct timespec *now);
void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events);
int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events);
int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now);
int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
const char *ifname, in_port_t port);
int tcp_init(struct ctx *c);
void tcp_timer(struct ctx *c, const struct timespec *now);
void tcp_defer_handler(struct ctx *c);
void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s);
void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
int tcp_set_peek_offset(int s, int offset);
extern bool peek_offset_cap;
/**
* union tcp_epoll_ref - epoll reference portion for TCP connections
@ -55,16 +59,12 @@ union tcp_listen_epoll_ref {
* @fwd_in: Port forwarding configuration for inbound packets
* @fwd_out: Port forwarding configuration for outbound packets
* @timer_run: Timestamp of most recent timer run
* @kernel_snd_wnd: Kernel reports sending window (with commit 8f7baad7f035)
* @pipe_size: Size of pipes for spliced connections
*/
struct tcp_ctx {
struct fwd_ports fwd_in;
struct fwd_ports fwd_out;
struct timespec timer_run;
#ifdef HAS_SND_WND
int kernel_snd_wnd;
#endif
size_t pipe_size;
};

439
tcp_buf.c
View file

@ -6,9 +6,9 @@
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* tcp_buf.c - TCP L2-L4 translation state machine
* tcp_buf.c - TCP L2 buffer management functions
*
* Copyright (c) 2020-2022 Red Hat GmbH
* Copyright Red Hat
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
@ -20,10 +20,11 @@
#include <netinet/ip.h>
#include <linux/tcp.h>
#include <netinet/tcp.h>
#include "util.h"
#include "ip.h"
#include "iov.h"
#include "passt.h"
#include "tap.h"
#include "siphash.h"
@ -33,283 +34,169 @@
#include "tcp_buf.h"
#define TCP_FRAMES_MEM 128
#define TCP_FRAMES \
#define TCP_FRAMES \
(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
/**
* tcp_buf_seq_update - Sequences to update with length of frames once sent
* @seq: Pointer to sequence number sent to tap-side, to be updated
* @len: TCP payload length
*/
struct tcp_buf_seq_update {
uint32_t *seq;
uint16_t len;
};
/* Static buffers */
/**
* tcp_l2_flags_t - TCP header and data to send option flags
* @th: TCP header
* @opts TCP option flags
*/
struct tcp_l2_flags_t {
struct tcphdr th;
char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
};
/**
* tcp_l2_payload_t - TCP header and data to send data
* 32 bytes aligned to be able to use AVX2 checksum
* @th: TCP header
* @data: TCP data
*/
struct tcp_l2_payload_t {
struct tcphdr th; /* 20 bytes */
uint8_t data[MSS]; /* 65516 bytes */
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32)));
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
#endif
/* Ethernet header for IPv4 frames */
/* Ethernet header for IPv4 and IPv6 frames */
static struct ethhdr tcp4_eth_src;
/* IPv4 headers */
static struct iphdr tcp4_l2_ip[TCP_FRAMES_MEM];
/* TCP headers and data for IPv4 frames */
static struct tcp_l2_payload_t tcp4_l2_payload[TCP_FRAMES_MEM];
static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
static unsigned int tcp4_l2_buf_used;
/* IPv4 headers for TCP option flags frames */
static struct iphdr tcp4_l2_flags_ip[TCP_FRAMES_MEM];
/* TCP headers and option flags for IPv4 frames */
static struct tcp_l2_flags_t tcp4_l2_flags[TCP_FRAMES_MEM];
static unsigned int tcp4_l2_flags_buf_used;
/* Ethernet header for IPv6 frames */
static struct ethhdr tcp6_eth_src;
/* IPv6 headers */
static struct ipv6hdr tcp6_l2_ip[TCP_FRAMES_MEM];
/* TCP headers and data for IPv6 frames */
static struct tcp_l2_payload_t tcp6_l2_payload[TCP_FRAMES_MEM];
static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM];
static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
static unsigned int tcp6_l2_buf_used;
/* IP headers for IPv4 and IPv6 */
struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM];
struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];
/* IPv6 headers for TCP option flags frames */
static struct ipv6hdr tcp6_l2_flags_ip[TCP_FRAMES_MEM];
/* TCP headers and option flags for IPv6 frames */
static struct tcp_l2_flags_t tcp6_l2_flags[TCP_FRAMES_MEM];
/* TCP segments with payload for IPv4 and IPv6 frames */
static struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM];
static unsigned int tcp6_l2_flags_buf_used;
static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516");
static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516");
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp_payload_used;
/* recvmsg()/sendmsg() data for tap */
static struct iovec iov_sock [TCP_FRAMES_MEM + 1];
static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM][TCP_IOV_NUM];
static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_IOV_NUM];
static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_IOV_NUM];
static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_IOV_NUM];
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
/**
* tcp_buf_update_l2() - Update L2 buffers with Ethernet and IPv4 addresses
* tcp_update_l2_buf() - Update Ethernet header buffers with addresses
* @eth_d: Ethernet destination address, NULL if unchanged
* @eth_s: Ethernet source address, NULL if unchanged
*/
void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
{
eth_update_mac(&tcp4_eth_src, eth_d, eth_s);
eth_update_mac(&tcp6_eth_src, eth_d, eth_s);
}
/**
* tcp_buf_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
* tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
* @c: Execution context
*/
void tcp_buf_sock4_iov_init(const struct ctx *c)
void tcp_sock_iov_init(const struct ctx *c)
{
struct ipv6hdr ip6 = L2_BUF_IP6_INIT(IPPROTO_TCP);
struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
int i;
(void)c;
tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
for (i = 0; i < ARRAY_SIZE(tcp_payload); i++) {
tcp6_payload_ip[i] = ip6;
tcp4_payload_ip[i] = iph;
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
struct iovec *iov;
struct iovec *iov = tcp_l2_iov[i];
/* headers */
tcp4_l2_ip[i] = iph;
tcp4_l2_payload[i].th = (struct tcphdr){
.doff = sizeof(struct tcphdr) / 4,
.ack = 1
};
tcp4_l2_flags_ip[i] = iph;
tcp4_l2_flags[i].th = (struct tcphdr){
.doff = sizeof(struct tcphdr) / 4,
.ack = 1
};
/* iovecs */
iov = tcp4_l2_iov[i];
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]);
iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
iov[TCP_IOV_IP].iov_base = &tcp4_l2_ip[i];
iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_payload[i];
iov = tcp4_l2_flags_iov[i];
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
iov[TCP_IOV_IP].iov_base = &tcp4_l2_flags_ip[i];
iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_flags[i];
iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i];
}
}
/**
* tcp_buf_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
* @c: Execution context
* tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission
* @ctx: Execution context
* @conns: Array of connection pointers corresponding to queued frames
* @frames: Two-dimensional array containing queued frames with sub-iovs
* @num_frames: Number of entries in the two arrays to be compared
*/
void tcp_buf_sock6_iov_init(const struct ctx *c)
static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
struct iovec (*frames)[TCP_NUM_IOVS], int num_frames)
{
struct ipv6hdr ip6 = L2_BUF_IP6_INIT(IPPROTO_TCP);
int i;
(void)c;
for (i = 0; i < num_frames; i++) {
const struct tcphdr *th = frames[i][TCP_IOV_PAYLOAD].iov_base;
struct tcp_tap_conn *conn = conns[i];
uint32_t seq = ntohl(th->seq);
uint32_t peek_offset;
tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
for (i = 0; i < TCP_FRAMES_MEM; i++) {
struct iovec *iov;
if (SEQ_LE(conn->seq_to_tap, seq))
continue;
/* headers */
tcp6_l2_ip[i] = ip6;
tcp6_l2_payload[i].th = (struct tcphdr){
.doff = sizeof(struct tcphdr) / 4,
.ack = 1
};
tcp6_l2_flags_ip[i] = ip6;
tcp6_l2_flags[i].th = (struct tcphdr){
.doff = sizeof(struct tcphdr) / 4,
.ack = 1
};
/* iovecs */
iov = tcp6_l2_iov[i];
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
iov[TCP_IOV_IP].iov_base = &tcp6_l2_ip[i];
iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_l2_payload[i];
iov = tcp6_l2_flags_iov[i];
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
iov[TCP_IOV_IP].iov_base = &tcp6_l2_flags_ip[i];
iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_l2_flags[i];
conn->seq_to_tap = seq;
peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
if (tcp_set_peek_offset(conn->sock, peek_offset))
tcp_rst(c, conn);
}
}
/**
* tcp_buf_l2_flags_flush() - Send out buffers for segments with no data (flags)
* tcp_payload_flush() - Send out buffers for segments with data or flags
* @c: Execution context
*/
void tcp_buf_l2_flags_flush(const struct ctx *c)
void tcp_payload_flush(const struct ctx *c)
{
tap_send_iov(c, tcp6_l2_flags_iov, tcp6_l2_flags_buf_used);
tcp6_l2_flags_buf_used = 0;
tap_send_iov(c, tcp4_l2_flags_iov, tcp4_l2_flags_buf_used);
tcp4_l2_flags_buf_used = 0;
}
/**
* tcp_buf_l2_data_flush() - Send out buffers for segments with data
* @c: Execution context
*/
void tcp_buf_l2_data_flush(const struct ctx *c)
{
unsigned i;
size_t m;
m = tap_send_iov(c, tcp6_l2_iov, tcp6_l2_buf_used);
for (i = 0; i < m; i++)
*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
tcp6_l2_buf_used = 0;
m = tap_send_iov(c, tcp4_l2_iov, tcp4_l2_buf_used);
for (i = 0; i < m; i++)
*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
tcp4_l2_buf_used = 0;
m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
tcp_payload_used);
if (m != tcp_payload_used) {
tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
tcp_payload_used - m);
}
tcp_payload_used = 0;
}
int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
/**
* tcp_buf_send_flag() - Send segment with flags to tap (no payload)
* @c: Execution context
* @conn: Connection pointer
* @flags: TCP flags: if not set, send segment only if ACK is due
*
* Return: negative error code on connection reset, 0 otherwise
*/
int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
{
struct tcp_l2_flags_t *payload;
struct iovec *dup_iov;
struct tcp_payload_t *payload;
struct iovec *iov;
struct tcphdr *th;
size_t optlen = 0;
size_t ip_len;
char *data;
size_t optlen;
size_t l4len;
uint32_t seq;
int ret;
iov = tcp_l2_iov[tcp_payload_used];
if (CONN_V4(conn)) {
iov = tcp4_l2_flags_iov[tcp4_l2_flags_buf_used++];
dup_iov = tcp4_l2_flags_iov[tcp4_l2_flags_buf_used];
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else {
iov = tcp6_l2_flags_iov[tcp6_l2_flags_buf_used++];
dup_iov = tcp6_l2_flags_iov[tcp6_l2_flags_buf_used];
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
payload = iov[TCP_IOV_PAYLOAD].iov_base;
th = &payload->th;
data = payload->opts;
ret = tcp_fill_flag_header(c, conn, flags, th, data, &optlen);
payload = iov[TCP_IOV_PAYLOAD].iov_base;
seq = conn->seq_to_tap;
ret = tcp_prepare_flags(c, conn, flags, &payload->th,
(struct tcp_syn_opts *)&payload->data, &optlen);
if (ret <= 0)
return ret;
if (CONN_V4(conn)) {
struct iphdr *iph = iov[TCP_IOV_IP].iov_base;
ip_len = tcp_fill_headers4(c, conn, iph, th, optlen, NULL,
conn->seq_to_tap);
} else {
struct ipv6hdr *ip6h = iov[TCP_IOV_IP].iov_base;
ip_len = tcp_fill_headers6(c, conn, ip6h, th, optlen,
conn->seq_to_tap);
}
iov[TCP_IOV_PAYLOAD].iov_len = ip_len;
tcp_payload_used++;
l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq, false);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
if (flags & DUP_ACK) {
int i;
for (i = 0; i < TCP_IOV_NUM; i++) {
memcpy(dup_iov[i].iov_base, iov[i].iov_base,
iov[i].iov_len);
dup_iov[i].iov_len = iov[i].iov_len;
}
struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_TAP].iov_len);
dup_iov[TCP_IOV_ETH].iov_base = iov[TCP_IOV_ETH].iov_base;
dup_iov[TCP_IOV_IP] = iov[TCP_IOV_IP];
memcpy(dup_iov[TCP_IOV_PAYLOAD].iov_base,
iov[TCP_IOV_PAYLOAD].iov_base, l4len);
dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
}
if (CONN_V4(conn)) {
if (flags & DUP_ACK)
tcp4_l2_flags_buf_used++;
if (tcp4_l2_flags_buf_used > TCP_FRAMES_MEM - 2)
tcp_buf_l2_flags_flush(c);
} else {
if (flags & DUP_ACK)
tcp6_l2_flags_buf_used++;
if (tcp6_l2_flags_buf_used > TCP_FRAMES_MEM - 2)
tcp_buf_l2_flags_flush(c);
}
if (tcp_payload_used > TCP_FRAMES_MEM - 2)
tcp_payload_flush(c);
return 0;
}
@ -318,49 +205,43 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
* tcp_data_to_tap() - Finalise (queue) highest-numbered scatter-gather buffer
* @c: Execution context
* @conn: Connection pointer
* @plen: Payload length at L4
* @dlen: TCP payload length
* @no_csum: Don't compute IPv4 checksum, use the one from previous buffer
* @seq: Sequence number to be sent
*/
static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
ssize_t plen, int no_csum, uint32_t seq)
ssize_t dlen, int no_csum, uint32_t seq)
{
uint32_t *seq_update = &conn->seq_to_tap;
struct tcp_payload_t *payload;
const uint16_t *check = NULL;
struct iovec *iov;
size_t l4len;
conn->seq_to_tap = seq + dlen;
tcp_frame_conns[tcp_payload_used] = conn;
iov = tcp_l2_iov[tcp_payload_used];
if (CONN_V4(conn)) {
struct iovec *iov_prev = tcp4_l2_iov[tcp4_l2_buf_used - 1];
const uint16_t *check = NULL;
if (no_csum) {
struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
check = &iph->check;
}
tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
iov = tcp4_l2_iov[tcp4_l2_buf_used++];
iov[TCP_IOV_PAYLOAD].iov_len = tcp_fill_headers4(c, conn,
iov[TCP_IOV_IP].iov_base,
iov[TCP_IOV_PAYLOAD].iov_base,
plen, check, seq);
if (tcp4_l2_buf_used > TCP_FRAMES_MEM - 1)
tcp_buf_l2_data_flush(c);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else if (CONN_V6(conn)) {
tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
iov = tcp6_l2_iov[tcp6_l2_buf_used++];
iov[TCP_IOV_PAYLOAD].iov_len = tcp_fill_headers6(c, conn,
iov[TCP_IOV_IP].iov_base,
iov[TCP_IOV_PAYLOAD].iov_base,
plen, seq);
if (tcp6_l2_buf_used > TCP_FRAMES_MEM - 1)
tcp_buf_l2_data_flush(c);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
payload = iov[TCP_IOV_PAYLOAD].iov_base;
payload->th.th_off = sizeof(struct tcphdr) / 4;
payload->th.th_x2 = 0;
payload->th.th_flags = 0;
payload->th.ack = 1;
l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq, false);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
}
/**
@ -372,17 +253,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
*
* #syscalls recvmsg
*/
int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
{
uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
int sendlen, len, plen, v4 = CONN_V4(conn);
int s = conn->sock, i, ret = 0;
int len, dlen, i, s = conn->sock;
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
uint32_t already_sent, seq;
struct iovec *iov;
/* How much have we read/sent since last received ack ? */
already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
if (SEQ_LT(already_sent, 0)) {
@ -391,6 +272,10 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
conn->seq_ack_from_tap, conn->seq_to_tap);
conn->seq_to_tap = conn->seq_ack_from_tap;
already_sent = 0;
if (tcp_set_peek_offset(s, 0)) {
tcp_rst(c, conn);
return -1;
}
}
if (!wnd_scaled || already_sent >= wnd_scaled) {
@ -408,25 +293,26 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
iov_rem = (wnd_scaled - already_sent) % mss;
}
mh_sock.msg_iov = iov_sock;
mh_sock.msg_iovlen = fill_bufs + 1;
/* Prepare iov according to kernel capability */
if (!peek_offset_cap) {
mh_sock.msg_iov = iov_sock;
iov_sock[0].iov_base = tcp_buf_discard;
iov_sock[0].iov_len = already_sent;
mh_sock.msg_iovlen = fill_bufs + 1;
} else {
mh_sock.msg_iov = &iov_sock[1];
mh_sock.msg_iovlen = fill_bufs;
}
iov_sock[0].iov_base = tcp_buf_discard;
iov_sock[0].iov_len = already_sent;
if (( v4 && tcp4_l2_buf_used + fill_bufs > TCP_FRAMES_MEM) ||
(!v4 && tcp6_l2_buf_used + fill_bufs > TCP_FRAMES_MEM)) {
tcp_buf_l2_data_flush(c);
if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) {
tcp_payload_flush(c);
/* Silence Coverity CWE-125 false positive */
tcp4_l2_buf_used = tcp6_l2_buf_used = 0;
tcp_payload_used = 0;
}
for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
if (v4)
iov->iov_base = &tcp4_l2_payload[tcp4_l2_buf_used + i].data;
else
iov->iov_base = &tcp6_l2_payload[tcp6_l2_buf_used + i].data;
iov->iov_base = &tcp_payload[tcp_payload_used + i].data;
iov->iov_len = mss;
}
if (iov_rem)
@ -437,12 +323,19 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
len = recvmsg(s, &mh_sock, MSG_PEEK);
while (len < 0 && errno == EINTR);
if (len < 0)
goto err;
if (len < 0) {
if (errno != EAGAIN && errno != EWOULDBLOCK) {
tcp_rst(c, conn);
return -errno;
}
return 0;
}
if (!len) {
if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
if ((ret = tcp_buf_send_flag(c, conn, FIN | ACK))) {
int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
if (ret) {
tcp_rst(c, conn);
return ret;
}
@ -453,42 +346,36 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
return 0;
}
sendlen = len - already_sent;
if (sendlen <= 0) {
if (!peek_offset_cap)
len -= already_sent;
if (len <= 0) {
conn_flag(c, conn, STALLED);
return 0;
}
conn_flag(c, conn, ~STALLED);
send_bufs = DIV_ROUND_UP(sendlen, mss);
last_len = sendlen - (send_bufs - 1) * mss;
send_bufs = DIV_ROUND_UP(len, mss);
last_len = len - (send_bufs - 1) * mss;
/* Likely, some new data was acked too. */
tcp_update_seqack_wnd(c, conn, 0, NULL);
tcp_update_seqack_wnd(c, conn, false, NULL);
/* Finally, queue to tap */
plen = mss;
dlen = mss;
seq = conn->seq_to_tap;
for (i = 0; i < send_bufs; i++) {
int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
if (i == send_bufs - 1)
plen = last_len;
dlen = last_len;
tcp_data_to_tap(c, conn, plen, no_csum, seq);
seq += plen;
tcp_data_to_tap(c, conn, dlen, no_csum, seq);
seq += dlen;
}
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
err:
if (errno != EAGAIN && errno != EWOULDBLOCK) {
ret = -errno;
tcp_rst(c, conn);
}
return ret;
}

View file

@ -6,12 +6,9 @@
#ifndef TCP_BUF_H
#define TCP_BUF_H
void tcp_buf_sock4_iov_init(const struct ctx *c);
void tcp_buf_sock6_iov_init(const struct ctx *c);
void tcp_buf_l2_flags_flush(const struct ctx *c);
void tcp_buf_l2_data_flush(const struct ctx *c);
uint16_t tcp_buf_conn_tap_mss(const struct tcp_tap_conn *conn);
int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
void tcp_sock_iov_init(const struct ctx *c);
void tcp_payload_flush(const struct ctx *c);
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
#endif /*TCP_BUF_H */

View file

@ -13,19 +13,16 @@
* struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
* @f: Generic flow information
* @in_epoll: Is the connection in the epoll set?
* @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
* @sock: Socket descriptor number
* @events: Connection events, implying connection states
* @timer: timerfd descriptor for timeout events
* @flags: Connection flags representing internal attributes
* @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
* @seq_dup_ack_approx: Last duplicate ACK number sent to tap
* @faddr: Guest side forwarding address (guest's remote address)
* @eport: Guest side endpoint port (guest's local port)
* @fport: Guest side forwarding port (guest's remote port)
* @wnd_from_tap: Last window size from tap, unscaled (as received)
* @wnd_to_tap: Sending window advertised to tap, unscaled (as sent)
* @seq_to_tap: Next sequence for packets to tap
@ -49,6 +46,10 @@ struct tcp_tap_conn {
unsigned int ws_from_tap :TCP_WS_BITS;
unsigned int ws_to_tap :TCP_WS_BITS;
#define TCP_MSS_BITS 14
unsigned int tap_mss :TCP_MSS_BITS;
#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS))
int sock :FD_REF_BITS;
@ -77,13 +78,6 @@ struct tcp_tap_conn {
#define ACK_TO_TAP_DUE BIT(3)
#define ACK_FROM_TAP_DUE BIT(4)
#define TCP_MSS_BITS 14
unsigned int tap_mss :TCP_MSS_BITS;
#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS))
#define SNDBUF_BITS 24
unsigned int sndbuf :SNDBUF_BITS;
#define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS)))
@ -91,11 +85,6 @@ struct tcp_tap_conn {
uint8_t seq_dup_ack_approx;
union inany_addr faddr;
in_port_t eport;
in_port_t fport;
uint16_t wnd_from_tap;
uint16_t wnd_to_tap;
@ -106,47 +95,41 @@ struct tcp_tap_conn {
uint32_t seq_init_from_tap;
};
#define SIDES 2
/**
* struct tcp_splice_conn - Descriptor for a spliced TCP connection
* @f: Generic flow information
* @in_epoll: Is the connection in the epoll set?
* @s: File descriptor for sockets
* @pipe: File descriptors for pipes
* @events: Events observed/actions performed on connection
* @flags: Connection flags (attributes, not events)
* @read: Bytes read (not fully written to other side in one shot)
* @written: Bytes written (not fully written from one other side read)
*/
* @events: Events observed/actions performed on connection
* @flags: Connection flags (attributes, not events)
* @in_epoll: Is the connection in the epoll set?
*/
struct tcp_splice_conn {
/* Must be first element */
struct flow_common f;
bool in_epoll :1;
int s[SIDES];
int pipe[SIDES][2];
uint32_t read[SIDES];
uint32_t written[SIDES];
uint8_t events;
#define SPLICE_CLOSED 0
#define SPLICE_CONNECT BIT(0)
#define SPLICE_ESTABLISHED BIT(1)
#define OUT_WAIT_0 BIT(2)
#define OUT_WAIT_1 BIT(3)
#define FIN_RCVD_0 BIT(4)
#define FIN_RCVD_1 BIT(5)
#define FIN_SENT_0 BIT(6)
#define FIN_SENT_1 BIT(7)
#define OUT_WAIT(sidei_) ((sidei_) ? BIT(3) : BIT(2))
#define FIN_RCVD(sidei_) ((sidei_) ? BIT(5) : BIT(4))
#define FIN_SENT(sidei_) ((sidei_) ? BIT(7) : BIT(6))
uint8_t flags;
#define SPLICE_V6 BIT(0)
#define RCVLOWAT_SET_0 BIT(1)
#define RCVLOWAT_SET_1 BIT(2)
#define RCVLOWAT_ACT_0 BIT(3)
#define RCVLOWAT_ACT_1 BIT(4)
#define CLOSING BIT(5)
#define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(1) : BIT(0))
#define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(3) : BIT(2))
#define CLOSING BIT(4)
uint32_t read[SIDES];
uint32_t written[SIDES];
bool in_epoll :1;
};
/* Socket pools */
@ -155,9 +138,9 @@ struct tcp_splice_conn {
extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE];
extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE];
bool tcp_flow_defer(union flow *flow);
bool tcp_splice_flow_defer(union flow *flow);
void tcp_splice_timer(const struct ctx *c, union flow *flow);
bool tcp_flow_defer(const struct tcp_tap_conn *conn);
bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
int tcp_conn_pool_sock(int pool[]);
int tcp_conn_sock(const struct ctx *c, sa_family_t af);
int tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);

View file

@ -8,7 +8,15 @@
#define MAX_WS 8
#define MAX_WINDOW (1 << (16 + (MAX_WS)))
#define MSS (USHRT_MAX - sizeof(struct tcphdr))
#define MSS4 ROUND_DOWN(IP_MAX_MTU - \
sizeof(struct tcphdr) - \
sizeof(struct iphdr), \
sizeof(uint32_t))
#define MSS6 ROUND_DOWN(IP_MAX_MTU - \
sizeof(struct tcphdr) - \
sizeof(struct ipv6hdr), \
sizeof(uint32_t))
#define SEQ_LE(a, b) ((b) - (a) < MAX_WINDOW)
#define SEQ_LT(a, b) ((b) - (a) - 1 < MAX_WINDOW)
@ -25,17 +33,108 @@
#define OPT_EOL 0
#define OPT_NOP 1
#define OPT_MSS 2
#define OPT_MSS_LEN 4
#define OPT_WS 3
#define OPT_WS_LEN 3
#define OPT_SACKP 4
#define OPT_SACK 5
#define OPT_TS 8
#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr))
#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP)
#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)]))
#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->oaddr))
#define CONN_V6(conn) (!CONN_V4(conn))
extern char tcp_buf_discard[MAX_WINDOW];
/*
* enum tcp_iov_parts - I/O vector parts for one TCP frame
* @TCP_IOV_TAP tap backend specific header
* @TCP_IOV_ETH Ethernet header
* @TCP_IOV_IP IP (v4/v6) header
* @TCP_IOV_PAYLOAD IP payload (TCP header + data)
* @TCP_NUM_IOVS the number of entries in the iovec array
*/
enum tcp_iov_parts {
TCP_IOV_TAP = 0,
TCP_IOV_ETH = 1,
TCP_IOV_IP = 2,
TCP_IOV_PAYLOAD = 3,
TCP_NUM_IOVS
};
/**
* struct tcp_payload_t - TCP header and data to send segments with payload
* @th: TCP header
* @data: TCP data
*/
struct tcp_payload_t {
struct tcphdr th;
uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)];
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32))); /* For AVX2 checksum routines */
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
#endif
/** struct tcp_opt_nop - TCP NOP option
* @kind: Option kind (OPT_NOP = 1)
*/
struct tcp_opt_nop {
uint8_t kind;
} __attribute__ ((packed));
#define TCP_OPT_NOP ((struct tcp_opt_nop){ .kind = OPT_NOP, })
/** struct tcp_opt_mss - TCP MSS option
* @kind: Option kind (OPT_MSS == 2)
* @len: Option length (4)
* @mss: Maximum Segment Size
*/
struct tcp_opt_mss {
uint8_t kind;
uint8_t len;
uint16_t mss;
} __attribute__ ((packed));
#define TCP_OPT_MSS(mss_) \
((struct tcp_opt_mss) { \
.kind = OPT_MSS, \
.len = sizeof(struct tcp_opt_mss), \
.mss = htons(mss_), \
})
/** struct tcp_opt_ws - TCP Window Scaling option
* @kind: Option kind (OPT_WS == 3)
* @len: Option length (3)
* @shift: Window scaling shift
*/
struct tcp_opt_ws {
uint8_t kind;
uint8_t len;
uint8_t shift;
} __attribute__ ((packed));
#define TCP_OPT_WS(shift_) \
((struct tcp_opt_ws) { \
.kind = OPT_WS, \
.len = sizeof(struct tcp_opt_ws), \
.shift = (shift_), \
})
/** struct tcp_syn_opts - TCP options we apply to SYN packets
* @mss: Maximum Segment Size (MSS) option
* @nop: NOP opt (for alignment)
* @ws: Window Scaling (WS) option
*/
struct tcp_syn_opts {
struct tcp_opt_mss mss;
struct tcp_opt_nop nop;
struct tcp_opt_ws ws;
} __attribute__ ((packed));
#define TCP_SYN_OPTS(mss_, ws_) \
((struct tcp_syn_opts){ \
.mss = TCP_OPT_MSS(mss_), \
.nop = TCP_OPT_NOP, \
.ws = TCP_OPT_WS(ws_), \
})
extern char tcp_buf_discard [MAX_WINDOW];
void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
unsigned long flag);
@ -54,28 +153,23 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
conn_event_do(c, conn, event); \
} while (0)
void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
#define tcp_rst(c, conn) \
do { \
flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
tcp_rst_do(c, conn); \
} while (0)
struct tcp_info_linux;
size_t tcp_fill_headers4(const struct ctx *c,
const struct tcp_tap_conn *conn,
struct iphdr *iph, struct tcphdr *th,
size_t plen, const uint16_t *check,
uint32_t seq);
size_t tcp_fill_headers6(const struct ctx *c,
const struct tcp_tap_conn *conn,
struct ipv6hdr *ip6h, struct tcphdr *th,
size_t plen, uint32_t seq);
size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
struct iovec *iov, size_t dlen,
const uint16_t *check, uint32_t seq,
bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
int force_seq, struct tcp_info *tinfo);
int tcp_fill_flag_header(struct ctx *c, struct tcp_tap_conn *conn, int flags,
struct tcphdr *th, char *opts, size_t *optlen);
bool force_seq, struct tcp_info_linux *tinfo);
int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
size_t *optlen);
#endif /* TCP_INTERNAL_H */

View file

@ -28,7 +28,7 @@
* - FIN_SENT_0: FIN (write shutdown) sent to accepted socket
* - FIN_SENT_1: FIN (write shutdown) sent to target socket
*
* #syscalls:pasta pipe2|pipe fcntl armv6l:fcntl64 armv7l:fcntl64 ppc64:fcntl64
* #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64 i686:fcntl64
*/
#include <sched.h>
@ -73,10 +73,7 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE];
/* Pool of pre-opened pipes */
static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2];
#define CONN_V6(x) (x->flags & SPLICE_V6)
#define CONN_V4(x) (!CONN_V6(x))
#define CONN_HAS(conn, set) ((conn->events & (set)) == (set))
#define CONN(idx) (&FLOW(idx)->tcp_splice)
#define CONN_HAS(conn, set) (((conn)->events & (set)) == (set))
/* Display strings for connection events */
static const char *tcp_splice_event_str[] __attribute((__unused__)) = {
@ -94,6 +91,24 @@ static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
static int tcp_sock_refill_ns(void *arg);
static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af);
/**
* conn_at_sidx() - Get spliced TCP connection specific flow at given sidx
* @sidx: Flow and side to retrieve
*
* Return: Spliced TCP connection at @sidx, or NULL of @sidx is invalid.
* Asserts if the flow at @sidx is not FLOW_TCP_SPLICE.
*/
static struct tcp_splice_conn *conn_at_sidx(flow_sidx_t sidx)
{
union flow *flow = flow_at_sidx(sidx);
if (!flow)
return NULL;
ASSERT(flow->f.type == FLOW_TCP_SPLICE);
return &flow->tcp_splice;
}
/**
* tcp_splice_conn_epoll_events() - epoll events masks for given state
* @events: Connection event flags
@ -102,19 +117,22 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af);
static void tcp_splice_conn_epoll_events(uint16_t events,
struct epoll_event ev[])
{
ev[0].events = ev[1].events = 0;
unsigned sidei;
flow_foreach_sidei(sidei)
ev[sidei].events = 0;
if (events & SPLICE_ESTABLISHED) {
if (!(events & FIN_SENT_1))
ev[0].events = EPOLLIN | EPOLLRDHUP;
if (!(events & FIN_SENT_0))
ev[1].events = EPOLLIN | EPOLLRDHUP;
flow_foreach_sidei(sidei) {
if (!(events & FIN_SENT(!sidei)))
ev[sidei].events = EPOLLIN | EPOLLRDHUP;
}
} else if (events & SPLICE_CONNECT) {
ev[1].events = EPOLLOUT;
}
ev[0].events |= (events & OUT_WAIT_0) ? EPOLLOUT : 0;
ev[1].events |= (events & OUT_WAIT_1) ? EPOLLOUT : 0;
flow_foreach_sidei(sidei)
ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0;
}
/**
@ -235,32 +253,31 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
/**
* tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
* @flow: Flow table entry for this connection
* @conn: Connection entry to handle
*
* Return: true if the flow is ready to free, false otherwise
*/
bool tcp_splice_flow_defer(union flow *flow)
bool tcp_splice_flow_defer(struct tcp_splice_conn *conn)
{
struct tcp_splice_conn *conn = &flow->tcp_splice;
unsigned side;
unsigned sidei;
if (!(flow->tcp_splice.flags & CLOSING))
if (!(conn->flags & CLOSING))
return false;
for (side = 0; side < SIDES; side++) {
flow_foreach_sidei(sidei) {
/* Flushing might need to block: don't recycle them. */
if (conn->pipe[side][0] >= 0) {
close(conn->pipe[side][0]);
close(conn->pipe[side][1]);
conn->pipe[side][0] = conn->pipe[side][1] = -1;
if (conn->pipe[sidei][0] >= 0) {
close(conn->pipe[sidei][0]);
close(conn->pipe[sidei][1]);
conn->pipe[sidei][0] = conn->pipe[sidei][1] = -1;
}
if (conn->s[side] >= 0) {
close(conn->s[side]);
conn->s[side] = -1;
if (conn->s[sidei] >= 0) {
close(conn->s[sidei]);
conn->s[sidei] = -1;
}
conn->read[side] = conn->written[side] = 0;
conn->read[sidei] = conn->written[sidei] = 0;
}
conn->events = SPLICE_CLOSED;
@ -280,33 +297,33 @@ bool tcp_splice_flow_defer(union flow *flow)
static int tcp_splice_connect_finish(const struct ctx *c,
struct tcp_splice_conn *conn)
{
unsigned side;
unsigned sidei;
int i = 0;
for (side = 0; side < SIDES; side++) {
flow_foreach_sidei(sidei) {
for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) {
if (splice_pipe_pool[i][0] >= 0) {
SWAP(conn->pipe[side][0],
SWAP(conn->pipe[sidei][0],
splice_pipe_pool[i][0]);
SWAP(conn->pipe[side][1],
SWAP(conn->pipe[sidei][1],
splice_pipe_pool[i][1]);
break;
}
}
if (conn->pipe[side][0] < 0) {
if (pipe2(conn->pipe[side], O_NONBLOCK | O_CLOEXEC)) {
if (conn->pipe[sidei][0] < 0) {
if (pipe2(conn->pipe[sidei], O_NONBLOCK | O_CLOEXEC)) {
flow_err(conn, "cannot create %d->%d pipe: %s",
side, !side, strerror(errno));
sidei, !sidei, strerror(errno));
conn_flag(c, conn, CLOSING);
return -EIO;
}
if (fcntl(conn->pipe[side][0], F_SETPIPE_SZ,
c->tcp.pipe_size)) {
if (fcntl(conn->pipe[sidei][0], F_SETPIPE_SZ,
c->tcp.pipe_size) != (int)c->tcp.pipe_size) {
flow_trace(conn,
"cannot set %d->%d pipe size to %zu",
side, !side, c->tcp.pipe_size);
sidei, !sidei, c->tcp.pipe_size);
}
}
}
@ -321,31 +338,20 @@ static int tcp_splice_connect_finish(const struct ctx *c,
* tcp_splice_connect() - Create and connect socket for new spliced connection
* @c: Execution context
* @conn: Connection pointer
* @af: Address family
* @pif: pif on which to create socket
* @port: Destination port, host order
*
* Return: 0 for connect() succeeded or in progress, negative value on error
*/
static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
sa_family_t af, uint8_t pif, in_port_t port)
static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
{
struct sockaddr_in6 addr6 = {
.sin6_family = AF_INET6,
.sin6_port = htons(port),
.sin6_addr = IN6ADDR_LOOPBACK_INIT,
};
struct sockaddr_in addr4 = {
.sin_family = AF_INET,
.sin_port = htons(port),
.sin_addr = IN4ADDR_LOOPBACK_INIT,
};
const struct sockaddr *sa;
const struct flowside *tgt = &conn->f.side[TGTSIDE];
sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6;
uint8_t tgtpif = conn->f.pif[TGTSIDE];
union sockaddr_inany sa;
socklen_t sl;
if (pif == PIF_HOST)
if (tgtpif == PIF_HOST)
conn->s[1] = tcp_conn_sock(c, af);
else if (pif == PIF_SPLICE)
else if (tgtpif == PIF_SPLICE)
conn->s[1] = tcp_conn_sock_ns(c, af);
else
ASSERT(0);
@ -359,15 +365,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
conn->s[1]);
}
if (CONN_V6(conn)) {
sa = (struct sockaddr *)&addr6;
sl = sizeof(addr6);
} else {
sa = (struct sockaddr *)&addr4;
sl = sizeof(addr4);
}
pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport);
if (connect(conn->s[1], sa, sl)) {
if (connect(conn->s[1], &sa.sa, sl)) {
if (errno != EINPROGRESS) {
flow_trace(conn, "Couldn't connect socket for splice: %s",
strerror(errno));
@ -414,67 +414,19 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
/**
* tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
* @c: Execution context
* @pif0: pif id of side 0
* @dstport: Side 0 destination port of connection
* @flow: flow to initialise
* @s0: Accepted (side 0) socket
* @sa: Peer address of connection
*
* Return: true if able to create a spliced connection, false otherwise
* #syscalls:pasta setsockopt
*/
bool tcp_splice_conn_from_sock(const struct ctx *c,
uint8_t pif0, in_port_t dstport,
union flow *flow, int s0,
const union sockaddr_inany *sa)
void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0)
{
struct tcp_splice_conn *conn;
union inany_addr src;
in_port_t srcport;
sa_family_t af;
uint8_t pif1;
struct tcp_splice_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE,
tcp_splice);
if (c->mode != MODE_PASTA)
return false;
ASSERT(c->mode == MODE_PASTA);
inany_from_sockaddr(&src, &srcport, sa);
af = inany_v4(&src) ? AF_INET : AF_INET6;
switch (pif0) {
case PIF_SPLICE:
if (!inany_is_loopback(&src)) {
char str[INANY_ADDRSTRLEN];
/* We can't use flow_err() etc. because we haven't set
* the flow type yet
*/
warn("Bad source address %s for splice, closing",
inany_ntop(&src, str, sizeof(str)));
/* We *don't* want to fall back to tap */
flow_alloc_cancel(flow);
return true;
}
pif1 = PIF_HOST;
dstport += c->tcp.fwd_out.delta[dstport];
break;
case PIF_HOST:
if (!inany_is_loopback(&src))
return false;
pif1 = PIF_SPLICE;
dstport += c->tcp.fwd_in.delta[dstport];
break;
default:
return false;
}
conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
conn->flags = af == AF_INET ? 0 : SPLICE_V6;
conn->s[0] = s0;
conn->s[1] = -1;
conn->pipe[0][0] = conn->pipe[0][1] = -1;
@ -483,10 +435,10 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
if (tcp_splice_connect(c, conn, af, pif1, dstport))
if (tcp_splice_connect(c, conn))
conn_flag(c, conn, CLOSING);
return true;
FLOW_ACTIVATE(conn);
}
/**
@ -500,8 +452,8 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events)
{
struct tcp_splice_conn *conn = CONN(ref.flowside.flow);
unsigned side = ref.flowside.side, fromside;
struct tcp_splice_conn *conn = conn_at_sidx(ref.flowside);
unsigned evsidei = ref.flowside.sidei, fromsidei;
uint8_t lowat_set_flag, lowat_act_flag;
int eof, never_read;
@ -533,30 +485,31 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
}
if (events & EPOLLOUT) {
fromside = !side;
conn_event(c, conn, side == 0 ? ~OUT_WAIT_0 : ~OUT_WAIT_1);
fromsidei = !evsidei;
conn_event(c, conn, ~OUT_WAIT(evsidei));
} else {
fromside = side;
fromsidei = evsidei;
}
if (events & EPOLLRDHUP)
/* For side 0 this is fake, but implied */
conn_event(c, conn, side == 0 ? FIN_RCVD_0 : FIN_RCVD_1);
conn_event(c, conn, FIN_RCVD(evsidei));
swap:
eof = 0;
never_read = 1;
lowat_set_flag = fromside == 0 ? RCVLOWAT_SET_0 : RCVLOWAT_SET_1;
lowat_act_flag = fromside == 0 ? RCVLOWAT_ACT_0 : RCVLOWAT_ACT_1;
lowat_set_flag = RCVLOWAT_SET(fromsidei);
lowat_act_flag = RCVLOWAT_ACT(fromsidei);
while (1) {
ssize_t readlen, to_write = 0, written;
ssize_t readlen, written, pending;
int more = 0;
retry:
readlen = splice(conn->s[fromside], NULL,
conn->pipe[fromside][1], NULL, c->tcp.pipe_size,
readlen = splice(conn->s[fromsidei], NULL,
conn->pipe[fromsidei][1], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
flow_trace(conn, "%zi from read-side call", readlen);
if (readlen < 0) {
@ -565,14 +518,11 @@ retry:
if (errno != EAGAIN)
goto close;
to_write = c->tcp.pipe_size;
} else if (!readlen) {
eof = 1;
to_write = c->tcp.pipe_size;
} else {
never_read = 0;
to_write += readlen;
if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
more = SPLICE_F_MORE;
@ -581,11 +531,11 @@ retry:
}
eintr:
written = splice(conn->pipe[fromside][0], NULL,
conn->s[!fromside], NULL, to_write,
written = splice(conn->pipe[fromsidei][0], NULL,
conn->s[!fromsidei], NULL, c->tcp.pipe_size,
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
flow_trace(conn, "%zi from write-side call (passed %zi)",
written, to_write);
written, c->tcp.pipe_size);
/* Most common case: skip updating counters. */
if (readlen > 0 && readlen == written) {
@ -596,18 +546,23 @@ eintr:
readlen > (long)c->tcp.pipe_size / 10) {
int lowat = c->tcp.pipe_size / 4;
setsockopt(conn->s[fromside], SOL_SOCKET,
SO_RCVLOWAT, &lowat, sizeof(lowat));
conn_flag(c, conn, lowat_set_flag);
conn_flag(c, conn, lowat_act_flag);
if (setsockopt(conn->s[fromsidei], SOL_SOCKET,
SO_RCVLOWAT,
&lowat, sizeof(lowat))) {
flow_trace(conn,
"Setting SO_RCVLOWAT %i: %s",
lowat, strerror(errno));
} else {
conn_flag(c, conn, lowat_set_flag);
conn_flag(c, conn, lowat_act_flag);
}
}
break;
}
conn->read[fromside] += readlen > 0 ? readlen : 0;
conn->written[fromside] += written > 0 ? written : 0;
conn->read[fromsidei] += readlen > 0 ? readlen : 0;
conn->written[fromsidei] += written > 0 ? written : 0;
if (written < 0) {
if (errno == EINTR)
@ -616,47 +571,43 @@ eintr:
if (errno != EAGAIN)
goto close;
if (never_read)
if (conn->read[fromsidei] == conn->written[fromsidei])
break;
conn_event(c, conn,
fromside == 0 ? OUT_WAIT_1 : OUT_WAIT_0);
conn_event(c, conn, OUT_WAIT(!fromsidei));
break;
}
if (never_read && written == (long)(c->tcp.pipe_size))
goto retry;
if (!never_read && written < to_write) {
to_write -= written;
pending = conn->read[fromsidei] - conn->written[fromsidei];
if (!never_read && written > 0 && written < pending)
goto retry;
}
if (eof)
break;
}
if ((conn->events & FIN_RCVD_0) && !(conn->events & FIN_SENT_1)) {
if (conn->read[fromside] == conn->written[fromside] && eof) {
shutdown(conn->s[1], SHUT_WR);
conn_event(c, conn, FIN_SENT_1);
if (conn->read[fromsidei] == conn->written[fromsidei] && eof) {
unsigned sidei;
flow_foreach_sidei(sidei) {
if ((conn->events & FIN_RCVD(sidei)) &&
!(conn->events & FIN_SENT(!sidei))) {
shutdown(conn->s[!sidei], SHUT_WR);
conn_event(c, conn, FIN_SENT(!sidei));
}
}
}
if ((conn->events & FIN_RCVD_1) && !(conn->events & FIN_SENT_0)) {
if (conn->read[fromside] == conn->written[fromside] && eof) {
shutdown(conn->s[0], SHUT_WR);
conn_event(c, conn, FIN_SENT_0);
}
}
if (CONN_HAS(conn, FIN_SENT_0 | FIN_SENT_1))
if (CONN_HAS(conn, FIN_SENT(0) | FIN_SENT(1)))
goto close;
if ((events & (EPOLLIN | EPOLLOUT)) == (EPOLLIN | EPOLLOUT)) {
events = EPOLLIN;
fromside = !fromside;
fromsidei = !fromsidei;
goto swap;
}
@ -721,7 +672,7 @@ static void tcp_splice_pipe_refill(const struct ctx *c)
continue;
if (fcntl(splice_pipe_pool[i][0], F_SETPIPE_SZ,
c->tcp.pipe_size)) {
c->tcp.pipe_size) != (int)c->tcp.pipe_size) {
trace("TCP (spliced): cannot set pool pipe size to %zu",
c->tcp.pipe_size);
}
@ -734,6 +685,7 @@ static void tcp_splice_pipe_refill(const struct ctx *c)
*
* Return: 0
*/
/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */
static int tcp_sock_refill_ns(void *arg)
{
const struct ctx *c = (const struct ctx *)arg;
@ -786,29 +738,26 @@ void tcp_splice_init(struct ctx *c)
/**
* tcp_splice_timer() - Timer for spliced connections
* @c: Execution context
* @flow: Flow table entry
* @conn: Connection to handle
*/
void tcp_splice_timer(const struct ctx *c, union flow *flow)
void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn)
{
struct tcp_splice_conn *conn = &flow->tcp_splice;
int side;
unsigned sidei;
ASSERT(!(conn->flags & CLOSING));
for (side = 0; side < SIDES; side++) {
uint8_t set = side == 0 ? RCVLOWAT_SET_0 : RCVLOWAT_SET_1;
uint8_t act = side == 0 ? RCVLOWAT_ACT_0 : RCVLOWAT_ACT_1;
if ((conn->flags & set) && !(conn->flags & act)) {
if (setsockopt(conn->s[side], SOL_SOCKET, SO_RCVLOWAT,
flow_foreach_sidei(sidei) {
if ((conn->flags & RCVLOWAT_SET(sidei)) &&
!(conn->flags & RCVLOWAT_ACT(sidei))) {
if (setsockopt(conn->s[sidei], SOL_SOCKET, SO_RCVLOWAT,
&((int){ 1 }), sizeof(int))) {
flow_trace(conn, "can't set SO_RCVLOWAT on %d",
conn->s[side]);
conn->s[sidei]);
}
conn_flag(c, conn, ~set);
conn_flag(c, conn, ~RCVLOWAT_SET(sidei));
}
}
conn_flag(c, conn, ~RCVLOWAT_ACT_0);
conn_flag(c, conn, ~RCVLOWAT_ACT_1);
flow_foreach_sidei(sidei)
conn_flag(c, conn, ~RCVLOWAT_ACT(sidei));
}

View file

@ -11,10 +11,7 @@ union sockaddr_inany;
void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
uint8_t pif0, in_port_t dstport,
union flow *flow, int s0,
const union sockaddr_inany *sa);
void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */

460
tcp_vu.c
View file

@ -1,460 +0,0 @@
// SPDX-License-Identifier: GPL-2.0-or-later
#include <errno.h>
#include <stddef.h>
#include <stdint.h>
#include <netinet/ip.h>
#include <sys/socket.h>
#include <linux/tcp.h>
#include <linux/virtio_net.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "siphash.h"
#include "inany.h"
#include "vhost_user.h"
#include "tcp.h"
#include "pcap.h"
#include "flow.h"
#include "tcp_conn.h"
#include "flow_table.h"
#include "tcp_vu.h"
#include "tcp_internal.h"
#include "checksum.h"
#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr))
#define CONN_V6(conn) (!CONN_V4(conn))
/* vhost-user */
static const struct virtio_net_hdr vu_header = {
.flags = VIRTIO_NET_HDR_F_DATA_VALID,
.gso_type = VIRTIO_NET_HDR_GSO_NONE,
};
static unsigned char buffer[65536];
static struct iovec iov_vu [VIRTQUEUE_MAX_SIZE];
static unsigned int indexes [VIRTQUEUE_MAX_SIZE];
uint16_t tcp_vu_conn_tap_mss(const struct tcp_tap_conn *conn)
{
(void)conn;
return USHRT_MAX;
}
int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
{
VuDev *vdev = (VuDev *)&c->vdev;
VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
size_t tlen, vnet_hdrlen, ip_len, optlen = 0;
struct virtio_net_hdr_mrg_rxbuf *vh;
VuVirtqElement *elem;
struct ethhdr *eh;
int nb_ack;
int ret;
elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
if (!elem)
return 0;
if (elem->in_num < 1) {
err("virtio-net receive queue contains no in buffers");
vu_queue_rewind(vdev, vq, 1);
return 0;
}
vh = elem->in_sg[0].iov_base;
vh->hdr = vu_header;
if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
vnet_hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
vh->num_buffers = htole16(1);
} else {
vnet_hdrlen = sizeof(struct virtio_net_hdr);
}
eh = (struct ethhdr *)((char *)elem->in_sg[0].iov_base + vnet_hdrlen);
memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
if (CONN_V4(conn)) {
struct iphdr *iph = (struct iphdr *)(eh + 1);
struct tcphdr *th = (struct tcphdr *)(iph + 1);
char *data = (char *)(th + 1);
eh->h_proto = htons(ETH_P_IP);
*th = (struct tcphdr){
.doff = sizeof(struct tcphdr) / 4,
.ack = 1
};
*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
ret = tcp_fill_flag_header(c, conn, flags, th, data, &optlen);
if (ret <= 0) {
vu_queue_rewind(vdev, vq, 1);
return ret;
}
ip_len = tcp_fill_headers4(c, conn, iph,
(struct tcphdr *)(iph + 1), optlen,
NULL, conn->seq_to_tap);
tlen = ip_len + sizeof(struct ethhdr);
if (*c->pcap) {
uint32_t sum = proto_ipv4_header_psum(iph->tot_len,
IPPROTO_TCP,
(struct in_addr){ .s_addr = iph->saddr },
(struct in_addr){ .s_addr = iph->daddr });
th->check = csum(th, optlen + sizeof(struct tcphdr), sum);
}
} else {
struct ipv6hdr *ip6h = (struct ipv6hdr *)(eh + 1);
struct tcphdr *th = (struct tcphdr *)(ip6h + 1);
char *data = (char *)(th + 1);
eh->h_proto = htons(ETH_P_IPV6);
*th = (struct tcphdr){
.doff = sizeof(struct tcphdr) / 4,
.ack = 1
};
*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
ret = tcp_fill_flag_header(c, conn, flags, th, data, &optlen);
if (ret <= 0) {
vu_queue_rewind(vdev, vq, 1);
return ret;
}
ip_len = tcp_fill_headers6(c, conn, ip6h,
(struct tcphdr *)(ip6h + 1),
optlen, conn->seq_to_tap);
tlen = ip_len + sizeof(struct ethhdr);
if (*c->pcap) {
uint32_t sum = proto_ipv6_header_psum(ip6h->payload_len,
IPPROTO_TCP,
&ip6h->saddr,
&ip6h->daddr);
th->check = csum(th, optlen + sizeof(struct tcphdr), sum);
}
}
pcap((void *)eh, tlen);
tlen += vnet_hdrlen;
vu_queue_fill(vdev, vq, elem, tlen, 0);
nb_ack = 1;
if (flags & DUP_ACK) {
elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
if (elem) {
if (elem->in_num < 1 || elem->in_sg[0].iov_len < tlen) {
vu_queue_rewind(vdev, vq, 1);
} else {
memcpy(elem->in_sg[0].iov_base, vh, tlen);
nb_ack++;
}
}
}
vu_queue_flush(vdev, vq, nb_ack);
vu_queue_notify(vdev, vq);
return 0;
}
int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
{
uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
uint32_t already_sent;
VuDev *vdev = (VuDev *)&c->vdev;
VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
int s = conn->sock, v4 = CONN_V4(conn);
int i, ret = 0, iov_count, iov_used;
struct msghdr mh_sock = { 0 };
size_t l2_hdrlen, vnet_hdrlen, fillsize;
ssize_t len;
uint16_t *check;
uint16_t mss = MSS_GET(conn);
int num_buffers;
int segment_size;
struct iovec *first;
bool has_mrg_rxbuf;
if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
err("Got packet, but no available descriptors on RX virtq.");
return 0;
}
already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
if (SEQ_LT(already_sent, 0)) {
/* RFC 761, section 2.1. */
flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
conn->seq_ack_from_tap, conn->seq_to_tap);
conn->seq_to_tap = conn->seq_ack_from_tap;
already_sent = 0;
}
if (!wnd_scaled || already_sent >= wnd_scaled) {
conn_flag(c, conn, STALLED);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
}
/* Set up buffer descriptors we'll fill completely and partially. */
fillsize = wnd_scaled;
iov_vu[0].iov_base = tcp_buf_discard;
iov_vu[0].iov_len = already_sent;
fillsize -= already_sent;
has_mrg_rxbuf = vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF);
if (has_mrg_rxbuf) {
vnet_hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
} else {
vnet_hdrlen = sizeof(struct virtio_net_hdr);
}
l2_hdrlen = vnet_hdrlen + sizeof(struct ethhdr) + sizeof(struct tcphdr);
if (v4) {
l2_hdrlen += sizeof(struct iphdr);
} else {
l2_hdrlen += sizeof(struct ipv6hdr);
}
iov_count = 0;
segment_size = 0;
while (fillsize > 0 && iov_count < VIRTQUEUE_MAX_SIZE - 1) {
VuVirtqElement *elem;
elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
if (!elem)
break;
if (elem->in_num < 1) {
err("virtio-net receive queue contains no in buffers");
goto err;
}
ASSERT(elem->in_num == 1);
ASSERT(elem->in_sg[0].iov_len >= l2_hdrlen);
indexes[iov_count] = elem->index;
if (segment_size == 0) {
iov_vu[iov_count + 1].iov_base =
(char *)elem->in_sg[0].iov_base + l2_hdrlen;
iov_vu[iov_count + 1].iov_len =
elem->in_sg[0].iov_len - l2_hdrlen;
} else {
iov_vu[iov_count + 1].iov_base = elem->in_sg[0].iov_base;
iov_vu[iov_count + 1].iov_len = elem->in_sg[0].iov_len;
}
if (iov_vu[iov_count + 1].iov_len > fillsize)
iov_vu[iov_count + 1].iov_len = fillsize;
segment_size += iov_vu[iov_count + 1].iov_len;
if (!has_mrg_rxbuf) {
segment_size = 0;
} else if (segment_size >= mss) {
iov_vu[iov_count + 1].iov_len -= segment_size - mss;
segment_size = 0;
}
fillsize -= iov_vu[iov_count + 1].iov_len;
iov_count++;
}
if (iov_count == 0)
return 0;
mh_sock.msg_iov = iov_vu;
mh_sock.msg_iovlen = iov_count + 1;
do
len = recvmsg(s, &mh_sock, MSG_PEEK);
while (len < 0 && errno == EINTR);
if (len < 0)
goto err;
if (!len) {
vu_queue_rewind(vdev, vq, iov_count);
if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
if ((ret = tcp_vu_send_flag(c, conn, FIN | ACK))) {
tcp_rst(c, conn);
return ret;
}
conn_event(c, conn, TAP_FIN_SENT);
}
return 0;
}
len -= already_sent;
if (len <= 0) {
conn_flag(c, conn, STALLED);
vu_queue_rewind(vdev, vq, iov_count);
return 0;
}
conn_flag(c, conn, ~STALLED);
/* Likely, some new data was acked too. */
tcp_update_seqack_wnd(c, conn, 0, NULL);
/* initialize headers */
iov_used = 0;
num_buffers = 0;
check = NULL;
segment_size = 0;
for (i = 0; i < iov_count && len; i++) {
if (segment_size == 0)
first = &iov_vu[i + 1];
if (iov_vu[i + 1].iov_len > (size_t)len)
iov_vu[i + 1].iov_len = len;
len -= iov_vu[i + 1].iov_len;
iov_used++;
segment_size += iov_vu[i + 1].iov_len;
num_buffers++;
if (segment_size >= mss || len == 0 ||
i + 1 == iov_count || !has_mrg_rxbuf) {
struct ethhdr *eh;
struct virtio_net_hdr_mrg_rxbuf *vh;
char *base = (char *)first->iov_base - l2_hdrlen;
size_t size = first->iov_len + l2_hdrlen;
vh = (struct virtio_net_hdr_mrg_rxbuf *)base;
vh->hdr = vu_header;
if (has_mrg_rxbuf)
vh->num_buffers = htole16(num_buffers);
eh = (struct ethhdr *)((char *)base + vnet_hdrlen);
memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
/* initialize header */
if (v4) {
struct iphdr *iph = (struct iphdr *)(eh + 1);
struct tcphdr *th = (struct tcphdr *)(iph + 1);
eh->h_proto = htons(ETH_P_IP);
*th = (struct tcphdr){
.doff = sizeof(struct tcphdr) / 4,
.ack = 1
};
*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
tcp_fill_headers4(c, conn, iph,
(struct tcphdr *)(iph + 1),
segment_size, len ? check : NULL,
conn->seq_to_tap);
if (*c->pcap) {
uint32_t sum = proto_ipv4_header_psum(iph->tot_len,
IPPROTO_TCP,
(struct in_addr){ .s_addr = iph->saddr },
(struct in_addr){ .s_addr = iph->daddr });
first->iov_base = th;
first->iov_len = size - l2_hdrlen + sizeof(*th);
th->check = csum_iov(first, num_buffers, sum);
}
check = &iph->check;
} else {
struct ipv6hdr *ip6h = (struct ipv6hdr *)(eh + 1);
struct tcphdr *th = (struct tcphdr *)(ip6h + 1);
eh->h_proto = htons(ETH_P_IPV6);
*th = (struct tcphdr){
.doff = sizeof(struct tcphdr) / 4,
.ack = 1
};
*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
tcp_fill_headers6(c, conn, ip6h,
(struct tcphdr *)(ip6h + 1),
segment_size, conn->seq_to_tap);
if (*c->pcap) {
uint32_t sum = proto_ipv6_header_psum(ip6h->payload_len,
IPPROTO_TCP,
&ip6h->saddr,
&ip6h->daddr);
first->iov_base = th;
first->iov_len = size - l2_hdrlen + sizeof(*th);
th->check = csum_iov(first, num_buffers, sum);
}
}
/* set iov for pcap logging */
first->iov_base = eh;
first->iov_len = size - vnet_hdrlen;
pcap_iov(first, num_buffers);
/* set iov_len for vu_queue_fill_by_index(); */
first->iov_base = base;
first->iov_len = size;
conn->seq_to_tap += segment_size;
segment_size = 0;
num_buffers = 0;
}
}
/* release unused buffers */
vu_queue_rewind(vdev, vq, iov_count - iov_used);
/* send packets */
for (i = 0; i < iov_used; i++) {
vu_queue_fill_by_index(vdev, vq, indexes[i],
iov_vu[i + 1].iov_len, i);
}
vu_queue_flush(vdev, vq, iov_used);
vu_queue_notify(vdev, vq);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
err:
vu_queue_rewind(vdev, vq, iov_count);
if (errno != EAGAIN && errno != EWOULDBLOCK) {
ret = -errno;
tcp_rst(c, conn);
}
return ret;
}

View file

@ -1,9 +0,0 @@
// SPDX-License-Identifier: GPL-2.0-or-later
#ifndef TCP_VU_H
#define TCP_VU_H
int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
#endif /*TCP_VU_H */

1
test/.gitignore vendored
View file

@ -1,5 +1,6 @@
test_logs/
mbuto/
podman/
*.img
QEMU_EFI.fd
*.qcow2

View file

@ -8,7 +8,6 @@
WGET = wget -c
DEBIAN_IMGS = debian-8.11.0-openstack-amd64.qcow2 \
debian-9-nocloud-amd64-daily-20200210-166.qcow2 \
debian-10-nocloud-amd64.qcow2 \
debian-10-generic-arm64.qcow2 \
debian-10-generic-ppc64el-20220911-1135.qcow2 \
@ -42,8 +41,7 @@ OPENSUSE_IMGS = openSUSE-Leap-15.1-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Leap-15.2-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Leap-15.3-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Tumbleweed-ARM-JeOS-efi.aarch64.raw.xz \
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz \
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz
UBUNTU_OLD_IMGS = trusty-server-cloudimg-amd64-disk1.img \
trusty-server-cloudimg-i386-disk1.img \
@ -52,10 +50,10 @@ UBUNTU_NEW_IMGS = xenial-server-cloudimg-powerpc-disk1.img \
jammy-server-cloudimg-s390x.img
UBUNTU_IMGS = $(UBUNTU_OLD_IMGS) $(UBUNTU_NEW_IMGS)
DOWNLOAD_ASSETS = mbuto \
DOWNLOAD_ASSETS = mbuto podman \
$(DEBIAN_IMGS) $(FEDORA_IMGS) $(OPENSUSE_IMGS) $(UBUNTU_IMGS)
TESTDATA_ASSETS = small.bin big.bin medium.bin
LOCAL_ASSETS = mbuto.img mbuto.mem.img QEMU_EFI.fd \
LOCAL_ASSETS = mbuto.img mbuto.mem.img podman/bin/podman QEMU_EFI.fd \
$(DEBIAN_IMGS:%=prepared-%) $(FEDORA_IMGS:%=prepared-%) \
$(UBUNTU_NEW_IMGS:%=prepared-%) \
nstool guest-key guest-key.pub \
@ -67,13 +65,27 @@ CFLAGS = -Wall -Werror -Wextra -pedantic -std=c99
assets: $(ASSETS)
.PHONY: pull-%
pull-%: %
git -C $* pull
mbuto:
git clone git://mbuto.sh/mbuto
mbuto/mbuto: pull-mbuto
podman:
git clone https://github.com/containers/podman.git
# To succesfully build podman, you will need gpgme and systemd
# development packages
podman/bin/podman: pull-podman
$(MAKE) -C podman
guest-key guest-key.pub:
ssh-keygen -f guest-key -N ''
mbuto.img: passt.mbuto mbuto guest-key.pub $(TESTDATA_ASSETS)
mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub $(TESTDATA_ASSETS)
./mbuto/mbuto -p ./$< -c lz4 -f $@
mbuto.mem.img: passt.mem.mbuto mbuto ../passt.avx2
@ -121,9 +133,6 @@ realclean: clean
debian-8.11.0-openstack-%.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/OpenStack/archive/8.11.0/debian-8.11.0-openstack-$*.qcow2
debian-9-nocloud-%-daily-20200210-166.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/stretch/daily/20200210-166/debian-9-nocloud-$*-daily-20200210-166.qcow2
debian-10-nocloud-%.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/buster/latest/debian-10-nocloud-$*.qcow2
@ -189,9 +198,6 @@ openSUSE-Tumbleweed-ARM-JeOS-efi.aarch64.raw.xz:
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz:
$(WGET) -O $@ http://download.opensuse.org/ports/armv7hl/tumbleweed/appliances/openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2:
$(WGET) -O $@ https://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2
# Ubuntu downloads
trusty-server-cloudimg-%-disk1.img:
$(WGET) -O $@ https://cloud-images.ubuntu.com/trusty/current/trusty-server-cloudimg-$*-disk1.img

View file

@ -28,10 +28,11 @@ on a system, i.e. common utilities such as a shell are not included here.
Example for Debian, and possibly most Debian-based distributions:
build-essential git jq strace iperf3 qemu-system-x86 tmux sipcalc bats bc
catatonit clang-tidy cppcheck go isc-dhcp-common psmisc linux-cpupower socat
netcat-openbsd fakeroot lz4 lm-sensors qemu-system-arm qemu-system-ppc
qemu-system-misc qemu-system-x86 valgrind
bats bc build-essential catatonit clang-tidy conmon cppcheck crun fakeroot
git go iperf3 isc-dhcp-common jq libgpgme-dev libseccomp-dev linux-cpupower
lm-sensors lz4 netavark netcat-openbsd psmisc qemu-efi-aarch64
qemu-system-arm qemu-system-misc qemu-system-ppc qemu-system-x86
qemu-system-x86 sipcalc socat strace tmux uidmap valgrind
NOTE: the tests need a qemu version >= 7.2, or one that contains commit
13c6be96618c ("net: stream: add unix socket"): this change introduces support

View file

@ -15,7 +15,7 @@
# layout_pasta() - Panes for host, pasta, and separate one for namespace
layout_pasta() {
sleep 3
sleep 1
tmux kill-pane -a -t 0
cmd_write 0 clear
@ -46,7 +46,7 @@ layout_pasta() {
# layout_passt() - Panes for host, passt, and guest
layout_passt() {
sleep 3
sleep 1
tmux kill-pane -a -t 0
cmd_write 0 clear
@ -77,7 +77,7 @@ layout_passt() {
# layout_passt_in_pasta() - Host, passt within pasta, namespace and guest
layout_passt_in_pasta() {
sleep 3
sleep 1
tmux kill-pane -a -t 0
cmd_write 0 clear
@ -113,7 +113,7 @@ layout_passt_in_pasta() {
# layout_two_guests() - Two guest panes, two passt panes, plus host and log
layout_two_guests() {
sleep 3
sleep 1
tmux kill-pane -a -t 0
cmd_write 0 clear
@ -152,7 +152,7 @@ layout_two_guests() {
# layout_demo_pasta() - Four panes for pasta demo
layout_demo_pasta() {
sleep 3
sleep 1
cmd_write 0 cd ${BASEPATH}
cmd_write 0 clear
@ -188,7 +188,7 @@ layout_demo_pasta() {
# layout_demo_passt() - Four panes for passt demo
layout_demo_passt() {
sleep 3
sleep 1
cmd_write 0 cd ${BASEPATH}
cmd_write 0 clear
@ -224,7 +224,7 @@ layout_demo_passt() {
# layout_demo_podman() - Four panes for pasta demo with Podman
layout_demo_podman() {
sleep 3
sleep 1
cmd_write 0 cd ${BASEPATH}
cmd_write 0 clear

View file

@ -18,7 +18,7 @@ PERF_LINK_COUNT=0
PERF_JS="${LOGDIR}/web/perf.js"
PERF_TEMPLATE_HTML="document.write('"'
Throughput in Gbps, latency in µs. Threads are <span style="font-family: monospace;">iperf3</span> processes, <i>passt</i> and <i>pasta</i> are currently single-threaded.<br/>
Throughput in Gbps, latency in µs. Threads are <span style="font-family: monospace;">iperf3</span> threads, <i>passt</i> and <i>pasta</i> are currently single-threaded.<br/>
Click on numbers to show test execution. Measured at head, commit <span style="font-family: monospace;">__commit__</span>.
<style type="text/CSS">
@ -56,7 +56,7 @@ table.pasta_local th { text-align: center; font-weight: bold; }
table.pasta_local tr:not(:first-of-type) td:not(:first-of-type) { font-family: monospace; font-weight: bolder; }
table.pasta_local tr:nth-child(3n+0) { background-color: #112315; }
table.pasta_local tr:not(:nth-child(3n+0)) td { background-color: #101010; }
table.pasta_local td:nth-child(3n+2) { background-color: #603302; }
table.pasta_local td:nth-child(4n+2) { background-color: #603302; }
table.pasta_local tr:nth-child(1) { background-color: #363e61; }
table.pasta td { border: 0px solid; padding: 6px; line-height: 1; }
table.pasta td { text-align: right; }

View file

@ -17,6 +17,8 @@ INITRAMFS="${BASEPATH}/mbuto.img"
VCPUS="$( [ $(nproc) -ge 8 ] && echo 6 || echo $(( $(nproc) / 2 + 1 )) )"
__mem_kib="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
VMEM="$((${__mem_kib} / 1024 / 4))"
QEMU_ARCH="$(uname -m)"
[ "${QEMU_ARCH}" = "i686" ] && QEMU_ARCH=i386
# setup_build() - Set up pane layout for build tests
setup_build() {
@ -53,10 +55,10 @@ setup_passt() {
wait_for [ -f "${STATESETUP}/passt.pid" ]
GUEST_CID=94557
context_run_bg qemu 'qemu-system-$(uname -m)' \
context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}" \
' -machine accel=kvm' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
@ -124,7 +126,12 @@ setup_passt_in_ns() {
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold"
__map_host4=192.0.2.1
__map_host6=2001:db8:9a55::1
__map_ns4=192.0.2.2
__map_ns6=2001:db8:9a55::2
context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --map-host-loopback ${__map_host4} --map-host-loopback ${__map_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold"
wait_for [ -f "${STATESETUP}/pasta.pid" ]
context_setup_nstool qemu ${STATESETUP}/ns.hold
@ -139,20 +146,20 @@ setup_passt_in_ns() {
if [ ${VALGRIND} -eq 1 ]; then
context_run passt "make clean"
context_run passt "make valgrind"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
else
context_run passt "make clean"
context_run passt "make"
context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid"
context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
fi
wait_for [ -f "${STATESETUP}/passt.pid" ]
GUEST_CID=94557
context_run_bg qemu 'qemu-system-$(uname -m)' \
context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}" \
' -machine accel=kvm' \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
@ -220,10 +227,10 @@ setup_two_guests() {
wait_for [ -f "${STATESETUP}/passt_2.pid" ]
GUEST_1_CID=94557
context_run_bg qemu_1 'qemu-system-$(uname -m)' \
context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
@ -233,10 +240,10 @@ setup_two_guests() {
" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"
GUEST_2_CID=94558
context_run_bg qemu_2 'qemu-system-$(uname -m)' \
context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \

View file

@ -31,8 +31,8 @@ PR_DELAY_INIT=100 # ms
# $@: Message to print
info() {
tmux select-pane -t ${PANE_INFO}
echo "${@}" >> $STATEBASE/log_pipe
echo "${@}" >> "${LOGFILE}"
printf "${@}\n" >> $STATEBASE/log_pipe
printf "${@}\n" >> "${LOGFILE}"
}
# info_n() - Highlight, print message to pane and to log file without newline
@ -47,13 +47,13 @@ info_n() {
# $@: Message to print
info_nolog() {
tmux select-pane -t ${PANE_INFO}
echo "${@}" >> $STATEBASE/log_pipe
printf "${@}\n" >> $STATEBASE/log_pipe
}
# info_nolog() - Print message to log file
# $@: Message to print
log() {
echo "${@}" >> "${LOGFILE}"
printf "${@}\n" >> "${LOGFILE}"
}
# info_nolog_n() - Send message to pane without highlighting it, without newline
@ -97,7 +97,6 @@ display_delay() {
switch_pane() {
tmux select-pane -t ${1}
PR_DELAY=${PR_DELAY_INIT}
display_delay "0.2"
}
# cmd_write() - Write a command to a pane, letter by letter, and execute it
@ -199,7 +198,7 @@ pane_run() {
# $1: Pane name
pane_wait() {
__lc="$(echo "${1}" | tr [A-Z] [a-z])"
sleep 0.1 || sleep 1
sleep 0.01 || sleep 1
__done=0
while
@ -207,7 +206,7 @@ pane_wait() {
case ${__l} in
*"$ " | *"# ") return ;;
esac
do sleep 0.1 || sleep 1; done
do sleep 0.01 || sleep 1; done
}
# pane_parse() - Print last line, @EMPTY@ if command had no output
@ -231,7 +230,7 @@ pane_status() {
__status="$(pane_parse "${1}")"
while ! [ "${__status}" -eq "${__status}" ] 2>/dev/null; do
sleep 1
sleep 0.01 || sleep 1
pane_run "${1}" 'echo $?'
pane_wait "${1}"
__status="$(pane_parse "${1}")"
@ -383,6 +382,16 @@ info_check_failed() {
printf " < failed.\n" >> "${LOGFILE}"
}
# status_bar_blink() - Make status bar blink
status_bar_blink() {
for i in `seq 1 3`; do
tmux set status-right-style 'bg=colour1 fg=colour196 bold'
sleep 0.1 || sleep 1
tmux set status-right-style 'bg=colour1 fg=colour233 bold'
sleep 0.1 || sleep 1
done
}
# info_passed() - Display, log, and make status bar blink when a test passes
info_passed() {
switch_pane ${PANE_INFO}
@ -391,12 +400,7 @@ info_passed() {
log "...passed."
log
for i in `seq 1 3`; do
tmux set status-right-style 'bg=colour1 fg=colour2 bold'
sleep "0.1"
tmux set status-right-style 'bg=colour1 fg=colour233 bold'
sleep "0.1"
done
[ ${FAST} -eq 1 ] || status_bar_blink
}
# info_failed() - Display, log, and make status bar blink when a test passes
@ -407,12 +411,7 @@ info_failed() {
log "...failed."
log
for i in `seq 1 3`; do
tmux set status-right-style 'bg=colour1 fg=colour196 bold'
sleep "0.1"
tmux set status-right-style 'bg=colour1 fg=colour233 bold'
sleep "0.1"
done
[ ${FAST} -eq 1 ] || status_bar_blink
pause_continue \
"Press any key to pause test session" \
@ -665,7 +664,7 @@ pause_continue() {
# run_term() - Start tmux session, running entry point, with recording if needed
run_term() {
TMUX="tmux new-session -s passt_test -eSTATEBASE=$STATEBASE -ePCAP=$PCAP -eDEBUG=$DEBUG"
TMUX="tmux new-session -s passt_test -eSTATEBASE=$STATEBASE -ePCAP=$PCAP -eDEBUG=$DEBUG -eTRACE=$TRACE -eKERNEL=$KERNEL"
if [ ${CI} -eq 1 ]; then
printf '\e[8;50;240t'

View file

@ -15,18 +15,13 @@
# test_iperf3s() - Start iperf3 server
# $1: Destination/server context
# $2: Port number, ${i} is translated to process index
# $3: Number of processes to run in parallel
# $2: Port number
test_iperf3s() {
__sctx="${1}"
__port="${2}"
__procs="$((${3} - 1))"
pane_or_context_run_bg "${__sctx}" \
'for i in $(seq 0 '${__procs}'); do' \
' iperf3 -s -p'${__port}' &' \
' echo $! > s${i}.pid; ' \
'done' \
'iperf3 -s -p'${__port}' & echo $! > s.pid' \
sleep 1 # Wait for server to be ready
}
@ -36,9 +31,9 @@ test_iperf3s() {
test_iperf3k() {
__sctx="${1}"
pane_or_context_run "${__sctx}" 'kill -INT $(cat s*.pid); rm s*.pid'
pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid); rm s.pid'
sleep 3 # Wait for kernel to free up ports
sleep 1 # Wait for kernel to free up ports
}
# test_iperf3() - Ugly helper for iperf3 directive
@ -46,37 +41,29 @@ test_iperf3k() {
# $2: Source/client context
# $3: Destination name or address for client
# $4: Port number, ${i} is translated to process index
# $5: Number of processes to run in parallel
# $6: Run time, in seconds
# $5: Run time, in seconds
# $@: Client options
test_iperf3() {
__var="${1}"; shift
__cctx="${1}"; shift
__dest="${1}"; shift
__port="${1}"; shift
__procs="$((${1} - 1))"; shift
__time="${1}"; shift
pane_or_context_run "${__cctx}" 'rm -f c*.json'
pane_or_context_run "${__cctx}" 'rm -f c.json'
# A 1s wait for connection on what's basically a local link
# indicates something is pretty wrong
__timeout=1000
pane_or_context_run "${__cctx}" \
'(' \
' for i in $(seq 0 '${__procs}'); do' \
' iperf3 -J -c '${__dest}' -p '${__port} \
' --connect-timeout '${__timeout} \
' -t'${__time}' -i0 -T c${i} '"${@}" \
' > c${i}.json &' \
' done;' \
' wait' \
')'
'iperf3 -J -c '${__dest}' -p '${__port} \
' --connect-timeout '${__timeout} \
' -t'${__time}' -i0 '"${@}"' > c.json' \
__jval=".end.sum_received.bits_per_second"
__bw=$(pane_or_context_output "${__cctx}" \
'cat c*.json | jq -rMs "map('${__jval}') | add"')
'cat c.json | jq -rMs "map('${__jval}') | add"')
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
}

View file

@ -44,7 +44,7 @@ endef
def start_stop_diff
guest sed /proc/slabinfo -ne 's/^\([^ ]* *[^ ]* *[^ ]* *[^ ]*\).*/\\\1/p' > /tmp/slabinfo.before
guest cat /proc/meminfo > /tmp/meminfo.before
guest /bin/passt.avx2 -l /tmp/log -s /tmp/sock -P /tmp/pid __OPTS__ --netns-only
guest /bin/passt.avx2 -l /tmp/log -s /tmp/sock -P /tmp/pid __OPTS__
sleep 2
guest cat /proc/meminfo > /tmp/meminfo.after
guest sed /proc/slabinfo -ne 's/^\([^ ]* *[^ ]* *[^ ]* *[^ ]*\).*/\\\1/p' > /tmp/slabinfo.after
@ -78,9 +78,16 @@ guest mount -o bind /proc /test/proc
guest mount -o bind /dev /test/dev
guest cp -Lr /bin /lib /lib64 /usr /sbin /test/
guest exec switch_root /test /bin/sh
guest ulimit -Hn 300000
guest unshare -rUm -R /test
guest chroot .
guest unshare -rUn
guest ip link add eth0 type dummy
guest ip link set eth0 up
guest ip address add 192.0.2.2/24 dev eth0
guest ip address add 2001:db8::2/64 dev eth0
guest ip route add default via 192.0.2.1
guest ip -6 route add default via 2001:db8::1 dev eth0
guest meminfo_size() { grep "^$2:" $1 | tr -s ' ' | cut -f2 -d ' '; }
guest meminfo_diff() { echo $(( $(meminfo_size $2 $3) - $(meminfo_size $1 $3) )); }
@ -103,27 +110,17 @@ info
th symbol MiB
set WHAT tcp_buf_discard
nm_row
set WHAT tcp6_l2_buf
set WHAT flowtab
nm_row
set WHAT tcp4_l2_buf
set WHAT tcp6_payload
nm_row
set WHAT tc
set WHAT tcp4_payload
nm_row
set WHAT pkt_buf
nm_row
set WHAT udp_splice_map
set WHAT udp_payload
nm_row
set WHAT udp6_l2_buf
nm_row
set WHAT udp4_l2_buf
nm_row
set WHAT udp_tap_map
nm_row
set WHAT icmp_id_map
nm_row
set WHAT udp_splice_buf
nm_row
set WHAT tc_hash
set WHAT flow_hashtab
nm_row
set WHAT pool_tap6_storage
nm_row
@ -142,8 +139,6 @@ set WHAT pid
slab_row
set WHAT dentry
slab_row
set WHAT Acpi-Parse
slab_row
set WHAT kmalloc-64
slab_row
set WHAT kmalloc-32

View file

@ -31,10 +31,15 @@
#define ARRAY_SIZE(a) ((int)(sizeof(a) / sizeof((a)[0])))
#define die(...) \
do { \
fprintf(stderr, __VA_ARGS__); \
exit(1); \
#define die(...) \
do { \
fprintf(stderr, "nstool: " __VA_ARGS__); \
exit(1); \
} while (0)
#define err(...) \
do { \
fprintf(stderr, "nstool: " __VA_ARGS__); \
} while (0)
struct ns_type {
@ -156,6 +161,9 @@ static int connect_ctl(const char *sockpath, bool wait,
static void cmd_hold(int argc, char *argv[])
{
struct sigaction sa = {
.sa_handler = SIG_IGN,
};
int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, PF_UNIX);
struct sockaddr_un addr;
const char *sockpath = argv[1];
@ -185,6 +193,10 @@ static void cmd_hold(int argc, char *argv[])
if (!getcwd(info.cwd, sizeof(info.cwd)))
die("getcwd(): %s\n", strerror(errno));
rc = sigaction(SIGPIPE, &sa, NULL);
if (rc)
die("sigaction(SIGPIPE): %s\n", strerror(errno));
do {
int afd = accept(fd, NULL, NULL);
char buf;
@ -193,17 +205,21 @@ static void cmd_hold(int argc, char *argv[])
die("accept(): %s\n", strerror(errno));
rc = write(afd, &info, sizeof(info));
if (rc < 0)
die("write(): %s\n", strerror(errno));
if (rc < 0) {
err("holder write() to control socket: %s\n",
strerror(errno));
}
if ((size_t)rc < sizeof(info))
die("short write() on control socket\n");
err("holder short write() on control socket\n");
rc = read(afd, &buf, sizeof(buf));
if (rc < 0)
die("read(): %s\n", strerror(errno));
if (rc < 0) {
err("holder read() on control socket: %s\n",
strerror(errno));
}
close(afd);
} while (rc == 0);
} while (rc <= 0);
unlink(sockpath);
}
@ -345,21 +361,43 @@ static int openns(const char *fmt, ...)
return fd;
}
static pid_t sig_pid;
static void sig_propagate(int signum)
{
int err;
err = kill(sig_pid, signum);
if (err)
die("Propagating %s: %s\n", strsignal(signum), strerror(errno));
}
static void wait_for_child(pid_t pid)
{
int status;
struct sigaction sa = {
.sa_handler = sig_propagate,
.sa_flags = SA_RESETHAND,
};
int status, err;
sig_pid = pid;
err = sigaction(SIGTERM, &sa, NULL);
if (err)
die("sigaction(SIGTERM): %s\n", strerror(errno));
/* Match the child's exit status, if possible */
for (;;) {
pid_t rc;
rc = waitpid(pid, &status, WUNTRACED);
if (rc < 0)
if (rc < 0) {
if (errno == EINTR)
continue;
die("waitpid() on %d: %s\n", pid, strerror(errno));
}
if (rc != pid)
die("waitpid() on %d returned %d", pid, rc);
if (WIFSTOPPED(status)) {
/* Stop the parent to patch */
/* Stop the parent to match */
kill(getpid(), SIGSTOP);
/* We must have resumed, resume the child */
kill(pid, SIGCONT);
@ -508,7 +546,7 @@ static void cmd_exec(int argc, char *argv[])
/* CHILD */
if (argc > optind + 1) {
exe = argv[optind + 1];
xargs = (const char * const*)(argv + optind + 1);
xargs = (const char *const *)(argv + optind + 1);
} else {
exe = getenv("SHELL");
if (!exe)

View file

@ -15,6 +15,14 @@ PROGS="${PROGS:-ash,dash,bash ip mount ls insmod mkdir ln cat chmod lsmod
sed tr chown sipcalc cut socat dd strace ping tail killall sleep sysctl
nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp}"
# OpenSSH 9.8 introduced split binaries, with sshd being the daemon, and
# sshd-session the per-session program. We need the latter as well, and the path
# depends on the distribution. It doesn't exist on older versions.
for bin in /usr/lib/openssh/sshd-session /usr/lib/ssh/sshd-session \
/usr/libexec/openssh/sshd-session; do
command -v "${bin}" >/dev/null && PROGS="${PROGS} ${bin}"
done
KMODS="${KMODS:- virtio_net virtio_pci vmw_vsock_virtio_transport}"
LINKS="${LINKS:-
@ -54,7 +62,7 @@ EOF
ln -s /run /var/run
:> /etc/fstab
# sshd(dropbear) via vsock
# sshd via vsock
cat > /etc/passwd << EOF
root:x:0:0:root:/root:/bin/sh
sshd:x:100:100:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
@ -64,7 +72,9 @@ root:::0:99999:7:::
EOF
chmod 000 /etc/shadow
:> /etc/ssh/sshd_config
cat > /etc/ssh/sshd_config << EOF
Subsystem sftp internal-sftp
EOF
ssh-keygen -A
chmod 700 /root/.ssh
chmod 700 /run/sshd
@ -76,7 +86,7 @@ EOF
EOF
chmod 600 /root/.ssh/authorized_keys
chmod 700 /root
socat VSOCK-LISTEN:22,fork EXEC:"sshd -i -e" 2> /var/log/vsock-ssh.log &
socat VSOCK-LISTEN:22,fork EXEC:"/sbin/sshd -i -e" 2> /var/log/vsock-ssh.log &
sh +m
'

View file

@ -12,7 +12,7 @@
PROGS="${PROGS:-ash,dash,bash chmod ip mount insmod mkdir ln cat chmod modprobe
grep mknod sed chown sleep bc ls ps mount unshare chroot cp kill diff
head tail sort tr tee cut nm which}"
head tail sort tr tee cut nm which switch_root}"
KMODS="${KMODS:- dummy}"
@ -29,13 +29,6 @@ COPIES="${COPIES} ../passt.avx2,/bin/passt.avx2"
FIXUP="${FIXUP}"'
ln -s /bin /usr/bin
chmod 777 /tmp
ip link add eth0 type dummy
ip link set eth0 up
ip address add 192.0.2.2/24 dev eth0
ip address add 2001:db8::2/64 dev eth0
ip route add default via 192.0.2.1
ip -6 route add default via 2001:db8::1 dev eth0
sleep 2
sh +m
'

View file

@ -38,7 +38,7 @@ check [ __MTU__ = 65520 ]
test DHCP: DNS
gout DNS sed -n 's/^nameserver \([0-9]*\.\)\(.*\)/\1\2/p' /etc/resolv.conf | tr '\n' ',' | sed 's/,$//;s/$/\n/'
hout HOST_DNS sed -n 's/^nameserver \([0-9]*\.\)\(.*\)/\1\2/p' /etc/resolv.conf | head -n3 | tr '\n' ',' | sed 's/,$//;s/$/\n/'
check [ "__DNS__" = "__HOST_DNS__" ] || [ "__DNS__" = "__HOST_GW__" -a "__HOST_DNS__" = "127.0.0.1" ]
check [ "__DNS__" = "__HOST_DNS__" ] || ( [ "__DNS__" = "__HOST_GW__" ] && expr "__HOST_DNS__" : "127[.]" )
# FQDNs should be terminated by dots, but the guest DHCP client might omit them:
# strip them first
@ -49,8 +49,10 @@ check [ "__SEARCH__" = "__HOST_SEARCH__" ]
test DHCPv6: address
guest /sbin/dhclient -6 __IFNAME__
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global").local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR6__" = "__HOST_ADDR6__" ]
test DHCPv6: route

View file

@ -16,14 +16,16 @@ htools ip jq sipcalc grep cut
test Interface name
gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
guest ip link set dev __IFNAME__ up && sleep 2
guest ip link set dev __IFNAME__ up
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME__" ]
test SLAAC: prefix
gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local] | .[0]'
gout PREFIX6 sipcalc __ADDR6__/64 | grep prefix | cut -d' ' -f4
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global").local] | .[0]'
gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .protocol == "kernel_ra") | .local + "/" + (.prefixlen | tostring)] | .[0]'
gout PREFIX6 sipcalc __ADDR6__ | grep prefix | cut -d' ' -f4
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
hout HOST_PREFIX6 sipcalc __HOST_ADDR6__/64 | grep prefix | cut -d' ' -f4
check [ "__PREFIX6__" = "__HOST_PREFIX6__" ]

75
test/passt_in_ns/dhcp Normal file
View file

@ -0,0 +1,75 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/passt/dhcp - Check DHCP and DHCPv6 functionality in passt mode
#
# Copyright (c) 2021 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
gtools ip jq dhclient sed tr
htools ip jq sed tr head
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME__" ]
test DHCP: address
guest /sbin/dhclient -4 __IFNAME__
gout ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR__" = "__HOST_ADDR__" ]
test DHCP: route
gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hout HOST_GW ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]'
check [ "__GW__" = "__HOST_GW__" ]
test DHCP: MTU
gout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu'
check [ __MTU__ = 65520 ]
test DHCP: DNS
gout DNS sed -n 's/^nameserver \([0-9]*\.\)\(.*\)/\1\2/p' /etc/resolv.conf | tr '\n' ',' | sed 's/,$//;s/$/\n/'
hout HOST_DNS sed -n 's/^nameserver \([0-9]*\.\)\(.*\)/\1\2/p' /etc/resolv.conf | head -n3 | tr '\n' ',' | sed 's/,$//;s/$/\n/'
check [ "__DNS__" = "__HOST_DNS__" ] || ( [ "__DNS__" = "__MAP_NS4__" ] && expr "__HOST_DNS__" : "127[.]" )
# FQDNs should be terminated by dots, but the guest DHCP client might omit them:
# strip them first
test DHCP: search list
gout SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
hout HOST_SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
check [ "__SEARCH__" = "__HOST_SEARCH__" ]
test DHCPv6: address
guest /sbin/dhclient -6 __IFNAME__
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR6__" = "__HOST_ADDR6__" ]
test DHCPv6: route
gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway'
hout HOST_GW6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]'
check [ "__GW6__" = "__HOST_GW6__" ]
# Strip interface specifier: interface names might differ between host and guest
test DHCPv6: DNS
gout DNS6 sed -n 's/^nameserver \([^:]*:\)\([^%]*\).*/\1\2/p' /etc/resolv.conf | tr '\n' ',' | sed 's/,$//;s/$/\n/'
hout HOST_DNS6 sed -n 's/^nameserver \([^:]*:\)\([^%]*\).*/\1\2/p' /etc/resolv.conf | tr '\n' ',' | sed 's/,$//;s/$/\n/'
check [ "__DNS6__" = "__HOST_DNS6__" ] || [ "__DNS6__" = "__MAP_NS6__" -a "__HOST_DNS6__" = "::1" ]
test DHCPv6: search list
gout SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
hout HOST_SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
check [ "__SEARCH6__" = "__HOST_SEARCH6__" ]

View file

@ -15,6 +15,11 @@ gtools socat ip jq
htools socat ip jq
nstools socat ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set TEMP_BIG __STATEDIR__/test_big.bin
set TEMP_SMALL __STATEDIR__/test_small.bin
set TEMP_NS_BIG __STATEDIR__/test_ns_big.bin
@ -27,7 +32,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10001
guestw
guest cmp test_big.bin /root/big.bin
test TCP/IPv4: host to ns: big transfer
test TCP/IPv4: host to ns (spliced): big transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002
@ -36,16 +41,15 @@ check cmp __TEMP_NS_BIG__ __BASEPATH__/big.bin
test TCP/IPv4: guest to host: big transfer
hostb socat -u TCP4-LISTEN:10003 OPEN:__TEMP_BIG__,create,trunc
gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
sleep 1
guest socat -u OPEN:/root/big.bin TCP4:__GW__:10003
guest socat -u OPEN:/root/big.bin TCP4:__MAP_HOST4__:10003
hostw
check cmp __TEMP_BIG__ __BASEPATH__/big.bin
test TCP/IPv4: guest to ns: big transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
sleep 1
guest socat -u OPEN:/root/big.bin TCP4:__GW__:10002
guest socat -u OPEN:/root/big.bin TCP4:__MAP_NS4__:10002
nsw
check cmp __TEMP_NS_BIG__ __BASEPATH__/big.bin
@ -59,7 +63,7 @@ check cmp __TEMP_BIG__ __BASEPATH__/big.bin
test TCP/IPv4: ns to host (via tap): big transfer
hostb socat -u TCP4-LISTEN:10003 OPEN:__TEMP_BIG__,create,trunc
sleep 1
ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__GW__:10003
ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__MAP_HOST4__:10003
hostw
check cmp __TEMP_BIG__ __BASEPATH__/big.bin
@ -86,7 +90,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10001
guestw
guest cmp test_small.bin /root/small.bin
test TCP/IPv4: host to ns: small transfer
test TCP/IPv4: host to ns (spliced): small transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002
@ -95,16 +99,15 @@ check cmp __TEMP_NS_SMALL__ __BASEPATH__/small.bin
test TCP/IPv4: guest to host: small transfer
hostb socat -u TCP4-LISTEN:10003 OPEN:__TEMP_SMALL__,create,trunc
gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
sleep 1
guest socat -u OPEN:/root/small.bin TCP4:__GW__:10003
guest socat -u OPEN:/root/small.bin TCP4:__MAP_HOST4__:10003
hostw
check cmp __TEMP_SMALL__ __BASEPATH__/small.bin
test TCP/IPv4: guest to ns: small transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
sleep 1
guest socat -u OPEN:/root/small.bin TCP4:__GW__:10002
guest socat -u OPEN:/root/small.bin TCP4:__MAP_NS4__:10002
nsw
check cmp __TEMP_NS_SMALL__ __BASEPATH__/small.bin
@ -118,7 +121,7 @@ check cmp __TEMP_SMALL__ __BASEPATH__/small.bin
test TCP/IPv4: ns to host (via tap): small transfer
hostb socat -u TCP4-LISTEN:10003 OPEN:__TEMP_SMALL__,create,trunc
sleep 1
ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__GW__:10003
ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__MAP_HOST4__:10003
hostw
check cmp __TEMP_SMALL__ __BASEPATH__/small.bin
@ -143,7 +146,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10001
guestw
guest cmp test_big.bin /root/big.bin
test TCP/IPv6: host to ns: big transfer
test TCP/IPv6: host to ns (spliced): big transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002
@ -152,17 +155,15 @@ check cmp __TEMP_NS_BIG__ __BASEPATH__/big.bin
test TCP/IPv6: guest to host: big transfer
hostb socat -u TCP6-LISTEN:10003 OPEN:__TEMP_BIG__,create,trunc
gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway'
gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
sleep 1
guest socat -u OPEN:/root/big.bin TCP6:[__GW6__%__IFNAME__]:10003
guest socat -u OPEN:/root/big.bin TCP6:[__MAP_HOST6__]:10003
hostw
check cmp __TEMP_BIG__ __BASEPATH__/big.bin
test TCP/IPv6: guest to ns: big transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
sleep 1
guest socat -u OPEN:/root/big.bin TCP6:[__GW6__%__IFNAME__]:10002
guest socat -u OPEN:/root/big.bin TCP6:[__MAP_NS6__]:10002
nsw
check cmp __TEMP_NS_BIG__ __BASEPATH__/big.bin
@ -175,9 +176,8 @@ check cmp __TEMP_BIG__ __BASEPATH__/big.bin
test TCP/IPv6: ns to host (via tap): big transfer
hostb socat -u TCP6-LISTEN:10003 OPEN:__TEMP_BIG__,create,trunc
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
sleep 1
ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__GW6__%__IFNAME__]:10003
ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__MAP_HOST6__]:10003
hostw
check cmp __TEMP_BIG__ __BASEPATH__/big.bin
@ -190,6 +190,7 @@ guest cmp test_big.bin /root/big.bin
test TCP/IPv6: ns to guest (using namespace address): big transfer
guestb socat -u TCP6-LISTEN:10001 OPEN:test_big.bin,create,trunc
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[0].local'
sleep 1
ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__ADDR6__]:10001
@ -203,7 +204,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10001
guestw
guest cmp test_small.bin /root/small.bin
test TCP/IPv6: host to ns: small transfer
test TCP/IPv6: host to ns (spliced): small transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002
@ -212,17 +213,15 @@ check cmp __TEMP_NS_SMALL__ __BASEPATH__/small.bin
test TCP/IPv6: guest to host: small transfer
hostb socat -u TCP6-LISTEN:10003 OPEN:__TEMP_SMALL__,create,trunc
gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway'
gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
sleep 1
guest socat -u OPEN:/root/small.bin TCP6:[__GW6__%__IFNAME__]:10003
guest socat -u OPEN:/root/small.bin TCP6:[__MAP_HOST6__]:10003
hostw
check cmp __TEMP_SMALL__ __BASEPATH__/small.bin
test TCP/IPv6: guest to ns: small transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__
sleep 1
guest socat -u OPEN:/root/small.bin TCP6:[__GW6__%__IFNAME__]:10002
guest socat -u OPEN:/root/small.bin TCP6:[__MAP_NS6__]:10002
nsw
check cmp __TEMP_NS_SMALL__ __BASEPATH__/small.bin
@ -235,9 +234,8 @@ check cmp __TEMP_SMALL__ __BASEPATH__/small.bin
test TCP/IPv6: ns to host (via tap): small transfer
hostb socat -u TCP6-LISTEN:10003 OPEN:__TEMP_SMALL__,create,trunc
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
sleep 1
ns socat -u OPEN:__BASEPATH__/small.bin TCP6:[__GW6__%__IFNAME__]:10003
ns socat -u OPEN:__BASEPATH__/small.bin TCP6:[__MAP_HOST6__]:10003
hostw
check cmp __TEMP_SMALL__ __BASEPATH__/small.bin

View file

@ -15,6 +15,11 @@ gtools socat ip jq
nstools socat ip jq
htools socat ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set TEMP __STATEDIR__/test.bin
set TEMP_NS __STATEDIR__/test_ns.bin
@ -25,7 +30,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10001,shut-null
guestw
guest cmp test.bin /root/medium.bin
test UDP/IPv4: host to ns
test UDP/IPv4: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null
@ -34,16 +39,15 @@ check cmp __TEMP_NS__ __BASEPATH__/medium.bin
test UDP/IPv4: guest to host
hostb socat -u UDP4-LISTEN:10003,null-eof OPEN:__TEMP__,create,trunc
gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
sleep 1
guest socat -u OPEN:/root/medium.bin UDP4:__GW__:10003,shut-null
guest socat -u OPEN:/root/medium.bin UDP4:__MAP_HOST4__:10003,shut-null
hostw
check cmp __TEMP__ __BASEPATH__/medium.bin
test UDP/IPv4: guest to ns
nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
sleep 1
guest socat -u OPEN:/root/medium.bin UDP4:__GW__:10002,shut-null
guest socat -u OPEN:/root/medium.bin UDP4:__MAP_NS4__:10002,shut-null
nsw
check cmp __TEMP_NS__ __BASEPATH__/medium.bin
@ -57,7 +61,7 @@ check cmp __TEMP__ __BASEPATH__/medium.bin
test UDP/IPv4: ns to host (via tap)
hostb socat -u UDP4-LISTEN:10003,null-eof OPEN:__TEMP__,create,trunc
sleep 1
ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null
ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__MAP_HOST4__:10003,shut-null
hostw
check cmp __TEMP__ __BASEPATH__/medium.bin
@ -84,7 +88,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10001,shut-null
guestw
guest cmp test.bin /root/medium.bin
test UDP/IPv6: host to ns
test UDP/IPv6: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null
@ -93,17 +97,15 @@ check cmp __TEMP_NS__ __BASEPATH__/medium.bin
test UDP/IPv6: guest to host
hostb socat -u UDP6-LISTEN:10003,null-eof OPEN:__TEMP__,create,trunc
gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway'
gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
sleep 1
guest socat -u OPEN:/root/medium.bin UDP6:[__GW6__%__IFNAME__]:10003,shut-null
guest socat -u OPEN:/root/medium.bin UDP6:[__MAP_HOST6__]:10003,shut-null
hostw
check cmp __TEMP__ __BASEPATH__/medium.bin
test UDP/IPv6: guest to ns
nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
sleep 1
guest socat -u OPEN:/root/medium.bin UDP6:[__GW6__%__IFNAME__]:10002,shut-null
guest socat -u OPEN:/root/medium.bin UDP6:[__MAP_NS6__]:10002,shut-null
nsw
check cmp __TEMP_NS__ __BASEPATH__/medium.bin
@ -116,9 +118,8 @@ check cmp __TEMP__ __BASEPATH__/medium.bin
test UDP/IPv6: ns to host (via tap)
hostb socat -u UDP6-LISTEN:10003,null-eof OPEN:__TEMP__,create,trunc
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
sleep 1
ns socat -u OPEN:__BASEPATH__/medium.bin UDP6:[__GW6__%__IFNAME__]:10003,shut-null
ns socat -u OPEN:__BASEPATH__/medium.bin UDP6:[__MAP_HOST6__]:10003,shut-null
hostw
check cmp __TEMP__ __BASEPATH__/medium.bin
@ -131,6 +132,7 @@ guest cmp test.bin /root/medium.bin
test UDP/IPv6: ns to guest (using namespace address)
guestb socat -u UDP6-LISTEN:10001,null-eof OPEN:test.bin,create,trunc
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[0].local'
sleep 1
ns socat -u OPEN:__BASEPATH__/medium.bin UDP6:[__ADDR6__]:10001,shut-null

View file

@ -35,9 +35,11 @@ check [ __MTU__ = 65520 ]
test DHCPv6: address
ns /sbin/dhclient -6 --no-pid __IFNAME__
# Wait for DAD to complete
ns while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global").local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ __ADDR6__ = __HOST_ADDR6__ ]
test DHCPv6: route

View file

@ -18,12 +18,13 @@ test Interface name
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
check [ -n "__IFNAME__" ]
ns ip link set dev __IFNAME__ up
sleep 2
# Wait for DAD to complete
ns while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
test SLAAC: prefix
nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local] | .[0]'
nsout PREFIX6 sipcalc __ADDR6__/64 | grep prefix | cut -d' ' -f4
hout HOST_ADDR6 ip -j -6 addr show|jq -rM ['.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global").local] | .[0]'
nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .protocol == "kernel_ra") | .local + "/" + (.prefixlen | tostring)] | .[0]'
nsout PREFIX6 sipcalc __ADDR6__ | grep prefix | cut -d' ' -f4
hout HOST_ADDR6 ip -j -6 addr show|jq -rM ['.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
hout HOST_PREFIX6 sipcalc __HOST_ADDR6__/64 | grep prefix | cut -d' ' -f4
check [ "__PREFIX6__" = "__HOST_PREFIX6__" ]

View file

@ -19,8 +19,8 @@ set TEMP_NS_BIG __STATEDIR__/test_ns_big.bin
set TEMP_SMALL __STATEDIR__/test_small.bin
set TEMP_NS_SMALL __STATEDIR__/test_ns_small.bin
test TCP/IPv4: host to ns: big transfer
nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_BIG__,create,trunc
test TCP/IPv4: host to ns (spliced): big transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002
nsw
check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__
@ -38,8 +38,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__GW__:10003
hostw
check cmp __BASEPATH__/big.bin __TEMP_BIG__
test TCP/IPv4: host to ns: small transfer
nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_SMALL__,create,trunc
test TCP/IPv4: host to ns (spliced): small transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
host socat OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002
nsw
check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__
@ -57,8 +57,8 @@ ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__GW__:10003
hostw
check cmp __BASEPATH__/small.bin __TEMP_SMALL__
test TCP/IPv6: host to ns: big transfer
nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_BIG__,create,trunc
test TCP/IPv6: host to ns (spliced): big transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002
nsw
check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__
@ -77,8 +77,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__GW6__%__IFNAME__]:10003
hostw
check cmp __BASEPATH__/big.bin __TEMP_BIG__
test TCP/IPv6: host to ns: small transfer
nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_SMALL__,create,trunc
test TCP/IPv6: host to ns (spliced): small transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002
nsw
check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__

View file

@ -17,8 +17,8 @@ htools dd socat ip jq
set TEMP __STATEDIR__/test.bin
set TEMP_NS __STATEDIR__/test_ns.bin
test UDP/IPv4: host to ns
nsb socat -u UDP4-LISTEN:10002,bind=127.0.0.1,null-eof OPEN:__TEMP_NS__,create,trunc
test UDP/IPv4: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
host socat OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null
nsw
check cmp __BASEPATH__/medium.bin __TEMP_NS__
@ -37,8 +37,8 @@ ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null
hostw
check cmp __BASEPATH__/medium.bin __TEMP__
test UDP/IPv6: host to ns
nsb socat -u UDP6-LISTEN:10002,bind=[::1],null-eof OPEN:__TEMP_NS__,create,trunc
test UDP/IPv6: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null
nsw
check cmp __BASEPATH__/medium.bin __TEMP_NS__

View file

@ -19,7 +19,7 @@ sleep 1
endef
def flood_log_client
host tcp_crr --nolog -P 10001 -C 10002 -6 -c -H ::1
host tcp_crr --nolog -l1 -P 10001 -C 10002 -6 -c -H ::1
endef
def check_log_size_mountns
@ -33,19 +33,16 @@ test Log creation
set PORTS -t 10001,10002 -u 10001,10002
set LOG_FILE __STATEDIR__/pasta.log
passt ./pasta -l __LOG_FILE__
passtb exit
sleep 1
passt ./pasta -l __LOG_FILE__ -- /bin/true
check [ -s __LOG_FILE__ ]
test Log truncated on creation
passt ./pasta -l __LOG_FILE__
passtb exit
sleep 1
check [ $(cat __LOG_FILE__ | wc -l) -eq 1 ]
passt ./pasta -l __LOG_FILE__ -- /bin/true & wait
pout PID2 echo $!
check head -1 __LOG_FILE__ | grep '^pasta .* [(]__PID2__[)]$'
test Maximum log size
passtb ./pasta --config-net -d -f -l __LOG_FILE__ --log-size $((100 * 1024)) -- sh -c 'while true; do tcp_crr --nolog -P 10001 -C 10002 -6; done'
passtb ./pasta --config-net -d -f -l __LOG_FILE__ --log-size $((100 * 1024)) -- sh -c 'while true; do tcp_crr --nolog -l1 -P 10001 -C 10002 -6; done'
sleep 1
flood_log_client

View file

@ -11,11 +11,16 @@
# Copyright (c) 2022 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
htools git make go bats catatonit ip jq socat
htools git make go bats ip jq socat ./test/podman/bin/podman
set PODMAN test/podman/bin/podman
hout WD pwd
test Podman pasta path
hout PASTA_BIN CONTAINERS_HELPER_BINARY_DIR="__WD__" __PODMAN__ info --format "{{.Host.Pasta.Executable}}"
check [ "__PASTA_BIN__" = "__WD__/pasta" ]
test Podman system test with bats
host git -C __STATEDIR__ clone https://github.com/containers/podman.git
host make -C __STATEDIR__/podman
hout WD pwd
host PODMAN="__STATEDIR__/podman/bin/podman" CONTAINERS_HELPER_BINARY_DIR="__WD__" bats __STATEDIR__/podman/test/system/505-networking-pasta.bats
host PODMAN="__PODMAN__" CONTAINERS_HELPER_BINARY_DIR="__WD__" bats test/podman/test/system/505-networking-pasta.bats

Some files were not shown because too many files have changed in this diff Show more