Compare commits

...

101 commits

Author SHA1 Message Date
David Gibson
0588163b1f cppcheck: Don't check the system headers
We pass -I options to cppcheck so that it will find the system headers.
Then we need to pass a bunch more options to suppress the zillions of
cppcheck errors found in those headers.

It turns out, however, that it's not recommended to give the system headers
to cppcheck anyway.  Instead it has built-in knowledge of the ANSI libc and
uses that as the basis of its checks.  We do need to suppress
missingIncludeSystem warnings instead though.

Not bothering with the system headers makes the cppcheck runtime go from
~37s to ~14s on my machine, which is a pretty nice win.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:21 +01:00
David Gibson
14dd70e2b3 linux_dep: Fix CLOSE_RANGE_UNSHARE availability handling
If CLOSE_RANGE_UNSHARE isn't defined, we define a fallback version of
close_range() which is a (successful) no-op.  This is broken in several
ways:
 * It doesn't actually fix compile if using old kernel headers, because
   the caller of close_range() still directly uses CLOSE_RANGE_UNSHARE
   unprotected by ifdefs
 * Even if it did fix the compile, it means inconsistent behaviour between
   a compile time failure to find the value (we silently don't close files)
   and a runtime failure (we die with an error from close_range())
 * Silently not closing the files we intend to close for security reasons
   is probably not a good idea in any case

We don't want to simply error if close_range() or CLOSE_RANGE_UNSHARE isn't
available, because that would require running on kernel >= 5.9.  On the
other hand there's not really any other way to flush all possible fds
leaked by the parent (close() in a loop takes over a minute).  So in this
case print a warning and carry on.

As bonus this fixes a cppcheck error I see with some different options I'm
looking to apply in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:17 +01:00
David Gibson
d64f257243 linux_dep: Move close_range() conditional handling to linux_dep.h
util.h has some #ifdefs and weak definitions to handle compatibility with
various kernel versions.  Move this to linux_dep.h which handles several
other similar cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:15 +01:00
David Gibson
b84cd05098 log: Only check for FALLOC_FL_COLLAPSE_RANGE availability at runtime
log.c has several #ifdefs on FALLOC_FL_COLLAPSE_RANGE that won't attempt
to use it if not defined.  But even if the value is defined at compile
time, it might not be available in the runtime kernel, so we need to check
for errors from a fallocate() call and fall back to other methods.

Simplify this to only need the runtime check by using linux_dep.h to define
FALLOC_FL_COLLAPSE_RANGE if it's not in the kernel headers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:25:58 +01:00
Stefano Brivio
58fa5508bd tap, tcp, util: Add some missing SOCK_CLOEXEC flags
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:

/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1139 |         int fd = socket(AF_UNIX, SOCK_STREAM, 0);
      |                                             ^
      |                                              | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1158 |                 ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
      |                                                                 ^
      |                                                                  | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1413 |         s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
      |                                                   ^
      |                                                    | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
  188 |         if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
      |                                             ^
      |                                              | SOCK_CLOEXEC

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:58 +01:00
Stefano Brivio
71869e2912 passt: Use NOLINT clang-tidy block instead of NOLINTNEXTLINE
For some reason, this is only reported by clang-tidy 19.1.2 on
Alpine:

/home/sbrivio/passt/passt.c:314:53: error: conditional operator with identical true and false expressions [bugprone-branch-clone,-warnings-as-errors]
  314 |         nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
      |                                                            ^

We do have a suppression, but not on the line preceding it, because
we also need a cppcheck suppression there. Use NOLINTBEGIN/NOLINTEND
for the clang-tidy suppression.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:52 +01:00
Stefano Brivio
d4f09c9b96 util: Define small and big thresholds for socket buffers as unsigned long long
On 32-bit architectures, clang-tidy reports:

/home/pi/passt/tcp.c:728:11: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^
/home/pi/passt/tcp.c:728:11: note: make conversion explicit to silence this warning
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~~~~~~~~
/home/pi/passt/tcp.c:728:11: note: perform multiplication in a wider type
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~
/home/pi/passt/tcp.c:730:15: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:730:15: note: make conversion explicit to silence this warning
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:730:15: note: perform multiplication in a wider type
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~
/home/pi/passt/tcp.c:731:17: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:731:17: note: make conversion explicit to silence this warning
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:731:17: note: perform multiplication in a wider type
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~

because, wherever we use those thresholds, we define the other term
of comparison as uint64_t. Define the thresholds as unsigned long long
as well, to make sure we match types.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:49 +01:00
Stefano Brivio
87940f9aa7 tap: Cast TAP_BUF_BYTES - ETH_MAX_MTU to ssize_t, not TAP_BUF_BYTES
Given that we're comparing against 'n', which is signed, we cast
TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated
as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be
signed.

This doesn't necessarily happen on 32-bit architectures, though. On
armhf and i686, clang-tidy 18.1.8 and 19.1.2 report:

/home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors]
 1087 |         for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
      |                     ~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

cast the whole difference to ssize_t, as we know it's going to be
positive anyway, instead of relying on that side effect.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:45 +01:00
Stefano Brivio
1feb90fe62 dhcpv6: Turn some option headers pointers to const
cppcheck 2.14.2 on Alpine reports:

dhcpv6.c:431:32: style: Variable 'client_id' can be declared as pointer to const [constVariablePointer]
 struct opt_hdr *ia, *bad_ia, *client_id;
                               ^

It's not only 'client_id': we can declare 'ia' as const pointer too.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:41 +01:00
Stefano Brivio
5f5e814cfc dhcpv6: Use for loop instead of goto to avoid false positive cppcheck warning
cppcheck 2.16.0 reports:

dhcpv6.c:334:14: style: The comparison 'ia_type == 3' is always true. [knownConditionTrueFalse]
 if (ia_type == OPT_IA_NA) {
             ^
dhcpv6.c:306:12: note: 'ia_type' is assigned value '3' here.
 ia_type = OPT_IA_NA;
           ^
dhcpv6.c:334:14: note: The comparison 'ia_type == 3' is always true.
 if (ia_type == OPT_IA_NA) {
             ^

this is not really the case as we set ia_type to OPT_IA_TA and then
jump back.

Anyway, there's no particular reason to use a goto here: add a trivial
foreach() macro to go through elements of an array and use it instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:11 +01:00
Jon Maloy
78da088f7b tcp: unify payload and flags l2 frames array
In order to reduce static memory and code footprint, we merge
the array for l2 flag frames into the one for payload frames.

This change also ensures that no flag message will be sent out
over the l2 media bypassing already queued payload messages.

Performance measurements with iperf3, where we force all
traffic via the tap queue, show no significant difference:

Dual traffic both directions sinmultaneously, with patch:
========================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  36.3 GBytes  3.12 Gbits/sec  4759       sender
[  5]   0.00-100.04 sec  36.3 GBytes  3.11 Gbits/sec             receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   321 GBytes  27.6 Gbits/sec            receiver

Dual traffic both directions sinmultaneously, without patch:
============================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  35.0 GBytes  3.01 Gbits/sec  6001       sender
[  5]   0.00-100.04 sec  34.8 GBytes  2.99 Gbits/sec            receiver

ns->host
--------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   345 GBytes  29.6 Gbits/sec            receiver

Single connection, with patch:
==============================
host->ns:
---------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   138 GBytes  11.8 Gbits/sec  922       sender
[  5]   0.00-100.04 sec   138 GBytes  11.8 Gbits/sec            receiver

ns->host:
-----------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   430 GBytes  36.9 Gbits/sec            receiver

Single connection, without patch:
=================================
host->ns:
------------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   139 GBytes  11.9 Gbits/sec  900       sender
[  5]   0.00-100.04 sec   139 GBytes  11.9 Gbits/sec            receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   440 GBytes  37.8 Gbits/sec            receiver

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:41 +01:00
David Gibson
9a0e544f05 test: Improve test for NDP assigned prefix
In the NDP tests we search explicitly for a guest address with prefix
length 64.  AFAICT this is an attempt to specifically find the SLAAC
assigned address, rather than something assigned by other means.  We can do
that more explicitly by checking for .protocol == "kernel_ra". however.

The SLAAC prefixes we assigned *will* always be 64-bit, that's hard-coded
into our NDP implementation.  RFC4862 doesn't really allow anything else
since the interface identifiers for an Ethernet-like link are 64-bits.

Let's actually verify that, rather than just assuming it, by extracting the
prefix length assigned in the guest and checking it as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:37 +01:00
David Gibson
910f4f9103 test: Don't require 64-bit prefixes in perf tests
When determining the namespace's IPv6 address in the perf test setup, we
explicitly filter for addresses with a 64-bit prefix length.  There's no
real reason we need that - as long as it's a global address we can use it.
I suspect this was copied without thinking from a similar example in the
NDP tests, where the 64-bit prefix length _is_ meaningful (though it's not
entirely clear if the handling is correct there either).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:34 +01:00
David Gibson
1699083f29 test: Make nstool hold robust against interruptions to control clients
Currently nstool die()s on essentially any error.  In most cases that's
fine for our purposes.  However, it's a problem when in "hold" mode and
getting an IO error on an accept()ed socket.  This could just indicate that
the control client aborted prematurely, in which case we don't want to
kill of the namespace we're holding.

Adjust these to print an error, close() the control client socket and
carry on.  In addition, we need to explicitly ignore SIGPIPE in order not
to be killed by an abruptly closed client connection.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:30 +01:00
David Gibson
b456ee1b53 test: Rename propagating signal handler
nstool in "exec" mode will propagate some signals (specifically SIGTERM) to
the process in the namespace it executes.  The signal handler which
accomplishes this is called simply sig_handler().  However, it turns out
we're going to need some other signal handlers, so rename this to the more
specific sig_propagate().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:27 +01:00
David Gibson
867db07fcf util: Work around cppcheck bug 6936
While experimenting with cppcheck options, I hit several false positives
caused by this bug: https://trac.cppcheck.net/ticket/13227

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:24 +01:00
David Gibson
6f913b3af0 udp: Don't dereference uflow before NULL check in udp_reply_sock_handler()
We have an ASSERT() verifying that we're able to look up the flow in
udp_reply_sock_handler().  However, we dereference uflow before that in
an initializer, rather defeating the point.  Rearrange to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:22 +01:00
David Gibson
d8e05a3fe0 ndp: Use const pointer for ndp_ns packet
We don't modify this structure at all.  For some reason cppcheck doesn't
catch this with our current options, but did when I was experimenting with
some different options.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:19 +01:00
David Gibson
0d7b8201ed linux_dep: Generalise tcp_info.h to handling Linux extension compatibility
tcp_info.h exists just to contain a modern enough version of struct
tcp_info for our needs, removing compile time dependency on the version of
kernel headers.  There are several other cases where we can remove similar
compile time dependencies on kernel version.  Prepare for that by renaming
tcp_info.h to linux_dep.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:16 +01:00
David Gibson
c5f4e4d146 fwd: Squash different-signedness comparison warning
On certain architectures we get a warning about comparison between
different signedness integers in fwd_probe_ephemeral().  This is because
NUM_PORTS evaluates to an unsigned integer.  It's a fixed value, though
and we know it will fit in a signed long on anything reasonable, so add
a cast to suppress the warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:14 +01:00
David Gibson
1e76a19895 util: Remove unused ffsl() function
We supply a weak alias for ffsl() in case it's not defined in our libc.
Except.. we don't have any users for it any more, so remove it.

make cppcheck doesn't spot this at present for complicated reasons, but it
might with tweaks to the options I'm experimenting with.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:11 +01:00
David Gibson
1d7cff3779 clang: Add rudimentary clangd configuration
clangd's default configuration seems to try to treat .h files as C++ not
C.  There are many more spurious warnings generated at present, but this
removes some of the most egregious ones.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:07 +01:00
David Gibson
c560e2f65b Makefile: Don't attempt to auto-detect stack size
We probe the available stack limit in the Makefile using rlimit, then use
that to set the size of the stack when we clone() extra threads.  But
the rlimit at compile time need not be the same as the rlimit at runtime,
so that's not particularly sensible.

Ideally, we'd set the stack size based on an estimate of the actual
maximum stack usage of all our clone()ed functions.  We don't have that
at the moment, but to keep things simple just set it to 1MiB - that's what
the current probe will set things to on my default configuration Fedora 40,
so it's likely to be fine in most cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:03 +01:00
David Gibson
13fc6d511e Makefile: Use -DARCH for qrap only
We insert -DARCH for all compiles, based on TARGET_ARCH determined in the
Makefile.  However, this is only used in qrap.c, not anywhere else in
passt or pasta.  Only supply this -D when compiling qrap specifically.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:59 +01:00
David Gibson
7917159005 seccomp: Simplify handling of AUDIT_ARCH
Currently we construct the AUDIT_ARCH variable in the Makefile, then pass
it into the C code with -D.  The only place that uses it, though is the
BPF filter generated by seccomp.sh.  seccomp.sh already needs to do things
differently depending on the arch, so it might as well just insert the
expanded AUDIT_ARCH directly into the generated code, rather than using
a #define.  Arguably this is better, even, since it ensures more locally
that the arch the BPF checks for matches the arch seccomp.sh built the
filter for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:55 +01:00
David Gibson
93bce404c1 Makefile: Move NETNS_RUN_DIR definition to C code
NETNS_RUN_DIR is set in the Makefile, then passed into the C code with
-D.  But NETNS_RUN_DIR is just a fixed string, it doesn't depend on any
make probes or variables, so there's really no reason to handle it via the
Makefile.  Just move it to a plain #define in conf.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:52 +01:00
David Gibson
c938d8a93e netlink: RTA_PAYLOAD() returns int, not size_t
Since it's the size of a chunk of memory it would seem logical that
RTA_PAYLOAD() returns size_t.  However, it doesn't - it explicitly casts
its result to an int.  RTNH_OK(), which often takes the result of
RTA_PAYLOAD() as a parameter compares it to an int, so using size_t can
result in comparison of different-signed integer warnings from clang.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:48 +01:00
David Gibson
f6b546c6e4 flow: Correct type of flowside_at_sidx()
Due to a copy-pasta error, this returns 'PIF_NONE' instead of NULL on the
failure case.  PIF_NONE expands to 0, which turns into NULL, but it's
still confusing, so fix it.  This removes a clang warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:44 +01:00
David Gibson
30b4f88167 arch: Avoid explicit access to 'environ'
We pass 'environ' to execve() in arch_avc2_exec(), so that we retain the
environment in the current process.  But the declaration of 'environ' is
a bit weird - it doesn't seem to be in a standard header, requiring a
manual explicit declaration.  But, we can avoid needing to reference it
explicitly by using execv() instead of execve().  This removes a clang
warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:29 +01:00
David Gibson
b78e72da0b clang: Move clang-tidy configuration from Makefile to .clang-tidy
Currently we configure clang-tidy with a very long command line spelled out
in the Makefile (mostly a big list of lints to disable).  Move it from here
into a .clang-tidy configuration file, so that the config is accessible if
clang-tidy is invoked in other ways (e.g. via clangd) as well.  As a bonus
this also means that we can move the bulky comments about why we're
suppressing various tests inline with the relevant config lines.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:19 +01:00
David Gibson
8346216c9a Makefile: Simplify exclusion of qrap from static checks
There are things in qrap.c that clang-tidy complains about that aren't
worth fixing.  So, we currently exclude it using $(filter-out).  However,
we already have a make variable which has just the passt sources, excluding
qrap, so we can use that instead of the awkward filter-out expression.

Currently, we still include qrap.c for cppcheck, but there's not much
point doing so: it's, well, qrap, so we don't care that much about lints.
Exclude it from cppcheck as well, for consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:07 +01:00
David Gibson
8f1b6a0ca6 clang: Add .clang-format file
I've been experimenting with clangd, but its default format style is
horrid.  Since our style is basically that of the Linux kernel, copy the
.clang-format from the kernel, minus reference to a bunch of kernel
specific macros.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:45:16 +01:00
David Gibson
5e93bcd8bf test: Adjust misplaced sleeps in two_guests code
Most of our transfer tests using socat use 'sleep' waaiting for the server
side to be ready before starting the client.  However in two_guests/basic
the sleep is in the wrong place: rather than being between starting the
server and starting the client, it's after waiting for the server to
complete.  This causes occasional hangs when the client runs before the
server is ready - in that case the receiving guest sends an RST, which we
don't (currently) propagate back to the sender.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-05 23:46:38 +01:00
Stefano Brivio
9afce0b45c tap: Explicitly cast TUNSETIFF to fix build warning with musl on ppc64le
On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than
INT_MAX (2^31 - 1), and musl declares the second argument of ioctl()
as 'int', not 'unsigned long' like glibc does, probably because of how
POSIX specifies the equivalent argument, int dcmd, in posix_devctl(),
so gcc reports a warning:

tap.c: In function 'tap_ns_tun':
tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow]
 1291 |         rc = ioctl(fd, TUNSETIFF, &ifr);
      |                        ^~~~~~~~~

We don't care about that overflow, so explicitly cast TUNSETIFF to
int.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-05 23:46:33 +01:00
Stefano Brivio
d165d36a0c tcp: Fix build against musl, __sum16 comes from linux/types.h
Use a plain uint16_t instead and avoid including one extra header:
the 'bitwise' attribute of __sum16 is just used by sparse(1).

Reported-by: omni <omni+alpine@hack.org>
Fixes: 3d484aa370 ("tcp: Update TCP checksum using an iovec array")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-05 23:46:24 +01:00
Stefano Brivio
ee7d0b62a7 util: Don't use errno after a successful call in __daemon()
I thought we could just set errno to 0, do a bunch of stuff, and check
that errno didn't change to infer we succeeded. But clang-tidy,
starting with LLVM 19, reports:

/home/sbrivio/passt/util.c:465:6: error: An undefined value may be read from 'errno' [clang-analyzer-unix.Errno,-warnings-as-errors]
  465 |         if (errno)
      |             ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
   38 | # define errno (*__errno_location ())
      |                ^~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:446:6: note: Assuming the condition is false
  446 |         if (pid == -1) {
      |             ^~~~~~~~~
/home/sbrivio/passt/util.c:446:2: note: Taking false branch
  446 |         if (pid == -1) {
      |         ^
/home/sbrivio/passt/util.c:451:6: note: Assuming 'pid' is 0
  451 |         if (pid) {
      |             ^~~
/home/sbrivio/passt/util.c:451:2: note: Taking false branch
  451 |         if (pid) {
      |         ^
/home/sbrivio/passt/util.c:463:2: note: Assuming that 'close' is successful; 'errno' becomes undefined after the call
  463 |         close(devnull_fd);
      |         ^~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:465:6: note: An undefined value may be read from 'errno'
  465 |         if (errno)
      |             ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
   38 | # define errno (*__errno_location ())
      |                ^~~~~~~~~~~~~~~~~~~~~~

And the LLVM documentation for the unix.Errno checker, 1.1.8.3
unix.Errno (C), mentions, at:

  https://clang.llvm.org/docs/analyzer/checkers.html#unix-errno

that:

  The C and POSIX standards often do not define if a standard library
  function may change value of errno if the call does not fail.
  Therefore, errno should only be used if it is known from the return
  value of a function that the call has failed.

which is, somewhat surprisingly, the case for close().

Instead of using errno, check the actual return values of the calls
we issue here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
b1a607fba1 udp: Take care of cert-int09-c clang-tidy warning for enum udp_iov_idx
/home/sbrivio/passt/udp.c:171:1: error: inital values in enum 'udp_iov_idx' are not consistent, consider explicit initialization of all, none or only the first enumerator [cert-int09-c,readability-enum-initial-value,-warnings-as-errors]
  171 | enum udp_iov_idx {
      | ^
  172 |         UDP_IOV_TAP     = 0,
  173 |         UDP_IOV_ETH     = 1,
  174 |         UDP_IOV_IP      = 2,
  175 |         UDP_IOV_PAYLOAD = 3,
  176 |         UDP_NUM_IOVS
      |
      |                      = 4

Don't initialise any value, so that it's obvious that constants map to
unique values.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
099ace64ce treewide: Address cert-err33-c clang-tidy warnings for clock and timer functions
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.

As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.

For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
59fe34ee36 treewide: Suppress clang-tidy warning if we already use O_CLOEXEC
In pcap_init(), we should always open the packet capture file with
O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means
close-on-exec, not close-on-fork.

In logfile_init() and pidfile_open(), the fact that we pass a third
'mode' argument to open() seems to confuse the android-cloexec-open
checker in LLVM versions from 16 to 19 (at least).

The checker is suggesting to add O_CLOEXEC to 'mode', and not in
'flags', where we already have it.

Add a suppression for clang-tidy and a comment, and avoid repeating
those three times by adding a new helper, output_file_open().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
134b4d58b4 Makefile: Disable readability-math-missing-parentheses clang-tidy check
With clang-tidy and LLVM 19:

/home/sbrivio/passt/conf.c:1218:29: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
 1218 |                 const char *octet = str + 3 * i;
      |                                           ^~~~~~
      |                                           (    )
/home/sbrivio/passt/ndp.c:285:18: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  285 |                                         .len            = 1 + 2 * n,
      |                                                               ^~~~~~
      |                                                               (    )
/home/sbrivio/passt/ndp.c:329:23: error: '%' has higher precedence than '-'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  329 |                         memset(ptr, 0, 8 - dns_s_len % 8);      /* padding */
      |                                            ^~~~~~~~~~~~~~
      |                                            (            )
/home/sbrivio/passt/pcap.c:131:20: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  131 |                 pcap_frame(iov + i * frame_parts, frame_parts, offset, &now);
      |                                  ^~~~~~~~~~~~~~~~
      |                                  (              )
/home/sbrivio/passt/util.c:216:10: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  216 |                 return (a->tv_nsec + 1000000000 - b->tv_nsec) / 1000 +
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                        (                                            )
/home/sbrivio/passt/util.c:217:10: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  217 |                        (a->tv_sec - b->tv_sec - 1) * 1000000;
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                        (                                    )
/home/sbrivio/passt/util.c:220:9: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  220 |         return (a->tv_nsec - b->tv_nsec) / 1000 +
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                (                               )
/home/sbrivio/passt/util.c:221:9: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  221 |                (a->tv_sec - b->tv_sec) * 1000000;
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                (                                )
/home/sbrivio/passt/util.c:545:32: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  545 |         return clone(fn, stack_area + stack_size / 2, flags, arg);
      |                                       ^~~~~~~~~~~~~~~
      |                                       (             )

Just... no.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
744247856d treewide: Silence cert-err33-c clang-tidy warnings for fprintf()
We use fprintf() to print to standard output or standard error
streams. If something gets truncated or there's an output error, we
don't really want to try and report that, and at the same time it's
not abnormal behaviour upon which we should terminate, either.

Just silence the warning with an ugly FPRINTF() variadic macro casting
the fprintf() expressions to void.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
98efe7c2fd treewide: Comply with CERT C rule ERR33-C for snprintf()
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):

/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
   43 |                 snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  577 |                         snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  579 |                                 snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  580 |                                          pidval);
      |                                          ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  105 |         snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  242 |         snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  243 |         snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
 1155 |                         snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning

Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:25 +01:00
Stefano Brivio
988a4d75f8 Makefile: Exclude qrap.c from clang-tidy checks
We'll deprecate qrap(1) soon, and warnings reported by clang-tidy as
of LLVM versions 16 and later would need a bunch of changes there to
be addressed, mostly around CERT C rule ERR33-C and checking return
code from snprintf().

It makes no sense to fix warnings in qrap just for the sake of it, so
officially declare the bitrotting season open.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 08:21:19 +01:00
Jon Maloy
ba38e67cf4 tcp: unify l2 TCPv4 and TCPv6 queues and structures
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.

Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.

Tests with iperf3 shows no measurable change in performance after this
change.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:44:08 +01:00
Jon Maloy
2053c36dec tcp: set ip and eth headers in l2 tap queues on the fly
l2 tap queue entries are currently initialized at system start, and
reused with preset headers through its whole life time. The only
fields we need to update per message are things like payload size
and checksums.

If we want to reuse these entries between ipv4 and ipv6 messages we
will need to set the pointer to the right header on the fly per
message, since the header type may differ between entries in the same
queue.

The same needs to be done for the ethernet header.

We do these changes here.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:43:24 +01:00
Laurent Vivier
5563d5f668 test: remove obsolete images
Remove debian-9-nocloud-amd64-daily-20200210-166.qcow2 and
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 as they cannot be
downloaded anymore

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:30:06 +02:00
Laurent Vivier
f43f7d5e89 tcp: cleanup tcp_buf_data_from_sock()
Remove the err label as there is only one caller, and move code
to the caller position. ret is not needed here anymore as it is
always 0.
Remove sendlen as we can user directly len.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:51 +02:00
David Gibson
e7fcd0c348 tcp: Use runtime tests for TCP_INFO fields
In order to use particular fields from the TCP_INFO getsockopt() we
need them to be in structure returned by the runtime kernel.  We attempt
to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed
in the Makefile.

However, that's not correct, because the kernel headers we compile against
may not be the same as the runtime kernel.  We instead should check against
the size of structure returned from the TCP_INFO getsockopt() as we already
do for tcpi_snd_wnd.  Switch from the compile time flags to a runtime
test.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:46 +02:00
David Gibson
81143813a6 tcp: Generalise probing for tcpi_snd_wnd field
In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we
need the field to be supported in the runtime kernel (snd_wnd_cap).

In fact we should check that for for every tcp_info field we want to use,
beyond the very old ones shared with BSD.  Prepare to do that, by
generalising the probing from setting a single bool to instead record the
size of the returned TCP_INFO structure.  We can then use that recorded
value to check for the presence of any field we need.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:27:17 +02:00
David Gibson
13f0291ede tcp: Remove compile-time dependency on struct tcp_info version
In the Makefile we probe to create several defines based on the presence
of particular fields in struct tcp_info.  These defines are used for two
purposes, neither of which they accomplish well:

1) Determining if the tcp_info fields are available at runtime.  For this
   purpose the defines are Just Plain Wrong, since the runtime kernel may
   not be the same as the compile time kernel. We corrected this for
   tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt

2) Allowing the source to compile against older kernel headers which don't
   have the fields in question.  This works in theory, but it does mean
   we won't be able to use the fields, even if later run against a
   newer kernel.  Furthermore, it's quite fragile: without much more
   thorough tests of builds in different environments that we're currently
   set up for, it's very easy to miss cases where we're accessing a field
   without protection from an #ifdef.  For example we currently access
   tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd().

Improve this with a different approach, borrowed from qemu (which has many
instances of similar problems).  Don't compile against linux/tcp.h, using
netinet/tcp.h instead.  Then for when we need an extension field, define
a struct tcp_info_linux, copied from the kernel, with all the fields we're
interested in.  That may need updating from future kernel versions, but
only when we want to use a new extension, so it shouldn't be frequent.

This allows us to remove the HAS_SND_WND define entirely.  We keep
HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1),
we'll fix that in a later patch.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Trivial grammar fixes in comments]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:26:48 +02:00
Stefano Brivio
9e4615b40b tcp_splice: fcntl(2) returns the size of the pipe, if F_SETPIPE_SZ succeeds
Don't report bogus failures (with --trace) just because the return
value is not zero.

Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-25 09:36:30 +02:00
Stefano Brivio
149f457b23 tcp_splice: splice() all we have to the writing side, not what we just read
In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.

However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 98304 from read-side call
  Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
  Flow 0 (TCP connection (spliced)): -1 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
  Flow 0 (TCP connection (spliced)): OUT_WAIT_0

we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 32768 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580

we splice() 32768 more bytes from our receiving side to the pipe. At
some point:

  pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
  Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
  Flow 0 (TCP connection (spliced)): 1320 from read-side call
  Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)

the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).

This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.

Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.

So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.

Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.

Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.

Reported-by: Ed Santiago <santiago@redhat.com>
Reported-by: Paul Holzinger <pholzing@redhat.com>
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-25 09:36:18 +02:00
David Gibson
9e5df350d6 tcp: Use structures to construct initial TCP options
As a rule, we prefer constructing packets with matching C structures,
rather than building them byte by byte.  However, one case we still build
byte by byte is the TCP options we include in SYN packets (in fact the only
time we generate TCP options on the tap interface).

Rework this to use a structure and initialisers which make it a bit
clearer what's going on.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
2024-10-21 18:51:04 +02:00
David Gibson
b4dace8f46 fwd: Direct inbound spliced forwards to the guest's external address
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing.  This gives us a very large performance improvement when it's
possible.

Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address.  However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1.  pasta's
forwarding exposes those services to the host, which isn't generally what
we want.

Correct this by instead forwarding inbound "splice" flows to the guest's
external address.

Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:03 +02:00
David Gibson
58e6d68599 test: Clarify test for spliced inbound transfers
The tests in pasta/tcp and pasta/udp for inbound transfers have the server
listening within the namespace explicitly bound to 127.0.0.1 or ::1.  This
only works because of the behaviour of inbound splice connections, which
always appear with both source and destination addresses as loopback in
the namespace.  That's not an inherent property for "spliced" connections
and arguably an undesirable one.  Also update the test names to make it
clearer that these tests are expecting to exercise the "splice" path.

Interestingly this was already correct for the equivalent passt_in_ns/*,
although we also update the test names for clarity there.

Note that there are similar issues in some of the podman tests, addressed
in https://github.com/containers/podman/pull/24064

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:00 +02:00
David Gibson
1fa421192c passt.1: Clarify and update "Handling of local addresses" section
This section didn't mention the effect of the --map-host-loopback option
which now alters this behaviour.  Update it accordingly.

It used "local addresses" to mean specifically 127.0.0.0/8 and ::1.
However, "local" could also refer to link-local addresses or to addresses
of any scope which happen to be configured on the host.  Use "loopback
address" to be more precise about this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:57 +02:00
David Gibson
ef8a5161d0 passt.1: Mark --stderr as deprecated more prominently
The description of this option says that it's deprecated, but unlike
--no-copy-addrs and --no-copy-routes it doesn't have a clear label.  Add
one to make it easier to spot.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:54 +02:00
David Gibson
53176ca91d test: Wait for DAD on DHCPv6 addresses
After running dhclient -6 we expect the DHCPv6 assigned address to be
immediately usable.  That's true with the Fedora dhclient-script (and the
upstream ISC DHCP one), however it's not true with the Debian
dhclient-script.  The Debian script can complete with the address still
in "tentative" state, and the address won't be usable until Duplicate
Address Detection (DAD) completes.  That's arguably a bug in Debian (see
link below), but for the time being we need to work around it anyway.

We usually get away with this, because by the time we do anything where the
address matters, DAD has completed.  However, it's not robust, so we should
explicitly wait for DAD to complete when we get an DHCPv6 address.

Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:51 +02:00
David Gibson
75b9c0feb0 test: Explicitly wait for DAD to complete on SLAAC addresses
Getting a SLAAC address takes a little while because the kernel must
complete Duplicate Address Detection (DAD) before marking the address as
ready.  In several places we have an explicit 'sleep 2' to wait for that
to complete.

Fixed length delays are never a great idea, although this one is pretty
solid.  Still, it would be better to explicitly wait for DAD to complete
in case of long delays (which might happen on slow emulated hosts, or with
heavy load), and to speed the tests up if DAD completes quicker.

Replace the fixed sleeps with a loop waiting for DAD to complete.  We do
this by looping waiting for all tentative addresses to disappear.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:47 +02:00
David Gibson
f9d677bff6 arp: Fix a handful of small warts
This fixes a number of harmless but slightly ugly warts in the ARP
resolution code:
 * Use in4addr_any to represent 0.0.0.0 rather than hand constructing an
   example.
 * When comparing am->sip against 0.0.0.0 use sizeof(am->sip) instead of
   sizeof(am->tip) (same value, but makes more logical sense)
 * Described the guest's assigned address as such, rather than as "our
   address" - that's not usually what we mean by "our address" these days
 * Remove "we might have the same IP address" comment which I can't make
   sense of in context (possibly it's relating to the statement below,
   which already has its own comment?)

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:04 +02:00
Stefano Brivio
2d7f734c45 tcp: Send "empty" handshake ACK before first data segment
Starting from commit 9178a9e346 ("tcp: Always send an ACK segment
once the handshake is completed"), we always send an ACK segment,
without any payload, to complete the three-way handshake while
establishing a connection started from a socket.

We queue that segment after checking if we already have data to send
to the tap, which means that its sequence number is higher than any
segment with data we're sending in the same iteration, if any data is
available on the socket.

However, in tcp_defer_handler(), we first flush "flags" buffers, that
is, we send out segments without any data first, and then segments
with data, which means that our "empty" ACK is sent before the ACK
segment with data (if any), which has a lower sequence number.

This appears to be harmless as the guest or container will generally
reorder segments, but it looks rather weird and we can't exclude it's
actually causing problems.

Queue the empty ACK first, so that it gets a lower sequence number,
before checking for any data from the socket.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-15 20:34:26 +02:00
Stefano Brivio
7612cb80fe test: Pass TRACE from run_term() into ./run from_term
Just like we do for PCAP, DEBUG and KERNEL. Otherwise, running tests
with TRACE=1 will not actually enable tracing output.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-10 05:25:19 +02:00
Stefano Brivio
b40880c157 test/lib/term: Always use printf for messages with escape sequences
...instead of echo: otherwise, bash won't handle escape sequences we
use to colour messages (and 'echo -e' is not specified by POSIX).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-10 05:25:00 +02:00
David Gibson
ff63ac922a conf: Add --dns-host option to configure host side nameserver
When redirecting DNS queries with the --dns-forward option, passt/pasta
needs a host side nameserver to redirect the queries to.  This is
controlled by the c->ip[46].dns_host variables.  This is set to the first
first nameserver listed in the host's /etc/resolv.conf, and there isn't
currently a way to override it from the command line.

Prior to 0b25cac9 ("conf: Treat --dns addresses as guest visible
addresses") it was possible to alter this with the -D/--dns option.
However, doing so was confusing and had some nonsensical edge cases because
-D generally takes guest side addresses, rather than host side addresses.

Add a new --dns-host option to restore this functionality in a more
sensible way.

Link: https://bugs.passt.top/show_bug.cgi?id=102
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 19:04:29 +02:00
David Gibson
9d66df9a9a conf: Add command line switch to enable IP_FREEBIND socket option
In a couple of recent reports, we've seen that it can be useful for pasta
to forward ports from addresses which are not currently configured on the
host, but might be in future.  That can be done with the sysctl
net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in
the first place.  We can allow the same thing on a per-socket basis with
the IP_FREEBIND (or IPV6_FREEBIND) socket option.

Add a --freebind command line argument to enable this socket option on
all listening sockets.

Link: https://bugs.passt.top/show_bug.cgi?id=101
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 19:04:29 +02:00
Laurent Vivier
151dbe0d3d udp: Update UDP checksum using an iovec array
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:13 +02:00
Laurent Vivier
3d484aa370 tcp: Update TCP checksum using an iovec array
TCP header and payload are supposed to be in the same buffer,
and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute
the checksum from the base address of the header using the
length of the IP payload.

In the future (for vhost-user) we need to dispatch the TCP header and
the TCP payload through several buffers. To be able to manage that, we
provide an iovec array that points to the data of the TCP frame.
We provide also an offset to be able to provide an array that contains
the TCP frame embedded in an lower level frame, and this offset points
to the TCP header inside the iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:10 +02:00
Laurent Vivier
e6548c6437 checksum: Add an offset argument in csum_iov()
The offset allows any headers that are not part of the data
to checksum to be skipped.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:08 +02:00
Laurent Vivier
fd8334b25d pcap: Add an offset argument in pcap_iov()
The offset is passed directly to pcap_frame() and allows
any headers that are not part of the frame to
capture to be skipped.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:02 +02:00
Laurent Vivier
72e7d3024b tcp: Use tcp_payload_t rather than tcphdr
As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the
checksum using the TCP header and the TCP payload, it is clearer
to use a pointer to tcp_payload_t that includes tcphdr and payload
rather than a pointer to tcphdr (and guessing TCP header is
followed by the payload).

Move tcp_payload_t and tcp_flags_t to tcp_internal.h.
(They will be used also by vhost-user).

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:50:46 +02:00
Stefano Brivio
def8acdcd8 test: Kernel binary can now be passed via the KERNEL environmental variable
This is quite useful at least for myself as I'm usually running tests
using a guest kernel that's not the same as the one on the host.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-02 14:50:34 +02:00
David Gibson
b55013b1a7 inany: Add inany_pton() helper
We already have an inany_ntop() function to format inany addresses into
text.  Add inany_pton() to parse them from text, and use it in
conf_ports().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:17 +02:00
David Gibson
cbde4192ee tcp, udp: Make {tcp,udp}_sock_init() take an inany address
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair.  Use an inany instead.  Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:16 +02:00
David Gibson
b8d4fac6a2 util, pif: Replace sock_l4() with pif_sock_l4()
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.

Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress.  This case doesn't yet arise, but
it might cause us trouble in future.

Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic.  This only makes sense if there is no specific
address given.  We verify this at runtime, but it would be nicer if we
could enforce it structurally.

For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems.  Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.

The new function takes the address as an inany *, with NULL indicating the
dual stack case.  This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:15 +02:00
David Gibson
204e77cd11 udp: Don't attempt to get dual-stack sockets in nonsensical cases
To save some kernel memory we try to use "dual stack" sockets (that listen
to both IPv4 and IPv6 traffic) when possible.   However udp_sock_init()
attempts to do this in some cases that can't work.  Specifically we can
only do this when listening on any address.  That's never true for the
ns (splicing) case, because we always listen on loopback.  For the !ns
case and AF_UNSPEC case, addr should always be NULL, but add an assert to
verify.

This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll
fall back to the other path.  But, it's messy and makes some upcoming
changes harder, so avoid attempting this in cases we know can't work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:15 +02:00
Laurent Vivier
8f8c4d27eb tcp: Allow checksum to be disabled
We can need not to set TCP checksum. Add a parameter to
tcp_fill_headers4() and tcp_fill_headers6() to disable it.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:28 +02:00
Laurent Vivier
4fe5f4e813 udp: Allow checksum to be disabled
We can need not to set the UDP checksum. Add a parameter to
udp_update_hdr4() and udp_update_hdr6() to disable it.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:20 +02:00
David Gibson
d836d9e345 util: Remove possible quadratic behaviour from write_remainder()
write_remainder() steps through the buffers in an IO vector writing out
everything past a certain byte offset.  However, on each iteration it
rescans the buffer from the beginning to find out where we're up to.  With
an unfortunate set of write sizes this could lead to quadratic behaviour.

In an even less likely set of circumstances (total vector length > maximum
size_t) the 'skip' variable could overflow.  This is one factor in a
longstanding Coverity error we've seen (although I still can't figure out
the remainder of its complaint).

Rework write_remainder() to always work out our new position in the vector
relative to our old/current position, rather than starting from the
beginning each time.  As a bonus this seems to fix the Coverity error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:03 +02:00
David Gibson
bfc294b90d util: Add helper to write() all of a buffer
write(2) might not write all the data it is given.  Add a write_all_buf()
helper to keep calling it until all the given data is written, or we get an
error.

Currently we use write_remainder() to do this operation in pcap_frame().
That's a little awkward since it requires constructing an iovec, and future
changes we want to make to write_remainder() will be easier in terms of
this single buffer helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:59 +02:00
David Gibson
bb41901c71 tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean
This parameter is already treated as a boolean internally.  Make it a
'bool' type for clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:55 +02:00
David Gibson
265b2099c7 tcp: Simplify ifdef logic in tcp_update_seqack_wnd()
This function has a block conditional on !snd_wnd_cap shortly before an
snd_wnd_cap is statically false).

Therefore, simplify this down to a single conditional with an else branch.
While we're there, fix some improperly indented closing braces.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:50 +02:00
David Gibson
4aff6f9392 tcp: Clean up tcpi_snd_wnd probing
When available, we want to retrieve our socket peer's advertised window and
forward that to the guest.  That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.

Currently our probing for this is a bit odd.  The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it.  Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field.  This is a bit hit and miss in two ways:
 * Zero is perfectly possible window the peer could report, so we can
   get false negatives
 * We're reading TCP_INFO into a local variable, which might not be zero
   initialised, so if the kernel _doesn't_ write it it could have non-zero
   garbage, giving us false positives.

We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived.  So, check whether that's long enough
to include the field.  This lets us probe the availability of the field
once and for all during initialisation.  That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.

We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:47 +02:00
David Gibson
7d8804beb8 tcp: Make some extra functions private
tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c,
and have no prototype in a header.  Make them static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:33 +02:00
David Gibson
5ff5d55291 tcp: Avoid overlapping memcpy() in DUP_ACK handling
When handling the DUP_ACK flag, we copy all the buffers making up the ack
frame.  However, all our frames share the same buffer for the Ethernet
header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will
result in a (perfectly) overlapping memcpy().  This seems to have been
harmless so far, but overlapping ranges to memcpy() is undefined behaviour,
so we really should avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:59 +02:00
David Gibson
1f414ed8f0 tcp: Remove redundant initialisation of iov[TCP_IOV_ETH].iov_base
This initialisation for IPv4 flags buffers is redundant with the very next
line which sets both iov_base and iov_len.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:46 +02:00
Stefano Brivio
6b38f07239 apparmor: Allow read access to /proc/sys/net/ipv4/ip_local_port_range
...for both passt and pasta: use passt's abstraction for this.

Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 15:34:06 +02:00
Stefano Brivio
116bc8266d selinux: Allow read access to /proc/sys/net/ipv4/ip_local_port_range
Since commit eedc81b6ef ("fwd, conf: Probe host's ephemeral ports"),
we might need to read from /proc/sys/net/ipv4/ip_local_port_range in
both passt and pasta.

While pasta was already allowed to open and write /proc/sys/net
entries, read access was missing in SELinux's type enforcement: add
that.

In passt, instead, this is the first time we need to access an entry
there: add everything we need.

Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 15:34:06 +02:00
David Gibson
a33ecafbd9 tap: Don't risk truncating frames on full buffer in tap_pasta_input()
tap_pasta_input() keeps reading frames from the tap device until the
buffer is full.  However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer.  If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.

Adjust the code to make sure we always have room for a maximum size frame.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:46 +02:00
David Gibson
d2a1dc744b tap: Restructure in tap_pasta_input()
tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around.  This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.

The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass.  This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.

Along the way handle a couple of extra edge cases:
 - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
   ports where those might not have the same value
 - Detect EOF on the tap device and exit in that case

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:43 +02:00
David Gibson
11e29054fe tap: Improve handling of EINTR in tap_passt_input()
When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK.  However in all
three cases it returns from the function.  That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again.  For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.

So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:41 +02:00
David Gibson
49fc4e0414 tap: Split out handling of EPOLLIN events
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN.  We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:37 +02:00
Stefano Brivio
63513e54f3 util: Fix order of operands and carry of one second in timespec_diff_us()
If the nanoseconds of the minuend timestamp are less than the
nanoseconds of the subtrahend timestamp, we need to carry one second
in the subtraction.

I subtracted this second from the minuend, but didn't actually carry
it in the subtraction of nanoseconds, and logged timestamps would jump
back whenever we switched to the first branch of timespec_diff_us()
from the second one.

Most likely, the reason why I didn't carry the second is that I
instinctively thought that swapping the operands would have the same
effect. But it doesn't, in general: that only happens with arithmetic
in modulo powers of 2. Undo the swap as well.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-06 13:01:34 +02:00
David Gibson
748ef4cd6e cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings
cppcheck-2.15.0 has apparently broadened when it throws a warning about
redundant initialization to include some cases where we have an initializer
for some fields, but then set other fields in the function body.

This is arguably a false positive: although we are technically overwriting
the zero-initialization the compiler supplies for fields not explicitly
initialized, this sort of construct makes sense when there are some fields
we know at the top of the function where the initializer is, but others
that require more complex calculation.

That said, in the two places this shows up, it's pretty easy to work
around.  The results are arguably slightly clearer than what we had, since
they move the parts of the initialization closer together.

So do that rather than having ugly suppressions or dealing with the
tedious process of reporting a cppcheck false positive.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:54:20 +02:00
Stefano Brivio
afedc2412e tcp: Use EPOLLET for any state of not established connections
Currently, for not established connections, we monitor sockets with
edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state
(outbound connection being established) but not in the
TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK
to the container/guest).

While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted
another possibility for a short EPOLLRDHUP storm (10 seconds), which
doesn't seem to happen in actual use cases, but I could reproduce it:
start a connection from a container, while dropping (using netfilter)
ACK segments coming out of the container itself.

On the server side, outside the container, accept the connection and
shutdown the writing side of it immediately.

At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere
TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't
have any reasonable way to handle it other than waiting for the tap
side to complete the three-way handshake. So we'll just keep getting
this EPOLLRDHUP until the SYN_TIMEOUT kicks in.

Always enable EPOLLET when EPOLLRDHUP is the only epoll event we
subscribe to: in this case, getting multiple EPOLLRDHUP reports is
totally useless.

In the only remaining non-established state, SOCK_ACCEPTED, for
inbound connections, we're anyway discarding EPOLLRDHUP events until
we established the conection, because we don't know what to do with
them until we get an answer from the tap side, so it's safe to enable
EPOLLET also in that case.

Link: https://bugs.passt.top/show_bug.cgi?id=94
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:54:16 +02:00
David Gibson
aff5a49b0e udp: Handle more error conditions in udp_sock_errs()
udp_sock_errs() reads out everything in the socket error queue.  However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.

One possibility is that the error is reported instead by the SO_ERROR
sockopt.  Check for that case and report it as best we can.  If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.

[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010

Link: https://bugs.passt.top/show_bug.cgi?id=95
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:38 +02:00
David Gibson
bd99f02a64 udp: Treat errors getting errors as unrecoverable
We can get network errors, usually transient, reported via the socket error
queue.  However, at least theoretically, we could get errors trying to
read the queue itself.  Since we have no idea how to clear an error
condition in that case, treat it as unrecoverable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:35 +02:00
David Gibson
bd092ca421 udp: Split socket error handling out from udp_sock_recv()
Currently udp_sock_recv() both attempts to clear socket errors and read
a batch of datagrams for forwarding.  That made sense initially, since
both listening and reply sockets need to do this.  However, we have certain
error cases which will add additional complexity to the error processing.
Furthermore, if we ever wanted to more thoroughly handle errors received
here - e.g. by synthesising ICMP messages on the tap device - it will
likely require different handling for the listening and reply socket cases.

So, split handling of error events into its own udp_sock_errs() function.
While we're there, allow it to report "unrecoverable errors".  We don't
have any of these so far, but some cases we're working on might require it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:33 +02:00
David Gibson
88bfa3801e flow: Helpers to log details of a flow
The details of a flow - endpoints, interfaces etc. - can be pretty
important for debugging.  We log this on flow state transitions, but it can
also be useful to log this when we report specific conditions.  Add some
helper functions and macros to make it easy to do that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:30 +02:00
David Gibson
1166401c2f udp: Allow UDP flows to be prematurely closed
Unlike TCP, UDP has no in-band signalling for the end of a flow.  So the
only way we remove flows is on a timer if they have no activity for 180s.

However, we've started to investigate some error conditions in which we
want to prematurely abort / abandon a UDP flow.  We can call
udp_flow_close(), which will make the flow inert (sockets closed, no epoll
events, can't be looked up in hash).  However it will still wait 3 minutes
to clear away the stale entry.

Clean this up by adding an explicit 'closed' flag which will cause a flow
to be more promptly cleaned up.  We also publish udp_flow_close() so it
can be called from other places to abort UDP flows().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:24 +02:00
David Gibson
7ad9f9bd2b flow: Fix incorrect hash probe in flowside_lookup()
Our flow hash table uses linear probing in which we step backwards through
clusters of adjacent hash entries when we have near collisions.  Usually
that's implemented by flow_hash_probe().  However, due to some details we
need a second implementation in flowside_lookup().  An embarrassing
oversight in rebasing from earlier versions has mean that version is
incorrect, trying to step forward through clusters rather than backward.

In situations with the right sorts of has near-collisions this can lead to
us not associating an ACK from the tap device with the right flow, leaving
it in a not-quite-established state.  If the remote peer does a shutdown()
at the right time, this can lead to a storm of EPOLLRDHUP events causing
high CPU load.

Fixes: acca4235c4 ("flow, tcp: Generalise TCP hash table to general flow hash table")
Link: https://bugs.passt.top/show_bug.cgi?id=94
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:52:31 +02:00
Stefano Brivio
0ea60e5a77 log: Don't prefix log file messages with time and severity if they're continuations
In fecb1b65b1 ("log: Don't prefix message with timestamp on --debug
if it's a continuation"), I fixed this for --debug on standard error,
but not for log files: if messages are continuations, they shouldn't
be prefixed by timestamp and severity.

Otherwise, we'll print stuff like this:

  0.0028: ERROR:   Receive error on guest connection, reset0.0028:  ERROR:   : Bad file descriptor

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-06 12:52:21 +02:00
64 changed files with 1767 additions and 1123 deletions

126
.clang-format Normal file
View file

@ -0,0 +1,126 @@
# SPDX-License-Identifier: GPL-2.0
#
# clang-format configuration file. Intended for clang-format >= 11.
#
# For more information, see:
#
# Documentation/dev-tools/clang-format.rst
# https://clang.llvm.org/docs/ClangFormat.html
# https://clang.llvm.org/docs/ClangFormatStyleOptions.html
#
---
AccessModifierOffset: -4
AlignAfterOpenBracket: Align
AlignConsecutiveAssignments: false
AlignConsecutiveDeclarations: false
AlignEscapedNewlines: Left
AlignOperands: true
AlignTrailingComments: false
AllowAllParametersOfDeclarationOnNextLine: false
AllowShortBlocksOnASingleLine: false
AllowShortCaseLabelsOnASingleLine: false
AllowShortFunctionsOnASingleLine: None
AllowShortIfStatementsOnASingleLine: false
AllowShortLoopsOnASingleLine: false
AlwaysBreakAfterDefinitionReturnType: None
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: false
BinPackArguments: true
BinPackParameters: true
BraceWrapping:
AfterClass: false
AfterControlStatement: false
AfterEnum: false
AfterFunction: true
AfterNamespace: true
AfterObjCDeclaration: false
AfterStruct: false
AfterUnion: false
AfterExternBlock: false
BeforeCatch: false
BeforeElse: false
IndentBraces: false
SplitEmptyFunction: true
SplitEmptyRecord: true
SplitEmptyNamespace: true
BreakBeforeBinaryOperators: None
BreakBeforeBraces: Custom
BreakBeforeInheritanceComma: false
BreakBeforeTernaryOperators: false
BreakConstructorInitializersBeforeComma: false
BreakConstructorInitializers: BeforeComma
BreakAfterJavaFieldAnnotations: false
BreakStringLiterals: false
ColumnLimit: 80
CommentPragmas: '^ IWYU pragma:'
CompactNamespaces: false
ConstructorInitializerAllOnOneLineOrOnePerLine: false
ConstructorInitializerIndentWidth: 8
ContinuationIndentWidth: 8
Cpp11BracedListStyle: false
DerivePointerAlignment: false
DisableFormat: false
ExperimentalAutoDetectBinPacking: false
FixNamespaceComments: false
# Taken from:
# git grep -h '^#define [^[:space:]]*for_each[^[:space:]]*(' include/ tools/ \
# | sed "s,^#define \([^[:space:]]*for_each[^[:space:]]*\)(.*$, - '\1'," \
# | LC_ALL=C sort -u
ForEachMacros:
- 'for_each_nst'
IncludeBlocks: Preserve
IncludeCategories:
- Regex: '.*'
Priority: 1
IncludeIsMainRegex: '(Test)?$'
IndentCaseLabels: false
IndentGotoLabels: false
IndentPPDirectives: None
IndentWidth: 8
IndentWrappedFunctionNames: false
JavaScriptQuotes: Leave
JavaScriptWrapImports: true
KeepEmptyLinesAtTheStartOfBlocks: false
MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
ObjCBinPackProtocolList: Auto
ObjCBlockIndentWidth: 8
ObjCSpaceAfterProperty: true
ObjCSpaceBeforeProtocolList: true
# Taken from git's rules
PenaltyBreakAssignment: 10
PenaltyBreakBeforeFirstCallParameter: 30
PenaltyBreakComment: 10
PenaltyBreakFirstLessLess: 0
PenaltyBreakString: 10
PenaltyExcessCharacter: 100
PenaltyReturnTypeOnItsOwnLine: 60
PointerAlignment: Right
ReflowComments: false
SortIncludes: false
SortUsingDeclarations: false
SpaceAfterCStyleCast: false
SpaceAfterTemplateKeyword: true
SpaceBeforeAssignmentOperators: true
SpaceBeforeCtorInitializerColon: true
SpaceBeforeInheritanceColon: true
SpaceBeforeParens: ControlStatementsExceptForEachMacros
SpaceBeforeRangeBasedForLoopColon: true
SpaceInEmptyParentheses: false
SpacesBeforeTrailingComments: 1
SpacesInAngles: false
SpacesInContainerLiterals: false
SpacesInCStyleCastParentheses: false
SpacesInParentheses: false
SpacesInSquareBrackets: false
Standard: Cpp03
TabWidth: 8
UseTab: Always
...

93
.clang-tidy Normal file
View file

@ -0,0 +1,93 @@
---
Checks:
- "clang-diagnostic-*,clang-analyzer-*,*,-modernize-*"
# TODO: enable once https://bugs.llvm.org/show_bug.cgi?id=41311 is fixed
- "-clang-analyzer-valist.Uninitialized"
# Dubious value, would kill readability
- "-cppcoreguidelines-init-variables"
# Dubious value over the compiler's built-in warning. Would
# increase verbosity.
- "-bugprone-assignment-in-if-condition"
# Debatable whether these improve readability, right now it would look
# like a mess
- "-google-readability-braces-around-statements"
- "-hicpp-braces-around-statements"
- "-readability-braces-around-statements"
# TODO: in most cases they are justified, but probably not everywhere
#
- "-readability-magic-numbers"
- "-cppcoreguidelines-avoid-magic-numbers"
# TODO: this is Linux-only for the moment, nice to fix eventually
- "-llvmlibc-restrict-system-libc-headers"
# Those are needed for syscalls, epoll_wait flags, etc.
- "-hicpp-signed-bitwise"
# Probably not doable to impement this without plain memcpy(), memset()
- "-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling"
# TODO: not really important, but nice to fix eventually
- "-llvm-include-order"
# Dubious value, would kill readability
- "-readability-isolate-declaration"
# TODO: nice to fix eventually
- "-bugprone-narrowing-conversions"
- "-cppcoreguidelines-narrowing-conversions"
# TODO: check, fix, and more in general constify wherever possible
- "-cppcoreguidelines-avoid-non-const-global-variables"
# TODO: check paths where it might make sense to improve performance
- "-altera-unroll-loops"
- "-altera-id-dependent-backward-branch"
# Not much can be done about them other than being careful
- "-bugprone-easily-swappable-parameters"
# TODO: split reported functions
- "-readability-function-cognitive-complexity"
# "Poor" alignment needed for structs reflecting message formats/headers
- "-altera-struct-pack-align"
# TODO: check again if multithreading is implemented
- "-concurrency-mt-unsafe"
# Complains about any identifier <3 characters, reasonable for
# globals, pointlessly verbose for locals and parameters.
- "-readability-identifier-length"
# Wants to include headers which *directly* provide the things
# we use. That sounds nice, but means it will often want a OS
# specific header instead of a mostly standard one, such as
# <linux/limits.h> instead of <limits.h>.
- "-misc-include-cleaner"
# Want to replace all #defines of integers with enums. Kind of
# makes sense when those defines form an enum-like set, but
# weird for cases like standalone constants, and causes other
# awkwardness for a bunch of cases we use
- "-cppcoreguidelines-macro-to-enum"
# It's been a couple of centuries since multiplication has been granted
# precedence over addition in modern mathematical notation. Adding
# parentheses to reinforce that certainly won't improve readability.
- "-readability-math-missing-parentheses"
WarningsAsErrors: "*"
HeaderFileExtensions:
- h
ImplementationFileExtensions:
- c
HeaderFilterRegex: ""
FormatStyle: none
CheckOptions:
bugprone-suspicious-string-compare.WarnOnImplicitComparison: "false"
SystemHeaders: false

3
.clangd Normal file
View file

@ -0,0 +1,3 @@
CompileFlags:
# Don't try to interpret our headers as C++'
Add: [-xc, -Wall]

161
Makefile
View file

@ -15,24 +15,11 @@ VERSION ?= $(shell git describe --tags HEAD 2>/dev/null || echo "unknown\ versio
# the IPv6 socket API? (Linux does)
DUAL_STACK_SOCKETS := 1
RLIMIT_STACK_VAL := $(shell /bin/sh -c 'ulimit -s')
ifeq ($(RLIMIT_STACK_VAL),unlimited)
RLIMIT_STACK_VAL := 1024
endif
TARGET ?= $(shell $(CC) -dumpmachine)
# Get 'uname -m'-like architecture description for target
TARGET_ARCH := $(shell echo $(TARGET) | cut -f1 -d- | tr [A-Z] [a-z])
TARGET_ARCH := $(shell echo $(TARGET_ARCH) | sed 's/powerpc/ppc/')
AUDIT_ARCH := $(shell echo $(TARGET_ARCH) | tr [a-z] [A-Z] | sed 's/^ARM.*/ARM/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/I[456]86/I386/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/PPC64/PPC/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/PPCLE/PPC64LE/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/MIPS64EL/MIPSEL64/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/HPPA/PARISC/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/SH4/SH/')
# On some systems enabling optimization also enables source fortification,
# automagically. Do not override it.
FORTIFY_FLAG :=
@ -44,10 +31,6 @@ FLAGS := -Wall -Wextra -Wno-format-zero-length
FLAGS += -pedantic -std=c11 -D_XOPEN_SOURCE=700 -D_GNU_SOURCE
FLAGS += $(FORTIFY_FLAG) -O2 -pie -fPIE
FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
FLAGS += -DNETNS_RUN_DIR=\"/run/netns\"
FLAGS += -DPASST_AUDIT_ARCH=AUDIT_ARCH_$(AUDIT_ARCH)
FLAGS += -DRLIMIT_STACK_VAL=$(RLIMIT_STACK_VAL)
FLAGS += -DARCH=\"$(TARGET_ARCH)\"
FLAGS += -DVERSION=\"$(VERSION)\"
FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
@ -67,21 +50,6 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
udp.h udp_flow.h util.h
HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_SND_WND
endif
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_bytes_acked = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_BYTES_ACKED
endif
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_min_rtt = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_MIN_RTT
endif
C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_GETRANDOM
@ -91,11 +59,6 @@ ifeq ($(shell :|$(CC) -fstack-protector-strong -S -xc - -o - >/dev/null 2>&1; ec
FLAGS += -fstack-protector-strong
endif
C := \#define _GNU_SOURCE\n\#include <fcntl.h>\nint x = FALLOC_FL_COLLAPSE_RANGE;
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
EXTRA_SYSCALLS += fallocate
endif
prefix ?= /usr/local
exec_prefix ?= $(prefix)
bindir ?= $(exec_prefix)/bin
@ -132,7 +95,7 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
ln -sf $< $@
qrap: $(QRAP_SRCS) passt.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(QRAP_SRCS) -o qrap $(LDFLAGS)
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
rt_sigreturn getpid gettid kill clock_gettime mmap \
@ -196,116 +159,11 @@ docs: README.md
done < README.md; \
) > README.plain.md
# Checkers currently disabled for clang-tidy:
# - llvmlibc-restrict-system-libc-headers
# TODO: this is Linux-only for the moment, nice to fix eventually
#
# - google-readability-braces-around-statements
# - hicpp-braces-around-statements
# - readability-braces-around-statements
# Debatable whether that improves readability, right now it would look
# like a mess
#
# - readability-magic-numbers
# - cppcoreguidelines-avoid-magic-numbers
# TODO: in most cases they are justified, but probably not everywhere
#
# - clang-analyzer-valist.Uninitialized
# TODO: enable once https://bugs.llvm.org/show_bug.cgi?id=41311 is fixed
#
# - clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling
# Probably not doable to impement this without plain memcpy(), memset()
#
# - cppcoreguidelines-init-variables
# Dubious value, would kill readability
#
# - hicpp-signed-bitwise
# Those are needed for syscalls, epoll_wait flags, etc.
#
# - llvm-include-order
# TODO: not really important, but nice to fix eventually
#
# - readability-isolate-declaration
# Dubious value, would kill readability
#
# - bugprone-narrowing-conversions
# - cppcoreguidelines-narrowing-conversions
# TODO: nice to fix eventually
#
# - cppcoreguidelines-avoid-non-const-global-variables
# TODO: check, fix, and more in general constify wherever possible
#
# - altera-unroll-loops
# - altera-id-dependent-backward-branch
# TODO: check paths where it might make sense to improve performance
#
# - bugprone-easily-swappable-parameters
# Not much can be done about them other than being careful
#
# - readability-function-cognitive-complexity
# TODO: split reported functions
#
# - altera-struct-pack-align
# "Poor" alignment needed for structs reflecting message formats/headers
#
# - concurrency-mt-unsafe
# TODO: check again if multithreading is implemented
#
# - readability-identifier-length
# Complains about any identifier <3 characters, reasonable for
# globals, pointlessly verbose for locals and parameters.
#
# - bugprone-assignment-in-if-condition
# Dubious value over the compiler's built-in warning. Would
# increase verbosity.
#
# - misc-include-cleaner
# Wants to include headers which *directly* provide the things
# we use. That sounds nice, but means it will often want a OS
# specific header instead of a mostly standard one, such as
# <linux/limits.h> instead of <limits.h>.
#
# - cppcoreguidelines-macro-to-enum
# Want to replace all #defines of integers with enums. Kind of
# makes sense when those defines form an enum-like set, but
# weird for cases like standalone constants, and causes other
# awkwardness for a bunch of cases we use
clang-tidy: $(PASST_SRCS) $(HEADERS)
clang-tidy $(PASST_SRCS) -- $(filter-out -pie,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) \
-DCLANG_TIDY_58992
clang-tidy: $(SRCS) $(HEADERS)
clang-tidy -checks=*,-modernize-*,\
-clang-analyzer-valist.Uninitialized,\
-cppcoreguidelines-init-variables,\
-bugprone-assignment-in-if-condition,\
-google-readability-braces-around-statements,\
-hicpp-braces-around-statements,\
-readability-braces-around-statements,\
-readability-magic-numbers,\
-llvmlibc-restrict-system-libc-headers,\
-hicpp-signed-bitwise,\
-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,\
-llvm-include-order,\
-cppcoreguidelines-avoid-magic-numbers,\
-readability-isolate-declaration,\
-bugprone-narrowing-conversions,\
-cppcoreguidelines-narrowing-conversions,\
-cppcoreguidelines-avoid-non-const-global-variables,\
-altera-unroll-loops,-altera-id-dependent-backward-branch,\
-bugprone-easily-swappable-parameters,\
-readability-function-cognitive-complexity,\
-altera-struct-pack-align,\
-concurrency-mt-unsafe,\
-readability-identifier-length,\
-misc-include-cleaner,\
-cppcoreguidelines-macro-to-enum \
-config='{CheckOptions: [{key: bugprone-suspicious-string-compare.WarnOnImplicitComparison, value: "false"}]}' \
--warnings-as-errors=* $(SRCS) -- $(filter-out -pie,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) -DCLANG_TIDY_58992
SYSTEM_INCLUDES := /usr/include $(wildcard /usr/include/$(TARGET))
ifeq ($(shell $(CC) -v 2>&1 | grep -c "gcc version"),1)
VER := $(shell $(CC) -dumpversion)
SYSTEM_INCLUDES += /usr/lib/gcc/$(TARGET)/$(VER)/include
endif
cppcheck: $(SRCS) $(HEADERS)
cppcheck: $(PASST_SRCS) $(HEADERS)
if cppcheck --check-level=exhaustive /dev/null > /dev/null 2>&1; then \
CPPCHECK_EXHAUSTIVE="--check-level=exhaustive"; \
else \
@ -314,11 +172,8 @@ cppcheck: $(SRCS) $(HEADERS)
cppcheck --std=c11 --error-exitcode=1 --enable=all --force \
--inconclusive --library=posix --quiet \
$${CPPCHECK_EXHAUSTIVE} \
$(SYSTEM_INCLUDES:%=-I%) \
$(SYSTEM_INCLUDES:%=--config-exclude=%) \
$(SYSTEM_INCLUDES:%=--suppress=*:%/*) \
$(SYSTEM_INCLUDES:%=--suppress=unmatchedSuppression:%/*) \
--inline-suppr \
--suppress=missingIncludeSystem \
--suppress=unusedStructMember \
$(filter -D%,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) \
$(SRCS) $(HEADERS)
$(filter -D%,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) -D CPPCHECK_6936 \
$(PASST_SRCS) $(HEADERS)

8
arch.c
View file

@ -19,6 +19,7 @@
#include <unistd.h>
#include "log.h"
#include "util.h"
/**
* arch_avx2_exec() - Switch to AVX2 build if supported
@ -40,8 +41,11 @@ void arch_avx2_exec(char **argv)
if (__builtin_cpu_supports("avx2")) {
char new_path[PATH_MAX + sizeof(".avx2")];
snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
execve(new_path, argv, environ);
if (snprintf_check(new_path, PATH_MAX + sizeof(".avx2"),
"%s.avx2", exe))
die_perror("Can't build AVX2 executable path");
execv(new_path, argv);
warn_perror("Can't run AVX2 build, using non-AVX2 version");
}
}

8
arp.c
View file

@ -59,14 +59,12 @@ int arp(const struct ctx *c, const struct pool *p)
ah->ar_op != htons(ARPOP_REQUEST))
return 1;
/* Discard announcements (but not 0.0.0.0 "probes"): we might have the
* same IP address, hide that.
*/
if (memcmp(am->sip, (unsigned char[4]){ 0 }, sizeof(am->tip)) &&
/* Discard announcements, but not 0.0.0.0 "probes" */
if (memcmp(am->sip, &in4addr_any, sizeof(am->sip)) &&
!memcmp(am->sip, am->tip, sizeof(am->sip)))
return 1;
/* Don't resolve our own address, either. */
/* Don't resolve the guest's assigned address, either. */
if (!memcmp(am->tip, &c->ip4.addr, sizeof(am->tip)))
return 1;

View file

@ -59,6 +59,7 @@
#include "util.h"
#include "ip.h"
#include "checksum.h"
#include "iov.h"
/* Checksums are optional for UDP over IPv4, so we usually just set
* them to 0. Change this to 1 to calculate real UDP over IPv4
@ -165,22 +166,24 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
* @udp4hr: UDP header, initialised apart from checksum
* @saddr: IPv4 source address
* @daddr: IPv4 destination address
* @payload: UDP packet payload
* @dlen: Length of @payload (not including UDP header)
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Length of the array
* @offset: UDP payload offset in the iovec array
*/
void csum_udp4(struct udphdr *udp4hr,
struct in_addr saddr, struct in_addr daddr,
const void *payload, size_t dlen)
const struct iovec *iov, int iov_cnt, size_t offset)
{
/* UDP checksums are optional, so don't bother */
udp4hr->check = 0;
if (UDP4_REAL_CHECKSUMS) {
uint16_t l4len = dlen + sizeof(struct udphdr);
uint16_t l4len = iov_size(iov, iov_cnt) - offset +
sizeof(struct udphdr);
uint32_t psum = proto_ipv4_header_psum(l4len, IPPROTO_UDP,
saddr, daddr);
psum = csum_unfolded(udp4hr, sizeof(struct udphdr), psum);
udp4hr->check = csum(payload, dlen, psum);
udp4hr->check = csum_iov(iov, iov_cnt, offset, psum);
}
}
@ -226,19 +229,24 @@ uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
/**
* csum_udp6() - Calculate and set checksum for a UDP over IPv6 packet
* @udp6hr: UDP header, initialised apart from checksum
* @payload: UDP packet payload
* @dlen: Length of @payload (not including UDP header)
* @saddr: Source address
* @daddr: Destination address
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Length of the array
* @offset: UDP payload offset in the iovec array
*/
void csum_udp6(struct udphdr *udp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t dlen)
const struct iovec *iov, int iov_cnt, size_t offset)
{
uint32_t psum = proto_ipv6_header_psum(dlen + sizeof(struct udphdr),
IPPROTO_UDP, saddr, daddr);
uint16_t l4len = iov_size(iov, iov_cnt) - offset +
sizeof(struct udphdr);
uint32_t psum = proto_ipv6_header_psum(l4len, IPPROTO_UDP,
saddr, daddr);
udp6hr->check = 0;
psum = csum_unfolded(udp6hr, sizeof(struct udphdr), psum);
udp6hr->check = csum(payload, dlen, psum);
udp6hr->check = csum_iov(iov, iov_cnt, offset, psum);
}
/**
@ -497,16 +505,26 @@ uint16_t csum(const void *buf, size_t len, uint32_t init)
*
* @iov Pointer to the array of IO vectors
* @n Length of the array
* @offset: Offset of the data to checksum within the full data length
* @init Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* cppcheck-suppress unusedFunction */
uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init)
uint16_t csum_iov(const struct iovec *iov, size_t n, size_t offset,
uint32_t init)
{
unsigned int i;
size_t first;
for (i = 0; i < n; i++)
i = iov_skip_bytes(iov, n, offset, &first);
if (i >= n)
return (uint16_t)~csum_fold(init);
init = csum_unfolded((char *)iov[i].iov_base + first,
iov[i].iov_len - first, init);
i++;
for (; i < n; i++)
init = csum_unfolded(iov[i].iov_base, iov[i].iov_len, init);
return (uint16_t)~csum_fold(init);

View file

@ -19,19 +19,20 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr);
void csum_udp4(struct udphdr *udp4hr,
struct in_addr saddr, struct in_addr daddr,
const void *payload, size_t dlen);
const struct iovec *iov, int iov_cnt, size_t offset);
void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t dlen);
uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
const struct in6_addr *saddr,
const struct in6_addr *daddr);
void csum_udp6(struct udphdr *udp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t dlen);
const struct iovec *iov, int iov_cnt, size_t offset);
void csum_icmp6(struct icmp6hdr *icmp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t dlen);
uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init);
uint16_t csum(const void *buf, size_t len, uint32_t init);
uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init);
uint16_t csum_iov(const struct iovec *iov, size_t n, size_t offset,
uint32_t init);
#endif /* CHECKSUM_H */

117
conf.c
View file

@ -46,6 +46,8 @@
#include "isolation.h"
#include "log.h"
#define NETNS_RUN_DIR "/run/netns"
/**
* next_chunk - Return the next piece of a string delimited by a character
* @s: String to search
@ -116,11 +118,10 @@ static int parse_port_range(const char *s, char **endptr,
static void conf_ports(const struct ctx *c, char optname, const char *optarg,
struct fwd_ports *fwd)
{
char addr_buf[sizeof(struct in6_addr)] = { 0 }, *addr = addr_buf;
union inany_addr addr_buf = inany_any6, *addr = &addr_buf;
char buf[BUFSIZ], *spec, *ifname = NULL, *p;
bool exclude_only = true, bound_one = false;
uint8_t exclude[PORT_BITMAP_SIZE] = { 0 };
sa_family_t af = AF_UNSPEC;
unsigned i;
int ret;
@ -166,15 +167,13 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
bitmap_set(fwd->map, i);
if (optname == 't') {
ret = tcp_sock_init(c, AF_UNSPEC, NULL, NULL,
i);
ret = tcp_sock_init(c, NULL, NULL, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
} else if (optname == 'u') {
ret = udp_sock_init(c, 0, AF_UNSPEC, NULL, NULL,
i);
ret = udp_sock_init(c, 0, NULL, NULL, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
@ -226,11 +225,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
p++;
}
if (inet_pton(AF_INET, p, addr))
af = AF_INET;
else if (inet_pton(AF_INET6, p, addr))
af = AF_INET6;
else
if (!inany_pton(p, addr))
goto bad;
}
} else {
@ -276,13 +271,13 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
bitmap_set(fwd->map, i);
if (optname == 't') {
ret = tcp_sock_init(c, af, addr, ifname, i);
ret = tcp_sock_init(c, addr, ifname, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
} else if (optname == 'u') {
ret = udp_sock_init(c, 0, af, addr, ifname, i);
ret = udp_sock_init(c, 0, addr, ifname, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
@ -338,9 +333,9 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
ret = 0;
if (optname == 't')
ret = tcp_sock_init(c, af, addr, ifname, i);
ret = tcp_sock_init(c, addr, ifname, i);
else if (optname == 'u')
ret = udp_sock_init(c, 0, af, addr, ifname, i);
ret = udp_sock_init(c, 0, addr, ifname, i);
if (ret)
goto bind_fail;
}
@ -581,10 +576,15 @@ static void conf_pasta_ns(int *netns_only, char *userns, char *netns,
if (pidval < 0 || pidval > INT_MAX)
die("Invalid PID %s", argv[optind]);
snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
if (!*userns)
snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
pidval);
if (snprintf_check(netns, PATH_MAX,
"/proc/%ld/ns/net", pidval))
die_perror("Can't build netns path");
if (!*userns) {
if (snprintf_check(userns, PATH_MAX,
"/proc/%ld/ns/user", pidval))
die_perror("Can't build userns path");
}
}
}
@ -735,19 +735,19 @@ static unsigned int conf_ip6(unsigned int ifi, struct ip6_ctx *ip6)
static void usage(const char *name, FILE *f, int status)
{
if (strstr(name, "pasta")) {
fprintf(f, "Usage: %s [OPTION]... [COMMAND] [ARGS]...\n", name);
fprintf(f, " %s [OPTION]... PID\n", name);
fprintf(f, " %s [OPTION]... --netns [PATH|NAME]\n", name);
fprintf(f,
FPRINTF(f, "Usage: %s [OPTION]... [COMMAND] [ARGS]...\n", name);
FPRINTF(f, " %s [OPTION]... PID\n", name);
FPRINTF(f, " %s [OPTION]... --netns [PATH|NAME]\n", name);
FPRINTF(f,
"\n"
"Without PID or --netns, run the given command or a\n"
"default shell in a new network and user namespace, and\n"
"connect it via pasta.\n");
} else {
fprintf(f, "Usage: %s [OPTION]...\n", name);
FPRINTF(f, "Usage: %s [OPTION]...\n", name);
}
fprintf(f,
FPRINTF(f,
"\n"
" -d, --debug Be verbose\n"
" --trace Be extra verbose, implies --debug\n"
@ -764,17 +764,17 @@ static void usage(const char *name, FILE *f, int status)
" --version Show version and exit\n");
if (strstr(name, "pasta")) {
fprintf(f,
FPRINTF(f,
" -I, --ns-ifname NAME namespace interface name\n"
" default: same interface name as external one\n");
} else {
fprintf(f,
FPRINTF(f,
" -s, --socket PATH UNIX domain socket path\n"
" default: probe free path starting from "
UNIX_SOCK_PATH "\n", 1);
}
fprintf(f,
FPRINTF(f,
" -F, --fd FD Use FD as pre-opened connected socket\n"
" -p, --pcap FILE Log tap-facing traffic to pcap file\n"
" -P, --pid FILE Write own PID to the given file\n"
@ -805,28 +805,28 @@ static void usage(const char *name, FILE *f, int status)
" can be specified multiple times\n"
" a single, empty option disables DNS information\n");
if (strstr(name, "pasta"))
fprintf(f, " default: don't use any addresses\n");
FPRINTF(f, " default: don't use any addresses\n");
else
fprintf(f, " default: use addresses from /etc/resolv.conf\n");
fprintf(f,
FPRINTF(f, " default: use addresses from /etc/resolv.conf\n");
FPRINTF(f,
" -S, --search LIST Space-separated list, search domains\n"
" a single, empty option disables the DNS search list\n");
if (strstr(name, "pasta"))
fprintf(f, " default: don't use any search list\n");
FPRINTF(f, " default: don't use any search list\n");
else
fprintf(f, " default: use search list from /etc/resolv.conf\n");
FPRINTF(f, " default: use search list from /etc/resolv.conf\n");
if (strstr(name, "pasta"))
fprintf(f, " --dhcp-dns \tPass DNS list via DHCP/DHCPv6/NDP\n");
FPRINTF(f, " --dhcp-dns \tPass DNS list via DHCP/DHCPv6/NDP\n");
else
fprintf(f, " --no-dhcp-dns No DNS list in DHCP/DHCPv6/NDP\n");
FPRINTF(f, " --no-dhcp-dns No DNS list in DHCP/DHCPv6/NDP\n");
if (strstr(name, "pasta"))
fprintf(f, " --dhcp-search Pass list via DHCP/DHCPv6/NDP\n");
FPRINTF(f, " --dhcp-search Pass list via DHCP/DHCPv6/NDP\n");
else
fprintf(f, " --no-dhcp-search No list in DHCP/DHCPv6/NDP\n");
FPRINTF(f, " --no-dhcp-search No list in DHCP/DHCPv6/NDP\n");
fprintf(f,
FPRINTF(f,
" --map-host-loopback ADDR Translate ADDR to refer to host\n"
" can be specified zero to two times (for IPv4 and IPv6)\n"
" default: gateway address\n"
@ -836,6 +836,9 @@ static void usage(const char *name, FILE *f, int status)
" --dns-forward ADDR Forward DNS queries sent to ADDR\n"
" can be specified zero to two times (for IPv4 and IPv6)\n"
" default: don't forward DNS queries\n"
" --dns-host ADDR Host nameserver to direct queries to\n"
" can be specified zero to two times (for IPv4 and IPv6)\n"
" default: first nameserver from host's /etc/resolv.conf\n"
" --no-tcp Disable TCP protocol handler\n"
" --no-udp Disable UDP protocol handler\n"
" --no-icmp Disable ICMP/ICMPv6 protocol handler\n"
@ -843,6 +846,7 @@ static void usage(const char *name, FILE *f, int status)
" --no-ndp Disable NDP responses\n"
" --no-dhcpv6 Disable DHCPv6 server\n"
" --no-ra Disable router advertisements\n"
" --freebind Bind to any address for forwarding\n"
" --no-map-gw Don't map gateway address to host\n"
" -4, --ipv4-only Enable IPv4 operation only\n"
" -6, --ipv6-only Enable IPv6 operation only\n");
@ -850,7 +854,7 @@ static void usage(const char *name, FILE *f, int status)
if (strstr(name, "pasta"))
goto pasta_opts;
fprintf(f,
FPRINTF(f,
" -1, --one-off Quit after handling one single client\n"
" -t, --tcp-ports SPEC TCP port forwarding to guest\n"
" can be specified multiple times\n"
@ -881,7 +885,7 @@ static void usage(const char *name, FILE *f, int status)
pasta_opts:
fprintf(f,
FPRINTF(f,
" -t, --tcp-ports SPEC TCP port forwarding to namespace\n"
" can be specified multiple times\n"
" SPEC can be:\n"
@ -915,6 +919,9 @@ pasta_opts:
" -U, --udp-ns SPEC UDP port forwarding to init namespace\n"
" SPEC is as described above\n"
" default: auto\n"
" --host-lo-to-ns-lo DEPRECATED:\n"
" Translate host-loopback forwards to\n"
" namespace loopback\n"
" --userns NSPATH Target user namespace to join\n"
" --netns PATH|NAME Target network namespace to join\n"
" --netns-only Don't join existing user namespace\n"
@ -1189,7 +1196,11 @@ static void conf_open_files(struct ctx *c)
if (c->mode != MODE_PASTA && c->fd_tap == -1)
c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
c->pidfile_fd = pidfile_open(c->pidfile);
if (*c->pidfile) {
c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
if (c->pidfile_fd < 0)
die_perror("Couldn't open PID file %s", c->pidfile);
}
}
/**
@ -1262,6 +1273,7 @@ void conf(struct ctx *c, int argc, char **argv)
{"no-dhcpv6", no_argument, &c->no_dhcpv6, 1 },
{"no-ndp", no_argument, &c->no_ndp, 1 },
{"no-ra", no_argument, &c->no_ra, 1 },
{"freebind", no_argument, &c->freebind, 1 },
{"no-map-gw", no_argument, &no_map_gw, 1 },
{"ipv4-only", no_argument, NULL, '4' },
{"ipv6-only", no_argument, NULL, '6' },
@ -1291,6 +1303,8 @@ void conf(struct ctx *c, int argc, char **argv)
{"netns-only", no_argument, NULL, 20 },
{"map-host-loopback", required_argument, NULL, 21 },
{"map-guest-addr", required_argument, NULL, 22 },
{"host-lo-to-ns-lo", no_argument, NULL, 23 },
{"dns-host", required_argument, NULL, 24 },
{ 0 },
};
const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@ -1413,9 +1427,9 @@ void conf(struct ctx *c, int argc, char **argv)
break;
case 14:
fprintf(stdout,
FPRINTF(stdout,
c->mode == MODE_PASTA ? "pasta " : "passt ");
fprintf(stdout, VERSION_BLOB);
FPRINTF(stdout, VERSION_BLOB);
exit(EXIT_SUCCESS);
case 15:
ret = snprintf(c->ip4.ifname_out,
@ -1468,6 +1482,23 @@ void conf(struct ctx *c, int argc, char **argv)
conf_nat(optarg, &c->ip4.map_guest_addr,
&c->ip6.map_guest_addr, NULL);
break;
case 23:
if (c->mode != MODE_PASTA)
die("--host-lo-to-ns-lo is for pasta mode only");
c->host_lo_to_ns_lo = 1;
break;
case 24:
if (inet_pton(AF_INET6, optarg, &c->ip6.dns_host) &&
!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
break;
if (inet_pton(AF_INET, optarg, &c->ip4.dns_host) &&
!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host) &&
!IN4_IS_ADDR_BROADCAST(&c->ip4.dns_host))
break;
die("Invalid host nameserver address: %s", optarg);
break;
case 'd':
c->debug = 1;
c->quiet = 0;

View file

@ -34,6 +34,8 @@
owner @{PROC}/@{pid}/uid_map r, # conf_ugid()
@{PROC}/sys/net/ipv4/ip_local_port_range r, # fwd_probe_ephemeral()
network netlink raw, # nl_sock_init_do(), netlink.c
network inet stream, # tcp.c

View file

@ -50,6 +50,7 @@ require {
type passwd_file_t;
class netlink_route_socket { bind create nlmsg_read };
type sysctl_net_t;
class capability { sys_tty_config setuid setgid };
class cap_userns { setpcap sys_admin sys_ptrace };
@ -104,6 +105,8 @@ allow passt_t net_conf_t:lnk_file read;
allow passt_t tmp_t:sock_file { create unlink write };
allow passt_t self:netlink_route_socket { bind create nlmsg_read read write setopt };
kernel_search_network_sysctl(passt_t)
allow passt_t sysctl_net_t:dir search;
allow passt_t sysctl_net_t:file { open read };
corenet_tcp_bind_all_nodes(passt_t)
corenet_udp_bind_all_nodes(passt_t)

View file

@ -196,7 +196,7 @@ allow pasta_t ifconfig_var_run_t:dir { read search watch };
allow pasta_t self:tun_socket create;
allow pasta_t tun_tap_device_t:chr_file { ioctl open read write };
allow pasta_t sysctl_net_t:dir search;
allow pasta_t sysctl_net_t:file { open write };
allow pasta_t sysctl_net_t:file { open read write };
allow pasta_t kernel_t:system module_request;
allow pasta_t nsfs_t:file read;

View file

@ -296,47 +296,42 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset,
static struct opt_hdr *dhcpv6_ia_notonlink(const struct pool *p,
struct in6_addr *la)
{
int ia_types[2] = { OPT_IA_NA, OPT_IA_TA }, *ia_type;
const struct opt_ia_addr *opt_addr;
char buf[INET6_ADDRSTRLEN];
struct in6_addr req_addr;
const struct opt_hdr *h;
struct opt_hdr *ia;
size_t offset;
int ia_type;
ia_type = OPT_IA_NA;
ia_ta:
offset = 0;
while ((ia = dhcpv6_opt(p, &offset, ia_type))) {
if (ntohs(ia->l) < OPT_VSIZE(ia_na))
return NULL;
offset += sizeof(struct opt_ia_na);
while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) {
const struct opt_ia_addr *opt_addr;
if (ntohs(h->l) != OPT_VSIZE(ia_addr))
foreach(ia_type, ia_types) {
offset = 0;
while ((ia = dhcpv6_opt(p, &offset, *ia_type))) {
if (ntohs(ia->l) < OPT_VSIZE(ia_na))
return NULL;
opt_addr = (const struct opt_ia_addr *)h;
req_addr = opt_addr->addr;
if (!IN6_ARE_ADDR_EQUAL(la, &req_addr)) {
info("DHCPv6: requested address %s not on link",
inet_ntop(AF_INET6, &req_addr,
buf, sizeof(buf)));
return ia;
}
offset += sizeof(struct opt_ia_na);
offset += sizeof(struct opt_ia_addr);
while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) {
if (ntohs(h->l) != OPT_VSIZE(ia_addr))
return NULL;
opt_addr = (const struct opt_ia_addr *)h;
req_addr = opt_addr->addr;
if (!IN6_ARE_ADDR_EQUAL(la, &req_addr))
goto err;
offset += sizeof(struct opt_ia_addr);
}
}
}
if (ia_type == OPT_IA_NA) {
ia_type = OPT_IA_TA;
goto ia_ta;
}
return NULL;
err:
info("DHCPv6: requested address %s not on link",
inet_ntop(AF_INET6, &req_addr, buf, sizeof(buf)));
return ia;
}
/**
@ -428,11 +423,11 @@ search:
int dhcpv6(struct ctx *c, const struct pool *p,
const struct in6_addr *saddr, const struct in6_addr *daddr)
{
struct opt_hdr *ia, *bad_ia, *client_id;
const struct opt_hdr *server_id;
const struct opt_hdr *client_id, *server_id, *ia;
const struct in6_addr *src;
const struct msg_hdr *mh;
const struct udphdr *uh;
struct opt_hdr *bad_ia;
size_t mlen, n;
uh = packet_get(p, 0, 0, sizeof(*uh), &mlen);

53
flow.c
View file

@ -283,28 +283,23 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
"Flow %u (%s): %s", flow_idx(f), type_or_state, msg);
}
/**
* flow_set_state() - Change flow's state
* @f: Flow changing state
* @state: New state
/** flow_log_details_() - Log the details of a flow
* @f: flow to log
* @pri: Log priority
* @state: State to log details according to
*
* Logs the details of the flow: endpoints, interfaces, type etc.
*/
static void flow_set_state(struct flow_common *f, enum flow_state state)
void flow_log_details_(const struct flow_common *f, int pri,
enum flow_state state)
{
char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN];
char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN];
const struct flowside *ini = &f->side[INISIDE];
const struct flowside *tgt = &f->side[TGTSIDE];
uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES);
ASSERT(oldstate < FLOW_NUM_STATES);
f->state = state;
flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
FLOW_STATE(f));
if (MAX(state, oldstate) >= FLOW_STATE_TGT)
flow_log_(f, LOG_DEBUG,
if (state >= FLOW_STATE_TGT)
flow_log_(f, pri,
"%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
@ -316,8 +311,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
tgt->oport,
inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)),
tgt->eport);
else if (MAX(state, oldstate) >= FLOW_STATE_INI)
flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?",
else if (state >= FLOW_STATE_INI)
flow_log_(f, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
ini->eport,
@ -325,6 +320,25 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
ini->oport);
}
/**
* flow_set_state() - Change flow's state
* @f: Flow changing state
* @state: New state
*/
static void flow_set_state(struct flow_common *f, enum flow_state state)
{
uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES);
ASSERT(oldstate < FLOW_NUM_STATES);
f->state = state;
flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
FLOW_STATE(f));
flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
}
/**
* flow_initiate_() - Move flow to INI, setting pif[INISIDE]
* @flow: Flow to change state
@ -697,7 +711,7 @@ static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto,
!(FLOW_PROTO(&flow->f) == proto &&
flow->f.pif[sidx.sidei] == pif &&
flowside_eq(&flow->f.side[sidx.sidei], side)))
b = (b + 1) % FLOW_HASH_SIZE;
b = mod_sub(b, 1, FLOW_HASH_SIZE);
return flow_hashtab[b];
}
@ -832,7 +846,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
closed = icmp_ping_timer(c, &flow->ping, now);
break;
case FLOW_UDP:
if (timer)
closed = udp_flow_defer(&flow->udp);
if (!closed && timer)
closed = udp_flow_timer(c, &flow->udp, now);
break;
default:

7
flow.h
View file

@ -264,4 +264,11 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
flow_dbg((f), __VA_ARGS__); \
} while (0)
void flow_log_details_(const struct flow_common *f, int pri,
enum flow_state state);
#define flow_log_details(f_, pri) \
flow_log_details_(&((f_)->f), (pri), (f_)->f.state)
#define flow_dbg_details(f_) flow_log_details((f_), LOG_DEBUG)
#define flow_err_details(f_) flow_log_details((f_), LOG_ERR)
#endif /* FLOW_H */

View file

@ -110,7 +110,7 @@ static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx)
const union flow *flow = flow_at_sidx(sidx);
if (!flow)
return PIF_NONE;
return NULL;
return &flow->f.side[sidx.sidei];
}

35
fwd.c
View file

@ -75,8 +75,8 @@ void fwd_probe_ephemeral(void)
if (*end || errno)
goto parse_err;
if (min < 0 || min >= NUM_PORTS ||
max < 0 || max >= NUM_PORTS)
if (min < 0 || min >= (long)NUM_PORTS ||
max < 0 || max >= (long)NUM_PORTS)
goto parse_err;
fwd_ephemeral_min = min;
@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
(proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
/* spliceable */
/* Preserve the specific loopback adddress used, but let the
* kernel pick a source port on the target side
/* The traffic will go over the guest's 'lo' interface, but by
* default use its external address, so we don't inadvertently
* expose services that listen only on the guest's loopback
* address. That can be overridden by --host-lo-to-ns-lo which
* will instead forward to the loopback address in the guest.
*
* In either case, let the kernel pick the source address to
* match.
*/
tgt->oaddr = ini->eaddr;
if (inany_v4(&ini->eaddr)) {
if (c->host_lo_to_ns_lo)
tgt->eaddr = inany_loopback4;
else
tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
tgt->oaddr = inany_any4;
} else {
if (c->host_lo_to_ns_lo)
tgt->eaddr = inany_loopback6;
else
tgt->eaddr.a6 = c->ip6.addr_seen;
tgt->oaddr = inany_any6;
}
/* Let the kernel pick source port */
tgt->oport = 0;
if (proto == IPPROTO_UDP)
/* But for UDP preserve the source port */
tgt->oport = ini->eport;
if (inany_v4(&ini->eaddr))
tgt->eaddr = inany_loopback4;
else
tgt->eaddr = inany_loopback6;
return PIF_SPLICE;
}

20
inany.c
View file

@ -36,3 +36,23 @@ const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size)
return inet_ntop(AF_INET6, &src->a6, dst, size);
}
/** inany_pton - Parse an IPv[46] address from text format
* @src: IPv[46] address
* @dst: output buffer, filled with parsed address
*
* Return: On success, 1, if no parseable address is found, 0
*/
int inany_pton(const char *src, union inany_addr *dst)
{
if (inet_pton(AF_INET, src, &dst->v4mapped.a4)) {
memset(&dst->v4mapped.zero, 0, sizeof(dst->v4mapped.zero));
memset(&dst->v4mapped.one, 0xff, sizeof(dst->v4mapped.one));
return 1;
}
if (inet_pton(AF_INET6, src, &dst->a6))
return 1;
return 0;
}

View file

@ -270,5 +270,6 @@ static inline void inany_siphash_feed(struct siphash_state *state,
#define INANY_ADDRSTRLEN MAX(INET_ADDRSTRLEN, INET6_ADDRSTRLEN)
const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
int inany_pton(const char *src, union inany_addr *dst);
#endif /* INANY_H */

144
linux_dep.h Normal file
View file

@ -0,0 +1,144 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright Red Hat
*
* Declarations for Linux specific dependencies
*/
#ifndef LINUX_DEP_H
#define LINUX_DEP_H
/* struct tcp_info_linux - Information from Linux TCP_INFO getsockopt()
*
* Largely derived from include/linux/tcp.h in the Linux kernel
*
* Some fields returned by TCP_INFO have been there for ages and are shared with
* BSD. struct tcp_info from netinet/tcp.h has only those fields. There are
* also a many Linux specific extensions to the structure, which are only found
* in the linux/tcp.h version of struct tcp_info.
*
* We want to use some of those extension fields, when available. We can test
* for availability in the runtime kernel using the length returned from
* getsockopt(). However, we won't necessarily be compiled against the same
* kernel headers as we'll run with, so compiling directly against linux/tcp.h
* means wrapping every field access in an #ifdef whose #else does the same
* thing as when the field is missing at runtime. This rapidly gets messy.
*
* Instead we define here struct tcp_info_linux which includes all the Linux
* extensions that we want to use. This is taken from v6.11 of the kernel.
*/
struct tcp_info_linux {
uint8_t tcpi_state;
uint8_t tcpi_ca_state;
uint8_t tcpi_retransmits;
uint8_t tcpi_probes;
uint8_t tcpi_backoff;
uint8_t tcpi_options;
uint8_t tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
uint8_t tcpi_delivery_rate_app_limited:1, tcpi_fastopen_client_fail:2;
uint32_t tcpi_rto;
uint32_t tcpi_ato;
uint32_t tcpi_snd_mss;
uint32_t tcpi_rcv_mss;
uint32_t tcpi_unacked;
uint32_t tcpi_sacked;
uint32_t tcpi_lost;
uint32_t tcpi_retrans;
uint32_t tcpi_fackets;
/* Times. */
uint32_t tcpi_last_data_sent;
uint32_t tcpi_last_ack_sent;
uint32_t tcpi_last_data_recv;
uint32_t tcpi_last_ack_recv;
/* Metrics. */
uint32_t tcpi_pmtu;
uint32_t tcpi_rcv_ssthresh;
uint32_t tcpi_rtt;
uint32_t tcpi_rttvar;
uint32_t tcpi_snd_ssthresh;
uint32_t tcpi_snd_cwnd;
uint32_t tcpi_advmss;
uint32_t tcpi_reordering;
uint32_t tcpi_rcv_rtt;
uint32_t tcpi_rcv_space;
uint32_t tcpi_total_retrans;
/* Linux extensions */
uint64_t tcpi_pacing_rate;
uint64_t tcpi_max_pacing_rate;
uint64_t tcpi_bytes_acked; /* RFC4898 tcpEStatsAppHCThruOctetsAcked */
uint64_t tcpi_bytes_received; /* RFC4898 tcpEStatsAppHCThruOctetsReceived */
uint32_t tcpi_segs_out; /* RFC4898 tcpEStatsPerfSegsOut */
uint32_t tcpi_segs_in; /* RFC4898 tcpEStatsPerfSegsIn */
uint32_t tcpi_notsent_bytes;
uint32_t tcpi_min_rtt;
uint32_t tcpi_data_segs_in; /* RFC4898 tcpEStatsDataSegsIn */
uint32_t tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */
uint64_t tcpi_delivery_rate;
uint64_t tcpi_busy_time; /* Time (usec) busy sending data */
uint64_t tcpi_rwnd_limited; /* Time (usec) limited by receive window */
uint64_t tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */
uint32_t tcpi_delivered;
uint32_t tcpi_delivered_ce;
uint64_t tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
uint64_t tcpi_bytes_retrans; /* RFC4898 tcpEStatsPerfOctetsRetrans */
uint32_t tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */
uint32_t tcpi_reord_seen; /* reordering events seen */
uint32_t tcpi_rcv_ooopack; /* Out-of-order packets received */
uint32_t tcpi_snd_wnd; /* peer's advertised receive window after
* scaling (bytes)
*/
uint32_t tcpi_rcv_wnd; /* local advertised receive window after
* scaling (bytes)
*/
uint32_t tcpi_rehash; /* PLB or timeout triggered rehash attempts */
uint16_t tcpi_total_rto; /* Total number of RTO timeouts, including
* SYN/SYN-ACK and recurring timeouts.
*/
uint16_t tcpi_total_rto_recoveries; /* Total number of RTO
* recoveries, including any
* unfinished recovery.
*/
uint32_t tcpi_total_rto_time; /* Total time spent in RTO recoveries
* in milliseconds, including any
* unfinished recovery.
*/
};
#include <linux/falloc.h>
#ifndef FALLOC_FL_COLLAPSE_RANGE
#define FALLOC_FL_COLLAPSE_RANGE 0x08
#endif
#include <linux/close_range.h>
/* glibc < 2.34 and musl as of 1.2.5 need these */
#ifndef SYS_close_range
#define SYS_close_range 436
#endif
#ifndef CLOSE_RANGE_UNSHARE /* Linux kernel < 5.9 */
#define CLOSE_RANGE_UNSHARE (1U << 1)
#endif
__attribute__ ((weak))
/* cppcheck-suppress funcArgNamesDifferent */
int close_range(unsigned int first, unsigned int last, int flags) {
return syscall(SYS_close_range, first, last, flags);
}
#endif /* LINUX_DEP_H */

33
log.c
View file

@ -26,6 +26,7 @@
#include <stdarg.h>
#include <sys/socket.h>
#include "linux_dep.h"
#include "log.h"
#include "util.h"
#include "passt.h"
@ -92,7 +93,6 @@ const char *logfile_prefix[] = {
" ", /* LOG_DEBUG */
};
#ifdef FALLOC_FL_COLLAPSE_RANGE
/**
* logfile_rotate_fallocate() - Write header, set log_written after fallocate()
* @fd: Log file descriptor
@ -126,7 +126,6 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *now)
log_written -= log_cut_size;
}
#endif /* FALLOC_FL_COLLAPSE_RANGE */
/**
* logfile_rotate_move() - Fallback: move recent entries toward start, then cut
@ -198,21 +197,17 @@ out:
*
* Return: 0 on success, negative error code on failure
*
* #syscalls fcntl
*
* fallocate() passed as EXTRA_SYSCALL only if FALLOC_FL_COLLAPSE_RANGE is there
* #syscalls fcntl fallocate
*/
static int logfile_rotate(int fd, const struct timespec *now)
{
if (fcntl(fd, F_SETFL, O_RDWR /* Drop O_APPEND: explicit lseek() */))
return -errno;
#ifdef FALLOC_FL_COLLAPSE_RANGE
/* Only for Linux >= 3.15, extent-based ext4 or XFS, glibc >= 2.18 */
if (!fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, log_cut_size))
logfile_rotate_fallocate(fd, now);
else
#endif
logfile_rotate_move(fd, now);
if (fcntl(fd, F_SETFL, O_RDWR | O_APPEND))
@ -224,19 +219,23 @@ static int logfile_rotate(int fd, const struct timespec *now)
/**
* logfile_write() - Write entry to log file, trigger rotation if full
* @newline: Append newline at the end of the message, if missing
* @cont: Continuation of a previous message, on the same line
* @pri: Facility and level map, same as priority for vsyslog()
* @now: Timestamp
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
static void logfile_write(bool newline, int pri, const struct timespec *now,
static void logfile_write(bool newline, bool cont, int pri,
const struct timespec *now,
const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
int n = 0;
n = logtime_fmt(buf, BUFSIZ, now);
n += snprintf(buf + n, BUFSIZ - n, ": %s", logfile_prefix[pri]);
if (!cont) {
n += logtime_fmt(buf, BUFSIZ, now);
n += snprintf(buf + n, BUFSIZ - n, ": %s", logfile_prefix[pri]);
}
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
@ -270,7 +269,7 @@ void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
char timestr[LOGTIME_STRLEN];
logtime_fmt(timestr, sizeof(timestr), now);
fprintf(stderr, "%s: ", timestr);
FPRINTF(stderr, "%s: ", timestr);
}
if ((log_mask & LOG_MASK(LOG_PRI(pri))) || !log_conf_parsed) {
@ -278,7 +277,7 @@ void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
va_copy(ap2, ap); /* Don't clobber ap, we need it again */
if (log_file != -1)
logfile_write(newline, pri, now, format, ap2);
logfile_write(newline, cont, pri, now, format, ap2);
else if (!(log_mask & LOG_MASK(LOG_DEBUG)))
passt_vsyslog(newline, pri, format, ap2);
@ -289,7 +288,7 @@ void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
(log_stderr && (log_mask & LOG_MASK(LOG_PRI(pri))))) {
(void)vfprintf(stderr, format, ap);
if (newline && format[strlen(format)] != '\n')
fprintf(stderr, "\n");
FPRINTF(stderr, "\n");
}
}
@ -395,7 +394,7 @@ void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
fprintf(stderr, "Failed to send %i bytes to syslog\n", n);
FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
@ -412,8 +411,7 @@ void logfile_init(const char *name, const char *path, size_t size)
if (readlink("/proc/self/exe", exe, PATH_MAX - 1) < 0)
die_perror("Failed to read own /proc/self/exe link");
log_file = open(path, O_CREAT | O_TRUNC | O_APPEND | O_RDWR | O_CLOEXEC,
S_IRUSR | S_IWUSR);
log_file = output_file_open(path, O_APPEND | O_RDWR);
if (log_file == -1)
die_perror("Couldn't open log file %s", path);
@ -429,4 +427,3 @@ void logfile_init(const char *name, const char *path, size_t size)
/* For FALLOC_FL_COLLAPSE_RANGE: VFS block size can be up to one page */
log_cut_size = ROUND_UP(log_size * LOGFILE_CUT_RATIO / 100, PAGE_SIZE);
}

4
ndp.c
View file

@ -234,8 +234,8 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
return 1;
if (ih->icmp6_type == NS) {
struct ndp_ns *ns = packet_get(p, 0, 0, sizeof(struct ndp_ns),
NULL);
const struct ndp_ns *ns =
packet_get(p, 0, 0, sizeof(struct ndp_ns), NULL);
if (!ns)
return -1;

View file

@ -353,7 +353,7 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
*/
bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
{
size_t nh_len = RTA_PAYLOAD(rta);
int nh_len = RTA_PAYLOAD(rta);
struct rtnexthop *rtnh;
bool found = false;
int hops = -1;
@ -582,7 +582,7 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
*(unsigned int *)RTA_DATA(rta) = ifi_dst;
} else if (rta->rta_type == RTA_MULTIPATH) {
size_t nh_len = RTA_PAYLOAD(rta);
int nh_len = RTA_PAYLOAD(rta);
struct rtnexthop *rtnh;
for (rtnh = (struct rtnexthop *)RTA_DATA(rta);

96
passt.1
View file

@ -95,7 +95,7 @@ detached PID namespace after starting, because the PID itself cannot change.
Default is to fork into background.
.TP
.BR \-e ", " \-\-stderr
.BR \-e ", " \-\-stderr " " (DEPRECATED)
This option has no effect, and is maintained for compatibility purposes only.
Note that this configuration option is \fBdeprecated\fR and will be removed in a
@ -249,10 +249,19 @@ the host.
.TP
.BR \-\-dns-forward " " \fIaddr
Map \fIaddr\fR (IPv4 or IPv6) as seen from guest or namespace to the
first configured DNS resolver (with corresponding IP version). Maps
only UDP and TCP traffic to port 53 or port 853. Replies are
translated back with a reverse mapping. This option can be specified
zero to two times (once for IPv4, once for IPv6).
nameserver (with corresponding IP version) specified by the
\fB\-\-dns-host\fR option. Maps only UDP and TCP traffic to port 53 or
port 853. Replies are translated back with a reverse mapping. This
option can be specified zero to two times (once for IPv4, once for
IPv6).
.TP
.BR \-\-dns-host " " \fIaddr
Configure the host nameserver which guest or namespace queries to the
\fB\-\-dns-forward\fR address will be redirected to. This option can
be specified zero to two times (once for IPv4, once for IPv6).
By default, the first nameserver from the host's
\fI/etc/resolv.conf\fR.
.TP
.BR \-S ", " \-\-search " " \fIlist
@ -327,6 +336,16 @@ namespace will be silently dropped.
Disable Router Advertisements. Router Solicitations coming from guest or target
namespace will be ignored.
.TP
.BR \-\-freebind
Allow any binding address to be specified for \fB-t\fR and \fB-u\fR
options. Usually binding addresses must be addresses currently
configured on the host. With \fB\-\-freebind\fR, the
\fBIP_FREEBIND\fR or \fBIPV6_FREEBIND\fR socket option is enabled
allowing any address to be used. This is typically used to bind
addresses which might be configured on the host in future, at which
point the forwarding will immediately start operating.
.TP
.BR \-\-map-host-loopback " " \fIaddr
Translate \fIaddr\fR to refer to the host. Packets from the guest to
@ -586,6 +605,13 @@ Configure UDP port forwarding from target namespace to init namespace.
Default is \fBauto\fR.
.TP
.BR \-\-host-lo-to-ns-lo " " (DEPRECATED)
If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from
the host's loopback address will appear on the loopback address in the
guest as well. Without this option such forwarded packets will appear
to come from the guest's public address.
.TP
.BR \-\-userns " " \fIspec
Target user namespace to join, as a path. If PID is given, without this option,
@ -863,38 +889,41 @@ root@localhost's password:
.SH NOTES
.SS Handling of traffic with local destination and source addresses
.SS Handling of traffic with loopback destination and source addresses
Both \fBpasst\fR and \fBpasta\fR can bind on ports with a local address,
depending on the configuration. Local destination or source addresses need to be
changed before packets are delivered to the guest or target namespace: most
operating systems would drop packets received from non-loopback interfaces with
local addresses, and it would also be impossible for guest or target namespace
to route answers back.
Both \fBpasst\fR and \fBpasta\fR can bind on ports with a loopback
address (127.0.0.0/8 or ::1), depending on the configuration. Loopback
destination or source addresses need to be changed before packets are
delivered to the guest or target namespace: most operating systems
would drop packets received with loopback addresses on non-loopback
interfaces, and it would also be impossible for guest or target
namespace to route answers back.
For convenience, and somewhat arbitrarily, the source address on these packets
is translated to the address of the default IPv4 or IPv6 gateway (if any) --
this is known to be an existing, valid address on the same subnet.
For convenience, the source address on these packets is translated to
the address specified by the \fB\-\-map-host-loopback\fR option (with
some exceptions in pasta mode, see next section below). If not
specified this defaults, somewhat arbitrarily, to the address of
default IPv4 or IPv6 gateway (if any) -- this is known to be an
existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or
\fB\-\-map-host-loopback none\fR are specified this translation is
disabled and packets with loopback addresses are simply dropped.
Loopback destination addresses are instead translated to the observed external
address of the guest or target namespace. For IPv6 packets, if usage of a
link-local address by guest or namespace has ever been observed, and the
original destination address is also a link-local address, the observed
link-local address is used. Otherwise, the observed global address is used. For
both IPv4 and IPv6, if no addresses have been seen yet, the configured addresses
will be used instead.
Loopback destination addresses are translated to the observed external
address of the guest or target namespace. For IPv6, the observed
link-local address is used if the translated source address is
link-local, otherwise the observed global address is used. For both
IPv4 and IPv6, if no addresses have been seen yet, the configured
addresses will be used instead.
For example, if \fBpasst\fR or \fBpasta\fR receive a connection from 127.0.0.1,
with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
the last observed source address from guest or namespace is 192.0.2.2, this will
be translated to a connection from 192.0.2.1 to 192.0.2.2.
Similarly, for traffic coming from guest or namespace, packets with destination
address corresponding to the default gateway will have their destination address
translated to a loopback address, if and only if a packet, in the opposite
direction, with a loopback destination or source address, port-wise matching for
UDP, or connection-wise for TCP, has been recently forwarded to guest or
namespace. This behaviour can be disabled with \-\-no\-map\-gw.
Similarly, for traffic coming from guest or namespace, packets with
destination address corresponding to the \fB\-\-map-host-loopback\fR
address will have their destination address translated to a loopback
address.
.SS Handling of local traffic in pasta
@ -910,8 +939,15 @@ and the new socket using the \fBsplice\fR(2) system call, and for UDP, a pair
of \fBrecvmmsg\fR(2) and \fBsendmmsg\fR(2) system calls deals with packet
transfers.
This bypass only applies to local connections and traffic, because it's not
possible to bind sockets to foreign addresses.
Because it's not possible to bind sockets to foreign addresses, this
bypass only applies to local connections and traffic. It also means
that the address translation differs slightly from passt mode.
Connections from loopback to loopback on the host will appear to come
from the target namespace's public address within the guest, unless
\fB\-\-host-lo-to-ns-lo\fR is specified, in which case they will
appear to come from loopback in the namespace as well. The latter
behaviour used to be the default, but is usually undesirable, since it
can unintentionally expose namespace local services to the host.
.SS Binding to low numbered ports (well-known or system ports, up to 1023)

12
passt.c
View file

@ -207,7 +207,8 @@ int main(int argc, char **argv)
struct timespec now;
struct sigaction sa;
clock_gettime(CLOCK_MONOTONIC, &log_start);
if (clock_gettime(CLOCK_MONOTONIC, &log_start))
die_perror("Failed to get CLOCK_MONOTONIC time");
arch_avx2_exec(argv);
@ -265,7 +266,8 @@ int main(int argc, char **argv)
secret_init(&c);
clock_gettime(CLOCK_MONOTONIC, &now);
if (clock_gettime(CLOCK_MONOTONIC, &now))
die_perror("Failed to get CLOCK_MONOTONIC time");
flow_init();
@ -307,13 +309,15 @@ int main(int argc, char **argv)
timer_init(&c, &now);
loop:
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
/* NOLINTBEGIN(bugprone-branch-clone): intervals can be the same */
/* cppcheck-suppress [duplicateValueTernary, unmatchedSuppression] */
nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
/* NOLINTEND(bugprone-branch-clone) */
if (nfds == -1 && errno != EINTR)
die_perror("epoll_wait() failed in main loop");
clock_gettime(CLOCK_MONOTONIC, &now);
if (clock_gettime(CLOCK_MONOTONIC, &now))
err_perror("Failed to get CLOCK_MONOTONIC time");
for (i = 0; i < nfds; i++) {
union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);

View file

@ -225,6 +225,8 @@ struct ip6_ctx {
* @no_dhcpv6: Disable DHCPv6 server
* @no_ndp: Disable NDP handler altogether
* @no_ra: Disable router advertisements
* @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses
* @freebind: Allow binding of non-local addresses for forwarding
* @low_wmem: Low probed net.core.wmem_max
* @low_rmem: Low probed net.core.rmem_max
*/
@ -284,6 +286,8 @@ struct ctx {
int no_dhcpv6;
int no_ndp;
int no_ra;
int host_lo_to_ns_lo;
int freebind;
int low_wmem;
int low_rmem;

14
pasta.c
View file

@ -102,7 +102,9 @@ static int pasta_wait_for_ns(void *arg)
int flags = O_RDONLY | O_CLOEXEC;
char ns[PATH_MAX];
snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
if (snprintf_check(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid))
die_perror("Can't build netns path");
do {
while ((c->pasta_netns_fd = open(ns, flags)) < 0) {
if (errno != ENOENT)
@ -239,8 +241,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
c->quiet = 1;
/* Configure user and group mappings */
snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
if (snprintf_check(uidmap, BUFSIZ, "0 %u 1", uid))
die_perror("Can't build uidmap");
if (snprintf_check(gidmap, BUFSIZ, "0 %u 1", gid))
die_perror("Can't build gidmap");
if (write_file("/proc/self/uid_map", uidmap) ||
write_file("/proc/self/setgroups", "deny") ||
@ -427,12 +432,12 @@ static int pasta_netns_quit_timer(void)
*/
void pasta_netns_quit_init(const struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_NSQUIT_INOTIFY };
struct epoll_event ev = { .events = EPOLLIN };
int flags = O_NONBLOCK | O_CLOEXEC;
struct statfs s = { 0 };
bool try_inotify = true;
int fd = -1, dir_fd;
union epoll_ref ref;
if (c->mode != MODE_PASTA || c->no_netns_quit || !*c->netns_base)
return;
@ -463,6 +468,7 @@ void pasta_netns_quit_init(const struct ctx *c)
ref.type = EPOLL_TYPE_NSQUIT_TIMER;
} else {
close(dir_fd);
ref.type = EPOLL_TYPE_NSQUIT_INOTIFY;
}
if (fd > FD_REF_MAX)

32
pcap.c
View file

@ -86,9 +86,8 @@ static void pcap_frame(const struct iovec *iov, size_t iovcnt,
.caplen = l2len,
.len = l2len
};
struct iovec hiov = { &h, sizeof(h) };
if (write_remainder(pcap_fd, &hiov, 1, 0) < 0 ||
if (write_all_buf(pcap_fd, &h, sizeof(h)) < 0 ||
write_remainder(pcap_fd, iov, iovcnt, offset) < 0)
debug_perror("Cannot log packet, length %zu", l2len);
}
@ -101,12 +100,14 @@ static void pcap_frame(const struct iovec *iov, size_t iovcnt,
void pcap(const char *pkt, size_t l2len)
{
struct iovec iov = { (char *)pkt, l2len };
struct timespec now;
struct timespec now = { 0 };
if (pcap_fd == -1)
return;
clock_gettime(CLOCK_REALTIME, &now);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
pcap_frame(&iov, 1, 0, &now);
}
@ -120,13 +121,14 @@ void pcap(const char *pkt, size_t l2len)
void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
size_t offset)
{
struct timespec now;
struct timespec now = { 0 };
unsigned int i;
if (pcap_fd == -1)
return;
clock_gettime(CLOCK_REALTIME, &now);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
for (i = 0; i < n; i++)
pcap_frame(iov + i * frame_parts, frame_parts, offset, &now);
@ -139,17 +141,20 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
* @iov: Pointer to the array of struct iovec describing the I/O vector
* containing packet data to write, including L2 header
* @iovcnt: Number of buffers (@iov entries)
* @offset: Offset of the L2 frame within the full data length
*/
/* cppcheck-suppress unusedFunction */
void pcap_iov(const struct iovec *iov, size_t iovcnt)
void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
{
struct timespec now;
struct timespec now = { 0 };
if (pcap_fd == -1)
return;
clock_gettime(CLOCK_REALTIME, &now);
pcap_frame(iov, iovcnt, 0, &now);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
pcap_frame(iov, iovcnt, offset, &now);
}
/**
@ -158,18 +163,15 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt)
*/
void pcap_init(struct ctx *c)
{
int flags = O_WRONLY | O_CREAT | O_TRUNC;
if (pcap_fd != -1)
return;
if (!*c->pcap)
return;
flags |= c->foreground ? O_CLOEXEC : 0;
pcap_fd = open(c->pcap, flags, S_IRUSR | S_IWUSR);
pcap_fd = output_file_open(c->pcap, O_WRONLY);
if (pcap_fd == -1) {
perror("open");
err_perror("Couldn't open pcap file %s", c->pcap);
return;
}

2
pcap.h
View file

@ -9,7 +9,7 @@
void pcap(const char *pkt, size_t l2len);
void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
size_t offset);
void pcap_iov(const struct iovec *iov, size_t iovcnt);
void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset);
void pcap_init(struct ctx *c);
#endif /* PCAP_H */

42
pif.c
View file

@ -59,3 +59,45 @@ void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
*sl = sizeof(sa->sa6);
}
}
/** pif_sock_l4() - Open a socket bound to an address on a specified interface
* @c: Execution context
* @type: Socket epoll type
* @pif: Interface for this socket
* @addr: Address to bind to, or NULL for dual-stack any
* @ifname: Interface for binding, NULL for any
* @port: Port number to bind to (host byte order)
* @data: epoll reference portion for protocol handlers
*
* NOTE: For namespace pifs, this must be called having already entered the
* relevant namespace.
*
* Return: newly created socket, negative error code on failure
*/
int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
const union inany_addr *addr, const char *ifname,
in_port_t port, uint32_t data)
{
union sockaddr_inany sa = {
.sa6.sin6_family = AF_INET6,
.sa6.sin6_addr = in6addr_any,
.sa6.sin6_port = htons(port),
};
socklen_t sl;
ASSERT(pif_is_socket(pif));
if (pif == PIF_SPLICE) {
/* Sanity checks */
ASSERT(!ifname);
ASSERT(addr && inany_is_loopback(addr));
}
if (!addr)
return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
ifname, false, data);
pif_sockaddr(c, &sa, &sl, pif, addr, port);
return sock_l4_sa(c, type, &sa, sl,
ifname, sa.sa_family == AF_INET6, data);
}

3
pif.h
View file

@ -59,5 +59,8 @@ static inline bool pif_is_socket(uint8_t pif)
void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
uint8_t pif, const union inany_addr *addr, in_port_t port);
int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
const union inany_addr *addr, const char *ifname,
in_port_t port, uint32_t data);
#endif /* PIF_H */

View file

@ -20,6 +20,15 @@ OUT="$(mktemp)"
[ -z "${ARCH}" ] && ARCH="$(uname -m)"
[ -z "${CC}" ] && CC="cc"
AUDIT_ARCH="AUDIT_ARCH_$(echo ${ARCH} | tr [a-z] [A-Z] \
| sed 's/^ARM.*/ARM/' \
| sed 's/I[456]86/I386/' \
| sed 's/PPC64/PPC/' \
| sed 's/PPCLE/PPC64LE/' \
| sed 's/MIPS64EL/MIPSEL64/' \
| sed 's/HPPA/PARISC/' \
| sed 's/SH4/SH/')"
HEADER="/* This file was automatically generated by $(basename ${0}) */
#ifndef AUDIT_ARCH_PPC64LE
@ -32,7 +41,7 @@ struct sock_filter filter_@PROFILE@[] = {
/* cppcheck-suppress [badBitmaskCheck, unmatchedSuppression] */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, arch))),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, PASST_AUDIT_ARCH, 0, @KILL@),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, @AUDIT_ARCH@, 0, @KILL@),
/* cppcheck-suppress [badBitmaskCheck, unmatchedSuppression] */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, nr))),
@ -233,7 +242,8 @@ gen_profile() {
sub ${__i} CALL "NR:${__nr}" "NAME:${__name}" "ALLOW:${__allow}"
done
finish PRE "PROFILE:${__profile}" "KILL:$(( __statements + 1))"
finish PRE "PROFILE:${__profile}" "KILL:$(( __statements + 1))" \
"AUDIT_ARCH:${AUDIT_ARCH}"
}
printf '%s\n' "${HEADER}" > "${OUT}"

142
tap.c
View file

@ -172,11 +172,15 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
char *data = (char *)(uh + 1);
const struct iovec iov = {
.iov_base = (void *)in,
.iov_len = dlen
};
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp4(uh, src, dst, in, dlen);
csum_udp4(uh, src, dst, &iov, 1, 0);
memcpy(data, in, dlen);
tap_send_single(c, buf, dlen + (data - buf));
@ -247,7 +251,7 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h,
void tap_udp6_send(const struct ctx *c,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
uint32_t flow, const void *in, size_t dlen)
uint32_t flow, void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
char buf[USHRT_MAX];
@ -255,11 +259,15 @@ void tap_udp6_send(const struct ctx *c,
struct udphdr *uh = tap_push_ip6h(ip6h, src, dst,
l4len, IPPROTO_UDP, flow);
char *data = (char *)(uh + 1);
const struct iovec iov = {
.iov_base = in,
.iov_len = dlen
};
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp6(uh, src, dst, in, dlen);
csum_udp6(uh, src, dst, &iov, 1, 0);
memcpy(data, in, dlen);
tap_send_single(c, buf, dlen + (data - buf));
@ -982,24 +990,17 @@ static void tap_sock_reset(struct ctx *c)
}
/**
* tap_handler_passt() - Packet handler for AF_UNIX file descriptor
* tap_passt_input() - Handler for new data on the socket to qemu
* @c: Execution context
* @events: epoll events
* @now: Current timestamp
*/
void tap_handler_passt(struct ctx *c, uint32_t events,
const struct timespec *now)
static void tap_passt_input(struct ctx *c, const struct timespec *now)
{
static const char *partial_frame;
static ssize_t partial_len = 0;
ssize_t n;
char *p;
if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
tap_sock_reset(c);
return;
}
tap_flush_pools();
if (partial_len) {
@ -1010,10 +1011,13 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
memmove(pkt_buf, partial_frame, partial_len);
}
n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len,
MSG_DONTWAIT);
do {
n = recv(c->fd_tap, pkt_buf + partial_len,
TAP_BUF_BYTES - partial_len, MSG_DONTWAIT);
} while ((n < 0) && errno == EINTR);
if (n < 0) {
if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
if (errno != EAGAIN && errno != EWOULDBLOCK) {
err_perror("Receive error on guest connection, reset");
tap_sock_reset(c);
}
@ -1051,6 +1055,63 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
tap_handler(c, now);
}
/**
* tap_handler_passt() - Event handler for AF_UNIX file descriptor
* @c: Execution context
* @events: epoll events
* @now: Current timestamp
*/
void tap_handler_passt(struct ctx *c, uint32_t events,
const struct timespec *now)
{
if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
tap_sock_reset(c);
return;
}
if (events & EPOLLIN)
tap_passt_input(c, now);
}
/**
* tap_pasta_input() - Handler for new data on the socket to hypervisor
* @c: Execution context
* @now: Current timestamp
*/
static void tap_pasta_input(struct ctx *c, const struct timespec *now)
{
ssize_t n, len;
tap_flush_pools();
for (n = 0; n <= (ssize_t)(TAP_BUF_BYTES - ETH_MAX_MTU); n += len) {
len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
if (len == 0) {
die("EOF on tap device, exiting");
} else if (len < 0) {
if (errno == EINTR) {
len = 0;
continue;
}
if (errno == EAGAIN && errno == EWOULDBLOCK)
break; /* all done for now */
die("Error on tap device, exiting");
}
/* Ignore frames of bad length */
if (len < (ssize_t)sizeof(struct ethhdr) ||
len > (ssize_t)ETH_MAX_MTU)
continue;
tap_add_packet(c, len, pkt_buf + n);
}
tap_handler(c, now);
}
/**
* tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor
* @c: Execution context
@ -1060,46 +1121,11 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
void tap_handler_pasta(struct ctx *c, uint32_t events,
const struct timespec *now)
{
ssize_t n, len;
int ret;
if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
die("Disconnect event on /dev/net/tun device, exiting");
redo:
n = 0;
tap_flush_pools();
restart:
while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
if (len < (ssize_t)sizeof(struct ethhdr) ||
len > (ssize_t)ETH_MAX_MTU) {
n += len;
continue;
}
tap_add_packet(c, len, pkt_buf + n);
if ((n += len) == TAP_BUF_BYTES)
break;
}
if (len < 0 && errno == EINTR)
goto restart;
ret = errno;
tap_handler(c, now);
if (len > 0 || ret == EAGAIN)
return;
if (n == TAP_BUF_BYTES)
goto redo;
die("Error on tap device, exiting");
if (events & EPOLLIN)
tap_pasta_input(c, now);
}
/**
@ -1110,7 +1136,7 @@ restart:
*/
int tap_sock_unix_open(char *sock_path)
{
int fd = socket(AF_UNIX, SOCK_STREAM, 0);
int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
struct sockaddr_un addr = {
.sun_family = AF_UNIX,
};
@ -1125,10 +1151,12 @@ int tap_sock_unix_open(char *sock_path)
if (*sock_path)
memcpy(path, sock_path, UNIX_PATH_MAX);
else
snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
else if (snprintf_check(path, UNIX_PATH_MAX - 1,
UNIX_SOCK_PATH, i))
die_perror("Can't build UNIX domain socket path");
ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
0);
if (ex < 0)
die_perror("Failed to check for UNIX domain conflicts");
@ -1261,7 +1289,7 @@ static int tap_ns_tun(void *arg)
if (fd < 0)
die_perror("Failed to open() /dev/net/tun");
rc = ioctl(fd, TUNSETIFF, &ifr);
rc = ioctl(fd, (int)TUNSETIFF, &ifr);
if (rc < 0)
die_perror("TUNSETIFF ioctl on /dev/net/tun failed");

2
tap.h
View file

@ -53,7 +53,7 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
void tap_udp6_send(const struct ctx *c,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
uint32_t flow, const void *in, size_t dlen);
uint32_t flow, void *in, size_t dlen);
void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
const void *in, size_t l4len);

433
tcp.c
View file

@ -274,6 +274,7 @@
#include <net/if.h>
#include <netinet/in.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>
#include <stdint.h>
#include <stdbool.h>
#include <stddef.h>
@ -286,8 +287,6 @@
#include <time.h>
#include <arpa/inet.h>
#include <linux/tcp.h> /* For struct tcp_info */
#include "checksum.h"
#include "util.h"
#include "iov.h"
@ -300,6 +299,7 @@
#include "log.h"
#include "inany.h"
#include "flow.h"
#include "linux_dep.h"
#include "flow_table.h"
#include "tcp_internal.h"
@ -308,11 +308,6 @@
/* MSS rounding: see SET_MSS() */
#define MSS_DEFAULT 536
#define WINDOW_DEFAULT 14600 /* RFC 6928 */
#ifdef HAS_SND_WND
# define KERNEL_REPORTS_SND_WND(c) ((c)->tcp.kernel_snd_wnd)
#else
# define KERNEL_REPORTS_SND_WND(c) (0 && (c))
#endif
#define ACK_INTERVAL 10 /* ms */
#define SYN_TIMEOUT 10 /* s */
@ -323,11 +318,6 @@
#define LOW_RTT_TABLE_SIZE 8
#define LOW_RTT_THRESHOLD 10 /* us */
/* We need to include <linux/tcp.h> for tcpi_bytes_acked, instead of
* <netinet/tcp.h>, but that doesn't include a definition for SOL_TCP
*/
#define SOL_TCP IPPROTO_TCP
#define ACK_IF_NEEDED 0 /* See tcp_send_flag() */
#define CONN_IS_CLOSING(conn) \
@ -371,6 +361,20 @@ char tcp_buf_discard [MAX_WINDOW];
/* Does the kernel support TCP_PEEK_OFF? */
bool peek_offset_cap;
/* Size of data returned by TCP_INFO getsockopt() */
socklen_t tcp_info_size;
#define tcp_info_cap(f_) \
((offsetof(struct tcp_info_linux, tcpi_##f_) + \
sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size)
/* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */
#define snd_wnd_cap tcp_info_cap(snd_wnd)
/* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */
#define bytes_acked_cap tcp_info_cap(bytes_acked)
/* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */
#define min_rtt_cap tcp_info_cap(min_rtt)
/* sendmsg() to socket */
static struct iovec tcp_iov [UIO_MAXIOV];
@ -440,7 +444,7 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
if (events == TAP_SYN_RCVD)
return EPOLLOUT | EPOLLET | EPOLLRDHUP;
return EPOLLRDHUP;
return EPOLLET | EPOLLRDHUP;
}
/**
@ -545,7 +549,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
(unsigned long long)it.it_value.tv_sec,
(unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
timerfd_settime(conn->timer, 0, &it, NULL);
if (timerfd_settime(conn->timer, 0, &it, NULL))
flow_err(conn, "failed to set timer: %s", strerror(errno));
}
/**
@ -675,13 +680,12 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
* @tinfo: Pointer to struct tcp_info for socket
*/
static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
const struct tcp_info *tinfo)
const struct tcp_info_linux *tinfo)
{
#ifdef HAS_MIN_RTT
const struct flowside *tapside = TAPFLOW(conn);
int i, hole = -1;
if (!tinfo->tcpi_min_rtt ||
if (!min_rtt_cap ||
(int)tinfo->tcpi_min_rtt > LOW_RTT_THRESHOLD)
return;
@ -702,10 +706,6 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
if (hole == LOW_RTT_TABLE_SIZE)
hole = 0;
inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any);
#else
(void)conn;
(void)tinfo;
#endif /* HAS_MIN_RTT */
}
/**
@ -752,34 +752,106 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
}
/**
* tcp_update_check_tcp4() - Update TCP checksum from stored one
* tcp_update_check_tcp4() - Calculate TCP checksum for IPv4
* @iph: IPv4 header
* @th: TCP header followed by TCP payload
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Length of the array
* @l4offset: IPv4 payload offset in the iovec array
*/
static void tcp_update_check_tcp4(const struct iphdr *iph, struct tcphdr *th)
static void tcp_update_check_tcp4(const struct iphdr *iph,
const struct iovec *iov, int iov_cnt,
size_t l4offset)
{
uint16_t l4len = ntohs(iph->tot_len) - sizeof(struct iphdr);
struct in_addr saddr = { .s_addr = iph->saddr };
struct in_addr daddr = { .s_addr = iph->daddr };
uint32_t sum = proto_ipv4_header_psum(l4len, IPPROTO_TCP, saddr, daddr);
size_t check_ofs;
uint16_t *check;
int check_idx;
uint32_t sum;
char *ptr;
th->check = 0;
th->check = csum(th, l4len, sum);
sum = proto_ipv4_header_psum(l4len, IPPROTO_TCP, saddr, daddr);
check_idx = iov_skip_bytes(iov, iov_cnt,
l4offset + offsetof(struct tcphdr, check),
&check_ofs);
if (check_idx >= iov_cnt) {
err("TCP4 buffer is too small, iov size %zd, check offset %zd",
iov_size(iov, iov_cnt),
l4offset + offsetof(struct tcphdr, check));
return;
}
if (check_ofs + sizeof(*check) > iov[check_idx].iov_len) {
err("TCP4 checksum field memory is not contiguous "
"check_ofs %zd check_idx %d iov_len %zd",
check_ofs, check_idx, iov[check_idx].iov_len);
return;
}
ptr = (char *)iov[check_idx].iov_base + check_ofs;
if ((uintptr_t)ptr & (__alignof__(*check) - 1)) {
err("TCP4 checksum field is not correctly aligned in memory");
return;
}
check = (uint16_t *)ptr;
*check = 0;
*check = csum_iov(iov, iov_cnt, l4offset, sum);
}
/**
* tcp_update_check_tcp6() - Calculate TCP checksum for IPv6
* @ip6h: IPv6 header
* @th: TCP header followed by TCP payload
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Length of the array
* @l4offset: IPv6 payload offset in the iovec array
*/
static void tcp_update_check_tcp6(struct ipv6hdr *ip6h, struct tcphdr *th)
static void tcp_update_check_tcp6(const struct ipv6hdr *ip6h,
const struct iovec *iov, int iov_cnt,
size_t l4offset)
{
uint16_t l4len = ntohs(ip6h->payload_len);
uint32_t sum = proto_ipv6_header_psum(l4len, IPPROTO_TCP,
&ip6h->saddr, &ip6h->daddr);
size_t check_ofs;
uint16_t *check;
int check_idx;
uint32_t sum;
char *ptr;
th->check = 0;
th->check = csum(th, l4len, sum);
sum = proto_ipv6_header_psum(l4len, IPPROTO_TCP, &ip6h->saddr,
&ip6h->daddr);
check_idx = iov_skip_bytes(iov, iov_cnt,
l4offset + offsetof(struct tcphdr, check),
&check_ofs);
if (check_idx >= iov_cnt) {
err("TCP6 buffer is too small, iov size %zd, check offset %zd",
iov_size(iov, iov_cnt),
l4offset + offsetof(struct tcphdr, check));
return;
}
if (check_ofs + sizeof(*check) > iov[check_idx].iov_len) {
err("TCP6 checksum field memory is not contiguous "
"check_ofs %zd check_idx %d iov_len %zd",
check_ofs, check_idx, iov[check_idx].iov_len);
return;
}
ptr = (char *)iov[check_idx].iov_base + check_ofs;
if ((uintptr_t)ptr & (__alignof__(*check) - 1)) {
err("TCP6 checksum field is not correctly aligned in memory");
return;
}
check = (uint16_t *)ptr;
*check = 0;
*check = csum_iov(iov, iov_cnt, l4offset, sum);
}
/**
@ -865,7 +937,6 @@ bool tcp_flow_defer(const struct tcp_tap_conn *conn)
/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
void tcp_defer_handler(struct ctx *c)
{
tcp_flags_flush(c);
tcp_payload_flush(c);
}
@ -896,26 +967,27 @@ static void tcp_fill_header(struct tcphdr *th,
/**
* tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
* @conn: Connection pointer
* @taph: tap backend specific header
* @iph: Pointer to IPv4 header
* @th: Pointer to TCP header
* @dlen: TCP payload length
* @check: Checksum, if already known
* @seq: Sequence number for this segment
* @conn: Connection pointer
* @taph: tap backend specific header
* @iph: Pointer to IPv4 header
* @bp: Pointer to TCP header followed by TCP payload
* @dlen: TCP payload length
* @check: Checksum, if already known
* @seq: Sequence number for this segment
* @no_tcp_csum: Do not set TCP checksum
*
* Return: The IPv4 payload length, host order
*/
static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
struct tap_hdr *taph,
struct iphdr *iph, struct tcphdr *th,
struct iphdr *iph, struct tcp_payload_t *bp,
size_t dlen, const uint16_t *check,
uint32_t seq)
uint32_t seq, bool no_tcp_csum)
{
const struct flowside *tapside = TAPFLOW(conn);
const struct in_addr *src4 = inany_v4(&tapside->oaddr);
const struct in_addr *dst4 = inany_v4(&tapside->eaddr);
size_t l4len = dlen + sizeof(*th);
size_t l4len = dlen + sizeof(bp->th);
size_t l3len = l4len + sizeof(*iph);
ASSERT(src4 && dst4);
@ -927,9 +999,18 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
iph->check = check ? *check :
csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4);
tcp_fill_header(th, conn, seq);
tcp_fill_header(&bp->th, conn, seq);
tcp_update_check_tcp4(iph, th);
if (no_tcp_csum) {
bp->th.check = 0;
} else {
const struct iovec iov = {
.iov_base = bp,
.iov_len = ntohs(iph->tot_len) - sizeof(struct iphdr),
};
tcp_update_check_tcp4(iph, &iov, 1, 0);
}
tap_hdr_update(taph, l3len + sizeof(struct ethhdr));
@ -938,23 +1019,24 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
/**
* tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
* @conn: Connection pointer
* @taph: tap backend specific header
* @ip6h: Pointer to IPv6 header
* @th: Pointer to TCP header
* @dlen: TCP payload length
* @check: Checksum, if already known
* @seq: Sequence number for this segment
* @conn: Connection pointer
* @taph: tap backend specific header
* @ip6h: Pointer to IPv6 header
* @bp: Pointer to TCP header followed by TCP payload
* @dlen: TCP payload length
* @check: Checksum, if already known
* @seq: Sequence number for this segment
* @no_tcp_csum: Do not set TCP checksum
*
* Return: The IPv6 payload length, host order
*/
static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
struct tap_hdr *taph,
struct ipv6hdr *ip6h, struct tcphdr *th,
size_t dlen, uint32_t seq)
struct ipv6hdr *ip6h, struct tcp_payload_t *bp,
size_t dlen, uint32_t seq, bool no_tcp_csum)
{
const struct flowside *tapside = TAPFLOW(conn);
size_t l4len = dlen + sizeof(*th);
size_t l4len = dlen + sizeof(bp->th);
ip6h->payload_len = htons(l4len);
ip6h->saddr = tapside->oaddr.a6;
@ -968,9 +1050,18 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
ip6h->flow_lbl[1] = (conn->sock >> 8) & 0xff;
ip6h->flow_lbl[2] = (conn->sock >> 0) & 0xff;
tcp_fill_header(th, conn, seq);
tcp_fill_header(&bp->th, conn, seq);
tcp_update_check_tcp6(ip6h, th);
if (no_tcp_csum) {
bp->th.check = 0;
} else {
const struct iovec iov = {
.iov_base = bp,
.iov_len = ntohs(ip6h->payload_len)
};
tcp_update_check_tcp6(ip6h, &iov, 1, 0);
}
tap_hdr_update(taph, l4len + sizeof(*ip6h) + sizeof(struct ethhdr));
@ -984,12 +1075,14 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
* @dlen: TCP payload length
* @check: Checksum, if already known
* @seq: Sequence number for this segment
* @no_tcp_csum: Do not set TCP checksum
*
* Return: IP payload length, host order
*/
size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
struct iovec *iov, size_t dlen,
const uint16_t *check, uint32_t seq)
const uint16_t *check, uint32_t seq,
bool no_tcp_csum)
{
const struct flowside *tapside = TAPFLOW(conn);
const struct in_addr *a4 = inany_v4(&tapside->oaddr);
@ -998,13 +1091,13 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_IP].iov_base,
iov[TCP_IOV_PAYLOAD].iov_base, dlen,
check, seq);
check, seq, no_tcp_csum);
}
return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_IP].iov_base,
iov[TCP_IOV_PAYLOAD].iov_base, dlen,
seq);
seq, no_tcp_csum);
}
/**
@ -1017,42 +1110,41 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
* Return: 1 if sequence or window were updated, 0 otherwise
*/
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
int force_seq, struct tcp_info *tinfo)
bool force_seq, struct tcp_info_linux *tinfo)
{
uint32_t prev_wnd_to_tap = conn->wnd_to_tap << conn->ws_to_tap;
uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
/* cppcheck-suppress [ctunullpointer, unmatchedSuppression] */
socklen_t sl = sizeof(*tinfo);
struct tcp_info tinfo_new;
struct tcp_info_linux tinfo_new;
uint32_t new_wnd_to_tap = prev_wnd_to_tap;
int s = conn->sock;
#ifndef HAS_BYTES_ACKED
(void)force_seq;
conn->seq_ack_to_tap = conn->seq_from_tap;
if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
conn->seq_ack_to_tap = prev_ack_to_tap;
#else
if ((unsigned)SNDBUF_GET(conn) < SNDBUF_SMALL || tcp_rtt_dst_low(conn)
|| CONN_IS_CLOSING(conn) || (conn->flags & LOCAL) || force_seq) {
if (!bytes_acked_cap) {
conn->seq_ack_to_tap = conn->seq_from_tap;
} else if (conn->seq_ack_to_tap != conn->seq_from_tap) {
if (!tinfo) {
tinfo = &tinfo_new;
if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
return 0;
}
conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
conn->seq_init_from_tap;
if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
conn->seq_ack_to_tap = prev_ack_to_tap;
}
#endif /* !HAS_BYTES_ACKED */
} else {
if ((unsigned)SNDBUF_GET(conn) < SNDBUF_SMALL ||
tcp_rtt_dst_low(conn) || CONN_IS_CLOSING(conn) ||
(conn->flags & LOCAL) || force_seq) {
conn->seq_ack_to_tap = conn->seq_from_tap;
} else if (conn->seq_ack_to_tap != conn->seq_from_tap) {
if (!tinfo) {
tinfo = &tinfo_new;
if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
return 0;
}
if (!KERNEL_REPORTS_SND_WND(c)) {
conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
conn->seq_init_from_tap;
if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
conn->seq_ack_to_tap = prev_ack_to_tap;
}
}
if (!snd_wnd_cap) {
tcp_get_sndbuf(conn);
new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW);
conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
@ -1063,14 +1155,13 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
if (!tinfo) {
if (prev_wnd_to_tap > WINDOW_DEFAULT) {
goto out;
}
}
tinfo = &tinfo_new;
if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
goto out;
}
}
}
#ifdef HAS_SND_WND
if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
new_wnd_to_tap = tinfo->tcpi_snd_wnd;
} else {
@ -1078,7 +1169,6 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
SNDBUF_GET(conn));
}
#endif
new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
if (!(conn->events & ESTABLISHED))
@ -1136,11 +1226,11 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
* 0 if there is no flag to send
* 1 otherwise
*/
int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
int flags, struct tcphdr *th, char *data,
int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
size_t *optlen)
{
struct tcp_info tinfo = { 0 };
struct tcp_info_linux tinfo = { 0 };
socklen_t sl = sizeof(tinfo);
int s = conn->sock;
@ -1153,27 +1243,16 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
return -ECONNRESET;
}
#ifdef HAS_SND_WND
if (!c->tcp.kernel_snd_wnd && tinfo.tcpi_snd_wnd)
c->tcp.kernel_snd_wnd = 1;
#endif
if (!(conn->flags & LOCAL))
tcp_rtt_dst_check(conn, &tinfo);
if (!tcp_update_seqack_wnd(c, conn, flags, &tinfo) && !flags)
if (!tcp_update_seqack_wnd(c, conn, !!flags, &tinfo) && !flags)
return 0;
*optlen = 0;
if (flags & SYN) {
int mss;
/* Options: MSS, NOP and window scale (8 bytes) */
*optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
*data++ = OPT_MSS;
*data++ = OPT_MSS_LEN;
if (c->mtu == -1) {
mss = tinfo.tcpi_snd_mss;
} else {
@ -1189,16 +1268,11 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
else if (mss > PAGE_SIZE)
mss = ROUND_DOWN(mss, PAGE_SIZE);
}
*(uint16_t *)data = htons(MIN(USHRT_MAX, mss));
data += OPT_MSS_LEN - 2;
conn->ws_to_tap = MIN(MAX_WS, tinfo.tcpi_snd_wscale);
*data++ = OPT_NOP;
*data++ = OPT_WS;
*data++ = OPT_WS_LEN;
*data++ = conn->ws_to_tap;
*opts = TCP_SYN_OPTS(mss, conn->ws_to_tap);
*optlen = sizeof(*opts);
} else if (!(flags & RST)) {
flags |= ACK;
}
@ -1235,7 +1309,8 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
*
* Return: negative error code on connection reset, 0 otherwise
*/
int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
int flags)
{
return tcp_buf_send_flag(c, conn, flags);
}
@ -1245,7 +1320,7 @@ int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
* @c: Execution context
* @conn: Connection pointer
*/
void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn)
{
if (conn->events == CLOSED)
return;
@ -1335,7 +1410,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
{
int s;
s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
s = socket(af, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC, IPPROTO_TCP);
if (s > FD_REF_MAX) {
close(s);
@ -1463,7 +1538,7 @@ static void tcp_bind_outbound(const struct ctx *c,
* @optlen: Bytes in options: caller MUST ensure available length
* @now: Current timestamp
*/
static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
const void *saddr, const void *daddr,
const struct tcphdr *th, const char *opts,
size_t optlen, const struct timespec *now)
@ -1628,7 +1703,7 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
*
* #syscalls recvmsg
*/
static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
{
return tcp_buf_data_from_sock(c, conn);
}
@ -1644,8 +1719,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
*
* Return: count of consumed packets
*/
static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
const struct pool *p, int idx)
static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
const struct pool *p, int idx)
{
int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0;
uint16_t max_ack_seq_wnd = conn->wnd_from_tap;
@ -1842,7 +1917,8 @@ out:
* @opts: Pointer to start of options
* @optlen: Bytes in options: caller MUST ensure available length
*/
static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
static void tcp_conn_from_sock_finish(const struct ctx *c,
struct tcp_tap_conn *conn,
const struct tcphdr *th,
const char *opts, size_t optlen)
{
@ -1865,11 +1941,12 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
return;
}
tcp_send_flag(c, conn, ACK);
/* The client might have sent data already, which we didn't
* dequeue waiting for SYN,ACK from tap -- check now.
*/
tcp_data_from_sock(c, conn);
tcp_send_flag(c, conn, ACK);
}
/**
@ -1885,7 +1962,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
*
* Return: count of consumed packets
*/
int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now)
{
@ -2023,7 +2100,7 @@ reset:
* @c: Execution context
* @conn: Connection pointer
*/
static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
static void tcp_connect_finish(const struct ctx *c, struct tcp_tap_conn *conn)
{
socklen_t sl;
int so;
@ -2049,8 +2126,8 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
* @sa: Peer socket address (from accept())
* @now: Current timestamp
*/
static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
const struct timespec *now)
static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow,
int s, const struct timespec *now)
{
struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
uint64_t hash;
@ -2081,7 +2158,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
* @ref: epoll reference of listening socket
* @now: Current timestamp
*/
void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
const struct timespec *now)
{
const struct flowside *ini;
@ -2146,7 +2223,7 @@ cancel:
*
* #syscalls timerfd_gettime arm:timerfd_gettime64 i686:timerfd_gettime64
*/
void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
{
struct itimerspec check_armed = { { 0 }, { 0 } };
struct tcp_tap_conn *conn = &FLOW(ref.flow)->tcp;
@ -2158,7 +2235,9 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
* timer is currently armed, this event came from a previous setting,
* and we just set the timer to a new point in the future: discard it.
*/
timerfd_gettime(conn->timer, &check_armed);
if (timerfd_gettime(conn->timer, &check_armed))
flow_err(conn, "failed to read timer: %s", strerror(errno));
if (check_armed.it_value.tv_sec || check_armed.it_value.tv_nsec)
return;
@ -2196,7 +2275,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
* case. This avoids having to preemptively reset the timer on
* ~ACK_TO_TAP_DUE or ~ACK_FROM_TAP_DUE.
*/
timerfd_settime(conn->timer, 0, &new, &old);
if (timerfd_settime(conn->timer, 0, &new, &old))
flow_err(conn, "failed to set timer: %s",
strerror(errno));
if (old.it_value.tv_sec == ACT_TIMEOUT) {
flow_dbg(conn, "activity timeout");
tcp_rst(c, conn);
@ -2210,7 +2292,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
* @ref: epoll reference
* @events: epoll events bitmap
*/
void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events)
{
struct tcp_tap_conn *conn = conn_at_sidx(ref.flowside);
@ -2241,7 +2324,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
tcp_data_from_sock(c, conn);
if (events & EPOLLOUT)
tcp_update_seqack_wnd(c, conn, 0, NULL);
tcp_update_seqack_wnd(c, conn, false, NULL);
return;
}
@ -2264,17 +2347,16 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
}
/**
* tcp_sock_init_af() - Initialise listening socket for a given af and port
* tcp_sock_init_one() - Initialise listening socket for address and port
* @c: Execution context
* @af: Address family to listen on
* @port: Port, host order
* @addr: Pointer to address for binding, NULL if not configured
* @addr: Pointer to address for binding, NULL for dual stack any
* @ifname: Name of interface to bind to, NULL if not configured
* @port: Port, host order
*
* Return: fd for the new listening socket, negative error code on failure
*/
static int tcp_sock_init_af(const struct ctx *c, sa_family_t af, in_port_t port,
const void *addr, const char *ifname)
static int tcp_sock_init_one(const struct ctx *c, const union inany_addr *addr,
const char *ifname, in_port_t port)
{
union tcp_listen_epoll_ref tref = {
.port = port,
@ -2282,12 +2364,13 @@ static int tcp_sock_init_af(const struct ctx *c, sa_family_t af, in_port_t port,
};
int s;
s = sock_l4(c, af, EPOLL_TYPE_TCP_LISTEN, addr, ifname, port, tref.u32);
s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_HOST, addr,
ifname, port, tref.u32);
if (c->tcp.fwd_in.mode == FWD_AUTO) {
if (af == AF_INET || af == AF_UNSPEC)
if (!addr || inany_v4(addr))
tcp_sock_init_ext[port][V4] = s < 0 ? -1 : s;
if (af == AF_INET6 || af == AF_UNSPEC)
if (!addr || !inany_v4(addr))
tcp_sock_init_ext[port][V6] = s < 0 ? -1 : s;
}
@ -2301,31 +2384,32 @@ static int tcp_sock_init_af(const struct ctx *c, sa_family_t af, in_port_t port,
/**
* tcp_sock_init() - Create listening sockets for a given host ("inbound") port
* @c: Execution context
* @af: Address family to select a specific IP version, or AF_UNSPEC
* @addr: Pointer to address for binding, NULL if not configured
* @ifname: Name of interface to bind to, NULL if not configured
* @port: Port, host order
*
* Return: 0 on (partial) success, negative error code on (complete) failure
*/
int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
const char *ifname, in_port_t port)
{
int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
ASSERT(!c->no_tcp);
if (af == AF_UNSPEC && c->ifi4 && c->ifi6)
if (!addr && c->ifi4 && c->ifi6)
/* Attempt to get a dual stack socket */
if (tcp_sock_init_af(c, AF_UNSPEC, port, addr, ifname) >= 0)
if (tcp_sock_init_one(c, NULL, ifname, port) >= 0)
return 0;
/* Otherwise create a socket per IP version */
if ((af == AF_INET || af == AF_UNSPEC) && c->ifi4)
r4 = tcp_sock_init_af(c, AF_INET, port, addr, ifname);
if ((!addr || inany_v4(addr)) && c->ifi4)
r4 = tcp_sock_init_one(c, addr ? addr : &inany_any4,
ifname, port);
if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
r6 = tcp_sock_init_af(c, AF_INET6, port, addr, ifname);
if ((!addr || !inany_v4(addr)) && c->ifi6)
r6 = tcp_sock_init_one(c, addr ? addr : &inany_any6,
ifname, port);
if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
return 0;
@ -2348,8 +2432,8 @@ static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
ASSERT(c->mode == MODE_PASTA);
s = sock_l4(c, AF_INET, EPOLL_TYPE_TCP_LISTEN, &in4addr_loopback,
NULL, port, tref.u32);
s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_SPLICE, &inany_loopback4,
NULL, port, tref.u32);
if (s >= 0)
tcp_sock_set_bufsize(c, s);
else
@ -2374,8 +2458,8 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
ASSERT(c->mode == MODE_PASTA);
s = sock_l4(c, AF_INET6, EPOLL_TYPE_TCP_LISTEN, &in6addr_loopback,
NULL, port, tref.u32);
s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_SPLICE, &inany_loopback6,
NULL, port, tref.u32);
if (s >= 0)
tcp_sock_set_bufsize(c, s);
else
@ -2477,7 +2561,7 @@ static void tcp_sock_refill_init(const struct ctx *c)
*
* Return: true if supported, false otherwise
*/
bool tcp_probe_peek_offset_cap(sa_family_t af)
static bool tcp_probe_peek_offset_cap(sa_family_t af)
{
bool ret = false;
int s, optv = 0;
@ -2494,6 +2578,34 @@ bool tcp_probe_peek_offset_cap(sa_family_t af)
return ret;
}
/**
* tcp_probe_tcp_info() - Check what data TCP_INFO reports
*
* Return: Number of bytes returned by TCP_INFO getsockopt()
*/
static socklen_t tcp_probe_tcp_info(void)
{
struct tcp_info_linux tinfo;
socklen_t sl = sizeof(tinfo);
int s;
s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP);
if (s < 0) {
warn_perror("Temporary TCP socket creation failed");
return false;
}
if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
warn_perror("Failed to get TCP_INFO on temporary socket");
close(s);
return false;
}
close(s);
return sl;
}
/**
* tcp_init() - Get initial sequence, hash secret, initialise per-socket data
* @c: Execution context
@ -2504,11 +2616,7 @@ int tcp_init(struct ctx *c)
{
ASSERT(!c->no_tcp);
if (c->ifi4)
tcp_sock4_iov_init(c);
if (c->ifi6)
tcp_sock6_iov_init(c);
tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
memset(init_sock_pool6, 0xff, sizeof(init_sock_pool6));
@ -2527,6 +2635,15 @@ int tcp_init(struct ctx *c)
(!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6));
debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not ");
tcp_info_size = tcp_probe_tcp_info();
#define dbg_tcpi(f_) debug("TCP_INFO tcpi_%s field%s supported", \
STRINGIFY(f_), tcp_info_cap(f_) ? " " : " not ")
dbg_tcpi(snd_wnd);
dbg_tcpi(bytes_acked);
dbg_tcpi(min_rtt);
#undef dbg_tcpi
return 0;
}
@ -2568,7 +2685,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
if (outbound)
tcp_ns_sock_init(c, port);
else
tcp_sock_init(c, AF_UNSPEC, NULL, NULL, port);
tcp_sock_init(c, NULL, NULL, port);
}
}
}

15
tcp.h
View file

@ -10,14 +10,15 @@
struct ctx;
void tcp_timer_handler(struct ctx *c, union epoll_ref ref);
void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
void tcp_timer_handler(const struct ctx *c, union epoll_ref ref);
void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
const struct timespec *now);
void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events);
int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events);
int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now);
int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
const char *ifname, in_port_t port);
int tcp_init(struct ctx *c);
void tcp_timer(struct ctx *c, const struct timespec *now);
@ -58,16 +59,12 @@ union tcp_listen_epoll_ref {
* @fwd_in: Port forwarding configuration for inbound packets
* @fwd_out: Port forwarding configuration for outbound packets
* @timer_run: Timestamp of most recent timer run
* @kernel_snd_wnd: Kernel reports sending window (with commit 8f7baad7f035)
* @pipe_size: Size of pipes for spliced connections
*/
struct tcp_ctx {
struct fwd_ports fwd_in;
struct fwd_ports fwd_out;
struct timespec timer_run;
#ifdef HAS_SND_WND
int kernel_snd_wnd;
#endif
size_t pipe_size;
};

356
tcp_buf.c
View file

@ -20,7 +20,7 @@
#include <netinet/ip.h>
#include <linux/tcp.h>
#include <netinet/tcp.h>
#include "util.h"
#include "ip.h"
@ -38,88 +38,32 @@
(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
/* Static buffers */
/**
* struct tcp_payload_t - TCP header and data to send segments with payload
* @th: TCP header
* @data: TCP data
*/
struct tcp_payload_t {
struct tcphdr th;
uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)];
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32))); /* For AVX2 checksum routines */
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
#endif
/**
* struct tcp_flags_t - TCP header and data to send zero-length
* segments (flags)
* @th: TCP header
* @opts TCP options
*/
struct tcp_flags_t {
struct tcphdr th;
char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32)));
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
#endif
/* Ethernet header for IPv4 frames */
/* Ethernet header for IPv4 and IPv6 frames */
static struct ethhdr tcp4_eth_src;
static struct tap_hdr tcp4_payload_tap_hdr[TCP_FRAMES_MEM];
/* IPv4 headers */
static struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM];
/* TCP segments with payload for IPv4 frames */
static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM];
static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516");
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp4_payload_used;
static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM];
/* IPv4 headers for TCP segment without payload */
static struct iphdr tcp4_flags_ip[TCP_FRAMES_MEM];
/* TCP segments without payload for IPv4 frames */
static struct tcp_flags_t tcp4_flags[TCP_FRAMES_MEM];
static unsigned int tcp4_flags_used;
/* Ethernet header for IPv6 frames */
static struct ethhdr tcp6_eth_src;
static struct tap_hdr tcp6_payload_tap_hdr[TCP_FRAMES_MEM];
/* IPv6 headers */
static struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];
/* TCP headers and data for IPv6 frames */
static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM];
static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM];
static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516");
/* IP headers for IPv4 and IPv6 */
struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM];
struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];
/* TCP segments with payload for IPv4 and IPv6 frames */
static struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM];
static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516");
static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516");
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp6_payload_used;
static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM];
/* IPv6 headers for TCP segment without payload */
static struct ipv6hdr tcp6_flags_ip[TCP_FRAMES_MEM];
/* TCP segment without payload for IPv6 frames */
static struct tcp_flags_t tcp6_flags[TCP_FRAMES_MEM];
static unsigned int tcp6_flags_used;
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp_payload_used;
/* recvmsg()/sendmsg() data for tap */
static struct iovec iov_sock [TCP_FRAMES_MEM + 1];
static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
/**
* tcp_update_l2_buf() - Update Ethernet header buffers with addresses
* @eth_d: Ethernet destination address, NULL if unchanged
@ -132,105 +76,30 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
}
/**
* tcp_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
* tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
* @c: Execution context
*/
void tcp_sock4_iov_init(const struct ctx *c)
{
struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
struct iovec *iov;
int i;
tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
for (i = 0; i < ARRAY_SIZE(tcp4_payload); i++) {
tcp4_payload_ip[i] = iph;
tcp4_payload[i].th.doff = sizeof(struct tcphdr) / 4;
tcp4_payload[i].th.ack = 1;
}
for (i = 0; i < ARRAY_SIZE(tcp4_flags); i++) {
tcp4_flags_ip[i] = iph;
tcp4_flags[i].th.doff = sizeof(struct tcphdr) / 4;
tcp4_flags[i].th.ack = 1;
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
iov = tcp4_l2_iov[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp4_payload_tap_hdr[i]);
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp4_eth_src);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_payload[i];
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
iov = tcp4_l2_flags_iov[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp4_flags_tap_hdr[i]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp4_eth_src);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_flags_ip[i]);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_flags[i];
}
}
/**
* tcp_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
* @c: Execution context
*/
void tcp_sock6_iov_init(const struct ctx *c)
void tcp_sock_iov_init(const struct ctx *c)
{
struct ipv6hdr ip6 = L2_BUF_IP6_INIT(IPPROTO_TCP);
struct iovec *iov;
struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
int i;
tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
for (i = 0; i < ARRAY_SIZE(tcp6_payload); i++) {
for (i = 0; i < ARRAY_SIZE(tcp_payload); i++) {
tcp6_payload_ip[i] = ip6;
tcp6_payload[i].th.doff = sizeof(struct tcphdr) / 4;
tcp6_payload[i].th.ack = 1;
}
for (i = 0; i < ARRAY_SIZE(tcp6_flags); i++) {
tcp6_flags_ip[i] = ip6;
tcp6_flags[i].th.doff = sizeof(struct tcphdr) / 4;
tcp6_flags[i].th .ack = 1;
tcp4_payload_ip[i] = iph;
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
iov = tcp6_l2_iov[i];
struct iovec *iov = tcp_l2_iov[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp6_payload_tap_hdr[i]);
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp6_eth_src);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_payload[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]);
iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i];
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
iov = tcp6_l2_flags_iov[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp6_flags_tap_hdr[i]);
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp6_eth_src);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_flags_ip[i]);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_flags[i];
}
}
/**
* tcp_flags_flush() - Send out buffers for segments with no data (flags)
* @c: Execution context
*/
void tcp_flags_flush(const struct ctx *c)
{
tap_send_frames(c, &tcp6_l2_flags_iov[0][0], TCP_NUM_IOVS,
tcp6_flags_used);
tcp6_flags_used = 0;
tap_send_frames(c, &tcp4_l2_flags_iov[0][0], TCP_NUM_IOVS,
tcp4_flags_used);
tcp4_flags_used = 0;
}
/**
@ -240,7 +109,7 @@ void tcp_flags_flush(const struct ctx *c)
* @frames: Two-dimensional array containing queued frames with sub-iovs
* @num_frames: Number of entries in the two arrays to be compared
*/
static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
struct iovec (*frames)[TCP_NUM_IOVS], int num_frames)
{
int i;
@ -262,28 +131,20 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
}
/**
* tcp_payload_flush() - Send out buffers for segments with data
* tcp_payload_flush() - Send out buffers for segments with data or flags
* @c: Execution context
*/
void tcp_payload_flush(struct ctx *c)
void tcp_payload_flush(const struct ctx *c)
{
size_t m;
m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS,
tcp6_payload_used);
if (m != tcp6_payload_used) {
tcp_revert_seq(c, &tcp6_frame_conns[m], &tcp6_l2_iov[m],
tcp6_payload_used - m);
m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
tcp_payload_used);
if (m != tcp_payload_used) {
tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
tcp_payload_used - m);
}
tcp6_payload_used = 0;
m = tap_send_frames(c, &tcp4_l2_iov[0][0], TCP_NUM_IOVS,
tcp4_payload_used);
if (m != tcp4_payload_used) {
tcp_revert_seq(c, &tcp4_frame_conns[m], &tcp4_l2_iov[m],
tcp4_payload_used - m);
}
tcp4_payload_used = 0;
tcp_payload_used = 0;
}
/**
@ -294,58 +155,48 @@ void tcp_payload_flush(struct ctx *c)
*
* Return: negative error code on connection reset, 0 otherwise
*/
int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
{
struct tcp_flags_t *payload;
struct tcp_payload_t *payload;
struct iovec *iov;
size_t optlen;
size_t l4len;
uint32_t seq;
int ret;
if (CONN_V4(conn))
iov = tcp4_l2_flags_iov[tcp4_flags_used++];
else
iov = tcp6_l2_flags_iov[tcp6_flags_used++];
iov = tcp_l2_iov[tcp_payload_used];
if (CONN_V4(conn)) {
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else {
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
payload = iov[TCP_IOV_PAYLOAD].iov_base;
seq = conn->seq_to_tap;
ret = tcp_prepare_flags(c, conn, flags, &payload->th,
payload->opts, &optlen);
if (ret <= 0) {
if (CONN_V4(conn))
tcp4_flags_used--;
else
tcp6_flags_used--;
(struct tcp_syn_opts *)&payload->data, &optlen);
if (ret <= 0)
return ret;
}
l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq);
tcp_payload_used++;
l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq, false);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
if (flags & DUP_ACK) {
struct iovec *dup_iov;
int i;
struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
if (CONN_V4(conn))
dup_iov = tcp4_l2_flags_iov[tcp4_flags_used++];
else
dup_iov = tcp6_l2_flags_iov[tcp6_flags_used++];
for (i = 0; i < TCP_NUM_IOVS; i++)
memcpy(dup_iov[i].iov_base, iov[i].iov_base,
iov[i].iov_len);
dup_iov[TCP_IOV_PAYLOAD].iov_len = iov[TCP_IOV_PAYLOAD].iov_len;
memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_TAP].iov_len);
dup_iov[TCP_IOV_ETH].iov_base = iov[TCP_IOV_ETH].iov_base;
dup_iov[TCP_IOV_IP] = iov[TCP_IOV_IP];
memcpy(dup_iov[TCP_IOV_PAYLOAD].iov_base,
iov[TCP_IOV_PAYLOAD].iov_base, l4len);
dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
}
if (CONN_V4(conn)) {
if (tcp4_flags_used > TCP_FRAMES_MEM - 2)
tcp_flags_flush(c);
} else {
if (tcp6_flags_used > TCP_FRAMES_MEM - 2)
tcp_flags_flush(c);
}
if (tcp_payload_used > TCP_FRAMES_MEM - 2)
tcp_payload_flush(c);
return 0;
}
@ -358,39 +209,39 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
* @no_csum: Don't compute IPv4 checksum, use the one from previous buffer
* @seq: Sequence number to be sent
*/
static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
ssize_t dlen, int no_csum, uint32_t seq)
{
struct tcp_payload_t *payload;
const uint16_t *check = NULL;
struct iovec *iov;
size_t l4len;
conn->seq_to_tap = seq + dlen;
tcp_frame_conns[tcp_payload_used] = conn;
iov = tcp_l2_iov[tcp_payload_used];
if (CONN_V4(conn)) {
struct iovec *iov_prev = tcp4_l2_iov[tcp4_payload_used - 1];
const uint16_t *check = NULL;
if (no_csum) {
struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
check = &iph->check;
}
tcp4_frame_conns[tcp4_payload_used] = conn;
iov = tcp4_l2_iov[tcp4_payload_used++];
l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else if (CONN_V6(conn)) {
tcp6_frame_conns[tcp6_payload_used] = conn;
iov = tcp6_l2_iov[tcp6_payload_used++];
l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
payload = iov[TCP_IOV_PAYLOAD].iov_base;
payload->th.th_off = sizeof(struct tcphdr) / 4;
payload->th.th_x2 = 0;
payload->th.th_flags = 0;
payload->th.ack = 1;
l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq, false);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
}
/**
@ -402,12 +253,11 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
*
* #syscalls recvmsg
*/
int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
{
uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
int sendlen, len, dlen, v4 = CONN_V4(conn);
int s = conn->sock, i, ret = 0;
int len, dlen, i, s = conn->sock;
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
uint32_t already_sent, seq;
@ -454,19 +304,15 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
mh_sock.msg_iovlen = fill_bufs;
}
if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) ||
(!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) {
if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) {
tcp_payload_flush(c);
/* Silence Coverity CWE-125 false positive */
tcp4_payload_used = tcp6_payload_used = 0;
tcp_payload_used = 0;
}
for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
if (v4)
iov->iov_base = &tcp4_payload[tcp4_payload_used + i].data;
else
iov->iov_base = &tcp6_payload[tcp6_payload_used + i].data;
iov->iov_base = &tcp_payload[tcp_payload_used + i].data;
iov->iov_len = mss;
}
if (iov_rem)
@ -477,12 +323,19 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
len = recvmsg(s, &mh_sock, MSG_PEEK);
while (len < 0 && errno == EINTR);
if (len < 0)
goto err;
if (len < 0) {
if (errno != EAGAIN && errno != EWOULDBLOCK) {
tcp_rst(c, conn);
return -errno;
}
return 0;
}
if (!len) {
if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
if ((ret = tcp_buf_send_flag(c, conn, FIN | ACK))) {
int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
if (ret) {
tcp_rst(c, conn);
return ret;
}
@ -493,28 +346,27 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
return 0;
}
sendlen = len;
if (!peek_offset_cap)
sendlen -= already_sent;
len -= already_sent;
if (sendlen <= 0) {
if (len <= 0) {
conn_flag(c, conn, STALLED);
return 0;
}
conn_flag(c, conn, ~STALLED);
send_bufs = DIV_ROUND_UP(sendlen, mss);
last_len = sendlen - (send_bufs - 1) * mss;
send_bufs = DIV_ROUND_UP(len, mss);
last_len = len - (send_bufs - 1) * mss;
/* Likely, some new data was acked too. */
tcp_update_seqack_wnd(c, conn, 0, NULL);
tcp_update_seqack_wnd(c, conn, false, NULL);
/* Finally, queue to tap */
dlen = mss;
seq = conn->seq_to_tap;
for (i = 0; i < send_bufs; i++) {
int no_csum = i && i != send_bufs - 1 && tcp4_payload_used;
int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
if (i == send_bufs - 1)
dlen = last_len;
@ -526,12 +378,4 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
err:
if (errno != EAGAIN && errno != EWOULDBLOCK) {
ret = -errno;
tcp_rst(c, conn);
}
return ret;
}

View file

@ -6,11 +6,9 @@
#ifndef TCP_BUF_H
#define TCP_BUF_H
void tcp_sock4_iov_init(const struct ctx *c);
void tcp_sock6_iov_init(const struct ctx *c);
void tcp_flags_flush(const struct ctx *c);
void tcp_payload_flush(struct ctx *c);
int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
void tcp_sock_iov_init(const struct ctx *c);
void tcp_payload_flush(const struct ctx *c);
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
#endif /*TCP_BUF_H */

View file

@ -33,9 +33,7 @@
#define OPT_EOL 0
#define OPT_NOP 1
#define OPT_MSS 2
#define OPT_MSS_LEN 4
#define OPT_WS 3
#define OPT_WS_LEN 3
#define OPT_SACKP 4
#define OPT_SACK 5
#define OPT_TS 8
@ -63,6 +61,79 @@ enum tcp_iov_parts {
TCP_NUM_IOVS
};
/**
* struct tcp_payload_t - TCP header and data to send segments with payload
* @th: TCP header
* @data: TCP data
*/
struct tcp_payload_t {
struct tcphdr th;
uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)];
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32))); /* For AVX2 checksum routines */
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
#endif
/** struct tcp_opt_nop - TCP NOP option
* @kind: Option kind (OPT_NOP = 1)
*/
struct tcp_opt_nop {
uint8_t kind;
} __attribute__ ((packed));
#define TCP_OPT_NOP ((struct tcp_opt_nop){ .kind = OPT_NOP, })
/** struct tcp_opt_mss - TCP MSS option
* @kind: Option kind (OPT_MSS == 2)
* @len: Option length (4)
* @mss: Maximum Segment Size
*/
struct tcp_opt_mss {
uint8_t kind;
uint8_t len;
uint16_t mss;
} __attribute__ ((packed));
#define TCP_OPT_MSS(mss_) \
((struct tcp_opt_mss) { \
.kind = OPT_MSS, \
.len = sizeof(struct tcp_opt_mss), \
.mss = htons(mss_), \
})
/** struct tcp_opt_ws - TCP Window Scaling option
* @kind: Option kind (OPT_WS == 3)
* @len: Option length (3)
* @shift: Window scaling shift
*/
struct tcp_opt_ws {
uint8_t kind;
uint8_t len;
uint8_t shift;
} __attribute__ ((packed));
#define TCP_OPT_WS(shift_) \
((struct tcp_opt_ws) { \
.kind = OPT_WS, \
.len = sizeof(struct tcp_opt_ws), \
.shift = (shift_), \
})
/** struct tcp_syn_opts - TCP options we apply to SYN packets
* @mss: Maximum Segment Size (MSS) option
* @nop: NOP opt (for alignment)
* @ws: Window Scaling (WS) option
*/
struct tcp_syn_opts {
struct tcp_opt_mss mss;
struct tcp_opt_nop nop;
struct tcp_opt_ws ws;
} __attribute__ ((packed));
#define TCP_SYN_OPTS(mss_, ws_) \
((struct tcp_syn_opts){ \
.mss = TCP_OPT_MSS(mss_), \
.nop = TCP_OPT_NOP, \
.ws = TCP_OPT_WS(ws_), \
})
extern char tcp_buf_discard [MAX_WINDOW];
void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
@ -82,19 +153,23 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
conn_event_do(c, conn, event); \
} while (0)
void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
#define tcp_rst(c, conn) \
do { \
flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
tcp_rst_do(c, conn); \
} while (0)
struct tcp_info_linux;
size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
struct iovec *iov, size_t dlen,
const uint16_t *check, uint32_t seq);
const uint16_t *check, uint32_t seq,
bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
int force_seq, struct tcp_info *tinfo);
int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags,
struct tcphdr *th, char *data, size_t *optlen);
bool force_seq, struct tcp_info_linux *tinfo);
int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
size_t *optlen);
#endif /* TCP_INTERNAL_H */

View file

@ -320,7 +320,7 @@ static int tcp_splice_connect_finish(const struct ctx *c,
}
if (fcntl(conn->pipe[sidei][0], F_SETPIPE_SZ,
c->tcp.pipe_size)) {
c->tcp.pipe_size) != (int)c->tcp.pipe_size) {
flow_trace(conn,
"cannot set %d->%d pipe size to %zu",
sidei, !sidei, c->tcp.pipe_size);
@ -503,7 +503,7 @@ swap:
lowat_act_flag = RCVLOWAT_ACT(fromsidei);
while (1) {
ssize_t readlen, to_write = 0, written;
ssize_t readlen, written, pending;
int more = 0;
retry:
@ -518,14 +518,11 @@ retry:
if (errno != EAGAIN)
goto close;
to_write = c->tcp.pipe_size;
} else if (!readlen) {
eof = 1;
to_write = c->tcp.pipe_size;
} else {
never_read = 0;
to_write += readlen;
if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
more = SPLICE_F_MORE;
@ -535,10 +532,10 @@ retry:
eintr:
written = splice(conn->pipe[fromsidei][0], NULL,
conn->s[!fromsidei], NULL, to_write,
conn->s[!fromsidei], NULL, c->tcp.pipe_size,
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
flow_trace(conn, "%zi from write-side call (passed %zi)",
written, to_write);
written, c->tcp.pipe_size);
/* Most common case: skip updating counters. */
if (readlen > 0 && readlen == written) {
@ -584,10 +581,9 @@ eintr:
if (never_read && written == (long)(c->tcp.pipe_size))
goto retry;
if (!never_read && written < to_write) {
to_write -= written;
pending = conn->read[fromsidei] - conn->written[fromsidei];
if (!never_read && written > 0 && written < pending)
goto retry;
}
if (eof)
break;
@ -676,7 +672,7 @@ static void tcp_splice_pipe_refill(const struct ctx *c)
continue;
if (fcntl(splice_pipe_pool[i][0], F_SETPIPE_SZ,
c->tcp.pipe_size)) {
c->tcp.pipe_size) != (int)c->tcp.pipe_size) {
trace("TCP (spliced): cannot set pool pipe size to %zu",
c->tcp.pipe_size);
}

View file

@ -8,7 +8,6 @@
WGET = wget -c
DEBIAN_IMGS = debian-8.11.0-openstack-amd64.qcow2 \
debian-9-nocloud-amd64-daily-20200210-166.qcow2 \
debian-10-nocloud-amd64.qcow2 \
debian-10-generic-arm64.qcow2 \
debian-10-generic-ppc64el-20220911-1135.qcow2 \
@ -42,8 +41,7 @@ OPENSUSE_IMGS = openSUSE-Leap-15.1-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Leap-15.2-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Leap-15.3-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Tumbleweed-ARM-JeOS-efi.aarch64.raw.xz \
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz \
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz
UBUNTU_OLD_IMGS = trusty-server-cloudimg-amd64-disk1.img \
trusty-server-cloudimg-i386-disk1.img \
@ -135,9 +133,6 @@ realclean: clean
debian-8.11.0-openstack-%.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/OpenStack/archive/8.11.0/debian-8.11.0-openstack-$*.qcow2
debian-9-nocloud-%-daily-20200210-166.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/stretch/daily/20200210-166/debian-9-nocloud-$*-daily-20200210-166.qcow2
debian-10-nocloud-%.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/buster/latest/debian-10-nocloud-$*.qcow2
@ -203,9 +198,6 @@ openSUSE-Tumbleweed-ARM-JeOS-efi.aarch64.raw.xz:
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz:
$(WGET) -O $@ http://download.opensuse.org/ports/armv7hl/tumbleweed/appliances/openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2:
$(WGET) -O $@ https://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2
# Ubuntu downloads
trusty-server-cloudimg-%-disk1.img:
$(WGET) -O $@ https://cloud-images.ubuntu.com/trusty/current/trusty-server-cloudimg-$*-disk1.img

View file

@ -58,7 +58,7 @@ setup_passt() {
context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}" \
' -machine accel=kvm' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
@ -159,7 +159,7 @@ setup_passt_in_ns() {
' -machine accel=kvm' \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
@ -230,7 +230,7 @@ setup_two_guests() {
context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
@ -243,7 +243,7 @@ setup_two_guests() {
context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \

View file

@ -31,8 +31,8 @@ PR_DELAY_INIT=100 # ms
# $@: Message to print
info() {
tmux select-pane -t ${PANE_INFO}
echo "${@}" >> $STATEBASE/log_pipe
echo "${@}" >> "${LOGFILE}"
printf "${@}\n" >> $STATEBASE/log_pipe
printf "${@}\n" >> "${LOGFILE}"
}
# info_n() - Highlight, print message to pane and to log file without newline
@ -47,13 +47,13 @@ info_n() {
# $@: Message to print
info_nolog() {
tmux select-pane -t ${PANE_INFO}
echo "${@}" >> $STATEBASE/log_pipe
printf "${@}\n" >> $STATEBASE/log_pipe
}
# info_nolog() - Print message to log file
# $@: Message to print
log() {
echo "${@}" >> "${LOGFILE}"
printf "${@}\n" >> "${LOGFILE}"
}
# info_nolog_n() - Send message to pane without highlighting it, without newline
@ -664,7 +664,7 @@ pause_continue() {
# run_term() - Start tmux session, running entry point, with recording if needed
run_term() {
TMUX="tmux new-session -s passt_test -eSTATEBASE=$STATEBASE -ePCAP=$PCAP -eDEBUG=$DEBUG"
TMUX="tmux new-session -s passt_test -eSTATEBASE=$STATEBASE -ePCAP=$PCAP -eDEBUG=$DEBUG -eTRACE=$TRACE -eKERNEL=$KERNEL"
if [ ${CI} -eq 1 ]; then
printf '\e[8;50;240t'

View file

@ -31,10 +31,15 @@
#define ARRAY_SIZE(a) ((int)(sizeof(a) / sizeof((a)[0])))
#define die(...) \
do { \
fprintf(stderr, __VA_ARGS__); \
exit(1); \
#define die(...) \
do { \
fprintf(stderr, "nstool: " __VA_ARGS__); \
exit(1); \
} while (0)
#define err(...) \
do { \
fprintf(stderr, "nstool: " __VA_ARGS__); \
} while (0)
struct ns_type {
@ -156,6 +161,9 @@ static int connect_ctl(const char *sockpath, bool wait,
static void cmd_hold(int argc, char *argv[])
{
struct sigaction sa = {
.sa_handler = SIG_IGN,
};
int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, PF_UNIX);
struct sockaddr_un addr;
const char *sockpath = argv[1];
@ -185,6 +193,10 @@ static void cmd_hold(int argc, char *argv[])
if (!getcwd(info.cwd, sizeof(info.cwd)))
die("getcwd(): %s\n", strerror(errno));
rc = sigaction(SIGPIPE, &sa, NULL);
if (rc)
die("sigaction(SIGPIPE): %s\n", strerror(errno));
do {
int afd = accept(fd, NULL, NULL);
char buf;
@ -193,17 +205,21 @@ static void cmd_hold(int argc, char *argv[])
die("accept(): %s\n", strerror(errno));
rc = write(afd, &info, sizeof(info));
if (rc < 0)
die("write(): %s\n", strerror(errno));
if (rc < 0) {
err("holder write() to control socket: %s\n",
strerror(errno));
}
if ((size_t)rc < sizeof(info))
die("short write() on control socket\n");
err("holder short write() on control socket\n");
rc = read(afd, &buf, sizeof(buf));
if (rc < 0)
die("read(): %s\n", strerror(errno));
if (rc < 0) {
err("holder read() on control socket: %s\n",
strerror(errno));
}
close(afd);
} while (rc == 0);
} while (rc <= 0);
unlink(sockpath);
}
@ -346,7 +362,7 @@ static int openns(const char *fmt, ...)
}
static pid_t sig_pid;
static void sig_handler(int signum)
static void sig_propagate(int signum)
{
int err;
@ -358,7 +374,7 @@ static void sig_handler(int signum)
static void wait_for_child(pid_t pid)
{
struct sigaction sa = {
.sa_handler = sig_handler,
.sa_handler = sig_propagate,
.sa_flags = SA_RESETHAND,
};
int status, err;

View file

@ -49,6 +49,8 @@ check [ "__SEARCH__" = "__HOST_SEARCH__" ]
test DHCPv6: address
guest /sbin/dhclient -6 __IFNAME__
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR6__" = "__HOST_ADDR6__" ]

View file

@ -16,13 +16,15 @@ htools ip jq sipcalc grep cut
test Interface name
gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
guest ip link set dev __IFNAME__ up && sleep 2
guest ip link set dev __IFNAME__ up
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME__" ]
test SLAAC: prefix
gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local] | .[0]'
gout PREFIX6 sipcalc __ADDR6__/64 | grep prefix | cut -d' ' -f4
gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .protocol == "kernel_ra") | .local + "/" + (.prefixlen | tostring)] | .[0]'
gout PREFIX6 sipcalc __ADDR6__ | grep prefix | cut -d' ' -f4
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
hout HOST_PREFIX6 sipcalc __HOST_ADDR6__/64 | grep prefix | cut -d' ' -f4
check [ "__PREFIX6__" = "__HOST_PREFIX6__" ]

View file

@ -52,6 +52,8 @@ check [ "__SEARCH__" = "__HOST_SEARCH__" ]
test DHCPv6: address
guest /sbin/dhclient -6 __IFNAME__
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR6__" = "__HOST_ADDR6__" ]

View file

@ -32,7 +32,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10001
guestw
guest cmp test_big.bin /root/big.bin
test TCP/IPv4: host to ns: big transfer
test TCP/IPv4: host to ns (spliced): big transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002
@ -90,7 +90,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10001
guestw
guest cmp test_small.bin /root/small.bin
test TCP/IPv4: host to ns: small transfer
test TCP/IPv4: host to ns (spliced): small transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002
@ -146,7 +146,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10001
guestw
guest cmp test_big.bin /root/big.bin
test TCP/IPv6: host to ns: big transfer
test TCP/IPv6: host to ns (spliced): big transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002
@ -204,7 +204,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10001
guestw
guest cmp test_small.bin /root/small.bin
test TCP/IPv6: host to ns: small transfer
test TCP/IPv6: host to ns (spliced): small transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002

View file

@ -30,7 +30,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10001,shut-null
guestw
guest cmp test.bin /root/medium.bin
test UDP/IPv4: host to ns
test UDP/IPv4: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null
@ -88,7 +88,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10001,shut-null
guestw
guest cmp test.bin /root/medium.bin
test UDP/IPv6: host to ns
test UDP/IPv6: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null

View file

@ -35,6 +35,8 @@ check [ __MTU__ = 65520 ]
test DHCPv6: address
ns /sbin/dhclient -6 --no-pid __IFNAME__
# Wait for DAD to complete
ns while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'

View file

@ -18,11 +18,12 @@ test Interface name
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
check [ -n "__IFNAME__" ]
ns ip link set dev __IFNAME__ up
sleep 2
# Wait for DAD to complete
ns while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
test SLAAC: prefix
nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local] | .[0]'
nsout PREFIX6 sipcalc __ADDR6__/64 | grep prefix | cut -d' ' -f4
nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .protocol == "kernel_ra") | .local + "/" + (.prefixlen | tostring)] | .[0]'
nsout PREFIX6 sipcalc __ADDR6__ | grep prefix | cut -d' ' -f4
hout HOST_ADDR6 ip -j -6 addr show|jq -rM ['.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
hout HOST_PREFIX6 sipcalc __HOST_ADDR6__/64 | grep prefix | cut -d' ' -f4
check [ "__PREFIX6__" = "__HOST_PREFIX6__" ]

View file

@ -19,8 +19,8 @@ set TEMP_NS_BIG __STATEDIR__/test_ns_big.bin
set TEMP_SMALL __STATEDIR__/test_small.bin
set TEMP_NS_SMALL __STATEDIR__/test_ns_small.bin
test TCP/IPv4: host to ns: big transfer
nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_BIG__,create,trunc
test TCP/IPv4: host to ns (spliced): big transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002
nsw
check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__
@ -38,8 +38,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__GW__:10003
hostw
check cmp __BASEPATH__/big.bin __TEMP_BIG__
test TCP/IPv4: host to ns: small transfer
nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_SMALL__,create,trunc
test TCP/IPv4: host to ns (spliced): small transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
host socat OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002
nsw
check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__
@ -57,8 +57,8 @@ ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__GW__:10003
hostw
check cmp __BASEPATH__/small.bin __TEMP_SMALL__
test TCP/IPv6: host to ns: big transfer
nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_BIG__,create,trunc
test TCP/IPv6: host to ns (spliced): big transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002
nsw
check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__
@ -77,8 +77,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__GW6__%__IFNAME__]:10003
hostw
check cmp __BASEPATH__/big.bin __TEMP_BIG__
test TCP/IPv6: host to ns: small transfer
nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_SMALL__,create,trunc
test TCP/IPv6: host to ns (spliced): small transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002
nsw
check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__

View file

@ -17,8 +17,8 @@ htools dd socat ip jq
set TEMP __STATEDIR__/test.bin
set TEMP_NS __STATEDIR__/test_ns.bin
test UDP/IPv4: host to ns
nsb socat -u UDP4-LISTEN:10002,bind=127.0.0.1,null-eof OPEN:__TEMP_NS__,create,trunc
test UDP/IPv4: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
host socat OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null
nsw
check cmp __BASEPATH__/medium.bin __TEMP_NS__
@ -37,8 +37,8 @@ ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null
hostw
check cmp __BASEPATH__/medium.bin __TEMP__
test UDP/IPv6: host to ns
nsb socat -u UDP6-LISTEN:10002,bind=[::1],null-eof OPEN:__TEMP_NS__,create,trunc
test UDP/IPv6: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null
nsw
check cmp __BASEPATH__/medium.bin __TEMP_NS__

View file

@ -116,6 +116,8 @@ iperf3k ns
# Reducing MTU below 1280 deconfigures IPv6, get our address back
guest dhclient -6 -x
guest dhclient -6 __IFNAME__
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
tl TCP RR latency over IPv4: guest to host
lat -

View file

@ -211,7 +211,7 @@ tr TCP throughput over IPv6: host to ns
iperf3s ns 10002
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local'
nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global").local'
bw -
bw -
bw -

View file

@ -196,7 +196,7 @@ tr UDP throughput over IPv6: host to ns
iperf3s ns 10002
nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local'
nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global").local'
iperf3 BW host __ADDR6__ 10002 __TIME__ __OPTS__ -b 8G -l 1472
bw __BW__ 0.3 0.5
iperf3 BW host __ADDR6__ 10002 __TIME__ __OPTS__ -b 12G -l 3972

View file

@ -38,6 +38,9 @@ TRACE=${TRACE:-0}
# If set, tell passt and pasta to take packet captures
PCAP=${PCAP:-0}
# Custom kernel to boot guests with, if given
KERNEL=${KERNEL:-"/boot/vmlinuz-$(uname -r)"}
COMMIT="$(git log --oneline --no-decorate -1)"
. lib/util

View file

@ -36,9 +36,13 @@ check [ "__ADDR2__" = "__HOST_ADDR__" ]
test DHCPv6: addresses
# Link is up now, wait for DAD to complete
sleep 2
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest2 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
guest2 /sbin/dhclient -6 __IFNAME2__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest2 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
g2out ADDR2_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME2__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
@ -48,33 +52,33 @@ check [ "__ADDR2_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: guest 1 > guest 2
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
guest2b socat -u TCP4-LISTEN:10004 OPEN:msg,create,trunc
sleep 1
guest1 echo "Hello_from_guest_1" | socat -u STDIN TCP4:__GW1__:10004
guest2w
sleep 1
g2out MSG2 cat msg
check [ "__MSG2__" = "Hello_from_guest_1" ]
test TCP/IPv6: guest 2 > guest 1
g2out GW2_6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway'
guest1b socat -u TCP6-LISTEN:10001 OPEN:msg,create,trunc
sleep 1
guest2 echo "Hello_from_guest_2" | socat -u STDIN TCP6:[__GW2_6__%__IFNAME2__]:10001
guest1w
sleep 1
g1out MSG1 cat msg
check [ "__MSG1__" = "Hello_from_guest_2" ]
test UDP/IPv4: guest 1 > guest 2
guest2b socat -u TCP4-LISTEN:10004 OPEN:msg,create,trunc
sleep 1
guest1 echo "Hello_from_guest_1" | socat -u STDIN TCP4:__GW1__:10004
guest2w
sleep 1
g2out MSG2 cat msg
check [ "__MSG2__" = "Hello_from_guest_1" ]
test UDP/IPv6: guest 2 > guest 1
guest1b socat -u TCP6-LISTEN:10001 OPEN:msg,create,trunc
sleep 1
guest2 echo "Hello_from_guest_2" | socat -u STDIN TCP6:[__GW2_6__%__IFNAME2__]:10001
guest1w
sleep 1
g1out MSG1 cat msg
check [ "__MSG1__" = "Hello_from_guest_2" ]

230
udp.c
View file

@ -169,11 +169,11 @@ udp_meta[UDP_MAX_FRAMES];
* @UDP_NUM_IOVS the number of entries in the iovec array
*/
enum udp_iov_idx {
UDP_IOV_TAP = 0,
UDP_IOV_ETH = 1,
UDP_IOV_IP = 2,
UDP_IOV_PAYLOAD = 3,
UDP_NUM_IOVS
UDP_IOV_TAP,
UDP_IOV_ETH,
UDP_IOV_IP,
UDP_IOV_PAYLOAD,
UDP_NUM_IOVS,
};
/* IOVs and msghdr arrays for receiving datagrams from sockets */
@ -294,15 +294,17 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
/**
* udp_update_hdr4() - Update headers for one IPv4 datagram
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
* @bp: Pointer to udp_payload_t to update
* @toside: Flowside for destination side
* @dlen: Length of UDP payload
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
* @bp: Pointer to udp_payload_t to update
* @toside: Flowside for destination side
* @dlen: Length of UDP payload
* @no_udp_csum: Do not set UDP checksum
*
* Return: size of IPv4 payload (UDP header + data)
*/
static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen)
const struct flowside *toside, size_t dlen,
bool no_udp_csum)
{
const struct in_addr *src = inany_v4(&toside->oaddr);
const struct in_addr *dst = inany_v4(&toside->eaddr);
@ -319,22 +321,33 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
bp->uh.source = htons(toside->oport);
bp->uh.dest = htons(toside->eport);
bp->uh.len = htons(l4len);
csum_udp4(&bp->uh, *src, *dst, bp->data, dlen);
if (no_udp_csum) {
bp->uh.check = 0;
} else {
const struct iovec iov = {
.iov_base = bp->data,
.iov_len = dlen
};
csum_udp4(&bp->uh, *src, *dst, &iov, 1, 0);
}
return l4len;
}
/**
* udp_update_hdr6() - Update headers for one IPv6 datagram
* @ip6h: Pre-filled IPv6 header (except for payload_len and addresses)
* @bp: Pointer to udp_payload_t to update
* @toside: Flowside for destination side
* @dlen: Length of UDP payload
* @ip6h: Pre-filled IPv6 header (except for payload_len and
* addresses)
* @bp: Pointer to udp_payload_t to update
* @toside: Flowside for destination side
* @dlen: Length of UDP payload
* @no_udp_csum: Do not set UDP checksum
*
* Return: size of IPv6 payload (UDP header + data)
*/
static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen)
const struct flowside *toside, size_t dlen,
bool no_udp_csum)
{
uint16_t l4len = dlen + sizeof(bp->uh);
@ -348,7 +361,20 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
bp->uh.source = htons(toside->oport);
bp->uh.dest = htons(toside->eport);
bp->uh.len = ip6h->payload_len;
csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, bp->data, dlen);
if (no_udp_csum) {
/* 0 is an invalid checksum for UDP IPv6 and dropped by
* the kernel stack, even if the checksum is disabled by virtio
* flags. We need to put any non-zero value here.
*/
bp->uh.check = 0xffff;
} else {
const struct iovec iov = {
.iov_base = bp->data,
.iov_len = dlen
};
csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6,
&iov, 1, 0);
}
return l4len;
}
@ -358,9 +384,11 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
* @mmh: Receiving mmsghdr array
* @idx: Index of the datagram to prepare
* @toside: Flowside for destination side
* @no_udp_csum: Do not set UDP checksum
*/
static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
const struct flowside *toside)
static void udp_tap_prepare(const struct mmsghdr *mmh,
unsigned idx, const struct flowside *toside,
bool no_udp_csum)
{
struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx];
struct udp_payload_t *bp = &udp_payload[idx];
@ -368,13 +396,15 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
size_t l4len;
if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) {
l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len);
l4len = udp_update_hdr6(&bm->ip6h, bp, toside,
mmh[idx].msg_len, no_udp_csum);
tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) +
sizeof(udp6_eth_hdr));
(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr);
(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h);
} else {
l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len);
l4len = udp_update_hdr4(&bm->ip4h, bp, toside,
mmh[idx].msg_len, no_udp_csum);
tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) +
sizeof(udp4_eth_hdr));
(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr);
@ -387,11 +417,12 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
* udp_sock_recverr() - Receive and clear an error from a socket
* @s: Socket to receive from
*
* Return: true if errors received and processed, false if no more errors
* Return: 1 if error received and processed, 0 if no more errors in queue, < 0
* if there was an error reading the queue
*
* #syscalls recvmsg
*/
static bool udp_sock_recverr(int s)
static int udp_sock_recverr(int s)
{
const struct sock_extended_err *ee;
const struct cmsghdr *hdr;
@ -408,14 +439,16 @@ static bool udp_sock_recverr(int s)
rc = recvmsg(s, &mh, MSG_ERRQUEUE);
if (rc < 0) {
if (errno != EAGAIN && errno != EWOULDBLOCK)
err_perror("Failed to read error queue");
return false;
if (errno == EAGAIN || errno == EWOULDBLOCK)
return 0;
err_perror("UDP: Failed to read error queue");
return -1;
}
if (!(mh.msg_flags & MSG_ERRQUEUE)) {
err("Missing MSG_ERRQUEUE flag reading error queue");
return false;
return -1;
}
hdr = CMSG_FIRSTHDR(&mh);
@ -424,7 +457,7 @@ static bool udp_sock_recverr(int s)
(hdr->cmsg_level == IPPROTO_IPV6 &&
hdr->cmsg_type == IPV6_RECVERR))) {
err("Unexpected cmsg reading error queue");
return false;
return -1;
}
ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
@ -433,7 +466,54 @@ static bool udp_sock_recverr(int s)
debug("%s error on UDP socket %i: %s",
str_ee_origin(ee), s, strerror(ee->ee_errno));
return true;
return 1;
}
/**
* udp_sock_errs() - Process errors on a socket
* @c: Execution context
* @s: Socket to receive from
* @events: epoll events bitmap
*
* Return: Number of errors handled, or < 0 if we have an unrecoverable error
*/
static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
{
unsigned n_err = 0;
socklen_t errlen;
int rc, err;
ASSERT(!c->no_udp);
if (!(events & EPOLLERR))
return 0; /* Nothing to do */
/* Empty the error queue */
while ((rc = udp_sock_recverr(s)) > 0)
n_err += rc;
if (rc < 0)
return -1; /* error reading error, unrecoverable */
errlen = sizeof(err);
if (getsockopt(s, SOL_SOCKET, SO_ERROR, &err, &errlen) < 0 ||
errlen != sizeof(err)) {
err_perror("Error reading SO_ERROR");
return -1; /* error reading error, unrecoverable */
}
if (err) {
debug("Unqueued error on UDP socket %i: %s", s, strerror(err));
n_err++;
}
if (!n_err) {
/* EPOLLERR, but no errors to clear !? */
err("EPOLLERR event without reported errors on socket %i", s);
return -1; /* no way to clear, unrecoverable */
}
return n_err;
}
/**
@ -443,6 +523,8 @@ static bool udp_sock_recverr(int s)
* @events: epoll events bitmap
* @mmh mmsghdr array to receive into
*
* Return: Number of datagrams received
*
* #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64
*/
static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
@ -459,12 +541,6 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
ASSERT(!c->no_udp);
/* Clear any errors first */
if (events & EPOLLERR) {
while (udp_sock_recverr(s))
;
}
if (!(events & EPOLLIN))
return 0;
@ -492,6 +568,13 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
const socklen_t sasize = sizeof(udp_meta[0].s_in);
int n, i;
if (udp_sock_errs(c, ref.fd, events) < 0) {
err("UDP: Unrecoverable error on listening socket:"
" (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
/* FIXME: what now? close/re-open socket? */
return;
}
if ((n = udp_sock_recv(c, ref.fd, events, udp_mh_recv)) <= 0)
return;
@ -512,7 +595,8 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
udp_splice_prepare(udp_mh_recv, i);
} else if (batchpif == PIF_TAP) {
udp_tap_prepare(udp_mh_recv, i,
flowside_at_sidx(batchsidx));
flowside_at_sidx(batchsidx),
false);
}
if (++i >= n)
@ -560,12 +644,20 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
const struct flowside *toside = flowside_at_sidx(tosidx);
struct udp_flow *uflow = udp_at_sidx(ref.flowside);
int from_s = uflow->s[ref.flowside.sidei];
uint8_t topif = pif_at_sidx(tosidx);
int n, i;
int n, i, from_s;
ASSERT(!c->no_udp && uflow);
from_s = uflow->s[ref.flowside.sidei];
if (udp_sock_errs(c, from_s, events) < 0) {
flow_err(uflow, "Unrecoverable error on reply socket");
flow_err_details(uflow);
udp_flow_close(c, uflow);
return;
}
if ((n = udp_sock_recv(c, from_s, events, udp_mh_recv)) <= 0)
return;
@ -576,7 +668,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
if (pif_is_socket(topif))
udp_splice_prepare(udp_mh_recv, i);
else if (topif == PIF_TAP)
udp_tap_prepare(udp_mh_recv, i, toside);
udp_tap_prepare(udp_mh_recv, i, toside, false);
/* Restore sockaddr length clobbered by recvmsg() */
udp_mh_recv[i].msg_hdr.msg_namelen = sizeof(udp_meta[i].s_in);
}
@ -703,69 +795,61 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
* udp_sock_init() - Initialise listening sockets for a given port
* @c: Execution context
* @ns: In pasta mode, if set, bind with loopback address in namespace
* @af: Address family to select a specific IP version, or AF_UNSPEC
* @addr: Pointer to address for binding, NULL if not configured
* @ifname: Name of interface to bind to, NULL if not configured
* @port: Port, host order
*
* Return: 0 on (partial) success, negative error code on (complete) failure
*/
int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
const void *addr, const char *ifname, in_port_t port)
int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
const char *ifname, in_port_t port)
{
union udp_listen_epoll_ref uref = { .port = port };
union udp_listen_epoll_ref uref = {
.pif = ns ? PIF_SPLICE : PIF_HOST,
.port = port,
};
int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
ASSERT(!c->no_udp);
if (ns)
uref.pif = PIF_SPLICE;
else
uref.pif = PIF_HOST;
if (af == AF_UNSPEC && c->ifi4 && c->ifi6) {
if (!addr && c->ifi4 && c->ifi6 && !ns) {
int s;
/* Attempt to get a dual stack socket */
if (!ns) {
s = sock_l4(c, AF_UNSPEC, EPOLL_TYPE_UDP_LISTEN,
addr, ifname, port, uref.u32);
udp_splice_init[V4][port] = s < 0 ? -1 : s;
udp_splice_init[V6][port] = s < 0 ? -1 : s;
} else {
s = sock_l4(c, AF_UNSPEC, EPOLL_TYPE_UDP_LISTEN,
&in4addr_loopback, ifname, port, uref.u32);
udp_splice_ns[V4][port] = s < 0 ? -1 : s;
udp_splice_ns[V6][port] = s < 0 ? -1 : s;
}
s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
NULL, ifname, port, uref.u32);
udp_splice_init[V4][port] = s < 0 ? -1 : s;
udp_splice_init[V6][port] = s < 0 ? -1 : s;
if (IN_INTERVAL(0, FD_REF_MAX, s))
return 0;
}
if ((af == AF_INET || af == AF_UNSPEC) && c->ifi4) {
if ((!addr || inany_v4(addr)) && c->ifi4) {
if (!ns) {
r4 = sock_l4(c, AF_INET, EPOLL_TYPE_UDP_LISTEN,
addr, ifname, port, uref.u32);
r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
addr ? addr : &inany_any4, ifname,
port, uref.u32);
udp_splice_init[V4][port] = r4 < 0 ? -1 : r4;
} else {
r4 = sock_l4(c, AF_INET, EPOLL_TYPE_UDP_LISTEN,
&in4addr_loopback,
ifname, port, uref.u32);
r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
&inany_loopback4, ifname,
port, uref.u32);
udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4;
}
}
if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6) {
if ((!addr || !inany_v4(addr)) && c->ifi6) {
if (!ns) {
r6 = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP_LISTEN,
addr, ifname, port, uref.u32);
r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
addr ? addr : &inany_any6, ifname,
port, uref.u32);
udp_splice_init[V6][port] = r6 < 0 ? -1 : r6;
} else {
r6 = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP_LISTEN,
&in6addr_loopback,
ifname, port, uref.u32);
r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
&inany_loopback6, ifname,
port, uref.u32);
udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6;
}
}
@ -833,7 +917,7 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
if ((c->ifi4 && socks[V4][port] == -1) ||
(c->ifi6 && socks[V6][port] == -1))
udp_sock_init(c, outbound, AF_UNSPEC, NULL, NULL, port);
udp_sock_init(c, outbound, NULL, NULL, port);
}
}

4
udp.h
View file

@ -16,8 +16,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
int udp_tap_handler(const struct ctx *c, uint8_t pif,
sa_family_t af, const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now);
int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
const void *addr, const char *ifname, in_port_t port);
int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
const char *ifname, in_port_t port);
int udp_init(struct ctx *c);
void udp_timer(struct ctx *c, const struct timespec *now);
void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);

View file

@ -39,8 +39,11 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx)
* @c: Execution context
* @uflow: UDP flow
*/
static void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
{
if (uflow->closed)
return; /* Nothing to do */
if (uflow->s[INISIDE] >= 0) {
/* The listening socket needs to stay in epoll */
close(uflow->s[INISIDE]);
@ -56,6 +59,8 @@ static void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE));
if (!pif_is_socket(uflow->f.pif[TGTSIDE]))
flow_hash_remove(c, FLOW_SIDX(uflow, TGTSIDE));
uflow->closed = true;
}
/**
@ -256,6 +261,17 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
return udp_flow_new(c, flow, -1, now);
}
/**
* udp_flow_defer() - Deferred per-flow handling (clean up aborted flows)
* @uflow: Flow to handle
*
* Return: true if the connection is ready to free, false otherwise
*/
bool udp_flow_defer(const struct udp_flow *uflow)
{
return uflow->closed;
}
/**
* udp_flow_timer() - Handler for timed events related to a given flow
* @c: Execution context

View file

@ -10,6 +10,7 @@
/**
* struct udp - Descriptor for a flow of UDP packets
* @f: Generic flow information
* @closed: Flow is already closed
* @ts: Activity timestamp
* @s: Socket fd (or -1) for each side of the flow
*/
@ -17,6 +18,7 @@ struct udp_flow {
/* Must be first element */
struct flow_common f;
bool closed :1;
time_t ts;
int s[SIDES];
};
@ -30,6 +32,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
const void *saddr, const void *daddr,
in_port_t srcport, in_port_t dstport,
const struct timespec *now);
void udp_flow_close(const struct ctx *c, struct udp_flow *uflow);
bool udp_flow_defer(const struct udp_flow *uflow);
bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow,
const struct timespec *now);

212
util.c
View file

@ -28,6 +28,7 @@
#include <linux/errqueue.h>
#include <getopt.h>
#include "linux_dep.h"
#include "util.h"
#include "iov.h"
#include "passt.h"
@ -52,6 +53,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
{
sa_family_t af = ((const struct sockaddr *)sa)->sa_family;
union epoll_ref ref = { .type = type, .data = data };
bool freebind = false;
struct epoll_event ev;
int fd, y = 1, ret;
uint8_t proto;
@ -61,8 +63,11 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
case EPOLL_TYPE_TCP_LISTEN:
proto = IPPROTO_TCP;
socktype = SOCK_STREAM | SOCK_NONBLOCK;
freebind = c->freebind;
break;
case EPOLL_TYPE_UDP_LISTEN:
freebind = c->freebind;
/* fallthrough */
case EPOLL_TYPE_UDP_REPLY:
proto = IPPROTO_UDP;
socktype = SOCK_DGRAM | SOCK_NONBLOCK;
@ -127,6 +132,18 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
}
}
if (freebind) {
int level = af == AF_INET ? IPPROTO_IP : IPPROTO_IPV6;
int opt = af == AF_INET ? IP_FREEBIND : IPV6_FREEBIND;
if (setsockopt(fd, level, opt, &y, sizeof(y))) {
err_perror("Failed to set %s on socket %i",
af == AF_INET ? "IP_FREEBIND"
: "IPV6_FREEBIND",
fd);
}
}
if (bind(fd, sa, sl) < 0) {
/* We'll fail to bind to low ports if we don't have enough
* capabilities, and we'll fail to bind on already bound ports,
@ -157,58 +174,6 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
return fd;
}
/**
* sock_l4() - Create and bind socket for given L4, add to epoll list
* @c: Execution context
* @af: Address family, AF_INET or AF_INET6
* @type: epoll type
* @bind_addr: Address for binding, NULL for any
* @ifname: Interface for binding, NULL for any
* @port: Port, host order
* @data: epoll reference portion for protocol handlers
*
* Return: newly created socket, negative error code on failure
*/
int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type,
const void *bind_addr, const char *ifname, uint16_t port,
uint32_t data)
{
switch (af) {
case AF_INET: {
struct sockaddr_in addr4 = {
.sin_family = AF_INET,
.sin_port = htons(port),
{ 0 }, { 0 },
};
if (bind_addr)
addr4.sin_addr = *(struct in_addr *)bind_addr;
return sock_l4_sa(c, type, &addr4, sizeof(addr4), ifname,
false, data);
}
case AF_UNSPEC:
if (!DUAL_STACK_SOCKETS || bind_addr)
return -EINVAL;
/* fallthrough */
case AF_INET6: {
struct sockaddr_in6 addr6 = {
.sin6_family = AF_INET6,
.sin6_port = htons(port),
0, IN6ADDR_ANY_INIT, 0,
};
if (bind_addr) {
addr6.sin6_addr = *(struct in6_addr *)bind_addr;
if (IN6_IS_ADDR_LINKLOCAL(bind_addr))
addr6.sin6_scope_id = c->ifi6;
}
return sock_l4_sa(c, type, &addr6, sizeof(addr6), ifname,
af == AF_INET6, data);
}
default:
return -EINVAL;
}
}
/**
* sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
@ -219,7 +184,8 @@ void sock_probe_mem(struct ctx *c)
int v = INT_MAX / 2, s;
socklen_t sl;
if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP);
if (s < 0) {
c->low_wmem = c->low_rmem = 1;
return;
}
@ -249,7 +215,7 @@ void sock_probe_mem(struct ctx *c)
int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b)
{
if (a->tv_nsec < b->tv_nsec) {
return (b->tv_nsec - a->tv_nsec) / 1000 +
return (a->tv_nsec + 1000000000 - b->tv_nsec) / 1000 +
(a->tv_sec - b->tv_sec - 1) * 1000000;
}
@ -443,25 +409,20 @@ void pidfile_write(int fd, pid_t pid)
}
/**
* pidfile_open() - Open PID file if needed
* @path: Path for PID file, empty string if no PID file is requested
* output_file_open() - Open file for output, if needed
* @path: Path for output file
* @flags: Flags for open() other than O_CREAT, O_TRUNC, O_CLOEXEC
*
* Return: descriptor for PID file, -1 if path is NULL, won't return on failure
* Return: file descriptor on success, -1 on failure with errno set by open()
*/
int pidfile_open(const char *path)
int output_file_open(const char *path, int flags)
{
int fd;
if (!*path)
return -1;
if ((fd = open(path, O_CREAT | O_TRUNC | O_WRONLY | O_CLOEXEC,
S_IRUSR | S_IWUSR)) < 0) {
perror("PID file open");
exit(EXIT_FAILURE);
}
return fd;
/* We use O_CLOEXEC here, but clang-tidy as of LLVM 16 to 19 looks for
* it in the 'mode' argument if we have one
*/
return open(path, O_CREAT | O_TRUNC | O_CLOEXEC | flags,
/* NOLINTNEXTLINE(android-cloexec-open) */
S_IRUSR | S_IWUSR);
}
/**
@ -485,16 +446,11 @@ int __daemon(int pidfile_fd, int devnull_fd)
exit(EXIT_SUCCESS);
}
errno = 0;
setsid();
dup2(devnull_fd, STDIN_FILENO);
dup2(devnull_fd, STDOUT_FILENO);
dup2(devnull_fd, STDERR_FILENO);
close(devnull_fd);
if (errno)
if (setsid() < 0 ||
dup2(devnull_fd, STDIN_FILENO) < 0 ||
dup2(devnull_fd, STDOUT_FILENO) < 0 ||
dup2(devnull_fd, STDERR_FILENO) < 0 ||
close(devnull_fd))
exit(EXIT_FAILURE);
return 0;
@ -582,6 +538,36 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
#endif
}
/* write_all_buf() - write all of a buffer to an fd
* @fd: File descriptor
* @buf: Pointer to base of buffer
* @len: Length of buffer
*
* Return: 0 on success, -1 on error (with errno set)
*
* #syscalls write
*/
int write_all_buf(int fd, const void *buf, size_t len)
{
const char *p = buf;
size_t left = len;
while (left) {
ssize_t rc;
do
rc = write(fd, p, left);
while ((rc < 0) && errno == EINTR);
if (rc < 0)
return -1;
p += rc;
left -= rc;
}
return 0;
}
/* write_remainder() - write the tail of an IO vector to an fd
* @fd: File descriptor
* @iov: IO vector
@ -590,28 +576,30 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
*
* Return: 0 on success, -1 on error (with errno set)
*
* #syscalls write writev
* #syscalls writev
*/
int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip)
{
size_t offset, i;
size_t i = 0, offset;
while ((i = iov_skip_bytes(iov, iovcnt, skip, &offset)) < iovcnt) {
while ((i += iov_skip_bytes(iov + i, iovcnt - i, skip, &offset)) < iovcnt) {
ssize_t rc;
if (offset) {
rc = write(fd, (char *)iov[i].iov_base + offset,
iov[i].iov_len - offset);
} else {
rc = writev(fd, &iov[i], iovcnt - i);
/* Write the remainder of the partially written buffer */
if (write_all_buf(fd, (char *)iov[i].iov_base + offset,
iov[i].iov_len - offset) < 0)
return -1;
i++;
}
/* Write as much of the remaining whole buffers as we can */
rc = writev(fd, &iov[i], iovcnt - i);
if (rc < 0)
return -1;
skip += rc;
skip = rc;
}
return 0;
}
@ -750,6 +738,48 @@ void close_open_files(int argc, char **argv)
rc = close_range(fd + 1, ~0U, CLOSE_RANGE_UNSHARE);
}
if (rc)
die_perror("Failed to close files leaked by parent");
if (rc) {
if (errno == ENOSYS || errno == EINVAL) {
/* This probably means close_range() or the
* CLOSE_RANGE_UNSHARE flag is not supported by the
* kernel. Not much we can do here except carry on and
* hope for the best.
*/
warn(
"Can't use close_range() to ensure no files leaked by parent");
} else {
die_perror("Failed to close files leaked by parent");
}
}
}
/**
* snprintf_check() - snprintf() wrapper, checking for truncation and errors
* @str: Output buffer
* @size: Maximum size to write to @str
* @format: Message
*
* Return: false on success, true on truncation or error, sets errno on failure
*/
bool snprintf_check(char *str, size_t size, const char *format, ...)
{
va_list ap;
int rc;
va_start(ap, format);
rc = vsnprintf(str, size, format, ap);
va_end(ap);
if (rc < 0) {
errno = EIO;
return true;
}
if ((size_t)rc >= size) {
errno = ENOBUFS;
return true;
}
return false;
}

54
util.h
View file

@ -11,12 +11,12 @@
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <signal.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/close_range.h>
#include "log.h"
@ -67,6 +67,15 @@
#define STRINGIFY(x) #x
#define STR(x) STRINGIFY(x)
#ifdef CPPCHECK_6936
/* Some cppcheck versions get confused by aborts inside a loop, causing
* it to give false positive uninitialised variable warnings later in
* the function, because it doesn't realise the non-initialising path
* already exited. See https://trac.cppcheck.net/ticket/13227
*/
#define ASSERT(expr) \
((expr) ? (void)0 : abort())
#else
#define ASSERT(expr) \
do { \
if (!(expr)) { \
@ -78,6 +87,7 @@
abort(); \
} \
} while (0)
#endif
#ifdef P_tmpdir
#define TMPDIR P_tmpdir
@ -91,6 +101,9 @@
#define ARRAY_SIZE(a) ((int)(sizeof(a) / sizeof((a)[0])))
#define foreach(item, array) \
for ((item) = (array); (item) - (array) < ARRAY_SIZE(array); (item)++)
#define IN_INTERVAL(a, b, x) ((x) >= (a) && (x) <= (b))
#define FD_PROTO(x, proto) \
(IN_INTERVAL(c->proto.fd_min, c->proto.fd_max, (x)))
@ -131,7 +144,7 @@ static inline uint32_t ntohl_unaligned(const void *p)
return ntohl(val);
}
#define NS_FN_STACK_SIZE (RLIMIT_STACK_VAL * 1024 / 8)
#define NS_FN_STACK_SIZE (1024 * 1024) /* 1MiB */
int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
void *arg);
#define NS_CALL(fn, arg) \
@ -144,9 +157,9 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
(void *)(arg)); \
} while (0)
#define RCVBUF_BIG (2UL * 1024 * 1024)
#define SNDBUF_BIG (4UL * 1024 * 1024)
#define SNDBUF_SMALL (128UL * 1024)
#define RCVBUF_BIG (2ULL * 1024 * 1024)
#define SNDBUF_BIG (4ULL * 1024 * 1024)
#define SNDBUF_SMALL (128ULL * 1024)
#include <net/if.h>
#include <limits.h>
@ -157,33 +170,9 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
struct ctx;
/* cppcheck-suppress funcArgNamesDifferent */
__attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); }
#ifdef CLOSE_RANGE_UNSHARE /* Linux kernel >= 5.9 */
/* glibc < 2.34 and musl as of 1.2.5 need these */
#ifndef SYS_close_range
#define SYS_close_range 436
#endif
__attribute__ ((weak))
/* cppcheck-suppress funcArgNamesDifferent */
int close_range(unsigned int first, unsigned int last, int flags) {
return syscall(SYS_close_range, first, last, flags);
}
#else
/* No reasonable fallback option */
/* cppcheck-suppress funcArgNamesDifferent */
int close_range(unsigned int first, unsigned int last, int flags) {
return 0;
}
#endif
int sock_l4_sa(const struct ctx *c, enum epoll_type type,
const void *sa, socklen_t sl,
const char *ifname, bool v6only, uint32_t data);
int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type,
const void *bind_addr, const char *ifname, uint16_t port,
uint32_t data);
void sock_probe_mem(struct ctx *c);
long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);
@ -195,13 +184,15 @@ char *line_read(char *buf, size_t len, int fd);
void ns_enter(const struct ctx *c);
bool ns_is_init(void);
int open_in_ns(const struct ctx *c, const char *path, int flags);
int pidfile_open(const char *path);
int output_file_open(const char *path, int flags);
void pidfile_write(int fd, pid_t pid);
int __daemon(int pidfile_fd, int devnull_fd);
int fls(unsigned long x);
int write_file(const char *path, const char *buf);
int write_all_buf(int fd, const void *buf, size_t len);
int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
void close_open_files(int argc, char **argv);
bool snprintf_check(char *str, size_t size, const char *format, ...);
/**
* af_name() - Return name of an address family
@ -269,6 +260,9 @@ static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m)
return mod_sub(x, i, m) < mod_sub(j, i, m);
}
/* FPRINTF() intentionally silences cert-err33-c clang-tidy warnings */
#define FPRINTF(f, ...) (void)fprintf(f, __VA_ARGS__)
/*
* Workarounds for https://github.com/llvm/llvm-project/issues/58992
*