Commit graph

123 commits

Author SHA1 Message Date
David Gibson
eda4f1997e epoll: Split listening Unix domain socket into its own type
tap_handler() actually handles events on three different types of object:
the /dev/tap character device (pasta), a connected Unix domain socket
(passt) or a listening Unix domain socket (passt).

The last, in particular, really has no handling in common with the others,
so split it into its own epoll type and directly dispatch to the relevant
handler from the top level.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:17 +02:00
David Gibson
485b5fb8f9 epoll: Split handling of listening TCP sockets into their own handler
tcp_sock_handler() handles both listening TCP sockets, and connected TCP
sockets, but what it needs to do in those cases has essentially nothing in
common.  Therefore, give listening sockets their own epoll_type value and
dispatch directly to their own handler from the top level.  Furthermore,
the two handlers need essentially entirely different information from the
reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate
the port for the listening socket, but that's not the same meaning.  So,
switch listening sockets to their own reference type which we can lay out
as we please.  That lets us remove the listen and outbound fields from the
normal (connected) tcp_epoll_ref, reducing it to just the connection table
index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:15 +02:00
David Gibson
e6f81e5578 epoll: Split handling of TCP timerfds into its own handler function
tcp_sock_handler() actually handles several different types of fd events.
This includes timerfds that aren't sockets at all.  The handling of these
has essentially nothing in common with the other cases.  So, give the
TCP timers there own epoll_type value and dispatch directly to their
handler.  This also means we can remove the timer field from tcp_epoll_ref,
the information it encoded is now implicit in the epoll_type value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:13 +02:00
David Gibson
8271a2ed57 epoll: Tiny cleanup to udp_sock_handler()
Move the test for c->no_udp into the function itself, rather than in the
dispatching switch statement to better localize the UDP specific logic, and
make for greated consistency with other handler functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:10 +02:00
David Gibson
05f606ab0b epoll: Split handling of ICMP and ICMPv6 sockets
We have different epoll type values for ICMP and ICMPv6 sockets, but they
both call the same handler function, icmp_sock_handler().  However that
function does essentially nothing in common for the two cases.  So, split
it into icmp_sock_handler() and icmpv6_sock_handler() and dispatch them
separately from the top level.

While we're there remove some parameters that the function was never using
anyway.  Also move the test for c->no_icmp into the functions, so that all
the logic specific to ICMP is within the handler, rather than in the top
level dispatch code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:07 +02:00
David Gibson
d850caab66 epoll: Fold sock_handler into general switch on epoll event fd
When we process events from epoll_wait(), we check for a number of special
cases before calling sock_handler() which then dispatches based on the
protocol type of the socket in the event.

Fold these cases together into a single switch on the fd type recorded in
the epoll data field.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:56 +02:00
David Gibson
6a6735ece4 epoll: Always use epoll_ref for the epoll data variable
epoll_ref contains a variety of information useful when handling epoll
events on our sockets, and we place it in the epoll_event data field
returned by epoll.  However, for a few other things we use the 'fd' field
in the standard union of types for that data field.

This actually introduces a bug which is vanishingly unlikely to hit in
practice, but very nasty if it ever did: theoretically if we had a very
large file descriptor number for fd_tap or fd_tap_listen it could overflow
into bits that overlap with the 'proto' field in epoll_ref.  With some
very bad luck this could mean that we mistakenly think an event on a
regular socket is an event on fd_tap or fd_tap_listen.

More practically, using different (but overlapping) fields of the
epoll_data means we can't unify dispatch for the various different objects
in the epoll.  Therefore use the same epoll_ref as the data for the tap
fds and the netns quit fd, adding new fd type values to describe them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:53 +02:00
David Gibson
3401644453 epoll: Generalize epoll_ref to cover things other than sockets
The epoll_ref type includes fields for the IP protocol of a socket, and the
socket fd.  However, we already have a few things in the epoll which aren't
protocol sockets, and we may have more in future.  Rename these fields to
an abstract "fd type" and file descriptor for more generality.

Similarly, rather than using existing IP protocol numbers for the type,
introduce our own number space.  For now these just correspond to the
supported protocols, but we'll expand on that in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:51 +02:00
David Gibson
8218d99013 Use C11 anonymous members to make poll refs less verbose to use
union epoll_ref has a deeply nested set of structs and unions to let us
subdivide it into the various different fields we want.  This means that
referencing elements can involve an awkward long string of intermediate
fields.

Using C11 anonymous structs and unions lets us do this less clumsily.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:17:57 +02:00
Stefano Brivio
940bd3eff9 passt: Fix error check for signal(), improve error messages
Valtteri reports that if SIGPIPE already has a disposition set by the
parent process, such as systemd with the default setting of
IgnoreSIGPIPE=yes, signal() will return the previous value, not zero,
and this is not an error: check for SIG_ERR instead.

While at it, split messages for failures of sigaction() and signal(),
and report the actual error.

Reported-by: Valtteri Vuorikoski <vuori@notcom.org>
Fixes: 8534be076c ("Catch failures when installing signal handlers")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-13 19:32:13 +02:00
Stefano Brivio
ca2749e1bd passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.

Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.

Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-06 18:00:33 +02:00
Chris Kuhn
89e38f5540 treewide: Fix header includes to build with musl
Roughly inspired from a patch by Chris Kuhn: fix up includes so that
we can build against musl: glibc is more lenient as headers generally
include a larger amount of other headers.

Compared to the original patch, I only included what was needed
directly in C files, instead of adding blanket includes in local
header files. It's a bit more involved, but more consistent with the
current (not ideal) situation.

Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Chris Kuhn
5c58feab7b conf, passt: Rename stderr to force_stderr
While building against musl, gcc informs us that 'stderr' is a
protected keyword. This probably comes from a #define stderr (stderr)
in musl's stdio.h, to avoid a clash with extern FILE *const stderr,
but I didn't really track it down. Just rename it to force_stderr, it
makes more sense.

[sbrivio: Added commit message]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Laine Stump
c9af6f92db convert all remaining err() followed by exit() to die()
This actually leaves us with 0 uses of err(), but someone could want
to use it in the future, so we may as well leave it around.

Signed-off-by: Laine Stump <laine@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-16 17:32:27 +01:00
Laine Stump
193385bd2f log to stderr until process is daemonized, even if a log file is set
Once a log file (specified on the commandline) is opened, the logging
functions will stop sending error logs to stderr, which is annoying if
passt has been started by another process that wants to collect error
messages from stderr so it can report why passt failed to start. It
would be much nicer if passt continued sending all log messages to
stderr until it daemonizes itself (at which point the process that
started passt can assume that it was started successfully).

The system log mask is set to LOG_EMERG when the process starts, and
we're already using that to do "special" logging during the period
from process start until the log level requested on the commandline is
processed (setting the log mask to something else). This period
*almost* matches with "the time before the process is daemonized"; if
we just delay setting the log mask a tiny bit, then it will match
exactly, and we can use it to determine if we need to send log
messages to stderr even when a log file has been specified and opened.

This patch delays the setting of the log mask until immediately before
the call to __daemon(). It also modifies logfn() slightly, so that it
will log to stderr any time log mask is LOG_EMERG, even if a log file
has been opened.

Signed-off-by: Laine Stump <laine@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-16 17:31:42 +01:00
Stefano Brivio
d8921dafe5 pasta: Wait for tap to be set up before spawning command
Adapted from a patch by Paul Holzinger: when pasta spawns a command,
operating without a pre-existing user and network namespace, it needs
to wait for the tap device to be configured and its handler ready,
before the command is actually executed.

Otherwise, something like:
  pasta --config-net nslookup passt.top

usually fails as the nslookup command is issued before the network
interface is ready.

We can't adopt a simpler approach based on SIGSTOP and SIGCONT here:
the child runs in a separate PID namespace, so it can't send SIGSTOP
to itself as the kernel sees the child as init process and blocks
the delivery of the signal.

We could send SIGSTOP from the parent, but this wouldn't avoid the
possible condition where the child isn't ready to wait for it when
the parent sends it, also raised by Paul -- and SIGSTOP can't be
blocked, so it can never be pending.

Use SIGUSR1 instead: mask it before clone(), so that the child starts
with it blocked, and can safely wait for it. Once the parent is
ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the
child is waiting for it, the kernel will queue it for us, because
it's blocked.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 1392bc5ca0 ("Allow pasta to take a command to execute")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 12:22:59 +01:00
Richard W.M. Jones
6b4e68383c passt, tap: Add --fd option
This passes a fully connected stream socket to passt.

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
[sbrivio: reuse fd_tap instead of adding a new descriptor,
 imply --one-off on --fd, add to optstring and usage()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:40:47 +01:00
Stefano Brivio
6533a4a07b passt: Move __setlogmask() calls before output unrelated to configuration
...so that we avoid printing some lines twice because log-level is
still set to LOG_EMERG, as if logging configuration didn't happen
yet.

While at it, note that logging to stderr doesn't really depend on
whether debug mode is enabled or not.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-10 11:17:50 +01:00
David Gibson
7c7b68dbe0 Use typing to reduce chances of IPv4 endianness errors
We recently corrected some errors handling the endianness of IPv4
addresses.  These are very easy errors to make since although we mostly
store them in network endianness, we sometimes need to manipulate them in
host endianness.

To reduce the chances of making such mistakes again, change to always using
a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network
endian addresses.  This makes it harder to accidentally do arithmetic or
comparisons on such addresses as if they were host endian.

We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to
directly work with struct in_addr values.  This has the additional benefit
of making the IPv4 and IPv6 paths more visually similar.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:24 +01:00
Stefano Brivio
f212044940 icmp: Don't discard first reply sequence for a given echo ID
In pasta mode, ICMP and ICMPv6 echo sockets relay back to us any
reply we send: we're on the same host as the target, after all. We
discard them by comparing the last sequence we sent with the sequence
we receive.

However, on the first reply for a given identifier, the sequence
might be zero, depending on the implementation of ping(8): we need
another value to indicate we haven't sent any sequence number, yet.

Use -1 as initialiser in the echo identifier map.

This is visible with Busybox's ping, and was reported by Paul on the
integration at https://github.com/containers/podman/pull/16141, with:

  $ podman run --net=pasta alpine ping -c 2 192.168.188.1

...where only the second reply would be routed back.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 33482d5bf2 ("passt: Add PASTA mode, major rework")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-27 00:18:21 +02:00
David Gibson
096e48669b isolation: Clarify various self-isolation steps
We have a number of steps of self-isolation scattered across our code.
Improve function names and add comments to make it clearer what the self
isolation model is, what the steps do, and why they happen at the points
they happen.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
Stefano Brivio
01efc71ddd log, conf: Add support for logging to file
In some environments, such as KubeVirt pods, we might not have a
system logger available. We could choose to run in foreground, but
this takes away the convenient synchronisation mechanism derived from
forking to background when interfaces are ready.

Add optional logging to file with -l/--log-file and --log-size.

Unfortunately, this means we need to duplicate features that are more
appropriately implemented by a system logger, such as rotation. Keep
that reasonably simple, by using fallocate() with range collapsing
where supported (Linux kernel >= 3.15, extent-based ext4 and XFS) and
falling back to an unsophisticated block-by-block moving of entries
toward the beginning of the file once we reach the (mandatory) size
limit.

While at it, clarify the role of LOG_EMERG in passt.c.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-14 17:38:28 +02:00
Stefano Brivio
da152331cf Move logging functions to a new file, log.c
Logging to file is going to add some further complexity that we don't
want to squeeze into util.c.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-14 17:38:25 +02:00
David Gibson
8f6be016ae cppcheck: Suppress same-value-in-ternary branches warning
TIMER_INTERVAL is the minimum of two separately defined intervals which
happen to have the same value at present.  This results in an expression
which has the same value in both branches of a ternary operator, which
cppcheck warngs about.  This is logically sound in this case, so suppress
the error (we appear to already have a similar suppression for clang-tidy).

Also add an unmatchedSuppression suppression, since only some cppcheck
versions complain about this instance.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-09-29 12:22:30 +02:00
David Gibson
8534be076c Catch failures when installing signal handlers
Stop ignoring the return codes from sigaction() and signal().  Unlikely to
happen in practice, but if it ever did it could lead to really hard to
debug problems.  So, take clang-tidy's advice and check for errors here.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-09-29 12:21:53 +02:00
David Gibson
eed17a47fe Handle userns isolation and dropping root at the same time
passt/pasta can interact with user namespaces in a number of ways:
   1) With --netns-only we'll remain in our original user namespace
   2) With --userns or a PID option to pasta we'll join either the given
      user namespace or that of the PID
   3) When pasta spawns a shell or command we'll start a new user namespace
      for the command and then join it
   4) With passt we'll create a new user namespace when we sandbox()
      ourself

However (3) and (4) turn out to have essentially the same effect.  In both
cases we create one new user namespace.  The spawned command starts there,
and passt/pasta itself will live there from sandbox() onwards.

Because of this, we can simplify user namespace handling by moving the
userns handling earlier, to the same point we drop root in the original
namespace.  Extend the drop_user() function to isolate_user() which does
both.

After switching UID and GID in the original userns, isolate_user() will
either join or create the userns we require.  When we spawn a command with
pasta_start_ns()/pasta_setup_ns() we no longer need to create a userns,
because we're already made one.  sandbox() likewise no longer needs to
create (or join) an userns because we're already in the one we need.

We no longer need c->pasta_userns_fd, since the fd is only used locally
in isolate_user().  Likewise we can replace c->netns_only with a local
in conf(), since it's not used outside there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-13 05:31:51 +02:00
David Gibson
d72a1e7bb9 Move self-isolation code into a separate file
passt/pasta contains a number of routines designed to isolate passt from
the rest of the system for security.  These are spread through util.c and
passt.c.  Move them together into a new isolation.c file.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-13 05:31:51 +02:00
David Gibson
60ffc5b6cb Don't unnecessarily avoid CLOEXEC flags
There are several places in the passt code where we have lint overrides
because we're not adding CLOEXEC flags to open or other operations.
Comments suggest this is because it's before we fork() into the background
but we'll need those file descriptors after we're in the background.

However, as the name suggests CLOEXEC closes on exec(), not on fork().  The
only place we exec() is either super early invoke the avx2 version of the
binary, or when we start a shell in pasta mode, which certainly *doesn't*
require the fds in question.

Add the CLOEXEC flag in those places, and remove the lint overrides.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-08-24 18:01:48 +02:00
David Gibson
16f5586bb8 Make substructures for IPv4 and IPv6 specific context information
The context structure contains a batch of fields specific to IPv4 and to
IPv6 connectivity.  Split those out into a sub-structure.

This allows the conf_ip4() and conf_ip6() functions, which take the
entire context but touch very little of it, to be given more specific
parameters, making it clearer what it affects without stepping through the
code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-07-30 22:14:07 +02:00
David Gibson
5e12d23acb Separate IPv4 and IPv6 configuration
After recent changes, conf_ip() now has essentially entirely disjoint paths
for IPv4 and IPv6 configuration.  So, it's cleaner to split them out into
different functions conf_ip4() and conf_ip6().

Splitting these out also lets us make the interface a bit nicer, having
them return success or failure directly, rather than manipulating c->v4
and c->v6 to indicate success/failure of the two versions.

Since these functions may also initialize the interface index for each
protocol, it turns out we can then drop c->v4 and c->v6 entirely, replacing
tests on those with tests on whether c->ifi4 or c->ifi6 is non-zero (since
a 0 interface index is never valid).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Whitespace fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-07-30 22:12:50 +02:00
Stefano Brivio
3ec02c0975 passt: Truncate PID file on open()
Otherwise, if the current PID has fewer digits than a previously
written one, the content will be wrong.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-07-22 19:42:18 +02:00
Stefano Brivio
1d223e4b4c passt: Allow exit_group() system call in seccomp profiles
We handle SIGQUIT and SIGTERM calling exit(), which is usually
implemented with the exit_group() system call.

If we don't allow exit_group(), we'll get a SIGSYS while handling
SIGQUIT and SIGTERM, which means a misleading non-zero exit code.

Reported-by: Wenli Quan <wquan@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2101990
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-07-14 01:36:05 +02:00
Stefano Brivio
17689cc9bf arch, passt: Use executable link to form AVX2 binary path
...instead of argv[0], which might or might not contain a valid path
to the executable itself. Instead of mangling argv[0], use the same
link to find out if we're already running the AVX2 build where
supported.

Alternatively, we could use execvpe(), but that might result in
running a different installed version, in case e.g. the set of
binaries is present in both /usr/bin and /usr/local/bin, with both
being in $PATH.

Reported-by: Wenli Quan <wquan@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2101310
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-07-14 01:36:05 +02:00
Stefano Brivio
a951e0b9ef conf: Add --runas option, changing to given UID and GID if started as root
On some systems, user and group "nobody" might not be available. The
new --runas option allows to override the default "nobody" choice if
started as root.

Now that we allow this, drop the initgroups() call that was used to
add any additional groups for the given user, as that might now
grant unnecessarily broad permissions. For instance, several
distributions have a "kvm" group to allow regular user access to
/dev/kvm, and we don't need that in passt or pasta.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-05-19 16:27:20 +02:00
Stefano Brivio
3c6ae62510 conf, tcp, udp: Allow address specification for forwarded ports
This feature is available in slirp4netns but was missing in passt and
pasta.

Given that we don't do dynamic memory allocation, we need to bind
sockets while parsing port configuration. This means we need to
process all other options first, as they might affect addressing and
IP version support. It also implies a minor rework of how TCP and UDP
implementations bind sockets.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-05-01 07:19:05 +02:00
Stefano Brivio
bb76470090 passt: Improper use of negative value (CWE-394)
Reported by Coverity.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-04-07 11:44:35 +02:00
Stefano Brivio
975ee8eb2b passt: Ignoring number of bytes read, CWE-252
Harmless, assuming sane kernel behaviour. Reported by Coverity.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-04-05 18:47:07 +02:00
Stefano Brivio
052424d7f5 passt: Accurate error reporting for sandbox()
It's actually quite easy to make it fail depending on the
environment, accurately report errors here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
62c3edd957 treewide: Fix android-cloexec-* clang-tidy warnings, re-enable checks
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
48582bf47f treewide: Mark constant references as const
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
92074c16a8 tcp_splice: Close sockets right away on high number of open files
We can't take for granted that the hard limit for open files is
big enough as to allow to delay closing sockets to a timer.

Store the value of RTLIMIT_NOFILE we set at start, and use it to
understand if we're approaching the limit with pending, spliced
TCP connections. If that's the case, close sockets right away as
soon as they're not needed, instead of deferring this task to a
timer.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
be5bbb9b06 tcp: Rework timers to use timerfd instead of periodic bitmap scan
With a lot of concurrent connections, the bitmap scan approach is
not really sustainable.

Switch to per-connection timerfd timers, set based on events and on
two new flags, ACK_FROM_TAP_DUE and ACK_TO_TAP_DUE. Timers are added
to the common epoll list, and implement the existing timeouts.

While at it, drop the CONN_ prefix from flag names, otherwise they
get quite long, and fix the logic to decide if a connection has a
local, possibly unreachable endpoint: we shouldn't go through the
rest of tcp_conn_from_tap() if we reset the connection due to a
successful bind(2), and we'll get EACCES if the port number is low.

Suggested by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
e5eefe7743 tcp: Refactor to use events instead of states, split out spliced implementation
Using events and flags instead of states makes the implementation
much more straightforward: actions are mostly centered on events
that occurred on the connection rather than states.

An example is given by the ESTABLISHED_SOCK_FIN_SENT and
FIN_WAIT_1_SOCK_FIN abominations: we don't actually care about
which side started closing the connection to handle closing of
connection halves.

Split out the spliced implementation, as it has very little in
common with the "regular" TCP path.

Refactor things here and there to improve clarity. Add helpers
to trace where resets and flag settings come from.

No functional changes intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-28 17:11:40 +02:00
Stefano Brivio
d2e40bb8d9 conf, util, tap: Implement --trace option for extra verbose logging
--debug can be a bit too noisy, especially as single packets or
socket messages are logged: implement a new option, --trace,
implying --debug, that enables all debug messages.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:21:13 +01:00
Stefano Brivio
213c397492 passt, pasta: Run-time selection of AVX2 build
Build-time selection of AVX2 flags and routines is not practical for
distributions, but limiting AVX2 usage to checksum routines with
specific run-time detection doesn't allow for easy performance gains
from auto-vectorisation of batched packet handling routines.

For x86_64, build non-AVX2 and AVX2 binaries, and implement a simple
wrapper replacing the current executable with the AVX2 build if it's
available, and if AVX2 is supported by the current CPU.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-28 16:46:28 +01:00
Stefano Brivio
6d661dc5b2 seccomp: Adjust list of allowed syscalls for armv6l, armv7l
It looks like glibc commonly implements clock_gettime(2) with
clock_gettime64(), and uses recv() instead of recvfrom(), send()
instead of sendto(), and sigreturn() instead of rt_sigreturn() on
armv6l and armv7l.

Adjust the list of system calls for armv6l and armv7l accordingly.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-26 23:39:19 +01:00
Stefano Brivio
a095fbc457 passt: Don't warn on failed madvise()
A kernel might not be configured with CONFIG_TRANSPARENT_HUGEPAGE,
especially on embedded systems. Ignore the error, it doesn't affect
functionality.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-26 23:37:05 +01:00
Stefano Brivio
9b61bd0b39 passt: Explicitly check return value of chdir()
...it doesn't actually matter as we're checking errno at the very
end, but, depending on build flags, chdir() might be declared with
warn_unused_result and the compiler issues a warning.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-25 22:42:36 +01:00
Stefano Brivio
5e0c75d609 passt: Drop PASST_LEGACY_NO_OPTIONS sections
...nobody uses those builds anymore.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-22 18:42:51 +01:00
Stefano Brivio
745a9ba428 pasta: By default, quit if filesystem-bound net namespace goes away
This should be convenient for users managing filesystem-bound network
namespaces: monitor the base directory of the namespace and exit if
the namespace given as PATH or NAME target is deleted. We can't add
an inotify watch directly on the namespace directory, that won't work
with nsfs.

Add an option to disable this behaviour, --no-netns-quit.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-21 13:41:13 +01:00