The IPv4 specific dhcp() manually constructs L2 and IP headers to send its
DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the
tap_udp6_send() helper. Now that we've broaded the parameters to
tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_ip4_send() has special case logic to compute the checksums for UDP
and ICMP packets, which is a mild layering violation. By using a suitable
helper we can split it into tap_udp4_send() and tap_icmp4_send() functions
without greatly increasing the code size, this removing that layering
violation.
We make some small changes to the interface while there. In both cases
we make the destination IPv4 address a parameter, which will be useful
later. For the UDP variant we make it take just the UDP payload, and it
will generate the UDP header. For the ICMP variant we pass in the ICMP
header as before. The inconsistency is because that's what seems to be
the more natural way to invoke the function in the callers in each case.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
ndp() takes a parameter giving the ethernet source address of the packet
it is to respond to, which it uses to determine the destination address to
send the reply packet to.
This is not necessary, because the address will always be the guest's
MAC address. Even if the guest has just changed MAC address, then either
tap_handler_passt() or tap_handler_pasta() - which are the only call paths
leading to ndp() will have updated c->mac_guest with the new value.
So, remove the parameter, and just use c->mac_guest, making it more
consistent with other paths where we construct packets to send inwards.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_ip6_send() has special case logic to compute the checksums for UDP
and ICMP packets, which is a mild layering violation. By using a suitable
helper we can split it into tap_udp6_send() and tap_icmp6_send() functions
without greatly increasing the code size, this removing that layering
violation.
We make some small changes to the interface while there. In both cases
we make the destination IPv6 address a parameter, which will be useful
later. For the UDP variant we make it take just the UDP payload, and it
will generate the UDP header. For the ICMP variant we pass in the ICMP
header as before. The inconsistency is because that's what seems to be
the more natural way to invoke the function in the callers in each case.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The IPv4 and IPv6 paths in tap_ip_send() have very little in common, and
it turns out that every caller (statically) knows if it is using IPv4 or
IPv6. So split into separate tap_ip4_send() and tap_ip6_send() functions.
Use a new tap_l2_hdr() function for the very small common part.
While we're there, make some minor cleanups:
- We were double writing some fields in the IPv6 header, so that it
temporary matched the pseudo-header for checksum calculation. With
recent checksum reworks, this isn't neccessary any more.
- We don't use any IPv4 header options, so use some sizeof() constructs
instead of some open coded values for header length.
- The comment used to say that the flow label was for TCP over IPv6, but
in fact the only thing we used it for was DHCPv6 over UDP traffic
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Callers of tap_send() can optionally use a small optimization by adding
extra space for the 4 byte length header used on the qemu socket interface.
tap_ip_send() is currently the only user of this, but this is used only
for "slow path" ICMP and DHCP packets, so there's not a lot of value to
the optimization.
Worse, having the two paths here complicates the interface and makes future
cleanups difficult, so just remove it. I have some plans to bring back the
optimization in a more general way in future, but for now it's just in the
way.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_ip_send() is never used for TCP packets, we're unlikely to use it for
that in future, and the handling of TCP packets makes other cleanups
unnecessarily awkward. Remove it.
This is the only user of csum_tcp4(), so we can remove that as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_ip_send() doesn't take a destination address, because it's specifically
for inbound packets, and the IP addresses of the guest/namespace are
already known to us. Rather than open-coding this destination address
logic, make helper functions for it which will enable some later cleanups.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We calculate IPv4 header checksums in at least two places, in dhcp() and
in tap_ip_send. Add a helper to handle this calculation in both places.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At least two places in passt fill in UDP over IPv4 checksums, although
since UDP checksums are optional with IPv4 that just amounts to storing
a 0 (in tap_ip_send()) or leaving a 0 from an earlier initialization (in
dhcp()). For consistency, add a helper for this "calculation".
Just for the heck of it, add the option (compile time disabled for now) to
calculate real UDP checksums.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add a helper for calculating UDP checksums when used over IPv6
For future flexibility, the new helper takes parameters for the fields in
the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to
be explicitly constructed. It also allows the UDP header and payload to
be in separate buffers, although we don't use this yet.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Although tap_ip_send() is currently the only place calculating ICMP
checksums, create a helper function for symmetry with ICMPv6. For
future flexibility it allows the ICMPv6 header and payload to be in
separate buffers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At least two places in passt calculate ICMPv6 checksums, ndp() and
tap_ip_send(). Add a helper to handle this calculation in both places.
For future flexibility, the new helper takes parameters for the fields in
the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to
be explicitly constructed. It also allows the ICMPv6 header and payload to
be in separate buffers, although we don't use this yet.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Even if CAP_NET_BIND_SERVICE is granted, we'll lose the capability in
the target user namespace as we isolate the process, which means
we're unable to bind to low ports at that point.
Bind inbound ports, and only those, before isolate_user(). Keep the
handling of outbound ports (for pasta mode only) after the setup of
the namespace, because that's where we'll bind them.
To this end, initialise the netlink socket for the init namespace
before isolate_user() as well, as we actually need to know the
addresses of the upstream interface before binding ports, in case
they're not explicitly passed by the user.
As we now call nl_sock_init() twice, checking its return code from
conf() twice looks a bit heavy: make it exit(), instead, as we
can't do much if we don't have netlink sockets.
While at it:
- move the v4_only && v6_only options check just after the first
option processing loop, as this is more strictly related to
option parsing proper
- update the man page, explaining that CAP_NET_BIND_SERVICE is
*not* the preferred way to bind ports, because passt and pasta
can be abused to allow other processes to make effective usage
of it. Add a note about the recommended sysctl instead
- simplify nl_sock_init_do() now that it's called once for each
case
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is a minor optimisation possibility I spotted while trying to
debug a hang in tap4_handler(): if we run out of space for packet
sequences, it's fine to add packets to an existing per-sequence pool.
We should check the count of packet sequences only once we realise
that we actually need a new packet sequence.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is practical to avoid explicit lifecycle management in users,
e.g. libvirtd, and is trivial to implement.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Logging to file is going to add some further complexity that we don't
want to squeeze into util.c.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Minor style improvement suggested by cppcheck.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reported by Coverity (CWE-119):
Negative value used as argument to a function expecting a
positive value (for example, size of buffer or allocation)
and harmless, because getsockopt() would return -EBADF if the
socket is -1, so we wouldn't print anything.
Check if accept4() returns a valid socket before calling getsockopt()
on it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The context structure contains a batch of fields specific to IPv4 and to
IPv6 connectivity. Split those out into a sub-structure.
This allows the conf_ip4() and conf_ip6() functions, which take the
entire context but touch very little of it, to be given more specific
parameters, making it clearer what it affects without stepping through the
code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
After recent changes, conf_ip() now has essentially entirely disjoint paths
for IPv4 and IPv6 configuration. So, it's cleaner to split them out into
different functions conf_ip4() and conf_ip6().
Splitting these out also lets us make the interface a bit nicer, having
them return success or failure directly, rather than manipulating c->v4
and c->v6 to indicate success/failure of the two versions.
Since these functions may also initialize the interface index for each
protocol, it turns out we can then drop c->v4 and c->v6 entirely, replacing
tests on those with tests on whether c->ifi4 or c->ifi6 is non-zero (since
a 0 interface index is never valid).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Whitespace fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In pasta mode, the guest's MAC address is set up in pasta_ns_cobf() called
from tap_sock_tun_init(). If we have a guest MAC configured with
--ns-mac-addr, this will set the given MAC on the kernel tuntap device, or
if we haven't configured one it will update our record of the guest MAC to
the kernel assigned one from the device.
For passt, we don't initially know the guest's MAC until we receive packets
from it, so we have to initially use a broadcast address. This is - oddly
- set up in an entirely different place, in conf_ip() conditional on the
mode.
Move it to the logically matching place for passt - tap_sock_unix_init().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
...namely, as connections are discarded or accepted. This was quite
useful to debug an issue with libvirtd failing to start qemu (because
passt refused the new connection) as a previous qemu instance was
still active.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
All instances were harmless, but it might be useful to have some
debug messages here and there. Reported by Coverity.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
read() will return zero if we pass a zero length, which makes no
sense: instead, track explicitly that we exhausted the buffer, flush
packets to handlers and redo.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If the tun interface disappears, we'll call tap_ns_tun() after the
seccomp profile is applied: add ioctl() and openat() to it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The existing sizes provide no measurable differences in throughput
and packet rates at this point. They were probably needed as batched
implementations were not complete, but they can be decreased quite a
bit now.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With a lot of concurrent connections, the bitmap scan approach is
not really sustainable.
Switch to per-connection timerfd timers, set based on events and on
two new flags, ACK_FROM_TAP_DUE and ACK_TO_TAP_DUE. Timers are added
to the common epoll list, and implement the existing timeouts.
While at it, drop the CONN_ prefix from flag names, otherwise they
get quite long, and fix the logic to decide if a connection has a
local, possibly unreachable endpoint: we shouldn't go through the
rest of tcp_conn_from_tap() if we reset the connection due to a
successful bind(2), and we'll get EACCES if the port number is low.
Suggested by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
--debug can be a bit too noisy, especially as single packets or
socket messages are logged: implement a new option, --trace,
implying --debug, that enables all debug messages.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
clang-tidy from LLVM 13.0.1 reports some new warnings from these
checkers:
- altera-unroll-loops, altera-id-dependent-backward-branch: ignore
for the moment being, add a TODO item
- bugprone-easily-swappable-parameters: ignore, nothing to do about
those
- readability-function-cognitive-complexity: ignore for the moment
being, add a TODO item
- altera-struct-pack-align: ignore, alignment is forced in protocol
headers
- concurrency-mt-unsafe: ignore for the moment being, add a TODO
item
Fix bugprone-implicit-widening-of-multiplication-result warnings,
though, that's doable and they seem to make sense.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The existing behaviour is not really practical: an automated agent in
charge of starting both qemu and passt would need to fork itself to
start passt, because passt won't fork to background until qemu
connects, and the agent needs to unblock to start qemu.
Instead of waiting for a connection to daemonise, do it right away as
soon as a socket is available: that can be considered an initialised
state already.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Depending on the C library, but not necessarily in all the
functions we use, statx() might be used instead of stat(),
getdents() instead of getdents64(), readlinkat() instead of
readlink(), openat() instead of open().
On aarch64, it's clone() and not fork(), and dup3() instead of
dup2() -- just allow the existing alternative instead of dealing
with per-arch selections.
Since glibc commit 9a7565403758 ("posix: Consolidate fork
implementation"), we need to allow set_robust_list() for
fork()/clone(), even in a single-threaded context.
On some architectures, epoll_pwait() is provided instead of
epoll_wait(), but never both. Same with newfstat() and
fstat(), sigreturn() and rt_sigreturn(), getdents64() and
getdents(), readlink() and readlinkat(), unlink() and
unlinkat(), whereas pipe() might not be available, but
pipe2() always is, exclusively or not.
Seen on Fedora 34: newfstatat() is used on top of fstat().
syslog() is an actual system call on some glibc/arch combinations,
instead of a connect()/send() implementation.
On ppc64 and ppc64le, _llseek(), recv(), send() and getuid()
are used. For ppc64 only: ugetrlimit() for the getrlimit()
implementation, plus sigreturn() and fcntl64().
On s390x, additionally, we need to allow socketcall() (on top
of socket()), and sigreturn() also for passt (not just for
pasta).
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is the only remaining Linux-specific include -- drop it to avoid
clang-tidy warnings and to make code more portable.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...mostly false positives, but a number of very relevant ones too,
in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Unions and structs, you all have names now.
Take the chance to enable bugprone-reserved-identifier,
cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy.
Provide a ffsl() weak declaration using gcc built-in.
Start reordering includes, but that's not enough for the
llvm-include-order checker yet.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
List of allowed syscalls comes from comments in the form:
#syscalls <list>
for syscalls needed both in passt and pasta mode, and:
#syscalls:pasta <list>
#syscalls:passt <list>
for syscalls specifically needed in pasta or passt mode only.
seccomp.sh builds a list of BPF statements from those comments,
prefixed by a binary search tree to keep lookup fast.
While at it, clean up a bit the Makefile using wildcards.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Move netlink routines to their own file, and use netlink to configure
or fetch all the information we need, except for the TUNSETIFF ioctl.
Move pasta-specific functions to their own file as well, add
parameters and calls to configure the tap interface in the namespace.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Based on a patch from Giuseppe Scrivano, this adds the ability to:
- specify paths and names of target namespaces to join, instead of
a PID, also for user namespaces, with --userns
- request to join or create a network namespace only, without
entering or creating a user namespace, with --netns-only
- specify the base directory for netns mountpoints, with --nsrun-dir
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[sbrivio: reworked logic to actually join the given namespaces when
they're not created, implemented --netns-only and --nsrun-dir,
updated pasta demo script and man page]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>