Nobody currently calls this as passt4netns, that was the name I used
before 'pasta', drop any reference before it's too late.
While at it, explicitly check that argc is bigger than or equal to
one, just as a defensive measure: argv[0] being NULL is not an issue
anyway.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
clang-tidy from LLVM 13.0.1 reports some new warnings from these
checkers:
- altera-unroll-loops, altera-id-dependent-backward-branch: ignore
for the moment being, add a TODO item
- bugprone-easily-swappable-parameters: ignore, nothing to do about
those
- readability-function-cognitive-complexity: ignore for the moment
being, add a TODO item
- altera-struct-pack-align: ignore, alignment is forced in protocol
headers
- concurrency-mt-unsafe: ignore for the moment being, add a TODO
item
Fix bugprone-implicit-widening-of-multiplication-result warnings,
though, that's doable and they seem to make sense.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcpi_bytes_acked and tcpi_min_rtt are only available on recent
kernel versions: provide fall-back paths (incurring some grade of
performance penalty).
Support for getrandom() was introduced in Linux 3.17 and glibc 2.25:
provide an alternate mechanism for that as well, reading from
/dev/random.
Also check if NETLINK_GET_STRICT_CHK is defined before using it:
it's not strictly needed, we'll filter out irrelevant results from
netlink anyway.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
On some distributions, on ppc64, ulimit -s returns 'unlimited': add a
reasonable default, and also make sure ulimit is invoked using the
default shell, which should ensure ulimit is actually implemented.
Also note that AUDIT_ARCH doesn't follow closely the naming reported
by 'uname -m': convert for i386 and ppc as needed.
While at it, move inclusion of seccomp.h after util.h, the former is
less generic (cosmetic/clang-tidy only).
Older kernel headers might lack a definition for AUDIT_ARCH_PPC64LE:
define that explicitly if it's not available.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Some of those warnings don't trigger even on systems with very
similar toolchains, suppress unmatchedSuppression warnings, they're
basically useless.
While at it, pass CFLAGS to cppcheck.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...mostly false positives, but a number of very relevant ones too,
in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Unions and structs, you all have names now.
Take the chance to enable bugprone-reserved-identifier,
cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy.
Provide a ffsl() weak declaration using gcc built-in.
Start reordering includes, but that's not enough for the
llvm-include-order checker yet.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Detect missing tcpi_snd_wnd in struct tcp_info at build time,
otherwise build fails with a pre-5.3 linux/tcp.h header.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A mix of unchecked return values, a missing permission mask for
open(2) with O_CREAT, and some false positives from
-Wstringop-overflow and -Wmaybe-uninitialized.
Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
SPDX tags don't replace license files. Some notices were missing and
some tags were not according to the SPDX specification, too.
Now reuse --lint from the REUSE tool (https://reuse.software/) passes.
Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Martin reports that DESTDIR is ignored in install/uninstall targets,
see also:
https://www.gnu.org/prep/standards/html_node/DESTDIR.html
Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
List of allowed syscalls comes from comments in the form:
#syscalls <list>
for syscalls needed both in passt and pasta mode, and:
#syscalls:pasta <list>
#syscalls:passt <list>
for syscalls specifically needed in pasta or passt mode only.
seccomp.sh builds a list of BPF statements from those comments,
prefixed by a binary search tree to keep lookup fast.
While at it, clean up a bit the Makefile using wildcards.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Move netlink routines to their own file, and use netlink to configure
or fetch all the information we need, except for the TUNSETIFF ioctl.
Move pasta-specific functions to their own file as well, add
parameters and calls to configure the tap interface in the namespace.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Based on a patch from Giuseppe Scrivano, this adds the ability to:
- specify paths and names of target namespaces to join, instead of
a PID, also for user namespaces, with --userns
- request to join or create a network namespace only, without
entering or creating a user namespace, with --netns-only
- specify the base directory for netns mountpoints, with --nsrun-dir
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[sbrivio: reworked logic to actually join the given namespaces when
they're not created, implemented --netns-only and --nsrun-dir,
updated pasta demo script and man page]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If transparent huge pages are available, madvise() will do the trick.
While at it, decrease EPOLL_EVENTS for the main loop from 10 to 8,
for slightly better socket fairness.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Provide an AVX2-based function using compiler intrinsics for
TCP/IP-style checksums. The load/unpack/add idea and implementation
is largely based on code from BESS (the Berkeley Extensible Software
Switch) licensed as 3-Clause BSD, with a number of modifications to
further decrease pipeline stalls and to minimise cache pollution.
This speeds up considerably data paths from sockets to tap
interfaces, decreasing overhead for checksum computation, with
16-64KiB packet buffers, from approximately 11% to 7%. The rest is
just syscalls at this point.
While at it, provide convenience targets in the Makefile for avx2,
avx2_debug, and debug targets -- these simply add target-specific
CFLAGS to the build.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With -DDEBUG, passt now saves guest-side traffic captures in
pcap format at /tmp/passt_<ISO8601 timestamp>.pcap. The timestamp
refers to time and date of start-up.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
It might be impractical to pass options to qrap when using libvirt,
because the <emulator/> tag expects a path to an executable, without
further arguments.
If the first argument is not a plausible socket number, and the
second argument is not a valid executable, look up a qemu command
from a list of possible names, then start it patching the command line
to include the -netdev fd= parameter corresponding to the AF_UNIX
domain socket we just opened.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This implementation, similarly to the IPv4 DHCP one, hands out a
single address, which is the same as the upstream address for the
host.
This avoids the need for address translation as long as the client
runs a DHCPv6 client. The NDP "Managed" flag is now set in Router
Advertisements.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
It's nice to be able to confirm connectivity using ICMP or ICMPv6
echo requests, and "ping" sockets on Linux (IPPROTO_ICMP datagram)
allow us to do that without any special capability.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Implement siphash routines for initial TCP sequence numbers (12 bytes
input for IPv4, 36 bytes input for IPv6), and while at it, also
functions we'll use later on for hash table indices and TCP timestamp
offsets (with 8, 20, 32 bytes of input).
Use these to set the initial sequence number, according to RFC 6528,
for connections originating either from the tap device or from
sockets.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With this, merd provides a fully functional IPv4 environment to
guests, requiring a single capability, CAP_NET_RAW.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can bypass a full-fledged network interface between qemu and merd by
connecting the qemu tap file descriptor to a provided UNIX domain
socket: this could be implemented in qemu eventually, qrap covers this
meanwhile.
This also avoids the need for the AF_PACKET socket towards the guest.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>