passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.
Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.
Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-05 20:11:44 +02:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
|
2020-07-20 16:41:49 +02:00
|
|
|
/* PASST - Plug A Simple Socket Transport
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
* for qemu/UNIX domain socket mode
|
|
|
|
*
|
|
|
|
* PASTA - Pack A Subtle Tap Abstraction
|
|
|
|
* for network namespace/tap device mode
|
2020-07-18 01:02:39 +02:00
|
|
|
*
|
2020-07-20 16:41:49 +02:00
|
|
|
* passt.c - Daemon implementation
|
2020-07-13 22:55:46 +02:00
|
|
|
*
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
* Copyright (c) 2020-2021 Red Hat GmbH
|
2020-07-13 22:55:46 +02:00
|
|
|
* Author: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
*
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
* Grab Ethernet frames from AF_UNIX socket (in "passt" mode) or tap device (in
|
|
|
|
* "pasta" mode), build SOCK_DGRAM/SOCK_STREAM sockets for each 5-tuple from
|
|
|
|
* TCP, UDP packets, perform connection tracking and forward them. Forward
|
|
|
|
* packets received on sockets back to the UNIX domain socket (typically, a
|
|
|
|
* socket virtio_net file descriptor from qemu) or to the tap device (typically,
|
|
|
|
* created in a separate network namespace).
|
2020-07-13 22:55:46 +02:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/epoll.h>
|
2021-08-12 15:42:43 +02:00
|
|
|
#include <fcntl.h>
|
2021-09-26 23:19:40 +02:00
|
|
|
#include <sys/mman.h>
|
2021-03-18 11:36:55 +01:00
|
|
|
#include <sys/resource.h>
|
2022-09-12 14:24:07 +02:00
|
|
|
#include <stdbool.h>
|
2020-07-13 22:55:46 +02:00
|
|
|
#include <stdlib.h>
|
|
|
|
#include <unistd.h>
|
|
|
|
#include <netdb.h>
|
2023-03-08 04:00:22 +01:00
|
|
|
#include <signal.h>
|
|
|
|
#include <stdio.h>
|
2020-07-13 22:55:46 +02:00
|
|
|
#include <string.h>
|
|
|
|
#include <errno.h>
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
#include <time.h>
|
2021-03-18 07:49:08 +01:00
|
|
|
#include <syslog.h>
|
2021-10-13 22:25:03 +02:00
|
|
|
#include <sys/prctl.h>
|
2021-10-21 04:26:08 +02:00
|
|
|
#include <netinet/if_ether.h>
|
2024-05-11 17:45:42 +02:00
|
|
|
#include <libgen.h>
|
2023-11-30 03:02:21 +01:00
|
|
|
#ifdef HAS_GETRANDOM
|
|
|
|
#include <sys/random.h>
|
|
|
|
#endif
|
2021-10-21 04:26:08 +02:00
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
#include "util.h"
|
2020-07-20 16:41:49 +02:00
|
|
|
#include "passt.h"
|
2021-10-05 21:15:01 +02:00
|
|
|
#include "dhcp.h"
|
2021-04-13 21:59:47 +02:00
|
|
|
#include "dhcpv6.h"
|
2022-09-12 14:24:03 +02:00
|
|
|
#include "isolation.h"
|
2021-05-21 11:14:48 +02:00
|
|
|
#include "pcap.h"
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
#include "tap.h"
|
2021-08-12 15:42:43 +02:00
|
|
|
#include "conf.h"
|
2021-10-11 12:01:31 +02:00
|
|
|
#include "pasta.h"
|
2022-02-28 16:18:44 +01:00
|
|
|
#include "arch.h"
|
2022-09-24 09:53:15 +02:00
|
|
|
#include "log.h"
|
2024-01-16 01:50:38 +01:00
|
|
|
#include "tcp_splice.h"
|
2020-07-18 01:02:39 +02:00
|
|
|
|
2021-09-26 23:19:40 +02:00
|
|
|
#define EPOLL_EVENTS 8
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
|
2024-01-16 01:50:36 +01:00
|
|
|
#define TIMER_INTERVAL__ MIN(TCP_TIMER_INTERVAL, UDP_TIMER_INTERVAL)
|
|
|
|
#define TIMER_INTERVAL_ MIN(TIMER_INTERVAL__, ICMP_TIMER_INTERVAL)
|
|
|
|
#define TIMER_INTERVAL MIN(TIMER_INTERVAL_, FLOW_TIMER_INTERVAL)
|
2021-04-30 14:52:18 +02:00
|
|
|
|
2021-09-26 23:19:40 +02:00
|
|
|
char pkt_buf[PKT_BUF_BYTES] __attribute__ ((aligned(PAGE_SIZE)));
|
2020-07-13 22:55:46 +02:00
|
|
|
|
2024-01-16 01:50:37 +01:00
|
|
|
char *epoll_type_str[] = {
|
2024-02-15 23:24:32 +01:00
|
|
|
[EPOLL_TYPE_TCP] = "connected TCP socket",
|
|
|
|
[EPOLL_TYPE_TCP_SPLICE] = "connected spliced TCP socket",
|
|
|
|
[EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket",
|
|
|
|
[EPOLL_TYPE_TCP_TIMER] = "TCP timer",
|
2024-07-18 07:26:53 +02:00
|
|
|
[EPOLL_TYPE_UDP_LISTEN] = "listening UDP socket",
|
2024-07-18 07:26:47 +02:00
|
|
|
[EPOLL_TYPE_UDP_REPLY] = "UDP reply socket",
|
2024-02-29 05:15:34 +01:00
|
|
|
[EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket",
|
2024-02-15 23:24:32 +01:00
|
|
|
[EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch",
|
|
|
|
[EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch",
|
|
|
|
[EPOLL_TYPE_TAP_PASTA] = "/dev/net/tun device",
|
|
|
|
[EPOLL_TYPE_TAP_PASST] = "connected qemu socket",
|
|
|
|
[EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket",
|
2021-05-10 07:50:35 +02:00
|
|
|
};
|
2024-01-16 01:50:37 +01:00
|
|
|
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
|
|
|
|
"epoll_type_str[] doesn't match enum epoll_type");
|
2020-07-21 10:48:24 +02:00
|
|
|
|
|
|
|
/**
|
2021-10-05 19:20:26 +02:00
|
|
|
* post_handler() - Run periodic and deferred tasks for L4 protocol handlers
|
2020-07-21 10:48:24 +02:00
|
|
|
* @c: Execution context
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
* @now: Current timestamp
|
2020-07-21 10:48:24 +02:00
|
|
|
*/
|
2022-03-26 07:23:21 +01:00
|
|
|
static void post_handler(struct ctx *c, const struct timespec *now)
|
2020-07-21 10:48:24 +02:00
|
|
|
{
|
2024-06-06 12:09:44 +02:00
|
|
|
#define CALL_PROTO_HANDLER(lc, uc) \
|
2021-10-05 19:20:26 +02:00
|
|
|
do { \
|
|
|
|
extern void \
|
2022-03-18 12:18:19 +01:00
|
|
|
lc ## _defer_handler (struct ctx *c) \
|
2021-10-05 19:20:26 +02:00
|
|
|
__attribute__ ((weak)); \
|
|
|
|
\
|
|
|
|
if (!c->no_ ## lc) { \
|
|
|
|
if (lc ## _defer_handler) \
|
2022-03-18 12:18:19 +01:00
|
|
|
lc ## _defer_handler(c); \
|
2021-10-05 19:20:26 +02:00
|
|
|
\
|
|
|
|
if (timespec_diff_ms((now), &c->lc.timer_run) \
|
|
|
|
>= uc ## _TIMER_INTERVAL) { \
|
|
|
|
lc ## _timer(c, now); \
|
|
|
|
c->lc.timer_run = *now; \
|
|
|
|
} \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
2022-03-18 12:18:19 +01:00
|
|
|
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
|
2024-06-06 12:09:44 +02:00
|
|
|
CALL_PROTO_HANDLER(tcp, TCP);
|
2022-03-18 12:18:19 +01:00
|
|
|
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
|
2024-06-06 12:09:44 +02:00
|
|
|
CALL_PROTO_HANDLER(udp, UDP);
|
2021-10-05 19:20:26 +02:00
|
|
|
|
2024-01-16 01:50:36 +01:00
|
|
|
flow_defer_handler(c, now);
|
2021-10-05 19:20:26 +02:00
|
|
|
#undef CALL_PROTO_HANDLER
|
2020-07-21 10:48:24 +02:00
|
|
|
}
|
|
|
|
|
2023-11-30 03:02:21 +01:00
|
|
|
/**
|
|
|
|
* secret_init() - Create secret value for SipHash calculations
|
|
|
|
* @c: Execution context
|
|
|
|
*/
|
|
|
|
static void secret_init(struct ctx *c)
|
|
|
|
{
|
|
|
|
#ifndef HAS_GETRANDOM
|
|
|
|
int dev_random = open("/dev/random", O_RDONLY);
|
|
|
|
unsigned int random_read = 0;
|
|
|
|
|
|
|
|
while (dev_random && random_read < sizeof(c->hash_secret)) {
|
|
|
|
int ret = read(dev_random,
|
|
|
|
(uint8_t *)&c->hash_secret + random_read,
|
|
|
|
sizeof(c->hash_secret) - random_read);
|
|
|
|
|
|
|
|
if (ret == -1 && errno == EINTR)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (ret <= 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
random_read += ret;
|
|
|
|
}
|
|
|
|
if (dev_random >= 0)
|
|
|
|
close(dev_random);
|
2024-06-15 00:37:11 +02:00
|
|
|
|
|
|
|
if (random_read < sizeof(c->hash_secret))
|
2023-11-30 03:02:21 +01:00
|
|
|
#else
|
|
|
|
if (getrandom(&c->hash_secret, sizeof(c->hash_secret),
|
2024-06-15 00:37:11 +02:00
|
|
|
GRND_RANDOM) < 0)
|
2023-11-30 03:02:21 +01:00
|
|
|
#endif /* !HAS_GETRANDOM */
|
2024-06-15 00:37:11 +02:00
|
|
|
die_perror("Failed to get random bytes for hash table and TCP");
|
2023-11-30 03:02:21 +01:00
|
|
|
}
|
|
|
|
|
2021-09-27 00:16:51 +02:00
|
|
|
/**
|
|
|
|
* timer_init() - Set initial timestamp for timer runs to current time
|
|
|
|
* @c: Execution context
|
|
|
|
* @now: Current timestamp
|
|
|
|
*/
|
2022-03-26 07:23:21 +01:00
|
|
|
static void timer_init(struct ctx *c, const struct timespec *now)
|
2021-09-27 00:16:51 +02:00
|
|
|
{
|
|
|
|
c->tcp.timer_run = c->udp.timer_run = c->icmp.timer_run = *now;
|
|
|
|
}
|
|
|
|
|
2021-07-21 12:01:04 +02:00
|
|
|
/**
|
|
|
|
* proto_update_l2_buf() - Update scatter-gather L2 buffers in protocol handlers
|
|
|
|
* @eth_d: Ethernet destination address, NULL if unchanged
|
|
|
|
* @eth_s: Ethernet source address, NULL if unchanged
|
|
|
|
*/
|
2023-08-22 07:29:57 +02:00
|
|
|
void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
|
2021-07-21 12:01:04 +02:00
|
|
|
{
|
2023-08-22 07:29:57 +02:00
|
|
|
tcp_update_l2_buf(eth_d, eth_s);
|
|
|
|
udp_update_l2_buf(eth_d, eth_s);
|
2021-07-21 12:01:04 +02:00
|
|
|
}
|
|
|
|
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 21:11:37 +01:00
|
|
|
/**
|
|
|
|
* exit_handler() - Signal handler for SIGQUIT and SIGTERM
|
|
|
|
* @unused: Unused, handler deals with SIGQUIT and SIGTERM only
|
|
|
|
*
|
|
|
|
* TODO: After unsharing the PID namespace and forking, SIG_DFL for SIGTERM and
|
|
|
|
* SIGQUIT unexpectedly doesn't cause the process to terminate, figure out why.
|
2022-07-13 03:36:09 +02:00
|
|
|
*
|
|
|
|
* #syscalls exit_group
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 21:11:37 +01:00
|
|
|
*/
|
|
|
|
void exit_handler(int signal)
|
|
|
|
{
|
|
|
|
(void)signal;
|
|
|
|
|
|
|
|
exit(EXIT_SUCCESS);
|
2021-10-14 12:17:47 +02:00
|
|
|
}
|
|
|
|
|
2020-07-13 22:55:46 +02:00
|
|
|
/**
|
|
|
|
* main() - Entry point and main loop
|
|
|
|
* @argc: Argument count
|
2021-08-12 15:42:43 +02:00
|
|
|
* @argv: Options, plus optional target PID for pasta mode
|
2020-07-13 22:55:46 +02:00
|
|
|
*
|
2021-10-21 09:41:13 +02:00
|
|
|
* Return: non-zero on failure
|
2021-10-13 22:25:03 +02:00
|
|
|
*
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 21:11:37 +01:00
|
|
|
* #syscalls read write writev
|
treewide: Allow additional system calls for i386/i686
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.
Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:
https://wiki.debian.org/ReleaseGoals/64bit-time
that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.
Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).
I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.
Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-19 23:42:30 +02:00
|
|
|
* #syscalls socket getsockopt setsockopt s390x:socketcall i686:socketcall close
|
|
|
|
* #syscalls bind connect recvfrom sendto shutdown
|
2024-04-11 17:34:04 +02:00
|
|
|
* #syscalls arm:recv ppc64le:recv arm:send ppc64le:send
|
2022-02-26 23:39:19 +01:00
|
|
|
* #syscalls accept4|accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
|
treewide: Allow additional system calls for i386/i686
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.
Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:
https://wiki.debian.org/ReleaseGoals/64bit-time
that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.
Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).
I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.
Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-19 23:42:30 +02:00
|
|
|
* #syscalls clock_gettime arm:clock_gettime64 i686:clock_gettime64
|
2020-07-13 22:55:46 +02:00
|
|
|
*/
|
|
|
|
int main(int argc, char **argv)
|
|
|
|
{
|
2020-07-21 10:48:24 +02:00
|
|
|
struct epoll_event events[EPOLL_EVENTS];
|
2024-05-22 20:18:19 +02:00
|
|
|
int nfds, i, devnull_fd = -1;
|
log, passt: Always print to stderr before initialisation is complete
After commit 15001b39ef1d ("conf: set the log level much earlier"), we
had a phase during initialisation when messages wouldn't be printed to
standard error anymore.
Commit f67238aa864d ("passt, log: Call __openlog() earlier, log to
stderr until we detach") fixed that, but only for the case where no
log files are given.
If a log file is configured, vlogmsg() will not call passt_vsyslog(),
but during initialisation, LOG_PERROR is set, so to avoid duplicated
prints (which would result from passt_vsyslog() printing to stderr),
we don't call fprintf() from vlogmsg() either.
This is getting a bit too complicated. Instead of abusing LOG_PERROR,
define an internal logging flag that clearly represents that we're not
done with the initialisation phase yet.
If this flag is not set, make sure we always print to stderr, if the
log mask matches.
Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-14 22:47:51 +02:00
|
|
|
char argv0[PATH_MAX], *name;
|
2020-07-13 22:55:46 +02:00
|
|
|
struct ctx c = { 0 };
|
2021-03-18 11:36:55 +01:00
|
|
|
struct rlimit limit;
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
struct timespec now;
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 21:11:37 +01:00
|
|
|
struct sigaction sa;
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2024-07-25 17:58:44 +02:00
|
|
|
clock_gettime(CLOCK_MONOTONIC, &log_start);
|
2024-07-24 17:39:55 +02:00
|
|
|
|
2022-02-28 16:18:44 +01:00
|
|
|
arch_avx2_exec(argv);
|
|
|
|
|
passt, util: Close any open file that the parent might have leaked
If a parent accidentally or due to implementation reasons leaks any
open file, we don't want to have access to them, except for the file
passed via --fd, if any.
This is the case for Podman when Podman's parent leaks files into
Podman: it's not practical for Podman to close unrelated files before
starting pasta, as reported by Paul.
Use close_range(2) to close all open files except for standard streams
and the one from --fd.
Given that parts of conf() depend on other files to be already opened,
such as the epoll file descriptor, we can't easily defer this to a
more convenient point, where --fd was already parsed. Introduce a
minimal, duplicate version of --fd parsing to keep this simple.
As we need to check that the passed --fd option doesn't exceed
INT_MAX, because we'll parse it with strtol() but file descriptor
indices are signed ints (regardless of the arguments close_range()
take), extend the existing check in the actual --fd parsing in conf(),
also rejecting file descriptors numbers that match standard streams,
while at it.
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
2024-08-06 20:32:11 +02:00
|
|
|
isolate_initial(argc, argv);
|
2021-10-14 02:47:03 +02:00
|
|
|
|
2024-05-22 20:18:19 +02:00
|
|
|
c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 21:11:37 +01:00
|
|
|
|
|
|
|
sigemptyset(&sa.sa_mask);
|
|
|
|
sa.sa_flags = 0;
|
|
|
|
sa.sa_handler = exit_handler;
|
|
|
|
sigaction(SIGTERM, &sa, NULL);
|
|
|
|
sigaction(SIGQUIT, &sa, NULL);
|
2021-09-15 00:28:40 +02:00
|
|
|
|
2022-02-17 01:41:28 +01:00
|
|
|
if (argc < 1)
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
|
2022-07-13 03:20:45 +02:00
|
|
|
strncpy(argv0, argv[0], PATH_MAX - 1);
|
|
|
|
name = basename(argv0);
|
|
|
|
if (strstr(name, "pasta")) {
|
2021-09-15 00:28:40 +02:00
|
|
|
sa.sa_handler = pasta_child_handler;
|
2024-06-17 11:55:04 +02:00
|
|
|
if (sigaction(SIGCHLD, &sa, NULL))
|
|
|
|
die_perror("Couldn't install signal handlers");
|
|
|
|
|
|
|
|
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
|
|
|
|
die_perror("Couldn't set disposition for SIGPIPE");
|
2021-09-15 00:28:40 +02:00
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
c.mode = MODE_PASTA;
|
2022-07-13 03:20:45 +02:00
|
|
|
} else if (strstr(name, "passt")) {
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
c.mode = MODE_PASST;
|
2022-02-17 01:41:28 +01:00
|
|
|
} else {
|
|
|
|
exit(EXIT_FAILURE);
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
}
|
2020-07-13 22:55:46 +02:00
|
|
|
|
2022-02-26 23:37:05 +01:00
|
|
|
madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
|
2021-09-26 23:19:40 +02:00
|
|
|
|
2022-08-23 08:31:51 +02:00
|
|
|
c.epollfd = epoll_create1(EPOLL_CLOEXEC);
|
2024-06-15 00:37:11 +02:00
|
|
|
if (c.epollfd == -1)
|
|
|
|
die_perror("Failed to create epoll file descriptor");
|
|
|
|
|
|
|
|
if (getrlimit(RLIMIT_NOFILE, &limit))
|
|
|
|
die_perror("Failed to get maximum value of open files limit");
|
2020-07-13 22:55:46 +02:00
|
|
|
|
2022-03-19 00:33:46 +01:00
|
|
|
c.nofile = limit.rlim_cur = limit.rlim_max;
|
2024-06-15 00:37:11 +02:00
|
|
|
if (setrlimit(RLIMIT_NOFILE, &limit))
|
|
|
|
die_perror("Failed to set current limit for open files");
|
|
|
|
|
2021-10-05 19:27:04 +02:00
|
|
|
sock_probe_mem(&c);
|
2021-03-18 11:36:55 +01:00
|
|
|
|
2022-05-01 06:36:34 +02:00
|
|
|
conf(&c, argc, argv);
|
|
|
|
trace_init(c.trace);
|
|
|
|
|
2024-02-15 23:24:32 +01:00
|
|
|
pasta_netns_quit_init(&c);
|
2022-05-01 06:36:34 +02:00
|
|
|
|
2021-08-12 15:42:43 +02:00
|
|
|
tap_sock_init(&c);
|
|
|
|
|
2023-11-30 03:02:21 +01:00
|
|
|
secret_init(&c);
|
|
|
|
|
2021-09-27 00:16:51 +02:00
|
|
|
clock_gettime(CLOCK_MONOTONIC, &now);
|
|
|
|
|
2024-01-16 01:50:43 +01:00
|
|
|
flow_init();
|
|
|
|
|
2022-05-01 06:36:34 +02:00
|
|
|
if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
|
2021-03-20 21:12:19 +01:00
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
|
2024-08-21 06:19:59 +02:00
|
|
|
proto_update_l2_buf(c.guest_mac, c.our_tap_mac);
|
2021-10-05 21:15:01 +02:00
|
|
|
|
2022-07-22 07:31:17 +02:00
|
|
|
if (c.ifi4 && !c.no_dhcp)
|
2021-10-05 21:15:01 +02:00
|
|
|
dhcp_init();
|
|
|
|
|
2022-07-22 07:31:17 +02:00
|
|
|
if (c.ifi6 && !c.no_dhcpv6)
|
2021-04-13 21:59:47 +02:00
|
|
|
dhcpv6_init(&c);
|
|
|
|
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 21:11:37 +01:00
|
|
|
pcap_init(&c);
|
|
|
|
|
2022-04-05 08:15:29 +02:00
|
|
|
if (!c.foreground) {
|
2024-06-15 00:37:11 +02:00
|
|
|
if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
|
|
|
|
die_perror("Failed to open /dev/null");
|
2022-04-05 08:15:29 +02:00
|
|
|
}
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 21:11:37 +01:00
|
|
|
|
2023-02-15 09:24:37 +01:00
|
|
|
if (isolate_prefork(&c))
|
|
|
|
die("Failed to sandbox process, exiting");
|
2021-09-27 00:16:51 +02:00
|
|
|
|
log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).
But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.
Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.
Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 14:07:37 +02:00
|
|
|
if (!c.foreground) {
|
2024-05-22 20:18:19 +02:00
|
|
|
__daemon(c.pidfile_fd, devnull_fd);
|
log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).
But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.
Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.
Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 14:07:37 +02:00
|
|
|
log_stderr = false;
|
|
|
|
} else {
|
2024-05-22 20:18:19 +02:00
|
|
|
pidfile_write(c.pidfile_fd, getpid());
|
log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).
But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.
Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.
Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 14:07:37 +02:00
|
|
|
}
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 21:11:37 +01:00
|
|
|
|
log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).
But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.
Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.
Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 14:07:37 +02:00
|
|
|
if (pasta_child_pid) {
|
pasta: Wait for tap to be set up before spawning command
Adapted from a patch by Paul Holzinger: when pasta spawns a command,
operating without a pre-existing user and network namespace, it needs
to wait for the tap device to be configured and its handler ready,
before the command is actually executed.
Otherwise, something like:
pasta --config-net nslookup passt.top
usually fails as the nslookup command is issued before the network
interface is ready.
We can't adopt a simpler approach based on SIGSTOP and SIGCONT here:
the child runs in a separate PID namespace, so it can't send SIGSTOP
to itself as the kernel sees the child as init process and blocks
the delivery of the signal.
We could send SIGSTOP from the parent, but this wouldn't avoid the
possible condition where the child isn't ready to wait for it when
the parent sends it, also raised by Paul -- and SIGSTOP can't be
blocked, so it can never be pending.
Use SIGUSR1 instead: mask it before clone(), so that the child starts
with it blocked, and can safely wait for it. Once the parent is
ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the
child is waiting for it, the kernel will queue it for us, because
it's blocked.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 1392bc5ca002 ("Allow pasta to take a command to execute")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 12:22:59 +01:00
|
|
|
kill(pasta_child_pid, SIGUSR1);
|
log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).
But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.
Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.
Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 14:07:37 +02:00
|
|
|
log_stderr = false;
|
|
|
|
}
|
pasta: Wait for tap to be set up before spawning command
Adapted from a patch by Paul Holzinger: when pasta spawns a command,
operating without a pre-existing user and network namespace, it needs
to wait for the tap device to be configured and its handler ready,
before the command is actually executed.
Otherwise, something like:
pasta --config-net nslookup passt.top
usually fails as the nslookup command is issued before the network
interface is ready.
We can't adopt a simpler approach based on SIGSTOP and SIGCONT here:
the child runs in a separate PID namespace, so it can't send SIGSTOP
to itself as the kernel sees the child as init process and blocks
the delivery of the signal.
We could send SIGSTOP from the parent, but this wouldn't avoid the
possible condition where the child isn't ready to wait for it when
the parent sends it, also raised by Paul -- and SIGSTOP can't be
blocked, so it can never be pending.
Use SIGUSR1 instead: mask it before clone(), so that the child starts
with it blocked, and can safely wait for it. Once the parent is
ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the
child is waiting for it, the kernel will queue it for us, because
it's blocked.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 1392bc5ca002 ("Allow pasta to take a command to execute")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 12:22:59 +01:00
|
|
|
|
2022-10-14 06:25:31 +02:00
|
|
|
isolate_postfork(&c);
|
2021-10-14 12:17:47 +02:00
|
|
|
|
2021-09-27 00:16:51 +02:00
|
|
|
timer_init(&c, &now);
|
2022-02-18 16:12:11 +01:00
|
|
|
|
2020-07-13 22:55:46 +02:00
|
|
|
loop:
|
2022-03-18 12:18:19 +01:00
|
|
|
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
|
2022-09-28 06:33:31 +02:00
|
|
|
/* cppcheck-suppress [duplicateValueTernary, unmatchedSuppression] */
|
passt: Assorted fixes from "fresh eyes" review
A bunch of fixes not worth single commits at this stage, notably:
- make buffer, length parameter ordering consistent in ARP, DHCP,
NDP handlers
- strict checking of buffer, message and option length in DHCP
handler (a malicious client could have easily crashed it)
- set up forwarding for IPv4 and IPv6, and masquerading with nft for
IPv4, from demo script
- get rid of separate slow and fast timers, we don't save any
overhead that way
- stricter checking of buffer lengths as passed to tap handlers
- proper dequeuing from qemu socket back-end: I accidentally trashed
messages that were bundled up together in a single tap read
operation -- the length header tells us what's the size of the next
frame, but there's no apparent limit to the number of messages we
get with one single receive
- rework some bits of the TCP state machine, now passive and active
connection closes appear to be robust -- introduce a new
FIN_WAIT_1_SOCK_FIN state indicating a FIN_WAIT_1 with a FIN flag
from socket
- streamline TCP option parsing routine
- track TCP state changes to stderr (this is temporary, proper
debugging and syslogging support pending)
- observe that multiplying a number by four might very well change
its value, and this happens to be the case for the data offset
from the TCP header as we check if it's the same as the total
length to find out if it's a duplicated ACK segment
- recent estimates suggest that the duration of a millisecond is
closer to a million nanoseconds than a thousand of them, this
trend is now reflected into the timespec_diff_ms() convenience
routine
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-21 11:33:38 +01:00
|
|
|
nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
|
2024-06-15 00:37:11 +02:00
|
|
|
if (nfds == -1 && errno != EINTR)
|
|
|
|
die_perror("epoll_wait() failed in main loop");
|
2020-07-13 22:55:46 +02:00
|
|
|
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
clock_gettime(CLOCK_MONOTONIC, &now);
|
|
|
|
|
2020-07-13 22:55:46 +02:00
|
|
|
for (i = 0; i < nfds; i++) {
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);
|
2023-08-11 07:12:23 +02:00
|
|
|
uint32_t eventmask = events[i].events;
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2023-08-11 07:12:23 +02:00
|
|
|
trace("%s: epoll event on %s %i (events: 0x%08x)",
|
2024-06-13 14:36:53 +02:00
|
|
|
c.mode == MODE_PASTA ? "pasta" : "passt",
|
2023-08-11 07:12:23 +02:00
|
|
|
EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
|
|
|
|
|
|
|
|
switch (ref.type) {
|
2023-08-11 07:12:29 +02:00
|
|
|
case EPOLL_TYPE_TAP_PASTA:
|
|
|
|
tap_handler_pasta(&c, eventmask, &now);
|
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TAP_PASST:
|
|
|
|
tap_handler_passt(&c, eventmask, &now);
|
2023-08-11 07:12:28 +02:00
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TAP_LISTEN:
|
|
|
|
tap_listen_handler(&c, eventmask);
|
2023-08-11 07:12:23 +02:00
|
|
|
break;
|
2024-02-15 23:24:32 +01:00
|
|
|
case EPOLL_TYPE_NSQUIT_INOTIFY:
|
|
|
|
pasta_netns_quit_inotify_handler(&c, ref.fd);
|
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_NSQUIT_TIMER:
|
|
|
|
pasta_netns_quit_timer_handler(&c, ref);
|
2023-08-11 07:12:23 +02:00
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TCP:
|
2024-01-16 01:50:38 +01:00
|
|
|
tcp_sock_handler(&c, ref, eventmask);
|
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TCP_SPLICE:
|
|
|
|
tcp_splice_sock_handler(&c, ref, eventmask);
|
2023-08-11 07:12:27 +02:00
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TCP_LISTEN:
|
|
|
|
tcp_listen_handler(&c, ref, &now);
|
2023-08-11 07:12:23 +02:00
|
|
|
break;
|
2023-08-11 07:12:26 +02:00
|
|
|
case EPOLL_TYPE_TCP_TIMER:
|
|
|
|
tcp_timer_handler(&c, ref);
|
|
|
|
break;
|
2024-07-18 07:26:53 +02:00
|
|
|
case EPOLL_TYPE_UDP_LISTEN:
|
|
|
|
udp_listen_sock_handler(&c, ref, eventmask, &now);
|
2023-08-11 07:12:23 +02:00
|
|
|
break;
|
2024-07-18 07:26:47 +02:00
|
|
|
case EPOLL_TYPE_UDP_REPLY:
|
|
|
|
udp_reply_sock_handler(&c, ref, eventmask, &now);
|
|
|
|
break;
|
2024-02-29 05:15:34 +01:00
|
|
|
case EPOLL_TYPE_PING:
|
|
|
|
icmp_sock_handler(&c, ref);
|
2023-08-11 07:12:23 +02:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* Can't happen */
|
|
|
|
ASSERT(0);
|
|
|
|
}
|
2020-07-13 22:55:46 +02:00
|
|
|
}
|
|
|
|
|
2021-10-05 19:20:26 +02:00
|
|
|
post_handler(&c, &now);
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
|
2020-07-13 22:55:46 +02:00
|
|
|
goto loop;
|
|
|
|
}
|