passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.
Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.
Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-05 20:11:44 +02:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
|
|
|
|
/* PASST - Plug A Simple Socket Transport
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
* for qemu/UNIX domain socket mode
|
|
|
|
*
|
|
|
|
* PASTA - Pack A Subtle Tap Abstraction
|
|
|
|
* for network namespace/tap device mode
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
*
|
|
|
|
* udp.c - UDP L2-L4 translation routines
|
|
|
|
*
|
|
|
|
* Copyright (c) 2020-2021 Red Hat GmbH
|
|
|
|
* Author: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
*/
|
|
|
|
|
|
|
|
/**
|
|
|
|
* DOC: Theory of Operation
|
|
|
|
*
|
2024-07-18 07:26:46 +02:00
|
|
|
* UDP Flows
|
|
|
|
* =========
|
|
|
|
*
|
|
|
|
* UDP doesn't have true connections, but many protocols use a connection-like
|
|
|
|
* format. The flow is initiated by a client sending a datagram from a port of
|
|
|
|
* its choosing (usually ephemeral) to a specific port (usually well known) on a
|
|
|
|
* server. Both client and server address must be unicast. The server sends
|
|
|
|
* replies using the same addresses & ports with src/dest swapped.
|
|
|
|
*
|
|
|
|
* We track pseudo-connections of this type as flow table entries of type
|
|
|
|
* FLOW_UDP. We store the time of the last traffic on the flow in uflow->ts,
|
|
|
|
* and let the flow expire if there is no traffic for UDP_CONN_TIMEOUT seconds.
|
|
|
|
*
|
|
|
|
* NOTE: This won't handle multicast protocols, or some protocols with different
|
|
|
|
* port usage. We'll need specific logic if we want to handle those.
|
|
|
|
*
|
|
|
|
* "Listening" sockets
|
|
|
|
* ===================
|
|
|
|
*
|
|
|
|
* UDP doesn't use listen(), but we consider long term sockets which are allowed
|
2024-07-18 07:26:47 +02:00
|
|
|
* to create new flows "listening" by analogy with TCP. This listening socket
|
|
|
|
* could receive packets from multiple flows, so we use a hash table match to
|
|
|
|
* find the specific flow for a datagram.
|
|
|
|
*
|
|
|
|
* When a UDP flow is initiated from a listening socket we take a duplicate of
|
|
|
|
* the socket and store it in uflow->s[INISIDE]. This will last for the
|
|
|
|
* lifetime of the flow, even if the original listening socket is closed due to
|
|
|
|
* port auto-probing. The duplicate is used to deliver replies back to the
|
|
|
|
* originating side.
|
|
|
|
*
|
|
|
|
* Reply sockets
|
|
|
|
* =============
|
|
|
|
*
|
|
|
|
* When a UDP flow targets a socket, we create a "reply" socket in
|
|
|
|
* uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive
|
|
|
|
* replies on the target side. This socket is both bound and connected and has
|
|
|
|
* EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams
|
|
|
|
* associated with this flow, so the epoll reference directly points to the flow
|
|
|
|
* and we don't need a hash lookup.
|
|
|
|
*
|
|
|
|
* NOTE: it's possible that the reply socket could have a bound address
|
|
|
|
* overlapping with an unrelated listening socket. We assume datagrams for the
|
|
|
|
* flow will come to the reply socket in preference to a listening socket. The
|
|
|
|
* sample program doc/platform-requirements/reuseaddr-priority.c documents and
|
|
|
|
* tests that assumption.
|
|
|
|
*
|
|
|
|
* "Spliced" flows
|
|
|
|
* ===============
|
|
|
|
*
|
|
|
|
* In PASTA mode, L2-L4 translation is skipped for connections to ports bound
|
|
|
|
* between namespaces using the loopback interface, messages are directly
|
|
|
|
* transferred between L4 sockets instead. These are called spliced connections
|
|
|
|
* in analogy with the TCP implementation. The the splice() syscall isn't
|
|
|
|
* actually used; it doesn't make sense for datagrams and instead a pair of
|
|
|
|
* recvmmsg() and sendmmsg() is used to forward the datagrams.
|
|
|
|
*
|
|
|
|
* Note that a spliced flow will have *both* a duplicated listening socket and a
|
|
|
|
* reply socket (see above).
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
*/
|
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
#include <sched.h>
|
2023-03-21 04:54:59 +01:00
|
|
|
#include <unistd.h>
|
2023-03-08 04:00:22 +01:00
|
|
|
#include <signal.h>
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
#include <stdio.h>
|
|
|
|
#include <errno.h>
|
|
|
|
#include <limits.h>
|
2024-02-28 12:25:03 +01:00
|
|
|
#include <assert.h>
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
#include <net/ethernet.h>
|
|
|
|
#include <net/if.h>
|
|
|
|
#include <netinet/in.h>
|
2021-10-21 04:26:08 +02:00
|
|
|
#include <netinet/ip.h>
|
|
|
|
#include <netinet/udp.h>
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
#include <stdint.h>
|
|
|
|
#include <stddef.h>
|
|
|
|
#include <string.h>
|
|
|
|
#include <sys/epoll.h>
|
|
|
|
#include <sys/types.h>
|
|
|
|
#include <sys/socket.h>
|
2021-04-22 13:39:36 +02:00
|
|
|
#include <sys/uio.h>
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
#include <time.h>
|
2024-07-18 07:26:49 +02:00
|
|
|
#include <arpa/inet.h>
|
2024-07-17 02:36:04 +02:00
|
|
|
#include <linux/errqueue.h>
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
|
2021-07-26 07:18:50 +02:00
|
|
|
#include "checksum.h"
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
#include "util.h"
|
2024-05-01 08:53:52 +02:00
|
|
|
#include "iov.h"
|
2024-03-06 06:58:33 +01:00
|
|
|
#include "ip.h"
|
2024-02-28 12:25:03 +01:00
|
|
|
#include "siphash.h"
|
|
|
|
#include "inany.h"
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
#include "passt.h"
|
|
|
|
#include "tap.h"
|
2021-07-21 12:01:04 +02:00
|
|
|
#include "pcap.h"
|
2022-09-24 09:53:15 +02:00
|
|
|
#include "log.h"
|
2024-07-18 07:26:46 +02:00
|
|
|
#include "flow_table.h"
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
|
2022-11-30 05:13:12 +01:00
|
|
|
#define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
|
udp: Don't explicitly track originating socket for spliced "connections"
When we look up udp_splice_to_ns[][].orig_sock in udp_sock_handler_splice()
we're finding the socket on which the originating packet for the
"connection" was received on. However, we don't specifically need this
socket to be the originating one - we just need one that's bound to the
the source port of this reply packet in the init namespace. We can look
this up in udp_splice_to_init[v6][src].target_sock, whose defining
characteristic is exactly that. The same applies with init and ns swapped.
In practice, of course, the port we locate this way will always be the
originating port, since we couldn't have started this "connection" if it
wasn't.
Change this, and we no longer need the @orig_sock field at all. That
leaves just @target_sock which we rename to simply @sock. The whole
udp_splice_flow structure now more represents a single bound port than
a "flow" per se, so rename and recomment it accordingly. Likewise the
udp_splice_to_{ns,init} names are now misleading, since the ports in
those maps are used in both directions. Rename them to
udp_splice_{ns,init} indicating the location where the described
socket is bound.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-30 05:13:09 +01:00
|
|
|
/* "Spliced" sockets indexed by bound port (host order) */
|
2024-07-18 07:26:48 +02:00
|
|
|
static int udp_splice_ns [IP_VERSIONS][NUM_PORTS];
|
|
|
|
static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2021-07-21 12:01:04 +02:00
|
|
|
/* Static buffers */
|
|
|
|
|
2024-05-01 10:31:08 +02:00
|
|
|
/**
|
|
|
|
* struct udp_payload_t - UDP header and data for inbound messages
|
|
|
|
* @uh: UDP header
|
|
|
|
* @data: UDP data
|
|
|
|
*/
|
|
|
|
static struct udp_payload_t {
|
|
|
|
struct udphdr uh;
|
|
|
|
char data[USHRT_MAX - sizeof(struct udphdr)];
|
|
|
|
#ifdef __AVX2__
|
|
|
|
} __attribute__ ((packed, aligned(32)))
|
|
|
|
#else
|
|
|
|
} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
|
|
|
|
#endif
|
|
|
|
udp_payload[UDP_MAX_FRAMES];
|
|
|
|
|
2024-05-01 10:31:09 +02:00
|
|
|
/* Ethernet header for IPv4 frames */
|
|
|
|
static struct ethhdr udp4_eth_hdr;
|
|
|
|
|
|
|
|
/* Ethernet header for IPv6 frames */
|
|
|
|
static struct ethhdr udp6_eth_hdr;
|
|
|
|
|
2021-07-21 12:01:04 +02:00
|
|
|
/**
|
2024-05-01 10:31:10 +02:00
|
|
|
* struct udp_meta_t - Pre-cooked headers and metadata for UDP packets
|
|
|
|
* @ip6h: Pre-filled IPv6 header (except for payload_len and addresses)
|
|
|
|
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
|
2024-05-01 08:53:45 +02:00
|
|
|
* @taph: Tap backend specific header
|
2024-05-01 10:31:10 +02:00
|
|
|
* @s_in: Source socket address, filled in by recvmmsg()
|
2024-07-18 07:26:46 +02:00
|
|
|
* @tosidx: sidx for the destination side of this datagram's flow
|
2021-07-21 12:01:04 +02:00
|
|
|
*/
|
2024-05-01 10:31:10 +02:00
|
|
|
static struct udp_meta_t {
|
2021-07-21 12:01:04 +02:00
|
|
|
struct ipv6hdr ip6h;
|
2024-05-01 10:31:10 +02:00
|
|
|
struct iphdr ip4h;
|
|
|
|
struct tap_hdr taph;
|
|
|
|
|
|
|
|
union sockaddr_inany s_in;
|
2024-07-18 07:26:46 +02:00
|
|
|
flow_sidx_t tosidx;
|
2024-05-01 10:31:10 +02:00
|
|
|
}
|
2021-07-26 07:18:50 +02:00
|
|
|
#ifdef __AVX2__
|
2024-05-01 10:31:10 +02:00
|
|
|
__attribute__ ((aligned(32)))
|
2021-07-26 07:18:50 +02:00
|
|
|
#endif
|
2024-05-01 10:31:10 +02:00
|
|
|
udp_meta[UDP_MAX_FRAMES];
|
2021-07-21 12:01:04 +02:00
|
|
|
|
2024-05-01 10:31:05 +02:00
|
|
|
/**
|
|
|
|
* enum udp_iov_idx - Indices for the buffers making up a single UDP frame
|
|
|
|
* @UDP_IOV_TAP tap specific header
|
|
|
|
* @UDP_IOV_ETH Ethernet header
|
|
|
|
* @UDP_IOV_IP IP (v4/v6) header
|
|
|
|
* @UDP_IOV_PAYLOAD IP payload (UDP header + data)
|
|
|
|
* @UDP_NUM_IOVS the number of entries in the iovec array
|
|
|
|
*/
|
|
|
|
enum udp_iov_idx {
|
|
|
|
UDP_IOV_TAP = 0,
|
|
|
|
UDP_IOV_ETH = 1,
|
|
|
|
UDP_IOV_IP = 2,
|
|
|
|
UDP_IOV_PAYLOAD = 3,
|
|
|
|
UDP_NUM_IOVS
|
|
|
|
};
|
|
|
|
|
2024-07-05 12:44:02 +02:00
|
|
|
/* IOVs and msghdr arrays for receiving datagrams from sockets */
|
|
|
|
static struct iovec udp_iov_recv [UDP_MAX_FRAMES];
|
|
|
|
static struct mmsghdr udp4_mh_recv [UDP_MAX_FRAMES];
|
|
|
|
static struct mmsghdr udp6_mh_recv [UDP_MAX_FRAMES];
|
2021-07-21 12:01:04 +02:00
|
|
|
|
2024-07-05 12:44:02 +02:00
|
|
|
/* IOVs and msghdr arrays for sending "spliced" datagrams to sockets */
|
2024-07-05 12:44:03 +02:00
|
|
|
static union sockaddr_inany udp_splice_to;
|
2023-01-05 05:26:22 +01:00
|
|
|
|
2024-07-05 12:44:02 +02:00
|
|
|
static struct iovec udp_iov_splice [UDP_MAX_FRAMES];
|
2024-07-05 12:44:03 +02:00
|
|
|
static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES];
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
|
2024-07-05 12:44:02 +02:00
|
|
|
/* IOVs for L2 frames */
|
2024-07-05 12:44:04 +02:00
|
|
|
static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS];
|
2024-07-05 12:44:02 +02:00
|
|
|
|
2023-11-06 03:17:08 +01:00
|
|
|
/**
|
|
|
|
* udp_portmap_clear() - Clear UDP port map before configuration
|
|
|
|
*/
|
|
|
|
void udp_portmap_clear(void)
|
|
|
|
{
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
for (i = 0; i < NUM_PORTS; i++) {
|
2024-07-18 07:26:48 +02:00
|
|
|
udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1;
|
|
|
|
udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1;
|
2023-11-06 03:17:08 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-07-21 12:01:04 +02:00
|
|
|
/**
|
|
|
|
* udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
|
|
|
|
* @eth_d: Ethernet destination address, NULL if unchanged
|
|
|
|
* @eth_s: Ethernet source address, NULL if unchanged
|
|
|
|
*/
|
2023-08-22 07:29:57 +02:00
|
|
|
void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
|
2021-07-21 12:01:04 +02:00
|
|
|
{
|
2024-05-01 10:31:09 +02:00
|
|
|
eth_update_mac(&udp4_eth_hdr, eth_d, eth_s);
|
|
|
|
eth_update_mac(&udp6_eth_hdr, eth_d, eth_s);
|
2021-07-21 12:01:04 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2024-05-01 10:31:06 +02:00
|
|
|
* udp_iov_init_one() - Initialise scatter-gather lists for one buffer
|
2022-11-24 09:54:20 +01:00
|
|
|
* @c: Execution context
|
2024-03-06 06:34:23 +01:00
|
|
|
* @i: Index of buffer to initialize
|
2021-07-21 12:01:04 +02:00
|
|
|
*/
|
2024-05-01 10:31:06 +02:00
|
|
|
static void udp_iov_init_one(const struct ctx *c, size_t i)
|
2021-07-21 12:01:04 +02:00
|
|
|
{
|
2024-05-01 10:31:08 +02:00
|
|
|
struct udp_payload_t *payload = &udp_payload[i];
|
2024-05-01 10:31:10 +02:00
|
|
|
struct udp_meta_t *meta = &udp_meta[i];
|
2024-07-05 12:44:04 +02:00
|
|
|
struct iovec *siov = &udp_iov_recv[i];
|
|
|
|
struct iovec *tiov = udp_l2_iov[i];
|
2024-05-01 10:31:10 +02:00
|
|
|
|
|
|
|
*meta = (struct udp_meta_t) {
|
|
|
|
.ip4h = L2_BUF_IP4_INIT(IPPROTO_UDP),
|
|
|
|
.ip6h = L2_BUF_IP6_INIT(IPPROTO_UDP),
|
|
|
|
};
|
2024-05-01 10:31:08 +02:00
|
|
|
|
|
|
|
*siov = IOV_OF_LVALUE(payload->data);
|
|
|
|
|
2024-07-05 12:44:04 +02:00
|
|
|
tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph);
|
|
|
|
tiov[UDP_IOV_PAYLOAD].iov_base = payload;
|
|
|
|
|
|
|
|
/* It's useful to have separate msghdr arrays for receiving. Otherwise,
|
|
|
|
* an IPv4 recv() will alter msg_namelen, so we'd have to reset it every
|
|
|
|
* time or risk truncating the address on future IPv6 recv()s.
|
|
|
|
*/
|
2024-05-01 10:31:06 +02:00
|
|
|
if (c->ifi4) {
|
2024-07-05 12:44:02 +02:00
|
|
|
struct msghdr *mh = &udp4_mh_recv[i].msg_hdr;
|
2024-05-01 10:31:06 +02:00
|
|
|
|
2024-05-01 10:31:10 +02:00
|
|
|
mh->msg_name = &meta->s_in;
|
|
|
|
mh->msg_namelen = sizeof(struct sockaddr_in);
|
2024-05-01 10:31:06 +02:00
|
|
|
mh->msg_iov = siov;
|
|
|
|
mh->msg_iovlen = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (c->ifi6) {
|
2024-07-05 12:44:02 +02:00
|
|
|
struct msghdr *mh = &udp6_mh_recv[i].msg_hdr;
|
2024-05-01 10:31:06 +02:00
|
|
|
|
2024-05-01 10:31:10 +02:00
|
|
|
mh->msg_name = &meta->s_in;
|
|
|
|
mh->msg_namelen = sizeof(struct sockaddr_in6);
|
2024-05-01 10:31:06 +02:00
|
|
|
mh->msg_iov = siov;
|
|
|
|
mh->msg_iovlen = 1;
|
|
|
|
}
|
2024-03-06 06:34:23 +01:00
|
|
|
}
|
2021-07-21 12:01:04 +02:00
|
|
|
|
2024-03-06 06:34:23 +01:00
|
|
|
/**
|
2024-05-01 10:31:06 +02:00
|
|
|
* udp_iov_init() - Initialise scatter-gather L2 buffers
|
2024-03-06 06:34:23 +01:00
|
|
|
* @c: Execution context
|
|
|
|
*/
|
2024-05-01 10:31:06 +02:00
|
|
|
static void udp_iov_init(const struct ctx *c)
|
2024-03-06 06:34:23 +01:00
|
|
|
{
|
|
|
|
size_t i;
|
2022-11-24 09:54:20 +01:00
|
|
|
|
2024-07-05 12:44:05 +02:00
|
|
|
udp4_eth_hdr.h_proto = htons_constant(ETH_P_IP);
|
|
|
|
udp6_eth_hdr.h_proto = htons_constant(ETH_P_IPV6);
|
|
|
|
|
2024-05-01 10:31:06 +02:00
|
|
|
for (i = 0; i < UDP_MAX_FRAMES; i++)
|
|
|
|
udp_iov_init_one(c, i);
|
2021-07-21 12:01:04 +02:00
|
|
|
}
|
|
|
|
|
2024-07-05 12:44:06 +02:00
|
|
|
/**
|
|
|
|
* udp_splice_prepare() - Prepare one datagram for splicing
|
|
|
|
* @mmh: Receiving mmsghdr array
|
|
|
|
* @idx: Index of the datagram to prepare
|
|
|
|
*/
|
|
|
|
static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx)
|
|
|
|
{
|
|
|
|
udp_mh_splice[idx].msg_hdr.msg_iov->iov_len = mmh[idx].msg_len;
|
|
|
|
}
|
|
|
|
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
/**
|
2024-07-05 12:44:07 +02:00
|
|
|
* udp_splice_send() - Send a batch of datagrams from socket to socket
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
* @c: Execution context
|
2024-07-05 12:44:07 +02:00
|
|
|
* @start: Index of batch's first datagram in udp[46]_l2_buf
|
|
|
|
* @n: Number of datagrams in batch
|
|
|
|
* @src: Source port for datagram (target side)
|
|
|
|
* @dst: Destination port for datagrams (target side)
|
2024-07-05 12:44:01 +02:00
|
|
|
* @ref: epoll reference for origin socket
|
2022-11-30 05:13:15 +01:00
|
|
|
* @now: Timestamp
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
*/
|
2024-07-05 12:44:07 +02:00
|
|
|
static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
|
2024-07-18 07:26:47 +02:00
|
|
|
flow_sidx_t tosidx)
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
{
|
2024-07-18 07:26:47 +02:00
|
|
|
const struct flowside *toside = flowside_at_sidx(tosidx);
|
|
|
|
const struct udp_flow *uflow = udp_at_sidx(tosidx);
|
|
|
|
uint8_t topif = pif_at_sidx(tosidx);
|
|
|
|
int s = uflow->s[tosidx.sidei];
|
|
|
|
socklen_t sl;
|
2022-11-30 05:13:10 +01:00
|
|
|
|
2024-07-18 07:26:47 +02:00
|
|
|
pif_sockaddr(c, &udp_splice_to, &sl, topif,
|
|
|
|
&toside->eaddr, toside->eport);
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2024-07-05 12:44:07 +02:00
|
|
|
sendmmsg(s, udp_mh_splice + start, n, MSG_NOSIGNAL);
|
2022-11-30 05:13:15 +01:00
|
|
|
}
|
|
|
|
|
2022-03-15 17:51:14 +01:00
|
|
|
/**
|
2022-11-24 09:54:21 +01:00
|
|
|
* udp_update_hdr4() - Update headers for one IPv4 datagram
|
2024-06-13 14:36:51 +02:00
|
|
|
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
|
2024-05-01 10:31:08 +02:00
|
|
|
* @bp: Pointer to udp_payload_t to update
|
2024-07-18 07:26:50 +02:00
|
|
|
* @toside: Flowside for destination side
|
2024-05-01 08:53:49 +02:00
|
|
|
* @dlen: Length of UDP payload
|
2022-11-24 09:54:21 +01:00
|
|
|
*
|
2024-05-01 10:31:05 +02:00
|
|
|
* Return: size of IPv4 payload (UDP header + data)
|
2022-03-15 17:51:14 +01:00
|
|
|
*/
|
2024-07-18 07:26:50 +02:00
|
|
|
static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
|
|
|
|
const struct flowside *toside, size_t dlen)
|
2022-03-15 17:51:14 +01:00
|
|
|
{
|
2024-08-21 06:19:57 +02:00
|
|
|
const struct in_addr *src = inany_v4(&toside->oaddr);
|
2024-07-18 07:26:50 +02:00
|
|
|
const struct in_addr *dst = inany_v4(&toside->eaddr);
|
2024-05-01 10:31:08 +02:00
|
|
|
size_t l4len = dlen + sizeof(bp->uh);
|
2024-06-13 14:36:51 +02:00
|
|
|
size_t l3len = l4len + sizeof(*ip4h);
|
2022-03-15 17:51:14 +01:00
|
|
|
|
2024-07-18 07:26:50 +02:00
|
|
|
ASSERT(src && dst);
|
2022-03-15 17:51:14 +01:00
|
|
|
|
2024-06-13 14:36:51 +02:00
|
|
|
ip4h->tot_len = htons(l3len);
|
2024-07-18 07:26:50 +02:00
|
|
|
ip4h->daddr = dst->s_addr;
|
|
|
|
ip4h->saddr = src->s_addr;
|
|
|
|
ip4h->check = csum_ip4_header(l3len, IPPROTO_UDP, *src, *dst);
|
2024-03-06 06:34:26 +01:00
|
|
|
|
2024-08-21 06:19:57 +02:00
|
|
|
bp->uh.source = htons(toside->oport);
|
2024-07-18 07:26:50 +02:00
|
|
|
bp->uh.dest = htons(toside->eport);
|
2024-05-01 10:31:08 +02:00
|
|
|
bp->uh.len = htons(l4len);
|
2024-07-18 07:26:50 +02:00
|
|
|
csum_udp4(&bp->uh, *src, *dst, bp->data, dlen);
|
2022-03-15 17:51:14 +01:00
|
|
|
|
2024-05-01 10:31:05 +02:00
|
|
|
return l4len;
|
2022-03-15 17:51:14 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2022-11-24 09:54:21 +01:00
|
|
|
* udp_update_hdr6() - Update headers for one IPv6 datagram
|
2024-06-13 14:36:51 +02:00
|
|
|
* @ip6h: Pre-filled IPv6 header (except for payload_len and addresses)
|
2024-05-01 10:31:08 +02:00
|
|
|
* @bp: Pointer to udp_payload_t to update
|
2024-07-18 07:26:50 +02:00
|
|
|
* @toside: Flowside for destination side
|
2024-05-01 08:53:49 +02:00
|
|
|
* @dlen: Length of UDP payload
|
2022-11-24 09:54:21 +01:00
|
|
|
*
|
2024-05-01 10:31:05 +02:00
|
|
|
* Return: size of IPv6 payload (UDP header + data)
|
2022-03-15 17:51:14 +01:00
|
|
|
*/
|
2024-07-18 07:26:50 +02:00
|
|
|
static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
|
|
|
|
const struct flowside *toside, size_t dlen)
|
2022-03-15 17:51:14 +01:00
|
|
|
{
|
2024-05-01 10:31:08 +02:00
|
|
|
uint16_t l4len = dlen + sizeof(bp->uh);
|
2022-03-15 17:51:14 +01:00
|
|
|
|
2024-06-13 14:36:51 +02:00
|
|
|
ip6h->payload_len = htons(l4len);
|
2024-07-18 07:26:50 +02:00
|
|
|
ip6h->daddr = toside->eaddr.a6;
|
2024-08-21 06:19:57 +02:00
|
|
|
ip6h->saddr = toside->oaddr.a6;
|
2024-06-13 14:36:51 +02:00
|
|
|
ip6h->version = 6;
|
|
|
|
ip6h->nexthdr = IPPROTO_UDP;
|
|
|
|
ip6h->hop_limit = 255;
|
2022-03-15 17:51:14 +01:00
|
|
|
|
2024-08-21 06:19:57 +02:00
|
|
|
bp->uh.source = htons(toside->oport);
|
2024-07-18 07:26:50 +02:00
|
|
|
bp->uh.dest = htons(toside->eport);
|
2024-06-13 14:36:51 +02:00
|
|
|
bp->uh.len = ip6h->payload_len;
|
2024-08-21 06:19:57 +02:00
|
|
|
csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, bp->data, dlen);
|
2022-03-15 17:51:14 +01:00
|
|
|
|
2024-05-01 10:31:05 +02:00
|
|
|
return l4len;
|
2022-03-15 17:51:14 +01:00
|
|
|
}
|
|
|
|
|
2024-07-05 12:44:06 +02:00
|
|
|
/**
|
|
|
|
* udp_tap_prepare() - Convert one datagram into a tap frame
|
|
|
|
* @mmh: Receiving mmsghdr array
|
|
|
|
* @idx: Index of the datagram to prepare
|
2024-07-18 07:26:50 +02:00
|
|
|
* @toside: Flowside for destination side
|
2024-07-05 12:44:06 +02:00
|
|
|
*/
|
2024-07-18 07:26:50 +02:00
|
|
|
static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
|
|
|
|
const struct flowside *toside)
|
2024-07-05 12:44:06 +02:00
|
|
|
{
|
|
|
|
struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx];
|
|
|
|
struct udp_payload_t *bp = &udp_payload[idx];
|
|
|
|
struct udp_meta_t *bm = &udp_meta[idx];
|
|
|
|
size_t l4len;
|
|
|
|
|
2024-08-21 06:19:57 +02:00
|
|
|
if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) {
|
2024-07-18 07:26:50 +02:00
|
|
|
l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len);
|
2024-07-05 12:44:06 +02:00
|
|
|
tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) +
|
|
|
|
sizeof(udp6_eth_hdr));
|
|
|
|
(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr);
|
|
|
|
(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h);
|
|
|
|
} else {
|
2024-07-18 07:26:50 +02:00
|
|
|
l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len);
|
2024-07-05 12:44:06 +02:00
|
|
|
tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) +
|
|
|
|
sizeof(udp4_eth_hdr));
|
|
|
|
(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr);
|
|
|
|
(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip4h);
|
|
|
|
}
|
|
|
|
(*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len;
|
|
|
|
}
|
|
|
|
|
2024-07-17 02:36:04 +02:00
|
|
|
/**
|
|
|
|
* udp_sock_recverr() - Receive and clear an error from a socket
|
|
|
|
* @s: Socket to receive from
|
|
|
|
*
|
|
|
|
* Return: true if errors received and processed, false if no more errors
|
|
|
|
*
|
|
|
|
* #syscalls recvmsg
|
|
|
|
*/
|
|
|
|
static bool udp_sock_recverr(int s)
|
|
|
|
{
|
|
|
|
const struct sock_extended_err *ee;
|
|
|
|
const struct cmsghdr *hdr;
|
|
|
|
char buf[CMSG_SPACE(sizeof(*ee))];
|
|
|
|
struct msghdr mh = {
|
|
|
|
.msg_name = NULL,
|
|
|
|
.msg_namelen = 0,
|
|
|
|
.msg_iov = NULL,
|
|
|
|
.msg_iovlen = 0,
|
|
|
|
.msg_control = buf,
|
|
|
|
.msg_controllen = sizeof(buf),
|
|
|
|
};
|
|
|
|
ssize_t rc;
|
|
|
|
|
|
|
|
rc = recvmsg(s, &mh, MSG_ERRQUEUE);
|
|
|
|
if (rc < 0) {
|
|
|
|
if (errno != EAGAIN && errno != EWOULDBLOCK)
|
|
|
|
err_perror("Failed to read error queue");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!(mh.msg_flags & MSG_ERRQUEUE)) {
|
|
|
|
err("Missing MSG_ERRQUEUE flag reading error queue");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
hdr = CMSG_FIRSTHDR(&mh);
|
|
|
|
if (!((hdr->cmsg_level == IPPROTO_IP &&
|
|
|
|
hdr->cmsg_type == IP_RECVERR) ||
|
|
|
|
(hdr->cmsg_level == IPPROTO_IPV6 &&
|
|
|
|
hdr->cmsg_type == IPV6_RECVERR))) {
|
|
|
|
err("Unexpected cmsg reading error queue");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
|
|
|
|
|
|
|
|
/* TODO: When possible propagate and otherwise handle errors */
|
|
|
|
debug("%s error on UDP socket %i: %s",
|
|
|
|
str_ee_origin(ee), s, strerror(ee->ee_errno));
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2024-07-05 12:44:06 +02:00
|
|
|
/**
|
|
|
|
* udp_sock_recv() - Receive datagrams from a socket
|
|
|
|
* @c: Execution context
|
|
|
|
* @s: Socket to receive from
|
|
|
|
* @events: epoll events bitmap
|
|
|
|
* @mmh mmsghdr array to receive into
|
|
|
|
*
|
2024-08-20 00:46:06 +02:00
|
|
|
* #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64
|
2024-07-05 12:44:06 +02:00
|
|
|
*/
|
2024-07-17 02:36:01 +02:00
|
|
|
static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
|
|
|
|
struct mmsghdr *mmh)
|
2024-07-05 12:44:06 +02:00
|
|
|
{
|
|
|
|
/* For not entirely clear reasons (data locality?) pasta gets better
|
|
|
|
* throughput if we receive tap datagrams one at a atime. For small
|
|
|
|
* splice datagrams throughput is slightly better if we do batch, but
|
|
|
|
* it's slightly worse for large splice datagrams. Since we don't know
|
|
|
|
* before we receive whether we'll use tap or splice, always go one at a
|
|
|
|
* time for pasta mode.
|
|
|
|
*/
|
|
|
|
int n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
|
|
|
|
|
2024-07-17 02:36:02 +02:00
|
|
|
ASSERT(!c->no_udp);
|
|
|
|
|
2024-07-17 02:36:04 +02:00
|
|
|
/* Clear any errors first */
|
|
|
|
if (events & EPOLLERR) {
|
|
|
|
while (udp_sock_recverr(s))
|
|
|
|
;
|
|
|
|
}
|
|
|
|
|
2024-07-17 02:36:02 +02:00
|
|
|
if (!(events & EPOLLIN))
|
2024-07-05 12:44:06 +02:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
n = recvmmsg(s, mmh, n, 0, NULL);
|
|
|
|
if (n < 0) {
|
|
|
|
err_perror("Error receiving datagrams");
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
2023-01-05 05:26:20 +01:00
|
|
|
/**
|
2024-07-18 07:26:53 +02:00
|
|
|
* udp_listen_sock_handler() - Handle new data from socket
|
2023-01-05 05:26:20 +01:00
|
|
|
* @c: Execution context
|
|
|
|
* @ref: epoll reference
|
|
|
|
* @events: epoll events bitmap
|
|
|
|
* @now: Current timestamp
|
|
|
|
*
|
|
|
|
* #syscalls recvmmsg
|
|
|
|
*/
|
2024-07-18 07:26:53 +02:00
|
|
|
void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
|
|
|
|
uint32_t events, const struct timespec *now)
|
2023-01-05 05:26:20 +01:00
|
|
|
{
|
2024-07-05 12:44:06 +02:00
|
|
|
struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv;
|
2024-07-05 12:44:07 +02:00
|
|
|
int n, i;
|
2023-01-05 05:26:20 +01:00
|
|
|
|
2024-07-05 12:44:06 +02:00
|
|
|
if ((n = udp_sock_recv(c, ref.fd, events, mmh_recv)) <= 0)
|
2023-01-05 05:26:20 +01:00
|
|
|
return;
|
|
|
|
|
2024-07-05 12:44:07 +02:00
|
|
|
/* We divide datagrams into batches based on how we need to send them,
|
2024-07-18 07:26:47 +02:00
|
|
|
* determined by udp_meta[i].tosidx. To avoid either two passes through
|
|
|
|
* the array, or recalculating tosidx for a single entry, we have to
|
|
|
|
* populate it one entry *ahead* of the loop counter.
|
2024-06-14 03:51:07 +02:00
|
|
|
*/
|
2024-08-02 18:10:35 +02:00
|
|
|
udp_meta[0].tosidx = udp_flow_from_sock(c, ref, &udp_meta[0].s_in, now);
|
2024-07-05 12:44:07 +02:00
|
|
|
for (i = 0; i < n; ) {
|
2024-07-18 07:26:46 +02:00
|
|
|
flow_sidx_t batchsidx = udp_meta[i].tosidx;
|
2024-07-18 07:26:47 +02:00
|
|
|
uint8_t batchpif = pif_at_sidx(batchsidx);
|
2024-07-05 12:44:07 +02:00
|
|
|
int batchstart = i;
|
|
|
|
|
|
|
|
do {
|
2024-07-18 07:26:47 +02:00
|
|
|
if (pif_is_socket(batchpif)) {
|
2024-07-05 12:44:07 +02:00
|
|
|
udp_splice_prepare(mmh_recv, i);
|
2024-07-18 07:26:50 +02:00
|
|
|
} else if (batchpif == PIF_TAP) {
|
|
|
|
udp_tap_prepare(mmh_recv, i,
|
|
|
|
flowside_at_sidx(batchsidx));
|
2024-07-05 12:44:07 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
if (++i >= n)
|
|
|
|
break;
|
|
|
|
|
2024-07-18 07:26:46 +02:00
|
|
|
udp_meta[i].tosidx = udp_flow_from_sock(c, ref,
|
2024-08-02 18:10:35 +02:00
|
|
|
&udp_meta[i].s_in,
|
2024-07-18 07:26:46 +02:00
|
|
|
now);
|
2024-07-18 07:26:47 +02:00
|
|
|
} while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx));
|
2024-07-05 12:44:07 +02:00
|
|
|
|
2024-07-18 07:26:47 +02:00
|
|
|
if (pif_is_socket(batchpif)) {
|
2024-07-05 12:44:07 +02:00
|
|
|
udp_splice_send(c, batchstart, i - batchstart,
|
2024-07-18 07:26:47 +02:00
|
|
|
batchsidx);
|
2024-07-18 07:26:50 +02:00
|
|
|
} else if (batchpif == PIF_TAP) {
|
2024-07-05 12:44:07 +02:00
|
|
|
tap_send_frames(c, &udp_l2_iov[batchstart][0],
|
|
|
|
UDP_NUM_IOVS, i - batchstart);
|
2024-07-18 07:26:50 +02:00
|
|
|
} else if (flow_sidx_valid(batchsidx)) {
|
|
|
|
flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
|
|
|
|
struct udp_flow *uflow = udp_at_sidx(batchsidx);
|
|
|
|
|
|
|
|
flow_err(uflow,
|
|
|
|
"No support for forwarding UDP from %s to %s",
|
|
|
|
pif_name(pif_at_sidx(fromsidx)),
|
|
|
|
pif_name(batchpif));
|
|
|
|
} else {
|
|
|
|
debug("Discarding %d datagrams without flow",
|
|
|
|
i - batchstart);
|
2024-07-05 12:44:07 +02:00
|
|
|
}
|
2023-01-05 05:26:23 +01:00
|
|
|
}
|
2023-01-05 05:26:20 +01:00
|
|
|
}
|
|
|
|
|
2024-07-18 07:26:47 +02:00
|
|
|
/**
|
|
|
|
* udp_reply_sock_handler() - Handle new data from flow specific socket
|
|
|
|
* @c: Execution context
|
|
|
|
* @ref: epoll reference
|
|
|
|
* @events: epoll events bitmap
|
|
|
|
* @now: Current timestamp
|
|
|
|
*
|
|
|
|
* #syscalls recvmmsg
|
|
|
|
*/
|
|
|
|
void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
|
|
|
|
uint32_t events, const struct timespec *now)
|
|
|
|
{
|
|
|
|
const struct flowside *fromside = flowside_at_sidx(ref.flowside);
|
|
|
|
flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
|
2024-07-18 07:26:49 +02:00
|
|
|
const struct flowside *toside = flowside_at_sidx(tosidx);
|
2024-07-18 07:26:47 +02:00
|
|
|
struct udp_flow *uflow = udp_at_sidx(ref.flowside);
|
|
|
|
int from_s = uflow->s[ref.flowside.sidei];
|
|
|
|
bool v6 = !inany_v4(&fromside->eaddr);
|
|
|
|
struct mmsghdr *mmh_recv = v6 ? udp6_mh_recv : udp4_mh_recv;
|
2024-07-18 07:26:49 +02:00
|
|
|
uint8_t topif = pif_at_sidx(tosidx);
|
2024-07-18 07:26:47 +02:00
|
|
|
int n, i;
|
|
|
|
|
|
|
|
ASSERT(!c->no_udp && uflow);
|
|
|
|
|
|
|
|
if ((n = udp_sock_recv(c, from_s, events, mmh_recv)) <= 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
flow_trace(uflow, "Received %d datagrams on reply socket", n);
|
|
|
|
uflow->ts = now->tv_sec;
|
|
|
|
|
2024-07-18 07:26:49 +02:00
|
|
|
for (i = 0; i < n; i++) {
|
|
|
|
if (pif_is_socket(topif))
|
|
|
|
udp_splice_prepare(mmh_recv, i);
|
2024-07-18 07:26:50 +02:00
|
|
|
else if (topif == PIF_TAP)
|
|
|
|
udp_tap_prepare(mmh_recv, i, toside);
|
2024-07-18 07:26:49 +02:00
|
|
|
}
|
|
|
|
|
2024-07-18 07:26:50 +02:00
|
|
|
if (pif_is_socket(topif)) {
|
2024-07-18 07:26:49 +02:00
|
|
|
udp_splice_send(c, 0, n, tosidx);
|
2024-07-18 07:26:50 +02:00
|
|
|
} else if (topif == PIF_TAP) {
|
2024-07-18 07:26:49 +02:00
|
|
|
tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
|
2024-07-18 07:26:50 +02:00
|
|
|
} else {
|
|
|
|
uint8_t frompif = pif_at_sidx(ref.flowside);
|
|
|
|
|
|
|
|
flow_err(uflow, "No support for forwarding UDP from %s to %s",
|
|
|
|
pif_name(frompif), pif_name(topif));
|
|
|
|
}
|
2024-07-18 07:26:49 +02:00
|
|
|
}
|
|
|
|
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
/**
|
2021-03-17 10:57:42 +01:00
|
|
|
* udp_tap_handler() - Handle packets from tap
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
* @c: Execution context
|
2023-11-07 02:40:16 +01:00
|
|
|
* @pif: pif on which the packet is arriving
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
* @af: Address family, AF_INET or AF_INET6
|
2023-08-22 07:29:53 +02:00
|
|
|
* @saddr: Source address
|
|
|
|
* @daddr: Destination address
|
treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:02:47 +01:00
|
|
|
* @p: Pool of UDP packets, with UDP headers
|
2023-09-08 03:49:47 +02:00
|
|
|
* @idx: Index of first packet to process
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
* @now: Current timestamp
|
2021-04-22 13:39:36 +02:00
|
|
|
*
|
|
|
|
* Return: count of consumed packets
|
2021-10-13 22:25:03 +02:00
|
|
|
*
|
|
|
|
* #syscalls sendmmsg
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
*/
|
2024-07-18 07:26:49 +02:00
|
|
|
int udp_tap_handler(const struct ctx *c, uint8_t pif,
|
2024-02-19 08:56:46 +01:00
|
|
|
sa_family_t af, const void *saddr, const void *daddr,
|
2023-09-08 03:49:47 +02:00
|
|
|
const struct pool *p, int idx, const struct timespec *now)
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
{
|
2024-07-18 07:26:49 +02:00
|
|
|
const struct flowside *toside;
|
2022-09-13 08:37:43 +02:00
|
|
|
struct mmsghdr mm[UIO_MAXIOV];
|
2024-07-18 07:26:49 +02:00
|
|
|
union sockaddr_inany to_sa;
|
2021-04-22 13:39:36 +02:00
|
|
|
struct iovec m[UIO_MAXIOV];
|
2024-01-15 07:39:43 +01:00
|
|
|
const struct udphdr *uh;
|
2024-07-18 07:26:49 +02:00
|
|
|
struct udp_flow *uflow;
|
treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:02:47 +01:00
|
|
|
int i, s, count = 0;
|
2024-07-18 07:26:49 +02:00
|
|
|
flow_sidx_t tosidx;
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
in_port_t src, dst;
|
2024-07-18 07:26:49 +02:00
|
|
|
uint8_t topif;
|
2021-04-22 13:39:36 +02:00
|
|
|
socklen_t sl;
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
|
2024-07-17 02:36:02 +02:00
|
|
|
ASSERT(!c->no_udp);
|
|
|
|
|
2023-09-08 03:49:47 +02:00
|
|
|
uh = packet_get(p, idx, 0, sizeof(*uh), NULL);
|
treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:02:47 +01:00
|
|
|
if (!uh)
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
return 1;
|
|
|
|
|
treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:02:47 +01:00
|
|
|
/* The caller already checks that all the messages have the same source
|
|
|
|
* and destination, so we can just take those from the first message.
|
|
|
|
*/
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
src = ntohs(uh->source);
|
|
|
|
dst = ntohs(uh->dest);
|
|
|
|
|
2024-07-18 07:26:49 +02:00
|
|
|
tosidx = udp_flow_from_tap(c, pif, af, saddr, daddr, src, dst, now);
|
|
|
|
if (!(uflow = udp_at_sidx(tosidx))) {
|
|
|
|
char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
|
conf, icmp, tcp, udp: Add options to bind to outbound address and interface
I didn't notice earlier: libslirp (and slirp4netns) supports binding
outbound sockets to specific IPv4 and IPv6 addresses, to force the
source addresse selection. If we want to claim feature parity, we
should implement that as well.
Further, Podman supports specifying outbound interfaces as well, but
this is simply done by resolving the primary address for an interface
when the network back-end is started. However, since kernel version
5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for
non-root users"), we can actually bind to a specific interface name,
which doesn't need to be validated in advance.
Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses,
and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets
to given interfaces.
Given that it probably makes little sense to select addresses and
routes from interfaces different than the ones given for outbound
sockets, also assign those as "template" interfaces, by default,
unless explicitly overridden by '-i'.
For ICMP and UDP, we call sock_l4() to open outbound sockets, as we
already needed to bind to given ports or echo identifiers, and we
can bind() a socket only once: there, pass address (if any) and
interface (if any) for the existing bind() and setsockopt() calls.
For TCP, in general, we wouldn't otherwise bind sockets. Add a
specific helper to do that.
For UDP outbound sockets, we need to know if the final destination
of the socket is a loopback address, before we decide whether it
makes sense to bind the socket at all: move the block mangling the
address destination before the creation of the socket in the IPv4
path. This was already the case for the IPv6 path.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-08 03:29:51 +01:00
|
|
|
|
2024-07-18 07:26:49 +02:00
|
|
|
debug("Dropping datagram with no flow %s %s:%hu -> %s:%hu",
|
|
|
|
pif_name(pif),
|
|
|
|
inet_ntop(af, saddr, sstr, sizeof(sstr)), src,
|
|
|
|
inet_ntop(af, daddr, dstr, sizeof(dstr)), dst);
|
|
|
|
return 1;
|
|
|
|
}
|
conf, icmp, tcp, udp: Add options to bind to outbound address and interface
I didn't notice earlier: libslirp (and slirp4netns) supports binding
outbound sockets to specific IPv4 and IPv6 addresses, to force the
source addresse selection. If we want to claim feature parity, we
should implement that as well.
Further, Podman supports specifying outbound interfaces as well, but
this is simply done by resolving the primary address for an interface
when the network back-end is started. However, since kernel version
5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for
non-root users"), we can actually bind to a specific interface name,
which doesn't need to be validated in advance.
Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses,
and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets
to given interfaces.
Given that it probably makes little sense to select addresses and
routes from interfaces different than the ones given for outbound
sockets, also assign those as "template" interfaces, by default,
unless explicitly overridden by '-i'.
For ICMP and UDP, we call sock_l4() to open outbound sockets, as we
already needed to bind to given ports or echo identifiers, and we
can bind() a socket only once: there, pass address (if any) and
interface (if any) for the existing bind() and setsockopt() calls.
For TCP, in general, we wouldn't otherwise bind sockets. Add a
specific helper to do that.
For UDP outbound sockets, we need to know if the final destination
of the socket is a loopback address, before we decide whether it
makes sense to bind the socket at all: move the block mangling the
address destination before the creation of the socket in the IPv4
path. This was already the case for the IPv6 path.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-08 03:29:51 +01:00
|
|
|
|
2024-07-18 07:26:49 +02:00
|
|
|
topif = pif_at_sidx(tosidx);
|
|
|
|
if (topif != PIF_HOST) {
|
|
|
|
flow_sidx_t fromsidx = flow_sidx_opposite(tosidx);
|
|
|
|
uint8_t frompif = pif_at_sidx(fromsidx);
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2024-07-18 07:26:49 +02:00
|
|
|
flow_err(uflow, "No support for forwarding UDP from %s to %s",
|
|
|
|
pif_name(frompif), pif_name(topif));
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
toside = flowside_at_sidx(tosidx);
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2024-07-18 07:26:49 +02:00
|
|
|
s = udp_at_sidx(tosidx)->s[tosidx.sidei];
|
|
|
|
ASSERT(s >= 0);
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2024-07-18 07:26:49 +02:00
|
|
|
pif_sockaddr(c, &to_sa, &sl, topif, &toside->eaddr, toside->eport);
|
2021-04-22 13:39:36 +02:00
|
|
|
|
2023-09-08 03:49:47 +02:00
|
|
|
for (i = 0; i < (int)p->count - idx; i++) {
|
2021-10-21 09:41:13 +02:00
|
|
|
struct udphdr *uh_send;
|
treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:02:47 +01:00
|
|
|
size_t len;
|
|
|
|
|
2023-09-08 03:49:47 +02:00
|
|
|
uh_send = packet_get(p, idx + i, 0, sizeof(*uh), &len);
|
treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:02:47 +01:00
|
|
|
if (!uh_send)
|
2023-09-08 03:49:47 +02:00
|
|
|
return p->count - idx;
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
|
2024-07-18 07:26:49 +02:00
|
|
|
mm[i].msg_hdr.msg_name = &to_sa;
|
2021-04-22 13:39:36 +02:00
|
|
|
mm[i].msg_hdr.msg_namelen = sl;
|
|
|
|
|
2022-09-13 08:37:44 +02:00
|
|
|
if (len) {
|
|
|
|
m[i].iov_base = (char *)(uh_send + 1);
|
|
|
|
m[i].iov_len = len;
|
|
|
|
|
|
|
|
mm[i].msg_hdr.msg_iov = m + i;
|
|
|
|
mm[i].msg_hdr.msg_iovlen = 1;
|
|
|
|
} else {
|
|
|
|
mm[i].msg_hdr.msg_iov = NULL;
|
|
|
|
mm[i].msg_hdr.msg_iovlen = 0;
|
|
|
|
}
|
treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:02:47 +01:00
|
|
|
|
2022-09-13 08:37:43 +02:00
|
|
|
mm[i].msg_hdr.msg_control = NULL;
|
|
|
|
mm[i].msg_hdr.msg_controllen = 0;
|
|
|
|
mm[i].msg_hdr.msg_flags = 0;
|
|
|
|
|
treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:02:47 +01:00
|
|
|
count++;
|
2021-04-22 13:39:36 +02:00
|
|
|
}
|
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
count = sendmmsg(s, mm, count, MSG_NOSIGNAL);
|
|
|
|
if (count < 0)
|
|
|
|
return 1;
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2022-05-01 06:36:34 +02:00
|
|
|
/**
|
|
|
|
* udp_sock_init() - Initialise listening sockets for a given port
|
|
|
|
* @c: Execution context
|
|
|
|
* @ns: In pasta mode, if set, bind with loopback address in namespace
|
|
|
|
* @af: Address family to select a specific IP version, or AF_UNSPEC
|
|
|
|
* @addr: Pointer to address for binding, NULL if not configured
|
2022-10-07 04:53:40 +02:00
|
|
|
* @ifname: Name of interface to bind to, NULL if not configured
|
2022-05-01 06:36:34 +02:00
|
|
|
* @port: Port, host order
|
2023-02-16 01:29:55 +01:00
|
|
|
*
|
2023-03-08 12:14:29 +01:00
|
|
|
* Return: 0 on (partial) success, negative error code on (complete) failure
|
2022-05-01 06:36:34 +02:00
|
|
|
*/
|
2023-02-16 01:29:55 +01:00
|
|
|
int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
|
|
|
|
const void *addr, const char *ifname, in_port_t port)
|
2022-05-01 06:36:34 +02:00
|
|
|
{
|
2024-07-18 07:26:53 +02:00
|
|
|
union udp_listen_epoll_ref uref = { .port = port };
|
2023-08-11 07:12:21 +02:00
|
|
|
int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
|
2022-05-01 06:36:34 +02:00
|
|
|
|
2024-07-17 02:36:02 +02:00
|
|
|
ASSERT(!c->no_udp);
|
|
|
|
|
2024-02-28 12:25:06 +01:00
|
|
|
if (ns)
|
2023-11-07 02:40:15 +01:00
|
|
|
uref.pif = PIF_SPLICE;
|
2024-02-28 12:25:06 +01:00
|
|
|
else
|
2023-11-07 02:40:15 +01:00
|
|
|
uref.pif = PIF_HOST;
|
2022-05-01 06:36:34 +02:00
|
|
|
|
tcp, udp: Don't initialise IPv6/IPv4 sockets if IPv4/IPv6 are not enabled
If we disable a given IP version automatically (no corresponding
default route on host) or administratively (--ipv4-only or
--ipv6-only options), we don't initialise related buffers and
services (DHCP for IPv4, NDP and DHCPv6 for IPv6). The "tap"
handlers will also ignore packets with a disabled IP version.
However, in commit 3c6ae625101a ("conf, tcp, udp: Allow address
specification for forwarded ports") I happily changed socket
initialisation functions to take AF_UNSPEC meaning "any enabled
IP version", but I forgot to add checks back for the "enabled"
part.
Reported by Paul: on a host without default IPv6 route, but IPv6
enabled, connect, using IPv6, to a port handled by pasta, which
tries to send data to a tap device without initialised buffers
for that IP version and exits because the resulting write() fails.
Simpler way to reproduce: pasta -6 and inbound IPv4 connection, or
pasta -4 and inbound IPv6 connection.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-09 18:21:44 +01:00
|
|
|
if ((af == AF_INET || af == AF_UNSPEC) && c->ifi4) {
|
2023-08-01 05:36:46 +02:00
|
|
|
uref.v6 = 0;
|
2022-05-01 06:36:34 +02:00
|
|
|
|
|
|
|
if (!ns) {
|
2024-07-18 07:26:53 +02:00
|
|
|
r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP_LISTEN,
|
|
|
|
addr, ifname, port, uref.u32);
|
2022-05-01 06:36:34 +02:00
|
|
|
|
2024-07-18 07:26:48 +02:00
|
|
|
udp_splice_init[V4][port] = s < 0 ? -1 : s;
|
2022-11-30 05:13:07 +01:00
|
|
|
} else {
|
2024-07-18 07:26:53 +02:00
|
|
|
r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP_LISTEN,
|
2024-02-28 12:25:03 +01:00
|
|
|
&in4addr_loopback,
|
2023-03-08 12:38:39 +01:00
|
|
|
ifname, port, uref.u32);
|
2024-07-18 07:26:48 +02:00
|
|
|
udp_splice_ns[V4][port] = s < 0 ? -1 : s;
|
2022-05-01 06:36:34 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tcp, udp: Don't initialise IPv6/IPv4 sockets if IPv4/IPv6 are not enabled
If we disable a given IP version automatically (no corresponding
default route on host) or administratively (--ipv4-only or
--ipv6-only options), we don't initialise related buffers and
services (DHCP for IPv4, NDP and DHCPv6 for IPv6). The "tap"
handlers will also ignore packets with a disabled IP version.
However, in commit 3c6ae625101a ("conf, tcp, udp: Allow address
specification for forwarded ports") I happily changed socket
initialisation functions to take AF_UNSPEC meaning "any enabled
IP version", but I forgot to add checks back for the "enabled"
part.
Reported by Paul: on a host without default IPv6 route, but IPv6
enabled, connect, using IPv6, to a port handled by pasta, which
tries to send data to a tap device without initialised buffers
for that IP version and exits because the resulting write() fails.
Simpler way to reproduce: pasta -6 and inbound IPv4 connection, or
pasta -4 and inbound IPv6 connection.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-09 18:21:44 +01:00
|
|
|
if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6) {
|
2023-08-01 05:36:46 +02:00
|
|
|
uref.v6 = 1;
|
2022-05-01 06:36:34 +02:00
|
|
|
|
|
|
|
if (!ns) {
|
2024-07-18 07:26:53 +02:00
|
|
|
r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP_LISTEN,
|
|
|
|
addr, ifname, port, uref.u32);
|
2022-05-01 06:36:34 +02:00
|
|
|
|
2024-07-18 07:26:48 +02:00
|
|
|
udp_splice_init[V6][port] = s < 0 ? -1 : s;
|
2022-11-30 05:13:07 +01:00
|
|
|
} else {
|
2024-07-18 07:26:53 +02:00
|
|
|
r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP_LISTEN,
|
2023-03-08 12:38:39 +01:00
|
|
|
&in6addr_loopback,
|
|
|
|
ifname, port, uref.u32);
|
2024-07-18 07:26:48 +02:00
|
|
|
udp_splice_ns[V6][port] = s < 0 ? -1 : s;
|
2022-05-01 06:36:34 +02:00
|
|
|
}
|
|
|
|
}
|
2023-02-16 01:29:55 +01:00
|
|
|
|
2023-08-11 07:12:21 +02:00
|
|
|
if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
|
2023-03-08 12:38:39 +01:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return r4 < 0 ? r4 : r6;
|
2022-05-01 06:36:34 +02:00
|
|
|
}
|
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
/**
|
|
|
|
* udp_splice_iov_init() - Set up buffers and descriptors for recvmmsg/sendmmsg
|
|
|
|
*/
|
|
|
|
static void udp_splice_iov_init(void)
|
|
|
|
{
|
|
|
|
int i;
|
passt: Spare some syscalls, add some optimisations from profiling
Avoid a bunch of syscalls on forwarding paths by:
- storing minimum and maximum file descriptor numbers for each
protocol, fall back to SO_PROTOCOL query only on overlaps
- allocating a larger receive buffer -- this can result in more
coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024),
so make sure we don't exceed that within a single call to protocol
tap handlers
- nesting the handling loop in tap_handler() in the receive loop,
so that we have better chances of filling our receive buffer in
fewer calls
- skipping the recvfrom() in the UDP handler on EPOLLERR -- there's
nothing to be done in that case
and while at it:
- restore the 20ms timer interval for periodic (TCP) events, I
accidentally changed that to 100ms in an earlier commit
- attempt using SO_ZEROCOPY for UDP -- if it's not available,
sendmmsg() will succeed anyway
- fix the handling of the status code from sendmmsg(), if it fails,
we'll try to discard the first message, hence return 1 from the
UDP handler
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-23 22:22:37 +02:00
|
|
|
|
2022-11-30 05:13:14 +01:00
|
|
|
for (i = 0; i < UDP_MAX_FRAMES; i++) {
|
2024-07-05 12:44:03 +02:00
|
|
|
struct msghdr *mh = &udp_mh_splice[i].msg_hdr;
|
2023-01-05 05:26:22 +01:00
|
|
|
|
2024-07-05 12:44:03 +02:00
|
|
|
mh->msg_name = &udp_splice_to;
|
|
|
|
mh->msg_namelen = sizeof(udp_splice_to);
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2024-05-01 10:31:08 +02:00
|
|
|
udp_iov_splice[i].iov_base = udp_payload[i].data;
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2024-07-05 12:44:03 +02:00
|
|
|
mh->msg_iov = &udp_iov_splice[i];
|
|
|
|
mh->msg_iovlen = 1;
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
}
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 07:25:09 +01:00
|
|
|
}
|
|
|
|
|
2023-11-15 06:25:34 +01:00
|
|
|
/**
|
|
|
|
* udp_port_rebind() - Rebind ports to match forward maps
|
|
|
|
* @c: Execution context
|
|
|
|
* @outbound: True to remap outbound forwards, otherwise inbound
|
|
|
|
*
|
|
|
|
* Must be called in namespace context if @outbound is true.
|
|
|
|
*/
|
|
|
|
static void udp_port_rebind(struct ctx *c, bool outbound)
|
|
|
|
{
|
2024-07-18 07:26:48 +02:00
|
|
|
int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init;
|
2023-11-15 06:25:34 +01:00
|
|
|
const uint8_t *fmap
|
2024-07-18 07:26:52 +02:00
|
|
|
= outbound ? c->udp.fwd_out.map : c->udp.fwd_in.map;
|
2023-11-15 06:25:34 +01:00
|
|
|
const uint8_t *rmap
|
2024-07-18 07:26:52 +02:00
|
|
|
= outbound ? c->udp.fwd_in.map : c->udp.fwd_out.map;
|
2023-11-15 06:25:34 +01:00
|
|
|
unsigned port;
|
|
|
|
|
|
|
|
for (port = 0; port < NUM_PORTS; port++) {
|
|
|
|
if (!bitmap_isset(fmap, port)) {
|
2024-07-18 07:26:48 +02:00
|
|
|
if (socks[V4][port] >= 0) {
|
|
|
|
close(socks[V4][port]);
|
|
|
|
socks[V4][port] = -1;
|
2023-11-15 06:25:34 +01:00
|
|
|
}
|
|
|
|
|
2024-07-18 07:26:48 +02:00
|
|
|
if (socks[V6][port] >= 0) {
|
|
|
|
close(socks[V6][port]);
|
|
|
|
socks[V6][port] = -1;
|
2023-11-15 06:25:34 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Don't loop back our own ports */
|
|
|
|
if (bitmap_isset(rmap, port))
|
|
|
|
continue;
|
|
|
|
|
2024-07-18 07:26:48 +02:00
|
|
|
if ((c->ifi4 && socks[V4][port] == -1) ||
|
|
|
|
(c->ifi6 && socks[V6][port] == -1))
|
2023-11-15 06:25:34 +01:00
|
|
|
udp_sock_init(c, outbound, AF_UNSPEC, NULL, NULL, port);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* udp_port_rebind_outbound() - Rebind ports in namespace
|
|
|
|
* @arg: Execution context
|
|
|
|
*
|
|
|
|
* Called with NS_CALL()
|
|
|
|
*
|
|
|
|
* Return: 0
|
|
|
|
*/
|
|
|
|
static int udp_port_rebind_outbound(void *arg)
|
|
|
|
{
|
|
|
|
struct ctx *c = (struct ctx *)arg;
|
|
|
|
|
|
|
|
ns_enter(c);
|
|
|
|
udp_port_rebind(c, true);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
/**
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
* udp_timer() - Scan activity bitmaps for ports with associated timed events
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
* @c: Execution context
|
2024-01-16 01:50:32 +01:00
|
|
|
* @now: Current timestamp
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
*/
|
2024-01-16 01:50:32 +01:00
|
|
|
void udp_timer(struct ctx *c, const struct timespec *now)
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
{
|
2024-07-18 07:26:51 +02:00
|
|
|
(void)now;
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
|
2024-07-17 02:36:02 +02:00
|
|
|
ASSERT(!c->no_udp);
|
|
|
|
|
2023-11-15 06:25:34 +01:00
|
|
|
if (c->mode == MODE_PASTA) {
|
2024-07-18 07:26:52 +02:00
|
|
|
if (c->udp.fwd_out.mode == FWD_AUTO) {
|
|
|
|
fwd_scan_ports_udp(&c->udp.fwd_out, &c->udp.fwd_in,
|
2024-02-28 12:25:20 +01:00
|
|
|
&c->tcp.fwd_out, &c->tcp.fwd_in);
|
2023-11-15 06:25:34 +01:00
|
|
|
NS_CALL(udp_port_rebind_outbound, c);
|
|
|
|
}
|
|
|
|
|
2024-07-18 07:26:52 +02:00
|
|
|
if (c->udp.fwd_in.mode == FWD_AUTO) {
|
|
|
|
fwd_scan_ports_udp(&c->udp.fwd_in, &c->udp.fwd_out,
|
2024-02-28 12:25:20 +01:00
|
|
|
&c->tcp.fwd_in, &c->tcp.fwd_out);
|
2023-11-15 06:25:34 +01:00
|
|
|
udp_port_rebind(c, false);
|
|
|
|
}
|
|
|
|
}
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
}
|
2024-02-12 07:06:58 +01:00
|
|
|
|
|
|
|
/**
|
|
|
|
* udp_init() - Initialise per-socket data, and sockets in namespace
|
|
|
|
* @c: Execution context
|
|
|
|
*
|
|
|
|
* Return: 0
|
|
|
|
*/
|
|
|
|
int udp_init(struct ctx *c)
|
|
|
|
{
|
2024-07-17 02:36:02 +02:00
|
|
|
ASSERT(!c->no_udp);
|
|
|
|
|
2024-05-01 10:31:06 +02:00
|
|
|
udp_iov_init(c);
|
2024-02-12 07:06:58 +01:00
|
|
|
|
|
|
|
if (c->mode == MODE_PASTA) {
|
|
|
|
udp_splice_iov_init();
|
|
|
|
NS_CALL(udp_port_rebind_outbound, c);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|