passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.
Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.
Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-05 20:11:44 +02:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-or-later
|
2021-10-19 12:43:28 +02:00
|
|
|
* Copyright (c) 2021 Red Hat GmbH
|
|
|
|
* Author: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
*/
|
|
|
|
|
passt: Spare some syscalls, add some optimisations from profiling
Avoid a bunch of syscalls on forwarding paths by:
- storing minimum and maximum file descriptor numbers for each
protocol, fall back to SO_PROTOCOL query only on overlaps
- allocating a larger receive buffer -- this can result in more
coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024),
so make sure we don't exceed that within a single call to protocol
tap handlers
- nesting the handling loop in tap_handler() in the receive loop,
so that we have better chances of filling our receive buffer in
fewer calls
- skipping the recvfrom() in the UDP handler on EPOLLERR -- there's
nothing to be done in that case
and while at it:
- restore the 20ms timer interval for periodic (TCP) events, I
accidentally changed that to 100ms in an earlier commit
- attempt using SO_ZEROCOPY for UDP -- if it's not available,
sendmmsg() will succeed anyway
- fix the handling of the status code from sendmmsg(), if it fails,
we'll try to discard the first message, hence return 1 from the
UDP handler
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-23 22:22:37 +02:00
|
|
|
#ifndef UDP_H
|
|
|
|
#define UDP_H
|
|
|
|
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
#define UDP_TIMER_INTERVAL 1000 /* ms */
|
|
|
|
|
udp: Use tap_send_frames()
To send frames on the tap interface, the UDP uses a fairly complicated two
level batching. First multiple frames are gathered into a single "message"
for the qemu stream socket, then multiple messages are send with
sendmmsg(). We now have tap_send_frames() which already deals with sending
a number of frames, including batching and handling partial sends. Use
that to considerably simplify things.
This does make a couple of behavioural changes:
* We used to split messages to keep them under 32kiB (except when a
single frame was longer than that). The comments claim this is
needed to stop qemu from closing the connection, but we don't have any
equivalent logic for TCP. I wasn't able to reproduce the problem with
this series, although it was apparently easy to reproduce earlier.
My suspicion is that there was never an inherent need to keep messages
small, however with larger messages (and default kernel buffer sizes)
the chances of needing more than one resend for partial send()s is
greatly increased. We used not to correctly handle that case of
multiple resends, but now we do.
* Previously when we got a partial send on UDP, we would resend the
remainder of the entire "message", including multiple frames. The
common code now only resends the remainder of a single frame, simply
dropping any frames which weren't even partially sent. This is what
TCP always did and is probably a better idea for UDP too.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-06 01:43:22 +01:00
|
|
|
void udp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
|
2022-03-26 07:23:21 +01:00
|
|
|
const struct timespec *now);
|
2023-08-22 07:29:53 +02:00
|
|
|
int udp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
|
2022-03-26 07:23:21 +01:00
|
|
|
const struct pool *p, const struct timespec *now);
|
2023-02-16 01:29:55 +01:00
|
|
|
int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
|
|
|
|
const void *addr, const char *ifname, in_port_t port);
|
2022-09-24 11:08:18 +02:00
|
|
|
int udp_init(struct ctx *c);
|
2022-03-26 07:23:21 +01:00
|
|
|
void udp_timer(struct ctx *c, const struct timespec *ts);
|
2023-08-22 07:29:57 +02:00
|
|
|
void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
|
passt: Spare some syscalls, add some optimisations from profiling
Avoid a bunch of syscalls on forwarding paths by:
- storing minimum and maximum file descriptor numbers for each
protocol, fall back to SO_PROTOCOL query only on overlaps
- allocating a larger receive buffer -- this can result in more
coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024),
so make sure we don't exceed that within a single call to protocol
tap handlers
- nesting the handling loop in tap_handler() in the receive loop,
so that we have better chances of filling our receive buffer in
fewer calls
- skipping the recvfrom() in the UDP handler on EPOLLERR -- there's
nothing to be done in that case
and while at it:
- restore the 20ms timer interval for periodic (TCP) events, I
accidentally changed that to 100ms in an earlier commit
- attempt using SO_ZEROCOPY for UDP -- if it's not available,
sendmmsg() will succeed anyway
- fix the handling of the status code from sendmmsg(), if it fails,
we'll try to discard the first message, hence return 1 from the
UDP handler
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-23 22:22:37 +02:00
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
/**
|
|
|
|
* union udp_epoll_ref - epoll reference portion for TCP connections
|
|
|
|
* @bound: Set if this file descriptor is a bound socket
|
2023-01-05 05:26:24 +01:00
|
|
|
* @splice: Set if descriptor packets to be "spliced"
|
2022-11-30 05:13:06 +01:00
|
|
|
* @orig: Set if a spliced socket which can originate "connections"
|
|
|
|
* @ns: Set if this is a socket in the pasta network namespace
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
* @v6: Set for IPv6 sockets or connections
|
|
|
|
* @port: Source port for connected sockets, bound port otherwise
|
|
|
|
* @u32: Opaque u32 value of reference
|
|
|
|
*/
|
|
|
|
union udp_epoll_ref {
|
|
|
|
struct {
|
2022-11-30 05:13:06 +01:00
|
|
|
bool splice:1,
|
|
|
|
orig:1,
|
|
|
|
ns:1,
|
|
|
|
v6:1;
|
|
|
|
uint32_t port:16;
|
2023-08-01 05:36:46 +02:00
|
|
|
};
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 08:34:53 +02:00
|
|
|
uint32_t u32;
|
|
|
|
};
|
|
|
|
|
2022-09-24 11:08:17 +02:00
|
|
|
|
|
|
|
/**
|
|
|
|
* udp_port_fwd - UDP specific port forwarding configuration
|
|
|
|
* @f: Generic forwarding configuration
|
|
|
|
* @rdelta: Reversed delta map to translate source ports on return packets
|
|
|
|
*/
|
|
|
|
struct udp_port_fwd {
|
|
|
|
struct port_fwd f;
|
2022-09-24 11:08:22 +02:00
|
|
|
in_port_t rdelta[NUM_PORTS];
|
2022-09-24 11:08:17 +02:00
|
|
|
};
|
|
|
|
|
passt: Spare some syscalls, add some optimisations from profiling
Avoid a bunch of syscalls on forwarding paths by:
- storing minimum and maximum file descriptor numbers for each
protocol, fall back to SO_PROTOCOL query only on overlaps
- allocating a larger receive buffer -- this can result in more
coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024),
so make sure we don't exceed that within a single call to protocol
tap handlers
- nesting the handling loop in tap_handler() in the receive loop,
so that we have better chances of filling our receive buffer in
fewer calls
- skipping the recvfrom() in the UDP handler on EPOLLERR -- there's
nothing to be done in that case
and while at it:
- restore the 20ms timer interval for periodic (TCP) events, I
accidentally changed that to 100ms in an earlier commit
- attempt using SO_ZEROCOPY for UDP -- if it's not available,
sendmmsg() will succeed anyway
- fix the handling of the status code from sendmmsg(), if it fails,
we'll try to discard the first message, hence return 1 from the
UDP handler
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-23 22:22:37 +02:00
|
|
|
/**
|
|
|
|
* struct udp_ctx - Execution context for UDP
|
2022-09-24 11:08:17 +02:00
|
|
|
* @fwd_in: Port forwarding configuration for inbound packets
|
|
|
|
* @fwd_out: Port forwarding configuration for outbound packets
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
* @timer_run: Timestamp of most recent timer run
|
passt: Spare some syscalls, add some optimisations from profiling
Avoid a bunch of syscalls on forwarding paths by:
- storing minimum and maximum file descriptor numbers for each
protocol, fall back to SO_PROTOCOL query only on overlaps
- allocating a larger receive buffer -- this can result in more
coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024),
so make sure we don't exceed that within a single call to protocol
tap handlers
- nesting the handling loop in tap_handler() in the receive loop,
so that we have better chances of filling our receive buffer in
fewer calls
- skipping the recvfrom() in the UDP handler on EPOLLERR -- there's
nothing to be done in that case
and while at it:
- restore the 20ms timer interval for periodic (TCP) events, I
accidentally changed that to 100ms in an earlier commit
- attempt using SO_ZEROCOPY for UDP -- if it's not available,
sendmmsg() will succeed anyway
- fix the handling of the status code from sendmmsg(), if it fails,
we'll try to discard the first message, hence return 1 from the
UDP handler
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-23 22:22:37 +02:00
|
|
|
*/
|
|
|
|
struct udp_ctx {
|
2022-09-24 11:08:17 +02:00
|
|
|
struct udp_port_fwd fwd_in;
|
|
|
|
struct udp_port_fwd fwd_out;
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 16:59:20 +02:00
|
|
|
struct timespec timer_run;
|
passt: Spare some syscalls, add some optimisations from profiling
Avoid a bunch of syscalls on forwarding paths by:
- storing minimum and maximum file descriptor numbers for each
protocol, fall back to SO_PROTOCOL query only on overlaps
- allocating a larger receive buffer -- this can result in more
coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024),
so make sure we don't exceed that within a single call to protocol
tap handlers
- nesting the handling loop in tap_handler() in the receive loop,
so that we have better chances of filling our receive buffer in
fewer calls
- skipping the recvfrom() in the UDP handler on EPOLLERR -- there's
nothing to be done in that case
and while at it:
- restore the 20ms timer interval for periodic (TCP) events, I
accidentally changed that to 100ms in an earlier commit
- attempt using SO_ZEROCOPY for UDP -- if it's not available,
sendmmsg() will succeed anyway
- fix the handling of the status code from sendmmsg(), if it fails,
we'll try to discard the first message, hence return 1 from the
UDP handler
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-23 22:22:37 +02:00
|
|
|
};
|
|
|
|
|
|
|
|
#endif /* UDP_H */
|