tap_ip6_send() has special case logic to compute the checksums for UDP
and ICMP packets, which is a mild layering violation. By using a suitable
helper we can split it into tap_udp6_send() and tap_icmp6_send() functions
without greatly increasing the code size, this removing that layering
violation.
We make some small changes to the interface while there. In both cases
we make the destination IPv6 address a parameter, which will be useful
later. For the UDP variant we make it take just the UDP payload, and it
will generate the UDP header. For the ICMP variant we pass in the ICMP
header as before. The inconsistency is because that's what seems to be
the more natural way to invoke the function in the callers in each case.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The IPv4 and IPv6 paths in tap_ip_send() have very little in common, and
it turns out that every caller (statically) knows if it is using IPv4 or
IPv6. So split into separate tap_ip4_send() and tap_ip6_send() functions.
Use a new tap_l2_hdr() function for the very small common part.
While we're there, make some minor cleanups:
- We were double writing some fields in the IPv6 header, so that it
temporary matched the pseudo-header for checksum calculation. With
recent checksum reworks, this isn't neccessary any more.
- We don't use any IPv4 header options, so use some sizeof() constructs
instead of some open coded values for header length.
- The comment used to say that the flow label was for TCP over IPv6, but
in fact the only thing we used it for was DHCPv6 over UDP traffic
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Callers of tap_send() can optionally use a small optimization by adding
extra space for the 4 byte length header used on the qemu socket interface.
tap_ip_send() is currently the only user of this, but this is used only
for "slow path" ICMP and DHCP packets, so there's not a lot of value to
the optimization.
Worse, having the two paths here complicates the interface and makes future
cleanups difficult, so just remove it. I have some plans to bring back the
optimization in a more general way in future, but for now it's just in the
way.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_ip_send() doesn't take a destination address, because it's specifically
for inbound packets, and the IP addresses of the guest/namespace are
already known to us. Rather than open-coding this destination address
logic, make helper functions for it which will enable some later cleanups.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The existing behaviour is not really practical: an automated agent in
charge of starting both qemu and passt would need to fork itself to
start passt, because passt won't fork to background until qemu
connects, and the agent needs to unblock to start qemu.
Instead of waiting for a connection to daemonise, do it right away as
soon as a socket is available: that can be considered an initialised
state already.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
SPDX tags don't replace license files. Some notices were missing and
some tags were not according to the SPDX specification, too.
Now reuse --lint from the REUSE tool (https://reuse.software/) passes.
Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This isn't optional: TCP streams must carry a unique, hard-to-guess,
non-zero label for each direction. Linux, probably among others,
will otherwise refuse to associate packets in a given stream to the
same connection.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>