We recently corrected some errors handling the endianness of IPv4
addresses. These are very easy errors to make since although we mostly
store them in network endianness, we sometimes need to manipulate them in
host endianness.
To reduce the chances of making such mistakes again, change to always using
a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network
endian addresses. This makes it harder to accidentally do arithmetic or
comparisons on such addresses as if they were host endian.
We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to
directly work with struct in_addr values. This has the additional benefit
of making the IPv4 and IPv6 paths more visually similar.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are several minor problems with our parsing of IPv4 netmasks (-n).
First, we don't reject nonsensical netmasks like 0.255.0.255. Address this
structurally by using prefix length instead of netmask as the primary
variable, only converting (and validating) when we need to. This has the
added benefit of making some things more uniform with the IPv6 path.
Second, when the user specifies a prefix length, we truncate the output
from strtol() to an integer, which means we would treat -n 4294967320 as
valid (equivalent to 24). Fix types to check for this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The IPv4 specific dhcp() manually constructs L2 and IP headers to send its
DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the
tap_udp6_send() helper. Now that we've broaded the parameters to
tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Callers of tap_send() can optionally use a small optimization by adding
extra space for the 4 byte length header used on the qemu socket interface.
tap_ip_send() is currently the only user of this, but this is used only
for "slow path" ICMP and DHCP packets, so there's not a lot of value to
the optimization.
Worse, having the two paths here complicates the interface and makes future
cleanups difficult, so just remove it. I have some plans to bring back the
optimization in a more general way in future, but for now it's just in the
way.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We calculate IPv4 header checksums in at least two places, in dhcp() and
in tap_ip_send. Add a helper to handle this calculation in both places.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At least two places in passt fill in UDP over IPv4 checksums, although
since UDP checksums are optional with IPv4 that just amounts to storing
a 0 (in tap_ip_send()) or leaving a 0 from an earlier initialization (in
dhcp()). For consistency, add a helper for this "calculation".
Just for the heck of it, add the option (compile time disabled for now) to
calculate real UDP checksums.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Logging to file is going to add some further complexity that we don't
want to squeeze into util.c.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The context structure contains a batch of fields specific to IPv4 and to
IPv6 connectivity. Split those out into a sub-structure.
This allows the conf_ip4() and conf_ip6() functions, which take the
entire context but touch very little of it, to be given more specific
parameters, making it clearer what it affects without stepping through the
code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For compatibility with libslirp/slirp4netns users: introduce a
mechanism to map, in the UDP routines, an address facing guest or
namespace to the first IPv4 or IPv6 address resulting from
configuration as resolver. This can be enabled with the new
--dns-forward option.
This implies that sourcing and using DNS addresses and search lists,
passed via command line or read from /etc/resolv.conf, is not bound
anymore to DHCP/DHCPv6/NDP usage: for example, pasta users might just
want to use addresses from /etc/resolv.conf as mapping target, while
not passing DNS options via DHCP.
Reflect this in all the involved code paths by differentiating
DHCP/DHCPv6/NDP usage from DNS configuration per se, and in the new
options --dhcp-dns, --dhcp-search for pasta, and --no-dhcp-dns,
--no-dhcp-search for passt.
This should be the last bit to enable substantial compatibility
between slirp4netns.sh and slirp4netns(1): pass the --dns-forward
option from the script too.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
clang-tidy from LLVM 13.0.1 reports some new warnings from these
checkers:
- altera-unroll-loops, altera-id-dependent-backward-branch: ignore
for the moment being, add a TODO item
- bugprone-easily-swappable-parameters: ignore, nothing to do about
those
- readability-function-cognitive-complexity: ignore for the moment
being, add a TODO item
- altera-struct-pack-align: ignore, alignment is forced in protocol
headers
- concurrency-mt-unsafe: ignore for the moment being, add a TODO
item
Fix bugprone-implicit-widening-of-multiplication-result warnings,
though, that's doable and they seem to make sense.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is the only remaining Linux-specific include -- drop it to avoid
clang-tidy warnings and to make code more portable.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...mostly false positives, but a number of very relevant ones too,
in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Unions and structs, you all have names now.
Take the chance to enable bugprone-reserved-identifier,
cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy.
Provide a ffsl() weak declaration using gcc built-in.
Start reordering includes, but that's not enough for the
llvm-include-order checker yet.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This enables CirrOS with udhcpc to set up a route to a gateway
that's not on the assigned subnet, together with:
https://bugs.launchpad.net/cirros/+bug/1190372
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Provide an AVX2-based function using compiler intrinsics for
TCP/IP-style checksums. The load/unpack/add idea and implementation
is largely based on code from BESS (the Berkeley Extensible Software
Switch) licensed as 3-Clause BSD, with a number of modifications to
further decrease pipeline stalls and to minimise cache pollution.
This speeds up considerably data paths from sockets to tap
interfaces, decreasing overhead for checksum computation, with
16-64KiB packet buffers, from approximately 11% to 7%. The rest is
just syscalls at this point.
While at it, provide convenience targets in the Makefile for avx2,
avx2_debug, and debug targets -- these simply add target-specific
CFLAGS to the build.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This value should work for all tap-like interfaces and is rather
convenient for performance testing. It will be configurable later
on.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add support for a variable amount of DNS servers, including zero,
from /etc/resolv.conf, in DHCP, NDP and DHCPv6 implementations.
Introduce support for domain search list for DHCP (RFC 3397),
NDP (RFC 8106), and DHCPv6 (RFC 3646), also sourced from
/etc/resolv.conf.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For simplicity, we just send all available options, so there's no
distinction between forced and requested option anymore.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A bunch of fixes not worth single commits at this stage, notably:
- make buffer, length parameter ordering consistent in ARP, DHCP,
NDP handlers
- strict checking of buffer, message and option length in DHCP
handler (a malicious client could have easily crashed it)
- set up forwarding for IPv4 and IPv6, and masquerading with nft for
IPv4, from demo script
- get rid of separate slow and fast timers, we don't save any
overhead that way
- stricter checking of buffer lengths as passed to tap handlers
- proper dequeuing from qemu socket back-end: I accidentally trashed
messages that were bundled up together in a single tap read
operation -- the length header tells us what's the size of the next
frame, but there's no apparent limit to the number of messages we
get with one single receive
- rework some bits of the TCP state machine, now passive and active
connection closes appear to be robust -- introduce a new
FIN_WAIT_1_SOCK_FIN state indicating a FIN_WAIT_1 with a FIN flag
from socket
- streamline TCP option parsing routine
- track TCP state changes to stderr (this is temporary, proper
debugging and syslogging support pending)
- observe that multiplying a number by four might very well change
its value, and this happens to be the case for the data offset
from the TCP header as we check if it's the same as the total
length to find out if it's a duplicated ACK segment
- recent estimates suggest that the duration of a millisecond is
closer to a million nanoseconds than a thousand of them, this
trend is now reflected into the timespec_diff_ms() convenience
routine
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With this, merd provides a fully functional IPv4 environment to
guests, requiring a single capability, CAP_NET_RAW.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>