Commit graph

1441 commits

Author SHA1 Message Date
Stefano Brivio
f919dc7a4b conf, netlink: Don't require a default route to start
There might be isolated testing environments where default routes and
global connectivity are not needed, a single interface has all
non-loopback addresses and routes, and still passt and pasta are
expected to work.

In this case, it's pretty obvious what our upstream interface should
be, so go ahead and select the only interface with at least one
route, disabling DHCP and implying --no-map-gw as the documentation
already states.

If there are multiple interfaces with routes, though, refuse to start,
because at that point it's really not clear what we should do.

Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://github.com/containers/podman/issues/21896
Signed-off-by: Stefano brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-18 08:57:21 +01:00
Stefano Brivio
f00b153414 netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
Martin reports that, with Fedora Linux kernel version
kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
read() as families"), pasta doesn't exit once the network namespace
is gone.

Actually, pasta is completely non-functional, at least with default
options, because nl_route_dup(), which duplicates routes from the
parent namespace into the target namespace at start-up, is stuck on
a second receive operation for RTM_GETROUTE.

However, with that commit, the kernel is now able to fit the whole
response, including the NLMSG_DONE message, into a single datagram,
so no further messages will be received.

It turns out that commit 4d6e9d0816 ("netlink: Always process all
responses to a netlink request") accidentally relied on the fact that
we would always get at least two datagrams as a response to
RTM_GETROUTE.

That is, the test to check if we expect another datagram, is based
on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
but we'll also expect another datagram if NLMSG_OK on the last
message is false. But NLMSG_OK with a zero length is always false.

The problem is that we don't distinguish if status is zero because
we got a NLMSG_DONE message, or because we processed all the
available datagram bytes.

Introduce an explicit check on NLMSG_DONE. We should probably
refactor this slightly, for example by introducing a special return
code from nl_status(), but this is probably the least invasive fix
for the issue at hand.

Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://github.com/containers/podman/issues/22052
Fixes: 4d6e9d0816 ("netlink: Always process all responses to a netlink request")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-18 08:56:32 +01:00
David Gibson
d3eb0d7b59 tap: Rename tap_iov_{base,len}
These two functions are typically used to calculate values to go into the
iov_base and iov_len fields of a struct iovec.  They don't have to be used
for that, though.  Rename them in terms of what they actually do: calculate
the base address and total length of the complete frame, including both L2
and tap specific headers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:43 +01:00
David Gibson
4db947d17c tap: Implement tap_send() "slow path" in terms of fast path
Most times we send frames to the guest it goes via tap_send_frames().
However "slow path" protocols - ARP, ICMP, ICMPv6, DHCP and DHCPv6 - go
via tap_send().

As well as being a semantic duplication, tap_send() contains at least one
serious problem: it doesn't properly handle short sends, which can be fatal
on the qemu socket connection, since frame boundaries will get out of sync.

Rewrite tap_send() to call tap_send_frames().  While we're there, rename it
tap_send_single() for clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:37 +01:00
David Gibson
1ebe787fe4 tap: Simplify some casts in the tap "slow path" functions
We can both remove some variables which differ from others only in type,
and slightly improve type safety.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:33 +01:00
David Gibson
2d0e0084b6 tap: Extend tap_send_frames() to allow multi-buffer frames
tap_send_frames() takes a vector of buffers and requires exactly one frame
per buffer.  We have future plans where we want to have multiple buffers
per frame in some circumstances, so extend tap_send_frames() to take the
number of buffers per frame as a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Improve comment to rembufs calculation]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:28 +01:00
Stefano Brivio
f67238aa86 passt, log: Call __openlog() earlier, log to stderr until we detach
Paul reports that, with commit 15001b39ef ("conf: set the log level
much earlier"), early messages aren't reported to standard error
anymore.

The reason is that, once the log mask is changed from LOG_EARLY, we
don't force logging to stderr, and this mechanism was abused to have
early errors on stderr. Now that we drop LOG_EARLY earlier on, this
doesn't work anymore.

Call __openlog() as soon as we know the mode we're running as, using
LOG_PERROR. Then, once we detach, if we're not running from an
interactive terminal and logging to standard error is not forced,
drop LOG_PERROR from the options.

While at it, check if the standard error descriptor refers to a
terminal, instead of checking standard output: if the user redirects
standard output to /dev/null, they might still want to see messages
from standard error.

Further, make sure we don't print messages to standard error reporting
that we couldn't log to the system logger, if we didn't open a
connection yet. That's expected.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 15001b39ef ("conf: set the log level much earlier")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:19:36 +01:00
Stefano Brivio
3fe9878db7 pcap: Use clock_gettime() instead of gettimeofday()
POSIX.1-2008 declared gettimeofday() as obsolete, but I'm a dinosaur.

Usually, C libraries translate that to the clock_gettime() system
call anyway, but this doesn't happen in Jon's environment, and,
there, seccomp happily kills pasta(1) when started with --pcap,
because we didn't add gettimeofday() to our seccomp profiles.

Use clock_gettime() instead.

Reported-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:18:36 +01:00
Stefano Brivio
0761f29a14 passt.1: --{no-,}dhcp-dns and --{no-,}dhcp-search don't take addresses
...they are simple enable/disable options.

Fixes: 89678c5157 ("conf, udp: Introduce basic DNS forwarding")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:18:09 +01:00
Stefano Brivio
4d05ba2c58 conf: Warn if we can't advertise any nameserver via DHCP, NDP, or DHCPv6
We might have read from resolv.conf, or from the command line, a
resolver that's reachable via loopback address, but that doesn't mean
we can offer that via DHCP, NDP or DHCPv6: warn if there are no
resolvers we can offer for a given IP version.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:17:37 +01:00
Stefano Brivio
43881636c2 conf: Handle addresses passed via --dns just like the ones from resolv.conf
...that is, call add_dns4() and add_dns6() instead of simply adding
those to the list of servers we advertise.

Most importantly, this will set the 'dns_host' field for the matching
IP version, so that, as mentioned in the man page, servers passed via
--dns are used for DNS mapping as well, if used in combination with
--dns-forward.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=82
Fixes: 89678c5157 ("conf, udp: Introduce basic DNS forwarding")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:16:04 +01:00
Laurent Vivier
b299942bbd tap: Capture only packets that are actually sent
In tap_send_frames(), if we failed to send all the frames, we must
only log the frames that have been sent, not all the frames we wanted
to send.

Fixes: dda7945ca9 ("pcap: Handle short writes in pcap_frame()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:27 +01:00
David Gibson
413c15988e udp: Use existing helper for UDP checksum on inbound IPv6 packets
Currently we open code the calculation of the UDP checksum in
udp_update_hdr6().  We calling a helper to handle the IPv6 pseudo-header,
and preset the checksum field to 0 so an uninitialised value doesn't get
folded in.  We already have a helper to do this: csum_udp6() which we use
in some slow paths.  Use it here as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:25 +01:00
David Gibson
ae69838db0 udp: Avoid unnecessary pointer in udp_update_hdr4()
We carry around the source address as a pointer to a constant struct
in_addr.  But it's silly to carry around a 4 or 8 byte pointer to a 4 byte
IPv4 address.  Just copy the IPv4 address around by value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:21 +01:00
David Gibson
b0419d150a udp: Re-order udp_update_hdr[46] for clarity and brevity
The order of things in these functions is a bit odd for historical reasons.
We initialise some IP header fields early, the more later after making
some tests.  Likewise we declare some variables without initialisation,
but then unconditionally set them to values we could calculate at the
start of the function.

Previous cleanups have removed the reasons for some of these choices, so
reorder for clarity, and where possible move the first assignment into an
initialiser.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:19 +01:00
David Gibson
8a842e03cd udp: Pass data length explicitly to to udp_update_hdr[46]
These functions take an index to the L2 buffer whose header information to
update.  They use that for two things: to locate the buffer pointer itself,
and to retrieve the length of the received message from the paralllel
udp[46]_l2_mh_sock array.  The latter is arguably a failure to separate
concerns.  Change these functions to explicitly take a buffer pointer and
payload length as parameters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:17 +01:00
David Gibson
76571ae869 udp: Consistent port variable names in udp_update_hdr[46]
In these functions we have 'dstport' for the destination port, but
'src_port' for the source port.  Change the latter to 'srcport' for
consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:37:15 +01:00
David Gibson
205b140dec udp: Refactor udp_sock[46]_iov_init()
Each of these functions have 3 essentially identical loops in a row.
Merge the loops into a single common udp_sock_iov_init() function, calling
udp_sock[46]_iov_init_one() helpers to initialize each "slot" in the
various parallel arrays.  This is slightly neater now, and more naturally
allows changes we want to make where more initialization will become common
between IPv4 and IPv6.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-13 14:36:59 +01:00
Stefano Brivio
860d2764dd conf: Don't warn if nameservers were found, but won't be advertised
Starting from commit 3a2afde87d ("conf, udp: Drop mostly duplicated
dns_send arrays, rename related fields"), we won't add to c->ip4.dns
and c->ip6.dns nameservers that can't be used by the guest or
container, and we won't advertise them.

However, the fact that we don't advertise any nameserver doesn't mean
that we didn't find any, and we should warn only if we couldn't find
any.

This is particularly relevant in case both --dns-forward and
--no-map-gw are passed, and a single loopback address is listed in
/etc/resolv.conf: we'll forward queries directed to the address
specified by --dns-forward to the loopback address we found, we
won't advertise that address, so we shouldn't warn: this is a
perfectly legitimate usage.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/19213
Fixes: 3a2afde87d ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-03-12 01:50:48 +01:00
David Gibson
4779dfe12f icmp: Use 'flowside' epoll references for ping sockets
Currently ping sockets use a custom epoll reference type which includes
the ICMP id.  However, now that we have entries in the flow table for
ping flows, finding that is sufficient to get everything else we want,
including the id.  Therefore remove the icmp_epoll_ref type and use the
general 'flowside' field for ping sockets.

Having done this we no longer need separate EPOLL_TYPE_ICMP and
EPOLL_TYPE_ICMPV6 reference types, because we can easily determine
which case we have from the flow type. Merge both types into
EPOLL_TYPE_PING.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:49:05 +01:00
David Gibson
02cbdb0b86 icmp: Flow based error reporting
Use flow_dbg() and flow_err() helpers to generate flow-linked error
messages in most places.  Make a few small improvements to the messages
while we're at it.  This allows us to avoid the awkward 'pname' variables
since whether we're dealing with ICMP or ICMPv6 is already built into the
flow type which these helpers include.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Coding style fix in icmp_tap_handler()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:36:04 +01:00
David Gibson
3af5e9fdba icmp: Store ping socket information in flow table
Currently icmp_id_map[][] stores information about ping sockets in a
bespoke structure.  Move the same information into new types of flow
in the flow table.  To match that change, replace the existing ICMP
timer with a flow-based timer for expiring ping sockets.  This has the
advantage that we only need to scan the active flows, not all possible
ids.

We convert icmp_id_map[][] to point to the flow table entries, rather
than containing its own information.  We do still use that array for
locating the right ping flows, rather than using a "flow native" form
of lookup for the time being.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update id_sock description in comment to icmp_ping_new()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:34:45 +01:00
Stefano Brivio
383a6f67e5 ip: Use regular htons() for non-constant protocol number in L2_BUF_IP4_PSUM
instead of htons_constant(), which is for... constants.

Fixes: 5bf200ae8a ("tcp, udp: Don't include destination address in partially precomputed csums")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-08 10:31:14 +01:00
David Gibson
137ce01789 iov: Improve documentation of iov_skip_bytes()
As pointed out in review, the documentation comments for iov_skip_bytes()
are more confusing than they should be.  Reword them, including updating
parameter names, to make it clearer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-07 23:09:15 +01:00
Laurent Vivier
bb11d15495 tcp: Introduce tcp_fill_headers4()/tcp_fill_headers6()
Replace the macro SET_TCP_HEADER_COMMON_V4_V6() by a new function
tcp_fill_header().

Move IPv4 and IPv6 code from tcp_l2_buf_fill_headers() to
tcp_fill_headers4() and tcp_fill_headers6()

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-10-lvivier@redhat.com>
[dwg: Correct commit message with new function names]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:52 +01:00
Laurent Vivier
6b22e10a26 tap: make tap_update_mac() generic
Use ethhdr rather than tap_hdr.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-9-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:49 +01:00
Laurent Vivier
7df624e79a checksum: introduce functions to compute the header part checksum for TCP/UDP
The TCP and UDP checksums are computed using the data in the TCP/UDP
payload but also some informations in the IP header (protocol,
length, source and destination addresses).

We add two functions, proto_ipv4_header_psum() and
proto_ipv6_header_psum(), to compute the checksum of the IP
header part.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-8-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:47 +01:00
Laurent Vivier
feb4900c25 checksum: use csum_ip4_header() in udp.c and tcp.c
We can find the same function to compute the IPv4 header
checksum in tcp.c, udp.c and tap.c

Use the function defined for tap.c, csum_ip4_header(), but
with the code used in tcp.c and udp.c as it doesn't need a fully
initialiazed IPv4 header, only protocol, tot_len, saddr and daddr.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-7-lvivier@redhat.com>
[dwg: Fix weird cppcheck regression; it appears to be a problem
 in pre-existing code, but somehow this patch is exposing it]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:44 +01:00
Laurent Vivier
e82b4fe5fc udp: little cleanup in udp_update_hdrX() to prepare future changes
in udp_update_hdr4():

    Assign the source address to src, either b->s_in.sin_addr,
    c->ip4.dns_match or c->ip4.gw and then set b->iph.saddr to src->s_addr.

in udp_update_hdr6():

   Assign the source address to src, either b->s_in6.sin6_addr,
   c->ip6.dns_match, c->ip6.gw or c->ip6.addr_ll.
   Assign the destination to dst, either c->ip6.addr_seen or
   &c->ip6.addr_ll_seen.
   Then set dst to b->ip6h.daddr and src to b->ip6h.saddr.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-6-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:41 +01:00
Laurent Vivier
324bd46782 util: move IP stuff from util.[ch] to ip.[ch]
Introduce ip.[ch] file to encapsulate IP protocol handling functions and
structures.  Modify various files to include the new header ip.h when
it's needed.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-5-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:38 +01:00
Laurent Vivier
e289d287c6 checksum: add csum_iov()
Introduce the function csum_unfolded() that computes the unfolded
32-bit checksum of a data buffer, and call it from csum() that returns
the folded value.

Introduce csum_iov() that computes the checksum using csum_folded() on
all vectors of the iovec array and returns the folded result.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-4-lvivier@redhat.com>
[dwg: Fixed trivial cppcheck & clang-tidy regressions]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:36 +01:00
Laurent Vivier
907621eaae checksum: align buffers
If buffer is not aligned use sum_16b() only on the not aligned
part, and then use csum_avx2() on the remaining part

Remove unneeded now function csum_unaligned().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-3-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:32 +01:00
Laurent Vivier
94502fa15e pcap: add pcap_iov()
Introduce a new function pcap_iov() to capture packet desribed by an IO
vector.

Update pcap_frame() to manage iovcnt > 1.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-2-lvivier@redhat.com>
[dwg: Fixed trivial cppcheck regressions]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:30 +01:00
David Gibson
3b9098aa49 fwd: Rename port_fwd.[ch] and their contents
Currently port_fwd.[ch] contains helpers related to port forwarding,
particular automatic port forwarding.  We're planning to allow much more
flexible sorts of forwarding, including both port translation and NAT based
on the flow table.  This will subsume the existing port forwarding logic,
so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names
within.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:27 +01:00
David Gibson
10376e7a2f port_fwd: Fix copypasta error in port_fwd_scan_udp() comments
port_fwd_scan_udp() handles UDP, as the name suggests, but its
function comment has the wrong function name and references TCP, due
to a bad copy-paste from port_fwd_scan_tcp().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:24 +01:00
David Gibson
f15be719b3 tap: Disallow loopback addresses on tap interface
The "tap" interface, whether it's actually a tuntap device or a qemu
socket, presents a virtual external link between different network hosts.
Hence, loopback addresses make no sense there.  However, nothing prevents
the guest from putting bogus packets with loopback addresses onto the
interface and it's not entirely clear what effect that will have on passt.

Explicitly test for such packets and drop them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:21 +01:00
David Gibson
3b59b9748a tcp: Validate TCP endpoint addresses
TCP connections should typically not have wildcard addresses (0.0.0.0
or ::) nor a zero port number for either endpoint.  It's not entirely
clear (at least to me) if it's strictly against the RFCs to do so, but
at any rate the socket interfaces often treat those values
specially[1], so it's not really possible to manipulate such
connections.  Likewise they should not have broadcast or multicast
addresses for either endpoint.

However, nothing prevents a guest from creating a SYN packet with such
values, and it's not entirely clear what the effect on passt would be.
To ensure sane behaviour, explicitly check for this case and drop such
packets, logging a debug warning (we don't want a higher level,
because that would allow a guest to spam the logs).

We never expect such an address on an accept()ed socket either, but
just in case, check for it as well.

[1] Depending on context as "unknown", "match any" or "kernel, pick
    something for me"

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:17 +01:00
David Gibson
dc9a5d71e9 tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler()
tcp_listen_handler() uses the epoll reference for the listening socket
it handles, and also passes on one variant of it to
tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock().  The latter
two functions only need a couple of specific fields from the
reference.

Pass those specific values instead of the whole reference, which
localises the handling of the listening (as opposed to accepted)
socket and its reference entirely within tcp_listen_handler().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:15 +01:00
David Gibson
ee677e0a42 tcp_splice: Improve logic deciding when to splice
This makes several tweaks to improve the logic which decides whether
we're able to use the splice method for a new connection.

 * Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we
   check for pasta mode within it, better localising the checks.
 * Previously if we got a connection from a non-loopback address we'd
   always fall back to the "tap" path, even if the  connection was on a
   socket in the namespace.  If we did get a non-loopback address on a
   namespace socket, something has gone wrong and the "tap" path certainly
   won't be able to handle it.  Report the error and close, rather than
   passing it along to tap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:13 +01:00
David Gibson
4c2d923b12 tcp_splice: Improve error reporting on connect path
This makes a number of changes to improve error reporting while
connecting a new spliced socket:
* We use flow_err() and similar functions so all messages include info
   on which specific flow was affected
 * We use strerror() to interpret raw error values
 * We now report errors on connection (at "trace" level, since this would
   allow spamming the logs)
 * We also look up and report some details on EPOLLERR events, which can
   include connection errors, since we use a non-blocking connect().  Again
   we use "trace" level since this can spam the logs.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:11 +01:00
David Gibson
f0e2a6b8c9 tcp_splice: Make tcp_splice_connect() create its own sockets
Currently creating the connected socket for a splice is split between
tcp_splice_conn_from_sock(), which opens the socket, and
tcp_splice_connect() which connects it.  Alter tcp_splice_connect() to
open its own socket based on an address family and pif we pass it.

This does require a second conditional on pif, but makes for a more
logical split of functionality: tcp_splice_conn_from_sock() picks the
target, tcp_splice_connect() creates the connection.  While we're
there improve reporting of errors

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:09 +01:00
David Gibson
f4e5d73684 tcp_splice: Merge tcp_splice_new() into its caller
The only caller of tcp_splice_new() is tcp_splice_conn_from_sock().
Both are quite short, and the division of responsibilities between the
two isn't particularly obvious.  Simplify by merging the former into
the latter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:06 +01:00
David Gibson
04d3d02603 tcp_splice: More specific variable names in new splice path
In tcp_splice_conn_from_sock(), the 'port' variable stores the source
port of the connection on the originating side.  In tcp_splice_new(),
called directly from it, the 'port' parameter gives the _destination_
port of the originating connection and is then updated to the
destination port of the connection on the other side.

Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the
accetped socket (on side 0), whereas in tcp_splice_new(), 's' is the
fd of the connecting socket (side 1).

I, for one, find having the same variable name with different meanings
in such close proximity in the flow of control pretty confusing.
Alter the names for greater specificity and clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:03 +01:00
David Gibson
0f938c3b9a flow: Clarify flow entry life cycle, introduce uniform logging
Our allocation scheme for flow entries means there are some
non-obvious constraints on when what things can be done with an entry.
Add a big doc comment explaining the life cycle.

In addition, make a FLOW_START() macro to mark one of the important
transitions.  This encourages correct usage, by making it natural to
only access the flow type specific structure after calling it.  It
also logs that a new flow has been created, which is useful for
debugging.

We also add logging when a flow's lifecycle ends.  This doesn't need a
new helper, because it can only happen either from flow_alloc_cancel()
or from the flow deferred handler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:01 +01:00
David Gibson
d0550f97cd tcp_splice: Don't use flow_trace() before setting flow type
In tcp_splice_conn_from_sock() we can call flow_trace() if there's an
error setting TCP_QUICKACK.  However, we do so before we've set the
flow type in the flow entry.  That means that flow_trace() will print
nonsense when it tries to print the flow type.

There's no reason the setsockopt() has to happen before initialising
the flow entry, so just move it after.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:58 +01:00
David Gibson
80f9b61b50 tcp_splice: Simplify clean up logic
Currently tcp_splice_flow_defer() contains specific logic to determine
if we're far enough initialised that we need to close pipes and/or
sockets.  This is potentially fragile if we change something about the
order in which we do things.  We can simplify this by initialising the
pipe and socket fields to -1 very early, then close()ing them if and
only if they're non-negative.

This lets us remove a special case cleanup if our connect() fails.
This will already trigger a CLOSING event, and the socket fd in
question is populated in the connection structure.  Thus we can let
the new cleanup logic handle it rather than requiring an explicit
close().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:48 +01:00
David Gibson
76c7e1dca3 flow: Add helper to determine a flow's protocol
Each flow already has a type field.  This implies the protocol the
flow represents, but also has more information: we have two ways to
represent TCP flows, "tap" and "spliced".  In order to generalise some
of the flow mechanics, we'll need to determine a flow's protocol in
terms of the IP (L4) protocol number.

Introduce a constant table and helper macro to derive this from the flow
type.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:45 +01:00
David Gibson
bb9bf0bb8f tcp, udp: Don't precompute port remappings in epoll references
The epoll references for both TCP listening sockets and UDP sockets
includes a port number.  This gives the destination port that traffic
to that socket will be sent to on the other side.  That will usually
be the same as the socket's bound port, but might not if the -t, -u,
-T or -U options are given with different original and forwarded port
numbers.

As we move towards a more flexible forwarding model for passt, it's
going to become possible for that destination port to vary depending
on more things (for example the source or destination address).  So,
it will no longer make sense to have a fixed value for a listening
socket.

Change to simpler semantics where this field in the reference gives
the bound port of the socket.  We apply the translations to the
correct destination port later on, when we're actually forwarding.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:40 +01:00
David Gibson
e196eada6f util: Allow IN4_IS_* macros to operate on untyped addresses
The IN4_IS_*() macros expect a pointer to a struct in_addr.  That
makes sense, but sometimes we have an IPv4 address as a void * pointer
or union type which makes these less convenient.  Additionally, this
doesn't match the behaviour of the standard library's IN6_IS_*()
macros on which they're modelled, nor our own IN4_ARE_ADDR_EQUAL().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:35 +01:00
David Gibson
f6e6e8ad40 inany: Introduce union sockaddr_inany
There are a number of places where we want to handle either a
sockaddr_in or a sockaddr_in6.  In some of those we use a void *,
which works ok and matches some standard library interfaces, but
doesn't give a signature level hint that we're dealing with only
sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of
socket address.  Other places we use a sockaddr_storage, which also
works, but has the same problem in addition to allocating more on the
stack than we need to.

Introduce union sockaddr_inany to explictly handle this case: it has
variants for sockaddr_in and sockaddr_in6.  Use it in a number of
places where it's easy to do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:31 +01:00