As TCP and UDP use now directly vhost-user we don't need this function
anymore. Other protocols (ICMP, ARP, DHCP, ...) use tap_send()/vu_send()
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
We are going to introduce a variant of the function to use
vhost-user buffers rather than passt internal buffers.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
To separate these functions from the ones specific to TCP management,
we are going to move it to a new file, but before that update their names
to reflect their role.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
To be able to provide pointers to TCP headers and IP headers without
worrying about alignment in the structure, split the structure into
several arrays and point to each part of the frame using an iovec array.
Using iovec also allows us to simply ignore the first entry when the
vnet length header is not needed. And as the payload buffer contains
only the TCP header and the TCP data we can increase the size of the
TCP data to USHRT_MAX - sizeof(struct tcphdr).
As a side effect, these changes improve performance by a factor of
x1.5.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
As pointed out in review, the documentation comments for iov_skip_bytes()
are more confusing than they should be. Reword them, including updating
parameter names, to make it clearer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Replace the macro SET_TCP_HEADER_COMMON_V4_V6() by a new function
tcp_fill_header().
Move IPv4 and IPv6 code from tcp_l2_buf_fill_headers() to
tcp_fill_headers4() and tcp_fill_headers6()
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-10-lvivier@redhat.com>
[dwg: Correct commit message with new function names]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Use ethhdr rather than tap_hdr.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-9-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The TCP and UDP checksums are computed using the data in the TCP/UDP
payload but also some informations in the IP header (protocol,
length, source and destination addresses).
We add two functions, proto_ipv4_header_psum() and
proto_ipv6_header_psum(), to compute the checksum of the IP
header part.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-8-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can find the same function to compute the IPv4 header
checksum in tcp.c, udp.c and tap.c
Use the function defined for tap.c, csum_ip4_header(), but
with the code used in tcp.c and udp.c as it doesn't need a fully
initialiazed IPv4 header, only protocol, tot_len, saddr and daddr.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-7-lvivier@redhat.com>
[dwg: Fix weird cppcheck regression; it appears to be a problem
in pre-existing code, but somehow this patch is exposing it]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
in udp_update_hdr4():
Assign the source address to src, either b->s_in.sin_addr,
c->ip4.dns_match or c->ip4.gw and then set b->iph.saddr to src->s_addr.
in udp_update_hdr6():
Assign the source address to src, either b->s_in6.sin6_addr,
c->ip6.dns_match, c->ip6.gw or c->ip6.addr_ll.
Assign the destination to dst, either c->ip6.addr_seen or
&c->ip6.addr_ll_seen.
Then set dst to b->ip6h.daddr and src to b->ip6h.saddr.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-6-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Introduce ip.[ch] file to encapsulate IP protocol handling functions and
structures. Modify various files to include the new header ip.h when
it's needed.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-5-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Introduce the function csum_unfolded() that computes the unfolded
32-bit checksum of a data buffer, and call it from csum() that returns
the folded value.
Introduce csum_iov() that computes the checksum using csum_folded() on
all vectors of the iovec array and returns the folded result.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-4-lvivier@redhat.com>
[dwg: Fixed trivial cppcheck & clang-tidy regressions]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If buffer is not aligned use sum_16b() only on the not aligned
part, and then use csum_avx2() on the remaining part
Remove unneeded now function csum_unaligned().
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-3-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Introduce a new function pcap_iov() to capture packet desribed by an IO
vector.
Update pcap_frame() to manage iovcnt > 1.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-2-lvivier@redhat.com>
[dwg: Fixed trivial cppcheck regressions]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently port_fwd.[ch] contains helpers related to port forwarding,
particular automatic port forwarding. We're planning to allow much more
flexible sorts of forwarding, including both port translation and NAT based
on the flow table. This will subsume the existing port forwarding logic,
so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names
within.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
port_fwd_scan_udp() handles UDP, as the name suggests, but its
function comment has the wrong function name and references TCP, due
to a bad copy-paste from port_fwd_scan_tcp().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The "tap" interface, whether it's actually a tuntap device or a qemu
socket, presents a virtual external link between different network hosts.
Hence, loopback addresses make no sense there. However, nothing prevents
the guest from putting bogus packets with loopback addresses onto the
interface and it's not entirely clear what effect that will have on passt.
Explicitly test for such packets and drop them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
TCP connections should typically not have wildcard addresses (0.0.0.0
or ::) nor a zero port number for either endpoint. It's not entirely
clear (at least to me) if it's strictly against the RFCs to do so, but
at any rate the socket interfaces often treat those values
specially[1], so it's not really possible to manipulate such
connections. Likewise they should not have broadcast or multicast
addresses for either endpoint.
However, nothing prevents a guest from creating a SYN packet with such
values, and it's not entirely clear what the effect on passt would be.
To ensure sane behaviour, explicitly check for this case and drop such
packets, logging a debug warning (we don't want a higher level,
because that would allow a guest to spam the logs).
We never expect such an address on an accept()ed socket either, but
just in case, check for it as well.
[1] Depending on context as "unknown", "match any" or "kernel, pick
something for me"
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_listen_handler() uses the epoll reference for the listening socket
it handles, and also passes on one variant of it to
tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock(). The latter
two functions only need a couple of specific fields from the
reference.
Pass those specific values instead of the whole reference, which
localises the handling of the listening (as opposed to accepted)
socket and its reference entirely within tcp_listen_handler().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This makes several tweaks to improve the logic which decides whether
we're able to use the splice method for a new connection.
* Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we
check for pasta mode within it, better localising the checks.
* Previously if we got a connection from a non-loopback address we'd
always fall back to the "tap" path, even if the connection was on a
socket in the namespace. If we did get a non-loopback address on a
namespace socket, something has gone wrong and the "tap" path certainly
won't be able to handle it. Report the error and close, rather than
passing it along to tap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This makes a number of changes to improve error reporting while
connecting a new spliced socket:
* We use flow_err() and similar functions so all messages include info
on which specific flow was affected
* We use strerror() to interpret raw error values
* We now report errors on connection (at "trace" level, since this would
allow spamming the logs)
* We also look up and report some details on EPOLLERR events, which can
include connection errors, since we use a non-blocking connect(). Again
we use "trace" level since this can spam the logs.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently creating the connected socket for a splice is split between
tcp_splice_conn_from_sock(), which opens the socket, and
tcp_splice_connect() which connects it. Alter tcp_splice_connect() to
open its own socket based on an address family and pif we pass it.
This does require a second conditional on pif, but makes for a more
logical split of functionality: tcp_splice_conn_from_sock() picks the
target, tcp_splice_connect() creates the connection. While we're
there improve reporting of errors
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The only caller of tcp_splice_new() is tcp_splice_conn_from_sock().
Both are quite short, and the division of responsibilities between the
two isn't particularly obvious. Simplify by merging the former into
the latter.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_splice_conn_from_sock(), the 'port' variable stores the source
port of the connection on the originating side. In tcp_splice_new(),
called directly from it, the 'port' parameter gives the _destination_
port of the originating connection and is then updated to the
destination port of the connection on the other side.
Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the
accetped socket (on side 0), whereas in tcp_splice_new(), 's' is the
fd of the connecting socket (side 1).
I, for one, find having the same variable name with different meanings
in such close proximity in the flow of control pretty confusing.
Alter the names for greater specificity and clarity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Our allocation scheme for flow entries means there are some
non-obvious constraints on when what things can be done with an entry.
Add a big doc comment explaining the life cycle.
In addition, make a FLOW_START() macro to mark one of the important
transitions. This encourages correct usage, by making it natural to
only access the flow type specific structure after calling it. It
also logs that a new flow has been created, which is useful for
debugging.
We also add logging when a flow's lifecycle ends. This doesn't need a
new helper, because it can only happen either from flow_alloc_cancel()
or from the flow deferred handler.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_splice_conn_from_sock() we can call flow_trace() if there's an
error setting TCP_QUICKACK. However, we do so before we've set the
flow type in the flow entry. That means that flow_trace() will print
nonsense when it tries to print the flow type.
There's no reason the setsockopt() has to happen before initialising
the flow entry, so just move it after.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently tcp_splice_flow_defer() contains specific logic to determine
if we're far enough initialised that we need to close pipes and/or
sockets. This is potentially fragile if we change something about the
order in which we do things. We can simplify this by initialising the
pipe and socket fields to -1 very early, then close()ing them if and
only if they're non-negative.
This lets us remove a special case cleanup if our connect() fails.
This will already trigger a CLOSING event, and the socket fd in
question is populated in the connection structure. Thus we can let
the new cleanup logic handle it rather than requiring an explicit
close().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Each flow already has a type field. This implies the protocol the
flow represents, but also has more information: we have two ways to
represent TCP flows, "tap" and "spliced". In order to generalise some
of the flow mechanics, we'll need to determine a flow's protocol in
terms of the IP (L4) protocol number.
Introduce a constant table and helper macro to derive this from the flow
type.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The epoll references for both TCP listening sockets and UDP sockets
includes a port number. This gives the destination port that traffic
to that socket will be sent to on the other side. That will usually
be the same as the socket's bound port, but might not if the -t, -u,
-T or -U options are given with different original and forwarded port
numbers.
As we move towards a more flexible forwarding model for passt, it's
going to become possible for that destination port to vary depending
on more things (for example the source or destination address). So,
it will no longer make sense to have a fixed value for a listening
socket.
Change to simpler semantics where this field in the reference gives
the bound port of the socket. We apply the translations to the
correct destination port later on, when we're actually forwarding.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The IN4_IS_*() macros expect a pointer to a struct in_addr. That
makes sense, but sometimes we have an IPv4 address as a void * pointer
or union type which makes these less convenient. Additionally, this
doesn't match the behaviour of the standard library's IN6_IS_*()
macros on which they're modelled, nor our own IN4_ARE_ADDR_EQUAL().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are a number of places where we want to handle either a
sockaddr_in or a sockaddr_in6. In some of those we use a void *,
which works ok and matches some standard library interfaces, but
doesn't give a signature level hint that we're dealing with only
sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of
socket address. Other places we use a sockaddr_storage, which also
works, but has the same problem in addition to allocating more on the
stack than we need to.
Introduce union sockaddr_inany to explictly handle this case: it has
variants for sockaddr_in and sockaddr_in6. Use it in a number of
places where it's easy to do so.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Our inany_addr type is used in some places to represent either IPv4 or
IPv6 addresses, and we plan to use it more widely. We don't yet
provide constants of this type for special addresses (loopback and
"any"). Add some of these, both the IPv4 and IPv6 variants of those
addresses, but typed as union inany_addr.
To avoid actually adding more things to .data we can use some macros and
casting to overlay the IPv6 versions of these with the standard library's
in6addr_loopback and in6addr_any. For the IPv4 versions we need to create
new constant globals.
For complicated historical reasons, the standard library doesn't
provide constants for IPv4 loopback and any addresses as struct
in_addr. It just has macros of type in_addr_t == uint32_t, which has
some gotchas w.r.t. endianness. We can use some more macros to
address this lack, using macros to effectively create these IPv4
constants as pieces of the inany constants above.
We use this last to avoid some awkward temporary variables just used
to get an address of an IPv4 loopback address.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add this helper to format an inany into either IPv4 or IPv6 text
format as appropriate.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add helpers to determine if an inany is loopback, unspecified or
multicast, regardless of whether it's a "true" IPv6 address or an IPv4
address represented as v4-mapped.
Use the loopback helper to simplify tcp_splice_conn_from_sock() slightly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we determine we have sent a partial frame in tap_send_frames_passt(),
we call tap_send_remainder() to send the remainder of it. The logic in
that function is very similar to that in the more general write_remainder()
except that it uses send() instead of write()/writev(). But we are dealing
specifically with the qemu socket here, which is a connected stream socket.
In that case write()s do the same thing as send() with the options we were
using, so we can just reuse write_remainder().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently pcap_frame() assumes that if write() doesn't return an error, it
has written everything we want. That's not necessarily true, because it
could return a short write. That's not likely to happen on a regular file,
but there's not a lot of reason not to be robust here; it's conceivable we
might want to direct the pcap fd at a named pipe or similar.
So, make pcap_frame() handle short frames by using the write_remainder()
helper.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Formatting fix, and avoid gcc warning in pcap_frame()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have several places where we want to write(2) a buffer or buffers and we
handle short write()s by retrying until everything is successfully written.
Add a helper for this in util.c.
This version has some differences from the typical write_all() function.
First, take an IO vector rather than a single buffer, because that will be
useful for some of our cases. Second, allow it to take an parameter to
skip the first n bytes of the given buffers. This will be useful for some
of the cases we want, and also falls out quite naturally from the
implementation.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor formatting fixes in write_remainder()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Update the low-level helper pcap_frame() to take a struct iovec and
offset within it, rather than an explicit pointer and length for the
frame. This moves the handling of an offset (to skip vnet_len) from
pcap_multiple() to pcap_frame().
This doesn't accomplish a great deal immediately, but will make
subsequent changes easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>