Commit graph

367 commits

Author SHA1 Message Date
Stefano Brivio
1a66806c18 tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf)
These system calls are needed after the conversion of time_t to 64-bit
types on 32-bit architectures.

Tested by running some transfer tests with passt and pasta on Debian
Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l.

Suggested-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:04:17 +02:00
Stefano Brivio
2aea1da143 treewide: Allow additional system calls for i386/i686
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.

Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:

  https://wiki.debian.org/ReleaseGoals/64bit-time

that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.

Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).

I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.

Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:00:43 +02:00
David Gibson
e6feb5a892 treewide: Use "our address" instead of "forwarding address"
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing.  As discussed on a
recent call, let's try "our" instead.

(While we're there correct an error in flow_initiate_af()s comments where
we referred to parameters by the wrong name).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:29 +02:00
Stefano Brivio
13295583f8 tcp: Change SO_PEEK_OFF support message to debug()
This:

  $ ./pasta
  SO_PEEK_OFF not supported
  #

is a bit annoying, and might trick users who face other issues into
thinking that SO_PEEK_OFF not being supported on a given kernel is
an actual issue.

Even if SO_PEEK_OFF is supported by the kernel, that would be the
only message displayed there, with default options, which looks a bit
out of context.

Switch that to debug(): now that Podman users can pass --debug too, we
can find out quickly if it's supported or not, if SO_PEEK_OFF usage is
suspected of causing any issue.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-25 12:25:26 +02:00
Jon Maloy
9cb6b50815 tcp: probe for SO_PEEK_OFF both in tcpv4 and tcp6
Based on an original patch by Jon Maloy:

--
The recently added socket option SO_PEEK_OFF is not supported for
TCP/IPv6 sockets. Until we get that support into the kernel we need to
test for support in both protocols to set the global 'peek_offset_cap´
to true.
--

Compared to the original patch:
- only check for SO_PEEK_OFF support for enabled IP versions
- use sa_family_t instead of int to pass the address family around

Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-23 16:42:27 +02:00
David Gibson
060f24e310 flow, tcp: Flow based NAT and port forwarding for TCP
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.

Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side.  This performs any NAT or port forwarding needed.

We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:29 +02:00
David Gibson
508adde342 tcp: Re-use flow hash for initial sequence number generation
We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp.  Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.

With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one.  It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:01 +02:00
David Gibson
acca4235c4 flow, tcp: Generalise TCP hash table to general flow hash table
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows.  This is
largely code motion and straightforward renames.  There are two semantic
changes:

 * flow_lookup_af() now needs to verify that the entry has a matching
   protocol and interface as well as matching addresses and ports.

 * We double the size of the hash table, because it's now at least
   theoretically possible for both sides of each flow to be hashed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:59 +02:00
David Gibson
163a339214 tcp, flow: Replace TCP specific hash function with general flow hash
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports.  We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well.  We also want a unified hash table
to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow.  This includes all the
information from struct flowside, plus the pif and the L4 protocol number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:56 +02:00
David Gibson
528a6517f8 tcp: Simplify endpoint validation using flowside information
Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:50 +02:00
David Gibson
e2ea10e246 tcp: Manage outbound address via flow table
For now when we forward a connection to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and port will be used by the kernel.  When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.

Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there.  This allows the
possibility of more complex rules for what outbound address and/or port
we use in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:47 +02:00
David Gibson
52d45f1737 tcp: Obtain guest address from flowside
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address.  This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address.  That won't
work; it will probably result in an RST from the guest.  Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.

Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:44 +02:00
David Gibson
f9fe212b1f tcp, flow: Remove redundant information, repack connection structures
Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure.  Access it from there instead, and
remove it from the TCP specific structure.   With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:41 +02:00
David Gibson
4e2d36e83f flow: Common address information for target side
Require the address and port information for the target (non
initiating) side to be populated when a flow enters TGT state.
Implement that for TCP and ICMP.  For now this leaves some information
redundantly recorded in both generic and type specific fields.  We'll
fix that in later patches.

For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:37 +02:00
David Gibson
8012f5ff55 flow: Common address information for initiating side
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow.  Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).

To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.

For now we only populate the initiating side, requiring that
information be completed when a flows enter INI.  Later patches will
populate the target side.

For now this leaves some information redundantly recorded in both generic
and type specific fields.  We'll fix that in later patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:32 +02:00
David Gibson
9b125e7776 flow, icmp, tcp: Clean up helpers for getting flow from index
TCP (both regular and spliced) and ICMP both have macros to retrieve the
relevant protcol specific flow structure from a flow index.  In most cases
what we actually want is to get the specific flow from a sidx.  Replace
those simple macros with a more precise inline, which also asserts that
the flow is of the type we expect.

While we're they're also add a pif_at_sidx() helper to get the interface of
a specific flow & side, which is useful in some places.

Finally, fix some minor style issues in the comments on some of the
existing sidx related helpers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:27:27 +02:00
David Gibson
4e1f850f61 udp, tcp: Tweak handling of no_udp and no_tcp flags
We abort the UDP socket handler if the no_udp flag is set.  But if UDP
was disabled we should never have had a UDP socket to trigger the handler
in the first place.  If we somehow did, ignoring it here isn't really going
to help because aborting without doing anything is likely to lead to an
epoll loop.  The same is the case for the TCP socket and timer handlers and
the no_tcp flag.

Change these checks on the flag to ASSERT()s.  Similarly add ASSERT()s to
several other entry points to the protocol specific code which should never
be called if the protocol is disabled.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:15 +02:00
Jon Maloy
a740e16fd1 tcp: handle shrunk window advertisements from guest
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the guest peer, while it is unable to send out window updates even
after socket reads have freed up enough buffer space to permit a larger
window. In this situation, new window advertisements from the peer can
only be triggered by data packets arriving from this side.

However, currently such packets are never sent, because the zero-window
condition prevents this side from sending out any packets whatsoever
to the peer.

We notice that the above bug is triggered *only* after the peer has
dropped one or more arriving packets because of severe memory squeeze,
and that we hence always enter a retransmission situation when this
occurs. This also means that the implementation goes against the
RFC-9293 recommendation that a previously advertised window never
should shrink.

RFC-9293 seems to permit that we can continue sending up to the right
edge of the last advertised non-zero window in such situations, so that
is what we do to resolve this situation.

It turns out that this solution is extremely simple to implememt in the
code: We just omit to save the advertised zero-window when we see that
it has shrunk, i.e., if the acknowledged sequence number in the
advertisement message is lower than that of the last data byte sent
from our side.

When that is the case, the following happens:
- The 'retr' flag in tcp_data_from_tap() will be 'false', so no
  retransmission will occur at this occasion.
- The data stream will soon reach the right edge of the previously
  advertised window. In fact, in all observed cases we have seen that
  it is already there when the zero-advertisement arrives.
- At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set,
  unless they already have been, meaning that only the next timer
  expiration will open for data retransmission or transmission.
- When that happens, the memory squeeze at the guest will normally have
  abated, and the data flow can resume.

It should be noted that although this solves the problem we have at
hand, it is a work-around, and not a genuine solution to the described
kernel bug.

Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor fix in commit title and commit reference in comment
 to workaround
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 18:05:08 +02:00
Jon Maloy
e63d281871 tcp: leverage support of SO_PEEK_OFF socket option when available
>From linux-6.9.0 the kernel will contain
commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").

This new feature makes is possible to call recv_msg(MSG_PEEK) and make
it start reading data from a given offset set by the SO_PEEK_OFF socket
option. This way, we can avoid repeated reading of already read bytes of
a received message, hence saving read cycles when forwarding TCP
messages in the host->name space direction.

In this commit, we add functionality to leverage this feature when
available, while we fall back to the previous behavior when not.

Measurements with iperf3 shows that throughput increases with 15-20
percent in the host->namespace direction when this feature is used.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 17:57:03 +02:00
David Gibson
8f8eb73482 flow: Add flow_sidx_valid() helper
To implement the TCP hash table, we need an invalid (NULL-like) value for
flow_sidx_t.  We use FLOW_SIDX_NONE for that, but for defensiveness, we
treat (usually) anything with an out of bounds flow index the same way.

That's not always done consistently though.  In flow_at_sidx() we open code
a check on the flow index.  In tcp_hash_probe() we instead compare against
FLOW_SIDX_NONE, and in some other places we use the fact that
flow_at_sidx() will return NULL in this case, even if we don't otherwise
need the flow it returns.

Clean this up a bit, by adding an explicit flow_sidx_valid() test function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:25 +02:00
David Gibson
74c1c5efcf util: sock_l4() determine protocol from epoll type rather than the reverse
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state.  Currently it determines the correct tag for the epoll
data based on the protocol.  However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol.  So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:09 +02:00
Stefano Brivio
dba7f0f5ce treewide: Replace strerror() calls
Now that we have logging functions embedding perror() functionality,
we can make _some_ calls more terse by using them. In many places,
the strerror() calls are still more convenient because, for example,
they are used in flow debugging functions, or because the return code
variable of interest is not 'errno'.

While at it, convert a few error messages from a scant perror style
to proper failure descriptions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:44 +02:00
Stefano Brivio
54a9d3801b tcp: Don't rely on bind() to fail to decide that connection target is valid
Commit e1a2e2780c ("tcp: Check if connection is local or low RTT
was seen before using large MSS") added a call to bind() before we
issue a connect() to the target for an outbound connection.

If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we
can conclude that the target address is a local (host) address, and we
can use an unlimited MSS.

While at it, according to the reasoning of that commit, if bind()
succeeds, we would know right away that nobody is listening at that
(local) address and port, and we don't even need to call connect(): we
can just fail early and reset the connection attempt.

But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or
net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will
actually succeed, so we can't rely on it to fail in general.

The visible issue with the existing behaviour is that we would reset
any outbound connection to non-local addresses, if non-local binds are
enabled.

Keep the significant optimisation for local addresses along with the
bind() call, but if it succeeds, don't draw any conclusion: close the
socket, grab another one, and proceed normally.

This will incur a small latency penalty if non-local binds are
enabled (we'll likely fetch an existing socket from the pool but
additionally call close()), or if the target is local but not bound:
we'll need to call connect() and get a failure before relaying that
failure back.

Link: https://github.com/containers/podman/issues/23003
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-19 15:00:55 +02:00
Laurent Vivier
fba2b544b6 tcp: move buffers management functions to their own file
Move all the TCP parts using internal buffers to tcp_buf.c
and keep generic TCP management functions in tcp.c.
Add tcp_internal.h to export needed functions from tcp.c and
tcp_buf.h from tcp_buf.c

With this change we can use existing TCP functions with a
different kind of memory storage as for instance the shared
memory provided by the guest via vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:05 +02:00
Laurent Vivier
ec26fa013a tcp: extract buffer management from tcp_send_flag()
This commit isolates the internal data structure management used for storing
data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[],
tcp4_flags[], ...) from the tcp_send_flag() function. The extracted
functionality is relocated to a new function named tcp_fill_flag_header().

tcp_fill_flag_header() is now a generic function that accepts parameters such
as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to
pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[].

This separation sets the stage for utilizing tcp_prepare_flags() to
set the memory provided by the guest via vhost-user in future developments.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:43:35 +02:00
David Gibson
d949667436 cppcheck: Suppress constParameterCallback errors
We have several functions which are used as callbacks for NS_CALL() which
only read their void * parameter, they don't write it.  The
constParameterCallback warning in cppcheck 2.14.1 complains that this
parameter could be const void *, also pointing out that that would require
casting the function pointer when used as a callback.

Casting the function pointers seems substantially uglier than using a
non-const void * as the parameter, especially since in each case we cast
the void * to a const pointer of specific type immediately.  So, suppress
these errors.

I think it would make logical sense to suppress this globally, but that
would cause unmatchedSuppression errors on earlier cppcheck versions.  So,
instead individually suppress it, along with unmatchedSuppression in the
relevant places.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-08 13:06:20 +02:00
David Gibson
ec416fdcc4 tcp, flow: Fix some error paths which didn't clean up flows properly
Flow table entries need to be fully initialised before returning to the
main epoll loop.  Commit 0060acd1 ("flow: Clarify and enforce flow state
transitions") now enforces that: once a flow is allocated we must either
cancel it, or activate it before returning to the main loop, or we will hit
an ASSERT().

Some error paths in tcp_conn_from_tap() weren't correctly updated for this
requirement - we can exit with a flow entry incompletely initialised.
Correct that by cancelling the flows in those situations.

I don't have enough information to be certain if this is the cause for
podman bug 22925, but it plausibly could be.

Fixes: 0060acd11b ("flow: Clarify and enforce flow state transitions")
Link: https://github.com/containers/podman/issues/22925
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
0e36fe1a43 clang-tidy: Enable the bugprone-macro-parentheses check
We globally disabled this, with a justification lumped together with
several checks about braces.  They don't really go together, the others
are essentially a stylistic choice which doesn't match our style.  Omitting
brackets on macro parameters can lead to real and hard to track down bugs
if an expression is ever passed to the macro instead of a plain identifier.

We've only gotten away with the macros which trigger the warning, because
of other conventions its been unlikely to invoke them with anything other
than a simple identifier.  Fix the macros, and enable the warning for the
future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
d2afb4b625 tcp: Make pointer const in tcp_revert_seq
The th pointer could be const, which causes a cppcheck warning on at least
some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40).

Fixes: e84a01e94c ("tcp: move seq_to_tap update to when frame is queued")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
Jon Maloy
e84a01e94c tcp: move seq_to_tap update to when frame is queued
commit a469fc393f ("tcp, tap: Don't increase tap-side sequence counter for dropped frames")
delayed update of conn->seq_to_tap until the moment the corresponding
frame has been successfully pushed out. This has the advantage that we
immediately can make a new attempt to transmit a frame after a failed
trasnmit, rather than waiting for the peer to later discover a gap and
trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence
number updates during peer-initiated retransmits, and we have realized
it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the
moment a frame is added to the outqueue. To retain the advantage of
having a quick re-attempt based on local failure detection, we now scan
through the part of the outqueue that had do be dropped, and restore the
sequence counter for each affected connection to the most appropriate
value.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-05 21:13:16 +02:00
David Gibson
cc801fb38f tcp: Remove interim 'tapside' field from connection
We recently introduced this field to keep track of which side of a TCP flow
is the guest/tap facing one.  Now that we generically record which pif each
side of each flow is connected to, we can easily derive that, and no longer
need to keep track of it explicitly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:06 +02:00
David Gibson
8a2accb847 flow: Record the pifs for each side of each flow
Currently we have no generic information flows apart from the type and
state, everything else is specific to the flow type.  Start introducing
generic flow information by recording the pifs which the flow connects.

To keep track of what information is valid, introduce new flow states:
INI for when the initiating side information is complete, and TGT for
when both sides information is complete, but we haven't chosen the
flow type yet.  For now, these states don't do an awful lot, but
they'll become more important as we add more generic information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:03 +02:00
David Gibson
43571852e6 flow: Make side 0 always be the initiating side
Each flow in the flow table has two sides, 0 and 1, representing the
two interfaces between which passt/pasta will forward data for that flow.
Which side is which is currently up to the protocol specific code:  TCP
uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except
for spliced connections where it uses 0 for the initiating side and 1 for
the target side.  ICMP also uses 0 for the host/"sock" side and 1 for the
guest/"tap" side, but in its case the latter is always also the initiating
side.

Make this generically consistent by always using side 0 for the initiating
side and 1 for the target side.  This doesn't simplify a lot for now, and
arguably makes TCP slightly more complex, since we add an extra field to
the connection structure to record which is the guest facing side. This is
an interim change, which we'll be able to remove later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:21:01 +02:00
David Gibson
0060acd11b flow: Clarify and enforce flow state transitions
Flows move over several different states in their lifetime.  The rules for
these are documented in comments, but they're pretty complex and a number
of the transitions are implicit, which makes this pretty fragile and
error prone.

Change the code to explicitly track the states in a field.  Make all
transitions explicit and logged.  To the extent that it's practical in C,
enforce what can and can't be done in various states with ASSERT()s.

While we're at it, tweak the docs to clarify the restrictions on each state
a bit.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:58 +02:00
David Gibson
a63199832a inany: Better helpers for using inany and specific family addrs together
This adds some extra inany helpers for comparing an inany address to
addresses of a specific family (including special addresses), and building
an inany from an IPv4 address (either statically or at runtime).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:55 +02:00
David Gibson
7a832a8a0e flow: Properly type callbacks to protocol specific handlers
The flow dispatches deferred and timer handling for flows centrally, but
needs to call into protocol specific code for the handling of individual
flows.  Currently this passes a general union flow *.  It makes more sense
to pass the specific relevant flow type structure.  That brings the check
on the flow type adjacent to casting to the union variant which it tags.

Arguably, this is a slight abstraction violation since it involves the
generic flow code using protocol specific types.  It's already calling into
protocol specific functions, so I don't think this really makes any
difference.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:52 +02:00
David Gibson
1a20370b36 util, tcp: Add helper to display socket addresses
When reporting errors, we sometimes want to show a relevant socket address.
Doing so by extracting the various relevant fields can be pretty awkward,
so introduce a sockaddr_ntop() helper to make it simpler.  For now we just
have one user in tcp.c, but I have further upcoming patches which can make
use of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:37 +02:00
David Gibson
eea5d3ef2d tcp: Update tap specific header too in tcp_fill_headers[46]()
tcp_fill_headers[46]() fill most of the headers, but the tap specific
header (the frame length for qemu sockets) is filled in afterwards.
Filling this as well:
  * Removes a little redundancy between the tcp_send_flag() and
    tcp_data_to_tap() path
  * Makes calculation of the correct length a little easier
  * Removes the now misleadingly named 'vnet_len' variable in
    tcp_send_flag()

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:34 +02:00
David Gibson
3559899586 iov: Helper macro to construct iovs covering existing variables or fields
Laurent's recent changes mean we use IO vectors much more heavily in the
TCP code.  In many of those cases, and few others around the code base,
individual iovs of these vectors are constructed to exactly cover existing
variables or fields.  We can make initializing such iovs shorter and
clearer with a macro for the purpose.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:31 +02:00
David Gibson
40f8b2976a tap, tcp: (Re-)abstract TAP specific header handling
Recent changes to the TCP code (reworking of the buffer handling) have
meant that it now (again) deals explicitly with the MODE_PASST specific
vnet_len field, instead of using the (partial) abstractions provided by the
tap layer.

The abstractions we had don't work for the new TCP structure, so make some
new ones that do: tap_hdr_iov() which constructs an iovec suitable for
containing (just) the TAP specific header and tap_hdr_update() which
updates it as necessary per-packet.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:29 +02:00
David Gibson
68d1b0a152 tcp: Simplify packet length calculation when preparing headers
tcp_fill_headers[46]() compute the L3 packet length from the L4 packet
length, then their caller tcp_l2_buf_fill_headers() converts it back to the
L4 packet length.  We can just use the L4 length throughout.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:26 +02:00
David Gibson
5566386f5f treewide: Standardise variable names for various packet lengths
At various points we need to track the lengths of a packet including or
excluding various different sets of headers.  We don't always use the same
variable names for doing so.  Worse in some places we use the same name
for different things: e.g. tcp_fill_headers[46]() use ip_len for the
length including the IP headers, but then tcp_send_flag() which calls it
uses it to mean the IP payload length only.

To improve clarity, standardise on these names:
   dlen:		L4 protocol payload length ("data length")
   l4len:		plen + length of L4 protocol header
   l3len:		l4len + length of IPv4/IPv6 header
   l2len:		l3len + length of L2 (ethernet) header

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:23 +02:00
David Gibson
9e22c53aa9 checksum: Make csum_ip4_header() take a host endian length
csum_ip4_header() takes the packet length as a network endian value.  In
general it's very error-prone to pass non-native-endian values as a raw
integer.  It's particularly bad here because this differs from other
checksum functions (e.g. proto_ipv4_header_psum()) which take host native
lengths.

It turns out all the callers have easy access to the native endian value,
so switch it to use host order like everything else.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:21 +02:00
Laurent Vivier
95601237ef tcp: Replace TCP buffer structure by an iovec array
To be able to provide pointers to TCP headers and IP headers without
worrying about alignment in the structure, split the structure into
several arrays and point to each part of the frame using an iovec array.

Using iovec also allows us to simply ignore the first entry when the
vnet length header is not needed.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-19 11:21:09 +02:00
David Gibson
4988e2b406 tcp: Unconditionally force ACK for all !SYN, !RST packets
Currently we set ACK on flags packets only when the acknowledged byte
pointer has advanced, or we hadn't previously set a window.  This means
in particular that we can send a window update with no ACK flag, which
doesn't appear to be correct.  RFC 9293 requires a receiver to ignore such
a packet [0], and indeed it appears that every non-SYN, non-RST packet
should have the ACK flag.

The reason for the existing logic, rather than always forcing an ACK seems
to be to avoid having the packet mistaken as a duplicate ACK which might
trigger a fast retransmit.  However, earlier tests in the function mean we
won't reach here if we don't have either an advance in the ack pointer -
which will already set the ACK flag, or a window update - which shouldn't
trigger a fast retransmit.

[0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1

Link: https://github.com/containers/podman/issues/22146
Link: https://bugs.passt.top/show_bug.cgi?id=84
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:52:04 +01:00
David Gibson
5894a245b9 tcp: Never automatically add the ACK flag to RST packets
tcp_send_flag() will sometimes force on the ACK flag for all !SYN packets.
This doesn't make sense for RST packets, where plain RST and RST+ACK have
somewhat different meanings.  AIUI, RST+ACK indicates an abrupt end to
a connection, but acknowledges data already sent.  Plain RST indicates an
abort, when one end receives a packet that doesn't seem to make sense in
the context of what it knows about the connection.  All of the cases where
we send RSTs are the second, so we don't want an ACK flag, but we currently
could add one anyway.

Change that, so we won't add an ACK to an RST unless the caller explicitly
requests it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:58 +01:00
David Gibson
16c2d8da0d tcp: Rearrange logic for setting ACK flag in tcp_send_flag()
We have different paths for controlling the ACK flag for the SYN and !SYN
paths.  This amounts to sometimes forcing on the ACK flag in the !SYN path
regardless of options.  We can rearrange things to explicitly be that which
will make things neater for some future changes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:55 +01:00
David Gibson
99355e25b9 tcp: Split handling of DUP_ACK from ACK
The DUP_ACK flag to tcp_send_flag() has two effects: first it forces the
setting of the ACK flag in the packet, even if we otherwise wouldn't.
Secondly, it causes a duplicate of the flags packet to be sent immediately
after the first.

Setting the ACK flag to tcp_send_flag() also has the first effect, so
instead of having DUP_ACK also do that, pass both flags when we need both
operations.  This slightly simplifies the logic of tcp_send_flag() in a way
that makes some future changes easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:41 +01:00
David Gibson
d3eb0d7b59 tap: Rename tap_iov_{base,len}
These two functions are typically used to calculate values to go into the
iov_base and iov_len fields of a struct iovec.  They don't have to be used
for that, though.  Rename them in terms of what they actually do: calculate
the base address and total length of the complete frame, including both L2
and tap specific headers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:43 +01:00
David Gibson
2d0e0084b6 tap: Extend tap_send_frames() to allow multi-buffer frames
tap_send_frames() takes a vector of buffers and requires exactly one frame
per buffer.  We have future plans where we want to have multiple buffers
per frame in some circumstances, so extend tap_send_frames() to take the
number of buffers per frame as a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Improve comment to rembufs calculation]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:28 +01:00