Commit graph

1764 commits

Author SHA1 Message Date
David Gibson
d29fa0856e udp: Remove rdelta port forwarding maps
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map.  This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port.  This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.

So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:57 +02:00
David Gibson
d89b3aa097 udp: Remove obsolete socket tracking
Now that UDP datagrams are all directed via the flow table, we no longer
use the udp_tap_map[ or udp_act[] arrays.  Remove them and connected
code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:55 +02:00
David Gibson
898f797174 udp: Direct datagrams from host to guest via flow table
This replaces the last piece of existing UDP port tracking with the
common flow table.  Specifically use the flow table to direct datagrams
from host sockets to the guest tap interface.  Since this now requires
a flow for every datagram, we add some logging if we encounter any
datagrams for which we can't find or create a flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:51 +02:00
David Gibson
b7ad19347f udp: Find or create flows for datagrams from tap interface
Currently we create flows for datagrams from socket interfaces, and use
them to direct "spliced" (socket to socket) datagrams.  We don't yet
match datagrams from the tap interface to existing flows, nor create new
flows for them.  Add that functionality, matching datagrams from tap to
existing flows when they exist, or creating new ones.

As with spliced flows, when creating a new flow from tap to socket, we
create a new connected socket to receive reply datagrams attached to that
flow specifically. We extend udp_flow_sock_handler() to handle reply
packets bound for tap rather than another socket.

For non-obvious reasons (perhaps increased stack usage?), this caused
a failure for me when running under valgrind, because valgrind invoked
rt_sigreturn which is not in our seccomp filter.  Since we already
allow rt_sigaction and others in the valgrind target, it seems
reasonable to add rt_sigreturn as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:48 +02:00
David Gibson
8126f7a660 udp: Remove obsolete splice tracking
Now that spliced datagrams are managed via the flow table, remove
UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used.  With
those removed, the 'ts' field in udp_splice_port is also no longer used.
struct udp_splice_port now contains just a socket fd, so replace it with
a plain int in udp_splice_ns[] and udp_splice_init[].  The latter are still
used for tracking of automatic port forwarding.

Finally, the 'splice' field of union udp_epoll_ref is no longer used so
remove it as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:45 +02:00
David Gibson
e0647ad80c udp: Handle "spliced" datagrams with per-flow sockets
When forwarding a datagram to a socket, we need to find a socket with a
suitable local address to send it.  Currently we keep track of such sockets
in an array indexed by local port, but this can't properly handle cases
where we have multiple local addresses in active use.

For "spliced" (socket to socket) cases, improve this by instead opening
a socket specifically for the target side of the flow.  We connect() as
well as bind()ing that socket, so that it will only receive the flow's
reply packets, not anything else.  We direct datagrams sent via that socket
using the addresses from the flow table, effectively replacing bespoke
addressing logic with the unified logic in fwd.c

When we create the flow, we also take a duplicate of the originating
socket, and use that to deliver reply datagrams back to the origin, again
using addresses from the flow table entry.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:42 +02:00
David Gibson
a45a7e9798 udp: Create flows for datagrams from originating sockets
This implements the first steps of tracking UDP packets with the flow table
rather than its own (buggy) set of port maps.  Specifically we create flow
table entries for datagrams received from a socket (PIF_HOST or
PIF_SPLICE).

When splitting datagrams from sockets into batches, we group by the flow
as well as splicesrc.  This may result in smaller batches, but makes things
easier down the line.  We can re-optimise this later if necessary.  For now
we don't do anything else with the flow, not even match reply packets to
the same flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:39 +02:00
David Gibson
8abd06e9fa fwd: Update flow forwarding logic for UDP
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets.  The
logic here doesn't exactly match our current forwarding, since our current
forwarding has some very strange and buggy edge cases.  Instead it's
attempting to replicate what appears to be the intended logic behind the
current forwarding.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:35 +02:00
David Gibson
c000f2aba6 flow, icmp: Use general flow forwarding rules for ICMP
Current ICMP hard codes its forwarding rules, and never applies any
translations.  Change it to use the flow_target() function, so that
it's translated the same as TCP (excluding TCP specific port
redirection).

This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine.  This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.

This removes the last user of flow_target_af(), so that's removed as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:33 +02:00
David Gibson
060f24e310 flow, tcp: Flow based NAT and port forwarding for TCP
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.

Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side.  This performs any NAT or port forwarding needed.

We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:29 +02:00
David Gibson
4cd753e65c icmp: Manage outbound socket address via flow table
For now when we forward a ping to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and id will be used by the kernel.  When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.

Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there.  This allows the
possibility of more complex rules for what outbound address and/or id
we use in future.

To implement this we create a new helper which sets up a new socket based
on information in a flowside, which will also have future uses.  It
behaves slightly differently from the existing ICMP code, in that it
doesn't bind to a specific interface if given a loopback address.  This is
logically correct - the loopback address means we need to operate through
the host's loopback interface, not ifname_out.  We didn't need it in ICMP
because ICMP will never generate a loopback address at this point, however
we intend to change that in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:25 +02:00
David Gibson
781164e25b flow: Helper to create sockets based on flowside
We have upcoming use cases where it's useful to create new bound socket
based on information from the flow table.  Add flowside_sock_l4() to do
this for either PIF_HOST or PIF_SPLICE sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:23 +02:00
David Gibson
2faf6fcd8b icmp: Eliminate icmp_id_map
With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything.  Eliminate it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:20 +02:00
David Gibson
2f40a01944 icmp: Look up ping flows using flow hash
When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables.  However, we can look this up using the flow hash
table - that's what it's for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:16 +02:00
David Gibson
6d76278c21 icmp: Obtain destination addresses from the flowsides
icmp_sock_handler() obtains the guest address from it's most recently
observed IP.  However, this can now be obtained from the common flowside
information.

icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.

Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:13 +02:00
David Gibson
5cffb1bf64 icmp: Remove redundant id field from flow table entry
struct icmp_ping_flow contains a field for the ICMP id of the ping, but
this is now redundant, since the id is also stored as the "port" in the
common flowsides.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:06 +02:00
David Gibson
508adde342 tcp: Re-use flow hash for initial sequence number generation
We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp.  Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.

With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one.  It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:01 +02:00
David Gibson
acca4235c4 flow, tcp: Generalise TCP hash table to general flow hash table
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows.  This is
largely code motion and straightforward renames.  There are two semantic
changes:

 * flow_lookup_af() now needs to verify that the entry has a matching
   protocol and interface as well as matching addresses and ports.

 * We double the size of the hash table, because it's now at least
   theoretically possible for both sides of each flow to be hashed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:59 +02:00
David Gibson
163a339214 tcp, flow: Replace TCP specific hash function with general flow hash
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports.  We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well.  We also want a unified hash table
to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow.  This includes all the
information from struct flowside, plus the pif and the L4 protocol number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:56 +02:00
David Gibson
f19a8f71f9 tcp_splice: Eliminate SPLICE_V6 flag
Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection.  Hence, drop the now unused SPLICE_V6 flag.

As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:53 +02:00
David Gibson
528a6517f8 tcp: Simplify endpoint validation using flowside information
Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:50 +02:00
David Gibson
e2ea10e246 tcp: Manage outbound address via flow table
For now when we forward a connection to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and port will be used by the kernel.  When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.

Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there.  This allows the
possibility of more complex rules for what outbound address and/or port
we use in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:47 +02:00
David Gibson
52d45f1737 tcp: Obtain guest address from flowside
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address.  This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address.  That won't
work; it will probably result in an RST from the guest.  Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.

Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:44 +02:00
David Gibson
f9fe212b1f tcp, flow: Remove redundant information, repack connection structures
Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure.  Access it from there instead, and
remove it from the TCP specific structure.   With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:41 +02:00
David Gibson
4e2d36e83f flow: Common address information for target side
Require the address and port information for the target (non
initiating) side to be populated when a flow enters TGT state.
Implement that for TCP and ICMP.  For now this leaves some information
redundantly recorded in both generic and type specific fields.  We'll
fix that in later patches.

For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:37 +02:00
David Gibson
8012f5ff55 flow: Common address information for initiating side
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow.  Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).

To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.

For now we only populate the initiating side, requiring that
information be completed when a flows enter INI.  Later patches will
populate the target side.

For now this leaves some information redundantly recorded in both generic
and type specific fields.  We'll fix that in later patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:32 +02:00
David Gibson
ba74b1fea1 doc: Extend zero-recv test with methods using msghdr
This test program verifies that we can receive and discard datagrams by
using recv() with a NULL buffer and zero-length.  Extend it to verify it
also works using recvmsg() and either an iov with a zero-length NULL
buffer or an iov that itself is NULL and zero-length.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fixed printf() message in main of recv-zero.c]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:31:02 +02:00
David Gibson
01e5611ec3 doc: Test behaviour of closing duplicate UDP sockets
To simplify lifetime management of "listening" UDP sockets, UDP flow
support needs to duplicate existing bound sockets.  Those duplicates will
be close()d when their corresponding flow expires, but we expect the
original to still receive datagrams as always.  That is, we expect the
close() on the duplicate to remove the duplicated fd, but not to close the
underlying UDP socket.

Add a test program to doc/platform-requirements to verify this requirement.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:30:14 +02:00
David Gibson
66a02c9f7c tcp_splice: Use parameterised macros for per-side event/flag bits
Both the events and flags fields in tcp_splice_conn have several bits
which are per-side, e.g. OUT_WAIT_0 for side 0 and OUT_WAIT_1 for side 1.
This necessitates some rather awkward ternary expressions when we need
to get the relevant bit for a particular side.

Simplify this by using a parameterised macro for the bit values.  This
needs a ternary expression inside the macros, but makes the places we use
it substantially clearer.

That simplification in turn allows us to use a loop across each side to
implement several things which are currently open coded to do equivalent
things for each side in turn.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:30:11 +02:00
David Gibson
5235c47c79 flow: Introduce flow_foreach_sidei() macro
We have a handful of places where we use a loop to step through each side
of a flow or flows, and we're probably going to have mroe in future.
Introduce a macro to implement this loop for convenience.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:30:07 +02:00
David Gibson
71d7985188 flow, tcp_splice: Prefer 'sidei' for variables referring to side index
In various places we have variables named 'side' or similar which always
have the value 0 or 1 (INISIDE or TGTSIDE).  Given a flow, this refers to
a specific side of it.  Upcoming flow table work will make it more useful
for "side" to refer to a specific side of a specific flow.  To make things
less confusing then, prefer the name term "side index" and name 'sidei' for
variables with just the 0 or 1 value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fixed minor detail in comment to struct flow_common]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:29:47 +02:00
David Gibson
9b125e7776 flow, icmp, tcp: Clean up helpers for getting flow from index
TCP (both regular and spliced) and ICMP both have macros to retrieve the
relevant protcol specific flow structure from a flow index.  In most cases
what we actually want is to get the specific flow from a sidx.  Replace
those simple macros with a more precise inline, which also asserts that
the flow is of the type we expect.

While we're they're also add a pif_at_sidx() helper to get the interface of
a specific flow & side, which is useful in some places.

Finally, fix some minor style issues in the comments on some of the
existing sidx related helpers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 15:27:27 +02:00
David Gibson
2fa91ee391 udp: Handle errors on UDP sockets
Currently we ignore all events other than EPOLLIN on UDP sockets.  This
means that if we ever receive an EPOLLERR event, we'll enter an infinite
loop on epoll, because we'll never do anything to clear the error.

Luckily that doesn't seem to have happened in practice, but it's certainly
fragile.  Furthermore changes in how we handle UDP sockets with the flow
table mean we will start receiving error events.

Add handling of EPOLLERR events.  For now we just read the error from the
error queue (thereby clearing the error state) and print a debug message.
We can add more substantial handling of specific events in future if we
want to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:21 +02:00
David Gibson
6bd8283bf9 util: Add AF_UNSPEC support to sockaddr_ntop()
Allow sockaddr_ntop() to format AF_UNSPEC socket addresses.  There do exist
a few cases where we might legitimately have either an AF_UNSPEC or a real
address, such as the origin address from MSG_ERRQUEUE.  Even in cases where
we shouldn't get an AF_UNSPEC address, formatting it is likely to make
things easier to debug if we ever somehow do.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:18 +02:00
David Gibson
4e1f850f61 udp, tcp: Tweak handling of no_udp and no_tcp flags
We abort the UDP socket handler if the no_udp flag is set.  But if UDP
was disabled we should never have had a UDP socket to trigger the handler
in the first place.  If we somehow did, ignoring it here isn't really going
to help because aborting without doing anything is likely to lead to an
epoll loop.  The same is the case for the TCP socket and timer handlers and
the no_tcp flag.

Change these checks on the flag to ASSERT()s.  Similarly add ASSERT()s to
several other entry points to the protocol specific code which should never
be called if the protocol is disabled.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:15 +02:00
David Gibson
272d1d033c udp: Make udp_sock_recv static
Through an oversight this was previously declared as a public function
although it's only used in udp.c and there is no prototype in any header.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:13 +02:00
David Gibson
f79c42317f conf: Don't configure port forwarding for a disabled protocol
UDP and/or TCP can be disabled with the --no-udp and --no-tcp options.
However, when this is specified, it's still possible to configure forwarded
ports for the disabled protocol.  In some cases this will open sockets and
perform other actions, which might not be safe since the entire protocol
won't be initialised.

Check for this case, and explicitly forbid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:04:55 +02:00
Jon Maloy
a740e16fd1 tcp: handle shrunk window advertisements from guest
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the guest peer, while it is unable to send out window updates even
after socket reads have freed up enough buffer space to permit a larger
window. In this situation, new window advertisements from the peer can
only be triggered by data packets arriving from this side.

However, currently such packets are never sent, because the zero-window
condition prevents this side from sending out any packets whatsoever
to the peer.

We notice that the above bug is triggered *only* after the peer has
dropped one or more arriving packets because of severe memory squeeze,
and that we hence always enter a retransmission situation when this
occurs. This also means that the implementation goes against the
RFC-9293 recommendation that a previously advertised window never
should shrink.

RFC-9293 seems to permit that we can continue sending up to the right
edge of the last advertised non-zero window in such situations, so that
is what we do to resolve this situation.

It turns out that this solution is extremely simple to implememt in the
code: We just omit to save the advertised zero-window when we see that
it has shrunk, i.e., if the acknowledged sequence number in the
advertisement message is lower than that of the last data byte sent
from our side.

When that is the case, the following happens:
- The 'retr' flag in tcp_data_from_tap() will be 'false', so no
  retransmission will occur at this occasion.
- The data stream will soon reach the right edge of the previously
  advertised window. In fact, in all observed cases we have seen that
  it is already there when the zero-advertisement arrives.
- At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set,
  unless they already have been, meaning that only the next timer
  expiration will open for data retransmission or transmission.
- When that happens, the memory squeeze at the guest will normally have
  abated, and the data flow can resume.

It should be noted that although this solves the problem we have at
hand, it is a work-around, and not a genuine solution to the described
kernel bug.

Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor fix in commit title and commit reference in comment
 to workaround
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 18:05:08 +02:00
Jon Maloy
e63d281871 tcp: leverage support of SO_PEEK_OFF socket option when available
>From linux-6.9.0 the kernel will contain
commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").

This new feature makes is possible to call recv_msg(MSG_PEEK) and make
it start reading data from a given offset set by the SO_PEEK_OFF socket
option. This way, we can avoid repeated reading of already read bytes of
a received message, hence saving read cycles when forwarding TCP
messages in the host->name space direction.

In this commit, we add functionality to leverage this feature when
available, while we fall back to the previous behavior when not.

Measurements with iperf3 shows that throughput increases with 15-20
percent in the host->namespace direction when this feature is used.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 17:57:03 +02:00
David Gibson
8bd57bf25b doc: Trivial fix for reuseaddr-priority
This test program checks for particular behaviour regardless of order of
operations.  So, we step through the test with all possible orders for
a number of different of parts.  Or at least, we're supposed to, a copy
pasta error led to using the same order for two things which should be
independent.

Fixes: 299c407501 ("doc: Add program to document and test assumptions about SO_REUSEADDR")
Reported-by: David Taylor <davidt@yadt.co.uk>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 17:55:52 +02:00
David Gibson
ec2691a12e doc: Test behaviour of zero length datagram recv()s
Add a test program verifying that we're able to discard datagrams from a
socket without needing a big discard buffer, by using a zero length recv().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:48 +02:00
David Gibson
299c407501 doc: Add program to document and test assumptions about SO_REUSEADDR
For the approach we intend to use for handling UDP flows, we have some
pretty specific requirements about how SO_REUSEADDR works with UDP sockets.
Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s,
and therefore there can be multiple sockets which are eligible to receive
the same datagram.  Which one will actually receive it is important to us.

Add a test program which verifies things work the way we expect, which
documents what those expectations are in the process.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:43 +02:00
David Gibson
be0214cca6 udp: Consolidate datagram batching
When we receive datagrams on a socket, we need to split them into batches
depending on how they need to be forwarded (either via a specific splice
socket, or via tap).  The logic to do this, is somewhat awkwardly split
between udp_buf_sock_handler() itself, udp_splice_send() and
udp_tap_send().

Move all the batching logic into udp_buf_sock_handler(), leaving
udp_splice_send() to just send the prepared batch.  udp_tap_send() reduces
to just a call to tap_send_frames() so open-code that call in
udp_buf_sock_handler().

This will allow separating the batching logic from the rest of the datagram
forwarding logic, which we'll need for upcoming flow table support.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:41 +02:00
David Gibson
69e5393c37 udp: Move some more of sock_handler tasks into sub-functions
udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four
things between them:
  1. Receive some datagrams from a socket
  2. Split those datagrams into batches depending on how they need to be
     sent (via tap or via a specific splice socket)
  3. Prepare buffers for each datagram to send it onwards
  4. Actually send it onwards

Split (1) and (3) into specific helper functions.  This isn't
immediately useful (udp_splice_prepare(), in particular, is trivial),
but it will make further reworks clearer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:39 +02:00
David Gibson
c6c61a9e1a udp: Don't repeatedly initialise udp[46]_eth_hdr
Since we split our packet frame buffers into different pieces, we have
a single buffer per IP version for the ethernet header, rather than one
per frame.  This makes sense since our ethernet header is alwaus the same.

However we initialise those buffers udp[46]_eth_hdr inside a per frame
loop.  Pull that outside the loop so we just initialise them once.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:37 +02:00
David Gibson
55aff45bc1 udp: Unify udp[46]_l2_iov
The only differences between these arrays are that udp4_l2_iov is
pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame
header and udp6_l2_iov points to the IPv6 versions.

We already have to set up a bunch of headers per-frame, including updating
udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len.  It makes more sense to adjust
the IOV entries to point at the correct headers for the frame than to have
two complete sets of iovecs.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:35 +02:00
David Gibson
9f9b15f949 udp: Unify udp[46]_mh_splice
We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where
the only difference is that they point to different sockaddr buffers for
the destination address.

Unify these by having the common array point at a sockaddr_inany as the
address.  This does mean slightly more work when we're about to splice,
because we need to write the whole socket address, rather than just the
port.  However it removes 32 mmsghdr structures and we're going to need
more flexibility constructing that target address for the flow table.

Because future changes might mean that the address isn't always loopback,
change the name of the common address from *_localname  to udp_splicename.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:33 +02:00
David Gibson
fbd78b6f3e udp: Rename IOV and mmsghdr arrays
Make the salient points about these various arrays clearer with renames:

* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do
  with L2.  They are, however, specific to receiving not sending.  Rename
  to udp_iov_recv and udp[46]_mh_recv.

* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa.
  Rename to udp[46]_l2_iov

* udp[46]_localname are (for now) pre-populated with the local address but
  the more salient point is that these are the destination address for the
  splice arrays.  Rename to udp[46]_splice_to

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:30 +02:00
David Gibson
f62c33d85f udp: Pass full epoll reference through more of sock handler path
udp_buf_sock_handler() takes the epoll reference from the receiving socket,
and passes the UDP relevant part on to several other functions.  Future
changes are going to need several different epoll types for UDP, and to
pass that information through to some of those functions.  To avoid extra
noise in the patches making the real changes, change those functions now
to take the full epoll reference, rather than just the UDP part.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:28 +02:00
David Gibson
8f8eb73482 flow: Add flow_sidx_valid() helper
To implement the TCP hash table, we need an invalid (NULL-like) value for
flow_sidx_t.  We use FLOW_SIDX_NONE for that, but for defensiveness, we
treat (usually) anything with an out of bounds flow index the same way.

That's not always done consistently though.  In flow_at_sidx() we open code
a check on the flow index.  In tcp_hash_probe() we instead compare against
FLOW_SIDX_NONE, and in some other places we use the fact that
flow_at_sidx() will return NULL in this case, even if we don't otherwise
need the flow it returns.

Clean this up a bit, by adding an explicit flow_sidx_valid() test function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:25 +02:00