Commit graph

81 commits

Author SHA1 Message Date
David Gibson
034fa8a58d tcp: Remove v6 flag from tcp_epoll_ref
This bit in the TCP specific epoll reference indicates whether the
connection is IPv6 or IPv4.  However the sites which refer to it are
already calling accept() which (optionally) returns an address for the
remote end of the connection.  We can use the sa_family field in that
address to determine the connection type independent of the epoll
reference.

This does have a cost: for the spliced case, it means we now need to get
that address from accept() which introduces an extran copy_to_user().
However, in future we want to allow handling IPv4 connectons through IPv6
sockets, which means we won't be able to determine the IP version at the
time we create the listening socket and epoll reference.  So, at some point
we'll have to pay this cost anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:48 +01:00
David Gibson
ca69c3f196 inany: Helper functions for handling addresses which could be IPv4 or IPv6
struct tcp_conn stores an address which could be IPv6 or IPv4 using a
union.  We can do this without an additional tag by encoding IPv4 addresses
as IPv4-mapped IPv6 addresses.

This approach is useful wider than the specific place in tcp_conn, so
expose a new 'union inany_addr' like this from a new inany.h.  Along with
that create a number of helper functions to make working with these "inany"
addresses easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:32 +01:00
David Gibson
233b95e90f tcp: Remove splice from tcp_epoll_ref
Currently the epoll reference for tcp sockets includes a bit indicating
whether the socket maps to a spliced connection.  However, the reference
also has the index of the connection structure which also indicates whether
it is spliced.  We can therefore avoid the splice bit in the epoll_ref by
unifying the first part of the non-spliced and spliced handlers where we
look up the connection state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:25 +01:00
David Gibson
d909fda1e8 tcp: Use the same sockets to listen for spliced and non-spliced connections
In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
spliced connections (these are bound to localhost) and non-spliced
connections (these are bound to the host address).  This introduces a
subtle behavioural difference between pasta and passt: by default, pasta
will listen only on a single host address, whereas passt will listen on
all addresses (0.0.0.0 or ::).  This also prevents us using some additional
optimizations that only work with the unspecified (0.0.0.0 or ::) address.

However, it turns out we don't need to do this.  We can splice a connection
if and only if it originates from the loopback address.  Currently we
ensure this by having the "spliced" listening sockets listening only on
loopback.  Instead, defer the decision about whether to splice a connection
until after accept(), by checking if the connection was made from the
loopback address.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:22 +01:00
David Gibson
356c6e0677 tcp: Unify part of spliced and non-spliced conn_from_sock path
In tcp_sock_handler() we split off to handle spliced sockets before
checking anything else.  However the first steps of the "new connection"
path for each case are the same: allocate a connection entry and accept()
the connection.

Remove this duplication by making tcp_conn_from_sock() handle both spliced
and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
and tcp_splice_conn_from_sock functions for the later stages which differ.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:19 +01:00
David Gibson
433604a581 tcp: Unify the IN_EPOLL flag
There is very little common between the tcp_tap_conn and tcp_splice_conn
structures.  However, both do have an IN_EPOLL flag which has the same
meaning in each case, though it's stored in a different location.

Simplify things slightly by moving this bit into the common header of the
two structures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:00 +01:00
David Gibson
34476511f7 tcp: Partially unify tcp_timer() and tcp_splice_timer()
These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them.  Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:58 +01:00
David Gibson
0eef48c4be tcp: Unify tcp_defer_handler and tcp_splice_defer_handler()
These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections.  To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:54 +01:00
David Gibson
ee8f8e9564 tcp: Unify spliced and non-spliced connection tables
Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections.  This is a bit counter-intuitive.

More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases.  So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.

For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall.  If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.

For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them.  We'll improve that in later patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:51 +01:00
David Gibson
181ce83d9b tcp: Improved helpers to update connections after moving
When we compact the connection tables (both spliced and non-spliced) we
need to move entries from one slot to another.  That requires some updates
in the entries themselves.  Add helpers to make all the necessary updates
for the spliced and non-spliced cases.  This will simplify later cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:48 +01:00
David Gibson
ff27fd63cd tcp: Add connection union type
Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays.  We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection.  For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.

This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline.  This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:46 +01:00
David Gibson
3cf027bd59 tcp: Move connection state structures into a shared header
Currently spliced and non-spliced connections use completely independent
tracking structures.  We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.

This requires renaming some #defines with the same name but different
meanings between the two cases.  In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:43 +01:00
David Gibson
c6b822428a tcp_splice: Helpers for converting from index to/from tcp_splice_conn
Like we already have for non-spliced connections, create a CONN_IDX()
macro for looking up the index of spliced connection structures.  Change
the name of the array of spliced connections to be different from that for
non-spliced connections (even though they're in different modules).  This
will make subsequent changes a bit safer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:40 +01:00
David Gibson
9ffa0184e3 tcp_splice: #include tcp_splice.h in tcp_splice.c
This obvious include was omitted, which means that declarations in the
header weren't checked against definitions in the .c file.  This shows up
an old declaration for a function that is now static, and a duplicate

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:30 +01:00
Stefano Brivio
d0dd0242a6 tcp, tcp_splice: Fix port remapping for inbound, spliced connections
In pasta mode, when we receive a new inbound connection, we need to
select a socket that was created in the namespace to proceed and
connect() it to its final destination.

The existing condition might pick a wrong socket, though, if the
destination port is remapped, because we'll check the bitmap of
inbound ports using the remapped port (stored in the epoll reference)
as index, and not the original port.

Instead of using the port bitmap for this purpose, store this
information in the epoll reference itself, by adding a new 'outbound'
bit, that's set if the listening socket was created the namespace,
and unset otherwise.

Then, use this bit to pick a socket on the right side.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 33482d5bf2 ("passt: Add PASTA mode, major rework")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
eab9d8d5d6 tcp, tcp_splice: Adjust comments to current meaning of inbound and outbound
For tcp_sock_init_ns(), "inbound" connections used to be the ones
being established toward any listening socket we create, as opposed
to sockets we connect().

Similarly, tcp_splice_new() used to handle "inbound" connections in
the sense that they originated from listening sockets, and they would
in turn cause a connect() on an "outbound" socket.

Since commit 1128fa03fe ("Improve types and names for port
forwarding configuration"), though, inbound connections are more
broadly defined as the ones directed to guest or namepsace, and
outbound the ones originating from there.

Update comments for those two functions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
da152331cf Move logging functions to a new file, log.c
Logging to file is going to add some further complexity that we don't
want to squeeze into util.c.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-14 17:38:25 +02:00
David Gibson
163dc5f188 Consolidate port forwarding configuration into a common structure
The configuration for how to forward ports in and out of the guest/ns is
divided between several different variables.  For each connect direction
and protocol we have a mode in the udp/tcp context structure, a bitmap
of which ports to forward also in the context structure and an array of
deltas to apply if the outward facing and inward facing port numbers are
different.  This last is a separate global variable, rather than being in
the context structure, for no particular reason.  UDP also requires an
additional array which has the reverse mapping used for return packets.

Consolidate these into a re-used substructure in the context structure.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-24 14:48:35 +02:00
Stefano Brivio
df69be379e tcp_splice: Allow up to 8 MiB as pipe size
It actually improves throughput a bit, if allowed by user limits.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-04-07 14:28:27 +02:00
Stefano Brivio
2a3b8dad33 tcp, tcp_splice: False "Negative array index read" positives, CWE-129
A flag or event bit is always set by callers. Reported by Coverity.

Signed-by-off: Stefano Brivio <sbrivio@redhat.com>
2022-04-07 11:44:35 +02:00
Stefano Brivio
264d68edcf tcp_splice: Logically dead code, CWE-561
Reported by Coverity.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-04-07 11:44:35 +02:00
Stefano Brivio
22ed4467a4 treewide: Unchecked return value from library, CWE-252
All instances were harmless, but it might be useful to have some
debug messages here and there. Reported by Coverity.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-04-07 11:44:35 +02:00
Stefano Brivio
dbd0a7035c treewide: Invalid type in argument to printf format specifier, CWE-686
Harmless except for two bad debugging prints.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-04-05 18:47:04 +02:00
Stefano Brivio
37c228ada8 tap, tcp, udp, icmp: Cut down on some oversized buffers
The existing sizes provide no measurable differences in throughput
and packet rates at this point. They were probably needed as batched
implementations were not complete, but they can be decreased quite a
bit now.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
62c3edd957 treewide: Fix android-cloexec-* clang-tidy warnings, re-enable checks
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
48582bf47f treewide: Mark constant references as const
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
bb70811183 treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.

Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.

This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
415ccf6116 tcp, tcp_splice: Use less awkward syntax to swap in/out sockets from pools
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
92074c16a8 tcp_splice: Close sockets right away on high number of open files
We can't take for granted that the hard limit for open files is
big enough as to allow to delay closing sockets to a timer.

Store the value of RTLIMIT_NOFILE we set at start, and use it to
understand if we're approaching the limit with pending, spliced
TCP connections. If that's the case, close sockets right away as
soon as they're not needed, instead of deferring this task to a
timer.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
3eb19cfd8a tcp, udp, util: Enforce 24-bit limit on socket numbers
This should never happen, but there are no formal guarantees: ensure
socket numbers are below SOCKET_MAX.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
e5eefe7743 tcp: Refactor to use events instead of states, split out spliced implementation
Using events and flags instead of states makes the implementation
much more straightforward: actions are mostly centered on events
that occurred on the connection rather than states.

An example is given by the ESTABLISHED_SOCK_FIN_SENT and
FIN_WAIT_1_SOCK_FIN abominations: we don't actually care about
which side started closing the connection to handle closing of
connection halves.

Split out the spliced implementation, as it has very little in
common with the "regular" TCP path.

Refactor things here and there to improve clarity. Add helpers
to trace where resets and flag settings come from.

No functional changes intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-28 17:11:40 +02:00