Currently if tcp_sock_refill_pool() is unable to fill all the slots in the
pool, it will silently exit. This might lead to a later attempt to get
fds from the pool to fail at which point it will be harder to tell what
originally went wrong.
Instead add warnings if we're unable to refill any of the socket pools when
requested. We have tcp_sock_refill_pool() return an error and report it
in the callers, because those callers have more context allowing for a
more useful message.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int. Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table. Doing this
sometimes requires moving an entry when one is freed. That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.
Implement a new way of managing the flow table that doesn't ever move
entries. It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b
[sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid
Coverity Scan false positive]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, flows are only evern finally freed (and the table compacted)
from the deferred handlers. Some future ways we want to optimise managing
the flow table will rely on this, so enforce it: rather than having the
TCP code directly call flow_table_compact(), add a boolean return value to
the per-flow deferred handlers. If true, this indicates that the flow
code itself should free the flow.
This forces all freeing of flows to occur during the flow code's scan of
the table in flow_defer_handler() which opens possibilities for future
optimisations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In general, the passt code is a bit haphazard about what's a true global
variable and what's in the quasi-global 'context structure'. The
flow_count field is one such example: it's in the context structure,
although it's really part of the same data structure as flowtab[], which
is a genuine global.
Move flow_count to be a regular global to match. For now it needs to be
public, rather than static, but we expect to be able to change that in
future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_defer_handler(), amongst other things, scans the flow table and does
some processing for each TCP connection. When we add other protocols to
the flow table, they're likely to want some similar scanning. It makes
more sense for cache friendliness to perform a single scan of the flow
table and dispatch to the protocol specific handlers, rather than having
each protocol separately scan the table.
To that end, add a new flow_defer_handler() handling all flow-linked
deferred operations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally
on the connection being closed or closing. Move that logic into the
"destroy" functions themselves, renaming them tcp_flow_defer() and
tcp_splice_flow_defer().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_timer() scans the connection table, expiring "tap" connections and
calling tcp_splice_timer() for "splice" connections. tcp_splice_timer()
expires spliced connections and then does some other processing.
However, tcp_timer() is always called shortly after tcp_defer_handler()
(from post_handler()), which also scans the flow table expiring both tap
and spliced connections. So remove the redundant handling, and only do
the extra tcp_splice_timer() work from tcp_timer().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we deal with hash collisions by letting a hash bucket contain
multiple entries, forming a linked list using an index in the connection
structure.
That's a pretty standard and simple approach, but in our case we can use
an even simpler one: linear probing. Here if a hash bucket is occupied
we just move onto the next one until we find a feww one. This slightly
simplifies lookup and more importantly saves some precious bytes in the
connection structure by removing the need for a link. It does require some
additional complexity for hash removal.
This approach can perform poorly with hash table load is high. However, we
already size our hash table of pointers larger than the connection table,
which puts an upper bound on the load. It's relatively cheap to decrease
that bound if we find we need to.
I adapted the linear probing operations from Knuth's The Art of Computer
Programming, Volume 3, 2nd Edition. Specifically Algorithm L and Algorithm
R in Section 6.4. Note that there is an error in Algorithm R as printed,
see errata at [0].
[0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_table_compact() will move entries in the connection/flow table to keep
it compact when other entries are removed. The moved entries need not have
the same type as the flow removed, so it needs to be able to handle moving
any type of flow. Therefore, move it to flow.c rather than being
purportedly TCP specific.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index
of their connection structures in the connection table, now become the
unified flow table. We can easily combine these into a common helper.
While we're there, add some trickery for some additional type safety.
They also define their own CONN() versions, which aren't so easily combined
since they need to return different types, but we can have them use a
common helper.
In the process, we standardise on always using an unsigned type to store
the connection / flow index, which makes more sense. tcp.c's conn_at_idx()
remains for now, but we change its parameter to unsigned to match. That in
turn means we can remove a check for negative values from it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We want to generalise "connection" tracking to things other than true TCP
connections. Continue implenenting this by renaming the TCP connection
table to the "flow table" and moving it to flow.c. The definitions are
split between flow.h and flow_table.h - we need this separation to avoid
circular dependencies: the definitions in flow.h will be needed by many
headers using the flow mechanism, but flow_table.h needs all those protocol
specific headers in order to define the full flow table entry.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently TCP connections use a 1-bit selector, 'spliced', to determine the
rest of the contents of the structure. We want to generalise the TCP
connection table to other types of flows in other protocols. Make a start
on this by replacing the tcp_conn_common structure with a new flow_common
structure with an enum rather than a simple boolean indicating the type of
flow.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
On the L2 tap side, we see TCP headers and know the TCP window that the
ultimate receiver is advertising. In order to avoid unnecessary buffering
within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt
to advertise that window back to the original sock-side sender using
TCP_WINDOW_CLAMP.
However, TCP_WINDOW_CLAMP just doesn't work like this. Prior to kernel
commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply
had no effect on established sockets. After that commit, it does affect
established sockets but doesn't behave the way we need:
* It appears to be designed only to shrink the window, not to allow it to
re-expand.
* More importantly, that commit has a serious bug where if the
setsockopt() is made when the existing kernel advertised window for the
socket happens to be zero, it will now become locked at zero, stopping
any further data from being received on the socket.
Since this has never worked as intended, simply remove it. It might be
possible to re-implement the intended behaviour by manipulating SO_RCVBUF,
so we leave a comment to that effect.
This kernel bug is the underlying cause of both the linked passt bug and
the linked podman bug. We attempted to fix this before with passt commit
d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag").
However while that commit masked the bug for some cases, it didn't really
address the problem.
Fixes: d3192f67c4 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag")
Link: https://github.com/containers/podman/issues/20170
Link: https://bugs.passt.top/show_bug.cgi?id=74
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Each spliced connection has two mostly, although not entirely, symmetric
sides. We currently call those "a" and "b" and have different fields in
the connection structure for each one.
We can better exploit that symmetry if we use two element arrays rather
thatn separately named fields. Do that in the places we can, and for the
others change the "a"/"b" terminology to 0/1 to match.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Newer versions of cppcheck (as of 2.12.0, at least) added a warning for
pointers which could be declared to point at const data, but aren't.
Based on that, make many pointers throughout the codebase const.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The in_epoll boolean is one of only two fields (currently) in the common
structure shared between tap and spliced connections. It seems like it
belongs there, because both tap and spliced connections use it, and it has
roughly the same meaning.
Roughly, however, isn't exactly: which fds this flag says are in the epoll
varies between the two connection types, and are in type specific fields.
So, it's only possible to meaningfully use this value locally in type
specific code anyway.
This common field is going to get in the way of more widespread
generalisation of connection / flow tracking, so move it to separate fields
in the tap and splice specific structures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In a number of places the comments and variable names we use to describe
addresses and ports are ambiguous. It's not sufficient to describe a port
as "tap-facing" or "socket-facing", because on both the tap side and the
socket side there are two ports for the two ends of the connection.
Similarly, "local" and "remote" aren't particularly helpful, because it's
not necessarily clear whether we're talking from the point of view of the
guest/namespace, the host, or passt itself.
This patch makes a number of changes to be more precise about this. It
introduces two new terms in aid of this:
A "forwarding" address (or port) refers to an address which is local
from the point of view of passt itself. That is a source address for
traffic sent by passt, whether it's to the guest via the tap interface
or to a host on the internet via a socket.
The "endpoint" address (or port) is the reverse: a remote address
from passt's point of view, the destination address for traffic sent
by passt.
Between them the "side" (either tap/guest-facing or sock/host-facing)
and forwarding vs. endpoint unambiguously describes which address or
port we're talking about.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The epoll_ref type includes fields for the IP protocol of a socket, and the
socket fd. However, we already have a few things in the epoll which aren't
protocol sockets, and we may have more in future. Rename these fields to
an abstract "fd type" and file descriptor for more generality.
Similarly, rather than using existing IP protocol numbers for the type,
introduce our own number space. For now these just correspond to the
supported protocols, but we'll expand on that in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.
Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.
Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...and, given that I keep getting this wrong, add a convenience
macro, MAX_FROM_BITS().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When creating a new spliced connection, we need to get a socket in the
other ns from the originating one. To avoid excessive ns switches we
usually get these from a pool refilled on a timer. However, if the pool
runs out we need a fallback. Currently that's done by passing -1 as the
socket to tcp_splice_connnect() and running it in the target ns.
This means that tcp_splice_connect() itself needs to have different cases
depending on whether it's given an existing socket or not, which is
a separate concern from what it's mostly doing. We change it to require
a suitable open socket to be passed in, and ensuring in the caller that we
have one.
This requires adding the fallback paths to the caller, tcp_splice_new().
We use slightly different approaches for a socket in the init ns versus the
guest ns.
This also means that we no longer need to run tcp_splice_connect() itself
in the guest ns, which allows us to remove a bunch of boilerplate code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_conn_new_sock() first looks for a socket in a pre-opened pool, then if
that's empty creates a new socket in the init namespace. Both parts of
this are duplicated in other places: the pool lookup logic is duplicated in
tcp_splice_new(), and the socket opening logic is duplicated in
tcp_sock_refill_pool().
Split the function into separate parts so we can remove both these
duplications.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_splice.c has some explicit extern declarations to access the
socket pools. This is pretty dangerous - if we changed the type of
these variables in tcp.c, we'd have tcp.c and tcp_splice.c using the
same memory in different ways with no compiler error. So, move the
extern declarations to tcp_conn.h so they're visible to both tcp.c and
tcp_splice.c, but not the rest of pasta.
In fact the pools for the guest namespace are necessarily only used by
tcp_splice.c - we have no sockets on the guest side if we're not
splicing. So move those declarations and the functions that deal
exclusively with them to tcp_splice.c
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The pointers are actually the same, but we later pass the container
union to tcp_table_compact(), which might zero the size of the whole
union, and this confuses Coverity Scan.
Given that we have pointers to the container union to start with,
just pass those instead, all the way down to tcp_table_compact().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
struct tcp_conn stores an address which could be IPv6 or IPv4 using a
union. We can do this without an additional tag by encoding IPv4 addresses
as IPv4-mapped IPv6 addresses.
This approach is useful wider than the specific place in tcp_conn, so
expose a new 'union inany_addr' like this from a new inany.h. Along with
that create a number of helper functions to make working with these "inany"
addresses easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently when we insert a connection into the hash table, we store its
bucket number so we can find it when removing entries. However, we can
recompute the hash value from other contents of the structure so we don't
need to store it. This brings the size of tcp_tap_conn down to 64 bytes
again, which means it will fit in a single cacheline on common machines.
This change also removes a non-obvious constraint that the hash table have
less than twice TCP_MAX_CONNS buckets, because of the way
TCP_HASH_BUCKET_BITS was constructed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There is very little common between the tcp_tap_conn and tcp_splice_conn
structures. However, both do have an IN_EPOLL flag which has the same
meaning in each case, though it's stored in a different location.
Simplify things slightly by moving this bit into the common header of the
two structures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them. Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections. To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections. This is a bit counter-intuitive.
More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases. So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.
For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall. If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.
For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them. We'll improve that in later patches.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays. We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection. For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.
This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline. This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently spliced and non-spliced connections use completely independent
tracking structures. We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.
This requires renaming some #defines with the same name but different
meanings between the two cases. In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>