Commit 89e38f55 "treewide: Fix header includes to build with musl" added
extra #includes to work with musl. Unfortunately with the cppcheck version
I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false
positives: specifically cppcheck seems to hit a #error in <bits/unistd.h>
complaining about including it directly instead of via <unistd.h> (which is
not something we're doing).
I have no idea why that would be happening; but I'm guessing it has to be
a bug in the cpp implementation in that cppcheck version. In any case,
it's possible to work around this by moving the include of <unistd.h>
before the include of <signal.h>. So, do that.
Fixes: 89e38f5540 ("treewide: Fix header includes to build with musl")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Roughly inspired from a patch by Chris Kuhn: fix up includes so that
we can build against musl: glibc is more lenient as headers generally
include a larger amount of other headers.
Compared to the original patch, I only included what was needed
directly in C files, instead of adding blanket includes in local
header files. It's a bit more involved, but more consistent with the
current (not ideal) situation.
Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We use the return value of fls() as array index for debug strings.
While fls() can return -1 (if no bit is set), Coverity Scan doesn't
see that we're first checking the return value of another fls() call
with the same bitmask, before using it.
Call fls() once, store its return value, check it, and use the stored
value as array index.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When creating a new spliced connection, we need to get a socket in the
other ns from the originating one. To avoid excessive ns switches we
usually get these from a pool refilled on a timer. However, if the pool
runs out we need a fallback. Currently that's done by passing -1 as the
socket to tcp_splice_connnect() and running it in the target ns.
This means that tcp_splice_connect() itself needs to have different cases
depending on whether it's given an existing socket or not, which is
a separate concern from what it's mostly doing. We change it to require
a suitable open socket to be passed in, and ensuring in the caller that we
have one.
This requires adding the fallback paths to the caller, tcp_splice_new().
We use slightly different approaches for a socket in the init ns versus the
guest ns.
This also means that we no longer need to run tcp_splice_connect() itself
in the guest ns, which allows us to remove a bunch of boilerplate code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_conn_new_sock() first looks for a socket in a pre-opened pool, then if
that's empty creates a new socket in the init namespace. Both parts of
this are duplicated in other places: the pool lookup logic is duplicated in
tcp_splice_new(), and the socket opening logic is duplicated in
tcp_sock_refill_pool().
Split the function into separate parts so we can remove both these
duplications.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_splice.c has some explicit extern declarations to access the
socket pools. This is pretty dangerous - if we changed the type of
these variables in tcp.c, we'd have tcp.c and tcp_splice.c using the
same memory in different ways with no compiler error. So, move the
extern declarations to tcp_conn.h so they're visible to both tcp.c and
tcp_splice.c, but not the rest of pasta.
In fact the pools for the guest namespace are necessarily only used by
tcp_splice.c - we have no sockets on the guest side if we're not
splicing. So move those declarations and the functions that deal
exclusively with them to tcp_splice.c
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are some places in passt/pasta which #include <assert.h> and make
various assertions. If we hit these something has already gone wrong, but
they're there so that we a useful message instead of cryptic misbehaviour
if assumptions we thought were correct turn out not to be.
Except.. the glibc implementation of assert() uses syscalls that aren't in
our seccomp filter, so we'll get a SIGSYS before it actually prints the
message. Work around this by adding our own ASSERT() implementation using
our existing err() function to log the message, and an abort(). The
abort() probably also won't work exactly right with seccomp, but once we've
printed the message, dying with a SIGSYS works just as well as dying with
a SIGABRT.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The pointers are actually the same, but we later pass the container
union to tcp_table_compact(), which might zero the size of the whole
union, and this confuses Coverity Scan.
Given that we have pointers to the container union to start with,
just pass those instead, all the way down to tcp_table_compact().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For non-spliced connections we now treat IPv4-mapped IPv6 addresses the
same as the corresponding IPv4 addresses. However currently we won't
splice a connection from ::ffff:127.0.0.1 the way we would one from
127.0.0.1. Correct this so that we can splice connections from IPv4
localhost that have been received on an IPv6 dual stack socket.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This bit in the TCP specific epoll reference indicates whether the
connection is IPv6 or IPv4. However the sites which refer to it are
already calling accept() which (optionally) returns an address for the
remote end of the connection. We can use the sa_family field in that
address to determine the connection type independent of the epoll
reference.
This does have a cost: for the spliced case, it means we now need to get
that address from accept() which introduces an extran copy_to_user().
However, in future we want to allow handling IPv4 connectons through IPv6
sockets, which means we won't be able to determine the IP version at the
time we create the listening socket and epoll reference. So, at some point
we'll have to pay this cost anyway.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
struct tcp_conn stores an address which could be IPv6 or IPv4 using a
union. We can do this without an additional tag by encoding IPv4 addresses
as IPv4-mapped IPv6 addresses.
This approach is useful wider than the specific place in tcp_conn, so
expose a new 'union inany_addr' like this from a new inany.h. Along with
that create a number of helper functions to make working with these "inany"
addresses easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently the epoll reference for tcp sockets includes a bit indicating
whether the socket maps to a spliced connection. However, the reference
also has the index of the connection structure which also indicates whether
it is spliced. We can therefore avoid the splice bit in the epoll_ref by
unifying the first part of the non-spliced and spliced handlers where we
look up the connection state.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
spliced connections (these are bound to localhost) and non-spliced
connections (these are bound to the host address). This introduces a
subtle behavioural difference between pasta and passt: by default, pasta
will listen only on a single host address, whereas passt will listen on
all addresses (0.0.0.0 or ::). This also prevents us using some additional
optimizations that only work with the unspecified (0.0.0.0 or ::) address.
However, it turns out we don't need to do this. We can splice a connection
if and only if it originates from the loopback address. Currently we
ensure this by having the "spliced" listening sockets listening only on
loopback. Instead, defer the decision about whether to splice a connection
until after accept(), by checking if the connection was made from the
loopback address.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_sock_handler() we split off to handle spliced sockets before
checking anything else. However the first steps of the "new connection"
path for each case are the same: allocate a connection entry and accept()
the connection.
Remove this duplication by making tcp_conn_from_sock() handle both spliced
and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
and tcp_splice_conn_from_sock functions for the later stages which differ.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There is very little common between the tcp_tap_conn and tcp_splice_conn
structures. However, both do have an IN_EPOLL flag which has the same
meaning in each case, though it's stored in a different location.
Simplify things slightly by moving this bit into the common header of the
two structures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them. Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections. To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections. This is a bit counter-intuitive.
More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases. So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.
For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall. If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.
For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them. We'll improve that in later patches.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we compact the connection tables (both spliced and non-spliced) we
need to move entries from one slot to another. That requires some updates
in the entries themselves. Add helpers to make all the necessary updates
for the spliced and non-spliced cases. This will simplify later cleanups.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays. We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection. For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.
This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline. This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently spliced and non-spliced connections use completely independent
tracking structures. We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.
This requires renaming some #defines with the same name but different
meanings between the two cases. In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Like we already have for non-spliced connections, create a CONN_IDX()
macro for looking up the index of spliced connection structures. Change
the name of the array of spliced connections to be different from that for
non-spliced connections (even though they're in different modules). This
will make subsequent changes a bit safer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This obvious include was omitted, which means that declarations in the
header weren't checked against definitions in the .c file. This shows up
an old declaration for a function that is now static, and a duplicate
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In pasta mode, when we receive a new inbound connection, we need to
select a socket that was created in the namespace to proceed and
connect() it to its final destination.
The existing condition might pick a wrong socket, though, if the
destination port is remapped, because we'll check the bitmap of
inbound ports using the remapped port (stored in the epoll reference)
as index, and not the original port.
Instead of using the port bitmap for this purpose, store this
information in the epoll reference itself, by adding a new 'outbound'
bit, that's set if the listening socket was created the namespace,
and unset otherwise.
Then, use this bit to pick a socket on the right side.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 33482d5bf2 ("passt: Add PASTA mode, major rework")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For tcp_sock_init_ns(), "inbound" connections used to be the ones
being established toward any listening socket we create, as opposed
to sockets we connect().
Similarly, tcp_splice_new() used to handle "inbound" connections in
the sense that they originated from listening sockets, and they would
in turn cause a connect() on an "outbound" socket.
Since commit 1128fa03fe ("Improve types and names for port
forwarding configuration"), though, inbound connections are more
broadly defined as the ones directed to guest or namepsace, and
outbound the ones originating from there.
Update comments for those two functions.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Logging to file is going to add some further complexity that we don't
want to squeeze into util.c.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The configuration for how to forward ports in and out of the guest/ns is
divided between several different variables. For each connect direction
and protocol we have a mode in the udp/tcp context structure, a bitmap
of which ports to forward also in the context structure and an array of
deltas to apply if the outward facing and inward facing port numbers are
different. This last is a separate global variable, rather than being in
the context structure, for no particular reason. UDP also requires an
additional array which has the reverse mapping used for return packets.
Consolidate these into a re-used substructure in the context structure.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
All instances were harmless, but it might be useful to have some
debug messages here and there. Reported by Coverity.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The existing sizes provide no measurable differences in throughput
and packet rates at this point. They were probably needed as batched
implementations were not complete, but they can be decreased quite a
bit now.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.
Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.
This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can't take for granted that the hard limit for open files is
big enough as to allow to delay closing sockets to a timer.
Store the value of RTLIMIT_NOFILE we set at start, and use it to
understand if we're approaching the limit with pending, spliced
TCP connections. If that's the case, close sockets right away as
soon as they're not needed, instead of deferring this task to a
timer.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This should never happen, but there are no formal guarantees: ensure
socket numbers are below SOCKET_MAX.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Using events and flags instead of states makes the implementation
much more straightforward: actions are mostly centered on events
that occurred on the connection rather than states.
An example is given by the ESTABLISHED_SOCK_FIN_SENT and
FIN_WAIT_1_SOCK_FIN abominations: we don't actually care about
which side started closing the connection to handle closing of
connection halves.
Split out the spliced implementation, as it has very little in
common with the "regular" TCP path.
Refactor things here and there to improve clarity. Add helpers
to trace where resets and flag settings come from.
No functional changes intended.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>