Commit graph

1499 commits

Author SHA1 Message Date
David Gibson
15be1bfd81 icmp: Simplify socket expiry scanning
Currently we use icmp_act[] to scan for ICMP ids which might have an open
socket which could time out.  However icmp_act[] contains no information
that's not already in icmp_id_map[] - it's just an "index" which allows
scanning for relevant entries with less cache footprint.

We only scan for ICMP socket expiry every 1s, though, so it's not clear
that cache footprint really matters.  Furthermore, there's no strong reason
we need to scan even that often - the timeout is fairly arbitrary and
approximate.

So, eliminate icmp_act[] in favour of directly scanning icmp_id_map[] and
compensate for the cache impact by reducing the scan frequency to once
every 10s.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:36:46 +01:00
David Gibson
24badd0acf icmp: Use -1 to represent "missing" sockets
icmp_id_map[] contains, amongst other things, fds for "ping" sockets
associated with various ICMP echo ids.  However, we only lazily open()
those sockets, so many will be missing.  We currently represent that with
a 0, which isn't great, since that's technically a valid fd.  Use -1
instead.  This does require initializing the fields in icmp_id_map[] but
we already have an obvious place to do that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:36:43 +01:00
David Gibson
43713af50e icmp: Don't attempt to match host IDs to guest IDs
When forwarding pings from tap, currently we create a ping socket with
a socket address whose port is set to the ID of the ping received from the
guest.  This causes the socket to send pings with the same ID on the host.
Although this seems look a good idea for maximum transparency, it's
probably unwise.

First, it's fallible - the bind() could fail, and we already have fallback
logic which will overwrite the packets with the expected guest id if the
id we get on replies doesn't already match.  We might as well do that
unconditionally.

But more importantly, we don't know what else on the host might be using
ping sockets, so we could end up with an ID that's the same as an existing
socket.  You'd expect that to fail the bind() with EADDRINUSE, which would
be fine: we'd fall back to rewriting the reply ids.  However it appears the
kernel (v6.6.3 at least), does *not* fail the bind() and instead it's
"last socket wins" in terms of who gets the replies.  So we could
accidentally intercept ping replies for something else on the host.

So, instead of using bind() to set the id, just let the kernel pick one
and expect to translate the replies back.  Although theoretically this
makes the passt/pasta link a bit less "transparent", essentially nothing
cares about specific ping IDs, much like TCP source ports, which we also
don't preserve.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:36:41 +01:00
David Gibson
8534cdbfd1 icmp: Don't attempt to handle "wrong direction" ping socket traffic
Linux ICMP "ping" sockets are very specific in what they do.  They let
userspace send ping requests (ICMP_ECHO or ICMP6_ECHO_REQUEST), and receive
matching replies (ICMP_ECHOREPLY or ICMP6_ECHO_REPLY).  They don't let you
intercept or handle incoming ping requests.

In the case of passt/pasta that means we can process echo requests from tap
and forward them to a ping socket, then take the replies from the ping
socket and forward them to tap.  We can't do the reverse: take echo
requests from the host and somehow forward them to the guest. There's
really no way for something outside to initiate a ping to a passt/pasta
connected guest and if there was we'd need an entirely different mechanism
to handle it.

However, we have some logic to deal with packets going in that reverse
direction.  Remove it, since it can't ever be used that way.  While we're
there use defines for the ICMPv6 types, instead of open coded type values.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:36:38 +01:00
David Gibson
2cb2fe6f89 icmp: Remove redundant initialisation of sendto() address
We initialise the address portion of the sockaddr for sendto() to the
unspecified address, but then always overwrite it with the actual
destination address before we call the sendto().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:36:35 +01:00
David Gibson
5dffb99892 icmp: Don't set "port" on destination sockaddr for ping sockets
We set the port to the ICMP id on the sendto() address when using ICMP
ping sockets.  However, this has no effect: the ICMP id the kernel
uses is determined only by the "port" on the socket's *bound* address
(which is constructed inside sock_l4(), using the id we also pass to
it).

For unclear reasons this change triggers cppcheck 2.13.0 to give new
"variable could be const pointer" warnings, so make *ih const as well to
fix that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:36:32 +01:00
David Gibson
8981a720aa flow: Avoid moving flow entries to compact table
Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table.  Doing this
sometimes requires moving an entry when one is freed.  That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.

Implement a new way of managing the flow table that doesn't ever move
entries.  It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b
[sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid
 Coverity Scan false positive]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:37 +01:00
David Gibson
9c0881d4f6 flow: Enforce that freeing of closed flows must happen in deferred handlers
Currently, flows are only evern finally freed (and the table compacted)
from the deferred handlers.  Some future ways we want to optimise managing
the flow table will rely on this, so enforce it: rather than having the
TCP code directly call flow_table_compact(), add a boolean return value to
the per-flow deferred handlers.  If true, this indicates that the flow
code itself should free the flow.

This forces all freeing of flows to occur during the flow code's scan of
the table in flow_defer_handler() which opens possibilities for future
optimisations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:33 +01:00
David Gibson
4a849e9526 flow: Abstract allocation of new flows with helper function
Currently tcp.c open codes the process of allocating a new flow from the
flow table: twice, in fact, once for guest to host and once for host to
guest connections.  This duplication isn't ideal and will get worse as we
add more protocols to the flow table.  It also makes it harder to
experiment with different ways of handling flow table allocation.

Instead, introduce a function to allocate a new flow: flow_alloc().  In
some cases we currently check if we're able to allocate, but delay the
actual allocation.  We now handle that slightly differently with a
flow_alloc_cancel() function to back out a recent allocation.  We have that
separate from a flow_free() function, because future changes we have in
mind will need to handle this case a little differently.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:31 +01:00
David Gibson
fb7c00169d flow: Move flow_count from context structure to a global
In general, the passt code is a bit haphazard about what's a true global
variable and what's in the quasi-global 'context structure'.  The
flow_count field is one such example: it's in the context structure,
although it's really part of the same data structure as flowtab[], which
is a genuine global.

Move flow_count to be a regular global to match.  For now it needs to be
public, rather than static, but we expect to be able to change that in
future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:29 +01:00
David Gibson
7f37bf4cd5 flow: Move flow_log_() to near top of flow.c
flow_log_() is a very basic widely used function that many other functions
in flow.c will end up needing.  At present it's below flow_table_compact()
which happens not to need it, but that's likely to change.  Move it to
near the top of flow.c to avoid forward declarations.

Code motion only, no changes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:27 +01:00
David Gibson
02e092b4fe tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets
Currently connected TCP sockets have the same epoll type, whether they're
for a "tap" connection or a spliced connection.  This means that
tcp_sock_handler() has to do a secondary check on the type of the
connection to call the right function.  We can avoid this by adding a new
epoll type and dispatching directly to the right thing.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:25 +01:00
David Gibson
70121ca1ec epoll: Better handling of number of epoll types
As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of
EPOLL_TYPE_MAX, which is a little bit safer and clearer.  Add a static
assert on the size of the matching names array.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:22 +01:00
David Gibson
36dfa8b8fb flow, tcp: Add handling for per-flow timers
tcp_timer() scans the flow table so that it can run tcp_splice_timer() on
each spliced connection.  More generally, other flow types might want to
run similar timers in future.

We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc.
However, this would need to scan the flow table, which we would have just
done in flow_defer_handler().  We'd prefer to just scan the flow table
once, dispatching both per-flow deferred events and per-flow timed events
if necessary.

So, extend flow_defer_handler() to do this.  For now we use the same timer
interval for all flow types (1s).  We can make that more flexible in future
if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:19 +01:00
David Gibson
b43e4483ed flow, tcp: Add flow-centric dispatch for deferred flow handling
tcp_defer_handler(), amongst other things, scans the flow table and does
some processing for each TCP connection.  When we add other protocols to
the flow table, they're likely to want some similar scanning.  It makes
more sense for cache friendliness to perform a single scan of the flow
table and dispatch to the protocol specific handlers, rather than having
each protocol separately scan the table.

To that end, add a new flow_defer_handler() handling all flow-linked
deferred operations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:17 +01:00
David Gibson
c97bb527d6 tcp, tcp_splice: Move per-type cleanup logic into per-type helpers
tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally
on the connection being closed or closing.  Move that logic into the
"destroy" functions themselves, renaming them tcp_flow_defer() and
tcp_splice_flow_defer().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:15 +01:00
David Gibson
eebca1115f tcp, tcp_splice: Remove redundant handling from tcp_timer()
tcp_timer() scans the connection table, expiring "tap" connections and
calling tcp_splice_timer() for "splice" connections.  tcp_splice_timer()
expires spliced connections and then does some other processing.

However, tcp_timer() is always called shortly after tcp_defer_handler()
(from post_handler()), which also scans the flow table expiring both tap
and spliced connections.  So remove the redundant handling, and only do
the extra tcp_splice_timer() work from tcp_timer().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:13 +01:00
David Gibson
8563e7c870 treewide: Standardise on 'now' for current timestamp variables
In a number of places we pass around a struct timespec representing the
(more or less) current time.  Sometimes we call it 'now', and sometimes we
call it 'ts'.  Standardise on the more informative 'now'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:10 +01:00
David Gibson
17bbab1c97 flow: Make flow_table.h #include the protocol specific headers it needs
flow_table.h, the lower level flow header relies on having the struct
definitions for every protocol specific flow type - so far that means
tcp_conn.h.  It doesn't include it itself, so tcp_conn.h must be included
before flow_table.h.

That's ok for now, but as we use the flow table for more things,
flow_table.h will need the structs for all of them, which means the
protocol specific .c files would need to include tcp_conn.h _and_ the
equivalents for every other flow type before flow_table.h every time,
which is weird.

So, although we *mostly* lean towards the include style where .c files need
to handle the include dependencies, in this case it makes more sense to
have flow_table.h include all the protocol specific headers it needs.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:34:55 +01:00
David Gibson
00c6eb6b68 pif: Remove unused pif_name() function
pif_name() has no current callers, although we expect some as we expand the
flow table support.  I'm not sure why this didn't get caught by one of
our static checkers earlier, but it's now causing cppcheck failures for me.

Add a cppcheck suppression.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-16 21:49:30 +01:00
David Gibson
a179ca6707 treewide: Make a bunch of pointer variables pointers to const
Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another
warning for pointer variables which could be pointer to const but aren't.
Use this to make a bunch of variables const pointers where they previously
weren't for no particular reason.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-16 21:49:27 +01:00
David Gibson
f60c85194b test: Fix passt.mbuto for cases where /usr/sbin doesn't exist
f0ccca74 ("test: make passt.mbuto script more robust") is supposed to make
mbuto more robust by standardizing on always putting things in /usr/sbin
with /sbin a symlink to it.  This matters because different distros have
different conventions about how the two are used.

However, the logic there requires that /usr/sbin at least exists to start
with.  This isn't always the case with Fedora derived mbuto images.
Ironically the DIRS variable ensures that /sbin exists, although we then
remove it, but doesn't require /usr/sbin to exist.  Fix that up so that
the new logic will work with Fedora.

Fixes: f0ccca741f ("test: make passt.mbuto script more robust")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-16 21:48:31 +01:00
Stefano Brivio
f091893c1f netlink: Fetch most specific (longest prefix) address in nl_addr_get()
This happened in most cases implicitly before commit eff3bcb245
("netlink: Split nl_addr() into separate operation functions"): while
going through results from netlink, we would only copy an address
into the provided return buffer if no address had been picked yet.

Because of the insertion logic in the kernel (ipv6_link_dev_addr()),
the first returned address would also be the one added last, and, in
case of a Linux guest using a DHCPv6 client as well as SLAAC, that
would be the address assigned via DHCPv6, because SLAAC happens
before the DHCPv6 exchange.

The effect of, instead, picking the last returned address (first
assigned) is visible when passt or pasta runs nested, given that, by
default, they advertise a prefix for SLAAC usage, plus an address via
DHCPv6.

The first level (L1 guest) would get a /64 address by means of SLAAC,
and a /128 address via DHCPv6, the latter matching the address on the
host.

The second level (L2 guest) would also get two addresses: a /64 via
SLAAC (same prefix as the host), and a /128 via DHCPv6, matching the
the L1 SLAAC-assigned address, not the one obtained via DHCPv6. That
is, none of the L2 addresses would match the address on the host. The
whole point of having a DHCPv6 server is to avoid (implicit) NAT when
possible, though.

Fix this in a more explicit way than the behaviour we initially had:
pick the first address among the set of most specific ones, by
comparing prefix lengths. Do this for IPv4 and for link-local
addresses, too, to match in any case the implementation of the
default source address selection.

Reported-by: Yalan Zhang <yalzhang@redhat.com>
Fixes: eff3bcb245 ("netlink: Split nl_addr() into separate operation functions")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-30 11:45:27 +01:00
Stefano Brivio
62b94c3ec8 README: Default SLAAC prefix comes from address (not prefix) on host
Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-30 11:45:27 +01:00
Stefano Brivio
e197c4e490 README: Fix broken link to CentOS Stream package
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-30 11:45:27 +01:00
Jon Paul Maloy
f0ccca741f test: make passt.mbuto script more robust
Creation of a symbolic link from /sbin to /usr/sbin fails if /sbin
exists and is non-empty. This is the case on Ubuntu-23.04.

We fix this by removing /sbin before creating the link.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:33:31 +01:00
Laurent Vivier
6a348cb435 tcp: make tcp_sock_set_bufsize() static (again)
e5eefe7743 ("tcp: Refactor to use events instead of states, split out
spliced implementation") has exported tcp_sock_set_bufsize() to
be able to use it in tcp_splice.c, but 6ccab72d9b has removed its use
in tcp_splice.c, so we can set it static again.

Fixes: 6ccab72d9b ("tcp: Improve handling of fallback if socket pool is empty on new splice")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:31:25 +01:00
David Gibson
57de44a4bc util: Make sock_l4() treat empty string ifname like NULL
sock_l4() takes NULL for ifname if you don't want to bind the socket to a
particular interface.  However, for a number of the callers, it's more
natural to use an empty string for that case.  Change sock_l4() to accept
either NULL or an empty string equivalently, and simplify some callers
using that change.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
5cada56186 treewide: Avoid in_addr_t
IPv4 addresses can be stored in an in_addr_t or a struct in_addr.  The
former is just a type alias to a 32-bit integer, so doesn't really give us
any type checking.  Therefore we generally prefer the structure, since we
mostly want to treat IP address as opaque objects.  Fix a few places where
we still use in_addr_t, but can just as easily use struct in_addr.

Note there are still some uses of in_addr_t in conf.c, but those are
justified: since they're doing prefix calculations, they actually need to
look at the internals of the address as an integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
24d1f6570b icmp: Avoid unnecessary handling of unspecified bind address
We go to some trouble, if the configured output address is unspecified, to
pass NULL to sock_l4().  But while passing NULL is one way to get sock_l4()
not to specify a bind address, passing the "any" address explicitly works
too.  Use this to simplify icmp_tap_handler().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
b9f4314ef9 util: Drop explicit setting to INADDR_ANY/in6addr_any in sock_l4()
The original commit message says:

---
Currently we initialise the address field of the sockaddrs we construct
to the any/unspecified address, but not in a very clear way: we use
explicit 0 values, which is only interpretable if you know the order of
fields in the sockaddr structures.  Use explicit field names, and explicit
initialiser macros for the address.

Because we initialise to this default value, we don't need to explicitly
set the any/unspecified address later on if the caller didn't pass an
overriding bind address.
---

and the original patch modified the initialisation of addr4 and
addr6:

- instead of { 0 }, { 0 } for sin_addr and sin_zero,
  .sin_addr = IN4ADDR_ANY_INIT

- instead of 0, IN6ADDR_ANY_INIT, 0:
  .sin6_addr = IN6ADDR_ANY_INIT

but I dropped those hunks: they break gcc versions 7 to 9 as reported
in eed6933e6c ("udp: Explicitly initialise sin6_scope_id and
sin_zero in sockaddr_in{,6}").

I applied the rest of the changes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Dropped first two hunks]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
eae4304000 util: Use htonl_constant() in more places
We might as well when we're passing a known constant value, giving the
compiler the best chance to optimise things away.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
073f530bfe treewide: Add IN4ADDR_ANY_INIT macro
We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to
the loopback address, make a similar one for the unspecified / any address.
This avoids messying things with the internal structure of struct in_addr
where we don't care about it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
546332786c treewide: Use IN4ADDR_LOOPBACK_INIT more widely
We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to
the loopback address without delving into its internals.  However there are
some places we don't use it, and explicitly look at the internal structure
of struct in_addr, which we generally want to avoid.  Use the define more
widely to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
1f2aab8aaa tcp: Fix address type for tcp_sock_init_af()
This takes a struct in_addr * (i.e. an IPv4 address), although it's
explicitly supposed to handle IPv6 as well.  Both its caller and sock_l4()
which it calls use a void * for the address, which can be either an in_addr
or an in6_addr.

We get away with this, because we don't do anything with the pointer other
than transfer it from the caller to sock_l4(), but it's misleading.  And
quite possibly technically UB, because C is like that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
4681ea09bc checksum: Don't use linux/icmp.h when netinet/ip_icmp.h will do
In most places where we need to get ICMP definitions, we get them from
<netinet/ip_icmp.h>.  However in checksum.c we instead include
<linux/icmp.h>.  Change it to use <netinet/ip_icmp.h> for consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
5d5bb8c150 tcp: Don't account for hash table size in tcp_hash()
Currently tcp_hash() returns the hash bucket for a value, that is the hash
modulo the size of the hash table.  Usually it's a bit more flexible to
have hash functions return a "raw" hash value and perform the modulus in
the callers.  That allows the same hash function to be used for multiple
tables of different sizes, or to re-use the hash for other purposes.

We don't do anything like that with tcp_hash() at present, but we have some
plans to do so.  Prepare for that by making tcp_hash() and tcp_conn_hash()
return raw hash values.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
64e5459ba6 tcp: Implement hash table with indices rather than pointers
We implement our hash table with pointers to the entry for each bucket (or
NULL).  However, the entries are always allocated within the flow table,
meaning that a flow index will suffice, halving the size of the hash table.

For TCP, just a flow index would be enough, but future uses will want to
expand the hash table to cover indexing either side of a flow, so use a
flow_sidx_t as the type for each hash bucket.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
5913f26415 tcp: Switch hash table to linear probing instead of chaining
Currently we deal with hash collisions by letting a hash bucket contain
multiple entries, forming a linked list using an index in the connection
structure.

That's a pretty standard and simple approach, but in our case we can use
an even simpler one: linear probing.  Here if a hash bucket is occupied
we just move onto the next one until we find a feww one.  This slightly
simplifies lookup and more importantly saves some precious bytes in the
connection structure by removing the need for a link.  It does require some
additional complexity for hash removal.

This approach can perform poorly with hash table load is high.  However, we
already size our hash table of pointers larger than the connection table,
which puts an upper bound on the load.  It's relatively cheap to decrease
that bound if we find we need to.

I adapted the linear probing operations from Knuth's The Art of Computer
Programming, Volume 3, 2nd Edition.  Specifically Algorithm L and Algorithm
R in Section 6.4.  Note that there is an error in Algorithm R as printed,
see errata at [0].

[0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
89fcb563fc tcp: Fix conceptually incorrect byte-order switch in tcp_tap_handler()
tcp_hash_lookup() expects the port numbers in host order, but the TCP
header, of course, has them in network order, so we need to switch them.
However we call htons() (host to network) instead of ntohs() (network to
host).  This works because those do the same thing in practice (they only
wouldn't on very strange theoretical platforms which are neither big nor
little endian).

But, having this the "wrong" way around is misleading, so switch it around.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
Stefano Brivio
a672705e4d README: Update "Availability" section
It's been a while -- there are now official packages for Arch Linux,
Gentoo, Void Linux.

Suggested-by: Rahil Bhimjiani <me@rahil.website>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
Stefano Brivio
fd29d62a9d tcp: Cast timeval fields to unsigned long long for printing
On x32, glibc defines time_t and suseconds_t (the latter, also known as
__syscall_slong_t) as unsigned long long, whereas "everywhere else",
including x86_64 and i686, those are unsigned long.

See also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 for
all the gory details.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
Stefano Brivio
60925b8b4e flow: Add missing include, stdio.h
Reported-by: lemmi <lemmi@nerd2nerd.org>
Link: https://github.com/void-linux/void-packages/actions/runs/7097192513/job/19316903568
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
Stefano Brivio
baf4e2c028 test: Select first reported IPv6 address for guest/host comparison
If we run passt nested (a guest connected via passt to a guest
connected via passt to the host), the first guest (L1) typically has
two IPv6 addresses on the same interface: one formed from the prefix
assigned via SLAAC, and another one assigned via DHCPv6 (to match the
address on the host).

When we select addresses for comparison, in this case, we have
multiple global unicast addresses -- again, on the same interface.
Selecting the first reported one on both host and guest is not
entirely correct (in theory, the order might differ), but works
reasonably well.

Use the trick from 5beef08597 ("test: Only select a single
interface or gateway in tests") to ask jq(1) for the first address
returned by the query.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:28:35 +01:00
Stefano Brivio
00358b7828 ndp: Extend lifetime of prefix, router, RDNSS and search list
Currently, we have no mechanism to dynamically update IPv6
addressing, routing or DNS information (which should eventually be
implemented via netlink monitor), so it makes no sense to limit
lifetimes of NDP information to any particular value.

If we do, with common configurations of systemd-networkd in a guest,
we can end up in a situation where we have a /128 address assigned
via DHCPv6, the NDP-assigned prefix expires, and the default route
also expires. However, as there's a valid address, the prefix is
not renewed. As a result, the default route becomes invalid and we
lose it altogether, which implies that the guest loses IPv6
connectivity except for link-local communication.

Set the router lifetime to the maximum allowed by RFC 8319, that is,
65535 seconds (about 18 hours). RFC 4861 limited this value to 9000
seconds, but RFC 8319 later updated this limit.

Set prefix and DNS information lifetime to infinity. This is allowed
by RFC 4861 and RFC 8319.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-12-27 19:22:29 +01:00
David Gibson
d491a4099b test: Make handling of shell prompts with escapes a little more reliable
When using the old-style "pane" methods of executing commands during the
tests, we need to scan the shell output for prompts in order to tell when
commands have finished.  This is inherently unreliable because commands
could output things that look like prompts, and prompts might not look like
we expect them to.  The only way to really fix this is to use a better way
of dispatching commands, like the newer "context" system.

However, it's awkward to convert everything to "context" right at the
moment, so we're still relying on some tests that do work most of the time.
It is, however, particularly sensitive to fancy coloured prompts using
escape sequences.  Currently we try to handle this by stripping actual
ESC characters with tr, then looking for some common variants.

We can do a bit better: instead strip all escape sequences using sed before
looking for our prompt.  Or, at least, any one using [a-zA-Z] as the
terminating character. Strictly speaking ANSI escapes can be terminated by
any character in 0x40..0x7e, which isn't easily expressed in a regexp.
This should capture all common ones, though.

With this transformation we can simplify the list of patterns we then look
for as a prompt, removing some redundant variants.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-07 07:24:48 +01:00
David Gibson
b86afe3559 tcp: Don't defer hash table removal
When a TCP connection is closed, we mark it by setting events to CLOSED,
then some time later we do final cleanups: closing sockets, removing from
the hash table and so forth.

This does mean that when making a hash lookup we need to exclude any
apparent matches that are CLOSED, since they represent a stale connection.
This can happen in practice if one connection closes and a new one with the
same endpoints is started shortly afterward.

Checking for CLOSED is quite specific to TCP however, and won't work when
we extend the hash table to more general flows.  So, alter the code to
immediately remove the connection from the hash table when CLOSED, although
we still defer closing sockets and other cleanup.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:33 +01:00
David Gibson
e21b6d69b1 tcp: "TCP" hash secret doesn't need to be TCP specific
The TCP state structure includes a 128-bit hash_secret which we use for
SipHash calculations to mitigate attacks on the TCP hash table and initial
sequence number.

We have plans to use SipHash in places that aren't TCP related, and there's
no particular reason they'd need their own secret.  So move the hash_secret
to the general context structure.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:32 +01:00
David Gibson
cf83988e96 pif: Add helpers to get the name of a pif
Future debugging will want to identify a specific passt interface.  We make
a distinction in these helpers between the name of the *type* of pif, and
name of the pif itself.  For the time being these are always the same
thing, since we have at most instance of each type of pif.  However, that
might change in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:29 +01:00
David Gibson
460064d262 test: Avoid hitting guestfish command length limits
In test/prepare-distro-img.sh we use guestfish to tweak our distro guest
images to be suitable.  Part of this is using a 'copy-in' directive to copy
in the source files for passt itself.  Currently we copy in all the files
with a single 'copy-in', since it allows listing multiple files.  However
it turns out that the number of arguments it can accept is fairly limited
and our current list of files is already very close to that limit.

Instead, expand our list of files to one copy-in per file, avoiding that
limitation.  This isn't much slower, because all the commands still run in
a single invocation of guestfish itself.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:26 +01:00