Commit graph

1577 commits

Author SHA1 Message Date
Stas Sergeev
f851084c96 dhcp: put option 53 at the beginning
... unless it is listed in 55.
Many clients expect option 53 at the beginning.
mTCP has this code:
  if ( resp->options[0] != 53 ) {
    TRACE_WARN(( "Dhcp: first option was not a Dhcp msg type\n" ));
    return;
  }

wattcp32 has this:
static int DHCP_is_ack (void)
{
  const BYTE *opt = (const BYTE*) &dhcp_in.dh_opt[4];

  return (opt[0] == DHCP_OPT_MSG_TYPE && opt[1] == 1 && opt[2] == DHCP_ACK);
}
static int DHCP_is_nack (void)
{
  const BYTE *opt = (const BYTE*) &dhcp_in.dh_opt[4];

  return (opt[0] == DHCP_OPT_MSG_TYPE && opt[1] == 1 && opt[2] == DHCP_NAK);
}

Link: https://bugs.passt.top/show_bug.cgi?id=77
Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
[sbrivio: s/options 53/option 53/ and s/other/others/ in comment]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:39:58 +02:00
Stefano Brivio
a469fc393f tcp, tap: Don't increase tap-side sequence counter for dropped frames
...so that we'll retry sending them, instead of more-or-less silently
dropping them. This happens quite frequently if our sending buffer on
the UNIX domain socket is heavily constrained (for instance, by the
208 KiB default memory limit).

It might be argued that dropping frames is part of the expected TCP
flow: we don't dequeue those from the socket anyway, so we'll
eventually retransmit them.

But we don't need the receiver to tell us (by the way of duplicate or
missing ACKs) that we couldn't send them: we already know as
sendmsg() reports that. This seems to considerably increase
throughput stability and throughput itself for TCP connections with
default wmem_max values.

Unfortunately, the 16 bits left as padding in the frame descriptors
we use internally aren't enough to uniquely identify for which
connection we should update sequence numbers: create a parallel
array of pointers to sequence numbers and L4 lengths, of
TCP_FRAMES_MEM size, and go through it after calling sendmsg().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:39:58 +02:00
Stefano Brivio
d3192f67c4 tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
It looks like we need it as workaround for this situation, readily
reproducible at least with a 6.5 Linux kernel, with default rmem_max
and wmem_max values:

- an iperf3 client on the host sends about 160 KiB, typically
  segmented into five frames by passt. We read this data using
  MSG_PEEK

- the iperf3 server on the guest starts receiving

- meanwhile, the host kernel advertised a zero-sized window to the
  sender, as expected

- eventually, the guest acknowledges all the data sent so far, and
  we drop it from the buffer, courtesy of tcp_sock_consume(), using
  recv() with MSG_TRUNC

- the client, however, doesn't get an updated window value, and
  even keepalive packets are answered with zero-window segments,
  until the connection is closed

It looks like dropping data from a socket using MSG_TRUNC doesn't
cause a recalculation of the window, which would be expected as a
result of any receiving operation that invalidates data on a buffer
(that is, not with MSG_PEEK).

Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
the previous value we clamped to, forces a recalculation of the
window which is advertised to the sender.

I couldn't quite confirm this issue by following all the possible
code paths in the kernel, yet. If confirmed, this should be fixed in
the kernel, but meanwhile this workaround looks robust to me (and it
will be needed for backward compatibility anyway).

Reported-by: Matej Hrica <mhrica@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=74
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:27:15 +02:00
Stefano Brivio
feaeb4986c tcp: Fix comment to tcp_sock_consume()
Note that tcp_sock_consume() doesn't update ACK sequence counters
anymore.

Fixes: cc6d8286d1 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:24:08 +02:00
David Gibson
117b474f85 cppcheck: Work around bug in cppcheck 2.12.0
cppcheck 2.12.0 (and maybe some other versions) things this if condition
is always true, which is demonstrably not true.  Work around the bug for
now.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:24:05 +02:00
David Gibson
8640d62af7 cppcheck: Use "exhaustive" level checking when available
Recent enough cppcheck versions (at least as of cppcheck 2.12) give this
error processing conf.c:

conf.c:1179:1: information: ValueFlow analysis is limited in conf. Use --check-level=exhaustive if full analysis is wanted. [checkLevelNormal]

Adding --check-level=exhaustive doesn't seem to significantly increase the
cppcheck run time for us, so enable it when possible, suppressing that
warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:24:00 +02:00
David Gibson
c0efa4e97f conf: Remove overly cryptic selection of forward table
In a couple of places in conf(), we use a local 'fwd' variable to reference
one of our forwarding tables.  The value depends on which command line
option we're currently looking at, and is initialized rather cryptically
from an assignment side-effect within the if condition checking that
option.

Newer versions of cppcheck complain about this assignment being an always
true condition, but in any case it's both clearer and slightly shorter to
use separate if branches for the two cases and set the forwarding parameter
more directly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:23:56 +02:00
David Gibson
6471c7d01b cppcheck: Make many pointers const
Newer versions of cppcheck (as of 2.12.0, at least) added a warning for
pointers which could be declared to point at const data, but aren't.
Based on that, make many pointers throughout the codebase const.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:23:35 +02:00
David Gibson
fc8f0f8c48 siphash: Use incremental rather than all-at-once siphash functions
We have a bunch of variants of the siphash functions for different data
sizes.  The callers, in tcp.c, need to pack the various values they want to
hash into a temporary structure, then call the appropriate version.  We can
avoid the copy into the temporary by directly using the incremental
siphash functions.

The length specific hash functions also have an undocumented constraint
that the data pointer they take must, in fact, be aligned to avoid
unaligned accesses, which may cause crashes on some architectures.

So, prefer the incremental approach and remove the length-specific
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:53 +02:00
David Gibson
04b10a8d90 siphash, checksum: Move TBAA explanation to checksum.c
A number of checksum and hash functions require workarounds for the odd
behaviour of Type-Baased Alias Analysis.  We have a detailed comment about
this on siphash_8b() and other functions reference that.

Move the main comment to csume_16b() instead, because we're going to
reorganise things in siphash.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:51 +02:00
David Gibson
ceae8422c1 siphash: Make internal helpers public
Move a bunch of code from siphash.c to siphash.h, making it available to
other modules.  This will allow places which need hashes of more complex
objects to construct them incrementally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:50 +02:00
David Gibson
cbc84df47f siphash: Use specific structure for internal state
To improve type safety, encapsulate the internal state of the SipHash
algorithm into a dedicated structure type.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:48 +02:00
David Gibson
fcec3f6f9d siphash: Use more hygienic state initialiser
The PREAMBLE macro sets up the SipHash initial internal state.  It also
defines that state as a variable, which isn't macro hygeinic.  With
previous changes simplifying this premable, it's now possible to replace it
with a macro which simply expands to a C initialisedrfor that state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:46 +02:00
David Gibson
5cc843521d siphash: Fix bug in state initialisation
The SipHash algorithm starts with initializing the 32 bytes of internal
state with some magic numbers XORed with the hash key.  However, our
implementation has a bug - rather than XORing the hash key, it *sets* the
initial state to copies of the key.

I don't know if that affects any of the cryptographic properties of SipHash
but it's not what we should be doing.  Fix it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:43 +02:00
David Gibson
831067f483 siphash: Clean up hash finalisation with posthash_final() function
The POSTAMBLE macro implements the finalisation steps of SipHash.  It
relies on some variables in the environment, including returning the final
hash value that way.  That isn't great hygeine.

In addition the PREAMBLE macro takes a length parameter which is used only
to initialize the 'b' value that's not used until the finalisation and is
also sometimes modified in a non-obvious way by the callers.

The 'b' value is always composed from the total length of the hash input
plus up to 7 bytes of "tail" data - that is the remainder of the hash input
after a multiple of 8 bytes has been consumed.

Simplify all this by replacing the POSTAMBLE macro with a siphash_final()
function which takes the length and tail data as parameters and returns the
final hash value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:40 +02:00
David Gibson
7a3153cbfb siphash: Add siphash_feed() helper
We have macros or inlines for a number of common operations in the siphash
functions.  However, in a number of places we still open code feeding
another 64-bits of data into the hash function: an xor, followed by 2
rounds of shuffling, followed by another xor.

Implement an inline function for this, which results in somewhat shortened
code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:37 +02:00
David Gibson
f7b2be2d21 siphash: Make sip round calculations an inline function rather than macro
The SIPROUND(n) macro implements n rounds of SipHash shuffling.  It relies
on 'v' and '__i' variables being available in the context it's used in
which isn't great hygeine.  Replace it with an inline function instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:34 +02:00
David Gibson
ca6e94702c siphash: Make siphash functions consistently return 64-bit results
Some of the siphas_*b() functions return 64-bit results, others 32-bit
results, with no obvious pattern.  siphash_32b() also appears to do this
incorrectly - taking the 64-bit hash value and simply returning it
truncated, rather than folding the two halves together.

Since SipHash proper is defined to give a 64-bit hash, make all of them
return 64-bit results.  In the one caller which needs a 32-bit value,
tcp_seq_init() do the fold down to 32-bits ourselves.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:30 +02:00
David Gibson
c1d2a070f2 util: Consolidate and improve workarounds for clang-tidy issue 58992
We have several workarounds for a clang-tidy bug where the checker doesn't
recognize that a number of system calls write to - and therefore initialise
- a socket address.  We can't neatly use a suppression, because the bogus
warning shows up some time after the actual system call, when we access
a field of the socket address which clang-tidy erroneously thinks is
uninitialised.

Consolidate these workarounds into one place by using macros to implement
wrappers around affected system calls which add a memset() of the sockaddr
to silence clang-tidy.  This removes the need for the individual memset()
workarounds at the callers - and the somewhat longwinded explanatory
comments.

We can then use a #define to not include the hack in "real" builds, but
only consider it for clang-tidy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:26:06 +02:00
David Gibson
5b6c68c2e4 Avoid shadowing index(3)
A classic gotcha of the standard C library is that its unwise to call any
variable 'index' because it will shadow the standard string library
function index(3).  This can cause warnings from cppcheck amongst others,
and it also means that if the variable is removed you tend to get confusing
type errors (or sometimes nothing at all) instead of a nice simple "name is
not defined" error.

Strictly speaking this only occurs if <string.h> is included, but that
is so common that as a rule it's best to just avoid it always.  We
have a number of places which hit this trap, so rename variables and
parameters to avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:25:51 +02:00
Stefano Brivio
9178a9e346 tcp: Always send an ACK segment once the handshake is completed
The reporter is running a SMTP server behind pasta, and the client
waits for the server's banner before sending any data. In turn, the
server waits for our ACK after sending SYN,ACK, which never comes.

If we use the ACK_IF_NEEDED indication to tcp_send_flag(), given that
there's no pending data, we delay sending the ACK segment at the end
of the three-way handshake until we have some data to send to the
server.

This was actually intended, as I thought we would lower the latency
for new connections, but we can't assume that the client will start
sending data first (SMTP is the typical example where this doesn't
happen).

And, trying out this patch with SSH (where the client starts sending
data first), the reporter actually noticed we have a lower latency
by forcing an ACK right away. Comparing a capture before the patch:

13:07:14.007704 IP 10.1.2.1.42056 > 10.1.2.140.1234: Flags [S], seq 1797034836, win 65535, options [mss 4096,nop,wscale 7], length 0
13:07:14.007769 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [S.], seq 2297052481, ack 1797034837, win 65480, options [mss 65480,nop,wscale 7], length 0
13:07:14.008462 IP 10.1.2.1.42056 > 10.1.2.140.1234: Flags [.], seq 1:22, ack 1, win 65535, length 21
13:07:14.008496 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [.], ack 22, win 512, length 0
13:07:14.011799 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [P.], seq 1:515, ack 22, win 512, length 514

and after:

13:10:26.165364 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [S], seq 4165939595, win 65535, options [mss 4096,nop,wscale 7], length 0
13:10:26.165391 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [S.], seq 985607380, ack 4165939596, win 65480, options [mss 65480,nop,wscale 7], length 0
13:10:26.165418 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [.], ack 1, win 512, length 0
13:10:26.165683 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [.], seq 1:22, ack 1, win 512, length 21
13:10:26.165698 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [.], ack 22, win 512, length 0
13:10:26.167107 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [P.], seq 1:515, ack 22, win 512, length 514

the latency between the initial SYN segment and the first data
transmission actually decreases from 792µs to 334µs. This is not
statistically relevant as we have a single measurement, but it can't
be that bad, either.

Reported-by: cr3bs (from IRC)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-09-27 17:25:30 +02:00
Stefano Brivio
c09069211a dhcp: Actually note down the length of options received by the client
It turns out we never used 'clen' until commit 1f24d3efb4 ("dhcp:
support BOOTP clients"), and we always ignored option 55 (Parameter
Request List), while, according to RFC 2132, we MUST try to insert
the requested options in the order requested by the client.

The commit mentioned above made this visible because now every client
is reported as sending a DHCPREQUEST as an old BOOTP client, based on
the lack of option 53 (that is, zero length).

Fixes: b439984641 ("merd: ARP and DHCP handlers, connection tracking fixes")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-09-27 17:25:22 +02:00
Stefano Brivio
8b8537d301 dhcpv6: Properly separate domain names in search list
To prepare the DHCPv6 domain search list option, we go over the
flattened list of domains, and replace both dots and zero bytes with a
counter of bytes in the next label, implementing the encoding
specified by section 3.1 of RFC 1035.

If there are multiple domains in the list, however, zero bytes serve
as markers for the end of a domain name, and we'll replace them with
the length of the first label of the next domain, plus one. This is
wrong. We should only convert the dots before the labels.

To distinguish between label separators and domain names separators,
for simplicity, introduce a dot before the first label of every
domain we copy to form the list. All dots are then replaced by label
lengths, and separators (zero bytes) remain as they are.

As we do this, we need to make sure we don't replace the trailing
dot, if present: that's already a separator. Skip copying it, and
just add separators as needed.

Now that we don't copy those, though, we might end up with
zero-length domains: skip them, as they're meaningless anyway.

And as we might skip domains, we can't use the index 'i' to check if
we're at the beginning of the option -- use 'srch' instead.

This is very similar to how we prepare the list for NDP option 31,
except that we don't need padding (RFC 8106, 5.2) here, and we should
refactor this into common functions, but it probably makes sense to
rework the NDP responder (https://bugs.passt.top/show_bug.cgi?id=21)
first.

Reported-by: Sebastian Mitterle <smitterl@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=75
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:21:00 +02:00
Stefano Brivio
05627dc512 util: Fix licensing information display in --version
The regular expression I used when relicensing to GPLv2+ missed this.

Fixes: ca2749e1bd ("passt: Relicense to GPL 2.0, or any later version")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 17:34:27 +02:00
David Gibson
46f915ddee tcp: Correct handling of FIN,ACK followed by SYN
When the guest tries to establish a connection, it could give up on it by
sending a FIN,ACK instead of a plain ACK to our SYN,ACK.  It could then
make a new attempt to establish a connection with the same addresses and
ports with a new SYN.

Although it's unlikely, it could send the 2nd SYN very shortly after the
FIN,ACK resulting in both being received in the same batch of packets from
the tap interface.

Currently, we don't handle that correctly, when we receive a FIN,ACK on a
not fully established connection we discard the remaining packets in the
batch, and so will never process the 2nd SYN.  Correct this by returning
1 from tcp_tap_handler() in this case, so we'll just consume the FIN,ACK
and continue to process the rest of the batch.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:22 +02:00
David Gibson
b3f2210b05 tcp: Consolidate paths where we initiate reset on tap interface
There are a number of conditions where we will issue a TCP RST in response
to something unexpected we received from the tap interface.  These occur in
both tcp_data_from_tap() and tcp_tap_handler().  In tcp_tap_handler() use
a 'goto out of line' technique to consolidate all these paths into one
place.  For the tcp_data_from_tap() cases use a negative return code and
direct that to the same path in tcp_tap_handler(), its caller.

In this case we want to discard all remaining packets in the batch we have
received: even if they're otherwise good, they'll be invalidated when the
guest receives the RST we're sending.  This is subtly different from the
case where we *receive* an RST, where we could in theory get a new SYN
immediately afterwards.  Clarify that with a common on the now common
reset path.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:19 +02:00
David Gibson
f984003fdf tcp: Correctly handle RST followed rapidly by SYN
Although it's unlikely in practice, the guest could theoretically
reset one TCP connection then immediately start a new one with the
same addressses and ports, such that we get an RST then a SYN in the
same batch of received packets in tcp_tap_handler().

We don't correctly handle that unlikely case, because when we receive
the RST, we discard any remaining packets in the batch so we'd never
see the SYN.  This could happen in either tcp_tap_handler() or
tcp_data_from_tap().  Correct that by returning 1, so that the caller
will continue calling tcp_tap_handler() on subsequent packets allowing
us to process any subsequent SYN.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:17 +02:00
David Gibson
60d3915ea3 tcp: Return consumed packet count from tcp_data_from_tap()
Currently tcp_data_from_tap() is assumed to consume all packets remaining
in the packet pool it is given.  However there are some edge cases where
that's not correct.  In preparation for fixing those, change it to return
a count of packets consumed and use that in its caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:13 +02:00
David Gibson
5fb376de6e tcp: Never hash match closed connections
>From a practical point of view, when a TCP connection ends, whether by
FIN or by RST, we set the CLOSED event, then some time later we remove the
connection from the hash table and clean it up.  However, from a protocol
point of view, once it's closed, it's gone, and any new packets with
matching addresses and ports are either forming a new connection, or are
invalid packets to discard.

Enforce these semantics in the TCP hash logic by never hash matching closed
connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:10 +02:00
David Gibson
805dd109a4 tcp: Remove some redundant packet_get() operations
Both tcp_data_from_tap() and tcp_tap_handler() call packet_get() to get
the entire L4 packet length, then immediately call it again to check that
the packet is long enough to include a TCP header.  The features of
packet_get() let us easily combine these together, we just need to adjust
the length slightly, because we want the value to include the TCP header
length.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:07 +02:00
David Gibson
7b56117dae udp, tap: Correctly advance through packets in udp_tap_handler()
In both tap4_handler() and tap6_handler(), once we've sorted incoming l3
packets into "sequences", we then step through all the packets in each DUP
sequence calling udp_tap_handler().  Or so it appears.

In fact, udp_tap_handler() doesn't take an index and always starts with
packet 0 of the sequence, even if called repeatedly.  It appears to be
written with the idea that the struct pool is a queue, from which it
consumes packets as it processes them, but that's not how the pool data
structure works.

Correct this by adding an index parameter to udp_tap_handler() and altering
the loops in tap.c to step through the pool properly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:04 +02:00
David Gibson
043a70b885 tcp, tap: Correctly advance through packets in tcp_tap_handler()
In both tap4_handler() and tap6_handler(), once we've sorted incoming l3
packets into "sequences", we then step through all the packets in each TCP
sequence calling tcp_tap_handler().  Or so it appears.

In fact, tcp_tap_handler() doesn't take an index and always looks at packet
0 of the sequence, except when it calls tcp_data_from_tap() to process
data packets.  It appears to be written with the idea that the struct pool
is a queue, from which it consumes packets as it processes them, but that's
not how the pool data structure works - they are more like an array of
packets.

We only get away with this, because setup packets for TCP tend to come in
separate batches (because we need to reply in between) and so we only get
a bunch of packets for the same connection together when they're data
packets (tcp_data_from_tap() has its own loop through packets).

Correct this by adding an index parameter to tcp_tap_handler() and altering
the loops in tap.c to step through the pool properly.

Link: https://bugs.passt.top/show_bug.cgi?id=68
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:15:46 +02:00
Stefano Brivio
ee58f37db0 test: Add Podman system test with bats for pasta
Ugly as hell, but we keep breaking things otherwise, and I keep
forgetting to run this manually (as long as it's based on my local
Podman setup, that's the only alternative).

We need to clone the Podman repository as distribution packages don't
contain test scripts, typically. While at it, build the latest
version which is what really matters.

As we're planning anyway to revamp the test framework, I'd be
inclined to just add this without too many thoughts, and have it as
a nice-to-have requirement reminder for the new framework.

Link: https://github.com/containers/podman/pull/19699
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-09-07 11:25:41 +02:00
Stas Sergeev
1f24d3efb4 dhcp: support BOOTP clients
BOOTP clients do not use tagged messages for requests.
As such, any message without the DHCP option 53, should be
considered a BOOTP request.

Link: https://bugs.passt.top/show_bug.cgi?id=72
Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-07 11:24:46 +02:00
Stas Sergeev
d8c4f23ecd tap: fix uses of l3_len in tap4_handler()
l3_len was calculated from the ethernet frame size, and it
was assumed to be equal to the length stored in an IP packet.
But if the ethernet frame is padded, then l3_len calculated
that way can only be used as a bound check to validate the
length stored in an IP header. It should not be used for
calculating the l4_len.

This patch makes sure the small padded ethernet frames are
properly processed, by trusting the length stored in an IP
header.

Link: https://bugs.passt.top/show_bug.cgi?id=73
Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-07 11:24:14 +02:00
Stefano Brivio
a405d0c026 fedora: Replace pasta hard links by separate builds
The hard link trick didn't actually fix the issue with SELinux file
contexts properly: as opposed to symbolic links, SELinux now
correctly associates types to the labels that are set -- except that
those labels are now shared, so we can end up (depending on how
rpm(8) extracts the archives) with /usr/bin/passt having a
pasta_exec_t context.

This got rather confusing as running restorecon(8) seemed to fix up
labels -- but that's simply toggling between passt_exec_t and
pasta_exec_t for both links, because each invocation will just "fix"
the file with the mismatching context.

Replace the hard links with two separate builds of the binary, as
suggested by David. The build is reproducible, so we pass "-pasta" in
the VERSION for pasta's build. This is wasteful but better than the
alternative.

Just copying the binary over would otherwise cause issues with
debuginfo packages due to duplicate Build-IDs -- and rpmbuild(8) also
warns about them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-07 01:57:00 +02:00
Stefano Brivio
63a8302961 apparmor: Add pasta's own profile
If pasta and pasta.avx2 are hard links to passt and passt.avx2,
AppArmor will attach their own profiles on execution, and we can
restrict passt's profile to what it actually needs. Note that pasta
needs to access all the resources that passt needs, so the pasta
abstraction still includes passt's one.

I plan to push the adaptation required for the Debian package in
commit 5bb812e79143 ("debian/rules: Override pasta symbolic links
with hard links"), on Salsa. If other distributions need to support
AppArmor profiles they can follow a similar approach.

The profile itself will be installed, there, via dh_apparmor, in a
separate commit, b52557fedcb1 ("debian/rules: Install new pasta
profile using dh_apparmor").

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-07 00:31:35 +02:00
Stefano Brivio
abf5ef6c22 apparmor: Allow pasta to remount /proc, access entries under its own copy
Since commit b0e450aa85 ("pasta: Detach mount namespace, (re)mount
procfs before spawning command"), we need to explicitly permit mount
of /proc, and access to entries under /proc/PID/net (after remount,
that's what AppArmor sees as path).

Fixes: b0e450aa85 ("pasta: Detach mount namespace, (re)mount procfs before spawning command")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-07 00:31:35 +02:00
Stefano Brivio
e2ad420fa2 apparmor: Allow read-only access to uid_map
Starting with commit 770d1a4502 ("isolation: Initially Keep
CAP_SETFCAP if running as UID 0 in non-init"), the lack of this rule
became more apparent as pasta needs to access uid_map in procfs even
as non-root.

However, both passt and pasta needs this, in case they are started as
root, so add this directly to passt's abstraction (which is sourced
by pasta's profile too).

Fixes: 770d1a4502 ("isolation: Initially Keep CAP_SETFCAP if running as UID 0 in non-init")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-07 00:31:35 +02:00
Stefano Brivio
b686afa23e apparmor: Explicitly pass options we use while remounting root filesystem
As a result of AppArmor commit d4b0fef10a4a ("parser: fix rule flag
generation change_mount type rules"), we can't expect anymore to
get permission to mount() / read-write, with MS_REC | MS_UNBINDABLE
("runbindable", in AppArmor terms), if we don't explicitly pass those
flags as options. It used to work by mistake.

Now, the reasonable expectation would be that we could just change the
existing rule into:

  mount options=(rw, runbindable) "" -> /,

...but this now fails to load too, I think as a result of AppArmor
commit 9d3f8c6cc05d ("parser: fix parsing of source as mount point
for propagation type flags"). It works with 'rw' alone, but
'runbindable' is indeed a propagation type flag.

Skip the source specification, it doesn't add anything meaningful to
the rule anyway.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/pull/19751
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-07 00:31:22 +02:00
Stefano Brivio
6d3e426d2f apparmor: Use abstractions/nameservice to deal with symlinked resolv.conf
While abstractions/nameservice appeared too broad and overkill for
our simple need (read-only resolv.conf access), it properly deals
with symlinked resolv.conf files generated by systemd-resolved,
NetworkManager or suchlike.

If we just grant read-only access to /etc/resolv.conf, we'll fail to
read nameserver information in rather common configurations, because
AppArmor won't follow the symlink.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-06 20:39:12 +02:00
David Gibson
a7e4bfb857 pasta: Strip RTA_PREFSRC when copying routes to the namespace
Host routes can include a preferred source address (RTA_PREFSRC), which
must be one of the host's addresses.  However when using pasta with -a the
namespace might be given a different address, not on the host.  This seems
to occur pretty routinely depending on the network configuration systems
in place on the host.

With --config-net we will try to copy host routes to the namespace.  If
one of those includes an RTA_PREFSRC, but the namespace doesn't have the
host address, this will fail with -EINVAL, causing pasta to fail.

Fix this by stripping off RTA_PREFSRC attributes from routes as we copy
them to the namespace.  This is by no means infallible, bit it should at
least handle common cases for the time being.

Link: https://bugs.passt.top/show_bug.cgi?id=71
Link: https://github.com/containers/podman/pull/19699#issuecomment-1688769287
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-23 15:52:31 +02:00
Stefano Brivio
5e4f7b92b0 netlink: Set IFA_ADDRESS, not just IFA_LOCAL, while adding IPv4 addresses
Otherwise, we actually configure the address, but it's not usable
because no local route is added by the kernel.

Link: https://github.com/containers/podman/pull/19699
Fixes: cfe7509e5c ("netlink: Use struct in_addr for IPv4 addresses, not bare uint32_t")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-23 09:34:44 +02:00
David Gibson
69303cafbe tcp: Remove broken pressure calculations for tcp_defer_handler()
tcp_defer_handler() performs a potentially expensive linear scan of the
connection table.  So, to mitigate the cost of that we skip if if we're not
under at least moderate pressure: either 30% of available connections or
30% (estimated) of available fds used.

But, the calculation for this has been broken since it was introduced: we
calculate "max_conns" based on c->tcp.conn_count, not TCP_MAX_CONNS,
meaning we only exit early if conn_count is less than 30% of itself, i.e.
never.

If that calculation is "corrected" to be based on TCP_MAX_CONNS, it
completely tanks the TCP CRR times for passt - from ~60ms to >1000ms on my
laptop.  My guess is that this is because in the case of many short lived
connections, we're letting the table become much fuller before compacting
it.  That means that other places which perform a table scan now have to
do much, much more.

For the time being, simply remove the tests, since they're not doing
anything useful.  We can reintroduce them more carefully if we see a need
for them.

This also removes the only user of c->tcp.splice_conn_count, so that can
be removed as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:41 +02:00
David Gibson
eb8fbdbfd0 inany: Add missing double include guard to inany.h
This was overlooked when the file was created.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:38 +02:00
David Gibson
b60fa33eea tcp: Move in_epoll flag out of common connection structure
The in_epoll boolean is one of only two fields (currently) in the common
structure shared between tap and spliced connections.  It seems like it
belongs there, because both tap and spliced connections use it, and it has
roughly the same meaning.

Roughly, however, isn't exactly: which fds this flag says are in the epoll
varies between the two connection types, and are in type specific fields.
So, it's only possible to meaningfully use this value locally in type
specific code anyway.

This common field is going to get in the way of more widespread
generalisation of connection / flow tracking, so move it to separate fields
in the tap and splice specific structures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:36 +02:00
David Gibson
955dd3251c tcp, udp: Don't pre-fill IPv4 destination address in headers
Because packets sent on the tap interface will always be going to the
guest/namespace, we more-or-less know what address they'll be going to.  So
we pre-fill this destination address in our header buffers for IPv4.  We
can't do the same for IPv6 because we could need either the global or
link-local address for the guest.  In future we're going to want more
flexibility for the destination address, so this pre-filling will get in
the way.

Change the flow so we always fill in the IPv4 destination address for each
packet, rather than prefilling it from proto_update_l2_buf().  In fact for
TCP we already redundantly filled the destination for each packet anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:33 +02:00
David Gibson
5bf200ae8a tcp, udp: Don't include destination address in partially precomputed csums
We partially prepopulate IP and TCP header structures including, amongst
other things the destination address, which for IPv4 is always the known
address of the guest/namespace.  We partially precompute both the IPv4
header checksum and the TCP checksum based on this.

In future we're going to want more flexibility with controlling the
destination for IPv4 (as we already do for IPv6), so this precomputed value
gets in the way.  Therefore remove the IPv4 destination from the
precomputed checksum and fold it into the checksum update when we actually
send a packet.

Doing this means we no longer need to recompute those partial sums when
the destination address changes ({tcp,udp}_update_l2_buf()) and instead
the computation can be moved to compile time.  This means while we perform
slightly more computations on each packet, we slightly reduce the amount of
memory we need to access.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:30 +02:00
David Gibson
8aa32009ed tcp: Consistent usage of ports in tcp_seq_init()
In tcp_seq_init() the meaning of "src" and "dst" isn't really clear since
it's used for connections in both directions.  However, these values are
just feeding a hash, so as long as we're consistent and include all the
information we want, it doesn't really matter.

Oddly, for the "src" side we supply the (tap side) forwarding address but
the (tap side) endpoint port.  This again doesn't really matter, but it's
confusing.  So swap this with dstport, so "src" is always forwarding
and "dst" is always endpoint.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:27 +02:00
David Gibson
bccfbff193 tcp: More precise terms for addresses and ports
In a number of places the comments and variable names we use to describe
addresses and ports are ambiguous.  It's not sufficient to describe a port
as "tap-facing" or "socket-facing", because on both the tap side and the
socket side there are two ports for the two ends of the connection.
Similarly, "local" and "remote" aren't particularly helpful, because it's
not necessarily clear whether we're talking from the point of view of the
guest/namespace, the host, or passt itself.

This patch makes a number of changes to be more precise about this.  It
introduces two new terms in aid of this:
    A "forwarding" address (or port) refers to an address which is local
from the point of view of passt itself.  That is a source address for
traffic sent by passt, whether it's to the guest via the tap interface
or to a host on the internet via a socket.
    The "endpoint" address (or port) is the reverse: a remote address
from passt's point of view, the destination address for traffic sent
by passt.

Between them the "side" (either tap/guest-facing or sock/host-facing)
and forwarding vs. endpoint unambiguously describes which address or
port we're talking about.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:24 +02:00