Commit graph

117 commits

Author SHA1 Message Date
Stefano Brivio
a469fc393f tcp, tap: Don't increase tap-side sequence counter for dropped frames
...so that we'll retry sending them, instead of more-or-less silently
dropping them. This happens quite frequently if our sending buffer on
the UNIX domain socket is heavily constrained (for instance, by the
208 KiB default memory limit).

It might be argued that dropping frames is part of the expected TCP
flow: we don't dequeue those from the socket anyway, so we'll
eventually retransmit them.

But we don't need the receiver to tell us (by the way of duplicate or
missing ACKs) that we couldn't send them: we already know as
sendmsg() reports that. This seems to considerably increase
throughput stability and throughput itself for TCP connections with
default wmem_max values.

Unfortunately, the 16 bits left as padding in the frame descriptors
we use internally aren't enough to uniquely identify for which
connection we should update sequence numbers: create a parallel
array of pointers to sequence numbers and L4 lengths, of
TCP_FRAMES_MEM size, and go through it after calling sendmsg().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:39:58 +02:00
David Gibson
6471c7d01b cppcheck: Make many pointers const
Newer versions of cppcheck (as of 2.12.0, at least) added a warning for
pointers which could be declared to point at const data, but aren't.
Based on that, make many pointers throughout the codebase const.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:23:35 +02:00
David Gibson
7b56117dae udp, tap: Correctly advance through packets in udp_tap_handler()
In both tap4_handler() and tap6_handler(), once we've sorted incoming l3
packets into "sequences", we then step through all the packets in each DUP
sequence calling udp_tap_handler().  Or so it appears.

In fact, udp_tap_handler() doesn't take an index and always starts with
packet 0 of the sequence, even if called repeatedly.  It appears to be
written with the idea that the struct pool is a queue, from which it
consumes packets as it processes them, but that's not how the pool data
structure works.

Correct this by adding an index parameter to udp_tap_handler() and altering
the loops in tap.c to step through the pool properly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:04 +02:00
David Gibson
043a70b885 tcp, tap: Correctly advance through packets in tcp_tap_handler()
In both tap4_handler() and tap6_handler(), once we've sorted incoming l3
packets into "sequences", we then step through all the packets in each TCP
sequence calling tcp_tap_handler().  Or so it appears.

In fact, tcp_tap_handler() doesn't take an index and always looks at packet
0 of the sequence, except when it calls tcp_data_from_tap() to process
data packets.  It appears to be written with the idea that the struct pool
is a queue, from which it consumes packets as it processes them, but that's
not how the pool data structure works - they are more like an array of
packets.

We only get away with this, because setup packets for TCP tend to come in
separate batches (because we need to reply in between) and so we only get
a bunch of packets for the same connection together when they're data
packets (tcp_data_from_tap() has its own loop through packets).

Correct this by adding an index parameter to tcp_tap_handler() and altering
the loops in tap.c to step through the pool properly.

Link: https://bugs.passt.top/show_bug.cgi?id=68
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:15:46 +02:00
Stas Sergeev
d8c4f23ecd tap: fix uses of l3_len in tap4_handler()
l3_len was calculated from the ethernet frame size, and it
was assumed to be equal to the length stored in an IP packet.
But if the ethernet frame is padded, then l3_len calculated
that way can only be used as a bound check to validate the
length stored in an IP header. It should not be used for
calculating the l4_len.

This patch makes sure the small padded ethernet frames are
properly processed, by trusting the length stored in an IP
header.

Link: https://bugs.passt.top/show_bug.cgi?id=73
Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-07 11:24:14 +02:00
David Gibson
955dd3251c tcp, udp: Don't pre-fill IPv4 destination address in headers
Because packets sent on the tap interface will always be going to the
guest/namespace, we more-or-less know what address they'll be going to.  So
we pre-fill this destination address in our header buffers for IPv4.  We
can't do the same for IPv6 because we could need either the global or
link-local address for the guest.  In future we're going to want more
flexibility for the destination address, so this pre-filling will get in
the way.

Change the flow so we always fill in the IPv4 destination address for each
packet, rather than prefilling it from proto_update_l2_buf().  In fact for
TCP we already redundantly filled the destination for each packet anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:33 +02:00
David Gibson
cee4a2da48 tap: Pass source address to protocol handler functions
The tap code passes the IPv4 or IPv6 destination address of packets it
receives to the protocol specific code.  Currently that protocol code
doesn't use the source address, but we want it to in future.  So, in
preparation, pass the IPv4/IPv6 source address of tap packets to those
functions as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:21 +02:00
David Gibson
673bde1f21 tap: Don't clobber source address in tap6_handler()
In tap6_handler() saddr is initialized to the IPv6 source address from the
incoming packet.  However part way through, but before organizing the
packet into a "sequence" we set it unconditionally to the guest's assigned
address.  We don't do anything equivalent for IPv4.

This doesn't make a lot of sense: if the guest is using a different source
address it makes sense to consider these different sequences of packets and
we shouldn't try to combine them together.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:17 +02:00
Stefano Brivio
5f1fcfffe4 tap: Fix format specifier in tap4_is_fragment() warning
Spotted by Coverity, relatively harmless.

Fixes: e01759e2fa ("tap: Explicitly drop IPv4 fragments, and give a warning")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-08-16 10:46:07 +02:00
David Gibson
ae5f6c8e1b epoll: Use different epoll types for passt and pasta tap fds
Currently we have a single epoll event type for the "tap" fd, which could
be either a handle on a /dev/net/tun device (pasta) or a connected Unix
socket (passt).  However for the two modes we call different handler
functions.  Simplify this a little by using different epoll types and
dispatching directly to the correct handler function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:20 +02:00
David Gibson
eda4f1997e epoll: Split listening Unix domain socket into its own type
tap_handler() actually handles events on three different types of object:
the /dev/tap character device (pasta), a connected Unix domain socket
(passt) or a listening Unix domain socket (passt).

The last, in particular, really has no handling in common with the others,
so split it into its own epoll type and directly dispatch to the relevant
handler from the top level.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:17 +02:00
David Gibson
6a6735ece4 epoll: Always use epoll_ref for the epoll data variable
epoll_ref contains a variety of information useful when handling epoll
events on our sockets, and we place it in the epoll_event data field
returned by epoll.  However, for a few other things we use the 'fd' field
in the standard union of types for that data field.

This actually introduces a bug which is vanishingly unlikely to hit in
practice, but very nasty if it ever did: theoretically if we had a very
large file descriptor number for fd_tap or fd_tap_listen it could overflow
into bits that overlap with the 'proto' field in epoll_ref.  With some
very bad luck this could mean that we mistakenly think an event on a
regular socket is an event on fd_tap or fd_tap_listen.

More practically, using different (but overlapping) fields of the
epoll_data means we can't unify dispatch for the various different objects
in the epoll.  Therefore use the same epoll_ref as the data for the tap
fds and the netns quit fd, adding new fd type values to describe them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:53 +02:00
David Gibson
e26282b67d tap: Fold reset handling into tap_handler_passt()
We call tap_sock_reset() if tap_handler_passt() fails, or if we get an
error event on the socket.  Fold that logic into tap_handler() passt itself
which simplifies the caller.  It also makes it clearer that we had a
redundant EPOLL_CTL_DEL and close() in one of the reset paths, so fix that
too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:49 +02:00
David Gibson
0d870c5da6 tap: Fold reset handling into tap_handler_pasta()
If tap_handler_pasta() fails, we reset the connection.  But in the case of
pasta the "reset" is just a fatal error.  Fold the die() calls directly
into tap_handler_pasta() for simplicity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:46 +02:00
David Gibson
548e05f76a tap: Clean up behaviour for errors on listening Unix socket
We call tap_sock_unix_new() to handle a new connection to the qemu socket
if we get an EPOLLIN event on c->fd_tap_listen.  If we get any other event
on the fd, we'll fall through to the "tap reset" path.  But that won't do
anything relevant to the listening socket, it will just close the already
connected socket.  Furthermore, the only other event we're subscribed to
for the listening socket is EPOLLRDHUP, which doesn't apply to a non
connected socket.

Remove EPOLLRDHUP from the subscribed events.  We don't need to explicitly
add EPOLLERR, because errors are always reported.  There's no obvious case
that would cause an error on a listening socket anyway, and it's not
obvious how we'd recover, treat it as a fatal error if it ever does happen.

Finally, fold all this handling into the tap_sock_unix_new() function,
there's no real reason to split it between there and tap_handler().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:44 +02:00
David Gibson
28877b0fcd tap: Clean up tap reset path
In tap_handler() if we get an error on the tap device or socket, we use
tap_sock_init() to re-initialise it.  However, what we actually need for
this reset case has remarkably little in common with the case where we're
initialising for the first time:
    * Re-initialising the packet pools is unnecessary
    * The case of a passed in fd (--fd) isn't relevant
    * We don't even call this for pasta mode
    * We will never re-call tap_sock_unix_init() because we never clear
      fd_tap_listen

In fact the only thing we do in tap_sock_init() relevant to the reset case
is to remove the fd from the epoll and close it... which isn't used in the
first initialisation case.

So make a new tap_sock_reset() function just for this case, and simplify
tap_sock_init() slightly as being used only for the first time case.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:40 +02:00
Laurent Vivier
b2bea0047d tap: fix seq->p.count limit
The number of items in pool_l4_t is defined to UIO_MAXIOV,
not TAP_SEQS. TAP_SEQS is the number of the sequences.

Fix the value used to compare seq->p.count with.

Fixes: 37c228ada8 ("tap, tcp, udp, icmp: Cut down on some oversized buffers")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: s/messages/sequences/ in commit message, extend
 initialisation of packets in pool to UIO_MAXIOV items]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:24:56 +02:00
David Gibson
0cf7bf31f6 tap: Remove unnecessary global tun_ns_fd
tap_ns_tun(), which runs in an ephemeral thread puts the fd it opens into
the global variable tun_ns_fd to communicate it back to the main thread
in tap_sock_tun_init().

However, the only thing tap_sock_tun_init() does with it is copies it to
c->fd_tap and everything else uses it from there.  tap_ns_tun() already
has access to the context structure, so we might as well store the value
directly in there rather than having a global as an intermediate.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:18:11 +02:00
David Gibson
7bc9b66fc2 tap: More detailed error reporting in tap_ns_tun()
There are several possible failure points in tap_ns_tun(), but if anything
goes wrong, we just set tun_ns_fd to -1 resulting in the same error
message.

Add more detailed error reporting to the various failure points.  At the
same time, we know this is only called from tap_sock_tun_init() which will
terminate pasta if we fail, so we can simplify things a little because we
don't need to close() the fd on the failure paths.

Link: https://bugs.passt.top/show_bug.cgi?id=69
Link: https://github.com/containers/podman/issues/19428
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:18:06 +02:00
David Gibson
6920adda0d util: Make ns_enter() a void function and report setns() errors
ns_enter() returns an integer... but it's always zero.  If we actually fail
the function doesn't return.  Therefore it makes more sense for this to be
a function returning void, and we can remove the cases where we pointlessly
checked its return value.

In addition ns_enter() is usually called from an ephemeral thread created
by NS_CALL().  That means that the exit(EXIT_FAILURE) there usually won't
be reported (since NS_CALL() doesn't wait() for the thread).  So, use die()
instead to print out some information in the unlikely event that our
setns() here does fail.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:18:02 +02:00
David Gibson
e01759e2fa tap: Explicitly drop IPv4 fragments, and give a warning
We don't handle defragmentation of IP packets coming from the tap side,
and we're unlikely to any time soon (with our large MTU, it's not useful
for practical use cases).  Currently, however, we simply ignore the
fragmentation flags and treat fragments as though they were whole IP
packets.  This isn't ideal and can lead to rather cryptic behaviour if we
do receive IP fragments.

Change the code to explicitly drop fragmented packets, and print a rate
limited warning if we do encounter them.

Link: https://bugs.passt.top/show_bug.cgi?id=62
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-07-07 19:07:12 +02:00
Stefano Brivio
d072ac2434 tap: With pasta, don't reset on tap errors, handle write failures
Since commit 0515adceaa ("passt, pasta: Namespace-based sandboxing,
defer seccomp policy application"), it makes no sense to close and
reopen the tap device on error: we don't have access to /dev/net/tun
after the initial setup phase.

If we hit ENOBUFS while writing (as reported: in one case because
the kernel actually ran out of memory, with another case under
investigation), or ENOSPC, we're supposed to drop whatever data we
were trying to send: there's no room for it.

Handle EINTR just like we handled EAGAIN/EWOULDBLOCK: there's no
particular reason why sending the same data should fail again.

Anything else I can think of would be an unrecoverable error: exit
with failure then.

While at it, drop a useless cast on the write() call: it takes a
const void * anyway.

Reported-by: Gianluca Stivan <me@yawnt.com>
Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu>
Fixes: 0515adceaa ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-06-23 10:15:10 +02:00
David Gibson
25f1d1a84f tap: Don't update ip6.addr_seen to ::
When we receive packets from the tap side, we update the addr_seen fields
to reflect the last known address of the guest or ns.  For ip4.addr_seen
we, sensibly, only update if the address we've just seen isn't 0 (0.0.0.0).
This case can occur during early DHCP transactions.

We have no equivalent case for IPv6.  We're less likely to hit this,
because DHCPv6 uses link-local addresses, however we can see an source
address of :: with certain multicast operations.  This can bite us if we
try to make an incoming connection very early after starting pasta with
--config-net: we may have only seen some of those multicast packets,
updated addr_seen to :: and not had any "real" packets to update it to a
global address.  I've seen this with some of the avocado test conversions.

In any case, it can never make sense to update addr_seen to ::, so
explicitly exclude that case.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-05-17 18:50:34 +02:00
Stefano Brivio
ca2749e1bd passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.

Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.

Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-06 18:00:33 +02:00
David Gibson
34ade90957 Work around weird false positives with cppcheck-2.9.1
Commit 89e38f55 "treewide: Fix header includes to build with musl" added
extra #includes to work with musl.  Unfortunately with the cppcheck version
I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false
positives: specifically cppcheck seems to hit a #error in <bits/unistd.h>
complaining about including it directly instead of via <unistd.h> (which is
not something we're doing).

I have no idea why that would be happening; but I'm guessing it has to be
a bug in the cpp implementation in that cppcheck version.  In any case,
it's possible to work around this by moving the include of <unistd.h>
before the include of <signal.h>.  So, do that.

Fixes: 89e38f5540 ("treewide: Fix header includes to build with musl")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 16:38:06 +01:00
Chris Kuhn
89e38f5540 treewide: Fix header includes to build with musl
Roughly inspired from a patch by Chris Kuhn: fix up includes so that
we can build against musl: glibc is more lenient as headers generally
include a larger amount of other headers.

Compared to the original patch, I only included what was needed
directly in C files, instead of adding blanket includes in local
header files. It's a bit more involved, but more consistent with the
current (not ideal) situation.

Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
d2df763232 log, conf, tap: Define die() as err() plus exit(), drop cppcheck workarounds
If we define die() as a variadic macro, passing __VA_ARGS__ to err(),
and calling exit() outside err() itself, we can drop the workarounds
introduced in commit 36f0199f6e ("conf, tap: Silence two false
positive invalidFunctionArg from cppcheck").

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-02-27 18:55:57 +01:00
Stefano Brivio
36f0199f6e conf, tap: Silence two false positive invalidFunctionArg from cppcheck
The newly introduced die() calls exit(), but cppcheck doesn't see it
and warns about possibly invalid arguments used after the check which
triggers die(). Add return statements to silence the warnings.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-16 19:19:23 +01:00
David Gibson
42bfd212b1 tap: Eliminate goto from tap_handler()
The goto here really doesn't improve clarity or brevity at all.  Use a
clearer construct.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-16 18:56:31 +01:00
David Gibson
b62ed9ca0e tap: Don't pcap frames that didn't get sent
In tap_send_frames() we send a number of frames to the tap device, then
also write them to the pcap capture file (if configured).  However the tap
send can partially fail (short write()s or similar), meaning that some
of the requested frames weren't actually sent, but we still write those
frames to the capture file.

We do give a debug message in this case, but it's misleading to add frames
that we know weren't sent to the capture file.  Rework to avoid this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-16 18:56:20 +01:00
Stefano Brivio
7564b58a7f tap: Use single counter for iov elements in tap_send_frames_pasta()
David points out that using multiple counters to go over the iov
array, namely 'i' and 'iov', makes mistakes easier. We can't just use
'iov', unless we reserve an element with zero iov_len at the end,
which isn't really justified.

Simply use 'i' to iterate over the array.

Link: https://archives.passt.top/passt-dev/Y+mfenvLn3VJ7Dg5@yekko/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-02-16 17:33:54 +01:00
Laine Stump
c9af6f92db convert all remaining err() followed by exit() to die()
This actually leaves us with 0 uses of err(), but someone could want
to use it in the future, so we may as well leave it around.

Signed-off-by: Laine Stump <laine@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-16 17:32:27 +01:00
Stefano Brivio
ac153595c0 tap: Send frames after the first one in tap_send_frames_pasta()
...instead of repeatedly sending out the first one in iov.

Fixes: e21ee41ac3 ("tcp: Combine two parts of pasta tap send path together")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 14:34:38 +01:00
David Gibson
2d553b587a tap: Improve handling of partial frame sends
In passt mode, when writing frames to the qemu socket, we might get a short
send.  If we ignored this and carried on, the qemu socket would get out of
sync, because the bytes we actually sent wouldn't correspond  to the length
header we already sent.  tap_send_frames_passt() handles that by doing a
a blocking send to complete the message, but it has a few flaws:
 * We only attempt to resend once: although it's unlikely in practice,
   nothing prevents the blocking send() from also being short
 * We print a debug error if send() returns non-zero.. but send() returns
   the number of bytes sent, so we actually want it to return the length
   of the remaining data.

Correct those flaws and also be a bit more thorough about reporting
problems here.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:55:01 +01:00
David Gibson
0fb7b2b908 tap: Use different io vector bases depending on tap type
Currently tap_send_frames() expects the frames it is given to include the
vnet_len field, even in pasta mode which doesn't use it (although it need
not be initialized in that case).  To match, tap_iov_base() and
tap_iov_len() construct the frame in that way.

This will inconvenience future changes, so alter things to set the buffers
to include just the frame needed by the tap backend type.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:57 +01:00
David Gibson
4b3d38a069 tap: Add "tap headers" abstraction
Currently both the TCP and UDP code need to deal in various places with the
details of the L2 headers, and also the tap-specific "vnet_len" header.
This makes abstracting the tap interface to new backends (e.g. vhost-user
or tun) more difficult.

To improve this abstraction, create a new 'tap_hdr' structure which
represents both L2 (always Ethernet at the moment, but might be vary in
future) and any additional tap specific headers (such as the qemu socket's
vnet_len field).  Provide helper functions and macros to initialize, update
and use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:52 +01:00
David Gibson
6d011c1faa tap, tcp: Move tap send path to tap.c
The functions which do the final steps of sending TCP packets on through
the tap interface - tcp_l2_buf_flush*() - no longer have anything that's
actually specific to TCP in them, other than comments and names.  Move them
all to tap.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:40 +01:00
Richard W.M. Jones
190169c544 passt, tap: Process data on the socket before HUP/ERR events
In the case where the client writes a packet and then closes the
socket, because we receive EPOLLIN|EPOLLRDHUP together we have a
choice of whether to close the socket immediately, or read the packet
and then close the socket.  Choose the latter.

This should improve fuzzing coverage and arguably is a better choice
even for regular use since dropping packets on close is bad.

See-also: https://archives.passt.top/passt-dev/20221117171805.3746f53a@elisabeth/
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:40:57 +01:00
Richard W.M. Jones
6b4e68383c passt, tap: Add --fd option
This passes a fully connected stream socket to passt.

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
[sbrivio: reuse fd_tap instead of adding a new descriptor,
 imply --one-off on --fd, add to optstring and usage()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:40:47 +01:00
David Gibson
698e4fd761 style: Minor corrections to function comments
Some style issues and a typo.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:26 +01:00
Stefano Brivio
25dab96205 tap: Revert recently added checks in tap_handler_passt()
This reverts commit 198f87835d ("tap: Return -EIO from
tap_handler_passt() on inconsistent packet stream") and commit
510dace86c ("tap: Keep stream consistent if qemu length descriptor
spans two recv() calls").

I can hit occasional failures in perf/passt_tcp tests where we seem
to be getting excess data at the end of a recv(), and for some reason
I couldn't figure out yet, if we just ignore it, subsequent recv()
calls from qemu return correct data. If we close the connection, qemu
can't talk to us anymore, of course.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-16 15:11:22 +01:00
Stefano Brivio
b27d6d121c arp, tap, util: Don't use perror() after seccomp filter is installed
If stderr is closed, after we fork to background, glibc's
implementation of perror() will try to re-open it by calling dup(),
upon which the seccomp filter causes the process to terminate,
because dup() is not included in the list of allowed syscalls.

Replace perror() calls that might happen after isolation_postfork().
We could probably replace all of them, but early ones need a bit more
attention as we have to check whether log.c functions work in early
stages.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-16 15:11:13 +01:00
Stefano Brivio
198f87835d tap: Return -EIO from tap_handler_passt() on inconsistent packet stream
While it's important to fail in that case, it makes little sense to
fail quietly: it's better to tell qemu explicitly that something went
wrong and that we won't recover, by closing the socket.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-10 11:17:50 +01:00
Stefano Brivio
510dace86c tap: Keep stream consistent if qemu length descriptor spans two recv() calls
I got all paranoid after triggering a divide-by-zero general
protection fault in passt with a qemu version without the virtio_net
TX hang fix, while flooding UDP. I start thinking this was actually
coming from some random changes I was playing with, but before
reaching this conclusion I reviewed once more the relatively short
path in tap_handler_passt() before we start using packet_*()
functions, and found this.

Never observed in practice, but artificially reproduced with changes
in qemu's socket interface: if we don't receive from qemu a complete
length descriptor in one recv() call, or if we receive a partial one
at the end of one call, we currently disregard the rest, which would
make the stream inconsistent.

Nothing really bad happens, except that from that point on we would
disregard all the packets we get until, if ever, we get the stream
back in sync by chance.

Force reading a complete packet length descriptor with a blocking
recv(), if needed -- not just a complete packet later.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-10 11:17:50 +01:00
Stefano Brivio
11efaefa1e passt, qrap, README: Update notes and documentation for AF_UNIX support in qemu
We can't get rid of qrap quite yet, but at least we should start
telling users it's not going to be needed anymore starting from qemu
7.2.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
David Gibson
f7653a1446 Use endian-safer typing in struct tap4_l4_t
We recently converted to using struct in_addr rather than bare in_addr_t
or uint32_t to represent IPv4 addresses in network order.  This makes it
harder forget to apply the correct endian conversions.

We omitted the IPv4 addresses stored in struct tap4_l4_t, however.  Convert
those as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:26 +01:00
David Gibson
7c7b68dbe0 Use typing to reduce chances of IPv4 endianness errors
We recently corrected some errors handling the endianness of IPv4
addresses.  These are very easy errors to make since although we mostly
store them in network endianness, we sometimes need to manipulate them in
host endianness.

To reduce the chances of making such mistakes again, change to always using
a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network
endian addresses.  This makes it harder to accidentally do arithmetic or
comparisons on such addresses as if they were host endian.

We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to
directly work with struct in_addr values.  This has the additional benefit
of making the IPv4 and IPv6 paths more visually similar.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:24 +01:00
Stefano Brivio
2d4468ebb7 tap: Support for detection of existing sockets on ramfs
On ramfs, connecting to a non-existent UNIX domain socket yields
EACCESS, instead of ENOENT. This is visible if we use passt directly
on rootfs (a ramfs instance) from an initramfs image.

It's probably wrong for ramfs to return EACCES, but given the
simplicity of the filesystem, I doubt we should try to fix it there
at the possible cost of added complexity.

Also, this whole beauty should go away once qrap-less usage is
established, so just accept EACCES as indication that a conflicting
socket does not, in fact, exist.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-04 12:01:09 +01:00
Stefano Brivio
947d756747 tap: Trace received (outbound) ICMP packets in debug mode, too
This only worked for ICMPv6: ICMP packets have no TCP-style header,
so they are handled as a special case before packet sequences are
formed, and the call to tap_packet_debug() was missing.

Fixes: bb70811183 ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-27 00:18:16 +02:00
David Gibson
c6845f60a0 dhcp: Use tap_udp4_send() helper in dhcp()
The IPv4 specific dhcp() manually constructs L2 and IP headers to send its
DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the
tap_udp6_send() helper.  Now that we've broaded the parameters to
tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:35:00 +02:00