The rationale behind the calculation of TAP_MSGS isn't necessarily obvious.
It's supposed to be the maximum number of packets that can fit in pkt_buf.
However, the calculation is wrong in several ways:
* It's based on ETH_ZLEN which isn't meaningful for virtual devices
* It always includes the qemu socket header which isn't used for pasta
* The size of pkt_buf isn't relevant for vhost-user
We've already made sure this is just a tuning parameter, not a hard limit.
Clarify what we're calculating here and why.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf (TAP_MSGS). However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size. But ETH_ZLEN is based on physical
constraints of Ethernet, which don't apply to our virtual devices. It is
possible to generate a legitimate frame smaller than this, for example an
empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long.
Further more, the same limit applies for vhost-user, which is not limited
by the size of pkt_buf like the other backends. In that case we don't even
have full control of the maximum buffer size, so we can't really calculate
how many packets could fit in there.
If we exceed do TAP_MSGS we'll drop packets, not just use more batches,
which is moderately bad. The fact that this needs to be sized just so for
correctness not merely for tuning is a fairly non-obvious coupling between
different parts of the code.
To make this more robust, alter the tap code so it doesn't rely on
everything fitting in a single batch of TAP_MSGS packets, instead breaking
into multiple batches as necessary. This leaves TAP_MSGS as purely a
tuning parameter, which we can freely adjust based on performance measures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The pcap header includes a value indicating how much of each frame is
captured. We always capture the entire frame, so we want to set this to
the maximum possible frame size. Currently we do that by setting it to
ETH_MAX_MTU, but that's a confusingly named constant which might not always
be correct depending on the details of our tap backend.
Instead add a tap_l2_max_len() function that explicitly returns the maximum
frame size for the current mode and use that to set snaplen. While we're
there, there's no particular need for the pcap header to be defined in a
global; make it local to pcap_init() instead.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of
an L2 frame. This define comes from the kernel, but it's badly named and
used confusingly.
First, it doesn't really have anything to do with Ethernet, which has no
structural limit on frame lengths. It comes more from either a) IP which
imposes a 64k datagram limit or b) from internal buffers used in various
places in the kernel (and in passt).
Worse, MTU generally means the maximum size of the IP (L3) datagram which
may be transferred, _not_ counting the L2 headers. In the kernel
ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as
a maximum frame length, _including_ L2 headers. In tap.c we're mostly
using it in the second way.
Finally, each of our tap backends could have different limits on the frame
size imposed by the mechanisms they're using.
Start clearing up this confusion by replacing it in tap.c with new
L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame
length for each backend.
Signed-off-by: David Gibson <dgibson@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing. They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense. So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.
In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMPv6 message containing
the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header
and the first 1232 bytes of the original message, if any. If the sender
socket has been connected, it uses this message to issue a
"Connection Refused" event to the user.
Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.
We now fix this for IPv6 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.
Note that for the sake of completeness, we even produce ICMP messages
for other error types and codes. We have noticed that at least
ICMP_PROT_UNREACH is propagated as an error event back to the user.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't
modify saddr which can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We will need to build the UDP header at other locations than in function
tap_udp6_send(), so we break that part out to a separate function.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.
Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.
We now fix this for IPv4 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.
Note that for the sake of completeness, we even produce ICMP messages
for other error codes. We have noticed that at least ICMP_PROT_UNREACH
is propagated as an error event back to the user.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't
modify 'in', it can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We will need to build the UDP header at other locations than in function
tap_udp4_send(), so we break that part out to a separate function.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, if a non-SYN TCP packet arrives which doesn't match any existing
connection, we simply ignore it. However RFC 9293, section 3.10.7.1 says
we should respond with an RST to a non-SYN, non-RST packet that's for a
CLOSED (i.e. non-existent) connection.
This can arise in practice with migration, in cases where some error means
we have to discard a connection. We destroy the connection with tcp_rst()
in that case, but because the guest is stopped, we may not be able to
deliver the RST packet on the tap interface immediately. This change
ensures an RST will be sent if the guest tries to use the connection again.
A similar situation can arise if a passt/pasta instance is killed or
crashes, but is then replaced with another attached to the same guest.
This can leave the guest with stale connections that the new passt instance
isn't aware of. It's better to send an RST so the guest knows quickly
these are broken, rather than letting them linger until they time out.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To allow more batching, we group together related packets into "seqs" in
the tap layer, before passing them to the L4 protocol layers. Currently
we consider the IP protocol, both IP addresses and also the L4 ports when
grouping things into seqs. We ignore the IPv6 flow label.
We have some future cases where we want to consider the the flow label in
the L4 code, which is awkward if we could be given a single batch with
multiple labels. Add the flow label to tap6_l4_t and group by it as well
as the other criteria. In future we could possibly use the flow label
_instead_ of peeking into the L4 header for the ports, but we don't do so
for now.
The guest should use the same flow label for all packets in a low, but if
it doesn't this change won't break anything, it just means we'll batch
things a bit sub-optimally.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The flow label is a 20-bit field in the IPv6 header. The length and
alignment make it awkward to pass around as is. Obviously, it can be
packed into a 32-bit integer though, and we do this in two places. We
have some further upcoming places where we want to manipulate the flow
label, so make some helpers for marshalling and unmarshalling it to an
integer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When studying the Linux source code and Wireshark dumps it seems like
the no_frag flag in the IPv4 header is always set. Following discussions
in the Internet on this subject indicates that modern routers never
fragment packets, and that it isn't even supported in many cases.
Adding to this that incoming messages forwarded on the tap interface
never even pass through a router it seems safe to always set this flag.
This makes the IPv4 headers of forwarded messages identical to those
sent by the external sockets, something we must consider desirable.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.
When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.
To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.
Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the podman CI I noticed many seccomp denials in our logs even though
tests passed:
comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e
syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000
Which is futex being called and blocked by the pasta profile. After a
few tries I managed to reproduce locally with this loop in ~20 min:
while :;
do podman run -d --network bridge quay.io/libpod/testimage:20241011 \
sleep 100 && \
sleep 10 && \
podman rm -fa -t0
done
And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the
following stack trace:
Stack trace of thread 1:
#0 0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b)
#1 0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de)
#2 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
#3 0x000055f31b78c50b n/a (n/a + 0x0)
#4 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
#5 0x000055f31b78d5a2 n/a (n/a + 0x0)
Pasta got killed in exit(), it seems glibc is trying to use a lock when
running exit handlers even though no exit handlers are defined.
Given no exit handlers are needed we can call _exit() instead. This
skips exit handlers and does not flush stdio streams compared to exit()
which should be fine for the use here.
Based on the input from Stefano I did not change the test/doc programs
or qrap as they do not use seccomp filters.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set. There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We need to initialize vhost-user structures with --fd too.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Merge code from tap_backend_init(), tap_sock_tun_init() and
tap_listen_handler() to set epoll_ref entry and to add it
to epollfd.
No functionality change
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We usually want to checksum only the tail part of a frame, excluding at
least some headers. csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector. We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.
We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters. This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are setups where no host interface is available or configured
at all, intentionally or not, temporarily or not, but users expect
(Podman) containers to run in any case as they did with slirp4netns,
and we're now getting reports that we broke such setups at a rather
alarming rate.
To this end, if we don't find any usable host interface, instead of
exiting:
- for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2
as default gateway
- for IPv6, don't assign any address (forcibly disable DHCPv6), and
use the *first* link-local address we observe to represent the
guest/container. Advertise fe80::1 as default gateway
- use 'tap0' as default interface name for pasta
Change ifi4 and ifi6 in struct ctx to int and accept a special -1
value meaning that no host interface was selected, but the IP family
is enabled. The fact that the kernel uses unsigned int values for
those is not an issue as 1. one can't create so many interfaces
anyway and 2. we otherwise handle those values transparently.
Fix a botched conditional in conf_print() to actually skip printing
DHCPv6 information if DHCPv6 is disabled (and skip printing NDP
information if NDP is disabled).
Link: https://github.com/containers/podman/issues/24614
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:
/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1139 | int fd = socket(AF_UNIX, SOCK_STREAM, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1158 | ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1413 | s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
188 | if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
| ^
| | SOCK_CLOEXEC
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Given that we're comparing against 'n', which is signed, we cast
TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated
as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be
signed.
This doesn't necessarily happen on 32-bit architectures, though. On
armhf and i686, clang-tidy 18.1.8 and 19.1.2 report:
/home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors]
1087 | for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
| ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cast the whole difference to ssize_t, as we know it's going to be
positive anyway, instead of relying on that side effect.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than
INT_MAX (2^31 - 1), and musl declares the second argument of ioctl()
as 'int', not 'unsigned long' like glibc does, probably because of how
POSIX specifies the equivalent argument, int dcmd, in posix_devctl(),
so gcc reports a warning:
tap.c: In function 'tap_ns_tun':
tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow]
1291 | rc = ioctl(fd, TUNSETIFF, &ifr);
| ^~~~~~~~~
We don't care about that overflow, so explicitly cast TUNSETIFF to
int.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):
/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
43 | snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
577 | snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
579 | snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
580 | pidval);
| ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
105 | snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
242 | snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
243 | snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
1155 | snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning
Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_pasta_input() keeps reading frames from the tap device until the
buffer is full. However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer. If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.
Adjust the code to make sure we always have room for a maximum size frame.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around. This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.
The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass. This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.
Along the way handle a couple of extra edge cases:
- Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
ports where those might not have the same value
- Detect EOF on the tap device and exit in that case
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all
three cases it returns from the function. That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again. For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.
So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN. We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it. Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
- Add structs for NA, RA, NS, MTU, prefix info, option header,
link-layer address, RDNSS, DNSSL and link-layer for RA message.
- Turn NA message from purely imperative, going byte by byte,
to declarative by filling it's struct.
- Turn part of RA message into declarative.
- Move packet_add() to be before the call of ndp() in tap6_handler()
if the protocol of the packet is ICMPv6.
- Add a pool of packets as an additional parameter to ndp().
- Check the size of NS packet with packet_get() before sending an NA
packet.
- Add documentation for the structs.
- Add an enum for NDP option types.
Link: https://bugs.passt.top/show_bug.cgi?id=21
Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
[sbrivio: Minor coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie. Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.
Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket. This handles a number of (unlikely) cases
which previously would not be dealt with correctly:
* If qemu sent a partial frame then waited some time before sending the
remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
doing the partial frame handling, which would put us out of sync with
the stream from qemu
* If a the blocking recv() only received some of the remainder of the
frame, not all of it, we'd return leaving us out of sync with the
stream again
Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This
is probably acceptable because it's an unlikely case in practice. If
necessary we could mitigate this by using a true ring buffer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself. As far as I can tell,
frames can be any length, with no particular alignment requirement. This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.
Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading. Some platforms will generate a fatal trap on
such an unaligned load. Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.
Introduce a new helper to safely load a possibly unaligned value here. We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads. If that turns out not
to be the case, we can look at improvements then.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket. It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling. That consideration
doesn't apply to the way we handle the qemu socket however.
Furthermore, using EPOLLET causes additional complications:
1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
we *do* set it when using pasta mode with --fd. This inconsistency
doesn't seem to have broken anything, but it's odd.
2) EPOLLET requires that tap_handler_passt() loop until all data available
is read (otherwise we may have data in the buffer but never get an event
causing us to read it). We do that with a rather ugly goto.
Worse, our condition for that goto appears to be incorrect. We'll only
loop if rem is non-zero, which will only happen if we perform a blocking
recv() for a partially received frame. We'll only perform that second
recv() if the original recv() resulted in a partially read frame. As
far as I can tell the original recv() could end on a frame boundary
(never triggering the second recv()) even if there is additional data in
the socket buffer. In that circumstance we wouldn't goto redo and could
leave unprocessed frames in the qemu socket buffer indefinitely.
This doesn't seem to have caused any problems in practice, but since
there's no obvious reason to use EPOLLET here anyway, we might as well
get rid of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on. That sounds sensible on first blush, but
probably isn't wise in practice. If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket. Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.
Neither (b) nor (c) is really salvageable with the same stream. Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.
So, instead of just skipping the frame and trying to carry on, log an error
and close the socket. As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error. Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.
Add an error message in this case. Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tap_sock_unix_open(), if we have a given path for the socket from
configuration, we don't need to loop over possible paths, so we exit
the loop on the first iteration, unconditionally.
But if we failed to bind() the socket to that explicit path, we should
exit, instead of continuing. Otherwise we'll pretend we're up and
running, but nobody can contact us, and this might be mildly confusing
for users.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If we start pasta with some ports forwarded, but no --config-net, say:
$ ./pasta -u 10001
and then use a local, non-loopback address to send traffic to that
port, say:
$ socat -u FILE:test UDP4:192.0.2.1:10001
pasta writes to the tap file descriptor, but if the interface is down,
we get EIO and terminate.
By itself, what I'm doing in this case is not very useful (I simply
forgot to pass --config-net), but if we happen to have a DHCP client
in the network namespace, the interface might still be down while
somebody tries to send traffic to it, and exiting in that case is not
really helpful.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
icmp_sock_handler() obtains the guest address from it's most recently
observed IP. However, this can now be obtained from the common flowside
information.
icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.
Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we have logging functions embedding perror() functionality,
we can make _some_ calls more terse by using them. In many places,
the strerror() calls are still more convenient because, for example,
they are used in flow debugging functions, or because the return code
variable of interest is not 'errno'.
While at it, convert a few error messages from a scant perror style
to proper failure descriptions.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
buf_size is set to sizeof(pkt_buf) by default. And it seems more correct
to provide the actual size of the buffer.
Later a buf_size of 0 will allow vhost-user mode to detect
guest memory buffers.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(),
and tap4_handler() and tap6_handler() into tap_handler().
Create a generic tap_add_packet() to consolidate packet
addition logic and reduce code duplication.
The purpose is to ease the export of these functions to use
them with the vhost-user backend.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We globally disabled this, with a justification lumped together with
several checks about braces. They don't really go together, the others
are essentially a stylistic choice which doesn't match our style. Omitting
brackets on macro parameters can lead to real and hard to track down bugs
if an expression is ever passed to the macro instead of a plain identifier.
We've only gotten away with the macros which trigger the warning, because
of other conventions its been unlikely to invoke them with anything other
than a simple identifier. Fix the macros, and enable the warning for the
future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Otherwise, if the user runs us as root, and gives us paths that are
only accessible by root, we'll fail to open them, which might in turn
encourage users to change permissions or ownerships: definitely a bad
idea in terms of security.
Reported-by: Minxi Hou <mhou@redhat.com>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
We'll need to open and bind the socket a while before listening to it,
so split that into two different functions. No functional changes
intended.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
This is a remnant from the time we kept access to the original
filesystem and we could reinitialise the listening AF_UNIX socket.
Since commit 0515adceaa ("passt, pasta: Namespace-based sandboxing,
defer seccomp policy application"), however, we can't re-bind the
listening socket once we're up and running.
Drop the -1 initalisation and the corresponding check.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
It has nothing to do with tap_sock_unix_init(). It used to be there as
that function could be called multiple times per passt instance, but
it's not the case anymore.
This also takes care of the fact that, with --fd, we wouldn't set the
initial MAC address, so we would need to wait for the guest to send us
an ARP packet before we could exchange data.
Fixes: 6b4e68383c ("passt, tap: Add --fd option")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Richard W.M. Jones <rjones@redhat.com>