We've recently added support for propagating ICMP errors related to a UDP
flow from the host to the guest, by handling the extended UDP error on the
socket and synthesizing a suitable ICMP on the tap interface.
Currently we create that ICMP with a source address of the "offender" from
the extended error information - the source of the ICMP error received on
the host. However, we don't translate this address for cases where we NAT
between host and guest. This means (amongst other things) that we won't
get a "Connection refused" error as expected if send data from the guest to
the --map-host-loopback address. The error comes from 127.0.0.1 on the
host, which doesn't make sense on the tap interface and will be discarded
by the guest.
Because ICMP errors can be sent by an intermediate host, not just by the
endpoints of the flow, we can't handle this translation purely with the
information in the flow table entry. We need to explicitly translate this
address by our NAT rules, which we can do with the nat_inbound() helper.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Make a number of changes to udp_sock_recverr() to improve the robustness
of how we handle addresses.
* Get the "offender" address (source of the ICMP packet) using the
SO_EE_OFFENDER() macro, reducing assumptions about structure layout.
* Parse the offender sockaddr using inany_from_sockaddr()
* Check explicitly that the source and destination pifs are what we
expect. Previously we checked something that was probably equivalent
in practice, but isn't strictly speaking what we require for the rest
of the code.
* Verify that for an ICMPv4 error we also have an IPv4 source/offender
and destination/endpoint address
* Verify that for an ICMPv6 error we have an IPv6 endpoint
* Improve debug reporting of any failures
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_sock_recverr() processes errors on UDP sockets and attempts to
propagate them as ICMP packets on the tap interface. To do this it
currently requires the flow with which the error is associated as a
parameter. If that's missing it will clear the error condition, but not
propagate it.
That means that we largely ignore errors on "listening" sockets. It also
means we may discard some errors on flow specific sockets if they occur
very shortly after the socket is created. In udp_flush_flow() we need to
clear any datagrams received between bind() and connect() which might not
be associated with the "final" flow for the socket. If we get errors
before that point we'll ignore them in the same way because we don't know
the flow they're associated with in advance.
This can happen in practice if we have errors which occur almost
immediately after connect(), such as ECONNREFUSED when we connect() to a
local address where nothing is listening.
Between the extended error message itself and the PKTINFO information we
do actually have enough information to find the correct flow. So, rather
than ignoring errors where we don't have a flow "hint", determine the flow
the hard way in udp_sock_recverr().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change warn() to debug() in udp_sock_recverr()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Usually we work with the "exit early" flow style, where we return early
on "error" conditions in functions. We don't currently do this in
udp_sock_recverr() for the case where we don't have a flow to associate
the error with.
Reorganise to use the "exit early" style, which will make some subsequent
changes less awkward.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we open code parsing the control message for IP_PKTINFO in
udp_peek_addr(). We have an upcoming case where we want to parse PKTINFO
in another place, so split this out into a helper function.
While we're there, make the parsing a bit more robust: scan all cmsgs to
look for the one we want, rather than assuming there's only one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: udp_pktinfo(): Fix typo in comment and change err() to debug()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we get an epoll event on a listening socket, we first deal with any
errors (udp_sock_errs()), then with any received packets (udp_sock_fwd()).
However, it's theoretically possible that new errors could get flagged on
the socket after we call udp_sock_errs(), in which case we could get errors
returned in in udp_sock_fwd() -> udp_peek_addr() -> recvmsg().
In fact, we do deal with this correctly, although the path is somewhat
non-obvious. The recvmsg() error will cause us to bail out of
udp_sock_fwd(), but the EPOLLERR event will now be flagged, so we'll come
back here next epoll loop and call udp_sock_errs().
Except.. we call udp_sock_fwd() from udp_flush_flow() as well as from
epoll events. This is to deal with any packets that arrived between bind()
and connect(), and so might not be associated with the socket's intended
flow. This expects udp_sock_fwd() to flush _all_ queued datagrams, so that
anything received later must be for the correct flow.
At the moment, udp_sock_errs() might fail to flush all datagrams if errors
occur. In particular this can happen in practice for locally reported
errors which occur immediately after connect() (e.g. connecting to a local
port with nothing listening).
We can deal with the problem case, and also make the flow a little more
natural for the common case by having udp_sock_fwd() call udp_sock_errs()
to handle errors as the occur, rather than trying to deal with all errors
in advance.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_sock_recverr() and udp_sock_errs() take an epoll reference from which
they obtain both the socket fd to receive errors from, and - for flow
specific sockets - the flow and side the socket is associated with.
We have some upcoming cases where we want to clear errors when we're not
directly associated with receiving an epoll event, so it's not natural to
have an epoll reference. Therefore, make these functions take the socket
and flow from explicit parameters.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we get an error on UDP receive, either in udp_peek_addr() or
udp_sock_recv(), we'll print an error message. However, this could be
a perfectly routine UDP error triggered by an ICMP, which need not go to
the error log.
This doesn't usually happen, because before receiving we typically clear
the error queue from udp_sock_errs(). However, it's possible an error
could be flagged after udp_sock_errs() but before we receive. So it's
better to handle this error "silently" (trace level only). We'll bail out
of the receive, return to the epoll loop, and get an EPOLLERR where we'll
handle and report the error properly.
In particular there's one situation that can trigger this case much more
easily. If we start a new outbound UDP flow to a local destination with
nothing listening, we'll get a more or less immediate connection refused
error. So, we'll get that error on the very first receive after the
connect(). That will occur in udp_flow_defer() -> udp_flush_flow() ->
udp_sock_fwd() -> udp_peek_addr() -> recvmsg(). This path doesn't call
udp_sock_errs() first, so isn't (imperfectly) protected the way we are
most of the time.
Fixes: 84ab1305fa ("udp: Polish udp_vu_sock_info() and remove from vu specific code")
Fixes: 69e5393c37 ("udp: Move some more of sock_handler tasks into sub-functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We recently enabled the IP_PKTINFO / IPV6_RECVPKTINFO socket options on our
UDP sockets. This lets us obtain and properly handle the specific local
address used when we're "listening" with a socket on 0.0.0.0 or ::.
However, the PKTINFO cmsgs this option generates appear on error queue
messages as well as regular datagrams. udp_sock_recverr() doesn't expect
this and so flags an unrecoverable error when it can't parse the control
message.
Correct this by adding space in udp_sock_recverr()s control buffer for the
additional PKTINFO data, and scan through all cmsgs for the RECVERR, rather
than only looking at the first one.
Link: https://bugs.passt.top/show_bug.cgi?id=99
Fixes: f4b0dd8b06 ("udp: Use PKTINFO cmsgs to get destination address for received datagrams")
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
So far for UDP flows (like TCP connections) we didn't record our address
(oaddr) in the flow table entry for socket based pifs. That's because we
didn't have that information when a flow was initiated by a datagram coming
to a "listening" socket with 0.0.0.0 or :: address. Even when we did have
the information, we didn't record it, to simplify address matching on
lookups.
This meant that in some circumstances we could send replies on a UDP flow
from a different address than the originating request came to, which is
surprising and breaks certain setups.
We now have code in udp_peek_addr() which does determine our address for
incoming UDP datagrams. We can use that information to properly populate
oaddr in the flow table for flow initiated from a socket.
In order to be able to consistently match datagrams to flows, we must
*always* have a specific oaddr, not an unspecified address (that's how the
flow hash table works). So, we also need to fill in oaddr correctly for
flows we initiate *to* sockets. Our forwarding logic doesn't specify
oaddr here, letting the kernel decide based on the routing table. In this
case we need to call getsockname() after connect()ing the socket to find
which local address the kernel picked.
This adds getsockname() to our seccomp profile for all variants.
Link: https://bugs.passt.top/show_bug.cgi?id=99
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we get the source address for received datagrams from recvmsg(),
but we don't get the local destination address. Sometimes we implicitly
know this because the receiving socket is bound to a specific address, but
when listening on 0.0.0.0 or ::, we don't.
We need this information to properly direct replies to flows which come in
to a non-default local address. So, enable the IP_PKTINFO and IPV6_PKTINFO
control messages to obtain this information in udp_peek_addr(). For now
we log a trace messages but don't do anything more with the information.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we establish a new UDP flow we create connect()ed sockets that will
only handle datagrams for this flow. However, there is a race between
bind() and connect() where they might get some packets queued for a
different flow. Currently we handle this by simply discarding any
queued datagrams after the connect. UDP protocols should be able to handle
such packet loss, but it's not ideal.
We now have the tools we need to handle this better, by redirecting any
datagrams received during that race to the appropriate flow. We need to
use a deferred handler for this to avoid unexpectedly re-ordering datagrams
in some edge cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update comment to udp_flow_defer()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_splice() prepare and udp_splice_send() are both quite simple functions
that now have only one caller: udp_sock_to_sock(). Fold them both into
that caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_listen_sock_data() forwards datagrams from a "listening" socket until
there are no more (for now). We have an upcoming use case where we want
to do that for a socket that's not a "listening" socket, and uses a
different epoll reference. So, adjust the function to take the pieces it
needs from the reference as direct parameters and rename to udp_sock_fwd().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently udp_flow_from_sock() is only used when receiving a datagram
from a "listening" socket. It takes the listening socket's epoll
reference to get the interface and port on which the datagram arrived.
We have some upcoming cases where we want to use this in different
contexts, so make it take the pif and port as direct parameters instead.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop @ref from comment to udp_flow_from_sock()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Recent changes mean that this define is no longer used anywhere except in
udp.c. Move it back into udp.c from udp_internal.h.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_buf_listen_sock_data() and udp_vu_listen_sock_data() now have
effectively identical structure. The forwarding functions used for flow
specific sockets (udp_buf_sock_to_tap(), udp_vu_sock_to_tap() and
udp_sock_to_sock()) also now take a number of datagrams. This means we
can re-use them for the listening socket path, just passing '1' so they
handle a single datagram at a time.
This allows us to merge both the vhost-user and flow specific paths into
a single, simpler udp_listen_sock_data().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_buf_reply_sock_data() can handle forwarding data either from socket
to socket ("splicing") or from socket to tap. It has a test on each
datagram for which case we're in, but that will be the same for everything
in the batch.
Split out the spliced path into a separate udp_sock_to_sock() function.
This leaves udp_{buf,vu}_reply_sock_data() handling only forwards from
socket to tap, so rename and simplify them accordingly.
This makes the code slightly longer for now, but will allow future cleanups
to shrink it back down again.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typos in comments to udp_sock_recv() and
udp_vu_listen_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Both udp_buf_reply_sock_data() and udp_vu_reply_sock_data() internally
decide what the maximum number of datagrams they will forward is. We have
some upcoming reasons to allow the caller to decide that instead, so make
the maximum number of datagrams a parameter for both of them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A "listening" UDP socket can receive datagrams from multiple flows. So,
we currently have some quite subtle and complex code in
udp_buf_listen_sock_data() to group contiguously received packets for the
same flow into batches for forwarding.
However, since we are now always using flow specific connect()ed sockets
once a flow is established, handling of datagrams on listening sockets is
essentially a slow path. Given that, it's not worth the complexity.
Substantially simplify the code by using an approach more like vhost-user,
and "peeking" at the address of the next datagram, one at a time to
determine the correct flow before we actually receive the data,
This removes all meaningful use of the s_in and tosidx fields in
udp_meta_t, so they too can be removed, along with setting of msg_name and
msg_namelen in the msghdr arrays which referenced them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_vu_sock_info() uses MSG_PEEK to look ahead at the next datagram to be
received and gets its source address. Currently we only use it in the
vhost-user path, but there's nothing inherently vhost-user specific about
it. We have upcoming uses for it elsewhere so rename and move to udp.c.
While we're there, polish its error reporting a litle.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop excess newline before udp_sock_recv()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently udp_sock_recv() decides the maximum number of frames it is
willing to receive based on the mode. However, we have upcoming use cases
where we will have different criteria for how many frames we want with
information that's not naturally available here but is in the caller. So
make the maximum number of frames a parameter.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typo in comment in udp_buf_reply_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we have an asymmetry in how we handle UDP sockets. For flows
where the target side is a socket, we create a new connect()ed socket
- the "reply socket" specifically for that flow used for sending and
receiving datagrams on that flow and only that flow. For flows where the
initiating side is a socket, we continue to use the "listening" socket (or
rather, a dup() of it). This has some disadvantages:
* We need a hash lookup for every datagram on the listening socket in
order to work out what flow it belongs to
* The dup() keeps the socket alive even if automatic forwarding removes
the listening socket. However, the epoll data remains the same
including containing the now stale original fd. This causes bug 103.
* We can't (easily) set flow-specific options on an initiating side
socket, because that could affect other flows as well
Alter the code to use a connect()ed socket on the initiating side as well
as the target side. There's no way to "clone and connect" the listening
socket (a loose equivalent of accept() for UDP), so we have to create a
new socket. We have to bind() this socket before we connect() it, which
is allowed thanks to SO_REUSEADDR, but does leave a small window where it
could receive datagrams not intended for this flow. For now we handle this
by simply discarding any datagrams received between bind() and connect(),
but I intend to improve this in a later patch.
Link: https://bugs.passt.top/show_bug.cgi?id=103
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that ICMP pass-through from socket-to-tap is in place, it is
easy to support UDP based traceroute functionality in direction
tap-to-socket.
We fix that in this commit.
Link: https://bugs.passt.top/show_bug.cgi?id=64
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Both udp_buf_listen_sock_data() and udp_buf_reply_sock_data() have comments
stating they use recvmmsg(). That's not correct, they only do so via
udp_sock_recv() which lists recvmmsg() itself.
In contrast udp_splice_send() and udp_tap_handler() both directly use
sendmmsg(), but only the latter lists it. Add it to the former as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since UDP has no built in knowledge of connections, the only way we
know when we're done with a UDP flow is a timeout with no activity.
To keep track of this struct udp_flow includes a timestamp to record
the last time we saw traffic on the flow.
For data from listening sockets and from tap, this is done implicitly via
udp_flow_from_{sock,tap}() but for reply sockets it's done explicitly.
However, that logic is duplicated between the vhost-user and "buf" paths.
Make it common in udp_reply_sock_handler() instead.
Technically this is a behavioural change: previously if we got an EPOLLIN
event, but there wasn't actually any data we wouldn't update the timestamp,
now we will. This should be harmless: if there's an EPOLLIN we expect
there to be data, and even if there isn't the worst we can do is mildly
delay the cleanup of a stale flow.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We've already have a pointer to the UDP flow in variable uflow, we can just
re-use it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_send_conn_fail_icmp[46]() aren't actually specific to connections
failing: they can propagate a variety of ICMP errors, which might or might
not break a "connection". They are, however, specific to sending ICMP
errors to the tap connection, not splice or host. Rename them to better
reflect that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Recently we added support for detecting ICMP triggered errors on UDP
sockets and forwarding them to the tap interface. However, in
udp_sock_recverr() where this is handled we don't know for certain that
the tap interface is the other side of the UDP flow. It could be a spliced
connection with another socket on the other side.
To forward errors in that case, we'd need to force the other side's socket
to trigger issue an ICMP error. I'm not sure if there's a way to do that;
probably not for an arbitrary ICMP but it might be possible for certain
error conditions.
Nonetheless what we do now - synthesise an ICMP on the tap interface - is
certainly wrong. It's probably harmless; for a spliced connection it will
have loopback addresses meaning we can expect the guest to discard it.
But, correct this for now, by not attempting to propagate errors when the
other side of the flow is a socket.
Fixes: 55431f0077 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e0 ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
While developing traceroute forwarding tap-to-sock we found that
struct msghdr.msg_name for the ICMPs in the opposite direction always
contains the destination address of the original UDP message, and not,
as one might expect, the one of the host which created the error message.
Study of the kernel code reveals that this address instead is appended
as extra data after the received struct sock_extended_err area.
We now change the ICMP receive code accordingly.
Fixes: 55431f0077 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e0 ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In udp_reply_sock_handler() if we're unable to forward the datagrams we
just print an error. Generally this means we have an unsupported pair of
pifs in the flow table, though, and that hasn't change. So, next time we
get a matching packet we'll just get the same failure. In vhost-user mode
we don't even dequeue the incoming packets which triggered this so we're
likely to get the same failure immediately.
Instead, close the flow, in the same we we do for an unrecoverable error.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Share some additional miscellaneous logic between the vhost-user and "buf"
paths for data on udp reply sockets. The biggest piece is error handling
of cases where we can't forward between the two pifs of the flow. We also
make common some more simple logic locating the correct flow and its
parameters.
This adds some lines of code due to extra comment lines, but nonetheless
reduces logic duplication.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_{listen,reply}_sock_handler() can accept both EPOLLERR and EPOLLIN
events. However, unlike most epoll event handlers we don't check the
event bits right there. EPOLLERR is checked within udp_sock_errs() which
we call unconditionally. Checking EPOLLIN is still more buried: it is
checked within both udp_sock_recv() and udp_vu_sock_recv().
We can simplify the logic and pass less extraneous parameters around by
moving the checking of the event bits to the top level event handlers.
This makes udp_{buf,vu}_{listen,reply}_sock_handler() no longer general
event handlers, but specific to EPOLLIN events, meaning new data. So,
rename those functions to udp_{buf,vu}_{listen,reply}_sock_data() to better
reflect their function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The vhost-user and non-vhost-user paths for both udp_listen_sock_handler()
and udp_reply_sock_handler() are more or less completely separate. Both,
however, start with essentially the same invocation of udp_sock_errs(), so
that can be made common.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMPv6 message containing
the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header
and the first 1232 bytes of the original message, if any. If the sender
socket has been connected, it uses this message to issue a
"Connection Refused" event to the user.
Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.
We now fix this for IPv6 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.
Note that for the sake of completeness, we even produce ICMP messages
for other error types and codes. We have noticed that at least
ICMP_PROT_UNREACH is propagated as an error event back to the user.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't
modify saddr which can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.
Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.
We now fix this for IPv4 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.
Note that for the sake of completeness, we even produce ICMP messages
for other error codes. We have noticed that at least ICMP_PROT_UNREACH
is propagated as an error event back to the user.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't
modify 'in', it can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
We usually want to checksum only the tail part of a frame, excluding at
least some headers. csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector. We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.
We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters. This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have an ASSERT() verifying that we're able to look up the flow in
udp_reply_sock_handler(). However, we dereference uflow before that in
an initializer, rather defeating the point. Rearrange to avoid that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair. Use an inany instead. Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.
Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress. This case doesn't yet arise, but
it might cause us trouble in future.
Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic. This only makes sense if there is no specific
address given. We verify this at runtime, but it would be nicer if we
could enforce it structurally.
For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems. Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.
The new function takes the address as an inany *, with NULL indicating the
dual stack case. This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
To save some kernel memory we try to use "dual stack" sockets (that listen
to both IPv4 and IPv6 traffic) when possible. However udp_sock_init()
attempts to do this in some cases that can't work. Specifically we can
only do this when listening on any address. That's never true for the
ns (splicing) case, because we always listen on loopback. For the !ns
case and AF_UNSPEC case, addr should always be NULL, but add an assert to
verify.
This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll
fall back to the other path. But, it's messy and makes some upcoming
changes harder, so avoid attempting this in cases we know can't work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We can need not to set the UDP checksum. Add a parameter to
udp_update_hdr4() and udp_update_hdr6() to disable it.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
cppcheck-2.15.0 has apparently broadened when it throws a warning about
redundant initialization to include some cases where we have an initializer
for some fields, but then set other fields in the function body.
This is arguably a false positive: although we are technically overwriting
the zero-initialization the compiler supplies for fields not explicitly
initialized, this sort of construct makes sense when there are some fields
we know at the top of the function where the initializer is, but others
that require more complex calculation.
That said, in the two places this shows up, it's pretty easy to work
around. The results are arguably slightly clearer than what we had, since
they move the parts of the initialization closer together.
So do that rather than having ugly suppressions or dealing with the
tedious process of reporting a cppcheck false positive.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_sock_errs() reads out everything in the socket error queue. However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.
One possibility is that the error is reported instead by the SO_ERROR
sockopt. Check for that case and report it as best we can. If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.
[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010
Link: https://bugs.passt.top/show_bug.cgi?id=95
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>