So far for UDP flows (like TCP connections) we didn't record our address
(oaddr) in the flow table entry for socket based pifs. That's because we
didn't have that information when a flow was initiated by a datagram coming
to a "listening" socket with 0.0.0.0 or :: address. Even when we did have
the information, we didn't record it, to simplify address matching on
lookups.
This meant that in some circumstances we could send replies on a UDP flow
from a different address than the originating request came to, which is
surprising and breaks certain setups.
We now have code in udp_peek_addr() which does determine our address for
incoming UDP datagrams. We can use that information to properly populate
oaddr in the flow table for flow initiated from a socket.
In order to be able to consistently match datagrams to flows, we must
*always* have a specific oaddr, not an unspecified address (that's how the
flow hash table works). So, we also need to fill in oaddr correctly for
flows we initiate *to* sockets. Our forwarding logic doesn't specify
oaddr here, letting the kernel decide based on the routing table. In this
case we need to call getsockname() after connect()ing the socket to find
which local address the kernel picked.
This adds getsockname() to our seccomp profile for all variants.
Link: https://bugs.passt.top/show_bug.cgi?id=99
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we establish a new UDP flow we create connect()ed sockets that will
only handle datagrams for this flow. However, there is a race between
bind() and connect() where they might get some packets queued for a
different flow. Currently we handle this by simply discarding any
queued datagrams after the connect. UDP protocols should be able to handle
such packet loss, but it's not ideal.
We now have the tools we need to handle this better, by redirecting any
datagrams received during that race to the appropriate flow. We need to
use a deferred handler for this to avoid unexpectedly re-ordering datagrams
in some edge cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update comment to udp_flow_defer()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently udp_flow_from_sock() is only used when receiving a datagram
from a "listening" socket. It takes the listening socket's epoll
reference to get the interface and port on which the datagram arrived.
We have some upcoming cases where we want to use this in different
contexts, so make it take the pif and port as direct parameters instead.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop @ref from comment to udp_flow_from_sock()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we have an asymmetry in how we handle UDP sockets. For flows
where the target side is a socket, we create a new connect()ed socket
- the "reply socket" specifically for that flow used for sending and
receiving datagrams on that flow and only that flow. For flows where the
initiating side is a socket, we continue to use the "listening" socket (or
rather, a dup() of it). This has some disadvantages:
* We need a hash lookup for every datagram on the listening socket in
order to work out what flow it belongs to
* The dup() keeps the socket alive even if automatic forwarding removes
the listening socket. However, the epoll data remains the same
including containing the now stale original fd. This causes bug 103.
* We can't (easily) set flow-specific options on an initiating side
socket, because that could affect other flows as well
Alter the code to use a connect()ed socket on the initiating side as well
as the target side. There's no way to "clone and connect" the listening
socket (a loose equivalent of accept() for UDP), so we have to create a
new socket. We have to bind() this socket before we connect() it, which
is allowed thanks to SO_REUSEADDR, but does leave a small window where it
could receive datagrams not intended for this flow. For now we handle this
by simply discarding any datagrams received between bind() and connect(),
but I intend to improve this in a later patch.
Link: https://bugs.passt.top/show_bug.cgi?id=103
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that ICMP pass-through from socket-to-tap is in place, it is
easy to support UDP based traceroute functionality in direction
tap-to-socket.
We fix that in this commit.
Link: https://bugs.passt.top/show_bug.cgi?id=64
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently udp_flow_new() open codes creating and connecting a socket to use
for reply messages. We have in mind some more places to use this logic,
plus it just makes for a rather large function. Split this handling out
into a new udp_flow_sock() function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For UDP packets from the tap interface (like TCP) we use a hash table to
look up which flow they belong to. Unlike TCP, we sometimes also create a
hash table entry for the socket side of UDP flows. We need that when we
receive a UDP packet from a "listening" socket which isn't specific to a
single flow.
At present we only do this for the initiating side of flows, which re-use
the listening socket. For the target side we use a connected "reply"
socket specific to the single flow.
We have in mind changes that maye introduce some edge cases were we could
receive UDP packets on a non flow specific socket more often. To allow for
those changes - and slightly simplifying things in the meantime - always
put both sides of a UDP flow - tap or socket - in the hash table. It's
not that costly, and means we always have the option of falling back to a
hash lookup.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Our general logging helpers include a number of _perror() variants which,
like perror(3) include the description of the current errno. We didn't
have those for our flow specific logging helpers, though. Fill this gap
with flow_perror() and flow_dbg_perror(), and use them where it's useful.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I explicitly added fcntl64() to the list of allowed system calls for
PPC64 a while ago, and now it turns out it's not available in recent
Debian builds. The warning from seccomp.sh is harmless because we
unconditionally try to enable fcntl() anyway, but take care of it
anyway.
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set. There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
It was reported that SSDP notifications sent from a container (with
e.g. minidlna) stopped appearing on the network starting from commit
1db4f773e8 ("udp: Improve detail of UDP endpoint sanity checking").
As a minimal reproducer using minidlnad(8):
$ mkdir /tmp/minidlna
$ cat conf
media_dir=/tmp/minidlna
db_dir=/tmp/minidlna
$ ./pasta -d --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f conf -P /dev/null & (sleep 1; killall minidlnad)'
[...]
1.0327: Flow 0 (NEW): FREE -> NEW
1.0327: Flow 0 (INI): NEW -> INI
1.0327: Flow 0 (INI): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
1.0327: Flow 0 (INI): Invalid endpoint on UDP packet
1.0327: Flow 0 (FREE): INI -> FREE
1.0328: Flow 0 (FREE): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
1.0328: Dropping datagram with no flow TAP 88.198.0.164:54185 -> 239.255.255.250:1900
This is an actual regression as there's no particular reason to block
outbound multicast UDP packets.
And even if we don't handle multicast groups in any particular way
(https://bugs.passt.top/show_bug.cgi?id=2, "Add IGMP/MLD proxy"),
there's no reason to block inbound multicast or broadcast packets
either, should they ever be somehow delivered to passt or pasta.
Let multicast and broadcast packets through, refusing only to
establish flows with unspecified endpoint, as those would actually
cause havoc in the flow table.
IP-wise, SSDP notifications look like this (after this patch), inside
and outside:
$ pasta -p /tmp/minidlna.pcap --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f minidlna.conf -P /dev/null & (sleep 1; killall minidlnad)'
[...]
$ tshark -a packets:1 -r /tmp/minidlna.pcap ssdp
2 0.074808 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1
# tshark -i ens3 -a packets:1 multicast 2>/dev/null
1 0.000000000 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1
Link: https://github.com/containers/podman/issues/24871
Fixes: 1db4f773e8 ("udp: Improve detail of UDP endpoint sanity checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
In udp_flow_new() we reject a flow if the endpoint isn't unicast, or it has
a zero endpoint port. Those conditions aren't strictly illegal, but we
can't safely handle them at present:
* Multicast UDP endpoints are certainly possible, but our current flow
tracking only makes sense for simple unicast flows - we'll need
different handling if we want to handle multicast flows in future
* It's not entirely clear if port 0 is RFC-ishly correct, but for socket
interfaces port 0 sometimes has a special meaning such as "pick the port
for me, kernel". That makes flows on port 0 unsafe to forward in the
usual way.
For the same reason we also can't safely handle port 0 as our port. In
principle that's also true for our address, however in the case of flows
initiated from a socket, we may not know our address since the socket
could be bound to 0.0.0.0 or ::, so we can only verify that our address
is unicast for flows initiated from the tap side.
Refine the current check in udp_flow_new() to slightly more detailed checks
in udp_flow_from_sock() and udp_flow_from_tap() to make what is and isn't
handled clearer. This makes this checking more similar to what we do for
TCP connections.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Unlike TCP, UDP has no in-band signalling for the end of a flow. So the
only way we remove flows is on a timer if they have no activity for 180s.
However, we've started to investigate some error conditions in which we
want to prematurely abort / abandon a UDP flow. We can call
udp_flow_close(), which will make the flow inert (sockets closed, no epoll
events, can't be looked up in hash). However it will still wait 3 minutes
to clear away the stale entry.
Clean this up by adding an explicit 'closed' flag which will cause a flow
to be more promptly cleaned up. We also publish udp_flow_close() so it
can be called from other places to abort UDP flows().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For some reason, this is reported only with musl, and older glibc
versions (2.31, at least).
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.
Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:
https://wiki.debian.org/ReleaseGoals/64bit-time
that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.
Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).
I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.
Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
No code change.
They need to be exported to be available by the vhost-user version of
passt.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>