1
0
Fork 0
mirror of https://passt.top/passt synced 2025-05-30 04:45:34 +02:00
Commit graph

2036 commits

Author SHA1 Message Date
Laurent Vivier
3262c9b088 iov: Standardize function comment headers
Update function comment headers in iov.c to a consistent and
standardized format.

This change ensures:
- Comment blocks for functions consistently start with /**.
- Function names in the comment summary line include parentheses ().

This improves overall comment clarity and uniformity within the file.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:13 +02:00
Laurent Vivier
b915375a42 virtio: Correct and align comment headers
Standardize and fix issues in `virtio.c` and `virtio.h` comment headers.

Improvements include:
- Added `()` to function names in comment summaries.
- Added colons after parameter and enum member tags.
- Changed `/*` to `/**` for `virtq_avail_event()` comment.
- Fixed typos (e.g., "file"->"fill", "virqueue"->"virtqueue").
- Added missing `Return:` tag for `vu_queue_rewind()`.
- Corrected parameter names in `virtio.h` comments to match code.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:11 +02:00
Laurent Vivier
2fd0944f21 vhost_user: Correct and align function comment headers
This commit cleans up function comment headers in vhost_user.c to ensure
accuracy and consistency with the code. Changes include correcting
parameter names in comments and signatures (e.g., standardizing on vmsg
for vhost messages, fixing dev to vdev), updating function names in
comment descriptions, and removing/rectifying erroneous parameter
documentation.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:08 +02:00
Laurent Vivier
2046976866 codespell: Correct typos in comments and error message
This commit addresses several spelling errors identified by the `codespell`
tool. The corrections apply to:
- Code comments in `fwd.c`, `ip.h`, `isolation.c`, and `log.c`.
- An error message string in `vhost_user.c`.

Specifically, the following misspellings were corrected:
- "adddress" to "address"
- "capabilites" to "capabilities"
- "Musn't" to "Mustn't"
- "calculatd" to "calculated"
- "Invalide" to "Invalid"

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-15 18:06:30 +02:00
Laurent Vivier
4234ace84c test: Display count of skipped tests in status and summary
This commit enhances test reporting by tracking and displaying the
number of skipped tests.

The skipped test count is now visible in the tmux status bar during
execution and included in the final test summary log. This provides
a more complete overview of test suite results.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-15 18:06:19 +02:00
Laurent Vivier
2d3d69c5c3 flow: Fix clang error (clang-analyzer-security.PointerSub)
Fixes the following clang-analyzer warning:

flow_table.h:96:25: note: Subtraction of two pointers that do not point into the same array is undefined behavior
   96 |         return (union flow *)f - flowtab;

The `flow_idx()` function is called via `FLOW_IDX()` from
`flow_foreach_slot()`, where `f` is set to `&flowtab[idx].f`.
Therefore, `f` and `flowtab` do point to the same array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:37 +02:00
Laurent Vivier
0f7bf10b0a ndp: Fix Clang analyzer warning (clang-analyzer-security.PointerSub)
Addresses Clang warning: "Subtraction of two pointers that do not
point into the same array is undefined behavior" for the line:
  `ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);`

Here, `ptr` is `&ra.var[0]`. The subtraction calculates the offset
of `var[0]` within the `struct ra_options ra`. Since `ptr` points
inside `ra`, this pointer arithmetic is well-defined for
calculating the size of the data to send, even if `ptr` and `&ra`
are not strictly considered part of the same "array" by the analyzer.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:35 +02:00
Laurent Vivier
a6b9832e49 virtio: Fix Clang warning (bugprone-sizeof-expression, cert-arr39-c)
In `virtqueue_read_indirect_desc()`, the pointer arithmetic involving
`desc` is intentional. We add the length in bytes (`read_len`)
divided by the size of `struct vring_desc` to `desc`, which is
an array of `struct vring_desc`. This correctly calculates the
offset in terms of the number of `struct vring_desc` elements.

Clang issues the following warning due to this explicit scaling:

virtio.c:238:8: error: suspicious usage of 'sizeof(...)' in pointer
arithmetic; this scaled value will be scaled again by the '+='
operator [bugprone-sizeof-expression,cert-arr39-c,-Werror]
  238 |         desc += read_len / sizeof(struct vring_desc);
      |               ^            ~~~~~~~~~~~~~~~~~~~~~~~~~
virtio.c:238:8: note: '+=' in pointer arithmetic internally scales
with 'sizeof(struct vring_desc)' == 16

This behavior is intended, so the warning can be considered a
false positive in this context. The code correctly advances the
pointer by the desired number of descriptor entries.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:32 +02:00
Laurent Vivier
570e7b4454 dhcpv6: fix GCC error (unterminated-string-initialization)
The string STR_NOTONLINK is intentionally not NUL-terminated.
Ignore the GCC error using __attribute__((nonstring)).

This error is reported by GCC 15.1.1 on Fedora 42. However,
Clang 20.1.3 does not support __attribute__((nonstring)).
Therefore, NOLINTNEXTLINE(clang-diagnostic-unknown-attributes)
is also added to suppress Clang's unknown attribute warning.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:20 +02:00
Laurent Vivier
8ec134109e flow: close socket fd on error
In eea8a76caf ("flow: fix podman issue "), we unregister
the fd from epoll_ctl() in case of error, but we also need to close it.

As flowside_sock_l4() also calls sock_l4_sa() via flowside_sock_splice()
we can do it unconditionally.

Fixes: eea8a76caf ("flow: fix podman issue ")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-12 21:04:57 +02:00
Laurent Vivier
92d5d68013 flow: fix wrong macro name in comments
The maximum number of flow macro name is FLOW_MAX, not MAX_FLOW.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-08 13:24:14 +02:00
Laurent Vivier
eea8a76caf flow: fix podman issue
While running pasta, we trigger the following assert:

  ASSERTION FAILED in udp_at_sidx (udp_flow.c:35): flow->f.type == FLOW_UDP

in udp_at_sidx() in the following path:

 902 void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 903                       uint32_t events, const struct timespec *now)
 904 {
 905         struct udp_flow *uflow = udp_at_sidx(ref.flowside);

The invalid sidx is comming from the epoll_ref provided by epoll_wait().

This assert follows the following error:

  Couldn't connect flow socket: Permission denied

It appears that an error happens in udp_flow_sock() and the recently
created fd is not removed from the epoll_ctl() pool:

 71 static int udp_flow_sock(const struct ctx *c,
 72                          struct udp_flow *uflow, unsigned sidei)
 73 {
...
 82         s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
 83         if (s < 0) {
 84                 flow_dbg_perror(uflow, "Couldn't open flow specific socket");
 85                 return s;
 86         }
 87
 88         if (flowside_connect(c, s, pif, side) < 0) {
 89                 int rc = -errno;
 90                 flow_dbg_perror(uflow, "Couldn't connect flow socket");
 91                 return rc;
 92         }
...

flowside_sock_l4() calls sock_l4_sa() that adds 's' to the epoll_ctl()
pool.

So to cleanly manage the error of flowside_connect() we need to remove
's' from the epoll_ctl() pool using epoll_del().

Link: https://github.com/containers/podman/issues/26073
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-07 14:42:48 +02:00
Stefano Brivio
587980ca1e udp: Actually discard datagrams we can't forward
Given that udp_sock_fwd() now loops on udp_peek_addr() to get endpoint
addresses for datagrams, if we can't forward one of these datagrams,
we need to make sure we actually discard it. Otherwise, with MSG_PEEK,
we won't dequeue and loop on it forever.

For example, if we fail to create a socket for a new flow, because,
say, the destination of an inbound packet is multicast, and we can't
bind() to a multicast address, the loop will look like this:

18.0563: Flow 0 (NEW): FREE -> NEW
18.0563: Flow 0 (INI): NEW -> INI
18.0563: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0563: Flow 0 (TGT): INI -> TGT
18.0563: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0563: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow
18.0564: Flow 0 (NEW): FREE -> NEW
18.0564: Flow 0 (INI): NEW -> INI
18.0564: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0564: Flow 0 (TGT): INI -> TGT
18.0564: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow

and seen from strace:

epoll_wait(3, [{events=EPOLLIN, data=0x1076c00000705}], 8, 1000) = 1
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-05-03 10:21:20 +02:00
Emanuel Valasiadis
f0021f9e1d fwd: fix doc typo
Signed-off-by: Emanuel Valasiadis <emanuel@valasiadis.space>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-03 03:42:51 +02:00
Janne Grunau
93394f4ef0 selinux: Add getattr to class udp_socket
Commit 59cc89f ("udp, udp_flow: Track our specific address on socket
interfaces") added a getsockname() call in udp_flow_new(). This requires
getattr. Fixes "Flow 0 (UDP flow): Unable to determine local address:
Permission denied" errors in muvm/passt on Fedora Linux 42 with SELinux.

The SELinux audit message is

| type=AVC msg=audit(1746083799.606:235): avc:  denied  { getattr } for
|   pid=2961 comm="passt" laddr=127.0.0.1 lport=49221
|   faddr=127.0.0.53 fport=53
|   scontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tcontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tclass=udp_socket permissive=0

Fixes: 59cc89f4cc ("udp, udp_flow: Track our specific address on socket interfaces")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2363238
Signed-off-by: Janne Grunau <janne-psst@jannau.net>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-02 12:00:51 +02:00
Laurent Vivier
11be695f5c flow: fix podman issue
While running piHole using podman, traffic can trigger the following
assert:

ASSSERTION FAILED in flow_alloc (flow.c:521): flow->f.state == FLOW_STATE_FREE

Backtrace shows that this happens in flow_defer_handler():

      0x00005610d6f5b481 flow_alloc (passt + 0xb481)
      0x00005610d6f74f86 udp_flow_from_sock (passt + 0x24f86)
      0x00005610d6f737c3 udp_sock_fwd (passt + 0x237c3)
      0x00005610d6f74c07 udp_flush_flow (passt + 0x24c07)
      0x00005610d6f752c2 udp_flow_defer (passt + 0x252c2)
      0x00005610d6f5bce1 flow_defer_handler (passt + 0xbce1)

We are trying to allocate a new flow inside the loop freeing them.

Inside the loop free_head points to the first free flow entry in the
current cluster. But if we allocate a new entry during the loop,
free_head is not updated and can point now to the entry we have just
allocated.

We can fix the problem by spliting the loop in two parts:
- first part where we can close some of them and allocate some new
  flow entries,
- second part where we free the entries closed in the previous loop
  and we aggregate the free entries to merge consecutive the clusters.

Reported-by: Martin Rijntjes <bugs@air-global.nl>
Link: https://github.com/containers/podman/issues/25959
Fixes: 9725e79888 ("udp_flow: Don't discard packets that arrive between bind() and connect()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-02 11:58:25 +02:00
Stefano Brivio
6a96cd97a5 util: Fix typo, ASSSERTION -> ASSERTION
Fixes: 9153aca15b ("util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-05-02 11:58:10 +02:00
Stefano Brivio
ea0a1240df passt-repair: Hide bogus gcc warning from -Og
When building with gcc 13 and -Og, we get:

passt-repair.c: In function ‘main’:
passt-repair.c:161:23: warning: ‘ev’ may be used uninitialized [-Wmaybe-uninitialized]
  161 |                 if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
      |                     ~~^~~~~

but that can't actually happen, because we only exit the preceding
while loop if 'found' is true, and that only happens, in turn, as we
assign 'ev'.

Get rid of the warning by (redundantly) initialising ev to NULL.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-30 16:58:58 +02:00
Alyssa Ross
aa1cc89228 conf: allow --fd 0
inetd-style socket passing traditionally starts a service with a
connected socket on file descriptors 0 and 1.  passt disallowing
obtaining its socket from either of these descriptors made it
difficult to use with super-servers providing this interface — in my
case I wanted to use passt with s6-ipcserver[1].  Since (as far as I
can tell) passt does not use standard input for anything else (unlike
standard output), it should be safe to relax the restrictions on --fd
to allow setting it to 0, enabling this use case.

Link: https://skarnet.org/software/s6/s6-ipcserver.html [1]
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-28 14:01:17 +02:00
David Gibson
436afc3044 udp: Translate offender addresses for ICMP messages
We've recently added support for propagating ICMP errors related to a UDP
flow from the host to the guest, by handling the extended UDP error on the
socket and synthesizing a suitable ICMP on the tap interface.

Currently we create that ICMP with a source address of the "offender" from
the extended error information - the source of the ICMP error received on
the host.  However, we don't translate this address for cases where we NAT
between host and guest.  This means (amongst other things) that we won't
get a "Connection refused" error as expected if send data from the guest to
the --map-host-loopback address.  The error comes from 127.0.0.1 on the
host, which doesn't make sense on the tap interface and will be discarded
by the guest.

Because ICMP errors can be sent by an intermediate host, not just by the
endpoints of the flow, we can't handle this translation purely with the
information in the flow table entry.  We need to explicitly translate this
address by our NAT rules, which we can do with the nat_inbound() helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:05 +02:00
David Gibson
08e617ec2b udp: Rework offender address handling in udp_sock_recverr()
Make a number of changes to udp_sock_recverr() to improve the robustness
of how we handle addresses.

 * Get the "offender" address (source of the ICMP packet) using the
   SO_EE_OFFENDER() macro, reducing assumptions about structure layout.
 * Parse the offender sockaddr using inany_from_sockaddr()
 * Check explicitly that the source and destination pifs are what we
   expect.  Previously we checked something that was probably equivalent
   in practice, but isn't strictly speaking what we require for the rest
   of the code.
 * Verify that for an ICMPv4 error we also have an IPv4 source/offender
   and destination/endpoint address
 * Verify that for an ICMPv6 error we have an IPv6 endpoint
 * Improve debug reporting of any failures

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:03 +02:00
David Gibson
4668e91378 treewide: Improve robustness against sockaddrs of unexpected family
inany_from_sockaddr() expects a socket address of family AF_INET or
AF_INET6 and ASSERT()s if it gets anything else.  In many of the callers we
can handle an unexpected family more gracefully, though, e.g. by failing
a single flow rather than killing passt.

Change inany_from_sockaddr() to return an error instead of ASSERT()ing,
and handle those errors in the callers.  Improve the reporting of any such
errors while we're at it.

With this greater robustness, allow inany_from_sockaddr() to take a void *
rather than specifically a union sockaddr_inany *.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:00 +02:00
David Gibson
9128f6e8f4 fwd: Split out helpers for port-independent NAT
Currently the functions fwd_nat_from_*() make some address translations
based on both the IP address and protocol port numbers, and others based
only on the address.  We have some upcoming cases where it's useful to use
the IP-address-only translations separately, so split them out into helper
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:41:47 +02:00
David Gibson
2340bbf867 udp: Propagate errors on listening and brand new sockets
udp_sock_recverr() processes errors on UDP sockets and attempts to
propagate them as ICMP packets on the tap interface.  To do this it
currently requires the flow with which the error is associated as a
parameter.  If that's missing it will clear the error condition, but not
propagate it.

That means that we largely ignore errors on "listening" sockets.  It also
means we may discard some errors on flow specific sockets if they occur
very shortly after the socket is created.  In udp_flush_flow() we need to
clear any datagrams received between bind() and connect() which might not
be associated with the "final" flow for the socket.  If we get errors
before that point we'll ignore them in the same way because we don't know
the flow they're associated with in advance.

This can happen in practice if we have errors which occur almost
immediately after connect(), such as ECONNREFUSED when we connect() to a
local address where nothing is listening.

Between the extended error message itself and the PKTINFO information we
do actually have enough information to find the correct flow.  So, rather
than ignoring errors where we don't have a flow "hint", determine the flow
the hard way in udp_sock_recverr().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change warn() to debug() in udp_sock_recverr()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:56:16 +02:00
David Gibson
cfc0ee145a udp: Minor re-organisation of udp_sock_recverr()
Usually we work with the "exit early" flow style, where we return early
on "error" conditions in functions.  We don't currently do this in
udp_sock_recverr() for the case where we don't have a flow to associate
the error with.

Reorganise to use the "exit early" style, which will make some subsequent
changes less awkward.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:49:06 +02:00
David Gibson
f107a86cc0 udp: Add udp_pktinfo() helper
Currently we open code parsing the control message for IP_PKTINFO in
udp_peek_addr().  We have an upcoming case where we want to parse PKTINFO
in another place, so split this out into a helper function.

While we're there, make the parsing a bit more robust: scan all cmsgs to
look for the one we want, rather than assuming there's only one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: udp_pktinfo(): Fix typo in comment and change err() to debug()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:48:35 +02:00
David Gibson
04984578b0 udp: Deal with errors as we go in udp_sock_fwd()
When we get an epoll event on a listening socket, we first deal with any
errors (udp_sock_errs()), then with any received packets (udp_sock_fwd()).
However, it's theoretically possible that new errors could get flagged on
the socket after we call udp_sock_errs(), in which case we could get errors
returned in in udp_sock_fwd() -> udp_peek_addr() -> recvmsg().

In fact, we do deal with this correctly, although the path is somewhat
non-obvious.  The recvmsg() error will cause us to bail out of
udp_sock_fwd(), but the EPOLLERR event will now be flagged, so we'll come
back here next epoll loop and call udp_sock_errs().

Except.. we call udp_sock_fwd() from udp_flush_flow() as well as from
epoll events.  This is to deal with any packets that arrived between bind()
and connect(), and so might not be associated with the socket's intended
flow.  This expects udp_sock_fwd() to flush _all_ queued datagrams, so that
anything received later must be for the correct flow.

At the moment, udp_sock_errs() might fail to flush all datagrams if errors
occur.  In particular this can happen in practice for locally reported
errors which occur immediately after connect() (e.g. connecting to a local
port with nothing listening).

We can deal with the problem case, and also make the flow a little more
natural for the common case by having udp_sock_fwd() call udp_sock_errs()
to handle errors as the occur, rather than trying to deal with all errors
in advance.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:45:19 +02:00
David Gibson
3f995586b3 udp: Pass socket & flow information direction to error handling functions
udp_sock_recverr() and udp_sock_errs() take an epoll reference from which
they obtain both the socket fd to receive errors from, and - for flow
specific sockets - the flow and side the socket is associated with.

We have some upcoming cases where we want to clear errors when we're not
directly associated with receiving an epoll event, so it's not natural to
have an epoll reference.  Therefore, make these functions take the socket
and flow from explicit parameters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:45:09 +02:00
David Gibson
1bb8145c22 udp: Be quieter about errors on UDP receive
If we get an error on UDP receive, either in udp_peek_addr() or
udp_sock_recv(), we'll print an error message.  However, this could be
a perfectly routine UDP error triggered by an ICMP, which need not go to
the error log.

This doesn't usually happen, because before receiving we typically clear
the error queue from udp_sock_errs().  However, it's possible an error
could be flagged after udp_sock_errs() but before we receive.  So it's
better to handle this error "silently" (trace level only).  We'll bail out
of the receive, return to the epoll loop, and get an EPOLLERR where we'll
handle and report the error properly.

In particular there's one situation that can trigger this case much more
easily.  If we start a new outbound UDP flow to a local destination with
nothing listening, we'll get a more or less immediate connection refused
error.  So, we'll get that error on the very first receive after the
connect().  That will occur in udp_flow_defer() -> udp_flush_flow() ->
udp_sock_fwd() -> udp_peek_addr() -> recvmsg().  This path doesn't call
udp_sock_errs() first, so isn't (imperfectly) protected the way we are
most of the time.

Fixes: 84ab1305fa ("udp: Polish udp_vu_sock_info() and remove from vu specific code")
Fixes: 69e5393c37 ("udp: Move some more of sock_handler tasks into sub-functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:43:56 +02:00
David Gibson
baf049f8e0 udp: Fix breakage of UDP error handling by PKTINFO support
We recently enabled the IP_PKTINFO / IPV6_RECVPKTINFO socket options on our
UDP sockets.  This lets us obtain and properly handle the specific local
address used when we're "listening" with a socket on 0.0.0.0 or ::.

However, the PKTINFO cmsgs this option generates appear on error queue
messages as well as regular datagrams.  udp_sock_recverr() doesn't expect
this and so flags an unrecoverable error when it can't parse the control
message.

Correct this by adding space in udp_sock_recverr()s control buffer for the
additional PKTINFO data, and scan through all cmsgs for the RECVERR, rather
than only looking at the first one.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Fixes: f4b0dd8b06 ("udp: Use PKTINFO cmsgs to get destination address for received datagrams")
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:43:00 +02:00
Stefano Brivio
50249086a9 conf: Honour --dns-forward for local resolver even with --no-map-gw
If the first resolver listed in the host's /etc/resolv.conf is a
loopback address, and --no-map-gw is given, we automatically conclude
that the resolver is not reachable, discard it, and, if it's the only
nameserver listed in /etc/resolv.conf, we'll warn that we:

  Couldn't get any nameserver address

However, this isn't true in a general case: the user might have passed
--dns-forward, and in that case, while we won't map the address of the
default gateway to the host, we're still supposed to map that
particular address. Otherwise, in this common Podman usage:

  pasta --config-net --dns-forward 169.254.1.1 -t none -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-c02a8d8f-6ee3-902e-33c5-317e0f24e0af --map-guest-addr 169.254.1.2

and with a loopback address in /etc/resolv.conf, we'll unexpectedly
refuse to forward DNS queries:

  # nslookup passt.top 169.254.1.1
  ;; connection timed out; no servers could be reached

To fix this, make an exception for --dns-forward: if &c->ip4.dns_match
or &c->ip6.dns_match are set in add_dns_resolv4() / add_dns_resolv6(),
use that address as guest-facing resolver.

We already set 'dns_host' to the address we found in /etc/resolv.conf,
that's correct in this case and it makes us forward queries as
expected.

I'm not changing the man page as the current description of
--dns-forward is already consistent with the new behaviour: there's no
described way in which --no-map-gw should affect it.

Reported-by: Andrew Sayers <andrew-bugs.passt.top@pileofstuff.org>
Link: https://bugs.passt.top/show_bug.cgi?id=111
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-15 19:42:59 +02:00
Stefano Brivio
bbff3653d6 conf: Split add_dns_resolv() into separate IPv4 and IPv6 versions
Not really valuable by itself, but dropping one level of nested blocks
makes the next change more convenient.

No functional changes intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-15 19:42:57 +02:00
David Gibson
59cc89f4cc udp, udp_flow: Track our specific address on socket interfaces
So far for UDP flows (like TCP connections) we didn't record our address
(oaddr) in the flow table entry for socket based pifs.  That's because we
didn't have that information when a flow was initiated by a datagram coming
to a "listening" socket with 0.0.0.0 or :: address.  Even when we did have
the information, we didn't record it, to simplify address matching on
lookups.

This meant that in some circumstances we could send replies on a UDP flow
from a different address than the originating request came to, which is
surprising and breaks certain setups.

We now have code in udp_peek_addr() which does determine our address for
incoming UDP datagrams.  We can use that information to properly populate
oaddr in the flow table for flow initiated from a socket.

In order to be able to consistently match datagrams to flows, we must
*always* have a specific oaddr, not an unspecified address (that's how the
flow hash table works).  So, we also need to fill in oaddr correctly for
flows we initiate *to* sockets.  Our forwarding logic doesn't specify
oaddr here, letting the kernel decide based on the routing table.  In this
case we need to call getsockname() after connect()ing the socket to find
which local address the kernel picked.

This adds getsockname() to our seccomp profile for all variants.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:46:16 +02:00
David Gibson
695c62396e inany: Improve ASSERT message for bad socket family
inany_from_sockaddr() can only handle sockaddrs of family AF_INET or
AF_INET6 and asserts if given something else.  I hit this assertion while
debugging something else, and wanted to see what the bad sockaddr family
was.  Now that we have ASSERT_WITH_MSG() its easy to add this information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:46:13 +02:00
David Gibson
f4b0dd8b06 udp: Use PKTINFO cmsgs to get destination address for received datagrams
Currently we get the source address for received datagrams from recvmsg(),
but we don't get the local destination address.  Sometimes we implicitly
know this because the receiving socket is bound to a specific address, but
when listening on 0.0.0.0 or ::, we don't.

We need this information to properly direct replies to flows which come in
to a non-default local address.  So, enable the IP_PKTINFO and IPV6_PKTINFO
control messages to obtain this information in udp_peek_addr().  For now
we log a trace messages but don't do anything more with the information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:45:59 +02:00
David Gibson
6693fa1158 tcp_splice: Don't clobber errno before checking for EAGAIN
Like many places, tcp_splice_sock_handler() needs to handle EAGAIN
specially, in this case for both of its splice() calls.  Unfortunately it
tests for EAGAIN some time after those calls.  In between there has been
at least a flow_trace() which could have clobbered errno.  Move the test on
errno closer to the relevant system calls to avoid this problem.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-09 22:57:27 +02:00
David Gibson
d3f33f3b8e tcp_splice: Don't double count bytes read on EINTR
In tcp_splice_sock_handler(), if we get an EINTR on our second splice()
(pipe to output socket) we - as we should - go back and retry it.  However,
we do so *after* we've already updated our byte counters.  That does no
harm for the conn->written[] counter - since the second splice() returned
an error it will be advanced by 0.  However we also advance the
conn->read[] counter, and then do so again when the splice() succeeds.
This results in the counters being out of sync, and us thinking we have
remaining data in the pipe when we don't, which can leave us in an
infinite loop once the stream finishes.

Fix this by moving the EINTR handling to directly next to the splice()
call (which is what we usually do for EINTR).  As a bonus this removes one
mildly confusing goto.

For symmetry, also rework the EINTR handling on the first splice() the same
way, although that doesn't (as far as I can tell) have buggy side effects.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2779347687
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-09 22:57:16 +02:00
Stefano Brivio
ffbef85e97 conf: Add missing return in conf_nat(), fix --map-guest-addr none
As reported by somebody on IRC:

  $ pasta --map-guest-addr none
  Invalid address to remap to host: none

that's because once we parsed "none", we try to parse it as an address
as well. But we already handled it, so stop once we're done.

Fixes: e813a4df7d ("conf: Allow address remapped to host to be configured")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-09 22:52:49 +02:00
Stefano Brivio
06ef64cdb7 udp_flow: Save 8 bytes in struct udp_flow on 64-bit architectures
Shuffle the fields just added by commits a7775e9550 ("udp: support
traceroute in direction tap-socket") and 9725e79888 ("udp_flow:
Don't discard packets that arrive between bind() and connect()").

On x86_64, as reported by pahole(1), before:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        _Bool                      closed:1;             /*    76: 0  1 */

        /* XXX 7 bits hole, try to pack */

        _Bool                      flush0;               /*    77     1 */
        _Bool                      flush1:1;             /*    78: 0  1 */

        /* XXX 7 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */
        uint8_t                    ttl[2];               /*    96     2 */

        /* size: 104, cachelines: 2, members: 7 */
        /* sum members: 95, holes: 1, sum holes: 1 */
        /* sum bitfield members: 2 bits, bit holes: 2, sum bit holes: 14 bits */
        /* padding: 6 */
        /* last cacheline: 40 bytes */
};

and after:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        uint8_t                    ttl[2];               /*    76     2 */
        _Bool                      closed:1;             /*    78: 0  1 */
        _Bool                      flush0:1;             /*    78: 1  1 */
        _Bool                      flush1:1;             /*    78: 2  1 */

        /* XXX 5 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */

        /* size: 96, cachelines: 2, members: 7 */
        /* sum members: 94, holes: 1, sum holes: 1 */
        /* sum bitfield members: 3 bits, bit holes: 1, sum bit holes: 5 bits */
        /* last cacheline: 32 bytes */
};

It doesn't matter much because anyway the typical storage for struct
udp_flow is given by union flow:

union flow {
        struct flow_common         f;                  /*     0    76 */
        struct flow_free_cluster   free;               /*     0    84 */
        struct tcp_tap_conn        tcp;                /*     0   120 */
        struct tcp_splice_conn     tcp_splice;         /*     0   120 */
        struct icmp_ping_flow      ping;               /*     0    96 */
        struct udp_flow            udp;                /*     0    96 */
};

but it still improves data locality somewhat, so let me fix this up
now that commits are fresh.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-09 22:52:32 +02:00
David Gibson
9725e79888 udp_flow: Don't discard packets that arrive between bind() and connect()
When we establish a new UDP flow we create connect()ed sockets that will
only handle datagrams for this flow.  However, there is a race between
bind() and connect() where they might get some packets queued for a
different flow.  Currently we handle this by simply discarding any
queued datagrams after the connect.  UDP protocols should be able to handle
such packet loss, but it's not ideal.

We now have the tools we need to handle this better, by redirecting any
datagrams received during that race to the appropriate flow.  We need to
use a deferred handler for this to avoid unexpectedly re-ordering datagrams
in some edge cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update comment to udp_flow_defer()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:44:31 +02:00
David Gibson
9eb5406260 udp: Fold udp_splice_prepare and udp_splice_send into udp_sock_to_sock
udp_splice() prepare and udp_splice_send() are both quite simple functions
that now have only one caller: udp_sock_to_sock().  Fold them both into
that caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:44:31 +02:00
David Gibson
bd6a41ee76 udp: Rework udp_listen_sock_data() into udp_sock_fwd()
udp_listen_sock_data() forwards datagrams from a "listening" socket until
there are no more (for now).  We have an upcoming use case where we want
to do that for a socket that's not a "listening" socket, and uses a
different epoll reference.  So, adjust the function to take the pieces it
needs from the reference as direct parameters and rename to udp_sock_fwd().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
159beefa36 udp_flow: Take pif and port as explicit parameters to udp_flow_from_sock()
Currently udp_flow_from_sock() is only used when receiving a datagram
from a "listening" socket.  It takes the listening socket's epoll
reference to get the interface and port on which the datagram arrived.

We have some upcoming cases where we want to use this in different
contexts, so make it take the pif and port as direct parameters instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop @ref from comment to udp_flow_from_sock()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
fd844a90bc udp: Move UDP_MAX_FRAMES to udp.c
Recent changes mean that this define is no longer used anywhere except in
udp.c.  Move it back into udp.c from udp_internal.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
fc6ee68ad3 udp: Merge vhost-user and "buf" listening socket paths
udp_buf_listen_sock_data() and udp_vu_listen_sock_data() now have
effectively identical structure.  The forwarding functions used for flow
specific sockets (udp_buf_sock_to_tap(), udp_vu_sock_to_tap() and
udp_sock_to_sock()) also now take a number of datagrams.  This means we
can re-use them for the listening socket path, just passing '1' so they
handle a single datagram at a time.

This allows us to merge both the vhost-user and flow specific paths into
a single, simpler udp_listen_sock_data().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:52 +02:00
David Gibson
0304dd9c34 udp: Split spliced forwarding path from udp_buf_reply_sock_data()
udp_buf_reply_sock_data() can handle forwarding data either from socket
to socket ("splicing") or from socket to tap.  It has a test on each
datagram for which case we're in, but that will be the same for everything
in the batch.

Split out the spliced path into a separate udp_sock_to_sock() function.
This leaves udp_{buf,vu}_reply_sock_data() handling only forwards from
socket to tap, so rename and simplify them accordingly.

This makes the code slightly longer for now, but will allow future cleanups
to shrink it back down again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typos in comments to udp_sock_recv() and
 udp_vu_listen_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:41:32 +02:00
David Gibson
5221e177e1 udp: Parameterize number of datagrams handled by udp_*_reply_sock_data()
Both udp_buf_reply_sock_data() and udp_vu_reply_sock_data() internally
decide what the maximum number of datagrams they will forward is.  We have
some upcoming reasons to allow the caller to decide that instead, so make
the maximum number of datagrams a parameter for both of them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:31:54 +02:00
David Gibson
3a0881dfd0 udp: Don't bother to batch datagrams from "listening" socket
A "listening" UDP socket can receive datagrams from multiple flows.  So,
we currently have some quite subtle and complex code in
udp_buf_listen_sock_data() to group contiguously received packets for the
same flow into batches for forwarding.

However, since we are now always using flow specific connect()ed sockets
once a flow is established, handling of datagrams on listening sockets is
essentially a slow path.  Given that, it's not worth the complexity.
Substantially simplify the code by using an approach more like vhost-user,
and "peeking" at the address of the next datagram, one at a time to
determine the correct flow before we actually receive the data,

This removes all meaningful use of the s_in and tosidx fields in
udp_meta_t, so they too can be removed, along with setting of msg_name and
msg_namelen in the msghdr arrays which referenced them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:30:17 +02:00
David Gibson
84ab1305fa udp: Polish udp_vu_sock_info() and remove from vu specific code
udp_vu_sock_info() uses MSG_PEEK to look ahead at the next datagram to be
received and gets its source address.  Currently we only use it in the
vhost-user path, but there's nothing inherently vhost-user specific about
it.  We have upcoming uses for it elsewhere so rename and move to udp.c.

While we're there, polish its error reporting a litle.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop excess newline before udp_sock_recv()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:29:23 +02:00
David Gibson
1d7bbb101a udp: Make udp_sock_recv() take max number of frames as a parameter
Currently udp_sock_recv() decides the maximum number of frames it is
willing to receive based on the mode.  However, we have upcoming use cases
where we will have different criteria for how many frames we want with
information that's not naturally available here but is in the caller.  So
make the maximum number of frames a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typo in comment in udp_buf_reply_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:25:33 +02:00