1
0
Fork 0
mirror of https://passt.top/passt synced 2025-05-27 11:55:59 +02:00

Compare commits

...

194 commits

Author SHA1 Message Date
Laurent Vivier
3262c9b088 iov: Standardize function comment headers
Update function comment headers in iov.c to a consistent and
standardized format.

This change ensures:
- Comment blocks for functions consistently start with /**.
- Function names in the comment summary line include parentheses ().

This improves overall comment clarity and uniformity within the file.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:13 +02:00
Laurent Vivier
b915375a42 virtio: Correct and align comment headers
Standardize and fix issues in `virtio.c` and `virtio.h` comment headers.

Improvements include:
- Added `()` to function names in comment summaries.
- Added colons after parameter and enum member tags.
- Changed `/*` to `/**` for `virtq_avail_event()` comment.
- Fixed typos (e.g., "file"->"fill", "virqueue"->"virtqueue").
- Added missing `Return:` tag for `vu_queue_rewind()`.
- Corrected parameter names in `virtio.h` comments to match code.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:11 +02:00
Laurent Vivier
2fd0944f21 vhost_user: Correct and align function comment headers
This commit cleans up function comment headers in vhost_user.c to ensure
accuracy and consistency with the code. Changes include correcting
parameter names in comments and signatures (e.g., standardizing on vmsg
for vhost messages, fixing dev to vdev), updating function names in
comment descriptions, and removing/rectifying erroneous parameter
documentation.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:08 +02:00
Laurent Vivier
2046976866 codespell: Correct typos in comments and error message
This commit addresses several spelling errors identified by the `codespell`
tool. The corrections apply to:
- Code comments in `fwd.c`, `ip.h`, `isolation.c`, and `log.c`.
- An error message string in `vhost_user.c`.

Specifically, the following misspellings were corrected:
- "adddress" to "address"
- "capabilites" to "capabilities"
- "Musn't" to "Mustn't"
- "calculatd" to "calculated"
- "Invalide" to "Invalid"

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-15 18:06:30 +02:00
Laurent Vivier
4234ace84c test: Display count of skipped tests in status and summary
This commit enhances test reporting by tracking and displaying the
number of skipped tests.

The skipped test count is now visible in the tmux status bar during
execution and included in the final test summary log. This provides
a more complete overview of test suite results.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-15 18:06:19 +02:00
Laurent Vivier
2d3d69c5c3 flow: Fix clang error (clang-analyzer-security.PointerSub)
Fixes the following clang-analyzer warning:

flow_table.h:96:25: note: Subtraction of two pointers that do not point into the same array is undefined behavior
   96 |         return (union flow *)f - flowtab;

The `flow_idx()` function is called via `FLOW_IDX()` from
`flow_foreach_slot()`, where `f` is set to `&flowtab[idx].f`.
Therefore, `f` and `flowtab` do point to the same array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:37 +02:00
Laurent Vivier
0f7bf10b0a ndp: Fix Clang analyzer warning (clang-analyzer-security.PointerSub)
Addresses Clang warning: "Subtraction of two pointers that do not
point into the same array is undefined behavior" for the line:
  `ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);`

Here, `ptr` is `&ra.var[0]`. The subtraction calculates the offset
of `var[0]` within the `struct ra_options ra`. Since `ptr` points
inside `ra`, this pointer arithmetic is well-defined for
calculating the size of the data to send, even if `ptr` and `&ra`
are not strictly considered part of the same "array" by the analyzer.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:35 +02:00
Laurent Vivier
a6b9832e49 virtio: Fix Clang warning (bugprone-sizeof-expression, cert-arr39-c)
In `virtqueue_read_indirect_desc()`, the pointer arithmetic involving
`desc` is intentional. We add the length in bytes (`read_len`)
divided by the size of `struct vring_desc` to `desc`, which is
an array of `struct vring_desc`. This correctly calculates the
offset in terms of the number of `struct vring_desc` elements.

Clang issues the following warning due to this explicit scaling:

virtio.c:238:8: error: suspicious usage of 'sizeof(...)' in pointer
arithmetic; this scaled value will be scaled again by the '+='
operator [bugprone-sizeof-expression,cert-arr39-c,-Werror]
  238 |         desc += read_len / sizeof(struct vring_desc);
      |               ^            ~~~~~~~~~~~~~~~~~~~~~~~~~
virtio.c:238:8: note: '+=' in pointer arithmetic internally scales
with 'sizeof(struct vring_desc)' == 16

This behavior is intended, so the warning can be considered a
false positive in this context. The code correctly advances the
pointer by the desired number of descriptor entries.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:32 +02:00
Laurent Vivier
570e7b4454 dhcpv6: fix GCC error (unterminated-string-initialization)
The string STR_NOTONLINK is intentionally not NUL-terminated.
Ignore the GCC error using __attribute__((nonstring)).

This error is reported by GCC 15.1.1 on Fedora 42. However,
Clang 20.1.3 does not support __attribute__((nonstring)).
Therefore, NOLINTNEXTLINE(clang-diagnostic-unknown-attributes)
is also added to suppress Clang's unknown attribute warning.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:20 +02:00
Laurent Vivier
8ec134109e flow: close socket fd on error
In eea8a76caf ("flow: fix podman issue "), we unregister
the fd from epoll_ctl() in case of error, but we also need to close it.

As flowside_sock_l4() also calls sock_l4_sa() via flowside_sock_splice()
we can do it unconditionally.

Fixes: eea8a76caf ("flow: fix podman issue ")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-12 21:04:57 +02:00
Laurent Vivier
92d5d68013 flow: fix wrong macro name in comments
The maximum number of flow macro name is FLOW_MAX, not MAX_FLOW.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-08 13:24:14 +02:00
Laurent Vivier
eea8a76caf flow: fix podman issue
While running pasta, we trigger the following assert:

  ASSERTION FAILED in udp_at_sidx (udp_flow.c:35): flow->f.type == FLOW_UDP

in udp_at_sidx() in the following path:

 902 void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 903                       uint32_t events, const struct timespec *now)
 904 {
 905         struct udp_flow *uflow = udp_at_sidx(ref.flowside);

The invalid sidx is comming from the epoll_ref provided by epoll_wait().

This assert follows the following error:

  Couldn't connect flow socket: Permission denied

It appears that an error happens in udp_flow_sock() and the recently
created fd is not removed from the epoll_ctl() pool:

 71 static int udp_flow_sock(const struct ctx *c,
 72                          struct udp_flow *uflow, unsigned sidei)
 73 {
...
 82         s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
 83         if (s < 0) {
 84                 flow_dbg_perror(uflow, "Couldn't open flow specific socket");
 85                 return s;
 86         }
 87
 88         if (flowside_connect(c, s, pif, side) < 0) {
 89                 int rc = -errno;
 90                 flow_dbg_perror(uflow, "Couldn't connect flow socket");
 91                 return rc;
 92         }
...

flowside_sock_l4() calls sock_l4_sa() that adds 's' to the epoll_ctl()
pool.

So to cleanly manage the error of flowside_connect() we need to remove
's' from the epoll_ctl() pool using epoll_del().

Link: https://github.com/containers/podman/issues/26073
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-07 14:42:48 +02:00
Stefano Brivio
587980ca1e udp: Actually discard datagrams we can't forward
Given that udp_sock_fwd() now loops on udp_peek_addr() to get endpoint
addresses for datagrams, if we can't forward one of these datagrams,
we need to make sure we actually discard it. Otherwise, with MSG_PEEK,
we won't dequeue and loop on it forever.

For example, if we fail to create a socket for a new flow, because,
say, the destination of an inbound packet is multicast, and we can't
bind() to a multicast address, the loop will look like this:

18.0563: Flow 0 (NEW): FREE -> NEW
18.0563: Flow 0 (INI): NEW -> INI
18.0563: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0563: Flow 0 (TGT): INI -> TGT
18.0563: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0563: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow
18.0564: Flow 0 (NEW): FREE -> NEW
18.0564: Flow 0 (INI): NEW -> INI
18.0564: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0564: Flow 0 (TGT): INI -> TGT
18.0564: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow

and seen from strace:

epoll_wait(3, [{events=EPOLLIN, data=0x1076c00000705}], 8, 1000) = 1
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-05-03 10:21:20 +02:00
Emanuel Valasiadis
f0021f9e1d fwd: fix doc typo
Signed-off-by: Emanuel Valasiadis <emanuel@valasiadis.space>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-03 03:42:51 +02:00
Janne Grunau
93394f4ef0 selinux: Add getattr to class udp_socket
Commit 59cc89f ("udp, udp_flow: Track our specific address on socket
interfaces") added a getsockname() call in udp_flow_new(). This requires
getattr. Fixes "Flow 0 (UDP flow): Unable to determine local address:
Permission denied" errors in muvm/passt on Fedora Linux 42 with SELinux.

The SELinux audit message is

| type=AVC msg=audit(1746083799.606:235): avc:  denied  { getattr } for
|   pid=2961 comm="passt" laddr=127.0.0.1 lport=49221
|   faddr=127.0.0.53 fport=53
|   scontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tcontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tclass=udp_socket permissive=0

Fixes: 59cc89f4cc ("udp, udp_flow: Track our specific address on socket interfaces")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2363238
Signed-off-by: Janne Grunau <janne-psst@jannau.net>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-02 12:00:51 +02:00
Laurent Vivier
11be695f5c flow: fix podman issue
While running piHole using podman, traffic can trigger the following
assert:

ASSSERTION FAILED in flow_alloc (flow.c:521): flow->f.state == FLOW_STATE_FREE

Backtrace shows that this happens in flow_defer_handler():

      0x00005610d6f5b481 flow_alloc (passt + 0xb481)
      0x00005610d6f74f86 udp_flow_from_sock (passt + 0x24f86)
      0x00005610d6f737c3 udp_sock_fwd (passt + 0x237c3)
      0x00005610d6f74c07 udp_flush_flow (passt + 0x24c07)
      0x00005610d6f752c2 udp_flow_defer (passt + 0x252c2)
      0x00005610d6f5bce1 flow_defer_handler (passt + 0xbce1)

We are trying to allocate a new flow inside the loop freeing them.

Inside the loop free_head points to the first free flow entry in the
current cluster. But if we allocate a new entry during the loop,
free_head is not updated and can point now to the entry we have just
allocated.

We can fix the problem by spliting the loop in two parts:
- first part where we can close some of them and allocate some new
  flow entries,
- second part where we free the entries closed in the previous loop
  and we aggregate the free entries to merge consecutive the clusters.

Reported-by: Martin Rijntjes <bugs@air-global.nl>
Link: https://github.com/containers/podman/issues/25959
Fixes: 9725e79888 ("udp_flow: Don't discard packets that arrive between bind() and connect()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-02 11:58:25 +02:00
Stefano Brivio
6a96cd97a5 util: Fix typo, ASSSERTION -> ASSERTION
Fixes: 9153aca15b ("util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-05-02 11:58:10 +02:00
Stefano Brivio
ea0a1240df passt-repair: Hide bogus gcc warning from -Og
When building with gcc 13 and -Og, we get:

passt-repair.c: In function ‘main’:
passt-repair.c:161:23: warning: ‘ev’ may be used uninitialized [-Wmaybe-uninitialized]
  161 |                 if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
      |                     ~~^~~~~

but that can't actually happen, because we only exit the preceding
while loop if 'found' is true, and that only happens, in turn, as we
assign 'ev'.

Get rid of the warning by (redundantly) initialising ev to NULL.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-30 16:58:58 +02:00
Alyssa Ross
aa1cc89228 conf: allow --fd 0
inetd-style socket passing traditionally starts a service with a
connected socket on file descriptors 0 and 1.  passt disallowing
obtaining its socket from either of these descriptors made it
difficult to use with super-servers providing this interface — in my
case I wanted to use passt with s6-ipcserver[1].  Since (as far as I
can tell) passt does not use standard input for anything else (unlike
standard output), it should be safe to relax the restrictions on --fd
to allow setting it to 0, enabling this use case.

Link: https://skarnet.org/software/s6/s6-ipcserver.html [1]
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-28 14:01:17 +02:00
David Gibson
436afc3044 udp: Translate offender addresses for ICMP messages
We've recently added support for propagating ICMP errors related to a UDP
flow from the host to the guest, by handling the extended UDP error on the
socket and synthesizing a suitable ICMP on the tap interface.

Currently we create that ICMP with a source address of the "offender" from
the extended error information - the source of the ICMP error received on
the host.  However, we don't translate this address for cases where we NAT
between host and guest.  This means (amongst other things) that we won't
get a "Connection refused" error as expected if send data from the guest to
the --map-host-loopback address.  The error comes from 127.0.0.1 on the
host, which doesn't make sense on the tap interface and will be discarded
by the guest.

Because ICMP errors can be sent by an intermediate host, not just by the
endpoints of the flow, we can't handle this translation purely with the
information in the flow table entry.  We need to explicitly translate this
address by our NAT rules, which we can do with the nat_inbound() helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:05 +02:00
David Gibson
08e617ec2b udp: Rework offender address handling in udp_sock_recverr()
Make a number of changes to udp_sock_recverr() to improve the robustness
of how we handle addresses.

 * Get the "offender" address (source of the ICMP packet) using the
   SO_EE_OFFENDER() macro, reducing assumptions about structure layout.
 * Parse the offender sockaddr using inany_from_sockaddr()
 * Check explicitly that the source and destination pifs are what we
   expect.  Previously we checked something that was probably equivalent
   in practice, but isn't strictly speaking what we require for the rest
   of the code.
 * Verify that for an ICMPv4 error we also have an IPv4 source/offender
   and destination/endpoint address
 * Verify that for an ICMPv6 error we have an IPv6 endpoint
 * Improve debug reporting of any failures

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:03 +02:00
David Gibson
4668e91378 treewide: Improve robustness against sockaddrs of unexpected family
inany_from_sockaddr() expects a socket address of family AF_INET or
AF_INET6 and ASSERT()s if it gets anything else.  In many of the callers we
can handle an unexpected family more gracefully, though, e.g. by failing
a single flow rather than killing passt.

Change inany_from_sockaddr() to return an error instead of ASSERT()ing,
and handle those errors in the callers.  Improve the reporting of any such
errors while we're at it.

With this greater robustness, allow inany_from_sockaddr() to take a void *
rather than specifically a union sockaddr_inany *.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:00 +02:00
David Gibson
9128f6e8f4 fwd: Split out helpers for port-independent NAT
Currently the functions fwd_nat_from_*() make some address translations
based on both the IP address and protocol port numbers, and others based
only on the address.  We have some upcoming cases where it's useful to use
the IP-address-only translations separately, so split them out into helper
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:41:47 +02:00
David Gibson
2340bbf867 udp: Propagate errors on listening and brand new sockets
udp_sock_recverr() processes errors on UDP sockets and attempts to
propagate them as ICMP packets on the tap interface.  To do this it
currently requires the flow with which the error is associated as a
parameter.  If that's missing it will clear the error condition, but not
propagate it.

That means that we largely ignore errors on "listening" sockets.  It also
means we may discard some errors on flow specific sockets if they occur
very shortly after the socket is created.  In udp_flush_flow() we need to
clear any datagrams received between bind() and connect() which might not
be associated with the "final" flow for the socket.  If we get errors
before that point we'll ignore them in the same way because we don't know
the flow they're associated with in advance.

This can happen in practice if we have errors which occur almost
immediately after connect(), such as ECONNREFUSED when we connect() to a
local address where nothing is listening.

Between the extended error message itself and the PKTINFO information we
do actually have enough information to find the correct flow.  So, rather
than ignoring errors where we don't have a flow "hint", determine the flow
the hard way in udp_sock_recverr().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change warn() to debug() in udp_sock_recverr()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:56:16 +02:00
David Gibson
cfc0ee145a udp: Minor re-organisation of udp_sock_recverr()
Usually we work with the "exit early" flow style, where we return early
on "error" conditions in functions.  We don't currently do this in
udp_sock_recverr() for the case where we don't have a flow to associate
the error with.

Reorganise to use the "exit early" style, which will make some subsequent
changes less awkward.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:49:06 +02:00
David Gibson
f107a86cc0 udp: Add udp_pktinfo() helper
Currently we open code parsing the control message for IP_PKTINFO in
udp_peek_addr().  We have an upcoming case where we want to parse PKTINFO
in another place, so split this out into a helper function.

While we're there, make the parsing a bit more robust: scan all cmsgs to
look for the one we want, rather than assuming there's only one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: udp_pktinfo(): Fix typo in comment and change err() to debug()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:48:35 +02:00
David Gibson
04984578b0 udp: Deal with errors as we go in udp_sock_fwd()
When we get an epoll event on a listening socket, we first deal with any
errors (udp_sock_errs()), then with any received packets (udp_sock_fwd()).
However, it's theoretically possible that new errors could get flagged on
the socket after we call udp_sock_errs(), in which case we could get errors
returned in in udp_sock_fwd() -> udp_peek_addr() -> recvmsg().

In fact, we do deal with this correctly, although the path is somewhat
non-obvious.  The recvmsg() error will cause us to bail out of
udp_sock_fwd(), but the EPOLLERR event will now be flagged, so we'll come
back here next epoll loop and call udp_sock_errs().

Except.. we call udp_sock_fwd() from udp_flush_flow() as well as from
epoll events.  This is to deal with any packets that arrived between bind()
and connect(), and so might not be associated with the socket's intended
flow.  This expects udp_sock_fwd() to flush _all_ queued datagrams, so that
anything received later must be for the correct flow.

At the moment, udp_sock_errs() might fail to flush all datagrams if errors
occur.  In particular this can happen in practice for locally reported
errors which occur immediately after connect() (e.g. connecting to a local
port with nothing listening).

We can deal with the problem case, and also make the flow a little more
natural for the common case by having udp_sock_fwd() call udp_sock_errs()
to handle errors as the occur, rather than trying to deal with all errors
in advance.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:45:19 +02:00
David Gibson
3f995586b3 udp: Pass socket & flow information direction to error handling functions
udp_sock_recverr() and udp_sock_errs() take an epoll reference from which
they obtain both the socket fd to receive errors from, and - for flow
specific sockets - the flow and side the socket is associated with.

We have some upcoming cases where we want to clear errors when we're not
directly associated with receiving an epoll event, so it's not natural to
have an epoll reference.  Therefore, make these functions take the socket
and flow from explicit parameters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:45:09 +02:00
David Gibson
1bb8145c22 udp: Be quieter about errors on UDP receive
If we get an error on UDP receive, either in udp_peek_addr() or
udp_sock_recv(), we'll print an error message.  However, this could be
a perfectly routine UDP error triggered by an ICMP, which need not go to
the error log.

This doesn't usually happen, because before receiving we typically clear
the error queue from udp_sock_errs().  However, it's possible an error
could be flagged after udp_sock_errs() but before we receive.  So it's
better to handle this error "silently" (trace level only).  We'll bail out
of the receive, return to the epoll loop, and get an EPOLLERR where we'll
handle and report the error properly.

In particular there's one situation that can trigger this case much more
easily.  If we start a new outbound UDP flow to a local destination with
nothing listening, we'll get a more or less immediate connection refused
error.  So, we'll get that error on the very first receive after the
connect().  That will occur in udp_flow_defer() -> udp_flush_flow() ->
udp_sock_fwd() -> udp_peek_addr() -> recvmsg().  This path doesn't call
udp_sock_errs() first, so isn't (imperfectly) protected the way we are
most of the time.

Fixes: 84ab1305fa ("udp: Polish udp_vu_sock_info() and remove from vu specific code")
Fixes: 69e5393c37 ("udp: Move some more of sock_handler tasks into sub-functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:43:56 +02:00
David Gibson
baf049f8e0 udp: Fix breakage of UDP error handling by PKTINFO support
We recently enabled the IP_PKTINFO / IPV6_RECVPKTINFO socket options on our
UDP sockets.  This lets us obtain and properly handle the specific local
address used when we're "listening" with a socket on 0.0.0.0 or ::.

However, the PKTINFO cmsgs this option generates appear on error queue
messages as well as regular datagrams.  udp_sock_recverr() doesn't expect
this and so flags an unrecoverable error when it can't parse the control
message.

Correct this by adding space in udp_sock_recverr()s control buffer for the
additional PKTINFO data, and scan through all cmsgs for the RECVERR, rather
than only looking at the first one.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Fixes: f4b0dd8b06 ("udp: Use PKTINFO cmsgs to get destination address for received datagrams")
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:43:00 +02:00
Stefano Brivio
50249086a9 conf: Honour --dns-forward for local resolver even with --no-map-gw
If the first resolver listed in the host's /etc/resolv.conf is a
loopback address, and --no-map-gw is given, we automatically conclude
that the resolver is not reachable, discard it, and, if it's the only
nameserver listed in /etc/resolv.conf, we'll warn that we:

  Couldn't get any nameserver address

However, this isn't true in a general case: the user might have passed
--dns-forward, and in that case, while we won't map the address of the
default gateway to the host, we're still supposed to map that
particular address. Otherwise, in this common Podman usage:

  pasta --config-net --dns-forward 169.254.1.1 -t none -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-c02a8d8f-6ee3-902e-33c5-317e0f24e0af --map-guest-addr 169.254.1.2

and with a loopback address in /etc/resolv.conf, we'll unexpectedly
refuse to forward DNS queries:

  # nslookup passt.top 169.254.1.1
  ;; connection timed out; no servers could be reached

To fix this, make an exception for --dns-forward: if &c->ip4.dns_match
or &c->ip6.dns_match are set in add_dns_resolv4() / add_dns_resolv6(),
use that address as guest-facing resolver.

We already set 'dns_host' to the address we found in /etc/resolv.conf,
that's correct in this case and it makes us forward queries as
expected.

I'm not changing the man page as the current description of
--dns-forward is already consistent with the new behaviour: there's no
described way in which --no-map-gw should affect it.

Reported-by: Andrew Sayers <andrew-bugs.passt.top@pileofstuff.org>
Link: https://bugs.passt.top/show_bug.cgi?id=111
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-15 19:42:59 +02:00
Stefano Brivio
bbff3653d6 conf: Split add_dns_resolv() into separate IPv4 and IPv6 versions
Not really valuable by itself, but dropping one level of nested blocks
makes the next change more convenient.

No functional changes intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-15 19:42:57 +02:00
David Gibson
59cc89f4cc udp, udp_flow: Track our specific address on socket interfaces
So far for UDP flows (like TCP connections) we didn't record our address
(oaddr) in the flow table entry for socket based pifs.  That's because we
didn't have that information when a flow was initiated by a datagram coming
to a "listening" socket with 0.0.0.0 or :: address.  Even when we did have
the information, we didn't record it, to simplify address matching on
lookups.

This meant that in some circumstances we could send replies on a UDP flow
from a different address than the originating request came to, which is
surprising and breaks certain setups.

We now have code in udp_peek_addr() which does determine our address for
incoming UDP datagrams.  We can use that information to properly populate
oaddr in the flow table for flow initiated from a socket.

In order to be able to consistently match datagrams to flows, we must
*always* have a specific oaddr, not an unspecified address (that's how the
flow hash table works).  So, we also need to fill in oaddr correctly for
flows we initiate *to* sockets.  Our forwarding logic doesn't specify
oaddr here, letting the kernel decide based on the routing table.  In this
case we need to call getsockname() after connect()ing the socket to find
which local address the kernel picked.

This adds getsockname() to our seccomp profile for all variants.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:46:16 +02:00
David Gibson
695c62396e inany: Improve ASSERT message for bad socket family
inany_from_sockaddr() can only handle sockaddrs of family AF_INET or
AF_INET6 and asserts if given something else.  I hit this assertion while
debugging something else, and wanted to see what the bad sockaddr family
was.  Now that we have ASSERT_WITH_MSG() its easy to add this information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:46:13 +02:00
David Gibson
f4b0dd8b06 udp: Use PKTINFO cmsgs to get destination address for received datagrams
Currently we get the source address for received datagrams from recvmsg(),
but we don't get the local destination address.  Sometimes we implicitly
know this because the receiving socket is bound to a specific address, but
when listening on 0.0.0.0 or ::, we don't.

We need this information to properly direct replies to flows which come in
to a non-default local address.  So, enable the IP_PKTINFO and IPV6_PKTINFO
control messages to obtain this information in udp_peek_addr().  For now
we log a trace messages but don't do anything more with the information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:45:59 +02:00
David Gibson
6693fa1158 tcp_splice: Don't clobber errno before checking for EAGAIN
Like many places, tcp_splice_sock_handler() needs to handle EAGAIN
specially, in this case for both of its splice() calls.  Unfortunately it
tests for EAGAIN some time after those calls.  In between there has been
at least a flow_trace() which could have clobbered errno.  Move the test on
errno closer to the relevant system calls to avoid this problem.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-09 22:57:27 +02:00
David Gibson
d3f33f3b8e tcp_splice: Don't double count bytes read on EINTR
In tcp_splice_sock_handler(), if we get an EINTR on our second splice()
(pipe to output socket) we - as we should - go back and retry it.  However,
we do so *after* we've already updated our byte counters.  That does no
harm for the conn->written[] counter - since the second splice() returned
an error it will be advanced by 0.  However we also advance the
conn->read[] counter, and then do so again when the splice() succeeds.
This results in the counters being out of sync, and us thinking we have
remaining data in the pipe when we don't, which can leave us in an
infinite loop once the stream finishes.

Fix this by moving the EINTR handling to directly next to the splice()
call (which is what we usually do for EINTR).  As a bonus this removes one
mildly confusing goto.

For symmetry, also rework the EINTR handling on the first splice() the same
way, although that doesn't (as far as I can tell) have buggy side effects.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2779347687
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-09 22:57:16 +02:00
Stefano Brivio
ffbef85e97 conf: Add missing return in conf_nat(), fix --map-guest-addr none
As reported by somebody on IRC:

  $ pasta --map-guest-addr none
  Invalid address to remap to host: none

that's because once we parsed "none", we try to parse it as an address
as well. But we already handled it, so stop once we're done.

Fixes: e813a4df7d ("conf: Allow address remapped to host to be configured")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-09 22:52:49 +02:00
Stefano Brivio
06ef64cdb7 udp_flow: Save 8 bytes in struct udp_flow on 64-bit architectures
Shuffle the fields just added by commits a7775e9550 ("udp: support
traceroute in direction tap-socket") and 9725e79888 ("udp_flow:
Don't discard packets that arrive between bind() and connect()").

On x86_64, as reported by pahole(1), before:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        _Bool                      closed:1;             /*    76: 0  1 */

        /* XXX 7 bits hole, try to pack */

        _Bool                      flush0;               /*    77     1 */
        _Bool                      flush1:1;             /*    78: 0  1 */

        /* XXX 7 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */
        uint8_t                    ttl[2];               /*    96     2 */

        /* size: 104, cachelines: 2, members: 7 */
        /* sum members: 95, holes: 1, sum holes: 1 */
        /* sum bitfield members: 2 bits, bit holes: 2, sum bit holes: 14 bits */
        /* padding: 6 */
        /* last cacheline: 40 bytes */
};

and after:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        uint8_t                    ttl[2];               /*    76     2 */
        _Bool                      closed:1;             /*    78: 0  1 */
        _Bool                      flush0:1;             /*    78: 1  1 */
        _Bool                      flush1:1;             /*    78: 2  1 */

        /* XXX 5 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */

        /* size: 96, cachelines: 2, members: 7 */
        /* sum members: 94, holes: 1, sum holes: 1 */
        /* sum bitfield members: 3 bits, bit holes: 1, sum bit holes: 5 bits */
        /* last cacheline: 32 bytes */
};

It doesn't matter much because anyway the typical storage for struct
udp_flow is given by union flow:

union flow {
        struct flow_common         f;                  /*     0    76 */
        struct flow_free_cluster   free;               /*     0    84 */
        struct tcp_tap_conn        tcp;                /*     0   120 */
        struct tcp_splice_conn     tcp_splice;         /*     0   120 */
        struct icmp_ping_flow      ping;               /*     0    96 */
        struct udp_flow            udp;                /*     0    96 */
};

but it still improves data locality somewhat, so let me fix this up
now that commits are fresh.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-09 22:52:32 +02:00
David Gibson
9725e79888 udp_flow: Don't discard packets that arrive between bind() and connect()
When we establish a new UDP flow we create connect()ed sockets that will
only handle datagrams for this flow.  However, there is a race between
bind() and connect() where they might get some packets queued for a
different flow.  Currently we handle this by simply discarding any
queued datagrams after the connect.  UDP protocols should be able to handle
such packet loss, but it's not ideal.

We now have the tools we need to handle this better, by redirecting any
datagrams received during that race to the appropriate flow.  We need to
use a deferred handler for this to avoid unexpectedly re-ordering datagrams
in some edge cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update comment to udp_flow_defer()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:44:31 +02:00
David Gibson
9eb5406260 udp: Fold udp_splice_prepare and udp_splice_send into udp_sock_to_sock
udp_splice() prepare and udp_splice_send() are both quite simple functions
that now have only one caller: udp_sock_to_sock().  Fold them both into
that caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:44:31 +02:00
David Gibson
bd6a41ee76 udp: Rework udp_listen_sock_data() into udp_sock_fwd()
udp_listen_sock_data() forwards datagrams from a "listening" socket until
there are no more (for now).  We have an upcoming use case where we want
to do that for a socket that's not a "listening" socket, and uses a
different epoll reference.  So, adjust the function to take the pieces it
needs from the reference as direct parameters and rename to udp_sock_fwd().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
159beefa36 udp_flow: Take pif and port as explicit parameters to udp_flow_from_sock()
Currently udp_flow_from_sock() is only used when receiving a datagram
from a "listening" socket.  It takes the listening socket's epoll
reference to get the interface and port on which the datagram arrived.

We have some upcoming cases where we want to use this in different
contexts, so make it take the pif and port as direct parameters instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop @ref from comment to udp_flow_from_sock()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
fd844a90bc udp: Move UDP_MAX_FRAMES to udp.c
Recent changes mean that this define is no longer used anywhere except in
udp.c.  Move it back into udp.c from udp_internal.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
fc6ee68ad3 udp: Merge vhost-user and "buf" listening socket paths
udp_buf_listen_sock_data() and udp_vu_listen_sock_data() now have
effectively identical structure.  The forwarding functions used for flow
specific sockets (udp_buf_sock_to_tap(), udp_vu_sock_to_tap() and
udp_sock_to_sock()) also now take a number of datagrams.  This means we
can re-use them for the listening socket path, just passing '1' so they
handle a single datagram at a time.

This allows us to merge both the vhost-user and flow specific paths into
a single, simpler udp_listen_sock_data().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:52 +02:00
David Gibson
0304dd9c34 udp: Split spliced forwarding path from udp_buf_reply_sock_data()
udp_buf_reply_sock_data() can handle forwarding data either from socket
to socket ("splicing") or from socket to tap.  It has a test on each
datagram for which case we're in, but that will be the same for everything
in the batch.

Split out the spliced path into a separate udp_sock_to_sock() function.
This leaves udp_{buf,vu}_reply_sock_data() handling only forwards from
socket to tap, so rename and simplify them accordingly.

This makes the code slightly longer for now, but will allow future cleanups
to shrink it back down again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typos in comments to udp_sock_recv() and
 udp_vu_listen_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:41:32 +02:00
David Gibson
5221e177e1 udp: Parameterize number of datagrams handled by udp_*_reply_sock_data()
Both udp_buf_reply_sock_data() and udp_vu_reply_sock_data() internally
decide what the maximum number of datagrams they will forward is.  We have
some upcoming reasons to allow the caller to decide that instead, so make
the maximum number of datagrams a parameter for both of them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:31:54 +02:00
David Gibson
3a0881dfd0 udp: Don't bother to batch datagrams from "listening" socket
A "listening" UDP socket can receive datagrams from multiple flows.  So,
we currently have some quite subtle and complex code in
udp_buf_listen_sock_data() to group contiguously received packets for the
same flow into batches for forwarding.

However, since we are now always using flow specific connect()ed sockets
once a flow is established, handling of datagrams on listening sockets is
essentially a slow path.  Given that, it's not worth the complexity.
Substantially simplify the code by using an approach more like vhost-user,
and "peeking" at the address of the next datagram, one at a time to
determine the correct flow before we actually receive the data,

This removes all meaningful use of the s_in and tosidx fields in
udp_meta_t, so they too can be removed, along with setting of msg_name and
msg_namelen in the msghdr arrays which referenced them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:30:17 +02:00
David Gibson
84ab1305fa udp: Polish udp_vu_sock_info() and remove from vu specific code
udp_vu_sock_info() uses MSG_PEEK to look ahead at the next datagram to be
received and gets its source address.  Currently we only use it in the
vhost-user path, but there's nothing inherently vhost-user specific about
it.  We have upcoming uses for it elsewhere so rename and move to udp.c.

While we're there, polish its error reporting a litle.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop excess newline before udp_sock_recv()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:29:23 +02:00
David Gibson
1d7bbb101a udp: Make udp_sock_recv() take max number of frames as a parameter
Currently udp_sock_recv() decides the maximum number of frames it is
willing to receive based on the mode.  However, we have upcoming use cases
where we will have different criteria for how many frames we want with
information that's not naturally available here but is in the caller.  So
make the maximum number of frames a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typo in comment in udp_buf_reply_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:25:33 +02:00
David Gibson
d74b5a7c10 udp: Use connect()ed sockets for initiating side
Currently we have an asymmetry in how we handle UDP sockets.  For flows
where the target side is a socket, we create a new connect()ed socket
- the "reply socket" specifically for that flow used for sending and
receiving datagrams on that flow and only that flow.  For flows where the
initiating side is a socket, we continue to use the "listening" socket (or
rather, a dup() of it).  This has some disadvantages:

 * We need a hash lookup for every datagram on the listening socket in
   order to work out what flow it belongs to
 * The dup() keeps the socket alive even if automatic forwarding removes
   the listening socket.  However, the epoll data remains the same
   including containing the now stale original fd.  This causes bug 103.
 * We can't (easily) set flow-specific options on an initiating side
   socket, because that could affect other flows as well

Alter the code to use a connect()ed socket on the initiating side as well
as the target side.  There's no way to "clone and connect" the listening
socket (a loose equivalent of accept() for UDP), so we have to create a
new socket.  We have to bind() this socket before we connect() it, which
is allowed thanks to SO_REUSEADDR, but does leave a small window where it
could receive datagrams not intended for this flow.  For now we handle this
by simply discarding any datagrams received between bind() and connect(),
but I intend to improve this in a later patch.

Link: https://bugs.passt.top/show_bug.cgi?id=103
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:24:36 +02:00
Jon Maloy
a7775e9550 udp: support traceroute in direction tap-socket
Now that ICMP pass-through from socket-to-tap is in place, it is
easy to support UDP based traceroute functionality in direction
tap-to-socket.

We fix that in this commit.

Link: https://bugs.passt.top/show_bug.cgi?id=64
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:23:35 +02:00
Stefano Brivio
06784d7fc6 passt-repair: Ensure that read buffer is NULL-terminated
After 3d41e4d838 ("passt-repair: Correct off-by-one error verifying
name"), Coverity Scan isn't convinced anymore about the fact that the
ev->name used in the snprintf() is NULL-terminated.

It comes from a read() call, and read() of course doesn't terminate
it, but we already check that the byte at ev->len - 1 is a NULL
terminator, so this is actually a false positive.

In any case, the logic ensuring that ev->name is NULL-terminated isn't
necessarily obvious, and additionally checking that the last byte in
the buffer we read is a NULL terminator is harmless, so do that
explicitly, even if it's redundant.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-05 11:28:23 +02:00
David Gibson
684870a766 udp: Correct some seccomp filter annotations
Both udp_buf_listen_sock_data() and udp_buf_reply_sock_data() have comments
stating they use recvmmsg().  That's not correct, they only do so via
udp_sock_recv() which lists recvmmsg() itself.

In contrast udp_splice_send() and udp_tap_handler() both directly use
sendmmsg(), but only the latter lists it.  Add it to the former as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:29 +02:00
David Gibson
76e554d9ec udp: Simplify updates to UDP flow timestamp
Since UDP has no built in knowledge of connections, the only way we
know when we're done with a UDP flow is a timeout with no activity.
To keep track of this struct udp_flow includes a timestamp to record
the last time we saw traffic on the flow.

For data from listening sockets and from tap, this is done implicitly via
udp_flow_from_{sock,tap}() but for reply sockets it's done explicitly.
However, that logic is duplicated between the vhost-user and "buf" paths.
Make it common in udp_reply_sock_handler() instead.

Technically this is a behavioural change: previously if we got an EPOLLIN
event, but there wasn't actually any data we wouldn't update the timestamp,
now we will.  This should be harmless: if there's an EPOLLIN we expect
there to be data, and even if there isn't the worst we can do is mildly
delay the cleanup of a stale flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:26 +02:00
David Gibson
8aa2d90c8d udp: Remove redundant udp_at_sidx() call in udp_tap_handler()
We've already have a pointer to the UDP flow in variable uflow, we can just
re-use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:14 +02:00
David Gibson
3d41e4d838 passt-repair: Correct off-by-one error verifying name
passt-repair will generate an error if the name it gets from the kernel is
too long or not NUL terminated.  Downstream testing has reported
occasionally seeing this error in practice.

In turns out there is a trivial off-by-one error in the check: ev->len is
the length of the name, including terminating \0 characters, so to check
for a \0 at the end of the buffer we need to check ev->name[len - 1] not
ev->name[len].

Fixes: 42a854a52b ("pasta, passt-repair: Support multiple events per read() in inotify handlers")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:29:42 +02:00
David Gibson
dec3d73e1e migrate, tcp: bind() migrated sockets in repair mode
Currently on a migration target, we create then immediately bind() new
sockets for the TCP connections we're reconstructing.  Mostly, this works,
since a socket() that is bound but hasn't had listen() or connect() called
is essentially passive.  However, this bind() is subject to the usual
address conflict checking.  In particular that means if we already have
a listening socket on that port, we'll get an EADDRINUSE.  This will happen
for every connection we try to migrate that was initiated from outside to
the guest, since we necessarily created a listening socket for that case.

We set SO_REUSEADDR on the socket in an attempt to avoid this, but that's
not sufficient; even with SO_REUSEADDR address conflicts are still
prohibited for listening sockets.  Of course once these incoming sockets
are fully repaired and connect()ed they'll no longer conflict, but that
doesn't help us if we fail at the bind().

We can avoid this by not calling bind() until we're already in repair mode
which suppresses this transient conflict.  Because of the batching of
setting repair mode, to do that we need to move the bind to a step in
tcp_flow_migrate_target_ext().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:29:01 +02:00
David Gibson
6bfc60b095 platform requirements: Add test for address conflicts with TCP_REPAIR
Simple test program to check the behaviour we need for bind() address
conflicts between listening sockets and repair mode sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:59 +02:00
David Gibson
8e32881ef1 platform requirements: Add attributes to die() function
Add both format string and ((noreturn)) attributes to the version of die()
used in the test programs in doc/platform-requirements.  As well as
potentially catching problems in format strings, this means that the
compiler and static checkers can properly reason about the fact that it
will exit, preventing bogus warnings.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:57 +02:00
David Gibson
2ed2d59def platform requirements: Fix clang-tidy warning
Recent clang-tidy versions complain about enums defined with some but not
all entries given explicit values.  I'm not entirely convinced about
whether that's a useful warning, but in any case we really don't need the
explicit values in doc/platform-requirements/reuseaddr-priority.c, so
remove them to make clang happy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:05 +02:00
David Gibson
3de5af6e41 udp: Improve name of UDP related ICMP sending functions
udp_send_conn_fail_icmp[46]() aren't actually specific to connections
failing: they can propagate a variety of ICMP errors, which might or might
not break a "connection".  They are, however, specific to sending ICMP
errors to the tap connection, not splice or host.  Rename them to better
reflect that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:26:11 +01:00
David Gibson
025a3c2686 udp: Don't attempt to forward ICMP socket errors to other sockets
Recently we added support for detecting ICMP triggered errors on UDP
sockets and forwarding them to the tap interface.  However, in
udp_sock_recverr() where this is handled we don't know for certain that
the tap interface is the other side of the UDP flow.  It could be a spliced
connection with another socket on the other side.

To forward errors in that case, we'd need to force the other side's socket
to trigger issue an ICMP error.  I'm not sure if there's a way to do that;
probably not for an arbitrary ICMP but it might be possible for certain
error conditions.

Nonetheless what we do now - synthesise an ICMP on the tap interface - is
certainly wrong.  It's probably harmless; for a spliced connection it will
have loopback addresses meaning we can expect the guest to discard it.
But, correct this for now, by not attempting to propagate errors when the
other side of the flow is a socket.

Fixes: 55431f0077 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e0 ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:25:51 +01:00
Stefano Brivio
42a854a52b pasta, passt-repair: Support multiple events per read() in inotify handlers
The current code assumes that we'll get one event per read() on
inotify descriptors, but that's not the case, not from documentation,
and not from reports.

Add loops in the two inotify handlers we have, in pasta-specific code
and passt-repair, to go through all the events we receive.

Link: https://bugs.passt.top/show_bug.cgi?id=119
[dwg: Remove unnecessary buffer expansion, use strnlen instead of strlen
 to make Coverity happier]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Add additional check on ev->name and ev->len in passt-repair]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:24:44 +01:00
Jon Maloy
65cca54be8 udp: correct source address for ICMP messages
While developing traceroute forwarding tap-to-sock we found that
struct msghdr.msg_name for the ICMPs in the opposite direction always
contains the destination address of the original UDP message, and not,
as one might expect, the one of the host which created the error message.

Study of the kernel code reveals that this address instead is appended
as extra data after the received struct sock_extended_err area.

We now change the ICMP receive code accordingly.

Fixes: 55431f0077 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e0 ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-27 05:05:20 +01:00
Julian Wundrak
664c588be7 build: normalize arm targets
Linux distributions use different dumpmachine outputs for the ARM
architecture. arm, armv6l, armv7l.
For the syscall annotation, these variants are standardized to “arm”.

Link: https://bugs.passt.top/show_bug.cgi?id=117
Signed-off-by: Julian Wundrak <julian@wundrak.net>
[sbrivio: Fix typo: assign from TARGET_ARCH, not from TARGET]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:43:22 +01:00
David Gibson
77883fbdd1 udp: Add helper function for creating connected UDP socket
Currently udp_flow_new() open codes creating and connecting a socket to use
for reply messages.  We have in mind some more places to use this logic,
plus it just makes for a rather large function.  Split this handling out
into a new udp_flow_sock() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:34 +01:00
David Gibson
37d78c9ef3 udp: Always hash socket facing flowsides
For UDP packets from the tap interface (like TCP) we use a hash table to
look up which flow they belong to.  Unlike TCP, we sometimes also create a
hash table entry for the socket side of UDP flows.  We need that when we
receive a UDP packet from a "listening" socket which isn't specific to a
single flow.

At present we only do this for the initiating side of flows, which re-use
the listening socket.  For the target side we use a connected "reply"
socket specific to the single flow.

We have in mind changes that maye introduce some edge cases were we could
receive UDP packets on a non flow specific socket more often.  To allow for
those changes - and slightly simplifying things in the meantime - always
put both sides of a UDP flow - tap or socket - in the hash table.  It's
not that costly, and means we always have the option of falling back to a
hash lookup.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:32 +01:00
David Gibson
f67c488b81 udp: Better handling of failure to forward from reply socket
In udp_reply_sock_handler() if we're unable to forward the datagrams we
just print an error.  Generally this means we have an unsupported pair of
pifs in the flow table, though, and that hasn't change.  So, next time we
get a matching packet we'll just get the same failure.  In vhost-user mode
we don't even dequeue the incoming packets which triggered this so we're
likely to get the same failure immediately.

Instead, close the flow, in the same we we do for an unrecoverable error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:30 +01:00
David Gibson
269cf6a12a udp: Share more logic between vu and non-vu reply socket paths
Share some additional miscellaneous logic between the vhost-user and "buf"
paths for data on udp reply sockets.  The biggest piece is error handling
of cases where we can't forward between the two pifs of the flow.  We also
make common some more simple logic locating the correct flow and its
parameters.

This adds some lines of code due to extra comment lines, but nonetheless
reduces logic duplication.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:28 +01:00
David Gibson
d924b7dfc4 udp_vu: Factor things out of udp_vu_reply_sock_data() loop
At the start of every cycle of the loop in udp_vu_reply_sock_data() we:
 - ASSERT that uflow is not NULL
 - Check if the target pif is PIF_TAP
 - Initialize the v6 boolean

However, all of these depend only on the flow, which doesn't change across
the loop.  This is probably a duplication from udp_vu_listen_sock_data(),
where the flow can be different for each packet.  For the reply socket
case, however, factor that logic out of the loop.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:26 +01:00
David Gibson
5a977c2f4e udp: Simplify checking of epoll event bits
udp_{listen,reply}_sock_handler() can accept both EPOLLERR and EPOLLIN
events.  However, unlike most epoll event handlers we don't check the
event bits right there.  EPOLLERR is checked within udp_sock_errs() which
we call unconditionally.  Checking EPOLLIN is still more buried: it is
checked within both udp_sock_recv() and udp_vu_sock_recv().

We can simplify the logic and pass less extraneous parameters around by
moving the checking of the event bits to the top level event handlers.

This makes udp_{buf,vu}_{listen,reply}_sock_handler() no longer general
event handlers, but specific to EPOLLIN events, meaning new data.  So,
rename those functions to udp_{buf,vu}_{listen,reply}_sock_data() to better
reflect their function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:23 +01:00
David Gibson
89b203b851 udp: Common invocation of udp_sock_errs() for vhost-user and "buf" paths
The vhost-user and non-vhost-user paths for both udp_listen_sock_handler()
and udp_reply_sock_handler() are more or less completely separate.  Both,
however, start with essentially the same invocation of udp_sock_errs(), so
that can be made common.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:11 +01:00
David Gibson
cf4d3f05c9 packet: Upgrade severity of most packet errors
All errors from packet_range_check(), packet_add() and packet_get() are
trace level.  However, these are for the most part actual error conditions.
They're states that should not happen, in many cases indicating a bug
in the caller or elswhere.

We don't promote these to err() or ASSERT() level, for fear of a localised
bug on very specific input crashing the entire program, or flooding the
logs, but we can at least upgrade them to debug level.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:30 +01:00
David Gibson
0857515c94 packet: ASSERT on signs of pool corruption
If packet_check_range() fails in packet_get_try_do() we just return NULL.
But this check only takes places after we've already validated the given
range against the packet it's in.  That means that if packet_check_range()
fails, the packet pool is already in a corrupted state (we should have
made strictly stronger checks when the packet was added).  Simply returning
NULL and logging a trace() level message isn't really adequate for that
situation; ASSERT instead.

Similarly we check the given idx against both p->count and p->size.  The
latter should be redundant, because count should always be <= size.  If
that's not the case then, again, the pool is already in a corrupted state
and we may have overwritten unknown memory.  Assert for this case too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:27 +01:00
David Gibson
9153aca15b util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers
We already have the ASSERT() macro which will abort() passt based on a
condition.  It always has a fixed error message based on its location and
the asserted expression.  We have some upcoming cases where we want to
customise the message when hitting an assert.

Add abort_with_msg() and ASSERT_WITH_MSG() helpers to allow this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:24 +01:00
David Gibson
38bcce9977 packet: Rework packet_get() versus packet_get_try()
Most failures of packet_get() indicate a serious problem, and log messages
accordingly.  However, a few callers expect failures here, because they're
probing for a certain range which might or might not be in a packet.  They
use packet_get_try() which passes a NULL func to packet_get_do() to
suppress the logging which is unwanted in this case.

However, this doesn't just suppress the log when packet_get_do() finds the
requested region isn't in the packet.  It suppresses logging for all other
errors too, which do indicate serious problems, even for the callers of
packet_get_try().  Worse it will pass the NULL func on to
packet_check_range() which doesn't expect it, meaning we'll get unhelpful
messages from there if there is a failure.

Fix this by making packet_get_try_do() the primary function which doesn't
log for the case of a range outside the packet.  packet_get_do() becomes a
trivial wrapper around that which logs a message if packet_get_try_do()
returns NULL.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:22 +01:00
David Gibson
961aa6a0eb packet: Move checks against PACKET_MAX_LEN to packet_check_range()
Both the callers of packet_check_range() separately verify that the given
length does not exceed PACKET_MAX_LEN.  Fold that check into
packet_check_range() instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:20 +01:00
David Gibson
37d9f374d9 packet: Avoid integer overflows in packet_get_do()
In packet_get_do() both offset and len are essentially untrusted.  We do
some validation of len (check it's < PACKET_MAX_LEN), but that's not enough
to ensure that (len + offset) doesn't overflow.  Rearrange our calculation
to make sure it's safe regardless of the given offset & len values.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:18 +01:00
David Gibson
c48331ca51 packet: Correct type of PACKET_MAX_LEN
PACKET_MAX_LEN is usually involved in calculations on size_t values - the
type of the iov_len field in struct iovec.  However, being defined bare as
UINT16_MAX, the compiled is likely to assign it a shorter type.  This can
lead to unexpected promotions (or lack thereof).  Add a cast to force the
type to be what we expect.

Fixes: c43972ad6 ("packet: Give explicit name to maximum packet size")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:15 +01:00
David Gibson
9866d146e6 tap: Clarify calculation of TAP_MSGS
The rationale behind the calculation of TAP_MSGS isn't necessarily obvious.
It's supposed to be the maximum number of packets that can fit in pkt_buf.
However, the calculation is wrong in several ways:
 * It's based on ETH_ZLEN which isn't meaningful for virtual devices
 * It always includes the qemu socket header which isn't used for pasta
 * The size of pkt_buf isn't relevant for vhost-user

We've already made sure this is just a tuning parameter, not a hard limit.
Clarify what we're calculating here and why.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:12 +01:00
David Gibson
a41d6d125e tap: Make size of pool_tap[46] purely a tuning parameter
Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf (TAP_MSGS).  However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size.  But ETH_ZLEN is based on physical
constraints of Ethernet, which don't apply to our virtual devices.  It is
possible to generate a legitimate frame smaller than this, for example an
empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long.

Further more, the same limit applies for vhost-user, which is not limited
by the size of pkt_buf like the other backends.  In that case we don't even
have full control of the maximum buffer size, so we can't really calculate
how many packets could fit in there.

If we exceed do TAP_MSGS we'll drop packets, not just use more batches,
which is moderately bad.  The fact that this needs to be sized just so for
correctness not merely for tuning is a fairly non-obvious coupling between
different parts of the code.

To make this more robust, alter the tap code so it doesn't rely on
everything fitting in a single batch of TAP_MSGS packets, instead breaking
into multiple batches as necessary.  This leaves TAP_MSGS as purely a
tuning parameter, which we can freely adjust based on performance measures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:09 +01:00
David Gibson
e43e00719d packet: More cautious checks to avoid pointer arithmetic UB
packet_check_range and vu_packet_check_range() verify that the packet or
section of packet we're interested in lies in the packet buffer pool we
expect it to.  However, in doing so it doesn't avoid the possibility of
an integer overflow while performing pointer arithmetic, with is UB.  In
fact, AFAICT it's UB even to use arbitrary pointer arithmetic to construct
a pointer outside of a known valid buffer.

To do this safely, we can't calculate the end of a memory region with
pointer addition when then the length as untrusted.  Instead we must work
out the offset of one memory region within another using pointer
subtraction, then do integer checks against the length of the outer region.
We then need to be careful about the order of checks so that those integer
checks can't themselves overflow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:06 +01:00
David Gibson
4592719a74 vu_common: Tighten vu_packet_check_range()
This function verifies that the given packet is within the mmap()ed memory
region of the vhost-user device.  We can do better, however.  The packet
should be not only within the mmap()ed range, but specifically in the
subsection of that range set aside for shared buffers, which starts at
dev_region->mmap_offset within there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:32:50 +01:00
Stefano Brivio
32f6212551 Makefile: Enable -Wformat-security
It looks like an easy win to prevent a number of possible security
flaws.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:53 +01:00
Stefano Brivio
07c2d584b3 conf: Include libgen.h for basename(), fix build against musl
Fixes: 4b17d042c7 ("conf: Move mode detection into helper function")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:49 +01:00
Stefano Brivio
ebdd46367c tcp: Flush socket before checking for more data in active close state
Otherwise, if all the pending data is acknowledged:

- tcp_update_seqack_from_tap() updates the current tap-side ACK
  sequence (conn->seq_ack_from_tap)

- next, we compare the sequence we sent (conn->seq_to_tap) to the
  ACK sequence (conn->seq_ack_from_tap) in tcp_data_from_sock() to
  understand if there's more data we can send.

  If they match, we conclude that we haven't sent any of that data,
  and keep re-sending it.

We need, instead, to flush the socket (drop acknowledged data) before
calling tcp_update_seqack_from_tap(), so that once we update
conn->seq_ack_from_tap, we can be sure that all data until there is
gone from the socket.

Link: https://bugs.passt.top/show_bug.cgi?id=114
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Fixes: 30f1e082c3 ("tcp: Keep updating window and checking for socket data after FIN from guest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:43 +01:00
David Gibson
c250ffc5c1 migrate: Bump migration version number
v1 of the migration stream format, had some flaws: it didn't properly
handle endianness of the MSS field, and it didn't transfer the RFC7323
timestamp.  We've now fixed those bugs, but it requires incompatible
changes to the stream format.

Because of the timestamps in particular, v1 is not really usable, so there
is little point maintaining compatible support for it.  However, v1 is in
released packages, both upstream and downstream (RHEL at least).  Just
updating the stream format without bumping the version would lead to very
cryptic errors if anyone did attempt to migrate between an old and new
passt.

So, bump the migration version to v2, so we'll get a clear error message if
anyone attempts this.  We don't attempt to maintain backwards compatibility
with v1, however: we'll simply fail if given a v1 stream.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 17:17:18 +01:00
David Gibson
cfb3740568 migrate, tcp: Migrate RFC 7323 timestamp
Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp.  In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.

Add code to dump and restore the timestamp across migration.

Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 15:27:27 +01:00
David Gibson
28772ee91a migrate, tcp: More careful marshalling of mss parameter during migration
During migration we extract the limit on segment size using TCP_MAXSEG,
and set it on the other side with TCP_REPAIR_OPTIONS.  However, unlike most
32-bit values we transfer we transfer it in native endian, not network
endian.  This is not correct; add it to the list of endian fixups we make.

In addition, while MAXSEG will be 32-bits in practice, and is given as such
to TCP_REPAIR_OPTIONS, the TCP_MAXSEG sockopt treats it as an 'int'.  It's
not strictly safe to pass a uint32_t to a getsockopt() expecting an int,
although we'll get away with it on most (maybe all) platforms.  Correct
this as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor coding style fix]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 15:25:12 +01:00
Stefano Brivio
51f3c071a7 passt-repair: Fix build with -Werror=format-security
Fixes: 0470170247 ("passt-repair: Add directory watch")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-18 17:18:47 +01:00
David Gibson
cb5b593563 tcp, flow: Better use flow specific logging heleprs
A number of places in the TCP code use general logging functions, instead
of the flow specific ones.  That includes a few older ones as well as many
places in the new migration code.  Thus they either don't identify which
flow the problem happened on, or identify it in a non-standard way.

Convert many of these to use the existing flow specific helpers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-14 23:40:40 +01:00
David Gibson
96fe5548cb conf: Unify several paths in conf_ports()
In conf_ports() we have three different paths which actually do the setup
of an individual forwarded port: one for the "all" case, one for the
exclusions only case and one for the range of ports with possible
exclusions case.

We can unify those cases using a new helper which handles a single range
of ports, with a bitmap of exclusions.  Although this is slightly longer
(largely due to the new helpers function comment), it reduces duplicated
logic.  It will also make future improvements to the tracking of port
forwards easier.

The new conf_ports_range_except() function has a pretty prodigious
parameter list, but I still think it's an overall improvement in conceptual
complexity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-14 23:40:23 +01:00
David Gibson
78f1f0fdfc test/perf: Simplify iperf3 server lifetime management
After we start the iperf3 server in the background, we have a sleep to
make sure it's ready to receive connections.  We can simplify this slightly
by using the -D option to have iperf3 background itself rather than
backgrounding it manually.  That won't return until the server is ready to
use.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
26df8a3608 conf: Limit maximum MTU based on backend frame size
The -m option controls the MTU, that is the maximum transmissible L3
datagram, not including L2 headers.  We currently limit it to ETH_MAX_MTU
which sounds like it makes sense.  But ETH_MAX_MTU is confusing: it's not
consistently used as to whether it means the maximum L3 datagram size or
the maximum L2 frame size.  Even within conf() we explicitly account for
the L2 header size when computing the default --mtu value, but not when
we compute the maximum --mtu value.

Clean this up by reworking the maximum MTU computation to be the minimum of
IP_MAX_MTU (65535) and the maximum sized IP datagram which can fit into
our L2 frames when we account for the L2 header.  The latter can vary
depending on our tap backend, although it doesn't right now.

Link: https://bugs.passt.top/show_bug.cgi?id=66
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
9d1a6b3eba pcap: Correctly set snaplen based on tap backend type
The pcap header includes a value indicating how much of each frame is
captured.  We always capture the entire frame, so we want to set this to
the maximum possible frame size.  Currently we do that by setting it to
ETH_MAX_MTU, but that's a confusingly named constant which might not always
be correct depending on the details of our tap backend.

Instead add a tap_l2_max_len() function that explicitly returns the maximum
frame size for the current mode and use that to set snaplen.  While we're
there, there's no particular need for the pcap header to be defined in a
global; make it local to pcap_init() instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
b6945e0553 Simplify sizing of pkt_buf
We define the size of pkt_buf as large enough to hold 128 maximum size
packets.  Well, approximately, since we round down to the page size.  We
don't have any specific reliance on how many packets can fit in the buffer,
we just want it to be big enough to allow reasonable batching.  The
current definition relies on the confusingly named ETH_MAX_MTU and adds
in sizeof(uint32_t) rather non-obviously for the pseudo-physical header
used by the qemu socket (passt mode) protocol.

Instead, just define it to be 8MiB, which is what that complex calculation
works out to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
c4bfa3339c tap: Use explicit defines for maximum length of L2 frame
Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of
an L2 frame.  This define comes from the kernel, but it's badly named and
used confusingly.

First, it doesn't really have anything to do with Ethernet, which has no
structural limit on frame lengths.  It comes more from either a) IP which
imposes a 64k datagram limit or b) from internal buffers used in various
places in the kernel (and in passt).

Worse, MTU generally means the maximum size of the IP (L3) datagram which
may be transferred, _not_ counting the L2 headers.  In the kernel
ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as
a maximum frame length, _including_ L2 headers.  In tap.c we're mostly
using it in the second way.

Finally, each of our tap backends could have different limits on the frame
size imposed by the mechanisms they're using.

Start clearing up this confusion by replacing it in tap.c with new
L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame
length for each backend.

Signed-off-by: David Gibson <dgibson@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
1eda8de438 packet: Remove redundant TAP_BUF_BYTES define
Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing.  They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense.  So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.

In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
c43972ad67 packet: Give explicit name to maximum packet size
We verify that every packet we store in a pool (and every partial packet
we retreive from it) has a length no longer than UINT16_MAX.  This
originated in the older packet pool implementation which stored packet
lengths in a uint16_t.  Now, that packets are represented by a struct
iovec with its size_t length, this check serves only as a sanity / security
check that we don't have some wildly out of range length due to a bug
elsewhere.

We have may reasons to (slightly) increase this limit in future, so in
preparation, give this quantity an explicit name - PACKET_MAX_LEN.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
74cd82adc8 conf: Detect vhost-user mode earlier
We detect our operating mode in conf_mode(), unless we're using vhost-user
mode, in which case we change it later when we parse the --vhost-user
option.  That means we need to delay parsing the --repair-path option (for
vhost-user only) until still later.

However, there are many other places in the main option parsing loop which
also rely on mode.  We get away with those, because they happen to be able
to treat passt and vhost-user modes identically.  This is potentially
confusing, though.  So, move setting of MODE_VU into conf_mode() so
c->mode always has its final value from that point onwards.

To match, we move the parsing of --repair-path back into the main option
parsing loop.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
4b17d042c7 conf: Move mode detection into helper function
One of the first things we need to do is determine if we're in passt mode
or pasta mode.  Currently this is open-coded in main(), by examining
argv[0].  We want to complexify this a bit in future to cover vhost-user
mode as well.  Prepare for this, by moving the mode detection into a new
conf_mode() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
bb00a0499f conf: Use the same optstring for passt and pasta modes
Currently we rely on detecting our mode first and use different sets of
(single character) options for each.  This means that if you use an option
valid in only one mode in another you'll get the generic usage() message.

We can give more helpful errors with little extra effort by combining all
the options into a single value of the option string and giving bespoke
messages if an option for the wrong mode is used; in fact we already did
this for some single mode options like '-1'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
Stefano Brivio
c8b520c062 flow, repair: Wait for a short while for passt-repair to connect
...and time out after that. This will be needed because of an upcoming
change to passt-repair enabling it to start before passt is started,
on both source and target, by means of an inotify watch.

Once the inotify watch triggers, passt-repair will connect right away,
but we have no guarantees that the connection completes before we
start the migration process, so wait for it (for a reasonable amount
of time).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-12 23:08:33 +01:00
Stefano Brivio
0470170247 passt-repair: Add directory watch
It might not be feasible for users to start passt-repair after passt
is started, on a migration target, but before the migration process
starts.

For instance, with libvirt, the guest domain (and, hence, passt) is
started on the target as part of the migration process. At least for
the moment being, there's no hook a libvirt user (including KubeVirt)
can use to start passt-repair before the migration starts.

Add a directory watch using inotify: if PATH is a directory, instead
of connecting to it, we'll watch for a .repair socket file to appear
in it, and then attempt to connect to that socket.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-12 21:34:36 +01:00
David Gibson
2b58b22845 cppcheck: Add suppressions for "logically" exported functions
We have some functions in our headers which are definitely there on
purpose.  However, they're not yet used outside the files in which they're
defined.  That causes sufficiently recent cppcheck versions (2.17) to
complain they should be static.

Suppress the errors for these "logically" exported functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
a83c806d17 vhost_user: Don't export several functions
vhost-user added several functions which are exposed in headers, but not
used outside the file where they're defined.  I can't tell if these are
really internal functions, or of they're logically supposed to be exported,
but we don't happen to have anything using them yet.

For the time being, just remove the exports.  We can add them back if we
need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
27395e67c2 tcp: Don't export tcp_update_csum()
tcp_update_csum() is exposed in tcp_internal.h, but is only used in tcp.c.
Remove the unneded prototype and make it static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
12d5b36b2f checksum: Don't export various functions
Several of the exposed functions in checksum.h are no longer directly used.
Remove them from the header, and make static.  In particular sum_16b()
should not be used outside: generally csum_unfolded() should be used which
will automatically use either the AVX2 optimized version or sum_16b() as
necessary.

csum_fold() and csum() could have external uses, but they're not used right
now.  We can expose them again if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
e36c35c952 log: Don't export passt_vsyslog()
passt_vsyslog() is an exposed function in log.h.  However it shouldn't
be called from outside log.c: it writes specifically to the system log,
and most code should call passt's logging helpers which might go to the
syslog or to a log file.

Make passt_vsyslog() local to log.c.  This requires a code motion to avoid
a forward declaration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
57d2db370b treewide: Mark assorted functions static
This marks static a number of functions which are only used in their .c
file, have no prototypes in a .h and were never intended to be globally
exposed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
68b04182e0 udp: create and send ICMPv6 to local peer when applicable
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMPv6 message containing
the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header
and the first 1232 bytes of the original message, if any. If the sender
socket has been connected, it uses this message to issue a
"Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv6 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error types and codes. We have noticed that at least
ICMP_PROT_UNREACH is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't
 modify saddr which can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
87e6a46442 tap: break out building of udp header from tap_udp6_send function
We will need to build the UDP header at other locations than in function
tap_udp6_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
55431f0077 udp: create and send ICMPv4 to local peer when applicable
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv4 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error codes. We have noticed that at least ICMP_PROT_UNREACH
is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't
 modify 'in', it can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:19 +01:00
Jon Maloy
82a839be98 tap: break out building of udp header from tap_udp4_send function
We will need to build the UDP header at other locations than in function
tap_udp4_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-06 20:17:36 +01:00
David Gibson
1924e25f07 conf: Be more precise about minimum MTUs
Currently we reject the -m option if given a value less than ETH_MIN_MTU
(68).  That define is derived from the kernel, but its name is misleading:
it doesn't really have anything to do with Ethernet per se, but is rather
the minimum payload any L2 link must be able to handle in order to carry
IPv4.  For IPv6, it's not sufficient: that requires an MTU of at least
1280.

Newer kernels have better named constants IPV4_MIN_MTU and IPv6_MIN_MTU.
Copy and use those constants instead, along with some more specific error
messages.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-06 20:17:23 +01:00
David Gibson
672d786de1 tcp: Send RST in response to guest packets that match no connection
Currently, if a non-SYN TCP packet arrives which doesn't match any existing
connection, we simply ignore it.  However RFC 9293, section 3.10.7.1 says
we should respond with an RST to a non-SYN, non-RST packet that's for a
CLOSED (i.e. non-existent) connection.

This can arise in practice with migration, in cases where some error means
we have to discard a connection.  We destroy the connection with tcp_rst()
in that case, but because the guest is stopped, we may not be able to
deliver the RST packet on the tap interface immediately.  This change
ensures an RST will be sent if the guest tries to use the connection again.

A similar situation can arise if a passt/pasta instance is killed or
crashes, but is then replaced with another attached to the same guest.
This can leave the guest with stale connections that the new passt instance
isn't aware of.  It's better to send an RST so the guest knows quickly
these are broken, rather than letting them linger until they time out.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:32 +01:00
David Gibson
1f236817ea tap: Consider IPv6 flow label when building packet sequences
To allow more batching, we group together related packets into "seqs" in
the tap layer, before passing them to the L4 protocol layers.  Currently
we consider the IP protocol, both IP addresses and also the L4 ports when
grouping things into seqs.  We ignore the IPv6 flow label.

We have some future cases where we want to consider the the flow label in
the L4 code, which is awkward if we could be given a single batch with
multiple labels.  Add the flow label to tap6_l4_t and group by it as well
as the other criteria.  In future we could possibly use the flow label
_instead_ of peeking into the L4 header for the ports, but we don't do so
for now.

The guest should use the same flow label for all packets in a low, but if
it doesn't this change won't break anything, it just means we'll batch
things a bit sub-optimally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:29 +01:00
David Gibson
008175636c ip: Helpers to access IPv6 flow label
The flow label is a 20-bit field in the IPv6 header.  The length and
alignment make it awkward to pass around as is.  Obviously, it can be
packed into a 32-bit integer though, and we do this in two places.  We
have some further upcoming places where we want to manipulate the flow
label, so make some helpers for marshalling and unmarshalling it to an
integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:17 +01:00
David Gibson
52419a64f2 migrate, tcp: Don't flow_alloc_cancel() during incoming migration
In tcp_flow_migrate_target(), if we're unable to create and bind the new
socket, we print an error, cancel the flow and carry on.  This seems to
make sense based on our policy of generally letting the migration complete
even if some or all flows are lost in the process.  But it doesn't quite
work: the flow_alloc_cancel() means that the flows in the target's flow
table are no longer one to one match to the flows which the source is
sending data for.  This means that data for later flows will be mismatched
to a different flow.  Most likely that will cause some nasty error later,
but even worse it might appear to succeed but lead to data corruption due
to incorrectly restoring one of the flows.

Instead, we should leave the flow in the table until we've read all the
data for it, *then* discard it.  Technically removing the
flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket()
fails it leaves conn->sock == -1, which will cause the restore functions
in tcp_flow_migrate_target_ext() to fail, discarding the flow.  To make
what's going on clearer (and with less extraneous error messages), put
several explicit tests for a missing socket later in the migration path to
read the data associated with the flow but explicitly discard it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:38 +01:00
David Gibson
b2708218a6 tcp: Unconditionally move to CLOSED state on tcp_rst()
tcp_rst() attempts to send an RST packet to the guest, and if that succeeds
moves the flow to CLOSED state.  However, even if the tcp_send_flag() fails
the flow is still dead: we've usually closed the socket already, and
something has already gone irretrievably wrong.  So we should still mark
the flow as CLOSED.  That will cause it to be cleaned up, meaning any
future packets from the guest for it won't match a flow, so should generate
new RSTs (they don't at the moment, but that's a separate bug).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:36 +01:00
David Gibson
56ce03ed0a tcp: Correct error code handling from tcp_flow_repair_socket()
There are two small bugs in error returns from tcp_low_repair_socket(),
which is supposed to return a negative errno code:

1) On bind() failures, wedirectly pass on the return code from bind(),
   which is just 0 or -1, instead of an error code.

2) In the caller, tcp_flow_migrate_target() we call strerror_() directly
   on the negative error code, but strerror() requires a positive error
   code.

Correct both of these.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:35 +01:00
David Gibson
39f85bce1a migrate, flow: Don't attempt to migrate TCP flows without passt-repair
Migrating TCP flows requires passt-repair in order to use TCP_REPAIR.  If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode.  In some cases we may not roll back these changes properly, meaning
we break network connections on the source.

Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate.  So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:33 +01:00
David Gibson
7b92f2e852 migrate, flow: Trivially succeed if migrating with no flows
We could get a migration request when we have no active flows; or at least
none that we need or are able to migrate.  In this case after sending or
receiving the number of flows we continue to step through various lists.

In the target case, this could include communication with passt-repair.  If
passt-repair wasn't started that could cause further errors, but of course
they shouldn't matter if we have nothing to repair.

Make it more obvious that there's nothing to do and avoid such errors by
short-circuiting flow_migrate_{source,target}() if there are no migratable
flows.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:21 +01:00
Stefano Brivio
87471731e6 selinux: Fixes/workarounds for passt and passt-repair, mostly for libvirt usage
Here are a bunch of workarounds and a couple of fixes for libvirt
usage which are rather hard to split into single logical patches
as there appear to be some obscure dependencies between some of them:

- passt-repair needs to have an exec_type typeattribute (otherwise
  the policy for lsmd(1) causes a violation on getattr on its
  executable) file, and that typeattribute just happened to be there
  for passt as a result of init_daemon_domain(), but passt-repair
  isn't a daemon, so we need an explicit corecmd_executable_file()

- passt-repair needs a workaround, which I'll revisit once
  https://github.com/fedora-selinux/selinux-policy/issues/2579 is
  solved, for usage with libvirt: allow it to use qemu_var_run_t
  and virt_var_run_t sockets

- add 'bpf' and 'dac_read_search' capabilities for passt-repair:
  they are needed (for whatever reason I didn't investigate) to
  actually receive socket files via SCM_RIGHTS

- passt needs further workarounds in the sense of
  https://github.com/fedora-selinux/selinux-policy/issues/2579:
  allow it to use map and use svirt_tmpfs_t (not just svirt_image_t):
  it depends on where the libvirt guest image is

- ...it also needs to map /dev/null if <access mode='shared'/> is
  enabled in libvirt's XML for the memoryBacking object, for
  vhost-user operation

- and 'ioctl' on the TCP socket appears to be actually needed, on top
  of 'getattr', to dump some socket parameters

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:14:01 +01:00
Michal Privoznik
be86232f72 seccomp.sh: Silence stty errors
When printing list of allowed syscalls the width of terminal is
obtained for nicer output (see commit below). The width is
obtained by running 'stty'. While this works when building from a
console, it doesn't work during rpmbuild/emerge/.. as stdout is
usually not a console but a logfile and stdin is usually
/dev/null or something. This results in stty reporting errors
like this:

  stty: 'standard input': Inappropriate ioctl for device

Redirect stty's stderr to /dev/null to silence it.

Fixes: 712ca32353 ("seccomp.sh: Try to account for terminal width while formatting list of system calls")
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-24 18:46:28 +01:00
Jon Maloy
ea69ca6a20 tap: always set the no_frag flag in IPv4 headers
When studying the Linux source code and Wireshark dumps it seems like
the no_frag flag in the IPv4 header is always set. Following discussions
in the Internet on this subject indicates that modern routers never
fragment packets, and that it isn't even supported in many cases.

Adding to this that incoming messages forwarded on the tap interface
never even pass through a router it seems safe to always set this flag.

This makes the IPv4 headers of forwarded messages identical to those
sent by the external sockets, something we must consider desirable.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-20 12:43:00 +01:00
Stefano Brivio
4dac2351fa contrib/fedora: Actually install passt-repair SELinux policy file
Otherwise we build it, but we don't install it. Not an issue that
warrants a a release right away as it's anyway usable.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 23:33:53 +01:00
Stefano Brivio
16553c8280 dhcp: Add option code byte in calculation for OPT_MAX boundary check
Otherwise we'll limit messages to 577 bytes, instead of 576 bytes as
intended:

  $ fqdn="thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.then_make_it_251_with_this"
  $ hostname="__eighteen_bytes__"
  $ ./pasta --fqdn ${fqdn} -H ${hostname} -p dhcp.pcap -- /sbin/dhclient -4
  Saving packet capture to dhcp.pcap
  $ tshark -r dhcp.pcap -V -Y 'dhcp.option.value == 5' | grep "Total Length"
      Total Length: 577

This was hidden by the issue fixed by commit bcc4908c2b ("dhcp
Remove option 255 length byte") until now.

Fixes: 31e8109a86 ("dhcp, dhcpv6: Add hostname and client fqdn ops")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 23:33:19 +01:00
Stefano Brivio
183bedf478 Makefile: Use mmap2() as alternative for mmap() in valgrind extra syscalls
...instead of unconditionally trying to enable both: mmap2() is the
32-bit ARM variant for mmap() (and perhaps for other architectures),
bot if mmap() is available, valgrind will use that one.

This avoids seccomp.sh warning us about missing mmap2() if mmap() is
present, and is consistent with what we do in vhost-user code.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 16:36:47 +01:00
David Gibson
1cc5d4c9fe conf: Use 0 instead of -1 as "unassigned" mtu value
On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default.  However, internally we use (c->mtu == -1) to represent that
state.  We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.

This is unnecessarily confusing.  We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case.  This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:41 +01:00
David Gibson
3dc7da68a2 conf: More thorough error checking when parsing --mtu option
We're a bit sloppy with parsing MTU which can lead to some surprising,
though fairly harmless, results:
  * Passing a non-number like '-m xyz' will not give an error and act like
    -m 0
  * Junk after a number (e.g. '-m 1500pqr') will be ignored rather than
    giving an error
  * We parse the MTU as a long, then immediately assign to an int, so on
    some platforms certain ludicrously out of bounds values will be
    silently truncated, rather than giving an error

Be a bit more thorough with the error checking to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:38 +01:00
David Gibson
65e317a8fc flow: Clean up and generalise flow traversal macros
The migration code introduced a number of 'foreach' macros to traverse the
flow table.  These aren't inherently tied to migration, so polish up their
naming, move them to flow_table.h and also use in flow_defer_handler()
which is the other place we need to traverse the whole table.

For now we keep foreach_established_tcp_flow() as is.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:36 +01:00
David Gibson
b79a22d360 flow: Remove unneeded bound parameter from flow traversal macros
The foreach macros used to step through flows each take a 'bound' parameter
to only scan part of the flow table.  Only one place actually passes a
bound different from FLOW_MAX.  So we can simplify every other invocation
by having that one case manually handle the bound.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:34 +01:00
David Gibson
7ffca35fdd flow: Remove unneeded index from foreach_* macros
The foreach macros are odd in that they take two loop counters: an integer
index, and a pointer to the flow.  We nearly always want the latter, not
the former, and we can get the index from the pointer trivially when we
need it.  So, rearrange the macros not to need the integer index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:20 +01:00
David Gibson
adb46c11d0 flow: Add flow_perror() helper
Our general logging helpers include a number of _perror() variants which,
like perror(3) include the description of the current errno.  We didn't
have those for our flow specific logging helpers, though.  Fill this gap
with flow_perror() and flow_dbg_perror(), and use them where it's useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:12 +01:00
David Gibson
ba0823f8a0 tcp: Don't pass both flow pointer and flow index
tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:10 +01:00
David Gibson
854bc7b1a3 tcp: Remove spurious prototype for tcp_flow_migrate_shrink_window
This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:08 +01:00
David Gibson
e56c8038fc tcp: More type safety for tcp_flow_migrate_target_ext()
tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:32:52 +01:00
David Gibson
5a07eb3ccc tcp_vu: head_cnt need not be global
head_cnt is a global variable which tracks how many entries in head[] are
currently used.  The fact that it's global obscures the fact that the
lifetime over which it has a meaningful value is quite short: a single
call to of tcp_vu_data_from_sock().

Make it a local to tcp_vu_data_from_sock() to make that lifetime clearer.
We keep the head[] array global for now - although technically it has the
same valid lifetime - because it's large enough we might not want to put
it on the stack.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 11:28:37 +01:00
David Gibson
6b4065153c tap: Remove unused ETH_HDR_INIT() macro
The uses of this macro were removed in d4598e1d18 ("udp: Use the same
buffer for the L2 header for all frames").

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:18 +01:00
David Gibson
354bc0bab1 packet: Don't pass start and offset separately to packet_check_range()
Fundamentally what packet_check_range() does is to check whether a given
memory range is within the allowed / expected memory set aside for packets
from a particular pool.  That range could represent a whole packet (from
packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't
really matter which.

However, we pass the start of the range as two parameters: @start which is
the start of the packet, and @offset which is the offset within the packet
of the range we're interested in.  We never use these separately, only as
(start + offset).  Simplify the interface of packet_check_range() and
vu_packet_check_range() to directly take the start of the relevant range.
This will allow some additional future improvements.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:12 +01:00
David Gibson
0a51060f7a packet: Use flexible array member in struct pool
Currently we have a dummy pkt[1] array, which we alias with an array of
a different size via various macros.  However, we already require C11 which
includes flexible array members, so we can do better.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:04 +01:00
Enrique Llorente
bcc4908c2b dhcp: Remove option 255 length byte
The option 255 (end of options) do not need the length byte, this change
remove that allowing to have one extra byte at other dynamic options.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:42:35 +01:00
Stefano Brivio
a1e48a02ff test: Add migration tests
PCAP=1 ./run migrate/bidirectional gives an overview of how the
whole thing is working.

Add 12 tests in total, checking basic functionality with and without
flows in both directions, with and without sockets in half-closed
states (both inbound and outbound), migration behaviour under traffic
flood, under traffic flood with > 253 flows, and strict checking of
sequences under flood with ramp patterns in both directions.

These tests need preparation and teardown for each case, as we need
to restore the source guest in its own context and pane before we can
test again. Eventually, we could consider alternating source and
target so that we don't need to restart from scratch every time, but
that's beyond the scope of this initial test implementation.

Trick: './run migrate/*' runs all the tests with preparation and
teardown steps.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:29:36 +01:00
Stefano Brivio
89ecf2fd40 migrate: Migrate TCP flows
This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:29:03 +01:00
Stefano Brivio
3e903bbb1f repair, passt-repair: Build and warning fixes for musl
Checked against musl 1.2.5.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-17 08:28:48 +01:00
Stefano Brivio
01b6a164d9 tcp_splice: A typo three years ago and SO_RCVLOWAT is gone
In commit e5eefe7743 ("tcp: Refactor to use events instead of
states, split out spliced implementation"), this:

			if (!bitmap_isset(rcvlowat_set, conn - ts) &&
			    readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

			if (conn->flags & lowat_set_flag &&
			    readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we
haven't set it already.

But, instead, we're checking if it's already set before we set it, so
we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us
improved CPU utilisation in non-interactive cases where we are not
transferring at full pipe capacity.

Fixes: e5eefe7743 ("tcp: Refactor to use events instead of states, split out spliced implementation")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:28:45 +01:00
Stefano Brivio
667caa09c6 tcp_splice: Don't wake up on input data if we can't write it anywhere
If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
given flow, it means that we're blocked, waiting for the receiver to
actually receive data, with a full pipe.

In that case, if we keep EPOLLIN set for the socket on the other side
(our receiving side), we'll get into a loop such as:

  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577

leading to 100% CPU usage, of course.

Drop EPOLLIN on our receiving side as long when we're waiting for
output readiness on the other side.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:27:30 +01:00
David Gibson
7c33b12086 vhost_user: Clear ring address on GET_VRING_BASE
GET_VRING_BASE stops the queue, clearing the call and kick fds.  However,
we don't clear vring.avail.  That means that if vu_queue_notify() is called
it won't realise the queue isn't ready and will die with an EBADFD.

We get this during migration, because for some reason, qemu reconfigures
the vhost-user device when a migration is triggered.  There's a window
between the GET_VRING_BASE and re-establishing the call fd where the
notify function can be called, causing a crash.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-15 05:34:21 +01:00
Stefano Brivio
71249ef3f9 tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values
I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.

This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.

Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).

The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.

Drop those settings.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 12:02:55 +01:00
Stefano Brivio
30f1e082c3 tcp: Keep updating window and checking for socket data after FIN from guest
Once we get a FIN segment from the container/guest, we enter something
resembling CLOSE_WAIT (from the perspective of the peer), but that
doesn't mean that we should stop processing window updates from the
guest and checking for socket data if the guest acknowledges
something.

If we don't do that, we can very easily run into a situation where we
send a burst of data to the tap, get a zero window update, along with
a FIN segment, because the flow is meant to be unidirectional, and now
the connection will be stuck forever, because we'll ignore updates.

Reproducer, server:

  $ pasta --config-net -t 9999 -- sh -c 'echo DONE | socat TCP-LISTEN:9997,shut-down STDIO'

and client:

  $ ./test/rampstream send 50000 | socat -u STDIN TCP:$LOCAL_ADDR:9997
  2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe

while at it, update the message string for the third passive close
state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 10:04:39 +01:00
Stefano Brivio
98d474c895 contrib/selinux: Enable mapping guest memory for libvirt guests
This doesn't actually belong to passt's own policy: we should export
an interface and libvirt's policy should use it, because passt's
policy shouldn't be aware of svirt_image_t at all.

However, libvirt doesn't maintain its own policy, which makes policy
updates rather involved. Add this workaround to ensure --vhost-user
is working in combination with libvirt, as it might take ages before
we can get the proper rule in libvirt's policy.

Reported-by: Laine Stump <laine@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 10:04:39 +01:00
Stefano Brivio
9a84df4c3f selinux: Add rules needed to run tests
...other than being convenient, they might be reasonably
representative of typical stand-alone usage.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-13 00:42:52 +01:00
David Gibson
a301158456 rampstream: Add utility to test for corruption of data streams
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:48:17 +01:00
Stefano Brivio
6f122f0171 tcp: Get bound address for connected inbound sockets too
So that we can bind inbound sockets to specific addresses, like we
already do for outbound sockets.

While at it, change the error message in tcp_conn_from_tap() to match
this one.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:48:00 +01:00
Stefano Brivio
f3fe795ff5 vhost_user: Make source quit after reporting migration state
This will close all the sockets we currently have open in repair mode,
and completes our migration tasks as source. If the hypervisor wants
to have us back at this point, somebody needs to restart us.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:51 +01:00
Stefano Brivio
b899141ad5 Add interfaces and configuration bits for passt-repair
In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.

When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.

To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:28 +01:00
David Gibson
155cd0c41e migrate: Migrate guest observed addresses
Most of the information in struct ctx doesn't need to be migrated.
Either it's strictly back end information which is allowed to differ
between the two ends, or it must already be configured identically on
the two ends.

There are a few exceptions though.  In particular passt learns several
addresses of the guest by observing what it sends out.  If we lose
this information across migration we might get away with it, but if
there are active flows we might misdirect some packets before
re-learning the guest address.

Avoid this by migrating the guest's observed addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Coding style stuff, comments, etc.]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:17 +01:00
Stefano Brivio
5911e08c0f migrate: Skeleton of live migration logic
Introduce facilities for guest migration on top of vhost-user
infrastructure.  Add migration facilities based on top of the current
vhost-user infrastructure, moving vu_migrate() and related functions
to migrate.c.

Versioned migration stages define function pointers to be called on
source or target, or data sections that need to be transferred.

The migration header consists of a magic number, a version number for the
encoding, and a "compat_version" which represents the oldest version which
is compatible with the current one.  We don't use it yet, but that allows
for the future possibility of backwards compatible protocol extensions.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:07 +01:00
Stefano Brivio
836fe215e0 passt-repair: Fix off-by-one in check for number of file descriptors
Actually, 254 is too many, but 253 isn't.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:46:46 +01:00
Laurent Vivier
def7de4690 tcp_vu: Fix off-by one in header count array adjustment
head_cnt represents the number of frames we're going to forward to the
guest in tcp_vu_sock_recv(), each of which could require multiple
buffers ("elements").  We initialise it with as many frames as we can
find space for in vu buffers, and we then need to adjust it down to
the number of frames we actually (partially) filled.

We adjust it down based on number of individual buffers used by the
data from recvmsg().  At this point 'i' is *one greater than* that
number of buffers, so we need to discard all (unused) frames with a
buffer index >= i, instead of > i.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
[david: Contributed actual commit message]
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:44:25 +01:00
Stefano Brivio
90f91fe726 tcp: Implement conservative zero-window probe on ACK timeout
This probably doesn't cover all the cases where we should send a
zero-window probe, but it's rather unobtrusive and obvious, so start
from here, also because I just observed this case (without the fix
from the previous patch, to take into account window information from
keep-alive segments).

If we hit the ACK timeout, and try re-sending data from the socket,
if the window is zero, we'll just fail again, go back to the timer,
and so on, until we hit the maximum number of re-transmissions and
reset the connection.

Don't do that: forcibly try to send something by implementing the
equivalent of a zero-window probe in this case.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-12 19:43:55 +01:00
Stefano Brivio
472e2e930f tcp: Don't discard window information on keep-alive segments
It looks like a detail, but it's critical if we're dealing with
somebody, such as near-future self, using TCP_REPAIR to migrate TCP
connections in the guest or container.

The last packet sent from the 'source' process/guest/container
typically reports a small window, or zero, because the guest/container
hadn't been draining it for a while.

The next packet, appearing as the target sets TCP_REPAIR_OFF on the
migrated socket, is a keep-alive (also called "window probe" in CRIU
or TCP_REPAIR-related code), and it comes with an updated window
value, reflecting the pre-migration "regular" value.

If we ignore it, it might take a while/forever before we realise we
can actually restart sending.

Fixes: 238c69f9af ("tcp: Acknowledge keep-alive segments, ignore them for the rest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-12 19:34:15 +01:00
Enrique Llorente
31e8109a86 dhcp, dhcpv6: Add hostname and client fqdn ops
Both DHCPv4 and DHCPv6 has the capability to pass the hostname to
clients, the DHCPv4 uses option 12 (hostname) while the DHCPv6 uses option 39
(client fqdn), for some virt deployments like kubevirt is expected to
have the VirtualMachine name as the guest hostname.

This change add the following arguments:
 - -H --hostname NAME to configure the hostname DHCPv4 option(12)
 - --fqdn NAME to configure client fqdn option for both DHCPv4(81) and
   DHCPv6(39)

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-10 18:30:24 +01:00
Stefano Brivio
a3d142a6f6 conf: Don't map DNS traffic to host, if host gateway is a resolver
This should be a relatively common case and I'm a bit surprised it's
been broken since I added the "gateway mapping" functionality, but it
doesn't happen with Podman, and not with systemd-resolved or similar
local proxies, and also not with servers where typically the gateway
is just a router and not a DNS resolver. That could be the reason why
nobody noticed until now.

By default, we'll map the address of the default gateway, in
containers and guests, to represent "the host", so that we have a
well-defined way to reach the host. Say:

  0.0029:     NAT to host 127.0.0.1: 192.168.100.1

But if the host gateway is also a DNS resolver:

  0.0029: DNS:
  0.0029:     192.168.100.1

then we'll send DNS queries directed to it to the host instead:

  0.0372: Flow 0 (INI): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => ?
  0.0372: Flow 0 (TGT): INI -> TGT
  0.0373: Flow 0 (TGT): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
  0.0373: Flow 0 (UDP flow): TGT -> TYPED
  0.0373: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
  0.0373: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 31049
  0.0374: Flow 0 (UDP flow): TYPED -> ACTIVE
  0.0374: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53

which doesn't quite work, of course:

  0.0374: pasta: epoll event on UDP reply socket 95 (events: 0x00000008)
  0.0374: ICMP error on UDP socket 95: Connection refused

unless the host is a resolver itself... but then we wouldn't find the
address of the gateway in its /etc/resolv.conf, presumably.

Fix this by making an exception for DNS traffic: if the default
gateway is a resolver, match on DNS traffic going to the default
gateway, and explicitly forward it to the configured resolver.

Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-09 08:17:06 +01:00
Stefano Brivio
864be475d9 passt-repair: Send one confirmation *per command*, not *per socket*
It looks like me, myself and I couldn't agree on the "simple" protocol
between passt and passt-repair. The man page and passt say it's one
confirmation per command, but the passt-repair implementation had one
confirmation per socket instead.

This caused all sort of mysterious issues with repair mode
pseudo-randomly enabled, and leading to hours of fun (mostly not
mine). Oops.

Switch to one confirmation per command (of course).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-09 08:16:41 +01:00
Enrique Llorente
fe8b6a7c42 dhcp: Don't re-use request message for reply
The logic composing the DHCP reply message is reusing the request
message to compose it, future long options like FQDN may
exceed the request message limit making it go beyond the lower
bound.

This change creates a new reply message with a fixed options size of 308
and fills it in with proper fields from requests adding on top the generated
options, this way the reply lower bound does not depend on the request.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-07 10:36:10 +01:00
Stefano Brivio
b7b70ba243 passt-repair: Dodge "structurally unreachable code" warning from Coverity
While main() conventionally returns int, and we need a return at the
end of the function to avoid compiler warnings, turning that return
into _exit() to avoid exit handlers triggers a Coverity warning. It's
unreachable code anyway, so switch that single occurence back to a
plain return.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-07 10:35:46 +01:00
Stefano Brivio
0f009ea598 passt-repair: Fix calculation of payload length from cmsg_len
There's no inverse function for CMSG_LEN(), so we need to loop over
SCM_MAX_FD (253) possible input values. The previous calculation is
clearly wrong, as not every int takes CMSG_LEN(sizeof(int)) in cmsg
data.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-07 10:35:17 +01:00
Stefano Brivio
a0b7f56b3a passt-repair: Don't use perror(), accept ECONNRESET as termination
If we use glibc's perror(), we need to allow dup() and fcntl() in our
seccomp profiles, which are a bit too much for this simple helper. On
top of that, we would probably need a wrapper to avoid allocation for
translated messages.

While at it: ECONNRESET is just a close() from passt, treat it like
EOF.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-07 10:34:31 +01:00
Stefano Brivio
a5cca995de conf, passt.1: Un-deprecate --host-lo-to-ns-lo
It was established behaviour, and it's now the third report about it:
users ask how to achieve the same functionality, and we don't have a
better answer yet.

The idea behind declaring it deprecated to start with, I guess, was
that we would eventually replace it by more flexible and generic
configuration options, which is still planned. But there's nothing
preventing us to alias this in the future to a particular
configuration.

So, stop scaring users off, and un-deprecate this.

Link: https://archives.passt.top/passt-dev/20240925102009.62b9a0ce@elisabeth/
Link: https://github.com/rootless-containers/rootlesskit/pull/482#issuecomment-2591855705
Link: https://github.com/moby/moby/issues/48838
Link: https://github.com/containers/podman/discussions/25243
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 11:14:30 +01:00
David Gibson
0da87b393b debug: Add tcpdump to mbuto.img
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-06 09:43:09 +01:00
Stefano Brivio
f66769c2de apparmor: Workaround for unconfined libvirtd when triggered by unprivileged user
If libvirtd is triggered by an unprivileged user, the virt-aa-helper
mechanism doesn't work, because per-VM profiles can't be instantiated,
and as a result libvirtd runs unconfined.

This means passt can't start, because the passt subprofile from
libvirt's profile is not loaded either.

Example:

  $ virsh start alpine
  error: Failed to start domain 'alpine'
  error: internal error: Child process (passt --one-off --socket /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0.socket --pid /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0-passt.pid --tcp-ports 40922:2) unexpected fatal signal 11

Add an annoying workaround for the moment being. Much better than
encouraging users to start guests as root, or to disable AppArmor
altogether.

Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-06 09:43:09 +01:00
Stefano Brivio
593be32774 passt-repair.1: Fix indication of TCP_REPAIR constants
...perhaps I should adopt the healthy habit of actually reading
headers instead of using my mental copy.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 09:43:00 +01:00
Stefano Brivio
9215f68a0c passt-repair: Build fixes for musl
When building against musl headers:

- sizeof() needs stddef.h, as it should be;

- we can't initialise a struct msghdr by simply listing fields in
  order, as they contain explicit padding fields. Use field names
  instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 09:40:54 +01:00
Paul Holzinger
a9d63f91a5 passt-repair: use _exit() over return
When returning from main it does the same as calling exit() which is not
good as glibc might try to call futex() which will be blocked by
seccomp. See the prevoius commit "treewide: use _exit() over exit()" for
a more detailed explanation.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-05 15:19:19 +01:00
Paul Holzinger
d0006fa784 treewide: use _exit() over exit()
In the podman CI I noticed many seccomp denials in our logs even though
tests passed:
comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e
syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000

Which is futex being called and blocked by the pasta profile. After a
few tries I managed to reproduce locally with this loop in ~20 min:
while :;
  do podman run -d --network bridge quay.io/libpod/testimage:20241011 \
	sleep 100 && \
  sleep 10 && \
  podman rm -fa -t0
done

And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the
following stack trace:
Stack trace of thread 1:
    0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b)
    0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de)
    0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
    0x000055f31b78c50b n/a (n/a + 0x0)
    0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
    0x000055f31b78d5a2 n/a (n/a + 0x0)

Pasta got killed in exit(), it seems glibc is trying to use a lock when
running exit handlers even though no exit handlers are defined.

Given no exit handlers are needed we can call _exit() instead. This
skips exit handlers and does not flush stdio streams compared to exit()
which should be fine for the use here.

Based on the input from Stefano I did not change the test/doc programs
or qrap as they do not use seccomp filters.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-05 15:19:02 +01:00
David Gibson
745c163e60 tcp: Simplify handling of getsockname()
For migration we need to get the specific local address and port for
connected sockets with getsockname().  We currently open code marshalling
the results into the flow entry.

However, we already have inany_from_sockaddr() which handles the fiddly
parts of this, so use it.  Also report failures, which may make debugging
problems easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop re-declarations of 'sa' and 'sl']
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 09:02:54 +01:00
David Gibson
b4a7b5d4a6 migrate: Fix several errors with passt-repair
The passt-repair helper is now merged, but alas it contains several small
bugs:
 * close() is not in the seccomp profile, meaning it will immediately
   SIGSYS when you make a request of it
 * The generated header, seccomp_repair.h isn't listed in .gitignore or
   removed by "make clean"

Fixes: 8c24301462 ("Introduce passt-repair")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 08:52:27 +01:00
Stefano Brivio
dcf014be88 doc: Add mock of migration source and target
These test programs show the migration of a TCP connection using the
passt-repair helper.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
52e57f9c9a tcp: Get socket port and address using getsockname() when connecting from guest
For migration only: we need to store 'oport', our socket-side port,
as we establish a connection from the guest, so that we can bind the
same oport as source port in the migration target.

Similar for 'oaddr': this is needed in case the migration target has
additional network interfaces, and we need to make sure our socket is
bound to the equivalent interface as it was on the source.

Use getsockname() to fetch them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
8c24301462 Introduce passt-repair
A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
passt. Not used yet.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
e894d9ae82 vhost_user: Turn some vhost-user message reports to trace()
Having every vhost-user message printed as part of debug output makes
debugging anything else a bit complicated.

Change per-packet debug() messages in vu_kick_cb() and
vu_send_single() to trace()

[dgibson: switch different messages to trace()]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
e25a93032f util: Add read_remainder() and read_all_buf()
These are symmetric to write_remainder() and write_all_buf() and
almost a copy and paste of them, with the most notable differences
being reversed reads/writes and a couple of better-safe-than-sorry
asserts to keep Coverity happy.

I'll use them in the next patch. At least for the moment, they're
going to be used for vhost-user mode only, so I'm not unconditionally
enabling readv() in the seccomp profile: the caller has to ensure it's
there.

[dgibson: make read_remainder() take const pointer to iovec]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
71fa736277 tcp_splice, udp_flow: fcntl64() support on PPC64 depends on glibc version
I explicitly added fcntl64() to the list of allowed system calls for
PPC64 a while ago, and now it turns out it's not available in recent
Debian builds. The warning from seccomp.sh is harmless because we
unconditionally try to enable fcntl() anyway, but take care of it
anyway.

Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:43:35 +01:00
Stefano Brivio
b75ad159e8 vhost_user: On 32-bit ARM, mmap() is not available, mmap2() is used instead
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armel&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477467&raw=0
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armhf&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477421&raw=0
Fixes: 31117b27c6 ("vhost-user: introduce vhost-user API")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:42:28 +01:00
Stefano Brivio
722d347c19 tcp: Don't reset outbound connection on SYN retries
Reported by somebody on IRC: if the server has considerable latency,
it might happen that the client retries sending SYN segments for the
same flow while we're still in a TAP_SYN_RCVD, non-ESTABLISHED state.

In that case, we should go with the blanket assumption that we need
to reset the connection on any unexpected segment: RFC 9293 explicitly
mentions this case in Figure 8: Recovery from Old Duplicate SYN,
section 3.5. It doesn't make sense for us to set a specific sequence
number, socket-side, but we should definitely wait and see.

Ignoring the duplicate SYN segment should also be compatible with
section 3.10.7.3. SYN-SENT STATE, which mentions updating sequences
socket-side (which we can't do anyway), but certainly not reset the
connection.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:42:13 +01:00
7ppKb5bW
bf2860819d pasta.te: fix demo.sh and remove one duplicate rule
On Fedora 41, without "allow pasta_t unconfined_t:dir read"
/usr/bin/pasta can't open /proc/[pid]/ns, which is required by
pasta_netns_quit_init().

This patch also remove one duplicate rule "allow pasta_t nsfs_t:file
read;", "allow pasta_t nsfs_t:file { open read };" at line 123 is
enough.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:33:14 +01:00
Stefano Brivio
dcd6d8191a tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros
Those are symmetric to TAPSIDE(x)/TAPFLOW(x) and I'll use them in
the next patch to extract 'oport' in order to re-bind sockets to
the original socket-side local port.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 07:32:53 +01:00
David Gibson
0349cf637f util: Rename and make global vu_remove_watch()
vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set.  There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:32:51 +01:00
David Gibson
10c4a9e1b3 tcp: Always pass NULL event with EPOLL_CTL_DEL
In tcp_epoll_ctl() we pass an event pointer with EPOLL_CTL_DEL, even though
it will be ignored.  It's possible this was a workaround for pre-2.6.9
kernels which required a non-NULL pointer here, but we rely on the kernel
accepting NULL events for EPOLL_CTL_DEL in lots of other places.  Use
NULL instead for simplicity and consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:32:37 +01:00
Laurent Vivier
dd6a6854c7 vhost-user: Implement an empty VHOST_USER_SEND_RARP command
Passt cannot manage and doesn't need to manage the broadcast of a fake RARP,
but QEMU will report an error message if Passt doesn't implement it.

Implement an empty SEND_RARP command to silence QEMU error message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-24 21:40:05 +01:00
Stefano Brivio
d477a1fb03 netlink: Skip loopback interface while looking for a template
There might be reasons to have routes on the loopback interface, for
example Any-IP/AnyIP routes as implemented by Linux kernel commit
ab79ad14a2d5 ("ipv6: Implement Any-IP support for IPv6.").

If we use the loopback interface as a template, though, we'll pick
'lo' (typically) as interface name for our tap interface, but we'll
already have an interface called 'lo' in the target namespace, and as
we TUNSETIFF on it, we'll fail with EINVAL, because it's not a tap
interface.

Skip the loopback interface while looking for a template interface or,
more accurately, skip the interface with index 1.

Strictly speaking, we should fetch interface flags via RTM_GETLINK
instead, and check for IFF_LOOPBACK, but interleaving that request
while we're iterating over routes is unnecessarily complicated.

Link: https://www.reddit.com/r/podman/comments/1i6pj7u/starting_pod_without_external_network/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-24 21:39:52 +01:00
102 changed files with 6772 additions and 1566 deletions

2
.gitignore vendored
View file

@ -3,8 +3,10 @@
/passt.avx2
/pasta
/pasta.avx2
/passt-repair
/qrap
/pasta.1
/seccomp.h
/seccomp_repair.h
/c*.json
README.plain.md

View file

@ -20,6 +20,7 @@ $(if $(TARGET),,$(error Failed to get target architecture))
# Get 'uname -m'-like architecture description for target
TARGET_ARCH := $(firstword $(subst -, ,$(TARGET)))
TARGET_ARCH := $(patsubst [:upper:],[:lower:],$(TARGET_ARCH))
TARGET_ARCH := $(patsubst arm%,arm,$(TARGET_ARCH))
TARGET_ARCH := $(subst powerpc,ppc,$(TARGET_ARCH))
# On some systems enabling optimization also enables source fortification,
@ -29,7 +30,7 @@ ifeq ($(shell $(CC) -O2 -dM -E - < /dev/null 2>&1 | grep ' _FORTIFY_SOURCE ' > /
FORTIFY_FLAG := -D_FORTIFY_SOURCE=2
endif
FLAGS := -Wall -Wextra -Wno-format-zero-length
FLAGS := -Wall -Wextra -Wno-format-zero-length -Wformat-security
FLAGS += -pedantic -std=c11 -D_XOPEN_SOURCE=700 -D_GNU_SOURCE
FLAGS += $(FORTIFY_FLAG) -O2 -pie -fPIE
FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
@ -38,20 +39,21 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
vhost_user.c virtio.c vu_common.c
ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
udp_vu.c util.c vhost_user.c virtio.c vu_common.c
QRAP_SRCS = qrap.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS)
PASST_REPAIR_SRCS = passt-repair.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1
MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
virtio.h vu_common.h
lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
udp_vu.h util.h vhost_user.h virtio.h vu_common.h
HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
@ -72,9 +74,9 @@ mandir ?= $(datarootdir)/man
man1dir ?= $(mandir)/man1
ifeq ($(TARGET_ARCH),x86_64)
BIN := passt passt.avx2 pasta pasta.avx2 qrap
BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
else
BIN := passt pasta qrap
BIN := passt pasta qrap passt-repair
endif
all: $(BIN) $(MANPAGES) docs
@ -83,7 +85,10 @@ static: FLAGS += -static -DGLIBC_NO_STATIC_NSS
static: clean all
seccomp.h: seccomp.sh $(PASST_SRCS) $(PASST_HEADERS)
@ EXTRA_SYSCALLS="$(EXTRA_SYSCALLS)" ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh $(PASST_SRCS) $(PASST_HEADERS)
@ EXTRA_SYSCALLS="$(EXTRA_SYSCALLS)" ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh seccomp.h $(PASST_SRCS) $(PASST_HEADERS)
seccomp_repair.h: seccomp.sh $(PASST_REPAIR_SRCS)
@ ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh seccomp_repair.h $(PASST_REPAIR_SRCS)
passt: $(PASST_SRCS) $(HEADERS)
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_SRCS) -o passt $(LDFLAGS)
@ -101,16 +106,19 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
qrap: $(QRAP_SRCS) passt.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
passt-repair: $(PASST_REPAIR_SRCS) seccomp_repair.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
rt_sigreturn getpid gettid kill clock_gettime mmap \
mmap2 munmap open unlink gettimeofday futex statx \
readlink
rt_sigreturn getpid gettid kill clock_gettime \
mmap|mmap2 munmap open unlink gettimeofday futex \
statx readlink
valgrind: FLAGS += -g -DVALGRIND
valgrind: all
.PHONY: clean
clean:
$(RM) $(BIN) *~ *.o seccomp.h pasta.1 \
$(RM) $(BIN) *~ *.o seccomp.h seccomp_repair.h pasta.1 \
passt.tar passt.tar.gz *.deb *.rpm \
passt.pid README.plain.md

View file

@ -85,7 +85,7 @@
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing")))
uint32_t sum_16b(const void *buf, size_t len)
static uint32_t sum_16b(const void *buf, size_t len)
{
const uint16_t *p = buf;
uint32_t sum = 0;
@ -107,7 +107,7 @@ uint32_t sum_16b(const void *buf, size_t len)
*
* Return: 16-bit folded sum
*/
uint16_t csum_fold(uint32_t sum)
static uint16_t csum_fold(uint32_t sum)
{
while (sum >> 16)
sum = (sum & 0xffff) + (sum >> 16);
@ -161,6 +161,21 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
return psum;
}
/**
* csum() - Compute TCP/IP-style checksum
* @buf: Input buffer
* @len: Input length
* @init: Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing"))) /* See csum_16b() */
static uint16_t csum(const void *buf, size_t len, uint32_t init)
{
return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
}
/**
* csum_udp4() - Calculate and set checksum for a UDP over IPv4 packet
* @udp4hr: UDP header, initialised apart from checksum
@ -482,21 +497,6 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
}
#endif /* !__AVX2__ */
/**
* csum() - Compute TCP/IP-style checksum
* @buf: Input buffer
* @len: Input length
* @init: Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing"))) /* See csum_16b() */
uint16_t csum(const void *buf, size_t len, uint32_t init)
{
return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
}
/**
* csum_iov_tail() - Calculate unfolded checksum for the tail of an IO vector
* @tail: IO vector tail to checksum

View file

@ -11,8 +11,6 @@ struct icmphdr;
struct icmp6hdr;
struct iov_tail;
uint32_t sum_16b(const void *buf, size_t len);
uint16_t csum_fold(uint32_t sum);
uint16_t csum_unaligned(const void *buf, size_t len, uint32_t init);
uint16_t csum_ip4_header(uint16_t l3len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr);
@ -32,7 +30,6 @@ void csum_icmp6(struct icmp6hdr *icmp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t dlen);
uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init);
uint16_t csum(const void *buf, size_t len, uint32_t init);
uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init);
#endif /* CHECKSUM_H */

462
conf.c
View file

@ -16,6 +16,7 @@
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <libgen.h>
#include <string.h>
#include <sched.h>
#include <sys/types.h>
@ -123,6 +124,75 @@ static int parse_port_range(const char *s, char **endptr,
return 0;
}
/**
* conf_ports_range_except() - Set up forwarding for a range of ports minus a
* bitmap of exclusions
* @c: Execution context
* @optname: Short option name, t, T, u, or U
* @optarg: Option argument (port specification)
* @fwd: Pointer to @fwd_ports to be updated
* @addr: Listening address
* @ifname: Listening interface
* @first: First port to forward
* @last: Last port to forward
* @exclude: Bitmap of ports to exclude
* @to: Port to translate @first to when forwarding
* @weak: Ignore errors, as long as at least one port is mapped
*/
static void conf_ports_range_except(const struct ctx *c, char optname,
const char *optarg, struct fwd_ports *fwd,
const union inany_addr *addr,
const char *ifname,
uint16_t first, uint16_t last,
const uint8_t *exclude, uint16_t to,
bool weak)
{
bool bound_one = false;
unsigned i;
int ret;
if (first == 0) {
die("Can't forward port 0 for option '-%c %s'",
optname, optarg);
}
for (i = first; i <= last; i++) {
if (bitmap_isset(exclude, i))
continue;
if (bitmap_isset(fwd->map, i)) {
warn(
"Altering mapping of already mapped port number: %s", optarg);
}
bitmap_set(fwd->map, i);
fwd->delta[i] = to - first;
if (optname == 't')
ret = tcp_sock_init(c, addr, ifname, i);
else if (optname == 'u')
ret = udp_sock_init(c, 0, addr, ifname, i);
else
/* No way to check in advance for -T and -U */
ret = 0;
if (ret == -ENFILE || ret == -EMFILE) {
die("Can't open enough sockets for port specifier: %s",
optarg);
}
if (!ret) {
bound_one = true;
} else if (!weak) {
die("Failed to bind port %u (%s) for option '-%c %s'",
i, strerror_(-ret), optname, optarg);
}
}
if (!bound_one)
die("Failed to bind any port for '-%c %s'", optname, optarg);
}
/**
* conf_ports() - Parse port configuration options, initialise UDP/TCP sockets
* @c: Execution context
@ -135,10 +205,9 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
{
union inany_addr addr_buf = inany_any6, *addr = &addr_buf;
char buf[BUFSIZ], *spec, *ifname = NULL, *p;
bool exclude_only = true, bound_one = false;
uint8_t exclude[PORT_BITMAP_SIZE] = { 0 };
bool exclude_only = true;
unsigned i;
int ret;
if (!strcmp(optarg, "none")) {
if (fwd->mode)
@ -173,32 +242,15 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
fwd->mode = FWD_ALL;
/* Skip port 0. It has special meaning for many socket APIs, so
* trying to bind it is not really safe.
*/
for (i = 1; i < NUM_PORTS; i++) {
/* Exclude ephemeral ports */
for (i = 0; i < NUM_PORTS; i++)
if (fwd_port_is_ephemeral(i))
continue;
bitmap_set(fwd->map, i);
if (optname == 't') {
ret = tcp_sock_init(c, NULL, NULL, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
} else if (optname == 'u') {
ret = udp_sock_init(c, 0, NULL, NULL, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
}
}
if (!bound_one)
goto bind_all_fail;
bitmap_set(exclude, i);
conf_ports_range_except(c, optname, optarg, fwd,
NULL, NULL,
1, NUM_PORTS - 1, exclude,
1, true);
return;
}
@ -275,37 +327,15 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
} while ((p = next_chunk(p, ',')));
if (exclude_only) {
/* Skip port 0. It has special meaning for many socket APIs, so
* trying to bind it is not really safe.
*/
for (i = 1; i < NUM_PORTS; i++) {
if (fwd_port_is_ephemeral(i) ||
bitmap_isset(exclude, i))
continue;
bitmap_set(fwd->map, i);
if (optname == 't') {
ret = tcp_sock_init(c, addr, ifname, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
} else if (optname == 'u') {
ret = udp_sock_init(c, 0, addr, ifname, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
} else {
/* No way to check in advance for -T and -U */
bound_one = true;
}
}
if (!bound_one)
goto bind_all_fail;
/* Exclude ephemeral ports */
for (i = 0; i < NUM_PORTS; i++)
if (fwd_port_is_ephemeral(i))
bitmap_set(exclude, i);
conf_ports_range_except(c, optname, optarg, fwd,
addr, ifname,
1, NUM_PORTS - 1, exclude,
1, true);
return;
}
@ -334,40 +364,18 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
if ((*p != '\0') && (*p != ',')) /* Garbage after the ranges */
goto bad;
for (i = orig_range.first; i <= orig_range.last; i++) {
if (bitmap_isset(fwd->map, i))
warn(
"Altering mapping of already mapped port number: %s", optarg);
if (bitmap_isset(exclude, i))
continue;
bitmap_set(fwd->map, i);
fwd->delta[i] = mapped_range.first - orig_range.first;
ret = 0;
if (optname == 't')
ret = tcp_sock_init(c, addr, ifname, i);
else if (optname == 'u')
ret = udp_sock_init(c, 0, addr, ifname, i);
if (ret)
goto bind_fail;
}
conf_ports_range_except(c, optname, optarg, fwd,
addr, ifname,
orig_range.first, orig_range.last,
exclude,
mapped_range.first, false);
} while ((p = next_chunk(p, ',')));
return;
enfile:
die("Can't open enough sockets for port specifier: %s", optarg);
bad:
die("Invalid port specifier %s", optarg);
mode_conflict:
die("Port forwarding mode '%s' conflicts with previous mode", optarg);
bind_fail:
die("Failed to bind port %u (%s) for option '-%c %s', exiting",
i, strerror_(-ret), optname, optarg);
bind_all_fail:
die("Failed to bind any port for '-%c %s', exiting", optname, optarg);
}
/**
@ -406,6 +414,76 @@ static unsigned add_dns6(struct ctx *c, const struct in6_addr *addr,
return 1;
}
/**
* add_dns_resolv4() - Possibly add one IPv4 nameserver from host's resolv.conf
* @c: Execution context
* @ns: Nameserver address
* @idx: Pointer to index of current IPv4 resolver entry, set on return
*/
static void add_dns_resolv4(struct ctx *c, struct in_addr *ns, unsigned *idx)
{
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host))
c->ip4.dns_host = *ns;
/* Special handling if guest or container can only access local
* addresses via redirect, or if the host gateway is also a resolver and
* we shadow its address
*/
if (IN4_IS_ADDR_LOOPBACK(ns) ||
IN4_ARE_ADDR_EQUAL(ns, &c->ip4.map_host_loopback)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
return; /* Address unreachable */
*ns = c->ip4.map_host_loopback;
c->ip4.dns_match = c->ip4.map_host_loopback;
} else {
/* No general host mapping, but requested for DNS
* (--dns-forward and --no-map-gw): advertise resolver
* address from --dns-forward, and map that to loopback
*/
*ns = c->ip4.dns_match;
}
}
*idx += add_dns4(c, ns, *idx);
}
/**
* add_dns_resolv6() - Possibly add one IPv6 nameserver from host's resolv.conf
* @c: Execution context
* @ns: Nameserver address
* @idx: Pointer to index of current IPv6 resolver entry, set on return
*/
static void add_dns_resolv6(struct ctx *c, struct in6_addr *ns, unsigned *idx)
{
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
c->ip6.dns_host = *ns;
/* Special handling if guest or container can only access local
* addresses via redirect, or if the host gateway is also a resolver and
* we shadow its address
*/
if (IN6_IS_ADDR_LOOPBACK(ns) ||
IN6_ARE_ADDR_EQUAL(ns, &c->ip6.map_host_loopback)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
return; /* Address unreachable */
*ns = c->ip6.map_host_loopback;
c->ip6.dns_match = c->ip6.map_host_loopback;
} else {
/* No general host mapping, but requested for DNS
* (--dns-forward and --no-map-gw): advertise resolver
* address from --dns-forward, and map that to loopback
*/
*ns = c->ip6.dns_match;
}
}
*idx += add_dns6(c, ns, *idx);
}
/**
* add_dns_resolv() - Possibly add ns from host resolv.conf to configuration
* @c: Execution context
@ -422,44 +500,11 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver,
struct in6_addr ns6;
struct in_addr ns4;
if (idx4 && inet_pton(AF_INET, nameserver, &ns4)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host))
c->ip4.dns_host = ns4;
if (idx4 && inet_pton(AF_INET, nameserver, &ns4))
add_dns_resolv4(c, &ns4, idx4);
/* Guest or container can only access local addresses via
* redirect
*/
if (IN4_IS_ADDR_LOOPBACK(&ns4)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
return;
ns4 = c->ip4.map_host_loopback;
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match))
c->ip4.dns_match = c->ip4.map_host_loopback;
}
*idx4 += add_dns4(c, &ns4, *idx4);
}
if (idx6 && inet_pton(AF_INET6, nameserver, &ns6)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
c->ip6.dns_host = ns6;
/* Guest or container can only access local addresses via
* redirect
*/
if (IN6_IS_ADDR_LOOPBACK(&ns6)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
return;
ns6 = c->ip6.map_host_loopback;
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match))
c->ip6.dns_match = c->ip6.map_host_loopback;
}
*idx6 += add_dns6(c, &ns6, *idx6);
}
if (idx6 && inet_pton(AF_INET6, nameserver, &ns6))
add_dns_resolv6(c, &ns6, idx6);
}
/**
@ -769,7 +814,7 @@ static void conf_ip6_local(struct ip6_ctx *ip6)
* usage() - Print usage, exit with given status code
* @name: Executable name
* @f: Stream to print usage info to
* @status: Status code for exit()
* @status: Status code for _exit()
*/
static void usage(const char *name, FILE *f, int status)
{
@ -816,6 +861,9 @@ static void usage(const char *name, FILE *f, int status)
" UNIX domain socket is provided by -s option\n"
" --print-capabilities print back-end capabilities in JSON format,\n"
" only meaningful for vhost-user mode\n");
FPRINTF(f,
" --repair-path PATH path for passt-repair(1)\n"
" default: append '.repair' to UNIX domain path\n");
}
FPRINTF(f,
@ -854,7 +902,9 @@ static void usage(const char *name, FILE *f, int status)
FPRINTF(f, " default: use addresses from /etc/resolv.conf\n");
FPRINTF(f,
" -S, --search LIST Space-separated list, search domains\n"
" a single, empty option disables the DNS search list\n");
" a single, empty option disables the DNS search list\n"
" -H, --hostname NAME Hostname to configure client with\n"
" --fqdn NAME FQDN to configure client with\n");
if (strstr(name, "pasta"))
FPRINTF(f, " default: don't use any search list\n");
else
@ -925,7 +975,7 @@ static void usage(const char *name, FILE *f, int status)
" SPEC is as described for TCP above\n"
" default: none\n");
exit(status);
_exit(status);
pasta_opts:
@ -963,8 +1013,7 @@ pasta_opts:
" -U, --udp-ns SPEC UDP port forwarding to init namespace\n"
" SPEC is as described above\n"
" default: auto\n"
" --host-lo-to-ns-lo DEPRECATED:\n"
" Translate host-loopback forwards to\n"
" --host-lo-to-ns-lo Translate host-loopback forwards to\n"
" namespace loopback\n"
" --userns NSPATH Target user namespace to join\n"
" --netns PATH|NAME Target network namespace to join\n"
@ -980,7 +1029,46 @@ pasta_opts:
" --ns-mac-addr ADDR Set MAC address on tap interface\n"
" --no-splice Disable inbound socket splicing\n");
exit(status);
_exit(status);
}
/**
* conf_mode() - Determine passt/pasta's operating mode from command line
* @argc: Argument count
* @argv: Command line arguments
*
* Return: mode to operate in, PASTA or PASST
*/
enum passt_modes conf_mode(int argc, char *argv[])
{
int vhost_user = 0;
const struct option optvu[] = {
{"vhost-user", no_argument, &vhost_user, 1 },
{ 0 },
};
char argv0[PATH_MAX], *basearg0;
int name;
optind = 0;
do {
name = getopt_long(argc, argv, "-:", optvu, NULL);
} while (name != -1);
if (vhost_user)
return MODE_VU;
if (argc < 1)
die("Cannot determine argv[0]");
strncpy(argv0, argv[0], PATH_MAX - 1);
basearg0 = basename(argv0);
if (strstr(basearg0, "pasta"))
return MODE_PASTA;
if (strstr(basearg0, "passt"))
return MODE_PASST;
die("Cannot determine mode, invoke as \"passt\" or \"pasta\"");
}
/**
@ -1217,6 +1305,8 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
*addr6 = in6addr_any;
if (no_map_gw)
*no_map_gw = 1;
return;
}
if (inet_pton(AF_INET6, arg, addr6) &&
@ -1240,8 +1330,25 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
*/
static void conf_open_files(struct ctx *c)
{
if (c->mode != MODE_PASTA && c->fd_tap == -1)
c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
if (c->mode != MODE_PASTA && c->fd_tap == -1) {
c->fd_tap_listen = sock_unix(c->sock_path);
if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
if (!*c->repair_path &&
snprintf_check(c->repair_path,
sizeof(c->repair_path), "%s.repair",
c->sock_path)) {
warn("passt-repair path %s not usable",
c->repair_path);
c->fd_repair_listen = -1;
} else {
c->fd_repair_listen = sock_unix(c->repair_path);
}
} else {
c->fd_repair_listen = -1;
}
c->fd_repair = -1;
}
if (*c->pidfile) {
c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
@ -1313,6 +1420,7 @@ void conf(struct ctx *c, int argc, char **argv)
{"outbound", required_argument, NULL, 'o' },
{"dns", required_argument, NULL, 'D' },
{"search", required_argument, NULL, 'S' },
{"hostname", required_argument, NULL, 'H' },
{"no-tcp", no_argument, &c->no_tcp, 1 },
{"no-udp", no_argument, &c->no_udp, 1 },
{"no-icmp", no_argument, &c->no_icmp, 1 },
@ -1354,21 +1462,25 @@ void conf(struct ctx *c, int argc, char **argv)
{"host-lo-to-ns-lo", no_argument, NULL, 23 },
{"dns-host", required_argument, NULL, 24 },
{"vhost-user", no_argument, NULL, 25 },
/* vhost-user backend program convention */
{"print-capabilities", no_argument, NULL, 26 },
{"socket-path", required_argument, NULL, 's' },
{"fqdn", required_argument, NULL, 27 },
{"repair-path", required_argument, NULL, 28 },
{ 0 },
};
const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
char userns[PATH_MAX] = { 0 }, netns[PATH_MAX] = { 0 };
bool copy_addrs_opt = false, copy_routes_opt = false;
enum fwd_ports_mode fwd_default = FWD_NONE;
bool v4_only = false, v6_only = false;
unsigned dns4_idx = 0, dns6_idx = 0;
unsigned long max_mtu = IP_MAX_MTU;
struct fqdn *dnss = c->dns_search;
unsigned int ifi4 = 0, ifi6 = 0;
const char *logfile = NULL;
const char *optstring;
size_t logsize = 0;
char *runas = NULL;
long fd_tap_opt;
@ -1379,11 +1491,11 @@ void conf(struct ctx *c, int argc, char **argv)
if (c->mode == MODE_PASTA) {
c->no_dhcp_dns = c->no_dhcp_dns_search = 1;
fwd_default = FWD_AUTO;
optstring = "+dqfel:hF:I:p:P:m:a:n:M:g:i:o:D:S:46t:u:T:U:";
} else {
optstring = "+dqfel:hs:F:p:P:m:a:n:M:g:i:o:D:S:461t:u:";
}
if (tap_l2_max_len(c) - ETH_HLEN < max_mtu)
max_mtu = tap_l2_max_len(c) - ETH_HLEN;
c->mtu = ROUND_DOWN(max_mtu, sizeof(uint32_t));
c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET;
c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET;
memcpy(c->our_tap_mac, MAC_OUR_LAA, ETH_ALEN);
@ -1482,7 +1594,7 @@ void conf(struct ctx *c, int argc, char **argv)
FPRINTF(stdout,
c->mode == MODE_PASTA ? "pasta " : "passt ");
FPRINTF(stdout, VERSION_BLOB);
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
case 15:
ret = snprintf(c->ip4.ifname_out,
sizeof(c->ip4.ifname_out), "%s", optarg);
@ -1551,12 +1663,26 @@ void conf(struct ctx *c, int argc, char **argv)
die("Invalid host nameserver address: %s", optarg);
case 25:
if (c->mode == MODE_PASTA)
die("--vhost-user is for passt mode only");
c->mode = MODE_VU;
/* Already handled in conf_mode() */
ASSERT(c->mode == MODE_VU);
break;
case 26:
vu_print_capabilities();
break;
case 27:
if (snprintf_check(c->fqdn, PASST_MAXDNAME,
"%s", optarg))
die("Invalid FQDN: %s", optarg);
break;
case 28:
if (c->mode != MODE_VU && strcmp(optarg, "none"))
die("--repair-path is for vhost-user mode only");
if (snprintf_check(c->repair_path,
sizeof(c->repair_path), "%s",
optarg))
die("Invalid passt-repair path: %s", optarg);
break;
case 'd':
c->debug = 1;
@ -1576,6 +1702,9 @@ void conf(struct ctx *c, int argc, char **argv)
c->foreground = 1;
break;
case 's':
if (c->mode == MODE_PASTA)
die("-s is for passt / vhost-user mode only");
ret = snprintf(c->sock_path, sizeof(c->sock_path), "%s",
optarg);
if (ret <= 0 || ret >= (int)sizeof(c->sock_path))
@ -1588,7 +1717,8 @@ void conf(struct ctx *c, int argc, char **argv)
fd_tap_opt = strtol(optarg, NULL, 0);
if (errno ||
fd_tap_opt <= STDERR_FILENO || fd_tap_opt > INT_MAX)
(fd_tap_opt != STDIN_FILENO && fd_tap_opt <= STDERR_FILENO) ||
fd_tap_opt > INT_MAX)
die("Invalid --fd: %s", optarg);
c->fd_tap = fd_tap_opt;
@ -1596,6 +1726,9 @@ void conf(struct ctx *c, int argc, char **argv)
*c->sock_path = 0;
break;
case 'I':
if (c->mode != MODE_PASTA)
die("-I is for pasta mode only");
ret = snprintf(c->pasta_ifn, IFNAMSIZ, "%s",
optarg);
if (ret <= 0 || ret >= IFNAMSIZ)
@ -1615,20 +1748,24 @@ void conf(struct ctx *c, int argc, char **argv)
die("Invalid PID file: %s", optarg);
break;
case 'm':
case 'm': {
unsigned long mtu;
char *e;
errno = 0;
c->mtu = strtol(optarg, NULL, 0);
mtu = strtoul(optarg, &e, 0);
if (!c->mtu) {
c->mtu = -1;
break;
}
if (c->mtu < ETH_MIN_MTU || c->mtu > (int)ETH_MAX_MTU ||
errno)
if (errno || *e)
die("Invalid MTU: %s", optarg);
if (mtu > max_mtu) {
die("MTU %lu too large (max %lu)",
mtu, max_mtu);
}
c->mtu = mtu;
break;
}
case 'a':
if (inet_pton(AF_INET6, optarg, &c->ip6.addr) &&
!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr) &&
@ -1727,6 +1864,11 @@ void conf(struct ctx *c, int argc, char **argv)
die("Cannot use DNS search domain %s", optarg);
break;
case 'H':
if (snprintf_check(c->hostname, PASST_MAXDNAME,
"%s", optarg))
die("Invalid hostname: %s", optarg);
break;
case '4':
v4_only = true;
v6_only = false;
@ -1743,11 +1885,16 @@ void conf(struct ctx *c, int argc, char **argv)
break;
case 't':
case 'u':
case 'T':
case 'U':
case 'D':
/* Handle these later, once addresses are configured */
break;
case 'T':
case 'U':
if (c->mode != MODE_PASTA)
die("-%c is for pasta mode only", name);
/* Handle properly later, once addresses are configured */
break;
case 'h':
usage(argv[0], stdout, EXIT_SUCCESS);
break;
@ -1795,9 +1942,21 @@ void conf(struct ctx *c, int argc, char **argv)
c->ifi4 = conf_ip4(ifi4, &c->ip4);
if (!v4_only)
c->ifi6 = conf_ip6(ifi6, &c->ip6);
if (c->ifi4 && c->mtu < IPV4_MIN_MTU) {
warn("MTU %"PRIu16" is too small for IPv4 (minimum %u)",
c->mtu, IPV4_MIN_MTU);
}
if (c->ifi6 && c->mtu < IPV6_MIN_MTU) {
warn("MTU %"PRIu16" is too small for IPv6 (minimum %u)",
c->mtu, IPV6_MIN_MTU);
}
if ((*c->ip4.ifname_out && !c->ifi4) ||
(*c->ip6.ifname_out && !c->ifi6))
die("External interface not usable");
if (!c->ifi4 && !c->ifi6) {
info("No external interface as template, switch to local mode");
@ -1824,8 +1983,8 @@ void conf(struct ctx *c, int argc, char **argv)
if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
c->no_dhcp = 1;
/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
* settings)
/* Inbound port options and DNS can be parsed now, after IPv4/IPv6
* settings
*/
fwd_probe_ephemeral();
udp_portmap_clear();
@ -1919,9 +2078,6 @@ void conf(struct ctx *c, int argc, char **argv)
c->no_dhcpv6 = 1;
}
if (!c->mtu)
c->mtu = ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN, sizeof(uint32_t));
get_dns(c);
if (!*c->pasta_ifn) {

1
conf.h
View file

@ -6,6 +6,7 @@
#ifndef CONF_H
#define CONF_H
enum passt_modes conf_mode(int argc, char *argv[]);
void conf(struct ctx *c, int argc, char **argv);
#endif /* CONF_H */

View file

@ -27,4 +27,25 @@ profile passt /usr/bin/passt{,.avx2} {
owner @{HOME}/** w, # pcap(), pidfile_open(),
# pidfile_write()
# Workaround: libvirt's profile comes with a passt subprofile which includes,
# in turn, <abstractions/passt>, and adds libvirt-specific rules on top, to
# allow passt (when started by libvirtd) to write socket and PID files in the
# location requested by libvirtd itself, and to execute passt itself.
#
# However, when libvirt runs as unprivileged user, the mechanism based on
# virt-aa-helper, designed to build per-VM profiles as guests are started,
# doesn't work. The helper needs to create and load profiles on the fly, which
# can't be done by unprivileged users, of course.
#
# As a result, libvirtd runs unconfined if guests are started by unprivileged
# users, starting passt unconfined as well, which means that passt runs under
# its own stand-alone profile (this one), which implies in turn that execve()
# of /usr/bin/passt is not allowed, and socket and PID files can't be written.
#
# Duplicate libvirt-specific rules here as long as this is not solved in
# libvirt's profile itself.
/usr/bin/passt r,
owner @{run}/user/[0-9]*/libvirt/qemu/run/passt/* rw,
owner @{run}/libvirt/qemu/passt/* rw,
}

View file

@ -0,0 +1,29 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# contrib/apparmor/usr.bin.passt-repair - AppArmor profile for passt-repair(1)
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
abi <abi/3.0>,
#include <tunables/global>
profile passt-repair /usr/bin/passt-repair {
#include <abstractions/base>
/** rw, # passt's ".repair" socket might be anywhere
unix (connect, receive, send) type=stream,
capability dac_override, # connect to passt's socket as root
capability net_admin, # currently needed for TCP_REPAIR socket option
capability net_raw, # what TCP_REPAIR should require instead
network unix stream, # connect and use UNIX domain socket
network inet stream, # use TCP sockets
}

View file

@ -44,7 +44,7 @@ Requires(preun): %{name}
Requires(preun): policycoreutils
%description selinux
This package adds SELinux enforcement to passt(1) and pasta(1).
This package adds SELinux enforcement to passt(1), pasta(1), passt-repair(1).
%prep
%setup -q -n passt-%{git_hash}
@ -82,6 +82,7 @@ make -f %{_datadir}/selinux/devel/Makefile
install -p -m 644 -D passt.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
install -p -m 644 -D passt.if %{buildroot}%{_datadir}/selinux/devel/include/distributed/passt.if
install -p -m 644 -D pasta.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
install -p -m 644 -D passt-repair.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
popd
%pre selinux
@ -90,11 +91,13 @@ popd
%post selinux
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
%postun selinux
if [ $1 -eq 0 ]; then
%selinux_modules_uninstall -s %{selinuxtype} passt
%selinux_modules_uninstall -s %{selinuxtype} pasta
%selinux_modules_uninstall -s %{selinuxtype} passt-repair
fi
%posttrans selinux
@ -108,9 +111,11 @@ fi
%{_bindir}/passt
%{_bindir}/pasta
%{_bindir}/qrap
%{_bindir}/passt-repair
%{_mandir}/man1/passt.1*
%{_mandir}/man1/pasta.1*
%{_mandir}/man1/qrap.1*
%{_mandir}/man1/passt-repair.1*
%ifarch x86_64
%{_bindir}/passt.avx2
%{_mandir}/man1/passt.avx2.1*
@ -122,6 +127,7 @@ fi
%{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
%{_datadir}/selinux/devel/include/distributed/passt.if
%{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
%{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
%changelog
{{{ passt_git_changelog }}}

View file

@ -0,0 +1,11 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# contrib/selinux/passt-repair.fc - SELinux: File Context for passt-repair
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
/usr/bin/passt-repair system_u:object_r:passt_repair_exec_t:s0

View file

@ -0,0 +1,87 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# contrib/selinux/passt-repair.te - SELinux: Type Enforcement for passt-repair
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
policy_module(passt-repair, 0.1)
require {
type unconfined_t;
type passt_t;
role unconfined_r;
class process transition;
class file { read execute execute_no_trans entrypoint open map };
class capability { dac_override net_admin net_raw };
class chr_file { append open getattr read write ioctl };
class unix_stream_socket { create connect sendto };
class sock_file { read write };
class tcp_socket { read setopt write };
type console_device_t;
type user_devpts_t;
type user_tmp_t;
# Workaround: passt-repair needs to needs to access socket files
# that passt, started by libvirt, might create under different
# labels, depending on whether passt is started as root or not.
#
# However, libvirt doesn't maintain its own policy, which makes
# updates particularly complicated. To avoid breakage in the short
# term, deal with that in passt's own policy.
type qemu_var_run_t;
type virt_var_run_t;
}
type passt_repair_t;
domain_type(passt_repair_t);
type passt_repair_exec_t;
corecmd_executable_file(passt_repair_exec_t);
role unconfined_r types passt_repair_t;
allow passt_repair_t passt_repair_exec_t:file { read execute execute_no_trans entrypoint open map };
type_transition unconfined_t passt_repair_exec_t:process passt_repair_t;
allow unconfined_t passt_repair_t:process transition;
allow passt_repair_t self:capability { dac_override dac_read_search net_admin net_raw };
allow passt_repair_t self:capability2 bpf;
allow passt_repair_t console_device_t:chr_file { append open getattr read write ioctl };
allow passt_repair_t user_devpts_t:chr_file { append open getattr read write ioctl };
allow passt_repair_t unconfined_t:unix_stream_socket { connectto read write };
allow passt_repair_t passt_t:unix_stream_socket { connectto read write };
allow passt_repair_t user_tmp_t:unix_stream_socket { connectto read write };
allow passt_repair_t user_tmp_t:dir { getattr read search watch };
allow passt_repair_t unconfined_t:sock_file { getattr read write };
allow passt_repair_t passt_t:sock_file { getattr read write };
allow passt_repair_t user_tmp_t:sock_file { getattr read write };
allow passt_repair_t unconfined_t:tcp_socket { read setopt write };
allow passt_repair_t passt_t:tcp_socket { read setopt write };
# Workaround: passt-repair needs to needs to access socket files
# that passt, started by libvirt, might create under different
# labels, depending on whether passt is started as root or not.
#
# However, libvirt doesn't maintain its own policy, which makes
# updates particularly complicated. To avoid breakage in the short
# term, deal with that in passt's own policy.
allow passt_repair_t qemu_var_run_t:unix_stream_socket { connectto read write };
allow passt_repair_t virt_var_run_t:unix_stream_socket { connectto read write };
allow passt_repair_t qemu_var_run_t:dir { getattr read search watch };
allow passt_repair_t virt_var_run_t:dir { getattr read search watch };
allow passt_repair_t qemu_var_run_t:sock_file { getattr read write };
allow passt_repair_t virt_var_run_t:sock_file { getattr read write };

View file

@ -20,9 +20,19 @@ require {
type fs_t;
type tmp_t;
type user_tmp_t;
type user_home_t;
type tmpfs_t;
type root_t;
# Workaround: passt --vhost-user needs to map guest memory, but
# libvirt doesn't maintain its own policy, which makes updates
# particularly complicated. To avoid breakage in the short term,
# deal with it in passt's own policy.
type svirt_image_t;
type svirt_tmpfs_t;
type svirt_t;
type null_device_t;
class file { ioctl getattr setattr create read write unlink open relabelto execute execute_no_trans map };
class dir { search write add_name remove_name mounton };
class chr_file { append read write open getattr ioctl };
@ -38,8 +48,8 @@ require {
type net_conf_t;
type proc_net_t;
type node_t;
class tcp_socket { create accept listen name_bind name_connect };
class udp_socket { create accept listen };
class tcp_socket { create accept listen name_bind name_connect getattr ioctl };
class udp_socket { create accept listen getattr };
class icmp_socket { bind create name_bind node_bind setopt read write };
class sock_file { create unlink write };
@ -80,6 +90,9 @@ allow passt_t root_t:dir mounton;
allow passt_t tmp_t:dir { add_name mounton remove_name write };
allow passt_t tmpfs_t:filesystem mount;
allow passt_t fs_t:filesystem unmount;
allow passt_t user_home_t:dir search;
allow passt_t user_tmp_t:fifo_file append;
allow passt_t user_tmp_t:file map;
manage_files_pattern(passt_t, user_tmp_t, user_tmp_t)
files_pid_filetrans(passt_t, user_tmp_t, file)
@ -119,11 +132,19 @@ corenet_udp_sendrecv_all_ports(passt_t)
allow passt_t node_t:icmp_socket { name_bind node_bind };
allow passt_t port_t:icmp_socket name_bind;
allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write };
allow passt_t self:udp_socket { create getopt setopt connect bind read write };
allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write getattr ioctl };
allow passt_t self:udp_socket { create getopt setopt connect bind read write getattr };
allow passt_t self:icmp_socket { bind create setopt read write };
allow passt_t user_tmp_t:dir { add_name write };
allow passt_t user_tmp_t:file { create open };
allow passt_t user_tmp_t:sock_file { create read write unlink };
allow passt_t unconfined_t:unix_stream_socket { read write };
# Workaround: passt --vhost-user needs to map guest memory, but
# libvirt doesn't maintain its own policy, which makes updates
# particularly complicated. To avoid breakage in the short term,
# deal with it in passt's own policy.
allow passt_t svirt_image_t:file { read write map };
allow passt_t svirt_tmpfs_t:file { read write map };
allow passt_t null_device_t:chr_file map;

View file

@ -18,6 +18,7 @@ require {
type bin_t;
type user_home_t;
type user_home_dir_t;
type user_tmp_t;
type fs_t;
type tmp_t;
type tmpfs_t;
@ -56,8 +57,10 @@ require {
attribute port_type;
type port_t;
type http_port_t;
type http_cache_port_t;
type ssh_port_t;
type reserved_port_t;
type unreserved_port_t;
type dns_port_t;
type dhcpc_port_t;
type chronyd_port_t;
@ -122,8 +125,8 @@ domain_auto_trans(pasta_t, ping_exec_t, ping_t);
allow pasta_t nsfs_t:file { open read };
allow pasta_t user_home_t:dir getattr;
allow pasta_t user_home_t:file { open read getattr setattr };
allow pasta_t user_home_t:dir { getattr search };
allow pasta_t user_home_t:file { open read getattr setattr execute execute_no_trans map};
allow pasta_t user_home_dir_t:dir { search getattr open add_name read write };
allow pasta_t user_home_dir_t:file { create open read write };
allow pasta_t tmp_t:dir { add_name mounton remove_name write };
@ -133,6 +136,11 @@ allow pasta_t root_t:dir mounton;
manage_files_pattern(pasta_t, pasta_pid_t, pasta_pid_t)
files_pid_filetrans(pasta_t, pasta_pid_t, file)
allow pasta_t user_tmp_t:dir { add_name remove_name search write };
allow pasta_t user_tmp_t:fifo_file append;
allow pasta_t user_tmp_t:file { create open write };
allow pasta_t user_tmp_t:sock_file { create unlink };
allow pasta_t console_device_t:chr_file { open write getattr ioctl };
allow pasta_t user_devpts_t:chr_file { getattr read write ioctl };
logging_send_syslog_msg(pasta_t)
@ -160,6 +168,8 @@ allow pasta_t self:udp_socket create_stream_socket_perms;
allow pasta_t reserved_port_t:udp_socket name_bind;
allow pasta_t llmnr_port_t:tcp_socket name_bind;
allow pasta_t llmnr_port_t:udp_socket name_bind;
allow pasta_t http_cache_port_t:tcp_socket { name_bind name_connect };
allow pasta_t unreserved_port_t:udp_socket name_bind;
corenet_udp_sendrecv_generic_node(pasta_t)
corenet_udp_bind_generic_node(pasta_t)
allow pasta_t node_t:icmp_socket { name_bind node_bind };
@ -171,7 +181,7 @@ allow pasta_t init_t:lnk_file read;
allow pasta_t init_t:unix_stream_socket connectto;
allow pasta_t init_t:dbus send_msg;
allow pasta_t init_t:system status;
allow pasta_t unconfined_t:dir search;
allow pasta_t unconfined_t:dir { read search };
allow pasta_t unconfined_t:file read;
allow pasta_t unconfined_t:lnk_file read;
allow pasta_t self:process { setpgid setcap };
@ -192,8 +202,6 @@ allow pasta_t sysctl_net_t:dir search;
allow pasta_t sysctl_net_t:file { open read write };
allow pasta_t kernel_t:system module_request;
allow pasta_t nsfs_t:file read;
allow pasta_t proc_t:dir mounton;
allow pasta_t proc_t:filesystem mount;
allow pasta_t net_conf_t:lnk_file read;

91
dhcp.c
View file

@ -63,6 +63,11 @@ static struct opt opts[255];
#define OPT_MIN 60 /* RFC 951 */
/* Total option size (excluding end option) is 576 (RFC 2131), minus
* offset of options (268), minus end option (1).
*/
#define OPT_MAX 307
/**
* dhcp_init() - Initialise DHCP options
*/
@ -122,7 +127,7 @@ struct msg {
uint8_t sname[64];
uint8_t file[128];
uint32_t magic;
uint8_t o[308];
uint8_t o[OPT_MAX + 1 /* End option */ ];
} __attribute__((__packed__));
/**
@ -130,15 +135,28 @@ struct msg {
* @m: Message to fill
* @o: Option number
* @offset: Current offset within options field, updated on insertion
*
* Return: false if m has space to write the option, true otherwise
*/
static void fill_one(struct msg *m, int o, int *offset)
static bool fill_one(struct msg *m, int o, int *offset)
{
size_t slen = opts[o].slen;
/* If we don't have space to write the option, then just skip */
if (*offset + 2 /* code and length of option */ + slen > OPT_MAX)
return true;
m->o[*offset] = o;
m->o[*offset + 1] = opts[o].slen;
memcpy(&m->o[*offset + 2], opts[o].s, opts[o].slen);
m->o[*offset + 1] = slen;
/* Move to option */
*offset += 2;
memcpy(&m->o[*offset], opts[o].s, slen);
opts[o].sent = 1;
*offset += 2 + opts[o].slen;
*offset += slen;
return false;
}
/**
@ -151,9 +169,6 @@ static int fill(struct msg *m)
{
int i, o, offset = 0;
m->op = BOOTREPLY;
m->secs = 0;
for (o = 0; o < 255; o++)
opts[o].sent = 0;
@ -162,21 +177,23 @@ static int fill(struct msg *m)
* Put it there explicitly, unless requested via option 55.
*/
if (opts[55].clen > 0 && !memchr(opts[55].c, 53, opts[55].clen))
fill_one(m, 53, &offset);
if (fill_one(m, 53, &offset))
debug("DHCP: skipping option 53");
for (i = 0; i < opts[55].clen; i++) {
o = opts[55].c[i];
if (opts[o].slen != -1)
fill_one(m, o, &offset);
if (fill_one(m, o, &offset))
debug("DHCP: skipping option %i", o);
}
for (o = 0; o < 255; o++) {
if (opts[o].slen != -1 && !opts[o].sent)
fill_one(m, o, &offset);
if (fill_one(m, o, &offset))
debug("DHCP: skipping option %i", o);
}
m->o[offset++] = 255;
m->o[offset++] = 0;
if (offset < OPT_MIN) {
memset(&m->o[offset], 0, OPT_MIN - offset);
@ -291,8 +308,9 @@ int dhcp(const struct ctx *c, const struct pool *p)
const struct ethhdr *eh;
const struct iphdr *iph;
const struct udphdr *uh;
struct msg const *m;
struct msg reply;
unsigned int i;
struct msg *m;
eh = packet_get(p, 0, offset, sizeof(*eh), NULL);
offset += sizeof(*eh);
@ -321,6 +339,22 @@ int dhcp(const struct ctx *c, const struct pool *p)
m->op != BOOTREQUEST)
return -1;
reply.op = BOOTREPLY;
reply.htype = m->htype;
reply.hlen = m->hlen;
reply.hops = 0;
reply.xid = m->xid;
reply.secs = 0;
reply.flags = m->flags;
reply.ciaddr = m->ciaddr;
reply.yiaddr = c->ip4.addr;
reply.siaddr = 0;
reply.giaddr = m->giaddr;
memcpy(&reply.chaddr, m->chaddr, sizeof(reply.chaddr));
memset(&reply.sname, 0, sizeof(reply.sname));
memset(&reply.file, 0, sizeof(reply.file));
reply.magic = m->magic;
offset += offsetof(struct msg, o);
for (i = 0; i < ARRAY_SIZE(opts); i++)
@ -364,7 +398,6 @@ int dhcp(const struct ctx *c, const struct pool *p)
info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));
m->yiaddr = c->ip4.addr;
mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len));
memcpy(opts[1].s, &mask, sizeof(mask));
memcpy(opts[3].s, &c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
@ -384,7 +417,7 @@ int dhcp(const struct ctx *c, const struct pool *p)
&c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
}
if (c->mtu != -1) {
if (c->mtu) {
opts[26].slen = 2;
opts[26].s[0] = c->mtu / 256;
opts[26].s[1] = c->mtu % 256;
@ -398,17 +431,41 @@ int dhcp(const struct ctx *c, const struct pool *p)
if (!opts[6].slen)
opts[6].slen = -1;
opt_len = strlen(c->hostname);
if (opt_len > 0) {
opts[12].slen = opt_len;
memcpy(opts[12].s, &c->hostname, opt_len);
}
opt_len = strlen(c->fqdn);
if (opt_len > 0) {
opt_len += 3 /* flags */
+ 2; /* Length byte for first label, and terminator */
if (sizeof(opts[81].s) >= opt_len) {
opts[81].s[0] = 0x4; /* flags (E) */
opts[81].s[1] = 0xff; /* RCODE1 */
opts[81].s[2] = 0xff; /* RCODE2 */
encode_domain_name((char *)opts[81].s + 3, c->fqdn);
opts[81].slen = opt_len;
} else {
debug("DHCP: client FQDN option doesn't fit, skipping");
}
}
if (!c->no_dhcp_dns_search)
opt_set_dns_search(c, sizeof(m->o));
dlen = offsetof(struct msg, o) + fill(m);
dlen = offsetof(struct msg, o) + fill(&reply);
if (m->flags & FLAG_BROADCAST)
dst = in4addr_broadcast;
else
dst = c->ip4.addr;
tap_udp4_send(c, c->ip4.our_tap_addr, 67, dst, 68, m, dlen);
tap_udp4_send(c, c->ip4.our_tap_addr, 67, dst, 68, &reply, dlen);
return 1;
}

103
dhcpv6.c
View file

@ -48,6 +48,7 @@ struct opt_hdr {
# define STATUS_NOTONLINK htons_constant(4)
# define OPT_DNS_SERVERS htons_constant(23)
# define OPT_DNS_SEARCH htons_constant(24)
# define OPT_CLIENT_FQDN htons_constant(39)
#define STR_NOTONLINK "Prefix not appropriate for link."
uint16_t l;
@ -58,6 +59,9 @@ struct opt_hdr {
sizeof(struct opt_hdr))
#define OPT_VSIZE(x) (sizeof(struct opt_##x) - \
sizeof(struct opt_hdr))
#define OPT_MAX_SIZE IPV6_MIN_MTU - (sizeof(struct ipv6hdr) + \
sizeof(struct udphdr) + \
sizeof(struct msg_hdr))
/**
* struct opt_client_id - DHCPv6 Client Identifier option
@ -140,7 +144,9 @@ struct opt_ia_addr {
struct opt_status_code {
struct opt_hdr hdr;
uint16_t code;
char status_msg[sizeof(STR_NOTONLINK) - 1];
/* "nonstring" is only supported since clang 23 */
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((nonstring)) char status_msg[sizeof(STR_NOTONLINK) - 1];
} __attribute__((packed));
/**
@ -163,6 +169,18 @@ struct opt_dns_search {
char list[MAXDNSRCH * NS_MAXDNAME];
} __attribute__((packed));
/**
* struct opt_client_fqdn - Client FQDN option (RFC 4704)
* @hdr: Option header
* @flags: Flags described by RFC 4704
* @domain_name: Client FQDN
*/
struct opt_client_fqdn {
struct opt_hdr hdr;
uint8_t flags;
char domain_name[PASST_MAXDNAME];
} __attribute__((packed));
/**
* struct msg_hdr - DHCPv6 client/server message header
* @type: DHCP message type
@ -193,6 +211,7 @@ struct msg_hdr {
* @client_id: Client Identifier, variable length
* @dns_servers: DNS Recursive Name Server, here just for storage size
* @dns_search: Domain Search List, here just for storage size
* @client_fqdn: Client FQDN, variable length
*/
static struct resp_t {
struct msg_hdr hdr;
@ -203,6 +222,7 @@ static struct resp_t {
struct opt_client_id client_id;
struct opt_dns_servers dns_servers;
struct opt_dns_search dns_search;
struct opt_client_fqdn client_fqdn;
} __attribute__((__packed__)) resp = {
{ 0 },
SERVER_ID,
@ -228,6 +248,10 @@ static struct resp_t {
{ { OPT_DNS_SEARCH, 0, },
{ 0 },
},
{ { OPT_CLIENT_FQDN, 0, },
0, { 0 },
},
};
static const struct opt_status_code sc_not_on_link = {
@ -346,7 +370,6 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset)
{
struct opt_dns_servers *srv = NULL;
struct opt_dns_search *srch = NULL;
char *p = NULL;
int i;
if (c->no_dhcp_dns)
@ -383,34 +406,81 @@ search:
if (!name_len)
continue;
name_len += 2; /* Length byte for first label, and terminator */
if (name_len >
NS_MAXDNAME + 1 /* Length byte for first label */ ||
name_len > 255) {
debug("DHCP: DNS search name '%s' too long, skipping",
c->dns_search[i].n);
continue;
}
if (!srch) {
srch = (struct opt_dns_search *)(buf + offset);
offset += sizeof(struct opt_hdr);
srch->hdr.t = OPT_DNS_SEARCH;
srch->hdr.l = 0;
p = srch->list;
}
*p = '.';
p = stpncpy(p + 1, c->dns_search[i].n, name_len);
p++;
srch->hdr.l += name_len + 2;
offset += name_len + 2;
encode_domain_name(buf + offset, c->dns_search[i].n);
srch->hdr.l += name_len;
offset += name_len;
}
if (srch) {
for (i = 0; i < srch->hdr.l; i++) {
if (srch->list[i] == '.') {
srch->list[i] = strcspn(srch->list + i + 1,
".");
}
}
if (srch)
srch->hdr.l = htons(srch->hdr.l);
}
return offset;
}
/**
* dhcpv6_client_fqdn_fill() - Fill in client FQDN option
* @c: Execution context
* @buf: Response message buffer where options will be appended
* @offset: Offset in message buffer for new options
*
* Return: updated length of response message buffer.
*/
static size_t dhcpv6_client_fqdn_fill(const struct pool *p, const struct ctx *c,
char *buf, int offset)
{
struct opt_client_fqdn const *req_opt;
struct opt_client_fqdn *o;
size_t opt_len;
opt_len = strlen(c->fqdn);
if (opt_len == 0) {
return offset;
}
opt_len += 2; /* Length byte for first label, and terminator */
if (opt_len > OPT_MAX_SIZE - (offset +
sizeof(struct opt_hdr) +
1 /* flags */ )) {
debug("DHCPv6: client FQDN option doesn't fit, skipping");
return offset;
}
o = (struct opt_client_fqdn *)(buf + offset);
encode_domain_name(o->domain_name, c->fqdn);
req_opt = (struct opt_client_fqdn *)dhcpv6_opt(p, &(size_t){ 0 },
OPT_CLIENT_FQDN);
if (req_opt && req_opt->flags & 0x01 /* S flag */)
o->flags = 0x02 /* O flag */;
else
o->flags = 0x00;
opt_len++;
o->hdr.t = OPT_CLIENT_FQDN;
o->hdr.l = htons(opt_len);
return offset + sizeof(struct opt_hdr) + opt_len;
}
/**
* dhcpv6() - Check if this is a DHCPv6 message, reply as needed
* @c: Execution context
@ -544,6 +614,7 @@ int dhcpv6(struct ctx *c, const struct pool *p,
n = offsetof(struct resp_t, client_id) +
sizeof(struct opt_hdr) + ntohs(client_id->l);
n = dhcpv6_dns_fill(c, (char *)&resp, n);
n = dhcpv6_client_fqdn_fill(p, c, (char *)&resp, n);
resp.hdr.xid = mh->xid;

2
doc/migration/.gitignore vendored Normal file
View file

@ -0,0 +1,2 @@
/source
/target

20
doc/migration/Makefile Normal file
View file

@ -0,0 +1,20 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
TARGETS = source target
CFLAGS = -Wall -Wextra -pedantic
all: $(TARGETS)
$(TARGETS): %: %.c
clean:
rm -f $(TARGETS)

51
doc/migration/README Normal file
View file

@ -0,0 +1,51 @@
<!---
SPDX-License-Identifier: GPL-2.0-or-later
Copyright (c) 2025 Red Hat GmbH
Author: Stefano Brivio <sbrivio@redhat.com>
-->
Migration
=========
These test programs show a migration of a TCP connection from one process to
another using the TCP_REPAIR socket option.
The two processes are a mock of the matching implementation in passt(1), and run
unprivileged, so they rely on the passt-repair helper to connect to them and set
or clear TCP_REPAIR on the connection socket, transferred to the helper using
SCM_RIGHTS.
The passt-repair helper needs to have the CAP_NET_ADMIN capability, or run as
root.
Example of usage
----------------
* Start the test server
$ nc -l 9999
* Start the source side of the TCP client (mock of the source instance of passt)
$ ./source 127.0.0.1 9999 9998 /tmp/repair.sock
* The client sends a test string, and waits for a connection from passt-repair
# passt-repair /tmp/repair.sock
* The socket is now in repair mode, and `source` dumps sequences, then exits
sending sequence: 3244673313
receiving sequence: 2250449386
* Continue the connection on the target side, restarting from those sequences
$ ./target 127.0.0.1 9999 9998 /tmp/repair.sock 3244673313 2250449386
* The target side now waits for a connection from passt-repair
# passt-repair /tmp/repair.sock
* The target side asks passt-repair to switch the socket to repair mode, sets up
the TCP sequences, then asks passt-repair to clear repair mode, and sends a
test string to the server

92
doc/migration/source.c Normal file
View file

@ -0,0 +1,92 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* doc/migration/source.c - Mock of TCP migration source, use with passt-repair
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
int main(int argc, char **argv)
{
struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
NULL, NULL, NULL };
struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
int seq, s, s_helper;
int8_t cmd;
struct iovec iov = { &cmd, sizeof(cmd) };
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
socklen_t seqlen = sizeof(int);
struct addrinfo *r;
(void)argc;
if (argc != 5) {
fprintf(stderr, "%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH\n",
argv[0]);
return -1;
}
strcpy(a_helper.sun_path, argv[4]);
getaddrinfo(argv[1], argv[2], &hints, &r);
/* Connect socket to server and send some data */
s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
bind(s, (struct sockaddr *)&a, sizeof(a));
connect(s, r->ai_addr, r->ai_addrlen);
send(s, "before migration\n", sizeof("before migration\n"), 0);
/* Wait for helper */
s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
unlink(a_helper.sun_path);
bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
listen(s_helper, 1);
s_helper = accept(s_helper, NULL, NULL);
/* Set up message for helper, with socket */
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_ON;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int8_t){ 0 }), 1, 0);
/* Terminate helper */
close(s_helper);
/* Get sending sequence */
seq = TCP_SEND_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
fprintf(stdout, "%u ", seq);
/* Get receiving sequence */
seq = TCP_RECV_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
fprintf(stdout, "%u\n", seq);
}

102
doc/migration/target.c Normal file
View file

@ -0,0 +1,102 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* doc/migration/target.c - Mock of TCP migration target, use with passt-repair
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
int main(int argc, char **argv)
{
struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
NULL, NULL, NULL };
struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
int s, s_helper, seq;
int8_t cmd;
struct iovec iov = { &cmd, sizeof(cmd) };
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
struct addrinfo *r;
(void)argc;
strcpy(a_helper.sun_path, argv[4]);
getaddrinfo(argv[1], argv[2], &hints, &r);
if (argc != 7) {
fprintf(stderr,
"%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH SSEQ RSEQ\n",
argv[0]);
return -1;
}
/* Prepare socket, bind to source port */
s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
bind(s, (struct sockaddr *)&a, sizeof(a));
/* Wait for helper */
s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
unlink(a_helper.sun_path);
bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
listen(s_helper, 1);
s_helper = accept(s_helper, NULL, NULL);
/* Set up message for helper, with socket */
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_ON;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int){ 0 }), 1, 0);
/* Set sending sequence */
seq = TCP_SEND_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
seq = atoi(argv[5]);
setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
/* Set receiving sequence */
seq = TCP_RECV_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
seq = atoi(argv[6]);
setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
/* Connect setting kernel state only, without actual SYN / handshake */
connect(s, r->ai_addr, r->ai_addrlen);
/* Send command to helper: turn repair mode off, wait for reply */
cmd = TCP_REPAIR_OFF;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int8_t){ 0 }), 1, 0);
/* Terminate helper */
close(s_helper);
/* Send some more data */
send(s, "after migration\n", sizeof("after migration\n"), 0);
}

View file

@ -1,3 +1,4 @@
/listen-vs-repair
/reuseaddr-priority
/recv-zero
/udp-close-dup

View file

@ -3,8 +3,8 @@
# Copyright Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
TARGETS = reuseaddr-priority recv-zero udp-close-dup
SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c
TARGETS = reuseaddr-priority recv-zero udp-close-dup listen-vs-repair
SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c listen-vs-repair.c
CFLAGS = -Wall
all: cppcheck clang-tidy $(TARGETS:%=check-%)

View file

@ -15,6 +15,7 @@
#include <stdio.h>
#include <stdlib.h>
__attribute__((format(printf, 1, 2), noreturn))
static inline void die(const char *fmt, ...)
{
va_list ap;

View file

@ -0,0 +1,128 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* liste-vs-repair.c
*
* Do listening sockets have address conflicts with sockets under repair
* ====================================================================
*
* When we accept() an incoming connection the accept()ed socket will have the
* same local address as the listening socket. This can be a complication on
* migration. On the migration target we've already set up listening sockets
* according to the command line. However to restore connections that we're
* migrating in we need to bind the new sockets to the same address, which would
* be an address conflict on the face of it. This test program verifies that
* enabling repair mode before bind() correctly suppresses that conflict.
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
/* NOLINTNEXTLINE(bugprone-reserved-identifier,cert-dcl37-c,cert-dcl51-cpp) */
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <errno.h>
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
#include <net/if.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include "common.h"
#define PORT 13256U
#define CPORT 13257U
/* 127.0.0.1:PORT */
static const struct sockaddr_in addr = SOCKADDR_INIT(INADDR_LOOPBACK, PORT);
/* 127.0.0.1:CPORT */
static const struct sockaddr_in caddr = SOCKADDR_INIT(INADDR_LOOPBACK, CPORT);
/* Put ourselves into a network sandbox */
static void net_sandbox(void)
{
/* NOLINTNEXTLINE(altera-struct-pack-align) */
const struct req_t {
struct nlmsghdr nlh;
struct ifinfomsg ifm;
} __attribute__((packed)) req = {
.nlh.nlmsg_type = RTM_NEWLINK,
.nlh.nlmsg_flags = NLM_F_REQUEST,
.nlh.nlmsg_len = sizeof(req),
.nlh.nlmsg_seq = 1,
.ifm.ifi_family = AF_UNSPEC,
.ifm.ifi_index = 1,
.ifm.ifi_flags = IFF_UP,
.ifm.ifi_change = IFF_UP,
};
int nl;
if (unshare(CLONE_NEWUSER | CLONE_NEWNET))
die("unshare(): %s\n", strerror(errno));
/* Bring up lo in the new netns */
nl = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
if (nl < 0)
die("Can't create netlink socket: %s\n", strerror(errno));
if (send(nl, &req, sizeof(req), 0) < 0)
die("Netlink send(): %s\n", strerror(errno));
close(nl);
}
static void check(void)
{
int s1, s2, op;
s1 = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s1 < 0)
die("socket() 1: %s\n", strerror(errno));
if (bind(s1, (struct sockaddr *)&addr, sizeof(addr)))
die("bind() 1: %s\n", strerror(errno));
if (listen(s1, 0))
die("listen(): %s\n", strerror(errno));
s2 = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s2 < 0)
die("socket() 2: %s\n", strerror(errno));
op = TCP_REPAIR_ON;
if (setsockopt(s2, SOL_TCP, TCP_REPAIR, &op, sizeof(op)))
die("TCP_REPAIR: %s\n", strerror(errno));
if (bind(s2, (struct sockaddr *)&addr, sizeof(addr)))
die("bind() 2: %s\n", strerror(errno));
if (connect(s2, (struct sockaddr *)&caddr, sizeof(caddr)))
die("connect(): %s\n", strerror(errno));
op = TCP_REPAIR_OFF_NO_WP;
if (setsockopt(s2, SOL_TCP, TCP_REPAIR, &op, sizeof(op)))
die("TCP_REPAIR: %s\n", strerror(errno));
close(s1);
close(s2);
}
int main(int argc, char *argv[])
{
(void)argc;
(void)argv;
net_sandbox();
check();
printf("Repair mode appears to properly suppress conflicts with listening sockets\n");
exit(0);
}

View file

@ -46,13 +46,13 @@
/* Different cases for receiving socket configuration */
enum sock_type {
/* Socket is bound to 0.0.0.0:DSTPORT and not connected */
SOCK_BOUND_ANY = 0,
SOCK_BOUND_ANY,
/* Socket is bound to 127.0.0.1:DSTPORT and not connected */
SOCK_BOUND_LO = 1,
SOCK_BOUND_LO,
/* Socket is bound to 0.0.0.0:DSTPORT and connected to 127.0.0.1:SRCPORT */
SOCK_CONNECTED = 2,
SOCK_CONNECTED,
NUM_SOCK_TYPES,
};

View file

@ -22,8 +22,8 @@ enum epoll_type {
EPOLL_TYPE_TCP_TIMER,
/* UDP "listening" sockets */
EPOLL_TYPE_UDP_LISTEN,
/* UDP socket for replies on a specific flow */
EPOLL_TYPE_UDP_REPLY,
/* UDP socket for a specific flow */
EPOLL_TYPE_UDP,
/* ICMP/ICMPv6 ping sockets */
EPOLL_TYPE_PING,
/* inotify fd watching for end of netns (pasta) */
@ -40,8 +40,10 @@ enum epoll_type {
EPOLL_TYPE_VHOST_CMD,
/* vhost-user kick event socket */
EPOLL_TYPE_VHOST_KICK,
/* vhost-user migration socket */
EPOLL_TYPE_VHOST_MIGRATION,
/* TCP_REPAIR helper listening socket */
EPOLL_TYPE_REPAIR_LISTEN,
/* TCP_REPAIR helper socket */
EPOLL_TYPE_REPAIR,
EPOLL_NUM_TYPES,
};

425
flow.c
View file

@ -19,6 +19,7 @@
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "repair.h"
const char *flow_state_str[] = {
[FLOW_STATE_FREE] = "FREE",
@ -52,6 +53,13 @@ const uint8_t flow_proto[] = {
static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
"flow_proto[] doesn't match enum flow_type");
#define foreach_established_tcp_flow(flow) \
flow_foreach_of_type((flow), FLOW_TCP) \
if (!tcp_flow_is_established(&(flow)->tcp)) \
/* NOLINTNEXTLINE(bugprone-branch-clone) */ \
continue; \
else
/* Global Flow Table */
/**
@ -73,7 +81,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
*
* Free cluster list
* flow_first_free gives the index of the first (lowest index) free cluster.
* Each free cluster has the index of the next free cluster, or MAX_FLOW if
* Each free cluster has the index of the next free cluster, or FLOW_MAX if
* it is the last free cluster. Together these form a linked list of free
* clusters, in strictly increasing order of index.
*
@ -259,11 +267,13 @@ int flowside_connect(const struct ctx *c, int s,
/** flow_log_ - Log flow-related message
* @f: flow the message is related to
* @newline: Append newline at the end of the message, if missing
* @pri: Log priority
* @fmt: Format string
* @...: printf-arguments
*/
void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
void flow_log_(const struct flow_common *f, bool newline, int pri,
const char *fmt, ...)
{
const char *type_or_state;
char msg[BUFSIZ];
@ -279,7 +289,7 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
else
type_or_state = FLOW_TYPE(f);
logmsg(true, false, pri,
logmsg(newline, false, pri,
"Flow %u (%s): %s", flow_idx(f), type_or_state, msg);
}
@ -299,7 +309,7 @@ void flow_log_details_(const struct flow_common *f, int pri,
const struct flowside *tgt = &f->side[TGTSIDE];
if (state >= FLOW_STATE_TGT)
flow_log_(f, pri,
flow_log_(f, true, pri,
"%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
@ -312,7 +322,7 @@ void flow_log_details_(const struct flow_common *f, int pri,
inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)),
tgt->eport);
else if (state >= FLOW_STATE_INI)
flow_log_(f, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
flow_log_(f, true, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
ini->eport,
@ -333,7 +343,7 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
ASSERT(oldstate < FLOW_NUM_STATES);
f->state = state;
flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
flow_log_(f, true, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
FLOW_STATE(f));
flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
@ -386,18 +396,27 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
* @flow: Flow to change state
* @pif: pif of the initiating side
* @ssa: Source socket address
* @daddr: Destination address (may be NULL)
* @dport: Destination port
*
* Return: pointer to the initiating flowside information
*/
const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
in_port_t dport)
struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
const union inany_addr *daddr,
in_port_t dport)
{
struct flowside *ini = &flow->f.side[INISIDE];
inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa);
if (inany_v4(&ini->eaddr))
if (inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa) < 0) {
char str[SOCKADDR_STRLEN];
ASSERT_WITH_MSG(0, "Bad socket address %s",
sockaddr_ntop(ssa, str, sizeof(str)));
}
if (daddr)
ini->oaddr = *daddr;
else if (inany_v4(&ini->eaddr))
ini->oaddr = inany_any4;
else
ini->oaddr = inany_any6;
@ -414,8 +433,8 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
*
* Return: pointer to the target flowside information
*/
const struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto)
struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto)
{
char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
struct flow_common *f = &flow->f;
@ -741,19 +760,30 @@ flow_sidx_t flow_lookup_af(const struct ctx *c,
* @proto: Protocol of the flow (IP L4 protocol number)
* @pif: Interface of the flow
* @esa: Socket address of the endpoint
* @oaddr: Our address (may be NULL)
* @oport: Our port number
*
* Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
*/
flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
const void *esa, in_port_t oport)
const void *esa,
const union inany_addr *oaddr, in_port_t oport)
{
struct flowside side = {
.oport = oport,
};
inany_from_sockaddr(&side.eaddr, &side.eport, esa);
if (inany_v4(&side.eaddr))
if (inany_from_sockaddr(&side.eaddr, &side.eport, esa) < 0) {
char str[SOCKADDR_STRLEN];
warn("Flow lookup on bad socket address %s",
sockaddr_ntop(esa, str, sizeof(str)));
return FLOW_SIDX_NONE;
}
if (oaddr)
side.oaddr = *oaddr;
else if (inany_v4(&side.eaddr))
side.oaddr = inany_any4;
else
side.oaddr = inany_any6;
@ -770,8 +800,9 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
{
struct flow_free_cluster *free_head = NULL;
unsigned *last_next = &flow_first_free;
bool to_free[FLOW_MAX] = { 0 };
bool timer = false;
unsigned idx;
union flow *flow;
if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
timer = true;
@ -780,49 +811,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
ASSERT(!flow_new_entry); /* Incomplete flow at end of cycle */
for (idx = 0; idx < FLOW_MAX; idx++) {
union flow *flow = &flowtab[idx];
/* Check which flows we might need to close first, but don't free them
* yet as it's not safe to do that in the middle of flow_foreach().
*/
flow_foreach(flow) {
bool closed = false;
switch (flow->f.state) {
case FLOW_STATE_FREE: {
unsigned skip = flow->free.n;
/* First entry of a free cluster must have n >= 1 */
ASSERT(skip);
if (free_head) {
/* Merge into preceding free cluster */
free_head->n += flow->free.n;
flow->free.n = flow->free.next = 0;
} else {
/* New free cluster, add to chain */
free_head = &flow->free;
*last_next = idx;
last_next = &free_head->next;
}
/* Skip remaining empty entries */
idx += skip - 1;
continue;
}
case FLOW_STATE_NEW:
case FLOW_STATE_INI:
case FLOW_STATE_TGT:
case FLOW_STATE_TYPED:
/* Incomplete flow at end of cycle */
ASSERT(false);
break;
case FLOW_STATE_ACTIVE:
/* Nothing to do */
break;
default:
ASSERT(false);
}
switch (flow->f.type) {
case FLOW_TYPE_NONE:
ASSERT(false);
@ -841,7 +835,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
closed = icmp_ping_timer(c, &flow->ping, now);
break;
case FLOW_UDP:
closed = udp_flow_defer(&flow->udp);
closed = udp_flow_defer(c, &flow->udp, now);
if (!closed && timer)
closed = udp_flow_timer(c, &flow->udp, now);
break;
@ -850,30 +844,321 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
;
}
if (closed) {
flow_set_state(&flow->f, FLOW_STATE_FREE);
memset(flow, 0, sizeof(*flow));
to_free[FLOW_IDX(flow)] = closed;
}
/* Second step: actually free the flows */
flow_foreach_slot(flow) {
switch (flow->f.state) {
case FLOW_STATE_FREE: {
unsigned skip = flow->free.n;
/* First entry of a free cluster must have n >= 1 */
ASSERT(skip);
if (free_head) {
/* Add slot to current free cluster */
ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
free_head->n++;
/* Merge into preceding free cluster */
free_head->n += flow->free.n;
flow->free.n = flow->free.next = 0;
} else {
/* Create new free cluster */
/* New free cluster, add to chain */
free_head = &flow->free;
free_head->n = 1;
*last_next = idx;
*last_next = FLOW_IDX(flow);
last_next = &free_head->next;
}
} else {
free_head = NULL;
/* Skip remaining empty entries */
flow += skip - 1;
continue;
}
case FLOW_STATE_NEW:
case FLOW_STATE_INI:
case FLOW_STATE_TGT:
case FLOW_STATE_TYPED:
/* Incomplete flow at end of cycle */
ASSERT(false);
break;
case FLOW_STATE_ACTIVE:
if (to_free[FLOW_IDX(flow)]) {
flow_set_state(&flow->f, FLOW_STATE_FREE);
memset(flow, 0, sizeof(*flow));
if (free_head) {
/* Add slot to current free cluster */
ASSERT(FLOW_IDX(flow) ==
FLOW_IDX(free_head) + free_head->n);
free_head->n++;
flow->free.n = flow->free.next = 0;
} else {
/* Create new free cluster */
free_head = &flow->free;
free_head->n = 1;
*last_next = FLOW_IDX(flow);
last_next = &free_head->next;
}
} else {
free_head = NULL;
}
break;
default:
ASSERT(false);
}
}
*last_next = FLOW_MAX;
}
/**
* flow_migrate_source_rollback() - Disable repair mode, return failure
* @c: Execution context
* @bound: No need to roll back flow indices >= @bound
* @ret: Negative error code
*
* Return: @ret
*/
static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
{
union flow *flow;
debug("...roll back migration");
foreach_established_tcp_flow(flow) {
if (FLOW_IDX(flow) >= bound)
break;
if (tcp_flow_repair_off(c, &flow->tcp))
die("Failed to roll back TCP_REPAIR mode");
}
if (repair_flush(c))
die("Failed to roll back TCP_REPAIR mode");
return ret;
}
/**
* flow_migrate_need_repair() - Do we need to set repair mode for any flow?
*
* Return: true if repair mode is needed, false otherwise
*/
static bool flow_migrate_need_repair(void)
{
union flow *flow;
foreach_established_tcp_flow(flow)
return true;
return false;
}
/**
* flow_migrate_repair_all() - Turn repair mode on or off for all flows
* @c: Execution context
* @enable: Switch repair mode on if set, off otherwise
*
* Return: 0 on success, negative error code on failure
*/
static int flow_migrate_repair_all(struct ctx *c, bool enable)
{
union flow *flow;
int rc;
/* If we don't have a repair helper, there's nothing we can do */
if (c->fd_repair < 0)
return 0;
foreach_established_tcp_flow(flow) {
if (enable)
rc = tcp_flow_repair_on(c, &flow->tcp);
else
rc = tcp_flow_repair_off(c, &flow->tcp);
if (rc) {
debug("Can't %s repair mode: %s",
enable ? "enable" : "disable", strerror_(-rc));
return flow_migrate_source_rollback(c, FLOW_IDX(flow),
rc);
}
}
if ((rc = repair_flush(c))) {
debug("Can't %s repair mode: %s",
enable ? "enable" : "disable", strerror_(-rc));
return flow_migrate_source_rollback(c, FLOW_IDX(flow), rc);
}
return 0;
}
/**
* flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
* @c: Execution context
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor (unused)
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
int rc;
(void)stage;
(void)fd;
if (flow_migrate_need_repair())
repair_wait(c);
if ((rc = flow_migrate_repair_all(c, true)))
return -rc;
return 0;
}
/**
* flow_migrate_source() - Dump all the remaining information and send data
* @c: Execution context (unused)
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
uint32_t count = 0;
bool first = true;
union flow *flow;
int rc;
(void)c;
(void)stage;
/* If we don't have a repair helper, we can't migrate TCP flows */
if (c->fd_repair >= 0) {
foreach_established_tcp_flow(flow)
count++;
}
count = htonl(count);
if (write_all_buf(fd, &count, sizeof(count))) {
rc = errno;
err_perror("Can't send flow count (%u)", ntohl(count));
return flow_migrate_source_rollback(c, FLOW_MAX, rc);
}
debug("Sending %u flows", ntohl(count));
if (!count)
return 0;
/* Dump and send information that can be stored in the flow table.
*
* Limited rollback options here: if we fail to transfer any data (that
* is, on the first flow), undo everything and resume. Otherwise, the
* stream might now be inconsistent, and we might have closed listening
* TCP sockets, so just terminate.
*/
foreach_established_tcp_flow(flow) {
rc = tcp_flow_migrate_source(fd, &flow->tcp);
if (rc) {
flow_err(flow, "Can't send data: %s",
strerror_(-rc));
if (!first)
die("Inconsistent migration state, exiting");
return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
}
first = false;
}
/* And then "extended" data (including window data we saved previously):
* the target needs to set repair mode on sockets before it can set
* this stuff, but it needs sockets (and flows) for that.
*
* This also closes sockets so that the target can start connecting
* theirs: you can't sendmsg() to queues (using the socket) if the
* socket is not connected (EPIPE), not even in repair mode. And the
* target needs to restore queues now because we're sending the data.
*
* So, no rollback here, just try as hard as we can. Tolerate per-flow
* failures but not if the stream might be inconsistent (reported here
* as EIO).
*/
foreach_established_tcp_flow(flow) {
rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
if (rc) {
flow_err(flow, "Can't send extended data: %s",
strerror_(-rc));
if (rc == -EIO)
die("Inconsistent migration state, exiting");
}
}
return 0;
}
/**
* flow_migrate_target() - Receive flows and insert in flow table
* @c: Execution context
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
uint32_t count;
unsigned i;
int rc;
(void)stage;
if (read_all_buf(fd, &count, sizeof(count)))
return errno;
count = ntohl(count);
debug("Receiving %u flows", count);
if (!count)
return 0;
repair_wait(c);
if ((rc = flow_migrate_repair_all(c, true)))
return -rc;
repair_flush(c);
/* TODO: flow header with type, instead? */
for (i = 0; i < count; i++) {
rc = tcp_flow_migrate_target(c, fd);
if (rc) {
flow_dbg(FLOW(i), "Migration data failure, abort: %s",
strerror_(-rc));
return -rc;
}
}
repair_flush(c);
for (i = 0; i < count; i++) {
rc = tcp_flow_migrate_target_ext(c, &flowtab[i].tcp, fd);
if (rc) {
flow_dbg(FLOW(i), "Migration data failure, abort: %s",
strerror_(-rc));
return -rc;
}
}
return 0;
}
/**
* flow_init() - Initialise flow related data structures
*/

29
flow.h
View file

@ -243,18 +243,27 @@ flow_sidx_t flow_lookup_af(const struct ctx *c,
const void *eaddr, const void *oaddr,
in_port_t eport, in_port_t oport);
flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
const void *esa, in_port_t oport);
const void *esa,
const union inany_addr *oaddr, in_port_t oport);
union flow;
void flow_init(void);
void flow_defer_handler(const struct ctx *c, const struct timespec *now);
int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
int fd);
void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
__attribute__((format(printf, 3, 4)));
#define flow_log(f_, pri, ...) flow_log_(&(f_)->f, (pri), __VA_ARGS__)
void flow_log_(const struct flow_common *f, bool newline, int pri,
const char *fmt, ...)
__attribute__((format(printf, 4, 5)));
#define flow_log(f_, pri, ...) flow_log_(&(f_)->f, true, (pri), __VA_ARGS__)
#define flow_dbg(f, ...) flow_log((f), LOG_DEBUG, __VA_ARGS__)
#define flow_err(f, ...) flow_log((f), LOG_ERR, __VA_ARGS__)
@ -264,6 +273,16 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
flow_dbg((f), __VA_ARGS__); \
} while (0)
#define flow_log_perror_(f, pri, ...) \
do { \
int errno_ = errno; \
flow_log_((f), false, (pri), __VA_ARGS__); \
logmsg(true, true, (pri), ": %s", strerror_(errno_)); \
} while (0)
#define flow_dbg_perror(f_, ...) flow_log_perror_(&(f_)->f, LOG_DEBUG, __VA_ARGS__)
#define flow_perror(f_, ...) flow_log_perror_(&(f_)->f, LOG_ERR, __VA_ARGS__)
void flow_log_details_(const struct flow_common *f, int pri,
enum flow_state state);
#define flow_log_details(f_, pri) \

View file

@ -50,6 +50,42 @@ extern union flow flowtab[];
#define flow_foreach_sidei(sidei_) \
for ((sidei_) = INISIDE; (sidei_) < SIDES; (sidei_)++)
/**
* flow_foreach_slot() - Step through each flow table entry
* @flow: Takes values of pointer to each flow table entry
*
* Includes FREE slots.
*/
#define flow_foreach_slot(flow) \
for ((flow) = flowtab; FLOW_IDX(flow) < FLOW_MAX; (flow)++)
/**
* flow_foreach() - Step through each active flow
* @flow: Takes values of pointer to each active flow
*/
#define flow_foreach(flow) \
flow_foreach_slot((flow)) \
if ((flow)->f.state == FLOW_STATE_FREE) \
(flow) += (flow)->free.n - 1; \
else if ((flow)->f.state != FLOW_STATE_ACTIVE) { \
flow_err((flow), "Bad flow state during traversal"); \
continue; \
} else
/**
* flow_foreach_of_type() - Step through each active flow of given type
* @flow: Takes values of pointer to each flow
* @type_: Type of flow to traverse
*/
#define flow_foreach_of_type(flow, type_) \
flow_foreach((flow)) \
if ((flow)->f.type != (type_)) \
/* NOLINTNEXTLINE(bugprone-branch-clone) */ \
continue; \
else
/** flow_idx() - Index of flow from common structure
* @f: Common flow fields pointer
*
@ -57,6 +93,7 @@ extern union flow flowtab[];
*/
static inline unsigned flow_idx(const struct flow_common *f)
{
/* NOLINTNEXTLINE(clang-analyzer-security.PointerSub) */
return (union flow *)f - flowtab;
}
@ -161,15 +198,16 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport);
const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
in_port_t dport);
struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
const union inany_addr *daddr,
in_port_t dport);
const struct flowside *flow_target_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport);
const struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto);
struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type);
#define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_)

89
fwd.c
View file

@ -323,6 +323,30 @@ static bool fwd_guest_accessible(const struct ctx *c,
return fwd_guest_accessible6(c, &addr->a6);
}
/**
* nat_outbound() - Apply address translation for outbound (TAP to HOST)
* @c: Execution context
* @addr: Input address (as seen on TAP interface)
* @translated: Output address (as seen on HOST interface)
*
* Only handles translations that depend *only* on the address. Anything
* related to specific ports or flows is handled elsewhere.
*/
static void nat_outbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated)
{
if (inany_equals4(addr, &c->ip4.map_host_loopback))
*translated = inany_loopback4;
else if (inany_equals6(addr, &c->ip6.map_host_loopback))
*translated = inany_loopback6;
else if (inany_equals4(addr, &c->ip4.map_guest_addr))
*translated = inany_from_v4(c->ip4.addr);
else if (inany_equals6(addr, &c->ip6.map_guest_addr))
translated->a6 = c->ip6.addr;
else
*translated = *addr;
}
/**
* fwd_nat_from_tap() - Determine to forward a flow from the tap interface
* @c: Execution context
@ -342,16 +366,8 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
else if (is_dns_flow(proto, ini) &&
inany_equals6(&ini->oaddr, &c->ip6.dns_match))
tgt->eaddr.a6 = c->ip6.dns_host;
else if (inany_equals4(&ini->oaddr, &c->ip4.map_host_loopback))
tgt->eaddr = inany_loopback4;
else if (inany_equals6(&ini->oaddr, &c->ip6.map_host_loopback))
tgt->eaddr = inany_loopback6;
else if (inany_equals4(&ini->oaddr, &c->ip4.map_guest_addr))
tgt->eaddr = inany_from_v4(c->ip4.addr);
else if (inany_equals6(&ini->oaddr, &c->ip6.map_guest_addr))
tgt->eaddr.a6 = c->ip6.addr;
else
tgt->eaddr = ini->oaddr;
nat_outbound(c, &ini->oaddr, &tgt->eaddr);
tgt->eport = ini->oport;
@ -402,7 +418,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
else
tgt->eaddr = inany_loopback6;
/* Preserve the specific loopback adddress used, but let the kernel pick
/* Preserve the specific loopback address used, but let the kernel pick
* a source port on the target side
*/
tgt->oaddr = ini->eaddr;
@ -423,6 +439,42 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
return PIF_HOST;
}
/**
* nat_inbound() - Apply address translation for inbound (HOST to TAP)
* @c: Execution context
* @addr: Input address (as seen on HOST interface)
* @translated: Output address (as seen on TAP interface)
*
* Return: true on success, false if it couldn't translate the address
*
* Only handles translations that depend *only* on the address. Anything
* related to specific ports or flows is handled elsewhere.
*/
bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated)
{
if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
inany_equals4(addr, &in4addr_loopback)) {
/* Specifically 127.0.0.1, not 127.0.0.0/8 */
*translated = inany_from_v4(c->ip4.map_host_loopback);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
inany_equals6(addr, &in6addr_loopback)) {
translated->a6 = c->ip6.map_host_loopback;
} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
inany_equals4(addr, &c->ip4.addr)) {
*translated = inany_from_v4(c->ip4.map_guest_addr);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
inany_equals6(addr, &c->ip6.addr)) {
translated->a6 = c->ip6.map_guest_addr;
} else if (fwd_guest_accessible(c, addr)) {
*translated = *addr;
} else {
return false;
}
return true;
}
/**
* fwd_nat_from_host() - Determine to forward a flow from the host interface
* @c: Execution context
@ -479,20 +531,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
return PIF_SPLICE;
}
if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
inany_equals4(&ini->eaddr, &in4addr_loopback)) {
/* Specifically 127.0.0.1, not 127.0.0.0/8 */
tgt->oaddr = inany_from_v4(c->ip4.map_host_loopback);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
inany_equals6(&ini->eaddr, &in6addr_loopback)) {
tgt->oaddr.a6 = c->ip6.map_host_loopback;
} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
inany_equals4(&ini->eaddr, &c->ip4.addr)) {
tgt->oaddr = inany_from_v4(c->ip4.map_guest_addr);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
inany_equals6(&ini->eaddr, &c->ip6.addr)) {
tgt->oaddr.a6 = c->ip6.map_guest_addr;
} else if (!fwd_guest_accessible(c, &ini->eaddr)) {
if (!nat_inbound(c, &ini->eaddr, &tgt->oaddr)) {
if (inany_v4(&ini->eaddr)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr))
/* No source address we can use */
@ -501,8 +540,6 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
} else {
tgt->oaddr.a6 = c->ip6.our_tap_ll;
}
} else {
tgt->oaddr = ini->eaddr;
}
tgt->oport = ini->eport;

3
fwd.h
View file

@ -7,6 +7,7 @@
#ifndef FWD_H
#define FWD_H
union inany_addr;
struct flowside;
/* Number of ports for both TCP and UDP */
@ -47,6 +48,8 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
const struct fwd_ports *tcp_rev);
void fwd_scan_ports_init(struct ctx *c);
bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated);
uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,

View file

@ -56,6 +56,7 @@ cd ..
make pkgs
scp passt passt.avx2 passt.1 qrap qrap.1 "${USER_HOST}:${BIN}"
scp pasta pasta.avx2 pasta.1 "${USER_HOST}:${BIN}"
scp passt-repair passt-repair.1 "${USER_HOST}:${BIN}"
ssh "${USER_HOST}" "rm -f ${BIN}/*.deb"
ssh "${USER_HOST}" "rm -f ${BIN}/*.rpm"

7
icmp.c
View file

@ -85,7 +85,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
n = recvfrom(ref.fd, buf, sizeof(buf), 0, &sr.sa, &sl);
if (n < 0) {
flow_err(pingf, "recvfrom() error: %s", strerror_(errno));
flow_perror(pingf, "recvfrom() error");
return;
}
@ -150,7 +150,7 @@ unexpected:
static void icmp_ping_close(const struct ctx *c,
const struct icmp_ping_flow *pingf)
{
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
epoll_del(c, pingf->sock);
close(pingf->sock);
flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
}
@ -300,8 +300,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0);
if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) {
flow_dbg(pingf, "failed to relay request to socket: %s",
strerror_(errno));
flow_dbg_perror(pingf, "failed to relay request to socket");
} else {
flow_dbg(pingf,
"echo request to socket, ID: %"PRIu16", seq: %"PRIu16,

29
inany.h
View file

@ -237,23 +237,30 @@ static inline void inany_from_af(union inany_addr *aa,
}
/** inany_from_sockaddr - Extract IPv[46] address and port number from sockaddr
* @aa: Pointer to store IPv[46] address
* @dst: Pointer to store IPv[46] address (output)
* @port: Pointer to store port number, host order
* @addr: AF_INET or AF_INET6 socket address
* @addr: Socket address
*
* Return: 0 on success, -1 on error (bad address family)
*/
static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
const union sockaddr_inany *sa)
static inline int inany_from_sockaddr(union inany_addr *dst, in_port_t *port,
const void *addr)
{
const union sockaddr_inany *sa = (const union sockaddr_inany *)addr;
if (sa->sa_family == AF_INET6) {
inany_from_af(aa, AF_INET6, &sa->sa6.sin6_addr);
inany_from_af(dst, AF_INET6, &sa->sa6.sin6_addr);
*port = ntohs(sa->sa6.sin6_port);
} else if (sa->sa_family == AF_INET) {
inany_from_af(aa, AF_INET, &sa->sa4.sin_addr);
*port = ntohs(sa->sa4.sin_port);
} else {
/* Not valid to call with other address families */
ASSERT(0);
return 0;
}
if (sa->sa_family == AF_INET) {
inany_from_af(dst, AF_INET, &sa->sa4.sin_addr);
*port = ntohs(sa->sa4.sin_port);
return 0;
}
return -1;
}
/** inany_siphash_feed- Fold IPv[46] address into an in-progress siphash

16
iov.c
View file

@ -26,7 +26,8 @@
#include "iov.h"
/* iov_skip_bytes() - Skip leading bytes of an IO vector
/**
* iov_skip_bytes() - Skip leading bytes of an IO vector
* @iov: IO vector
* @n: Number of entries in @iov
* @skip: Number of leading bytes of @iov to skip
@ -56,8 +57,8 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
}
/**
* iov_from_buf - Copy data from a buffer to an I/O vector (struct iovec)
* efficiently.
* iov_from_buf() - Copy data from a buffer to an I/O vector (struct iovec)
* efficiently.
*
* @iov: Pointer to the array of struct iovec describing the
* scatter/gather I/O vector.
@ -96,8 +97,8 @@ size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
}
/**
* iov_to_buf - Copy data from a scatter/gather I/O vector (struct iovec) to
* a buffer efficiently.
* iov_to_buf() - Copy data from a scatter/gather I/O vector (struct iovec) to
* a buffer efficiently.
*
* @iov: Pointer to the array of struct iovec describing the scatter/gather
* I/O vector.
@ -136,8 +137,8 @@ size_t iov_to_buf(const struct iovec *iov, size_t iov_cnt,
}
/**
* iov_size - Calculate the total size of a scatter/gather I/O vector
* (struct iovec).
* iov_size() - Calculate the total size of a scatter/gather I/O vector
* (struct iovec).
*
* @iov: Pointer to the array of struct iovec describing the
* scatter/gather I/O vector.
@ -203,6 +204,7 @@ size_t iov_tail_size(struct iov_tail *tail)
* overruns the IO vector, is not contiguous or doesn't have the
* requested alignment.
*/
/* cppcheck-suppress [staticFunction,unmatchedSuppression] */
void *iov_peek_header_(struct iov_tail *tail, size_t len, size_t align)
{
char *p;

36
ip.h
View file

@ -36,13 +36,14 @@
.tos = 0, \
.tot_len = 0, \
.id = 0, \
.frag_off = 0, \
.frag_off = htons(IP_DF), \
.ttl = 0xff, \
.protocol = (proto), \
.saddr = 0, \
.daddr = 0, \
}
#define L2_BUF_IP4_PSUM(proto) ((uint32_t)htons_constant(0x4500) + \
(uint32_t)htons_constant(IP_DF) + \
(uint32_t)htons(0xff00 | (proto)))
@ -90,10 +91,34 @@ struct ipv6_opt_hdr {
*/
} __attribute__((packed)); /* required for some archs */
/**
* ip6_set_flow_lbl() - Set flow label in an IPv6 header
* @ip6h: Pointer to IPv6 header, updated
* @flow: Set @ip6h flow label to the low 20 bits of this integer
*/
static inline void ip6_set_flow_lbl(struct ipv6hdr *ip6h, uint32_t flow)
{
ip6h->flow_lbl[0] = (flow >> 16) & 0xf;
ip6h->flow_lbl[1] = (flow >> 8) & 0xff;
ip6h->flow_lbl[2] = (flow >> 0) & 0xff;
}
/** ip6_get_flow_lbl() - Get flow label from an IPv6 header
* @ip6h: Pointer to IPv6 header
*
* Return: flow label from @ip6h as an integer (<= 20 bits)
*/
static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h)
{
return (ip6h->flow_lbl[0] & 0xf) << 16 |
ip6h->flow_lbl[1] << 8 |
ip6h->flow_lbl[2];
}
char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
size_t *dlen);
/* IPv6 link-local all-nodes multicast adddress, ff02::1 */
/* IPv6 link-local all-nodes multicast address, ff02::1 */
static const struct in6_addr in6addr_ll_all_nodes = {
.s6_addr = {
0xff, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
@ -104,4 +129,11 @@ static const struct in6_addr in6addr_ll_all_nodes = {
/* IPv4 Limited Broadcast (RFC 919, Section 7), 255.255.255.255 */
static const struct in_addr in4addr_broadcast = { 0xffffffff };
#ifndef IPV4_MIN_MTU
#define IPV4_MIN_MTU 68
#endif
#ifndef IPV6_MIN_MTU
#define IPV6_MIN_MTU 1280
#endif
#endif /* IP_H */

View file

@ -129,7 +129,7 @@ static void drop_caps_ep_except(uint64_t keep)
* additional layer of protection. Executing this requires
* CAP_SETPCAP, which we will have within our userns.
*
* Note that dropping capabilites from the bounding set limits
* Note that dropping capabilities from the bounding set limits
* exec()ed processes, but does not remove them from the effective or
* permitted sets, so it doesn't reduce our own capabilities.
*/
@ -174,8 +174,8 @@ static void clamp_caps(void)
* Should:
* - drop unneeded capabilities
* - close all open files except for standard streams and the one from --fd
* Musn't:
* - remove filesytem access (we need to access files during setup)
* Mustn't:
* - remove filesystem access (we need to access files during setup)
*/
void isolate_initial(int argc, char **argv)
{
@ -194,7 +194,7 @@ void isolate_initial(int argc, char **argv)
*
* It's debatable whether it's useful to drop caps when we
* retain SETUID and SYS_ADMIN, but we might as well. We drop
* further capabilites in isolate_user() and
* further capabilities in isolate_user() and
* isolate_prefork().
*/
keep = BIT(CAP_NET_BIND_SERVICE) | BIT(CAP_SETUID) | BIT(CAP_SETGID) |

53
log.c
View file

@ -56,7 +56,7 @@ bool log_stderr = true; /* Not daemonised, no shell spawned */
*
* Return: pointer to @now, or NULL if there was an error retrieving the time
*/
const struct timespec *logtime(struct timespec *ts)
static const struct timespec *logtime(struct timespec *ts)
{
if (clock_gettime(CLOCK_MONOTONIC, ts))
return NULL;
@ -249,6 +249,30 @@ static void logfile_write(bool newline, bool cont, int pri,
log_written += n;
}
/**
* passt_vsyslog() - vsyslog() implementation not using heap memory
* @newline: Append newline at the end of the message, if missing
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
static void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
/* Send without timestamp, the system logger should add it */
n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (newline && format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
* vlogmsg() - Print or send messages to log or output files as configured
* @newline: Append newline at the end of the message, if missing
@ -257,6 +281,7 @@ static void logfile_write(bool newline, bool cont, int pri,
* @format: Message
* @ap: Variable argument list
*/
/* cppcheck-suppress [staticFunction,unmatchedSuppression] */
void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
{
bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1;
@ -373,35 +398,11 @@ void __setlogmask(int mask)
setlogmask(mask);
}
/**
* passt_vsyslog() - vsyslog() implementation not using heap memory
* @newline: Append newline at the end of the message, if missing
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
/* Send without timestamp, the system logger should add it */
n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (newline && format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
* logfile_init() - Open log file and write header with PID, version, path
* @name: Identifier for header: passt or pasta
* @path: Path to log file
* @size: Maximum size of log file: log_cut_size is calculatd here
* @size: Maximum size of log file: log_cut_size is calculated here
*/
void logfile_init(const char *name, const char *path, size_t size)
{

5
log.h
View file

@ -32,13 +32,13 @@ void logmsg_perror(int pri, const char *format, ...)
#define die(...) \
do { \
err(__VA_ARGS__); \
exit(EXIT_FAILURE); \
_exit(EXIT_FAILURE); \
} while (0)
#define die_perror(...) \
do { \
err_perror(__VA_ARGS__); \
exit(EXIT_FAILURE); \
_exit(EXIT_FAILURE); \
} while (0)
extern int log_trace;
@ -55,7 +55,6 @@ void trace_init(int enable);
void __openlog(const char *ident, int option, int facility);
void logfile_init(const char *name, const char *path, size_t size);
void passt_vsyslog(bool newline, int pri, const char *format, va_list ap);
void __setlogmask(int mask);
#endif /* LOG_H */

304
migrate.c Normal file
View file

@ -0,0 +1,304 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* migrate.c - Migration sections, layout, and routines
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <errno.h>
#include <sys/uio.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "migrate.h"
#include "repair.h"
/* Magic identifier for migration data */
#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
/**
* struct migrate_seen_addrs_v1 - Migratable guest addresses for v1 state stream
* @addr6: Observed guest IPv6 address
* @addr6_ll: Observed guest IPv6 link-local address
* @addr4: Observed guest IPv4 address
* @mac: Observed guest MAC address
*/
struct migrate_seen_addrs_v1 {
struct in6_addr addr6;
struct in6_addr addr6_ll;
struct in_addr addr4;
unsigned char mac[ETH_ALEN];
} __attribute__((packed));
/**
* seen_addrs_source_v1() - Copy and send guest observed addresses from source
* @c: Execution context
* @stage: Migration stage, unused
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */
static int seen_addrs_source_v1(struct ctx *c,
const struct migrate_stage *stage, int fd)
{
struct migrate_seen_addrs_v1 addrs = {
.addr6 = c->ip6.addr_seen,
.addr6_ll = c->ip6.addr_ll_seen,
.addr4 = c->ip4.addr_seen,
};
(void)stage;
memcpy(addrs.mac, c->guest_mac, sizeof(addrs.mac));
if (write_all_buf(fd, &addrs, sizeof(addrs)))
return errno;
return 0;
}
/**
* seen_addrs_target_v1() - Receive and use guest observed addresses on target
* @c: Execution context
* @stage: Migration stage, unused
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int seen_addrs_target_v1(struct ctx *c,
const struct migrate_stage *stage, int fd)
{
struct migrate_seen_addrs_v1 addrs;
(void)stage;
if (read_all_buf(fd, &addrs, sizeof(addrs)))
return errno;
c->ip6.addr_seen = addrs.addr6;
c->ip6.addr_ll_seen = addrs.addr6_ll;
c->ip4.addr_seen = addrs.addr4;
memcpy(c->guest_mac, addrs.mac, sizeof(c->guest_mac));
return 0;
}
/* Stages for version 2 */
static const struct migrate_stage stages_v2[] = {
{
.name = "observed addresses",
.source = seen_addrs_source_v1,
.target = seen_addrs_target_v1,
},
{
.name = "prepare flows",
.source = flow_migrate_source_pre,
.target = NULL,
},
{
.name = "transfer flows",
.source = flow_migrate_source,
.target = flow_migrate_target,
},
{ 0 },
};
/* Supported encoding versions, from latest (most preferred) to oldest */
static const struct migrate_version versions[] = {
{ 2, stages_v2, },
/* v1 was released, but not widely used. It had bad endianness for the
* MSS and omitted timestamps, which meant it usually wouldn't work.
* Therefore we don't attempt to support compatibility with it.
*/
{ 0 },
};
/* Current encoding version */
#define CURRENT_VERSION (&versions[0])
/**
* migrate_source() - Migration as source, send state to hypervisor
* @c: Execution context
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int migrate_source(struct ctx *c, int fd)
{
const struct migrate_version *v = CURRENT_VERSION;
const struct migrate_header header = {
.magic = htonll_constant(MIGRATE_MAGIC),
.version = htonl(v->id),
.compat_version = htonl(v->id),
};
const struct migrate_stage *s;
int ret;
if (write_all_buf(fd, &header, sizeof(header))) {
ret = errno;
err("Can't send migration header: %s, abort", strerror_(ret));
return ret;
}
for (s = v->s; s->name; s++) {
if (!s->source)
continue;
debug("Source side migration stage: %s", s->name);
if ((ret = s->source(c, s, fd))) {
err("Source migration stage: %s: %s, abort", s->name,
strerror_(ret));
return ret;
}
}
return 0;
}
/**
* migrate_target_read_header() - Read header in target
* @fd: Descriptor for state transfer
*
* Return: version structure on success, NULL on failure with errno set
*/
static const struct migrate_version *migrate_target_read_header(int fd)
{
const struct migrate_version *v;
struct migrate_header h;
uint32_t id, compat_id;
if (read_all_buf(fd, &h, sizeof(h)))
return NULL;
id = ntohl(h.version);
compat_id = ntohl(h.compat_version);
debug("Source magic: 0x%016" PRIx64 ", version: %u, compat: %u",
ntohll(h.magic), id, compat_id);
if (ntohll(h.magic) != MIGRATE_MAGIC || !id || !compat_id) {
err("Invalid incoming device state");
errno = EINVAL;
return NULL;
}
for (v = versions; v->id; v++)
if (v->id <= id && v->id >= compat_id)
return v;
errno = ENOTSUP;
err("Unsupported device state version: %u", id);
return NULL;
}
/**
* migrate_target() - Migration as target, receive state from hypervisor
* @c: Execution context
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int migrate_target(struct ctx *c, int fd)
{
const struct migrate_version *v;
const struct migrate_stage *s;
int ret;
if (!(v = migrate_target_read_header(fd)))
return errno;
for (s = v->s; s->name; s++) {
if (!s->target)
continue;
debug("Target side migration stage: %s", s->name);
if ((ret = s->target(c, s, fd))) {
err("Target migration stage: %s: %s, abort", s->name,
strerror_(ret));
return ret;
}
}
return 0;
}
/**
* migrate_init() - Set up things necessary for migration
* @c: Execution context
*/
void migrate_init(struct ctx *c)
{
c->device_state_result = -1;
}
/**
* migrate_close() - Close migration channel and connection to passt-repair
* @c: Execution context
*/
void migrate_close(struct ctx *c)
{
if (c->device_state_fd != -1) {
debug("Closing migration channel, fd: %d", c->device_state_fd);
close(c->device_state_fd);
c->device_state_fd = -1;
c->device_state_result = -1;
}
repair_close(c);
}
/**
* migrate_request() - Request a migration of device state
* @c: Execution context
* @fd: fd to transfer state
* @target: Are we the target of the migration?
*/
void migrate_request(struct ctx *c, int fd, bool target)
{
debug("Migration requested, fd: %d (was %d)", fd, c->device_state_fd);
if (c->device_state_fd != -1)
migrate_close(c);
c->device_state_fd = fd;
c->migrate_target = target;
}
/**
* migrate_handler() - Send/receive passt internal state to/from hypervisor
* @c: Execution context
*/
void migrate_handler(struct ctx *c)
{
int rc;
if (c->device_state_fd < 0)
return;
debug("Handling migration request from fd: %d, target: %d",
c->device_state_fd, c->migrate_target);
if (c->migrate_target)
rc = migrate_target(c, c->device_state_fd);
else
rc = migrate_source(c, c->device_state_fd);
migrate_close(c);
c->device_state_result = rc;
}

51
migrate.h Normal file
View file

@ -0,0 +1,51 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#ifndef MIGRATE_H
#define MIGRATE_H
/**
* struct migrate_header - Migration header from source
* @magic: 0xB1BB1D1B0BB1D1B0, network order
* @version: Highest known, target aborts if too old, network order
* @compat_version: Lowest version compatible with @version, target aborts
* if too new, network order
*/
struct migrate_header {
uint64_t magic;
uint32_t version;
uint32_t compat_version;
} __attribute__((packed));
/**
* struct migrate_stage - Callbacks and parameters for one stage of migration
* @name: Stage name (for debugging)
* @source: Callback to implement this stage on the source
* @target: Callback to implement this stage on the target
*/
struct migrate_stage {
const char *name;
int (*source)(struct ctx *c, const struct migrate_stage *stage, int fd);
int (*target)(struct ctx *c, const struct migrate_stage *stage, int fd);
/* Add here separate rollback callbacks if needed */
};
/**
* struct migrate_version - Stages for a particular protocol version
* @id: Version number, host order
* @s: Ordered array of stages, NULL-terminated
*/
struct migrate_version {
uint32_t id;
const struct migrate_stage *s;
};
void migrate_init(struct ctx *c);
void migrate_close(struct ctx *c);
void migrate_request(struct ctx *c, int fd, bool target);
void migrate_handler(struct ctx *c);
#endif /* MIGRATE_H */

3
ndp.c
View file

@ -256,7 +256,7 @@ static void ndp_ra(const struct ctx *c, const struct in6_addr *dst)
ptr = &ra.var[0];
if (c->mtu != -1) {
if (c->mtu) {
struct opt_mtu *mtu = (struct opt_mtu *)ptr;
*mtu = (struct opt_mtu) {
.header = {
@ -328,6 +328,7 @@ static void ndp_ra(const struct ctx *c, const struct in6_addr *dst)
memcpy(&ra.source_ll.mac, c->our_tap_mac, ETH_ALEN);
/* NOLINTNEXTLINE(clang-analyzer-security.PointerSub) */
ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);
}

View file

@ -297,6 +297,10 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
if (!thisifi)
continue; /* No interface for this route */
/* Skip 'lo': we should test IFF_LOOPBACK, but keep it simple */
if (thisifi == 1)
continue;
/* Skip routes to link-local addresses */
if (af == AF_INET && dst &&
IN4_IS_PREFIX_LINKLOCAL(dst, rtm->rtm_dst_len))
@ -351,7 +355,7 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
*
* Return: true if a gateway was found, false otherwise
*/
bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
static bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
{
int nh_len = RTA_PAYLOAD(rta);
struct rtnexthop *rtnh;

152
packet.c
View file

@ -23,51 +23,73 @@
#include "log.h"
/**
* packet_check_range() - Check if a packet memory range is valid
* packet_check_range() - Check if a memory range is valid for a pool
* @p: Packet pool
* @offset: Offset of data range in packet descriptor
* @ptr: Start of desired data range
* @len: Length of desired data range
* @start: Start of the packet descriptor
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: 0 if the range is valid, -1 otherwise
*/
static int packet_check_range(const struct pool *p, size_t offset, size_t len,
const char *start, const char *func, int line)
static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
const char *func, int line)
{
if (len > PACKET_MAX_LEN) {
debug("packet range length %zu (max %zu), %s:%i",
len, PACKET_MAX_LEN, func, line);
return -1;
}
if (p->buf_size == 0) {
int ret;
ret = vu_packet_check_range((void *)p->buf, offset, len, start);
ret = vu_packet_check_range((void *)p->buf, ptr, len);
if (ret == -1)
trace("cannot find region, %s:%i", func, line);
debug("cannot find region, %s:%i", func, line);
return ret;
}
if (start < p->buf) {
trace("packet start %p before buffer start %p, "
"%s:%i", (void *)start, (void *)p->buf, func, line);
if (ptr < p->buf) {
debug("packet range start %p before buffer start %p, %s:%i",
(void *)ptr, (void *)p->buf, func, line);
return -1;
}
if (start + len + offset > p->buf + p->buf_size) {
trace("packet offset plus length %zu from size %zu, "
"%s:%i", start - p->buf + len + offset,
p->buf_size, func, line);
if (len > p->buf_size) {
debug("packet range length %zu larger than buffer %zu, %s:%i",
len, p->buf_size, func, line);
return -1;
}
if ((size_t)(ptr - p->buf) > p->buf_size - len) {
debug("packet range %p, len %zu after buffer end %p, %s:%i",
(void *)ptr, len, (void *)(p->buf + p->buf_size),
func, line);
return -1;
}
return 0;
}
/**
* pool_full() - Is a packet pool full?
* @p: Pointer to packet pool
*
* Return: true if the pool is full, false if more packets can be added
*/
bool pool_full(const struct pool *p)
{
return p->count >= p->size;
}
/**
* packet_add_do() - Add data as packet descriptor to given pool
* @p: Existing pool
* @len: Length of new descriptor
* @start: Start of data
* @func: For tracing: name of calling function, NULL means no trace()
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*/
void packet_add_do(struct pool *p, size_t len, const char *start,
@ -75,26 +97,63 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
{
size_t idx = p->count;
if (idx >= p->size) {
trace("add packet index %zu to pool with size %zu, %s:%i",
if (pool_full(p)) {
debug("add packet index %zu to pool with size %zu, %s:%i",
idx, p->size, func, line);
return;
}
if (packet_check_range(p, 0, len, start, func, line))
if (packet_check_range(p, start, len, func, line))
return;
if (len > UINT16_MAX) {
trace("add packet length %zu, %s:%i", len, func, line);
return;
}
p->pkt[idx].iov_base = (void *)start;
p->pkt[idx].iov_len = len;
p->count++;
}
/**
* packet_get_try_do() - Get data range from packet descriptor from given pool
* @p: Packet pool
* @idx: Index of packet descriptor in pool
* @offset: Offset of data range in packet descriptor
* @len: Length of desired data range
* @left: Length of available data after range, set on return, can be NULL
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: pointer to start of data range, NULL on invalid range or descriptor
*/
void *packet_get_try_do(const struct pool *p, size_t idx, size_t offset,
size_t len, size_t *left, const char *func, int line)
{
char *ptr;
ASSERT_WITH_MSG(p->count <= p->size,
"Corrupt pool count: %zu, size: %zu, %s:%i",
p->count, p->size, func, line);
if (idx >= p->count) {
debug("packet %zu from pool count: %zu, %s:%i",
idx, p->count, func, line);
return NULL;
}
if (offset > p->pkt[idx].iov_len ||
len > (p->pkt[idx].iov_len - offset))
return NULL;
ptr = (char *)p->pkt[idx].iov_base + offset;
ASSERT_WITH_MSG(!packet_check_range(p, ptr, len, func, line),
"Corrupt packet pool, %s:%i", func, line);
if (left)
*left = p->pkt[idx].iov_len - offset - len;
return ptr;
}
/**
* packet_get_do() - Get data range from packet descriptor from given pool
* @p: Packet pool
@ -102,47 +161,24 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
* @offset: Offset of data range in packet descriptor
* @len: Length of desired data range
* @left: Length of available data after range, set on return, can be NULL
* @func: For tracing: name of calling function, NULL means no trace()
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: pointer to start of data range, NULL on invalid range or descriptor
* Return: as packet_get_try_do() but log a trace message when returning NULL
*/
void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
size_t len, size_t *left, const char *func, int line)
void *packet_get_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line)
{
if (idx >= p->size || idx >= p->count) {
if (func) {
trace("packet %zu from pool size: %zu, count: %zu, "
"%s:%i", idx, p->size, p->count, func, line);
}
return NULL;
void *r = packet_get_try_do(p, idx, offset, len, left, func, line);
if (!r) {
trace("missing packet data length %zu, offset %zu from "
"length %zu, %s:%i",
len, offset, p->pkt[idx].iov_len, func, line);
}
if (len > UINT16_MAX) {
if (func) {
trace("packet data length %zu, %s:%i",
len, func, line);
}
return NULL;
}
if (len + offset > p->pkt[idx].iov_len) {
if (func) {
trace("data length %zu, offset %zu from length %zu, "
"%s:%i", len, offset, p->pkt[idx].iov_len,
func, line);
}
return NULL;
}
if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
func, line))
return NULL;
if (left)
*left = p->pkt[idx].iov_len - offset - len;
return (char *)p->pkt[idx].iov_base + offset;
return r;
}
/**

View file

@ -6,6 +6,11 @@
#ifndef PACKET_H
#define PACKET_H
#include <stdbool.h>
/* Maximum size of a single packet stored in pool, including headers */
#define PACKET_MAX_LEN ((size_t)UINT16_MAX)
/**
* struct pool - Generic pool of packets stored in a buffer
* @buf: Buffer storing packet descriptors,
@ -21,27 +26,29 @@ struct pool {
size_t buf_size;
size_t size;
size_t count;
struct iovec pkt[1];
struct iovec pkt[];
};
int vu_packet_check_range(void *buf, size_t offset, size_t len,
const char *start);
int vu_packet_check_range(void *buf, const char *ptr, size_t len);
void packet_add_do(struct pool *p, size_t len, const char *start,
const char *func, int line);
void *packet_get_try_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line);
void *packet_get_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line);
bool pool_full(const struct pool *p);
void pool_flush(struct pool *p);
#define packet_add(p, len, start) \
packet_add_do(p, len, start, __func__, __LINE__)
#define packet_get_try(p, idx, offset, len, left) \
packet_get_try_do(p, idx, offset, len, left, __func__, __LINE__)
#define packet_get(p, idx, offset, len, left) \
packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
#define packet_get_try(p, idx, offset, len, left) \
packet_get_do(p, idx, offset, len, left, NULL, 0)
#define PACKET_POOL_DECL(_name, _size, _buf) \
struct _name ## _t { \
char *buf; \

74
passt-repair.1 Normal file
View file

@ -0,0 +1,74 @@
.\" SPDX-License-Identifier: GPL-2.0-or-later
.\" Copyright (c) 2025 Red Hat GmbH
.\" Author: Stefano Brivio <sbrivio@redhat.com>
.TH passt-repair 1
.SH NAME
.B passt-repair
\- Helper setting TCP_REPAIR socket options for \fBpasst\fR(1)
.SH SYNOPSIS
.B passt-repair
\fIPATH\fR
.SH DESCRIPTION
.B passt-repair
is a privileged helper setting and clearing repair mode on TCP sockets on behalf
of \fBpasst\fR(1), as instructed via single-byte commands over a UNIX domain
socket.
It can be used to migrate TCP connections between guests without granting
additional capabilities to \fBpasst\fR(1) itself: to migrate TCP connections,
\fBpasst\fR(1) leverages repair mode, which needs the \fBCAP_NET_ADMIN\fR
capability (see \fBcapabilities\fR(7)) to be set or cleared.
If \fIPATH\fR represents a UNIX domain socket, \fBpasst-repair\fR(1) attempts to
connect to it. If it is a directory, \fBpasst-repair\fR(1) waits until a file
ending with \fI.repair\fR appears in it, and then attempts to connect to it.
.SH PROTOCOL
\fBpasst-repair\fR(1) connects to \fBpasst\fR(1) using the socket specified via
\fI--repair-path\fR option in \fBpasst\fR(1) itself. By default, the name is the
same as the UNIX domain socket used for guest communication, suffixed by
\fI.repair\fR.
The messages consist of one 8-bit signed integer that can be \fITCP_REPAIR_ON\fR
(1), \fITCP_REPAIR_OFF\fR (0), or \fITCP_REPAIR_OFF_NO_WP\fR (-1), as defined by
the Linux kernel user API, and one to SCM_MAX_FD (253) sockets as SCM_RIGHTS
(see \fBunix\fR(7)) ancillary message, sent by the server, \fBpasst\fR(1).
The client, \fBpasst-repair\fR(1), replies with the same byte (and no ancillary
message) to indicate success, and closes the connection on failure.
The server closes the connection on error or completion.
.SH NOTES
\fBpasst-repair\fR(1) can be granted the \fBCAP_NET_ADMIN\fR capability
(preferred, as it limits privileges to the strictly necessary ones), or it can
be run as root.
.SH AUTHOR
Stefano Brivio <sbrivio@redhat.com>.
.SH REPORTING BUGS
Please report issues on the bug tracker at https://bugs.passt.top/, or
send a message to the passt-user@passt.top mailing list, see
https://lists.passt.top/.
.SH COPYRIGHT
Copyright (c) 2025 Red Hat GmbH.
\fBpasst-repair\fR is free software: you can redistribute them and/or modify
them under the terms of the GNU General Public License as published by the Free
Software Foundation, either version 2 of the License, or (at your option) any
later version.
.SH SEE ALSO
\fBpasst\fR(1), \fBqemu\fR(1), \fBcapabilities\fR(7), \fBunix\fR(7).

266
passt-repair.c Normal file
View file

@ -0,0 +1,266 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*
* Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
* with byte commands mapping to TCP_REPAIR values, and switch repair mode on or
* off. Reply by echoing the command. Exit on EOF.
*/
#include <sys/inotify.h>
#include <sys/prctl.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/un.h>
#include <errno.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
#include <linux/audit.h>
#include <linux/capability.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include "seccomp_repair.h"
#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
#define REPAIR_EXT ".repair"
#define REPAIR_EXT_LEN strlen(REPAIR_EXT)
/**
* main() - Entry point and whole program with loop
* @argc: Argument count, must be 2
* @argv: Argument: path of UNIX domain socket to connect to
*
* Return: 0 on success (EOF), 1 on error, 2 on usage error
*
* #syscalls:repair connect setsockopt write close exit_group
* #syscalls:repair socket s390x:socketcall i686:socketcall
* #syscalls:repair recvfrom recvmsg arm:recv ppc64le:recv
* #syscalls:repair sendto sendmsg arm:send ppc64le:send
* #syscalls:repair stat|statx stat64|statx statx
* #syscalls:repair fstat|fstat64 newfstatat|fstatat64
* #syscalls:repair inotify_init1 inotify_add_watch
*/
int main(int argc, char **argv)
{
char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
__attribute__ ((aligned(__alignof__(struct cmsghdr))));
struct sockaddr_un a = { AF_UNIX, "" };
int fds[SCM_MAX_FD], s, ret, i, n = 0;
bool inotify_dir = false;
struct sock_fprog prog;
int8_t cmd = INT8_MAX;
struct cmsghdr *cmsg;
struct msghdr msg;
struct iovec iov;
size_t cmsg_len;
struct stat sb;
int op;
prctl(PR_SET_DUMPABLE, 0);
prog.len = (unsigned short)sizeof(filter_repair) /
sizeof(filter_repair[0]);
prog.filter = filter_repair;
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
fprintf(stderr, "Failed to apply seccomp filter");
_exit(1);
}
iov = (struct iovec){ &cmd, sizeof(cmd) };
msg = (struct msghdr){ .msg_name = NULL, .msg_namelen = 0,
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = buf,
.msg_controllen = sizeof(buf),
.msg_flags = 0 };
cmsg = CMSG_FIRSTHDR(&msg);
if (argc != 2) {
fprintf(stderr, "Usage: %s PATH\n", argv[0]);
_exit(2);
}
if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
fprintf(stderr, "Failed to create AF_UNIX socket: %i\n", errno);
_exit(1);
}
if ((stat(argv[1], &sb))) {
fprintf(stderr, "Can't stat() %s: %i\n", argv[1], errno);
_exit(1);
}
if ((sb.st_mode & S_IFMT) == S_IFDIR) {
char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
__attribute__ ((aligned(__alignof__(struct inotify_event))));
const struct inotify_event *ev = NULL;
char path[PATH_MAX + 1];
bool found = false;
ssize_t n;
int fd;
if ((fd = inotify_init1(IN_CLOEXEC)) < 0) {
fprintf(stderr, "inotify_init1: %i\n", errno);
_exit(1);
}
if (inotify_add_watch(fd, argv[1], IN_CREATE) < 0) {
fprintf(stderr, "inotify_add_watch: %i\n", errno);
_exit(1);
}
do {
char *p;
n = read(fd, buf, sizeof(buf));
if (n < 0) {
fprintf(stderr, "inotify read: %i", errno);
_exit(1);
}
buf[n - 1] = '\0';
if (n < (ssize_t)sizeof(*ev)) {
fprintf(stderr, "Short inotify read: %zi", n);
continue;
}
for (p = buf; p < buf + n; p += sizeof(*ev) + ev->len) {
ev = (const struct inotify_event *)p;
if (ev->len >= REPAIR_EXT_LEN &&
!memcmp(ev->name +
strnlen(ev->name, ev->len) -
REPAIR_EXT_LEN,
REPAIR_EXT, REPAIR_EXT_LEN)) {
found = true;
break;
}
}
} while (!found);
if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
fprintf(stderr, "Invalid filename from inotify\n");
_exit(1);
}
snprintf(path, sizeof(path), "%s/%s", argv[1], ev->name);
if ((stat(path, &sb))) {
fprintf(stderr, "Can't stat() %s: %i\n", path, errno);
_exit(1);
}
ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", path);
inotify_dir = true;
} else {
ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
}
if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
fprintf(stderr, "Invalid socket path");
_exit(2);
}
if ((sb.st_mode & S_IFMT) != S_IFSOCK) {
fprintf(stderr, "%s is not a socket\n", a.sun_path);
_exit(2);
}
while (connect(s, (struct sockaddr *)&a, sizeof(a))) {
if (inotify_dir && errno == ECONNREFUSED)
continue;
fprintf(stderr, "Failed to connect to %s: %s\n", a.sun_path,
strerror(errno));
_exit(1);
}
loop:
ret = recvmsg(s, &msg, 0);
if (ret < 0) {
if (errno == ECONNRESET) {
ret = 0;
} else {
fprintf(stderr, "Failed to read message: %i\n", errno);
_exit(1);
}
}
if (!ret) /* Done */
_exit(0);
if (!cmsg ||
cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
cmsg->cmsg_type != SCM_RIGHTS) {
fprintf(stderr, "No/bad ancillary data from peer\n");
_exit(1);
}
/* No inverse formula for CMSG_LEN(x), and building one with CMSG_LEN(0)
* works but there's no guarantee it does. Search the whole domain.
*/
for (i = 1; i <= SCM_MAX_FD; i++) {
if (CMSG_LEN(sizeof(int) * i) == cmsg->cmsg_len) {
n = i;
break;
}
}
if (!n) {
cmsg_len = cmsg->cmsg_len; /* socklen_t is 'unsigned' on musl */
fprintf(stderr, "Invalid ancillary data length %zu from peer\n",
cmsg_len);
_exit(1);
}
memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF &&
cmd != TCP_REPAIR_OFF_NO_WP) {
fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
_exit(1);
}
op = cmd;
for (i = 0; i < n; i++) {
if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &op, sizeof(op))) {
fprintf(stderr,
"Setting TCP_REPAIR to %i on socket %i: %s", op,
fds[i], strerror(errno));
_exit(1);
}
/* Close _our_ copy */
close(fds[i]);
}
/* Confirm setting by echoing the command back */
if (send(s, &cmd, sizeof(cmd), 0) < 0) {
fprintf(stderr, "Reply to %i: %s\n", op, strerror(errno));
_exit(1);
}
goto loop;
return 0;
}

37
passt.1
View file

@ -401,6 +401,16 @@ Enable IPv6-only operation. IPv4 traffic will be ignored.
By default, IPv4 operation is enabled as long as at least an IPv4 route and an
interface address are configured on a given host interface.
.TP
.BR \-H ", " \-\-hostname " " \fIname
Hostname to configure the client with.
Send \fIname\fR as DHCP option 12 (hostname).
.TP
.BR \-\-fqdn " " \fIname
FQDN to configure the client with.
Send \fIname\fR as Client FQDN: DHCP option 81 and DHCPv6 option 39.
.SS \fBpasst\fR-only options
.TP
@ -418,6 +428,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
.BR \-\-print-capabilities
Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
.TP
.BR \-\-repair-path " " \fIpath
Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect
to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
migration. \fB--repair-path none\fR disables this interface (if you need to
specify a socket path called "none" you can prefix the path by \fI./\fR).
Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
chosen for the hypervisor UNIX domain socket. No socket is created if not in
\-\-vhost-user mode.
.TP
.BR \-F ", " \-\-fd " " \fIFD
Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
@ -622,7 +643,7 @@ Configure UDP port forwarding from target namespace to init namespace.
Default is \fBauto\fR.
.TP
.BR \-\-host-lo-to-ns-lo " " (DEPRECATED)
.BR \-\-host-lo-to-ns-lo
If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from
the host's loopback address will appear on the loopback address in the
guest as well. Without this option such forwarded packets will appear
@ -941,10 +962,16 @@ with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
the last observed source address from guest or namespace is 192.0.2.2, this will
be translated to a connection from 192.0.2.1 to 192.0.2.2.
Similarly, for traffic coming from guest or namespace, packets with
destination address corresponding to the \fB\-\-map-host-loopback\fR
address will have their destination address translated to a loopback
address.
Similarly, for traffic coming from guest or namespace, packets with destination
address corresponding to the \fB\-\-map-host-loopback\fR address will have their
destination address translated to a loopback address.
As an exception, traffic identified as DNS, originally directed to the
\fB\-\-map-host-loopback\fR address, if this address matches a resolver address
on the host, is \fBnot\fR translated to loopback, but rather handled in the same
way as if specified as \-\-dns-forward address, if no such option was given.
In the common case where the host gateway also acts a resolver, this avoids that
the host mapping shadows the gateway/resolver itself.
.SS Handling of local traffic in pasta

49
passt.c
View file

@ -51,6 +51,8 @@
#include "tcp_splice.h"
#include "ndp.h"
#include "vu_common.h"
#include "migrate.h"
#include "repair.h"
#define EPOLL_EVENTS 8
@ -66,7 +68,7 @@ char *epoll_type_str[] = {
[EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket",
[EPOLL_TYPE_TCP_TIMER] = "TCP timer",
[EPOLL_TYPE_UDP_LISTEN] = "listening UDP socket",
[EPOLL_TYPE_UDP_REPLY] = "UDP reply socket",
[EPOLL_TYPE_UDP] = "UDP flow socket",
[EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket",
[EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch",
[EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch",
@ -75,7 +77,8 @@ char *epoll_type_str[] = {
[EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket",
[EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket",
[EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket",
[EPOLL_TYPE_VHOST_MIGRATION] = "vhost-user migration socket",
[EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket",
[EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket",
};
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
"epoll_type_str[] doesn't match enum epoll_type");
@ -163,11 +166,11 @@ void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
*
* #syscalls exit_group
*/
void exit_handler(int signal)
static void exit_handler(int signal)
{
(void)signal;
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
}
/**
@ -188,7 +191,6 @@ int main(int argc, char **argv)
{
struct epoll_event events[EPOLL_EVENTS];
int nfds, i, devnull_fd = -1;
char argv0[PATH_MAX], *name;
struct ctx c = { 0 };
struct rlimit limit;
struct timespec now;
@ -202,6 +204,7 @@ int main(int argc, char **argv)
isolate_initial(argc, argv);
c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
c.device_state_fd = -1;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
@ -209,27 +212,18 @@ int main(int argc, char **argv)
sigaction(SIGTERM, &sa, NULL);
sigaction(SIGQUIT, &sa, NULL);
if (argc < 1)
exit(EXIT_FAILURE);
c.mode = conf_mode(argc, argv);
strncpy(argv0, argv[0], PATH_MAX - 1);
name = basename(argv0);
if (strstr(name, "pasta")) {
if (c.mode == MODE_PASTA) {
sa.sa_handler = pasta_child_handler;
if (sigaction(SIGCHLD, &sa, NULL))
die_perror("Couldn't install signal handlers");
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
die_perror("Couldn't set disposition for SIGPIPE");
c.mode = MODE_PASTA;
} else if (strstr(name, "passt")) {
c.mode = MODE_PASST;
} else {
exit(EXIT_FAILURE);
}
madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
die_perror("Couldn't set disposition for SIGPIPE");
madvise(pkt_buf, sizeof(pkt_buf), MADV_HUGEPAGE);
c.epollfd = epoll_create1(EPOLL_CLOEXEC);
if (c.epollfd == -1)
@ -259,7 +253,7 @@ int main(int argc, char **argv)
flow_init();
if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
exit(EXIT_FAILURE);
_exit(EXIT_FAILURE);
proto_update_l2_buf(c.guest_mac, c.our_tap_mac);
@ -345,8 +339,8 @@ loop:
case EPOLL_TYPE_UDP_LISTEN:
udp_listen_sock_handler(&c, ref, eventmask, &now);
break;
case EPOLL_TYPE_UDP_REPLY:
udp_reply_sock_handler(&c, ref, eventmask, &now);
case EPOLL_TYPE_UDP:
udp_sock_handler(&c, ref, eventmask, &now);
break;
case EPOLL_TYPE_PING:
icmp_sock_handler(&c, ref);
@ -357,8 +351,11 @@ loop:
case EPOLL_TYPE_VHOST_KICK:
vu_kick_cb(c.vdev, ref, &now);
break;
case EPOLL_TYPE_VHOST_MIGRATION:
vu_migrate(c.vdev, eventmask);
case EPOLL_TYPE_REPAIR_LISTEN:
repair_listen_handler(&c, eventmask);
break;
case EPOLL_TYPE_REPAIR:
repair_handler(&c, eventmask);
break;
default:
/* Can't happen */
@ -368,5 +365,7 @@ loop:
post_handler(&c, &now);
migrate_handler(&c);
goto loop;
}

30
passt.h
View file

@ -20,6 +20,7 @@ union epoll_ref;
#include "siphash.h"
#include "ip.h"
#include "inany.h"
#include "migrate.h"
#include "flow.h"
#include "icmp.h"
#include "fwd.h"
@ -68,12 +69,9 @@ union epoll_ref {
static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
"epoll_ref must have same size as epoll_data");
#define TAP_BUF_BYTES \
ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE)
#define TAP_MSGS \
DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
/* Large enough for ~128 maximum size frames */
#define PKT_BUF_BYTES (8UL << 20)
#define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0)
extern char pkt_buf [PKT_BUF_BYTES];
extern char *epoll_type_str[];
@ -193,6 +191,7 @@ struct ip6_ctx {
* @foreground: Run in foreground, don't log to stderr by default
* @nofile: Maximum number of open files (ulimit -n)
* @sock_path: Path for UNIX domain socket
* @repair_path: TCP_REPAIR helper path, can be "none", empty for default
* @pcap: Path for packet capture file
* @pidfile: Path to PID file, empty string if not configured
* @pidfile_fd: File descriptor for PID file, -1 if none
@ -203,12 +202,16 @@ struct ip6_ctx {
* @epollfd: File descriptor for epoll instance
* @fd_tap_listen: File descriptor for listening AF_UNIX socket, if any
* @fd_tap: AF_UNIX socket, tuntap device, or pre-opened socket
* @fd_repair_listen: File descriptor for listening TCP_REPAIR socket, if any
* @fd_repair: Connected AF_UNIX socket for TCP_REPAIR helper
* @our_tap_mac: Pasta/passt's MAC on the tap link
* @guest_mac: MAC address of guest or namespace, seen or configured
* @hash_secret: 128-bit secret for siphash functions
* @ifi4: Template interface for IPv4, -1: none, 0: IPv4 disabled
* @ip: IPv4 configuration
* @dns_search: DNS search list
* @hostname: Guest hostname
* @fqdn: Guest FQDN
* @ifi6: Template interface for IPv6, -1: none, 0: IPv6 disabled
* @ip6: IPv6 configuration
* @pasta_ifn: Name of namespace interface for pasta
@ -235,6 +238,9 @@ struct ip6_ctx {
* @low_wmem: Low probed net.core.wmem_max
* @low_rmem: Low probed net.core.rmem_max
* @vdev: vhost-user device
* @device_state_fd: Device state migration channel
* @device_state_result: Device state migration result
* @migrate_target: Are we the target, on the next migration request?
*/
struct ctx {
enum passt_modes mode;
@ -244,6 +250,7 @@ struct ctx {
int foreground;
int nofile;
char sock_path[UNIX_PATH_MAX];
char repair_path[UNIX_PATH_MAX];
char pcap[PATH_MAX];
char pidfile[PATH_MAX];
@ -260,8 +267,12 @@ struct ctx {
int epollfd;
int fd_tap_listen;
int fd_tap;
int fd_repair_listen;
int fd_repair;
unsigned char our_tap_mac[ETH_ALEN];
unsigned char guest_mac[ETH_ALEN];
uint16_t mtu;
uint64_t hash_secret[2];
int ifi4;
@ -269,6 +280,9 @@ struct ctx {
struct fqdn dns_search[MAXDNSRCH];
char hostname[PASST_MAXDNAME];
char fqdn[PASST_MAXDNAME];
int ifi6;
struct ip6_ctx ip6;
@ -283,7 +297,6 @@ struct ctx {
int no_icmp;
struct icmp_ctx icmp;
int mtu;
int no_dns;
int no_dns_search;
int no_dhcp_dns;
@ -300,6 +313,11 @@ struct ctx {
int low_rmem;
struct vu_dev *vdev;
/* Migration */
int device_state_fd;
int device_state_result;
bool migrate_target;
};
void proto_update_l2_buf(const unsigned char *eth_d,

45
pasta.c
View file

@ -73,12 +73,12 @@ void pasta_child_handler(int signal)
!waitid(P_PID, pasta_child_pid, &infop, WEXITED | WNOHANG)) {
if (infop.si_pid == pasta_child_pid) {
if (infop.si_code == CLD_EXITED)
exit(infop.si_status);
_exit(infop.si_status);
/* If killed by a signal, si_status is the number.
* Follow common shell convention of returning it + 128.
*/
exit(infop.si_status + 128);
_exit(infop.si_status + 128);
/* Nothing to do, detached PID namespace going away */
}
@ -169,10 +169,12 @@ void pasta_open_ns(struct ctx *c, const char *netns)
* struct pasta_spawn_cmd_arg - Argument for pasta_spawn_cmd()
* @exe: Executable to run
* @argv: Command and arguments to run
* @ctx: Context to read config from
*/
struct pasta_spawn_cmd_arg {
const char *exe;
char *const *argv;
struct ctx *c;
};
/**
@ -186,6 +188,7 @@ static int pasta_spawn_cmd(void *arg)
{
char hostname[HOST_NAME_MAX + 1] = HOSTNAME_PREFIX;
const struct pasta_spawn_cmd_arg *a;
size_t conf_hostname_len;
sigset_t set;
/* We run in a detached PID and mount namespace: mount /proc over */
@ -195,9 +198,15 @@ static int pasta_spawn_cmd(void *arg)
if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
warn("Cannot set ping_group_range, ICMP requests might fail");
if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
errno == ENAMETOOLONG) {
a = (const struct pasta_spawn_cmd_arg *)arg;
conf_hostname_len = strlen(a->c->hostname);
if (conf_hostname_len > 0) {
if (sethostname(a->c->hostname, conf_hostname_len))
warn("Unable to set configured hostname");
} else if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
errno == ENAMETOOLONG) {
hostname[HOST_NAME_MAX] = '\0';
if (sethostname(hostname, strlen(hostname)))
warn("Unable to set pasta-prefixed hostname");
@ -208,7 +217,6 @@ static int pasta_spawn_cmd(void *arg)
sigaddset(&set, SIGUSR1);
sigwaitinfo(&set, NULL);
a = (const struct pasta_spawn_cmd_arg *)arg;
execvp(a->exe, a->argv);
die_perror("Failed to start command or shell");
@ -230,6 +238,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
struct pasta_spawn_cmd_arg arg = {
.exe = argv[0],
.argv = argv,
.c = c,
};
char uidmap[BUFSIZ], gidmap[BUFSIZ];
char *sh_argv[] = { NULL, NULL };
@ -310,7 +319,7 @@ void pasta_ns_conf(struct ctx *c)
if (c->pasta_conf_ns) {
unsigned int flags = IFF_UP;
if (c->mtu != -1)
if (c->mtu)
nl_link_set_mtu(nl_sock_ns, c->pasta_ifi, c->mtu);
if (c->ifi6) /* Avoid duplicate address detection on link up */
@ -489,17 +498,23 @@ void pasta_netns_quit_init(const struct ctx *c)
*/
void pasta_netns_quit_inotify_handler(struct ctx *c, int inotify_fd)
{
char buf[sizeof(struct inotify_event) + NAME_MAX + 1];
const struct inotify_event *in_ev = (struct inotify_event *)buf;
char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
__attribute__ ((aligned(__alignof__(struct inotify_event))));
const struct inotify_event *ev;
ssize_t n;
char *p;
if (read(inotify_fd, buf, sizeof(buf)) < (ssize_t)sizeof(*in_ev))
if ((n = read(inotify_fd, buf, sizeof(buf))) < (ssize_t)sizeof(*ev))
return;
if (strncmp(in_ev->name, c->netns_base, sizeof(c->netns_base)))
return;
for (p = buf; p < buf + n; p += sizeof(*ev) + ev->len) {
ev = (const struct inotify_event *)p;
info("Namespace %s is gone, exiting", c->netns_base);
exit(EXIT_SUCCESS);
if (!strncmp(ev->name, c->netns_base, sizeof(c->netns_base))) {
info("Namespace %s is gone, exiting", c->netns_base);
_exit(EXIT_SUCCESS);
}
}
}
/**
@ -525,7 +540,7 @@ void pasta_netns_quit_timer_handler(struct ctx *c, union epoll_ref ref)
return;
info("Namespace %s is gone, exiting", c->netns_base);
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
}
close(fd);

46
pcap.c
View file

@ -33,33 +33,12 @@
#include "log.h"
#include "pcap.h"
#include "iov.h"
#include "tap.h"
#define PCAP_VERSION_MINOR 4
static int pcap_fd = -1;
/* See pcap.h from libpcap, or pcap-savefile(5) */
static const struct {
uint32_t magic;
#define PCAP_MAGIC 0xa1b2c3d4
uint16_t major;
#define PCAP_VERSION_MAJOR 2
uint16_t minor;
#define PCAP_VERSION_MINOR 4
int32_t thiszone;
uint32_t sigfigs;
uint32_t snaplen;
uint32_t linktype;
#define PCAP_LINKTYPE_ETHERNET 1
} pcap_hdr = {
PCAP_MAGIC, PCAP_VERSION_MAJOR, PCAP_VERSION_MINOR, 0, 0, ETH_MAX_MTU,
PCAP_LINKTYPE_ETHERNET
};
struct pcap_pkthdr {
uint32_t tv_sec;
uint32_t tv_usec;
@ -162,6 +141,29 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
*/
void pcap_init(struct ctx *c)
{
/* See pcap.h from libpcap, or pcap-savefile(5) */
#define PCAP_MAGIC 0xa1b2c3d4
#define PCAP_VERSION_MAJOR 2
#define PCAP_VERSION_MINOR 4
#define PCAP_LINKTYPE_ETHERNET 1
const struct {
uint32_t magic;
uint16_t major;
uint16_t minor;
int32_t thiszone;
uint32_t sigfigs;
uint32_t snaplen;
uint32_t linktype;
} pcap_hdr = {
.magic = PCAP_MAGIC,
.major = PCAP_VERSION_MAJOR,
.minor = PCAP_VERSION_MINOR,
.snaplen = tap_l2_max_len(c),
.linktype = PCAP_LINKTYPE_ETHERNET
};
if (pcap_fd != -1)
return;

255
repair.c Normal file
View file

@ -0,0 +1,255 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <errno.h>
#include <sys/socket.h>
#include <sys/uio.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "repair.h"
#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
/* Wait for a while for TCP_REPAIR helper to connect if it's not there yet */
#define REPAIR_ACCEPT_TIMEOUT_MS 10
#define REPAIR_ACCEPT_TIMEOUT_US (REPAIR_ACCEPT_TIMEOUT_MS * 1000)
/* Pending file descriptors for next repair_flush() call, or command change */
static int repair_fds[SCM_MAX_FD];
/* Pending command: flush pending file descriptors if it changes */
static int8_t repair_cmd;
/* Number of pending file descriptors set in @repair_fds */
static int repair_nfds;
/**
* repair_sock_init() - Start listening for connections on helper socket
* @c: Execution context
*/
void repair_sock_init(const struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
struct epoll_event ev = { 0 };
if (c->fd_repair_listen == -1)
return;
if (listen(c->fd_repair_listen, 0)) {
err_perror("listen() on repair helper socket, won't migrate");
return;
}
ref.fd = c->fd_repair_listen;
ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
ev.data.u64 = ref.u64;
if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
err_perror("repair helper socket epoll_ctl(), won't migrate");
}
/**
* repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
* @c: Execution context
* @events: epoll events
*/
void repair_listen_handler(struct ctx *c, uint32_t events)
{
union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
struct epoll_event ev = { 0 };
struct ucred ucred;
socklen_t len;
if (events != EPOLLIN) {
debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
events);
return;
}
len = sizeof(ucred);
/* Another client is already connected: accept and close right away. */
if (c->fd_repair != -1) {
int discard = accept4(c->fd_repair_listen, NULL, NULL,
SOCK_NONBLOCK);
if (discard == -1)
return;
if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
close(discard);
return;
}
if ((c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0)) < 0) {
debug_perror("accept4() on TCP_REPAIR helper listening socket");
return;
}
if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
ref.fd = c->fd_repair;
ev.events = EPOLLHUP | EPOLLET;
ev.data.u64 = ref.u64;
if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
close(c->fd_repair);
c->fd_repair = -1;
}
}
/**
* repair_close() - Close connection to TCP_REPAIR helper
* @c: Execution context
*/
void repair_close(struct ctx *c)
{
debug("Closing TCP_REPAIR helper socket");
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
close(c->fd_repair);
c->fd_repair = -1;
}
/**
* repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
* @c: Execution context
* @events: epoll events
*/
void repair_handler(struct ctx *c, uint32_t events)
{
(void)events;
repair_close(c);
}
/**
* repair_wait() - Wait (with timeout) for TCP_REPAIR helper to connect
* @c: Execution context
*/
void repair_wait(struct ctx *c)
{
struct timeval tv = { .tv_sec = 0,
.tv_usec = (long)(REPAIR_ACCEPT_TIMEOUT_US) };
static_assert(REPAIR_ACCEPT_TIMEOUT_US < 1000 * 1000,
".tv_usec is greater than 1000 * 1000");
if (c->fd_repair >= 0 || c->fd_repair_listen == -1)
return;
if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
&tv, sizeof(tv))) {
err_perror("Set timeout on TCP_REPAIR listening socket");
return;
}
repair_listen_handler(c, EPOLLIN);
tv.tv_usec = 0;
if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
&tv, sizeof(tv)))
err_perror("Clear timeout on TCP_REPAIR listening socket");
}
/**
* repair_flush() - Flush current set of sockets to helper, with current command
* @c: Execution context
*
* Return: 0 on success, negative error code on failure
*/
int repair_flush(struct ctx *c)
{
char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
__attribute__ ((aligned(__alignof__(struct cmsghdr)))) = { 0 };
struct iovec iov = { &repair_cmd, sizeof(repair_cmd) };
struct cmsghdr *cmsg;
struct msghdr msg;
int8_t reply;
if (!repair_nfds)
return 0;
msg = (struct msghdr){ .msg_name = NULL, .msg_namelen = 0,
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = buf,
.msg_controllen = CMSG_SPACE(sizeof(int) *
repair_nfds),
.msg_flags = 0 };
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int) * repair_nfds);
memcpy(CMSG_DATA(cmsg), repair_fds, sizeof(int) * repair_nfds);
repair_nfds = 0;
if (sendmsg(c->fd_repair, &msg, 0) < 0) {
int ret = -errno;
err_perror("Failed to send sockets to TCP_REPAIR helper");
repair_close(c);
return ret;
}
if (recv(c->fd_repair, &reply, sizeof(reply), 0) < 0) {
int ret = -errno;
err_perror("Failed to receive reply from TCP_REPAIR helper");
repair_close(c);
return ret;
}
if (reply != repair_cmd) {
err("Unexpected reply from TCP_REPAIR helper: %d", reply);
repair_close(c);
return -ENXIO;
}
return 0;
}
/**
* repair_set() - Add socket to TCP_REPAIR set with given command
* @c: Execution context
* @s: Socket to add
* @cmd: TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
*
* Return: 0 on success, negative error code on failure
*/
int repair_set(struct ctx *c, int s, int cmd)
{
int rc;
if (repair_nfds && repair_cmd != cmd) {
if ((rc = repair_flush(c)))
return rc;
}
repair_cmd = cmd;
repair_fds[repair_nfds++] = s;
if (repair_nfds >= SCM_MAX_FD) {
if ((rc = repair_flush(c)))
return rc;
}
return 0;
}

17
repair.h Normal file
View file

@ -0,0 +1,17 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#ifndef REPAIR_H
#define REPAIR_H
void repair_sock_init(const struct ctx *c);
void repair_listen_handler(struct ctx *c, uint32_t events);
void repair_handler(struct ctx *c, uint32_t events);
void repair_close(struct ctx *c);
void repair_wait(struct ctx *c);
int repair_flush(struct ctx *c);
int repair_set(struct ctx *c, int s, int cmd);
#endif /* REPAIR_H */

View file

@ -14,8 +14,10 @@
# Author: Stefano Brivio <sbrivio@redhat.com>
TMP="$(mktemp)"
IN="$@"
OUT="$(mktemp)"
OUT_FINAL="${1}"
shift
IN="$@"
[ -z "${ARCH}" ] && ARCH="$(uname -m)"
[ -z "${CC}" ] && CC="cc"
@ -253,7 +255,7 @@ for __p in ${__profiles}; do
__calls="${__calls} ${EXTRA_SYSCALLS:-}"
__calls="$(filter ${__calls})"
cols="$(stty -a | sed -n 's/.*columns \([0-9]*\).*/\1/p' || :)" 2>/dev/null
cols="$(stty -a 2>/dev/null | sed -n 's/.*columns \([0-9]*\).*/\1/p' || :)" 2>/dev/null
case $cols in [0-9]*) col_args="-w ${cols}";; *) col_args="";; esac
echo "seccomp profile ${__p} allows: ${__calls}" | tr '\n' ' ' | fmt -t ${col_args}
@ -268,4 +270,4 @@ for __p in ${__profiles}; do
gen_profile "${__p}" ${__calls}
done
mv "${OUT}" seccomp.h
mv "${OUT}" "${OUT_FINAL}"

284
tap.c
View file

@ -56,18 +56,70 @@
#include "netlink.h"
#include "pasta.h"
#include "packet.h"
#include "repair.h"
#include "tap.h"
#include "log.h"
#include "vhost_user.h"
#include "vu_common.h"
/* Maximum allowed frame lengths (including L2 header) */
/* Verify that an L2 frame length limit is large enough to contain the header,
* but small enough to fit in the packet pool
*/
#define CHECK_FRAME_LEN(len) \
static_assert((len) >= ETH_HLEN && (len) <= PACKET_MAX_LEN, \
#len " has bad value")
CHECK_FRAME_LEN(L2_MAX_LEN_PASTA);
CHECK_FRAME_LEN(L2_MAX_LEN_PASST);
CHECK_FRAME_LEN(L2_MAX_LEN_VU);
/* We try size the packet pools so that we can use a single batch for the entire
* packet buffer. This might be exceeded for vhost-user, though, which uses its
* own buffers rather than pkt_buf.
*
* This is just a tuning parameter, the code will work with slightly more
* overhead if it's incorrect. So, we estimate based on the minimum practical
* frame size - an empty UDP datagram - rather than the minimum theoretical
* frame size.
*
* FIXME: Profile to work out how big this actually needs to be to amortise
* per-batch syscall overheads
*/
#define TAP_MSGS_IP4 \
DIV_ROUND_UP(sizeof(pkt_buf), \
ETH_HLEN + sizeof(struct iphdr) + sizeof(struct udphdr))
#define TAP_MSGS_IP6 \
DIV_ROUND_UP(sizeof(pkt_buf), \
ETH_HLEN + sizeof(struct ipv6hdr) + sizeof(struct udphdr))
/* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS_IP4, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
#define TAP_SEQS 128 /* Different L4 tuples in one batch */
#define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */
/**
* tap_l2_max_len() - Maximum frame size (including L2 header) for current mode
* @c: Execution context
*/
unsigned long tap_l2_max_len(const struct ctx *c)
{
/* NOLINTBEGIN(bugprone-branch-clone): values can be the same */
switch (c->mode) {
case MODE_PASST:
return L2_MAX_LEN_PASST;
case MODE_PASTA:
return L2_MAX_LEN_PASTA;
case MODE_VU:
return L2_MAX_LEN_VU;
}
/* NOLINTEND(bugprone-branch-clone) */
ASSERT(0);
}
/**
* tap_send_single() - Send a single frame
* @c: Execution context
@ -121,7 +173,7 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
{
struct ethhdr *eh = (struct ethhdr *)buf;
@ -142,8 +194,8 @@ static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto)
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto)
{
uint16_t l3len = l4len + sizeof(*ip4h);
@ -152,17 +204,17 @@ static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
ip4h->tos = 0;
ip4h->tot_len = htons(l3len);
ip4h->id = 0;
ip4h->frag_off = 0;
ip4h->frag_off = htons(IP_DF);
ip4h->ttl = 255;
ip4h->protocol = proto;
ip4h->saddr = src.s_addr;
ip4h->daddr = dst.s_addr;
ip4h->check = csum_ip4_header(l3len, proto, src, dst);
return ip4h + 1;
return (char *)ip4h + sizeof(*ip4h);
}
/**
* tap_udp4_send() - Send UDP over IPv4 packet
* tap_push_uh4() - Build UDPv4 header with checksum
* @c: Execution context
* @src: IPv4 source address
* @sport: UDP source port
@ -170,16 +222,14 @@ static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
* @dport: UDP destination port
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*
* Return: pointer at which to write the packet's payload
*/
void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
char buf[USHRT_MAX];
struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
char *data = (char *)(uh + 1);
const struct iovec iov = {
.iov_base = (void *)in,
.iov_len = dlen
@ -190,8 +240,30 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp4(uh, src, dst, &payload);
memcpy(data, in, dlen);
return (char *)uh + sizeof(*uh);
}
/**
* tap_udp4_send() - Send UDP over IPv4 packet
* @c: Execution context
* @src: IPv4 source address
* @sport: UDP source port
* @dst: IPv4 destination address
* @dport: UDP destination port
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*/
void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
char buf[USHRT_MAX];
struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
memcpy(data, in, dlen);
tap_send_single(c, buf, dlen + (data - buf));
}
@ -228,10 +300,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src,
const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow)
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src, const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow)
{
ip6h->payload_len = htons(l4len);
ip6h->priority = 0;
@ -240,10 +311,40 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h,
ip6h->hop_limit = 255;
ip6h->saddr = *src;
ip6h->daddr = *dst;
ip6h->flow_lbl[0] = (flow >> 16) & 0xf;
ip6h->flow_lbl[1] = (flow >> 8) & 0xff;
ip6h->flow_lbl[2] = (flow >> 0) & 0xff;
return ip6h + 1;
ip6_set_flow_lbl(ip6h, flow);
return (char *)ip6h + sizeof(*ip6h);
}
/**
* tap_push_uh6() - Build UDPv6 header with checksum
* @c: Execution context
* @src: IPv6 source address
* @sport: UDP source port
* @dst: IPv6 destination address
* @dport: UDP destination port
* @flow: Flow label
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*
* Return: pointer at which to write the packet's payload
*/
void *tap_push_uh6(struct udphdr *uh,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
const struct iovec iov = {
.iov_base = in,
.iov_len = dlen
};
struct iov_tail payload = IOV_TAIL(&iov, 1, 0);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp6(uh, src, dst, &payload);
return (char *)uh + sizeof(*uh);
}
/**
@ -254,7 +355,7 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h,
* @dst: IPv6 destination address
* @dport: UDP destination port
* @flow: Flow label
* @in: UDP payload contents (not including UDP header)
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*/
void tap_udp6_send(const struct ctx *c,
@ -267,19 +368,9 @@ void tap_udp6_send(const struct ctx *c,
struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
struct udphdr *uh = tap_push_ip6h(ip6h, src, dst,
l4len, IPPROTO_UDP, flow);
char *data = (char *)(uh + 1);
const struct iovec iov = {
.iov_base = in,
.iov_len = dlen
};
struct iov_tail payload = IOV_TAIL(&iov, 1, 0);
char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp6(uh, src, dst, &payload);
memcpy(data, in, dlen);
tap_send_single(c, buf, dlen + (data - buf));
}
@ -468,6 +559,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
* struct l4_seq4_t - Message sequence for one protocol handler call, IPv4
* @msgs: Count of messages in sequence
* @protocol: Protocol number
* @ttl: Time to live
* @source: Source port
* @dest: Destination port
* @saddr: Source address
@ -476,6 +568,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
*/
static struct tap4_l4_t {
uint8_t protocol;
uint8_t ttl;
uint16_t source;
uint16_t dest;
@ -490,14 +583,17 @@ static struct tap4_l4_t {
* struct l4_seq6_t - Message sequence for one protocol handler call, IPv6
* @msgs: Count of messages in sequence
* @protocol: Protocol number
* @flow_lbl: IPv6 flow label
* @source: Source port
* @dest: Destination port
* @saddr: Source address
* @daddr: Destination address
* @hop_limit: Hop limit
* @msg: Array of messages that can be handled in a single call
*/
static struct tap6_l4_t {
uint8_t protocol;
uint32_t flow_lbl :20;
uint16_t source;
uint16_t dest;
@ -505,6 +601,8 @@ static struct tap6_l4_t {
struct in6_addr saddr;
struct in6_addr daddr;
uint8_t hop_limit;
struct pool_l4_t p;
} tap6_l4[TAP_SEQS /* Arbitrary: TAP_MSGS in theory, so limit in users */];
@ -693,7 +791,8 @@ resume:
#define L4_MATCH(iph, uh, seq) \
((seq)->protocol == (iph)->protocol && \
(seq)->source == (uh)->source && (seq)->dest == (uh)->dest && \
(seq)->saddr.s_addr == (iph)->saddr && (seq)->daddr.s_addr == (iph)->daddr)
(seq)->saddr.s_addr == (iph)->saddr && \
(seq)->daddr.s_addr == (iph)->daddr && (seq)->ttl == (iph)->ttl)
#define L4_SET(iph, uh, seq) \
do { \
@ -702,6 +801,7 @@ resume:
(seq)->dest = (uh)->dest; \
(seq)->saddr.s_addr = (iph)->saddr; \
(seq)->daddr.s_addr = (iph)->daddr; \
(seq)->ttl = (iph)->ttl; \
} while (0)
if (seq && L4_MATCH(iph, uh, seq) && seq->p.count < UIO_MAXIOV)
@ -743,14 +843,14 @@ append:
for (k = 0; k < p->count; )
k += tcp_tap_handler(c, PIF_TAP, AF_INET,
&seq->saddr, &seq->daddr,
p, k, now);
0, p, k, now);
} else if (seq->protocol == IPPROTO_UDP) {
if (c->no_udp)
continue;
for (k = 0; k < p->count; )
k += udp_tap_handler(c, PIF_TAP, AF_INET,
&seq->saddr, &seq->daddr,
p, k, now);
seq->ttl, p, k, now);
}
}
@ -871,16 +971,20 @@ resume:
((seq)->protocol == (proto) && \
(seq)->source == (uh)->source && \
(seq)->dest == (uh)->dest && \
(seq)->flow_lbl == ip6_get_flow_lbl(ip6h) && \
IN6_ARE_ADDR_EQUAL(&(seq)->saddr, saddr) && \
IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr))
IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr) && \
(seq)->hop_limit == (ip6h)->hop_limit)
#define L4_SET(ip6h, proto, uh, seq) \
do { \
(seq)->protocol = (proto); \
(seq)->source = (uh)->source; \
(seq)->dest = (uh)->dest; \
(seq)->flow_lbl = ip6_get_flow_lbl(ip6h); \
(seq)->saddr = *saddr; \
(seq)->daddr = *daddr; \
(seq)->hop_limit = (ip6h)->hop_limit; \
} while (0)
if (seq && L4_MATCH(ip6h, proto, uh, seq) &&
@ -924,14 +1028,14 @@ append:
for (k = 0; k < p->count; )
k += tcp_tap_handler(c, PIF_TAP, AF_INET6,
&seq->saddr, &seq->daddr,
p, k, now);
seq->flow_lbl, p, k, now);
} else if (seq->protocol == IPPROTO_UDP) {
if (c->no_udp)
continue;
for (k = 0; k < p->count; )
k += udp_tap_handler(c, PIF_TAP, AF_INET6,
&seq->saddr, &seq->daddr,
p, k, now);
seq->hop_limit, p, k, now);
}
}
@ -966,8 +1070,10 @@ void tap_handler(struct ctx *c, const struct timespec *now)
* @c: Execution context
* @l2len: Total L2 packet length
* @p: Packet buffer
* @now: Current timestamp
*/
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
const struct timespec *now)
{
const struct ethhdr *eh;
@ -983,9 +1089,17 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
switch (ntohs(eh->h_proto)) {
case ETH_P_ARP:
case ETH_P_IP:
if (pool_full(pool_tap4)) {
tap4_handler(c, pool_tap4, now);
pool_flush(pool_tap4);
}
packet_add(pool_tap4, l2len, p);
break;
case ETH_P_IPV6:
if (pool_full(pool_tap6)) {
tap6_handler(c, pool_tap6, now);
pool_flush(pool_tap6);
}
packet_add(pool_tap6, l2len, p);
break;
default:
@ -1002,10 +1116,10 @@ void tap_sock_reset(struct ctx *c)
info("Client connection closed%s", c->one_off ? ", exiting" : "");
if (c->one_off)
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
/* Close the connected socket, wait for a new connection */
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
epoll_del(c, c->fd_tap);
close(c->fd_tap);
c->fd_tap = -1;
if (c->mode == MODE_VU)
@ -1036,7 +1150,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
do {
n = recv(c->fd_tap, pkt_buf + partial_len,
TAP_BUF_BYTES - partial_len, MSG_DONTWAIT);
sizeof(pkt_buf) - partial_len, MSG_DONTWAIT);
} while ((n < 0) && errno == EINTR);
if (n < 0) {
@ -1053,7 +1167,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
while (n >= (ssize_t)sizeof(uint32_t)) {
uint32_t l2len = ntohl_unaligned(p);
if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
if (l2len < sizeof(struct ethhdr) || l2len > L2_MAX_LEN_PASST) {
err("Bad frame size from guest, resetting connection");
tap_sock_reset(c);
return;
@ -1066,7 +1180,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
p += sizeof(uint32_t);
n -= sizeof(uint32_t);
tap_add_packet(c, l2len, p);
tap_add_packet(c, l2len, p, now);
p += l2len;
n -= l2len;
@ -1107,8 +1221,10 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
tap_flush_pools();
for (n = 0; n <= (ssize_t)(TAP_BUF_BYTES - ETH_MAX_MTU); n += len) {
len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
for (n = 0;
n <= (ssize_t)(sizeof(pkt_buf) - L2_MAX_LEN_PASTA);
n += len) {
len = read(c->fd_tap, pkt_buf + n, L2_MAX_LEN_PASTA);
if (len == 0) {
die("EOF on tap device, exiting");
@ -1126,10 +1242,10 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
/* Ignore frames of bad length */
if (len < (ssize_t)sizeof(struct ethhdr) ||
len > (ssize_t)ETH_MAX_MTU)
len > (ssize_t)L2_MAX_LEN_PASTA)
continue;
tap_add_packet(c, len, pkt_buf + n);
tap_add_packet(c, len, pkt_buf + n, now);
}
tap_handler(c, now);
@ -1151,68 +1267,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
tap_pasta_input(c, now);
}
/**
* tap_sock_unix_open() - Create and bind AF_UNIX socket
* @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
*
* Return: socket descriptor on success, won't return on failure
*/
int tap_sock_unix_open(char *sock_path)
{
int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
struct sockaddr_un addr = {
.sun_family = AF_UNIX,
};
int i;
if (fd < 0)
die_perror("Failed to open UNIX domain socket");
for (i = 1; i < UNIX_SOCK_MAX; i++) {
char *path = addr.sun_path;
int ex, ret;
if (*sock_path)
memcpy(path, sock_path, UNIX_PATH_MAX);
else if (snprintf_check(path, UNIX_PATH_MAX - 1,
UNIX_SOCK_PATH, i))
die_perror("Can't build UNIX domain socket path");
ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
0);
if (ex < 0)
die_perror("Failed to check for UNIX domain conflicts");
ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
errno != EACCES)) {
if (*sock_path)
die("Socket path %s already in use", path);
close(ex);
continue;
}
close(ex);
unlink(path);
ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
if (*sock_path && ret)
die_perror("Failed to bind UNIX domain socket");
if (!ret)
break;
}
if (i == UNIX_SOCK_MAX)
die_perror("Failed to bind UNIX domain socket");
info("UNIX domain socket bound at %s", addr.sun_path);
if (!*sock_path)
memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
return fd;
}
/**
* tap_backend_show_hints() - Give help information to start QEMU
* @c: Execution context
@ -1389,8 +1443,8 @@ void tap_sock_update_pool(void *base, size_t size)
{
int i;
pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, base, size);
pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, base, size);
pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS_IP4, base, size);
pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS_IP6, base, size);
for (i = 0; i < TAP_SEQS; i++) {
tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
@ -1423,6 +1477,8 @@ void tap_backend_init(struct ctx *c)
tap_sock_tun_init(c);
break;
case MODE_VU:
repair_sock_init(c);
/* fall through */
case MODE_PASST:
tap_sock_unix_init(c);

50
tap.h
View file

@ -6,7 +6,32 @@
#ifndef TAP_H
#define TAP_H
#define ETH_HDR_INIT(proto) { .h_proto = htons_constant(proto) }
/** L2_MAX_LEN_PASTA - Maximum frame length for pasta mode (with L2 header)
*
* The kernel tuntap device imposes a maximum frame size of 65535 including
* 'hard_header_len' (14 bytes for L2 Ethernet in the case of "tap" mode).
*/
#define L2_MAX_LEN_PASTA USHRT_MAX
/** L2_MAX_LEN_PASST - Maximum frame length for passt mode (with L2 header)
*
* The only structural limit the QEMU socket protocol imposes on frames is
* (2^32-1) bytes, but that would be ludicrously long in practice. For now,
* limit it somewhat arbitrarily to 65535 bytes. FIXME: Work out an appropriate
* limit with more precision.
*/
#define L2_MAX_LEN_PASST USHRT_MAX
/** L2_MAX_LEN_VU - Maximum frame length for vhost-user mode (with L2 header)
*
* vhost-user allows multiple buffers per frame, each of which can be quite
* large, so the inherent frame size limit is rather large. Much larger than is
* actually useful for IP. For now limit arbitrarily to 65535 bytes. FIXME:
* Work out an appropriate limit with more precision.
*/
#define L2_MAX_LEN_VU USHRT_MAX
struct udphdr;
/**
* struct tap_hdr - tap backend specific headers
@ -44,6 +69,23 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
thdr->vnet_len = htonl(l2len);
}
unsigned long tap_l2_max_len(const struct ctx *c);
void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto);
void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen);
void *tap_push_uh6(struct udphdr *uh,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
void *in, size_t dlen);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto);
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src,
const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow);
void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen);
@ -51,6 +93,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
const void *in, size_t l4len);
const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
const struct in6_addr *src);
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src, const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow);
void tap_udp6_send(const struct ctx *c,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
@ -74,6 +119,7 @@ void tap_sock_update_pool(void *base, size_t size);
void tap_backend_init(struct ctx *c);
void tap_flush_pools(void);
void tap_handler(struct ctx *c, const struct timespec *now);
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
const struct timespec *now);
#endif /* TAP_H */

1259
tcp.c

File diff suppressed because it is too large Load diff

3
tcp.h
View file

@ -16,7 +16,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events);
int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const void *saddr, const void *daddr, uint32_t flow_lbl,
const struct pool *p, int idx, const struct timespec *now);
int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
const char *ifname, in_port_t port);
@ -25,7 +25,6 @@ void tcp_timer(struct ctx *c, const struct timespec *now);
void tcp_defer_handler(struct ctx *c);
void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
int tcp_set_peek_offset(int s, int offset);
extern bool peek_offset_cap;

View file

@ -125,7 +125,7 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
conn->seq_to_tap = seq;
peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
if (tcp_set_peek_offset(conn->sock, peek_offset))
if (tcp_set_peek_offset(conn, peek_offset))
tcp_rst(c, conn);
}
}
@ -304,7 +304,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
conn->seq_ack_from_tap, conn->seq_to_tap);
conn->seq_to_tap = conn->seq_ack_from_tap;
already_sent = 0;
if (tcp_set_peek_offset(s, 0)) {
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
}

View file

@ -19,6 +19,7 @@
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
* @sock: Socket descriptor number
* @events: Connection events, implying connection states
* @listening_sock: Listening socket this socket was accept()ed from, or -1
* @timer: timerfd descriptor for timeout events
* @flags: Connection flags representing internal attributes
* @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
@ -68,6 +69,7 @@ struct tcp_tap_conn {
#define CONN_STATE_BITS /* Setting these clears other flags */ \
(SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED)
int listening_sock;
int timer :FD_REF_BITS;
@ -96,6 +98,95 @@ struct tcp_tap_conn {
uint32_t seq_init_from_tap;
};
/**
* struct tcp_tap_transfer - Migrated TCP data, flow table part, network order
* @pif: Interfaces for each side of the flow
* @side: Addresses and ports for each side of the flow
* @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @events: Connection events, implying connection states
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
* @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
* @flags: Connection flags representing internal attributes
* @seq_dup_ack_approx: Last duplicate ACK number sent to tap
* @wnd_from_tap: Last window size from tap, unscaled (as received)
* @wnd_to_tap: Sending window advertised to tap, unscaled (as sent)
* @seq_to_tap: Next sequence for packets to tap
* @seq_ack_from_tap: Last ACK number received from tap
* @seq_from_tap: Next sequence for packets from tap (not actually sent)
* @seq_ack_to_tap: Last ACK number sent to tap
* @seq_init_from_tap: Initial sequence number from tap
*/
struct tcp_tap_transfer {
uint8_t pif[SIDES];
struct flowside side[SIDES];
uint8_t retrans;
uint8_t ws_from_tap;
uint8_t ws_to_tap;
uint8_t events;
uint32_t tap_mss;
uint32_t sndbuf;
uint8_t flags;
uint8_t seq_dup_ack_approx;
uint16_t wnd_from_tap;
uint16_t wnd_to_tap;
uint32_t seq_to_tap;
uint32_t seq_ack_from_tap;
uint32_t seq_from_tap;
uint32_t seq_ack_to_tap;
uint32_t seq_init_from_tap;
} __attribute__((packed, aligned(__alignof__(uint32_t))));
/**
* struct tcp_tap_transfer_ext - Migrated TCP data, outside flow, network order
* @seq_snd: Socket-side send sequence
* @seq_rcv: Socket-side receive sequence
* @sndq: Length of pending send queue (unacknowledged / not sent)
* @notsent: Part of pending send queue that wasn't sent out yet
* @rcvq: Length of pending receive queue
* @mss: Socket-side MSS clamp
* @timestamp: RFC 7323 timestamp
* @snd_wl1: Next sequence used in window probe (next sequence - 1)
* @snd_wnd: Socket-side sending window
* @max_window: Window clamp
* @rcv_wnd: Socket-side receive window
* @rcv_wup: rcv_nxt on last window update sent
* @snd_ws: Window scaling factor, send
* @rcv_ws: Window scaling factor, receive
* @tcpi_state: Connection state in TCP_INFO style (enum, tcp_states.h)
* @tcpi_options: TCPI_OPT_* constants (timestamps, selective ACK)
*/
struct tcp_tap_transfer_ext {
uint32_t seq_snd;
uint32_t seq_rcv;
uint32_t sndq;
uint32_t notsent;
uint32_t rcvq;
uint32_t mss;
uint32_t timestamp;
/* We can't just use struct tcp_repair_window: we need network order */
uint32_t snd_wl1;
uint32_t snd_wnd;
uint32_t max_window;
uint32_t rcv_wnd;
uint32_t rcv_wup;
uint8_t snd_ws;
uint8_t rcv_ws;
uint8_t tcpi_state;
uint8_t tcpi_options;
} __attribute__((packed, aligned(__alignof__(uint32_t))));
/**
* struct tcp_splice_conn - Descriptor for a spliced TCP connection
* @f: Generic flow information
@ -140,11 +231,23 @@ extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE];
extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE];
bool tcp_flow_defer(const struct tcp_tap_conn *conn);
int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
int tcp_flow_migrate_target(struct ctx *c, int fd);
int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
int tcp_conn_pool_sock(int pool[]);
int tcp_conn_sock(const struct ctx *c, sa_family_t af);
int tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);
int tcp_conn_sock(sa_family_t af);
int tcp_sock_refill_pool(int pool[], sa_family_t af);
void tcp_splice_refill(const struct ctx *c);
#endif /* TCP_CONN_H */

View file

@ -38,9 +38,13 @@
#define OPT_SACK 5
#define OPT_TS 8
#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP)
#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)]))
#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP)
#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)]))
#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define HOSTSIDE(conn_) ((conn_)->f.pif[1] == PIF_HOST)
#define HOSTFLOW(conn_) (&((conn_)->f.side[HOSTSIDE(conn_)]))
#define HOST_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->oaddr))
#define CONN_V6(conn) (!CONN_V4(conn))
@ -162,8 +166,6 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
struct tcp_info_linux;
void tcp_update_csum(uint32_t psum, struct tcphdr *th,
struct iov_tail *payload);
void tcp_fill_headers(const struct tcp_tap_conn *conn,
struct tap_hdr *taph,
struct iphdr *ip4h, struct ipv6hdr *ip6h,
@ -175,5 +177,6 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
size_t *optlen);
int tcp_set_peek_offset(const struct tcp_tap_conn *conn, int offset);
#endif /* TCP_INTERNAL_H */

View file

@ -28,7 +28,7 @@
* - FIN_SENT_0: FIN (write shutdown) sent to accepted socket
* - FIN_SENT_1: FIN (write shutdown) sent to target socket
*
* #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64 i686:fcntl64
* #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64|fcntl i686:fcntl64
*/
#include <sched.h>
@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
ev[1].events = EPOLLOUT;
}
flow_foreach_sidei(sidei)
ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0;
flow_foreach_sidei(sidei) {
if (events & OUT_WAIT(sidei)) {
ev[sidei].events |= EPOLLOUT;
ev[!sidei].events &= ~EPOLLIN;
}
}
}
/**
@ -160,7 +164,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) ||
epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) {
int ret = -errno;
flow_err(conn, "ERROR on epoll_ctl(): %s", strerror_(errno));
flow_perror(conn, "ERROR on epoll_ctl()");
return ret;
}
@ -200,8 +204,8 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
}
if (flag == CLOSING) {
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[0], NULL);
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[1], NULL);
epoll_del(c, conn->s[0]);
epoll_del(c, conn->s[1]);
}
}
@ -313,8 +317,8 @@ static int tcp_splice_connect_finish(const struct ctx *c,
if (conn->pipe[sidei][0] < 0) {
if (pipe2(conn->pipe[sidei], O_NONBLOCK | O_CLOEXEC)) {
flow_err(conn, "cannot create %d->%d pipe: %s",
sidei, !sidei, strerror_(errno));
flow_perror(conn, "cannot create %d->%d pipe",
sidei, !sidei);
conn_flag(c, conn, CLOSING);
return -EIO;
}
@ -351,7 +355,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
int one = 1;
if (tgtpif == PIF_HOST)
conn->s[1] = tcp_conn_sock(c, af);
conn->s[1] = tcp_conn_sock(af);
else if (tgtpif == PIF_SPLICE)
conn->s[1] = tcp_conn_sock_ns(c, af);
else
@ -478,8 +482,7 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
if (rc)
flow_err(conn, "Error retrieving SO_ERROR: %s",
strerror_(errno));
flow_perror(conn, "Error retrieving SO_ERROR");
else
flow_trace(conn, "Error event on socket: %s",
strerror_(err));
@ -517,20 +520,21 @@ swap:
int more = 0;
retry:
readlen = splice(conn->s[fromsidei], NULL,
conn->pipe[fromsidei][1], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
flow_trace(conn, "%zi from read-side call", readlen);
if (readlen < 0) {
if (errno == EINTR)
goto retry;
do
readlen = splice(conn->s[fromsidei], NULL,
conn->pipe[fromsidei][1], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
while (readlen < 0 && errno == EINTR);
if (errno != EAGAIN)
goto close;
} else if (!readlen) {
if (readlen < 0 && errno != EAGAIN)
goto close;
flow_trace(conn, "%zi from read-side call", readlen);
if (!readlen) {
eof = 1;
} else {
} else if (readlen > 0) {
never_read = 0;
if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
@ -540,10 +544,16 @@ retry:
conn_flag(c, conn, lowat_act_flag);
}
eintr:
written = splice(conn->pipe[fromsidei][0], NULL,
conn->s[!fromsidei], NULL, c->tcp.pipe_size,
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
do
written = splice(conn->pipe[fromsidei][0], NULL,
conn->s[!fromsidei], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
while (written < 0 && errno == EINTR);
if (written < 0 && errno != EAGAIN)
goto close;
flow_trace(conn, "%zi from write-side call (passed %zi)",
written, c->tcp.pipe_size);
@ -552,7 +562,7 @@ eintr:
if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
continue;
if (conn->flags & lowat_set_flag &&
if (!(conn->flags & lowat_set_flag) &&
readlen > (long)c->tcp.pipe_size / 10) {
int lowat = c->tcp.pipe_size / 4;
@ -575,12 +585,6 @@ eintr:
conn->written[fromsidei] += written > 0 ? written : 0;
if (written < 0) {
if (errno == EINTR)
goto eintr;
if (errno != EAGAIN)
goto close;
if (conn->read[fromsidei] == conn->written[fromsidei])
break;
@ -703,13 +707,13 @@ static int tcp_sock_refill_ns(void *arg)
ns_enter(c);
if (c->ifi4) {
int rc = tcp_sock_refill_pool(c, ns_sock_pool4, AF_INET);
int rc = tcp_sock_refill_pool(ns_sock_pool4, AF_INET);
if (rc < 0)
warn("TCP: Error refilling IPv4 ns socket pool: %s",
strerror_(-rc));
}
if (c->ifi6) {
int rc = tcp_sock_refill_pool(c, ns_sock_pool6, AF_INET6);
int rc = tcp_sock_refill_pool(ns_sock_pool6, AF_INET6);
if (rc < 0)
warn("TCP: Error refilling IPv6 ns socket pool: %s",
strerror_(-rc));

View file

@ -38,7 +38,6 @@
static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1];
static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
static int head[VIRTQUEUE_MAX_SIZE + 1];
static int head_cnt;
/**
* tcp_vu_hdrlen() - return the size of the header in level 2 frame (TCP)
@ -183,7 +182,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
static ssize_t tcp_vu_sock_recv(const struct ctx *c,
const struct tcp_tap_conn *conn, bool v6,
uint32_t already_sent, size_t fillsize,
int *iov_cnt)
int *iov_cnt, int *head_cnt)
{
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
@ -202,7 +201,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c,
vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE);
elem_cnt = 0;
head_cnt = 0;
*head_cnt = 0;
while (fillsize > 0 && elem_cnt < VIRTQUEUE_MAX_SIZE) {
struct iovec *iov;
size_t frame_size, dlen;
@ -221,7 +220,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c,
ASSERT(iov->iov_len >= hdrlen);
iov->iov_base = (char *)iov->iov_base + hdrlen;
iov->iov_len -= hdrlen;
head[head_cnt++] = elem_cnt;
head[(*head_cnt)++] = elem_cnt;
fillsize -= dlen;
elem_cnt += cnt;
@ -261,17 +260,18 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c,
len -= iov->iov_len;
}
/* adjust head count */
while (head_cnt > 0 && head[head_cnt - 1] > i)
head_cnt--;
while (*head_cnt > 0 && head[*head_cnt - 1] >= i)
(*head_cnt)--;
/* mark end of array */
head[head_cnt] = i;
head[*head_cnt] = i;
*iov_cnt = i;
/* release unused buffers */
vu_queue_rewind(vq, elem_cnt - i);
/* restore space for headers in iov */
for (i = 0; i < head_cnt; i++) {
for (i = 0; i < *head_cnt; i++) {
struct iovec *iov = &elem[head[i]].in_sg[0];
iov->iov_base = (char *)iov->iov_base - hdrlen;
@ -357,11 +357,11 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
ssize_t len, previous_dlen;
int i, iov_cnt, head_cnt;
size_t hdrlen, fillsize;
int v6 = CONN_V6(conn);
uint32_t already_sent;
const uint16_t *check;
int i, iov_cnt;
if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
debug("Got packet, but RX virtqueue not usable yet");
@ -376,7 +376,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
conn->seq_ack_from_tap, conn->seq_to_tap);
conn->seq_to_tap = conn->seq_ack_from_tap;
already_sent = 0;
if (tcp_set_peek_offset(conn->sock, 0)) {
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
}
@ -396,7 +396,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
/* collect the buffers from vhost-user and fill them with the
* data from the socket
*/
len = tcp_vu_sock_recv(c, conn, v6, already_sent, fillsize, &iov_cnt);
len = tcp_vu_sock_recv(c, conn, v6, already_sent, fillsize,
&iov_cnt, &head_cnt);
if (len < 0) {
if (len != -EAGAIN && len != -EWOULDBLOCK) {
tcp_rst(c, conn);

1
test/.gitignore vendored
View file

@ -8,5 +8,6 @@ QEMU_EFI.fd
*.raw.xz
*.bin
nstool
rampstream
guest-key
guest-key.pub

View file

@ -52,7 +52,8 @@ UBUNTU_IMGS = $(UBUNTU_OLD_IMGS) $(UBUNTU_NEW_IMGS)
DOWNLOAD_ASSETS = mbuto podman \
$(DEBIAN_IMGS) $(FEDORA_IMGS) $(OPENSUSE_IMGS) $(UBUNTU_IMGS)
TESTDATA_ASSETS = small.bin big.bin medium.bin
TESTDATA_ASSETS = small.bin big.bin medium.bin \
rampstream
LOCAL_ASSETS = mbuto.img mbuto.mem.img podman/bin/podman QEMU_EFI.fd \
$(DEBIAN_IMGS:%=prepared-%) $(FEDORA_IMGS:%=prepared-%) \
$(UBUNTU_NEW_IMGS:%=prepared-%) \
@ -85,7 +86,7 @@ podman/bin/podman: pull-podman
guest-key guest-key.pub:
ssh-keygen -f guest-key -N ''
mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub $(TESTDATA_ASSETS)
mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub rampstream-check.sh $(TESTDATA_ASSETS)
./mbuto/mbuto -p ./$< -c lz4 -f $@
mbuto.mem.img: passt.mem.mbuto mbuto ../passt.avx2

View file

@ -135,7 +135,7 @@ layout_two_guests() {
get_info_cols
pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #1" qemu_2 guest_2
tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
tmux send-keys -t ${PANE_INFO} -N 100 C-m
@ -143,13 +143,66 @@ layout_two_guests() {
pane_watch_contexts ${PANE_HOST} host host
pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #1" pasta_1 passt_2
info_layout "two guests, two passt instances, in namespaces"
sleep 1
}
# layout_migrate() - Two guest panes, two passt panes, two passt-repair panes,
# plus host and log
layout_migrate() {
sleep 1
tmux kill-pane -a -t 0
cmd_write 0 clear
tmux split-window -v -t passt_test
tmux split-window -h -l '33%'
tmux split-window -h -t passt_test:1.1
tmux split-window -h -l '35%' -t passt_test:1.0
tmux split-window -v -t passt_test:1.0
tmux split-window -v -t passt_test:1.4
tmux split-window -v -t passt_test:1.6
tmux split-window -v -t passt_test:1.3
PANE_GUEST_1=0
PANE_GUEST_2=1
PANE_INFO=2
PANE_MON=3
PANE_HOST=4
PANE_PASST_REPAIR_1=5
PANE_PASST_1=6
PANE_PASST_REPAIR_2=7
PANE_PASST_2=8
get_info_cols
pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
tmux send-keys -t ${PANE_INFO} -N 100 C-m
tmux select-pane -t ${PANE_INFO} -T "test log"
pane_watch_contexts ${PANE_MON} "QEMU monitor" mon mon
pane_watch_contexts ${PANE_HOST} host host
pane_watch_contexts ${PANE_PASST_REPAIR_1} "passt-repair #1 in namespace #1" repair_1 passt_repair_1
pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
pane_watch_contexts ${PANE_PASST_REPAIR_2} "passt-repair #2 in namespace #2" repair_2 passt_repair_2
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
info_layout "two guests, two passt + passt-repair instances, in namespaces"
sleep 1
}
# layout_demo_pasta() - Four panes for pasta demo
layout_demo_pasta() {
sleep 1

View file

@ -49,7 +49,7 @@ setup_passt() {
context_run passt "make clean"
context_run passt "make valgrind"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt ${__opts} -s ${STATESETUP}/passt.socket -f -t 10001 -u 10001 -P ${STATESETUP}/passt.pid"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt ${__opts} -s ${STATESETUP}/passt.socket -f -t 10001 -u 10001 -H hostname1 --fqdn fqdn1.passt.test -P ${STATESETUP}/passt.pid"
# pidfile isn't created until passt is listening
wait_for [ -f "${STATESETUP}/passt.pid" ]
@ -160,11 +160,11 @@ setup_passt_in_ns() {
if [ ${VALGRIND} -eq 1 ]; then
context_run passt "make clean"
context_run passt "make valgrind"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -H hostname1 --fqdn fqdn1.passt.test -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
else
context_run passt "make clean"
context_run passt "make"
context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -H hostname1 --fqdn fqdn1.passt.test -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
fi
wait_for [ -f "${STATESETUP}/passt.pid" ]
@ -243,7 +243,7 @@ setup_two_guests() {
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} --fqdn fqdn1.passt.test -H hostname1 -t 10001 -u 10001"
wait_for [ -f "${STATESETUP}/passt_1.pid" ]
__opts=
@ -252,7 +252,7 @@ setup_two_guests() {
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} --hostname hostname2 --fqdn fqdn2 -t 10004 -u 10004"
wait_for [ -f "${STATESETUP}/passt_2.pid" ]
__vmem="$((${MEM_KIB} / 1024 / 4))"
@ -305,6 +305,117 @@ setup_two_guests() {
context_setup_guest guest_2 ${GUEST_2_CID}
}
# setup_migrate() - Set up two namespace, run qemu, passt/passt-repair in both
setup_migrate() {
context_setup_host host
context_setup_host mon
context_setup_host pasta_1
context_setup_host pasta_2
layout_migrate
# Ports:
#
# guest #1 | guest #2 | ns #1 | host
# --------- |-----------|-----------|------------
# 10001 as server | | to guest | to ns #1
# 10002 | | as server | to ns #1
# 10003 | | to init | as server
# 10004 | as server | to guest | to ns #1
__opts=
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/pasta_1.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
__map_host4=192.0.2.1
__map_host6=2001:db8:9a55::1
__map_ns4=192.0.2.2
__map_ns6=2001:db8:9a55::2
# Option 1: send stuff via spliced path in pasta
# context_run_bg pasta_1 "./pasta ${__opts} -P ${STATESETUP}/pasta_1.pid -t 10001,10002 -T 10003 -u 10001,10002 -U 10003 --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
# Option 2: send stuff via tap (--map-guest-addr) instead (useful to see capture of full migration)
context_run_bg pasta_1 "./pasta ${__opts} -P ${STATESETUP}/pasta_1.pid -t 10001,10002,10004 -T 10003 -u 10001,10002,10004 -U 10003 --map-guest-addr ${__map_host4} --map-guest-addr ${__map_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
context_setup_nstool passt_1 ${STATESETUP}/ns1.hold
context_setup_nstool passt_repair_1 ${STATESETUP}/ns1.hold
context_setup_nstool passt_2 ${STATESETUP}/ns1.hold
context_setup_nstool passt_repair_2 ${STATESETUP}/ns1.hold
context_setup_nstool qemu_1 ${STATESETUP}/ns1.hold
context_setup_nstool qemu_2 ${STATESETUP}/ns1.hold
__ifname="$(context_run qemu_1 "ip -j link show | jq -rM '.[] | select(.link_type == \"ether\").ifname'")"
sleep 1
__opts="--vhost-user"
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
wait_for [ -f "${STATESETUP}/passt_1.pid" ]
context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
__opts="--vhost-user"
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
wait_for [ -f "${STATESETUP}/passt_2.pid" ]
context_run_bg passt_repair_2 "./passt-repair ${STATESETUP}/passt_2.socket.repair"
__vmem="512M" # Keep migration fast
__qemu_netdev1=" \
-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
__qemu_netdev2=" \
-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
GUEST_1_CID=94557
context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
" ${__qemu_netdev1}" \
" -pidfile ${STATESETUP}/qemu_1.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID" \
" -monitor unix:${STATESETUP}/qemu_1_mon.sock,server,nowait"
GUEST_2_CID=94558
context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
" ${__qemu_netdev2}" \
" -pidfile ${STATESETUP}/qemu_2.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID" \
" -monitor unix:${STATESETUP}/qemu_2_mon.sock,server,nowait" \
" -incoming tcp:0:20005"
context_setup_guest guest_1 ${GUEST_1_CID}
# Only available after migration:
( context_setup_guest guest_2 ${GUEST_2_CID} & )
}
# teardown_context_watch() - Remove contexts and stop panes watching them
# $1: Pane number watching
# $@: Context names
@ -375,7 +486,8 @@ teardown_two_guests() {
context_wait pasta_1
context_wait pasta_2
rm -f "${STATESETUP}/passt__[12].pid" "${STATESETUP}/pasta_[12].pid"
rm "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
rm "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"
teardown_context_watch ${PANE_HOST} host
teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
@ -384,6 +496,30 @@ teardown_two_guests() {
teardown_context_watch ${PANE_PASST_2} pasta_2 passt_2
}
# teardown_migrate() - Exit namespaces, kill qemu processes, passt and pasta
teardown_migrate() {
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_1.pid")
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_2.pid")
context_wait qemu_1
context_wait qemu_2
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/passt_2.pid")
context_wait passt_1
context_wait passt_2
${NSTOOL} stop "${STATESETUP}/ns1.hold"
context_wait pasta_1
rm -f "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
rm -f "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"
teardown_context_watch ${PANE_HOST} host
teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
teardown_context_watch ${PANE_GUEST_2} qemu_2 guest_2
teardown_context_watch ${PANE_PASST_1} pasta_1 passt_1
teardown_context_watch ${PANE_PASST_2} pasta_1 passt_2
}
# teardown_demo_passt() - Exit namespace, kill qemu, passt and pasta
teardown_demo_passt() {
tmux send-keys -t ${PANE_GUEST} "C-c"

View file

@ -19,6 +19,7 @@ STATUS_FILE_INDEX=0
STATUS_COLS=
STATUS_PASS=0
STATUS_FAIL=0
STATUS_SKIPPED=0
PR_RED='\033[1;31m'
PR_GREEN='\033[1;32m'
@ -439,19 +440,21 @@ info_layout() {
# status_test_ok() - Update counter of passed tests, log and display message
status_test_ok() {
STATUS_PASS=$((STATUS_PASS + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | #(TZ="UTC" date -Iseconds)"
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_passed
}
# status_test_fail() - Update counter of failed tests, log and display message
status_test_fail() {
STATUS_FAIL=$((STATUS_FAIL + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | #(TZ="UTC" date -Iseconds)"
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_failed
}
# status_test_fail() - Update counter of failed tests, log and display message
status_test_skip() {
STATUS_SKIPPED=$((STATUS_SKIPPED + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_skipped
}

View file

@ -20,10 +20,7 @@ test_iperf3s() {
__sctx="${1}"
__port="${2}"
pane_or_context_run_bg "${__sctx}" \
'iperf3 -s -p'${__port}' & echo $! > s.pid' \
sleep 1 # Wait for server to be ready
pane_or_context_run "${__sctx}" 'iperf3 -s -p'${__port}' -D -I s.pid'
}
# test_iperf3k() - Kill iperf3 server
@ -31,7 +28,7 @@ test_iperf3s() {
test_iperf3k() {
__sctx="${1}"
pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid); rm s.pid'
pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid)'
sleep 1 # Wait for kernel to free up ports
}
@ -68,6 +65,45 @@ test_iperf3() {
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
}
# test_iperf3m() - Ugly helper for iperf3 directive, guest migration variant
# $1: Variable name: to put the measure bandwidth into
# $2: Initial source/client context
# $3: Second source/client context the guest is moving to
# $4: Destination name or address for client
# $5: Port number, ${i} is translated to process index
# $6: Run time, in seconds
# $7: Client options
test_iperf3m() {
__var="${1}"; shift
__cctx="${1}"; shift
__cctx2="${1}"; shift
__dest="${1}"; shift
__port="${1}"; shift
__time="${1}"; shift
pane_or_context_run "${__cctx}" 'rm -f c.json'
# A 1s wait for connection on what's basically a local link
# indicates something is pretty wrong
__timeout=1000
pane_or_context_run_bg "${__cctx}" \
'iperf3 -J -c '${__dest}' -p '${__port} \
' --connect-timeout '${__timeout} \
' -t'${__time}' -i0 '"${@}"' > c.json' \
__jval=".end.sum_received.bits_per_second"
sleep $((${__time} + 3))
pane_or_context_output "${__cctx2}" \
'cat c.json'
__bw=$(pane_or_context_output "${__cctx2}" \
'cat c.json | jq -rMs "map('${__jval}') | add"')
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
}
test_one_line() {
__line="${1}"
@ -177,6 +213,12 @@ test_one_line() {
"guest2w")
pane_or_context_wait guest_2 || TEST_ONE_nok=1
;;
"mon")
pane_or_context_run mon "${__arg}" || TEST_ONE_nok=1
;;
"monb")
pane_or_context_run_bg mon "${__arg}"
;;
"ns")
pane_or_context_run ns "${__arg}" || TEST_ONE_nok=1
;;
@ -292,6 +334,9 @@ test_one_line() {
"iperf3")
test_iperf3 ${__arg}
;;
"iperf3m")
test_iperf3m ${__arg}
;;
"set")
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__arg%% *}__" "${__arg#* }")"
;;

59
test/migrate/basic Normal file
View file

@ -0,0 +1,59 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/basic - Check basic migration functionality
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: guest1/guest2 > host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
sleep 1
# Option 1: via spliced path in pasta, namespace to host
# guest1b { printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
guest1b { printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]

62
test/migrate/basic_fin Normal file
View file

@ -0,0 +1,62 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/basic_fin - Outbound traffic across migration, half-closed socket
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: guest1, half-close, guest2 > host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb echo FIN | socat TCP4-LISTEN:10006,shut-down STDIO,ignoreeof > __STATESETUP__/msg
#hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
#sleep 20
# Option 1: via spliced path in pasta, namespace to host
# guest1b { printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
guest1b { printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]

View file

@ -0,0 +1,64 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/bidirectional - Check migration with messages in both directions
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4: guest1/guest2 > host, host > guest1/guest2
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
guest1b socat -u TCP4-LISTEN:10001 OPEN:msg,create,trunc
sleep 1
guest1b socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
hostb socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
sleep 1
guest1 printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
host printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
sleep 2
guest2 printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
host printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
hostw
# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
# use sleep 1 for the moment
sleep 1
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]
g2out MSG cat msg
check [ "__MSG__" = "Dear guest 1, you are now guest 2" ]

View file

@ -0,0 +1,64 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/bidirectional_fin - Both directions, half-closed sockets
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4: guest1/guest2 <- (half closed) -> host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb echo FIN | socat TCP4-LISTEN:10006,shut-down STDIO,ignoreeof > __STATESETUP__/msg
guest1b echo FIN | socat TCP4-LISTEN:10001,shut-down STDIO,ignoreeof > msg
sleep 1
guest1b socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
hostb socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
sleep 1
guest1 printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
host printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
sleep 2
guest2 printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
host printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
hostw
# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
# use sleep 1 for the moment
sleep 1
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]
g2out MSG cat msg
check [ "__MSG__" = "Dear guest 1, you are now guest 2" ]

View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_bidir6 - Migration behaviour with many bidirectional flows
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 128
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N --bidir
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 host <-> guest flood, many flows, during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

50
test/migrate/iperf3_in4 Normal file
View file

@ -0,0 +1,50 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_in4 - Migration behaviour under inbound IPv4 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
guest1 /sbin/sysctl -w net.core.rmem_max=33554432
guest1 /sbin/sysctl -w net.core.wmem_max=33554432
set THREADS 1
set TIME 4
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

58
test/migrate/iperf3_in6 Normal file
View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_in6 - Migration behaviour under inbound IPv6 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 4
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

View file

@ -0,0 +1,60 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_many_out6 - Migration behaviour with many outbound flows
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 16
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 guest to host flood, many flows, during migration
test TCP/IPv6 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

47
test/migrate/iperf3_out4 Normal file
View file

@ -0,0 +1,47 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_out4 - Migration behaviour under outbound IPv4 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 6
set TIME 2
set OMIT 0.1
set OPTS -P __THREADS__ -O__OMIT__ -Z -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4 guest to host throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

58
test/migrate/iperf3_out6 Normal file
View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_out6 - Migration behaviour under outbound IPv6 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 6
set TIME 2
set OMIT 0.1
set OPTS -P __THREADS__ -O__OMIT__ -Z -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 guest to host throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

View file

@ -0,0 +1,59 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/rampstream_in - Check sequence correctness with inbound ramp
#
# Copyright (c) 2025 Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set RAMPS 6000000
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: sequence check, ramps, inbound
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
guest1b socat -u TCP4-LISTEN:10001 EXEC:"rampstream-check.sh __RAMPS__"
sleep 1
hostb socat -u EXEC:"test/rampstream send __RAMPS__" TCP4:__ADDR1__:10001
sleep 1
monb echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
guest2 cat rampstream.err
guest2 [ $(cat rampstream.status) -eq 0 ]

View file

@ -0,0 +1,55 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/rampstream_out - Check sequence correctness with outbound ramp
#
# Copyright (c) 2025 Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set RAMPS 6000000
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: sequence check, ramps, outbound
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 EXEC:"test/rampstream check __RAMPS__"
sleep 1
guest1b socat -u EXEC:"rampstream send __RAMPS__" TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw

View file

@ -13,7 +13,8 @@
PROGS="${PROGS:-ash,dash,bash ip mount ls insmod mkdir ln cat chmod lsmod
modprobe find grep mknod mv rm umount jq iperf3 dhclient hostname
sed tr chown sipcalc cut socat dd strace ping tail killall sleep sysctl
nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp}"
nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp tcpdump
env}"
# OpenSSH 9.8 introduced split binaries, with sshd being the daemon, and
# sshd-session the per-session program. We need the latter as well, and the path
@ -31,7 +32,7 @@ LINKS="${LINKS:-
DIRS="${DIRS} /tmp /usr/sbin /usr/share /var/log /var/lib /etc/ssh /run/sshd /root/.ssh"
COPIES="${COPIES} small.bin,/root/small.bin medium.bin,/root/medium.bin big.bin,/root/big.bin"
COPIES="${COPIES} small.bin,/root/small.bin medium.bin,/root/medium.bin big.bin,/root/big.bin rampstream,/bin/rampstream rampstream-check.sh,/bin/rampstream-check.sh"
FIXUP="${FIXUP}"'
mv /sbin/* /usr/sbin || :
@ -41,6 +42,7 @@ FIXUP="${FIXUP}"'
#!/bin/sh
LOG=/var/log/dhclient-script.log
echo \${reason} \${interface} >> \$LOG
env >> \$LOG
set >> \$LOG
[ -n "\${new_interface_mtu}" ] && ip link set dev \${interface} mtu \${new_interface_mtu}
@ -54,7 +56,8 @@ set >> \$LOG
[ -n "\${new_ip6_address}" ] && ip addr add \${new_ip6_address}/\${new_ip6_prefixlen} dev \${interface}
[ -n "\${new_dhcp6_name_servers}" ] && for d in \${new_dhcp6_name_servers}; do echo "nameserver \${d}%\${interface}" >> /etc/resolv.conf; done
[ -n "\${new_dhcp6_domain_search}" ] && (printf "search"; for d in \${new_dhcp6_domain_search}; do printf " %s" "\${d}"; done; printf "\n") >> /etc/resolv.conf
[ -n "\${new_host_name}" ] && hostname "\${new_host_name}"
[ -n "\${new_host_name}" ] && echo "\${new_host_name}" > /tmp/new_host_name
[ -n "\${new_fqdn_fqdn}" ] && echo "\${new_fqdn_fqdn}" > /tmp/new_fqdn_fqdn
exit 0
EOF
chmod 755 /sbin/dhclient-script
@ -65,6 +68,7 @@ EOF
# sshd via vsock
cat > /etc/passwd << EOF
root:x:0:0:root:/root:/bin/sh
tcpdump:x:72:72:tcpdump:/:/sbin/nologin
sshd:x:100:100:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
EOF
cat > /etc/shadow << EOF

View file

@ -11,7 +11,7 @@
# Copyright (c) 2021 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
gtools ip jq dhclient sed tr
gtools ip jq dhclient sed tr hostname
htools ip jq sed tr head
test Interface name
@ -47,7 +47,16 @@ gout SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^searc
hout HOST_SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
check [ "__SEARCH__" = "__HOST_SEARCH__" ]
test DHCP: Hostname
gout NEW_HOST_NAME cat /tmp/new_host_name
check [ "__NEW_HOST_NAME__" = "hostname1" ]
test DHCP: Client FQDN
gout NEW_FQDN_FQDN cat /tmp/new_fqdn_fqdn
check [ "__NEW_FQDN_FQDN__" = "fqdn1.passt.test" ]
test DHCPv6: address
guest rm /tmp/new_fqdn_fqdn
guest /sbin/dhclient -6 __IFNAME__
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
@ -70,3 +79,7 @@ test DHCPv6: search list
gout SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
hout HOST_SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
check [ "__SEARCH6__" = "__HOST_SEARCH6__" ]
test DHCPv6: Hostname
gout NEW_FQDN_FQDN cat /tmp/new_fqdn_fqdn
check [ "__NEW_FQDN_FQDN__" = "fqdn1.passt.test" ]

3
test/rampstream-check.sh Executable file
View file

@ -0,0 +1,3 @@
#! /bin/sh
(rampstream check "$@" 2>&1; echo $? > rampstream.status) | tee rampstream.err

143
test/rampstream.c Normal file
View file

@ -0,0 +1,143 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* rampstream - Generate a check and stream of bytes in a ramp pattern
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
/* Length of the repeating ramp. This is a deliberately not a "round" number so
* that we're very likely to misalign with likely block or chunk sizes of the
* transport. That means we'll detect gaps in the stream, even if they occur
* neatly on block boundaries. Specifically this is the largest 8-bit prime. */
#define RAMPLEN 251
#define INTERVAL 10000
#define ARRAY_SIZE(a) ((int)(sizeof(a) / sizeof((a)[0])))
#define die(...) \
do { \
fprintf(stderr, "rampstream: " __VA_ARGS__); \
exit(1); \
} while (0)
static void usage(void)
{
die("Usage:\n"
" rampstream send <number>\n"
" Generate a ramp pattern of bytes on stdout, repeated <number>\n"
" times\n"
" rampstream check <number>\n"
" Check a ramp pattern of bytes on stdin, repeater <number>\n"
" times\n");
}
static void ramp_send(unsigned long long num, const uint8_t *ramp)
{
unsigned long long i;
for (i = 0; i < num; i++) {
int off = 0;
ssize_t rc;
if (i % INTERVAL == 0)
fprintf(stderr, "%llu...\r", i);
while (off < RAMPLEN) {
rc = write(1, ramp + off, RAMPLEN - off);
if (rc < 0) {
if (errno == EINTR ||
errno == EAGAIN ||
errno == EWOULDBLOCK)
continue;
die("Error writing ramp: %s\n",
strerror(errno));
}
if (rc == 0)
die("Zero length write\n");
off += rc;
}
}
}
static void ramp_check(unsigned long long num, const uint8_t *ramp)
{
unsigned long long i;
for (i = 0; i < num; i++) {
uint8_t buf[RAMPLEN];
int off = 0;
ssize_t rc;
if (i % INTERVAL == 0)
fprintf(stderr, "%llu...\r", i);
while (off < RAMPLEN) {
rc = read(0, buf + off, RAMPLEN - off);
if (rc < 0) {
if (errno == EINTR ||
errno == EAGAIN ||
errno == EWOULDBLOCK)
continue;
die("Error reading ramp: %s\n",
strerror(errno));
}
if (rc == 0)
die("Unexpected EOF, ramp %llu, byte %d\n",
i, off);
off += rc;
}
if (memcmp(buf, ramp, sizeof(buf)) != 0) {
int j, k;
for (j = 0; j < RAMPLEN; j++)
if (buf[j] != ramp[j])
break;
for (k = j; k < RAMPLEN && k < j + 16; k++)
fprintf(stderr,
"Byte %d: expected 0x%02x, got 0x%02x\n",
k, ramp[k], buf[k]);
die("Data mismatch, ramp %llu, byte %d\n", i, j);
}
}
}
int main(int argc, char *argv[])
{
const char *subcmd = argv[1];
unsigned long long num;
uint8_t ramp[RAMPLEN];
char *e;
int i;
if (argc < 2)
usage();
errno = 0;
num = strtoull(argv[2], &e, 0);
if (*e || errno)
usage();
/* Initialize the ramp block */
for (i = 0; i < RAMPLEN; i++)
ramp[i] = i;
if (strcmp(subcmd, "send") == 0)
ramp_send(num, ramp);
else if (strcmp(subcmd, "check") == 0)
ramp_check(num, ramp);
else
usage();
exit(0);
}

View file

@ -130,6 +130,43 @@ run() {
test two_guests_vu/basic
teardown two_guests
setup migrate
test migrate/basic
teardown migrate
setup migrate
test migrate/basic_fin
teardown migrate
setup migrate
test migrate/bidirectional
teardown migrate
setup migrate
test migrate/bidirectional_fin
teardown migrate
setup migrate
test migrate/iperf3_out4
teardown migrate
setup migrate
test migrate/iperf3_out6
teardown migrate
setup migrate
test migrate/iperf3_in4
teardown migrate
setup migrate
test migrate/iperf3_in6
teardown migrate
setup migrate
test migrate/iperf3_bidir6
teardown migrate
setup migrate
test migrate/iperf3_many_out6
teardown migrate
setup migrate
test migrate/rampstream_in
teardown migrate
setup migrate
test migrate/rampstream_out
teardown migrate
VALGRIND=0
VHOST_USER=0
setup passt_in_ns
@ -165,7 +202,7 @@ skip_distro() {
perf_finish
[ ${CI} -eq 1 ] && video_stop
log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}"
log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}, SKIPPED: ${STATUS_SKIPPED}"
pause_continue \
"Press any key to keep test session open" \
@ -186,7 +223,10 @@ run_selected() {
__setup=
for __test; do
if [ "${__test%%/*}" != "${__setup}" ]; then
# HACK: the migrate tests need the setup repeated for
# each test
if [ "${__test%%/*}" != "${__setup}" -o \
"${__test%%/*}" = "migrate" ]; then
[ -n "${__setup}" ] && teardown "${__setup}"
__setup="${__test%%/*}"
setup "${__setup}"
@ -196,7 +236,7 @@ run_selected() {
done
teardown "${__setup}"
log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}"
log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}, SKIPPED: ${STATUS_SKIPPED}"
pause_continue \
"Press any key to keep test session open" \
@ -267,4 +307,4 @@ fi
tail -n1 ${LOGFILE}
echo "Log at ${LOGFILE}"
exit $(tail -n1 ${LOGFILE} | sed -n 's/.*FAIL: \(.*\)$/\1/p')
exit $(tail -n1 ${LOGFILE} | sed -n 's/.*FAIL: \(.*\),.*$/\1/p')

696
udp.c
View file

@ -39,27 +39,30 @@
* could receive packets from multiple flows, so we use a hash table match to
* find the specific flow for a datagram.
*
* When a UDP flow is initiated from a listening socket we take a duplicate of
* the socket and store it in uflow->s[INISIDE]. This will last for the
* Flow sockets
* ============
*
* When a UDP flow targets a socket, we create a "flow" socket in
* uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive
* replies on the target side. This socket is both bound and connected and has
* EPOLL_TYPE_UDP. The connect() means it will only receive datagrams
* associated with this flow, so the epoll reference directly points to the flow
* and we don't need a hash lookup.
*
* When a flow is initiated from a listening socket, we create a "flow" socket
* with the same bound address as the listening socket, but also connect()ed to
* the flow's peer. This is stored in uflow->s[INISIDE] and will last for the
* lifetime of the flow, even if the original listening socket is closed due to
* port auto-probing. The duplicate is used to deliver replies back to the
* originating side.
*
* Reply sockets
* =============
*
* When a UDP flow targets a socket, we create a "reply" socket in
* uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive
* replies on the target side. This socket is both bound and connected and has
* EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams
* associated with this flow, so the epoll reference directly points to the flow
* and we don't need a hash lookup.
*
* NOTE: it's possible that the reply socket could have a bound address
* overlapping with an unrelated listening socket. We assume datagrams for the
* flow will come to the reply socket in preference to a listening socket. The
* sample program doc/platform-requirements/reuseaddr-priority.c documents and
* tests that assumption.
* NOTE: A flow socket can have a bound address overlapping with a listening
* socket. That will happen naturally for flows initiated from a socket, but is
* also possible (though unlikely) for tap initiated flows, depending on the
* source port. We assume datagrams for the flow will come to a connect()ed
* socket in preference to a listening socket. The sample program
* doc/platform-requirements/reuseaddr-priority.c documents and tests that
* assumption.
*
* "Spliced" flows
* ===============
@ -71,8 +74,7 @@
* actually used; it doesn't make sense for datagrams and instead a pair of
* recvmmsg() and sendmmsg() is used to forward the datagrams.
*
* Note that a spliced flow will have *both* a duplicated listening socket and a
* reply socket (see above).
* Note that a spliced flow will have two flow sockets (see above).
*/
#include <sched.h>
@ -87,6 +89,8 @@
#include <netinet/in.h>
#include <netinet/ip.h>
#include <netinet/udp.h>
#include <netinet/ip_icmp.h>
#include <netinet/icmp6.h>
#include <stdint.h>
#include <stddef.h>
#include <string.h>
@ -112,6 +116,14 @@
#include "udp_internal.h"
#include "udp_vu.h"
#define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */
/* Maximum UDP data to be returned in ICMP messages */
#define ICMP4_MAX_DLEN 8
#define ICMP6_MAX_DLEN (IPV6_MIN_MTU \
- sizeof(struct udphdr) \
- sizeof(struct ipv6hdr))
/* "Spliced" sockets indexed by bound port (host order) */
static int udp_splice_ns [IP_VERSIONS][NUM_PORTS];
static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
@ -128,26 +140,31 @@ static struct ethhdr udp4_eth_hdr;
static struct ethhdr udp6_eth_hdr;
/**
* struct udp_meta_t - Pre-cooked headers and metadata for UDP packets
* struct udp_meta_t - Pre-cooked headers for UDP packets
* @ip6h: Pre-filled IPv6 header (except for payload_len and addresses)
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
* @taph: Tap backend specific header
* @s_in: Source socket address, filled in by recvmmsg()
* @tosidx: sidx for the destination side of this datagram's flow
*/
static struct udp_meta_t {
struct ipv6hdr ip6h;
struct iphdr ip4h;
struct tap_hdr taph;
union sockaddr_inany s_in;
flow_sidx_t tosidx;
}
#ifdef __AVX2__
__attribute__ ((aligned(32)))
#endif
udp_meta[UDP_MAX_FRAMES];
#define PKTINFO_SPACE \
MAX(CMSG_SPACE(sizeof(struct in_pktinfo)), \
CMSG_SPACE(sizeof(struct in6_pktinfo)))
#define RECVERR_SPACE \
MAX(CMSG_SPACE(sizeof(struct sock_extended_err) + \
sizeof(struct sockaddr_in)), \
CMSG_SPACE(sizeof(struct sock_extended_err) + \
sizeof(struct sockaddr_in6)))
/**
* enum udp_iov_idx - Indices for the buffers making up a single UDP frame
* @UDP_IOV_TAP tap specific header
@ -224,8 +241,6 @@ static void udp_iov_init_one(const struct ctx *c, size_t i)
tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph);
tiov[UDP_IOV_PAYLOAD].iov_base = payload;
mh->msg_name = &meta->s_in;
mh->msg_namelen = sizeof(meta->s_in);
mh->msg_iov = siov;
mh->msg_iovlen = 1;
}
@ -245,41 +260,6 @@ static void udp_iov_init(const struct ctx *c)
udp_iov_init_one(c, i);
}
/**
* udp_splice_prepare() - Prepare one datagram for splicing
* @mmh: Receiving mmsghdr array
* @idx: Index of the datagram to prepare
*/
static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx)
{
udp_mh_splice[idx].msg_hdr.msg_iov->iov_len = mmh[idx].msg_len;
}
/**
* udp_splice_send() - Send a batch of datagrams from socket to socket
* @c: Execution context
* @start: Index of batch's first datagram in udp[46]_l2_buf
* @n: Number of datagrams in batch
* @src: Source port for datagram (target side)
* @dst: Destination port for datagrams (target side)
* @ref: epoll reference for origin socket
* @now: Timestamp
*/
static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
flow_sidx_t tosidx)
{
const struct flowside *toside = flowside_at_sidx(tosidx);
const struct udp_flow *uflow = udp_at_sidx(tosidx);
uint8_t topif = pif_at_sidx(tosidx);
int s = uflow->s[tosidx.sidei];
socklen_t sl;
pif_sockaddr(c, &udp_splice_to, &sl, topif,
&toside->eaddr, toside->eport);
sendmmsg(s, udp_mh_splice + start, n, MSG_NOSIGNAL);
}
/**
* udp_update_hdr4() - Update headers for one IPv4 datagram
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
@ -402,28 +382,172 @@ static void udp_tap_prepare(const struct mmsghdr *mmh,
(*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len;
}
/**
* udp_send_tap_icmp4() - Construct and send ICMPv4 to local peer
* @c: Execution context
* @ee: Extended error descriptor
* @toside: Destination side of flow
* @saddr: Address of ICMP generating node
* @in: First bytes (max 8) of original UDP message body
* @dlen: Length of the read part of original UDP message body
*/
static void udp_send_tap_icmp4(const struct ctx *c,
const struct sock_extended_err *ee,
const struct flowside *toside,
struct in_addr saddr,
const void *in, size_t dlen)
{
struct in_addr oaddr = toside->oaddr.v4mapped.a4;
struct in_addr eaddr = toside->eaddr.v4mapped.a4;
in_port_t eport = toside->eport;
in_port_t oport = toside->oport;
struct {
struct icmphdr icmp4h;
struct iphdr ip4h;
struct udphdr uh;
char data[ICMP4_MAX_DLEN];
} __attribute__((packed, aligned(__alignof__(max_align_t)))) msg;
size_t msglen = sizeof(msg) - sizeof(msg.data) + dlen;
size_t l4len = dlen + sizeof(struct udphdr);
ASSERT(dlen <= ICMP4_MAX_DLEN);
memset(&msg, 0, sizeof(msg));
msg.icmp4h.type = ee->ee_type;
msg.icmp4h.code = ee->ee_code;
if (ee->ee_type == ICMP_DEST_UNREACH && ee->ee_code == ICMP_FRAG_NEEDED)
msg.icmp4h.un.frag.mtu = htons((uint16_t) ee->ee_info);
/* Reconstruct the original headers as returned in the ICMP message */
tap_push_ip4h(&msg.ip4h, eaddr, oaddr, l4len, IPPROTO_UDP);
tap_push_uh4(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
memcpy(&msg.data, in, dlen);
tap_icmp4_send(c, saddr, eaddr, &msg, msglen);
}
/**
* udp_send_tap_icmp6() - Construct and send ICMPv6 to local peer
* @c: Execution context
* @ee: Extended error descriptor
* @toside: Destination side of flow
* @saddr: Address of ICMP generating node
* @in: First bytes (max 1232) of original UDP message body
* @dlen: Length of the read part of original UDP message body
* @flow: IPv6 flow identifier
*/
static void udp_send_tap_icmp6(const struct ctx *c,
const struct sock_extended_err *ee,
const struct flowside *toside,
const struct in6_addr *saddr,
void *in, size_t dlen, uint32_t flow)
{
const struct in6_addr *oaddr = &toside->oaddr.a6;
const struct in6_addr *eaddr = &toside->eaddr.a6;
in_port_t eport = toside->eport;
in_port_t oport = toside->oport;
struct {
struct icmp6_hdr icmp6h;
struct ipv6hdr ip6h;
struct udphdr uh;
char data[ICMP6_MAX_DLEN];
} __attribute__((packed, aligned(__alignof__(max_align_t)))) msg;
size_t msglen = sizeof(msg) - sizeof(msg.data) + dlen;
size_t l4len = dlen + sizeof(struct udphdr);
ASSERT(dlen <= ICMP6_MAX_DLEN);
memset(&msg, 0, sizeof(msg));
msg.icmp6h.icmp6_type = ee->ee_type;
msg.icmp6h.icmp6_code = ee->ee_code;
if (ee->ee_type == ICMP6_PACKET_TOO_BIG)
msg.icmp6h.icmp6_dataun.icmp6_un_data32[0] = htonl(ee->ee_info);
/* Reconstruct the original headers as returned in the ICMP message */
tap_push_ip6h(&msg.ip6h, eaddr, oaddr, l4len, IPPROTO_UDP, flow);
tap_push_uh6(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
memcpy(&msg.data, in, dlen);
tap_icmp6_send(c, saddr, eaddr, &msg, msglen);
}
/**
* udp_pktinfo() - Retrieve packet destination address from cmsg
* @msg: msghdr into which message has been received
* @dst: (Local) destination address of message in @mh (output)
*
* Return: 0 on success, -1 if the information was missing (@dst is set to
* inany_any6).
*/
static int udp_pktinfo(struct msghdr *msg, union inany_addr *dst)
{
struct cmsghdr *hdr;
for (hdr = CMSG_FIRSTHDR(msg); hdr; hdr = CMSG_NXTHDR(msg, hdr)) {
if (hdr->cmsg_level == IPPROTO_IP &&
hdr->cmsg_type == IP_PKTINFO) {
const struct in_pktinfo *i4 = (void *)CMSG_DATA(hdr);
*dst = inany_from_v4(i4->ipi_addr);
return 0;
}
if (hdr->cmsg_level == IPPROTO_IPV6 &&
hdr->cmsg_type == IPV6_PKTINFO) {
const struct in6_pktinfo *i6 = (void *)CMSG_DATA(hdr);
dst->a6 = i6->ipi6_addr;
return 0;
}
}
debug("Missing PKTINFO cmsg on datagram");
*dst = inany_any6;
return -1;
}
/**
* udp_sock_recverr() - Receive and clear an error from a socket
* @s: Socket to receive from
* @c: Execution context
* @s: Socket to receive errors from
* @sidx: Flow and side of @s, or FLOW_SIDX_NONE if unknown
* @pif: Interface on which the error occurred
* (only used if @sidx == FLOW_SIDX_NONE)
* @port: Local port number of @s (only used if @sidx == FLOW_SIDX_NONE)
*
* Return: 1 if error received and processed, 0 if no more errors in queue, < 0
* if there was an error reading the queue
*
* #syscalls recvmsg
*/
static int udp_sock_recverr(int s)
static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
uint8_t pif, in_port_t port)
{
char buf[PKTINFO_SPACE + RECVERR_SPACE];
const struct sock_extended_err *ee;
const struct cmsghdr *hdr;
char buf[CMSG_SPACE(sizeof(*ee))];
char data[ICMP6_MAX_DLEN];
struct cmsghdr *hdr;
struct iovec iov = {
.iov_base = data,
.iov_len = sizeof(data)
};
union sockaddr_inany src;
struct msghdr mh = {
.msg_name = NULL,
.msg_namelen = 0,
.msg_iov = NULL,
.msg_iovlen = 0,
.msg_name = &src,
.msg_namelen = sizeof(src),
.msg_iov = &iov,
.msg_iovlen = 1,
.msg_control = buf,
.msg_controllen = sizeof(buf),
};
const struct flowside *fromside, *toside;
union inany_addr offender, otap;
char astr[INANY_ADDRSTRLEN];
char sastr[SOCKADDR_STRLEN];
const struct in_addr *o4;
in_port_t offender_port;
struct udp_flow *uflow;
uint8_t topif;
size_t dlen;
ssize_t rc;
rc = recvmsg(s, &mh, MSG_ERRQUEUE);
@ -440,33 +564,102 @@ static int udp_sock_recverr(int s)
return -1;
}
hdr = CMSG_FIRSTHDR(&mh);
if (!((hdr->cmsg_level == IPPROTO_IP &&
hdr->cmsg_type == IP_RECVERR) ||
(hdr->cmsg_level == IPPROTO_IPV6 &&
hdr->cmsg_type == IPV6_RECVERR))) {
err("Unexpected cmsg reading error queue");
for (hdr = CMSG_FIRSTHDR(&mh); hdr; hdr = CMSG_NXTHDR(&mh, hdr)) {
if ((hdr->cmsg_level == IPPROTO_IP &&
hdr->cmsg_type == IP_RECVERR) ||
(hdr->cmsg_level == IPPROTO_IPV6 &&
hdr->cmsg_type == IPV6_RECVERR))
break;
}
if (!hdr) {
err("Missing RECVERR cmsg in error queue");
return -1;
}
ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
/* TODO: When possible propagate and otherwise handle errors */
debug("%s error on UDP socket %i: %s",
str_ee_origin(ee), s, strerror_(ee->ee_errno));
if (!flow_sidx_valid(sidx)) {
/* No hint from the socket, determine flow from addresses */
union inany_addr dst;
if (udp_pktinfo(&mh, &dst) < 0) {
debug("Missing PKTINFO on UDP error");
return 1;
}
sidx = flow_lookup_sa(c, IPPROTO_UDP, pif, &src, &dst, port);
if (!flow_sidx_valid(sidx)) {
debug("Ignoring UDP error without flow");
return 1;
}
} else {
pif = pif_at_sidx(sidx);
}
uflow = udp_at_sidx(sidx);
ASSERT(uflow);
fromside = &uflow->f.side[sidx.sidei];
toside = &uflow->f.side[!sidx.sidei];
topif = uflow->f.pif[!sidx.sidei];
dlen = rc;
if (inany_from_sockaddr(&offender, &offender_port,
SO_EE_OFFENDER(ee)) < 0)
goto fail;
if (pif != PIF_HOST || topif != PIF_TAP)
/* XXX Can we support any other cases? */
goto fail;
/* If the offender *is* the endpoint, make sure our translation is
* consistent with the flow's translation. This matters if the flow
* endpoint has a port specific translation (like --dns-match).
*/
if (inany_equals(&offender, &fromside->eaddr))
otap = toside->oaddr;
else if (!nat_inbound(c, &offender, &otap))
goto fail;
if (hdr->cmsg_level == IPPROTO_IP &&
(o4 = inany_v4(&otap)) && inany_v4(&toside->eaddr)) {
dlen = MIN(dlen, ICMP4_MAX_DLEN);
udp_send_tap_icmp4(c, ee, toside, *o4, data, dlen);
return 1;
}
if (hdr->cmsg_level == IPPROTO_IPV6 && !inany_v4(&toside->eaddr)) {
udp_send_tap_icmp6(c, ee, toside, &otap.a6, data, dlen,
FLOW_IDX(uflow));
return 1;
}
fail:
flow_dbg(uflow, "Can't propagate %s error from %s %s to %s %s",
str_ee_origin(ee),
pif_name(pif),
sockaddr_ntop(SO_EE_OFFENDER(ee), sastr, sizeof(sastr)),
pif_name(topif),
inany_ntop(&toside->eaddr, astr, sizeof(astr)));
return 1;
}
/**
* udp_sock_errs() - Process errors on a socket
* @c: Execution context
* @s: Socket to receive from
* @events: epoll events bitmap
* @s: Socket to receive errors from
* @sidx: Flow and side of @s, or FLOW_SIDX_NONE if unknown
* @pif: Interface on which the error occurred
* (only used if @sidx == FLOW_SIDX_NONE)
* @port: Local port number of @s (only used if @sidx == FLOW_SIDX_NONE)
*
* Return: Number of errors handled, or < 0 if we have an unrecoverable error
*/
int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
static int udp_sock_errs(const struct ctx *c, int s, flow_sidx_t sidx,
uint8_t pif, in_port_t port)
{
unsigned n_err = 0;
socklen_t errlen;
@ -474,11 +667,8 @@ int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
ASSERT(!c->no_udp);
if (!(events & EPOLLERR))
return 0; /* Nothing to do */
/* Empty the error queue */
while ((rc = udp_sock_recverr(s)) > 0)
while ((rc = udp_sock_recverr(c, s, sidx, pif, port)) > 0)
n_err += rc;
if (rc < 0)
@ -505,37 +695,62 @@ int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
return n_err;
}
/**
* udp_peek_addr() - Get source address for next packet
* @s: Socket to get information from
* @src: Socket address (output)
* @dst: (Local) destination address (output)
*
* Return: 0 if no more packets, 1 on success, -ve error code on error
*/
static int udp_peek_addr(int s, union sockaddr_inany *src,
union inany_addr *dst)
{
char sastr[SOCKADDR_STRLEN], dstr[INANY_ADDRSTRLEN];
char cmsg[PKTINFO_SPACE];
struct msghdr msg = {
.msg_name = src,
.msg_namelen = sizeof(*src),
.msg_control = cmsg,
.msg_controllen = sizeof(cmsg),
};
int rc;
rc = recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
if (rc < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK)
return 0;
return -errno;
}
udp_pktinfo(&msg, dst);
trace("Peeked UDP datagram: %s -> %s",
sockaddr_ntop(src, sastr, sizeof(sastr)),
inany_ntop(dst, dstr, sizeof(dstr)));
return 1;
}
/**
* udp_sock_recv() - Receive datagrams from a socket
* @c: Execution context
* @s: Socket to receive from
* @events: epoll events bitmap
* @mmh mmsghdr array to receive into
* @n: Maximum number of datagrams to receive
*
* Return: Number of datagrams received
*
* #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64
*/
static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
struct mmsghdr *mmh)
static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh, int n)
{
/* For not entirely clear reasons (data locality?) pasta gets better
* throughput if we receive tap datagrams one at a atime. For small
* splice datagrams throughput is slightly better if we do batch, but
* it's slightly worse for large splice datagrams. Since we don't know
* before we receive whether we'll use tap or splice, always go one at a
* time for pasta mode.
*/
int n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
ASSERT(!c->no_udp);
if (!(events & EPOLLIN))
return 0;
n = recvmmsg(s, mmh, n, 0, NULL);
if (n < 0) {
err_perror("Error receiving datagrams");
trace("Error receiving datagrams: %s", strerror_(errno));
/* Bail out and let the EPOLLERR handler deal with it */
return 0;
}
@ -543,78 +758,121 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
}
/**
* udp_buf_listen_sock_handler() - Handle new data from socket
* udp_sock_to_sock() - Forward datagrams from socket to socket
* @c: Execution context
* @ref: epoll reference
* @events: epoll events bitmap
* @now: Current timestamp
* @from_s: Socket to receive datagrams from
* @n: Maximum number of datagrams to forward
* @tosidx: Flow & side to forward datagrams to
*
* #syscalls recvmmsg
* #syscalls sendmmsg
*/
static void udp_buf_listen_sock_handler(const struct ctx *c,
union epoll_ref ref, uint32_t events,
const struct timespec *now)
static void udp_sock_to_sock(const struct ctx *c, int from_s, int n,
flow_sidx_t tosidx)
{
const socklen_t sasize = sizeof(udp_meta[0].s_in);
int n, i;
const struct flowside *toside = flowside_at_sidx(tosidx);
const struct udp_flow *uflow = udp_at_sidx(tosidx);
uint8_t topif = pif_at_sidx(tosidx);
int to_s = uflow->s[tosidx.sidei];
socklen_t sl;
int i;
if (udp_sock_errs(c, ref.fd, events) < 0) {
err("UDP: Unrecoverable error on listening socket:"
" (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
/* FIXME: what now? close/re-open socket? */
if ((n = udp_sock_recv(c, from_s, udp_mh_recv, n)) <= 0)
return;
for (i = 0; i < n; i++) {
udp_mh_splice[i].msg_hdr.msg_iov->iov_len
= udp_mh_recv[i].msg_len;
}
if ((n = udp_sock_recv(c, ref.fd, events, udp_mh_recv)) <= 0)
pif_sockaddr(c, &udp_splice_to, &sl, topif,
&toside->eaddr, toside->eport);
sendmmsg(to_s, udp_mh_splice, n, MSG_NOSIGNAL);
}
/**
* udp_buf_sock_to_tap() - Forward datagrams from socket to tap
* @c: Execution context
* @s: Socket to read data from
* @n: Maximum number of datagrams to forward
* @tosidx: Flow & side to forward data from @s to
*/
static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
flow_sidx_t tosidx)
{
const struct flowside *toside = flowside_at_sidx(tosidx);
int i;
if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
return;
/* We divide datagrams into batches based on how we need to send them,
* determined by udp_meta[i].tosidx. To avoid either two passes through
* the array, or recalculating tosidx for a single entry, we have to
* populate it one entry *ahead* of the loop counter.
*/
udp_meta[0].tosidx = udp_flow_from_sock(c, ref, &udp_meta[0].s_in, now);
udp_mh_recv[0].msg_hdr.msg_namelen = sasize;
for (i = 0; i < n; ) {
flow_sidx_t batchsidx = udp_meta[i].tosidx;
uint8_t batchpif = pif_at_sidx(batchsidx);
int batchstart = i;
for (i = 0; i < n; i++)
udp_tap_prepare(udp_mh_recv, i, toside, false);
do {
if (pif_is_socket(batchpif)) {
udp_splice_prepare(udp_mh_recv, i);
} else if (batchpif == PIF_TAP) {
udp_tap_prepare(udp_mh_recv, i,
flowside_at_sidx(batchsidx),
false);
tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
}
/**
* udp_sock_fwd() - Forward datagrams from a possibly unconnected socket
* @c: Execution context
* @s: Socket to forward from
* @frompif: Interface to which @s belongs
* @port: Our (local) port number of @s
* @now: Current timestamp
*/
void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
in_port_t port, const struct timespec *now)
{
union sockaddr_inany src;
union inany_addr dst;
int rc;
while ((rc = udp_peek_addr(s, &src, &dst)) != 0) {
bool discard = false;
flow_sidx_t tosidx;
uint8_t topif;
if (rc < 0) {
trace("Error peeking at socket address: %s",
strerror_(-rc));
/* Clear errors & carry on */
if (udp_sock_errs(c, s, FLOW_SIDX_NONE,
frompif, port) < 0) {
err(
"UDP: Unrecoverable error on listening socket: (%s port %hu)",
pif_name(frompif), port);
/* FIXME: what now? close/re-open socket? */
}
continue;
}
if (++i >= n)
break;
tosidx = udp_flow_from_sock(c, frompif, &dst, port, &src, now);
topif = pif_at_sidx(tosidx);
udp_meta[i].tosidx = udp_flow_from_sock(c, ref,
&udp_meta[i].s_in,
now);
udp_mh_recv[i].msg_hdr.msg_namelen = sasize;
} while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx));
if (pif_is_socket(batchpif)) {
udp_splice_send(c, batchstart, i - batchstart,
batchsidx);
} else if (batchpif == PIF_TAP) {
tap_send_frames(c, &udp_l2_iov[batchstart][0],
UDP_NUM_IOVS, i - batchstart);
} else if (flow_sidx_valid(batchsidx)) {
flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
struct udp_flow *uflow = udp_at_sidx(batchsidx);
if (pif_is_socket(topif)) {
udp_sock_to_sock(c, s, 1, tosidx);
} else if (topif == PIF_TAP) {
if (c->mode == MODE_VU)
udp_vu_sock_to_tap(c, s, 1, tosidx);
else
udp_buf_sock_to_tap(c, s, 1, tosidx);
} else if (flow_sidx_valid(tosidx)) {
struct udp_flow *uflow = udp_at_sidx(tosidx);
flow_err(uflow,
"No support for forwarding UDP from %s to %s",
pif_name(pif_at_sidx(fromsidx)),
pif_name(batchpif));
pif_name(frompif), pif_name(topif));
discard = true;
} else {
debug("Discarding %d datagrams without flow",
i - batchstart);
debug("Discarding datagram without flow");
discard = true;
}
if (discard) {
struct msghdr msg = { 0 };
if (recvmsg(s, &msg, MSG_DONTWAIT) < 0)
debug_perror("Failed to discard datagram");
}
}
}
@ -630,87 +888,69 @@ void udp_listen_sock_handler(const struct ctx *c,
union epoll_ref ref, uint32_t events,
const struct timespec *now)
{
if (c->mode == MODE_VU) {
udp_vu_listen_sock_handler(c, ref, events, now);
return;
}
udp_buf_listen_sock_handler(c, ref, events, now);
if (events & (EPOLLERR | EPOLLIN))
udp_sock_fwd(c, ref.fd, ref.udp.pif, ref.udp.port, now);
}
/**
* udp_buf_reply_sock_handler() - Handle new data from flow specific socket
* udp_sock_handler() - Handle new data from flow specific socket
* @c: Execution context
* @ref: epoll reference
* @events: epoll events bitmap
* @now: Current timestamp
*
* #syscalls recvmmsg
*/
static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events,
const struct timespec *now)
void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now)
{
flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
const struct flowside *toside = flowside_at_sidx(tosidx);
struct udp_flow *uflow = udp_at_sidx(ref.flowside);
uint8_t topif = pif_at_sidx(tosidx);
int n, i, from_s;
ASSERT(!c->no_udp && uflow);
from_s = uflow->s[ref.flowside.sidei];
if (udp_sock_errs(c, from_s, events) < 0) {
flow_err(uflow, "Unrecoverable error on reply socket");
flow_err_details(uflow);
udp_flow_close(c, uflow);
return;
if (events & EPOLLERR) {
if (udp_sock_errs(c, ref.fd, ref.flowside, PIF_NONE, 0) < 0) {
flow_err(uflow, "Unrecoverable error on flow socket");
goto fail;
}
}
if ((n = udp_sock_recv(c, from_s, events, udp_mh_recv)) <= 0)
return;
if (events & EPOLLIN) {
/* For not entirely clear reasons (data locality?) pasta gets
* better throughput if we receive tap datagrams one at a
* time. For small splice datagrams throughput is slightly
* better if we do batch, but it's slightly worse for large
* splice datagrams. Since we don't know the size before we
* receive, always go one at a time for pasta mode.
*/
size_t n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
uint8_t topif = pif_at_sidx(tosidx);
int s = ref.fd;
flow_trace(uflow, "Received %d datagrams on reply socket", n);
uflow->ts = now->tv_sec;
flow_trace(uflow, "Received data on reply socket");
uflow->ts = now->tv_sec;
for (i = 0; i < n; i++) {
if (pif_is_socket(topif))
udp_splice_prepare(udp_mh_recv, i);
else if (topif == PIF_TAP)
udp_tap_prepare(udp_mh_recv, i, toside, false);
/* Restore sockaddr length clobbered by recvmsg() */
udp_mh_recv[i].msg_hdr.msg_namelen = sizeof(udp_meta[i].s_in);
if (pif_is_socket(topif)) {
udp_sock_to_sock(c, ref.fd, n, tosidx);
} else if (topif == PIF_TAP) {
if (c->mode == MODE_VU) {
udp_vu_sock_to_tap(c, s, UDP_MAX_FRAMES,
tosidx);
} else {
udp_buf_sock_to_tap(c, s, n, tosidx);
}
} else {
flow_err(uflow,
"No support for forwarding UDP from %s to %s",
pif_name(pif_at_sidx(ref.flowside)),
pif_name(topif));
goto fail;
}
}
return;
if (pif_is_socket(topif)) {
udp_splice_send(c, 0, n, tosidx);
} else if (topif == PIF_TAP) {
tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
} else {
uint8_t frompif = pif_at_sidx(ref.flowside);
flow_err(uflow, "No support for forwarding UDP from %s to %s",
pif_name(frompif), pif_name(topif));
}
}
/**
* udp_reply_sock_handler() - Handle new data from flow specific socket
* @c: Execution context
* @ref: epoll reference
* @events: epoll events bitmap
* @now: Current timestamp
*/
void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now)
{
if (c->mode == MODE_VU) {
udp_vu_reply_sock_handler(c, ref, events, now);
return;
}
udp_buf_reply_sock_handler(c, ref, events, now);
fail:
flow_err_details(uflow);
udp_flow_close(c, uflow);
}
/**
@ -720,6 +960,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
* @af: Address family, AF_INET or AF_INET6
* @saddr: Source address
* @daddr: Destination address
* @ttl: TTL or hop limit for packets to be sent in this call
* @p: Pool of UDP packets, with UDP headers
* @idx: Index of first packet to process
* @now: Current timestamp
@ -730,7 +971,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
*/
int udp_tap_handler(const struct ctx *c, uint8_t pif,
sa_family_t af, const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now)
uint8_t ttl, const struct pool *p, int idx,
const struct timespec *now)
{
const struct flowside *toside;
struct mmsghdr mm[UIO_MAXIOV];
@ -778,7 +1020,7 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
}
toside = flowside_at_sidx(tosidx);
s = udp_at_sidx(tosidx)->s[tosidx.sidei];
s = uflow->s[tosidx.sidei];
ASSERT(s >= 0);
pif_sockaddr(c, &to_sa, &sl, topif, &toside->eaddr, toside->eport);
@ -809,6 +1051,24 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
mm[i].msg_hdr.msg_controllen = 0;
mm[i].msg_hdr.msg_flags = 0;
if (ttl != uflow->ttl[tosidx.sidei]) {
uflow->ttl[tosidx.sidei] = ttl;
if (af == AF_INET) {
if (setsockopt(s, IPPROTO_IP, IP_TTL,
&ttl, sizeof(ttl)) < 0)
flow_perror(uflow,
"setsockopt IP_TTL");
} else {
/* IPv6 hop_limit cannot be only 1 byte */
int hop_limit = ttl;
if (setsockopt(s, SOL_IPV6, IPV6_UNICAST_HOPS,
&hop_limit, sizeof(hop_limit)) < 0)
flow_perror(uflow,
"setsockopt IPV6_UNICAST_HOPS");
}
}
count++;
}

7
udp.h
View file

@ -11,11 +11,12 @@
void udp_portmap_clear(void);
void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now);
void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now);
void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now);
int udp_tap_handler(const struct ctx *c, uint8_t pif,
sa_family_t af, const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now);
uint8_t ttl, const struct pool *p, int idx,
const struct timespec *now);
int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
const char *ifname, in_port_t port);
int udp_init(struct ctx *c);

View file

@ -9,10 +9,12 @@
#include <fcntl.h>
#include <sys/uio.h>
#include <unistd.h>
#include <netinet/udp.h>
#include "util.h"
#include "passt.h"
#include "flow_table.h"
#include "udp_internal.h"
#define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */
@ -41,121 +43,145 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx)
*/
void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
{
unsigned sidei;
if (uflow->closed)
return; /* Nothing to do */
if (uflow->s[INISIDE] >= 0) {
/* The listening socket needs to stay in epoll */
close(uflow->s[INISIDE]);
uflow->s[INISIDE] = -1;
flow_foreach_sidei(sidei) {
flow_hash_remove(c, FLOW_SIDX(uflow, sidei));
if (uflow->s[sidei] >= 0) {
epoll_del(c, uflow->s[sidei]);
close(uflow->s[sidei]);
uflow->s[sidei] = -1;
}
}
if (uflow->s[TGTSIDE] >= 0) {
/* But the flow specific one needs to be removed */
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL);
close(uflow->s[TGTSIDE]);
uflow->s[TGTSIDE] = -1;
}
flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE));
if (!pif_is_socket(uflow->f.pif[TGTSIDE]))
flow_hash_remove(c, FLOW_SIDX(uflow, TGTSIDE));
uflow->closed = true;
}
/**
* udp_flow_sock() - Create, bind and connect a flow specific UDP socket
* @c: Execution context
* @uflow: UDP flow to open socket for
* @sidei: Side of @uflow to open socket for
*
* Return: fd of new socket on success, -ve error code on failure
*/
static int udp_flow_sock(const struct ctx *c,
struct udp_flow *uflow, unsigned sidei)
{
const struct flowside *side = &uflow->f.side[sidei];
uint8_t pif = uflow->f.pif[sidei];
union {
flow_sidx_t sidx;
uint32_t data;
} fref = { .sidx = FLOW_SIDX(uflow, sidei) };
int s;
s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
if (s < 0) {
flow_dbg_perror(uflow, "Couldn't open flow specific socket");
return s;
}
if (flowside_connect(c, s, pif, side) < 0) {
int rc = -errno;
epoll_del(c, s);
close(s);
flow_dbg_perror(uflow, "Couldn't connect flow socket");
return rc;
}
/* It's possible, if unlikely, that we could receive some packets in
* between the bind() and connect() which may or may not be for this
* flow. Being UDP we could just discard them, but it's not ideal.
*
* There's also a tricky case if a bunch of datagrams for a new flow
* arrive in rapid succession, the first going to the original listening
* socket and later ones going to this new socket. If we forwarded the
* datagrams from the new socket immediately here they would go before
* the datagram which established the flow. Again, not strictly wrong
* for UDP, but not ideal.
*
* So, we flag that the new socket is in a transient state where it
* might have datagrams for a different flow queued. Before the next
* epoll cycle, udp_flow_defer() will flush out any such datagrams, and
* thereafter everything on the new socket should be strictly for this
* flow.
*/
if (sidei)
uflow->flush1 = true;
else
uflow->flush0 = true;
return s;
}
/**
* udp_flow_new() - Common setup for a new UDP flow
* @c: Execution context
* @flow: Initiated flow
* @s_ini: Initiating socket (or -1)
* @now: Timestamp
*
* Return: UDP specific flow, if successful, NULL on failure
*
* #syscalls getsockname
*/
static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
int s_ini, const struct timespec *now)
const struct timespec *now)
{
struct udp_flow *uflow = NULL;
const struct flowside *tgt;
uint8_t tgtpif;
unsigned sidei;
if (!(tgt = flow_target(c, flow, IPPROTO_UDP)))
goto cancel;
tgtpif = flow->f.pif[TGTSIDE];
uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp);
uflow->ts = now->tv_sec;
uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1;
uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = 0;
if (s_ini >= 0) {
/* When using auto port-scanning the listening port could go
* away, so we need to duplicate the socket
*/
uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);
if (uflow->s[INISIDE] < 0) {
flow_err(uflow,
"Couldn't duplicate listening socket: %s",
strerror_(errno));
flow_foreach_sidei(sidei) {
if (pif_is_socket(uflow->f.pif[sidei]))
if ((uflow->s[sidei] = udp_flow_sock(c, uflow, sidei)) < 0)
goto cancel;
}
if (uflow->s[TGTSIDE] >= 0 && inany_is_unspecified(&tgt->oaddr)) {
/* When we target a socket, we connect() it, but might not
* always bind(), leaving the kernel to pick our address. In
* that case connect() will implicitly bind() the socket, but we
* need to determine its local address so that we can match
* reply packets back to the correct flow. Update the flow with
* the information from getsockname() */
union sockaddr_inany sa;
socklen_t sl = sizeof(sa);
in_port_t port;
if (getsockname(uflow->s[TGTSIDE], &sa.sa, &sl) < 0 ||
inany_from_sockaddr(&uflow->f.side[TGTSIDE].oaddr,
&port, &sa) < 0) {
flow_perror(uflow, "Unable to determine local address");
goto cancel;
}
if (port != tgt->oport) {
flow_err(uflow, "Unexpected local port");
goto cancel;
}
}
if (pif_is_socket(tgtpif)) {
struct mmsghdr discard[UIO_MAXIOV] = { 0 };
union {
flow_sidx_t sidx;
uint32_t data;
} fref = {
.sidx = FLOW_SIDX(flow, TGTSIDE),
};
int rc;
uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY,
tgtpif, tgt, fref.data);
if (uflow->s[TGTSIDE] < 0) {
flow_dbg(uflow,
"Couldn't open socket for spliced flow: %s",
strerror_(errno));
goto cancel;
}
if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) {
flow_dbg(uflow,
"Couldn't connect flow socket: %s",
strerror_(errno));
goto cancel;
}
/* It's possible, if unlikely, that we could receive some
* unrelated packets in between the bind() and connect() of this
* socket. For now we just discard these. We could consider
* trying to redirect these to an appropriate handler, if we
* need to.
*/
rc = recvmmsg(uflow->s[TGTSIDE], discard, ARRAY_SIZE(discard),
MSG_DONTWAIT, NULL);
if (rc >= ARRAY_SIZE(discard)) {
flow_dbg(uflow,
"Too many (%d) spurious reply datagrams", rc);
goto cancel;
} else if (rc > 0) {
flow_trace(uflow,
"Discarded %d spurious reply datagrams", rc);
} else if (errno != EAGAIN) {
flow_err(uflow,
"Unexpected error discarding datagrams: %s",
strerror_(errno));
}
}
flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE));
/* If the target side is a socket, it will be a reply socket that knows
* its own flowside. But if it's tap, then we need to look it up by
* hash.
/* Tap sides always need to be looked up by hash. Socket sides don't
* always, but sometimes do (receiving packets on a socket not specific
* to one flow). Unconditionally hash both sides so all our bases are
* covered
*/
if (!pif_is_socket(tgtpif))
flow_hash_insert(c, FLOW_SIDX(uflow, TGTSIDE));
flow_foreach_sidei(sidei)
flow_hash_insert(c, FLOW_SIDX(uflow, sidei));
FLOW_ACTIVATE(uflow);
return FLOW_SIDX(uflow, TGTSIDE);
@ -168,18 +194,21 @@ cancel:
}
/**
* udp_flow_from_sock() - Find or create UDP flow for "listening" socket
* udp_flow_from_sock() - Find or create UDP flow for incoming datagram
* @c: Execution context
* @ref: epoll reference of the receiving socket
* @pif: Interface the datagram is arriving from
* @dst: Our (local) address to which the datagram is arriving
* @port: Our (local) port number to which the datagram is arriving
* @s_in: Source socket address, filled in by recvmmsg()
* @now: Timestamp
*
* #syscalls fcntl arm:fcntl64 ppc64:fcntl64 i686:fcntl64
* #syscalls fcntl arm:fcntl64 ppc64:fcntl64|fcntl i686:fcntl64
*
* Return: sidx for the destination side of the flow for this packet, or
* FLOW_SIDX_NONE if we couldn't find or create a flow.
*/
flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif,
const union inany_addr *dst, in_port_t port,
const union sockaddr_inany *s_in,
const struct timespec *now)
{
@ -188,9 +217,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
union flow *flow;
flow_sidx_t sidx;
ASSERT(ref.type == EPOLL_TYPE_UDP_LISTEN);
sidx = flow_lookup_sa(c, IPPROTO_UDP, ref.udp.pif, s_in, ref.udp.port);
sidx = flow_lookup_sa(c, IPPROTO_UDP, pif, s_in, dst, port);
if ((uflow = udp_at_sidx(sidx))) {
uflow->ts = now->tv_sec;
return flow_sidx_opposite(sidx);
@ -200,12 +227,11 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
char sastr[SOCKADDR_STRLEN];
debug("Couldn't allocate flow for UDP datagram from %s %s",
pif_name(ref.udp.pif),
sockaddr_ntop(s_in, sastr, sizeof(sastr)));
pif_name(pif), sockaddr_ntop(s_in, sastr, sizeof(sastr)));
return FLOW_SIDX_NONE;
}
ini = flow_initiate_sa(flow, ref.udp.pif, s_in, ref.udp.port);
ini = flow_initiate_sa(flow, pif, s_in, dst, port);
if (!inany_is_unicast(&ini->eaddr) ||
ini->eport == 0 || ini->oport == 0) {
@ -218,7 +244,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
return FLOW_SIDX_NONE;
}
return udp_flow_new(c, flow, ref.fd, now);
return udp_flow_new(c, flow, now);
}
/**
@ -274,17 +300,45 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
return FLOW_SIDX_NONE;
}
return udp_flow_new(c, flow, -1, now);
return udp_flow_new(c, flow, now);
}
/**
* udp_flush_flow() - Flush datagrams that might not be for this flow
* @c: Execution context
* @uflow: Flow to handle
* @sidei: Side of the flow to flush
* @now: Current timestamp
*/
static void udp_flush_flow(const struct ctx *c,
const struct udp_flow *uflow, unsigned sidei,
const struct timespec *now)
{
/* We don't know exactly where the datagrams will come from, but we know
* they'll have an interface and oport matching this flow */
udp_sock_fwd(c, uflow->s[sidei], uflow->f.pif[sidei],
uflow->f.side[sidei].oport, now);
}
/**
* udp_flow_defer() - Deferred per-flow handling (clean up aborted flows)
* @c: Execution context
* @uflow: Flow to handle
* @now: Current timestamp
*
* Return: true if the connection is ready to free, false otherwise
*/
bool udp_flow_defer(const struct udp_flow *uflow)
bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow,
const struct timespec *now)
{
if (uflow->flush0) {
udp_flush_flow(c, uflow, INISIDE, now);
uflow->flush0 = false;
}
if (uflow->flush1) {
udp_flush_flow(c, uflow, TGTSIDE, now);
uflow->flush1 = false;
}
return uflow->closed;
}

View file

@ -8,9 +8,12 @@
#define UDP_FLOW_H
/**
* struct udp - Descriptor for a flow of UDP packets
* struct udp_flow - Descriptor for a flow of UDP packets
* @f: Generic flow information
* @ttl: TTL or hop_limit for both sides
* @closed: Flow is already closed
* @flush0: @s[0] may have datagrams queued for other flows
* @flush1: @s[1] may have datagrams queued for other flows
* @ts: Activity timestamp
* @s: Socket fd (or -1) for each side of the flow
*/
@ -18,13 +21,19 @@ struct udp_flow {
/* Must be first element */
struct flow_common f;
bool closed :1;
uint8_t ttl[SIDES];
bool closed :1,
flush0 :1,
flush1 :1;
time_t ts;
int s[SIDES];
};
struct udp_flow *udp_at_sidx(flow_sidx_t sidx);
flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif,
const union inany_addr *dst, in_port_t port,
const union sockaddr_inany *s_in,
const struct timespec *now);
flow_sidx_t udp_flow_from_tap(const struct ctx *c,
@ -33,7 +42,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
in_port_t srcport, in_port_t dstport,
const struct timespec *now);
void udp_flow_close(const struct ctx *c, struct udp_flow *uflow);
bool udp_flow_defer(const struct udp_flow *uflow);
bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow,
const struct timespec *now);
bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow,
const struct timespec *now);

View file

@ -8,8 +8,6 @@
#include "tap.h" /* needed by udp_meta_t */
#define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */
/**
* struct udp_payload_t - UDP header and data for inbound messages
* @uh: UDP header
@ -30,5 +28,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen,
bool no_udp_csum);
int udp_sock_errs(const struct ctx *c, int s, uint32_t events);
void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
in_port_t port, const struct timespec *now);
#endif /* UDP_INTERNAL_H */

138
udp_vu.c
View file

@ -57,35 +57,16 @@ static size_t udp_vu_hdrlen(bool v6)
return hdrlen;
}
/**
* udp_vu_sock_info() - get socket information
* @s: Socket to get information from
* @s_in: Socket address (output)
*
* Return: 0 if socket address can be read, -1 otherwise
*/
static int udp_vu_sock_info(int s, union sockaddr_inany *s_in)
{
struct msghdr msg = {
.msg_name = s_in,
.msg_namelen = sizeof(union sockaddr_inany),
};
return recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
}
/**
* udp_vu_sock_recv() - Receive datagrams from socket into vhost-user buffers
* @c: Execution context
* @s: Socket to receive from
* @events: epoll events bitmap
* @v6: Set for IPv6 connections
* @dlen: Size of received data (output)
*
* Return: Number of iov entries used to store the datagram
*/
static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events,
bool v6, ssize_t *dlen)
static int udp_vu_sock_recv(const struct ctx *c, int s, bool v6, ssize_t *dlen)
{
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
@ -95,9 +76,6 @@ static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events,
ASSERT(!c->no_udp);
if (!(events & EPOLLIN))
return 0;
/* compute L2 header length */
hdrlen = udp_vu_hdrlen(v6);
@ -214,125 +192,27 @@ static void udp_vu_csum(const struct flowside *toside, int iov_used)
}
/**
* udp_vu_listen_sock_handler() - Handle new data from socket
* udp_vu_sock_to_tap() - Forward datagrams from socket to tap
* @c: Execution context
* @ref: epoll reference
* @events: epoll events bitmap
* @now: Current timestamp
* @s: Socket to read data from
* @n: Maximum number of datagrams to forward
* @tosidx: Flow & side to forward data from @s to
*/
void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now)
void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx)
{
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
int i;
if (udp_sock_errs(c, ref.fd, events) < 0) {
err("UDP: Unrecoverable error on listening socket:"
" (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
return;
}
for (i = 0; i < UDP_MAX_FRAMES; i++) {
const struct flowside *toside;
union sockaddr_inany s_in;
flow_sidx_t sidx;
uint8_t pif;
ssize_t dlen;
int iov_used;
bool v6;
if (udp_vu_sock_info(ref.fd, &s_in) < 0)
break;
sidx = udp_flow_from_sock(c, ref, &s_in, now);
pif = pif_at_sidx(sidx);
if (pif != PIF_TAP) {
if (flow_sidx_valid(sidx)) {
flow_sidx_t fromsidx = flow_sidx_opposite(sidx);
struct udp_flow *uflow = udp_at_sidx(sidx);
flow_err(uflow,
"No support for forwarding UDP from %s to %s",
pif_name(pif_at_sidx(fromsidx)),
pif_name(pif));
} else {
debug("Discarding 1 datagram without flow");
}
continue;
}
toside = flowside_at_sidx(sidx);
v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
iov_used = udp_vu_sock_recv(c, ref.fd, events, v6, &dlen);
if (iov_used <= 0)
break;
udp_vu_prepare(c, toside, dlen);
if (*c->pcap) {
udp_vu_csum(toside, iov_used);
pcap_iov(iov_vu, iov_used,
sizeof(struct virtio_net_hdr_mrg_rxbuf));
}
vu_flush(vdev, vq, elem, iov_used);
}
}
/**
* udp_vu_reply_sock_handler() - Handle new data from flow specific socket
* @c: Execution context
* @ref: epoll reference
* @events: epoll events bitmap
* @now: Current timestamp
*/
void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now)
{
flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
const struct flowside *toside = flowside_at_sidx(tosidx);
struct udp_flow *uflow = udp_at_sidx(ref.flowside);
int from_s = uflow->s[ref.flowside.sidei];
bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
int i;
ASSERT(!c->no_udp);
if (udp_sock_errs(c, from_s, events) < 0) {
flow_err(uflow, "Unrecoverable error on reply socket");
flow_err_details(uflow);
udp_flow_close(c, uflow);
return;
}
for (i = 0; i < UDP_MAX_FRAMES; i++) {
uint8_t topif = pif_at_sidx(tosidx);
for (i = 0; i < n; i++) {
ssize_t dlen;
int iov_used;
bool v6;
ASSERT(uflow);
if (topif != PIF_TAP) {
uint8_t frompif = pif_at_sidx(ref.flowside);
flow_err(uflow,
"No support for forwarding UDP from %s to %s",
pif_name(frompif), pif_name(topif));
continue;
}
v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
iov_used = udp_vu_sock_recv(c, from_s, events, v6, &dlen);
iov_used = udp_vu_sock_recv(c, s, v6, &dlen);
if (iov_used <= 0)
break;
flow_trace(uflow, "Received 1 datagram on reply socket");
uflow->ts = now->tv_sec;
udp_vu_prepare(c, toside, dlen);
if (*c->pcap) {

View file

@ -6,8 +6,8 @@
#ifndef UDP_VU_H
#define UDP_VU_H
void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now);
void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now);
void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
const struct timespec *now);
void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx);
#endif /* UDP_VU_H */

221
util.c
View file

@ -71,7 +71,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
case EPOLL_TYPE_UDP_LISTEN:
freebind = c->freebind;
/* fallthrough */
case EPOLL_TYPE_UDP_REPLY:
case EPOLL_TYPE_UDP:
proto = IPPROTO_UDP;
socktype = SOCK_DGRAM | SOCK_NONBLOCK;
break;
@ -109,11 +109,15 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
debug("Failed to set SO_REUSEADDR on socket %i", fd);
if (proto == IPPROTO_UDP) {
int pktinfo = af == AF_INET ? IP_PKTINFO : IPV6_RECVPKTINFO;
int recverr = af == AF_INET ? IP_RECVERR : IPV6_RECVERR;
int level = af == AF_INET ? IPPROTO_IP : IPPROTO_IPV6;
int opt = af == AF_INET ? IP_RECVERR : IPV6_RECVERR;
if (setsockopt(fd, level, opt, &y, sizeof(y)))
if (setsockopt(fd, level, recverr, &y, sizeof(y)))
die_perror("Failed to set RECVERR on socket %i", fd);
if (setsockopt(fd, level, pktinfo, &y, sizeof(y)))
die_perror("Failed to set PKTINFO on socket %i", fd);
}
if (ifname && *ifname) {
@ -178,6 +182,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
return fd;
}
/**
* sock_unix() - Create and bind AF_UNIX socket
* @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
*
* Return: socket descriptor on success, won't return on failure
*/
int sock_unix(char *sock_path)
{
int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
struct sockaddr_un addr = {
.sun_family = AF_UNIX,
};
int i;
if (fd < 0)
die_perror("Failed to open UNIX domain socket");
for (i = 1; i < UNIX_SOCK_MAX; i++) {
char *path = addr.sun_path;
int ex, ret;
if (*sock_path)
memcpy(path, sock_path, UNIX_PATH_MAX);
else if (snprintf_check(path, UNIX_PATH_MAX - 1,
UNIX_SOCK_PATH, i))
die_perror("Can't build UNIX domain socket path");
ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
0);
if (ex < 0)
die_perror("Failed to check for UNIX domain conflicts");
ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
errno != EACCES)) {
if (*sock_path)
die("Socket path %s already in use", path);
close(ex);
continue;
}
close(ex);
unlink(path);
ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
if (*sock_path && ret)
die_perror("Failed to bind UNIX domain socket");
if (!ret)
break;
}
if (i == UNIX_SOCK_MAX)
die_perror("Failed to bind UNIX domain socket");
info("UNIX domain socket bound at %s", addr.sun_path);
if (!*sock_path)
memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
return fd;
}
/**
* sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
* @c: Execution context
@ -405,7 +471,7 @@ void pidfile_write(int fd, pid_t pid)
if (write(fd, pid_buf, n) < 0) {
perror("PID file write");
exit(EXIT_FAILURE);
_exit(EXIT_FAILURE);
}
close(fd);
@ -441,12 +507,12 @@ int __daemon(int pidfile_fd, int devnull_fd)
if (pid == -1) {
perror("fork");
exit(EXIT_FAILURE);
_exit(EXIT_FAILURE);
}
if (pid) {
pidfile_write(pidfile_fd, pid);
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
}
if (setsid() < 0 ||
@ -454,7 +520,7 @@ int __daemon(int pidfile_fd, int devnull_fd)
dup2(devnull_fd, STDOUT_FILENO) < 0 ||
dup2(devnull_fd, STDERR_FILENO) < 0 ||
close(devnull_fd))
exit(EXIT_FAILURE);
_exit(EXIT_FAILURE);
return 0;
}
@ -606,6 +672,90 @@ int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip)
return 0;
}
/**
* read_all_buf() - Fill a whole buffer from a file descriptor
* @fd: File descriptor
* @buf: Pointer to base of buffer
* @len: Length of buffer
*
* Return: 0 on success, -1 on error (with errno set)
*
* #syscalls read
*/
int read_all_buf(int fd, void *buf, size_t len)
{
size_t left = len;
char *p = buf;
while (left) {
ssize_t rc;
ASSERT(left <= len);
do
rc = read(fd, p, left);
while ((rc < 0) && errno == EINTR);
if (rc < 0)
return -1;
if (rc == 0) {
errno = ENODATA;
return -1;
}
p += rc;
left -= rc;
}
return 0;
}
/**
* read_remainder() - Read the tail of an IO vector from a file descriptor
* @fd: File descriptor
* @iov: IO vector
* @cnt: Number of entries in @iov
* @skip: Number of bytes of the vector to skip reading
*
* Return: 0 on success, -1 on error (with errno set)
*
* Note: mode-specific seccomp profiles need to enable readv() to use this.
*/
/* cppcheck-suppress unusedFunction */
int read_remainder(int fd, const struct iovec *iov, size_t cnt, size_t skip)
{
size_t i = 0, offset;
while ((i += iov_skip_bytes(iov + i, cnt - i, skip, &offset)) < cnt) {
ssize_t rc;
if (offset) {
ASSERT(offset < iov[i].iov_len);
/* Read the remainder of the partially read buffer */
if (read_all_buf(fd, (char *)iov[i].iov_base + offset,
iov[i].iov_len - offset) < 0)
return -1;
i++;
}
if (cnt == i)
break;
/* Fill as many of the remaining buffers as we can */
rc = readv(fd, &iov[i], cnt - i);
if (rc < 0)
return -1;
if (rc == 0) {
errno = ENODATA;
return -1;
}
skip = rc;
}
return 0;
}
/** sockaddr_ntop() - Convert a socket address to text format
* @sa: Socket address
* @dst: output buffer, minimum SOCKADDR_STRLEN bytes
@ -725,7 +875,9 @@ void close_open_files(int argc, char **argv)
errno = 0;
fd = strtol(optarg, NULL, 0);
if (errno || fd <= STDERR_FILENO || fd > INT_MAX)
if (errno ||
(fd != STDIN_FILENO && fd <= STDERR_FILENO) ||
fd > INT_MAX)
die("Invalid --fd: %s", optarg);
}
} while (name != -1);
@ -837,3 +989,56 @@ void raw_random(void *buf, size_t buflen)
if (random_read < buflen)
die("Unexpected EOF on random data source");
}
/**
* epoll_del() - Remove a file descriptor from our passt epoll
* @c: Execution context
* @fd: File descriptor to remove
*/
void epoll_del(const struct ctx *c, int fd)
{
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, fd, NULL);
}
/**
* encode_domain_name() - Encode domain name according to RFC 1035, section 3.1
* @buf: Buffer to fill in with encoded domain name
* @domain_name: Input domain name string with terminator
*
* The buffer's 'buf' size has to be >= strlen(domain_name) + 2
*/
void encode_domain_name(char *buf, const char *domain_name)
{
size_t i;
char *p;
buf[0] = strcspn(domain_name, ".");
p = buf + 1;
for (i = 0; domain_name[i]; i++) {
if (domain_name[i] == '.')
p[i] = strcspn(domain_name + i + 1, ".");
else
p[i] = domain_name[i];
}
p[i] = 0L;
}
/**
* abort_with_msg() - Print error message and abort
* @fmt: Format string
* @...: Format parameters
*/
void abort_with_msg(const char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
vlogmsg(true, false, LOG_CRIT, fmt, ap);
va_end(ap);
/* This may actually cause a SIGSYS instead of SIGABRT, due to seccomp,
* but that will still get the job done.
*/
abort();
}

77
util.h
View file

@ -31,12 +31,6 @@
#ifndef SECCOMP_RET_KILL_PROCESS
#define SECCOMP_RET_KILL_PROCESS SECCOMP_RET_KILL
#endif
#ifndef ETH_MAX_MTU
#define ETH_MAX_MTU USHRT_MAX
#endif
#ifndef ETH_MIN_MTU
#define ETH_MIN_MTU 68
#endif
#ifndef IP_MAX_MTU
#define IP_MAX_MTU USHRT_MAX
#endif
@ -67,27 +61,22 @@
#define STRINGIFY(x) #x
#define STR(x) STRINGIFY(x)
#ifdef CPPCHECK_6936
void abort_with_msg(const char *fmt, ...)
__attribute__((format(printf, 1, 2), noreturn));
/* Some cppcheck versions get confused by aborts inside a loop, causing
* it to give false positive uninitialised variable warnings later in
* the function, because it doesn't realise the non-initialising path
* already exited. See https://trac.cppcheck.net/ticket/13227
*
* Therefore, avoid using the usual do while wrapper we use to force the macro
* to act like a single statement requiring a ';'.
*/
#define ASSERT(expr) \
((expr) ? (void)0 : abort())
#else
#define ASSERT_WITH_MSG(expr, ...) \
((expr) ? (void)0 : abort_with_msg(__VA_ARGS__))
#define ASSERT(expr) \
do { \
if (!(expr)) { \
err("ASSERTION FAILED in %s (%s:%d): %s", \
__func__, __FILE__, __LINE__, STRINGIFY(expr)); \
/* This may actually SIGSYS, due to seccomp, \
* but that will still get the job done \
*/ \
abort(); \
} \
} while (0)
#endif
ASSERT_WITH_MSG((expr), "ASSERTION FAILED in %s (%s:%d): %s", \
__func__, __FILE__, __LINE__, STRINGIFY(expr))
#ifdef P_tmpdir
#define TMPDIR P_tmpdir
@ -122,14 +111,43 @@
(((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
#endif
#ifndef __bswap_constant_32
#define __bswap_constant_32(x) \
((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | \
(((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
#endif
#ifndef __bswap_constant_64
#define __bswap_constant_64(x) \
((((x) & 0xff00000000000000ULL) >> 56) | \
(((x) & 0x00ff000000000000ULL) >> 40) | \
(((x) & 0x0000ff0000000000ULL) >> 24) | \
(((x) & 0x000000ff00000000ULL) >> 8) | \
(((x) & 0x00000000ff000000ULL) << 8) | \
(((x) & 0x0000000000ff0000ULL) << 24) | \
(((x) & 0x000000000000ff00ULL) << 40) | \
(((x) & 0x00000000000000ffULL) << 56))
#endif
#if __BYTE_ORDER == __BIG_ENDIAN
#define htons_constant(x) (x)
#define htonl_constant(x) (x)
#define htonll_constant(x) (x)
#define ntohs_constant(x) (x)
#define ntohl_constant(x) (x)
#define ntohll_constant(x) (x)
#else
#define htons_constant(x) (__bswap_constant_16(x))
#define htonl_constant(x) (__bswap_constant_32(x))
#define htonll_constant(x) (__bswap_constant_64(x))
#define ntohs_constant(x) (__bswap_constant_16(x))
#define ntohl_constant(x) (__bswap_constant_32(x))
#define ntohll_constant(x) (__bswap_constant_64(x))
#endif
#define ntohll(x) (be64toh((x)))
#define htonll(x) (htobe64((x)))
/**
* ntohl_unaligned() - Read 32-bit BE value from a possibly unaligned address
* @p: Pointer to the BE value in memory
@ -185,6 +203,7 @@ struct ctx;
int sock_l4_sa(const struct ctx *c, enum epoll_type type,
const void *sa, socklen_t sl,
const char *ifname, bool v6only, uint32_t data);
int sock_unix(char *sock_path);
void sock_probe_mem(struct ctx *c);
long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);
@ -203,6 +222,8 @@ int fls(unsigned long x);
int write_file(const char *path, const char *buf);
int write_all_buf(int fd, const void *buf, size_t len);
int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
int read_all_buf(int fd, void *buf, size_t len);
int read_remainder(int fd, const struct iovec *iov, size_t cnt, size_t skip);
void close_open_files(int argc, char **argv);
bool snprintf_check(char *str, size_t size, const char *format, ...);
@ -276,6 +297,7 @@ static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m)
#define FPRINTF(f, ...) (void)fprintf(f, __VA_ARGS__)
void raw_random(void *buf, size_t buflen);
void epoll_del(const struct ctx *c, int fd);
/*
* Starting from glibc 2.40.9000 and commit 25a5eb4010df ("string: strerror,
@ -349,4 +371,17 @@ static inline int wrap_accept4(int sockfd, struct sockaddr *addr,
#define accept4(s, addr, addrlen, flags) \
wrap_accept4((s), (addr), (addrlen), (flags))
static inline int wrap_getsockname(int sockfd, struct sockaddr *addr,
/* cppcheck-suppress constParameterPointer */
socklen_t *addrlen)
{
sa_init(addr, addrlen);
return getsockname(sockfd, addr, addrlen);
}
#define getsockname(s, addr, addrlen) \
wrap_getsockname((s), (addr), (addrlen))
#define PASST_MAXDNAME 254 /* 253 (RFC 1035) + 1 (the terminator) */
void encode_domain_name(char *buf, const char *domain_name);
#endif /* UTIL_H */

View file

@ -44,6 +44,7 @@
#include "tap.h"
#include "vhost_user.h"
#include "pcap.h"
#include "migrate.h"
/* vhost-user version we are compatible with */
#define VHOST_USER_VERSION 1
@ -60,7 +61,7 @@ void vu_print_capabilities(void)
info("{");
info(" \"type\": \"net\"");
info("}");
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
}
/**
@ -162,17 +163,6 @@ static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
close(vmsg->fds[i]);
}
/**
* vu_remove_watch() - Remove a file descriptor from our passt epoll
* file descriptor
* @vdev: vhost-user device
* @fd: file descriptor to remove
*/
static void vu_remove_watch(const struct vu_dev *vdev, int fd)
{
epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL);
}
/**
* vmsg_set_reply_u64() - Set reply payload.u64 and clear request flags
* and fd_num
@ -312,13 +302,13 @@ static void vu_message_write(int conn_fd, struct vhost_user_msg *vmsg)
* @conn_fd: vhost-user command socket
* @vmsg: vhost-user message
*/
static void vu_send_reply(int conn_fd, struct vhost_user_msg *msg)
static void vu_send_reply(int conn_fd, struct vhost_user_msg *vmsg)
{
msg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
msg->hdr.flags |= VHOST_USER_VERSION;
msg->hdr.flags |= VHOST_USER_REPLY_MASK;
vmsg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
vmsg->hdr.flags |= VHOST_USER_VERSION;
vmsg->hdr.flags |= VHOST_USER_REPLY_MASK;
vu_message_write(conn_fd, msg);
vu_message_write(conn_fd, vmsg);
}
/**
@ -329,7 +319,7 @@ static void vu_send_reply(int conn_fd, struct vhost_user_msg *msg)
* Return: True as a reply is requested
*/
static bool vu_get_features_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
uint64_t features =
1ULL << VIRTIO_F_VERSION_1 |
@ -339,9 +329,9 @@ static bool vu_get_features_exec(struct vu_dev *vdev,
(void)vdev;
vmsg_set_reply_u64(msg, features);
vmsg_set_reply_u64(vmsg, features);
debug("Sending back to guest u64: 0x%016"PRIx64, msg->payload.u64);
debug("Sending back to guest u64: 0x%016"PRIx64, vmsg->payload.u64);
return true;
}
@ -367,11 +357,11 @@ static void vu_set_enable_all_rings(struct vu_dev *vdev, bool enable)
* Return: False as no reply is requested
*/
static bool vu_set_features_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
debug("u64: 0x%016"PRIx64, msg->payload.u64);
debug("u64: 0x%016"PRIx64, vmsg->payload.u64);
vdev->features = msg->payload.u64;
vdev->features = vmsg->payload.u64;
/* We only support devices conforming to VIRTIO 1.0 or
* later
*/
@ -392,10 +382,10 @@ static bool vu_set_features_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_owner_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
(void)vdev;
(void)msg;
(void)vmsg;
return false;
}
@ -430,12 +420,12 @@ static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
*
* Return: False as no reply is requested
*
* #syscalls:vu mmap munmap
* #syscalls:vu mmap|mmap2 munmap
*/
static bool vu_set_mem_table_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
struct vhost_user_memory m = msg->payload.memory, *memory = &m;
struct vhost_user_memory m = vmsg->payload.memory, *memory = &m;
unsigned int i;
for (i = 0; i < vdev->nregions; i++) {
@ -475,7 +465,7 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
*/
mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
PROT_READ | PROT_WRITE, MAP_SHARED |
MAP_NORESERVE, msg->fds[i], 0);
MAP_NORESERVE, vmsg->fds[i], 0);
if (mmap_addr == MAP_FAILED)
die_perror("vhost-user region mmap error");
@ -484,7 +474,7 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
debug(" mmap_addr: 0x%016"PRIx64,
dev_region->mmap_addr);
close(msg->fds[i]);
close(vmsg->fds[i]);
}
for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
@ -527,7 +517,7 @@ static void vu_close_log(struct vu_dev *vdev)
* vu_log_kick() - Inform the front-end that the log has been modified
* @vdev: vhost-user device
*/
void vu_log_kick(const struct vu_dev *vdev)
static void vu_log_kick(const struct vu_dev *vdev)
{
if (vdev->log_call_fd != -1) {
int rc;
@ -551,7 +541,7 @@ static void vu_log_page(uint8_t *log_table, uint64_t page)
/**
* vu_log_write() - Log memory write
* @dev: vhost-user device
* @vdev: vhost-user device
* @address: Memory address
* @length: Memory size
*/
@ -576,23 +566,23 @@ void vu_log_write(const struct vu_dev *vdev, uint64_t address, uint64_t length)
* @vdev: vhost-user device
* @vmsg: vhost-user message
*
* Return: False as no reply is requested
* Return: True as a reply is requested
*
* #syscalls:vu mmap|mmap2 munmap
*/
static bool vu_set_log_base_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
uint64_t log_mmap_size, log_mmap_offset;
void *base;
int fd;
if (msg->fd_num != 1 || msg->hdr.size != sizeof(msg->payload.log))
if (vmsg->fd_num != 1 || vmsg->hdr.size != sizeof(vmsg->payload.log))
die("vhost-user: Invalid log_base message");
fd = msg->fds[0];
log_mmap_offset = msg->payload.log.mmap_offset;
log_mmap_size = msg->payload.log.mmap_size;
fd = vmsg->fds[0];
log_mmap_offset = vmsg->payload.log.mmap_offset;
log_mmap_size = vmsg->payload.log.mmap_size;
debug("vhost-user log mmap_offset: %"PRId64, log_mmap_offset);
debug("vhost-user log mmap_size: %"PRId64, log_mmap_size);
@ -609,8 +599,8 @@ static bool vu_set_log_base_exec(struct vu_dev *vdev,
vdev->log_table = base;
vdev->log_size = log_mmap_size;
msg->hdr.size = sizeof(msg->payload.u64);
msg->fd_num = 0;
vmsg->hdr.size = sizeof(vmsg->payload.u64);
vmsg->fd_num = 0;
return true;
}
@ -623,15 +613,15 @@ static bool vu_set_log_base_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_log_fd_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
if (msg->fd_num != 1)
if (vmsg->fd_num != 1)
die("Invalid log_fd message");
if (vdev->log_call_fd != -1)
close(vdev->log_call_fd);
vdev->log_call_fd = msg->fds[0];
vdev->log_call_fd = vmsg->fds[0];
debug("Got log_call_fd: %d", vdev->log_call_fd);
@ -646,13 +636,13 @@ static bool vu_set_log_fd_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_vring_num_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
unsigned int idx = msg->payload.state.index;
unsigned int num = msg->payload.state.num;
unsigned int idx = vmsg->payload.state.index;
unsigned int num = vmsg->payload.state.num;
debug("State.index: %u", idx);
debug("State.num: %u", num);
trace("State.index: %u", idx);
trace("State.num: %u", num);
vdev->vq[idx].vring.num = num;
return false;
@ -666,13 +656,13 @@ static bool vu_set_vring_num_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
/* We need to copy the payload to vhost_vring_addr structure
* to access index because address of msg->payload.addr
* to access index because address of vmsg->payload.addr
* can be unaligned as it is packed.
*/
struct vhost_vring_addr addr = msg->payload.addr;
struct vhost_vring_addr addr = vmsg->payload.addr;
struct vu_virtq *vq = &vdev->vq[addr.index];
debug("vhost_vring_addr:");
@ -687,7 +677,7 @@ static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
debug(" log_guest_addr: 0x%016" PRIx64,
(uint64_t)addr.log_guest_addr);
vq->vra = msg->payload.addr;
vq->vra = vmsg->payload.addr;
vq->vring.flags = addr.flags;
vq->vring.log_guest_addr = addr.log_guest_addr;
@ -712,10 +702,10 @@ static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_vring_base_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
unsigned int idx = msg->payload.state.index;
unsigned int num = msg->payload.state.num;
unsigned int idx = vmsg->payload.state.index;
unsigned int num = vmsg->payload.state.num;
debug("State.index: %u", idx);
debug("State.num: %u", num);
@ -733,22 +723,23 @@ static bool vu_set_vring_base_exec(struct vu_dev *vdev,
* Return: True as a reply is requested
*/
static bool vu_get_vring_base_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
unsigned int idx = msg->payload.state.index;
unsigned int idx = vmsg->payload.state.index;
debug("State.index: %u", idx);
msg->payload.state.num = vdev->vq[idx].last_avail_idx;
msg->hdr.size = sizeof(msg->payload.state);
vmsg->payload.state.num = vdev->vq[idx].last_avail_idx;
vmsg->hdr.size = sizeof(vmsg->payload.state);
vdev->vq[idx].started = false;
vdev->vq[idx].vring.avail = 0;
if (vdev->vq[idx].call_fd != -1) {
close(vdev->vq[idx].call_fd);
vdev->vq[idx].call_fd = -1;
}
if (vdev->vq[idx].kick_fd != -1) {
vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
epoll_del(vdev->context, vdev->vq[idx].kick_fd);
close(vdev->vq[idx].kick_fd);
vdev->vq[idx].kick_fd = -1;
}
@ -780,21 +771,21 @@ static void vu_set_watch(const struct vu_dev *vdev, int idx)
* close fds if NOFD bit is set
* @vmsg: vhost-user message
*/
static void vu_check_queue_msg_file(struct vhost_user_msg *msg)
static void vu_check_queue_msg_file(struct vhost_user_msg *vmsg)
{
bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
bool nofd = vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
int idx = vmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
if (idx >= VHOST_USER_MAX_QUEUES)
die("Invalid vhost-user queue index: %u", idx);
if (nofd) {
vmsg_close_fds(msg);
vmsg_close_fds(vmsg);
return;
}
if (msg->fd_num != 1)
die("Invalid fds in vhost-user request: %d", msg->hdr.request);
if (vmsg->fd_num != 1)
die("Invalid fds in vhost-user request: %d", vmsg->hdr.request);
}
/**
@ -806,23 +797,23 @@ static void vu_check_queue_msg_file(struct vhost_user_msg *msg)
* Return: False as no reply is requested
*/
static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
bool nofd = vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
int idx = vmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
debug("u64: 0x%016"PRIx64, msg->payload.u64);
debug("u64: 0x%016"PRIx64, vmsg->payload.u64);
vu_check_queue_msg_file(msg);
vu_check_queue_msg_file(vmsg);
if (vdev->vq[idx].kick_fd != -1) {
vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
epoll_del(vdev->context, vdev->vq[idx].kick_fd);
close(vdev->vq[idx].kick_fd);
vdev->vq[idx].kick_fd = -1;
}
if (!nofd)
vdev->vq[idx].kick_fd = msg->fds[0];
vdev->vq[idx].kick_fd = vmsg->fds[0];
debug("Got kick_fd: %d for vq: %d", vdev->vq[idx].kick_fd, idx);
@ -846,14 +837,14 @@ static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_vring_call_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
bool nofd = vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
int idx = vmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
debug("u64: 0x%016"PRIx64, msg->payload.u64);
debug("u64: 0x%016"PRIx64, vmsg->payload.u64);
vu_check_queue_msg_file(msg);
vu_check_queue_msg_file(vmsg);
if (vdev->vq[idx].call_fd != -1) {
close(vdev->vq[idx].call_fd);
@ -861,11 +852,11 @@ static bool vu_set_vring_call_exec(struct vu_dev *vdev,
}
if (!nofd)
vdev->vq[idx].call_fd = msg->fds[0];
vdev->vq[idx].call_fd = vmsg->fds[0];
/* in case of I/O hang after reconnecting */
if (vdev->vq[idx].call_fd != -1)
eventfd_write(msg->fds[0], 1);
eventfd_write(vmsg->fds[0], 1);
debug("Got call_fd: %d for vq: %d", vdev->vq[idx].call_fd, idx);
@ -881,14 +872,14 @@ static bool vu_set_vring_call_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_vring_err_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
bool nofd = vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
int idx = vmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
debug("u64: 0x%016"PRIx64, msg->payload.u64);
debug("u64: 0x%016"PRIx64, vmsg->payload.u64);
vu_check_queue_msg_file(msg);
vu_check_queue_msg_file(vmsg);
if (vdev->vq[idx].err_fd != -1) {
close(vdev->vq[idx].err_fd);
@ -896,7 +887,7 @@ static bool vu_set_vring_err_exec(struct vu_dev *vdev,
}
if (!nofd)
vdev->vq[idx].err_fd = msg->fds[0];
vdev->vq[idx].err_fd = vmsg->fds[0];
return false;
}
@ -910,14 +901,15 @@ static bool vu_set_vring_err_exec(struct vu_dev *vdev,
* Return: True as a reply is requested
*/
static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK |
1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD |
1ULL << VHOST_USER_PROTOCOL_F_DEVICE_STATE;
1ULL << VHOST_USER_PROTOCOL_F_DEVICE_STATE |
1ULL << VHOST_USER_PROTOCOL_F_RARP;
(void)vdev;
vmsg_set_reply_u64(msg, features);
vmsg_set_reply_u64(vmsg, features);
return true;
}
@ -930,13 +922,13 @@ static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_protocol_features_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
uint64_t features = msg->payload.u64;
uint64_t features = vmsg->payload.u64;
debug("u64: 0x%016"PRIx64, features);
vdev->protocol_features = msg->payload.u64;
vdev->protocol_features = vmsg->payload.u64;
return false;
}
@ -949,11 +941,11 @@ static bool vu_set_protocol_features_exec(struct vu_dev *vdev,
* Return: True as a reply is requested
*/
static bool vu_get_queue_num_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
(void)vdev;
vmsg_set_reply_u64(msg, VHOST_USER_MAX_QUEUES);
vmsg_set_reply_u64(vmsg, VHOST_USER_MAX_QUEUES);
return true;
}
@ -966,10 +958,10 @@ static bool vu_get_queue_num_exec(struct vu_dev *vdev,
* Return: False as no reply is requested
*/
static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
unsigned int enable = msg->payload.state.num;
unsigned int idx = msg->payload.state.index;
unsigned int enable = vmsg->payload.state.num;
unsigned int idx = vmsg->payload.state.index;
debug("State.index: %u", idx);
debug("State.enable: %u", enable);
@ -982,33 +974,29 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
}
/**
* vu_set_migration_watch() - Add the migration file descriptor to epoll
* vu_send_rarp_exec() - vhost-user specification says: "Broadcast a fake
* RARP to notify the migration is terminated",
* but passt doesn't need to update any ARP table,
* so do nothing to silence QEMU bogus error message
* @vdev: vhost-user device
* @fd: File descriptor to add
* @direction: Direction of the migration (save or load backend state)
* @vmsg: vhost-user message
*
* Return: False as no reply is requested
*/
static void vu_set_migration_watch(const struct vu_dev *vdev, int fd,
uint32_t direction)
static bool vu_send_rarp_exec(struct vu_dev *vdev,
struct vhost_user_msg *vmsg)
{
union epoll_ref ref = {
.type = EPOLL_TYPE_VHOST_MIGRATION,
.fd = fd,
};
struct epoll_event ev = { 0 };
char macstr[ETH_ADDRSTRLEN];
ev.data.u64 = ref.u64;
switch (direction) {
case VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE:
ev.events = EPOLLOUT;
break;
case VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD:
ev.events = EPOLLIN;
break;
default:
ASSERT(0);
}
(void)vdev;
epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
/* ignore the command */
debug("Ignore command VHOST_USER_SEND_RARP for %s",
eth_ntop((unsigned char *)&vmsg->payload.u64, macstr,
sizeof(macstr)));
return false;
}
/**
@ -1020,12 +1008,12 @@ static void vu_set_migration_watch(const struct vu_dev *vdev, int fd,
* and set bit 8 as we don't provide our own fd.
*/
static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
unsigned int direction = msg->payload.transfer_state.direction;
unsigned int phase = msg->payload.transfer_state.phase;
unsigned int direction = vmsg->payload.transfer_state.direction;
unsigned int phase = vmsg->payload.transfer_state.phase;
if (msg->fd_num != 1)
if (vmsg->fd_num != 1)
die("Invalid device_state_fd message");
if (phase != VHOST_USER_TRANSFER_STATE_PHASE_STOPPED)
@ -1033,21 +1021,13 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
if (direction != VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE &&
direction != VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD)
die("Invalide device_state_fd direction: %d", direction);
die("Invalid device_state_fd direction: %d", direction);
if (vdev->device_state_fd != -1) {
vu_remove_watch(vdev, vdev->device_state_fd);
close(vdev->device_state_fd);
}
vdev->device_state_fd = msg->fds[0];
vdev->device_state_result = -1;
vu_set_migration_watch(vdev, vdev->device_state_fd, direction);
debug("Got device_state_fd: %d", vdev->device_state_fd);
migrate_request(vdev->context, vmsg->fds[0],
direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
/* We don't provide a new fd for the data transfer */
vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
vmsg_set_reply_u64(vmsg, VHOST_USER_VRING_NOFD_MASK);
return true;
}
@ -1059,12 +1039,11 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
*
* Return: True as the reply contains the migration result
*/
/* cppcheck-suppress constParameterCallback */
static bool vu_check_device_state_exec(struct vu_dev *vdev,
struct vhost_user_msg *msg)
struct vhost_user_msg *vmsg)
{
(void)vdev;
vmsg_set_reply_u64(msg, vdev->device_state_result);
vmsg_set_reply_u64(vmsg, vdev->context->device_state_result);
return true;
}
@ -1072,7 +1051,6 @@ static bool vu_check_device_state_exec(struct vu_dev *vdev,
/**
* vu_init() - Initialize vhost-user device structure
* @c: execution context
* @vdev: vhost-user device
*/
void vu_init(struct ctx *c)
{
@ -1090,8 +1068,8 @@ void vu_init(struct ctx *c)
}
c->vdev->log_table = NULL;
c->vdev->log_call_fd = -1;
c->vdev->device_state_fd = -1;
c->vdev->device_state_result = -1;
migrate_init(c);
}
@ -1118,7 +1096,7 @@ void vu_cleanup(struct vu_dev *vdev)
vq->err_fd = -1;
}
if (vq->kick_fd != -1) {
vu_remove_watch(vdev, vq->kick_fd);
epoll_del(vdev->context, vq->kick_fd);
close(vq->kick_fd);
vq->kick_fd = -1;
}
@ -1141,12 +1119,8 @@ void vu_cleanup(struct vu_dev *vdev)
vu_close_log(vdev);
if (vdev->device_state_fd != -1) {
vu_remove_watch(vdev, vdev->device_state_fd);
close(vdev->device_state_fd);
vdev->device_state_fd = -1;
vdev->device_state_result = -1;
}
/* If we lose the VU dev, we also lose our migration channel */
migrate_close(vdev->context);
}
/**
@ -1159,7 +1133,7 @@ static void vu_sock_reset(struct vu_dev *vdev)
}
static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
struct vhost_user_msg *msg) = {
struct vhost_user_msg *vmsg) = {
[VHOST_USER_GET_FEATURES] = vu_get_features_exec,
[VHOST_USER_SET_FEATURES] = vu_set_features_exec,
[VHOST_USER_GET_PROTOCOL_FEATURES] = vu_get_protocol_features_exec,
@ -1177,6 +1151,7 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
[VHOST_USER_SET_VRING_CALL] = vu_set_vring_call_exec,
[VHOST_USER_SET_VRING_ERR] = vu_set_vring_err_exec,
[VHOST_USER_SET_VRING_ENABLE] = vu_set_vring_enable_exec,
[VHOST_USER_SEND_RARP] = vu_send_rarp_exec,
[VHOST_USER_SET_DEVICE_STATE_FD] = vu_set_device_state_fd_exec,
[VHOST_USER_CHECK_DEVICE_STATE] = vu_check_device_state_exec,
};
@ -1189,7 +1164,7 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
*/
void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
{
struct vhost_user_msg msg = { 0 };
struct vhost_user_msg vmsg = { 0 };
bool need_reply, reply_requested;
int ret;
@ -1198,34 +1173,41 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
return;
}
ret = vu_message_read_default(fd, &msg);
ret = vu_message_read_default(fd, &vmsg);
if (ret == 0) {
vu_sock_reset(vdev);
return;
}
debug("================ Vhost user message ================");
debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
msg.hdr.request);
debug("Flags: 0x%x", msg.hdr.flags);
debug("Size: %u", msg.hdr.size);
debug("Request: %s (%d)", vu_request_to_string(vmsg.hdr.request),
vmsg.hdr.request);
debug("Flags: 0x%x", vmsg.hdr.flags);
debug("Size: %u", vmsg.hdr.size);
need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
need_reply = vmsg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
if (msg.hdr.request >= 0 && msg.hdr.request < VHOST_USER_MAX &&
vu_handle[msg.hdr.request])
reply_requested = vu_handle[msg.hdr.request](vdev, &msg);
if (vmsg.hdr.request >= 0 && vmsg.hdr.request < VHOST_USER_MAX &&
vu_handle[vmsg.hdr.request])
reply_requested = vu_handle[vmsg.hdr.request](vdev, &vmsg);
else
die("Unhandled request: %d", msg.hdr.request);
die("Unhandled request: %d", vmsg.hdr.request);
/* cppcheck-suppress legacyUninitvar */
if (!reply_requested && need_reply) {
msg.payload.u64 = 0;
msg.hdr.flags = 0;
msg.hdr.size = sizeof(msg.payload.u64);
msg.fd_num = 0;
vmsg.payload.u64 = 0;
vmsg.hdr.flags = 0;
vmsg.hdr.size = sizeof(vmsg.payload.u64);
vmsg.fd_num = 0;
reply_requested = true;
}
if (reply_requested)
vu_send_reply(fd, &msg);
vu_send_reply(fd, &vmsg);
if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
vdev->context->device_state_result == 0 &&
!vdev->context->migrate_target) {
info("Migration complete, exiting");
_exit(EXIT_SUCCESS);
}
}

View file

@ -184,7 +184,7 @@ union vhost_user_payload {
};
/**
* struct vhost_user_msg - vhost-use message
* struct vhost_user_msg - vhost-user message
* @hdr: Message header
* @payload: Message payload
* @fds: File descriptors associated with the message
@ -241,7 +241,6 @@ static inline bool vu_queue_started(const struct vu_virtq *vq)
void vu_print_capabilities(void);
void vu_init(struct ctx *c);
void vu_cleanup(struct vu_dev *vdev);
void vu_log_kick(const struct vu_dev *vdev);
void vu_log_write(const struct vu_dev *vdev, uint64_t address,
uint64_t length);
void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events);

View file

@ -156,9 +156,9 @@ static inline uint16_t vring_avail_ring(const struct vu_virtq *vq, int i)
}
/**
* virtq_used_event - Get location of used event indices
* virtq_used_event() - Get location of used event indices
* (only with VIRTIO_F_EVENT_IDX)
* @vq Virtqueue
* @vq: Virtqueue
*
* Return: return the location of the used event index
*/
@ -170,7 +170,7 @@ static inline uint16_t *virtq_used_event(const struct vu_virtq *vq)
/**
* vring_get_used_event() - Get the used event from the available ring
* @vq Virtqueue
* @vq: Virtqueue
*
* Return: the used event (available only if VIRTIO_RING_F_EVENT_IDX is set)
* used_event is a performant alternative where the driver
@ -235,6 +235,7 @@ static int virtqueue_read_indirect_desc(const struct vu_dev *dev,
memcpy(desc, orig_desc, read_len);
len -= read_len;
addr += read_len;
/* NOLINTNEXTLINE(bugprone-sizeof-expression,cert-arr39-c) */
desc += read_len / sizeof(struct vring_desc);
}
@ -243,9 +244,9 @@ static int virtqueue_read_indirect_desc(const struct vu_dev *dev,
/**
* enum virtqueue_read_desc_state - State in the descriptor chain
* @VIRTQUEUE_READ_DESC_ERROR Found an invalid descriptor
* @VIRTQUEUE_READ_DESC_DONE No more descriptors in the chain
* @VIRTQUEUE_READ_DESC_MORE there are more descriptors in the chain
* @VIRTQUEUE_READ_DESC_ERROR: Found an invalid descriptor
* @VIRTQUEUE_READ_DESC_DONE: No more descriptors in the chain
* @VIRTQUEUE_READ_DESC_MORE: there are more descriptors in the chain
*/
enum virtqueue_read_desc_state {
VIRTQUEUE_READ_DESC_ERROR = -1,
@ -286,7 +287,7 @@ static int virtqueue_read_next_desc(const struct vring_desc *desc,
*
* Return: true if the virtqueue is empty, false otherwise
*/
bool vu_queue_empty(struct vu_virtq *vq)
static bool vu_queue_empty(struct vu_virtq *vq)
{
if (!vq->vring.avail)
return true;
@ -346,8 +347,9 @@ void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
die_perror("Error writing vhost-user queue eventfd");
}
/* virtq_avail_event() - Get location of available event indices
* (only with VIRTIO_F_EVENT_IDX)
/**
* virtq_avail_event() - Get location of available event indices
* (only with VIRTIO_F_EVENT_IDX)
* @vq: Virtqueue
*
* Return: return the location of the available event index
@ -420,8 +422,8 @@ static bool virtqueue_map_desc(const struct vu_dev *dev,
}
/**
* vu_queue_map_desc - Map the virtqueue descriptor ring into our virtual
* address space
* vu_queue_map_desc() - Map the virtqueue descriptor ring into our virtual
* address space
* @dev: Vhost-user device
* @vq: Virtqueue
* @idx: First descriptor ring entry to map
@ -504,7 +506,7 @@ static int vu_queue_map_desc(const struct vu_dev *dev,
* vu_queue_pop() - Pop an entry from the virtqueue
* @dev: Vhost-user device
* @vq: Virtqueue
* @elem: Virtqueue element to file with the entry information
* @elem: Virtqueue element to fill with the entry information
*
* Return: -1 if there is an error, 0 otherwise
*/
@ -544,7 +546,7 @@ int vu_queue_pop(const struct vu_dev *dev, struct vu_virtq *vq,
}
/**
* vu_queue_detach_element() - Detach an element from the virqueue
* vu_queue_detach_element() - Detach an element from the virtqueue
* @vq: Virtqueue
*/
void vu_queue_detach_element(struct vu_virtq *vq)
@ -554,7 +556,7 @@ void vu_queue_detach_element(struct vu_virtq *vq)
}
/**
* vu_queue_unpop() - Push back the previously popped element from the virqueue
* vu_queue_unpop() - Push back the previously popped element from the virtqueue
* @vq: Virtqueue
*/
/* cppcheck-suppress unusedFunction */
@ -568,6 +570,8 @@ void vu_queue_unpop(struct vu_virtq *vq)
* vu_queue_rewind() - Push back a given number of popped elements
* @vq: Virtqueue
* @num: Number of element to unpop
*
* Return: True on success, false if not
*/
bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
{
@ -671,9 +675,10 @@ static void vu_log_queue_fill(const struct vu_dev *vdev, struct vu_virtq *vq,
* @len: Size of the element
* @idx: Used ring entry index
*/
void vu_queue_fill_by_index(const struct vu_dev *vdev, struct vu_virtq *vq,
unsigned int index, unsigned int len,
unsigned int idx)
static void vu_queue_fill_by_index(const struct vu_dev *vdev,
struct vu_virtq *vq,
unsigned int index, unsigned int len,
unsigned int idx)
{
struct vring_used_elem uelem;

View file

@ -106,8 +106,6 @@ struct vu_dev_region {
* @log_call_fd: Eventfd to report logging update
* @log_size: Size of the logging memory region
* @log_table: Base of the logging memory region
* @device_state_fd: Device state migration channel
* @device_state_result: Device state migration result
*/
struct vu_dev {
struct ctx *context;
@ -119,8 +117,6 @@ struct vu_dev {
int log_call_fd;
uint64_t log_size;
uint8_t *log_table;
int device_state_fd;
int device_state_result;
};
/**
@ -154,7 +150,7 @@ static inline bool has_feature(uint64_t features, unsigned int fbit)
/**
* vu_has_feature() - Check if a virtio-net feature is available
* @vdev: Vhost-user device
* @bit: Feature to check
* @fbit: Feature to check
*
* Return: True if the feature is available
*/
@ -167,7 +163,7 @@ static inline bool vu_has_feature(const struct vu_dev *vdev,
/**
* vu_has_protocol_feature() - Check if a vhost-user feature is available
* @vdev: Vhost-user device
* @bit: Feature to check
* @fbit: Feature to check
*
* Return: True if the feature is available
*/
@ -178,16 +174,12 @@ static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
return has_feature(vdev->protocol_features, fbit);
}
bool vu_queue_empty(struct vu_virtq *vq);
void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
int vu_queue_pop(const struct vu_dev *dev, struct vu_virtq *vq,
struct vu_virtq_element *elem);
void vu_queue_detach_element(struct vu_virtq *vq);
void vu_queue_unpop(struct vu_virtq *vq);
bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
void vu_queue_fill_by_index(const struct vu_dev *vdev, struct vu_virtq *vq,
unsigned int index, unsigned int len,
unsigned int idx);
void vu_queue_fill(const struct vu_dev *vdev, struct vu_virtq *vq,
const struct vu_virtq_element *elem, unsigned int len,
unsigned int idx);

Some files were not shown because too many files have changed in this diff Show more