1
0
Fork 0
mirror of https://passt.top/passt synced 2025-05-23 17:55:34 +02:00

Compare commits

...

250 commits

Author SHA1 Message Date
Laurent Vivier
3262c9b088 iov: Standardize function comment headers
Update function comment headers in iov.c to a consistent and
standardized format.

This change ensures:
- Comment blocks for functions consistently start with /**.
- Function names in the comment summary line include parentheses ().

This improves overall comment clarity and uniformity within the file.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:13 +02:00
Laurent Vivier
b915375a42 virtio: Correct and align comment headers
Standardize and fix issues in `virtio.c` and `virtio.h` comment headers.

Improvements include:
- Added `()` to function names in comment summaries.
- Added colons after parameter and enum member tags.
- Changed `/*` to `/**` for `virtq_avail_event()` comment.
- Fixed typos (e.g., "file"->"fill", "virqueue"->"virtqueue").
- Added missing `Return:` tag for `vu_queue_rewind()`.
- Corrected parameter names in `virtio.h` comments to match code.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:11 +02:00
Laurent Vivier
2fd0944f21 vhost_user: Correct and align function comment headers
This commit cleans up function comment headers in vhost_user.c to ensure
accuracy and consistency with the code. Changes include correcting
parameter names in comments and signatures (e.g., standardizing on vmsg
for vhost messages, fixing dev to vdev), updating function names in
comment descriptions, and removing/rectifying erroneous parameter
documentation.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:08 +02:00
Laurent Vivier
2046976866 codespell: Correct typos in comments and error message
This commit addresses several spelling errors identified by the `codespell`
tool. The corrections apply to:
- Code comments in `fwd.c`, `ip.h`, `isolation.c`, and `log.c`.
- An error message string in `vhost_user.c`.

Specifically, the following misspellings were corrected:
- "adddress" to "address"
- "capabilites" to "capabilities"
- "Musn't" to "Mustn't"
- "calculatd" to "calculated"
- "Invalide" to "Invalid"

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-15 18:06:30 +02:00
Laurent Vivier
4234ace84c test: Display count of skipped tests in status and summary
This commit enhances test reporting by tracking and displaying the
number of skipped tests.

The skipped test count is now visible in the tmux status bar during
execution and included in the final test summary log. This provides
a more complete overview of test suite results.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-15 18:06:19 +02:00
Laurent Vivier
2d3d69c5c3 flow: Fix clang error (clang-analyzer-security.PointerSub)
Fixes the following clang-analyzer warning:

flow_table.h:96:25: note: Subtraction of two pointers that do not point into the same array is undefined behavior
   96 |         return (union flow *)f - flowtab;

The `flow_idx()` function is called via `FLOW_IDX()` from
`flow_foreach_slot()`, where `f` is set to `&flowtab[idx].f`.
Therefore, `f` and `flowtab` do point to the same array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:37 +02:00
Laurent Vivier
0f7bf10b0a ndp: Fix Clang analyzer warning (clang-analyzer-security.PointerSub)
Addresses Clang warning: "Subtraction of two pointers that do not
point into the same array is undefined behavior" for the line:
  `ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);`

Here, `ptr` is `&ra.var[0]`. The subtraction calculates the offset
of `var[0]` within the `struct ra_options ra`. Since `ptr` points
inside `ra`, this pointer arithmetic is well-defined for
calculating the size of the data to send, even if `ptr` and `&ra`
are not strictly considered part of the same "array" by the analyzer.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:35 +02:00
Laurent Vivier
a6b9832e49 virtio: Fix Clang warning (bugprone-sizeof-expression, cert-arr39-c)
In `virtqueue_read_indirect_desc()`, the pointer arithmetic involving
`desc` is intentional. We add the length in bytes (`read_len`)
divided by the size of `struct vring_desc` to `desc`, which is
an array of `struct vring_desc`. This correctly calculates the
offset in terms of the number of `struct vring_desc` elements.

Clang issues the following warning due to this explicit scaling:

virtio.c:238:8: error: suspicious usage of 'sizeof(...)' in pointer
arithmetic; this scaled value will be scaled again by the '+='
operator [bugprone-sizeof-expression,cert-arr39-c,-Werror]
  238 |         desc += read_len / sizeof(struct vring_desc);
      |               ^            ~~~~~~~~~~~~~~~~~~~~~~~~~
virtio.c:238:8: note: '+=' in pointer arithmetic internally scales
with 'sizeof(struct vring_desc)' == 16

This behavior is intended, so the warning can be considered a
false positive in this context. The code correctly advances the
pointer by the desired number of descriptor entries.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:32 +02:00
Laurent Vivier
570e7b4454 dhcpv6: fix GCC error (unterminated-string-initialization)
The string STR_NOTONLINK is intentionally not NUL-terminated.
Ignore the GCC error using __attribute__((nonstring)).

This error is reported by GCC 15.1.1 on Fedora 42. However,
Clang 20.1.3 does not support __attribute__((nonstring)).
Therefore, NOLINTNEXTLINE(clang-diagnostic-unknown-attributes)
is also added to suppress Clang's unknown attribute warning.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:20 +02:00
Laurent Vivier
8ec134109e flow: close socket fd on error
In eea8a76caf ("flow: fix podman issue "), we unregister
the fd from epoll_ctl() in case of error, but we also need to close it.

As flowside_sock_l4() also calls sock_l4_sa() via flowside_sock_splice()
we can do it unconditionally.

Fixes: eea8a76caf ("flow: fix podman issue ")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-12 21:04:57 +02:00
Laurent Vivier
92d5d68013 flow: fix wrong macro name in comments
The maximum number of flow macro name is FLOW_MAX, not MAX_FLOW.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-08 13:24:14 +02:00
Laurent Vivier
eea8a76caf flow: fix podman issue
While running pasta, we trigger the following assert:

  ASSERTION FAILED in udp_at_sidx (udp_flow.c:35): flow->f.type == FLOW_UDP

in udp_at_sidx() in the following path:

 902 void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 903                       uint32_t events, const struct timespec *now)
 904 {
 905         struct udp_flow *uflow = udp_at_sidx(ref.flowside);

The invalid sidx is comming from the epoll_ref provided by epoll_wait().

This assert follows the following error:

  Couldn't connect flow socket: Permission denied

It appears that an error happens in udp_flow_sock() and the recently
created fd is not removed from the epoll_ctl() pool:

 71 static int udp_flow_sock(const struct ctx *c,
 72                          struct udp_flow *uflow, unsigned sidei)
 73 {
...
 82         s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
 83         if (s < 0) {
 84                 flow_dbg_perror(uflow, "Couldn't open flow specific socket");
 85                 return s;
 86         }
 87
 88         if (flowside_connect(c, s, pif, side) < 0) {
 89                 int rc = -errno;
 90                 flow_dbg_perror(uflow, "Couldn't connect flow socket");
 91                 return rc;
 92         }
...

flowside_sock_l4() calls sock_l4_sa() that adds 's' to the epoll_ctl()
pool.

So to cleanly manage the error of flowside_connect() we need to remove
's' from the epoll_ctl() pool using epoll_del().

Link: https://github.com/containers/podman/issues/26073
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-07 14:42:48 +02:00
Stefano Brivio
587980ca1e udp: Actually discard datagrams we can't forward
Given that udp_sock_fwd() now loops on udp_peek_addr() to get endpoint
addresses for datagrams, if we can't forward one of these datagrams,
we need to make sure we actually discard it. Otherwise, with MSG_PEEK,
we won't dequeue and loop on it forever.

For example, if we fail to create a socket for a new flow, because,
say, the destination of an inbound packet is multicast, and we can't
bind() to a multicast address, the loop will look like this:

18.0563: Flow 0 (NEW): FREE -> NEW
18.0563: Flow 0 (INI): NEW -> INI
18.0563: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0563: Flow 0 (TGT): INI -> TGT
18.0563: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0563: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow
18.0564: Flow 0 (NEW): FREE -> NEW
18.0564: Flow 0 (INI): NEW -> INI
18.0564: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0564: Flow 0 (TGT): INI -> TGT
18.0564: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow

and seen from strace:

epoll_wait(3, [{events=EPOLLIN, data=0x1076c00000705}], 8, 1000) = 1
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-05-03 10:21:20 +02:00
Emanuel Valasiadis
f0021f9e1d fwd: fix doc typo
Signed-off-by: Emanuel Valasiadis <emanuel@valasiadis.space>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-03 03:42:51 +02:00
Janne Grunau
93394f4ef0 selinux: Add getattr to class udp_socket
Commit 59cc89f ("udp, udp_flow: Track our specific address on socket
interfaces") added a getsockname() call in udp_flow_new(). This requires
getattr. Fixes "Flow 0 (UDP flow): Unable to determine local address:
Permission denied" errors in muvm/passt on Fedora Linux 42 with SELinux.

The SELinux audit message is

| type=AVC msg=audit(1746083799.606:235): avc:  denied  { getattr } for
|   pid=2961 comm="passt" laddr=127.0.0.1 lport=49221
|   faddr=127.0.0.53 fport=53
|   scontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tcontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tclass=udp_socket permissive=0

Fixes: 59cc89f4cc ("udp, udp_flow: Track our specific address on socket interfaces")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2363238
Signed-off-by: Janne Grunau <janne-psst@jannau.net>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-02 12:00:51 +02:00
Laurent Vivier
11be695f5c flow: fix podman issue
While running piHole using podman, traffic can trigger the following
assert:

ASSSERTION FAILED in flow_alloc (flow.c:521): flow->f.state == FLOW_STATE_FREE

Backtrace shows that this happens in flow_defer_handler():

      0x00005610d6f5b481 flow_alloc (passt + 0xb481)
      0x00005610d6f74f86 udp_flow_from_sock (passt + 0x24f86)
      0x00005610d6f737c3 udp_sock_fwd (passt + 0x237c3)
      0x00005610d6f74c07 udp_flush_flow (passt + 0x24c07)
      0x00005610d6f752c2 udp_flow_defer (passt + 0x252c2)
      0x00005610d6f5bce1 flow_defer_handler (passt + 0xbce1)

We are trying to allocate a new flow inside the loop freeing them.

Inside the loop free_head points to the first free flow entry in the
current cluster. But if we allocate a new entry during the loop,
free_head is not updated and can point now to the entry we have just
allocated.

We can fix the problem by spliting the loop in two parts:
- first part where we can close some of them and allocate some new
  flow entries,
- second part where we free the entries closed in the previous loop
  and we aggregate the free entries to merge consecutive the clusters.

Reported-by: Martin Rijntjes <bugs@air-global.nl>
Link: https://github.com/containers/podman/issues/25959
Fixes: 9725e79888 ("udp_flow: Don't discard packets that arrive between bind() and connect()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-02 11:58:25 +02:00
Stefano Brivio
6a96cd97a5 util: Fix typo, ASSSERTION -> ASSERTION
Fixes: 9153aca15b ("util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-05-02 11:58:10 +02:00
Stefano Brivio
ea0a1240df passt-repair: Hide bogus gcc warning from -Og
When building with gcc 13 and -Og, we get:

passt-repair.c: In function ‘main’:
passt-repair.c:161:23: warning: ‘ev’ may be used uninitialized [-Wmaybe-uninitialized]
  161 |                 if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
      |                     ~~^~~~~

but that can't actually happen, because we only exit the preceding
while loop if 'found' is true, and that only happens, in turn, as we
assign 'ev'.

Get rid of the warning by (redundantly) initialising ev to NULL.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-30 16:58:58 +02:00
Alyssa Ross
aa1cc89228 conf: allow --fd 0
inetd-style socket passing traditionally starts a service with a
connected socket on file descriptors 0 and 1.  passt disallowing
obtaining its socket from either of these descriptors made it
difficult to use with super-servers providing this interface — in my
case I wanted to use passt with s6-ipcserver[1].  Since (as far as I
can tell) passt does not use standard input for anything else (unlike
standard output), it should be safe to relax the restrictions on --fd
to allow setting it to 0, enabling this use case.

Link: https://skarnet.org/software/s6/s6-ipcserver.html [1]
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-28 14:01:17 +02:00
David Gibson
436afc3044 udp: Translate offender addresses for ICMP messages
We've recently added support for propagating ICMP errors related to a UDP
flow from the host to the guest, by handling the extended UDP error on the
socket and synthesizing a suitable ICMP on the tap interface.

Currently we create that ICMP with a source address of the "offender" from
the extended error information - the source of the ICMP error received on
the host.  However, we don't translate this address for cases where we NAT
between host and guest.  This means (amongst other things) that we won't
get a "Connection refused" error as expected if send data from the guest to
the --map-host-loopback address.  The error comes from 127.0.0.1 on the
host, which doesn't make sense on the tap interface and will be discarded
by the guest.

Because ICMP errors can be sent by an intermediate host, not just by the
endpoints of the flow, we can't handle this translation purely with the
information in the flow table entry.  We need to explicitly translate this
address by our NAT rules, which we can do with the nat_inbound() helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:05 +02:00
David Gibson
08e617ec2b udp: Rework offender address handling in udp_sock_recverr()
Make a number of changes to udp_sock_recverr() to improve the robustness
of how we handle addresses.

 * Get the "offender" address (source of the ICMP packet) using the
   SO_EE_OFFENDER() macro, reducing assumptions about structure layout.
 * Parse the offender sockaddr using inany_from_sockaddr()
 * Check explicitly that the source and destination pifs are what we
   expect.  Previously we checked something that was probably equivalent
   in practice, but isn't strictly speaking what we require for the rest
   of the code.
 * Verify that for an ICMPv4 error we also have an IPv4 source/offender
   and destination/endpoint address
 * Verify that for an ICMPv6 error we have an IPv6 endpoint
 * Improve debug reporting of any failures

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:03 +02:00
David Gibson
4668e91378 treewide: Improve robustness against sockaddrs of unexpected family
inany_from_sockaddr() expects a socket address of family AF_INET or
AF_INET6 and ASSERT()s if it gets anything else.  In many of the callers we
can handle an unexpected family more gracefully, though, e.g. by failing
a single flow rather than killing passt.

Change inany_from_sockaddr() to return an error instead of ASSERT()ing,
and handle those errors in the callers.  Improve the reporting of any such
errors while we're at it.

With this greater robustness, allow inany_from_sockaddr() to take a void *
rather than specifically a union sockaddr_inany *.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:00 +02:00
David Gibson
9128f6e8f4 fwd: Split out helpers for port-independent NAT
Currently the functions fwd_nat_from_*() make some address translations
based on both the IP address and protocol port numbers, and others based
only on the address.  We have some upcoming cases where it's useful to use
the IP-address-only translations separately, so split them out into helper
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:41:47 +02:00
David Gibson
2340bbf867 udp: Propagate errors on listening and brand new sockets
udp_sock_recverr() processes errors on UDP sockets and attempts to
propagate them as ICMP packets on the tap interface.  To do this it
currently requires the flow with which the error is associated as a
parameter.  If that's missing it will clear the error condition, but not
propagate it.

That means that we largely ignore errors on "listening" sockets.  It also
means we may discard some errors on flow specific sockets if they occur
very shortly after the socket is created.  In udp_flush_flow() we need to
clear any datagrams received between bind() and connect() which might not
be associated with the "final" flow for the socket.  If we get errors
before that point we'll ignore them in the same way because we don't know
the flow they're associated with in advance.

This can happen in practice if we have errors which occur almost
immediately after connect(), such as ECONNREFUSED when we connect() to a
local address where nothing is listening.

Between the extended error message itself and the PKTINFO information we
do actually have enough information to find the correct flow.  So, rather
than ignoring errors where we don't have a flow "hint", determine the flow
the hard way in udp_sock_recverr().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change warn() to debug() in udp_sock_recverr()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:56:16 +02:00
David Gibson
cfc0ee145a udp: Minor re-organisation of udp_sock_recverr()
Usually we work with the "exit early" flow style, where we return early
on "error" conditions in functions.  We don't currently do this in
udp_sock_recverr() for the case where we don't have a flow to associate
the error with.

Reorganise to use the "exit early" style, which will make some subsequent
changes less awkward.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:49:06 +02:00
David Gibson
f107a86cc0 udp: Add udp_pktinfo() helper
Currently we open code parsing the control message for IP_PKTINFO in
udp_peek_addr().  We have an upcoming case where we want to parse PKTINFO
in another place, so split this out into a helper function.

While we're there, make the parsing a bit more robust: scan all cmsgs to
look for the one we want, rather than assuming there's only one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: udp_pktinfo(): Fix typo in comment and change err() to debug()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:48:35 +02:00
David Gibson
04984578b0 udp: Deal with errors as we go in udp_sock_fwd()
When we get an epoll event on a listening socket, we first deal with any
errors (udp_sock_errs()), then with any received packets (udp_sock_fwd()).
However, it's theoretically possible that new errors could get flagged on
the socket after we call udp_sock_errs(), in which case we could get errors
returned in in udp_sock_fwd() -> udp_peek_addr() -> recvmsg().

In fact, we do deal with this correctly, although the path is somewhat
non-obvious.  The recvmsg() error will cause us to bail out of
udp_sock_fwd(), but the EPOLLERR event will now be flagged, so we'll come
back here next epoll loop and call udp_sock_errs().

Except.. we call udp_sock_fwd() from udp_flush_flow() as well as from
epoll events.  This is to deal with any packets that arrived between bind()
and connect(), and so might not be associated with the socket's intended
flow.  This expects udp_sock_fwd() to flush _all_ queued datagrams, so that
anything received later must be for the correct flow.

At the moment, udp_sock_errs() might fail to flush all datagrams if errors
occur.  In particular this can happen in practice for locally reported
errors which occur immediately after connect() (e.g. connecting to a local
port with nothing listening).

We can deal with the problem case, and also make the flow a little more
natural for the common case by having udp_sock_fwd() call udp_sock_errs()
to handle errors as the occur, rather than trying to deal with all errors
in advance.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:45:19 +02:00
David Gibson
3f995586b3 udp: Pass socket & flow information direction to error handling functions
udp_sock_recverr() and udp_sock_errs() take an epoll reference from which
they obtain both the socket fd to receive errors from, and - for flow
specific sockets - the flow and side the socket is associated with.

We have some upcoming cases where we want to clear errors when we're not
directly associated with receiving an epoll event, so it's not natural to
have an epoll reference.  Therefore, make these functions take the socket
and flow from explicit parameters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:45:09 +02:00
David Gibson
1bb8145c22 udp: Be quieter about errors on UDP receive
If we get an error on UDP receive, either in udp_peek_addr() or
udp_sock_recv(), we'll print an error message.  However, this could be
a perfectly routine UDP error triggered by an ICMP, which need not go to
the error log.

This doesn't usually happen, because before receiving we typically clear
the error queue from udp_sock_errs().  However, it's possible an error
could be flagged after udp_sock_errs() but before we receive.  So it's
better to handle this error "silently" (trace level only).  We'll bail out
of the receive, return to the epoll loop, and get an EPOLLERR where we'll
handle and report the error properly.

In particular there's one situation that can trigger this case much more
easily.  If we start a new outbound UDP flow to a local destination with
nothing listening, we'll get a more or less immediate connection refused
error.  So, we'll get that error on the very first receive after the
connect().  That will occur in udp_flow_defer() -> udp_flush_flow() ->
udp_sock_fwd() -> udp_peek_addr() -> recvmsg().  This path doesn't call
udp_sock_errs() first, so isn't (imperfectly) protected the way we are
most of the time.

Fixes: 84ab1305fa ("udp: Polish udp_vu_sock_info() and remove from vu specific code")
Fixes: 69e5393c37 ("udp: Move some more of sock_handler tasks into sub-functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:43:56 +02:00
David Gibson
baf049f8e0 udp: Fix breakage of UDP error handling by PKTINFO support
We recently enabled the IP_PKTINFO / IPV6_RECVPKTINFO socket options on our
UDP sockets.  This lets us obtain and properly handle the specific local
address used when we're "listening" with a socket on 0.0.0.0 or ::.

However, the PKTINFO cmsgs this option generates appear on error queue
messages as well as regular datagrams.  udp_sock_recverr() doesn't expect
this and so flags an unrecoverable error when it can't parse the control
message.

Correct this by adding space in udp_sock_recverr()s control buffer for the
additional PKTINFO data, and scan through all cmsgs for the RECVERR, rather
than only looking at the first one.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Fixes: f4b0dd8b06 ("udp: Use PKTINFO cmsgs to get destination address for received datagrams")
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:43:00 +02:00
Stefano Brivio
50249086a9 conf: Honour --dns-forward for local resolver even with --no-map-gw
If the first resolver listed in the host's /etc/resolv.conf is a
loopback address, and --no-map-gw is given, we automatically conclude
that the resolver is not reachable, discard it, and, if it's the only
nameserver listed in /etc/resolv.conf, we'll warn that we:

  Couldn't get any nameserver address

However, this isn't true in a general case: the user might have passed
--dns-forward, and in that case, while we won't map the address of the
default gateway to the host, we're still supposed to map that
particular address. Otherwise, in this common Podman usage:

  pasta --config-net --dns-forward 169.254.1.1 -t none -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-c02a8d8f-6ee3-902e-33c5-317e0f24e0af --map-guest-addr 169.254.1.2

and with a loopback address in /etc/resolv.conf, we'll unexpectedly
refuse to forward DNS queries:

  # nslookup passt.top 169.254.1.1
  ;; connection timed out; no servers could be reached

To fix this, make an exception for --dns-forward: if &c->ip4.dns_match
or &c->ip6.dns_match are set in add_dns_resolv4() / add_dns_resolv6(),
use that address as guest-facing resolver.

We already set 'dns_host' to the address we found in /etc/resolv.conf,
that's correct in this case and it makes us forward queries as
expected.

I'm not changing the man page as the current description of
--dns-forward is already consistent with the new behaviour: there's no
described way in which --no-map-gw should affect it.

Reported-by: Andrew Sayers <andrew-bugs.passt.top@pileofstuff.org>
Link: https://bugs.passt.top/show_bug.cgi?id=111
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-15 19:42:59 +02:00
Stefano Brivio
bbff3653d6 conf: Split add_dns_resolv() into separate IPv4 and IPv6 versions
Not really valuable by itself, but dropping one level of nested blocks
makes the next change more convenient.

No functional changes intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-15 19:42:57 +02:00
David Gibson
59cc89f4cc udp, udp_flow: Track our specific address on socket interfaces
So far for UDP flows (like TCP connections) we didn't record our address
(oaddr) in the flow table entry for socket based pifs.  That's because we
didn't have that information when a flow was initiated by a datagram coming
to a "listening" socket with 0.0.0.0 or :: address.  Even when we did have
the information, we didn't record it, to simplify address matching on
lookups.

This meant that in some circumstances we could send replies on a UDP flow
from a different address than the originating request came to, which is
surprising and breaks certain setups.

We now have code in udp_peek_addr() which does determine our address for
incoming UDP datagrams.  We can use that information to properly populate
oaddr in the flow table for flow initiated from a socket.

In order to be able to consistently match datagrams to flows, we must
*always* have a specific oaddr, not an unspecified address (that's how the
flow hash table works).  So, we also need to fill in oaddr correctly for
flows we initiate *to* sockets.  Our forwarding logic doesn't specify
oaddr here, letting the kernel decide based on the routing table.  In this
case we need to call getsockname() after connect()ing the socket to find
which local address the kernel picked.

This adds getsockname() to our seccomp profile for all variants.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:46:16 +02:00
David Gibson
695c62396e inany: Improve ASSERT message for bad socket family
inany_from_sockaddr() can only handle sockaddrs of family AF_INET or
AF_INET6 and asserts if given something else.  I hit this assertion while
debugging something else, and wanted to see what the bad sockaddr family
was.  Now that we have ASSERT_WITH_MSG() its easy to add this information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:46:13 +02:00
David Gibson
f4b0dd8b06 udp: Use PKTINFO cmsgs to get destination address for received datagrams
Currently we get the source address for received datagrams from recvmsg(),
but we don't get the local destination address.  Sometimes we implicitly
know this because the receiving socket is bound to a specific address, but
when listening on 0.0.0.0 or ::, we don't.

We need this information to properly direct replies to flows which come in
to a non-default local address.  So, enable the IP_PKTINFO and IPV6_PKTINFO
control messages to obtain this information in udp_peek_addr().  For now
we log a trace messages but don't do anything more with the information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:45:59 +02:00
David Gibson
6693fa1158 tcp_splice: Don't clobber errno before checking for EAGAIN
Like many places, tcp_splice_sock_handler() needs to handle EAGAIN
specially, in this case for both of its splice() calls.  Unfortunately it
tests for EAGAIN some time after those calls.  In between there has been
at least a flow_trace() which could have clobbered errno.  Move the test on
errno closer to the relevant system calls to avoid this problem.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-09 22:57:27 +02:00
David Gibson
d3f33f3b8e tcp_splice: Don't double count bytes read on EINTR
In tcp_splice_sock_handler(), if we get an EINTR on our second splice()
(pipe to output socket) we - as we should - go back and retry it.  However,
we do so *after* we've already updated our byte counters.  That does no
harm for the conn->written[] counter - since the second splice() returned
an error it will be advanced by 0.  However we also advance the
conn->read[] counter, and then do so again when the splice() succeeds.
This results in the counters being out of sync, and us thinking we have
remaining data in the pipe when we don't, which can leave us in an
infinite loop once the stream finishes.

Fix this by moving the EINTR handling to directly next to the splice()
call (which is what we usually do for EINTR).  As a bonus this removes one
mildly confusing goto.

For symmetry, also rework the EINTR handling on the first splice() the same
way, although that doesn't (as far as I can tell) have buggy side effects.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2779347687
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-09 22:57:16 +02:00
Stefano Brivio
ffbef85e97 conf: Add missing return in conf_nat(), fix --map-guest-addr none
As reported by somebody on IRC:

  $ pasta --map-guest-addr none
  Invalid address to remap to host: none

that's because once we parsed "none", we try to parse it as an address
as well. But we already handled it, so stop once we're done.

Fixes: e813a4df7d ("conf: Allow address remapped to host to be configured")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-09 22:52:49 +02:00
Stefano Brivio
06ef64cdb7 udp_flow: Save 8 bytes in struct udp_flow on 64-bit architectures
Shuffle the fields just added by commits a7775e9550 ("udp: support
traceroute in direction tap-socket") and 9725e79888 ("udp_flow:
Don't discard packets that arrive between bind() and connect()").

On x86_64, as reported by pahole(1), before:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        _Bool                      closed:1;             /*    76: 0  1 */

        /* XXX 7 bits hole, try to pack */

        _Bool                      flush0;               /*    77     1 */
        _Bool                      flush1:1;             /*    78: 0  1 */

        /* XXX 7 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */
        uint8_t                    ttl[2];               /*    96     2 */

        /* size: 104, cachelines: 2, members: 7 */
        /* sum members: 95, holes: 1, sum holes: 1 */
        /* sum bitfield members: 2 bits, bit holes: 2, sum bit holes: 14 bits */
        /* padding: 6 */
        /* last cacheline: 40 bytes */
};

and after:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        uint8_t                    ttl[2];               /*    76     2 */
        _Bool                      closed:1;             /*    78: 0  1 */
        _Bool                      flush0:1;             /*    78: 1  1 */
        _Bool                      flush1:1;             /*    78: 2  1 */

        /* XXX 5 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */

        /* size: 96, cachelines: 2, members: 7 */
        /* sum members: 94, holes: 1, sum holes: 1 */
        /* sum bitfield members: 3 bits, bit holes: 1, sum bit holes: 5 bits */
        /* last cacheline: 32 bytes */
};

It doesn't matter much because anyway the typical storage for struct
udp_flow is given by union flow:

union flow {
        struct flow_common         f;                  /*     0    76 */
        struct flow_free_cluster   free;               /*     0    84 */
        struct tcp_tap_conn        tcp;                /*     0   120 */
        struct tcp_splice_conn     tcp_splice;         /*     0   120 */
        struct icmp_ping_flow      ping;               /*     0    96 */
        struct udp_flow            udp;                /*     0    96 */
};

but it still improves data locality somewhat, so let me fix this up
now that commits are fresh.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-09 22:52:32 +02:00
David Gibson
9725e79888 udp_flow: Don't discard packets that arrive between bind() and connect()
When we establish a new UDP flow we create connect()ed sockets that will
only handle datagrams for this flow.  However, there is a race between
bind() and connect() where they might get some packets queued for a
different flow.  Currently we handle this by simply discarding any
queued datagrams after the connect.  UDP protocols should be able to handle
such packet loss, but it's not ideal.

We now have the tools we need to handle this better, by redirecting any
datagrams received during that race to the appropriate flow.  We need to
use a deferred handler for this to avoid unexpectedly re-ordering datagrams
in some edge cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update comment to udp_flow_defer()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:44:31 +02:00
David Gibson
9eb5406260 udp: Fold udp_splice_prepare and udp_splice_send into udp_sock_to_sock
udp_splice() prepare and udp_splice_send() are both quite simple functions
that now have only one caller: udp_sock_to_sock().  Fold them both into
that caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:44:31 +02:00
David Gibson
bd6a41ee76 udp: Rework udp_listen_sock_data() into udp_sock_fwd()
udp_listen_sock_data() forwards datagrams from a "listening" socket until
there are no more (for now).  We have an upcoming use case where we want
to do that for a socket that's not a "listening" socket, and uses a
different epoll reference.  So, adjust the function to take the pieces it
needs from the reference as direct parameters and rename to udp_sock_fwd().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
159beefa36 udp_flow: Take pif and port as explicit parameters to udp_flow_from_sock()
Currently udp_flow_from_sock() is only used when receiving a datagram
from a "listening" socket.  It takes the listening socket's epoll
reference to get the interface and port on which the datagram arrived.

We have some upcoming cases where we want to use this in different
contexts, so make it take the pif and port as direct parameters instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop @ref from comment to udp_flow_from_sock()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
fd844a90bc udp: Move UDP_MAX_FRAMES to udp.c
Recent changes mean that this define is no longer used anywhere except in
udp.c.  Move it back into udp.c from udp_internal.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
fc6ee68ad3 udp: Merge vhost-user and "buf" listening socket paths
udp_buf_listen_sock_data() and udp_vu_listen_sock_data() now have
effectively identical structure.  The forwarding functions used for flow
specific sockets (udp_buf_sock_to_tap(), udp_vu_sock_to_tap() and
udp_sock_to_sock()) also now take a number of datagrams.  This means we
can re-use them for the listening socket path, just passing '1' so they
handle a single datagram at a time.

This allows us to merge both the vhost-user and flow specific paths into
a single, simpler udp_listen_sock_data().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:52 +02:00
David Gibson
0304dd9c34 udp: Split spliced forwarding path from udp_buf_reply_sock_data()
udp_buf_reply_sock_data() can handle forwarding data either from socket
to socket ("splicing") or from socket to tap.  It has a test on each
datagram for which case we're in, but that will be the same for everything
in the batch.

Split out the spliced path into a separate udp_sock_to_sock() function.
This leaves udp_{buf,vu}_reply_sock_data() handling only forwards from
socket to tap, so rename and simplify them accordingly.

This makes the code slightly longer for now, but will allow future cleanups
to shrink it back down again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typos in comments to udp_sock_recv() and
 udp_vu_listen_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:41:32 +02:00
David Gibson
5221e177e1 udp: Parameterize number of datagrams handled by udp_*_reply_sock_data()
Both udp_buf_reply_sock_data() and udp_vu_reply_sock_data() internally
decide what the maximum number of datagrams they will forward is.  We have
some upcoming reasons to allow the caller to decide that instead, so make
the maximum number of datagrams a parameter for both of them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:31:54 +02:00
David Gibson
3a0881dfd0 udp: Don't bother to batch datagrams from "listening" socket
A "listening" UDP socket can receive datagrams from multiple flows.  So,
we currently have some quite subtle and complex code in
udp_buf_listen_sock_data() to group contiguously received packets for the
same flow into batches for forwarding.

However, since we are now always using flow specific connect()ed sockets
once a flow is established, handling of datagrams on listening sockets is
essentially a slow path.  Given that, it's not worth the complexity.
Substantially simplify the code by using an approach more like vhost-user,
and "peeking" at the address of the next datagram, one at a time to
determine the correct flow before we actually receive the data,

This removes all meaningful use of the s_in and tosidx fields in
udp_meta_t, so they too can be removed, along with setting of msg_name and
msg_namelen in the msghdr arrays which referenced them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:30:17 +02:00
David Gibson
84ab1305fa udp: Polish udp_vu_sock_info() and remove from vu specific code
udp_vu_sock_info() uses MSG_PEEK to look ahead at the next datagram to be
received and gets its source address.  Currently we only use it in the
vhost-user path, but there's nothing inherently vhost-user specific about
it.  We have upcoming uses for it elsewhere so rename and move to udp.c.

While we're there, polish its error reporting a litle.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop excess newline before udp_sock_recv()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:29:23 +02:00
David Gibson
1d7bbb101a udp: Make udp_sock_recv() take max number of frames as a parameter
Currently udp_sock_recv() decides the maximum number of frames it is
willing to receive based on the mode.  However, we have upcoming use cases
where we will have different criteria for how many frames we want with
information that's not naturally available here but is in the caller.  So
make the maximum number of frames a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typo in comment in udp_buf_reply_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:25:33 +02:00
David Gibson
d74b5a7c10 udp: Use connect()ed sockets for initiating side
Currently we have an asymmetry in how we handle UDP sockets.  For flows
where the target side is a socket, we create a new connect()ed socket
- the "reply socket" specifically for that flow used for sending and
receiving datagrams on that flow and only that flow.  For flows where the
initiating side is a socket, we continue to use the "listening" socket (or
rather, a dup() of it).  This has some disadvantages:

 * We need a hash lookup for every datagram on the listening socket in
   order to work out what flow it belongs to
 * The dup() keeps the socket alive even if automatic forwarding removes
   the listening socket.  However, the epoll data remains the same
   including containing the now stale original fd.  This causes bug 103.
 * We can't (easily) set flow-specific options on an initiating side
   socket, because that could affect other flows as well

Alter the code to use a connect()ed socket on the initiating side as well
as the target side.  There's no way to "clone and connect" the listening
socket (a loose equivalent of accept() for UDP), so we have to create a
new socket.  We have to bind() this socket before we connect() it, which
is allowed thanks to SO_REUSEADDR, but does leave a small window where it
could receive datagrams not intended for this flow.  For now we handle this
by simply discarding any datagrams received between bind() and connect(),
but I intend to improve this in a later patch.

Link: https://bugs.passt.top/show_bug.cgi?id=103
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:24:36 +02:00
Jon Maloy
a7775e9550 udp: support traceroute in direction tap-socket
Now that ICMP pass-through from socket-to-tap is in place, it is
easy to support UDP based traceroute functionality in direction
tap-to-socket.

We fix that in this commit.

Link: https://bugs.passt.top/show_bug.cgi?id=64
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:23:35 +02:00
Stefano Brivio
06784d7fc6 passt-repair: Ensure that read buffer is NULL-terminated
After 3d41e4d838 ("passt-repair: Correct off-by-one error verifying
name"), Coverity Scan isn't convinced anymore about the fact that the
ev->name used in the snprintf() is NULL-terminated.

It comes from a read() call, and read() of course doesn't terminate
it, but we already check that the byte at ev->len - 1 is a NULL
terminator, so this is actually a false positive.

In any case, the logic ensuring that ev->name is NULL-terminated isn't
necessarily obvious, and additionally checking that the last byte in
the buffer we read is a NULL terminator is harmless, so do that
explicitly, even if it's redundant.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-05 11:28:23 +02:00
David Gibson
684870a766 udp: Correct some seccomp filter annotations
Both udp_buf_listen_sock_data() and udp_buf_reply_sock_data() have comments
stating they use recvmmsg().  That's not correct, they only do so via
udp_sock_recv() which lists recvmmsg() itself.

In contrast udp_splice_send() and udp_tap_handler() both directly use
sendmmsg(), but only the latter lists it.  Add it to the former as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:29 +02:00
David Gibson
76e554d9ec udp: Simplify updates to UDP flow timestamp
Since UDP has no built in knowledge of connections, the only way we
know when we're done with a UDP flow is a timeout with no activity.
To keep track of this struct udp_flow includes a timestamp to record
the last time we saw traffic on the flow.

For data from listening sockets and from tap, this is done implicitly via
udp_flow_from_{sock,tap}() but for reply sockets it's done explicitly.
However, that logic is duplicated between the vhost-user and "buf" paths.
Make it common in udp_reply_sock_handler() instead.

Technically this is a behavioural change: previously if we got an EPOLLIN
event, but there wasn't actually any data we wouldn't update the timestamp,
now we will.  This should be harmless: if there's an EPOLLIN we expect
there to be data, and even if there isn't the worst we can do is mildly
delay the cleanup of a stale flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:26 +02:00
David Gibson
8aa2d90c8d udp: Remove redundant udp_at_sidx() call in udp_tap_handler()
We've already have a pointer to the UDP flow in variable uflow, we can just
re-use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:14 +02:00
David Gibson
3d41e4d838 passt-repair: Correct off-by-one error verifying name
passt-repair will generate an error if the name it gets from the kernel is
too long or not NUL terminated.  Downstream testing has reported
occasionally seeing this error in practice.

In turns out there is a trivial off-by-one error in the check: ev->len is
the length of the name, including terminating \0 characters, so to check
for a \0 at the end of the buffer we need to check ev->name[len - 1] not
ev->name[len].

Fixes: 42a854a52b ("pasta, passt-repair: Support multiple events per read() in inotify handlers")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:29:42 +02:00
David Gibson
dec3d73e1e migrate, tcp: bind() migrated sockets in repair mode
Currently on a migration target, we create then immediately bind() new
sockets for the TCP connections we're reconstructing.  Mostly, this works,
since a socket() that is bound but hasn't had listen() or connect() called
is essentially passive.  However, this bind() is subject to the usual
address conflict checking.  In particular that means if we already have
a listening socket on that port, we'll get an EADDRINUSE.  This will happen
for every connection we try to migrate that was initiated from outside to
the guest, since we necessarily created a listening socket for that case.

We set SO_REUSEADDR on the socket in an attempt to avoid this, but that's
not sufficient; even with SO_REUSEADDR address conflicts are still
prohibited for listening sockets.  Of course once these incoming sockets
are fully repaired and connect()ed they'll no longer conflict, but that
doesn't help us if we fail at the bind().

We can avoid this by not calling bind() until we're already in repair mode
which suppresses this transient conflict.  Because of the batching of
setting repair mode, to do that we need to move the bind to a step in
tcp_flow_migrate_target_ext().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:29:01 +02:00
David Gibson
6bfc60b095 platform requirements: Add test for address conflicts with TCP_REPAIR
Simple test program to check the behaviour we need for bind() address
conflicts between listening sockets and repair mode sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:59 +02:00
David Gibson
8e32881ef1 platform requirements: Add attributes to die() function
Add both format string and ((noreturn)) attributes to the version of die()
used in the test programs in doc/platform-requirements.  As well as
potentially catching problems in format strings, this means that the
compiler and static checkers can properly reason about the fact that it
will exit, preventing bogus warnings.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:57 +02:00
David Gibson
2ed2d59def platform requirements: Fix clang-tidy warning
Recent clang-tidy versions complain about enums defined with some but not
all entries given explicit values.  I'm not entirely convinced about
whether that's a useful warning, but in any case we really don't need the
explicit values in doc/platform-requirements/reuseaddr-priority.c, so
remove them to make clang happy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:05 +02:00
David Gibson
3de5af6e41 udp: Improve name of UDP related ICMP sending functions
udp_send_conn_fail_icmp[46]() aren't actually specific to connections
failing: they can propagate a variety of ICMP errors, which might or might
not break a "connection".  They are, however, specific to sending ICMP
errors to the tap connection, not splice or host.  Rename them to better
reflect that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:26:11 +01:00
David Gibson
025a3c2686 udp: Don't attempt to forward ICMP socket errors to other sockets
Recently we added support for detecting ICMP triggered errors on UDP
sockets and forwarding them to the tap interface.  However, in
udp_sock_recverr() where this is handled we don't know for certain that
the tap interface is the other side of the UDP flow.  It could be a spliced
connection with another socket on the other side.

To forward errors in that case, we'd need to force the other side's socket
to trigger issue an ICMP error.  I'm not sure if there's a way to do that;
probably not for an arbitrary ICMP but it might be possible for certain
error conditions.

Nonetheless what we do now - synthesise an ICMP on the tap interface - is
certainly wrong.  It's probably harmless; for a spliced connection it will
have loopback addresses meaning we can expect the guest to discard it.
But, correct this for now, by not attempting to propagate errors when the
other side of the flow is a socket.

Fixes: 55431f0077 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e0 ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:25:51 +01:00
Stefano Brivio
42a854a52b pasta, passt-repair: Support multiple events per read() in inotify handlers
The current code assumes that we'll get one event per read() on
inotify descriptors, but that's not the case, not from documentation,
and not from reports.

Add loops in the two inotify handlers we have, in pasta-specific code
and passt-repair, to go through all the events we receive.

Link: https://bugs.passt.top/show_bug.cgi?id=119
[dwg: Remove unnecessary buffer expansion, use strnlen instead of strlen
 to make Coverity happier]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Add additional check on ev->name and ev->len in passt-repair]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:24:44 +01:00
Jon Maloy
65cca54be8 udp: correct source address for ICMP messages
While developing traceroute forwarding tap-to-sock we found that
struct msghdr.msg_name for the ICMPs in the opposite direction always
contains the destination address of the original UDP message, and not,
as one might expect, the one of the host which created the error message.

Study of the kernel code reveals that this address instead is appended
as extra data after the received struct sock_extended_err area.

We now change the ICMP receive code accordingly.

Fixes: 55431f0077 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e0 ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-27 05:05:20 +01:00
Julian Wundrak
664c588be7 build: normalize arm targets
Linux distributions use different dumpmachine outputs for the ARM
architecture. arm, armv6l, armv7l.
For the syscall annotation, these variants are standardized to “arm”.

Link: https://bugs.passt.top/show_bug.cgi?id=117
Signed-off-by: Julian Wundrak <julian@wundrak.net>
[sbrivio: Fix typo: assign from TARGET_ARCH, not from TARGET]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:43:22 +01:00
David Gibson
77883fbdd1 udp: Add helper function for creating connected UDP socket
Currently udp_flow_new() open codes creating and connecting a socket to use
for reply messages.  We have in mind some more places to use this logic,
plus it just makes for a rather large function.  Split this handling out
into a new udp_flow_sock() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:34 +01:00
David Gibson
37d78c9ef3 udp: Always hash socket facing flowsides
For UDP packets from the tap interface (like TCP) we use a hash table to
look up which flow they belong to.  Unlike TCP, we sometimes also create a
hash table entry for the socket side of UDP flows.  We need that when we
receive a UDP packet from a "listening" socket which isn't specific to a
single flow.

At present we only do this for the initiating side of flows, which re-use
the listening socket.  For the target side we use a connected "reply"
socket specific to the single flow.

We have in mind changes that maye introduce some edge cases were we could
receive UDP packets on a non flow specific socket more often.  To allow for
those changes - and slightly simplifying things in the meantime - always
put both sides of a UDP flow - tap or socket - in the hash table.  It's
not that costly, and means we always have the option of falling back to a
hash lookup.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:32 +01:00
David Gibson
f67c488b81 udp: Better handling of failure to forward from reply socket
In udp_reply_sock_handler() if we're unable to forward the datagrams we
just print an error.  Generally this means we have an unsupported pair of
pifs in the flow table, though, and that hasn't change.  So, next time we
get a matching packet we'll just get the same failure.  In vhost-user mode
we don't even dequeue the incoming packets which triggered this so we're
likely to get the same failure immediately.

Instead, close the flow, in the same we we do for an unrecoverable error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:30 +01:00
David Gibson
269cf6a12a udp: Share more logic between vu and non-vu reply socket paths
Share some additional miscellaneous logic between the vhost-user and "buf"
paths for data on udp reply sockets.  The biggest piece is error handling
of cases where we can't forward between the two pifs of the flow.  We also
make common some more simple logic locating the correct flow and its
parameters.

This adds some lines of code due to extra comment lines, but nonetheless
reduces logic duplication.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:28 +01:00
David Gibson
d924b7dfc4 udp_vu: Factor things out of udp_vu_reply_sock_data() loop
At the start of every cycle of the loop in udp_vu_reply_sock_data() we:
 - ASSERT that uflow is not NULL
 - Check if the target pif is PIF_TAP
 - Initialize the v6 boolean

However, all of these depend only on the flow, which doesn't change across
the loop.  This is probably a duplication from udp_vu_listen_sock_data(),
where the flow can be different for each packet.  For the reply socket
case, however, factor that logic out of the loop.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:26 +01:00
David Gibson
5a977c2f4e udp: Simplify checking of epoll event bits
udp_{listen,reply}_sock_handler() can accept both EPOLLERR and EPOLLIN
events.  However, unlike most epoll event handlers we don't check the
event bits right there.  EPOLLERR is checked within udp_sock_errs() which
we call unconditionally.  Checking EPOLLIN is still more buried: it is
checked within both udp_sock_recv() and udp_vu_sock_recv().

We can simplify the logic and pass less extraneous parameters around by
moving the checking of the event bits to the top level event handlers.

This makes udp_{buf,vu}_{listen,reply}_sock_handler() no longer general
event handlers, but specific to EPOLLIN events, meaning new data.  So,
rename those functions to udp_{buf,vu}_{listen,reply}_sock_data() to better
reflect their function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:23 +01:00
David Gibson
89b203b851 udp: Common invocation of udp_sock_errs() for vhost-user and "buf" paths
The vhost-user and non-vhost-user paths for both udp_listen_sock_handler()
and udp_reply_sock_handler() are more or less completely separate.  Both,
however, start with essentially the same invocation of udp_sock_errs(), so
that can be made common.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:11 +01:00
David Gibson
cf4d3f05c9 packet: Upgrade severity of most packet errors
All errors from packet_range_check(), packet_add() and packet_get() are
trace level.  However, these are for the most part actual error conditions.
They're states that should not happen, in many cases indicating a bug
in the caller or elswhere.

We don't promote these to err() or ASSERT() level, for fear of a localised
bug on very specific input crashing the entire program, or flooding the
logs, but we can at least upgrade them to debug level.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:30 +01:00
David Gibson
0857515c94 packet: ASSERT on signs of pool corruption
If packet_check_range() fails in packet_get_try_do() we just return NULL.
But this check only takes places after we've already validated the given
range against the packet it's in.  That means that if packet_check_range()
fails, the packet pool is already in a corrupted state (we should have
made strictly stronger checks when the packet was added).  Simply returning
NULL and logging a trace() level message isn't really adequate for that
situation; ASSERT instead.

Similarly we check the given idx against both p->count and p->size.  The
latter should be redundant, because count should always be <= size.  If
that's not the case then, again, the pool is already in a corrupted state
and we may have overwritten unknown memory.  Assert for this case too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:27 +01:00
David Gibson
9153aca15b util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers
We already have the ASSERT() macro which will abort() passt based on a
condition.  It always has a fixed error message based on its location and
the asserted expression.  We have some upcoming cases where we want to
customise the message when hitting an assert.

Add abort_with_msg() and ASSERT_WITH_MSG() helpers to allow this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:24 +01:00
David Gibson
38bcce9977 packet: Rework packet_get() versus packet_get_try()
Most failures of packet_get() indicate a serious problem, and log messages
accordingly.  However, a few callers expect failures here, because they're
probing for a certain range which might or might not be in a packet.  They
use packet_get_try() which passes a NULL func to packet_get_do() to
suppress the logging which is unwanted in this case.

However, this doesn't just suppress the log when packet_get_do() finds the
requested region isn't in the packet.  It suppresses logging for all other
errors too, which do indicate serious problems, even for the callers of
packet_get_try().  Worse it will pass the NULL func on to
packet_check_range() which doesn't expect it, meaning we'll get unhelpful
messages from there if there is a failure.

Fix this by making packet_get_try_do() the primary function which doesn't
log for the case of a range outside the packet.  packet_get_do() becomes a
trivial wrapper around that which logs a message if packet_get_try_do()
returns NULL.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:22 +01:00
David Gibson
961aa6a0eb packet: Move checks against PACKET_MAX_LEN to packet_check_range()
Both the callers of packet_check_range() separately verify that the given
length does not exceed PACKET_MAX_LEN.  Fold that check into
packet_check_range() instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:20 +01:00
David Gibson
37d9f374d9 packet: Avoid integer overflows in packet_get_do()
In packet_get_do() both offset and len are essentially untrusted.  We do
some validation of len (check it's < PACKET_MAX_LEN), but that's not enough
to ensure that (len + offset) doesn't overflow.  Rearrange our calculation
to make sure it's safe regardless of the given offset & len values.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:18 +01:00
David Gibson
c48331ca51 packet: Correct type of PACKET_MAX_LEN
PACKET_MAX_LEN is usually involved in calculations on size_t values - the
type of the iov_len field in struct iovec.  However, being defined bare as
UINT16_MAX, the compiled is likely to assign it a shorter type.  This can
lead to unexpected promotions (or lack thereof).  Add a cast to force the
type to be what we expect.

Fixes: c43972ad6 ("packet: Give explicit name to maximum packet size")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:15 +01:00
David Gibson
9866d146e6 tap: Clarify calculation of TAP_MSGS
The rationale behind the calculation of TAP_MSGS isn't necessarily obvious.
It's supposed to be the maximum number of packets that can fit in pkt_buf.
However, the calculation is wrong in several ways:
 * It's based on ETH_ZLEN which isn't meaningful for virtual devices
 * It always includes the qemu socket header which isn't used for pasta
 * The size of pkt_buf isn't relevant for vhost-user

We've already made sure this is just a tuning parameter, not a hard limit.
Clarify what we're calculating here and why.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:12 +01:00
David Gibson
a41d6d125e tap: Make size of pool_tap[46] purely a tuning parameter
Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf (TAP_MSGS).  However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size.  But ETH_ZLEN is based on physical
constraints of Ethernet, which don't apply to our virtual devices.  It is
possible to generate a legitimate frame smaller than this, for example an
empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long.

Further more, the same limit applies for vhost-user, which is not limited
by the size of pkt_buf like the other backends.  In that case we don't even
have full control of the maximum buffer size, so we can't really calculate
how many packets could fit in there.

If we exceed do TAP_MSGS we'll drop packets, not just use more batches,
which is moderately bad.  The fact that this needs to be sized just so for
correctness not merely for tuning is a fairly non-obvious coupling between
different parts of the code.

To make this more robust, alter the tap code so it doesn't rely on
everything fitting in a single batch of TAP_MSGS packets, instead breaking
into multiple batches as necessary.  This leaves TAP_MSGS as purely a
tuning parameter, which we can freely adjust based on performance measures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:09 +01:00
David Gibson
e43e00719d packet: More cautious checks to avoid pointer arithmetic UB
packet_check_range and vu_packet_check_range() verify that the packet or
section of packet we're interested in lies in the packet buffer pool we
expect it to.  However, in doing so it doesn't avoid the possibility of
an integer overflow while performing pointer arithmetic, with is UB.  In
fact, AFAICT it's UB even to use arbitrary pointer arithmetic to construct
a pointer outside of a known valid buffer.

To do this safely, we can't calculate the end of a memory region with
pointer addition when then the length as untrusted.  Instead we must work
out the offset of one memory region within another using pointer
subtraction, then do integer checks against the length of the outer region.
We then need to be careful about the order of checks so that those integer
checks can't themselves overflow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:06 +01:00
David Gibson
4592719a74 vu_common: Tighten vu_packet_check_range()
This function verifies that the given packet is within the mmap()ed memory
region of the vhost-user device.  We can do better, however.  The packet
should be not only within the mmap()ed range, but specifically in the
subsection of that range set aside for shared buffers, which starts at
dev_region->mmap_offset within there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:32:50 +01:00
Stefano Brivio
32f6212551 Makefile: Enable -Wformat-security
It looks like an easy win to prevent a number of possible security
flaws.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:53 +01:00
Stefano Brivio
07c2d584b3 conf: Include libgen.h for basename(), fix build against musl
Fixes: 4b17d042c7 ("conf: Move mode detection into helper function")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:49 +01:00
Stefano Brivio
ebdd46367c tcp: Flush socket before checking for more data in active close state
Otherwise, if all the pending data is acknowledged:

- tcp_update_seqack_from_tap() updates the current tap-side ACK
  sequence (conn->seq_ack_from_tap)

- next, we compare the sequence we sent (conn->seq_to_tap) to the
  ACK sequence (conn->seq_ack_from_tap) in tcp_data_from_sock() to
  understand if there's more data we can send.

  If they match, we conclude that we haven't sent any of that data,
  and keep re-sending it.

We need, instead, to flush the socket (drop acknowledged data) before
calling tcp_update_seqack_from_tap(), so that once we update
conn->seq_ack_from_tap, we can be sure that all data until there is
gone from the socket.

Link: https://bugs.passt.top/show_bug.cgi?id=114
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Fixes: 30f1e082c3 ("tcp: Keep updating window and checking for socket data after FIN from guest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:43 +01:00
David Gibson
c250ffc5c1 migrate: Bump migration version number
v1 of the migration stream format, had some flaws: it didn't properly
handle endianness of the MSS field, and it didn't transfer the RFC7323
timestamp.  We've now fixed those bugs, but it requires incompatible
changes to the stream format.

Because of the timestamps in particular, v1 is not really usable, so there
is little point maintaining compatible support for it.  However, v1 is in
released packages, both upstream and downstream (RHEL at least).  Just
updating the stream format without bumping the version would lead to very
cryptic errors if anyone did attempt to migrate between an old and new
passt.

So, bump the migration version to v2, so we'll get a clear error message if
anyone attempts this.  We don't attempt to maintain backwards compatibility
with v1, however: we'll simply fail if given a v1 stream.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 17:17:18 +01:00
David Gibson
cfb3740568 migrate, tcp: Migrate RFC 7323 timestamp
Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp.  In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.

Add code to dump and restore the timestamp across migration.

Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 15:27:27 +01:00
David Gibson
28772ee91a migrate, tcp: More careful marshalling of mss parameter during migration
During migration we extract the limit on segment size using TCP_MAXSEG,
and set it on the other side with TCP_REPAIR_OPTIONS.  However, unlike most
32-bit values we transfer we transfer it in native endian, not network
endian.  This is not correct; add it to the list of endian fixups we make.

In addition, while MAXSEG will be 32-bits in practice, and is given as such
to TCP_REPAIR_OPTIONS, the TCP_MAXSEG sockopt treats it as an 'int'.  It's
not strictly safe to pass a uint32_t to a getsockopt() expecting an int,
although we'll get away with it on most (maybe all) platforms.  Correct
this as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor coding style fix]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 15:25:12 +01:00
Stefano Brivio
51f3c071a7 passt-repair: Fix build with -Werror=format-security
Fixes: 0470170247 ("passt-repair: Add directory watch")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-18 17:18:47 +01:00
David Gibson
cb5b593563 tcp, flow: Better use flow specific logging heleprs
A number of places in the TCP code use general logging functions, instead
of the flow specific ones.  That includes a few older ones as well as many
places in the new migration code.  Thus they either don't identify which
flow the problem happened on, or identify it in a non-standard way.

Convert many of these to use the existing flow specific helpers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-14 23:40:40 +01:00
David Gibson
96fe5548cb conf: Unify several paths in conf_ports()
In conf_ports() we have three different paths which actually do the setup
of an individual forwarded port: one for the "all" case, one for the
exclusions only case and one for the range of ports with possible
exclusions case.

We can unify those cases using a new helper which handles a single range
of ports, with a bitmap of exclusions.  Although this is slightly longer
(largely due to the new helpers function comment), it reduces duplicated
logic.  It will also make future improvements to the tracking of port
forwards easier.

The new conf_ports_range_except() function has a pretty prodigious
parameter list, but I still think it's an overall improvement in conceptual
complexity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-14 23:40:23 +01:00
David Gibson
78f1f0fdfc test/perf: Simplify iperf3 server lifetime management
After we start the iperf3 server in the background, we have a sleep to
make sure it's ready to receive connections.  We can simplify this slightly
by using the -D option to have iperf3 background itself rather than
backgrounding it manually.  That won't return until the server is ready to
use.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
26df8a3608 conf: Limit maximum MTU based on backend frame size
The -m option controls the MTU, that is the maximum transmissible L3
datagram, not including L2 headers.  We currently limit it to ETH_MAX_MTU
which sounds like it makes sense.  But ETH_MAX_MTU is confusing: it's not
consistently used as to whether it means the maximum L3 datagram size or
the maximum L2 frame size.  Even within conf() we explicitly account for
the L2 header size when computing the default --mtu value, but not when
we compute the maximum --mtu value.

Clean this up by reworking the maximum MTU computation to be the minimum of
IP_MAX_MTU (65535) and the maximum sized IP datagram which can fit into
our L2 frames when we account for the L2 header.  The latter can vary
depending on our tap backend, although it doesn't right now.

Link: https://bugs.passt.top/show_bug.cgi?id=66
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
9d1a6b3eba pcap: Correctly set snaplen based on tap backend type
The pcap header includes a value indicating how much of each frame is
captured.  We always capture the entire frame, so we want to set this to
the maximum possible frame size.  Currently we do that by setting it to
ETH_MAX_MTU, but that's a confusingly named constant which might not always
be correct depending on the details of our tap backend.

Instead add a tap_l2_max_len() function that explicitly returns the maximum
frame size for the current mode and use that to set snaplen.  While we're
there, there's no particular need for the pcap header to be defined in a
global; make it local to pcap_init() instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
b6945e0553 Simplify sizing of pkt_buf
We define the size of pkt_buf as large enough to hold 128 maximum size
packets.  Well, approximately, since we round down to the page size.  We
don't have any specific reliance on how many packets can fit in the buffer,
we just want it to be big enough to allow reasonable batching.  The
current definition relies on the confusingly named ETH_MAX_MTU and adds
in sizeof(uint32_t) rather non-obviously for the pseudo-physical header
used by the qemu socket (passt mode) protocol.

Instead, just define it to be 8MiB, which is what that complex calculation
works out to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
c4bfa3339c tap: Use explicit defines for maximum length of L2 frame
Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of
an L2 frame.  This define comes from the kernel, but it's badly named and
used confusingly.

First, it doesn't really have anything to do with Ethernet, which has no
structural limit on frame lengths.  It comes more from either a) IP which
imposes a 64k datagram limit or b) from internal buffers used in various
places in the kernel (and in passt).

Worse, MTU generally means the maximum size of the IP (L3) datagram which
may be transferred, _not_ counting the L2 headers.  In the kernel
ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as
a maximum frame length, _including_ L2 headers.  In tap.c we're mostly
using it in the second way.

Finally, each of our tap backends could have different limits on the frame
size imposed by the mechanisms they're using.

Start clearing up this confusion by replacing it in tap.c with new
L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame
length for each backend.

Signed-off-by: David Gibson <dgibson@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
1eda8de438 packet: Remove redundant TAP_BUF_BYTES define
Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing.  They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense.  So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.

In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
c43972ad67 packet: Give explicit name to maximum packet size
We verify that every packet we store in a pool (and every partial packet
we retreive from it) has a length no longer than UINT16_MAX.  This
originated in the older packet pool implementation which stored packet
lengths in a uint16_t.  Now, that packets are represented by a struct
iovec with its size_t length, this check serves only as a sanity / security
check that we don't have some wildly out of range length due to a bug
elsewhere.

We have may reasons to (slightly) increase this limit in future, so in
preparation, give this quantity an explicit name - PACKET_MAX_LEN.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
74cd82adc8 conf: Detect vhost-user mode earlier
We detect our operating mode in conf_mode(), unless we're using vhost-user
mode, in which case we change it later when we parse the --vhost-user
option.  That means we need to delay parsing the --repair-path option (for
vhost-user only) until still later.

However, there are many other places in the main option parsing loop which
also rely on mode.  We get away with those, because they happen to be able
to treat passt and vhost-user modes identically.  This is potentially
confusing, though.  So, move setting of MODE_VU into conf_mode() so
c->mode always has its final value from that point onwards.

To match, we move the parsing of --repair-path back into the main option
parsing loop.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
4b17d042c7 conf: Move mode detection into helper function
One of the first things we need to do is determine if we're in passt mode
or pasta mode.  Currently this is open-coded in main(), by examining
argv[0].  We want to complexify this a bit in future to cover vhost-user
mode as well.  Prepare for this, by moving the mode detection into a new
conf_mode() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
bb00a0499f conf: Use the same optstring for passt and pasta modes
Currently we rely on detecting our mode first and use different sets of
(single character) options for each.  This means that if you use an option
valid in only one mode in another you'll get the generic usage() message.

We can give more helpful errors with little extra effort by combining all
the options into a single value of the option string and giving bespoke
messages if an option for the wrong mode is used; in fact we already did
this for some single mode options like '-1'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
Stefano Brivio
c8b520c062 flow, repair: Wait for a short while for passt-repair to connect
...and time out after that. This will be needed because of an upcoming
change to passt-repair enabling it to start before passt is started,
on both source and target, by means of an inotify watch.

Once the inotify watch triggers, passt-repair will connect right away,
but we have no guarantees that the connection completes before we
start the migration process, so wait for it (for a reasonable amount
of time).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-12 23:08:33 +01:00
Stefano Brivio
0470170247 passt-repair: Add directory watch
It might not be feasible for users to start passt-repair after passt
is started, on a migration target, but before the migration process
starts.

For instance, with libvirt, the guest domain (and, hence, passt) is
started on the target as part of the migration process. At least for
the moment being, there's no hook a libvirt user (including KubeVirt)
can use to start passt-repair before the migration starts.

Add a directory watch using inotify: if PATH is a directory, instead
of connecting to it, we'll watch for a .repair socket file to appear
in it, and then attempt to connect to that socket.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-12 21:34:36 +01:00
David Gibson
2b58b22845 cppcheck: Add suppressions for "logically" exported functions
We have some functions in our headers which are definitely there on
purpose.  However, they're not yet used outside the files in which they're
defined.  That causes sufficiently recent cppcheck versions (2.17) to
complain they should be static.

Suppress the errors for these "logically" exported functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
a83c806d17 vhost_user: Don't export several functions
vhost-user added several functions which are exposed in headers, but not
used outside the file where they're defined.  I can't tell if these are
really internal functions, or of they're logically supposed to be exported,
but we don't happen to have anything using them yet.

For the time being, just remove the exports.  We can add them back if we
need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
27395e67c2 tcp: Don't export tcp_update_csum()
tcp_update_csum() is exposed in tcp_internal.h, but is only used in tcp.c.
Remove the unneded prototype and make it static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
12d5b36b2f checksum: Don't export various functions
Several of the exposed functions in checksum.h are no longer directly used.
Remove them from the header, and make static.  In particular sum_16b()
should not be used outside: generally csum_unfolded() should be used which
will automatically use either the AVX2 optimized version or sum_16b() as
necessary.

csum_fold() and csum() could have external uses, but they're not used right
now.  We can expose them again if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
e36c35c952 log: Don't export passt_vsyslog()
passt_vsyslog() is an exposed function in log.h.  However it shouldn't
be called from outside log.c: it writes specifically to the system log,
and most code should call passt's logging helpers which might go to the
syslog or to a log file.

Make passt_vsyslog() local to log.c.  This requires a code motion to avoid
a forward declaration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
57d2db370b treewide: Mark assorted functions static
This marks static a number of functions which are only used in their .c
file, have no prototypes in a .h and were never intended to be globally
exposed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
68b04182e0 udp: create and send ICMPv6 to local peer when applicable
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMPv6 message containing
the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header
and the first 1232 bytes of the original message, if any. If the sender
socket has been connected, it uses this message to issue a
"Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv6 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error types and codes. We have noticed that at least
ICMP_PROT_UNREACH is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't
 modify saddr which can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
87e6a46442 tap: break out building of udp header from tap_udp6_send function
We will need to build the UDP header at other locations than in function
tap_udp6_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
55431f0077 udp: create and send ICMPv4 to local peer when applicable
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv4 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error codes. We have noticed that at least ICMP_PROT_UNREACH
is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't
 modify 'in', it can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:19 +01:00
Jon Maloy
82a839be98 tap: break out building of udp header from tap_udp4_send function
We will need to build the UDP header at other locations than in function
tap_udp4_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-06 20:17:36 +01:00
David Gibson
1924e25f07 conf: Be more precise about minimum MTUs
Currently we reject the -m option if given a value less than ETH_MIN_MTU
(68).  That define is derived from the kernel, but its name is misleading:
it doesn't really have anything to do with Ethernet per se, but is rather
the minimum payload any L2 link must be able to handle in order to carry
IPv4.  For IPv6, it's not sufficient: that requires an MTU of at least
1280.

Newer kernels have better named constants IPV4_MIN_MTU and IPv6_MIN_MTU.
Copy and use those constants instead, along with some more specific error
messages.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-06 20:17:23 +01:00
David Gibson
672d786de1 tcp: Send RST in response to guest packets that match no connection
Currently, if a non-SYN TCP packet arrives which doesn't match any existing
connection, we simply ignore it.  However RFC 9293, section 3.10.7.1 says
we should respond with an RST to a non-SYN, non-RST packet that's for a
CLOSED (i.e. non-existent) connection.

This can arise in practice with migration, in cases where some error means
we have to discard a connection.  We destroy the connection with tcp_rst()
in that case, but because the guest is stopped, we may not be able to
deliver the RST packet on the tap interface immediately.  This change
ensures an RST will be sent if the guest tries to use the connection again.

A similar situation can arise if a passt/pasta instance is killed or
crashes, but is then replaced with another attached to the same guest.
This can leave the guest with stale connections that the new passt instance
isn't aware of.  It's better to send an RST so the guest knows quickly
these are broken, rather than letting them linger until they time out.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:32 +01:00
David Gibson
1f236817ea tap: Consider IPv6 flow label when building packet sequences
To allow more batching, we group together related packets into "seqs" in
the tap layer, before passing them to the L4 protocol layers.  Currently
we consider the IP protocol, both IP addresses and also the L4 ports when
grouping things into seqs.  We ignore the IPv6 flow label.

We have some future cases where we want to consider the the flow label in
the L4 code, which is awkward if we could be given a single batch with
multiple labels.  Add the flow label to tap6_l4_t and group by it as well
as the other criteria.  In future we could possibly use the flow label
_instead_ of peeking into the L4 header for the ports, but we don't do so
for now.

The guest should use the same flow label for all packets in a low, but if
it doesn't this change won't break anything, it just means we'll batch
things a bit sub-optimally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:29 +01:00
David Gibson
008175636c ip: Helpers to access IPv6 flow label
The flow label is a 20-bit field in the IPv6 header.  The length and
alignment make it awkward to pass around as is.  Obviously, it can be
packed into a 32-bit integer though, and we do this in two places.  We
have some further upcoming places where we want to manipulate the flow
label, so make some helpers for marshalling and unmarshalling it to an
integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:17 +01:00
David Gibson
52419a64f2 migrate, tcp: Don't flow_alloc_cancel() during incoming migration
In tcp_flow_migrate_target(), if we're unable to create and bind the new
socket, we print an error, cancel the flow and carry on.  This seems to
make sense based on our policy of generally letting the migration complete
even if some or all flows are lost in the process.  But it doesn't quite
work: the flow_alloc_cancel() means that the flows in the target's flow
table are no longer one to one match to the flows which the source is
sending data for.  This means that data for later flows will be mismatched
to a different flow.  Most likely that will cause some nasty error later,
but even worse it might appear to succeed but lead to data corruption due
to incorrectly restoring one of the flows.

Instead, we should leave the flow in the table until we've read all the
data for it, *then* discard it.  Technically removing the
flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket()
fails it leaves conn->sock == -1, which will cause the restore functions
in tcp_flow_migrate_target_ext() to fail, discarding the flow.  To make
what's going on clearer (and with less extraneous error messages), put
several explicit tests for a missing socket later in the migration path to
read the data associated with the flow but explicitly discard it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:38 +01:00
David Gibson
b2708218a6 tcp: Unconditionally move to CLOSED state on tcp_rst()
tcp_rst() attempts to send an RST packet to the guest, and if that succeeds
moves the flow to CLOSED state.  However, even if the tcp_send_flag() fails
the flow is still dead: we've usually closed the socket already, and
something has already gone irretrievably wrong.  So we should still mark
the flow as CLOSED.  That will cause it to be cleaned up, meaning any
future packets from the guest for it won't match a flow, so should generate
new RSTs (they don't at the moment, but that's a separate bug).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:36 +01:00
David Gibson
56ce03ed0a tcp: Correct error code handling from tcp_flow_repair_socket()
There are two small bugs in error returns from tcp_low_repair_socket(),
which is supposed to return a negative errno code:

1) On bind() failures, wedirectly pass on the return code from bind(),
   which is just 0 or -1, instead of an error code.

2) In the caller, tcp_flow_migrate_target() we call strerror_() directly
   on the negative error code, but strerror() requires a positive error
   code.

Correct both of these.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:35 +01:00
David Gibson
39f85bce1a migrate, flow: Don't attempt to migrate TCP flows without passt-repair
Migrating TCP flows requires passt-repair in order to use TCP_REPAIR.  If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode.  In some cases we may not roll back these changes properly, meaning
we break network connections on the source.

Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate.  So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:33 +01:00
David Gibson
7b92f2e852 migrate, flow: Trivially succeed if migrating with no flows
We could get a migration request when we have no active flows; or at least
none that we need or are able to migrate.  In this case after sending or
receiving the number of flows we continue to step through various lists.

In the target case, this could include communication with passt-repair.  If
passt-repair wasn't started that could cause further errors, but of course
they shouldn't matter if we have nothing to repair.

Make it more obvious that there's nothing to do and avoid such errors by
short-circuiting flow_migrate_{source,target}() if there are no migratable
flows.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:21 +01:00
Stefano Brivio
87471731e6 selinux: Fixes/workarounds for passt and passt-repair, mostly for libvirt usage
Here are a bunch of workarounds and a couple of fixes for libvirt
usage which are rather hard to split into single logical patches
as there appear to be some obscure dependencies between some of them:

- passt-repair needs to have an exec_type typeattribute (otherwise
  the policy for lsmd(1) causes a violation on getattr on its
  executable) file, and that typeattribute just happened to be there
  for passt as a result of init_daemon_domain(), but passt-repair
  isn't a daemon, so we need an explicit corecmd_executable_file()

- passt-repair needs a workaround, which I'll revisit once
  https://github.com/fedora-selinux/selinux-policy/issues/2579 is
  solved, for usage with libvirt: allow it to use qemu_var_run_t
  and virt_var_run_t sockets

- add 'bpf' and 'dac_read_search' capabilities for passt-repair:
  they are needed (for whatever reason I didn't investigate) to
  actually receive socket files via SCM_RIGHTS

- passt needs further workarounds in the sense of
  https://github.com/fedora-selinux/selinux-policy/issues/2579:
  allow it to use map and use svirt_tmpfs_t (not just svirt_image_t):
  it depends on where the libvirt guest image is

- ...it also needs to map /dev/null if <access mode='shared'/> is
  enabled in libvirt's XML for the memoryBacking object, for
  vhost-user operation

- and 'ioctl' on the TCP socket appears to be actually needed, on top
  of 'getattr', to dump some socket parameters

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:14:01 +01:00
Michal Privoznik
be86232f72 seccomp.sh: Silence stty errors
When printing list of allowed syscalls the width of terminal is
obtained for nicer output (see commit below). The width is
obtained by running 'stty'. While this works when building from a
console, it doesn't work during rpmbuild/emerge/.. as stdout is
usually not a console but a logfile and stdin is usually
/dev/null or something. This results in stty reporting errors
like this:

  stty: 'standard input': Inappropriate ioctl for device

Redirect stty's stderr to /dev/null to silence it.

Fixes: 712ca32353 ("seccomp.sh: Try to account for terminal width while formatting list of system calls")
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-24 18:46:28 +01:00
Jon Maloy
ea69ca6a20 tap: always set the no_frag flag in IPv4 headers
When studying the Linux source code and Wireshark dumps it seems like
the no_frag flag in the IPv4 header is always set. Following discussions
in the Internet on this subject indicates that modern routers never
fragment packets, and that it isn't even supported in many cases.

Adding to this that incoming messages forwarded on the tap interface
never even pass through a router it seems safe to always set this flag.

This makes the IPv4 headers of forwarded messages identical to those
sent by the external sockets, something we must consider desirable.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-20 12:43:00 +01:00
Stefano Brivio
4dac2351fa contrib/fedora: Actually install passt-repair SELinux policy file
Otherwise we build it, but we don't install it. Not an issue that
warrants a a release right away as it's anyway usable.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 23:33:53 +01:00
Stefano Brivio
16553c8280 dhcp: Add option code byte in calculation for OPT_MAX boundary check
Otherwise we'll limit messages to 577 bytes, instead of 576 bytes as
intended:

  $ fqdn="thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.then_make_it_251_with_this"
  $ hostname="__eighteen_bytes__"
  $ ./pasta --fqdn ${fqdn} -H ${hostname} -p dhcp.pcap -- /sbin/dhclient -4
  Saving packet capture to dhcp.pcap
  $ tshark -r dhcp.pcap -V -Y 'dhcp.option.value == 5' | grep "Total Length"
      Total Length: 577

This was hidden by the issue fixed by commit bcc4908c2b ("dhcp
Remove option 255 length byte") until now.

Fixes: 31e8109a86 ("dhcp, dhcpv6: Add hostname and client fqdn ops")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 23:33:19 +01:00
Stefano Brivio
183bedf478 Makefile: Use mmap2() as alternative for mmap() in valgrind extra syscalls
...instead of unconditionally trying to enable both: mmap2() is the
32-bit ARM variant for mmap() (and perhaps for other architectures),
bot if mmap() is available, valgrind will use that one.

This avoids seccomp.sh warning us about missing mmap2() if mmap() is
present, and is consistent with what we do in vhost-user code.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 16:36:47 +01:00
David Gibson
1cc5d4c9fe conf: Use 0 instead of -1 as "unassigned" mtu value
On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default.  However, internally we use (c->mtu == -1) to represent that
state.  We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.

This is unnecessarily confusing.  We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case.  This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:41 +01:00
David Gibson
3dc7da68a2 conf: More thorough error checking when parsing --mtu option
We're a bit sloppy with parsing MTU which can lead to some surprising,
though fairly harmless, results:
  * Passing a non-number like '-m xyz' will not give an error and act like
    -m 0
  * Junk after a number (e.g. '-m 1500pqr') will be ignored rather than
    giving an error
  * We parse the MTU as a long, then immediately assign to an int, so on
    some platforms certain ludicrously out of bounds values will be
    silently truncated, rather than giving an error

Be a bit more thorough with the error checking to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:38 +01:00
David Gibson
65e317a8fc flow: Clean up and generalise flow traversal macros
The migration code introduced a number of 'foreach' macros to traverse the
flow table.  These aren't inherently tied to migration, so polish up their
naming, move them to flow_table.h and also use in flow_defer_handler()
which is the other place we need to traverse the whole table.

For now we keep foreach_established_tcp_flow() as is.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:36 +01:00
David Gibson
b79a22d360 flow: Remove unneeded bound parameter from flow traversal macros
The foreach macros used to step through flows each take a 'bound' parameter
to only scan part of the flow table.  Only one place actually passes a
bound different from FLOW_MAX.  So we can simplify every other invocation
by having that one case manually handle the bound.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:34 +01:00
David Gibson
7ffca35fdd flow: Remove unneeded index from foreach_* macros
The foreach macros are odd in that they take two loop counters: an integer
index, and a pointer to the flow.  We nearly always want the latter, not
the former, and we can get the index from the pointer trivially when we
need it.  So, rearrange the macros not to need the integer index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:20 +01:00
David Gibson
adb46c11d0 flow: Add flow_perror() helper
Our general logging helpers include a number of _perror() variants which,
like perror(3) include the description of the current errno.  We didn't
have those for our flow specific logging helpers, though.  Fill this gap
with flow_perror() and flow_dbg_perror(), and use them where it's useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:12 +01:00
David Gibson
ba0823f8a0 tcp: Don't pass both flow pointer and flow index
tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:10 +01:00
David Gibson
854bc7b1a3 tcp: Remove spurious prototype for tcp_flow_migrate_shrink_window
This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:08 +01:00
David Gibson
e56c8038fc tcp: More type safety for tcp_flow_migrate_target_ext()
tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:32:52 +01:00
David Gibson
5a07eb3ccc tcp_vu: head_cnt need not be global
head_cnt is a global variable which tracks how many entries in head[] are
currently used.  The fact that it's global obscures the fact that the
lifetime over which it has a meaningful value is quite short: a single
call to of tcp_vu_data_from_sock().

Make it a local to tcp_vu_data_from_sock() to make that lifetime clearer.
We keep the head[] array global for now - although technically it has the
same valid lifetime - because it's large enough we might not want to put
it on the stack.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 11:28:37 +01:00
David Gibson
6b4065153c tap: Remove unused ETH_HDR_INIT() macro
The uses of this macro were removed in d4598e1d18 ("udp: Use the same
buffer for the L2 header for all frames").

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:18 +01:00
David Gibson
354bc0bab1 packet: Don't pass start and offset separately to packet_check_range()
Fundamentally what packet_check_range() does is to check whether a given
memory range is within the allowed / expected memory set aside for packets
from a particular pool.  That range could represent a whole packet (from
packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't
really matter which.

However, we pass the start of the range as two parameters: @start which is
the start of the packet, and @offset which is the offset within the packet
of the range we're interested in.  We never use these separately, only as
(start + offset).  Simplify the interface of packet_check_range() and
vu_packet_check_range() to directly take the start of the relevant range.
This will allow some additional future improvements.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:12 +01:00
David Gibson
0a51060f7a packet: Use flexible array member in struct pool
Currently we have a dummy pkt[1] array, which we alias with an array of
a different size via various macros.  However, we already require C11 which
includes flexible array members, so we can do better.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:04 +01:00
Enrique Llorente
bcc4908c2b dhcp: Remove option 255 length byte
The option 255 (end of options) do not need the length byte, this change
remove that allowing to have one extra byte at other dynamic options.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:42:35 +01:00
Stefano Brivio
a1e48a02ff test: Add migration tests
PCAP=1 ./run migrate/bidirectional gives an overview of how the
whole thing is working.

Add 12 tests in total, checking basic functionality with and without
flows in both directions, with and without sockets in half-closed
states (both inbound and outbound), migration behaviour under traffic
flood, under traffic flood with > 253 flows, and strict checking of
sequences under flood with ramp patterns in both directions.

These tests need preparation and teardown for each case, as we need
to restore the source guest in its own context and pane before we can
test again. Eventually, we could consider alternating source and
target so that we don't need to restart from scratch every time, but
that's beyond the scope of this initial test implementation.

Trick: './run migrate/*' runs all the tests with preparation and
teardown steps.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:29:36 +01:00
Stefano Brivio
89ecf2fd40 migrate: Migrate TCP flows
This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:29:03 +01:00
Stefano Brivio
3e903bbb1f repair, passt-repair: Build and warning fixes for musl
Checked against musl 1.2.5.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-17 08:28:48 +01:00
Stefano Brivio
01b6a164d9 tcp_splice: A typo three years ago and SO_RCVLOWAT is gone
In commit e5eefe7743 ("tcp: Refactor to use events instead of
states, split out spliced implementation"), this:

			if (!bitmap_isset(rcvlowat_set, conn - ts) &&
			    readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

			if (conn->flags & lowat_set_flag &&
			    readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we
haven't set it already.

But, instead, we're checking if it's already set before we set it, so
we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us
improved CPU utilisation in non-interactive cases where we are not
transferring at full pipe capacity.

Fixes: e5eefe7743 ("tcp: Refactor to use events instead of states, split out spliced implementation")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:28:45 +01:00
Stefano Brivio
667caa09c6 tcp_splice: Don't wake up on input data if we can't write it anywhere
If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
given flow, it means that we're blocked, waiting for the receiver to
actually receive data, with a full pipe.

In that case, if we keep EPOLLIN set for the socket on the other side
(our receiving side), we'll get into a loop such as:

  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577

leading to 100% CPU usage, of course.

Drop EPOLLIN on our receiving side as long when we're waiting for
output readiness on the other side.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:27:30 +01:00
David Gibson
7c33b12086 vhost_user: Clear ring address on GET_VRING_BASE
GET_VRING_BASE stops the queue, clearing the call and kick fds.  However,
we don't clear vring.avail.  That means that if vu_queue_notify() is called
it won't realise the queue isn't ready and will die with an EBADFD.

We get this during migration, because for some reason, qemu reconfigures
the vhost-user device when a migration is triggered.  There's a window
between the GET_VRING_BASE and re-establishing the call fd where the
notify function can be called, causing a crash.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-15 05:34:21 +01:00
Stefano Brivio
71249ef3f9 tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values
I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.

This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.

Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).

The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.

Drop those settings.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 12:02:55 +01:00
Stefano Brivio
30f1e082c3 tcp: Keep updating window and checking for socket data after FIN from guest
Once we get a FIN segment from the container/guest, we enter something
resembling CLOSE_WAIT (from the perspective of the peer), but that
doesn't mean that we should stop processing window updates from the
guest and checking for socket data if the guest acknowledges
something.

If we don't do that, we can very easily run into a situation where we
send a burst of data to the tap, get a zero window update, along with
a FIN segment, because the flow is meant to be unidirectional, and now
the connection will be stuck forever, because we'll ignore updates.

Reproducer, server:

  $ pasta --config-net -t 9999 -- sh -c 'echo DONE | socat TCP-LISTEN:9997,shut-down STDIO'

and client:

  $ ./test/rampstream send 50000 | socat -u STDIN TCP:$LOCAL_ADDR:9997
  2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe

while at it, update the message string for the third passive close
state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 10:04:39 +01:00
Stefano Brivio
98d474c895 contrib/selinux: Enable mapping guest memory for libvirt guests
This doesn't actually belong to passt's own policy: we should export
an interface and libvirt's policy should use it, because passt's
policy shouldn't be aware of svirt_image_t at all.

However, libvirt doesn't maintain its own policy, which makes policy
updates rather involved. Add this workaround to ensure --vhost-user
is working in combination with libvirt, as it might take ages before
we can get the proper rule in libvirt's policy.

Reported-by: Laine Stump <laine@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 10:04:39 +01:00
Stefano Brivio
9a84df4c3f selinux: Add rules needed to run tests
...other than being convenient, they might be reasonably
representative of typical stand-alone usage.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-13 00:42:52 +01:00
David Gibson
a301158456 rampstream: Add utility to test for corruption of data streams
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:48:17 +01:00
Stefano Brivio
6f122f0171 tcp: Get bound address for connected inbound sockets too
So that we can bind inbound sockets to specific addresses, like we
already do for outbound sockets.

While at it, change the error message in tcp_conn_from_tap() to match
this one.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:48:00 +01:00
Stefano Brivio
f3fe795ff5 vhost_user: Make source quit after reporting migration state
This will close all the sockets we currently have open in repair mode,
and completes our migration tasks as source. If the hypervisor wants
to have us back at this point, somebody needs to restart us.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:51 +01:00
Stefano Brivio
b899141ad5 Add interfaces and configuration bits for passt-repair
In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.

When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.

To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:28 +01:00
David Gibson
155cd0c41e migrate: Migrate guest observed addresses
Most of the information in struct ctx doesn't need to be migrated.
Either it's strictly back end information which is allowed to differ
between the two ends, or it must already be configured identically on
the two ends.

There are a few exceptions though.  In particular passt learns several
addresses of the guest by observing what it sends out.  If we lose
this information across migration we might get away with it, but if
there are active flows we might misdirect some packets before
re-learning the guest address.

Avoid this by migrating the guest's observed addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Coding style stuff, comments, etc.]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:17 +01:00
Stefano Brivio
5911e08c0f migrate: Skeleton of live migration logic
Introduce facilities for guest migration on top of vhost-user
infrastructure.  Add migration facilities based on top of the current
vhost-user infrastructure, moving vu_migrate() and related functions
to migrate.c.

Versioned migration stages define function pointers to be called on
source or target, or data sections that need to be transferred.

The migration header consists of a magic number, a version number for the
encoding, and a "compat_version" which represents the oldest version which
is compatible with the current one.  We don't use it yet, but that allows
for the future possibility of backwards compatible protocol extensions.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:07 +01:00
Stefano Brivio
836fe215e0 passt-repair: Fix off-by-one in check for number of file descriptors
Actually, 254 is too many, but 253 isn't.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:46:46 +01:00
Laurent Vivier
def7de4690 tcp_vu: Fix off-by one in header count array adjustment
head_cnt represents the number of frames we're going to forward to the
guest in tcp_vu_sock_recv(), each of which could require multiple
buffers ("elements").  We initialise it with as many frames as we can
find space for in vu buffers, and we then need to adjust it down to
the number of frames we actually (partially) filled.

We adjust it down based on number of individual buffers used by the
data from recvmsg().  At this point 'i' is *one greater than* that
number of buffers, so we need to discard all (unused) frames with a
buffer index >= i, instead of > i.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
[david: Contributed actual commit message]
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:44:25 +01:00
Stefano Brivio
90f91fe726 tcp: Implement conservative zero-window probe on ACK timeout
This probably doesn't cover all the cases where we should send a
zero-window probe, but it's rather unobtrusive and obvious, so start
from here, also because I just observed this case (without the fix
from the previous patch, to take into account window information from
keep-alive segments).

If we hit the ACK timeout, and try re-sending data from the socket,
if the window is zero, we'll just fail again, go back to the timer,
and so on, until we hit the maximum number of re-transmissions and
reset the connection.

Don't do that: forcibly try to send something by implementing the
equivalent of a zero-window probe in this case.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-12 19:43:55 +01:00
Stefano Brivio
472e2e930f tcp: Don't discard window information on keep-alive segments
It looks like a detail, but it's critical if we're dealing with
somebody, such as near-future self, using TCP_REPAIR to migrate TCP
connections in the guest or container.

The last packet sent from the 'source' process/guest/container
typically reports a small window, or zero, because the guest/container
hadn't been draining it for a while.

The next packet, appearing as the target sets TCP_REPAIR_OFF on the
migrated socket, is a keep-alive (also called "window probe" in CRIU
or TCP_REPAIR-related code), and it comes with an updated window
value, reflecting the pre-migration "regular" value.

If we ignore it, it might take a while/forever before we realise we
can actually restart sending.

Fixes: 238c69f9af ("tcp: Acknowledge keep-alive segments, ignore them for the rest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-12 19:34:15 +01:00
Enrique Llorente
31e8109a86 dhcp, dhcpv6: Add hostname and client fqdn ops
Both DHCPv4 and DHCPv6 has the capability to pass the hostname to
clients, the DHCPv4 uses option 12 (hostname) while the DHCPv6 uses option 39
(client fqdn), for some virt deployments like kubevirt is expected to
have the VirtualMachine name as the guest hostname.

This change add the following arguments:
 - -H --hostname NAME to configure the hostname DHCPv4 option(12)
 - --fqdn NAME to configure client fqdn option for both DHCPv4(81) and
   DHCPv6(39)

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-10 18:30:24 +01:00
Stefano Brivio
a3d142a6f6 conf: Don't map DNS traffic to host, if host gateway is a resolver
This should be a relatively common case and I'm a bit surprised it's
been broken since I added the "gateway mapping" functionality, but it
doesn't happen with Podman, and not with systemd-resolved or similar
local proxies, and also not with servers where typically the gateway
is just a router and not a DNS resolver. That could be the reason why
nobody noticed until now.

By default, we'll map the address of the default gateway, in
containers and guests, to represent "the host", so that we have a
well-defined way to reach the host. Say:

  0.0029:     NAT to host 127.0.0.1: 192.168.100.1

But if the host gateway is also a DNS resolver:

  0.0029: DNS:
  0.0029:     192.168.100.1

then we'll send DNS queries directed to it to the host instead:

  0.0372: Flow 0 (INI): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => ?
  0.0372: Flow 0 (TGT): INI -> TGT
  0.0373: Flow 0 (TGT): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
  0.0373: Flow 0 (UDP flow): TGT -> TYPED
  0.0373: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
  0.0373: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 31049
  0.0374: Flow 0 (UDP flow): TYPED -> ACTIVE
  0.0374: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53

which doesn't quite work, of course:

  0.0374: pasta: epoll event on UDP reply socket 95 (events: 0x00000008)
  0.0374: ICMP error on UDP socket 95: Connection refused

unless the host is a resolver itself... but then we wouldn't find the
address of the gateway in its /etc/resolv.conf, presumably.

Fix this by making an exception for DNS traffic: if the default
gateway is a resolver, match on DNS traffic going to the default
gateway, and explicitly forward it to the configured resolver.

Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-09 08:17:06 +01:00
Stefano Brivio
864be475d9 passt-repair: Send one confirmation *per command*, not *per socket*
It looks like me, myself and I couldn't agree on the "simple" protocol
between passt and passt-repair. The man page and passt say it's one
confirmation per command, but the passt-repair implementation had one
confirmation per socket instead.

This caused all sort of mysterious issues with repair mode
pseudo-randomly enabled, and leading to hours of fun (mostly not
mine). Oops.

Switch to one confirmation per command (of course).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-09 08:16:41 +01:00
Enrique Llorente
fe8b6a7c42 dhcp: Don't re-use request message for reply
The logic composing the DHCP reply message is reusing the request
message to compose it, future long options like FQDN may
exceed the request message limit making it go beyond the lower
bound.

This change creates a new reply message with a fixed options size of 308
and fills it in with proper fields from requests adding on top the generated
options, this way the reply lower bound does not depend on the request.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-07 10:36:10 +01:00
Stefano Brivio
b7b70ba243 passt-repair: Dodge "structurally unreachable code" warning from Coverity
While main() conventionally returns int, and we need a return at the
end of the function to avoid compiler warnings, turning that return
into _exit() to avoid exit handlers triggers a Coverity warning. It's
unreachable code anyway, so switch that single occurence back to a
plain return.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-07 10:35:46 +01:00
Stefano Brivio
0f009ea598 passt-repair: Fix calculation of payload length from cmsg_len
There's no inverse function for CMSG_LEN(), so we need to loop over
SCM_MAX_FD (253) possible input values. The previous calculation is
clearly wrong, as not every int takes CMSG_LEN(sizeof(int)) in cmsg
data.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-07 10:35:17 +01:00
Stefano Brivio
a0b7f56b3a passt-repair: Don't use perror(), accept ECONNRESET as termination
If we use glibc's perror(), we need to allow dup() and fcntl() in our
seccomp profiles, which are a bit too much for this simple helper. On
top of that, we would probably need a wrapper to avoid allocation for
translated messages.

While at it: ECONNRESET is just a close() from passt, treat it like
EOF.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-07 10:34:31 +01:00
Stefano Brivio
a5cca995de conf, passt.1: Un-deprecate --host-lo-to-ns-lo
It was established behaviour, and it's now the third report about it:
users ask how to achieve the same functionality, and we don't have a
better answer yet.

The idea behind declaring it deprecated to start with, I guess, was
that we would eventually replace it by more flexible and generic
configuration options, which is still planned. But there's nothing
preventing us to alias this in the future to a particular
configuration.

So, stop scaring users off, and un-deprecate this.

Link: https://archives.passt.top/passt-dev/20240925102009.62b9a0ce@elisabeth/
Link: https://github.com/rootless-containers/rootlesskit/pull/482#issuecomment-2591855705
Link: https://github.com/moby/moby/issues/48838
Link: https://github.com/containers/podman/discussions/25243
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 11:14:30 +01:00
David Gibson
0da87b393b debug: Add tcpdump to mbuto.img
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-06 09:43:09 +01:00
Stefano Brivio
f66769c2de apparmor: Workaround for unconfined libvirtd when triggered by unprivileged user
If libvirtd is triggered by an unprivileged user, the virt-aa-helper
mechanism doesn't work, because per-VM profiles can't be instantiated,
and as a result libvirtd runs unconfined.

This means passt can't start, because the passt subprofile from
libvirt's profile is not loaded either.

Example:

  $ virsh start alpine
  error: Failed to start domain 'alpine'
  error: internal error: Child process (passt --one-off --socket /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0.socket --pid /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0-passt.pid --tcp-ports 40922:2) unexpected fatal signal 11

Add an annoying workaround for the moment being. Much better than
encouraging users to start guests as root, or to disable AppArmor
altogether.

Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-06 09:43:09 +01:00
Stefano Brivio
593be32774 passt-repair.1: Fix indication of TCP_REPAIR constants
...perhaps I should adopt the healthy habit of actually reading
headers instead of using my mental copy.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 09:43:00 +01:00
Stefano Brivio
9215f68a0c passt-repair: Build fixes for musl
When building against musl headers:

- sizeof() needs stddef.h, as it should be;

- we can't initialise a struct msghdr by simply listing fields in
  order, as they contain explicit padding fields. Use field names
  instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 09:40:54 +01:00
Paul Holzinger
a9d63f91a5 passt-repair: use _exit() over return
When returning from main it does the same as calling exit() which is not
good as glibc might try to call futex() which will be blocked by
seccomp. See the prevoius commit "treewide: use _exit() over exit()" for
a more detailed explanation.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-05 15:19:19 +01:00
Paul Holzinger
d0006fa784 treewide: use _exit() over exit()
In the podman CI I noticed many seccomp denials in our logs even though
tests passed:
comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e
syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000

Which is futex being called and blocked by the pasta profile. After a
few tries I managed to reproduce locally with this loop in ~20 min:
while :;
  do podman run -d --network bridge quay.io/libpod/testimage:20241011 \
	sleep 100 && \
  sleep 10 && \
  podman rm -fa -t0
done

And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the
following stack trace:
Stack trace of thread 1:
    0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b)
    0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de)
    0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
    0x000055f31b78c50b n/a (n/a + 0x0)
    0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
    0x000055f31b78d5a2 n/a (n/a + 0x0)

Pasta got killed in exit(), it seems glibc is trying to use a lock when
running exit handlers even though no exit handlers are defined.

Given no exit handlers are needed we can call _exit() instead. This
skips exit handlers and does not flush stdio streams compared to exit()
which should be fine for the use here.

Based on the input from Stefano I did not change the test/doc programs
or qrap as they do not use seccomp filters.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-05 15:19:02 +01:00
David Gibson
745c163e60 tcp: Simplify handling of getsockname()
For migration we need to get the specific local address and port for
connected sockets with getsockname().  We currently open code marshalling
the results into the flow entry.

However, we already have inany_from_sockaddr() which handles the fiddly
parts of this, so use it.  Also report failures, which may make debugging
problems easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop re-declarations of 'sa' and 'sl']
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 09:02:54 +01:00
David Gibson
b4a7b5d4a6 migrate: Fix several errors with passt-repair
The passt-repair helper is now merged, but alas it contains several small
bugs:
 * close() is not in the seccomp profile, meaning it will immediately
   SIGSYS when you make a request of it
 * The generated header, seccomp_repair.h isn't listed in .gitignore or
   removed by "make clean"

Fixes: 8c24301462 ("Introduce passt-repair")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 08:52:27 +01:00
Stefano Brivio
dcf014be88 doc: Add mock of migration source and target
These test programs show the migration of a TCP connection using the
passt-repair helper.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
52e57f9c9a tcp: Get socket port and address using getsockname() when connecting from guest
For migration only: we need to store 'oport', our socket-side port,
as we establish a connection from the guest, so that we can bind the
same oport as source port in the migration target.

Similar for 'oaddr': this is needed in case the migration target has
additional network interfaces, and we need to make sure our socket is
bound to the equivalent interface as it was on the source.

Use getsockname() to fetch them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
8c24301462 Introduce passt-repair
A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
passt. Not used yet.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
e894d9ae82 vhost_user: Turn some vhost-user message reports to trace()
Having every vhost-user message printed as part of debug output makes
debugging anything else a bit complicated.

Change per-packet debug() messages in vu_kick_cb() and
vu_send_single() to trace()

[dgibson: switch different messages to trace()]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
e25a93032f util: Add read_remainder() and read_all_buf()
These are symmetric to write_remainder() and write_all_buf() and
almost a copy and paste of them, with the most notable differences
being reversed reads/writes and a couple of better-safe-than-sorry
asserts to keep Coverity happy.

I'll use them in the next patch. At least for the moment, they're
going to be used for vhost-user mode only, so I'm not unconditionally
enabling readv() in the seccomp profile: the caller has to ensure it's
there.

[dgibson: make read_remainder() take const pointer to iovec]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
71fa736277 tcp_splice, udp_flow: fcntl64() support on PPC64 depends on glibc version
I explicitly added fcntl64() to the list of allowed system calls for
PPC64 a while ago, and now it turns out it's not available in recent
Debian builds. The warning from seccomp.sh is harmless because we
unconditionally try to enable fcntl() anyway, but take care of it
anyway.

Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:43:35 +01:00
Stefano Brivio
b75ad159e8 vhost_user: On 32-bit ARM, mmap() is not available, mmap2() is used instead
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armel&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477467&raw=0
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armhf&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477421&raw=0
Fixes: 31117b27c6 ("vhost-user: introduce vhost-user API")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:42:28 +01:00
Stefano Brivio
722d347c19 tcp: Don't reset outbound connection on SYN retries
Reported by somebody on IRC: if the server has considerable latency,
it might happen that the client retries sending SYN segments for the
same flow while we're still in a TAP_SYN_RCVD, non-ESTABLISHED state.

In that case, we should go with the blanket assumption that we need
to reset the connection on any unexpected segment: RFC 9293 explicitly
mentions this case in Figure 8: Recovery from Old Duplicate SYN,
section 3.5. It doesn't make sense for us to set a specific sequence
number, socket-side, but we should definitely wait and see.

Ignoring the duplicate SYN segment should also be compatible with
section 3.10.7.3. SYN-SENT STATE, which mentions updating sequences
socket-side (which we can't do anyway), but certainly not reset the
connection.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:42:13 +01:00
7ppKb5bW
bf2860819d pasta.te: fix demo.sh and remove one duplicate rule
On Fedora 41, without "allow pasta_t unconfined_t:dir read"
/usr/bin/pasta can't open /proc/[pid]/ns, which is required by
pasta_netns_quit_init().

This patch also remove one duplicate rule "allow pasta_t nsfs_t:file
read;", "allow pasta_t nsfs_t:file { open read };" at line 123 is
enough.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:33:14 +01:00
Stefano Brivio
dcd6d8191a tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros
Those are symmetric to TAPSIDE(x)/TAPFLOW(x) and I'll use them in
the next patch to extract 'oport' in order to re-bind sockets to
the original socket-side local port.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 07:32:53 +01:00
David Gibson
0349cf637f util: Rename and make global vu_remove_watch()
vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set.  There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:32:51 +01:00
David Gibson
10c4a9e1b3 tcp: Always pass NULL event with EPOLL_CTL_DEL
In tcp_epoll_ctl() we pass an event pointer with EPOLL_CTL_DEL, even though
it will be ignored.  It's possible this was a workaround for pre-2.6.9
kernels which required a non-NULL pointer here, but we rely on the kernel
accepting NULL events for EPOLL_CTL_DEL in lots of other places.  Use
NULL instead for simplicity and consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:32:37 +01:00
Laurent Vivier
dd6a6854c7 vhost-user: Implement an empty VHOST_USER_SEND_RARP command
Passt cannot manage and doesn't need to manage the broadcast of a fake RARP,
but QEMU will report an error message if Passt doesn't implement it.

Implement an empty SEND_RARP command to silence QEMU error message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-24 21:40:05 +01:00
Stefano Brivio
d477a1fb03 netlink: Skip loopback interface while looking for a template
There might be reasons to have routes on the loopback interface, for
example Any-IP/AnyIP routes as implemented by Linux kernel commit
ab79ad14a2d5 ("ipv6: Implement Any-IP support for IPv6.").

If we use the loopback interface as a template, though, we'll pick
'lo' (typically) as interface name for our tap interface, but we'll
already have an interface called 'lo' in the target namespace, and as
we TUNSETIFF on it, we'll fail with EINVAL, because it's not a tap
interface.

Skip the loopback interface while looking for a template interface or,
more accurately, skip the interface with index 1.

Strictly speaking, we should fetch interface flags via RTM_GETLINK
instead, and check for IFF_LOOPBACK, but interleaving that request
while we're iterating over routes is unnecessarily complicated.

Link: https://www.reddit.com/r/podman/comments/1i6pj7u/starting_pod_without_external_network/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-24 21:39:52 +01:00
Laurent Vivier
4f2c8e7913 vhost_user: Drop packet with unsupported iovec array
If the iovec array cannot be managed, drop it rather than
passing the second entry to tap_add_packet().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-21 14:30:42 +01:00
Stefano Brivio
ec5c4d936d tcp: Set PSH flag for last incoming packets in a batch
So far we omitted setting PSH flags for inbound traffic altogether: as
we ignore the nature of the data we're sending, we can't conclude that
some data is more or less urgent. This works fine with Linux guests,
as the Linux kernel doesn't do much with it, on input: it will
generally deliver data to the application layer without delay.

However, with Windows, things change: if we don't set the PSH flag on
interactive inbound traffic, we can expect long delays before the data
is delivered to the application.

This is very visible with RDP, where packets we send on behalf of the
RDP client are delivered with delays exceeding one second:

  $ tshark -r rdp.pcap -td -Y 'frame.number in { 33170 .. 33173 }' --disable-protocol tls
  33170   0.030296 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285229 Win=387968 Len=0
  33171   0.985412 88.198.0.164 → 93.235.154.248 105 TCP 3389 → 49012 [PSH, ACK] Seq=285229 Ack=13820 Win=63198 Len=51
  33172   0.030373 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285280 Win=387968 Len=0
  33173   1.383776 88.198.0.164 → 93.235.154.248 424 TCP 3389 → 49012 [PSH, ACK] Seq=285280 Ack=13820 Win=63198 Len=370

in this example (packet capture taken by passt), frame  is a
mouse event sent by the RDP client, and frame  is the first
event (display reacting to click) sent back by the server. This
appears as a 1.4 s delay before we get frame .

If we set PSH, instead:

  $ tshark -r rdp_psh.pcap -td -Y 'frame.number in { 314 .. 317 }' --disable-protocol tls
  314   0.002503 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7779 Ack=74047 Win=31872 Len=116
  315   0.000557 88.198.0.164 → 93.235.154.248 54 TCP 3389 → 51066 [ACK] Seq=79162 Ack=7895 Win=62872 Len=0
  316   0.012752 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7895 Ack=79162 Win=31872 Len=116
  317   0.011927 88.198.0.164 → 93.235.154.248 107 TCP 3389 → 51066 [PSH, ACK] Seq=79162 Ack=8011 Win=62756 Len=53

here, in frame , our mouse event is delivered without a delay and
receives a response in approximately 12 ms.

Set PSH on the last segment for any batch we dequeue from the socket,
that is, set it whenever we know that we might not be sending data to
the same port for a while.

Reported-by: NN708
Link: https://bugs.passt.top/show_bug.cgi?id=107
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-21 14:28:44 +01:00
Stefano Brivio
db2c91ae86 tcp: Set ACK flag on *all* RST segments, even for client in SYN-SENT state
Somewhat curiously, RFC 9293, section 3.10.7.3, states:

   If the state is SYN-SENT, then
   [...]

      Second, check the RST bit:
      -  If the RST bit is set,
      [...]

         o  If the ACK was acceptable, then signal to the user "error:
            connection reset", drop the segment, enter CLOSED state,
            delete TCB, and return.  Otherwise (no ACK), drop the
            segment and return.

which matches verbatim RFC 793, pages 66-67, and is implemented as-is
by tcp_rcv_synsent_state_process() in the Linux kernel, that is:

	/* No ACK in the segment */

	if (th->rst) {
		/* rfc793:
		 * "If the RST bit is set
		 *
		 *      Otherwise (no ACK) drop the segment and return."
		 */

		goto discard_and_undo;
	}

meaning that if a client is in SYN-SENT state, and we send a RST
segment once we realise that we can't establish the outbound
connection, the client will ignore our segment and will need to
pointlessly wait until the connection times out instead of aborting
it right away.

The ACK flag on a RST, in this case, doesn't really seem to have any
function, but we must set it nevertheless. The ACK sequence number is
already correct because we always set it before calling
tcp_prepare_flags(), whenever relevant.

This leaves us with no cases where we should *not* set the ACK flag
on non-SYN segments, so always set the ACK flag for RST segments.

Note that non-SYN, non-RST segments were already covered by commit
4988e2b406 ("tcp: Unconditionally force ACK for all !SYN, !RST
packets").

Reported-by: Dirk Janssen <Dirk.Janssen@schiphol.nl>
Reported-by: Roeland van de Pol <Roeland.van.de.Pol@schiphol.nl>
Reported-by: Robert Floor <Robert.Floor@schiphol.nl>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-21 14:28:44 +01:00
Stefano Brivio
54bb972cfb tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets
Following up on 725acd111b ("tcp_splice: Set (again) TCP_NODELAY on
both sides"), David argues that, in general, we don't know what kind
of TCP traffic we're dealing with, on any side or path.

TCP segments might have been delivered to our socket with a PSH flag,
but we don't have a way to know about it.

Similarly, the guest might send us segments with PSH or URG set, but
we don't know if we should generally TCP_CORK sockets and uncork on
those flags, because that would assume they're running a Linux kernel
(and a particular version of it) matching the kernel that delivers
outbound packets for us.

Given that we can't make any assumption and everything might very well
be interactive traffic, disable Nagle's algorithm on all non-spliced
sockets as well.

After all, John Nagle himself is nowadays recommending that delayed
ACKs should never be enabled together with his algorithm, but we
don't have a practical way to ensure that our environment is free from
delayed ACKs (TCP_QUICKACK is not really usable for this purpose):

  https://news.ycombinator.com/item?id=34180239

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-21 14:28:37 +01:00
Stefano Brivio
8757834d14 tcp: Buffer sizes are *not* inherited on accept()/accept4()
...so it's pointless to set SO_RCVBUF and SO_SNDBUF on listening
sockets.

Call tcp_sock_set_bufsize() after accept4(), for inbound sockets.

As we didn't have large buffer sizes set for inbound sockets for
a long time (they are set explicitly only if the maximum size is
big enough, more than than the ~200 KiB default), I ran some more
throughput tests for this one, and I see slightly better numbers
(say, 17 gbps instead of 15 gbps guest to host without vhost-user).

Fixes: 904b86ade7 ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-21 14:28:14 +01:00
Laurent Vivier
c96a88d550 vhost_user: remove ASSERT() on iovec number
Replace ASSERT() on the number of iovec in the element and on
the first entry length by a debug() message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix typo in failure message]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
412ed4f09f vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_DEVICE_STATE
Report to front-end that we support device state commands:
VHOST_USER_CHECK_DEVICE_STATE
VHOST_USER_SET_LOG_BASE

These feature is needed to transfer backend state using frontend
channel.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
31d70024be vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command
Set the file descriptor to use to transfer the
backend device state during migration.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fixed nits and coding style here and there]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
878e163454 vhost-user: add VHOST_USER_CHECK_DEVICE_STATE command
After transferring the back-end’s internal state during migration,
check whether the back-end was able to successfully fully process
the state.

The value returned indicates success or error;
0 is success, any non-zero value is an error.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
78c73e9395 vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_LOG_SHMFD
This features allows QEMU to be migrated. We need also to report
VHOST_F_LOG_ALL.

This protocol feature reports we can log the page update and
implement VHOST_USER_SET_LOG_BASE and VHOST_USER_SET_LOG_FD.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
3c1d91b816 vhost-user: add VHOST_USER_SET_LOG_BASE command
Sets logging shared memory space.

When the back-end has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature,
the log memory fd is provided in the ancillary data of
VHOST_USER_SET_LOG_BASE message, the size and offset of shared memory
area provided in the message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix coding style in a bunch of places]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
538312af19 vhost-user: Pass vu_dev to more virtio functions
vu_dev will be needed to log page update.

Add the parameter to:

  vring_used_write()
  vu_queue_fill_by_index()
  vu_queue_fill()
  vring_used_idx_set()
  vu_queue_flush()

The new parameter is unused for now.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
b04195c60f vhost-user: add VHOST_USER_SET_LOG_FD command
VHOST_USER_SET_LOG_FD is an optional message with an eventfd
in ancillary data, it may be used to inform the front-end that the
log has been modified.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to vu_set_log_fd_exec()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
6016e04a3a vhost-user: update protocol features and commands list
vhost-user protocol specification has been updated with
feature flags and commands we will need to implement migration.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to union vhost_user_payload]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Stefano Brivio
a8f4fc481c tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest
There are pretty much two cases of the (misnomer) STALLED: in one
case, we could send more data to the guest if it becomes available,
and in another case, we can't, because we filled the window.

If, in this second case, we keep EPOLLIN enabled, but never read from
the socket, we get short but CPU-annoying storms of EPOLLIN events,
upon which we reschedule the ACK timeout handler, never read from the
socket, go back to epoll_wait(), and so on:

  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1

also known as:

  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s

which, for some reason, becomes very visible with muvm and aria2c
downloading from a server nearby in parallel chunks.

That's because EPOLLIN isn't cleared if we don't read from the socket,
and even with EPOLLET, epoll_wait() will repeatedly wake us up until
we actually read something.

In this case, we don't want to subscribe to EPOLLIN at all: all we're
waiting for is an ACK segment from the guest. Differentiate this case
with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
indicate that we're waiting for an ACK from the guest
(ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.

If this flag is set before we set STALLED, EPOLLIN will be masked
while we set EPOLLET because of STALLED. Whenever we clear STALLED,
we also clear this flag.

This is definitely not elegant, but it's a minimal fix.

We can probably simplify this at a later point by having a category
of connection flags directly corresponding to epoll flags, and
dropping STALLED altogether, or, perhaps, always using EPOLLET (but
we need a mechanism to re-check sockets for pending data if we can't
temporarily write to the guest).

I suspect that this might also be implied in
https://github.com/containers/podman/issues/23686, hence the Link:
tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
that).

Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-16 21:15:33 +01:00
Stefano Brivio
b8f573cdc2 tcp: Set EPOLLET when when reading from a socket fails with EAGAIN
Before SO_PEEK_OFF support was introduced by commit e63d281871
("tcp: leverage support of SO_PEEK_OFF socket option when available"),
we would peek data from sockets using a "discard" buffer as first
iovec element, so that, unless we had no pending data at all, we would
always get a positive return code from recvmsg() (except for closing
connections or errors).

If we couldn't send more data to the guest, in the window, we would
set the STALLED flag (causing the epoll descriptor to switch to
edge-triggered mode), and return early from tcp_data_from_sock().

With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
on the socket, but nothing beyond our current peeking offset, we'll
get EAGAIN instead of our current "discard" length. In that case, we
return even earlier, and we don't set EPOLLET on the socket as a
result.

As reported by Asahi Lina, this causes event loops where the kernel is
signalling socket readiness, because there's data we didn't dequeue
yet (waiting for the guest to acknowledge it), but we won't actually
peek anything new, and return early without setting EPOLLET.

This is the original report, mentioning the originally proposed fix:

--
When there is unacknowledged data in the inbound socket buffer, passt
leaves the socket in the epoll instance to accept new data from the
server. Since there is already data in the socket buffer, an epoll
without EPOLLET will repeatedly fire while no data is processed,
busy-looping the CPU:

epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)

Add in the missing EPOLLET flag for this case. This brings CPU
usage down from around ~80% when downloading over TCP, to ~5% (use
case: passt as network transport for muvm, downloading Steam games).
--

we can't set EPOLLET unconditionally though, at least right now,
because we don't monitor the guest tap for EPOLLOUT in case we fail
to write on that side because we filled up that buffer (and not the
window of a TCP connection).

Instead, rely on the observation that, once a connection is
established, we only get EAGAIN on recvmsg() if we are attempting to
peek data from a socket with a non-zero peeking offset: we only peek
when there's pending data on a socket, and in that case, if we peek
without offset, we'll always see some data.

And if we peek data with a non-zero offset and get EAGAIN, that means
that we're either waiting for more data to arrive on the socket (which
would cause further wake-ups, even with EPOLLET), or we're waiting for
the guest to acknowledge some of it, which would anyway cause a
wake-up.

In that case, it's safe to set STALLED and, in turn, EPOLLET on the
socket, which fixes the EPOLLIN event loop.

While we're establishing a connection from the socket side, though,
we'll call, once, tcp_{buf,vu}_data_from_sock() to see if we got
any data while we were waiting for SYN, ACK from the guest. See the
comment at the end of tcp_conn_from_sock_finish().

And if there's no data queued on the socket as we check, we'll also
get EAGAIN, even if our peeking offset is zero. For this reason, we
need to additionally check that 'already_sent' is not zero, meaning,
explicitly, that our peeking offset is not zero.

Reported-by: Asahi Lina <lina@asahilina.net>
Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-16 21:15:33 +01:00
Stefano Brivio
22cf08ba00 tcp: Don't subscribe to EPOLLOUT events on STALLED
I inadvertently added that in an unrelated change, but it doesn't make
sense: STALLED means we have pending socket data that we can't write
to the guest, not the other way around.

Fixes: bb70811183 ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-16 21:15:33 +01:00
Stefano Brivio
707f77b0a9 tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up
In the next patches, I'm extending the usage of STALLED to a few more
cases.

Doing so revealed this issue: if we set STALLED and, consequently,
EPOLLOUT (which is wrong, fixed later) right after we set a connection
to ESTABLISHED (which also happened by mistake while I was preparing
another change), with the guest sending data together with the final
ACK in the handshake, say:

  41.3661: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.3662: Flow 2 (NEW): FREE -> NEW
  41.3663: Flow 2 (INI): NEW -> INI
  41.3663: Flow 2 (INI): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => ?
  41.3665: Flow 2 (TGT): INI -> TGT
  41.3666: Flow 2 (TGT): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3667: Flow 2 (TCP connection): TGT -> TYPED
  41.3667: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3669: Flow 2 (TCP connection): TAP_SYN_RCVD: CLOSED -> SYN_SENT
  41.3670: Flow 2 (TCP connection): Side 0 hash table insert: bucket: 339814
  41.3672: Flow 2 (TCP connection): TYPED -> ACTIVE
  41.3673: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3674: Flow 2 (TCP connection): TAP_SYN_ACK_SENT: SYN_SENT -> SYN_RCVD
  41.3675: Flow 2 (TCP connection): ACK_FROM_TAP_DUE
  41.3675: Flow 2 (TCP connection): timer expires in 10.000s
  41.3675: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.3676: Flow 2 (TCP connection): ACK_FROM_TAP_DUE dropped
  41.3676: Flow 2 (TCP connection): ESTABLISHED: SYN_RCVD -> ESTABLISHED
  41.3678: Flow 2 (TCP connection): STALLED
  41.3678: vhost-user: got kick_data: 0000000000000002 idx: 1
  41.3679: Flow 2 (TCP connection): ACK_TO_TAP_DUE
  41.3680: Flow 2 (TCP connection): timer expires in 0.010s
  41.3680: Flow 2 (TCP connection): STALLED dropped

we'll immediately get an EPOLLOUT event, call tcp_update_seqack_wnd(),
but ignore window and ACK sequence update. At this point, we think we
acknowledged all the data to the guest (but we didn't) and we'll
happily proceed to clear the ACK_TO_TAP_DUE flag:

  41.3780: Flow 2 (TCP connection): ACK_TO_TAP_DUE dropped
  41.3780: Flow 2 (TCP connection): timer expires in 7200.000s
  41.5754: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.9956: vhost-user: got kick_data: 0000000000000001 idx: 1
  42.8275: vhost-user: got kick_data: 0000000000000001 idx: 1

while the guest starts retransmitting that data desperately, without
ever getting an ACK segment from us:

   1433  38.746353 2a01:4f8:222:904::2 → 2001:db8:9a55::1 94 TCP 54312 → 10003 [SYN] Seq=0 Win=65460 Len=0 MSS=65460 SACK_PERM TSval=1089126192 TSecr=0 WS=128
   1434  38.747357 2001:db8:9a55::1 → 2a01:4f8:222:904::2 82 TCP 10003 → 54312 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=61440 WS=256
   1435  38.747500 2a01:4f8:222:904::2 → 2001:db8:9a55::1 74 TCP 54312 → 10003 [ACK] Seq=1 Ack=1 Win=65536 Len=0
   1436  38.747769 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1437  38.747798 2a01:4f8:222:904::2 → 2001:db8:9a55::1 32841 TCP 54312 → 10003 [ACK] Seq=8193 Ack=1 Win=65536 Len=32767
   1438  38.748049 2001:db8:9a55::1 → 2a01:4f8:222:904::2 74 TCP [TCP Window Update] 10003 → 54312 [ACK] Seq=1 Ack=1 Win=65280 Len=0
   1439  38.954044 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1440  39.370096 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1441  40.202135 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192

because seq_ack_to_tap is already set to the sequence after frame
number 1437 in the example.

For some reason, I could only reproduce this with vhost-user, IPv6,
and passt running under valgrind while taking captures. Even under
these conditions, it happens quite rarely.

Forcibly send an ACK segment if we update the ACK sequence (or the
advertised window).

Fixes: e5eefe7743 ("tcp: Refactor to use events instead of states, split out spliced implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-16 21:15:33 +01:00
Laurent Vivier
1b95bd6fa1 vhost_user: fix multibuffer from linux
Under some conditions, linux can provide several buffers
in the same element (multiple entries in the iovec array).

I didn't identify what changed between the kernel guest that
provides one buffer and the one that provides several
(doesn't seem to be a kernel change or a configuration change).

Fix the following assert:

ASSERTION FAILED in virtqueue_map_desc (virtio.c:402): num_sg < max_num_sg

What I can see is the buffer can be splitted in two iovecs:
  - vnet header
  - packet data

This change manages this special case but the real fix will be to allow
tap_add_packet() to manage iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-15 23:05:31 +01:00
Stefano Brivio
f04b483d15 test/pasta_podman: Run Podman tests on a single CPU thread
Increasingly often, I'm getting occasional failures of the same type
as https://github.com/containers/podman/issues/24147. I guess it
mostly depends on the system load.

It will be a while until I'll actually run tests on a kernel
including my fix for it, kernel commit a502ea6fa94b ("udp: Deal with
race between UDP socket address change and rehash"), so add a horrible
workaround using taskset(1), for the moment.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-15 22:16:32 +01:00
Laurent Vivier
2c174f1fe8 checksum: fix checksum with odd base address
csum_unfolded() must call csum_avx2() with a 32byte aligned base address.

To be able to do that if the buffer is not correctly aligned,
it splits the buffers in 2 parts, the second part is 32byte aligned and
can be used with csum_avx2(), the first part is the remaining part, that
is not 32byte aligned and we use sum_16b() to compute the checksum.

A problem appears if the length of the first part is odd because
the checksum is using 16bit words to do the checksum.

If the length is odd, when the second part is computed, all words are
shifted by 1 byte, meaning weight of upper and lower byte is swapped.

For instance a 13 bytes buffer:

bytes:

aa AA bb BB cc CC dd DD ee EE ff FF gg

16bit words:

AAaa BBbb CCcc DDdd EEee FFff 00gg

If we don't split the sequence, the checksum is:

AAaa + BBbb + CCcc + DDdd + EEee + FFff + 00gg

If we split the sequence with an even length for the first part:

(AAaa + BBbb) + (CCcc + DDdd + EEee + FFff + 00gg)

But if the first part has an odd length:

(AAaa + BBbb + 00cc) + (ddCC + eeDD + ffEE + ggFF)

To avoid the problem, do not call csum_avx2() if the first part cannot
have an even length, and compute the checksum of all the buffer using
sum_16b().

This is slower but it can only happen if the buffer base address is odd,
and this can only happen if the binary is built using '-Os', and that
means we have chosen to prioritize size over speed.

Reported-by: Mike Jones <mike@mjones.io>
Link: https://bugs.passt.top/show_bug.cgi?id=108
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Added comment explaining why we check for pad & 1]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-10 22:20:23 +01:00
Stefano Brivio
725acd111b tcp_splice: Set (again) TCP_NODELAY on both sides
In commit 7ecf693297 ("pasta, tcp: Don't set TCP_CORK on spliced
sockets") I just assumed that we wouldn't benefit from disabling
Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay).

It turns out that with some patterns, such as a PostgreSQL server
in a container receiving parameterised, short queries, for which pasta
sees several short inbound messages (Parse, Bind, Describe, Execute
and Sync commands getting each one their own packet, 5 to 49 bytes TCP
payload each), we'll read them usually in two batches, and send them
in matching batches, for example:

  9165.2467:          pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
  9165.2468:          Flow 0 (TCP connection (spliced)): 76 from read-side call
  9165.2468:          Flow 0 (TCP connection (spliced)): 76 from write-side call (passed 524288)
  9165.2469:          pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
  9165.2470:          Flow 0 (TCP connection (spliced)): 15 from read-side call
  9165.2470:          Flow 0 (TCP connection (spliced)): 15 from write-side call (passed 524288)
  9165.2944:          pasta: epoll event on connected spliced TCP socket 118 (events: 0x00000001)

and the kernel delivers the first one, waits for acknowledgement from
the receiver, then delivers the second one. This adds very substantial
and unnecessary delay. It's usually a fixed ~40ms between the two
batches, which is clearly unacceptable for loopback connections.

In this example, the delay is shown by the timestamp of the response
from socket 118. The peer (server) doesn't actually take that long
(less than a millisecond), but it takes that long for the kernel to
deliver our request.

To avoid batching and delays, disable Nagle's algorithm by setting
TCP_NODELAY on both internal and external sockets: this way, we get
one inbound packet for each original message, we transfer them right
away, and the kernel delivers them to the process in the container as
they are, without delay.

We can do this safely as we don't care much about network utilisation
when there's in fact pretty much no network (loopback connections).

This is unfortunately not visible in the TCP request-response tests
from the test suite because, with smaller messages (we use one byte),
Nagle's algorithm doesn't even kick in. It's probably not trivial to
implement a universal test covering this case.

Fixes: 7ecf693297 ("pasta, tcp: Don't set TCP_CORK on spliced sockets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-06 10:10:29 +01:00
Stefano Brivio
3876fc780d seccomp: Unconditionally allow accept(2) even if accept4(2) is present
On Alpine Linux 3.21, passt aborts right away as soon as QEMU connects
to it.

Most likely, this has always been the case with musl, because since
musl commit dc01e2cbfb29 ("add fallback emulation for accept4 on old
kernels"), accept4() without flags is implemented using accept().

However, I guess that nobody realised earlier because it's typically
pasta(1) being used on musl-based distributions, and the only place
where we call accept4() without flags is tap_listen_handler().

Add accept() to the list of allowed system calls regardless of the
presence of accept4().

Reported-by: NN708 <nn708@outlook.com>
Link: https://bugs.passt.top/show_bug.cgi?id=106
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-05 23:49:11 +01:00
Laurent Vivier
898e853635 virtio: Use const pointer for vu_dev
We don't modify the structure in some virtio functions.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-05 23:48:59 +01:00
Stefano Brivio
324233bd9b udp_flow: Don't block multicast and broadcast messages
It was reported that SSDP notifications sent from a container (with
e.g. minidlna) stopped appearing on the network starting from commit
1db4f773e8 ("udp: Improve detail of UDP endpoint sanity checking").
As a minimal reproducer using minidlnad(8):

  $ mkdir /tmp/minidlna
  $ cat conf
  media_dir=/tmp/minidlna
  db_dir=/tmp/minidlna
  $ ./pasta -d --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f conf -P /dev/null & (sleep 1; killall minidlnad)'

[...]

  1.0327: Flow 0 (NEW): FREE -> NEW
  1.0327: Flow 0 (INI): NEW -> INI
  1.0327: Flow 0 (INI): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
  1.0327: Flow 0 (INI): Invalid endpoint on UDP packet
  1.0327: Flow 0 (FREE): INI -> FREE
  1.0328: Flow 0 (FREE): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
  1.0328: Dropping datagram with no flow TAP 88.198.0.164:54185 -> 239.255.255.250:1900

This is an actual regression as there's no particular reason to block
outbound multicast UDP packets.

And even if we don't handle multicast groups in any particular way
(https://bugs.passt.top/show_bug.cgi?id=2, "Add IGMP/MLD proxy"),
there's no reason to block inbound multicast or broadcast packets
either, should they ever be somehow delivered to passt or pasta.

Let multicast and broadcast packets through, refusing only to
establish flows with unspecified endpoint, as those would actually
cause havoc in the flow table.

IP-wise, SSDP notifications look like this (after this patch), inside
and outside:

  $ pasta -p /tmp/minidlna.pcap --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f minidlna.conf -P /dev/null & (sleep 1; killall minidlnad)'

[...]

  $ tshark -a packets:1 -r /tmp/minidlna.pcap ssdp
      2   0.074808 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1

  # tshark -i ens3 -a packets:1 multicast 2>/dev/null
      1 0.000000000 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1

Link: https://github.com/containers/podman/issues/24871
Fixes: 1db4f773e8 ("udp: Improve detail of UDP endpoint sanity checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-05 23:48:59 +01:00
Stefano Brivio
2385b69a66 Makefile: Report error and stop if we can't set TARGET
I don't think it's necessarily productive to check all the possible
error conditions in the Makefile, but this one is annoying: issue
'make' without a C compiler, then install one, and build again.

Then run passt and it will mysteriously terminate on epoll_wait(),
because seccomp.h is good enough to build against, but the resulting
seccomp filter doesn't allow any system call. Not really fun to debug.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-05 23:48:37 +01:00
Stefano Brivio
e5ba8adef7 README: Mark vhost-user as supported
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-12 10:50:48 +01:00
Stefano Brivio
09478d55fe treewide: Dodge dynamic memory allocation in strerror() from glibc > 2.40
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.

This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).

Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-12-11 12:21:23 +01:00
Jon Maloy
e24f026222 pasta: make it possible to disable socket splicing
During testing it is sometimes useful to force traffic which would
normally be forwared by socket splicing through the tap interface.

In this commit, we add a command switch enabling such funtionality
for inbound local traffic.

For outbound local traffic this is much trickier, if even possible,
so leave that for a later commit.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-11 01:47:37 +01:00
Laurent Vivier
947f5cdb93 tap: Call vu_init() with --fd
We need to initialize vhost-user structures with --fd too.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-10 12:26:56 +01:00
Laurent Vivier
2139ad33fc tap: Use a common function to start a new connection
Merge code from tap_backend_init(), tap_sock_tun_init() and
tap_listen_handler() to set epoll_ref entry and to add it
to epollfd.

No functionality change

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-10 12:26:34 +01:00
Laurent Vivier
8996d183c5 udp_vu: update segment size
In udp_vu_sock_recv(), collect a segment with a size defined to
IP_MAX_MTU + ETH_HLEN + sizeof(struct virtio_net_hdr_mrg_rxbuf)

The original version double counted the IP header: IP_MAX_MTU includes
the IP header, and so did hdrlen.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-05 21:08:58 +01:00
David Gibson
190829705e flow: Remove over-zealous sanity checks in flow_sidx_hash()
In flow_sidx_hash() we verify that the flow we're hashing doesn't have an
unspecified endpoint address, or zero for either port.  The hash table only
works if we're looking for exact matches of address and port, and this is
attempting to catch any cases where we might have left address or port
unpopulated or filled with a wildcard.

This doesn't really work though, because there are cases where unspecified
addresses or zero ports are correct:
 * We already use unspecified addresses for our address in cases where we
   don't know the specific local address for that side, and exclude the
   obvious extra check on side->oaddr for that reason.
 * Zero port numbers aren't strictly forbidden over the wire.  We forbid
   them for TCP & UDP because they can't safely be handled on the socket
   side.  However for ICMP a zero id, which goes in the port field is
   valid.
 * Possible future flow types (for example, for multicast protocols) might
   legitimately have an unspecified address.

Although it makes them easier to miss, these sorts of sanity checks really
have to be done at the protocol / flow type layer, and we already do so.
Remove the checks in flow_sidx_hash() other than checking that the pif
is specified.

Reported-by: Stefan <steffhip@gmail.com>
Link: https://bugs.passt.top/show_bug.cgi?id=105
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-05 21:08:58 +01:00
David Gibson
1db4f773e8 udp: Improve detail of UDP endpoint sanity checking
In udp_flow_new() we reject a flow if the endpoint isn't unicast, or it has
a zero endpoint port.  Those conditions aren't strictly illegal, but we
can't safely handle them at present:
 * Multicast UDP endpoints are certainly possible, but our current flow
   tracking only makes sense for simple unicast flows - we'll need
   different handling if we want to handle multicast flows in future
 * It's not entirely clear if port 0 is RFC-ishly correct, but for socket
   interfaces port 0 sometimes has a special meaning such as "pick the port
   for me, kernel".  That makes flows on port 0 unsafe to forward in the
   usual way.

For the same reason we also can't safely handle port 0 as our port.  In
principle that's also true for our address, however in the case of flows
initiated from a socket, we may not know our address since the socket
could be bound to 0.0.0.0 or ::, so we can only verify that our address
is unicast for flows initiated from the tap side.

Refine the current check in udp_flow_new() to slightly more detailed checks
in udp_flow_from_sock() and udp_flow_from_tap() to make what is and isn't
handled clearer.  This makes this checking more similar to what we do for
TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-05 21:08:58 +01:00
Stefano Brivio
966fdc8749 perf/passt_vu_tcp: Make it shine
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 15:06:44 +01:00
Laurent Vivier
020c8b7127 tcp_vu: Compute IPv4 header checksum if dlen changes
In tcp_vu_data_from_sock() we compute IPv4 header checksum only
for the first and the last packets, and re-use the first packet checksum
for all the other packets as the content of the header doesn't change.

It's more accurate to check the dlen value to know if the checksum
should change as dlen is the only information that can change in the
loop.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
Laurent Vivier
d9c0f8eefb Makefile: Use make internal string functions
TARGET_ARCH is computed from '$(CC) -dumpmachine' using external
bash commands like echo, cut, tr and sed. This can be done using
make internal string functions.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
b6e79efa0b tcp_vu: Remove unnecessary tcp_vu_update_check() function
Because the vhost-user <-> virtio-net path ignores checksums, we usually
don't calculate them when sending packets to the guest.  So, we always
pass no_tcp_csum=true to tcp_fill_headers().  We do want accurate
checksums when capturing packets though, so the captures don't show bogus
values.

Currently we handle this by updating the checksum field immediately before
writing the packet to the capture file, using tcp_vu_update_check().  This
is unnecessary, though: in each case tcp_fill_headers() is called not very
long before, so we can alter its no_tcp_csum parameter pased on whether
we're generating captures or not.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
a6348cad51 tcp: Merge tcp_fill_headers[46]() with each other
We have different versions of this function for IPv4 and IPv6, but the
caller already requires some IP version specific code to get the right
header pointers.  Instead, have a common function that fills either an
IPv4 or an IPv6 header based on which header pointer it is passed.  This
allows us to remove a small amount of code duplication and make a few
slightly ugly conditionals.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
2abf5ab7f3 tcp: Merge tcp_update_check_tcp[46]()
The only reason we need separate functions for the IPv4 and IPv6 case is
to calculate the checksum of the IP pseudo-header, which is different for
the two cases.  However, the caller already knows which path it's on and
can access the values needed for the pseudo-header partial sum more easily
than tcp_update_check_tcp[46]() can.

So, merge these functions into a single tcp_update_csum() function that
just takes the pseudo-header partial sum, calculated in the caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
08ea3cc581 tcp: Pass TCP header and payload separately to tcp_fill_headers[46]()
At the moment these take separate pointers to the tap specific and IP
headers, but expect the TCP header and payload as a single tcp_payload_t.
As well as being slightly inconsistent, this involves some slightly iffy
pointer shenanigans when called on the flags path with a tcp_flags_t
instead of a tcp_payload_t.

More importantly, it's inconvenient for the upcoming vhost-user case, where
the TCP header and payload might not be contiguous.  Furthermore, the
payload itself might not be contiguous.

So, pass the TCP header as its own pointer, and the TCP payload as an IO
vector.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
2ee07697c4 tcp: Pass TCP header and payload separately to tcp_update_check_tcp[46]()
Currently these expects both the TCP header and payload in a single IOV,
and goes to some trouble to locate the checksum field within it.  In the
current caller we've already know where the TCP header is, so we might as
well just pass it in.  This will need to work a bit differently for
vhost-user, but that code already needs to locate the TCP header for other
reasons, so again we can just pass it in.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
67151090bc iov, checksum: Replace csum_iov() with csum_iov_tail()
We usually want to checksum only the tail part of a frame, excluding at
least some headers.  csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector.  We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.

We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters.  This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
f931103171 iov: iov tail helpers
In the vhost-user code we have a number of places where we need to locate
a particular header within the guest-supplied IO vector.  We need to work
out which buffer the header is in, and verify that it's contiguous and
aligned as we need.  At the moment this is open-coded, but introduce a
helper to make this more straightforward.

We add a new datatype 'struct iov_tail' representing an IO vector from
which we've logically consumed some number of headers.  The IOV_PULL_HEADER
macro consumes a new header from the vector, returning a pointer and
updating the iov_tail.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
Stefano Brivio
804a7ce94a tcp_vu: Change 'dlen' to ssize_t in tcp_vu_data_from_sock()
...to quickly suppress a false positive from Coverity, which assumes
that iov_size is 0 and 'dlen' might overflow as a result (with hdrlen
being 66). An ASSERT() in tcp_vu_sock_recv() already guarantees that
iov_size(iov, buf_cnt) here is anyway greater than 'hdrlen'.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
2024-11-27 16:49:21 +01:00
Laurent Vivier
00cc2303fd Fix build on 32bit target
Fix the following errors when built with CFLAGS="-m32 -U__AVX2__":

packet.c:57:23: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘size_t’ {aka ‘unsigned int’} [-Wformat=]
   57 |                 trace("packet offset plus length %lu from size %lu, "
   58 |                       "%s:%i", start - p->buf + len + offset,
      |                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                     |
      |                                                     size_t {aka unsigned int}

packet.c:57:23: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘size_t’ {aka ‘unsigned int’} [-Wformat=]
   57 |                 trace("packet offset plus length %lu from size %lu, "
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   58 |                       "%s:%i", start - p->buf + len + offset,
   59 |                       p->buf_size, func, line);
      |                       ~~~~~~~~~~~
      |                        |
      |                        size_t {aka unsigned int}

vhost_user.c:139:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  139 |                         return (void *)(qemu_addr - r->qva + r->mmap_addr +
      |                                ^

vhost_user.c:439:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  439 |                         munmap((void *)r->mmap_addr, r->size + r->mmap_offset);
      |                                ^

vhost_user.c:900:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  900 |                         munmap((void *)r->mmap_addr, r->size + r->mmap_offset);
      |                                ^

virtio.c:111:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  111 |                         return (void *)(guest_addr - r->gpa + r->mmap_addr +
      |                                ^

vu_common.c:37:27: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
   37 |                 char *m = (char *)dev_region->mmap_addr;
      |                           ^

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:49:21 +01:00
Laurent Vivier
6fae899cbb virtio: check if avail ring is configured
If the connection to the vhost-user front end is closed during transfers
virtio rings are deconfigured and not available anymore, but we can
try to access them to process queued data. This can trigger a SIGSEG as
we try to access unavailable memory.
To fix that check vq->vring.avail is sane before accessing the vring

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:49:21 +01:00
David Gibson
7e131e920c tcp: Move tcp_l2_buf_fill_headers() to tcp_buf.c
This function only has callers in tcp_buf.c.  More importantly, it's
inherently tied to the "buf" path, because it uses internal knowledge of
how we lay out the various headers across our locally allocated buffers.

Therefore, move it to tcp_buf.c.

Slightly reformat the prototypes while we're at it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:49:21 +01:00
Stefano Brivio
676bf5488e test: Add tests for passt in vhost-user mode
Run functional and performance tests for vhost-user mode as well. For
functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
to their non-vhost-user counterparts, as no differences are intended
but we want to distinguish them in test logs.

For performance tests, instead, we add separate perf/passt_vu_tcp and
perf/passt_vu_udp files, as we need longer test duration, as well as
higher UDP sending bandwidths and larger TCP windows, to actually get
the highest throughput vhost-user mode offers.

For valgrind tests, vhost-user mode needs two extra system calls:
statx and readlink. Add them as EXTRA_SYSCALLS for the valgrind
target.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-27 16:49:21 +01:00
Laurent Vivier
28997fcb29 vhost-user: add vhost-user
add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
 before including <linux/if_ether.h> as C libraries such as musl
 __UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
 a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:47:32 +01:00
Laurent Vivier
b2e62f7e85 passt: rename tap_sock_init() to tap_backend_init()
Extract pool storage initialization loop to tap_sock_update_pool(),
extract QEMU hints to tap_backend_show_hints().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:12:27 +01:00
Laurent Vivier
b7c292b758 tcp: Export headers functions
Export tcp_fill_headers[4|6]() and tcp_update_check_tcp[4|6]().

They'll be needed by vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:12:24 +01:00
Laurent Vivier
5a8b33c667 udp: Prepare udp.c to be shared with vhost-user
Export udp_payload_t, udp_update_hdr4(), udp_update_hdr6() and
udp_sock_errs().

Rename udp_listen_sock_handler() to udp_buf_listen_sock_handler() and
udp_reply_sock_handler to udp_buf_reply_sock_handler().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:12:21 +01:00
Laurent Vivier
31117b27c6 vhost-user: introduce vhost-user API
Add vhost_user.c and vhost_user.h that define the functions needed
to implement vhost-user backend.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:12:19 +01:00
Laurent Vivier
7d1cd4dbf5 vhost-user: introduce virtio API
Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:11:36 +01:00
Laurent Vivier
dd143e3890 packet: replace struct desc by struct iovec
To be able to manage buffers inside a shared memory provided
by a VM via a vhost-user interface, we cannot rely on the fact
that buffers are located in a pre-defined memory area and use
a base address and a 32bit offset to address them.

We need a 64bit address, so replace struct desc by struct iovec
and update range checking.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:11:18 +01:00
113 changed files with 11277 additions and 1568 deletions

2
.gitignore vendored
View file

@ -3,8 +3,10 @@
/passt.avx2
/pasta
/pasta.avx2
/passt-repair
/qrap
/pasta.1
/seccomp.h
/seccomp_repair.h
/c*.json
README.plain.md

View file

@ -16,9 +16,12 @@ VERSION ?= $(shell git describe --tags HEAD 2>/dev/null || echo "unknown\ versio
DUAL_STACK_SOCKETS := 1
TARGET ?= $(shell $(CC) -dumpmachine)
$(if $(TARGET),,$(error Failed to get target architecture))
# Get 'uname -m'-like architecture description for target
TARGET_ARCH := $(shell echo $(TARGET) | cut -f1 -d- | tr [A-Z] [a-z])
TARGET_ARCH := $(shell echo $(TARGET_ARCH) | sed 's/powerpc/ppc/')
TARGET_ARCH := $(firstword $(subst -, ,$(TARGET)))
TARGET_ARCH := $(patsubst [:upper:],[:lower:],$(TARGET_ARCH))
TARGET_ARCH := $(patsubst arm%,arm,$(TARGET_ARCH))
TARGET_ARCH := $(subst powerpc,ppc,$(TARGET_ARCH))
# On some systems enabling optimization also enables source fortification,
# automagically. Do not override it.
@ -27,7 +30,7 @@ ifeq ($(shell $(CC) -O2 -dM -E - < /dev/null 2>&1 | grep ' _FORTIFY_SOURCE ' > /
FORTIFY_FLAG := -D_FORTIFY_SOURCE=2
endif
FLAGS := -Wall -Wextra -Wno-format-zero-length
FLAGS := -Wall -Wextra -Wno-format-zero-length -Wformat-security
FLAGS += -pedantic -std=c11 -D_XOPEN_SOURCE=700 -D_GNU_SOURCE
FLAGS += $(FORTIFY_FLAG) -O2 -pie -fPIE
FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
@ -36,18 +39,21 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c
ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
udp_vu.c util.c vhost_user.c virtio.c vu_common.c
QRAP_SRCS = qrap.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS)
PASST_REPAIR_SRCS = passt-repair.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1
MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
udp.h udp_flow.h util.h
lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
udp_vu.h util.h vhost_user.h virtio.h vu_common.h
HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
@ -68,9 +74,9 @@ mandir ?= $(datarootdir)/man
man1dir ?= $(mandir)/man1
ifeq ($(TARGET_ARCH),x86_64)
BIN := passt passt.avx2 pasta pasta.avx2 qrap
BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
else
BIN := passt pasta qrap
BIN := passt pasta qrap passt-repair
endif
all: $(BIN) $(MANPAGES) docs
@ -79,7 +85,10 @@ static: FLAGS += -static -DGLIBC_NO_STATIC_NSS
static: clean all
seccomp.h: seccomp.sh $(PASST_SRCS) $(PASST_HEADERS)
@ EXTRA_SYSCALLS="$(EXTRA_SYSCALLS)" ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh $(PASST_SRCS) $(PASST_HEADERS)
@ EXTRA_SYSCALLS="$(EXTRA_SYSCALLS)" ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh seccomp.h $(PASST_SRCS) $(PASST_HEADERS)
seccomp_repair.h: seccomp.sh $(PASST_REPAIR_SRCS)
@ ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh seccomp_repair.h $(PASST_REPAIR_SRCS)
passt: $(PASST_SRCS) $(HEADERS)
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_SRCS) -o passt $(LDFLAGS)
@ -97,15 +106,19 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
qrap: $(QRAP_SRCS) passt.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
passt-repair: $(PASST_REPAIR_SRCS) seccomp_repair.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
rt_sigreturn getpid gettid kill clock_gettime mmap \
mmap2 munmap open unlink gettimeofday futex
rt_sigreturn getpid gettid kill clock_gettime \
mmap|mmap2 munmap open unlink gettimeofday futex \
statx readlink
valgrind: FLAGS += -g -DVALGRIND
valgrind: all
.PHONY: clean
clean:
$(RM) $(BIN) *~ *.o seccomp.h pasta.1 \
$(RM) $(BIN) *~ *.o seccomp.h seccomp_repair.h pasta.1 \
passt.tar passt.tar.gz *.deb *.rpm \
passt.pid README.plain.md

View file

@ -321,7 +321,7 @@ speeding up local connections, and usually requiring NAT. _pasta_:
protocol
* ✅ 4 to 50 times IPv4 TCP throughput of existing, conceptually similar
solutions depending on MTU (UDP and IPv6 hard to compare)
* 🛠 [_vhost-user_ support](https://bugs.passt.top/show_bug.cgi?id=25) for
* [_vhost-user_ support](https://bugs.passt.top/show_bug.cgi?id=25) for
maximum one copy on every data path and lower request-response latency
* ⌚ [multithreading](https://bugs.passt.top/show_bug.cgi?id=13)
* ⌚ [raw IP socket support](https://bugs.passt.top/show_bug.cgi?id=14) if

View file

@ -85,7 +85,7 @@
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing")))
uint32_t sum_16b(const void *buf, size_t len)
static uint32_t sum_16b(const void *buf, size_t len)
{
const uint16_t *p = buf;
uint32_t sum = 0;
@ -107,7 +107,7 @@ uint32_t sum_16b(const void *buf, size_t len)
*
* Return: 16-bit folded sum
*/
uint16_t csum_fold(uint32_t sum)
static uint16_t csum_fold(uint32_t sum)
{
while (sum >> 16)
sum = (sum & 0xffff) + (sum >> 16);
@ -161,29 +161,42 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
return psum;
}
/**
* csum() - Compute TCP/IP-style checksum
* @buf: Input buffer
* @len: Input length
* @init: Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing"))) /* See csum_16b() */
static uint16_t csum(const void *buf, size_t len, uint32_t init)
{
return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
}
/**
* csum_udp4() - Calculate and set checksum for a UDP over IPv4 packet
* @udp4hr: UDP header, initialised apart from checksum
* @saddr: IPv4 source address
* @daddr: IPv4 destination address
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Length of the array
* @offset: UDP payload offset in the iovec array
* @data: UDP payload (as IO vector tail)
*/
void csum_udp4(struct udphdr *udp4hr,
struct in_addr saddr, struct in_addr daddr,
const struct iovec *iov, int iov_cnt, size_t offset)
struct iov_tail *data)
{
/* UDP checksums are optional, so don't bother */
udp4hr->check = 0;
if (UDP4_REAL_CHECKSUMS) {
uint16_t l4len = iov_size(iov, iov_cnt) - offset +
sizeof(struct udphdr);
uint16_t l4len = iov_tail_size(data) + sizeof(struct udphdr);
uint32_t psum = proto_ipv4_header_psum(l4len, IPPROTO_UDP,
saddr, daddr);
psum = csum_unfolded(udp4hr, sizeof(struct udphdr), psum);
udp4hr->check = csum_iov(iov, iov_cnt, offset, psum);
udp4hr->check = csum_iov_tail(data, psum);
}
}
@ -231,22 +244,20 @@ uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
* @udp6hr: UDP header, initialised apart from checksum
* @saddr: Source address
* @daddr: Destination address
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Length of the array
* @offset: UDP payload offset in the iovec array
* @data: UDP payload (as IO vector tail)
*/
void csum_udp6(struct udphdr *udp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const struct iovec *iov, int iov_cnt, size_t offset)
struct iov_tail *data)
{
uint16_t l4len = iov_size(iov, iov_cnt) - offset +
sizeof(struct udphdr);
uint16_t l4len = iov_tail_size(data) + sizeof(struct udphdr);
uint32_t psum = proto_ipv6_header_psum(l4len, IPPROTO_UDP,
saddr, daddr);
udp6hr->check = 0;
psum = csum_unfolded(udp6hr, sizeof(struct udphdr), psum);
udp6hr->check = csum_iov(iov, iov_cnt, offset, psum);
udp6hr->check = csum_iov_tail(data, psum);
}
/**
@ -456,7 +467,8 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
intptr_t align = ROUND_UP((intptr_t)buf, sizeof(__m256i));
unsigned int pad = align - (intptr_t)buf;
if (len < pad)
/* Don't mix sum_16b() and csum_avx2() with odd padding lengths */
if (pad & 1 || len < pad)
pad = len;
if (pad)
@ -486,46 +498,23 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
#endif /* !__AVX2__ */
/**
* csum() - Compute TCP/IP-style checksum
* @buf: Input buffer
* @len: Input length
* @init: Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing"))) /* See csum_16b() */
uint16_t csum(const void *buf, size_t len, uint32_t init)
{
return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
}
/**
* csum_iov() - Calculates the unfolded checksum over an array of IO vectors
*
* @iov Pointer to the array of IO vectors
* @n Length of the array
* @offset: Offset of the data to checksum within the full data length
* csum_iov_tail() - Calculate unfolded checksum for the tail of an IO vector
* @tail: IO vector tail to checksum
* @init Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
uint16_t csum_iov(const struct iovec *iov, size_t n, size_t offset,
uint32_t init)
uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init)
{
unsigned int i;
size_t first;
i = iov_skip_bytes(iov, n, offset, &first);
if (i >= n)
return (uint16_t)~csum_fold(init);
init = csum_unfolded((char *)iov[i].iov_base + first,
iov[i].iov_len - first, init);
i++;
for (; i < n; i++)
init = csum_unfolded(iov[i].iov_base, iov[i].iov_len, init);
if (iov_tail_prune(tail)) {
size_t i;
init = csum_unfolded((char *)tail->iov[0].iov_base + tail->off,
tail->iov[0].iov_len - tail->off, init);
for (i = 1; i < tail->cnt; i++) {
const struct iovec *iov = &tail->iov[i];
init = csum_unfolded(iov->iov_base, iov->iov_len, init);
}
}
return (uint16_t)~csum_fold(init);
}

View file

@ -9,9 +9,8 @@
struct udphdr;
struct icmphdr;
struct icmp6hdr;
struct iov_tail;
uint32_t sum_16b(const void *buf, size_t len);
uint16_t csum_fold(uint32_t sum);
uint16_t csum_unaligned(const void *buf, size_t len, uint32_t init);
uint16_t csum_ip4_header(uint16_t l3len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr);
@ -19,20 +18,18 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr);
void csum_udp4(struct udphdr *udp4hr,
struct in_addr saddr, struct in_addr daddr,
const struct iovec *iov, int iov_cnt, size_t offset);
struct iov_tail *data);
void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t dlen);
uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
const struct in6_addr *saddr,
const struct in6_addr *daddr);
void csum_udp6(struct udphdr *udp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const struct iovec *iov, int iov_cnt, size_t offset);
struct iov_tail *data);
void csum_icmp6(struct icmp6hdr *icmp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t dlen);
uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init);
uint16_t csum(const void *buf, size_t len, uint32_t init);
uint16_t csum_iov(const struct iovec *iov, size_t n, size_t offset,
uint32_t init);
uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init);
#endif /* CHECKSUM_H */

490
conf.c
View file

@ -16,6 +16,7 @@
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <libgen.h>
#include <string.h>
#include <sched.h>
#include <sys/types.h>
@ -45,6 +46,7 @@
#include "lineread.h"
#include "isolation.h"
#include "log.h"
#include "vhost_user.h"
#define NETNS_RUN_DIR "/run/netns"
@ -122,6 +124,75 @@ static int parse_port_range(const char *s, char **endptr,
return 0;
}
/**
* conf_ports_range_except() - Set up forwarding for a range of ports minus a
* bitmap of exclusions
* @c: Execution context
* @optname: Short option name, t, T, u, or U
* @optarg: Option argument (port specification)
* @fwd: Pointer to @fwd_ports to be updated
* @addr: Listening address
* @ifname: Listening interface
* @first: First port to forward
* @last: Last port to forward
* @exclude: Bitmap of ports to exclude
* @to: Port to translate @first to when forwarding
* @weak: Ignore errors, as long as at least one port is mapped
*/
static void conf_ports_range_except(const struct ctx *c, char optname,
const char *optarg, struct fwd_ports *fwd,
const union inany_addr *addr,
const char *ifname,
uint16_t first, uint16_t last,
const uint8_t *exclude, uint16_t to,
bool weak)
{
bool bound_one = false;
unsigned i;
int ret;
if (first == 0) {
die("Can't forward port 0 for option '-%c %s'",
optname, optarg);
}
for (i = first; i <= last; i++) {
if (bitmap_isset(exclude, i))
continue;
if (bitmap_isset(fwd->map, i)) {
warn(
"Altering mapping of already mapped port number: %s", optarg);
}
bitmap_set(fwd->map, i);
fwd->delta[i] = to - first;
if (optname == 't')
ret = tcp_sock_init(c, addr, ifname, i);
else if (optname == 'u')
ret = udp_sock_init(c, 0, addr, ifname, i);
else
/* No way to check in advance for -T and -U */
ret = 0;
if (ret == -ENFILE || ret == -EMFILE) {
die("Can't open enough sockets for port specifier: %s",
optarg);
}
if (!ret) {
bound_one = true;
} else if (!weak) {
die("Failed to bind port %u (%s) for option '-%c %s'",
i, strerror_(-ret), optname, optarg);
}
}
if (!bound_one)
die("Failed to bind any port for '-%c %s'", optname, optarg);
}
/**
* conf_ports() - Parse port configuration options, initialise UDP/TCP sockets
* @c: Execution context
@ -134,10 +205,9 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
{
union inany_addr addr_buf = inany_any6, *addr = &addr_buf;
char buf[BUFSIZ], *spec, *ifname = NULL, *p;
bool exclude_only = true, bound_one = false;
uint8_t exclude[PORT_BITMAP_SIZE] = { 0 };
bool exclude_only = true;
unsigned i;
int ret;
if (!strcmp(optarg, "none")) {
if (fwd->mode)
@ -172,32 +242,15 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
fwd->mode = FWD_ALL;
/* Skip port 0. It has special meaning for many socket APIs, so
* trying to bind it is not really safe.
*/
for (i = 1; i < NUM_PORTS; i++) {
/* Exclude ephemeral ports */
for (i = 0; i < NUM_PORTS; i++)
if (fwd_port_is_ephemeral(i))
continue;
bitmap_set(fwd->map, i);
if (optname == 't') {
ret = tcp_sock_init(c, NULL, NULL, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
} else if (optname == 'u') {
ret = udp_sock_init(c, 0, NULL, NULL, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
}
}
if (!bound_one)
goto bind_all_fail;
bitmap_set(exclude, i);
conf_ports_range_except(c, optname, optarg, fwd,
NULL, NULL,
1, NUM_PORTS - 1, exclude,
1, true);
return;
}
@ -274,37 +327,15 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
} while ((p = next_chunk(p, ',')));
if (exclude_only) {
/* Skip port 0. It has special meaning for many socket APIs, so
* trying to bind it is not really safe.
*/
for (i = 1; i < NUM_PORTS; i++) {
if (fwd_port_is_ephemeral(i) ||
bitmap_isset(exclude, i))
continue;
bitmap_set(fwd->map, i);
if (optname == 't') {
ret = tcp_sock_init(c, addr, ifname, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
} else if (optname == 'u') {
ret = udp_sock_init(c, 0, addr, ifname, i);
if (ret == -ENFILE || ret == -EMFILE)
goto enfile;
if (!ret)
bound_one = true;
} else {
/* No way to check in advance for -T and -U */
bound_one = true;
}
}
if (!bound_one)
goto bind_all_fail;
/* Exclude ephemeral ports */
for (i = 0; i < NUM_PORTS; i++)
if (fwd_port_is_ephemeral(i))
bitmap_set(exclude, i);
conf_ports_range_except(c, optname, optarg, fwd,
addr, ifname,
1, NUM_PORTS - 1, exclude,
1, true);
return;
}
@ -333,40 +364,18 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
if ((*p != '\0') && (*p != ',')) /* Garbage after the ranges */
goto bad;
for (i = orig_range.first; i <= orig_range.last; i++) {
if (bitmap_isset(fwd->map, i))
warn(
"Altering mapping of already mapped port number: %s", optarg);
if (bitmap_isset(exclude, i))
continue;
bitmap_set(fwd->map, i);
fwd->delta[i] = mapped_range.first - orig_range.first;
ret = 0;
if (optname == 't')
ret = tcp_sock_init(c, addr, ifname, i);
else if (optname == 'u')
ret = udp_sock_init(c, 0, addr, ifname, i);
if (ret)
goto bind_fail;
}
conf_ports_range_except(c, optname, optarg, fwd,
addr, ifname,
orig_range.first, orig_range.last,
exclude,
mapped_range.first, false);
} while ((p = next_chunk(p, ',')));
return;
enfile:
die("Can't open enough sockets for port specifier: %s", optarg);
bad:
die("Invalid port specifier %s", optarg);
mode_conflict:
die("Port forwarding mode '%s' conflicts with previous mode", optarg);
bind_fail:
die("Failed to bind port %u (%s) for option '-%c %s', exiting",
i, strerror(-ret), optname, optarg);
bind_all_fail:
die("Failed to bind any port for '-%c %s', exiting", optname, optarg);
}
/**
@ -405,6 +414,76 @@ static unsigned add_dns6(struct ctx *c, const struct in6_addr *addr,
return 1;
}
/**
* add_dns_resolv4() - Possibly add one IPv4 nameserver from host's resolv.conf
* @c: Execution context
* @ns: Nameserver address
* @idx: Pointer to index of current IPv4 resolver entry, set on return
*/
static void add_dns_resolv4(struct ctx *c, struct in_addr *ns, unsigned *idx)
{
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host))
c->ip4.dns_host = *ns;
/* Special handling if guest or container can only access local
* addresses via redirect, or if the host gateway is also a resolver and
* we shadow its address
*/
if (IN4_IS_ADDR_LOOPBACK(ns) ||
IN4_ARE_ADDR_EQUAL(ns, &c->ip4.map_host_loopback)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
return; /* Address unreachable */
*ns = c->ip4.map_host_loopback;
c->ip4.dns_match = c->ip4.map_host_loopback;
} else {
/* No general host mapping, but requested for DNS
* (--dns-forward and --no-map-gw): advertise resolver
* address from --dns-forward, and map that to loopback
*/
*ns = c->ip4.dns_match;
}
}
*idx += add_dns4(c, ns, *idx);
}
/**
* add_dns_resolv6() - Possibly add one IPv6 nameserver from host's resolv.conf
* @c: Execution context
* @ns: Nameserver address
* @idx: Pointer to index of current IPv6 resolver entry, set on return
*/
static void add_dns_resolv6(struct ctx *c, struct in6_addr *ns, unsigned *idx)
{
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
c->ip6.dns_host = *ns;
/* Special handling if guest or container can only access local
* addresses via redirect, or if the host gateway is also a resolver and
* we shadow its address
*/
if (IN6_IS_ADDR_LOOPBACK(ns) ||
IN6_ARE_ADDR_EQUAL(ns, &c->ip6.map_host_loopback)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
return; /* Address unreachable */
*ns = c->ip6.map_host_loopback;
c->ip6.dns_match = c->ip6.map_host_loopback;
} else {
/* No general host mapping, but requested for DNS
* (--dns-forward and --no-map-gw): advertise resolver
* address from --dns-forward, and map that to loopback
*/
*ns = c->ip6.dns_match;
}
}
*idx += add_dns6(c, ns, *idx);
}
/**
* add_dns_resolv() - Possibly add ns from host resolv.conf to configuration
* @c: Execution context
@ -421,44 +500,11 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver,
struct in6_addr ns6;
struct in_addr ns4;
if (idx4 && inet_pton(AF_INET, nameserver, &ns4)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host))
c->ip4.dns_host = ns4;
if (idx4 && inet_pton(AF_INET, nameserver, &ns4))
add_dns_resolv4(c, &ns4, idx4);
/* Guest or container can only access local addresses via
* redirect
*/
if (IN4_IS_ADDR_LOOPBACK(&ns4)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
return;
ns4 = c->ip4.map_host_loopback;
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match))
c->ip4.dns_match = c->ip4.map_host_loopback;
}
*idx4 += add_dns4(c, &ns4, *idx4);
}
if (idx6 && inet_pton(AF_INET6, nameserver, &ns6)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
c->ip6.dns_host = ns6;
/* Guest or container can only access local addresses via
* redirect
*/
if (IN6_IS_ADDR_LOOPBACK(&ns6)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
return;
ns6 = c->ip6.map_host_loopback;
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match))
c->ip6.dns_match = c->ip6.map_host_loopback;
}
*idx6 += add_dns6(c, &ns6, *idx6);
}
if (idx6 && inet_pton(AF_INET6, nameserver, &ns6))
add_dns_resolv6(c, &ns6, idx6);
}
/**
@ -654,7 +700,7 @@ static unsigned int conf_ip4(unsigned int ifi, struct ip4_ctx *ip4)
&ip4->guest_gw);
if (rc < 0) {
debug("Couldn't discover IPv4 gateway address: %s",
strerror(-rc));
strerror_(-rc));
return 0;
}
}
@ -664,7 +710,7 @@ static unsigned int conf_ip4(unsigned int ifi, struct ip4_ctx *ip4)
&ip4->addr, &ip4->prefix_len, NULL);
if (rc < 0) {
debug("Couldn't discover IPv4 address: %s",
strerror(-rc));
strerror_(-rc));
return 0;
}
}
@ -728,7 +774,7 @@ static unsigned int conf_ip6(unsigned int ifi, struct ip6_ctx *ip6)
rc = nl_route_get_def(nl_sock, ifi, AF_INET6, &ip6->guest_gw);
if (rc < 0) {
debug("Couldn't discover IPv6 gateway address: %s",
strerror(-rc));
strerror_(-rc));
return 0;
}
}
@ -737,7 +783,7 @@ static unsigned int conf_ip6(unsigned int ifi, struct ip6_ctx *ip6)
IN6_IS_ADDR_UNSPECIFIED(&ip6->addr) ? &ip6->addr : NULL,
&prefix_len, &ip6->our_tap_ll);
if (rc < 0) {
debug("Couldn't discover IPv6 address: %s", strerror(-rc));
debug("Couldn't discover IPv6 address: %s", strerror_(-rc));
return 0;
}
@ -768,7 +814,7 @@ static void conf_ip6_local(struct ip6_ctx *ip6)
* usage() - Print usage, exit with given status code
* @name: Executable name
* @f: Stream to print usage info to
* @status: Status code for exit()
* @status: Status code for _exit()
*/
static void usage(const char *name, FILE *f, int status)
{
@ -807,9 +853,17 @@ static void usage(const char *name, FILE *f, int status)
" default: same interface name as external one\n");
} else {
FPRINTF(f,
" -s, --socket PATH UNIX domain socket path\n"
" -s, --socket, --socket-path PATH UNIX domain socket path\n"
" default: probe free path starting from "
UNIX_SOCK_PATH "\n", 1);
FPRINTF(f,
" --vhost-user Enable vhost-user mode\n"
" UNIX domain socket is provided by -s option\n"
" --print-capabilities print back-end capabilities in JSON format,\n"
" only meaningful for vhost-user mode\n");
FPRINTF(f,
" --repair-path PATH path for passt-repair(1)\n"
" default: append '.repair' to UNIX domain path\n");
}
FPRINTF(f,
@ -848,7 +902,9 @@ static void usage(const char *name, FILE *f, int status)
FPRINTF(f, " default: use addresses from /etc/resolv.conf\n");
FPRINTF(f,
" -S, --search LIST Space-separated list, search domains\n"
" a single, empty option disables the DNS search list\n");
" a single, empty option disables the DNS search list\n"
" -H, --hostname NAME Hostname to configure client with\n"
" --fqdn NAME FQDN to configure client with\n");
if (strstr(name, "pasta"))
FPRINTF(f, " default: don't use any search list\n");
else
@ -919,7 +975,7 @@ static void usage(const char *name, FILE *f, int status)
" SPEC is as described for TCP above\n"
" default: none\n");
exit(status);
_exit(status);
pasta_opts:
@ -957,8 +1013,7 @@ pasta_opts:
" -U, --udp-ns SPEC UDP port forwarding to init namespace\n"
" SPEC is as described above\n"
" default: auto\n"
" --host-lo-to-ns-lo DEPRECATED:\n"
" Translate host-loopback forwards to\n"
" --host-lo-to-ns-lo Translate host-loopback forwards to\n"
" namespace loopback\n"
" --userns NSPATH Target user namespace to join\n"
" --netns PATH|NAME Target network namespace to join\n"
@ -971,9 +1026,49 @@ pasta_opts:
" Don't copy all routes to namespace\n"
" --no-copy-addrs DEPRECATED:\n"
" Don't copy all addresses to namespace\n"
" --ns-mac-addr ADDR Set MAC address on tap interface\n");
" --ns-mac-addr ADDR Set MAC address on tap interface\n"
" --no-splice Disable inbound socket splicing\n");
exit(status);
_exit(status);
}
/**
* conf_mode() - Determine passt/pasta's operating mode from command line
* @argc: Argument count
* @argv: Command line arguments
*
* Return: mode to operate in, PASTA or PASST
*/
enum passt_modes conf_mode(int argc, char *argv[])
{
int vhost_user = 0;
const struct option optvu[] = {
{"vhost-user", no_argument, &vhost_user, 1 },
{ 0 },
};
char argv0[PATH_MAX], *basearg0;
int name;
optind = 0;
do {
name = getopt_long(argc, argv, "-:", optvu, NULL);
} while (name != -1);
if (vhost_user)
return MODE_VU;
if (argc < 1)
die("Cannot determine argv[0]");
strncpy(argv0, argv[0], PATH_MAX - 1);
basearg0 = basename(argv0);
if (strstr(basearg0, "pasta"))
return MODE_PASTA;
if (strstr(basearg0, "passt"))
return MODE_PASST;
die("Cannot determine mode, invoke as \"passt\" or \"pasta\"");
}
/**
@ -1210,6 +1305,8 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
*addr6 = in6addr_any;
if (no_map_gw)
*no_map_gw = 1;
return;
}
if (inet_pton(AF_INET6, arg, addr6) &&
@ -1233,8 +1330,25 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
*/
static void conf_open_files(struct ctx *c)
{
if (c->mode != MODE_PASTA && c->fd_tap == -1)
c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
if (c->mode != MODE_PASTA && c->fd_tap == -1) {
c->fd_tap_listen = sock_unix(c->sock_path);
if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
if (!*c->repair_path &&
snprintf_check(c->repair_path,
sizeof(c->repair_path), "%s.repair",
c->sock_path)) {
warn("passt-repair path %s not usable",
c->repair_path);
c->fd_repair_listen = -1;
} else {
c->fd_repair_listen = sock_unix(c->repair_path);
}
} else {
c->fd_repair_listen = -1;
}
c->fd_repair = -1;
}
if (*c->pidfile) {
c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
@ -1306,6 +1420,7 @@ void conf(struct ctx *c, int argc, char **argv)
{"outbound", required_argument, NULL, 'o' },
{"dns", required_argument, NULL, 'D' },
{"search", required_argument, NULL, 'S' },
{"hostname", required_argument, NULL, 'H' },
{"no-tcp", no_argument, &c->no_tcp, 1 },
{"no-udp", no_argument, &c->no_udp, 1 },
{"no-icmp", no_argument, &c->no_icmp, 1 },
@ -1313,6 +1428,7 @@ void conf(struct ctx *c, int argc, char **argv)
{"no-dhcpv6", no_argument, &c->no_dhcpv6, 1 },
{"no-ndp", no_argument, &c->no_ndp, 1 },
{"no-ra", no_argument, &c->no_ra, 1 },
{"no-splice", no_argument, &c->no_splice, 1 },
{"freebind", no_argument, &c->freebind, 1 },
{"no-map-gw", no_argument, &no_map_gw, 1 },
{"ipv4-only", no_argument, NULL, '4' },
@ -1345,18 +1461,26 @@ void conf(struct ctx *c, int argc, char **argv)
{"map-guest-addr", required_argument, NULL, 22 },
{"host-lo-to-ns-lo", no_argument, NULL, 23 },
{"dns-host", required_argument, NULL, 24 },
{"vhost-user", no_argument, NULL, 25 },
/* vhost-user backend program convention */
{"print-capabilities", no_argument, NULL, 26 },
{"socket-path", required_argument, NULL, 's' },
{"fqdn", required_argument, NULL, 27 },
{"repair-path", required_argument, NULL, 28 },
{ 0 },
};
const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
char userns[PATH_MAX] = { 0 }, netns[PATH_MAX] = { 0 };
bool copy_addrs_opt = false, copy_routes_opt = false;
enum fwd_ports_mode fwd_default = FWD_NONE;
bool v4_only = false, v6_only = false;
unsigned dns4_idx = 0, dns6_idx = 0;
unsigned long max_mtu = IP_MAX_MTU;
struct fqdn *dnss = c->dns_search;
unsigned int ifi4 = 0, ifi6 = 0;
const char *logfile = NULL;
const char *optstring;
size_t logsize = 0;
char *runas = NULL;
long fd_tap_opt;
@ -1367,11 +1491,11 @@ void conf(struct ctx *c, int argc, char **argv)
if (c->mode == MODE_PASTA) {
c->no_dhcp_dns = c->no_dhcp_dns_search = 1;
fwd_default = FWD_AUTO;
optstring = "+dqfel:hF:I:p:P:m:a:n:M:g:i:o:D:S:46t:u:T:U:";
} else {
optstring = "+dqfel:hs:F:p:P:m:a:n:M:g:i:o:D:S:461t:u:";
}
if (tap_l2_max_len(c) - ETH_HLEN < max_mtu)
max_mtu = tap_l2_max_len(c) - ETH_HLEN;
c->mtu = ROUND_DOWN(max_mtu, sizeof(uint32_t));
c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET;
c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET;
memcpy(c->our_tap_mac, MAC_OUR_LAA, ETH_ALEN);
@ -1470,7 +1594,7 @@ void conf(struct ctx *c, int argc, char **argv)
FPRINTF(stdout,
c->mode == MODE_PASTA ? "pasta " : "passt ");
FPRINTF(stdout, VERSION_BLOB);
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
case 15:
ret = snprintf(c->ip4.ifname_out,
sizeof(c->ip4.ifname_out), "%s", optarg);
@ -1538,6 +1662,27 @@ void conf(struct ctx *c, int argc, char **argv)
break;
die("Invalid host nameserver address: %s", optarg);
case 25:
/* Already handled in conf_mode() */
ASSERT(c->mode == MODE_VU);
break;
case 26:
vu_print_capabilities();
break;
case 27:
if (snprintf_check(c->fqdn, PASST_MAXDNAME,
"%s", optarg))
die("Invalid FQDN: %s", optarg);
break;
case 28:
if (c->mode != MODE_VU && strcmp(optarg, "none"))
die("--repair-path is for vhost-user mode only");
if (snprintf_check(c->repair_path,
sizeof(c->repair_path), "%s",
optarg))
die("Invalid passt-repair path: %s", optarg);
break;
case 'd':
c->debug = 1;
@ -1557,6 +1702,9 @@ void conf(struct ctx *c, int argc, char **argv)
c->foreground = 1;
break;
case 's':
if (c->mode == MODE_PASTA)
die("-s is for passt / vhost-user mode only");
ret = snprintf(c->sock_path, sizeof(c->sock_path), "%s",
optarg);
if (ret <= 0 || ret >= (int)sizeof(c->sock_path))
@ -1569,7 +1717,8 @@ void conf(struct ctx *c, int argc, char **argv)
fd_tap_opt = strtol(optarg, NULL, 0);
if (errno ||
fd_tap_opt <= STDERR_FILENO || fd_tap_opt > INT_MAX)
(fd_tap_opt != STDIN_FILENO && fd_tap_opt <= STDERR_FILENO) ||
fd_tap_opt > INT_MAX)
die("Invalid --fd: %s", optarg);
c->fd_tap = fd_tap_opt;
@ -1577,6 +1726,9 @@ void conf(struct ctx *c, int argc, char **argv)
*c->sock_path = 0;
break;
case 'I':
if (c->mode != MODE_PASTA)
die("-I is for pasta mode only");
ret = snprintf(c->pasta_ifn, IFNAMSIZ, "%s",
optarg);
if (ret <= 0 || ret >= IFNAMSIZ)
@ -1596,20 +1748,24 @@ void conf(struct ctx *c, int argc, char **argv)
die("Invalid PID file: %s", optarg);
break;
case 'm':
case 'm': {
unsigned long mtu;
char *e;
errno = 0;
c->mtu = strtol(optarg, NULL, 0);
mtu = strtoul(optarg, &e, 0);
if (!c->mtu) {
c->mtu = -1;
break;
}
if (c->mtu < ETH_MIN_MTU || c->mtu > (int)ETH_MAX_MTU ||
errno)
if (errno || *e)
die("Invalid MTU: %s", optarg);
if (mtu > max_mtu) {
die("MTU %lu too large (max %lu)",
mtu, max_mtu);
}
c->mtu = mtu;
break;
}
case 'a':
if (inet_pton(AF_INET6, optarg, &c->ip6.addr) &&
!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr) &&
@ -1708,6 +1864,11 @@ void conf(struct ctx *c, int argc, char **argv)
die("Cannot use DNS search domain %s", optarg);
break;
case 'H':
if (snprintf_check(c->hostname, PASST_MAXDNAME,
"%s", optarg))
die("Invalid hostname: %s", optarg);
break;
case '4':
v4_only = true;
v6_only = false;
@ -1724,11 +1885,16 @@ void conf(struct ctx *c, int argc, char **argv)
break;
case 't':
case 'u':
case 'T':
case 'U':
case 'D':
/* Handle these later, once addresses are configured */
break;
case 'T':
case 'U':
if (c->mode != MODE_PASTA)
die("-%c is for pasta mode only", name);
/* Handle properly later, once addresses are configured */
break;
case 'h':
usage(argv[0], stdout, EXIT_SUCCESS);
break;
@ -1739,6 +1905,9 @@ void conf(struct ctx *c, int argc, char **argv)
}
} while (name != -1);
if (c->mode != MODE_PASTA)
c->no_splice = 1;
if (c->mode == MODE_PASTA && !c->pasta_conf_ns) {
if (copy_routes_opt)
die("--no-copy-routes needs --config-net");
@ -1773,9 +1942,21 @@ void conf(struct ctx *c, int argc, char **argv)
c->ifi4 = conf_ip4(ifi4, &c->ip4);
if (!v4_only)
c->ifi6 = conf_ip6(ifi6, &c->ip6);
if (c->ifi4 && c->mtu < IPV4_MIN_MTU) {
warn("MTU %"PRIu16" is too small for IPv4 (minimum %u)",
c->mtu, IPV4_MIN_MTU);
}
if (c->ifi6 && c->mtu < IPV6_MIN_MTU) {
warn("MTU %"PRIu16" is too small for IPv6 (minimum %u)",
c->mtu, IPV6_MIN_MTU);
}
if ((*c->ip4.ifname_out && !c->ifi4) ||
(*c->ip6.ifname_out && !c->ifi6))
die("External interface not usable");
if (!c->ifi4 && !c->ifi6) {
info("No external interface as template, switch to local mode");
@ -1802,8 +1983,8 @@ void conf(struct ctx *c, int argc, char **argv)
if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
c->no_dhcp = 1;
/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
* settings)
/* Inbound port options and DNS can be parsed now, after IPv4/IPv6
* settings
*/
fwd_probe_ephemeral();
udp_portmap_clear();
@ -1897,9 +2078,6 @@ void conf(struct ctx *c, int argc, char **argv)
c->no_dhcpv6 = 1;
}
if (!c->mtu)
c->mtu = ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN, sizeof(uint32_t));
get_dns(c);
if (!*c->pasta_ifn) {

1
conf.h
View file

@ -6,6 +6,7 @@
#ifndef CONF_H
#define CONF_H
enum passt_modes conf_mode(int argc, char *argv[]);
void conf(struct ctx *c, int argc, char **argv);
#endif /* CONF_H */

View file

@ -27,4 +27,25 @@ profile passt /usr/bin/passt{,.avx2} {
owner @{HOME}/** w, # pcap(), pidfile_open(),
# pidfile_write()
# Workaround: libvirt's profile comes with a passt subprofile which includes,
# in turn, <abstractions/passt>, and adds libvirt-specific rules on top, to
# allow passt (when started by libvirtd) to write socket and PID files in the
# location requested by libvirtd itself, and to execute passt itself.
#
# However, when libvirt runs as unprivileged user, the mechanism based on
# virt-aa-helper, designed to build per-VM profiles as guests are started,
# doesn't work. The helper needs to create and load profiles on the fly, which
# can't be done by unprivileged users, of course.
#
# As a result, libvirtd runs unconfined if guests are started by unprivileged
# users, starting passt unconfined as well, which means that passt runs under
# its own stand-alone profile (this one), which implies in turn that execve()
# of /usr/bin/passt is not allowed, and socket and PID files can't be written.
#
# Duplicate libvirt-specific rules here as long as this is not solved in
# libvirt's profile itself.
/usr/bin/passt r,
owner @{run}/user/[0-9]*/libvirt/qemu/run/passt/* rw,
owner @{run}/libvirt/qemu/passt/* rw,
}

View file

@ -0,0 +1,29 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# contrib/apparmor/usr.bin.passt-repair - AppArmor profile for passt-repair(1)
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
abi <abi/3.0>,
#include <tunables/global>
profile passt-repair /usr/bin/passt-repair {
#include <abstractions/base>
/** rw, # passt's ".repair" socket might be anywhere
unix (connect, receive, send) type=stream,
capability dac_override, # connect to passt's socket as root
capability net_admin, # currently needed for TCP_REPAIR socket option
capability net_raw, # what TCP_REPAIR should require instead
network unix stream, # connect and use UNIX domain socket
network inet stream, # use TCP sockets
}

View file

@ -44,7 +44,7 @@ Requires(preun): %{name}
Requires(preun): policycoreutils
%description selinux
This package adds SELinux enforcement to passt(1) and pasta(1).
This package adds SELinux enforcement to passt(1), pasta(1), passt-repair(1).
%prep
%setup -q -n passt-%{git_hash}
@ -82,6 +82,7 @@ make -f %{_datadir}/selinux/devel/Makefile
install -p -m 644 -D passt.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
install -p -m 644 -D passt.if %{buildroot}%{_datadir}/selinux/devel/include/distributed/passt.if
install -p -m 644 -D pasta.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
install -p -m 644 -D passt-repair.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
popd
%pre selinux
@ -90,11 +91,13 @@ popd
%post selinux
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
%postun selinux
if [ $1 -eq 0 ]; then
%selinux_modules_uninstall -s %{selinuxtype} passt
%selinux_modules_uninstall -s %{selinuxtype} pasta
%selinux_modules_uninstall -s %{selinuxtype} passt-repair
fi
%posttrans selinux
@ -108,9 +111,11 @@ fi
%{_bindir}/passt
%{_bindir}/pasta
%{_bindir}/qrap
%{_bindir}/passt-repair
%{_mandir}/man1/passt.1*
%{_mandir}/man1/pasta.1*
%{_mandir}/man1/qrap.1*
%{_mandir}/man1/passt-repair.1*
%ifarch x86_64
%{_bindir}/passt.avx2
%{_mandir}/man1/passt.avx2.1*
@ -122,6 +127,7 @@ fi
%{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
%{_datadir}/selinux/devel/include/distributed/passt.if
%{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
%{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
%changelog
{{{ passt_git_changelog }}}

View file

@ -0,0 +1,11 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# contrib/selinux/passt-repair.fc - SELinux: File Context for passt-repair
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
/usr/bin/passt-repair system_u:object_r:passt_repair_exec_t:s0

View file

@ -0,0 +1,87 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# contrib/selinux/passt-repair.te - SELinux: Type Enforcement for passt-repair
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
policy_module(passt-repair, 0.1)
require {
type unconfined_t;
type passt_t;
role unconfined_r;
class process transition;
class file { read execute execute_no_trans entrypoint open map };
class capability { dac_override net_admin net_raw };
class chr_file { append open getattr read write ioctl };
class unix_stream_socket { create connect sendto };
class sock_file { read write };
class tcp_socket { read setopt write };
type console_device_t;
type user_devpts_t;
type user_tmp_t;
# Workaround: passt-repair needs to needs to access socket files
# that passt, started by libvirt, might create under different
# labels, depending on whether passt is started as root or not.
#
# However, libvirt doesn't maintain its own policy, which makes
# updates particularly complicated. To avoid breakage in the short
# term, deal with that in passt's own policy.
type qemu_var_run_t;
type virt_var_run_t;
}
type passt_repair_t;
domain_type(passt_repair_t);
type passt_repair_exec_t;
corecmd_executable_file(passt_repair_exec_t);
role unconfined_r types passt_repair_t;
allow passt_repair_t passt_repair_exec_t:file { read execute execute_no_trans entrypoint open map };
type_transition unconfined_t passt_repair_exec_t:process passt_repair_t;
allow unconfined_t passt_repair_t:process transition;
allow passt_repair_t self:capability { dac_override dac_read_search net_admin net_raw };
allow passt_repair_t self:capability2 bpf;
allow passt_repair_t console_device_t:chr_file { append open getattr read write ioctl };
allow passt_repair_t user_devpts_t:chr_file { append open getattr read write ioctl };
allow passt_repair_t unconfined_t:unix_stream_socket { connectto read write };
allow passt_repair_t passt_t:unix_stream_socket { connectto read write };
allow passt_repair_t user_tmp_t:unix_stream_socket { connectto read write };
allow passt_repair_t user_tmp_t:dir { getattr read search watch };
allow passt_repair_t unconfined_t:sock_file { getattr read write };
allow passt_repair_t passt_t:sock_file { getattr read write };
allow passt_repair_t user_tmp_t:sock_file { getattr read write };
allow passt_repair_t unconfined_t:tcp_socket { read setopt write };
allow passt_repair_t passt_t:tcp_socket { read setopt write };
# Workaround: passt-repair needs to needs to access socket files
# that passt, started by libvirt, might create under different
# labels, depending on whether passt is started as root or not.
#
# However, libvirt doesn't maintain its own policy, which makes
# updates particularly complicated. To avoid breakage in the short
# term, deal with that in passt's own policy.
allow passt_repair_t qemu_var_run_t:unix_stream_socket { connectto read write };
allow passt_repair_t virt_var_run_t:unix_stream_socket { connectto read write };
allow passt_repair_t qemu_var_run_t:dir { getattr read search watch };
allow passt_repair_t virt_var_run_t:dir { getattr read search watch };
allow passt_repair_t qemu_var_run_t:sock_file { getattr read write };
allow passt_repair_t virt_var_run_t:sock_file { getattr read write };

View file

@ -20,9 +20,19 @@ require {
type fs_t;
type tmp_t;
type user_tmp_t;
type user_home_t;
type tmpfs_t;
type root_t;
# Workaround: passt --vhost-user needs to map guest memory, but
# libvirt doesn't maintain its own policy, which makes updates
# particularly complicated. To avoid breakage in the short term,
# deal with it in passt's own policy.
type svirt_image_t;
type svirt_tmpfs_t;
type svirt_t;
type null_device_t;
class file { ioctl getattr setattr create read write unlink open relabelto execute execute_no_trans map };
class dir { search write add_name remove_name mounton };
class chr_file { append read write open getattr ioctl };
@ -38,8 +48,8 @@ require {
type net_conf_t;
type proc_net_t;
type node_t;
class tcp_socket { create accept listen name_bind name_connect };
class udp_socket { create accept listen };
class tcp_socket { create accept listen name_bind name_connect getattr ioctl };
class udp_socket { create accept listen getattr };
class icmp_socket { bind create name_bind node_bind setopt read write };
class sock_file { create unlink write };
@ -80,6 +90,9 @@ allow passt_t root_t:dir mounton;
allow passt_t tmp_t:dir { add_name mounton remove_name write };
allow passt_t tmpfs_t:filesystem mount;
allow passt_t fs_t:filesystem unmount;
allow passt_t user_home_t:dir search;
allow passt_t user_tmp_t:fifo_file append;
allow passt_t user_tmp_t:file map;
manage_files_pattern(passt_t, user_tmp_t, user_tmp_t)
files_pid_filetrans(passt_t, user_tmp_t, file)
@ -119,11 +132,19 @@ corenet_udp_sendrecv_all_ports(passt_t)
allow passt_t node_t:icmp_socket { name_bind node_bind };
allow passt_t port_t:icmp_socket name_bind;
allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write };
allow passt_t self:udp_socket { create getopt setopt connect bind read write };
allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write getattr ioctl };
allow passt_t self:udp_socket { create getopt setopt connect bind read write getattr };
allow passt_t self:icmp_socket { bind create setopt read write };
allow passt_t user_tmp_t:dir { add_name write };
allow passt_t user_tmp_t:file { create open };
allow passt_t user_tmp_t:sock_file { create read write unlink };
allow passt_t unconfined_t:unix_stream_socket { read write };
# Workaround: passt --vhost-user needs to map guest memory, but
# libvirt doesn't maintain its own policy, which makes updates
# particularly complicated. To avoid breakage in the short term,
# deal with it in passt's own policy.
allow passt_t svirt_image_t:file { read write map };
allow passt_t svirt_tmpfs_t:file { read write map };
allow passt_t null_device_t:chr_file map;

View file

@ -18,6 +18,7 @@ require {
type bin_t;
type user_home_t;
type user_home_dir_t;
type user_tmp_t;
type fs_t;
type tmp_t;
type tmpfs_t;
@ -56,8 +57,10 @@ require {
attribute port_type;
type port_t;
type http_port_t;
type http_cache_port_t;
type ssh_port_t;
type reserved_port_t;
type unreserved_port_t;
type dns_port_t;
type dhcpc_port_t;
type chronyd_port_t;
@ -122,8 +125,8 @@ domain_auto_trans(pasta_t, ping_exec_t, ping_t);
allow pasta_t nsfs_t:file { open read };
allow pasta_t user_home_t:dir getattr;
allow pasta_t user_home_t:file { open read getattr setattr };
allow pasta_t user_home_t:dir { getattr search };
allow pasta_t user_home_t:file { open read getattr setattr execute execute_no_trans map};
allow pasta_t user_home_dir_t:dir { search getattr open add_name read write };
allow pasta_t user_home_dir_t:file { create open read write };
allow pasta_t tmp_t:dir { add_name mounton remove_name write };
@ -133,6 +136,11 @@ allow pasta_t root_t:dir mounton;
manage_files_pattern(pasta_t, pasta_pid_t, pasta_pid_t)
files_pid_filetrans(pasta_t, pasta_pid_t, file)
allow pasta_t user_tmp_t:dir { add_name remove_name search write };
allow pasta_t user_tmp_t:fifo_file append;
allow pasta_t user_tmp_t:file { create open write };
allow pasta_t user_tmp_t:sock_file { create unlink };
allow pasta_t console_device_t:chr_file { open write getattr ioctl };
allow pasta_t user_devpts_t:chr_file { getattr read write ioctl };
logging_send_syslog_msg(pasta_t)
@ -160,6 +168,8 @@ allow pasta_t self:udp_socket create_stream_socket_perms;
allow pasta_t reserved_port_t:udp_socket name_bind;
allow pasta_t llmnr_port_t:tcp_socket name_bind;
allow pasta_t llmnr_port_t:udp_socket name_bind;
allow pasta_t http_cache_port_t:tcp_socket { name_bind name_connect };
allow pasta_t unreserved_port_t:udp_socket name_bind;
corenet_udp_sendrecv_generic_node(pasta_t)
corenet_udp_bind_generic_node(pasta_t)
allow pasta_t node_t:icmp_socket { name_bind node_bind };
@ -171,7 +181,7 @@ allow pasta_t init_t:lnk_file read;
allow pasta_t init_t:unix_stream_socket connectto;
allow pasta_t init_t:dbus send_msg;
allow pasta_t init_t:system status;
allow pasta_t unconfined_t:dir search;
allow pasta_t unconfined_t:dir { read search };
allow pasta_t unconfined_t:file read;
allow pasta_t unconfined_t:lnk_file read;
allow pasta_t self:process { setpgid setcap };
@ -192,8 +202,6 @@ allow pasta_t sysctl_net_t:dir search;
allow pasta_t sysctl_net_t:file { open read write };
allow pasta_t kernel_t:system module_request;
allow pasta_t nsfs_t:file read;
allow pasta_t proc_t:dir mounton;
allow pasta_t proc_t:filesystem mount;
allow pasta_t net_conf_t:lnk_file read;

91
dhcp.c
View file

@ -63,6 +63,11 @@ static struct opt opts[255];
#define OPT_MIN 60 /* RFC 951 */
/* Total option size (excluding end option) is 576 (RFC 2131), minus
* offset of options (268), minus end option (1).
*/
#define OPT_MAX 307
/**
* dhcp_init() - Initialise DHCP options
*/
@ -122,7 +127,7 @@ struct msg {
uint8_t sname[64];
uint8_t file[128];
uint32_t magic;
uint8_t o[308];
uint8_t o[OPT_MAX + 1 /* End option */ ];
} __attribute__((__packed__));
/**
@ -130,15 +135,28 @@ struct msg {
* @m: Message to fill
* @o: Option number
* @offset: Current offset within options field, updated on insertion
*
* Return: false if m has space to write the option, true otherwise
*/
static void fill_one(struct msg *m, int o, int *offset)
static bool fill_one(struct msg *m, int o, int *offset)
{
size_t slen = opts[o].slen;
/* If we don't have space to write the option, then just skip */
if (*offset + 2 /* code and length of option */ + slen > OPT_MAX)
return true;
m->o[*offset] = o;
m->o[*offset + 1] = opts[o].slen;
memcpy(&m->o[*offset + 2], opts[o].s, opts[o].slen);
m->o[*offset + 1] = slen;
/* Move to option */
*offset += 2;
memcpy(&m->o[*offset], opts[o].s, slen);
opts[o].sent = 1;
*offset += 2 + opts[o].slen;
*offset += slen;
return false;
}
/**
@ -151,9 +169,6 @@ static int fill(struct msg *m)
{
int i, o, offset = 0;
m->op = BOOTREPLY;
m->secs = 0;
for (o = 0; o < 255; o++)
opts[o].sent = 0;
@ -162,21 +177,23 @@ static int fill(struct msg *m)
* Put it there explicitly, unless requested via option 55.
*/
if (opts[55].clen > 0 && !memchr(opts[55].c, 53, opts[55].clen))
fill_one(m, 53, &offset);
if (fill_one(m, 53, &offset))
debug("DHCP: skipping option 53");
for (i = 0; i < opts[55].clen; i++) {
o = opts[55].c[i];
if (opts[o].slen != -1)
fill_one(m, o, &offset);
if (fill_one(m, o, &offset))
debug("DHCP: skipping option %i", o);
}
for (o = 0; o < 255; o++) {
if (opts[o].slen != -1 && !opts[o].sent)
fill_one(m, o, &offset);
if (fill_one(m, o, &offset))
debug("DHCP: skipping option %i", o);
}
m->o[offset++] = 255;
m->o[offset++] = 0;
if (offset < OPT_MIN) {
memset(&m->o[offset], 0, OPT_MIN - offset);
@ -291,8 +308,9 @@ int dhcp(const struct ctx *c, const struct pool *p)
const struct ethhdr *eh;
const struct iphdr *iph;
const struct udphdr *uh;
struct msg const *m;
struct msg reply;
unsigned int i;
struct msg *m;
eh = packet_get(p, 0, offset, sizeof(*eh), NULL);
offset += sizeof(*eh);
@ -321,6 +339,22 @@ int dhcp(const struct ctx *c, const struct pool *p)
m->op != BOOTREQUEST)
return -1;
reply.op = BOOTREPLY;
reply.htype = m->htype;
reply.hlen = m->hlen;
reply.hops = 0;
reply.xid = m->xid;
reply.secs = 0;
reply.flags = m->flags;
reply.ciaddr = m->ciaddr;
reply.yiaddr = c->ip4.addr;
reply.siaddr = 0;
reply.giaddr = m->giaddr;
memcpy(&reply.chaddr, m->chaddr, sizeof(reply.chaddr));
memset(&reply.sname, 0, sizeof(reply.sname));
memset(&reply.file, 0, sizeof(reply.file));
reply.magic = m->magic;
offset += offsetof(struct msg, o);
for (i = 0; i < ARRAY_SIZE(opts); i++)
@ -364,7 +398,6 @@ int dhcp(const struct ctx *c, const struct pool *p)
info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));
m->yiaddr = c->ip4.addr;
mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len));
memcpy(opts[1].s, &mask, sizeof(mask));
memcpy(opts[3].s, &c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
@ -384,7 +417,7 @@ int dhcp(const struct ctx *c, const struct pool *p)
&c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
}
if (c->mtu != -1) {
if (c->mtu) {
opts[26].slen = 2;
opts[26].s[0] = c->mtu / 256;
opts[26].s[1] = c->mtu % 256;
@ -398,17 +431,41 @@ int dhcp(const struct ctx *c, const struct pool *p)
if (!opts[6].slen)
opts[6].slen = -1;
opt_len = strlen(c->hostname);
if (opt_len > 0) {
opts[12].slen = opt_len;
memcpy(opts[12].s, &c->hostname, opt_len);
}
opt_len = strlen(c->fqdn);
if (opt_len > 0) {
opt_len += 3 /* flags */
+ 2; /* Length byte for first label, and terminator */
if (sizeof(opts[81].s) >= opt_len) {
opts[81].s[0] = 0x4; /* flags (E) */
opts[81].s[1] = 0xff; /* RCODE1 */
opts[81].s[2] = 0xff; /* RCODE2 */
encode_domain_name((char *)opts[81].s + 3, c->fqdn);
opts[81].slen = opt_len;
} else {
debug("DHCP: client FQDN option doesn't fit, skipping");
}
}
if (!c->no_dhcp_dns_search)
opt_set_dns_search(c, sizeof(m->o));
dlen = offsetof(struct msg, o) + fill(m);
dlen = offsetof(struct msg, o) + fill(&reply);
if (m->flags & FLAG_BROADCAST)
dst = in4addr_broadcast;
else
dst = c->ip4.addr;
tap_udp4_send(c, c->ip4.our_tap_addr, 67, dst, 68, m, dlen);
tap_udp4_send(c, c->ip4.our_tap_addr, 67, dst, 68, &reply, dlen);
return 1;
}

103
dhcpv6.c
View file

@ -48,6 +48,7 @@ struct opt_hdr {
# define STATUS_NOTONLINK htons_constant(4)
# define OPT_DNS_SERVERS htons_constant(23)
# define OPT_DNS_SEARCH htons_constant(24)
# define OPT_CLIENT_FQDN htons_constant(39)
#define STR_NOTONLINK "Prefix not appropriate for link."
uint16_t l;
@ -58,6 +59,9 @@ struct opt_hdr {
sizeof(struct opt_hdr))
#define OPT_VSIZE(x) (sizeof(struct opt_##x) - \
sizeof(struct opt_hdr))
#define OPT_MAX_SIZE IPV6_MIN_MTU - (sizeof(struct ipv6hdr) + \
sizeof(struct udphdr) + \
sizeof(struct msg_hdr))
/**
* struct opt_client_id - DHCPv6 Client Identifier option
@ -140,7 +144,9 @@ struct opt_ia_addr {
struct opt_status_code {
struct opt_hdr hdr;
uint16_t code;
char status_msg[sizeof(STR_NOTONLINK) - 1];
/* "nonstring" is only supported since clang 23 */
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((nonstring)) char status_msg[sizeof(STR_NOTONLINK) - 1];
} __attribute__((packed));
/**
@ -163,6 +169,18 @@ struct opt_dns_search {
char list[MAXDNSRCH * NS_MAXDNAME];
} __attribute__((packed));
/**
* struct opt_client_fqdn - Client FQDN option (RFC 4704)
* @hdr: Option header
* @flags: Flags described by RFC 4704
* @domain_name: Client FQDN
*/
struct opt_client_fqdn {
struct opt_hdr hdr;
uint8_t flags;
char domain_name[PASST_MAXDNAME];
} __attribute__((packed));
/**
* struct msg_hdr - DHCPv6 client/server message header
* @type: DHCP message type
@ -193,6 +211,7 @@ struct msg_hdr {
* @client_id: Client Identifier, variable length
* @dns_servers: DNS Recursive Name Server, here just for storage size
* @dns_search: Domain Search List, here just for storage size
* @client_fqdn: Client FQDN, variable length
*/
static struct resp_t {
struct msg_hdr hdr;
@ -203,6 +222,7 @@ static struct resp_t {
struct opt_client_id client_id;
struct opt_dns_servers dns_servers;
struct opt_dns_search dns_search;
struct opt_client_fqdn client_fqdn;
} __attribute__((__packed__)) resp = {
{ 0 },
SERVER_ID,
@ -228,6 +248,10 @@ static struct resp_t {
{ { OPT_DNS_SEARCH, 0, },
{ 0 },
},
{ { OPT_CLIENT_FQDN, 0, },
0, { 0 },
},
};
static const struct opt_status_code sc_not_on_link = {
@ -346,7 +370,6 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset)
{
struct opt_dns_servers *srv = NULL;
struct opt_dns_search *srch = NULL;
char *p = NULL;
int i;
if (c->no_dhcp_dns)
@ -383,34 +406,81 @@ search:
if (!name_len)
continue;
name_len += 2; /* Length byte for first label, and terminator */
if (name_len >
NS_MAXDNAME + 1 /* Length byte for first label */ ||
name_len > 255) {
debug("DHCP: DNS search name '%s' too long, skipping",
c->dns_search[i].n);
continue;
}
if (!srch) {
srch = (struct opt_dns_search *)(buf + offset);
offset += sizeof(struct opt_hdr);
srch->hdr.t = OPT_DNS_SEARCH;
srch->hdr.l = 0;
p = srch->list;
}
*p = '.';
p = stpncpy(p + 1, c->dns_search[i].n, name_len);
p++;
srch->hdr.l += name_len + 2;
offset += name_len + 2;
encode_domain_name(buf + offset, c->dns_search[i].n);
srch->hdr.l += name_len;
offset += name_len;
}
if (srch) {
for (i = 0; i < srch->hdr.l; i++) {
if (srch->list[i] == '.') {
srch->list[i] = strcspn(srch->list + i + 1,
".");
}
}
if (srch)
srch->hdr.l = htons(srch->hdr.l);
}
return offset;
}
/**
* dhcpv6_client_fqdn_fill() - Fill in client FQDN option
* @c: Execution context
* @buf: Response message buffer where options will be appended
* @offset: Offset in message buffer for new options
*
* Return: updated length of response message buffer.
*/
static size_t dhcpv6_client_fqdn_fill(const struct pool *p, const struct ctx *c,
char *buf, int offset)
{
struct opt_client_fqdn const *req_opt;
struct opt_client_fqdn *o;
size_t opt_len;
opt_len = strlen(c->fqdn);
if (opt_len == 0) {
return offset;
}
opt_len += 2; /* Length byte for first label, and terminator */
if (opt_len > OPT_MAX_SIZE - (offset +
sizeof(struct opt_hdr) +
1 /* flags */ )) {
debug("DHCPv6: client FQDN option doesn't fit, skipping");
return offset;
}
o = (struct opt_client_fqdn *)(buf + offset);
encode_domain_name(o->domain_name, c->fqdn);
req_opt = (struct opt_client_fqdn *)dhcpv6_opt(p, &(size_t){ 0 },
OPT_CLIENT_FQDN);
if (req_opt && req_opt->flags & 0x01 /* S flag */)
o->flags = 0x02 /* O flag */;
else
o->flags = 0x00;
opt_len++;
o->hdr.t = OPT_CLIENT_FQDN;
o->hdr.l = htons(opt_len);
return offset + sizeof(struct opt_hdr) + opt_len;
}
/**
* dhcpv6() - Check if this is a DHCPv6 message, reply as needed
* @c: Execution context
@ -544,6 +614,7 @@ int dhcpv6(struct ctx *c, const struct pool *p,
n = offsetof(struct resp_t, client_id) +
sizeof(struct opt_hdr) + ntohs(client_id->l);
n = dhcpv6_dns_fill(c, (char *)&resp, n);
n = dhcpv6_client_fqdn_fill(p, c, (char *)&resp, n);
resp.hdr.xid = mh->xid;

2
doc/migration/.gitignore vendored Normal file
View file

@ -0,0 +1,2 @@
/source
/target

20
doc/migration/Makefile Normal file
View file

@ -0,0 +1,20 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
TARGETS = source target
CFLAGS = -Wall -Wextra -pedantic
all: $(TARGETS)
$(TARGETS): %: %.c
clean:
rm -f $(TARGETS)

51
doc/migration/README Normal file
View file

@ -0,0 +1,51 @@
<!---
SPDX-License-Identifier: GPL-2.0-or-later
Copyright (c) 2025 Red Hat GmbH
Author: Stefano Brivio <sbrivio@redhat.com>
-->
Migration
=========
These test programs show a migration of a TCP connection from one process to
another using the TCP_REPAIR socket option.
The two processes are a mock of the matching implementation in passt(1), and run
unprivileged, so they rely on the passt-repair helper to connect to them and set
or clear TCP_REPAIR on the connection socket, transferred to the helper using
SCM_RIGHTS.
The passt-repair helper needs to have the CAP_NET_ADMIN capability, or run as
root.
Example of usage
----------------
* Start the test server
$ nc -l 9999
* Start the source side of the TCP client (mock of the source instance of passt)
$ ./source 127.0.0.1 9999 9998 /tmp/repair.sock
* The client sends a test string, and waits for a connection from passt-repair
# passt-repair /tmp/repair.sock
* The socket is now in repair mode, and `source` dumps sequences, then exits
sending sequence: 3244673313
receiving sequence: 2250449386
* Continue the connection on the target side, restarting from those sequences
$ ./target 127.0.0.1 9999 9998 /tmp/repair.sock 3244673313 2250449386
* The target side now waits for a connection from passt-repair
# passt-repair /tmp/repair.sock
* The target side asks passt-repair to switch the socket to repair mode, sets up
the TCP sequences, then asks passt-repair to clear repair mode, and sends a
test string to the server

92
doc/migration/source.c Normal file
View file

@ -0,0 +1,92 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* doc/migration/source.c - Mock of TCP migration source, use with passt-repair
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
int main(int argc, char **argv)
{
struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
NULL, NULL, NULL };
struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
int seq, s, s_helper;
int8_t cmd;
struct iovec iov = { &cmd, sizeof(cmd) };
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
socklen_t seqlen = sizeof(int);
struct addrinfo *r;
(void)argc;
if (argc != 5) {
fprintf(stderr, "%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH\n",
argv[0]);
return -1;
}
strcpy(a_helper.sun_path, argv[4]);
getaddrinfo(argv[1], argv[2], &hints, &r);
/* Connect socket to server and send some data */
s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
bind(s, (struct sockaddr *)&a, sizeof(a));
connect(s, r->ai_addr, r->ai_addrlen);
send(s, "before migration\n", sizeof("before migration\n"), 0);
/* Wait for helper */
s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
unlink(a_helper.sun_path);
bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
listen(s_helper, 1);
s_helper = accept(s_helper, NULL, NULL);
/* Set up message for helper, with socket */
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_ON;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int8_t){ 0 }), 1, 0);
/* Terminate helper */
close(s_helper);
/* Get sending sequence */
seq = TCP_SEND_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
fprintf(stdout, "%u ", seq);
/* Get receiving sequence */
seq = TCP_RECV_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
fprintf(stdout, "%u\n", seq);
}

102
doc/migration/target.c Normal file
View file

@ -0,0 +1,102 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* doc/migration/target.c - Mock of TCP migration target, use with passt-repair
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
int main(int argc, char **argv)
{
struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
NULL, NULL, NULL };
struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
int s, s_helper, seq;
int8_t cmd;
struct iovec iov = { &cmd, sizeof(cmd) };
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
struct addrinfo *r;
(void)argc;
strcpy(a_helper.sun_path, argv[4]);
getaddrinfo(argv[1], argv[2], &hints, &r);
if (argc != 7) {
fprintf(stderr,
"%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH SSEQ RSEQ\n",
argv[0]);
return -1;
}
/* Prepare socket, bind to source port */
s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
bind(s, (struct sockaddr *)&a, sizeof(a));
/* Wait for helper */
s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
unlink(a_helper.sun_path);
bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
listen(s_helper, 1);
s_helper = accept(s_helper, NULL, NULL);
/* Set up message for helper, with socket */
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_ON;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int){ 0 }), 1, 0);
/* Set sending sequence */
seq = TCP_SEND_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
seq = atoi(argv[5]);
setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
/* Set receiving sequence */
seq = TCP_RECV_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
seq = atoi(argv[6]);
setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
/* Connect setting kernel state only, without actual SYN / handshake */
connect(s, r->ai_addr, r->ai_addrlen);
/* Send command to helper: turn repair mode off, wait for reply */
cmd = TCP_REPAIR_OFF;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int8_t){ 0 }), 1, 0);
/* Terminate helper */
close(s_helper);
/* Send some more data */
send(s, "after migration\n", sizeof("after migration\n"), 0);
}

View file

@ -1,3 +1,4 @@
/listen-vs-repair
/reuseaddr-priority
/recv-zero
/udp-close-dup

View file

@ -3,8 +3,8 @@
# Copyright Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
TARGETS = reuseaddr-priority recv-zero udp-close-dup
SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c
TARGETS = reuseaddr-priority recv-zero udp-close-dup listen-vs-repair
SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c listen-vs-repair.c
CFLAGS = -Wall
all: cppcheck clang-tidy $(TARGETS:%=check-%)

View file

@ -15,6 +15,7 @@
#include <stdio.h>
#include <stdlib.h>
__attribute__((format(printf, 1, 2), noreturn))
static inline void die(const char *fmt, ...)
{
va_list ap;

View file

@ -0,0 +1,128 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* liste-vs-repair.c
*
* Do listening sockets have address conflicts with sockets under repair
* ====================================================================
*
* When we accept() an incoming connection the accept()ed socket will have the
* same local address as the listening socket. This can be a complication on
* migration. On the migration target we've already set up listening sockets
* according to the command line. However to restore connections that we're
* migrating in we need to bind the new sockets to the same address, which would
* be an address conflict on the face of it. This test program verifies that
* enabling repair mode before bind() correctly suppresses that conflict.
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
/* NOLINTNEXTLINE(bugprone-reserved-identifier,cert-dcl37-c,cert-dcl51-cpp) */
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <errno.h>
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
#include <net/if.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include "common.h"
#define PORT 13256U
#define CPORT 13257U
/* 127.0.0.1:PORT */
static const struct sockaddr_in addr = SOCKADDR_INIT(INADDR_LOOPBACK, PORT);
/* 127.0.0.1:CPORT */
static const struct sockaddr_in caddr = SOCKADDR_INIT(INADDR_LOOPBACK, CPORT);
/* Put ourselves into a network sandbox */
static void net_sandbox(void)
{
/* NOLINTNEXTLINE(altera-struct-pack-align) */
const struct req_t {
struct nlmsghdr nlh;
struct ifinfomsg ifm;
} __attribute__((packed)) req = {
.nlh.nlmsg_type = RTM_NEWLINK,
.nlh.nlmsg_flags = NLM_F_REQUEST,
.nlh.nlmsg_len = sizeof(req),
.nlh.nlmsg_seq = 1,
.ifm.ifi_family = AF_UNSPEC,
.ifm.ifi_index = 1,
.ifm.ifi_flags = IFF_UP,
.ifm.ifi_change = IFF_UP,
};
int nl;
if (unshare(CLONE_NEWUSER | CLONE_NEWNET))
die("unshare(): %s\n", strerror(errno));
/* Bring up lo in the new netns */
nl = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
if (nl < 0)
die("Can't create netlink socket: %s\n", strerror(errno));
if (send(nl, &req, sizeof(req), 0) < 0)
die("Netlink send(): %s\n", strerror(errno));
close(nl);
}
static void check(void)
{
int s1, s2, op;
s1 = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s1 < 0)
die("socket() 1: %s\n", strerror(errno));
if (bind(s1, (struct sockaddr *)&addr, sizeof(addr)))
die("bind() 1: %s\n", strerror(errno));
if (listen(s1, 0))
die("listen(): %s\n", strerror(errno));
s2 = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s2 < 0)
die("socket() 2: %s\n", strerror(errno));
op = TCP_REPAIR_ON;
if (setsockopt(s2, SOL_TCP, TCP_REPAIR, &op, sizeof(op)))
die("TCP_REPAIR: %s\n", strerror(errno));
if (bind(s2, (struct sockaddr *)&addr, sizeof(addr)))
die("bind() 2: %s\n", strerror(errno));
if (connect(s2, (struct sockaddr *)&caddr, sizeof(caddr)))
die("connect(): %s\n", strerror(errno));
op = TCP_REPAIR_OFF_NO_WP;
if (setsockopt(s2, SOL_TCP, TCP_REPAIR, &op, sizeof(op)))
die("TCP_REPAIR: %s\n", strerror(errno));
close(s1);
close(s2);
}
int main(int argc, char *argv[])
{
(void)argc;
(void)argv;
net_sandbox();
check();
printf("Repair mode appears to properly suppress conflicts with listening sockets\n");
exit(0);
}

View file

@ -46,13 +46,13 @@
/* Different cases for receiving socket configuration */
enum sock_type {
/* Socket is bound to 0.0.0.0:DSTPORT and not connected */
SOCK_BOUND_ANY = 0,
SOCK_BOUND_ANY,
/* Socket is bound to 127.0.0.1:DSTPORT and not connected */
SOCK_BOUND_LO = 1,
SOCK_BOUND_LO,
/* Socket is bound to 0.0.0.0:DSTPORT and connected to 127.0.0.1:SRCPORT */
SOCK_CONNECTED = 2,
SOCK_CONNECTED,
NUM_SOCK_TYPES,
};

View file

@ -22,8 +22,8 @@ enum epoll_type {
EPOLL_TYPE_TCP_TIMER,
/* UDP "listening" sockets */
EPOLL_TYPE_UDP_LISTEN,
/* UDP socket for replies on a specific flow */
EPOLL_TYPE_UDP_REPLY,
/* UDP socket for a specific flow */
EPOLL_TYPE_UDP,
/* ICMP/ICMPv6 ping sockets */
EPOLL_TYPE_PING,
/* inotify fd watching for end of netns (pasta) */
@ -36,6 +36,14 @@ enum epoll_type {
EPOLL_TYPE_TAP_PASST,
/* socket listening for qemu socket connections */
EPOLL_TYPE_TAP_LISTEN,
/* vhost-user command socket */
EPOLL_TYPE_VHOST_CMD,
/* vhost-user kick event socket */
EPOLL_TYPE_VHOST_KICK,
/* TCP_REPAIR helper listening socket */
EPOLL_TYPE_REPAIR_LISTEN,
/* TCP_REPAIR helper socket */
EPOLL_TYPE_REPAIR,
EPOLL_NUM_TYPES,
};

432
flow.c
View file

@ -19,6 +19,7 @@
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "repair.h"
const char *flow_state_str[] = {
[FLOW_STATE_FREE] = "FREE",
@ -52,6 +53,13 @@ const uint8_t flow_proto[] = {
static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
"flow_proto[] doesn't match enum flow_type");
#define foreach_established_tcp_flow(flow) \
flow_foreach_of_type((flow), FLOW_TCP) \
if (!tcp_flow_is_established(&(flow)->tcp)) \
/* NOLINTNEXTLINE(bugprone-branch-clone) */ \
continue; \
else
/* Global Flow Table */
/**
@ -73,7 +81,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
*
* Free cluster list
* flow_first_free gives the index of the first (lowest index) free cluster.
* Each free cluster has the index of the next free cluster, or MAX_FLOW if
* Each free cluster has the index of the next free cluster, or FLOW_MAX if
* it is the last free cluster. Together these form a linked list of free
* clusters, in strictly increasing order of index.
*
@ -259,11 +267,13 @@ int flowside_connect(const struct ctx *c, int s,
/** flow_log_ - Log flow-related message
* @f: flow the message is related to
* @newline: Append newline at the end of the message, if missing
* @pri: Log priority
* @fmt: Format string
* @...: printf-arguments
*/
void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
void flow_log_(const struct flow_common *f, bool newline, int pri,
const char *fmt, ...)
{
const char *type_or_state;
char msg[BUFSIZ];
@ -279,7 +289,7 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
else
type_or_state = FLOW_TYPE(f);
logmsg(true, false, pri,
logmsg(newline, false, pri,
"Flow %u (%s): %s", flow_idx(f), type_or_state, msg);
}
@ -299,7 +309,7 @@ void flow_log_details_(const struct flow_common *f, int pri,
const struct flowside *tgt = &f->side[TGTSIDE];
if (state >= FLOW_STATE_TGT)
flow_log_(f, pri,
flow_log_(f, true, pri,
"%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
@ -312,7 +322,7 @@ void flow_log_details_(const struct flow_common *f, int pri,
inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)),
tgt->eport);
else if (state >= FLOW_STATE_INI)
flow_log_(f, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
flow_log_(f, true, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
ini->eport,
@ -333,7 +343,7 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
ASSERT(oldstate < FLOW_NUM_STATES);
f->state = state;
flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
flow_log_(f, true, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
FLOW_STATE(f));
flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
@ -386,18 +396,27 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
* @flow: Flow to change state
* @pif: pif of the initiating side
* @ssa: Source socket address
* @daddr: Destination address (may be NULL)
* @dport: Destination port
*
* Return: pointer to the initiating flowside information
*/
const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
in_port_t dport)
struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
const union inany_addr *daddr,
in_port_t dport)
{
struct flowside *ini = &flow->f.side[INISIDE];
inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa);
if (inany_v4(&ini->eaddr))
if (inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa) < 0) {
char str[SOCKADDR_STRLEN];
ASSERT_WITH_MSG(0, "Bad socket address %s",
sockaddr_ntop(ssa, str, sizeof(str)));
}
if (daddr)
ini->oaddr = *daddr;
else if (inany_v4(&ini->eaddr))
ini->oaddr = inany_any4;
else
ini->oaddr = inany_any6;
@ -414,8 +433,8 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
*
* Return: pointer to the target flowside information
*/
const struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto)
struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto)
{
char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
struct flow_common *f = &flow->f;
@ -597,12 +616,7 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx)
const struct flowside *side = &f->side[sidx.sidei];
uint8_t pif = f->pif[sidx.sidei];
/* For the hash table to work, entries must have complete endpoint
* information, and at least a forwarding port.
*/
ASSERT(pif != PIF_NONE && !inany_is_unspecified(&side->eaddr) &&
side->eport != 0 && side->oport != 0);
ASSERT(pif != PIF_NONE);
return flow_hash(c, FLOW_PROTO(f), pif, side);
}
@ -746,19 +760,30 @@ flow_sidx_t flow_lookup_af(const struct ctx *c,
* @proto: Protocol of the flow (IP L4 protocol number)
* @pif: Interface of the flow
* @esa: Socket address of the endpoint
* @oaddr: Our address (may be NULL)
* @oport: Our port number
*
* Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
*/
flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
const void *esa, in_port_t oport)
const void *esa,
const union inany_addr *oaddr, in_port_t oport)
{
struct flowside side = {
.oport = oport,
};
inany_from_sockaddr(&side.eaddr, &side.eport, esa);
if (inany_v4(&side.eaddr))
if (inany_from_sockaddr(&side.eaddr, &side.eport, esa) < 0) {
char str[SOCKADDR_STRLEN];
warn("Flow lookup on bad socket address %s",
sockaddr_ntop(esa, str, sizeof(str)));
return FLOW_SIDX_NONE;
}
if (oaddr)
side.oaddr = *oaddr;
else if (inany_v4(&side.eaddr))
side.oaddr = inany_any4;
else
side.oaddr = inany_any6;
@ -775,8 +800,9 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
{
struct flow_free_cluster *free_head = NULL;
unsigned *last_next = &flow_first_free;
bool to_free[FLOW_MAX] = { 0 };
bool timer = false;
unsigned idx;
union flow *flow;
if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
timer = true;
@ -785,49 +811,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
ASSERT(!flow_new_entry); /* Incomplete flow at end of cycle */
for (idx = 0; idx < FLOW_MAX; idx++) {
union flow *flow = &flowtab[idx];
/* Check which flows we might need to close first, but don't free them
* yet as it's not safe to do that in the middle of flow_foreach().
*/
flow_foreach(flow) {
bool closed = false;
switch (flow->f.state) {
case FLOW_STATE_FREE: {
unsigned skip = flow->free.n;
/* First entry of a free cluster must have n >= 1 */
ASSERT(skip);
if (free_head) {
/* Merge into preceding free cluster */
free_head->n += flow->free.n;
flow->free.n = flow->free.next = 0;
} else {
/* New free cluster, add to chain */
free_head = &flow->free;
*last_next = idx;
last_next = &free_head->next;
}
/* Skip remaining empty entries */
idx += skip - 1;
continue;
}
case FLOW_STATE_NEW:
case FLOW_STATE_INI:
case FLOW_STATE_TGT:
case FLOW_STATE_TYPED:
/* Incomplete flow at end of cycle */
ASSERT(false);
break;
case FLOW_STATE_ACTIVE:
/* Nothing to do */
break;
default:
ASSERT(false);
}
switch (flow->f.type) {
case FLOW_TYPE_NONE:
ASSERT(false);
@ -846,7 +835,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
closed = icmp_ping_timer(c, &flow->ping, now);
break;
case FLOW_UDP:
closed = udp_flow_defer(&flow->udp);
closed = udp_flow_defer(c, &flow->udp, now);
if (!closed && timer)
closed = udp_flow_timer(c, &flow->udp, now);
break;
@ -855,30 +844,321 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
;
}
if (closed) {
flow_set_state(&flow->f, FLOW_STATE_FREE);
memset(flow, 0, sizeof(*flow));
to_free[FLOW_IDX(flow)] = closed;
}
/* Second step: actually free the flows */
flow_foreach_slot(flow) {
switch (flow->f.state) {
case FLOW_STATE_FREE: {
unsigned skip = flow->free.n;
/* First entry of a free cluster must have n >= 1 */
ASSERT(skip);
if (free_head) {
/* Add slot to current free cluster */
ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
free_head->n++;
/* Merge into preceding free cluster */
free_head->n += flow->free.n;
flow->free.n = flow->free.next = 0;
} else {
/* Create new free cluster */
/* New free cluster, add to chain */
free_head = &flow->free;
free_head->n = 1;
*last_next = idx;
*last_next = FLOW_IDX(flow);
last_next = &free_head->next;
}
} else {
free_head = NULL;
/* Skip remaining empty entries */
flow += skip - 1;
continue;
}
case FLOW_STATE_NEW:
case FLOW_STATE_INI:
case FLOW_STATE_TGT:
case FLOW_STATE_TYPED:
/* Incomplete flow at end of cycle */
ASSERT(false);
break;
case FLOW_STATE_ACTIVE:
if (to_free[FLOW_IDX(flow)]) {
flow_set_state(&flow->f, FLOW_STATE_FREE);
memset(flow, 0, sizeof(*flow));
if (free_head) {
/* Add slot to current free cluster */
ASSERT(FLOW_IDX(flow) ==
FLOW_IDX(free_head) + free_head->n);
free_head->n++;
flow->free.n = flow->free.next = 0;
} else {
/* Create new free cluster */
free_head = &flow->free;
free_head->n = 1;
*last_next = FLOW_IDX(flow);
last_next = &free_head->next;
}
} else {
free_head = NULL;
}
break;
default:
ASSERT(false);
}
}
*last_next = FLOW_MAX;
}
/**
* flow_migrate_source_rollback() - Disable repair mode, return failure
* @c: Execution context
* @bound: No need to roll back flow indices >= @bound
* @ret: Negative error code
*
* Return: @ret
*/
static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
{
union flow *flow;
debug("...roll back migration");
foreach_established_tcp_flow(flow) {
if (FLOW_IDX(flow) >= bound)
break;
if (tcp_flow_repair_off(c, &flow->tcp))
die("Failed to roll back TCP_REPAIR mode");
}
if (repair_flush(c))
die("Failed to roll back TCP_REPAIR mode");
return ret;
}
/**
* flow_migrate_need_repair() - Do we need to set repair mode for any flow?
*
* Return: true if repair mode is needed, false otherwise
*/
static bool flow_migrate_need_repair(void)
{
union flow *flow;
foreach_established_tcp_flow(flow)
return true;
return false;
}
/**
* flow_migrate_repair_all() - Turn repair mode on or off for all flows
* @c: Execution context
* @enable: Switch repair mode on if set, off otherwise
*
* Return: 0 on success, negative error code on failure
*/
static int flow_migrate_repair_all(struct ctx *c, bool enable)
{
union flow *flow;
int rc;
/* If we don't have a repair helper, there's nothing we can do */
if (c->fd_repair < 0)
return 0;
foreach_established_tcp_flow(flow) {
if (enable)
rc = tcp_flow_repair_on(c, &flow->tcp);
else
rc = tcp_flow_repair_off(c, &flow->tcp);
if (rc) {
debug("Can't %s repair mode: %s",
enable ? "enable" : "disable", strerror_(-rc));
return flow_migrate_source_rollback(c, FLOW_IDX(flow),
rc);
}
}
if ((rc = repair_flush(c))) {
debug("Can't %s repair mode: %s",
enable ? "enable" : "disable", strerror_(-rc));
return flow_migrate_source_rollback(c, FLOW_IDX(flow), rc);
}
return 0;
}
/**
* flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
* @c: Execution context
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor (unused)
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
int rc;
(void)stage;
(void)fd;
if (flow_migrate_need_repair())
repair_wait(c);
if ((rc = flow_migrate_repair_all(c, true)))
return -rc;
return 0;
}
/**
* flow_migrate_source() - Dump all the remaining information and send data
* @c: Execution context (unused)
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
uint32_t count = 0;
bool first = true;
union flow *flow;
int rc;
(void)c;
(void)stage;
/* If we don't have a repair helper, we can't migrate TCP flows */
if (c->fd_repair >= 0) {
foreach_established_tcp_flow(flow)
count++;
}
count = htonl(count);
if (write_all_buf(fd, &count, sizeof(count))) {
rc = errno;
err_perror("Can't send flow count (%u)", ntohl(count));
return flow_migrate_source_rollback(c, FLOW_MAX, rc);
}
debug("Sending %u flows", ntohl(count));
if (!count)
return 0;
/* Dump and send information that can be stored in the flow table.
*
* Limited rollback options here: if we fail to transfer any data (that
* is, on the first flow), undo everything and resume. Otherwise, the
* stream might now be inconsistent, and we might have closed listening
* TCP sockets, so just terminate.
*/
foreach_established_tcp_flow(flow) {
rc = tcp_flow_migrate_source(fd, &flow->tcp);
if (rc) {
flow_err(flow, "Can't send data: %s",
strerror_(-rc));
if (!first)
die("Inconsistent migration state, exiting");
return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
}
first = false;
}
/* And then "extended" data (including window data we saved previously):
* the target needs to set repair mode on sockets before it can set
* this stuff, but it needs sockets (and flows) for that.
*
* This also closes sockets so that the target can start connecting
* theirs: you can't sendmsg() to queues (using the socket) if the
* socket is not connected (EPIPE), not even in repair mode. And the
* target needs to restore queues now because we're sending the data.
*
* So, no rollback here, just try as hard as we can. Tolerate per-flow
* failures but not if the stream might be inconsistent (reported here
* as EIO).
*/
foreach_established_tcp_flow(flow) {
rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
if (rc) {
flow_err(flow, "Can't send extended data: %s",
strerror_(-rc));
if (rc == -EIO)
die("Inconsistent migration state, exiting");
}
}
return 0;
}
/**
* flow_migrate_target() - Receive flows and insert in flow table
* @c: Execution context
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
uint32_t count;
unsigned i;
int rc;
(void)stage;
if (read_all_buf(fd, &count, sizeof(count)))
return errno;
count = ntohl(count);
debug("Receiving %u flows", count);
if (!count)
return 0;
repair_wait(c);
if ((rc = flow_migrate_repair_all(c, true)))
return -rc;
repair_flush(c);
/* TODO: flow header with type, instead? */
for (i = 0; i < count; i++) {
rc = tcp_flow_migrate_target(c, fd);
if (rc) {
flow_dbg(FLOW(i), "Migration data failure, abort: %s",
strerror_(-rc));
return -rc;
}
}
repair_flush(c);
for (i = 0; i < count; i++) {
rc = tcp_flow_migrate_target_ext(c, &flowtab[i].tcp, fd);
if (rc) {
flow_dbg(FLOW(i), "Migration data failure, abort: %s",
strerror_(-rc));
return -rc;
}
}
return 0;
}
/**
* flow_init() - Initialise flow related data structures
*/

29
flow.h
View file

@ -243,18 +243,27 @@ flow_sidx_t flow_lookup_af(const struct ctx *c,
const void *eaddr, const void *oaddr,
in_port_t eport, in_port_t oport);
flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
const void *esa, in_port_t oport);
const void *esa,
const union inany_addr *oaddr, in_port_t oport);
union flow;
void flow_init(void);
void flow_defer_handler(const struct ctx *c, const struct timespec *now);
int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
int fd);
void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
__attribute__((format(printf, 3, 4)));
#define flow_log(f_, pri, ...) flow_log_(&(f_)->f, (pri), __VA_ARGS__)
void flow_log_(const struct flow_common *f, bool newline, int pri,
const char *fmt, ...)
__attribute__((format(printf, 4, 5)));
#define flow_log(f_, pri, ...) flow_log_(&(f_)->f, true, (pri), __VA_ARGS__)
#define flow_dbg(f, ...) flow_log((f), LOG_DEBUG, __VA_ARGS__)
#define flow_err(f, ...) flow_log((f), LOG_ERR, __VA_ARGS__)
@ -264,6 +273,16 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
flow_dbg((f), __VA_ARGS__); \
} while (0)
#define flow_log_perror_(f, pri, ...) \
do { \
int errno_ = errno; \
flow_log_((f), false, (pri), __VA_ARGS__); \
logmsg(true, true, (pri), ": %s", strerror_(errno_)); \
} while (0)
#define flow_dbg_perror(f_, ...) flow_log_perror_(&(f_)->f, LOG_DEBUG, __VA_ARGS__)
#define flow_perror(f_, ...) flow_log_perror_(&(f_)->f, LOG_ERR, __VA_ARGS__)
void flow_log_details_(const struct flow_common *f, int pri,
enum flow_state state);
#define flow_log_details(f_, pri) \

View file

@ -50,6 +50,42 @@ extern union flow flowtab[];
#define flow_foreach_sidei(sidei_) \
for ((sidei_) = INISIDE; (sidei_) < SIDES; (sidei_)++)
/**
* flow_foreach_slot() - Step through each flow table entry
* @flow: Takes values of pointer to each flow table entry
*
* Includes FREE slots.
*/
#define flow_foreach_slot(flow) \
for ((flow) = flowtab; FLOW_IDX(flow) < FLOW_MAX; (flow)++)
/**
* flow_foreach() - Step through each active flow
* @flow: Takes values of pointer to each active flow
*/
#define flow_foreach(flow) \
flow_foreach_slot((flow)) \
if ((flow)->f.state == FLOW_STATE_FREE) \
(flow) += (flow)->free.n - 1; \
else if ((flow)->f.state != FLOW_STATE_ACTIVE) { \
flow_err((flow), "Bad flow state during traversal"); \
continue; \
} else
/**
* flow_foreach_of_type() - Step through each active flow of given type
* @flow: Takes values of pointer to each flow
* @type_: Type of flow to traverse
*/
#define flow_foreach_of_type(flow, type_) \
flow_foreach((flow)) \
if ((flow)->f.type != (type_)) \
/* NOLINTNEXTLINE(bugprone-branch-clone) */ \
continue; \
else
/** flow_idx() - Index of flow from common structure
* @f: Common flow fields pointer
*
@ -57,6 +93,7 @@ extern union flow flowtab[];
*/
static inline unsigned flow_idx(const struct flow_common *f)
{
/* NOLINTNEXTLINE(clang-analyzer-security.PointerSub) */
return (union flow *)f - flowtab;
}
@ -161,15 +198,16 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport);
const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
in_port_t dport);
struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
const union inany_addr *daddr,
in_port_t dport);
const struct flowside *flow_target_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport);
const struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto);
struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type);
#define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_)

91
fwd.c
View file

@ -323,6 +323,30 @@ static bool fwd_guest_accessible(const struct ctx *c,
return fwd_guest_accessible6(c, &addr->a6);
}
/**
* nat_outbound() - Apply address translation for outbound (TAP to HOST)
* @c: Execution context
* @addr: Input address (as seen on TAP interface)
* @translated: Output address (as seen on HOST interface)
*
* Only handles translations that depend *only* on the address. Anything
* related to specific ports or flows is handled elsewhere.
*/
static void nat_outbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated)
{
if (inany_equals4(addr, &c->ip4.map_host_loopback))
*translated = inany_loopback4;
else if (inany_equals6(addr, &c->ip6.map_host_loopback))
*translated = inany_loopback6;
else if (inany_equals4(addr, &c->ip4.map_guest_addr))
*translated = inany_from_v4(c->ip4.addr);
else if (inany_equals6(addr, &c->ip6.map_guest_addr))
translated->a6 = c->ip6.addr;
else
*translated = *addr;
}
/**
* fwd_nat_from_tap() - Determine to forward a flow from the tap interface
* @c: Execution context
@ -342,16 +366,8 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
else if (is_dns_flow(proto, ini) &&
inany_equals6(&ini->oaddr, &c->ip6.dns_match))
tgt->eaddr.a6 = c->ip6.dns_host;
else if (inany_equals4(&ini->oaddr, &c->ip4.map_host_loopback))
tgt->eaddr = inany_loopback4;
else if (inany_equals6(&ini->oaddr, &c->ip6.map_host_loopback))
tgt->eaddr = inany_loopback6;
else if (inany_equals4(&ini->oaddr, &c->ip4.map_guest_addr))
tgt->eaddr = inany_from_v4(c->ip4.addr);
else if (inany_equals6(&ini->oaddr, &c->ip6.map_guest_addr))
tgt->eaddr.a6 = c->ip6.addr;
else
tgt->eaddr = ini->oaddr;
nat_outbound(c, &ini->oaddr, &tgt->eaddr);
tgt->eport = ini->oport;
@ -402,7 +418,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
else
tgt->eaddr = inany_loopback6;
/* Preserve the specific loopback adddress used, but let the kernel pick
/* Preserve the specific loopback address used, but let the kernel pick
* a source port on the target side
*/
tgt->oaddr = ini->eaddr;
@ -423,6 +439,42 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
return PIF_HOST;
}
/**
* nat_inbound() - Apply address translation for inbound (HOST to TAP)
* @c: Execution context
* @addr: Input address (as seen on HOST interface)
* @translated: Output address (as seen on TAP interface)
*
* Return: true on success, false if it couldn't translate the address
*
* Only handles translations that depend *only* on the address. Anything
* related to specific ports or flows is handled elsewhere.
*/
bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated)
{
if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
inany_equals4(addr, &in4addr_loopback)) {
/* Specifically 127.0.0.1, not 127.0.0.0/8 */
*translated = inany_from_v4(c->ip4.map_host_loopback);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
inany_equals6(addr, &in6addr_loopback)) {
translated->a6 = c->ip6.map_host_loopback;
} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
inany_equals4(addr, &c->ip4.addr)) {
*translated = inany_from_v4(c->ip4.map_guest_addr);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
inany_equals6(addr, &c->ip6.addr)) {
translated->a6 = c->ip6.map_guest_addr;
} else if (fwd_guest_accessible(c, addr)) {
*translated = *addr;
} else {
return false;
}
return true;
}
/**
* fwd_nat_from_host() - Determine to forward a flow from the host interface
* @c: Execution context
@ -443,7 +495,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
else if (proto == IPPROTO_UDP)
tgt->eport += c->udp.fwd_in.delta[tgt->eport];
if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) &&
if (!c->no_splice && inany_is_loopback(&ini->eaddr) &&
(proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
/* spliceable */
@ -479,20 +531,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
return PIF_SPLICE;
}
if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
inany_equals4(&ini->eaddr, &in4addr_loopback)) {
/* Specifically 127.0.0.1, not 127.0.0.0/8 */
tgt->oaddr = inany_from_v4(c->ip4.map_host_loopback);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
inany_equals6(&ini->eaddr, &in6addr_loopback)) {
tgt->oaddr.a6 = c->ip6.map_host_loopback;
} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
inany_equals4(&ini->eaddr, &c->ip4.addr)) {
tgt->oaddr = inany_from_v4(c->ip4.map_guest_addr);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
inany_equals6(&ini->eaddr, &c->ip6.addr)) {
tgt->oaddr.a6 = c->ip6.map_guest_addr;
} else if (!fwd_guest_accessible(c, &ini->eaddr)) {
if (!nat_inbound(c, &ini->eaddr, &tgt->oaddr)) {
if (inany_v4(&ini->eaddr)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr))
/* No source address we can use */
@ -501,8 +540,6 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
} else {
tgt->oaddr.a6 = c->ip6.our_tap_ll;
}
} else {
tgt->oaddr = ini->eaddr;
}
tgt->oport = ini->eport;

3
fwd.h
View file

@ -7,6 +7,7 @@
#ifndef FWD_H
#define FWD_H
union inany_addr;
struct flowside;
/* Number of ports for both TCP and UDP */
@ -47,6 +48,8 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
const struct fwd_ports *tcp_rev);
void fwd_scan_ports_init(struct ctx *c);
bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated);
uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,

View file

@ -56,6 +56,7 @@ cd ..
make pkgs
scp passt passt.avx2 passt.1 qrap qrap.1 "${USER_HOST}:${BIN}"
scp pasta pasta.avx2 pasta.1 "${USER_HOST}:${BIN}"
scp passt-repair passt-repair.1 "${USER_HOST}:${BIN}"
ssh "${USER_HOST}" "rm -f ${BIN}/*.deb"
ssh "${USER_HOST}" "rm -f ${BIN}/*.rpm"

7
icmp.c
View file

@ -85,7 +85,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
n = recvfrom(ref.fd, buf, sizeof(buf), 0, &sr.sa, &sl);
if (n < 0) {
flow_err(pingf, "recvfrom() error: %s", strerror(errno));
flow_perror(pingf, "recvfrom() error");
return;
}
@ -150,7 +150,7 @@ unexpected:
static void icmp_ping_close(const struct ctx *c,
const struct icmp_ping_flow *pingf)
{
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
epoll_del(c, pingf->sock);
close(pingf->sock);
flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
}
@ -300,8 +300,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0);
if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) {
flow_dbg(pingf, "failed to relay request to socket: %s",
strerror(errno));
flow_dbg_perror(pingf, "failed to relay request to socket");
} else {
flow_dbg(pingf,
"echo request to socket, ID: %"PRIu16", seq: %"PRIu16,

29
inany.h
View file

@ -237,23 +237,30 @@ static inline void inany_from_af(union inany_addr *aa,
}
/** inany_from_sockaddr - Extract IPv[46] address and port number from sockaddr
* @aa: Pointer to store IPv[46] address
* @dst: Pointer to store IPv[46] address (output)
* @port: Pointer to store port number, host order
* @addr: AF_INET or AF_INET6 socket address
* @addr: Socket address
*
* Return: 0 on success, -1 on error (bad address family)
*/
static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
const union sockaddr_inany *sa)
static inline int inany_from_sockaddr(union inany_addr *dst, in_port_t *port,
const void *addr)
{
const union sockaddr_inany *sa = (const union sockaddr_inany *)addr;
if (sa->sa_family == AF_INET6) {
inany_from_af(aa, AF_INET6, &sa->sa6.sin6_addr);
inany_from_af(dst, AF_INET6, &sa->sa6.sin6_addr);
*port = ntohs(sa->sa6.sin6_port);
} else if (sa->sa_family == AF_INET) {
inany_from_af(aa, AF_INET, &sa->sa4.sin_addr);
*port = ntohs(sa->sa4.sin_port);
} else {
/* Not valid to call with other address families */
ASSERT(0);
return 0;
}
if (sa->sa_family == AF_INET) {
inany_from_af(dst, AF_INET, &sa->sa4.sin_addr);
*port = ntohs(sa->sa4.sin_port);
return 0;
}
return -1;
}
/** inany_siphash_feed- Fold IPv[46] address into an in-progress siphash

108
iov.c
View file

@ -26,7 +26,8 @@
#include "iov.h"
/* iov_skip_bytes() - Skip leading bytes of an IO vector
/**
* iov_skip_bytes() - Skip leading bytes of an IO vector
* @iov: IO vector
* @n: Number of entries in @iov
* @skip: Number of leading bytes of @iov to skip
@ -56,8 +57,8 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
}
/**
* iov_from_buf - Copy data from a buffer to an I/O vector (struct iovec)
* efficiently.
* iov_from_buf() - Copy data from a buffer to an I/O vector (struct iovec)
* efficiently.
*
* @iov: Pointer to the array of struct iovec describing the
* scatter/gather I/O vector.
@ -68,7 +69,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
*
* Returns: The number of bytes successfully copied.
*/
/* cppcheck-suppress unusedFunction */
size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
size_t offset, const void *buf, size_t bytes)
{
@ -97,8 +97,8 @@ size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
}
/**
* iov_to_buf - Copy data from a scatter/gather I/O vector (struct iovec) to
* a buffer efficiently.
* iov_to_buf() - Copy data from a scatter/gather I/O vector (struct iovec) to
* a buffer efficiently.
*
* @iov: Pointer to the array of struct iovec describing the scatter/gather
* I/O vector.
@ -137,8 +137,8 @@ size_t iov_to_buf(const struct iovec *iov, size_t iov_cnt,
}
/**
* iov_size - Calculate the total size of a scatter/gather I/O vector
* (struct iovec).
* iov_size() - Calculate the total size of a scatter/gather I/O vector
* (struct iovec).
*
* @iov: Pointer to the array of struct iovec describing the
* scatter/gather I/O vector.
@ -156,3 +156,95 @@ size_t iov_size(const struct iovec *iov, size_t iov_cnt)
return len;
}
/**
* iov_tail_prune() - Remove any unneeded buffers from an IOV tail
* @tail: IO vector tail (modified)
*
* If an IOV tail's offset is large enough, it may not include any bytes from
* the first (or first several) buffers in the underlying IO vector. Modify the
* tail's representation so it contains the same logical bytes, but only
* includes buffers that are actually needed. This will avoid stepping through
* unnecessary elements of the underlying IO vector on future operations.
*
* Return: true if the tail still contains any bytes, otherwise false
*/
bool iov_tail_prune(struct iov_tail *tail)
{
size_t i;
i = iov_skip_bytes(tail->iov, tail->cnt, tail->off, &tail->off);
tail->iov += i;
tail->cnt -= i;
return !!tail->cnt;
}
/**
* iov_tail_size - Calculate the total size of an IO vector tail
* @tail: IO vector tail
*
* Returns: The total size in bytes.
*/
size_t iov_tail_size(struct iov_tail *tail)
{
iov_tail_prune(tail);
return iov_size(tail->iov, tail->cnt) - tail->off;
}
/**
* iov_peek_header_() - Get pointer to a header from an IOV tail
* @tail: IOV tail to get header from
* @len: Length of header to get, in bytes
* @align: Required alignment of header, in bytes
*
* @tail may be pruned, but will represent the same bytes as before.
*
* Returns: Pointer to the first @len logical bytes of the tail, NULL if that
* overruns the IO vector, is not contiguous or doesn't have the
* requested alignment.
*/
/* cppcheck-suppress [staticFunction,unmatchedSuppression] */
void *iov_peek_header_(struct iov_tail *tail, size_t len, size_t align)
{
char *p;
if (!iov_tail_prune(tail))
return NULL; /* Nothing left */
if (tail->off + len < tail->off)
return NULL; /* Overflow */
if (tail->off + len > tail->iov[0].iov_len)
return NULL; /* Not contiguous */
p = (char *)tail->iov[0].iov_base + tail->off;
if ((uintptr_t)p % align)
return NULL; /* not aligned */
return p;
}
/**
* iov_remove_header_() - Remove a header from an IOV tail
* @tail: IOV tail to remove header from (modified)
* @len: Length of header to remove, in bytes
* @align: Required alignment of header, in bytes
*
* On success, @tail is updated so that it longer includes the bytes of the
* returned header.
*
* Returns: Pointer to the first @len logical bytes of the tail, NULL if that
* overruns the IO vector, is not contiguous or doesn't have the
* requested alignment.
*/
void *iov_remove_header_(struct iov_tail *tail, size_t len, size_t align)
{
char *p = iov_peek_header_(tail, len, align);
if (!p)
return NULL;
tail->off = tail->off + len;
return p;
}

76
iov.h
View file

@ -28,4 +28,80 @@ size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
size_t iov_to_buf(const struct iovec *iov, size_t iov_cnt,
size_t offset, void *buf, size_t bytes);
size_t iov_size(const struct iovec *iov, size_t iov_cnt);
/*
* DOC: Theory of Operation, struct iov_tail
*
* Sometimes a single logical network frame is split across multiple buffers,
* represented by an IO vector (struct iovec[]). We often want to process this
* one header / network layer at a time. So, it's useful to maintain a "tail"
* of the vector representing the parts we haven't yet extracted.
*
* The headers we extract need not line up with buffer boundaries (though we do
* assume they're contiguous within a single buffer for now). So, we could
* represent that tail as another struct iovec[], but that would mean copying
* the whole array of struct iovecs, just so we can adjust the offset and length
* on the first one.
*
* So, instead represent the tail as pointer into an existing struct iovec[],
* with an explicit offset for where the "tail" starts within it. If we extract
* enough headers that some buffers of the original vector no longer contain
* part of the tail, we (lazily) advance our struct iovec * to the first buffer
* we still need, and adjust the vector length and offset to match.
*/
/**
* struct iov_tail - An IO vector which may have some headers logically removed
* @iov: IO vector
* @cnt: Number of entries in @iov
* @off: Current offset in @iov
*/
struct iov_tail {
const struct iovec *iov;
size_t cnt, off;
};
/**
* IOV_TAIL() - Create a new IOV tail
* @iov_: IO vector to create tail from
* @cnt_: Length of the IO vector at @iov_
* @off_: Byte offset in the IO vector where the tail begins
*/
#define IOV_TAIL(iov_, cnt_, off_) \
(struct iov_tail){ .iov = (iov_), .cnt = (cnt_), .off = (off_) }
bool iov_tail_prune(struct iov_tail *tail);
size_t iov_tail_size(struct iov_tail *tail);
void *iov_peek_header_(struct iov_tail *tail, size_t len, size_t align);
void *iov_remove_header_(struct iov_tail *tail, size_t len, size_t align);
/**
* IOV_PEEK_HEADER() - Get typed pointer to a header from an IOV tail
* @tail_: IOV tail to get header from
* @type_: Data type of the header
*
* @tail_ may be pruned, but will represent the same bytes as before.
*
* Returns: Pointer of type (@type_ *) located at the start of @tail_, NULL if
* we can't get a contiguous and aligned pointer.
*/
#define IOV_PEEK_HEADER(tail_, type_) \
((type_ *)(iov_peek_header_((tail_), \
sizeof(type_), __alignof__(type_))))
/**
* IOV_REMOVE_HEADER() - Remove and return typed header from an IOV tail
* @tail_: IOV tail to remove header from (modified)
* @type_: Data type of the header to remove
*
* On success, @tail_ is updated so that it longer includes the bytes of the
* returned header.
*
* Returns: Pointer of type (@type_ *) located at the old start of @tail_, NULL
* if we can't get a contiguous and aligned pointer.
*/
#define IOV_REMOVE_HEADER(tail_, type_) \
((type_ *)(iov_remove_header_((tail_), \
sizeof(type_), __alignof__(type_))))
#endif /* IOVEC_H */

36
ip.h
View file

@ -36,13 +36,14 @@
.tos = 0, \
.tot_len = 0, \
.id = 0, \
.frag_off = 0, \
.frag_off = htons(IP_DF), \
.ttl = 0xff, \
.protocol = (proto), \
.saddr = 0, \
.daddr = 0, \
}
#define L2_BUF_IP4_PSUM(proto) ((uint32_t)htons_constant(0x4500) + \
(uint32_t)htons_constant(IP_DF) + \
(uint32_t)htons(0xff00 | (proto)))
@ -90,10 +91,34 @@ struct ipv6_opt_hdr {
*/
} __attribute__((packed)); /* required for some archs */
/**
* ip6_set_flow_lbl() - Set flow label in an IPv6 header
* @ip6h: Pointer to IPv6 header, updated
* @flow: Set @ip6h flow label to the low 20 bits of this integer
*/
static inline void ip6_set_flow_lbl(struct ipv6hdr *ip6h, uint32_t flow)
{
ip6h->flow_lbl[0] = (flow >> 16) & 0xf;
ip6h->flow_lbl[1] = (flow >> 8) & 0xff;
ip6h->flow_lbl[2] = (flow >> 0) & 0xff;
}
/** ip6_get_flow_lbl() - Get flow label from an IPv6 header
* @ip6h: Pointer to IPv6 header
*
* Return: flow label from @ip6h as an integer (<= 20 bits)
*/
static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h)
{
return (ip6h->flow_lbl[0] & 0xf) << 16 |
ip6h->flow_lbl[1] << 8 |
ip6h->flow_lbl[2];
}
char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
size_t *dlen);
/* IPv6 link-local all-nodes multicast adddress, ff02::1 */
/* IPv6 link-local all-nodes multicast address, ff02::1 */
static const struct in6_addr in6addr_ll_all_nodes = {
.s6_addr = {
0xff, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
@ -104,4 +129,11 @@ static const struct in6_addr in6addr_ll_all_nodes = {
/* IPv4 Limited Broadcast (RFC 919, Section 7), 255.255.255.255 */
static const struct in_addr in4addr_broadcast = { 0xffffffff };
#ifndef IPV4_MIN_MTU
#define IPV4_MIN_MTU 68
#endif
#ifndef IPV6_MIN_MTU
#define IPV6_MIN_MTU 1280
#endif
#endif /* IP_H */

View file

@ -129,7 +129,7 @@ static void drop_caps_ep_except(uint64_t keep)
* additional layer of protection. Executing this requires
* CAP_SETPCAP, which we will have within our userns.
*
* Note that dropping capabilites from the bounding set limits
* Note that dropping capabilities from the bounding set limits
* exec()ed processes, but does not remove them from the effective or
* permitted sets, so it doesn't reduce our own capabilities.
*/
@ -174,8 +174,8 @@ static void clamp_caps(void)
* Should:
* - drop unneeded capabilities
* - close all open files except for standard streams and the one from --fd
* Musn't:
* - remove filesytem access (we need to access files during setup)
* Mustn't:
* - remove filesystem access (we need to access files during setup)
*/
void isolate_initial(int argc, char **argv)
{
@ -194,7 +194,7 @@ void isolate_initial(int argc, char **argv)
*
* It's debatable whether it's useful to drop caps when we
* retain SETUID and SYS_ADMIN, but we might as well. We drop
* further capabilites in isolate_user() and
* further capabilities in isolate_user() and
* isolate_prefork().
*/
keep = BIT(CAP_NET_BIND_SERVICE) | BIT(CAP_SETUID) | BIT(CAP_SETGID) |
@ -379,12 +379,21 @@ void isolate_postfork(const struct ctx *c)
prctl(PR_SET_DUMPABLE, 0);
if (c->mode == MODE_PASTA) {
prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
prog.filter = filter_pasta;
} else {
switch (c->mode) {
case MODE_PASST:
prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
prog.filter = filter_passt;
break;
case MODE_PASTA:
prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
prog.filter = filter_pasta;
break;
case MODE_VU:
prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
prog.filter = filter_vu;
break;
default:
ASSERT(0);
}
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||

55
log.c
View file

@ -56,7 +56,7 @@ bool log_stderr = true; /* Not daemonised, no shell spawned */
*
* Return: pointer to @now, or NULL if there was an error retrieving the time
*/
const struct timespec *logtime(struct timespec *ts)
static const struct timespec *logtime(struct timespec *ts)
{
if (clock_gettime(CLOCK_MONOTONIC, ts))
return NULL;
@ -249,6 +249,30 @@ static void logfile_write(bool newline, bool cont, int pri,
log_written += n;
}
/**
* passt_vsyslog() - vsyslog() implementation not using heap memory
* @newline: Append newline at the end of the message, if missing
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
static void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
/* Send without timestamp, the system logger should add it */
n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (newline && format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
* vlogmsg() - Print or send messages to log or output files as configured
* @newline: Append newline at the end of the message, if missing
@ -257,6 +281,7 @@ static void logfile_write(bool newline, bool cont, int pri,
* @format: Message
* @ap: Variable argument list
*/
/* cppcheck-suppress [staticFunction,unmatchedSuppression] */
void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
{
bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1;
@ -322,7 +347,7 @@ void logmsg_perror(int pri, const char *format, ...)
vlogmsg(false, false, pri, format, ap);
va_end(ap);
logmsg(true, true, pri, ": %s", strerror(errno_copy));
logmsg(true, true, pri, ": %s", strerror_(errno_copy));
}
/**
@ -373,35 +398,11 @@ void __setlogmask(int mask)
setlogmask(mask);
}
/**
* passt_vsyslog() - vsyslog() implementation not using heap memory
* @newline: Append newline at the end of the message, if missing
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
/* Send without timestamp, the system logger should add it */
n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (newline && format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
* logfile_init() - Open log file and write header with PID, version, path
* @name: Identifier for header: passt or pasta
* @path: Path to log file
* @size: Maximum size of log file: log_cut_size is calculatd here
* @size: Maximum size of log file: log_cut_size is calculated here
*/
void logfile_init(const char *name, const char *path, size_t size)
{

5
log.h
View file

@ -32,13 +32,13 @@ void logmsg_perror(int pri, const char *format, ...)
#define die(...) \
do { \
err(__VA_ARGS__); \
exit(EXIT_FAILURE); \
_exit(EXIT_FAILURE); \
} while (0)
#define die_perror(...) \
do { \
err_perror(__VA_ARGS__); \
exit(EXIT_FAILURE); \
_exit(EXIT_FAILURE); \
} while (0)
extern int log_trace;
@ -55,7 +55,6 @@ void trace_init(int enable);
void __openlog(const char *ident, int option, int facility);
void logfile_init(const char *name, const char *path, size_t size);
void passt_vsyslog(bool newline, int pri, const char *format, va_list ap);
void __setlogmask(int mask);
#endif /* LOG_H */

304
migrate.c Normal file
View file

@ -0,0 +1,304 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* migrate.c - Migration sections, layout, and routines
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <errno.h>
#include <sys/uio.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "migrate.h"
#include "repair.h"
/* Magic identifier for migration data */
#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
/**
* struct migrate_seen_addrs_v1 - Migratable guest addresses for v1 state stream
* @addr6: Observed guest IPv6 address
* @addr6_ll: Observed guest IPv6 link-local address
* @addr4: Observed guest IPv4 address
* @mac: Observed guest MAC address
*/
struct migrate_seen_addrs_v1 {
struct in6_addr addr6;
struct in6_addr addr6_ll;
struct in_addr addr4;
unsigned char mac[ETH_ALEN];
} __attribute__((packed));
/**
* seen_addrs_source_v1() - Copy and send guest observed addresses from source
* @c: Execution context
* @stage: Migration stage, unused
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */
static int seen_addrs_source_v1(struct ctx *c,
const struct migrate_stage *stage, int fd)
{
struct migrate_seen_addrs_v1 addrs = {
.addr6 = c->ip6.addr_seen,
.addr6_ll = c->ip6.addr_ll_seen,
.addr4 = c->ip4.addr_seen,
};
(void)stage;
memcpy(addrs.mac, c->guest_mac, sizeof(addrs.mac));
if (write_all_buf(fd, &addrs, sizeof(addrs)))
return errno;
return 0;
}
/**
* seen_addrs_target_v1() - Receive and use guest observed addresses on target
* @c: Execution context
* @stage: Migration stage, unused
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int seen_addrs_target_v1(struct ctx *c,
const struct migrate_stage *stage, int fd)
{
struct migrate_seen_addrs_v1 addrs;
(void)stage;
if (read_all_buf(fd, &addrs, sizeof(addrs)))
return errno;
c->ip6.addr_seen = addrs.addr6;
c->ip6.addr_ll_seen = addrs.addr6_ll;
c->ip4.addr_seen = addrs.addr4;
memcpy(c->guest_mac, addrs.mac, sizeof(c->guest_mac));
return 0;
}
/* Stages for version 2 */
static const struct migrate_stage stages_v2[] = {
{
.name = "observed addresses",
.source = seen_addrs_source_v1,
.target = seen_addrs_target_v1,
},
{
.name = "prepare flows",
.source = flow_migrate_source_pre,
.target = NULL,
},
{
.name = "transfer flows",
.source = flow_migrate_source,
.target = flow_migrate_target,
},
{ 0 },
};
/* Supported encoding versions, from latest (most preferred) to oldest */
static const struct migrate_version versions[] = {
{ 2, stages_v2, },
/* v1 was released, but not widely used. It had bad endianness for the
* MSS and omitted timestamps, which meant it usually wouldn't work.
* Therefore we don't attempt to support compatibility with it.
*/
{ 0 },
};
/* Current encoding version */
#define CURRENT_VERSION (&versions[0])
/**
* migrate_source() - Migration as source, send state to hypervisor
* @c: Execution context
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int migrate_source(struct ctx *c, int fd)
{
const struct migrate_version *v = CURRENT_VERSION;
const struct migrate_header header = {
.magic = htonll_constant(MIGRATE_MAGIC),
.version = htonl(v->id),
.compat_version = htonl(v->id),
};
const struct migrate_stage *s;
int ret;
if (write_all_buf(fd, &header, sizeof(header))) {
ret = errno;
err("Can't send migration header: %s, abort", strerror_(ret));
return ret;
}
for (s = v->s; s->name; s++) {
if (!s->source)
continue;
debug("Source side migration stage: %s", s->name);
if ((ret = s->source(c, s, fd))) {
err("Source migration stage: %s: %s, abort", s->name,
strerror_(ret));
return ret;
}
}
return 0;
}
/**
* migrate_target_read_header() - Read header in target
* @fd: Descriptor for state transfer
*
* Return: version structure on success, NULL on failure with errno set
*/
static const struct migrate_version *migrate_target_read_header(int fd)
{
const struct migrate_version *v;
struct migrate_header h;
uint32_t id, compat_id;
if (read_all_buf(fd, &h, sizeof(h)))
return NULL;
id = ntohl(h.version);
compat_id = ntohl(h.compat_version);
debug("Source magic: 0x%016" PRIx64 ", version: %u, compat: %u",
ntohll(h.magic), id, compat_id);
if (ntohll(h.magic) != MIGRATE_MAGIC || !id || !compat_id) {
err("Invalid incoming device state");
errno = EINVAL;
return NULL;
}
for (v = versions; v->id; v++)
if (v->id <= id && v->id >= compat_id)
return v;
errno = ENOTSUP;
err("Unsupported device state version: %u", id);
return NULL;
}
/**
* migrate_target() - Migration as target, receive state from hypervisor
* @c: Execution context
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int migrate_target(struct ctx *c, int fd)
{
const struct migrate_version *v;
const struct migrate_stage *s;
int ret;
if (!(v = migrate_target_read_header(fd)))
return errno;
for (s = v->s; s->name; s++) {
if (!s->target)
continue;
debug("Target side migration stage: %s", s->name);
if ((ret = s->target(c, s, fd))) {
err("Target migration stage: %s: %s, abort", s->name,
strerror_(ret));
return ret;
}
}
return 0;
}
/**
* migrate_init() - Set up things necessary for migration
* @c: Execution context
*/
void migrate_init(struct ctx *c)
{
c->device_state_result = -1;
}
/**
* migrate_close() - Close migration channel and connection to passt-repair
* @c: Execution context
*/
void migrate_close(struct ctx *c)
{
if (c->device_state_fd != -1) {
debug("Closing migration channel, fd: %d", c->device_state_fd);
close(c->device_state_fd);
c->device_state_fd = -1;
c->device_state_result = -1;
}
repair_close(c);
}
/**
* migrate_request() - Request a migration of device state
* @c: Execution context
* @fd: fd to transfer state
* @target: Are we the target of the migration?
*/
void migrate_request(struct ctx *c, int fd, bool target)
{
debug("Migration requested, fd: %d (was %d)", fd, c->device_state_fd);
if (c->device_state_fd != -1)
migrate_close(c);
c->device_state_fd = fd;
c->migrate_target = target;
}
/**
* migrate_handler() - Send/receive passt internal state to/from hypervisor
* @c: Execution context
*/
void migrate_handler(struct ctx *c)
{
int rc;
if (c->device_state_fd < 0)
return;
debug("Handling migration request from fd: %d, target: %d",
c->device_state_fd, c->migrate_target);
if (c->migrate_target)
rc = migrate_target(c, c->device_state_fd);
else
rc = migrate_source(c, c->device_state_fd);
migrate_close(c);
c->device_state_result = rc;
}

51
migrate.h Normal file
View file

@ -0,0 +1,51 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#ifndef MIGRATE_H
#define MIGRATE_H
/**
* struct migrate_header - Migration header from source
* @magic: 0xB1BB1D1B0BB1D1B0, network order
* @version: Highest known, target aborts if too old, network order
* @compat_version: Lowest version compatible with @version, target aborts
* if too new, network order
*/
struct migrate_header {
uint64_t magic;
uint32_t version;
uint32_t compat_version;
} __attribute__((packed));
/**
* struct migrate_stage - Callbacks and parameters for one stage of migration
* @name: Stage name (for debugging)
* @source: Callback to implement this stage on the source
* @target: Callback to implement this stage on the target
*/
struct migrate_stage {
const char *name;
int (*source)(struct ctx *c, const struct migrate_stage *stage, int fd);
int (*target)(struct ctx *c, const struct migrate_stage *stage, int fd);
/* Add here separate rollback callbacks if needed */
};
/**
* struct migrate_version - Stages for a particular protocol version
* @id: Version number, host order
* @s: Ordered array of stages, NULL-terminated
*/
struct migrate_version {
uint32_t id;
const struct migrate_stage *s;
};
void migrate_init(struct ctx *c);
void migrate_close(struct ctx *c);
void migrate_request(struct ctx *c, int fd, bool target);
void migrate_handler(struct ctx *c);
#endif /* MIGRATE_H */

3
ndp.c
View file

@ -256,7 +256,7 @@ static void ndp_ra(const struct ctx *c, const struct in6_addr *dst)
ptr = &ra.var[0];
if (c->mtu != -1) {
if (c->mtu) {
struct opt_mtu *mtu = (struct opt_mtu *)ptr;
*mtu = (struct opt_mtu) {
.header = {
@ -328,6 +328,7 @@ static void ndp_ra(const struct ctx *c, const struct in6_addr *dst)
memcpy(&ra.source_ll.mac, c->our_tap_mac, ETH_ALEN);
/* NOLINTNEXTLINE(clang-analyzer-security.PointerSub) */
ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);
}

View file

@ -297,6 +297,10 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
if (!thisifi)
continue; /* No interface for this route */
/* Skip 'lo': we should test IFF_LOOPBACK, but keep it simple */
if (thisifi == 1)
continue;
/* Skip routes to link-local addresses */
if (af == AF_INET && dst &&
IN4_IS_PREFIX_LINKLOCAL(dst, rtm->rtm_dst_len))
@ -320,7 +324,7 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
}
if (status < 0)
warn("netlink: RTM_GETROUTE failed: %s", strerror(-status));
warn("netlink: RTM_GETROUTE failed: %s", strerror_(-status));
if (defifi) {
if (ndef > 1) {
@ -351,7 +355,7 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
*
* Return: true if a gateway was found, false otherwise
*/
bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
static bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
{
int nh_len = RTA_PAYLOAD(rta);
struct rtnexthop *rtnh;

191
packet.c
View file

@ -22,12 +22,74 @@
#include "util.h"
#include "log.h"
/**
* packet_check_range() - Check if a memory range is valid for a pool
* @p: Packet pool
* @ptr: Start of desired data range
* @len: Length of desired data range
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: 0 if the range is valid, -1 otherwise
*/
static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
const char *func, int line)
{
if (len > PACKET_MAX_LEN) {
debug("packet range length %zu (max %zu), %s:%i",
len, PACKET_MAX_LEN, func, line);
return -1;
}
if (p->buf_size == 0) {
int ret;
ret = vu_packet_check_range((void *)p->buf, ptr, len);
if (ret == -1)
debug("cannot find region, %s:%i", func, line);
return ret;
}
if (ptr < p->buf) {
debug("packet range start %p before buffer start %p, %s:%i",
(void *)ptr, (void *)p->buf, func, line);
return -1;
}
if (len > p->buf_size) {
debug("packet range length %zu larger than buffer %zu, %s:%i",
len, p->buf_size, func, line);
return -1;
}
if ((size_t)(ptr - p->buf) > p->buf_size - len) {
debug("packet range %p, len %zu after buffer end %p, %s:%i",
(void *)ptr, len, (void *)(p->buf + p->buf_size),
func, line);
return -1;
}
return 0;
}
/**
* pool_full() - Is a packet pool full?
* @p: Pointer to packet pool
*
* Return: true if the pool is full, false if more packets can be added
*/
bool pool_full(const struct pool *p)
{
return p->count >= p->size;
}
/**
* packet_add_do() - Add data as packet descriptor to given pool
* @p: Existing pool
* @len: Length of new descriptor
* @start: Start of data
* @func: For tracing: name of calling function, NULL means no trace()
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*/
void packet_add_do(struct pool *p, size_t len, const char *start,
@ -35,44 +97,63 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
{
size_t idx = p->count;
if (idx >= p->size) {
trace("add packet index %zu to pool with size %zu, %s:%i",
if (pool_full(p)) {
debug("add packet index %zu to pool with size %zu, %s:%i",
idx, p->size, func, line);
return;
}
if (start < p->buf) {
trace("add packet start %p before buffer start %p, %s:%i",
(void *)start, (void *)p->buf, func, line);
if (packet_check_range(p, start, len, func, line))
return;
}
if (start + len > p->buf + p->buf_size) {
trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
(void *)start, len, (void *)(p->buf + p->buf_size),
func, line);
return;
}
if (len > UINT16_MAX) {
trace("add packet length %zu, %s:%i", len, func, line);
return;
}
#if UINTPTR_MAX == UINT64_MAX
if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
trace("add packet start %p, buffer start %p, %s:%i",
(void *)start, (void *)p->buf, func, line);
return;
}
#endif
p->pkt[idx].offset = start - p->buf;
p->pkt[idx].len = len;
p->pkt[idx].iov_base = (void *)start;
p->pkt[idx].iov_len = len;
p->count++;
}
/**
* packet_get_try_do() - Get data range from packet descriptor from given pool
* @p: Packet pool
* @idx: Index of packet descriptor in pool
* @offset: Offset of data range in packet descriptor
* @len: Length of desired data range
* @left: Length of available data after range, set on return, can be NULL
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: pointer to start of data range, NULL on invalid range or descriptor
*/
void *packet_get_try_do(const struct pool *p, size_t idx, size_t offset,
size_t len, size_t *left, const char *func, int line)
{
char *ptr;
ASSERT_WITH_MSG(p->count <= p->size,
"Corrupt pool count: %zu, size: %zu, %s:%i",
p->count, p->size, func, line);
if (idx >= p->count) {
debug("packet %zu from pool count: %zu, %s:%i",
idx, p->count, func, line);
return NULL;
}
if (offset > p->pkt[idx].iov_len ||
len > (p->pkt[idx].iov_len - offset))
return NULL;
ptr = (char *)p->pkt[idx].iov_base + offset;
ASSERT_WITH_MSG(!packet_check_range(p, ptr, len, func, line),
"Corrupt packet pool, %s:%i", func, line);
if (left)
*left = p->pkt[idx].iov_len - offset - len;
return ptr;
}
/**
* packet_get_do() - Get data range from packet descriptor from given pool
* @p: Packet pool
@ -80,52 +161,24 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
* @offset: Offset of data range in packet descriptor
* @len: Length of desired data range
* @left: Length of available data after range, set on return, can be NULL
* @func: For tracing: name of calling function, NULL means no trace()
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: pointer to start of data range, NULL on invalid range or descriptor
* Return: as packet_get_try_do() but log a trace message when returning NULL
*/
void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
size_t len, size_t *left, const char *func, int line)
void *packet_get_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line)
{
if (idx >= p->size || idx >= p->count) {
if (func) {
trace("packet %zu from pool size: %zu, count: %zu, "
"%s:%i", idx, p->size, p->count, func, line);
}
return NULL;
void *r = packet_get_try_do(p, idx, offset, len, left, func, line);
if (!r) {
trace("missing packet data length %zu, offset %zu from "
"length %zu, %s:%i",
len, offset, p->pkt[idx].iov_len, func, line);
}
if (len > UINT16_MAX || len + offset > UINT32_MAX) {
if (func) {
trace("packet data length %zu, offset %zu, %s:%i",
len, offset, func, line);
}
return NULL;
}
if (p->pkt[idx].offset + len + offset > p->buf_size) {
if (func) {
trace("packet offset plus length %zu from size %zu, "
"%s:%i", p->pkt[idx].offset + len + offset,
p->buf_size, func, line);
}
return NULL;
}
if (len + offset > p->pkt[idx].len) {
if (func) {
trace("data length %zu, offset %zu from length %u, "
"%s:%i", len, offset, p->pkt[idx].len,
func, line);
}
return NULL;
}
if (left)
*left = p->pkt[idx].len - offset - len;
return p->buf + p->pkt[idx].offset + offset;
return r;
}
/**

View file

@ -6,20 +6,17 @@
#ifndef PACKET_H
#define PACKET_H
/**
* struct desc - Generic offset-based descriptor within buffer
* @offset: Offset of descriptor relative to buffer start, 32-bit limit
* @len: Length of descriptor, host order, 16-bit limit
*/
struct desc {
uint32_t offset;
uint16_t len;
};
#include <stdbool.h>
/* Maximum size of a single packet stored in pool, including headers */
#define PACKET_MAX_LEN ((size_t)UINT16_MAX)
/**
* struct pool - Generic pool of packets stored in a buffer
* @buf: Buffer storing packet descriptors
* @buf_size: Total size of buffer
* @buf: Buffer storing packet descriptors,
* a struct vu_dev_region array for passt vhost-user mode
* @buf_size: Total size of buffer,
* 0 for passt vhost-user mode
* @size: Number of usable descriptors for the pool
* @count: Number of used descriptors for the pool
* @pkt: Descriptors: see macros below
@ -29,32 +26,36 @@ struct pool {
size_t buf_size;
size_t size;
size_t count;
struct desc pkt[1];
struct iovec pkt[];
};
int vu_packet_check_range(void *buf, const char *ptr, size_t len);
void packet_add_do(struct pool *p, size_t len, const char *start,
const char *func, int line);
void *packet_get_try_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line);
void *packet_get_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line);
bool pool_full(const struct pool *p);
void pool_flush(struct pool *p);
#define packet_add(p, len, start) \
packet_add_do(p, len, start, __func__, __LINE__)
#define packet_get_try(p, idx, offset, len, left) \
packet_get_try_do(p, idx, offset, len, left, __func__, __LINE__)
#define packet_get(p, idx, offset, len, left) \
packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
#define packet_get_try(p, idx, offset, len, left) \
packet_get_do(p, idx, offset, len, left, NULL, 0)
#define PACKET_POOL_DECL(_name, _size, _buf) \
struct _name ## _t { \
char *buf; \
size_t buf_size; \
size_t size; \
size_t count; \
struct desc pkt[_size]; \
struct iovec pkt[_size]; \
}
#define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size) \

74
passt-repair.1 Normal file
View file

@ -0,0 +1,74 @@
.\" SPDX-License-Identifier: GPL-2.0-or-later
.\" Copyright (c) 2025 Red Hat GmbH
.\" Author: Stefano Brivio <sbrivio@redhat.com>
.TH passt-repair 1
.SH NAME
.B passt-repair
\- Helper setting TCP_REPAIR socket options for \fBpasst\fR(1)
.SH SYNOPSIS
.B passt-repair
\fIPATH\fR
.SH DESCRIPTION
.B passt-repair
is a privileged helper setting and clearing repair mode on TCP sockets on behalf
of \fBpasst\fR(1), as instructed via single-byte commands over a UNIX domain
socket.
It can be used to migrate TCP connections between guests without granting
additional capabilities to \fBpasst\fR(1) itself: to migrate TCP connections,
\fBpasst\fR(1) leverages repair mode, which needs the \fBCAP_NET_ADMIN\fR
capability (see \fBcapabilities\fR(7)) to be set or cleared.
If \fIPATH\fR represents a UNIX domain socket, \fBpasst-repair\fR(1) attempts to
connect to it. If it is a directory, \fBpasst-repair\fR(1) waits until a file
ending with \fI.repair\fR appears in it, and then attempts to connect to it.
.SH PROTOCOL
\fBpasst-repair\fR(1) connects to \fBpasst\fR(1) using the socket specified via
\fI--repair-path\fR option in \fBpasst\fR(1) itself. By default, the name is the
same as the UNIX domain socket used for guest communication, suffixed by
\fI.repair\fR.
The messages consist of one 8-bit signed integer that can be \fITCP_REPAIR_ON\fR
(1), \fITCP_REPAIR_OFF\fR (0), or \fITCP_REPAIR_OFF_NO_WP\fR (-1), as defined by
the Linux kernel user API, and one to SCM_MAX_FD (253) sockets as SCM_RIGHTS
(see \fBunix\fR(7)) ancillary message, sent by the server, \fBpasst\fR(1).
The client, \fBpasst-repair\fR(1), replies with the same byte (and no ancillary
message) to indicate success, and closes the connection on failure.
The server closes the connection on error or completion.
.SH NOTES
\fBpasst-repair\fR(1) can be granted the \fBCAP_NET_ADMIN\fR capability
(preferred, as it limits privileges to the strictly necessary ones), or it can
be run as root.
.SH AUTHOR
Stefano Brivio <sbrivio@redhat.com>.
.SH REPORTING BUGS
Please report issues on the bug tracker at https://bugs.passt.top/, or
send a message to the passt-user@passt.top mailing list, see
https://lists.passt.top/.
.SH COPYRIGHT
Copyright (c) 2025 Red Hat GmbH.
\fBpasst-repair\fR is free software: you can redistribute them and/or modify
them under the terms of the GNU General Public License as published by the Free
Software Foundation, either version 2 of the License, or (at your option) any
later version.
.SH SEE ALSO
\fBpasst\fR(1), \fBqemu\fR(1), \fBcapabilities\fR(7), \fBunix\fR(7).

266
passt-repair.c Normal file
View file

@ -0,0 +1,266 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*
* Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
* with byte commands mapping to TCP_REPAIR values, and switch repair mode on or
* off. Reply by echoing the command. Exit on EOF.
*/
#include <sys/inotify.h>
#include <sys/prctl.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/un.h>
#include <errno.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
#include <linux/audit.h>
#include <linux/capability.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include "seccomp_repair.h"
#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
#define REPAIR_EXT ".repair"
#define REPAIR_EXT_LEN strlen(REPAIR_EXT)
/**
* main() - Entry point and whole program with loop
* @argc: Argument count, must be 2
* @argv: Argument: path of UNIX domain socket to connect to
*
* Return: 0 on success (EOF), 1 on error, 2 on usage error
*
* #syscalls:repair connect setsockopt write close exit_group
* #syscalls:repair socket s390x:socketcall i686:socketcall
* #syscalls:repair recvfrom recvmsg arm:recv ppc64le:recv
* #syscalls:repair sendto sendmsg arm:send ppc64le:send
* #syscalls:repair stat|statx stat64|statx statx
* #syscalls:repair fstat|fstat64 newfstatat|fstatat64
* #syscalls:repair inotify_init1 inotify_add_watch
*/
int main(int argc, char **argv)
{
char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
__attribute__ ((aligned(__alignof__(struct cmsghdr))));
struct sockaddr_un a = { AF_UNIX, "" };
int fds[SCM_MAX_FD], s, ret, i, n = 0;
bool inotify_dir = false;
struct sock_fprog prog;
int8_t cmd = INT8_MAX;
struct cmsghdr *cmsg;
struct msghdr msg;
struct iovec iov;
size_t cmsg_len;
struct stat sb;
int op;
prctl(PR_SET_DUMPABLE, 0);
prog.len = (unsigned short)sizeof(filter_repair) /
sizeof(filter_repair[0]);
prog.filter = filter_repair;
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
fprintf(stderr, "Failed to apply seccomp filter");
_exit(1);
}
iov = (struct iovec){ &cmd, sizeof(cmd) };
msg = (struct msghdr){ .msg_name = NULL, .msg_namelen = 0,
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = buf,
.msg_controllen = sizeof(buf),
.msg_flags = 0 };
cmsg = CMSG_FIRSTHDR(&msg);
if (argc != 2) {
fprintf(stderr, "Usage: %s PATH\n", argv[0]);
_exit(2);
}
if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
fprintf(stderr, "Failed to create AF_UNIX socket: %i\n", errno);
_exit(1);
}
if ((stat(argv[1], &sb))) {
fprintf(stderr, "Can't stat() %s: %i\n", argv[1], errno);
_exit(1);
}
if ((sb.st_mode & S_IFMT) == S_IFDIR) {
char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
__attribute__ ((aligned(__alignof__(struct inotify_event))));
const struct inotify_event *ev = NULL;
char path[PATH_MAX + 1];
bool found = false;
ssize_t n;
int fd;
if ((fd = inotify_init1(IN_CLOEXEC)) < 0) {
fprintf(stderr, "inotify_init1: %i\n", errno);
_exit(1);
}
if (inotify_add_watch(fd, argv[1], IN_CREATE) < 0) {
fprintf(stderr, "inotify_add_watch: %i\n", errno);
_exit(1);
}
do {
char *p;
n = read(fd, buf, sizeof(buf));
if (n < 0) {
fprintf(stderr, "inotify read: %i", errno);
_exit(1);
}
buf[n - 1] = '\0';
if (n < (ssize_t)sizeof(*ev)) {
fprintf(stderr, "Short inotify read: %zi", n);
continue;
}
for (p = buf; p < buf + n; p += sizeof(*ev) + ev->len) {
ev = (const struct inotify_event *)p;
if (ev->len >= REPAIR_EXT_LEN &&
!memcmp(ev->name +
strnlen(ev->name, ev->len) -
REPAIR_EXT_LEN,
REPAIR_EXT, REPAIR_EXT_LEN)) {
found = true;
break;
}
}
} while (!found);
if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
fprintf(stderr, "Invalid filename from inotify\n");
_exit(1);
}
snprintf(path, sizeof(path), "%s/%s", argv[1], ev->name);
if ((stat(path, &sb))) {
fprintf(stderr, "Can't stat() %s: %i\n", path, errno);
_exit(1);
}
ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", path);
inotify_dir = true;
} else {
ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
}
if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
fprintf(stderr, "Invalid socket path");
_exit(2);
}
if ((sb.st_mode & S_IFMT) != S_IFSOCK) {
fprintf(stderr, "%s is not a socket\n", a.sun_path);
_exit(2);
}
while (connect(s, (struct sockaddr *)&a, sizeof(a))) {
if (inotify_dir && errno == ECONNREFUSED)
continue;
fprintf(stderr, "Failed to connect to %s: %s\n", a.sun_path,
strerror(errno));
_exit(1);
}
loop:
ret = recvmsg(s, &msg, 0);
if (ret < 0) {
if (errno == ECONNRESET) {
ret = 0;
} else {
fprintf(stderr, "Failed to read message: %i\n", errno);
_exit(1);
}
}
if (!ret) /* Done */
_exit(0);
if (!cmsg ||
cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
cmsg->cmsg_type != SCM_RIGHTS) {
fprintf(stderr, "No/bad ancillary data from peer\n");
_exit(1);
}
/* No inverse formula for CMSG_LEN(x), and building one with CMSG_LEN(0)
* works but there's no guarantee it does. Search the whole domain.
*/
for (i = 1; i <= SCM_MAX_FD; i++) {
if (CMSG_LEN(sizeof(int) * i) == cmsg->cmsg_len) {
n = i;
break;
}
}
if (!n) {
cmsg_len = cmsg->cmsg_len; /* socklen_t is 'unsigned' on musl */
fprintf(stderr, "Invalid ancillary data length %zu from peer\n",
cmsg_len);
_exit(1);
}
memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF &&
cmd != TCP_REPAIR_OFF_NO_WP) {
fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
_exit(1);
}
op = cmd;
for (i = 0; i < n; i++) {
if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &op, sizeof(op))) {
fprintf(stderr,
"Setting TCP_REPAIR to %i on socket %i: %s", op,
fds[i], strerror(errno));
_exit(1);
}
/* Close _our_ copy */
close(fds[i]);
}
/* Confirm setting by echoing the command back */
if (send(s, &cmd, sizeof(cmd), 0) < 0) {
fprintf(stderr, "Reply to %i: %s\n", op, strerror(errno));
_exit(1);
}
goto loop;
return 0;
}

52
passt.1
View file

@ -401,15 +401,44 @@ Enable IPv6-only operation. IPv4 traffic will be ignored.
By default, IPv4 operation is enabled as long as at least an IPv4 route and an
interface address are configured on a given host interface.
.TP
.BR \-H ", " \-\-hostname " " \fIname
Hostname to configure the client with.
Send \fIname\fR as DHCP option 12 (hostname).
.TP
.BR \-\-fqdn " " \fIname
FQDN to configure the client with.
Send \fIname\fR as Client FQDN: DHCP option 81 and DHCPv6 option 39.
.SS \fBpasst\fR-only options
.TP
.BR \-s ", " \-\-socket " " \fIpath
.BR \-s ", " \-\-socket-path ", " \-\-socket " " \fIpath
Path for UNIX domain socket used by \fBqemu\fR(1) or \fBqrap\fR(1) to connect to
\fBpasst\fR.
Default is to probe a free socket, not accepting connections, starting from
\fI/tmp/passt_1.socket\fR to \fI/tmp/passt_64.socket\fR.
.TP
.BR \-\-vhost-user
Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
.TP
.BR \-\-print-capabilities
Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
.TP
.BR \-\-repair-path " " \fIpath
Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect
to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
migration. \fB--repair-path none\fR disables this interface (if you need to
specify a socket path called "none" you can prefix the path by \fI./\fR).
Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
chosen for the hypervisor UNIX domain socket. No socket is created if not in
\-\-vhost-user mode.
.TP
.BR \-F ", " \-\-fd " " \fIFD
Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
@ -614,7 +643,7 @@ Configure UDP port forwarding from target namespace to init namespace.
Default is \fBauto\fR.
.TP
.BR \-\-host-lo-to-ns-lo " " (DEPRECATED)
.BR \-\-host-lo-to-ns-lo
If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from
the host's loopback address will appear on the loopback address in the
guest as well. Without this option such forwarded packets will appear
@ -687,6 +716,11 @@ Configure MAC address \fIaddr\fR on the tap interface in the namespace.
Default is to let the tap driver build a pseudorandom hardware address.
.TP
.BR \-\-no-splice
Disable the bypass path for inbound, local traffic. See the section \fBHandling
of local traffic in pasta\fR in the \fBNOTES\fR for more details.
.SH EXAMPLES
.SS \fBpasta
@ -928,10 +962,16 @@ with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
the last observed source address from guest or namespace is 192.0.2.2, this will
be translated to a connection from 192.0.2.1 to 192.0.2.2.
Similarly, for traffic coming from guest or namespace, packets with
destination address corresponding to the \fB\-\-map-host-loopback\fR
address will have their destination address translated to a loopback
address.
Similarly, for traffic coming from guest or namespace, packets with destination
address corresponding to the \fB\-\-map-host-loopback\fR address will have their
destination address translated to a loopback address.
As an exception, traffic identified as DNS, originally directed to the
\fB\-\-map-host-loopback\fR address, if this address matches a resolver address
on the host, is \fBnot\fR translated to loopback, but rather handled in the same
way as if specified as \-\-dns-forward address, if no such option was given.
In the common case where the host gateway also acts a resolver, this avoids that
the host mapping shadows the gateway/resolver itself.
.SS Handling of local traffic in pasta

60
passt.c
View file

@ -50,6 +50,9 @@
#include "log.h"
#include "tcp_splice.h"
#include "ndp.h"
#include "vu_common.h"
#include "migrate.h"
#include "repair.h"
#define EPOLL_EVENTS 8
@ -65,13 +68,17 @@ char *epoll_type_str[] = {
[EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket",
[EPOLL_TYPE_TCP_TIMER] = "TCP timer",
[EPOLL_TYPE_UDP_LISTEN] = "listening UDP socket",
[EPOLL_TYPE_UDP_REPLY] = "UDP reply socket",
[EPOLL_TYPE_UDP] = "UDP flow socket",
[EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket",
[EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch",
[EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch",
[EPOLL_TYPE_TAP_PASTA] = "/dev/net/tun device",
[EPOLL_TYPE_TAP_PASST] = "connected qemu socket",
[EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket",
[EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket",
[EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket",
[EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket",
[EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket",
};
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
"epoll_type_str[] doesn't match enum epoll_type");
@ -159,11 +166,11 @@ void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
*
* #syscalls exit_group
*/
void exit_handler(int signal)
static void exit_handler(int signal)
{
(void)signal;
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
}
/**
@ -177,14 +184,13 @@ void exit_handler(int signal)
* #syscalls socket getsockopt setsockopt s390x:socketcall i686:socketcall close
* #syscalls bind connect recvfrom sendto shutdown
* #syscalls arm:recv ppc64le:recv arm:send ppc64le:send
* #syscalls accept4|accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
* #syscalls accept4 accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
* #syscalls clock_gettime arm:clock_gettime64 i686:clock_gettime64
*/
int main(int argc, char **argv)
{
struct epoll_event events[EPOLL_EVENTS];
int nfds, i, devnull_fd = -1;
char argv0[PATH_MAX], *name;
struct ctx c = { 0 };
struct rlimit limit;
struct timespec now;
@ -198,6 +204,7 @@ int main(int argc, char **argv)
isolate_initial(argc, argv);
c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
c.device_state_fd = -1;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
@ -205,27 +212,18 @@ int main(int argc, char **argv)
sigaction(SIGTERM, &sa, NULL);
sigaction(SIGQUIT, &sa, NULL);
if (argc < 1)
exit(EXIT_FAILURE);
c.mode = conf_mode(argc, argv);
strncpy(argv0, argv[0], PATH_MAX - 1);
name = basename(argv0);
if (strstr(name, "pasta")) {
if (c.mode == MODE_PASTA) {
sa.sa_handler = pasta_child_handler;
if (sigaction(SIGCHLD, &sa, NULL))
die_perror("Couldn't install signal handlers");
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
die_perror("Couldn't set disposition for SIGPIPE");
c.mode = MODE_PASTA;
} else if (strstr(name, "passt")) {
c.mode = MODE_PASST;
} else {
exit(EXIT_FAILURE);
}
madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
die_perror("Couldn't set disposition for SIGPIPE");
madvise(pkt_buf, sizeof(pkt_buf), MADV_HUGEPAGE);
c.epollfd = epoll_create1(EPOLL_CLOEXEC);
if (c.epollfd == -1)
@ -245,7 +243,7 @@ int main(int argc, char **argv)
pasta_netns_quit_init(&c);
tap_sock_init(&c);
tap_backend_init(&c);
random_init(&c);
@ -255,7 +253,7 @@ int main(int argc, char **argv)
flow_init();
if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
exit(EXIT_FAILURE);
_exit(EXIT_FAILURE);
proto_update_l2_buf(c.guest_mac, c.our_tap_mac);
@ -341,12 +339,24 @@ loop:
case EPOLL_TYPE_UDP_LISTEN:
udp_listen_sock_handler(&c, ref, eventmask, &now);
break;
case EPOLL_TYPE_UDP_REPLY:
udp_reply_sock_handler(&c, ref, eventmask, &now);
case EPOLL_TYPE_UDP:
udp_sock_handler(&c, ref, eventmask, &now);
break;
case EPOLL_TYPE_PING:
icmp_sock_handler(&c, ref);
break;
case EPOLL_TYPE_VHOST_CMD:
vu_control_handler(c.vdev, c.fd_tap, eventmask);
break;
case EPOLL_TYPE_VHOST_KICK:
vu_kick_cb(c.vdev, ref, &now);
break;
case EPOLL_TYPE_REPAIR_LISTEN:
repair_listen_handler(&c, eventmask);
break;
case EPOLL_TYPE_REPAIR:
repair_handler(&c, eventmask);
break;
default:
/* Can't happen */
ASSERT(0);
@ -355,5 +365,7 @@ loop:
post_handler(&c, &now);
migrate_handler(&c);
goto loop;
}

39
passt.h
View file

@ -20,11 +20,13 @@ union epoll_ref;
#include "siphash.h"
#include "ip.h"
#include "inany.h"
#include "migrate.h"
#include "flow.h"
#include "icmp.h"
#include "fwd.h"
#include "tcp.h"
#include "udp.h"
#include "vhost_user.h"
/* Default address for our end on the tap interface. Bit 0 of byte 0 must be 0
* (unicast) and bit 1 of byte 1 must be 1 (locally administered). Otherwise
@ -43,6 +45,7 @@ union epoll_ref;
* @icmp: ICMP-specific reference part
* @data: Data handled by protocol handlers
* @nsdir_fd: netns dirfd for fallback timer checking if namespace is gone
* @queue: vhost-user queue index for this fd
* @u64: Opaque reference for epoll_ctl() and epoll_wait()
*/
union epoll_ref {
@ -58,6 +61,7 @@ union epoll_ref {
union udp_listen_epoll_ref udp;
uint32_t data;
int nsdir_fd;
int queue;
};
};
uint64_t u64;
@ -65,12 +69,9 @@ union epoll_ref {
static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
"epoll_ref must have same size as epoll_data");
#define TAP_BUF_BYTES \
ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE)
#define TAP_MSGS \
DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
/* Large enough for ~128 maximum size frames */
#define PKT_BUF_BYTES (8UL << 20)
#define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0)
extern char pkt_buf [PKT_BUF_BYTES];
extern char *epoll_type_str[];
@ -94,6 +95,7 @@ struct fqdn {
enum passt_modes {
MODE_PASST,
MODE_PASTA,
MODE_VU,
};
/**
@ -189,6 +191,7 @@ struct ip6_ctx {
* @foreground: Run in foreground, don't log to stderr by default
* @nofile: Maximum number of open files (ulimit -n)
* @sock_path: Path for UNIX domain socket
* @repair_path: TCP_REPAIR helper path, can be "none", empty for default
* @pcap: Path for packet capture file
* @pidfile: Path to PID file, empty string if not configured
* @pidfile_fd: File descriptor for PID file, -1 if none
@ -199,12 +202,16 @@ struct ip6_ctx {
* @epollfd: File descriptor for epoll instance
* @fd_tap_listen: File descriptor for listening AF_UNIX socket, if any
* @fd_tap: AF_UNIX socket, tuntap device, or pre-opened socket
* @fd_repair_listen: File descriptor for listening TCP_REPAIR socket, if any
* @fd_repair: Connected AF_UNIX socket for TCP_REPAIR helper
* @our_tap_mac: Pasta/passt's MAC on the tap link
* @guest_mac: MAC address of guest or namespace, seen or configured
* @hash_secret: 128-bit secret for siphash functions
* @ifi4: Template interface for IPv4, -1: none, 0: IPv4 disabled
* @ip: IPv4 configuration
* @dns_search: DNS search list
* @hostname: Guest hostname
* @fqdn: Guest FQDN
* @ifi6: Template interface for IPv6, -1: none, 0: IPv6 disabled
* @ip6: IPv6 configuration
* @pasta_ifn: Name of namespace interface for pasta
@ -225,10 +232,15 @@ struct ip6_ctx {
* @no_dhcpv6: Disable DHCPv6 server
* @no_ndp: Disable NDP handler altogether
* @no_ra: Disable router advertisements
* @no_splice: Disable socket splicing for inbound traffic
* @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses
* @freebind: Allow binding of non-local addresses for forwarding
* @low_wmem: Low probed net.core.wmem_max
* @low_rmem: Low probed net.core.rmem_max
* @vdev: vhost-user device
* @device_state_fd: Device state migration channel
* @device_state_result: Device state migration result
* @migrate_target: Are we the target, on the next migration request?
*/
struct ctx {
enum passt_modes mode;
@ -238,6 +250,7 @@ struct ctx {
int foreground;
int nofile;
char sock_path[UNIX_PATH_MAX];
char repair_path[UNIX_PATH_MAX];
char pcap[PATH_MAX];
char pidfile[PATH_MAX];
@ -254,8 +267,12 @@ struct ctx {
int epollfd;
int fd_tap_listen;
int fd_tap;
int fd_repair_listen;
int fd_repair;
unsigned char our_tap_mac[ETH_ALEN];
unsigned char guest_mac[ETH_ALEN];
uint16_t mtu;
uint64_t hash_secret[2];
int ifi4;
@ -263,6 +280,9 @@ struct ctx {
struct fqdn dns_search[MAXDNSRCH];
char hostname[PASST_MAXDNAME];
char fqdn[PASST_MAXDNAME];
int ifi6;
struct ip6_ctx ip6;
@ -277,7 +297,6 @@ struct ctx {
int no_icmp;
struct icmp_ctx icmp;
int mtu;
int no_dns;
int no_dns_search;
int no_dhcp_dns;
@ -286,11 +305,19 @@ struct ctx {
int no_dhcpv6;
int no_ndp;
int no_ra;
int no_splice;
int host_lo_to_ns_lo;
int freebind;
int low_wmem;
int low_rmem;
struct vu_dev *vdev;
/* Migration */
int device_state_fd;
int device_state_result;
bool migrate_target;
};
void proto_update_l2_buf(const unsigned char *eth_d,

67
pasta.c
View file

@ -73,12 +73,12 @@ void pasta_child_handler(int signal)
!waitid(P_PID, pasta_child_pid, &infop, WEXITED | WNOHANG)) {
if (infop.si_pid == pasta_child_pid) {
if (infop.si_code == CLD_EXITED)
exit(infop.si_status);
_exit(infop.si_status);
/* If killed by a signal, si_status is the number.
* Follow common shell convention of returning it + 128.
*/
exit(infop.si_status + 128);
_exit(infop.si_status + 128);
/* Nothing to do, detached PID namespace going away */
}
@ -169,10 +169,12 @@ void pasta_open_ns(struct ctx *c, const char *netns)
* struct pasta_spawn_cmd_arg - Argument for pasta_spawn_cmd()
* @exe: Executable to run
* @argv: Command and arguments to run
* @ctx: Context to read config from
*/
struct pasta_spawn_cmd_arg {
const char *exe;
char *const *argv;
struct ctx *c;
};
/**
@ -186,6 +188,7 @@ static int pasta_spawn_cmd(void *arg)
{
char hostname[HOST_NAME_MAX + 1] = HOSTNAME_PREFIX;
const struct pasta_spawn_cmd_arg *a;
size_t conf_hostname_len;
sigset_t set;
/* We run in a detached PID and mount namespace: mount /proc over */
@ -195,9 +198,15 @@ static int pasta_spawn_cmd(void *arg)
if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
warn("Cannot set ping_group_range, ICMP requests might fail");
if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
errno == ENAMETOOLONG) {
a = (const struct pasta_spawn_cmd_arg *)arg;
conf_hostname_len = strlen(a->c->hostname);
if (conf_hostname_len > 0) {
if (sethostname(a->c->hostname, conf_hostname_len))
warn("Unable to set configured hostname");
} else if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
errno == ENAMETOOLONG) {
hostname[HOST_NAME_MAX] = '\0';
if (sethostname(hostname, strlen(hostname)))
warn("Unable to set pasta-prefixed hostname");
@ -208,7 +217,6 @@ static int pasta_spawn_cmd(void *arg)
sigaddset(&set, SIGUSR1);
sigwaitinfo(&set, NULL);
a = (const struct pasta_spawn_cmd_arg *)arg;
execvp(a->exe, a->argv);
die_perror("Failed to start command or shell");
@ -230,6 +238,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
struct pasta_spawn_cmd_arg arg = {
.exe = argv[0],
.argv = argv,
.c = c,
};
char uidmap[BUFSIZ], gidmap[BUFSIZ];
char *sh_argv[] = { NULL, NULL };
@ -296,7 +305,7 @@ void pasta_ns_conf(struct ctx *c)
rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);
if (rc < 0)
die("Couldn't bring up loopback interface in namespace: %s",
strerror(-rc));
strerror_(-rc));
/* Get or set MAC in target namespace */
if (MAC_IS_ZERO(c->guest_mac))
@ -305,12 +314,12 @@ void pasta_ns_conf(struct ctx *c)
rc = nl_link_set_mac(nl_sock_ns, c->pasta_ifi, c->guest_mac);
if (rc < 0)
die("Couldn't set MAC address in namespace: %s",
strerror(-rc));
strerror_(-rc));
if (c->pasta_conf_ns) {
unsigned int flags = IFF_UP;
if (c->mtu != -1)
if (c->mtu)
nl_link_set_mtu(nl_sock_ns, c->pasta_ifi, c->mtu);
if (c->ifi6) /* Avoid duplicate address detection on link up */
@ -332,7 +341,7 @@ void pasta_ns_conf(struct ctx *c)
if (rc < 0) {
die("Couldn't set IPv4 address(es) in namespace: %s",
strerror(-rc));
strerror_(-rc));
}
if (c->ip4.no_copy_routes) {
@ -346,7 +355,7 @@ void pasta_ns_conf(struct ctx *c)
if (rc < 0) {
die("Couldn't set IPv4 route(s) in guest: %s",
strerror(-rc));
strerror_(-rc));
}
}
@ -355,13 +364,13 @@ void pasta_ns_conf(struct ctx *c)
&c->ip6.addr_ll_seen);
if (rc < 0) {
warn("Can't get LL address from namespace: %s",
strerror(-rc));
strerror_(-rc));
}
rc = nl_addr_set_ll_nodad(nl_sock_ns, c->pasta_ifi);
if (rc < 0) {
warn("Can't set nodad for LL in namespace: %s",
strerror(-rc));
strerror_(-rc));
}
/* We dodged DAD: re-enable neighbour solicitations */
@ -382,7 +391,7 @@ void pasta_ns_conf(struct ctx *c)
if (rc < 0) {
die("Couldn't set IPv6 address(es) in namespace: %s",
strerror(-rc));
strerror_(-rc));
}
if (c->ip6.no_copy_routes) {
@ -397,7 +406,7 @@ void pasta_ns_conf(struct ctx *c)
if (rc < 0) {
die("Couldn't set IPv6 route(s) in guest: %s",
strerror(-rc));
strerror_(-rc));
}
}
}
@ -446,18 +455,18 @@ void pasta_netns_quit_init(const struct ctx *c)
return;
if ((dir_fd = open(c->netns_dir, O_CLOEXEC | O_RDONLY)) < 0)
die("netns dir open: %s, exiting", strerror(errno));
die("netns dir open: %s, exiting", strerror_(errno));
if (fstatfs(dir_fd, &s) || s.f_type == DEVPTS_SUPER_MAGIC ||
s.f_type == PROC_SUPER_MAGIC || s.f_type == SYSFS_MAGIC)
try_inotify = false;
if (try_inotify && (fd = inotify_init1(flags)) < 0)
warn("inotify_init1(): %s, use a timer", strerror(errno));
warn("inotify_init1(): %s, use a timer", strerror_(errno));
if (fd >= 0 && inotify_add_watch(fd, c->netns_dir, IN_DELETE) < 0) {
warn("inotify_add_watch(): %s, use a timer",
strerror(errno));
strerror_(errno));
close(fd);
fd = -1;
}
@ -489,17 +498,23 @@ void pasta_netns_quit_init(const struct ctx *c)
*/
void pasta_netns_quit_inotify_handler(struct ctx *c, int inotify_fd)
{
char buf[sizeof(struct inotify_event) + NAME_MAX + 1];
const struct inotify_event *in_ev = (struct inotify_event *)buf;
char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
__attribute__ ((aligned(__alignof__(struct inotify_event))));
const struct inotify_event *ev;
ssize_t n;
char *p;
if (read(inotify_fd, buf, sizeof(buf)) < (ssize_t)sizeof(*in_ev))
if ((n = read(inotify_fd, buf, sizeof(buf))) < (ssize_t)sizeof(*ev))
return;
if (strncmp(in_ev->name, c->netns_base, sizeof(c->netns_base)))
return;
for (p = buf; p < buf + n; p += sizeof(*ev) + ev->len) {
ev = (const struct inotify_event *)p;
info("Namespace %s is gone, exiting", c->netns_base);
exit(EXIT_SUCCESS);
if (!strncmp(ev->name, c->netns_base, sizeof(c->netns_base))) {
info("Namespace %s is gone, exiting", c->netns_base);
_exit(EXIT_SUCCESS);
}
}
}
/**
@ -525,7 +540,7 @@ void pasta_netns_quit_timer_handler(struct ctx *c, union epoll_ref ref)
return;
info("Namespace %s is gone, exiting", c->netns_base);
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
}
close(fd);

47
pcap.c
View file

@ -33,33 +33,12 @@
#include "log.h"
#include "pcap.h"
#include "iov.h"
#include "tap.h"
#define PCAP_VERSION_MINOR 4
static int pcap_fd = -1;
/* See pcap.h from libpcap, or pcap-savefile(5) */
static const struct {
uint32_t magic;
#define PCAP_MAGIC 0xa1b2c3d4
uint16_t major;
#define PCAP_VERSION_MAJOR 2
uint16_t minor;
#define PCAP_VERSION_MINOR 4
int32_t thiszone;
uint32_t sigfigs;
uint32_t snaplen;
uint32_t linktype;
#define PCAP_LINKTYPE_ETHERNET 1
} pcap_hdr = {
PCAP_MAGIC, PCAP_VERSION_MAJOR, PCAP_VERSION_MINOR, 0, 0, ETH_MAX_MTU,
PCAP_LINKTYPE_ETHERNET
};
struct pcap_pkthdr {
uint32_t tv_sec;
uint32_t tv_usec;
@ -143,7 +122,6 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
* @iovcnt: Number of buffers (@iov entries)
* @offset: Offset of the L2 frame within the full data length
*/
/* cppcheck-suppress unusedFunction */
void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
{
struct timespec now = { 0 };
@ -163,6 +141,29 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
*/
void pcap_init(struct ctx *c)
{
/* See pcap.h from libpcap, or pcap-savefile(5) */
#define PCAP_MAGIC 0xa1b2c3d4
#define PCAP_VERSION_MAJOR 2
#define PCAP_VERSION_MINOR 4
#define PCAP_LINKTYPE_ETHERNET 1
const struct {
uint32_t magic;
uint16_t major;
uint16_t minor;
int32_t thiszone;
uint32_t sigfigs;
uint32_t snaplen;
uint32_t linktype;
} pcap_hdr = {
.magic = PCAP_MAGIC,
.major = PCAP_VERSION_MAJOR,
.minor = PCAP_VERSION_MINOR,
.snaplen = tap_l2_max_len(c),
.linktype = PCAP_LINKTYPE_ETHERNET
};
if (pcap_fd != -1)
return;

255
repair.c Normal file
View file

@ -0,0 +1,255 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <errno.h>
#include <sys/socket.h>
#include <sys/uio.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "repair.h"
#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
/* Wait for a while for TCP_REPAIR helper to connect if it's not there yet */
#define REPAIR_ACCEPT_TIMEOUT_MS 10
#define REPAIR_ACCEPT_TIMEOUT_US (REPAIR_ACCEPT_TIMEOUT_MS * 1000)
/* Pending file descriptors for next repair_flush() call, or command change */
static int repair_fds[SCM_MAX_FD];
/* Pending command: flush pending file descriptors if it changes */
static int8_t repair_cmd;
/* Number of pending file descriptors set in @repair_fds */
static int repair_nfds;
/**
* repair_sock_init() - Start listening for connections on helper socket
* @c: Execution context
*/
void repair_sock_init(const struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
struct epoll_event ev = { 0 };
if (c->fd_repair_listen == -1)
return;
if (listen(c->fd_repair_listen, 0)) {
err_perror("listen() on repair helper socket, won't migrate");
return;
}
ref.fd = c->fd_repair_listen;
ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
ev.data.u64 = ref.u64;
if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
err_perror("repair helper socket epoll_ctl(), won't migrate");
}
/**
* repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
* @c: Execution context
* @events: epoll events
*/
void repair_listen_handler(struct ctx *c, uint32_t events)
{
union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
struct epoll_event ev = { 0 };
struct ucred ucred;
socklen_t len;
if (events != EPOLLIN) {
debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
events);
return;
}
len = sizeof(ucred);
/* Another client is already connected: accept and close right away. */
if (c->fd_repair != -1) {
int discard = accept4(c->fd_repair_listen, NULL, NULL,
SOCK_NONBLOCK);
if (discard == -1)
return;
if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
close(discard);
return;
}
if ((c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0)) < 0) {
debug_perror("accept4() on TCP_REPAIR helper listening socket");
return;
}
if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
ref.fd = c->fd_repair;
ev.events = EPOLLHUP | EPOLLET;
ev.data.u64 = ref.u64;
if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
close(c->fd_repair);
c->fd_repair = -1;
}
}
/**
* repair_close() - Close connection to TCP_REPAIR helper
* @c: Execution context
*/
void repair_close(struct ctx *c)
{
debug("Closing TCP_REPAIR helper socket");
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
close(c->fd_repair);
c->fd_repair = -1;
}
/**
* repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
* @c: Execution context
* @events: epoll events
*/
void repair_handler(struct ctx *c, uint32_t events)
{
(void)events;
repair_close(c);
}
/**
* repair_wait() - Wait (with timeout) for TCP_REPAIR helper to connect
* @c: Execution context
*/
void repair_wait(struct ctx *c)
{
struct timeval tv = { .tv_sec = 0,
.tv_usec = (long)(REPAIR_ACCEPT_TIMEOUT_US) };
static_assert(REPAIR_ACCEPT_TIMEOUT_US < 1000 * 1000,
".tv_usec is greater than 1000 * 1000");
if (c->fd_repair >= 0 || c->fd_repair_listen == -1)
return;
if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
&tv, sizeof(tv))) {
err_perror("Set timeout on TCP_REPAIR listening socket");
return;
}
repair_listen_handler(c, EPOLLIN);
tv.tv_usec = 0;
if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
&tv, sizeof(tv)))
err_perror("Clear timeout on TCP_REPAIR listening socket");
}
/**
* repair_flush() - Flush current set of sockets to helper, with current command
* @c: Execution context
*
* Return: 0 on success, negative error code on failure
*/
int repair_flush(struct ctx *c)
{
char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
__attribute__ ((aligned(__alignof__(struct cmsghdr)))) = { 0 };
struct iovec iov = { &repair_cmd, sizeof(repair_cmd) };
struct cmsghdr *cmsg;
struct msghdr msg;
int8_t reply;
if (!repair_nfds)
return 0;
msg = (struct msghdr){ .msg_name = NULL, .msg_namelen = 0,
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = buf,
.msg_controllen = CMSG_SPACE(sizeof(int) *
repair_nfds),
.msg_flags = 0 };
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int) * repair_nfds);
memcpy(CMSG_DATA(cmsg), repair_fds, sizeof(int) * repair_nfds);
repair_nfds = 0;
if (sendmsg(c->fd_repair, &msg, 0) < 0) {
int ret = -errno;
err_perror("Failed to send sockets to TCP_REPAIR helper");
repair_close(c);
return ret;
}
if (recv(c->fd_repair, &reply, sizeof(reply), 0) < 0) {
int ret = -errno;
err_perror("Failed to receive reply from TCP_REPAIR helper");
repair_close(c);
return ret;
}
if (reply != repair_cmd) {
err("Unexpected reply from TCP_REPAIR helper: %d", reply);
repair_close(c);
return -ENXIO;
}
return 0;
}
/**
* repair_set() - Add socket to TCP_REPAIR set with given command
* @c: Execution context
* @s: Socket to add
* @cmd: TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
*
* Return: 0 on success, negative error code on failure
*/
int repair_set(struct ctx *c, int s, int cmd)
{
int rc;
if (repair_nfds && repair_cmd != cmd) {
if ((rc = repair_flush(c)))
return rc;
}
repair_cmd = cmd;
repair_fds[repair_nfds++] = s;
if (repair_nfds >= SCM_MAX_FD) {
if ((rc = repair_flush(c)))
return rc;
}
return 0;
}

17
repair.h Normal file
View file

@ -0,0 +1,17 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#ifndef REPAIR_H
#define REPAIR_H
void repair_sock_init(const struct ctx *c);
void repair_listen_handler(struct ctx *c, uint32_t events);
void repair_handler(struct ctx *c, uint32_t events);
void repair_close(struct ctx *c);
void repair_wait(struct ctx *c);
int repair_flush(struct ctx *c);
int repair_set(struct ctx *c, int s, int cmd);
#endif /* REPAIR_H */

View file

@ -14,8 +14,10 @@
# Author: Stefano Brivio <sbrivio@redhat.com>
TMP="$(mktemp)"
IN="$@"
OUT="$(mktemp)"
OUT_FINAL="${1}"
shift
IN="$@"
[ -z "${ARCH}" ] && ARCH="$(uname -m)"
[ -z "${CC}" ] && CC="cc"
@ -253,7 +255,7 @@ for __p in ${__profiles}; do
__calls="${__calls} ${EXTRA_SYSCALLS:-}"
__calls="$(filter ${__calls})"
cols="$(stty -a | sed -n 's/.*columns \([0-9]*\).*/\1/p' || :)" 2>/dev/null
cols="$(stty -a 2>/dev/null | sed -n 's/.*columns \([0-9]*\).*/\1/p' || :)" 2>/dev/null
case $cols in [0-9]*) col_args="-w ${cols}";; *) col_args="";; esac
echo "seccomp profile ${__p} allows: ${__calls}" | tr '\n' ' ' | fmt -t ${col_args}
@ -268,4 +270,4 @@ for __p in ${__profiles}; do
gen_profile "${__p}" ${__calls}
done
mv "${OUT}" seccomp.h
mv "${OUT}" "${OUT_FINAL}"

441
tap.c
View file

@ -56,16 +56,70 @@
#include "netlink.h"
#include "pasta.h"
#include "packet.h"
#include "repair.h"
#include "tap.h"
#include "log.h"
#include "vhost_user.h"
#include "vu_common.h"
/* Maximum allowed frame lengths (including L2 header) */
/* Verify that an L2 frame length limit is large enough to contain the header,
* but small enough to fit in the packet pool
*/
#define CHECK_FRAME_LEN(len) \
static_assert((len) >= ETH_HLEN && (len) <= PACKET_MAX_LEN, \
#len " has bad value")
CHECK_FRAME_LEN(L2_MAX_LEN_PASTA);
CHECK_FRAME_LEN(L2_MAX_LEN_PASST);
CHECK_FRAME_LEN(L2_MAX_LEN_VU);
/* We try size the packet pools so that we can use a single batch for the entire
* packet buffer. This might be exceeded for vhost-user, though, which uses its
* own buffers rather than pkt_buf.
*
* This is just a tuning parameter, the code will work with slightly more
* overhead if it's incorrect. So, we estimate based on the minimum practical
* frame size - an empty UDP datagram - rather than the minimum theoretical
* frame size.
*
* FIXME: Profile to work out how big this actually needs to be to amortise
* per-batch syscall overheads
*/
#define TAP_MSGS_IP4 \
DIV_ROUND_UP(sizeof(pkt_buf), \
ETH_HLEN + sizeof(struct iphdr) + sizeof(struct udphdr))
#define TAP_MSGS_IP6 \
DIV_ROUND_UP(sizeof(pkt_buf), \
ETH_HLEN + sizeof(struct ipv6hdr) + sizeof(struct udphdr))
/* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS_IP4, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
#define TAP_SEQS 128 /* Different L4 tuples in one batch */
#define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */
/**
* tap_l2_max_len() - Maximum frame size (including L2 header) for current mode
* @c: Execution context
*/
unsigned long tap_l2_max_len(const struct ctx *c)
{
/* NOLINTBEGIN(bugprone-branch-clone): values can be the same */
switch (c->mode) {
case MODE_PASST:
return L2_MAX_LEN_PASST;
case MODE_PASTA:
return L2_MAX_LEN_PASTA;
case MODE_VU:
return L2_MAX_LEN_VU;
}
/* NOLINTEND(bugprone-branch-clone) */
ASSERT(0);
}
/**
* tap_send_single() - Send a single frame
* @c: Execution context
@ -78,16 +132,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
struct iovec iov[2];
size_t iovcnt = 0;
if (c->mode == MODE_PASST) {
switch (c->mode) {
case MODE_PASST:
iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
iovcnt++;
/* fall through */
case MODE_PASTA:
iov[iovcnt].iov_base = (void *)data;
iov[iovcnt].iov_len = l2len;
iovcnt++;
tap_send_frames(c, iov, iovcnt, 1);
break;
case MODE_VU:
vu_send_single(c, data, l2len);
break;
}
iov[iovcnt].iov_base = (void *)data;
iov[iovcnt].iov_len = l2len;
iovcnt++;
tap_send_frames(c, iov, iovcnt, 1);
}
/**
@ -113,7 +173,7 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
{
struct ethhdr *eh = (struct ethhdr *)buf;
@ -134,8 +194,8 @@ static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto)
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto)
{
uint16_t l3len = l4len + sizeof(*ip4h);
@ -144,13 +204,43 @@ static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
ip4h->tos = 0;
ip4h->tot_len = htons(l3len);
ip4h->id = 0;
ip4h->frag_off = 0;
ip4h->frag_off = htons(IP_DF);
ip4h->ttl = 255;
ip4h->protocol = proto;
ip4h->saddr = src.s_addr;
ip4h->daddr = dst.s_addr;
ip4h->check = csum_ip4_header(l3len, proto, src, dst);
return ip4h + 1;
return (char *)ip4h + sizeof(*ip4h);
}
/**
* tap_push_uh4() - Build UDPv4 header with checksum
* @c: Execution context
* @src: IPv4 source address
* @sport: UDP source port
* @dst: IPv4 destination address
* @dport: UDP destination port
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*
* Return: pointer at which to write the packet's payload
*/
void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
const struct iovec iov = {
.iov_base = (void *)in,
.iov_len = dlen
};
struct iov_tail payload = IOV_TAIL(&iov, 1, 0);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp4(uh, src, dst, &payload);
return (char *)uh + sizeof(*uh);
}
/**
@ -160,7 +250,7 @@ static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
* @sport: UDP source port
* @dst: IPv4 destination address
* @dport: UDP destination port
* @in: UDP payload contents (not including UDP header)
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*/
void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
@ -171,18 +261,9 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
char buf[USHRT_MAX];
struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
char *data = (char *)(uh + 1);
const struct iovec iov = {
.iov_base = (void *)in,
.iov_len = dlen
};
char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp4(uh, src, dst, &iov, 1, 0);
memcpy(data, in, dlen);
tap_send_single(c, buf, dlen + (data - buf));
}
@ -219,10 +300,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src,
const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow)
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src, const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow)
{
ip6h->payload_len = htons(l4len);
ip6h->priority = 0;
@ -231,10 +311,40 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h,
ip6h->hop_limit = 255;
ip6h->saddr = *src;
ip6h->daddr = *dst;
ip6h->flow_lbl[0] = (flow >> 16) & 0xf;
ip6h->flow_lbl[1] = (flow >> 8) & 0xff;
ip6h->flow_lbl[2] = (flow >> 0) & 0xff;
return ip6h + 1;
ip6_set_flow_lbl(ip6h, flow);
return (char *)ip6h + sizeof(*ip6h);
}
/**
* tap_push_uh6() - Build UDPv6 header with checksum
* @c: Execution context
* @src: IPv6 source address
* @sport: UDP source port
* @dst: IPv6 destination address
* @dport: UDP destination port
* @flow: Flow label
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*
* Return: pointer at which to write the packet's payload
*/
void *tap_push_uh6(struct udphdr *uh,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
const struct iovec iov = {
.iov_base = in,
.iov_len = dlen
};
struct iov_tail payload = IOV_TAIL(&iov, 1, 0);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp6(uh, src, dst, &payload);
return (char *)uh + sizeof(*uh);
}
/**
@ -245,7 +355,7 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h,
* @dst: IPv6 destination address
* @dport: UDP destination port
* @flow: Flow label
* @in: UDP payload contents (not including UDP header)
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*/
void tap_udp6_send(const struct ctx *c,
@ -258,18 +368,9 @@ void tap_udp6_send(const struct ctx *c,
struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
struct udphdr *uh = tap_push_ip6h(ip6h, src, dst,
l4len, IPPROTO_UDP, flow);
char *data = (char *)(uh + 1);
const struct iovec iov = {
.iov_base = in,
.iov_len = dlen
};
char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp6(uh, src, dst, &iov, 1, 0);
memcpy(data, in, dlen);
tap_send_single(c, buf, dlen + (data - buf));
}
@ -414,10 +515,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
if (!nframes)
return 0;
if (c->mode == MODE_PASTA)
switch (c->mode) {
case MODE_PASTA:
m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
else
break;
case MODE_PASST:
m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
break;
case MODE_VU:
/* fall through */
default:
ASSERT(0);
}
if (m < nframes)
debug("tap: failed to send %zu frames of %zu",
@ -450,6 +559,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
* struct l4_seq4_t - Message sequence for one protocol handler call, IPv4
* @msgs: Count of messages in sequence
* @protocol: Protocol number
* @ttl: Time to live
* @source: Source port
* @dest: Destination port
* @saddr: Source address
@ -458,6 +568,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
*/
static struct tap4_l4_t {
uint8_t protocol;
uint8_t ttl;
uint16_t source;
uint16_t dest;
@ -472,14 +583,17 @@ static struct tap4_l4_t {
* struct l4_seq6_t - Message sequence for one protocol handler call, IPv6
* @msgs: Count of messages in sequence
* @protocol: Protocol number
* @flow_lbl: IPv6 flow label
* @source: Source port
* @dest: Destination port
* @saddr: Source address
* @daddr: Destination address
* @hop_limit: Hop limit
* @msg: Array of messages that can be handled in a single call
*/
static struct tap6_l4_t {
uint8_t protocol;
uint32_t flow_lbl :20;
uint16_t source;
uint16_t dest;
@ -487,6 +601,8 @@ static struct tap6_l4_t {
struct in6_addr saddr;
struct in6_addr daddr;
uint8_t hop_limit;
struct pool_l4_t p;
} tap6_l4[TAP_SEQS /* Arbitrary: TAP_MSGS in theory, so limit in users */];
@ -675,7 +791,8 @@ resume:
#define L4_MATCH(iph, uh, seq) \
((seq)->protocol == (iph)->protocol && \
(seq)->source == (uh)->source && (seq)->dest == (uh)->dest && \
(seq)->saddr.s_addr == (iph)->saddr && (seq)->daddr.s_addr == (iph)->daddr)
(seq)->saddr.s_addr == (iph)->saddr && \
(seq)->daddr.s_addr == (iph)->daddr && (seq)->ttl == (iph)->ttl)
#define L4_SET(iph, uh, seq) \
do { \
@ -684,6 +801,7 @@ resume:
(seq)->dest = (uh)->dest; \
(seq)->saddr.s_addr = (iph)->saddr; \
(seq)->daddr.s_addr = (iph)->daddr; \
(seq)->ttl = (iph)->ttl; \
} while (0)
if (seq && L4_MATCH(iph, uh, seq) && seq->p.count < UIO_MAXIOV)
@ -725,14 +843,14 @@ append:
for (k = 0; k < p->count; )
k += tcp_tap_handler(c, PIF_TAP, AF_INET,
&seq->saddr, &seq->daddr,
p, k, now);
0, p, k, now);
} else if (seq->protocol == IPPROTO_UDP) {
if (c->no_udp)
continue;
for (k = 0; k < p->count; )
k += udp_tap_handler(c, PIF_TAP, AF_INET,
&seq->saddr, &seq->daddr,
p, k, now);
seq->ttl, p, k, now);
}
}
@ -853,16 +971,20 @@ resume:
((seq)->protocol == (proto) && \
(seq)->source == (uh)->source && \
(seq)->dest == (uh)->dest && \
(seq)->flow_lbl == ip6_get_flow_lbl(ip6h) && \
IN6_ARE_ADDR_EQUAL(&(seq)->saddr, saddr) && \
IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr))
IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr) && \
(seq)->hop_limit == (ip6h)->hop_limit)
#define L4_SET(ip6h, proto, uh, seq) \
do { \
(seq)->protocol = (proto); \
(seq)->source = (uh)->source; \
(seq)->dest = (uh)->dest; \
(seq)->flow_lbl = ip6_get_flow_lbl(ip6h); \
(seq)->saddr = *saddr; \
(seq)->daddr = *daddr; \
(seq)->hop_limit = (ip6h)->hop_limit; \
} while (0)
if (seq && L4_MATCH(ip6h, proto, uh, seq) &&
@ -906,14 +1028,14 @@ append:
for (k = 0; k < p->count; )
k += tcp_tap_handler(c, PIF_TAP, AF_INET6,
&seq->saddr, &seq->daddr,
p, k, now);
seq->flow_lbl, p, k, now);
} else if (seq->protocol == IPPROTO_UDP) {
if (c->no_udp)
continue;
for (k = 0; k < p->count; )
k += udp_tap_handler(c, PIF_TAP, AF_INET6,
&seq->saddr, &seq->daddr,
p, k, now);
seq->hop_limit, p, k, now);
}
}
@ -948,8 +1070,10 @@ void tap_handler(struct ctx *c, const struct timespec *now)
* @c: Execution context
* @l2len: Total L2 packet length
* @p: Packet buffer
* @now: Current timestamp
*/
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
const struct timespec *now)
{
const struct ethhdr *eh;
@ -965,9 +1089,17 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
switch (ntohs(eh->h_proto)) {
case ETH_P_ARP:
case ETH_P_IP:
if (pool_full(pool_tap4)) {
tap4_handler(c, pool_tap4, now);
pool_flush(pool_tap4);
}
packet_add(pool_tap4, l2len, p);
break;
case ETH_P_IPV6:
if (pool_full(pool_tap6)) {
tap6_handler(c, pool_tap6, now);
pool_flush(pool_tap6);
}
packet_add(pool_tap6, l2len, p);
break;
default:
@ -979,17 +1111,19 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
* tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
* @c: Execution context
*/
static void tap_sock_reset(struct ctx *c)
void tap_sock_reset(struct ctx *c)
{
info("Client connection closed%s", c->one_off ? ", exiting" : "");
if (c->one_off)
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
/* Close the connected socket, wait for a new connection */
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
epoll_del(c, c->fd_tap);
close(c->fd_tap);
c->fd_tap = -1;
if (c->mode == MODE_VU)
vu_cleanup(c->vdev);
}
/**
@ -1016,7 +1150,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
do {
n = recv(c->fd_tap, pkt_buf + partial_len,
TAP_BUF_BYTES - partial_len, MSG_DONTWAIT);
sizeof(pkt_buf) - partial_len, MSG_DONTWAIT);
} while ((n < 0) && errno == EINTR);
if (n < 0) {
@ -1033,7 +1167,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
while (n >= (ssize_t)sizeof(uint32_t)) {
uint32_t l2len = ntohl_unaligned(p);
if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
if (l2len < sizeof(struct ethhdr) || l2len > L2_MAX_LEN_PASST) {
err("Bad frame size from guest, resetting connection");
tap_sock_reset(c);
return;
@ -1046,7 +1180,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
p += sizeof(uint32_t);
n -= sizeof(uint32_t);
tap_add_packet(c, l2len, p);
tap_add_packet(c, l2len, p, now);
p += l2len;
n -= l2len;
@ -1087,8 +1221,10 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
tap_flush_pools();
for (n = 0; n <= (ssize_t)(TAP_BUF_BYTES - ETH_MAX_MTU); n += len) {
len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
for (n = 0;
n <= (ssize_t)(sizeof(pkt_buf) - L2_MAX_LEN_PASTA);
n += len) {
len = read(c->fd_tap, pkt_buf + n, L2_MAX_LEN_PASTA);
if (len == 0) {
die("EOF on tap device, exiting");
@ -1106,10 +1242,10 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
/* Ignore frames of bad length */
if (len < (ssize_t)sizeof(struct ethhdr) ||
len > (ssize_t)ETH_MAX_MTU)
len > (ssize_t)L2_MAX_LEN_PASTA)
continue;
tap_add_packet(c, len, pkt_buf + n);
tap_add_packet(c, len, pkt_buf + n, now);
}
tap_handler(c, now);
@ -1132,72 +1268,35 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
}
/**
* tap_sock_unix_open() - Create and bind AF_UNIX socket
* @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
*
* Return: socket descriptor on success, won't return on failure
* tap_backend_show_hints() - Give help information to start QEMU
* @c: Execution context
*/
int tap_sock_unix_open(char *sock_path)
static void tap_backend_show_hints(struct ctx *c)
{
int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
struct sockaddr_un addr = {
.sun_family = AF_UNIX,
};
int i;
if (fd < 0)
die_perror("Failed to open UNIX domain socket");
for (i = 1; i < UNIX_SOCK_MAX; i++) {
char *path = addr.sun_path;
int ex, ret;
if (*sock_path)
memcpy(path, sock_path, UNIX_PATH_MAX);
else if (snprintf_check(path, UNIX_PATH_MAX - 1,
UNIX_SOCK_PATH, i))
die_perror("Can't build UNIX domain socket path");
ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
0);
if (ex < 0)
die_perror("Failed to check for UNIX domain conflicts");
ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
errno != EACCES)) {
if (*sock_path)
die("Socket path %s already in use", path);
close(ex);
continue;
}
close(ex);
unlink(path);
ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
if (*sock_path && ret)
die_perror("Failed to bind UNIX domain socket");
if (!ret)
break;
switch (c->mode) {
case MODE_PASTA:
/* No hints */
break;
case MODE_PASST:
info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
info(" kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
c->sock_path);
info("or qrap, for earlier qemu versions:");
info(" ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
break;
case MODE_VU:
info("You can start qemu with:");
info(" kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
c->sock_path);
break;
}
if (i == UNIX_SOCK_MAX)
die_perror("Failed to bind UNIX domain socket");
info("UNIX domain socket bound at %s", addr.sun_path);
if (!*sock_path)
memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
return fd;
}
/**
* tap_sock_unix_init() - Start listening for connections on AF_UNIX socket
* @c: Execution context
*/
static void tap_sock_unix_init(struct ctx *c)
static void tap_sock_unix_init(const struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
struct epoll_event ev = { 0 };
@ -1208,12 +1307,33 @@ static void tap_sock_unix_init(struct ctx *c)
ev.events = EPOLLIN | EPOLLET;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
}
info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
info(" kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
c->sock_path);
info("or qrap, for earlier qemu versions:");
info(" ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
/**
* tap_start_connection() - start a new connection
* @c: Execution context
*/
static void tap_start_connection(const struct ctx *c)
{
struct epoll_event ev = { 0 };
union epoll_ref ref = { 0 };
ref.fd = c->fd_tap;
switch (c->mode) {
case MODE_PASST:
ref.type = EPOLL_TYPE_TAP_PASST;
break;
case MODE_PASTA:
ref.type = EPOLL_TYPE_TAP_PASTA;
break;
case MODE_VU:
ref.type = EPOLL_TYPE_VHOST_CMD;
break;
}
ev.events = EPOLLIN | EPOLLRDHUP;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
}
/**
@ -1223,8 +1343,6 @@ static void tap_sock_unix_init(struct ctx *c)
*/
void tap_listen_handler(struct ctx *c, uint32_t events)
{
union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
struct epoll_event ev = { 0 };
int v = INT_MAX / 2;
struct ucred ucred;
socklen_t len;
@ -1263,10 +1381,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
setsockopt(c->fd_tap, SOL_SOCKET, SO_SNDBUF, &v, sizeof(v)))
trace("tap: failed to set SO_SNDBUF to %i", v);
ref.fd = c->fd_tap;
ev.events = EPOLLIN | EPOLLRDHUP;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
tap_start_connection(c);
}
/**
@ -1310,58 +1425,61 @@ static int tap_ns_tun(void *arg)
*/
static void tap_sock_tun_init(struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASTA };
struct epoll_event ev = { 0 };
NS_CALL(tap_ns_tun, c);
if (c->fd_tap == -1)
die("Failed to set up tap device in namespace");
pasta_ns_conf(c);
ref.fd = c->fd_tap;
ev.events = EPOLLIN | EPOLLRDHUP;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
tap_start_connection(c);
}
/**
* tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
* @c: Execution context
* tap_sock_update_pool() - Set the buffer base and size for the pool of packets
* @base: Buffer base
* @size Buffer size
*/
void tap_sock_init(struct ctx *c)
void tap_sock_update_pool(void *base, size_t size)
{
size_t sz = sizeof(pkt_buf);
int i;
pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS_IP4, base, size);
pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS_IP6, base, size);
for (i = 0; i < TAP_SEQS; i++) {
tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
}
}
/**
* tap_backend_init() - Create and set up AF_UNIX socket or
* tuntap file descriptor
* @c: Execution context
*/
void tap_backend_init(struct ctx *c)
{
if (c->mode == MODE_VU) {
tap_sock_update_pool(NULL, 0);
vu_init(c);
} else {
tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
}
if (c->fd_tap != -1) { /* Passed as --fd */
struct epoll_event ev = { 0 };
union epoll_ref ref;
ASSERT(c->one_off);
ref.fd = c->fd_tap;
if (c->mode == MODE_PASST)
ref.type = EPOLL_TYPE_TAP_PASST;
else
ref.type = EPOLL_TYPE_TAP_PASTA;
ev.events = EPOLLIN | EPOLLRDHUP;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
tap_start_connection(c);
return;
}
if (c->mode == MODE_PASTA) {
switch (c->mode) {
case MODE_PASTA:
tap_sock_tun_init(c);
} else {
break;
case MODE_VU:
repair_sock_init(c);
/* fall through */
case MODE_PASST:
tap_sock_unix_init(c);
/* In passt mode, we don't know the guest's MAC address until it
@ -1369,5 +1487,8 @@ void tap_sock_init(struct ctx *c)
* first packets will reach it.
*/
memset(&c->guest_mac, 0xff, sizeof(c->guest_mac));
break;
}
tap_backend_show_hints(c);
}

57
tap.h
View file

@ -6,7 +6,32 @@
#ifndef TAP_H
#define TAP_H
#define ETH_HDR_INIT(proto) { .h_proto = htons_constant(proto) }
/** L2_MAX_LEN_PASTA - Maximum frame length for pasta mode (with L2 header)
*
* The kernel tuntap device imposes a maximum frame size of 65535 including
* 'hard_header_len' (14 bytes for L2 Ethernet in the case of "tap" mode).
*/
#define L2_MAX_LEN_PASTA USHRT_MAX
/** L2_MAX_LEN_PASST - Maximum frame length for passt mode (with L2 header)
*
* The only structural limit the QEMU socket protocol imposes on frames is
* (2^32-1) bytes, but that would be ludicrously long in practice. For now,
* limit it somewhat arbitrarily to 65535 bytes. FIXME: Work out an appropriate
* limit with more precision.
*/
#define L2_MAX_LEN_PASST USHRT_MAX
/** L2_MAX_LEN_VU - Maximum frame length for vhost-user mode (with L2 header)
*
* vhost-user allows multiple buffers per frame, each of which can be quite
* large, so the inherent frame size limit is rather large. Much larger than is
* actually useful for IP. For now limit arbitrarily to 65535 bytes. FIXME:
* Work out an appropriate limit with more precision.
*/
#define L2_MAX_LEN_VU USHRT_MAX
struct udphdr;
/**
* struct tap_hdr - tap backend specific headers
@ -40,9 +65,27 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
*/
static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
{
thdr->vnet_len = htonl(l2len);
if (thdr)
thdr->vnet_len = htonl(l2len);
}
unsigned long tap_l2_max_len(const struct ctx *c);
void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto);
void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen);
void *tap_push_uh6(struct udphdr *uh,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
void *in, size_t dlen);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto);
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src,
const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow);
void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen);
@ -50,6 +93,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
const void *in, size_t l4len);
const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
const struct in6_addr *src);
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src, const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow);
void tap_udp6_send(const struct ctx *c,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
@ -68,9 +114,12 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
void tap_handler_passt(struct ctx *c, uint32_t events,
const struct timespec *now);
int tap_sock_unix_open(char *sock_path);
void tap_sock_init(struct ctx *c);
void tap_sock_reset(struct ctx *c);
void tap_sock_update_pool(void *base, size_t size);
void tap_backend_init(struct ctx *c);
void tap_flush_pools(void);
void tap_handler(struct ctx *c, const struct timespec *now);
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
const struct timespec *now);
#endif /* TAP_H */

1570
tcp.c

File diff suppressed because it is too large Load diff

3
tcp.h
View file

@ -16,7 +16,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events);
int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const void *saddr, const void *daddr, uint32_t flow_lbl,
const struct pool *p, int idx, const struct timespec *now);
int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
const char *ifname, in_port_t port);
@ -25,7 +25,6 @@ void tcp_timer(struct ctx *c, const struct timespec *now);
void tcp_defer_handler(struct ctx *c);
void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
int tcp_set_peek_offset(int s, int offset);
extern bool peek_offset_cap;

View file

@ -125,7 +125,7 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
conn->seq_to_tap = seq;
peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
if (tcp_set_peek_offset(conn->sock, peek_offset))
if (tcp_set_peek_offset(conn, peek_offset))
tcp_rst(c, conn);
}
}
@ -147,6 +147,35 @@ void tcp_payload_flush(const struct ctx *c)
tcp_payload_used = 0;
}
/**
* tcp_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
* @conn: Connection pointer
* @iov: Pointer to an array of iovec of TCP pre-cooked buffers
* @check: Checksum, if already known
* @seq: Sequence number for this segment
* @no_tcp_csum: Do not set TCP checksum
*/
static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
struct iovec *iov, const uint16_t *check,
uint32_t seq, bool no_tcp_csum)
{
struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
struct tcphdr *th = IOV_REMOVE_HEADER(&tail, struct tcphdr);
struct tap_hdr *taph = iov[TCP_IOV_TAP].iov_base;
const struct flowside *tapside = TAPFLOW(conn);
const struct in_addr *a4 = inany_v4(&tapside->oaddr);
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
if (a4)
ip4h = iov[TCP_IOV_IP].iov_base;
else
ip6h = iov[TCP_IOV_IP].iov_base;
tcp_fill_headers(conn, taph, ip4h, ip6h, th, &tail,
check, seq, no_tcp_csum);
}
/**
* tcp_buf_send_flag() - Send segment with flags to tap (no payload)
* @c: Execution context
@ -181,8 +210,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
return ret;
tcp_payload_used++;
l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq, false);
l4len = optlen + sizeof(struct tcphdr);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
if (flags & DUP_ACK) {
struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
@ -208,14 +239,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
* @dlen: TCP payload length
* @no_csum: Don't compute IPv4 checksum, use the one from previous buffer
* @seq: Sequence number to be sent
* @push: Set PSH flag, last segment in a batch
*/
static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
ssize_t dlen, int no_csum, uint32_t seq)
ssize_t dlen, int no_csum, uint32_t seq, bool push)
{
struct tcp_payload_t *payload;
const uint16_t *check = NULL;
struct iovec *iov;
size_t l4len;
conn->seq_to_tap = seq + dlen;
tcp_frame_conns[tcp_payload_used] = conn;
@ -238,8 +269,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
payload->th.th_x2 = 0;
payload->th.th_flags = 0;
payload->th.ack = 1;
l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq, false);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
payload->th.psh = push;
iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
}
@ -272,13 +304,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
conn->seq_ack_from_tap, conn->seq_to_tap);
conn->seq_to_tap = conn->seq_ack_from_tap;
already_sent = 0;
if (tcp_set_peek_offset(s, 0)) {
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
}
}
if (!wnd_scaled || already_sent >= wnd_scaled) {
conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
conn_flag(c, conn, STALLED);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
@ -329,6 +362,9 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
return -errno;
}
if (already_sent) /* No new data and EAGAIN: set EPOLLET */
conn_flag(c, conn, STALLED);
return 0;
}
@ -354,6 +390,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
return 0;
}
conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
conn_flag(c, conn, ~STALLED);
send_bufs = DIV_ROUND_UP(len, mss);
@ -367,11 +404,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
seq = conn->seq_to_tap;
for (i = 0; i < send_bufs; i++) {
int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
bool push = false;
if (i == send_bufs - 1)
if (i == send_bufs - 1) {
dlen = last_len;
push = true;
}
tcp_data_to_tap(c, conn, dlen, no_csum, seq);
tcp_data_to_tap(c, conn, dlen, no_csum, seq, push);
seq += dlen;
}

View file

@ -19,6 +19,7 @@
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
* @sock: Socket descriptor number
* @events: Connection events, implying connection states
* @listening_sock: Listening socket this socket was accept()ed from, or -1
* @timer: timerfd descriptor for timeout events
* @flags: Connection flags representing internal attributes
* @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
@ -68,6 +69,7 @@ struct tcp_tap_conn {
#define CONN_STATE_BITS /* Setting these clears other flags */ \
(SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED)
int listening_sock;
int timer :FD_REF_BITS;
@ -77,6 +79,7 @@ struct tcp_tap_conn {
#define ACTIVE_CLOSE BIT(2)
#define ACK_TO_TAP_DUE BIT(3)
#define ACK_FROM_TAP_DUE BIT(4)
#define ACK_FROM_TAP_BLOCKS BIT(5)
#define SNDBUF_BITS 24
unsigned int sndbuf :SNDBUF_BITS;
@ -95,6 +98,95 @@ struct tcp_tap_conn {
uint32_t seq_init_from_tap;
};
/**
* struct tcp_tap_transfer - Migrated TCP data, flow table part, network order
* @pif: Interfaces for each side of the flow
* @side: Addresses and ports for each side of the flow
* @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @events: Connection events, implying connection states
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
* @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
* @flags: Connection flags representing internal attributes
* @seq_dup_ack_approx: Last duplicate ACK number sent to tap
* @wnd_from_tap: Last window size from tap, unscaled (as received)
* @wnd_to_tap: Sending window advertised to tap, unscaled (as sent)
* @seq_to_tap: Next sequence for packets to tap
* @seq_ack_from_tap: Last ACK number received from tap
* @seq_from_tap: Next sequence for packets from tap (not actually sent)
* @seq_ack_to_tap: Last ACK number sent to tap
* @seq_init_from_tap: Initial sequence number from tap
*/
struct tcp_tap_transfer {
uint8_t pif[SIDES];
struct flowside side[SIDES];
uint8_t retrans;
uint8_t ws_from_tap;
uint8_t ws_to_tap;
uint8_t events;
uint32_t tap_mss;
uint32_t sndbuf;
uint8_t flags;
uint8_t seq_dup_ack_approx;
uint16_t wnd_from_tap;
uint16_t wnd_to_tap;
uint32_t seq_to_tap;
uint32_t seq_ack_from_tap;
uint32_t seq_from_tap;
uint32_t seq_ack_to_tap;
uint32_t seq_init_from_tap;
} __attribute__((packed, aligned(__alignof__(uint32_t))));
/**
* struct tcp_tap_transfer_ext - Migrated TCP data, outside flow, network order
* @seq_snd: Socket-side send sequence
* @seq_rcv: Socket-side receive sequence
* @sndq: Length of pending send queue (unacknowledged / not sent)
* @notsent: Part of pending send queue that wasn't sent out yet
* @rcvq: Length of pending receive queue
* @mss: Socket-side MSS clamp
* @timestamp: RFC 7323 timestamp
* @snd_wl1: Next sequence used in window probe (next sequence - 1)
* @snd_wnd: Socket-side sending window
* @max_window: Window clamp
* @rcv_wnd: Socket-side receive window
* @rcv_wup: rcv_nxt on last window update sent
* @snd_ws: Window scaling factor, send
* @rcv_ws: Window scaling factor, receive
* @tcpi_state: Connection state in TCP_INFO style (enum, tcp_states.h)
* @tcpi_options: TCPI_OPT_* constants (timestamps, selective ACK)
*/
struct tcp_tap_transfer_ext {
uint32_t seq_snd;
uint32_t seq_rcv;
uint32_t sndq;
uint32_t notsent;
uint32_t rcvq;
uint32_t mss;
uint32_t timestamp;
/* We can't just use struct tcp_repair_window: we need network order */
uint32_t snd_wl1;
uint32_t snd_wnd;
uint32_t max_window;
uint32_t rcv_wnd;
uint32_t rcv_wup;
uint8_t snd_ws;
uint8_t rcv_ws;
uint8_t tcpi_state;
uint8_t tcpi_options;
} __attribute__((packed, aligned(__alignof__(uint32_t))));
/**
* struct tcp_splice_conn - Descriptor for a spliced TCP connection
* @f: Generic flow information
@ -139,11 +231,23 @@ extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE];
extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE];
bool tcp_flow_defer(const struct tcp_tap_conn *conn);
int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
int tcp_flow_migrate_target(struct ctx *c, int fd);
int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
int tcp_conn_pool_sock(int pool[]);
int tcp_conn_sock(const struct ctx *c, sa_family_t af);
int tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);
int tcp_conn_sock(sa_family_t af);
int tcp_sock_refill_pool(int pool[], sa_family_t af);
void tcp_splice_refill(const struct ctx *c);
#endif /* TCP_CONN_H */

View file

@ -38,9 +38,13 @@
#define OPT_SACK 5
#define OPT_TS 8
#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP)
#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)]))
#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP)
#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)]))
#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define HOSTSIDE(conn_) ((conn_)->f.pif[1] == PIF_HOST)
#define HOSTFLOW(conn_) (&((conn_)->f.side[HOSTSIDE(conn_)]))
#define HOST_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->oaddr))
#define CONN_V6(conn) (!CONN_V4(conn))
@ -162,14 +166,17 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
struct tcp_info_linux;
size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
struct iovec *iov, size_t dlen,
const uint16_t *check, uint32_t seq,
bool no_tcp_csum);
void tcp_fill_headers(const struct tcp_tap_conn *conn,
struct tap_hdr *taph,
struct iphdr *ip4h, struct ipv6hdr *ip6h,
struct tcphdr *th, struct iov_tail *payload,
const uint16_t *ip4_check, uint32_t seq, bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
bool force_seq, struct tcp_info_linux *tinfo);
int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
size_t *optlen);
int tcp_set_peek_offset(const struct tcp_tap_conn *conn, int offset);
#endif /* TCP_INTERNAL_H */

View file

@ -28,7 +28,7 @@
* - FIN_SENT_0: FIN (write shutdown) sent to accepted socket
* - FIN_SENT_1: FIN (write shutdown) sent to target socket
*
* #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64 i686:fcntl64
* #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64|fcntl i686:fcntl64
*/
#include <sched.h>
@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
ev[1].events = EPOLLOUT;
}
flow_foreach_sidei(sidei)
ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0;
flow_foreach_sidei(sidei) {
if (events & OUT_WAIT(sidei)) {
ev[sidei].events |= EPOLLOUT;
ev[!sidei].events &= ~EPOLLIN;
}
}
}
/**
@ -160,7 +164,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) ||
epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) {
int ret = -errno;
flow_err(conn, "ERROR on epoll_ctl(): %s", strerror(errno));
flow_perror(conn, "ERROR on epoll_ctl()");
return ret;
}
@ -200,8 +204,8 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
}
if (flag == CLOSING) {
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[0], NULL);
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[1], NULL);
epoll_del(c, conn->s[0]);
epoll_del(c, conn->s[1]);
}
}
@ -313,8 +317,8 @@ static int tcp_splice_connect_finish(const struct ctx *c,
if (conn->pipe[sidei][0] < 0) {
if (pipe2(conn->pipe[sidei], O_NONBLOCK | O_CLOEXEC)) {
flow_err(conn, "cannot create %d->%d pipe: %s",
sidei, !sidei, strerror(errno));
flow_perror(conn, "cannot create %d->%d pipe",
sidei, !sidei);
conn_flag(c, conn, CLOSING);
return -EIO;
}
@ -348,9 +352,10 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
uint8_t tgtpif = conn->f.pif[TGTSIDE];
union sockaddr_inany sa;
socklen_t sl;
int one = 1;
if (tgtpif == PIF_HOST)
conn->s[1] = tcp_conn_sock(c, af);
conn->s[1] = tcp_conn_sock(af);
else if (tgtpif == PIF_SPLICE)
conn->s[1] = tcp_conn_sock_ns(c, af);
else
@ -359,18 +364,27 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
if (conn->s[1] < 0)
return -1;
if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK,
&((int){ 1 }), sizeof(int))) {
if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, &one, sizeof(one))) {
flow_trace(conn, "failed to set TCP_QUICKACK on socket %i",
conn->s[1]);
}
if (setsockopt(conn->s[0], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
conn->s[0]);
}
if (setsockopt(conn->s[1], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
conn->s[1]);
}
pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport);
if (connect(conn->s[1], &sa.sa, sl)) {
if (errno != EINPROGRESS) {
flow_trace(conn, "Couldn't connect socket for splice: %s",
strerror(errno));
strerror_(errno));
return -errno;
}
@ -468,11 +482,10 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
if (rc)
flow_err(conn, "Error retrieving SO_ERROR: %s",
strerror(errno));
flow_perror(conn, "Error retrieving SO_ERROR");
else
flow_trace(conn, "Error event on socket: %s",
strerror(err));
strerror_(err));
goto close;
}
@ -507,20 +520,21 @@ swap:
int more = 0;
retry:
readlen = splice(conn->s[fromsidei], NULL,
conn->pipe[fromsidei][1], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
flow_trace(conn, "%zi from read-side call", readlen);
if (readlen < 0) {
if (errno == EINTR)
goto retry;
do
readlen = splice(conn->s[fromsidei], NULL,
conn->pipe[fromsidei][1], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
while (readlen < 0 && errno == EINTR);
if (errno != EAGAIN)
goto close;
} else if (!readlen) {
if (readlen < 0 && errno != EAGAIN)
goto close;
flow_trace(conn, "%zi from read-side call", readlen);
if (!readlen) {
eof = 1;
} else {
} else if (readlen > 0) {
never_read = 0;
if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
@ -530,10 +544,16 @@ retry:
conn_flag(c, conn, lowat_act_flag);
}
eintr:
written = splice(conn->pipe[fromsidei][0], NULL,
conn->s[!fromsidei], NULL, c->tcp.pipe_size,
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
do
written = splice(conn->pipe[fromsidei][0], NULL,
conn->s[!fromsidei], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
while (written < 0 && errno == EINTR);
if (written < 0 && errno != EAGAIN)
goto close;
flow_trace(conn, "%zi from write-side call (passed %zi)",
written, c->tcp.pipe_size);
@ -542,7 +562,7 @@ eintr:
if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
continue;
if (conn->flags & lowat_set_flag &&
if (!(conn->flags & lowat_set_flag) &&
readlen > (long)c->tcp.pipe_size / 10) {
int lowat = c->tcp.pipe_size / 4;
@ -551,7 +571,7 @@ eintr:
&lowat, sizeof(lowat))) {
flow_trace(conn,
"Setting SO_RCVLOWAT %i: %s",
lowat, strerror(errno));
lowat, strerror_(errno));
} else {
conn_flag(c, conn, lowat_set_flag);
conn_flag(c, conn, lowat_act_flag);
@ -565,12 +585,6 @@ eintr:
conn->written[fromsidei] += written > 0 ? written : 0;
if (written < 0) {
if (errno == EINTR)
goto eintr;
if (errno != EAGAIN)
goto close;
if (conn->read[fromsidei] == conn->written[fromsidei])
break;
@ -693,16 +707,16 @@ static int tcp_sock_refill_ns(void *arg)
ns_enter(c);
if (c->ifi4) {
int rc = tcp_sock_refill_pool(c, ns_sock_pool4, AF_INET);
int rc = tcp_sock_refill_pool(ns_sock_pool4, AF_INET);
if (rc < 0)
warn("TCP: Error refilling IPv4 ns socket pool: %s",
strerror(-rc));
strerror_(-rc));
}
if (c->ifi6) {
int rc = tcp_sock_refill_pool(c, ns_sock_pool6, AF_INET6);
int rc = tcp_sock_refill_pool(ns_sock_pool6, AF_INET6);
if (rc < 0)
warn("TCP: Error refilling IPv6 ns socket pool: %s",
strerror(-rc));
strerror_(-rc));
}
return 0;

474
tcp_vu.c Normal file
View file

@ -0,0 +1,474 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* tcp_vu.c - TCP L2 vhost-user management functions
*
* Copyright Red Hat
* Author: Laurent Vivier <lvivier@redhat.com>
*/
#include <errno.h>
#include <stddef.h>
#include <stdint.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>
#include <sys/socket.h>
#include <netinet/if_ether.h>
#include <linux/virtio_net.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "siphash.h"
#include "inany.h"
#include "vhost_user.h"
#include "tcp.h"
#include "pcap.h"
#include "flow.h"
#include "tcp_conn.h"
#include "flow_table.h"
#include "tcp_vu.h"
#include "tap.h"
#include "tcp_internal.h"
#include "checksum.h"
#include "vu_common.h"
#include <time.h>
static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1];
static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
static int head[VIRTQUEUE_MAX_SIZE + 1];
/**
* tcp_vu_hdrlen() - return the size of the header in level 2 frame (TCP)
* @v6: Set for IPv6 packet
*
* Return: Return the size of the header
*/
static size_t tcp_vu_hdrlen(bool v6)
{
size_t hdrlen;
hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf) +
sizeof(struct ethhdr) + sizeof(struct tcphdr);
if (v6)
hdrlen += sizeof(struct ipv6hdr);
else
hdrlen += sizeof(struct iphdr);
return hdrlen;
}
/**
* tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload)
* @c: Execution context
* @conn: Connection pointer
* @flags: TCP flags: if not set, send segment only if ACK is due
*
* Return: negative error code on connection reset, 0 otherwise
*/
int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
{
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
size_t optlen, hdrlen;
struct vu_virtq_element flags_elem[2];
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
struct iovec flags_iov[2];
struct tcp_syn_opts *opts;
struct iov_tail payload;
struct tcphdr *th;
struct ethhdr *eh;
uint32_t seq;
int elem_cnt;
int nb_ack;
int ret;
hdrlen = tcp_vu_hdrlen(CONN_V6(conn));
vu_set_element(&flags_elem[0], NULL, &flags_iov[0]);
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1,
hdrlen + sizeof(struct tcp_syn_opts), NULL);
if (elem_cnt != 1)
return -1;
ASSERT(flags_elem[0].in_sg[0].iov_len >=
hdrlen + sizeof(struct tcp_syn_opts));
vu_set_vnethdr(vdev, flags_elem[0].in_sg[0].iov_base, 1);
eh = vu_eth(flags_elem[0].in_sg[0].iov_base);
memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
if (CONN_V4(conn)) {
eh->h_proto = htons(ETH_P_IP);
ip4h = vu_ip(flags_elem[0].in_sg[0].iov_base);
*ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
th = vu_payloadv4(flags_elem[0].in_sg[0].iov_base);
} else {
eh->h_proto = htons(ETH_P_IPV6);
ip6h = vu_ip(flags_elem[0].in_sg[0].iov_base);
*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
th = vu_payloadv6(flags_elem[0].in_sg[0].iov_base);
}
memset(th, 0, sizeof(*th));
th->doff = sizeof(*th) / 4;
th->ack = 1;
seq = conn->seq_to_tap;
opts = (struct tcp_syn_opts *)(th + 1);
ret = tcp_prepare_flags(c, conn, flags, th, opts, &optlen);
if (ret <= 0) {
vu_queue_rewind(vq, 1);
return ret;
}
flags_elem[0].in_sg[0].iov_len = hdrlen + optlen;
payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
NULL, seq, !*c->pcap);
if (*c->pcap) {
pcap_iov(&flags_elem[0].in_sg[0], 1,
sizeof(struct virtio_net_hdr_mrg_rxbuf));
}
nb_ack = 1;
if (flags & DUP_ACK) {
vu_set_element(&flags_elem[1], NULL, &flags_iov[1]);
elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1,
flags_elem[0].in_sg[0].iov_len, NULL);
if (elem_cnt == 1 &&
flags_elem[1].in_sg[0].iov_len >=
flags_elem[0].in_sg[0].iov_len) {
memcpy(flags_elem[1].in_sg[0].iov_base,
flags_elem[0].in_sg[0].iov_base,
flags_elem[0].in_sg[0].iov_len);
nb_ack++;
if (*c->pcap) {
pcap_iov(&flags_elem[1].in_sg[0], 1,
sizeof(struct virtio_net_hdr_mrg_rxbuf));
}
}
}
vu_flush(vdev, vq, flags_elem, nb_ack);
return 0;
}
/** tcp_vu_sock_recv() - Receive datastream from socket into vhost-user buffers
* @c: Execution context
* @conn: Connection pointer
* @v6: Set for IPv6 connections
* @already_sent: Number of bytes already sent
* @fillsize: Maximum bytes to fill in guest-side receiving window
* @iov_cnt: number of iov (output)
*
* Return: Number of iov entries used to store the data or negative error code
*/
static ssize_t tcp_vu_sock_recv(const struct ctx *c,
const struct tcp_tap_conn *conn, bool v6,
uint32_t already_sent, size_t fillsize,
int *iov_cnt, int *head_cnt)
{
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
int s = conn->sock;
ssize_t ret, len;
size_t hdrlen;
int elem_cnt;
int i;
*iov_cnt = 0;
hdrlen = tcp_vu_hdrlen(v6);
vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE);
elem_cnt = 0;
*head_cnt = 0;
while (fillsize > 0 && elem_cnt < VIRTQUEUE_MAX_SIZE) {
struct iovec *iov;
size_t frame_size, dlen;
int cnt;
cnt = vu_collect(vdev, vq, &elem[elem_cnt],
VIRTQUEUE_MAX_SIZE - elem_cnt,
MIN(mss, fillsize) + hdrlen, &frame_size);
if (cnt == 0)
break;
dlen = frame_size - hdrlen;
/* reserve space for headers in iov */
iov = &elem[elem_cnt].in_sg[0];
ASSERT(iov->iov_len >= hdrlen);
iov->iov_base = (char *)iov->iov_base + hdrlen;
iov->iov_len -= hdrlen;
head[(*head_cnt)++] = elem_cnt;
fillsize -= dlen;
elem_cnt += cnt;
}
if (peek_offset_cap) {
mh_sock.msg_iov = iov_vu + 1;
mh_sock.msg_iovlen = elem_cnt;
} else {
iov_vu[0].iov_base = tcp_buf_discard;
iov_vu[0].iov_len = already_sent;
mh_sock.msg_iov = iov_vu;
mh_sock.msg_iovlen = elem_cnt + 1;
}
do
ret = recvmsg(s, &mh_sock, MSG_PEEK);
while (ret < 0 && errno == EINTR);
if (ret < 0) {
vu_queue_rewind(vq, elem_cnt);
return -errno;
}
if (!peek_offset_cap)
ret -= already_sent;
/* adjust iov number and length of the last iov */
len = ret;
for (i = 0; len && i < elem_cnt; i++) {
struct iovec *iov = &elem[i].in_sg[0];
if (iov->iov_len > (size_t)len)
iov->iov_len = len;
len -= iov->iov_len;
}
/* adjust head count */
while (*head_cnt > 0 && head[*head_cnt - 1] >= i)
(*head_cnt)--;
/* mark end of array */
head[*head_cnt] = i;
*iov_cnt = i;
/* release unused buffers */
vu_queue_rewind(vq, elem_cnt - i);
/* restore space for headers in iov */
for (i = 0; i < *head_cnt; i++) {
struct iovec *iov = &elem[head[i]].in_sg[0];
iov->iov_base = (char *)iov->iov_base - hdrlen;
iov->iov_len += hdrlen;
}
return ret;
}
/**
* tcp_vu_prepare() - Prepare the frame header
* @c: Execution context
* @conn: Connection pointer
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Number of entries in @iov
* @check: Checksum, if already known
* @no_tcp_csum: Do not set TCP checksum
* @push: Set PSH flag, last segment in a batch
*/
static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
struct iovec *iov, size_t iov_cnt,
const uint16_t **check, bool no_tcp_csum, bool push)
{
const struct flowside *toside = TAPFLOW(conn);
bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
size_t hdrlen = tcp_vu_hdrlen(v6);
struct iov_tail payload = IOV_TAIL(iov, iov_cnt, hdrlen);
char *base = iov[0].iov_base;
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
struct tcphdr *th;
struct ethhdr *eh;
/* we guess the first iovec provided by the guest can embed
* all the headers needed by L2 frame
*/
ASSERT(iov[0].iov_len >= hdrlen);
eh = vu_eth(base);
memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
/* initialize header */
if (!v6) {
eh->h_proto = htons(ETH_P_IP);
ip4h = vu_ip(base);
*ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
th = vu_payloadv4(base);
} else {
eh->h_proto = htons(ETH_P_IPV6);
ip6h = vu_ip(base);
*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
th = vu_payloadv6(base);
}
memset(th, 0, sizeof(*th));
th->doff = sizeof(*th) / 4;
th->ack = 1;
th->psh = push;
tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
*check, conn->seq_to_tap, no_tcp_csum);
if (ip4h)
*check = &ip4h->check;
}
/**
* tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
* in window
* @c: Execution context
* @conn: Connection pointer
*
* Return: Negative on connection reset, 0 otherwise
*/
int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
{
uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
ssize_t len, previous_dlen;
int i, iov_cnt, head_cnt;
size_t hdrlen, fillsize;
int v6 = CONN_V6(conn);
uint32_t already_sent;
const uint16_t *check;
if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
debug("Got packet, but RX virtqueue not usable yet");
return 0;
}
already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
if (SEQ_LT(already_sent, 0)) {
/* RFC 761, section 2.1. */
flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
conn->seq_ack_from_tap, conn->seq_to_tap);
conn->seq_to_tap = conn->seq_ack_from_tap;
already_sent = 0;
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
}
}
if (!wnd_scaled || already_sent >= wnd_scaled) {
conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
conn_flag(c, conn, STALLED);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
}
/* Set up buffer descriptors we'll fill completely and partially. */
fillsize = wnd_scaled - already_sent;
/* collect the buffers from vhost-user and fill them with the
* data from the socket
*/
len = tcp_vu_sock_recv(c, conn, v6, already_sent, fillsize,
&iov_cnt, &head_cnt);
if (len < 0) {
if (len != -EAGAIN && len != -EWOULDBLOCK) {
tcp_rst(c, conn);
return len;
}
if (already_sent) /* No new data and EAGAIN: set EPOLLET */
conn_flag(c, conn, STALLED);
return 0;
}
if (!len) {
if (already_sent) {
conn_flag(c, conn, STALLED);
} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
SOCK_FIN_RCVD) {
int ret = tcp_vu_send_flag(c, conn, FIN | ACK);
if (ret) {
tcp_rst(c, conn);
return ret;
}
conn_event(c, conn, TAP_FIN_SENT);
}
return 0;
}
conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
conn_flag(c, conn, ~STALLED);
/* Likely, some new data was acked too. */
tcp_update_seqack_wnd(c, conn, false, NULL);
/* initialize headers */
/* iov_vu is an array of buffers and the buffer size can be
* smaller than the frame size we want to use but with
* num_buffer we can merge several virtio iov buffers in one packet
* we need only to set the packet headers in the first iov and
* num_buffer to the number of iov entries
*/
hdrlen = tcp_vu_hdrlen(v6);
for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) {
struct iovec *iov = &elem[head[i]].in_sg[0];
int buf_cnt = head[i + 1] - head[i];
ssize_t dlen = iov_size(iov, buf_cnt) - hdrlen;
bool push = i == head_cnt - 1;
vu_set_vnethdr(vdev, iov->iov_base, buf_cnt);
/* The IPv4 header checksum varies only with dlen */
if (previous_dlen != dlen)
check = NULL;
previous_dlen = dlen;
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
if (*c->pcap) {
pcap_iov(iov, buf_cnt,
sizeof(struct virtio_net_hdr_mrg_rxbuf));
}
conn->seq_to_tap += dlen;
}
/* send packets */
vu_flush(vdev, vq, elem, iov_cnt);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
}

12
tcp_vu.h Normal file
View file

@ -0,0 +1,12 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* Copyright Red Hat
* Author: Laurent Vivier <lvivier@redhat.com>
*/
#ifndef TCP_VU_H
#define TCP_VU_H
int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
#endif /*TCP_VU_H */

1
test/.gitignore vendored
View file

@ -8,5 +8,6 @@ QEMU_EFI.fd
*.raw.xz
*.bin
nstool
rampstream
guest-key
guest-key.pub

View file

@ -52,7 +52,8 @@ UBUNTU_IMGS = $(UBUNTU_OLD_IMGS) $(UBUNTU_NEW_IMGS)
DOWNLOAD_ASSETS = mbuto podman \
$(DEBIAN_IMGS) $(FEDORA_IMGS) $(OPENSUSE_IMGS) $(UBUNTU_IMGS)
TESTDATA_ASSETS = small.bin big.bin medium.bin
TESTDATA_ASSETS = small.bin big.bin medium.bin \
rampstream
LOCAL_ASSETS = mbuto.img mbuto.mem.img podman/bin/podman QEMU_EFI.fd \
$(DEBIAN_IMGS:%=prepared-%) $(FEDORA_IMGS:%=prepared-%) \
$(UBUNTU_NEW_IMGS:%=prepared-%) \
@ -85,7 +86,7 @@ podman/bin/podman: pull-podman
guest-key guest-key.pub:
ssh-keygen -f guest-key -N ''
mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub $(TESTDATA_ASSETS)
mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub rampstream-check.sh $(TESTDATA_ASSETS)
./mbuto/mbuto -p ./$< -c lz4 -f $@
mbuto.mem.img: passt.mem.mbuto mbuto ../passt.avx2

View file

@ -135,7 +135,7 @@ layout_two_guests() {
get_info_cols
pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #1" qemu_2 guest_2
tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
tmux send-keys -t ${PANE_INFO} -N 100 C-m
@ -143,13 +143,66 @@ layout_two_guests() {
pane_watch_contexts ${PANE_HOST} host host
pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #1" pasta_1 passt_2
info_layout "two guests, two passt instances, in namespaces"
sleep 1
}
# layout_migrate() - Two guest panes, two passt panes, two passt-repair panes,
# plus host and log
layout_migrate() {
sleep 1
tmux kill-pane -a -t 0
cmd_write 0 clear
tmux split-window -v -t passt_test
tmux split-window -h -l '33%'
tmux split-window -h -t passt_test:1.1
tmux split-window -h -l '35%' -t passt_test:1.0
tmux split-window -v -t passt_test:1.0
tmux split-window -v -t passt_test:1.4
tmux split-window -v -t passt_test:1.6
tmux split-window -v -t passt_test:1.3
PANE_GUEST_1=0
PANE_GUEST_2=1
PANE_INFO=2
PANE_MON=3
PANE_HOST=4
PANE_PASST_REPAIR_1=5
PANE_PASST_1=6
PANE_PASST_REPAIR_2=7
PANE_PASST_2=8
get_info_cols
pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
tmux send-keys -t ${PANE_INFO} -N 100 C-m
tmux select-pane -t ${PANE_INFO} -T "test log"
pane_watch_contexts ${PANE_MON} "QEMU monitor" mon mon
pane_watch_contexts ${PANE_HOST} host host
pane_watch_contexts ${PANE_PASST_REPAIR_1} "passt-repair #1 in namespace #1" repair_1 passt_repair_1
pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
pane_watch_contexts ${PANE_PASST_REPAIR_2} "passt-repair #2 in namespace #2" repair_2 passt_repair_2
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
info_layout "two guests, two passt + passt-repair instances, in namespaces"
sleep 1
}
# layout_demo_pasta() - Four panes for pasta demo
layout_demo_pasta() {
sleep 1

View file

@ -49,6 +49,21 @@ td:empty { visibility: hidden; }
__passt_tcp_LINE__ __passt_udp_LINE__
</table>
</li><li><p>passt with vhost-user support</p>
<table class="passt" width="70%">
<tr>
<th/>
<th id="perf_passt_vu_tcp" colspan="__passt_vu_tcp_cols__">TCP, __passt_vu_tcp_threads__ at __passt_vu_tcp_freq__ GHz</th>
<th id="perf_passt_vu_udp" colspan="__passt_vu_udp_cols__">UDP, __passt_vu_udp_threads__ at __passt_vu_udp_freq__ GHz</th>
</tr>
<tr>
<td align="right">MTU:</td>
__passt_vu_tcp_header__
__passt_vu_udp_header__
</tr>
__passt_vu_tcp_LINE__ __passt_vu_udp_LINE__
</table>
<style type="text/CSS">
table.pasta_local td { border: 0px solid; padding: 6px; line-height: 1; }
table.pasta_local td { text-align: right; }

View file

@ -15,8 +15,7 @@
INITRAMFS="${BASEPATH}/mbuto.img"
VCPUS="$( [ $(nproc) -ge 8 ] && echo 6 || echo $(( $(nproc) / 2 + 1 )) )"
__mem_kib="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
VMEM="$((${__mem_kib} / 1024 / 4))"
MEM_KIB="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
QEMU_ARCH="$(uname -m)"
[ "${QEMU_ARCH}" = "i686" ] && QEMU_ARCH=i386
@ -46,24 +45,38 @@ setup_passt() {
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
context_run passt "make clean"
context_run passt "make valgrind"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt ${__opts} -s ${STATESETUP}/passt.socket -f -t 10001 -u 10001 -P ${STATESETUP}/passt.pid"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt ${__opts} -s ${STATESETUP}/passt.socket -f -t 10001 -u 10001 -H hostname1 --fqdn fqdn1.passt.test -P ${STATESETUP}/passt.pid"
# pidfile isn't created until passt is listening
wait_for [ -f "${STATESETUP}/passt.pid" ]
__vmem="$((${MEM_KIB} / 1024 / 4))"
if [ ${VHOST_USER} -eq 1 ]; then
__vmem="$(((${__vmem} + 500) / 1000))G"
__qemu_netdev=" \
-chardev socket,id=c,path=${STATESETUP}/passt.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
else
__qemu_netdev="-device virtio-net-pci,netdev=s \
-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket"
fi
GUEST_CID=94557
context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}" \
' -machine accel=kvm' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
' -device virtio-net-pci,netdev=s0 ' \
" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
" ${__qemu_netdev}" \
" -pidfile ${STATESETUP}/qemu.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
@ -142,29 +155,43 @@ setup_passt_in_ns() {
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_in_pasta.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
if [ ${VALGRIND} -eq 1 ]; then
context_run passt "make clean"
context_run passt "make valgrind"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -H hostname1 --fqdn fqdn1.passt.test -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
else
context_run passt "make clean"
context_run passt "make"
context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -H hostname1 --fqdn fqdn1.passt.test -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
fi
wait_for [ -f "${STATESETUP}/passt.pid" ]
__vmem="$((${MEM_KIB} / 1024 / 4))"
if [ ${VHOST_USER} -eq 1 ]; then
__vmem="$(((${__vmem} + 500) / 1000))G"
__qemu_netdev=" \
-chardev socket,id=c,path=${STATESETUP}/passt.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
else
__qemu_netdev="-device virtio-net-pci,netdev=s \
-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket"
fi
GUEST_CID=94557
context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}" \
' -machine accel=kvm' \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
' -device virtio-net-pci,netdev=s0 ' \
" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
" ${__qemu_netdev}" \
" -pidfile ${STATESETUP}/qemu.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
@ -214,11 +241,126 @@ setup_two_guests() {
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} --fqdn fqdn1.passt.test -H hostname1 -t 10001 -u 10001"
wait_for [ -f "${STATESETUP}/passt_1.pid" ]
__opts=
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} --hostname hostname2 --fqdn fqdn2 -t 10004 -u 10004"
wait_for [ -f "${STATESETUP}/passt_2.pid" ]
__vmem="$((${MEM_KIB} / 1024 / 4))"
if [ ${VHOST_USER} -eq 1 ]; then
__vmem="$(((${__vmem} + 500) / 1000))G"
__qemu_netdev1=" \
-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
__qemu_netdev2=" \
-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
else
__qemu_netdev1="-device virtio-net-pci,netdev=s \
-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_1.socket"
__qemu_netdev2="-device virtio-net-pci,netdev=s \
-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_2.socket"
fi
GUEST_1_CID=94557
context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
" ${__qemu_netdev1}" \
" -pidfile ${STATESETUP}/qemu_1.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"
GUEST_2_CID=94558
context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
" ${__qemu_netdev2}" \
" -pidfile ${STATESETUP}/qemu_2.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID"
context_setup_guest guest_1 ${GUEST_1_CID}
context_setup_guest guest_2 ${GUEST_2_CID}
}
# setup_migrate() - Set up two namespace, run qemu, passt/passt-repair in both
setup_migrate() {
context_setup_host host
context_setup_host mon
context_setup_host pasta_1
context_setup_host pasta_2
layout_migrate
# Ports:
#
# guest #1 | guest #2 | ns #1 | host
# --------- |-----------|-----------|------------
# 10001 as server | | to guest | to ns #1
# 10002 | | as server | to ns #1
# 10003 | | to init | as server
# 10004 | as server | to guest | to ns #1
__opts=
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/pasta_1.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
__map_host4=192.0.2.1
__map_host6=2001:db8:9a55::1
__map_ns4=192.0.2.2
__map_ns6=2001:db8:9a55::2
# Option 1: send stuff via spliced path in pasta
# context_run_bg pasta_1 "./pasta ${__opts} -P ${STATESETUP}/pasta_1.pid -t 10001,10002 -T 10003 -u 10001,10002 -U 10003 --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
# Option 2: send stuff via tap (--map-guest-addr) instead (useful to see capture of full migration)
context_run_bg pasta_1 "./pasta ${__opts} -P ${STATESETUP}/pasta_1.pid -t 10001,10002,10004 -T 10003 -u 10001,10002,10004 -U 10003 --map-guest-addr ${__map_host4} --map-guest-addr ${__map_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
context_setup_nstool passt_1 ${STATESETUP}/ns1.hold
context_setup_nstool passt_repair_1 ${STATESETUP}/ns1.hold
context_setup_nstool passt_2 ${STATESETUP}/ns1.hold
context_setup_nstool passt_repair_2 ${STATESETUP}/ns1.hold
context_setup_nstool qemu_1 ${STATESETUP}/ns1.hold
context_setup_nstool qemu_2 ${STATESETUP}/ns1.hold
__ifname="$(context_run qemu_1 "ip -j link show | jq -rM '.[] | select(.link_type == \"ether\").ifname'")"
sleep 1
__opts="--vhost-user"
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
wait_for [ -f "${STATESETUP}/passt_1.pid" ]
__opts=
context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
__opts="--vhost-user"
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
@ -226,34 +368,52 @@ setup_two_guests() {
context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
wait_for [ -f "${STATESETUP}/passt_2.pid" ]
context_run_bg passt_repair_2 "./passt-repair ${STATESETUP}/passt_2.socket.repair"
__vmem="512M" # Keep migration fast
__qemu_netdev1=" \
-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
__qemu_netdev2=" \
-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
GUEST_1_CID=94557
context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
' -device virtio-net-pci,netdev=s0 ' \
" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_1.socket " \
" ${__qemu_netdev1}" \
" -pidfile ${STATESETUP}/qemu_1.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"
" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID" \
" -monitor unix:${STATESETUP}/qemu_1_mon.sock,server,nowait"
GUEST_2_CID=94558
context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
' -device virtio-net-pci,netdev=s0 ' \
" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_2.socket " \
" ${__qemu_netdev2}" \
" -pidfile ${STATESETUP}/qemu_2.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID"
" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID" \
" -monitor unix:${STATESETUP}/qemu_2_mon.sock,server,nowait" \
" -incoming tcp:0:20005"
context_setup_guest guest_1 ${GUEST_1_CID}
context_setup_guest guest_2 ${GUEST_2_CID}
# Only available after migration:
( context_setup_guest guest_2 ${GUEST_2_CID} & )
}
# teardown_context_watch() - Remove contexts and stop panes watching them
@ -326,7 +486,8 @@ teardown_two_guests() {
context_wait pasta_1
context_wait pasta_2
rm -f "${STATESETUP}/passt__[12].pid" "${STATESETUP}/pasta_[12].pid"
rm "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
rm "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"
teardown_context_watch ${PANE_HOST} host
teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
@ -335,6 +496,30 @@ teardown_two_guests() {
teardown_context_watch ${PANE_PASST_2} pasta_2 passt_2
}
# teardown_migrate() - Exit namespaces, kill qemu processes, passt and pasta
teardown_migrate() {
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_1.pid")
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_2.pid")
context_wait qemu_1
context_wait qemu_2
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/passt_2.pid")
context_wait passt_1
context_wait passt_2
${NSTOOL} stop "${STATESETUP}/ns1.hold"
context_wait pasta_1
rm -f "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
rm -f "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"
teardown_context_watch ${PANE_HOST} host
teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
teardown_context_watch ${PANE_GUEST_2} qemu_2 guest_2
teardown_context_watch ${PANE_PASST_1} pasta_1 passt_1
teardown_context_watch ${PANE_PASST_2} pasta_1 passt_2
}
# teardown_demo_passt() - Exit namespace, kill qemu, passt and pasta
teardown_demo_passt() {
tmux send-keys -t ${PANE_GUEST} "C-c"

View file

@ -33,7 +33,7 @@ setup_memory() {
pane_or_context_run guest 'qemu-system-$(uname -m)' \
' -machine accel=kvm' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -m '$((${MEM_KIB} / 1024 / 4))' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -initrd '${INITRAMFS_MEM}' -nographic -serial stdio' \
' -nodefaults' \

View file

@ -19,6 +19,7 @@ STATUS_FILE_INDEX=0
STATUS_COLS=
STATUS_PASS=0
STATUS_FAIL=0
STATUS_SKIPPED=0
PR_RED='\033[1;31m'
PR_GREEN='\033[1;32m'
@ -439,19 +440,21 @@ info_layout() {
# status_test_ok() - Update counter of passed tests, log and display message
status_test_ok() {
STATUS_PASS=$((STATUS_PASS + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | #(TZ="UTC" date -Iseconds)"
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_passed
}
# status_test_fail() - Update counter of failed tests, log and display message
status_test_fail() {
STATUS_FAIL=$((STATUS_FAIL + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | #(TZ="UTC" date -Iseconds)"
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_failed
}
# status_test_fail() - Update counter of failed tests, log and display message
status_test_skip() {
STATUS_SKIPPED=$((STATUS_SKIPPED + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_skipped
}

View file

@ -20,10 +20,7 @@ test_iperf3s() {
__sctx="${1}"
__port="${2}"
pane_or_context_run_bg "${__sctx}" \
'iperf3 -s -p'${__port}' & echo $! > s.pid' \
sleep 1 # Wait for server to be ready
pane_or_context_run "${__sctx}" 'iperf3 -s -p'${__port}' -D -I s.pid'
}
# test_iperf3k() - Kill iperf3 server
@ -31,7 +28,7 @@ test_iperf3s() {
test_iperf3k() {
__sctx="${1}"
pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid); rm s.pid'
pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid)'
sleep 1 # Wait for kernel to free up ports
}
@ -68,6 +65,45 @@ test_iperf3() {
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
}
# test_iperf3m() - Ugly helper for iperf3 directive, guest migration variant
# $1: Variable name: to put the measure bandwidth into
# $2: Initial source/client context
# $3: Second source/client context the guest is moving to
# $4: Destination name or address for client
# $5: Port number, ${i} is translated to process index
# $6: Run time, in seconds
# $7: Client options
test_iperf3m() {
__var="${1}"; shift
__cctx="${1}"; shift
__cctx2="${1}"; shift
__dest="${1}"; shift
__port="${1}"; shift
__time="${1}"; shift
pane_or_context_run "${__cctx}" 'rm -f c.json'
# A 1s wait for connection on what's basically a local link
# indicates something is pretty wrong
__timeout=1000
pane_or_context_run_bg "${__cctx}" \
'iperf3 -J -c '${__dest}' -p '${__port} \
' --connect-timeout '${__timeout} \
' -t'${__time}' -i0 '"${@}"' > c.json' \
__jval=".end.sum_received.bits_per_second"
sleep $((${__time} + 3))
pane_or_context_output "${__cctx2}" \
'cat c.json'
__bw=$(pane_or_context_output "${__cctx2}" \
'cat c.json | jq -rMs "map('${__jval}') | add"')
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
}
test_one_line() {
__line="${1}"
@ -177,6 +213,12 @@ test_one_line() {
"guest2w")
pane_or_context_wait guest_2 || TEST_ONE_nok=1
;;
"mon")
pane_or_context_run mon "${__arg}" || TEST_ONE_nok=1
;;
"monb")
pane_or_context_run_bg mon "${__arg}"
;;
"ns")
pane_or_context_run ns "${__arg}" || TEST_ONE_nok=1
;;
@ -292,6 +334,9 @@ test_one_line() {
"iperf3")
test_iperf3 ${__arg}
;;
"iperf3m")
test_iperf3m ${__arg}
;;
"set")
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__arg%% *}__" "${__arg#* }")"
;;

59
test/migrate/basic Normal file
View file

@ -0,0 +1,59 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/basic - Check basic migration functionality
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: guest1/guest2 > host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
sleep 1
# Option 1: via spliced path in pasta, namespace to host
# guest1b { printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
guest1b { printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]

62
test/migrate/basic_fin Normal file
View file

@ -0,0 +1,62 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/basic_fin - Outbound traffic across migration, half-closed socket
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: guest1, half-close, guest2 > host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb echo FIN | socat TCP4-LISTEN:10006,shut-down STDIO,ignoreeof > __STATESETUP__/msg
#hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
#sleep 20
# Option 1: via spliced path in pasta, namespace to host
# guest1b { printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
guest1b { printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]

View file

@ -0,0 +1,64 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/bidirectional - Check migration with messages in both directions
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4: guest1/guest2 > host, host > guest1/guest2
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
guest1b socat -u TCP4-LISTEN:10001 OPEN:msg,create,trunc
sleep 1
guest1b socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
hostb socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
sleep 1
guest1 printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
host printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
sleep 2
guest2 printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
host printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
hostw
# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
# use sleep 1 for the moment
sleep 1
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]
g2out MSG cat msg
check [ "__MSG__" = "Dear guest 1, you are now guest 2" ]

View file

@ -0,0 +1,64 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/bidirectional_fin - Both directions, half-closed sockets
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4: guest1/guest2 <- (half closed) -> host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb echo FIN | socat TCP4-LISTEN:10006,shut-down STDIO,ignoreeof > __STATESETUP__/msg
guest1b echo FIN | socat TCP4-LISTEN:10001,shut-down STDIO,ignoreeof > msg
sleep 1
guest1b socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
hostb socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
sleep 1
guest1 printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
host printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
sleep 2
guest2 printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
host printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
hostw
# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
# use sleep 1 for the moment
sleep 1
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]
g2out MSG cat msg
check [ "__MSG__" = "Dear guest 1, you are now guest 2" ]

View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_bidir6 - Migration behaviour with many bidirectional flows
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 128
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N --bidir
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 host <-> guest flood, many flows, during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

50
test/migrate/iperf3_in4 Normal file
View file

@ -0,0 +1,50 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_in4 - Migration behaviour under inbound IPv4 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
guest1 /sbin/sysctl -w net.core.rmem_max=33554432
guest1 /sbin/sysctl -w net.core.wmem_max=33554432
set THREADS 1
set TIME 4
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

58
test/migrate/iperf3_in6 Normal file
View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_in6 - Migration behaviour under inbound IPv6 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 4
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

View file

@ -0,0 +1,60 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_many_out6 - Migration behaviour with many outbound flows
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 16
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 guest to host flood, many flows, during migration
test TCP/IPv6 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

47
test/migrate/iperf3_out4 Normal file
View file

@ -0,0 +1,47 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_out4 - Migration behaviour under outbound IPv4 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 6
set TIME 2
set OMIT 0.1
set OPTS -P __THREADS__ -O__OMIT__ -Z -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4 guest to host throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

58
test/migrate/iperf3_out6 Normal file
View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_out6 - Migration behaviour under outbound IPv6 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 6
set TIME 2
set OMIT 0.1
set OPTS -P __THREADS__ -O__OMIT__ -Z -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 guest to host throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

View file

@ -0,0 +1,59 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/rampstream_in - Check sequence correctness with inbound ramp
#
# Copyright (c) 2025 Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set RAMPS 6000000
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: sequence check, ramps, inbound
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
guest1b socat -u TCP4-LISTEN:10001 EXEC:"rampstream-check.sh __RAMPS__"
sleep 1
hostb socat -u EXEC:"test/rampstream send __RAMPS__" TCP4:__ADDR1__:10001
sleep 1
monb echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
guest2 cat rampstream.err
guest2 [ $(cat rampstream.status) -eq 0 ]

View file

@ -0,0 +1,55 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/rampstream_out - Check sequence correctness with outbound ramp
#
# Copyright (c) 2025 Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set RAMPS 6000000
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: sequence check, ramps, outbound
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 EXEC:"test/rampstream check __RAMPS__"
sleep 1
guest1b socat -u EXEC:"rampstream send __RAMPS__" TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw

View file

@ -13,7 +13,8 @@
PROGS="${PROGS:-ash,dash,bash ip mount ls insmod mkdir ln cat chmod lsmod
modprobe find grep mknod mv rm umount jq iperf3 dhclient hostname
sed tr chown sipcalc cut socat dd strace ping tail killall sleep sysctl
nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp}"
nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp tcpdump
env}"
# OpenSSH 9.8 introduced split binaries, with sshd being the daemon, and
# sshd-session the per-session program. We need the latter as well, and the path
@ -31,7 +32,7 @@ LINKS="${LINKS:-
DIRS="${DIRS} /tmp /usr/sbin /usr/share /var/log /var/lib /etc/ssh /run/sshd /root/.ssh"
COPIES="${COPIES} small.bin,/root/small.bin medium.bin,/root/medium.bin big.bin,/root/big.bin"
COPIES="${COPIES} small.bin,/root/small.bin medium.bin,/root/medium.bin big.bin,/root/big.bin rampstream,/bin/rampstream rampstream-check.sh,/bin/rampstream-check.sh"
FIXUP="${FIXUP}"'
mv /sbin/* /usr/sbin || :
@ -41,6 +42,7 @@ FIXUP="${FIXUP}"'
#!/bin/sh
LOG=/var/log/dhclient-script.log
echo \${reason} \${interface} >> \$LOG
env >> \$LOG
set >> \$LOG
[ -n "\${new_interface_mtu}" ] && ip link set dev \${interface} mtu \${new_interface_mtu}
@ -54,7 +56,8 @@ set >> \$LOG
[ -n "\${new_ip6_address}" ] && ip addr add \${new_ip6_address}/\${new_ip6_prefixlen} dev \${interface}
[ -n "\${new_dhcp6_name_servers}" ] && for d in \${new_dhcp6_name_servers}; do echo "nameserver \${d}%\${interface}" >> /etc/resolv.conf; done
[ -n "\${new_dhcp6_domain_search}" ] && (printf "search"; for d in \${new_dhcp6_domain_search}; do printf " %s" "\${d}"; done; printf "\n") >> /etc/resolv.conf
[ -n "\${new_host_name}" ] && hostname "\${new_host_name}"
[ -n "\${new_host_name}" ] && echo "\${new_host_name}" > /tmp/new_host_name
[ -n "\${new_fqdn_fqdn}" ] && echo "\${new_fqdn_fqdn}" > /tmp/new_fqdn_fqdn
exit 0
EOF
chmod 755 /sbin/dhclient-script
@ -65,6 +68,7 @@ EOF
# sshd via vsock
cat > /etc/passwd << EOF
root:x:0:0:root:/root:/bin/sh
tcpdump:x:72:72:tcpdump:/:/sbin/nologin
sshd:x:100:100:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
EOF
cat > /etc/shadow << EOF

View file

@ -11,7 +11,7 @@
# Copyright (c) 2021 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
gtools ip jq dhclient sed tr
gtools ip jq dhclient sed tr hostname
htools ip jq sed tr head
test Interface name
@ -47,7 +47,16 @@ gout SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^searc
hout HOST_SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
check [ "__SEARCH__" = "__HOST_SEARCH__" ]
test DHCP: Hostname
gout NEW_HOST_NAME cat /tmp/new_host_name
check [ "__NEW_HOST_NAME__" = "hostname1" ]
test DHCP: Client FQDN
gout NEW_FQDN_FQDN cat /tmp/new_fqdn_fqdn
check [ "__NEW_FQDN_FQDN__" = "fqdn1.passt.test" ]
test DHCPv6: address
guest rm /tmp/new_fqdn_fqdn
guest /sbin/dhclient -6 __IFNAME__
# Wait for DAD to complete
guest while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
@ -70,3 +79,7 @@ test DHCPv6: search list
gout SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
hout HOST_SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
check [ "__SEARCH6__" = "__HOST_SEARCH6__" ]
test DHCPv6: Hostname
gout NEW_FQDN_FQDN cat /tmp/new_fqdn_fqdn
check [ "__NEW_FQDN_FQDN__" = "fqdn1.passt.test" ]

1
test/passt_vu Symbolic link
View file

@ -0,0 +1 @@
passt

1
test/passt_vu_in_ns Symbolic link
View file

@ -0,0 +1 @@
passt_in_ns

View file

@ -23,4 +23,4 @@ check [ "__PASTA_BIN__" = "__WD__/pasta" ]
test Podman system test with bats
host PODMAN="__PODMAN__" CONTAINERS_HELPER_BINARY_DIR="__WD__" bats test/podman/test/system/505-networking-pasta.bats
host PODMAN="__PODMAN__" CONTAINERS_HELPER_BINARY_DIR="__WD__" taskset -c 1 bats test/podman/test/system/505-networking-pasta.bats

211
test/perf/passt_vu_tcp Normal file
View file

@ -0,0 +1,211 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/perf/passt_vu_tcp - Check TCP performance in passt vhost-user mode
#
# Copyright (c) 2021 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
gtools /sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr # From neper
nstools /sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr
htools bc head sed seq
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test passt: throughput and latency
guest /sbin/sysctl -w net.core.rmem_max=536870912
guest /sbin/sysctl -w net.core.wmem_max=536870912
guest /sbin/sysctl -w net.core.rmem_default=33554432
guest /sbin/sysctl -w net.core.wmem_default=33554432
guest /sbin/sysctl -w net.ipv4.tcp_rmem="4096 131072 268435456"
guest /sbin/sysctl -w net.ipv4.tcp_wmem="4096 131072 268435456"
guest /sbin/sysctl -w net.ipv4.tcp_timestamps=0
ns /sbin/sysctl -w net.ipv4.tcp_rmem="4096 524288 134217728"
ns /sbin/sysctl -w net.ipv4.tcp_wmem="4096 524288 134217728"
ns /sbin/sysctl -w net.ipv4.tcp_timestamps=0
gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1
hout FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l
hout FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__
set THREADS 6
set TIME 2
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N
info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz
report passt_vu tcp __THREADS__ __FREQ__
th MTU 256B 576B 1280B 1500B 9000B 65520B
tr TCP throughput over IPv6: guest to host
iperf3s ns 10002
bw -
bw -
guest ip link set dev __IFNAME__ mtu 1280
iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 16M -l 1M
bw __BW__ 1.2 1.5
guest ip link set dev __IFNAME__ mtu 1500
iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 32M -l 1M
bw __BW__ 1.6 1.8
guest ip link set dev __IFNAME__ mtu 9000
iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M -l 1M
bw __BW__ 4.0 5.0
guest ip link set dev __IFNAME__ mtu 65520
iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M -l 1M
bw __BW__ 7.0 8.0
iperf3k ns
tl TCP RR latency over IPv6: guest to host
lat -
lat -
lat -
lat -
lat -
nsb tcp_rr --nolog -6
gout LAT tcp_rr --nolog -l1 -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 200 150
tl TCP CRR latency over IPv6: guest to host
lat -
lat -
lat -
lat -
lat -
nsb tcp_crr --nolog -6
gout LAT tcp_crr --nolog -l1 -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 500 400
tr TCP throughput over IPv4: guest to host
iperf3s ns 10002
guest ip link set dev __IFNAME__ mtu 256
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 2M -l 1M
bw __BW__ 0.2 0.3
guest ip link set dev __IFNAME__ mtu 576
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 4M -l 1M
bw __BW__ 0.5 0.8
guest ip link set dev __IFNAME__ mtu 1280
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 8M -l 1M
bw __BW__ 1.2 1.5
guest ip link set dev __IFNAME__ mtu 1500
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 16M -l 1M
bw __BW__ 1.6 1.8
guest ip link set dev __IFNAME__ mtu 9000
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M -l 1M
bw __BW__ 4.0 5.0
guest ip link set dev __IFNAME__ mtu 65520
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M -l 1M
bw __BW__ 7.0 8.0
iperf3k ns
# Reducing MTU below 1280 deconfigures IPv6, get our address back
guest dhclient -6 -x
guest dhclient -6 __IFNAME__
tl TCP RR latency over IPv4: guest to host
lat -
lat -
lat -
lat -
lat -
nsb tcp_rr --nolog -4
gout LAT tcp_rr --nolog -l1 -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 200 150
tl TCP CRR latency over IPv4: guest to host
lat -
lat -
lat -
lat -
lat -
nsb tcp_crr --nolog -4
gout LAT tcp_crr --nolog -l1 -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 500 400
tr TCP throughput over IPv6: host to guest
iperf3s guest 10001
bw -
bw -
bw -
bw -
bw -
iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -w 256M -l 16k
bw __BW__ 6.0 6.8
iperf3k guest
tl TCP RR latency over IPv6: host to guest
lat -
lat -
lat -
lat -
lat -
guestb tcp_rr --nolog -P 10001 -C 10011 -6
sleep 1
nsout LAT tcp_rr --nolog -l1 -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 200 150
tl TCP CRR latency over IPv6: host to guest
lat -
lat -
lat -
lat -
lat -
guestb tcp_crr --nolog -P 10001 -C 10011 -6
sleep 1
nsout LAT tcp_crr --nolog -l1 -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 500 350
tr TCP throughput over IPv4: host to guest
iperf3s guest 10001
bw -
bw -
bw -
bw -
bw -
iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -w 256M -l 16k
bw __BW__ 6.0 6.8
iperf3k guest
tl TCP RR latency over IPv4: host to guest
lat -
lat -
lat -
lat -
lat -
guestb tcp_rr --nolog -P 10001 -C 10011 -4
sleep 1
nsout LAT tcp_rr --nolog -l1 -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 200 150
tl TCP CRR latency over IPv6: host to guest
lat -
lat -
lat -
lat -
lat -
guestb tcp_crr --nolog -P 10001 -C 10011 -4
sleep 1
nsout LAT tcp_crr --nolog -l1 -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 500 300
te

159
test/perf/passt_vu_udp Normal file
View file

@ -0,0 +1,159 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/perf/passt_vu_udp - Check UDP performance in passt vhost-user mode
#
# Copyright (c) 2021 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
gtools /sbin/sysctl ip jq nproc sleep iperf3 udp_rr # From neper
nstools ip jq sleep iperf3 udp_rr
htools bc head sed
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test passt: throughput and latency
guest /sbin/sysctl -w net.core.rmem_max=16777216
guest /sbin/sysctl -w net.core.wmem_max=16777216
guest /sbin/sysctl -w net.core.rmem_default=16777216
guest /sbin/sysctl -w net.core.wmem_default=16777216
hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1
hout FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l
hout FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__
set THREADS 2
set TIME 1
set OPTS -u -P __THREADS__ --pacing-timer 1000
info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz
report passt_vu udp __THREADS__ __FREQ__
th pktlen 256B 576B 1280B 1500B 9000B 65520B
tr UDP throughput over IPv6: guest to host
iperf3s ns 10002
# (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header
bw -
bw -
iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 3G -l 1232
bw __BW__ 0.8 1.2
iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 4G -l 1452
bw __BW__ 1.0 1.5
iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 10G -l 8952
bw __BW__ 4.0 5.0
iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 20G -l 64372
bw __BW__ 4.0 5.0
iperf3k ns
tl UDP RR latency over IPv6: guest to host
lat -
lat -
lat -
lat -
lat -
nsb udp_rr --nolog -6
gout LAT udp_rr --nolog -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 200 150
tr UDP throughput over IPv4: guest to host
iperf3s ns 10002
# (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 1G -l 228
bw __BW__ 0.0 0.0
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 2G -l 548
bw __BW__ 0.4 0.6
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 3G -l 1252
bw __BW__ 0.8 1.2
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 4G -l 1472
bw __BW__ 1.0 1.5
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 10G -l 8972
bw __BW__ 4.0 5.0
iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 20G -l 65492
bw __BW__ 4.0 5.0
iperf3k ns
tl UDP RR latency over IPv4: guest to host
lat -
lat -
lat -
lat -
lat -
nsb udp_rr --nolog -4
gout LAT udp_rr --nolog -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 200 150
tr UDP throughput over IPv6: host to guest
iperf3s guest 10001
# (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header
bw -
bw -
iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -b 3G -l 1232
bw __BW__ 0.8 1.2
iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -b 4G -l 1452
bw __BW__ 1.0 1.5
iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -b 10G -l 8952
bw __BW__ 3.0 4.0
iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -b 20G -l 64372
bw __BW__ 3.0 4.0
iperf3k guest
tl UDP RR latency over IPv6: host to guest
lat -
lat -
lat -
lat -
lat -
guestb udp_rr --nolog -P 10001 -C 10011 -6
sleep 1
nsout LAT udp_rr --nolog -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 200 150
tr UDP throughput over IPv4: host to guest
iperf3s guest 10001
# (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header
iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 1G -l 228
bw __BW__ 0.0 0.0
iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 2G -l 548
bw __BW__ 0.4 0.6
iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 3G -l 1252
bw __BW__ 0.8 1.2
iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 4G -l 1472
bw __BW__ 1.0 1.5
iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 10G -l 8972
bw __BW__ 3.0 4.0
iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 20G -l 65492
bw __BW__ 3.0 4.0
iperf3k guest
tl UDP RR latency over IPv4: host to guest
lat -
lat -
lat -
lat -
lat -
guestb udp_rr --nolog -P 10001 -C 10011 -4
sleep 1
nsout LAT udp_rr --nolog -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
lat __LAT__ 200 150
te

3
test/rampstream-check.sh Executable file
View file

@ -0,0 +1,3 @@
#! /bin/sh
(rampstream check "$@" 2>&1; echo $? > rampstream.status) | tee rampstream.err

143
test/rampstream.c Normal file
View file

@ -0,0 +1,143 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* rampstream - Generate a check and stream of bytes in a ramp pattern
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
/* Length of the repeating ramp. This is a deliberately not a "round" number so
* that we're very likely to misalign with likely block or chunk sizes of the
* transport. That means we'll detect gaps in the stream, even if they occur
* neatly on block boundaries. Specifically this is the largest 8-bit prime. */
#define RAMPLEN 251
#define INTERVAL 10000
#define ARRAY_SIZE(a) ((int)(sizeof(a) / sizeof((a)[0])))
#define die(...) \
do { \
fprintf(stderr, "rampstream: " __VA_ARGS__); \
exit(1); \
} while (0)
static void usage(void)
{
die("Usage:\n"
" rampstream send <number>\n"
" Generate a ramp pattern of bytes on stdout, repeated <number>\n"
" times\n"
" rampstream check <number>\n"
" Check a ramp pattern of bytes on stdin, repeater <number>\n"
" times\n");
}
static void ramp_send(unsigned long long num, const uint8_t *ramp)
{
unsigned long long i;
for (i = 0; i < num; i++) {
int off = 0;
ssize_t rc;
if (i % INTERVAL == 0)
fprintf(stderr, "%llu...\r", i);
while (off < RAMPLEN) {
rc = write(1, ramp + off, RAMPLEN - off);
if (rc < 0) {
if (errno == EINTR ||
errno == EAGAIN ||
errno == EWOULDBLOCK)
continue;
die("Error writing ramp: %s\n",
strerror(errno));
}
if (rc == 0)
die("Zero length write\n");
off += rc;
}
}
}
static void ramp_check(unsigned long long num, const uint8_t *ramp)
{
unsigned long long i;
for (i = 0; i < num; i++) {
uint8_t buf[RAMPLEN];
int off = 0;
ssize_t rc;
if (i % INTERVAL == 0)
fprintf(stderr, "%llu...\r", i);
while (off < RAMPLEN) {
rc = read(0, buf + off, RAMPLEN - off);
if (rc < 0) {
if (errno == EINTR ||
errno == EAGAIN ||
errno == EWOULDBLOCK)
continue;
die("Error reading ramp: %s\n",
strerror(errno));
}
if (rc == 0)
die("Unexpected EOF, ramp %llu, byte %d\n",
i, off);
off += rc;
}
if (memcmp(buf, ramp, sizeof(buf)) != 0) {
int j, k;
for (j = 0; j < RAMPLEN; j++)
if (buf[j] != ramp[j])
break;
for (k = j; k < RAMPLEN && k < j + 16; k++)
fprintf(stderr,
"Byte %d: expected 0x%02x, got 0x%02x\n",
k, ramp[k], buf[k]);
die("Data mismatch, ramp %llu, byte %d\n", i, j);
}
}
}
int main(int argc, char *argv[])
{
const char *subcmd = argv[1];
unsigned long long num;
uint8_t ramp[RAMPLEN];
char *e;
int i;
if (argc < 2)
usage();
errno = 0;
num = strtoull(argv[2], &e, 0);
if (*e || errno)
usage();
/* Initialize the ramp block */
for (i = 0; i < RAMPLEN; i++)
ramp[i] = i;
if (strcmp(subcmd, "send") == 0)
ramp_send(num, ramp);
else if (strcmp(subcmd, "check") == 0)
ramp_check(num, ramp);
else
usage();
exit(0);
}

View file

@ -93,6 +93,7 @@ run() {
test memory/passt
teardown memory
VHOST_USER=0
setup passt
test passt/ndp
test passt/dhcp
@ -115,7 +116,59 @@ run() {
test two_guests/basic
teardown two_guests
VHOST_USER=1
setup passt_in_ns
test passt_vu/ndp
test passt_vu_in_ns/dhcp
test passt_vu_in_ns/icmp
test passt_vu_in_ns/tcp
test passt_vu_in_ns/udp
test passt_vu_in_ns/shutdown
teardown passt_in_ns
setup two_guests
test two_guests_vu/basic
teardown two_guests
setup migrate
test migrate/basic
teardown migrate
setup migrate
test migrate/basic_fin
teardown migrate
setup migrate
test migrate/bidirectional
teardown migrate
setup migrate
test migrate/bidirectional_fin
teardown migrate
setup migrate
test migrate/iperf3_out4
teardown migrate
setup migrate
test migrate/iperf3_out6
teardown migrate
setup migrate
test migrate/iperf3_in4
teardown migrate
setup migrate
test migrate/iperf3_in6
teardown migrate
setup migrate
test migrate/iperf3_bidir6
teardown migrate
setup migrate
test migrate/iperf3_many_out6
teardown migrate
setup migrate
test migrate/rampstream_in
teardown migrate
setup migrate
test migrate/rampstream_out
teardown migrate
VALGRIND=0
VHOST_USER=0
setup passt_in_ns
test passt/ndp
test passt_in_ns/dhcp
@ -126,6 +179,15 @@ run() {
test passt_in_ns/shutdown
teardown passt_in_ns
VHOST_USER=1
setup passt_in_ns
test passt_vu/ndp
test passt_vu_in_ns/dhcp
test perf/passt_vu_tcp
test perf/passt_vu_udp
test passt_vu_in_ns/shutdown
teardown passt_in_ns
# TODO: Make those faster by at least pre-installing gcc and make on
# non-x86 images, then re-enable.
skip_distro() {
@ -140,7 +202,7 @@ skip_distro() {
perf_finish
[ ${CI} -eq 1 ] && video_stop
log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}"
log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}, SKIPPED: ${STATUS_SKIPPED}"
pause_continue \
"Press any key to keep test session open" \
@ -161,7 +223,10 @@ run_selected() {
__setup=
for __test; do
if [ "${__test%%/*}" != "${__setup}" ]; then
# HACK: the migrate tests need the setup repeated for
# each test
if [ "${__test%%/*}" != "${__setup}" -o \
"${__test%%/*}" = "migrate" ]; then
[ -n "${__setup}" ] && teardown "${__setup}"
__setup="${__test%%/*}"
setup "${__setup}"
@ -171,7 +236,7 @@ run_selected() {
done
teardown "${__setup}"
log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}"
log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}, SKIPPED: ${STATUS_SKIPPED}"
pause_continue \
"Press any key to keep test session open" \
@ -242,4 +307,4 @@ fi
tail -n1 ${LOGFILE}
echo "Log at ${LOGFILE}"
exit $(tail -n1 ${LOGFILE} | sed -n 's/.*FAIL: \(.*\)$/\1/p')
exit $(tail -n1 ${LOGFILE} | sed -n 's/.*FAIL: \(.*\),.*$/\1/p')

1
test/two_guests_vu Symbolic link
View file

@ -0,0 +1 @@
two_guests

748
udp.c
View file

@ -39,27 +39,30 @@
* could receive packets from multiple flows, so we use a hash table match to
* find the specific flow for a datagram.
*
* When a UDP flow is initiated from a listening socket we take a duplicate of
* the socket and store it in uflow->s[INISIDE]. This will last for the
* Flow sockets
* ============
*
* When a UDP flow targets a socket, we create a "flow" socket in
* uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive
* replies on the target side. This socket is both bound and connected and has
* EPOLL_TYPE_UDP. The connect() means it will only receive datagrams
* associated with this flow, so the epoll reference directly points to the flow
* and we don't need a hash lookup.
*
* When a flow is initiated from a listening socket, we create a "flow" socket
* with the same bound address as the listening socket, but also connect()ed to
* the flow's peer. This is stored in uflow->s[INISIDE] and will last for the
* lifetime of the flow, even if the original listening socket is closed due to
* port auto-probing. The duplicate is used to deliver replies back to the
* originating side.
*
* Reply sockets
* =============
*
* When a UDP flow targets a socket, we create a "reply" socket in
* uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive
* replies on the target side. This socket is both bound and connected and has
* EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams
* associated with this flow, so the epoll reference directly points to the flow
* and we don't need a hash lookup.
*
* NOTE: it's possible that the reply socket could have a bound address
* overlapping with an unrelated listening socket. We assume datagrams for the
* flow will come to the reply socket in preference to a listening socket. The
* sample program doc/platform-requirements/reuseaddr-priority.c documents and
* tests that assumption.
* NOTE: A flow socket can have a bound address overlapping with a listening
* socket. That will happen naturally for flows initiated from a socket, but is
* also possible (though unlikely) for tap initiated flows, depending on the
* source port. We assume datagrams for the flow will come to a connect()ed
* socket in preference to a listening socket. The sample program
* doc/platform-requirements/reuseaddr-priority.c documents and tests that
* assumption.
*
* "Spliced" flows
* ===============
@ -71,8 +74,7 @@
* actually used; it doesn't make sense for datagrams and instead a pair of
* recvmmsg() and sendmmsg() is used to forward the datagrams.
*
* Note that a spliced flow will have *both* a duplicated listening socket and a
* reply socket (see above).
* Note that a spliced flow will have two flow sockets (see above).
*/
#include <sched.h>
@ -87,6 +89,8 @@
#include <netinet/in.h>
#include <netinet/ip.h>
#include <netinet/udp.h>
#include <netinet/ip_icmp.h>
#include <netinet/icmp6.h>
#include <stdint.h>
#include <stddef.h>
#include <string.h>
@ -109,29 +113,25 @@
#include "pcap.h"
#include "log.h"
#include "flow_table.h"
#include "udp_internal.h"
#include "udp_vu.h"
#define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */
/* Maximum UDP data to be returned in ICMP messages */
#define ICMP4_MAX_DLEN 8
#define ICMP6_MAX_DLEN (IPV6_MIN_MTU \
- sizeof(struct udphdr) \
- sizeof(struct ipv6hdr))
/* "Spliced" sockets indexed by bound port (host order) */
static int udp_splice_ns [IP_VERSIONS][NUM_PORTS];
static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
/* Static buffers */
/**
* struct udp_payload_t - UDP header and data for inbound messages
* @uh: UDP header
* @data: UDP data
*/
static struct udp_payload_t {
struct udphdr uh;
char data[USHRT_MAX - sizeof(struct udphdr)];
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32)))
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
#endif
udp_payload[UDP_MAX_FRAMES];
/* UDP header and data for inbound messages */
static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
/* Ethernet header for IPv4 frames */
static struct ethhdr udp4_eth_hdr;
@ -140,26 +140,31 @@ static struct ethhdr udp4_eth_hdr;
static struct ethhdr udp6_eth_hdr;
/**
* struct udp_meta_t - Pre-cooked headers and metadata for UDP packets
* struct udp_meta_t - Pre-cooked headers for UDP packets
* @ip6h: Pre-filled IPv6 header (except for payload_len and addresses)
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
* @taph: Tap backend specific header
* @s_in: Source socket address, filled in by recvmmsg()
* @tosidx: sidx for the destination side of this datagram's flow
*/
static struct udp_meta_t {
struct ipv6hdr ip6h;
struct iphdr ip4h;
struct tap_hdr taph;
union sockaddr_inany s_in;
flow_sidx_t tosidx;
}
#ifdef __AVX2__
__attribute__ ((aligned(32)))
#endif
udp_meta[UDP_MAX_FRAMES];
#define PKTINFO_SPACE \
MAX(CMSG_SPACE(sizeof(struct in_pktinfo)), \
CMSG_SPACE(sizeof(struct in6_pktinfo)))
#define RECVERR_SPACE \
MAX(CMSG_SPACE(sizeof(struct sock_extended_err) + \
sizeof(struct sockaddr_in)), \
CMSG_SPACE(sizeof(struct sock_extended_err) + \
sizeof(struct sockaddr_in6)))
/**
* enum udp_iov_idx - Indices for the buffers making up a single UDP frame
* @UDP_IOV_TAP tap specific header
@ -236,8 +241,6 @@ static void udp_iov_init_one(const struct ctx *c, size_t i)
tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph);
tiov[UDP_IOV_PAYLOAD].iov_base = payload;
mh->msg_name = &meta->s_in;
mh->msg_namelen = sizeof(meta->s_in);
mh->msg_iov = siov;
mh->msg_iovlen = 1;
}
@ -257,41 +260,6 @@ static void udp_iov_init(const struct ctx *c)
udp_iov_init_one(c, i);
}
/**
* udp_splice_prepare() - Prepare one datagram for splicing
* @mmh: Receiving mmsghdr array
* @idx: Index of the datagram to prepare
*/
static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx)
{
udp_mh_splice[idx].msg_hdr.msg_iov->iov_len = mmh[idx].msg_len;
}
/**
* udp_splice_send() - Send a batch of datagrams from socket to socket
* @c: Execution context
* @start: Index of batch's first datagram in udp[46]_l2_buf
* @n: Number of datagrams in batch
* @src: Source port for datagram (target side)
* @dst: Destination port for datagrams (target side)
* @ref: epoll reference for origin socket
* @now: Timestamp
*/
static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
flow_sidx_t tosidx)
{
const struct flowside *toside = flowside_at_sidx(tosidx);
const struct udp_flow *uflow = udp_at_sidx(tosidx);
uint8_t topif = pif_at_sidx(tosidx);
int s = uflow->s[tosidx.sidei];
socklen_t sl;
pif_sockaddr(c, &udp_splice_to, &sl, topif,
&toside->eaddr, toside->eport);
sendmmsg(s, udp_mh_splice + start, n, MSG_NOSIGNAL);
}
/**
* udp_update_hdr4() - Update headers for one IPv4 datagram
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
@ -302,9 +270,9 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
*
* Return: size of IPv4 payload (UDP header + data)
*/
static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen,
bool no_udp_csum)
size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen,
bool no_udp_csum)
{
const struct in_addr *src = inany_v4(&toside->oaddr);
const struct in_addr *dst = inany_v4(&toside->eaddr);
@ -328,7 +296,8 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
.iov_base = bp->data,
.iov_len = dlen
};
csum_udp4(&bp->uh, *src, *dst, &iov, 1, 0);
struct iov_tail data = IOV_TAIL(&iov, 1, 0);
csum_udp4(&bp->uh, *src, *dst, &data);
}
return l4len;
@ -345,9 +314,9 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
*
* Return: size of IPv6 payload (UDP header + data)
*/
static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen,
bool no_udp_csum)
size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen,
bool no_udp_csum)
{
uint16_t l4len = dlen + sizeof(bp->uh);
@ -372,8 +341,8 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
.iov_base = bp->data,
.iov_len = dlen
};
csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6,
&iov, 1, 0);
struct iov_tail data = IOV_TAIL(&iov, 1, 0);
csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, &data);
}
return l4len;
@ -413,28 +382,172 @@ static void udp_tap_prepare(const struct mmsghdr *mmh,
(*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len;
}
/**
* udp_send_tap_icmp4() - Construct and send ICMPv4 to local peer
* @c: Execution context
* @ee: Extended error descriptor
* @toside: Destination side of flow
* @saddr: Address of ICMP generating node
* @in: First bytes (max 8) of original UDP message body
* @dlen: Length of the read part of original UDP message body
*/
static void udp_send_tap_icmp4(const struct ctx *c,
const struct sock_extended_err *ee,
const struct flowside *toside,
struct in_addr saddr,
const void *in, size_t dlen)
{
struct in_addr oaddr = toside->oaddr.v4mapped.a4;
struct in_addr eaddr = toside->eaddr.v4mapped.a4;
in_port_t eport = toside->eport;
in_port_t oport = toside->oport;
struct {
struct icmphdr icmp4h;
struct iphdr ip4h;
struct udphdr uh;
char data[ICMP4_MAX_DLEN];
} __attribute__((packed, aligned(__alignof__(max_align_t)))) msg;
size_t msglen = sizeof(msg) - sizeof(msg.data) + dlen;
size_t l4len = dlen + sizeof(struct udphdr);
ASSERT(dlen <= ICMP4_MAX_DLEN);
memset(&msg, 0, sizeof(msg));
msg.icmp4h.type = ee->ee_type;
msg.icmp4h.code = ee->ee_code;
if (ee->ee_type == ICMP_DEST_UNREACH && ee->ee_code == ICMP_FRAG_NEEDED)
msg.icmp4h.un.frag.mtu = htons((uint16_t) ee->ee_info);
/* Reconstruct the original headers as returned in the ICMP message */
tap_push_ip4h(&msg.ip4h, eaddr, oaddr, l4len, IPPROTO_UDP);
tap_push_uh4(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
memcpy(&msg.data, in, dlen);
tap_icmp4_send(c, saddr, eaddr, &msg, msglen);
}
/**
* udp_send_tap_icmp6() - Construct and send ICMPv6 to local peer
* @c: Execution context
* @ee: Extended error descriptor
* @toside: Destination side of flow
* @saddr: Address of ICMP generating node
* @in: First bytes (max 1232) of original UDP message body
* @dlen: Length of the read part of original UDP message body
* @flow: IPv6 flow identifier
*/
static void udp_send_tap_icmp6(const struct ctx *c,
const struct sock_extended_err *ee,
const struct flowside *toside,
const struct in6_addr *saddr,
void *in, size_t dlen, uint32_t flow)
{
const struct in6_addr *oaddr = &toside->oaddr.a6;
const struct in6_addr *eaddr = &toside->eaddr.a6;
in_port_t eport = toside->eport;
in_port_t oport = toside->oport;
struct {
struct icmp6_hdr icmp6h;
struct ipv6hdr ip6h;
struct udphdr uh;
char data[ICMP6_MAX_DLEN];
} __attribute__((packed, aligned(__alignof__(max_align_t)))) msg;
size_t msglen = sizeof(msg) - sizeof(msg.data) + dlen;
size_t l4len = dlen + sizeof(struct udphdr);
ASSERT(dlen <= ICMP6_MAX_DLEN);
memset(&msg, 0, sizeof(msg));
msg.icmp6h.icmp6_type = ee->ee_type;
msg.icmp6h.icmp6_code = ee->ee_code;
if (ee->ee_type == ICMP6_PACKET_TOO_BIG)
msg.icmp6h.icmp6_dataun.icmp6_un_data32[0] = htonl(ee->ee_info);
/* Reconstruct the original headers as returned in the ICMP message */
tap_push_ip6h(&msg.ip6h, eaddr, oaddr, l4len, IPPROTO_UDP, flow);
tap_push_uh6(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
memcpy(&msg.data, in, dlen);
tap_icmp6_send(c, saddr, eaddr, &msg, msglen);
}
/**
* udp_pktinfo() - Retrieve packet destination address from cmsg
* @msg: msghdr into which message has been received
* @dst: (Local) destination address of message in @mh (output)
*
* Return: 0 on success, -1 if the information was missing (@dst is set to
* inany_any6).
*/
static int udp_pktinfo(struct msghdr *msg, union inany_addr *dst)
{
struct cmsghdr *hdr;
for (hdr = CMSG_FIRSTHDR(msg); hdr; hdr = CMSG_NXTHDR(msg, hdr)) {
if (hdr->cmsg_level == IPPROTO_IP &&
hdr->cmsg_type == IP_PKTINFO) {
const struct in_pktinfo *i4 = (void *)CMSG_DATA(hdr);
*dst = inany_from_v4(i4->ipi_addr);
return 0;
}
if (hdr->cmsg_level == IPPROTO_IPV6 &&
hdr->cmsg_type == IPV6_PKTINFO) {
const struct in6_pktinfo *i6 = (void *)CMSG_DATA(hdr);
dst->a6 = i6->ipi6_addr;
return 0;
}
}
debug("Missing PKTINFO cmsg on datagram");
*dst = inany_any6;
return -1;
}
/**
* udp_sock_recverr() - Receive and clear an error from a socket
* @s: Socket to receive from
* @c: Execution context
* @s: Socket to receive errors from
* @sidx: Flow and side of @s, or FLOW_SIDX_NONE if unknown
* @pif: Interface on which the error occurred
* (only used if @sidx == FLOW_SIDX_NONE)
* @port: Local port number of @s (only used if @sidx == FLOW_SIDX_NONE)
*
* Return: 1 if error received and processed, 0 if no more errors in queue, < 0
* if there was an error reading the queue
*
* #syscalls recvmsg
*/
static int udp_sock_recverr(int s)
static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
uint8_t pif, in_port_t port)
{
char buf[PKTINFO_SPACE + RECVERR_SPACE];
const struct sock_extended_err *ee;
const struct cmsghdr *hdr;
char buf[CMSG_SPACE(sizeof(*ee))];
char data[ICMP6_MAX_DLEN];
struct cmsghdr *hdr;
struct iovec iov = {
.iov_base = data,
.iov_len = sizeof(data)
};
union sockaddr_inany src;
struct msghdr mh = {
.msg_name = NULL,
.msg_namelen = 0,
.msg_iov = NULL,
.msg_iovlen = 0,
.msg_name = &src,
.msg_namelen = sizeof(src),
.msg_iov = &iov,
.msg_iovlen = 1,
.msg_control = buf,
.msg_controllen = sizeof(buf),
};
const struct flowside *fromside, *toside;
union inany_addr offender, otap;
char astr[INANY_ADDRSTRLEN];
char sastr[SOCKADDR_STRLEN];
const struct in_addr *o4;
in_port_t offender_port;
struct udp_flow *uflow;
uint8_t topif;
size_t dlen;
ssize_t rc;
rc = recvmsg(s, &mh, MSG_ERRQUEUE);
@ -451,33 +564,102 @@ static int udp_sock_recverr(int s)
return -1;
}
hdr = CMSG_FIRSTHDR(&mh);
if (!((hdr->cmsg_level == IPPROTO_IP &&
hdr->cmsg_type == IP_RECVERR) ||
(hdr->cmsg_level == IPPROTO_IPV6 &&
hdr->cmsg_type == IPV6_RECVERR))) {
err("Unexpected cmsg reading error queue");
for (hdr = CMSG_FIRSTHDR(&mh); hdr; hdr = CMSG_NXTHDR(&mh, hdr)) {
if ((hdr->cmsg_level == IPPROTO_IP &&
hdr->cmsg_type == IP_RECVERR) ||
(hdr->cmsg_level == IPPROTO_IPV6 &&
hdr->cmsg_type == IPV6_RECVERR))
break;
}
if (!hdr) {
err("Missing RECVERR cmsg in error queue");
return -1;
}
ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
/* TODO: When possible propagate and otherwise handle errors */
debug("%s error on UDP socket %i: %s",
str_ee_origin(ee), s, strerror(ee->ee_errno));
str_ee_origin(ee), s, strerror_(ee->ee_errno));
if (!flow_sidx_valid(sidx)) {
/* No hint from the socket, determine flow from addresses */
union inany_addr dst;
if (udp_pktinfo(&mh, &dst) < 0) {
debug("Missing PKTINFO on UDP error");
return 1;
}
sidx = flow_lookup_sa(c, IPPROTO_UDP, pif, &src, &dst, port);
if (!flow_sidx_valid(sidx)) {
debug("Ignoring UDP error without flow");
return 1;
}
} else {
pif = pif_at_sidx(sidx);
}
uflow = udp_at_sidx(sidx);
ASSERT(uflow);
fromside = &uflow->f.side[sidx.sidei];
toside = &uflow->f.side[!sidx.sidei];
topif = uflow->f.pif[!sidx.sidei];
dlen = rc;
if (inany_from_sockaddr(&offender, &offender_port,
SO_EE_OFFENDER(ee)) < 0)
goto fail;
if (pif != PIF_HOST || topif != PIF_TAP)
/* XXX Can we support any other cases? */
goto fail;
/* If the offender *is* the endpoint, make sure our translation is
* consistent with the flow's translation. This matters if the flow
* endpoint has a port specific translation (like --dns-match).
*/
if (inany_equals(&offender, &fromside->eaddr))
otap = toside->oaddr;
else if (!nat_inbound(c, &offender, &otap))
goto fail;
if (hdr->cmsg_level == IPPROTO_IP &&
(o4 = inany_v4(&otap)) && inany_v4(&toside->eaddr)) {
dlen = MIN(dlen, ICMP4_MAX_DLEN);
udp_send_tap_icmp4(c, ee, toside, *o4, data, dlen);
return 1;
}
if (hdr->cmsg_level == IPPROTO_IPV6 && !inany_v4(&toside->eaddr)) {
udp_send_tap_icmp6(c, ee, toside, &otap.a6, data, dlen,
FLOW_IDX(uflow));
return 1;
}
fail:
flow_dbg(uflow, "Can't propagate %s error from %s %s to %s %s",
str_ee_origin(ee),
pif_name(pif),
sockaddr_ntop(SO_EE_OFFENDER(ee), sastr, sizeof(sastr)),
pif_name(topif),
inany_ntop(&toside->eaddr, astr, sizeof(astr)));
return 1;
}
/**
* udp_sock_errs() - Process errors on a socket
* @c: Execution context
* @s: Socket to receive from
* @events: epoll events bitmap
* @s: Socket to receive errors from
* @sidx: Flow and side of @s, or FLOW_SIDX_NONE if unknown
* @pif: Interface on which the error occurred
* (only used if @sidx == FLOW_SIDX_NONE)
* @port: Local port number of @s (only used if @sidx == FLOW_SIDX_NONE)
*
* Return: Number of errors handled, or < 0 if we have an unrecoverable error
*/
static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
static int udp_sock_errs(const struct ctx *c, int s, flow_sidx_t sidx,
uint8_t pif, in_port_t port)
{
unsigned n_err = 0;
socklen_t errlen;
@ -485,11 +667,8 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
ASSERT(!c->no_udp);
if (!(events & EPOLLERR))
return 0; /* Nothing to do */
/* Empty the error queue */
while ((rc = udp_sock_recverr(s)) > 0)
while ((rc = udp_sock_recverr(c, s, sidx, pif, port)) > 0)
n_err += rc;
if (rc < 0)
@ -503,7 +682,7 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
}
if (err) {
debug("Unqueued error on UDP socket %i: %s", s, strerror(err));
debug("Unqueued error on UDP socket %i: %s", s, strerror_(err));
n_err++;
}
@ -516,173 +695,262 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
return n_err;
}
/**
* udp_peek_addr() - Get source address for next packet
* @s: Socket to get information from
* @src: Socket address (output)
* @dst: (Local) destination address (output)
*
* Return: 0 if no more packets, 1 on success, -ve error code on error
*/
static int udp_peek_addr(int s, union sockaddr_inany *src,
union inany_addr *dst)
{
char sastr[SOCKADDR_STRLEN], dstr[INANY_ADDRSTRLEN];
char cmsg[PKTINFO_SPACE];
struct msghdr msg = {
.msg_name = src,
.msg_namelen = sizeof(*src),
.msg_control = cmsg,
.msg_controllen = sizeof(cmsg),
};
int rc;
rc = recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
if (rc < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK)
return 0;
return -errno;
}
udp_pktinfo(&msg, dst);
trace("Peeked UDP datagram: %s -> %s",
sockaddr_ntop(src, sastr, sizeof(sastr)),
inany_ntop(dst, dstr, sizeof(dstr)));
return 1;
}
/**
* udp_sock_recv() - Receive datagrams from a socket
* @c: Execution context
* @s: Socket to receive from
* @events: epoll events bitmap
* @mmh mmsghdr array to receive into
* @n: Maximum number of datagrams to receive
*
* Return: Number of datagrams received
*
* #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64
*/
static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
struct mmsghdr *mmh)
static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh, int n)
{
/* For not entirely clear reasons (data locality?) pasta gets better
* throughput if we receive tap datagrams one at a atime. For small
* splice datagrams throughput is slightly better if we do batch, but
* it's slightly worse for large splice datagrams. Since we don't know
* before we receive whether we'll use tap or splice, always go one at a
* time for pasta mode.
*/
int n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
ASSERT(!c->no_udp);
if (!(events & EPOLLIN))
return 0;
n = recvmmsg(s, mmh, n, 0, NULL);
if (n < 0) {
err_perror("Error receiving datagrams");
trace("Error receiving datagrams: %s", strerror_(errno));
/* Bail out and let the EPOLLERR handler deal with it */
return 0;
}
return n;
}
/**
* udp_sock_to_sock() - Forward datagrams from socket to socket
* @c: Execution context
* @from_s: Socket to receive datagrams from
* @n: Maximum number of datagrams to forward
* @tosidx: Flow & side to forward datagrams to
*
* #syscalls sendmmsg
*/
static void udp_sock_to_sock(const struct ctx *c, int from_s, int n,
flow_sidx_t tosidx)
{
const struct flowside *toside = flowside_at_sidx(tosidx);
const struct udp_flow *uflow = udp_at_sidx(tosidx);
uint8_t topif = pif_at_sidx(tosidx);
int to_s = uflow->s[tosidx.sidei];
socklen_t sl;
int i;
if ((n = udp_sock_recv(c, from_s, udp_mh_recv, n)) <= 0)
return;
for (i = 0; i < n; i++) {
udp_mh_splice[i].msg_hdr.msg_iov->iov_len
= udp_mh_recv[i].msg_len;
}
pif_sockaddr(c, &udp_splice_to, &sl, topif,
&toside->eaddr, toside->eport);
sendmmsg(to_s, udp_mh_splice, n, MSG_NOSIGNAL);
}
/**
* udp_buf_sock_to_tap() - Forward datagrams from socket to tap
* @c: Execution context
* @s: Socket to read data from
* @n: Maximum number of datagrams to forward
* @tosidx: Flow & side to forward data from @s to
*/
static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
flow_sidx_t tosidx)
{
const struct flowside *toside = flowside_at_sidx(tosidx);
int i;
if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
return;
for (i = 0; i < n; i++)
udp_tap_prepare(udp_mh_recv, i, toside, false);
tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
}
/**
* udp_sock_fwd() - Forward datagrams from a possibly unconnected socket
* @c: Execution context
* @s: Socket to forward from
* @frompif: Interface to which @s belongs
* @port: Our (local) port number of @s
* @now: Current timestamp
*/
void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
in_port_t port, const struct timespec *now)
{
union sockaddr_inany src;
union inany_addr dst;
int rc;
while ((rc = udp_peek_addr(s, &src, &dst)) != 0) {
bool discard = false;
flow_sidx_t tosidx;
uint8_t topif;
if (rc < 0) {
trace("Error peeking at socket address: %s",
strerror_(-rc));
/* Clear errors & carry on */
if (udp_sock_errs(c, s, FLOW_SIDX_NONE,
frompif, port) < 0) {
err(
"UDP: Unrecoverable error on listening socket: (%s port %hu)",
pif_name(frompif), port);
/* FIXME: what now? close/re-open socket? */
}
continue;
}
tosidx = udp_flow_from_sock(c, frompif, &dst, port, &src, now);
topif = pif_at_sidx(tosidx);
if (pif_is_socket(topif)) {
udp_sock_to_sock(c, s, 1, tosidx);
} else if (topif == PIF_TAP) {
if (c->mode == MODE_VU)
udp_vu_sock_to_tap(c, s, 1, tosidx);
else
udp_buf_sock_to_tap(c, s, 1, tosidx);
} else if (flow_sidx_valid(tosidx)) {
struct udp_flow *uflow = udp_at_sidx(tosidx);
flow_err(uflow,
"No support for forwarding UDP from %s to %s",
pif_name(frompif), pif_name(topif));
discard = true;
} else {
debug("Discarding datagram without flow");
discard = true;
}
if (discard) {
struct msghdr msg = { 0 };
if (recvmsg(s, &msg, MSG_DONTWAIT) < 0)
debug_perror("Failed to discard datagram");
}
}
}
/**
* udp_listen_sock_handler() - Handle new data from socket
* @c: Execution context
* @ref: epoll reference
* @events: epoll events bitmap
* @now: Current timestamp
*
* #syscalls recvmmsg
*/
void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now)
void udp_listen_sock_handler(const struct ctx *c,
union epoll_ref ref, uint32_t events,
const struct timespec *now)
{
const socklen_t sasize = sizeof(udp_meta[0].s_in);
int n, i;
if (udp_sock_errs(c, ref.fd, events) < 0) {
err("UDP: Unrecoverable error on listening socket:"
" (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
/* FIXME: what now? close/re-open socket? */
return;
}
if ((n = udp_sock_recv(c, ref.fd, events, udp_mh_recv)) <= 0)
return;
/* We divide datagrams into batches based on how we need to send them,
* determined by udp_meta[i].tosidx. To avoid either two passes through
* the array, or recalculating tosidx for a single entry, we have to
* populate it one entry *ahead* of the loop counter.
*/
udp_meta[0].tosidx = udp_flow_from_sock(c, ref, &udp_meta[0].s_in, now);
udp_mh_recv[0].msg_hdr.msg_namelen = sasize;
for (i = 0; i < n; ) {
flow_sidx_t batchsidx = udp_meta[i].tosidx;
uint8_t batchpif = pif_at_sidx(batchsidx);
int batchstart = i;
do {
if (pif_is_socket(batchpif)) {
udp_splice_prepare(udp_mh_recv, i);
} else if (batchpif == PIF_TAP) {
udp_tap_prepare(udp_mh_recv, i,
flowside_at_sidx(batchsidx),
false);
}
if (++i >= n)
break;
udp_meta[i].tosidx = udp_flow_from_sock(c, ref,
&udp_meta[i].s_in,
now);
udp_mh_recv[i].msg_hdr.msg_namelen = sasize;
} while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx));
if (pif_is_socket(batchpif)) {
udp_splice_send(c, batchstart, i - batchstart,
batchsidx);
} else if (batchpif == PIF_TAP) {
tap_send_frames(c, &udp_l2_iov[batchstart][0],
UDP_NUM_IOVS, i - batchstart);
} else if (flow_sidx_valid(batchsidx)) {
flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
struct udp_flow *uflow = udp_at_sidx(batchsidx);
flow_err(uflow,
"No support for forwarding UDP from %s to %s",
pif_name(pif_at_sidx(fromsidx)),
pif_name(batchpif));
} else {
debug("Discarding %d datagrams without flow",
i - batchstart);
}
}
if (events & (EPOLLERR | EPOLLIN))
udp_sock_fwd(c, ref.fd, ref.udp.pif, ref.udp.port, now);
}
/**
* udp_reply_sock_handler() - Handle new data from flow specific socket
* udp_sock_handler() - Handle new data from flow specific socket
* @c: Execution context
* @ref: epoll reference
* @events: epoll events bitmap
* @now: Current timestamp
*
* #syscalls recvmmsg
*/
void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now)
void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now)
{
flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
const struct flowside *toside = flowside_at_sidx(tosidx);
struct udp_flow *uflow = udp_at_sidx(ref.flowside);
uint8_t topif = pif_at_sidx(tosidx);
int n, i, from_s;
ASSERT(!c->no_udp && uflow);
from_s = uflow->s[ref.flowside.sidei];
if (udp_sock_errs(c, from_s, events) < 0) {
flow_err(uflow, "Unrecoverable error on reply socket");
flow_err_details(uflow);
udp_flow_close(c, uflow);
return;
if (events & EPOLLERR) {
if (udp_sock_errs(c, ref.fd, ref.flowside, PIF_NONE, 0) < 0) {
flow_err(uflow, "Unrecoverable error on flow socket");
goto fail;
}
}
if ((n = udp_sock_recv(c, from_s, events, udp_mh_recv)) <= 0)
return;
if (events & EPOLLIN) {
/* For not entirely clear reasons (data locality?) pasta gets
* better throughput if we receive tap datagrams one at a
* time. For small splice datagrams throughput is slightly
* better if we do batch, but it's slightly worse for large
* splice datagrams. Since we don't know the size before we
* receive, always go one at a time for pasta mode.
*/
size_t n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
uint8_t topif = pif_at_sidx(tosidx);
int s = ref.fd;
flow_trace(uflow, "Received %d datagrams on reply socket", n);
uflow->ts = now->tv_sec;
flow_trace(uflow, "Received data on reply socket");
uflow->ts = now->tv_sec;
for (i = 0; i < n; i++) {
if (pif_is_socket(topif))
udp_splice_prepare(udp_mh_recv, i);
else if (topif == PIF_TAP)
udp_tap_prepare(udp_mh_recv, i, toside, false);
/* Restore sockaddr length clobbered by recvmsg() */
udp_mh_recv[i].msg_hdr.msg_namelen = sizeof(udp_meta[i].s_in);
if (pif_is_socket(topif)) {
udp_sock_to_sock(c, ref.fd, n, tosidx);
} else if (topif == PIF_TAP) {
if (c->mode == MODE_VU) {
udp_vu_sock_to_tap(c, s, UDP_MAX_FRAMES,
tosidx);
} else {
udp_buf_sock_to_tap(c, s, n, tosidx);
}
} else {
flow_err(uflow,
"No support for forwarding UDP from %s to %s",
pif_name(pif_at_sidx(ref.flowside)),
pif_name(topif));
goto fail;
}
}
return;
if (pif_is_socket(topif)) {
udp_splice_send(c, 0, n, tosidx);
} else if (topif == PIF_TAP) {
tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
} else {
uint8_t frompif = pif_at_sidx(ref.flowside);
flow_err(uflow, "No support for forwarding UDP from %s to %s",
pif_name(frompif), pif_name(topif));
}
fail:
flow_err_details(uflow);
udp_flow_close(c, uflow);
}
/**
@ -692,6 +960,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
* @af: Address family, AF_INET or AF_INET6
* @saddr: Source address
* @daddr: Destination address
* @ttl: TTL or hop limit for packets to be sent in this call
* @p: Pool of UDP packets, with UDP headers
* @idx: Index of first packet to process
* @now: Current timestamp
@ -702,7 +971,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
*/
int udp_tap_handler(const struct ctx *c, uint8_t pif,
sa_family_t af, const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now)
uint8_t ttl, const struct pool *p, int idx,
const struct timespec *now)
{
const struct flowside *toside;
struct mmsghdr mm[UIO_MAXIOV];
@ -750,7 +1020,7 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
}
toside = flowside_at_sidx(tosidx);
s = udp_at_sidx(tosidx)->s[tosidx.sidei];
s = uflow->s[tosidx.sidei];
ASSERT(s >= 0);
pif_sockaddr(c, &to_sa, &sl, topif, &toside->eaddr, toside->eport);
@ -781,6 +1051,24 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
mm[i].msg_hdr.msg_controllen = 0;
mm[i].msg_hdr.msg_flags = 0;
if (ttl != uflow->ttl[tosidx.sidei]) {
uflow->ttl[tosidx.sidei] = ttl;
if (af == AF_INET) {
if (setsockopt(s, IPPROTO_IP, IP_TTL,
&ttl, sizeof(ttl)) < 0)
flow_perror(uflow,
"setsockopt IP_TTL");
} else {
/* IPv6 hop_limit cannot be only 1 byte */
int hop_limit = ttl;
if (setsockopt(s, SOL_IPV6, IPV6_UNICAST_HOPS,
&hop_limit, sizeof(hop_limit)) < 0)
flow_perror(uflow,
"setsockopt IPV6_UNICAST_HOPS");
}
}
count++;
}

7
udp.h
View file

@ -11,11 +11,12 @@
void udp_portmap_clear(void);
void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now);
void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now);
void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events, const struct timespec *now);
int udp_tap_handler(const struct ctx *c, uint8_t pif,
sa_family_t af, const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now);
uint8_t ttl, const struct pool *p, int idx,
const struct timespec *now);
int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
const char *ifname, in_port_t port);
int udp_init(struct ctx *c);

Some files were not shown because too many files have changed in this diff Show more