1
0
Fork 0
mirror of https://passt.top/passt synced 2025-05-20 00:15:34 +02:00

Compare commits

...

382 commits

Author SHA1 Message Date
Laurent Vivier
3262c9b088 iov: Standardize function comment headers
Update function comment headers in iov.c to a consistent and
standardized format.

This change ensures:
- Comment blocks for functions consistently start with /**.
- Function names in the comment summary line include parentheses ().

This improves overall comment clarity and uniformity within the file.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:13 +02:00
Laurent Vivier
b915375a42 virtio: Correct and align comment headers
Standardize and fix issues in `virtio.c` and `virtio.h` comment headers.

Improvements include:
- Added `()` to function names in comment summaries.
- Added colons after parameter and enum member tags.
- Changed `/*` to `/**` for `virtq_avail_event()` comment.
- Fixed typos (e.g., "file"->"fill", "virqueue"->"virtqueue").
- Added missing `Return:` tag for `vu_queue_rewind()`.
- Corrected parameter names in `virtio.h` comments to match code.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:11 +02:00
Laurent Vivier
2fd0944f21 vhost_user: Correct and align function comment headers
This commit cleans up function comment headers in vhost_user.c to ensure
accuracy and consistency with the code. Changes include correcting
parameter names in comments and signatures (e.g., standardizing on vmsg
for vhost messages, fixing dev to vdev), updating function names in
comment descriptions, and removing/rectifying erroneous parameter
documentation.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-16 18:27:08 +02:00
Laurent Vivier
2046976866 codespell: Correct typos in comments and error message
This commit addresses several spelling errors identified by the `codespell`
tool. The corrections apply to:
- Code comments in `fwd.c`, `ip.h`, `isolation.c`, and `log.c`.
- An error message string in `vhost_user.c`.

Specifically, the following misspellings were corrected:
- "adddress" to "address"
- "capabilites" to "capabilities"
- "Musn't" to "Mustn't"
- "calculatd" to "calculated"
- "Invalide" to "Invalid"

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-15 18:06:30 +02:00
Laurent Vivier
4234ace84c test: Display count of skipped tests in status and summary
This commit enhances test reporting by tracking and displaying the
number of skipped tests.

The skipped test count is now visible in the tmux status bar during
execution and included in the final test summary log. This provides
a more complete overview of test suite results.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-15 18:06:19 +02:00
Laurent Vivier
2d3d69c5c3 flow: Fix clang error (clang-analyzer-security.PointerSub)
Fixes the following clang-analyzer warning:

flow_table.h:96:25: note: Subtraction of two pointers that do not point into the same array is undefined behavior
   96 |         return (union flow *)f - flowtab;

The `flow_idx()` function is called via `FLOW_IDX()` from
`flow_foreach_slot()`, where `f` is set to `&flowtab[idx].f`.
Therefore, `f` and `flowtab` do point to the same array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:37 +02:00
Laurent Vivier
0f7bf10b0a ndp: Fix Clang analyzer warning (clang-analyzer-security.PointerSub)
Addresses Clang warning: "Subtraction of two pointers that do not
point into the same array is undefined behavior" for the line:
  `ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);`

Here, `ptr` is `&ra.var[0]`. The subtraction calculates the offset
of `var[0]` within the `struct ra_options ra`. Since `ptr` points
inside `ra`, this pointer arithmetic is well-defined for
calculating the size of the data to send, even if `ptr` and `&ra`
are not strictly considered part of the same "array" by the analyzer.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:35 +02:00
Laurent Vivier
a6b9832e49 virtio: Fix Clang warning (bugprone-sizeof-expression, cert-arr39-c)
In `virtqueue_read_indirect_desc()`, the pointer arithmetic involving
`desc` is intentional. We add the length in bytes (`read_len`)
divided by the size of `struct vring_desc` to `desc`, which is
an array of `struct vring_desc`. This correctly calculates the
offset in terms of the number of `struct vring_desc` elements.

Clang issues the following warning due to this explicit scaling:

virtio.c:238:8: error: suspicious usage of 'sizeof(...)' in pointer
arithmetic; this scaled value will be scaled again by the '+='
operator [bugprone-sizeof-expression,cert-arr39-c,-Werror]
  238 |         desc += read_len / sizeof(struct vring_desc);
      |               ^            ~~~~~~~~~~~~~~~~~~~~~~~~~
virtio.c:238:8: note: '+=' in pointer arithmetic internally scales
with 'sizeof(struct vring_desc)' == 16

This behavior is intended, so the warning can be considered a
false positive in this context. The code correctly advances the
pointer by the desired number of descriptor entries.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:32 +02:00
Laurent Vivier
570e7b4454 dhcpv6: fix GCC error (unterminated-string-initialization)
The string STR_NOTONLINK is intentionally not NUL-terminated.
Ignore the GCC error using __attribute__((nonstring)).

This error is reported by GCC 15.1.1 on Fedora 42. However,
Clang 20.1.3 does not support __attribute__((nonstring)).
Therefore, NOLINTNEXTLINE(clang-diagnostic-unknown-attributes)
is also added to suppress Clang's unknown attribute warning.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-14 17:51:20 +02:00
Laurent Vivier
8ec134109e flow: close socket fd on error
In eea8a76caf ("flow: fix podman issue "), we unregister
the fd from epoll_ctl() in case of error, but we also need to close it.

As flowside_sock_l4() also calls sock_l4_sa() via flowside_sock_splice()
we can do it unconditionally.

Fixes: eea8a76caf ("flow: fix podman issue ")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-12 21:04:57 +02:00
Laurent Vivier
92d5d68013 flow: fix wrong macro name in comments
The maximum number of flow macro name is FLOW_MAX, not MAX_FLOW.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-08 13:24:14 +02:00
Laurent Vivier
eea8a76caf flow: fix podman issue
While running pasta, we trigger the following assert:

  ASSERTION FAILED in udp_at_sidx (udp_flow.c:35): flow->f.type == FLOW_UDP

in udp_at_sidx() in the following path:

 902 void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 903                       uint32_t events, const struct timespec *now)
 904 {
 905         struct udp_flow *uflow = udp_at_sidx(ref.flowside);

The invalid sidx is comming from the epoll_ref provided by epoll_wait().

This assert follows the following error:

  Couldn't connect flow socket: Permission denied

It appears that an error happens in udp_flow_sock() and the recently
created fd is not removed from the epoll_ctl() pool:

 71 static int udp_flow_sock(const struct ctx *c,
 72                          struct udp_flow *uflow, unsigned sidei)
 73 {
...
 82         s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
 83         if (s < 0) {
 84                 flow_dbg_perror(uflow, "Couldn't open flow specific socket");
 85                 return s;
 86         }
 87
 88         if (flowside_connect(c, s, pif, side) < 0) {
 89                 int rc = -errno;
 90                 flow_dbg_perror(uflow, "Couldn't connect flow socket");
 91                 return rc;
 92         }
...

flowside_sock_l4() calls sock_l4_sa() that adds 's' to the epoll_ctl()
pool.

So to cleanly manage the error of flowside_connect() we need to remove
's' from the epoll_ctl() pool using epoll_del().

Link: https://github.com/containers/podman/issues/26073
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-07 14:42:48 +02:00
Stefano Brivio
587980ca1e udp: Actually discard datagrams we can't forward
Given that udp_sock_fwd() now loops on udp_peek_addr() to get endpoint
addresses for datagrams, if we can't forward one of these datagrams,
we need to make sure we actually discard it. Otherwise, with MSG_PEEK,
we won't dequeue and loop on it forever.

For example, if we fail to create a socket for a new flow, because,
say, the destination of an inbound packet is multicast, and we can't
bind() to a multicast address, the loop will look like this:

18.0563: Flow 0 (NEW): FREE -> NEW
18.0563: Flow 0 (INI): NEW -> INI
18.0563: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0563: Flow 0 (TGT): INI -> TGT
18.0563: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0563: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow
18.0564: Flow 0 (NEW): FREE -> NEW
18.0564: Flow 0 (INI): NEW -> INI
18.0564: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0564: Flow 0 (TGT): INI -> TGT
18.0564: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow

and seen from strace:

epoll_wait(3, [{events=EPOLLIN, data=0x1076c00000705}], 8, 1000) = 1
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-05-03 10:21:20 +02:00
Emanuel Valasiadis
f0021f9e1d fwd: fix doc typo
Signed-off-by: Emanuel Valasiadis <emanuel@valasiadis.space>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-03 03:42:51 +02:00
Janne Grunau
93394f4ef0 selinux: Add getattr to class udp_socket
Commit 59cc89f ("udp, udp_flow: Track our specific address on socket
interfaces") added a getsockname() call in udp_flow_new(). This requires
getattr. Fixes "Flow 0 (UDP flow): Unable to determine local address:
Permission denied" errors in muvm/passt on Fedora Linux 42 with SELinux.

The SELinux audit message is

| type=AVC msg=audit(1746083799.606:235): avc:  denied  { getattr } for
|   pid=2961 comm="passt" laddr=127.0.0.1 lport=49221
|   faddr=127.0.0.53 fport=53
|   scontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tcontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tclass=udp_socket permissive=0

Fixes: 59cc89f4cc ("udp, udp_flow: Track our specific address on socket interfaces")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2363238
Signed-off-by: Janne Grunau <janne-psst@jannau.net>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-02 12:00:51 +02:00
Laurent Vivier
11be695f5c flow: fix podman issue
While running piHole using podman, traffic can trigger the following
assert:

ASSSERTION FAILED in flow_alloc (flow.c:521): flow->f.state == FLOW_STATE_FREE

Backtrace shows that this happens in flow_defer_handler():

      0x00005610d6f5b481 flow_alloc (passt + 0xb481)
      0x00005610d6f74f86 udp_flow_from_sock (passt + 0x24f86)
      0x00005610d6f737c3 udp_sock_fwd (passt + 0x237c3)
      0x00005610d6f74c07 udp_flush_flow (passt + 0x24c07)
      0x00005610d6f752c2 udp_flow_defer (passt + 0x252c2)
      0x00005610d6f5bce1 flow_defer_handler (passt + 0xbce1)

We are trying to allocate a new flow inside the loop freeing them.

Inside the loop free_head points to the first free flow entry in the
current cluster. But if we allocate a new entry during the loop,
free_head is not updated and can point now to the entry we have just
allocated.

We can fix the problem by spliting the loop in two parts:
- first part where we can close some of them and allocate some new
  flow entries,
- second part where we free the entries closed in the previous loop
  and we aggregate the free entries to merge consecutive the clusters.

Reported-by: Martin Rijntjes <bugs@air-global.nl>
Link: https://github.com/containers/podman/issues/25959
Fixes: 9725e79888 ("udp_flow: Don't discard packets that arrive between bind() and connect()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-05-02 11:58:25 +02:00
Stefano Brivio
6a96cd97a5 util: Fix typo, ASSSERTION -> ASSERTION
Fixes: 9153aca15b ("util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-05-02 11:58:10 +02:00
Stefano Brivio
ea0a1240df passt-repair: Hide bogus gcc warning from -Og
When building with gcc 13 and -Og, we get:

passt-repair.c: In function ‘main’:
passt-repair.c:161:23: warning: ‘ev’ may be used uninitialized [-Wmaybe-uninitialized]
  161 |                 if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
      |                     ~~^~~~~

but that can't actually happen, because we only exit the preceding
while loop if 'found' is true, and that only happens, in turn, as we
assign 'ev'.

Get rid of the warning by (redundantly) initialising ev to NULL.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-30 16:58:58 +02:00
Alyssa Ross
aa1cc89228 conf: allow --fd 0
inetd-style socket passing traditionally starts a service with a
connected socket on file descriptors 0 and 1.  passt disallowing
obtaining its socket from either of these descriptors made it
difficult to use with super-servers providing this interface — in my
case I wanted to use passt with s6-ipcserver[1].  Since (as far as I
can tell) passt does not use standard input for anything else (unlike
standard output), it should be safe to relax the restrictions on --fd
to allow setting it to 0, enabling this use case.

Link: https://skarnet.org/software/s6/s6-ipcserver.html [1]
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-28 14:01:17 +02:00
David Gibson
436afc3044 udp: Translate offender addresses for ICMP messages
We've recently added support for propagating ICMP errors related to a UDP
flow from the host to the guest, by handling the extended UDP error on the
socket and synthesizing a suitable ICMP on the tap interface.

Currently we create that ICMP with a source address of the "offender" from
the extended error information - the source of the ICMP error received on
the host.  However, we don't translate this address for cases where we NAT
between host and guest.  This means (amongst other things) that we won't
get a "Connection refused" error as expected if send data from the guest to
the --map-host-loopback address.  The error comes from 127.0.0.1 on the
host, which doesn't make sense on the tap interface and will be discarded
by the guest.

Because ICMP errors can be sent by an intermediate host, not just by the
endpoints of the flow, we can't handle this translation purely with the
information in the flow table entry.  We need to explicitly translate this
address by our NAT rules, which we can do with the nat_inbound() helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:05 +02:00
David Gibson
08e617ec2b udp: Rework offender address handling in udp_sock_recverr()
Make a number of changes to udp_sock_recverr() to improve the robustness
of how we handle addresses.

 * Get the "offender" address (source of the ICMP packet) using the
   SO_EE_OFFENDER() macro, reducing assumptions about structure layout.
 * Parse the offender sockaddr using inany_from_sockaddr()
 * Check explicitly that the source and destination pifs are what we
   expect.  Previously we checked something that was probably equivalent
   in practice, but isn't strictly speaking what we require for the rest
   of the code.
 * Verify that for an ICMPv4 error we also have an IPv4 source/offender
   and destination/endpoint address
 * Verify that for an ICMPv6 error we have an IPv6 endpoint
 * Improve debug reporting of any failures

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:03 +02:00
David Gibson
4668e91378 treewide: Improve robustness against sockaddrs of unexpected family
inany_from_sockaddr() expects a socket address of family AF_INET or
AF_INET6 and ASSERT()s if it gets anything else.  In many of the callers we
can handle an unexpected family more gracefully, though, e.g. by failing
a single flow rather than killing passt.

Change inany_from_sockaddr() to return an error instead of ASSERT()ing,
and handle those errors in the callers.  Improve the reporting of any such
errors while we're at it.

With this greater robustness, allow inany_from_sockaddr() to take a void *
rather than specifically a union sockaddr_inany *.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:42:00 +02:00
David Gibson
9128f6e8f4 fwd: Split out helpers for port-independent NAT
Currently the functions fwd_nat_from_*() make some address translations
based on both the IP address and protocol port numbers, and others based
only on the address.  We have some upcoming cases where it's useful to use
the IP-address-only translations separately, so split them out into helper
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-22 12:41:47 +02:00
David Gibson
2340bbf867 udp: Propagate errors on listening and brand new sockets
udp_sock_recverr() processes errors on UDP sockets and attempts to
propagate them as ICMP packets on the tap interface.  To do this it
currently requires the flow with which the error is associated as a
parameter.  If that's missing it will clear the error condition, but not
propagate it.

That means that we largely ignore errors on "listening" sockets.  It also
means we may discard some errors on flow specific sockets if they occur
very shortly after the socket is created.  In udp_flush_flow() we need to
clear any datagrams received between bind() and connect() which might not
be associated with the "final" flow for the socket.  If we get errors
before that point we'll ignore them in the same way because we don't know
the flow they're associated with in advance.

This can happen in practice if we have errors which occur almost
immediately after connect(), such as ECONNREFUSED when we connect() to a
local address where nothing is listening.

Between the extended error message itself and the PKTINFO information we
do actually have enough information to find the correct flow.  So, rather
than ignoring errors where we don't have a flow "hint", determine the flow
the hard way in udp_sock_recverr().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change warn() to debug() in udp_sock_recverr()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:56:16 +02:00
David Gibson
cfc0ee145a udp: Minor re-organisation of udp_sock_recverr()
Usually we work with the "exit early" flow style, where we return early
on "error" conditions in functions.  We don't currently do this in
udp_sock_recverr() for the case where we don't have a flow to associate
the error with.

Reorganise to use the "exit early" style, which will make some subsequent
changes less awkward.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:49:06 +02:00
David Gibson
f107a86cc0 udp: Add udp_pktinfo() helper
Currently we open code parsing the control message for IP_PKTINFO in
udp_peek_addr().  We have an upcoming case where we want to parse PKTINFO
in another place, so split this out into a helper function.

While we're there, make the parsing a bit more robust: scan all cmsgs to
look for the one we want, rather than assuming there's only one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: udp_pktinfo(): Fix typo in comment and change err() to debug()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:48:35 +02:00
David Gibson
04984578b0 udp: Deal with errors as we go in udp_sock_fwd()
When we get an epoll event on a listening socket, we first deal with any
errors (udp_sock_errs()), then with any received packets (udp_sock_fwd()).
However, it's theoretically possible that new errors could get flagged on
the socket after we call udp_sock_errs(), in which case we could get errors
returned in in udp_sock_fwd() -> udp_peek_addr() -> recvmsg().

In fact, we do deal with this correctly, although the path is somewhat
non-obvious.  The recvmsg() error will cause us to bail out of
udp_sock_fwd(), but the EPOLLERR event will now be flagged, so we'll come
back here next epoll loop and call udp_sock_errs().

Except.. we call udp_sock_fwd() from udp_flush_flow() as well as from
epoll events.  This is to deal with any packets that arrived between bind()
and connect(), and so might not be associated with the socket's intended
flow.  This expects udp_sock_fwd() to flush _all_ queued datagrams, so that
anything received later must be for the correct flow.

At the moment, udp_sock_errs() might fail to flush all datagrams if errors
occur.  In particular this can happen in practice for locally reported
errors which occur immediately after connect() (e.g. connecting to a local
port with nothing listening).

We can deal with the problem case, and also make the flow a little more
natural for the common case by having udp_sock_fwd() call udp_sock_errs()
to handle errors as the occur, rather than trying to deal with all errors
in advance.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:45:19 +02:00
David Gibson
3f995586b3 udp: Pass socket & flow information direction to error handling functions
udp_sock_recverr() and udp_sock_errs() take an epoll reference from which
they obtain both the socket fd to receive errors from, and - for flow
specific sockets - the flow and side the socket is associated with.

We have some upcoming cases where we want to clear errors when we're not
directly associated with receiving an epoll event, so it's not natural to
have an epoll reference.  Therefore, make these functions take the socket
and flow from explicit parameters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:45:09 +02:00
David Gibson
1bb8145c22 udp: Be quieter about errors on UDP receive
If we get an error on UDP receive, either in udp_peek_addr() or
udp_sock_recv(), we'll print an error message.  However, this could be
a perfectly routine UDP error triggered by an ICMP, which need not go to
the error log.

This doesn't usually happen, because before receiving we typically clear
the error queue from udp_sock_errs().  However, it's possible an error
could be flagged after udp_sock_errs() but before we receive.  So it's
better to handle this error "silently" (trace level only).  We'll bail out
of the receive, return to the epoll loop, and get an EPOLLERR where we'll
handle and report the error properly.

In particular there's one situation that can trigger this case much more
easily.  If we start a new outbound UDP flow to a local destination with
nothing listening, we'll get a more or less immediate connection refused
error.  So, we'll get that error on the very first receive after the
connect().  That will occur in udp_flow_defer() -> udp_flush_flow() ->
udp_sock_fwd() -> udp_peek_addr() -> recvmsg().  This path doesn't call
udp_sock_errs() first, so isn't (imperfectly) protected the way we are
most of the time.

Fixes: 84ab1305fa ("udp: Polish udp_vu_sock_info() and remove from vu specific code")
Fixes: 69e5393c37 ("udp: Move some more of sock_handler tasks into sub-functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:43:56 +02:00
David Gibson
baf049f8e0 udp: Fix breakage of UDP error handling by PKTINFO support
We recently enabled the IP_PKTINFO / IPV6_RECVPKTINFO socket options on our
UDP sockets.  This lets us obtain and properly handle the specific local
address used when we're "listening" with a socket on 0.0.0.0 or ::.

However, the PKTINFO cmsgs this option generates appear on error queue
messages as well as regular datagrams.  udp_sock_recverr() doesn't expect
this and so flags an unrecoverable error when it can't parse the control
message.

Correct this by adding space in udp_sock_recverr()s control buffer for the
additional PKTINFO data, and scan through all cmsgs for the RECVERR, rather
than only looking at the first one.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Fixes: f4b0dd8b06 ("udp: Use PKTINFO cmsgs to get destination address for received datagrams")
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-15 19:43:00 +02:00
Stefano Brivio
50249086a9 conf: Honour --dns-forward for local resolver even with --no-map-gw
If the first resolver listed in the host's /etc/resolv.conf is a
loopback address, and --no-map-gw is given, we automatically conclude
that the resolver is not reachable, discard it, and, if it's the only
nameserver listed in /etc/resolv.conf, we'll warn that we:

  Couldn't get any nameserver address

However, this isn't true in a general case: the user might have passed
--dns-forward, and in that case, while we won't map the address of the
default gateway to the host, we're still supposed to map that
particular address. Otherwise, in this common Podman usage:

  pasta --config-net --dns-forward 169.254.1.1 -t none -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-c02a8d8f-6ee3-902e-33c5-317e0f24e0af --map-guest-addr 169.254.1.2

and with a loopback address in /etc/resolv.conf, we'll unexpectedly
refuse to forward DNS queries:

  # nslookup passt.top 169.254.1.1
  ;; connection timed out; no servers could be reached

To fix this, make an exception for --dns-forward: if &c->ip4.dns_match
or &c->ip6.dns_match are set in add_dns_resolv4() / add_dns_resolv6(),
use that address as guest-facing resolver.

We already set 'dns_host' to the address we found in /etc/resolv.conf,
that's correct in this case and it makes us forward queries as
expected.

I'm not changing the man page as the current description of
--dns-forward is already consistent with the new behaviour: there's no
described way in which --no-map-gw should affect it.

Reported-by: Andrew Sayers <andrew-bugs.passt.top@pileofstuff.org>
Link: https://bugs.passt.top/show_bug.cgi?id=111
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-15 19:42:59 +02:00
Stefano Brivio
bbff3653d6 conf: Split add_dns_resolv() into separate IPv4 and IPv6 versions
Not really valuable by itself, but dropping one level of nested blocks
makes the next change more convenient.

No functional changes intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-15 19:42:57 +02:00
David Gibson
59cc89f4cc udp, udp_flow: Track our specific address on socket interfaces
So far for UDP flows (like TCP connections) we didn't record our address
(oaddr) in the flow table entry for socket based pifs.  That's because we
didn't have that information when a flow was initiated by a datagram coming
to a "listening" socket with 0.0.0.0 or :: address.  Even when we did have
the information, we didn't record it, to simplify address matching on
lookups.

This meant that in some circumstances we could send replies on a UDP flow
from a different address than the originating request came to, which is
surprising and breaks certain setups.

We now have code in udp_peek_addr() which does determine our address for
incoming UDP datagrams.  We can use that information to properly populate
oaddr in the flow table for flow initiated from a socket.

In order to be able to consistently match datagrams to flows, we must
*always* have a specific oaddr, not an unspecified address (that's how the
flow hash table works).  So, we also need to fill in oaddr correctly for
flows we initiate *to* sockets.  Our forwarding logic doesn't specify
oaddr here, letting the kernel decide based on the routing table.  In this
case we need to call getsockname() after connect()ing the socket to find
which local address the kernel picked.

This adds getsockname() to our seccomp profile for all variants.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:46:16 +02:00
David Gibson
695c62396e inany: Improve ASSERT message for bad socket family
inany_from_sockaddr() can only handle sockaddrs of family AF_INET or
AF_INET6 and asserts if given something else.  I hit this assertion while
debugging something else, and wanted to see what the bad sockaddr family
was.  Now that we have ASSERT_WITH_MSG() its easy to add this information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:46:13 +02:00
David Gibson
f4b0dd8b06 udp: Use PKTINFO cmsgs to get destination address for received datagrams
Currently we get the source address for received datagrams from recvmsg(),
but we don't get the local destination address.  Sometimes we implicitly
know this because the receiving socket is bound to a specific address, but
when listening on 0.0.0.0 or ::, we don't.

We need this information to properly direct replies to flows which come in
to a non-default local address.  So, enable the IP_PKTINFO and IPV6_PKTINFO
control messages to obtain this information in udp_peek_addr().  For now
we log a trace messages but don't do anything more with the information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-10 19:45:59 +02:00
David Gibson
6693fa1158 tcp_splice: Don't clobber errno before checking for EAGAIN
Like many places, tcp_splice_sock_handler() needs to handle EAGAIN
specially, in this case for both of its splice() calls.  Unfortunately it
tests for EAGAIN some time after those calls.  In between there has been
at least a flow_trace() which could have clobbered errno.  Move the test on
errno closer to the relevant system calls to avoid this problem.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-09 22:57:27 +02:00
David Gibson
d3f33f3b8e tcp_splice: Don't double count bytes read on EINTR
In tcp_splice_sock_handler(), if we get an EINTR on our second splice()
(pipe to output socket) we - as we should - go back and retry it.  However,
we do so *after* we've already updated our byte counters.  That does no
harm for the conn->written[] counter - since the second splice() returned
an error it will be advanced by 0.  However we also advance the
conn->read[] counter, and then do so again when the splice() succeeds.
This results in the counters being out of sync, and us thinking we have
remaining data in the pipe when we don't, which can leave us in an
infinite loop once the stream finishes.

Fix this by moving the EINTR handling to directly next to the splice()
call (which is what we usually do for EINTR).  As a bonus this removes one
mildly confusing goto.

For symmetry, also rework the EINTR handling on the first splice() the same
way, although that doesn't (as far as I can tell) have buggy side effects.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2779347687
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-09 22:57:16 +02:00
Stefano Brivio
ffbef85e97 conf: Add missing return in conf_nat(), fix --map-guest-addr none
As reported by somebody on IRC:

  $ pasta --map-guest-addr none
  Invalid address to remap to host: none

that's because once we parsed "none", we try to parse it as an address
as well. But we already handled it, so stop once we're done.

Fixes: e813a4df7d ("conf: Allow address remapped to host to be configured")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-09 22:52:49 +02:00
Stefano Brivio
06ef64cdb7 udp_flow: Save 8 bytes in struct udp_flow on 64-bit architectures
Shuffle the fields just added by commits a7775e9550 ("udp: support
traceroute in direction tap-socket") and 9725e79888 ("udp_flow:
Don't discard packets that arrive between bind() and connect()").

On x86_64, as reported by pahole(1), before:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        _Bool                      closed:1;             /*    76: 0  1 */

        /* XXX 7 bits hole, try to pack */

        _Bool                      flush0;               /*    77     1 */
        _Bool                      flush1:1;             /*    78: 0  1 */

        /* XXX 7 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */
        uint8_t                    ttl[2];               /*    96     2 */

        /* size: 104, cachelines: 2, members: 7 */
        /* sum members: 95, holes: 1, sum holes: 1 */
        /* sum bitfield members: 2 bits, bit holes: 2, sum bit holes: 14 bits */
        /* padding: 6 */
        /* last cacheline: 40 bytes */
};

and after:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        uint8_t                    ttl[2];               /*    76     2 */
        _Bool                      closed:1;             /*    78: 0  1 */
        _Bool                      flush0:1;             /*    78: 1  1 */
        _Bool                      flush1:1;             /*    78: 2  1 */

        /* XXX 5 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */

        /* size: 96, cachelines: 2, members: 7 */
        /* sum members: 94, holes: 1, sum holes: 1 */
        /* sum bitfield members: 3 bits, bit holes: 1, sum bit holes: 5 bits */
        /* last cacheline: 32 bytes */
};

It doesn't matter much because anyway the typical storage for struct
udp_flow is given by union flow:

union flow {
        struct flow_common         f;                  /*     0    76 */
        struct flow_free_cluster   free;               /*     0    84 */
        struct tcp_tap_conn        tcp;                /*     0   120 */
        struct tcp_splice_conn     tcp_splice;         /*     0   120 */
        struct icmp_ping_flow      ping;               /*     0    96 */
        struct udp_flow            udp;                /*     0    96 */
};

but it still improves data locality somewhat, so let me fix this up
now that commits are fresh.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-09 22:52:32 +02:00
David Gibson
9725e79888 udp_flow: Don't discard packets that arrive between bind() and connect()
When we establish a new UDP flow we create connect()ed sockets that will
only handle datagrams for this flow.  However, there is a race between
bind() and connect() where they might get some packets queued for a
different flow.  Currently we handle this by simply discarding any
queued datagrams after the connect.  UDP protocols should be able to handle
such packet loss, but it's not ideal.

We now have the tools we need to handle this better, by redirecting any
datagrams received during that race to the appropriate flow.  We need to
use a deferred handler for this to avoid unexpectedly re-ordering datagrams
in some edge cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update comment to udp_flow_defer()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:44:31 +02:00
David Gibson
9eb5406260 udp: Fold udp_splice_prepare and udp_splice_send into udp_sock_to_sock
udp_splice() prepare and udp_splice_send() are both quite simple functions
that now have only one caller: udp_sock_to_sock().  Fold them both into
that caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:44:31 +02:00
David Gibson
bd6a41ee76 udp: Rework udp_listen_sock_data() into udp_sock_fwd()
udp_listen_sock_data() forwards datagrams from a "listening" socket until
there are no more (for now).  We have an upcoming use case where we want
to do that for a socket that's not a "listening" socket, and uses a
different epoll reference.  So, adjust the function to take the pieces it
needs from the reference as direct parameters and rename to udp_sock_fwd().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
159beefa36 udp_flow: Take pif and port as explicit parameters to udp_flow_from_sock()
Currently udp_flow_from_sock() is only used when receiving a datagram
from a "listening" socket.  It takes the listening socket's epoll
reference to get the interface and port on which the datagram arrived.

We have some upcoming cases where we want to use this in different
contexts, so make it take the pif and port as direct parameters instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop @ref from comment to udp_flow_from_sock()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
fd844a90bc udp: Move UDP_MAX_FRAMES to udp.c
Recent changes mean that this define is no longer used anywhere except in
udp.c.  Move it back into udp.c from udp_internal.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:53 +02:00
David Gibson
fc6ee68ad3 udp: Merge vhost-user and "buf" listening socket paths
udp_buf_listen_sock_data() and udp_vu_listen_sock_data() now have
effectively identical structure.  The forwarding functions used for flow
specific sockets (udp_buf_sock_to_tap(), udp_vu_sock_to_tap() and
udp_sock_to_sock()) also now take a number of datagrams.  This means we
can re-use them for the listening socket path, just passing '1' so they
handle a single datagram at a time.

This allows us to merge both the vhost-user and flow specific paths into
a single, simpler udp_listen_sock_data().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:43:52 +02:00
David Gibson
0304dd9c34 udp: Split spliced forwarding path from udp_buf_reply_sock_data()
udp_buf_reply_sock_data() can handle forwarding data either from socket
to socket ("splicing") or from socket to tap.  It has a test on each
datagram for which case we're in, but that will be the same for everything
in the batch.

Split out the spliced path into a separate udp_sock_to_sock() function.
This leaves udp_{buf,vu}_reply_sock_data() handling only forwards from
socket to tap, so rename and simplify them accordingly.

This makes the code slightly longer for now, but will allow future cleanups
to shrink it back down again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typos in comments to udp_sock_recv() and
 udp_vu_listen_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:41:32 +02:00
David Gibson
5221e177e1 udp: Parameterize number of datagrams handled by udp_*_reply_sock_data()
Both udp_buf_reply_sock_data() and udp_vu_reply_sock_data() internally
decide what the maximum number of datagrams they will forward is.  We have
some upcoming reasons to allow the caller to decide that instead, so make
the maximum number of datagrams a parameter for both of them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:31:54 +02:00
David Gibson
3a0881dfd0 udp: Don't bother to batch datagrams from "listening" socket
A "listening" UDP socket can receive datagrams from multiple flows.  So,
we currently have some quite subtle and complex code in
udp_buf_listen_sock_data() to group contiguously received packets for the
same flow into batches for forwarding.

However, since we are now always using flow specific connect()ed sockets
once a flow is established, handling of datagrams on listening sockets is
essentially a slow path.  Given that, it's not worth the complexity.
Substantially simplify the code by using an approach more like vhost-user,
and "peeking" at the address of the next datagram, one at a time to
determine the correct flow before we actually receive the data,

This removes all meaningful use of the s_in and tosidx fields in
udp_meta_t, so they too can be removed, along with setting of msg_name and
msg_namelen in the msghdr arrays which referenced them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:30:17 +02:00
David Gibson
84ab1305fa udp: Polish udp_vu_sock_info() and remove from vu specific code
udp_vu_sock_info() uses MSG_PEEK to look ahead at the next datagram to be
received and gets its source address.  Currently we only use it in the
vhost-user path, but there's nothing inherently vhost-user specific about
it.  We have upcoming uses for it elsewhere so rename and move to udp.c.

While we're there, polish its error reporting a litle.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop excess newline before udp_sock_recv()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:29:23 +02:00
David Gibson
1d7bbb101a udp: Make udp_sock_recv() take max number of frames as a parameter
Currently udp_sock_recv() decides the maximum number of frames it is
willing to receive based on the mode.  However, we have upcoming use cases
where we will have different criteria for how many frames we want with
information that's not naturally available here but is in the caller.  So
make the maximum number of frames a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typo in comment in udp_buf_reply_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:25:33 +02:00
David Gibson
d74b5a7c10 udp: Use connect()ed sockets for initiating side
Currently we have an asymmetry in how we handle UDP sockets.  For flows
where the target side is a socket, we create a new connect()ed socket
- the "reply socket" specifically for that flow used for sending and
receiving datagrams on that flow and only that flow.  For flows where the
initiating side is a socket, we continue to use the "listening" socket (or
rather, a dup() of it).  This has some disadvantages:

 * We need a hash lookup for every datagram on the listening socket in
   order to work out what flow it belongs to
 * The dup() keeps the socket alive even if automatic forwarding removes
   the listening socket.  However, the epoll data remains the same
   including containing the now stale original fd.  This causes bug 103.
 * We can't (easily) set flow-specific options on an initiating side
   socket, because that could affect other flows as well

Alter the code to use a connect()ed socket on the initiating side as well
as the target side.  There's no way to "clone and connect" the listening
socket (a loose equivalent of accept() for UDP), so we have to create a
new socket.  We have to bind() this socket before we connect() it, which
is allowed thanks to SO_REUSEADDR, but does leave a small window where it
could receive datagrams not intended for this flow.  For now we handle this
by simply discarding any datagrams received between bind() and connect(),
but I intend to improve this in a later patch.

Link: https://bugs.passt.top/show_bug.cgi?id=103
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:24:36 +02:00
Jon Maloy
a7775e9550 udp: support traceroute in direction tap-socket
Now that ICMP pass-through from socket-to-tap is in place, it is
easy to support UDP based traceroute functionality in direction
tap-to-socket.

We fix that in this commit.

Link: https://bugs.passt.top/show_bug.cgi?id=64
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-07 21:23:35 +02:00
Stefano Brivio
06784d7fc6 passt-repair: Ensure that read buffer is NULL-terminated
After 3d41e4d838 ("passt-repair: Correct off-by-one error verifying
name"), Coverity Scan isn't convinced anymore about the fact that the
ev->name used in the snprintf() is NULL-terminated.

It comes from a read() call, and read() of course doesn't terminate
it, but we already check that the byte at ev->len - 1 is a NULL
terminator, so this is actually a false positive.

In any case, the logic ensuring that ev->name is NULL-terminated isn't
necessarily obvious, and additionally checking that the last byte in
the buffer we read is a NULL terminator is harmless, so do that
explicitly, even if it's redundant.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-04-05 11:28:23 +02:00
David Gibson
684870a766 udp: Correct some seccomp filter annotations
Both udp_buf_listen_sock_data() and udp_buf_reply_sock_data() have comments
stating they use recvmmsg().  That's not correct, they only do so via
udp_sock_recv() which lists recvmmsg() itself.

In contrast udp_splice_send() and udp_tap_handler() both directly use
sendmmsg(), but only the latter lists it.  Add it to the former as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:29 +02:00
David Gibson
76e554d9ec udp: Simplify updates to UDP flow timestamp
Since UDP has no built in knowledge of connections, the only way we
know when we're done with a UDP flow is a timeout with no activity.
To keep track of this struct udp_flow includes a timestamp to record
the last time we saw traffic on the flow.

For data from listening sockets and from tap, this is done implicitly via
udp_flow_from_{sock,tap}() but for reply sockets it's done explicitly.
However, that logic is duplicated between the vhost-user and "buf" paths.
Make it common in udp_reply_sock_handler() instead.

Technically this is a behavioural change: previously if we got an EPOLLIN
event, but there wasn't actually any data we wouldn't update the timestamp,
now we will.  This should be harmless: if there's an EPOLLIN we expect
there to be data, and even if there isn't the worst we can do is mildly
delay the cleanup of a stale flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:26 +02:00
David Gibson
8aa2d90c8d udp: Remove redundant udp_at_sidx() call in udp_tap_handler()
We've already have a pointer to the UDP flow in variable uflow, we can just
re-use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 11:30:14 +02:00
David Gibson
3d41e4d838 passt-repair: Correct off-by-one error verifying name
passt-repair will generate an error if the name it gets from the kernel is
too long or not NUL terminated.  Downstream testing has reported
occasionally seeing this error in practice.

In turns out there is a trivial off-by-one error in the check: ev->len is
the length of the name, including terminating \0 characters, so to check
for a \0 at the end of the buffer we need to check ev->name[len - 1] not
ev->name[len].

Fixes: 42a854a52b ("pasta, passt-repair: Support multiple events per read() in inotify handlers")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:29:42 +02:00
David Gibson
dec3d73e1e migrate, tcp: bind() migrated sockets in repair mode
Currently on a migration target, we create then immediately bind() new
sockets for the TCP connections we're reconstructing.  Mostly, this works,
since a socket() that is bound but hasn't had listen() or connect() called
is essentially passive.  However, this bind() is subject to the usual
address conflict checking.  In particular that means if we already have
a listening socket on that port, we'll get an EADDRINUSE.  This will happen
for every connection we try to migrate that was initiated from outside to
the guest, since we necessarily created a listening socket for that case.

We set SO_REUSEADDR on the socket in an attempt to avoid this, but that's
not sufficient; even with SO_REUSEADDR address conflicts are still
prohibited for listening sockets.  Of course once these incoming sockets
are fully repaired and connect()ed they'll no longer conflict, but that
doesn't help us if we fail at the bind().

We can avoid this by not calling bind() until we're already in repair mode
which suppresses this transient conflict.  Because of the batching of
setting repair mode, to do that we need to move the bind to a step in
tcp_flow_migrate_target_ext().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:29:01 +02:00
David Gibson
6bfc60b095 platform requirements: Add test for address conflicts with TCP_REPAIR
Simple test program to check the behaviour we need for bind() address
conflicts between listening sockets and repair mode sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:59 +02:00
David Gibson
8e32881ef1 platform requirements: Add attributes to die() function
Add both format string and ((noreturn)) attributes to the version of die()
used in the test programs in doc/platform-requirements.  As well as
potentially catching problems in format strings, this means that the
compiler and static checkers can properly reason about the fact that it
will exit, preventing bogus warnings.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:57 +02:00
David Gibson
2ed2d59def platform requirements: Fix clang-tidy warning
Recent clang-tidy versions complain about enums defined with some but not
all entries given explicit values.  I'm not entirely convinced about
whether that's a useful warning, but in any case we really don't need the
explicit values in doc/platform-requirements/reuseaddr-priority.c, so
remove them to make clang happy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-04-02 08:28:05 +02:00
David Gibson
3de5af6e41 udp: Improve name of UDP related ICMP sending functions
udp_send_conn_fail_icmp[46]() aren't actually specific to connections
failing: they can propagate a variety of ICMP errors, which might or might
not break a "connection".  They are, however, specific to sending ICMP
errors to the tap connection, not splice or host.  Rename them to better
reflect that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:26:11 +01:00
David Gibson
025a3c2686 udp: Don't attempt to forward ICMP socket errors to other sockets
Recently we added support for detecting ICMP triggered errors on UDP
sockets and forwarding them to the tap interface.  However, in
udp_sock_recverr() where this is handled we don't know for certain that
the tap interface is the other side of the UDP flow.  It could be a spliced
connection with another socket on the other side.

To forward errors in that case, we'd need to force the other side's socket
to trigger issue an ICMP error.  I'm not sure if there's a way to do that;
probably not for an arbitrary ICMP but it might be possible for certain
error conditions.

Nonetheless what we do now - synthesise an ICMP on the tap interface - is
certainly wrong.  It's probably harmless; for a spliced connection it will
have loopback addresses meaning we can expect the guest to discard it.
But, correct this for now, by not attempting to propagate errors when the
other side of the flow is a socket.

Fixes: 55431f0077 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e0 ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:25:51 +01:00
Stefano Brivio
42a854a52b pasta, passt-repair: Support multiple events per read() in inotify handlers
The current code assumes that we'll get one event per read() on
inotify descriptors, but that's not the case, not from documentation,
and not from reports.

Add loops in the two inotify handlers we have, in pasta-specific code
and passt-repair, to go through all the events we receive.

Link: https://bugs.passt.top/show_bug.cgi?id=119
[dwg: Remove unnecessary buffer expansion, use strnlen instead of strlen
 to make Coverity happier]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Add additional check on ev->name and ev->len in passt-repair]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-28 13:24:44 +01:00
Jon Maloy
65cca54be8 udp: correct source address for ICMP messages
While developing traceroute forwarding tap-to-sock we found that
struct msghdr.msg_name for the ICMPs in the opposite direction always
contains the destination address of the original UDP message, and not,
as one might expect, the one of the host which created the error message.

Study of the kernel code reveals that this address instead is appended
as extra data after the received struct sock_extended_err area.

We now change the ICMP receive code accordingly.

Fixes: 55431f0077 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e0 ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-27 05:05:20 +01:00
Julian Wundrak
664c588be7 build: normalize arm targets
Linux distributions use different dumpmachine outputs for the ARM
architecture. arm, armv6l, armv7l.
For the syscall annotation, these variants are standardized to “arm”.

Link: https://bugs.passt.top/show_bug.cgi?id=117
Signed-off-by: Julian Wundrak <julian@wundrak.net>
[sbrivio: Fix typo: assign from TARGET_ARCH, not from TARGET]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:43:22 +01:00
David Gibson
77883fbdd1 udp: Add helper function for creating connected UDP socket
Currently udp_flow_new() open codes creating and connecting a socket to use
for reply messages.  We have in mind some more places to use this logic,
plus it just makes for a rather large function.  Split this handling out
into a new udp_flow_sock() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:34 +01:00
David Gibson
37d78c9ef3 udp: Always hash socket facing flowsides
For UDP packets from the tap interface (like TCP) we use a hash table to
look up which flow they belong to.  Unlike TCP, we sometimes also create a
hash table entry for the socket side of UDP flows.  We need that when we
receive a UDP packet from a "listening" socket which isn't specific to a
single flow.

At present we only do this for the initiating side of flows, which re-use
the listening socket.  For the target side we use a connected "reply"
socket specific to the single flow.

We have in mind changes that maye introduce some edge cases were we could
receive UDP packets on a non flow specific socket more often.  To allow for
those changes - and slightly simplifying things in the meantime - always
put both sides of a UDP flow - tap or socket - in the hash table.  It's
not that costly, and means we always have the option of falling back to a
hash lookup.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:32 +01:00
David Gibson
f67c488b81 udp: Better handling of failure to forward from reply socket
In udp_reply_sock_handler() if we're unable to forward the datagrams we
just print an error.  Generally this means we have an unsupported pair of
pifs in the flow table, though, and that hasn't change.  So, next time we
get a matching packet we'll just get the same failure.  In vhost-user mode
we don't even dequeue the incoming packets which triggered this so we're
likely to get the same failure immediately.

Instead, close the flow, in the same we we do for an unrecoverable error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:30 +01:00
David Gibson
269cf6a12a udp: Share more logic between vu and non-vu reply socket paths
Share some additional miscellaneous logic between the vhost-user and "buf"
paths for data on udp reply sockets.  The biggest piece is error handling
of cases where we can't forward between the two pifs of the flow.  We also
make common some more simple logic locating the correct flow and its
parameters.

This adds some lines of code due to extra comment lines, but nonetheless
reduces logic duplication.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:28 +01:00
David Gibson
d924b7dfc4 udp_vu: Factor things out of udp_vu_reply_sock_data() loop
At the start of every cycle of the loop in udp_vu_reply_sock_data() we:
 - ASSERT that uflow is not NULL
 - Check if the target pif is PIF_TAP
 - Initialize the v6 boolean

However, all of these depend only on the flow, which doesn't change across
the loop.  This is probably a duplication from udp_vu_listen_sock_data(),
where the flow can be different for each packet.  For the reply socket
case, however, factor that logic out of the loop.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:26 +01:00
David Gibson
5a977c2f4e udp: Simplify checking of epoll event bits
udp_{listen,reply}_sock_handler() can accept both EPOLLERR and EPOLLIN
events.  However, unlike most epoll event handlers we don't check the
event bits right there.  EPOLLERR is checked within udp_sock_errs() which
we call unconditionally.  Checking EPOLLIN is still more buried: it is
checked within both udp_sock_recv() and udp_vu_sock_recv().

We can simplify the logic and pass less extraneous parameters around by
moving the checking of the event bits to the top level event handlers.

This makes udp_{buf,vu}_{listen,reply}_sock_handler() no longer general
event handlers, but specific to EPOLLIN events, meaning new data.  So,
rename those functions to udp_{buf,vu}_{listen,reply}_sock_data() to better
reflect their function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:23 +01:00
David Gibson
89b203b851 udp: Common invocation of udp_sock_errs() for vhost-user and "buf" paths
The vhost-user and non-vhost-user paths for both udp_listen_sock_handler()
and udp_reply_sock_handler() are more or less completely separate.  Both,
however, start with essentially the same invocation of udp_sock_errs(), so
that can be made common.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-26 21:34:11 +01:00
David Gibson
cf4d3f05c9 packet: Upgrade severity of most packet errors
All errors from packet_range_check(), packet_add() and packet_get() are
trace level.  However, these are for the most part actual error conditions.
They're states that should not happen, in many cases indicating a bug
in the caller or elswhere.

We don't promote these to err() or ASSERT() level, for fear of a localised
bug on very specific input crashing the entire program, or flooding the
logs, but we can at least upgrade them to debug level.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:30 +01:00
David Gibson
0857515c94 packet: ASSERT on signs of pool corruption
If packet_check_range() fails in packet_get_try_do() we just return NULL.
But this check only takes places after we've already validated the given
range against the packet it's in.  That means that if packet_check_range()
fails, the packet pool is already in a corrupted state (we should have
made strictly stronger checks when the packet was added).  Simply returning
NULL and logging a trace() level message isn't really adequate for that
situation; ASSERT instead.

Similarly we check the given idx against both p->count and p->size.  The
latter should be redundant, because count should always be <= size.  If
that's not the case then, again, the pool is already in a corrupted state
and we may have overwritten unknown memory.  Assert for this case too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:27 +01:00
David Gibson
9153aca15b util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers
We already have the ASSERT() macro which will abort() passt based on a
condition.  It always has a fixed error message based on its location and
the asserted expression.  We have some upcoming cases where we want to
customise the message when hitting an assert.

Add abort_with_msg() and ASSERT_WITH_MSG() helpers to allow this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:24 +01:00
David Gibson
38bcce9977 packet: Rework packet_get() versus packet_get_try()
Most failures of packet_get() indicate a serious problem, and log messages
accordingly.  However, a few callers expect failures here, because they're
probing for a certain range which might or might not be in a packet.  They
use packet_get_try() which passes a NULL func to packet_get_do() to
suppress the logging which is unwanted in this case.

However, this doesn't just suppress the log when packet_get_do() finds the
requested region isn't in the packet.  It suppresses logging for all other
errors too, which do indicate serious problems, even for the callers of
packet_get_try().  Worse it will pass the NULL func on to
packet_check_range() which doesn't expect it, meaning we'll get unhelpful
messages from there if there is a failure.

Fix this by making packet_get_try_do() the primary function which doesn't
log for the case of a range outside the packet.  packet_get_do() becomes a
trivial wrapper around that which logs a message if packet_get_try_do()
returns NULL.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:22 +01:00
David Gibson
961aa6a0eb packet: Move checks against PACKET_MAX_LEN to packet_check_range()
Both the callers of packet_check_range() separately verify that the given
length does not exceed PACKET_MAX_LEN.  Fold that check into
packet_check_range() instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:20 +01:00
David Gibson
37d9f374d9 packet: Avoid integer overflows in packet_get_do()
In packet_get_do() both offset and len are essentially untrusted.  We do
some validation of len (check it's < PACKET_MAX_LEN), but that's not enough
to ensure that (len + offset) doesn't overflow.  Rearrange our calculation
to make sure it's safe regardless of the given offset & len values.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:18 +01:00
David Gibson
c48331ca51 packet: Correct type of PACKET_MAX_LEN
PACKET_MAX_LEN is usually involved in calculations on size_t values - the
type of the iov_len field in struct iovec.  However, being defined bare as
UINT16_MAX, the compiled is likely to assign it a shorter type.  This can
lead to unexpected promotions (or lack thereof).  Add a cast to force the
type to be what we expect.

Fixes: c43972ad6 ("packet: Give explicit name to maximum packet size")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:15 +01:00
David Gibson
9866d146e6 tap: Clarify calculation of TAP_MSGS
The rationale behind the calculation of TAP_MSGS isn't necessarily obvious.
It's supposed to be the maximum number of packets that can fit in pkt_buf.
However, the calculation is wrong in several ways:
 * It's based on ETH_ZLEN which isn't meaningful for virtual devices
 * It always includes the qemu socket header which isn't used for pasta
 * The size of pkt_buf isn't relevant for vhost-user

We've already made sure this is just a tuning parameter, not a hard limit.
Clarify what we're calculating here and why.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:12 +01:00
David Gibson
a41d6d125e tap: Make size of pool_tap[46] purely a tuning parameter
Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf (TAP_MSGS).  However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size.  But ETH_ZLEN is based on physical
constraints of Ethernet, which don't apply to our virtual devices.  It is
possible to generate a legitimate frame smaller than this, for example an
empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long.

Further more, the same limit applies for vhost-user, which is not limited
by the size of pkt_buf like the other backends.  In that case we don't even
have full control of the maximum buffer size, so we can't really calculate
how many packets could fit in there.

If we exceed do TAP_MSGS we'll drop packets, not just use more batches,
which is moderately bad.  The fact that this needs to be sized just so for
correctness not merely for tuning is a fairly non-obvious coupling between
different parts of the code.

To make this more robust, alter the tap code so it doesn't rely on
everything fitting in a single batch of TAP_MSGS packets, instead breaking
into multiple batches as necessary.  This leaves TAP_MSGS as purely a
tuning parameter, which we can freely adjust based on performance measures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:09 +01:00
David Gibson
e43e00719d packet: More cautious checks to avoid pointer arithmetic UB
packet_check_range and vu_packet_check_range() verify that the packet or
section of packet we're interested in lies in the packet buffer pool we
expect it to.  However, in doing so it doesn't avoid the possibility of
an integer overflow while performing pointer arithmetic, with is UB.  In
fact, AFAICT it's UB even to use arbitrary pointer arithmetic to construct
a pointer outside of a known valid buffer.

To do this safely, we can't calculate the end of a memory region with
pointer addition when then the length as untrusted.  Instead we must work
out the offset of one memory region within another using pointer
subtraction, then do integer checks against the length of the outer region.
We then need to be careful about the order of checks so that those integer
checks can't themselves overflow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:33:06 +01:00
David Gibson
4592719a74 vu_common: Tighten vu_packet_check_range()
This function verifies that the given packet is within the mmap()ed memory
region of the vhost-user device.  We can do better, however.  The packet
should be not only within the mmap()ed range, but specifically in the
subsection of that range set aside for shared buffers, which starts at
dev_region->mmap_offset within there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-20 20:32:50 +01:00
Stefano Brivio
32f6212551 Makefile: Enable -Wformat-security
It looks like an easy win to prevent a number of possible security
flaws.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:53 +01:00
Stefano Brivio
07c2d584b3 conf: Include libgen.h for basename(), fix build against musl
Fixes: 4b17d042c7 ("conf: Move mode detection into helper function")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:49 +01:00
Stefano Brivio
ebdd46367c tcp: Flush socket before checking for more data in active close state
Otherwise, if all the pending data is acknowledged:

- tcp_update_seqack_from_tap() updates the current tap-side ACK
  sequence (conn->seq_ack_from_tap)

- next, we compare the sequence we sent (conn->seq_to_tap) to the
  ACK sequence (conn->seq_ack_from_tap) in tcp_data_from_sock() to
  understand if there's more data we can send.

  If they match, we conclude that we haven't sent any of that data,
  and keep re-sending it.

We need, instead, to flush the socket (drop acknowledged data) before
calling tcp_update_seqack_from_tap(), so that once we update
conn->seq_ack_from_tap, we can be sure that all data until there is
gone from the socket.

Link: https://bugs.passt.top/show_bug.cgi?id=114
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Fixes: 30f1e082c3 ("tcp: Keep updating window and checking for socket data after FIN from guest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-20 05:50:43 +01:00
David Gibson
c250ffc5c1 migrate: Bump migration version number
v1 of the migration stream format, had some flaws: it didn't properly
handle endianness of the MSS field, and it didn't transfer the RFC7323
timestamp.  We've now fixed those bugs, but it requires incompatible
changes to the stream format.

Because of the timestamps in particular, v1 is not really usable, so there
is little point maintaining compatible support for it.  However, v1 is in
released packages, both upstream and downstream (RHEL at least).  Just
updating the stream format without bumping the version would lead to very
cryptic errors if anyone did attempt to migrate between an old and new
passt.

So, bump the migration version to v2, so we'll get a clear error message if
anyone attempts this.  We don't attempt to maintain backwards compatibility
with v1, however: we'll simply fail if given a v1 stream.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 17:17:18 +01:00
David Gibson
cfb3740568 migrate, tcp: Migrate RFC 7323 timestamp
Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp.  In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.

Add code to dump and restore the timestamp across migration.

Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 15:27:27 +01:00
David Gibson
28772ee91a migrate, tcp: More careful marshalling of mss parameter during migration
During migration we extract the limit on segment size using TCP_MAXSEG,
and set it on the other side with TCP_REPAIR_OPTIONS.  However, unlike most
32-bit values we transfer we transfer it in native endian, not network
endian.  This is not correct; add it to the list of endian fixups we make.

In addition, while MAXSEG will be 32-bits in practice, and is given as such
to TCP_REPAIR_OPTIONS, the TCP_MAXSEG sockopt treats it as an 'int'.  It's
not strictly safe to pass a uint32_t to a getsockopt() expecting an int,
although we'll get away with it on most (maybe all) platforms.  Correct
this as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor coding style fix]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-19 15:25:12 +01:00
Stefano Brivio
51f3c071a7 passt-repair: Fix build with -Werror=format-security
Fixes: 0470170247 ("passt-repair: Add directory watch")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-18 17:18:47 +01:00
David Gibson
cb5b593563 tcp, flow: Better use flow specific logging heleprs
A number of places in the TCP code use general logging functions, instead
of the flow specific ones.  That includes a few older ones as well as many
places in the new migration code.  Thus they either don't identify which
flow the problem happened on, or identify it in a non-standard way.

Convert many of these to use the existing flow specific helpers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-14 23:40:40 +01:00
David Gibson
96fe5548cb conf: Unify several paths in conf_ports()
In conf_ports() we have three different paths which actually do the setup
of an individual forwarded port: one for the "all" case, one for the
exclusions only case and one for the range of ports with possible
exclusions case.

We can unify those cases using a new helper which handles a single range
of ports, with a bitmap of exclusions.  Although this is slightly longer
(largely due to the new helpers function comment), it reduces duplicated
logic.  It will also make future improvements to the tracking of port
forwards easier.

The new conf_ports_range_except() function has a pretty prodigious
parameter list, but I still think it's an overall improvement in conceptual
complexity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-14 23:40:23 +01:00
David Gibson
78f1f0fdfc test/perf: Simplify iperf3 server lifetime management
After we start the iperf3 server in the background, we have a sleep to
make sure it's ready to receive connections.  We can simplify this slightly
by using the -D option to have iperf3 background itself rather than
backgrounding it manually.  That won't return until the server is ready to
use.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
26df8a3608 conf: Limit maximum MTU based on backend frame size
The -m option controls the MTU, that is the maximum transmissible L3
datagram, not including L2 headers.  We currently limit it to ETH_MAX_MTU
which sounds like it makes sense.  But ETH_MAX_MTU is confusing: it's not
consistently used as to whether it means the maximum L3 datagram size or
the maximum L2 frame size.  Even within conf() we explicitly account for
the L2 header size when computing the default --mtu value, but not when
we compute the maximum --mtu value.

Clean this up by reworking the maximum MTU computation to be the minimum of
IP_MAX_MTU (65535) and the maximum sized IP datagram which can fit into
our L2 frames when we account for the L2 header.  The latter can vary
depending on our tap backend, although it doesn't right now.

Link: https://bugs.passt.top/show_bug.cgi?id=66
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
9d1a6b3eba pcap: Correctly set snaplen based on tap backend type
The pcap header includes a value indicating how much of each frame is
captured.  We always capture the entire frame, so we want to set this to
the maximum possible frame size.  Currently we do that by setting it to
ETH_MAX_MTU, but that's a confusingly named constant which might not always
be correct depending on the details of our tap backend.

Instead add a tap_l2_max_len() function that explicitly returns the maximum
frame size for the current mode and use that to set snaplen.  While we're
there, there's no particular need for the pcap header to be defined in a
global; make it local to pcap_init() instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
b6945e0553 Simplify sizing of pkt_buf
We define the size of pkt_buf as large enough to hold 128 maximum size
packets.  Well, approximately, since we round down to the page size.  We
don't have any specific reliance on how many packets can fit in the buffer,
we just want it to be big enough to allow reasonable batching.  The
current definition relies on the confusingly named ETH_MAX_MTU and adds
in sizeof(uint32_t) rather non-obviously for the pseudo-physical header
used by the qemu socket (passt mode) protocol.

Instead, just define it to be 8MiB, which is what that complex calculation
works out to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
c4bfa3339c tap: Use explicit defines for maximum length of L2 frame
Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of
an L2 frame.  This define comes from the kernel, but it's badly named and
used confusingly.

First, it doesn't really have anything to do with Ethernet, which has no
structural limit on frame lengths.  It comes more from either a) IP which
imposes a 64k datagram limit or b) from internal buffers used in various
places in the kernel (and in passt).

Worse, MTU generally means the maximum size of the IP (L3) datagram which
may be transferred, _not_ counting the L2 headers.  In the kernel
ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as
a maximum frame length, _including_ L2 headers.  In tap.c we're mostly
using it in the second way.

Finally, each of our tap backends could have different limits on the frame
size imposed by the mechanisms they're using.

Start clearing up this confusion by replacing it in tap.c with new
L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame
length for each backend.

Signed-off-by: David Gibson <dgibson@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
1eda8de438 packet: Remove redundant TAP_BUF_BYTES define
Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing.  They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense.  So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.

In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
c43972ad67 packet: Give explicit name to maximum packet size
We verify that every packet we store in a pool (and every partial packet
we retreive from it) has a length no longer than UINT16_MAX.  This
originated in the older packet pool implementation which stored packet
lengths in a uint16_t.  Now, that packets are represented by a struct
iovec with its size_t length, this check serves only as a sanity / security
check that we don't have some wildly out of range length due to a bug
elsewhere.

We have may reasons to (slightly) increase this limit in future, so in
preparation, give this quantity an explicit name - PACKET_MAX_LEN.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
74cd82adc8 conf: Detect vhost-user mode earlier
We detect our operating mode in conf_mode(), unless we're using vhost-user
mode, in which case we change it later when we parse the --vhost-user
option.  That means we need to delay parsing the --repair-path option (for
vhost-user only) until still later.

However, there are many other places in the main option parsing loop which
also rely on mode.  We get away with those, because they happen to be able
to treat passt and vhost-user modes identically.  This is potentially
confusing, though.  So, move setting of MODE_VU into conf_mode() so
c->mode always has its final value from that point onwards.

To match, we move the parsing of --repair-path back into the main option
parsing loop.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
4b17d042c7 conf: Move mode detection into helper function
One of the first things we need to do is determine if we're in passt mode
or pasta mode.  Currently this is open-coded in main(), by examining
argv[0].  We want to complexify this a bit in future to cover vhost-user
mode as well.  Prepare for this, by moving the mode detection into a new
conf_mode() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
David Gibson
bb00a0499f conf: Use the same optstring for passt and pasta modes
Currently we rely on detecting our mode first and use different sets of
(single character) options for each.  This means that if you use an option
valid in only one mode in another you'll get the generic usage() message.

We can give more helpful errors with little extra effort by combining all
the options into a single value of the option string and giving bespoke
messages if an option for the wrong mode is used; in fact we already did
this for some single mode options like '-1'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-12 23:08:33 +01:00
Stefano Brivio
c8b520c062 flow, repair: Wait for a short while for passt-repair to connect
...and time out after that. This will be needed because of an upcoming
change to passt-repair enabling it to start before passt is started,
on both source and target, by means of an inotify watch.

Once the inotify watch triggers, passt-repair will connect right away,
but we have no guarantees that the connection completes before we
start the migration process, so wait for it (for a reasonable amount
of time).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-12 23:08:33 +01:00
Stefano Brivio
0470170247 passt-repair: Add directory watch
It might not be feasible for users to start passt-repair after passt
is started, on a migration target, but before the migration process
starts.

For instance, with libvirt, the guest domain (and, hence, passt) is
started on the target as part of the migration process. At least for
the moment being, there's no hook a libvirt user (including KubeVirt)
can use to start passt-repair before the migration starts.

Add a directory watch using inotify: if PATH is a directory, instead
of connecting to it, we'll watch for a .repair socket file to appear
in it, and then attempt to connect to that socket.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-03-12 21:34:36 +01:00
David Gibson
2b58b22845 cppcheck: Add suppressions for "logically" exported functions
We have some functions in our headers which are definitely there on
purpose.  However, they're not yet used outside the files in which they're
defined.  That causes sufficiently recent cppcheck versions (2.17) to
complain they should be static.

Suppress the errors for these "logically" exported functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
a83c806d17 vhost_user: Don't export several functions
vhost-user added several functions which are exposed in headers, but not
used outside the file where they're defined.  I can't tell if these are
really internal functions, or of they're logically supposed to be exported,
but we don't happen to have anything using them yet.

For the time being, just remove the exports.  We can add them back if we
need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
27395e67c2 tcp: Don't export tcp_update_csum()
tcp_update_csum() is exposed in tcp_internal.h, but is only used in tcp.c.
Remove the unneded prototype and make it static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
12d5b36b2f checksum: Don't export various functions
Several of the exposed functions in checksum.h are no longer directly used.
Remove them from the header, and make static.  In particular sum_16b()
should not be used outside: generally csum_unfolded() should be used which
will automatically use either the AVX2 optimized version or sum_16b() as
necessary.

csum_fold() and csum() could have external uses, but they're not used right
now.  We can expose them again if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
e36c35c952 log: Don't export passt_vsyslog()
passt_vsyslog() is an exposed function in log.h.  However it shouldn't
be called from outside log.c: it writes specifically to the system log,
and most code should call passt's logging helpers which might go to the
syslog or to a log file.

Make passt_vsyslog() local to log.c.  This requires a code motion to avoid
a forward declaration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
David Gibson
57d2db370b treewide: Mark assorted functions static
This marks static a number of functions which are only used in their .c
file, have no prototypes in a .h and were never intended to be globally
exposed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
68b04182e0 udp: create and send ICMPv6 to local peer when applicable
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMPv6 message containing
the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header
and the first 1232 bytes of the original message, if any. If the sender
socket has been connected, it uses this message to issue a
"Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv6 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error types and codes. We have noticed that at least
ICMP_PROT_UNREACH is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't
 modify saddr which can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
87e6a46442 tap: break out building of udp header from tap_udp6_send function
We will need to build the UDP header at other locations than in function
tap_udp6_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:24 +01:00
Jon Maloy
55431f0077 udp: create and send ICMPv4 to local peer when applicable
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv4 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error codes. We have noticed that at least ICMP_PROT_UNREACH
is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't
 modify 'in', it can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-07 02:21:19 +01:00
Jon Maloy
82a839be98 tap: break out building of udp header from tap_udp4_send function
We will need to build the UDP header at other locations than in function
tap_udp4_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-06 20:17:36 +01:00
David Gibson
1924e25f07 conf: Be more precise about minimum MTUs
Currently we reject the -m option if given a value less than ETH_MIN_MTU
(68).  That define is derived from the kernel, but its name is misleading:
it doesn't really have anything to do with Ethernet per se, but is rather
the minimum payload any L2 link must be able to handle in order to carry
IPv4.  For IPv6, it's not sufficient: that requires an MTU of at least
1280.

Newer kernels have better named constants IPV4_MIN_MTU and IPv6_MIN_MTU.
Copy and use those constants instead, along with some more specific error
messages.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-06 20:17:23 +01:00
David Gibson
672d786de1 tcp: Send RST in response to guest packets that match no connection
Currently, if a non-SYN TCP packet arrives which doesn't match any existing
connection, we simply ignore it.  However RFC 9293, section 3.10.7.1 says
we should respond with an RST to a non-SYN, non-RST packet that's for a
CLOSED (i.e. non-existent) connection.

This can arise in practice with migration, in cases where some error means
we have to discard a connection.  We destroy the connection with tcp_rst()
in that case, but because the guest is stopped, we may not be able to
deliver the RST packet on the tap interface immediately.  This change
ensures an RST will be sent if the guest tries to use the connection again.

A similar situation can arise if a passt/pasta instance is killed or
crashes, but is then replaced with another attached to the same guest.
This can leave the guest with stale connections that the new passt instance
isn't aware of.  It's better to send an RST so the guest knows quickly
these are broken, rather than letting them linger until they time out.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:32 +01:00
David Gibson
1f236817ea tap: Consider IPv6 flow label when building packet sequences
To allow more batching, we group together related packets into "seqs" in
the tap layer, before passing them to the L4 protocol layers.  Currently
we consider the IP protocol, both IP addresses and also the L4 ports when
grouping things into seqs.  We ignore the IPv6 flow label.

We have some future cases where we want to consider the the flow label in
the L4 code, which is awkward if we could be given a single batch with
multiple labels.  Add the flow label to tap6_l4_t and group by it as well
as the other criteria.  In future we could possibly use the flow label
_instead_ of peeking into the L4 header for the ports, but we don't do so
for now.

The guest should use the same flow label for all packets in a low, but if
it doesn't this change won't break anything, it just means we'll batch
things a bit sub-optimally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:29 +01:00
David Gibson
008175636c ip: Helpers to access IPv6 flow label
The flow label is a 20-bit field in the IPv6 header.  The length and
alignment make it awkward to pass around as is.  Obviously, it can be
packed into a 32-bit integer though, and we do this in two places.  We
have some further upcoming places where we want to manipulate the flow
label, so make some helpers for marshalling and unmarshalling it to an
integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-03-05 21:46:17 +01:00
David Gibson
52419a64f2 migrate, tcp: Don't flow_alloc_cancel() during incoming migration
In tcp_flow_migrate_target(), if we're unable to create and bind the new
socket, we print an error, cancel the flow and carry on.  This seems to
make sense based on our policy of generally letting the migration complete
even if some or all flows are lost in the process.  But it doesn't quite
work: the flow_alloc_cancel() means that the flows in the target's flow
table are no longer one to one match to the flows which the source is
sending data for.  This means that data for later flows will be mismatched
to a different flow.  Most likely that will cause some nasty error later,
but even worse it might appear to succeed but lead to data corruption due
to incorrectly restoring one of the flows.

Instead, we should leave the flow in the table until we've read all the
data for it, *then* discard it.  Technically removing the
flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket()
fails it leaves conn->sock == -1, which will cause the restore functions
in tcp_flow_migrate_target_ext() to fail, discarding the flow.  To make
what's going on clearer (and with less extraneous error messages), put
several explicit tests for a missing socket later in the migration path to
read the data associated with the flow but explicitly discard it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:38 +01:00
David Gibson
b2708218a6 tcp: Unconditionally move to CLOSED state on tcp_rst()
tcp_rst() attempts to send an RST packet to the guest, and if that succeeds
moves the flow to CLOSED state.  However, even if the tcp_send_flag() fails
the flow is still dead: we've usually closed the socket already, and
something has already gone irretrievably wrong.  So we should still mark
the flow as CLOSED.  That will cause it to be cleaned up, meaning any
future packets from the guest for it won't match a flow, so should generate
new RSTs (they don't at the moment, but that's a separate bug).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:36 +01:00
David Gibson
56ce03ed0a tcp: Correct error code handling from tcp_flow_repair_socket()
There are two small bugs in error returns from tcp_low_repair_socket(),
which is supposed to return a negative errno code:

1) On bind() failures, wedirectly pass on the return code from bind(),
   which is just 0 or -1, instead of an error code.

2) In the caller, tcp_flow_migrate_target() we call strerror_() directly
   on the negative error code, but strerror() requires a positive error
   code.

Correct both of these.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:35 +01:00
David Gibson
39f85bce1a migrate, flow: Don't attempt to migrate TCP flows without passt-repair
Migrating TCP flows requires passt-repair in order to use TCP_REPAIR.  If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode.  In some cases we may not roll back these changes properly, meaning
we break network connections on the source.

Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate.  So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:33 +01:00
David Gibson
7b92f2e852 migrate, flow: Trivially succeed if migrating with no flows
We could get a migration request when we have no active flows; or at least
none that we need or are able to migrate.  In this case after sending or
receiving the number of flows we continue to step through various lists.

In the target case, this could include communication with passt-repair.  If
passt-repair wasn't started that could cause further errors, but of course
they shouldn't matter if we have nothing to repair.

Make it more obvious that there's nothing to do and avoid such errors by
short-circuiting flow_migrate_{source,target}() if there are no migratable
flows.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:32:21 +01:00
Stefano Brivio
87471731e6 selinux: Fixes/workarounds for passt and passt-repair, mostly for libvirt usage
Here are a bunch of workarounds and a couple of fixes for libvirt
usage which are rather hard to split into single logical patches
as there appear to be some obscure dependencies between some of them:

- passt-repair needs to have an exec_type typeattribute (otherwise
  the policy for lsmd(1) causes a violation on getattr on its
  executable) file, and that typeattribute just happened to be there
  for passt as a result of init_daemon_domain(), but passt-repair
  isn't a daemon, so we need an explicit corecmd_executable_file()

- passt-repair needs a workaround, which I'll revisit once
  https://github.com/fedora-selinux/selinux-policy/issues/2579 is
  solved, for usage with libvirt: allow it to use qemu_var_run_t
  and virt_var_run_t sockets

- add 'bpf' and 'dac_read_search' capabilities for passt-repair:
  they are needed (for whatever reason I didn't investigate) to
  actually receive socket files via SCM_RIGHTS

- passt needs further workarounds in the sense of
  https://github.com/fedora-selinux/selinux-policy/issues/2579:
  allow it to use map and use svirt_tmpfs_t (not just svirt_image_t):
  it depends on where the libvirt guest image is

- ...it also needs to map /dev/null if <access mode='shared'/> is
  enabled in libvirt's XML for the memoryBacking object, for
  vhost-user operation

- and 'ioctl' on the TCP socket appears to be actually needed, on top
  of 'getattr', to dump some socket parameters

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-28 01:14:01 +01:00
Michal Privoznik
be86232f72 seccomp.sh: Silence stty errors
When printing list of allowed syscalls the width of terminal is
obtained for nicer output (see commit below). The width is
obtained by running 'stty'. While this works when building from a
console, it doesn't work during rpmbuild/emerge/.. as stdout is
usually not a console but a logfile and stdin is usually
/dev/null or something. This results in stty reporting errors
like this:

  stty: 'standard input': Inappropriate ioctl for device

Redirect stty's stderr to /dev/null to silence it.

Fixes: 712ca32353 ("seccomp.sh: Try to account for terminal width while formatting list of system calls")
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-24 18:46:28 +01:00
Jon Maloy
ea69ca6a20 tap: always set the no_frag flag in IPv4 headers
When studying the Linux source code and Wireshark dumps it seems like
the no_frag flag in the IPv4 header is always set. Following discussions
in the Internet on this subject indicates that modern routers never
fragment packets, and that it isn't even supported in many cases.

Adding to this that incoming messages forwarded on the tap interface
never even pass through a router it seems safe to always set this flag.

This makes the IPv4 headers of forwarded messages identical to those
sent by the external sockets, something we must consider desirable.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-20 12:43:00 +01:00
Stefano Brivio
4dac2351fa contrib/fedora: Actually install passt-repair SELinux policy file
Otherwise we build it, but we don't install it. Not an issue that
warrants a a release right away as it's anyway usable.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 23:33:53 +01:00
Stefano Brivio
16553c8280 dhcp: Add option code byte in calculation for OPT_MAX boundary check
Otherwise we'll limit messages to 577 bytes, instead of 576 bytes as
intended:

  $ fqdn="thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.then_make_it_251_with_this"
  $ hostname="__eighteen_bytes__"
  $ ./pasta --fqdn ${fqdn} -H ${hostname} -p dhcp.pcap -- /sbin/dhclient -4
  Saving packet capture to dhcp.pcap
  $ tshark -r dhcp.pcap -V -Y 'dhcp.option.value == 5' | grep "Total Length"
      Total Length: 577

This was hidden by the issue fixed by commit bcc4908c2b ("dhcp
Remove option 255 length byte") until now.

Fixes: 31e8109a86 ("dhcp, dhcpv6: Add hostname and client fqdn ops")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 23:33:19 +01:00
Stefano Brivio
183bedf478 Makefile: Use mmap2() as alternative for mmap() in valgrind extra syscalls
...instead of unconditionally trying to enable both: mmap2() is the
32-bit ARM variant for mmap() (and perhaps for other architectures),
bot if mmap() is available, valgrind will use that one.

This avoids seccomp.sh warning us about missing mmap2() if mmap() is
present, and is consistent with what we do in vhost-user code.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 16:36:47 +01:00
David Gibson
1cc5d4c9fe conf: Use 0 instead of -1 as "unassigned" mtu value
On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default.  However, internally we use (c->mtu == -1) to represent that
state.  We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.

This is unnecessarily confusing.  We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case.  This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:41 +01:00
David Gibson
3dc7da68a2 conf: More thorough error checking when parsing --mtu option
We're a bit sloppy with parsing MTU which can lead to some surprising,
though fairly harmless, results:
  * Passing a non-number like '-m xyz' will not give an error and act like
    -m 0
  * Junk after a number (e.g. '-m 1500pqr') will be ignored rather than
    giving an error
  * We parse the MTU as a long, then immediately assign to an int, so on
    some platforms certain ludicrously out of bounds values will be
    silently truncated, rather than giving an error

Be a bit more thorough with the error checking to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:38 +01:00
David Gibson
65e317a8fc flow: Clean up and generalise flow traversal macros
The migration code introduced a number of 'foreach' macros to traverse the
flow table.  These aren't inherently tied to migration, so polish up their
naming, move them to flow_table.h and also use in flow_defer_handler()
which is the other place we need to traverse the whole table.

For now we keep foreach_established_tcp_flow() as is.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:36 +01:00
David Gibson
b79a22d360 flow: Remove unneeded bound parameter from flow traversal macros
The foreach macros used to step through flows each take a 'bound' parameter
to only scan part of the flow table.  Only one place actually passes a
bound different from FLOW_MAX.  So we can simplify every other invocation
by having that one case manually handle the bound.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:34 +01:00
David Gibson
7ffca35fdd flow: Remove unneeded index from foreach_* macros
The foreach macros are odd in that they take two loop counters: an integer
index, and a pointer to the flow.  We nearly always want the latter, not
the former, and we can get the index from the pointer trivially when we
need it.  So, rearrange the macros not to need the integer index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-19 06:35:20 +01:00
David Gibson
adb46c11d0 flow: Add flow_perror() helper
Our general logging helpers include a number of _perror() variants which,
like perror(3) include the description of the current errno.  We didn't
have those for our flow specific logging helpers, though.  Fill this gap
with flow_perror() and flow_dbg_perror(), and use them where it's useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:12 +01:00
David Gibson
ba0823f8a0 tcp: Don't pass both flow pointer and flow index
tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:10 +01:00
David Gibson
854bc7b1a3 tcp: Remove spurious prototype for tcp_flow_migrate_shrink_window
This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:33:08 +01:00
David Gibson
e56c8038fc tcp: More type safety for tcp_flow_migrate_target_ext()
tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 13:32:52 +01:00
David Gibson
5a07eb3ccc tcp_vu: head_cnt need not be global
head_cnt is a global variable which tracks how many entries in head[] are
currently used.  The fact that it's global obscures the fact that the
lifetime over which it has a meaningful value is quite short: a single
call to of tcp_vu_data_from_sock().

Make it a local to tcp_vu_data_from_sock() to make that lifetime clearer.
We keep the head[] array global for now - although technically it has the
same valid lifetime - because it's large enough we might not want to put
it on the stack.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 11:28:37 +01:00
David Gibson
6b4065153c tap: Remove unused ETH_HDR_INIT() macro
The uses of this macro were removed in d4598e1d18 ("udp: Use the same
buffer for the L2 header for all frames").

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:18 +01:00
David Gibson
354bc0bab1 packet: Don't pass start and offset separately to packet_check_range()
Fundamentally what packet_check_range() does is to check whether a given
memory range is within the allowed / expected memory set aside for packets
from a particular pool.  That range could represent a whole packet (from
packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't
really matter which.

However, we pass the start of the range as two parameters: @start which is
the start of the packet, and @offset which is the offset within the packet
of the range we're interested in.  We never use these separately, only as
(start + offset).  Simplify the interface of packet_check_range() and
vu_packet_check_range() to directly take the start of the relevant range.
This will allow some additional future improvements.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:12 +01:00
David Gibson
0a51060f7a packet: Use flexible array member in struct pool
Currently we have a dummy pkt[1] array, which we alias with an array of
a different size via various macros.  However, we already require C11 which
includes flexible array members, so we can do better.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:43:04 +01:00
Enrique Llorente
bcc4908c2b dhcp: Remove option 255 length byte
The option 255 (end of options) do not need the length byte, this change
remove that allowing to have one extra byte at other dynamic options.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-18 08:42:35 +01:00
Stefano Brivio
a1e48a02ff test: Add migration tests
PCAP=1 ./run migrate/bidirectional gives an overview of how the
whole thing is working.

Add 12 tests in total, checking basic functionality with and without
flows in both directions, with and without sockets in half-closed
states (both inbound and outbound), migration behaviour under traffic
flood, under traffic flood with > 253 flows, and strict checking of
sequences under flood with ramp patterns in both directions.

These tests need preparation and teardown for each case, as we need
to restore the source guest in its own context and pane before we can
test again. Eventually, we could consider alternating source and
target so that we don't need to restart from scratch every time, but
that's beyond the scope of this initial test implementation.

Trick: './run migrate/*' runs all the tests with preparation and
teardown steps.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:29:36 +01:00
Stefano Brivio
89ecf2fd40 migrate: Migrate TCP flows
This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:29:03 +01:00
Stefano Brivio
3e903bbb1f repair, passt-repair: Build and warning fixes for musl
Checked against musl 1.2.5.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-17 08:28:48 +01:00
Stefano Brivio
01b6a164d9 tcp_splice: A typo three years ago and SO_RCVLOWAT is gone
In commit e5eefe7743 ("tcp: Refactor to use events instead of
states, split out spliced implementation"), this:

			if (!bitmap_isset(rcvlowat_set, conn - ts) &&
			    readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

			if (conn->flags & lowat_set_flag &&
			    readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we
haven't set it already.

But, instead, we're checking if it's already set before we set it, so
we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us
improved CPU utilisation in non-interactive cases where we are not
transferring at full pipe capacity.

Fixes: e5eefe7743 ("tcp: Refactor to use events instead of states, split out spliced implementation")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:28:45 +01:00
Stefano Brivio
667caa09c6 tcp_splice: Don't wake up on input data if we can't write it anywhere
If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
given flow, it means that we're blocked, waiting for the receiver to
actually receive data, with a full pipe.

In that case, if we keep EPOLLIN set for the socket on the other side
(our receiving side), we'll get into a loop such as:

  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577

leading to 100% CPU usage, of course.

Drop EPOLLIN on our receiving side as long when we're waiting for
output readiness on the other side.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-17 08:27:30 +01:00
David Gibson
7c33b12086 vhost_user: Clear ring address on GET_VRING_BASE
GET_VRING_BASE stops the queue, clearing the call and kick fds.  However,
we don't clear vring.avail.  That means that if vu_queue_notify() is called
it won't realise the queue isn't ready and will die with an EBADFD.

We get this during migration, because for some reason, qemu reconfigures
the vhost-user device when a migration is triggered.  There's a window
between the GET_VRING_BASE and re-establishing the call fd where the
notify function can be called, causing a crash.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-15 05:34:21 +01:00
Stefano Brivio
71249ef3f9 tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values
I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.

This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.

Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).

The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.

Drop those settings.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 12:02:55 +01:00
Stefano Brivio
30f1e082c3 tcp: Keep updating window and checking for socket data after FIN from guest
Once we get a FIN segment from the container/guest, we enter something
resembling CLOSE_WAIT (from the perspective of the peer), but that
doesn't mean that we should stop processing window updates from the
guest and checking for socket data if the guest acknowledges
something.

If we don't do that, we can very easily run into a situation where we
send a burst of data to the tap, get a zero window update, along with
a FIN segment, because the flow is meant to be unidirectional, and now
the connection will be stuck forever, because we'll ignore updates.

Reproducer, server:

  $ pasta --config-net -t 9999 -- sh -c 'echo DONE | socat TCP-LISTEN:9997,shut-down STDIO'

and client:

  $ ./test/rampstream send 50000 | socat -u STDIN TCP:$LOCAL_ADDR:9997
  2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe

while at it, update the message string for the third passive close
state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 10:04:39 +01:00
Stefano Brivio
98d474c895 contrib/selinux: Enable mapping guest memory for libvirt guests
This doesn't actually belong to passt's own policy: we should export
an interface and libvirt's policy should use it, because passt's
policy shouldn't be aware of svirt_image_t at all.

However, libvirt doesn't maintain its own policy, which makes policy
updates rather involved. Add this workaround to ensure --vhost-user
is working in combination with libvirt, as it might take ages before
we can get the proper rule in libvirt's policy.

Reported-by: Laine Stump <laine@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-14 10:04:39 +01:00
Stefano Brivio
9a84df4c3f selinux: Add rules needed to run tests
...other than being convenient, they might be reasonably
representative of typical stand-alone usage.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-13 00:42:52 +01:00
David Gibson
a301158456 rampstream: Add utility to test for corruption of data streams
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:48:17 +01:00
Stefano Brivio
6f122f0171 tcp: Get bound address for connected inbound sockets too
So that we can bind inbound sockets to specific addresses, like we
already do for outbound sockets.

While at it, change the error message in tcp_conn_from_tap() to match
this one.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:48:00 +01:00
Stefano Brivio
f3fe795ff5 vhost_user: Make source quit after reporting migration state
This will close all the sockets we currently have open in repair mode,
and completes our migration tasks as source. If the hypervisor wants
to have us back at this point, somebody needs to restart us.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:51 +01:00
Stefano Brivio
b899141ad5 Add interfaces and configuration bits for passt-repair
In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.

When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.

To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:28 +01:00
David Gibson
155cd0c41e migrate: Migrate guest observed addresses
Most of the information in struct ctx doesn't need to be migrated.
Either it's strictly back end information which is allowed to differ
between the two ends, or it must already be configured identically on
the two ends.

There are a few exceptions though.  In particular passt learns several
addresses of the guest by observing what it sends out.  If we lose
this information across migration we might get away with it, but if
there are active flows we might misdirect some packets before
re-learning the guest address.

Avoid this by migrating the guest's observed addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Coding style stuff, comments, etc.]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:17 +01:00
Stefano Brivio
5911e08c0f migrate: Skeleton of live migration logic
Introduce facilities for guest migration on top of vhost-user
infrastructure.  Add migration facilities based on top of the current
vhost-user infrastructure, moving vu_migrate() and related functions
to migrate.c.

Versioned migration stages define function pointers to be called on
source or target, or data sections that need to be transferred.

The migration header consists of a magic number, a version number for the
encoding, and a "compat_version" which represents the oldest version which
is compatible with the current one.  We don't use it yet, but that allows
for the future possibility of backwards compatible protocol extensions.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:47:07 +01:00
Stefano Brivio
836fe215e0 passt-repair: Fix off-by-one in check for number of file descriptors
Actually, 254 is too many, but 253 isn't.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:46:46 +01:00
Laurent Vivier
def7de4690 tcp_vu: Fix off-by one in header count array adjustment
head_cnt represents the number of frames we're going to forward to the
guest in tcp_vu_sock_recv(), each of which could require multiple
buffers ("elements").  We initialise it with as many frames as we can
find space for in vu buffers, and we then need to adjust it down to
the number of frames we actually (partially) filled.

We adjust it down based on number of individual buffers used by the
data from recvmsg().  At this point 'i' is *one greater than* that
number of buffers, so we need to discard all (unused) frames with a
buffer index >= i, instead of > i.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
[david: Contributed actual commit message]
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-12 19:44:25 +01:00
Stefano Brivio
90f91fe726 tcp: Implement conservative zero-window probe on ACK timeout
This probably doesn't cover all the cases where we should send a
zero-window probe, but it's rather unobtrusive and obvious, so start
from here, also because I just observed this case (without the fix
from the previous patch, to take into account window information from
keep-alive segments).

If we hit the ACK timeout, and try re-sending data from the socket,
if the window is zero, we'll just fail again, go back to the timer,
and so on, until we hit the maximum number of re-transmissions and
reset the connection.

Don't do that: forcibly try to send something by implementing the
equivalent of a zero-window probe in this case.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-12 19:43:55 +01:00
Stefano Brivio
472e2e930f tcp: Don't discard window information on keep-alive segments
It looks like a detail, but it's critical if we're dealing with
somebody, such as near-future self, using TCP_REPAIR to migrate TCP
connections in the guest or container.

The last packet sent from the 'source' process/guest/container
typically reports a small window, or zero, because the guest/container
hadn't been draining it for a while.

The next packet, appearing as the target sets TCP_REPAIR_OFF on the
migrated socket, is a keep-alive (also called "window probe" in CRIU
or TCP_REPAIR-related code), and it comes with an updated window
value, reflecting the pre-migration "regular" value.

If we ignore it, it might take a while/forever before we realise we
can actually restart sending.

Fixes: 238c69f9af ("tcp: Acknowledge keep-alive segments, ignore them for the rest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-12 19:34:15 +01:00
Enrique Llorente
31e8109a86 dhcp, dhcpv6: Add hostname and client fqdn ops
Both DHCPv4 and DHCPv6 has the capability to pass the hostname to
clients, the DHCPv4 uses option 12 (hostname) while the DHCPv6 uses option 39
(client fqdn), for some virt deployments like kubevirt is expected to
have the VirtualMachine name as the guest hostname.

This change add the following arguments:
 - -H --hostname NAME to configure the hostname DHCPv4 option(12)
 - --fqdn NAME to configure client fqdn option for both DHCPv4(81) and
   DHCPv6(39)

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-10 18:30:24 +01:00
Stefano Brivio
a3d142a6f6 conf: Don't map DNS traffic to host, if host gateway is a resolver
This should be a relatively common case and I'm a bit surprised it's
been broken since I added the "gateway mapping" functionality, but it
doesn't happen with Podman, and not with systemd-resolved or similar
local proxies, and also not with servers where typically the gateway
is just a router and not a DNS resolver. That could be the reason why
nobody noticed until now.

By default, we'll map the address of the default gateway, in
containers and guests, to represent "the host", so that we have a
well-defined way to reach the host. Say:

  0.0029:     NAT to host 127.0.0.1: 192.168.100.1

But if the host gateway is also a DNS resolver:

  0.0029: DNS:
  0.0029:     192.168.100.1

then we'll send DNS queries directed to it to the host instead:

  0.0372: Flow 0 (INI): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => ?
  0.0372: Flow 0 (TGT): INI -> TGT
  0.0373: Flow 0 (TGT): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
  0.0373: Flow 0 (UDP flow): TGT -> TYPED
  0.0373: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
  0.0373: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 31049
  0.0374: Flow 0 (UDP flow): TYPED -> ACTIVE
  0.0374: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53

which doesn't quite work, of course:

  0.0374: pasta: epoll event on UDP reply socket 95 (events: 0x00000008)
  0.0374: ICMP error on UDP socket 95: Connection refused

unless the host is a resolver itself... but then we wouldn't find the
address of the gateway in its /etc/resolv.conf, presumably.

Fix this by making an exception for DNS traffic: if the default
gateway is a resolver, match on DNS traffic going to the default
gateway, and explicitly forward it to the configured resolver.

Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-09 08:17:06 +01:00
Stefano Brivio
864be475d9 passt-repair: Send one confirmation *per command*, not *per socket*
It looks like me, myself and I couldn't agree on the "simple" protocol
between passt and passt-repair. The man page and passt say it's one
confirmation per command, but the passt-repair implementation had one
confirmation per socket instead.

This caused all sort of mysterious issues with repair mode
pseudo-randomly enabled, and leading to hours of fun (mostly not
mine). Oops.

Switch to one confirmation per command (of course).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-09 08:16:41 +01:00
Enrique Llorente
fe8b6a7c42 dhcp: Don't re-use request message for reply
The logic composing the DHCP reply message is reusing the request
message to compose it, future long options like FQDN may
exceed the request message limit making it go beyond the lower
bound.

This change creates a new reply message with a fixed options size of 308
and fills it in with proper fields from requests adding on top the generated
options, this way the reply lower bound does not depend on the request.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-07 10:36:10 +01:00
Stefano Brivio
b7b70ba243 passt-repair: Dodge "structurally unreachable code" warning from Coverity
While main() conventionally returns int, and we need a return at the
end of the function to avoid compiler warnings, turning that return
into _exit() to avoid exit handlers triggers a Coverity warning. It's
unreachable code anyway, so switch that single occurence back to a
plain return.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-07 10:35:46 +01:00
Stefano Brivio
0f009ea598 passt-repair: Fix calculation of payload length from cmsg_len
There's no inverse function for CMSG_LEN(), so we need to loop over
SCM_MAX_FD (253) possible input values. The previous calculation is
clearly wrong, as not every int takes CMSG_LEN(sizeof(int)) in cmsg
data.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-07 10:35:17 +01:00
Stefano Brivio
a0b7f56b3a passt-repair: Don't use perror(), accept ECONNRESET as termination
If we use glibc's perror(), we need to allow dup() and fcntl() in our
seccomp profiles, which are a bit too much for this simple helper. On
top of that, we would probably need a wrapper to avoid allocation for
translated messages.

While at it: ECONNRESET is just a close() from passt, treat it like
EOF.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-07 10:34:31 +01:00
Stefano Brivio
a5cca995de conf, passt.1: Un-deprecate --host-lo-to-ns-lo
It was established behaviour, and it's now the third report about it:
users ask how to achieve the same functionality, and we don't have a
better answer yet.

The idea behind declaring it deprecated to start with, I guess, was
that we would eventually replace it by more flexible and generic
configuration options, which is still planned. But there's nothing
preventing us to alias this in the future to a particular
configuration.

So, stop scaring users off, and un-deprecate this.

Link: https://archives.passt.top/passt-dev/20240925102009.62b9a0ce@elisabeth/
Link: https://github.com/rootless-containers/rootlesskit/pull/482#issuecomment-2591855705
Link: https://github.com/moby/moby/issues/48838
Link: https://github.com/containers/podman/discussions/25243
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 11:14:30 +01:00
David Gibson
0da87b393b debug: Add tcpdump to mbuto.img
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-06 09:43:09 +01:00
Stefano Brivio
f66769c2de apparmor: Workaround for unconfined libvirtd when triggered by unprivileged user
If libvirtd is triggered by an unprivileged user, the virt-aa-helper
mechanism doesn't work, because per-VM profiles can't be instantiated,
and as a result libvirtd runs unconfined.

This means passt can't start, because the passt subprofile from
libvirt's profile is not loaded either.

Example:

  $ virsh start alpine
  error: Failed to start domain 'alpine'
  error: internal error: Child process (passt --one-off --socket /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0.socket --pid /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0-passt.pid --tcp-ports 40922:2) unexpected fatal signal 11

Add an annoying workaround for the moment being. Much better than
encouraging users to start guests as root, or to disable AppArmor
altogether.

Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-06 09:43:09 +01:00
Stefano Brivio
593be32774 passt-repair.1: Fix indication of TCP_REPAIR constants
...perhaps I should adopt the healthy habit of actually reading
headers instead of using my mental copy.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 09:43:00 +01:00
Stefano Brivio
9215f68a0c passt-repair: Build fixes for musl
When building against musl headers:

- sizeof() needs stddef.h, as it should be;

- we can't initialise a struct msghdr by simply listing fields in
  order, as they contain explicit padding fields. Use field names
  instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-06 09:40:54 +01:00
Paul Holzinger
a9d63f91a5 passt-repair: use _exit() over return
When returning from main it does the same as calling exit() which is not
good as glibc might try to call futex() which will be blocked by
seccomp. See the prevoius commit "treewide: use _exit() over exit()" for
a more detailed explanation.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-05 15:19:19 +01:00
Paul Holzinger
d0006fa784 treewide: use _exit() over exit()
In the podman CI I noticed many seccomp denials in our logs even though
tests passed:
comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e
syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000

Which is futex being called and blocked by the pasta profile. After a
few tries I managed to reproduce locally with this loop in ~20 min:
while :;
  do podman run -d --network bridge quay.io/libpod/testimage:20241011 \
	sleep 100 && \
  sleep 10 && \
  podman rm -fa -t0
done

And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the
following stack trace:
Stack trace of thread 1:
    0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b)
    0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de)
    0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
    0x000055f31b78c50b n/a (n/a + 0x0)
    0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
    0x000055f31b78d5a2 n/a (n/a + 0x0)

Pasta got killed in exit(), it seems glibc is trying to use a lock when
running exit handlers even though no exit handlers are defined.

Given no exit handlers are needed we can call _exit() instead. This
skips exit handlers and does not flush stdio streams compared to exit()
which should be fine for the use here.

Based on the input from Stefano I did not change the test/doc programs
or qrap as they do not use seccomp filters.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-05 15:19:02 +01:00
David Gibson
745c163e60 tcp: Simplify handling of getsockname()
For migration we need to get the specific local address and port for
connected sockets with getsockname().  We currently open code marshalling
the results into the flow entry.

However, we already have inany_from_sockaddr() which handles the fiddly
parts of this, so use it.  Also report failures, which may make debugging
problems easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop re-declarations of 'sa' and 'sl']
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 09:02:54 +01:00
David Gibson
b4a7b5d4a6 migrate: Fix several errors with passt-repair
The passt-repair helper is now merged, but alas it contains several small
bugs:
 * close() is not in the seccomp profile, meaning it will immediately
   SIGSYS when you make a request of it
 * The generated header, seccomp_repair.h isn't listed in .gitignore or
   removed by "make clean"

Fixes: 8c24301462 ("Introduce passt-repair")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 08:52:27 +01:00
Stefano Brivio
dcf014be88 doc: Add mock of migration source and target
These test programs show the migration of a TCP connection using the
passt-repair helper.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
52e57f9c9a tcp: Get socket port and address using getsockname() when connecting from guest
For migration only: we need to store 'oport', our socket-side port,
as we establish a connection from the guest, so that we can bind the
same oport as source port in the migration target.

Similar for 'oaddr': this is needed in case the migration target has
additional network interfaces, and we need to make sure our socket is
bound to the equivalent interface as it was on the source.

Use getsockname() to fetch them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
8c24301462 Introduce passt-repair
A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
passt. Not used yet.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
e894d9ae82 vhost_user: Turn some vhost-user message reports to trace()
Having every vhost-user message printed as part of debug output makes
debugging anything else a bit complicated.

Change per-packet debug() messages in vu_kick_cb() and
vu_send_single() to trace()

[dgibson: switch different messages to trace()]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
e25a93032f util: Add read_remainder() and read_all_buf()
These are symmetric to write_remainder() and write_all_buf() and
almost a copy and paste of them, with the most notable differences
being reversed reads/writes and a couple of better-safe-than-sorry
asserts to keep Coverity happy.

I'll use them in the next patch. At least for the moment, they're
going to be used for vhost-user mode only, so I'm not unconditionally
enabling readv() in the seccomp profile: the caller has to ensure it's
there.

[dgibson: make read_remainder() take const pointer to iovec]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-04 01:28:04 +01:00
Stefano Brivio
71fa736277 tcp_splice, udp_flow: fcntl64() support on PPC64 depends on glibc version
I explicitly added fcntl64() to the list of allowed system calls for
PPC64 a while ago, and now it turns out it's not available in recent
Debian builds. The warning from seccomp.sh is harmless because we
unconditionally try to enable fcntl() anyway, but take care of it
anyway.

Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:43:35 +01:00
Stefano Brivio
b75ad159e8 vhost_user: On 32-bit ARM, mmap() is not available, mmap2() is used instead
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armel&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477467&raw=0
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armhf&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477421&raw=0
Fixes: 31117b27c6 ("vhost-user: introduce vhost-user API")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:42:28 +01:00
Stefano Brivio
722d347c19 tcp: Don't reset outbound connection on SYN retries
Reported by somebody on IRC: if the server has considerable latency,
it might happen that the client retries sending SYN segments for the
same flow while we're still in a TAP_SYN_RCVD, non-ESTABLISHED state.

In that case, we should go with the blanket assumption that we need
to reset the connection on any unexpected segment: RFC 9293 explicitly
mentions this case in Figure 8: Recovery from Old Duplicate SYN,
section 3.5. It doesn't make sense for us to set a specific sequence
number, socket-side, but we should definitely wait and see.

Ignoring the duplicate SYN segment should also be compatible with
section 3.10.7.3. SYN-SENT STATE, which mentions updating sequences
socket-side (which we can't do anyway), but certainly not reset the
connection.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 22:42:13 +01:00
7ppKb5bW
bf2860819d pasta.te: fix demo.sh and remove one duplicate rule
On Fedora 41, without "allow pasta_t unconfined_t:dir read"
/usr/bin/pasta can't open /proc/[pid]/ns, which is required by
pasta_netns_quit_init().

This patch also remove one duplicate rule "allow pasta_t nsfs_t:file
read;", "allow pasta_t nsfs_t:file { open read };" at line 123 is
enough.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:33:14 +01:00
Stefano Brivio
dcd6d8191a tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros
Those are symmetric to TAPSIDE(x)/TAPFLOW(x) and I'll use them in
the next patch to extract 'oport' in order to re-bind sockets to
the original socket-side local port.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-02-03 07:32:53 +01:00
David Gibson
0349cf637f util: Rename and make global vu_remove_watch()
vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set.  There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:32:51 +01:00
David Gibson
10c4a9e1b3 tcp: Always pass NULL event with EPOLL_CTL_DEL
In tcp_epoll_ctl() we pass an event pointer with EPOLL_CTL_DEL, even though
it will be ignored.  It's possible this was a workaround for pre-2.6.9
kernels which required a non-NULL pointer here, but we rely on the kernel
accepting NULL events for EPOLL_CTL_DEL in lots of other places.  Use
NULL instead for simplicity and consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-02-03 07:32:37 +01:00
Laurent Vivier
dd6a6854c7 vhost-user: Implement an empty VHOST_USER_SEND_RARP command
Passt cannot manage and doesn't need to manage the broadcast of a fake RARP,
but QEMU will report an error message if Passt doesn't implement it.

Implement an empty SEND_RARP command to silence QEMU error message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-24 21:40:05 +01:00
Stefano Brivio
d477a1fb03 netlink: Skip loopback interface while looking for a template
There might be reasons to have routes on the loopback interface, for
example Any-IP/AnyIP routes as implemented by Linux kernel commit
ab79ad14a2d5 ("ipv6: Implement Any-IP support for IPv6.").

If we use the loopback interface as a template, though, we'll pick
'lo' (typically) as interface name for our tap interface, but we'll
already have an interface called 'lo' in the target namespace, and as
we TUNSETIFF on it, we'll fail with EINVAL, because it's not a tap
interface.

Skip the loopback interface while looking for a template interface or,
more accurately, skip the interface with index 1.

Strictly speaking, we should fetch interface flags via RTM_GETLINK
instead, and check for IFF_LOOPBACK, but interleaving that request
while we're iterating over routes is unnecessarily complicated.

Link: https://www.reddit.com/r/podman/comments/1i6pj7u/starting_pod_without_external_network/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-24 21:39:52 +01:00
Laurent Vivier
4f2c8e7913 vhost_user: Drop packet with unsupported iovec array
If the iovec array cannot be managed, drop it rather than
passing the second entry to tap_add_packet().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-21 14:30:42 +01:00
Stefano Brivio
ec5c4d936d tcp: Set PSH flag for last incoming packets in a batch
So far we omitted setting PSH flags for inbound traffic altogether: as
we ignore the nature of the data we're sending, we can't conclude that
some data is more or less urgent. This works fine with Linux guests,
as the Linux kernel doesn't do much with it, on input: it will
generally deliver data to the application layer without delay.

However, with Windows, things change: if we don't set the PSH flag on
interactive inbound traffic, we can expect long delays before the data
is delivered to the application.

This is very visible with RDP, where packets we send on behalf of the
RDP client are delivered with delays exceeding one second:

  $ tshark -r rdp.pcap -td -Y 'frame.number in { 33170 .. 33173 }' --disable-protocol tls
  33170   0.030296 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285229 Win=387968 Len=0
  33171   0.985412 88.198.0.164 → 93.235.154.248 105 TCP 3389 → 49012 [PSH, ACK] Seq=285229 Ack=13820 Win=63198 Len=51
  33172   0.030373 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285280 Win=387968 Len=0
  33173   1.383776 88.198.0.164 → 93.235.154.248 424 TCP 3389 → 49012 [PSH, ACK] Seq=285280 Ack=13820 Win=63198 Len=370

in this example (packet capture taken by passt), frame  is a
mouse event sent by the RDP client, and frame  is the first
event (display reacting to click) sent back by the server. This
appears as a 1.4 s delay before we get frame .

If we set PSH, instead:

  $ tshark -r rdp_psh.pcap -td -Y 'frame.number in { 314 .. 317 }' --disable-protocol tls
  314   0.002503 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7779 Ack=74047 Win=31872 Len=116
  315   0.000557 88.198.0.164 → 93.235.154.248 54 TCP 3389 → 51066 [ACK] Seq=79162 Ack=7895 Win=62872 Len=0
  316   0.012752 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7895 Ack=79162 Win=31872 Len=116
  317   0.011927 88.198.0.164 → 93.235.154.248 107 TCP 3389 → 51066 [PSH, ACK] Seq=79162 Ack=8011 Win=62756 Len=53

here, in frame , our mouse event is delivered without a delay and
receives a response in approximately 12 ms.

Set PSH on the last segment for any batch we dequeue from the socket,
that is, set it whenever we know that we might not be sending data to
the same port for a while.

Reported-by: NN708
Link: https://bugs.passt.top/show_bug.cgi?id=107
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-21 14:28:44 +01:00
Stefano Brivio
db2c91ae86 tcp: Set ACK flag on *all* RST segments, even for client in SYN-SENT state
Somewhat curiously, RFC 9293, section 3.10.7.3, states:

   If the state is SYN-SENT, then
   [...]

      Second, check the RST bit:
      -  If the RST bit is set,
      [...]

         o  If the ACK was acceptable, then signal to the user "error:
            connection reset", drop the segment, enter CLOSED state,
            delete TCB, and return.  Otherwise (no ACK), drop the
            segment and return.

which matches verbatim RFC 793, pages 66-67, and is implemented as-is
by tcp_rcv_synsent_state_process() in the Linux kernel, that is:

	/* No ACK in the segment */

	if (th->rst) {
		/* rfc793:
		 * "If the RST bit is set
		 *
		 *      Otherwise (no ACK) drop the segment and return."
		 */

		goto discard_and_undo;
	}

meaning that if a client is in SYN-SENT state, and we send a RST
segment once we realise that we can't establish the outbound
connection, the client will ignore our segment and will need to
pointlessly wait until the connection times out instead of aborting
it right away.

The ACK flag on a RST, in this case, doesn't really seem to have any
function, but we must set it nevertheless. The ACK sequence number is
already correct because we always set it before calling
tcp_prepare_flags(), whenever relevant.

This leaves us with no cases where we should *not* set the ACK flag
on non-SYN segments, so always set the ACK flag for RST segments.

Note that non-SYN, non-RST segments were already covered by commit
4988e2b406 ("tcp: Unconditionally force ACK for all !SYN, !RST
packets").

Reported-by: Dirk Janssen <Dirk.Janssen@schiphol.nl>
Reported-by: Roeland van de Pol <Roeland.van.de.Pol@schiphol.nl>
Reported-by: Robert Floor <Robert.Floor@schiphol.nl>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-21 14:28:44 +01:00
Stefano Brivio
54bb972cfb tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets
Following up on 725acd111b ("tcp_splice: Set (again) TCP_NODELAY on
both sides"), David argues that, in general, we don't know what kind
of TCP traffic we're dealing with, on any side or path.

TCP segments might have been delivered to our socket with a PSH flag,
but we don't have a way to know about it.

Similarly, the guest might send us segments with PSH or URG set, but
we don't know if we should generally TCP_CORK sockets and uncork on
those flags, because that would assume they're running a Linux kernel
(and a particular version of it) matching the kernel that delivers
outbound packets for us.

Given that we can't make any assumption and everything might very well
be interactive traffic, disable Nagle's algorithm on all non-spliced
sockets as well.

After all, John Nagle himself is nowadays recommending that delayed
ACKs should never be enabled together with his algorithm, but we
don't have a practical way to ensure that our environment is free from
delayed ACKs (TCP_QUICKACK is not really usable for this purpose):

  https://news.ycombinator.com/item?id=34180239

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-21 14:28:37 +01:00
Stefano Brivio
8757834d14 tcp: Buffer sizes are *not* inherited on accept()/accept4()
...so it's pointless to set SO_RCVBUF and SO_SNDBUF on listening
sockets.

Call tcp_sock_set_bufsize() after accept4(), for inbound sockets.

As we didn't have large buffer sizes set for inbound sockets for
a long time (they are set explicitly only if the maximum size is
big enough, more than than the ~200 KiB default), I ran some more
throughput tests for this one, and I see slightly better numbers
(say, 17 gbps instead of 15 gbps guest to host without vhost-user).

Fixes: 904b86ade7 ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-21 14:28:14 +01:00
Laurent Vivier
c96a88d550 vhost_user: remove ASSERT() on iovec number
Replace ASSERT() on the number of iovec in the element and on
the first entry length by a debug() message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix typo in failure message]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
412ed4f09f vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_DEVICE_STATE
Report to front-end that we support device state commands:
VHOST_USER_CHECK_DEVICE_STATE
VHOST_USER_SET_LOG_BASE

These feature is needed to transfer backend state using frontend
channel.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
31d70024be vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command
Set the file descriptor to use to transfer the
backend device state during migration.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fixed nits and coding style here and there]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
878e163454 vhost-user: add VHOST_USER_CHECK_DEVICE_STATE command
After transferring the back-end’s internal state during migration,
check whether the back-end was able to successfully fully process
the state.

The value returned indicates success or error;
0 is success, any non-zero value is an error.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
78c73e9395 vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_LOG_SHMFD
This features allows QEMU to be migrated. We need also to report
VHOST_F_LOG_ALL.

This protocol feature reports we can log the page update and
implement VHOST_USER_SET_LOG_BASE and VHOST_USER_SET_LOG_FD.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
3c1d91b816 vhost-user: add VHOST_USER_SET_LOG_BASE command
Sets logging shared memory space.

When the back-end has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature,
the log memory fd is provided in the ancillary data of
VHOST_USER_SET_LOG_BASE message, the size and offset of shared memory
area provided in the message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix coding style in a bunch of places]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
538312af19 vhost-user: Pass vu_dev to more virtio functions
vu_dev will be needed to log page update.

Add the parameter to:

  vring_used_write()
  vu_queue_fill_by_index()
  vu_queue_fill()
  vring_used_idx_set()
  vu_queue_flush()

The new parameter is unused for now.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
b04195c60f vhost-user: add VHOST_USER_SET_LOG_FD command
VHOST_USER_SET_LOG_FD is an optional message with an eventfd
in ancillary data, it may be used to inform the front-end that the
log has been modified.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to vu_set_log_fd_exec()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Laurent Vivier
6016e04a3a vhost-user: update protocol features and commands list
vhost-user protocol specification has been updated with
feature flags and commands we will need to implement migration.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to union vhost_user_payload]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-20 19:51:24 +01:00
Stefano Brivio
a8f4fc481c tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest
There are pretty much two cases of the (misnomer) STALLED: in one
case, we could send more data to the guest if it becomes available,
and in another case, we can't, because we filled the window.

If, in this second case, we keep EPOLLIN enabled, but never read from
the socket, we get short but CPU-annoying storms of EPOLLIN events,
upon which we reschedule the ACK timeout handler, never read from the
socket, go back to epoll_wait(), and so on:

  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1

also known as:

  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s

which, for some reason, becomes very visible with muvm and aria2c
downloading from a server nearby in parallel chunks.

That's because EPOLLIN isn't cleared if we don't read from the socket,
and even with EPOLLET, epoll_wait() will repeatedly wake us up until
we actually read something.

In this case, we don't want to subscribe to EPOLLIN at all: all we're
waiting for is an ACK segment from the guest. Differentiate this case
with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
indicate that we're waiting for an ACK from the guest
(ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.

If this flag is set before we set STALLED, EPOLLIN will be masked
while we set EPOLLET because of STALLED. Whenever we clear STALLED,
we also clear this flag.

This is definitely not elegant, but it's a minimal fix.

We can probably simplify this at a later point by having a category
of connection flags directly corresponding to epoll flags, and
dropping STALLED altogether, or, perhaps, always using EPOLLET (but
we need a mechanism to re-check sockets for pending data if we can't
temporarily write to the guest).

I suspect that this might also be implied in
https://github.com/containers/podman/issues/23686, hence the Link:
tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
that).

Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-16 21:15:33 +01:00
Stefano Brivio
b8f573cdc2 tcp: Set EPOLLET when when reading from a socket fails with EAGAIN
Before SO_PEEK_OFF support was introduced by commit e63d281871
("tcp: leverage support of SO_PEEK_OFF socket option when available"),
we would peek data from sockets using a "discard" buffer as first
iovec element, so that, unless we had no pending data at all, we would
always get a positive return code from recvmsg() (except for closing
connections or errors).

If we couldn't send more data to the guest, in the window, we would
set the STALLED flag (causing the epoll descriptor to switch to
edge-triggered mode), and return early from tcp_data_from_sock().

With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
on the socket, but nothing beyond our current peeking offset, we'll
get EAGAIN instead of our current "discard" length. In that case, we
return even earlier, and we don't set EPOLLET on the socket as a
result.

As reported by Asahi Lina, this causes event loops where the kernel is
signalling socket readiness, because there's data we didn't dequeue
yet (waiting for the guest to acknowledge it), but we won't actually
peek anything new, and return early without setting EPOLLET.

This is the original report, mentioning the originally proposed fix:

--
When there is unacknowledged data in the inbound socket buffer, passt
leaves the socket in the epoll instance to accept new data from the
server. Since there is already data in the socket buffer, an epoll
without EPOLLET will repeatedly fire while no data is processed,
busy-looping the CPU:

epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)

Add in the missing EPOLLET flag for this case. This brings CPU
usage down from around ~80% when downloading over TCP, to ~5% (use
case: passt as network transport for muvm, downloading Steam games).
--

we can't set EPOLLET unconditionally though, at least right now,
because we don't monitor the guest tap for EPOLLOUT in case we fail
to write on that side because we filled up that buffer (and not the
window of a TCP connection).

Instead, rely on the observation that, once a connection is
established, we only get EAGAIN on recvmsg() if we are attempting to
peek data from a socket with a non-zero peeking offset: we only peek
when there's pending data on a socket, and in that case, if we peek
without offset, we'll always see some data.

And if we peek data with a non-zero offset and get EAGAIN, that means
that we're either waiting for more data to arrive on the socket (which
would cause further wake-ups, even with EPOLLET), or we're waiting for
the guest to acknowledge some of it, which would anyway cause a
wake-up.

In that case, it's safe to set STALLED and, in turn, EPOLLET on the
socket, which fixes the EPOLLIN event loop.

While we're establishing a connection from the socket side, though,
we'll call, once, tcp_{buf,vu}_data_from_sock() to see if we got
any data while we were waiting for SYN, ACK from the guest. See the
comment at the end of tcp_conn_from_sock_finish().

And if there's no data queued on the socket as we check, we'll also
get EAGAIN, even if our peeking offset is zero. For this reason, we
need to additionally check that 'already_sent' is not zero, meaning,
explicitly, that our peeking offset is not zero.

Reported-by: Asahi Lina <lina@asahilina.net>
Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-16 21:15:33 +01:00
Stefano Brivio
22cf08ba00 tcp: Don't subscribe to EPOLLOUT events on STALLED
I inadvertently added that in an unrelated change, but it doesn't make
sense: STALLED means we have pending socket data that we can't write
to the guest, not the other way around.

Fixes: bb70811183 ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-16 21:15:33 +01:00
Stefano Brivio
707f77b0a9 tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up
In the next patches, I'm extending the usage of STALLED to a few more
cases.

Doing so revealed this issue: if we set STALLED and, consequently,
EPOLLOUT (which is wrong, fixed later) right after we set a connection
to ESTABLISHED (which also happened by mistake while I was preparing
another change), with the guest sending data together with the final
ACK in the handshake, say:

  41.3661: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.3662: Flow 2 (NEW): FREE -> NEW
  41.3663: Flow 2 (INI): NEW -> INI
  41.3663: Flow 2 (INI): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => ?
  41.3665: Flow 2 (TGT): INI -> TGT
  41.3666: Flow 2 (TGT): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3667: Flow 2 (TCP connection): TGT -> TYPED
  41.3667: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3669: Flow 2 (TCP connection): TAP_SYN_RCVD: CLOSED -> SYN_SENT
  41.3670: Flow 2 (TCP connection): Side 0 hash table insert: bucket: 339814
  41.3672: Flow 2 (TCP connection): TYPED -> ACTIVE
  41.3673: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3674: Flow 2 (TCP connection): TAP_SYN_ACK_SENT: SYN_SENT -> SYN_RCVD
  41.3675: Flow 2 (TCP connection): ACK_FROM_TAP_DUE
  41.3675: Flow 2 (TCP connection): timer expires in 10.000s
  41.3675: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.3676: Flow 2 (TCP connection): ACK_FROM_TAP_DUE dropped
  41.3676: Flow 2 (TCP connection): ESTABLISHED: SYN_RCVD -> ESTABLISHED
  41.3678: Flow 2 (TCP connection): STALLED
  41.3678: vhost-user: got kick_data: 0000000000000002 idx: 1
  41.3679: Flow 2 (TCP connection): ACK_TO_TAP_DUE
  41.3680: Flow 2 (TCP connection): timer expires in 0.010s
  41.3680: Flow 2 (TCP connection): STALLED dropped

we'll immediately get an EPOLLOUT event, call tcp_update_seqack_wnd(),
but ignore window and ACK sequence update. At this point, we think we
acknowledged all the data to the guest (but we didn't) and we'll
happily proceed to clear the ACK_TO_TAP_DUE flag:

  41.3780: Flow 2 (TCP connection): ACK_TO_TAP_DUE dropped
  41.3780: Flow 2 (TCP connection): timer expires in 7200.000s
  41.5754: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.9956: vhost-user: got kick_data: 0000000000000001 idx: 1
  42.8275: vhost-user: got kick_data: 0000000000000001 idx: 1

while the guest starts retransmitting that data desperately, without
ever getting an ACK segment from us:

   1433  38.746353 2a01:4f8:222:904::2 → 2001:db8:9a55::1 94 TCP 54312 → 10003 [SYN] Seq=0 Win=65460 Len=0 MSS=65460 SACK_PERM TSval=1089126192 TSecr=0 WS=128
   1434  38.747357 2001:db8:9a55::1 → 2a01:4f8:222:904::2 82 TCP 10003 → 54312 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=61440 WS=256
   1435  38.747500 2a01:4f8:222:904::2 → 2001:db8:9a55::1 74 TCP 54312 → 10003 [ACK] Seq=1 Ack=1 Win=65536 Len=0
   1436  38.747769 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1437  38.747798 2a01:4f8:222:904::2 → 2001:db8:9a55::1 32841 TCP 54312 → 10003 [ACK] Seq=8193 Ack=1 Win=65536 Len=32767
   1438  38.748049 2001:db8:9a55::1 → 2a01:4f8:222:904::2 74 TCP [TCP Window Update] 10003 → 54312 [ACK] Seq=1 Ack=1 Win=65280 Len=0
   1439  38.954044 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1440  39.370096 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1441  40.202135 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192

because seq_ack_to_tap is already set to the sequence after frame
number 1437 in the example.

For some reason, I could only reproduce this with vhost-user, IPv6,
and passt running under valgrind while taking captures. Even under
these conditions, it happens quite rarely.

Forcibly send an ACK segment if we update the ACK sequence (or the
advertised window).

Fixes: e5eefe7743 ("tcp: Refactor to use events instead of states, split out spliced implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-16 21:15:33 +01:00
Laurent Vivier
1b95bd6fa1 vhost_user: fix multibuffer from linux
Under some conditions, linux can provide several buffers
in the same element (multiple entries in the iovec array).

I didn't identify what changed between the kernel guest that
provides one buffer and the one that provides several
(doesn't seem to be a kernel change or a configuration change).

Fix the following assert:

ASSERTION FAILED in virtqueue_map_desc (virtio.c:402): num_sg < max_num_sg

What I can see is the buffer can be splitted in two iovecs:
  - vnet header
  - packet data

This change manages this special case but the real fix will be to allow
tap_add_packet() to manage iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-15 23:05:31 +01:00
Stefano Brivio
f04b483d15 test/pasta_podman: Run Podman tests on a single CPU thread
Increasingly often, I'm getting occasional failures of the same type
as https://github.com/containers/podman/issues/24147. I guess it
mostly depends on the system load.

It will be a while until I'll actually run tests on a kernel
including my fix for it, kernel commit a502ea6fa94b ("udp: Deal with
race between UDP socket address change and rehash"), so add a horrible
workaround using taskset(1), for the moment.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-15 22:16:32 +01:00
Laurent Vivier
2c174f1fe8 checksum: fix checksum with odd base address
csum_unfolded() must call csum_avx2() with a 32byte aligned base address.

To be able to do that if the buffer is not correctly aligned,
it splits the buffers in 2 parts, the second part is 32byte aligned and
can be used with csum_avx2(), the first part is the remaining part, that
is not 32byte aligned and we use sum_16b() to compute the checksum.

A problem appears if the length of the first part is odd because
the checksum is using 16bit words to do the checksum.

If the length is odd, when the second part is computed, all words are
shifted by 1 byte, meaning weight of upper and lower byte is swapped.

For instance a 13 bytes buffer:

bytes:

aa AA bb BB cc CC dd DD ee EE ff FF gg

16bit words:

AAaa BBbb CCcc DDdd EEee FFff 00gg

If we don't split the sequence, the checksum is:

AAaa + BBbb + CCcc + DDdd + EEee + FFff + 00gg

If we split the sequence with an even length for the first part:

(AAaa + BBbb) + (CCcc + DDdd + EEee + FFff + 00gg)

But if the first part has an odd length:

(AAaa + BBbb + 00cc) + (ddCC + eeDD + ffEE + ggFF)

To avoid the problem, do not call csum_avx2() if the first part cannot
have an even length, and compute the checksum of all the buffer using
sum_16b().

This is slower but it can only happen if the buffer base address is odd,
and this can only happen if the binary is built using '-Os', and that
means we have chosen to prioritize size over speed.

Reported-by: Mike Jones <mike@mjones.io>
Link: https://bugs.passt.top/show_bug.cgi?id=108
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Added comment explaining why we check for pad & 1]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-10 22:20:23 +01:00
Stefano Brivio
725acd111b tcp_splice: Set (again) TCP_NODELAY on both sides
In commit 7ecf693297 ("pasta, tcp: Don't set TCP_CORK on spliced
sockets") I just assumed that we wouldn't benefit from disabling
Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay).

It turns out that with some patterns, such as a PostgreSQL server
in a container receiving parameterised, short queries, for which pasta
sees several short inbound messages (Parse, Bind, Describe, Execute
and Sync commands getting each one their own packet, 5 to 49 bytes TCP
payload each), we'll read them usually in two batches, and send them
in matching batches, for example:

  9165.2467:          pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
  9165.2468:          Flow 0 (TCP connection (spliced)): 76 from read-side call
  9165.2468:          Flow 0 (TCP connection (spliced)): 76 from write-side call (passed 524288)
  9165.2469:          pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
  9165.2470:          Flow 0 (TCP connection (spliced)): 15 from read-side call
  9165.2470:          Flow 0 (TCP connection (spliced)): 15 from write-side call (passed 524288)
  9165.2944:          pasta: epoll event on connected spliced TCP socket 118 (events: 0x00000001)

and the kernel delivers the first one, waits for acknowledgement from
the receiver, then delivers the second one. This adds very substantial
and unnecessary delay. It's usually a fixed ~40ms between the two
batches, which is clearly unacceptable for loopback connections.

In this example, the delay is shown by the timestamp of the response
from socket 118. The peer (server) doesn't actually take that long
(less than a millisecond), but it takes that long for the kernel to
deliver our request.

To avoid batching and delays, disable Nagle's algorithm by setting
TCP_NODELAY on both internal and external sockets: this way, we get
one inbound packet for each original message, we transfer them right
away, and the kernel delivers them to the process in the container as
they are, without delay.

We can do this safely as we don't care much about network utilisation
when there's in fact pretty much no network (loopback connections).

This is unfortunately not visible in the TCP request-response tests
from the test suite because, with smaller messages (we use one byte),
Nagle's algorithm doesn't even kick in. It's probably not trivial to
implement a universal test covering this case.

Fixes: 7ecf693297 ("pasta, tcp: Don't set TCP_CORK on spliced sockets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-06 10:10:29 +01:00
Stefano Brivio
3876fc780d seccomp: Unconditionally allow accept(2) even if accept4(2) is present
On Alpine Linux 3.21, passt aborts right away as soon as QEMU connects
to it.

Most likely, this has always been the case with musl, because since
musl commit dc01e2cbfb29 ("add fallback emulation for accept4 on old
kernels"), accept4() without flags is implemented using accept().

However, I guess that nobody realised earlier because it's typically
pasta(1) being used on musl-based distributions, and the only place
where we call accept4() without flags is tap_listen_handler().

Add accept() to the list of allowed system calls regardless of the
presence of accept4().

Reported-by: NN708 <nn708@outlook.com>
Link: https://bugs.passt.top/show_bug.cgi?id=106
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-05 23:49:11 +01:00
Laurent Vivier
898e853635 virtio: Use const pointer for vu_dev
We don't modify the structure in some virtio functions.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-05 23:48:59 +01:00
Stefano Brivio
324233bd9b udp_flow: Don't block multicast and broadcast messages
It was reported that SSDP notifications sent from a container (with
e.g. minidlna) stopped appearing on the network starting from commit
1db4f773e8 ("udp: Improve detail of UDP endpoint sanity checking").
As a minimal reproducer using minidlnad(8):

  $ mkdir /tmp/minidlna
  $ cat conf
  media_dir=/tmp/minidlna
  db_dir=/tmp/minidlna
  $ ./pasta -d --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f conf -P /dev/null & (sleep 1; killall minidlnad)'

[...]

  1.0327: Flow 0 (NEW): FREE -> NEW
  1.0327: Flow 0 (INI): NEW -> INI
  1.0327: Flow 0 (INI): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
  1.0327: Flow 0 (INI): Invalid endpoint on UDP packet
  1.0327: Flow 0 (FREE): INI -> FREE
  1.0328: Flow 0 (FREE): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
  1.0328: Dropping datagram with no flow TAP 88.198.0.164:54185 -> 239.255.255.250:1900

This is an actual regression as there's no particular reason to block
outbound multicast UDP packets.

And even if we don't handle multicast groups in any particular way
(https://bugs.passt.top/show_bug.cgi?id=2, "Add IGMP/MLD proxy"),
there's no reason to block inbound multicast or broadcast packets
either, should they ever be somehow delivered to passt or pasta.

Let multicast and broadcast packets through, refusing only to
establish flows with unspecified endpoint, as those would actually
cause havoc in the flow table.

IP-wise, SSDP notifications look like this (after this patch), inside
and outside:

  $ pasta -p /tmp/minidlna.pcap --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f minidlna.conf -P /dev/null & (sleep 1; killall minidlnad)'

[...]

  $ tshark -a packets:1 -r /tmp/minidlna.pcap ssdp
      2   0.074808 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1

  # tshark -i ens3 -a packets:1 multicast 2>/dev/null
      1 0.000000000 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1

Link: https://github.com/containers/podman/issues/24871
Fixes: 1db4f773e8 ("udp: Improve detail of UDP endpoint sanity checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2025-01-05 23:48:59 +01:00
Stefano Brivio
2385b69a66 Makefile: Report error and stop if we can't set TARGET
I don't think it's necessarily productive to check all the possible
error conditions in the Makefile, but this one is annoying: issue
'make' without a C compiler, then install one, and build again.

Then run passt and it will mysteriously terminate on epoll_wait(),
because seccomp.h is good enough to build against, but the resulting
seccomp filter doesn't allow any system call. Not really fun to debug.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2025-01-05 23:48:37 +01:00
Stefano Brivio
e5ba8adef7 README: Mark vhost-user as supported
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-12 10:50:48 +01:00
Stefano Brivio
09478d55fe treewide: Dodge dynamic memory allocation in strerror() from glibc > 2.40
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.

This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).

Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-12-11 12:21:23 +01:00
Jon Maloy
e24f026222 pasta: make it possible to disable socket splicing
During testing it is sometimes useful to force traffic which would
normally be forwared by socket splicing through the tap interface.

In this commit, we add a command switch enabling such funtionality
for inbound local traffic.

For outbound local traffic this is much trickier, if even possible,
so leave that for a later commit.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-11 01:47:37 +01:00
Laurent Vivier
947f5cdb93 tap: Call vu_init() with --fd
We need to initialize vhost-user structures with --fd too.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-10 12:26:56 +01:00
Laurent Vivier
2139ad33fc tap: Use a common function to start a new connection
Merge code from tap_backend_init(), tap_sock_tun_init() and
tap_listen_handler() to set epoll_ref entry and to add it
to epollfd.

No functionality change

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-10 12:26:34 +01:00
Laurent Vivier
8996d183c5 udp_vu: update segment size
In udp_vu_sock_recv(), collect a segment with a size defined to
IP_MAX_MTU + ETH_HLEN + sizeof(struct virtio_net_hdr_mrg_rxbuf)

The original version double counted the IP header: IP_MAX_MTU includes
the IP header, and so did hdrlen.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-05 21:08:58 +01:00
David Gibson
190829705e flow: Remove over-zealous sanity checks in flow_sidx_hash()
In flow_sidx_hash() we verify that the flow we're hashing doesn't have an
unspecified endpoint address, or zero for either port.  The hash table only
works if we're looking for exact matches of address and port, and this is
attempting to catch any cases where we might have left address or port
unpopulated or filled with a wildcard.

This doesn't really work though, because there are cases where unspecified
addresses or zero ports are correct:
 * We already use unspecified addresses for our address in cases where we
   don't know the specific local address for that side, and exclude the
   obvious extra check on side->oaddr for that reason.
 * Zero port numbers aren't strictly forbidden over the wire.  We forbid
   them for TCP & UDP because they can't safely be handled on the socket
   side.  However for ICMP a zero id, which goes in the port field is
   valid.
 * Possible future flow types (for example, for multicast protocols) might
   legitimately have an unspecified address.

Although it makes them easier to miss, these sorts of sanity checks really
have to be done at the protocol / flow type layer, and we already do so.
Remove the checks in flow_sidx_hash() other than checking that the pif
is specified.

Reported-by: Stefan <steffhip@gmail.com>
Link: https://bugs.passt.top/show_bug.cgi?id=105
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-05 21:08:58 +01:00
David Gibson
1db4f773e8 udp: Improve detail of UDP endpoint sanity checking
In udp_flow_new() we reject a flow if the endpoint isn't unicast, or it has
a zero endpoint port.  Those conditions aren't strictly illegal, but we
can't safely handle them at present:
 * Multicast UDP endpoints are certainly possible, but our current flow
   tracking only makes sense for simple unicast flows - we'll need
   different handling if we want to handle multicast flows in future
 * It's not entirely clear if port 0 is RFC-ishly correct, but for socket
   interfaces port 0 sometimes has a special meaning such as "pick the port
   for me, kernel".  That makes flows on port 0 unsafe to forward in the
   usual way.

For the same reason we also can't safely handle port 0 as our port.  In
principle that's also true for our address, however in the case of flows
initiated from a socket, we may not know our address since the socket
could be bound to 0.0.0.0 or ::, so we can only verify that our address
is unicast for flows initiated from the tap side.

Refine the current check in udp_flow_new() to slightly more detailed checks
in udp_flow_from_sock() and udp_flow_from_tap() to make what is and isn't
handled clearer.  This makes this checking more similar to what we do for
TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-12-05 21:08:58 +01:00
Stefano Brivio
966fdc8749 perf/passt_vu_tcp: Make it shine
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 15:06:44 +01:00
Laurent Vivier
020c8b7127 tcp_vu: Compute IPv4 header checksum if dlen changes
In tcp_vu_data_from_sock() we compute IPv4 header checksum only
for the first and the last packets, and re-use the first packet checksum
for all the other packets as the content of the header doesn't change.

It's more accurate to check the dlen value to know if the checksum
should change as dlen is the only information that can change in the
loop.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
Laurent Vivier
d9c0f8eefb Makefile: Use make internal string functions
TARGET_ARCH is computed from '$(CC) -dumpmachine' using external
bash commands like echo, cut, tr and sed. This can be done using
make internal string functions.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
b6e79efa0b tcp_vu: Remove unnecessary tcp_vu_update_check() function
Because the vhost-user <-> virtio-net path ignores checksums, we usually
don't calculate them when sending packets to the guest.  So, we always
pass no_tcp_csum=true to tcp_fill_headers().  We do want accurate
checksums when capturing packets though, so the captures don't show bogus
values.

Currently we handle this by updating the checksum field immediately before
writing the packet to the capture file, using tcp_vu_update_check().  This
is unnecessary, though: in each case tcp_fill_headers() is called not very
long before, so we can alter its no_tcp_csum parameter pased on whether
we're generating captures or not.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
a6348cad51 tcp: Merge tcp_fill_headers[46]() with each other
We have different versions of this function for IPv4 and IPv6, but the
caller already requires some IP version specific code to get the right
header pointers.  Instead, have a common function that fills either an
IPv4 or an IPv6 header based on which header pointer it is passed.  This
allows us to remove a small amount of code duplication and make a few
slightly ugly conditionals.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
2abf5ab7f3 tcp: Merge tcp_update_check_tcp[46]()
The only reason we need separate functions for the IPv4 and IPv6 case is
to calculate the checksum of the IP pseudo-header, which is different for
the two cases.  However, the caller already knows which path it's on and
can access the values needed for the pseudo-header partial sum more easily
than tcp_update_check_tcp[46]() can.

So, merge these functions into a single tcp_update_csum() function that
just takes the pseudo-header partial sum, calculated in the caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
08ea3cc581 tcp: Pass TCP header and payload separately to tcp_fill_headers[46]()
At the moment these take separate pointers to the tap specific and IP
headers, but expect the TCP header and payload as a single tcp_payload_t.
As well as being slightly inconsistent, this involves some slightly iffy
pointer shenanigans when called on the flags path with a tcp_flags_t
instead of a tcp_payload_t.

More importantly, it's inconvenient for the upcoming vhost-user case, where
the TCP header and payload might not be contiguous.  Furthermore, the
payload itself might not be contiguous.

So, pass the TCP header as its own pointer, and the TCP payload as an IO
vector.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
2ee07697c4 tcp: Pass TCP header and payload separately to tcp_update_check_tcp[46]()
Currently these expects both the TCP header and payload in a single IOV,
and goes to some trouble to locate the checksum field within it.  In the
current caller we've already know where the TCP header is, so we might as
well just pass it in.  This will need to work a bit differently for
vhost-user, but that code already needs to locate the TCP header for other
reasons, so again we can just pass it in.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
67151090bc iov, checksum: Replace csum_iov() with csum_iov_tail()
We usually want to checksum only the tail part of a frame, excluding at
least some headers.  csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector.  We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.

We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters.  This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
David Gibson
f931103171 iov: iov tail helpers
In the vhost-user code we have a number of places where we need to locate
a particular header within the guest-supplied IO vector.  We need to work
out which buffer the header is in, and verify that it's contiguous and
aligned as we need.  At the moment this is open-coded, but introduce a
helper to make this more straightforward.

We add a new datatype 'struct iov_tail' representing an IO vector from
which we've logically consumed some number of headers.  The IOV_PULL_HEADER
macro consumes a new header from the vector, returning a pointer and
updating the iov_tail.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-28 14:03:16 +01:00
Stefano Brivio
804a7ce94a tcp_vu: Change 'dlen' to ssize_t in tcp_vu_data_from_sock()
...to quickly suppress a false positive from Coverity, which assumes
that iov_size is 0 and 'dlen' might overflow as a result (with hdrlen
being 66). An ASSERT() in tcp_vu_sock_recv() already guarantees that
iov_size(iov, buf_cnt) here is anyway greater than 'hdrlen'.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
2024-11-27 16:49:21 +01:00
Laurent Vivier
00cc2303fd Fix build on 32bit target
Fix the following errors when built with CFLAGS="-m32 -U__AVX2__":

packet.c:57:23: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘size_t’ {aka ‘unsigned int’} [-Wformat=]
   57 |                 trace("packet offset plus length %lu from size %lu, "
   58 |                       "%s:%i", start - p->buf + len + offset,
      |                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                     |
      |                                                     size_t {aka unsigned int}

packet.c:57:23: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘size_t’ {aka ‘unsigned int’} [-Wformat=]
   57 |                 trace("packet offset plus length %lu from size %lu, "
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   58 |                       "%s:%i", start - p->buf + len + offset,
   59 |                       p->buf_size, func, line);
      |                       ~~~~~~~~~~~
      |                        |
      |                        size_t {aka unsigned int}

vhost_user.c:139:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  139 |                         return (void *)(qemu_addr - r->qva + r->mmap_addr +
      |                                ^

vhost_user.c:439:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  439 |                         munmap((void *)r->mmap_addr, r->size + r->mmap_offset);
      |                                ^

vhost_user.c:900:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  900 |                         munmap((void *)r->mmap_addr, r->size + r->mmap_offset);
      |                                ^

virtio.c:111:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  111 |                         return (void *)(guest_addr - r->gpa + r->mmap_addr +
      |                                ^

vu_common.c:37:27: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
   37 |                 char *m = (char *)dev_region->mmap_addr;
      |                           ^

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:49:21 +01:00
Laurent Vivier
6fae899cbb virtio: check if avail ring is configured
If the connection to the vhost-user front end is closed during transfers
virtio rings are deconfigured and not available anymore, but we can
try to access them to process queued data. This can trigger a SIGSEG as
we try to access unavailable memory.
To fix that check vq->vring.avail is sane before accessing the vring

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:49:21 +01:00
David Gibson
7e131e920c tcp: Move tcp_l2_buf_fill_headers() to tcp_buf.c
This function only has callers in tcp_buf.c.  More importantly, it's
inherently tied to the "buf" path, because it uses internal knowledge of
how we lay out the various headers across our locally allocated buffers.

Therefore, move it to tcp_buf.c.

Slightly reformat the prototypes while we're at it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:49:21 +01:00
Stefano Brivio
676bf5488e test: Add tests for passt in vhost-user mode
Run functional and performance tests for vhost-user mode as well. For
functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
to their non-vhost-user counterparts, as no differences are intended
but we want to distinguish them in test logs.

For performance tests, instead, we add separate perf/passt_vu_tcp and
perf/passt_vu_udp files, as we need longer test duration, as well as
higher UDP sending bandwidths and larger TCP windows, to actually get
the highest throughput vhost-user mode offers.

For valgrind tests, vhost-user mode needs two extra system calls:
statx and readlink. Add them as EXTRA_SYSCALLS for the valgrind
target.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-27 16:49:21 +01:00
Laurent Vivier
28997fcb29 vhost-user: add vhost-user
add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
 before including <linux/if_ether.h> as C libraries such as musl
 __UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
 a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:47:32 +01:00
Laurent Vivier
b2e62f7e85 passt: rename tap_sock_init() to tap_backend_init()
Extract pool storage initialization loop to tap_sock_update_pool(),
extract QEMU hints to tap_backend_show_hints().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:12:27 +01:00
Laurent Vivier
b7c292b758 tcp: Export headers functions
Export tcp_fill_headers[4|6]() and tcp_update_check_tcp[4|6]().

They'll be needed by vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:12:24 +01:00
Laurent Vivier
5a8b33c667 udp: Prepare udp.c to be shared with vhost-user
Export udp_payload_t, udp_update_hdr4(), udp_update_hdr6() and
udp_sock_errs().

Rename udp_listen_sock_handler() to udp_buf_listen_sock_handler() and
udp_reply_sock_handler to udp_buf_reply_sock_handler().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:12:21 +01:00
Laurent Vivier
31117b27c6 vhost-user: introduce vhost-user API
Add vhost_user.c and vhost_user.h that define the functions needed
to implement vhost-user backend.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:12:19 +01:00
Laurent Vivier
7d1cd4dbf5 vhost-user: introduce virtio API
Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:11:36 +01:00
Laurent Vivier
dd143e3890 packet: replace struct desc by struct iovec
To be able to manage buffers inside a shared memory provided
by a VM via a vhost-user interface, we cannot rely on the fact
that buffers are located in a pre-defined memory area and use
a base address and a 32bit offset to address them.

We need a 64bit address, so replace struct desc by struct iovec
and update range checking.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:11:18 +01:00
Stefano Brivio
c0fbc7ef2a dhcp: Honour broadcast flag (RFC 2131, 4.1)
It's widely considered a legacy option nowadays, and I've haven't seen
clients setting it since Windows 95, but it's convenient for a minimal
DHCP client not using raw IP sockets such as what I'm playing with for
muvm.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-27 05:37:28 +01:00
Stefano Brivio
9da2038485 dhcp: Introduce support for Rapid Commit (option 80, RFC 4039)
I'm trying to speed up and simplify IP address acquisition in muvm.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-27 05:37:28 +01:00
Stefano Brivio
d6e9e2486f dhcp: Use -1 as "missing option" length instead of 0
We want to add support for option 80 (Rapid Commit, RFC 4039), whose
length is 0.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-27 05:37:28 +01:00
Stefano Brivio
14b84a7f07 treewide: Introduce 'local mode' for disconnected setups
There are setups where no host interface is available or configured
at all, intentionally or not, temporarily or not, but users expect
(Podman) containers to run in any case as they did with slirp4netns,
and we're now getting reports that we broke such setups at a rather
alarming rate.

To this end, if we don't find any usable host interface, instead of
exiting:

- for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2
  as default gateway

- for IPv6, don't assign any address (forcibly disable DHCPv6), and
  use the *first* link-local address we observe to represent the
  guest/container. Advertise fe80::1 as default gateway

- use 'tap0' as default interface name for pasta

Change ifi4 and ifi6 in struct ctx to int and accept a special -1
value meaning that no host interface was selected, but the IP family
is enabled. The fact that the kernel uses unsigned int values for
those is not an issue as 1. one can't create so many interfaces
anyway and 2. we otherwise handle those values transparently.

Fix a botched conditional in conf_print() to actually skip printing
DHCPv6 information if DHCPv6 is disabled (and skip printing NDP
information if NDP is disabled).

Link: https://github.com/containers/podman/issues/24614
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 05:16:38 +01:00
David Gibson
c6e6106413 test: Improve logic for waiting for SLAAC & DAD to complete in NDP tests
Since 9a0e544f05 the NDP tests attempt to explicitly wait for DAD to
complete, rather than just having a hard coded sleep.  However, the
conditions we use are a bit sloppy and allow for a number of possible cases
where it might not work correctly.  Stefano seems to be hitting one of
these (though I'm not sure which) with some later patches.

 - We wait for *lack* of a tentative address, so if the first check occurs
   before we have even a tentative address it will bypass the delay
 - It's not entirely clear if the permanent address will always appear
   as soon as the tentative address disappears
 - We weren't filtering on interface
 - We were doing the filtering with ip-address options rather than in jq.
   However in at least in some circumstances this seems to result in an
   empty .addr_info field, rather than omitting it entirely, which could
   cause us to get the wrong result

So, instead, explicitly wait for the address we need to be present: an
RA provided address on the external interface.  While we're here we remove
the requirement that it have global scope: the "kernel_ra" check is already
sufficient to make sure this address comes from an NDP RA, not something
else.  If it's not the global scope address we expect, better to check it
and fail, rather than keep waiting.

Fixes: 9a0e544f05 ("test: Improve test for NDP assigned prefix")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-26 08:30:18 +01:00
Stefano Brivio
cda7f160f0 ndp: Don't send first periodic router advertisement right after guest connects
This is very visible with muvm, but it also happens with QEMU: we're
sending the first unsolicited router advertisement milliseconds after
the guest connects.

That's usually pointless because, when the hypervisor connects, the
guest is typically not ready yet to process anything of that sort:
it's still booting. And if we happen to send it late enough (still
milliseconds), with muvm, while the message is discarded, it
sometimes (slightly) delays the response to the first solicited
router advertisement, which is the one we need to have coming fast.

Skip sending the unsolicited advertisement on the first timer run,
just calculate the next delay. Keep it simple by observing that we're
probably not trying to reach the 1970s with IPv6.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-26 08:30:18 +01:00
Stefano Brivio
2bf8ffcf07 test/perf: Select a single IPv6 namespace address in pasta tests
By dropping the filter on prefix length, commit 910f4f9103
("test: Don't require 64-bit prefixes in perf tests") broke tests on
setups where two global unicast IPv6 addresses are available, which
is the typical case when the "host" is a VM running under passt with
addresses from SLAAC and DHCPv6, because two addresses will be
returned.

Pick the first one instead. We don't really care about the prefix
length, any of these addresses will work.

Fixes: 910f4f9103 ("test: Don't require 64-bit prefixes in perf tests")
Link: https://archives.passt.top/passt-dev/20241119214344.6b4a5b3a@elisabeth/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-26 08:30:18 +01:00
Stefano Brivio
6819b2e102 conf, passt.1: Update --mac-addr default in usage() and man page
Fixes: 90e83d50a9 ("Don't take "our" MAC address from the host")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-26 08:30:18 +01:00
Stefano Brivio
b61be8468a passt.1: Fix "default" note about --map-guest-addr
It's not true that there's no mapping by default: there's no mapping
in the --map-guest-addr sense, by default, but in that case
the default --map-host-loopback behaviour prevails.

While at it, fix a typo.

Fixes: 57b7bd2a48 ("fwd, conf: Allow NAT of the guest's assigned address")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-26 08:30:18 +01:00
Stefano Brivio
238c69f9af tcp: Acknowledge keep-alive segments, ignore them for the rest
RFC 9293, 3.8.4 says:

   Implementers MAY include "keep-alives" in their TCP implementations
   (MAY-5), although this practice is not universally accepted.  Some
   TCP implementations, however, have included a keep-alive mechanism.
   To confirm that an idle connection is still active, these
   implementations send a probe segment designed to elicit a response
   from the TCP peer.  Such a segment generally contains SEG.SEQ =
   SND.NXT-1 and may or may not contain one garbage octet of data.  If
   keep-alives are included, the application MUST be able to turn them
   on or off for each TCP connection (MUST-24), and they MUST default to
   off (MUST-25).

but currently, tcp_data_from_tap() is not aware of this and will
schedule a fast re-transmit on the second keep-alive (because it's
also a duplicate ACK), ignoring the fact that the sequence number was
rewinded to SND.NXT-1.

ACK these keep-alive segments, reset the activity timeout, and ignore
them for the rest.

At some point, we could think of implementing an approximation of
keep-alive segments on outbound sockets, for example by setting
TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single
keep-alive segment at approximately the same time, and never reset the
connection. That's beyond the scope of this fix, though.

Reported-by: Tim Besard <tim.besard@gmail.com>
Link: https://github.com/containers/podman/discussions/24572
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-21 06:52:36 +01:00
Stefano Brivio
af464c4ffb tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore
We enter the timer handler with the ACK_TO_TAP_DUE flag, call
tcp_prepare_flags() with ACK_IF_NEEDED, and realise that we
acknowledged everything meanwhile, so we return early, but we also
need to reset that flag to avoid unnecessarily scheduling the timer
over and over again until more pending data appears.

I'm not sure if this fixes any real issue, but I've spotted this
in several logs reported by users, including one where we have some
unexpected bursts of high CPU load during TCP transfers at low rates,
from https://github.com/containers/podman/issues/23686.

Link: https://github.com/containers/podman/discussions/24572
Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-21 06:51:25 +01:00
David Gibson
5ae21841ac ndp: Don't send unsolicited RAs if NDP is disabled
We recently added support for sending unsolicited NDP Router Advertisement
packets.  While we (correctly) disable this if the --no-ra option is given
we incorrectly still send them if --no-ndp is set.  Fix the oversight.

Fixes: 6e1e44293e ("ndp: Send unsolicited Router Advertisements")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-19 21:10:42 +01:00
Stefano Brivio
bf9492747d ndp: Don't send unsolicited router advertisement if we can't, yet
ndp_timer() is called right away on the first epoll_wait() cycle,
when the communication channel to the guest isn't ready yet:

  1.0038: NDP: sending unsolicited RA, next in 264s
  1.0038: tap: failed to send 1 frames of 1

check that it's up before sending it. This effectively delays the
first gratuitous router advertisement, which is probably a good idea
given that we expect the guest to send a router solicitation right
away.

Fixes: 6e1e44293e ("ndp: Send unsolicited Router Advertisements")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-19 21:10:14 +01:00
Stefano Brivio
5e24466677 selinux: Use auth_read_passwd() interface for all our getpwnam() needs
If passt or pasta are started as root, we need to read the passwd file
(be it /etc/passwd or whatever sssd provides) to find out UID and GID
of 'nobody' so that we can switch to it.

Instead of a bunch of allow rules for passwd_file_t and sssd macros,
use the more convenient auth_read_passwd() interface which should
cover our usage of getpwnam().

The existing rules weren't actually enough:

  # strace -e openat passt -f
  [...]
  Started as root, will change to nobody.
  openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/lib64/libnss_sss.so.2", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
  openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
  openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 4

with corresponding SELinux warnings logged in audit.log.

Reported-by: Minxi Hou <mhou@redhat.com>
Analysed-by: Miloš Malik <mmalik@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-19 21:10:14 +01:00
David Gibson
6e1e44293e ndp: Send unsolicited Router Advertisements
Currently, our NDP implementation only sends Router Advertisements (RA)
when it receives a Router Solicitation (RS) from the guest.  However,
RFC 4861 requires that we periodically send unsolicited RAs.

Linux as a guest also requires this: it will send an RS when a link first
comes up, but the route it gets from this will have a finite lifetime (we
set this to 65535s, the maximum allowed, around 18 hours).  When that
expires the guest will not send a new RS, but instead expects the route to
have been renewed (if still valid) by an unsolicited RA.

Implement sending unsolicited RAs on a partially randomised timer, as
required by RFC 4861.  The RFC also specifies that solicited RAs should
also be delayed, or even omitted, if the next unsolicited RA is soon
enough.  For now we don't do that, always sending an immediate RA in
response to an RS.  We can get away with this because in our use cases
we expect to just have passt itself and the guest on the link, rather than
a large broadcast domain.

Link: https://github.com/kubevirt/kubevirt/issues/13191
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:40 +01:00
David Gibson
b39760cc7d passt: Seed libc's pseudo random number generator
We have an upcoming case where we need pseudo-random numbers to scatter
timings, but we don't need cryptographically strong random numbers.  libc's
built in random() is fine for this purpose, but we should seed it.  Extend
secret_init() - the only current user of random numbers - to do this as
well as generating the SipHash secret.  Using /dev/random for a PRNG seed
is probably overkill, but it's simple and we only do it once, so we might
as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:38 +01:00
David Gibson
71d5deed5e util: Add general low-level random bytes helper
Currently secret_init() open codes getting good quality random bytes from
the OS, either via getrandom(2) or reading /dev/random.  We're going to
add at least one more place that needs random data in future, so make a
general helper for getting random bytes.  While we're there, fix a number
of minor bugs:
 - getrandom() can theoretically return a "short read", so handle that case
 - getrandom() as well as read can return a transient EINTR
 - We would attempt to read data from /dev/random if we failed to open it
   (open() returns -1), but not if we opened it as fd 0 (unlikely, but ok)
 - More specific error reporting

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:36 +01:00
David Gibson
a60703e899 ndp: Make route lifetime a #define
Currently we open-code the lifetime of the route we advertise via NDP to be
65535s (the maximum).  Change it to a #define.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:34 +01:00
David Gibson
36c070e6e3 ndp: Use struct assignment in preference to memcpy() for IPv6 addresses
There are a number of places we can simply assign IPv6 addresses about,
rather than the current mildly ugly memcpy().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:31 +01:00
David Gibson
cbc83e14df ndp: Split out helpers for sending specific NDP message types
Currently the large ndp() function responds to all NDP messages we handle,
both parsing the message as necessary and sending the response.  Split out
the code to construct and send specific message types into ndp_na() (to
send NA messages) and ndp_ra() (to send RA messages).

As well as breaking up an excessively large function, this is a first step
to being able to send unsolicited NDP messages.

While we're there, remove a slighty ugly goto.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:29 +01:00
David Gibson
4e47167035 ndp: Add ndp_send() helper
ndp() has a conditional on message type generating the reply message, then
a tiny amount of common code, then another conditional to send the reply
with slightly different parameters.  We can make this a bit neater by
making a helper function for sending the reply, and call it from each of
the different message type paths.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:28 +01:00
David Gibson
71f228d04b ndp: Remove redundant update to addr_seen
ndp() updates addr_seen or addr_ll_seen based on the source address of the
received packet.  This is redundant since tap6_handler() has already
updated addr_seen for any type of packet, not just NDP.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:13 +01:00
David Gibson
0588163b1f cppcheck: Don't check the system headers
We pass -I options to cppcheck so that it will find the system headers.
Then we need to pass a bunch more options to suppress the zillions of
cppcheck errors found in those headers.

It turns out, however, that it's not recommended to give the system headers
to cppcheck anyway.  Instead it has built-in knowledge of the ANSI libc and
uses that as the basis of its checks.  We do need to suppress
missingIncludeSystem warnings instead though.

Not bothering with the system headers makes the cppcheck runtime go from
~37s to ~14s on my machine, which is a pretty nice win.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:21 +01:00
David Gibson
14dd70e2b3 linux_dep: Fix CLOSE_RANGE_UNSHARE availability handling
If CLOSE_RANGE_UNSHARE isn't defined, we define a fallback version of
close_range() which is a (successful) no-op.  This is broken in several
ways:
 * It doesn't actually fix compile if using old kernel headers, because
   the caller of close_range() still directly uses CLOSE_RANGE_UNSHARE
   unprotected by ifdefs
 * Even if it did fix the compile, it means inconsistent behaviour between
   a compile time failure to find the value (we silently don't close files)
   and a runtime failure (we die with an error from close_range())
 * Silently not closing the files we intend to close for security reasons
   is probably not a good idea in any case

We don't want to simply error if close_range() or CLOSE_RANGE_UNSHARE isn't
available, because that would require running on kernel >= 5.9.  On the
other hand there's not really any other way to flush all possible fds
leaked by the parent (close() in a loop takes over a minute).  So in this
case print a warning and carry on.

As bonus this fixes a cppcheck error I see with some different options I'm
looking to apply in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:17 +01:00
David Gibson
d64f257243 linux_dep: Move close_range() conditional handling to linux_dep.h
util.h has some #ifdefs and weak definitions to handle compatibility with
various kernel versions.  Move this to linux_dep.h which handles several
other similar cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:15 +01:00
David Gibson
b84cd05098 log: Only check for FALLOC_FL_COLLAPSE_RANGE availability at runtime
log.c has several #ifdefs on FALLOC_FL_COLLAPSE_RANGE that won't attempt
to use it if not defined.  But even if the value is defined at compile
time, it might not be available in the runtime kernel, so we need to check
for errors from a fallocate() call and fall back to other methods.

Simplify this to only need the runtime check by using linux_dep.h to define
FALLOC_FL_COLLAPSE_RANGE if it's not in the kernel headers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:25:58 +01:00
Stefano Brivio
58fa5508bd tap, tcp, util: Add some missing SOCK_CLOEXEC flags
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:

/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1139 |         int fd = socket(AF_UNIX, SOCK_STREAM, 0);
      |                                             ^
      |                                              | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1158 |                 ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
      |                                                                 ^
      |                                                                  | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1413 |         s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
      |                                                   ^
      |                                                    | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
  188 |         if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
      |                                             ^
      |                                              | SOCK_CLOEXEC

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:58 +01:00
Stefano Brivio
71869e2912 passt: Use NOLINT clang-tidy block instead of NOLINTNEXTLINE
For some reason, this is only reported by clang-tidy 19.1.2 on
Alpine:

/home/sbrivio/passt/passt.c:314:53: error: conditional operator with identical true and false expressions [bugprone-branch-clone,-warnings-as-errors]
  314 |         nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
      |                                                            ^

We do have a suppression, but not on the line preceding it, because
we also need a cppcheck suppression there. Use NOLINTBEGIN/NOLINTEND
for the clang-tidy suppression.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:52 +01:00
Stefano Brivio
d4f09c9b96 util: Define small and big thresholds for socket buffers as unsigned long long
On 32-bit architectures, clang-tidy reports:

/home/pi/passt/tcp.c:728:11: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^
/home/pi/passt/tcp.c:728:11: note: make conversion explicit to silence this warning
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~~~~~~~~
/home/pi/passt/tcp.c:728:11: note: perform multiplication in a wider type
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~
/home/pi/passt/tcp.c:730:15: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:730:15: note: make conversion explicit to silence this warning
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:730:15: note: perform multiplication in a wider type
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~
/home/pi/passt/tcp.c:731:17: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:731:17: note: make conversion explicit to silence this warning
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:731:17: note: perform multiplication in a wider type
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~

because, wherever we use those thresholds, we define the other term
of comparison as uint64_t. Define the thresholds as unsigned long long
as well, to make sure we match types.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:49 +01:00
Stefano Brivio
87940f9aa7 tap: Cast TAP_BUF_BYTES - ETH_MAX_MTU to ssize_t, not TAP_BUF_BYTES
Given that we're comparing against 'n', which is signed, we cast
TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated
as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be
signed.

This doesn't necessarily happen on 32-bit architectures, though. On
armhf and i686, clang-tidy 18.1.8 and 19.1.2 report:

/home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors]
 1087 |         for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
      |                     ~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

cast the whole difference to ssize_t, as we know it's going to be
positive anyway, instead of relying on that side effect.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:45 +01:00
Stefano Brivio
1feb90fe62 dhcpv6: Turn some option headers pointers to const
cppcheck 2.14.2 on Alpine reports:

dhcpv6.c:431:32: style: Variable 'client_id' can be declared as pointer to const [constVariablePointer]
 struct opt_hdr *ia, *bad_ia, *client_id;
                               ^

It's not only 'client_id': we can declare 'ia' as const pointer too.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:41 +01:00
Stefano Brivio
5f5e814cfc dhcpv6: Use for loop instead of goto to avoid false positive cppcheck warning
cppcheck 2.16.0 reports:

dhcpv6.c:334:14: style: The comparison 'ia_type == 3' is always true. [knownConditionTrueFalse]
 if (ia_type == OPT_IA_NA) {
             ^
dhcpv6.c:306:12: note: 'ia_type' is assigned value '3' here.
 ia_type = OPT_IA_NA;
           ^
dhcpv6.c:334:14: note: The comparison 'ia_type == 3' is always true.
 if (ia_type == OPT_IA_NA) {
             ^

this is not really the case as we set ia_type to OPT_IA_TA and then
jump back.

Anyway, there's no particular reason to use a goto here: add a trivial
foreach() macro to go through elements of an array and use it instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:11 +01:00
Jon Maloy
78da088f7b tcp: unify payload and flags l2 frames array
In order to reduce static memory and code footprint, we merge
the array for l2 flag frames into the one for payload frames.

This change also ensures that no flag message will be sent out
over the l2 media bypassing already queued payload messages.

Performance measurements with iperf3, where we force all
traffic via the tap queue, show no significant difference:

Dual traffic both directions sinmultaneously, with patch:
========================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  36.3 GBytes  3.12 Gbits/sec  4759       sender
[  5]   0.00-100.04 sec  36.3 GBytes  3.11 Gbits/sec             receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   321 GBytes  27.6 Gbits/sec            receiver

Dual traffic both directions sinmultaneously, without patch:
============================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  35.0 GBytes  3.01 Gbits/sec  6001       sender
[  5]   0.00-100.04 sec  34.8 GBytes  2.99 Gbits/sec            receiver

ns->host
--------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   345 GBytes  29.6 Gbits/sec            receiver

Single connection, with patch:
==============================
host->ns:
---------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   138 GBytes  11.8 Gbits/sec  922       sender
[  5]   0.00-100.04 sec   138 GBytes  11.8 Gbits/sec            receiver

ns->host:
-----------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   430 GBytes  36.9 Gbits/sec            receiver

Single connection, without patch:
=================================
host->ns:
------------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   139 GBytes  11.9 Gbits/sec  900       sender
[  5]   0.00-100.04 sec   139 GBytes  11.9 Gbits/sec            receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   440 GBytes  37.8 Gbits/sec            receiver

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:41 +01:00
David Gibson
9a0e544f05 test: Improve test for NDP assigned prefix
In the NDP tests we search explicitly for a guest address with prefix
length 64.  AFAICT this is an attempt to specifically find the SLAAC
assigned address, rather than something assigned by other means.  We can do
that more explicitly by checking for .protocol == "kernel_ra". however.

The SLAAC prefixes we assigned *will* always be 64-bit, that's hard-coded
into our NDP implementation.  RFC4862 doesn't really allow anything else
since the interface identifiers for an Ethernet-like link are 64-bits.

Let's actually verify that, rather than just assuming it, by extracting the
prefix length assigned in the guest and checking it as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:37 +01:00
David Gibson
910f4f9103 test: Don't require 64-bit prefixes in perf tests
When determining the namespace's IPv6 address in the perf test setup, we
explicitly filter for addresses with a 64-bit prefix length.  There's no
real reason we need that - as long as it's a global address we can use it.
I suspect this was copied without thinking from a similar example in the
NDP tests, where the 64-bit prefix length _is_ meaningful (though it's not
entirely clear if the handling is correct there either).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:34 +01:00
David Gibson
1699083f29 test: Make nstool hold robust against interruptions to control clients
Currently nstool die()s on essentially any error.  In most cases that's
fine for our purposes.  However, it's a problem when in "hold" mode and
getting an IO error on an accept()ed socket.  This could just indicate that
the control client aborted prematurely, in which case we don't want to
kill of the namespace we're holding.

Adjust these to print an error, close() the control client socket and
carry on.  In addition, we need to explicitly ignore SIGPIPE in order not
to be killed by an abruptly closed client connection.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:30 +01:00
David Gibson
b456ee1b53 test: Rename propagating signal handler
nstool in "exec" mode will propagate some signals (specifically SIGTERM) to
the process in the namespace it executes.  The signal handler which
accomplishes this is called simply sig_handler().  However, it turns out
we're going to need some other signal handlers, so rename this to the more
specific sig_propagate().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:27 +01:00
David Gibson
867db07fcf util: Work around cppcheck bug 6936
While experimenting with cppcheck options, I hit several false positives
caused by this bug: https://trac.cppcheck.net/ticket/13227

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:24 +01:00
David Gibson
6f913b3af0 udp: Don't dereference uflow before NULL check in udp_reply_sock_handler()
We have an ASSERT() verifying that we're able to look up the flow in
udp_reply_sock_handler().  However, we dereference uflow before that in
an initializer, rather defeating the point.  Rearrange to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:22 +01:00
David Gibson
d8e05a3fe0 ndp: Use const pointer for ndp_ns packet
We don't modify this structure at all.  For some reason cppcheck doesn't
catch this with our current options, but did when I was experimenting with
some different options.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:19 +01:00
David Gibson
0d7b8201ed linux_dep: Generalise tcp_info.h to handling Linux extension compatibility
tcp_info.h exists just to contain a modern enough version of struct
tcp_info for our needs, removing compile time dependency on the version of
kernel headers.  There are several other cases where we can remove similar
compile time dependencies on kernel version.  Prepare for that by renaming
tcp_info.h to linux_dep.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:16 +01:00
David Gibson
c5f4e4d146 fwd: Squash different-signedness comparison warning
On certain architectures we get a warning about comparison between
different signedness integers in fwd_probe_ephemeral().  This is because
NUM_PORTS evaluates to an unsigned integer.  It's a fixed value, though
and we know it will fit in a signed long on anything reasonable, so add
a cast to suppress the warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:14 +01:00
David Gibson
1e76a19895 util: Remove unused ffsl() function
We supply a weak alias for ffsl() in case it's not defined in our libc.
Except.. we don't have any users for it any more, so remove it.

make cppcheck doesn't spot this at present for complicated reasons, but it
might with tweaks to the options I'm experimenting with.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:11 +01:00
David Gibson
1d7cff3779 clang: Add rudimentary clangd configuration
clangd's default configuration seems to try to treat .h files as C++ not
C.  There are many more spurious warnings generated at present, but this
removes some of the most egregious ones.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:07 +01:00
David Gibson
c560e2f65b Makefile: Don't attempt to auto-detect stack size
We probe the available stack limit in the Makefile using rlimit, then use
that to set the size of the stack when we clone() extra threads.  But
the rlimit at compile time need not be the same as the rlimit at runtime,
so that's not particularly sensible.

Ideally, we'd set the stack size based on an estimate of the actual
maximum stack usage of all our clone()ed functions.  We don't have that
at the moment, but to keep things simple just set it to 1MiB - that's what
the current probe will set things to on my default configuration Fedora 40,
so it's likely to be fine in most cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:03 +01:00
David Gibson
13fc6d511e Makefile: Use -DARCH for qrap only
We insert -DARCH for all compiles, based on TARGET_ARCH determined in the
Makefile.  However, this is only used in qrap.c, not anywhere else in
passt or pasta.  Only supply this -D when compiling qrap specifically.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:59 +01:00
David Gibson
7917159005 seccomp: Simplify handling of AUDIT_ARCH
Currently we construct the AUDIT_ARCH variable in the Makefile, then pass
it into the C code with -D.  The only place that uses it, though is the
BPF filter generated by seccomp.sh.  seccomp.sh already needs to do things
differently depending on the arch, so it might as well just insert the
expanded AUDIT_ARCH directly into the generated code, rather than using
a #define.  Arguably this is better, even, since it ensures more locally
that the arch the BPF checks for matches the arch seccomp.sh built the
filter for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:55 +01:00
David Gibson
93bce404c1 Makefile: Move NETNS_RUN_DIR definition to C code
NETNS_RUN_DIR is set in the Makefile, then passed into the C code with
-D.  But NETNS_RUN_DIR is just a fixed string, it doesn't depend on any
make probes or variables, so there's really no reason to handle it via the
Makefile.  Just move it to a plain #define in conf.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:52 +01:00
David Gibson
c938d8a93e netlink: RTA_PAYLOAD() returns int, not size_t
Since it's the size of a chunk of memory it would seem logical that
RTA_PAYLOAD() returns size_t.  However, it doesn't - it explicitly casts
its result to an int.  RTNH_OK(), which often takes the result of
RTA_PAYLOAD() as a parameter compares it to an int, so using size_t can
result in comparison of different-signed integer warnings from clang.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:48 +01:00
David Gibson
f6b546c6e4 flow: Correct type of flowside_at_sidx()
Due to a copy-pasta error, this returns 'PIF_NONE' instead of NULL on the
failure case.  PIF_NONE expands to 0, which turns into NULL, but it's
still confusing, so fix it.  This removes a clang warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:44 +01:00
David Gibson
30b4f88167 arch: Avoid explicit access to 'environ'
We pass 'environ' to execve() in arch_avc2_exec(), so that we retain the
environment in the current process.  But the declaration of 'environ' is
a bit weird - it doesn't seem to be in a standard header, requiring a
manual explicit declaration.  But, we can avoid needing to reference it
explicitly by using execv() instead of execve().  This removes a clang
warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:29 +01:00
David Gibson
b78e72da0b clang: Move clang-tidy configuration from Makefile to .clang-tidy
Currently we configure clang-tidy with a very long command line spelled out
in the Makefile (mostly a big list of lints to disable).  Move it from here
into a .clang-tidy configuration file, so that the config is accessible if
clang-tidy is invoked in other ways (e.g. via clangd) as well.  As a bonus
this also means that we can move the bulky comments about why we're
suppressing various tests inline with the relevant config lines.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:19 +01:00
David Gibson
8346216c9a Makefile: Simplify exclusion of qrap from static checks
There are things in qrap.c that clang-tidy complains about that aren't
worth fixing.  So, we currently exclude it using $(filter-out).  However,
we already have a make variable which has just the passt sources, excluding
qrap, so we can use that instead of the awkward filter-out expression.

Currently, we still include qrap.c for cppcheck, but there's not much
point doing so: it's, well, qrap, so we don't care that much about lints.
Exclude it from cppcheck as well, for consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:07 +01:00
David Gibson
8f1b6a0ca6 clang: Add .clang-format file
I've been experimenting with clangd, but its default format style is
horrid.  Since our style is basically that of the Linux kernel, copy the
.clang-format from the kernel, minus reference to a bunch of kernel
specific macros.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:45:16 +01:00
David Gibson
5e93bcd8bf test: Adjust misplaced sleeps in two_guests code
Most of our transfer tests using socat use 'sleep' waaiting for the server
side to be ready before starting the client.  However in two_guests/basic
the sleep is in the wrong place: rather than being between starting the
server and starting the client, it's after waiting for the server to
complete.  This causes occasional hangs when the client runs before the
server is ready - in that case the receiving guest sends an RST, which we
don't (currently) propagate back to the sender.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-05 23:46:38 +01:00
Stefano Brivio
9afce0b45c tap: Explicitly cast TUNSETIFF to fix build warning with musl on ppc64le
On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than
INT_MAX (2^31 - 1), and musl declares the second argument of ioctl()
as 'int', not 'unsigned long' like glibc does, probably because of how
POSIX specifies the equivalent argument, int dcmd, in posix_devctl(),
so gcc reports a warning:

tap.c: In function 'tap_ns_tun':
tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow]
 1291 |         rc = ioctl(fd, TUNSETIFF, &ifr);
      |                        ^~~~~~~~~

We don't care about that overflow, so explicitly cast TUNSETIFF to
int.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-05 23:46:33 +01:00
Stefano Brivio
d165d36a0c tcp: Fix build against musl, __sum16 comes from linux/types.h
Use a plain uint16_t instead and avoid including one extra header:
the 'bitwise' attribute of __sum16 is just used by sparse(1).

Reported-by: omni <omni+alpine@hack.org>
Fixes: 3d484aa370 ("tcp: Update TCP checksum using an iovec array")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-05 23:46:24 +01:00
Stefano Brivio
ee7d0b62a7 util: Don't use errno after a successful call in __daemon()
I thought we could just set errno to 0, do a bunch of stuff, and check
that errno didn't change to infer we succeeded. But clang-tidy,
starting with LLVM 19, reports:

/home/sbrivio/passt/util.c:465:6: error: An undefined value may be read from 'errno' [clang-analyzer-unix.Errno,-warnings-as-errors]
  465 |         if (errno)
      |             ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
   38 | # define errno (*__errno_location ())
      |                ^~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:446:6: note: Assuming the condition is false
  446 |         if (pid == -1) {
      |             ^~~~~~~~~
/home/sbrivio/passt/util.c:446:2: note: Taking false branch
  446 |         if (pid == -1) {
      |         ^
/home/sbrivio/passt/util.c:451:6: note: Assuming 'pid' is 0
  451 |         if (pid) {
      |             ^~~
/home/sbrivio/passt/util.c:451:2: note: Taking false branch
  451 |         if (pid) {
      |         ^
/home/sbrivio/passt/util.c:463:2: note: Assuming that 'close' is successful; 'errno' becomes undefined after the call
  463 |         close(devnull_fd);
      |         ^~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:465:6: note: An undefined value may be read from 'errno'
  465 |         if (errno)
      |             ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
   38 | # define errno (*__errno_location ())
      |                ^~~~~~~~~~~~~~~~~~~~~~

And the LLVM documentation for the unix.Errno checker, 1.1.8.3
unix.Errno (C), mentions, at:

  https://clang.llvm.org/docs/analyzer/checkers.html#unix-errno

that:

  The C and POSIX standards often do not define if a standard library
  function may change value of errno if the call does not fail.
  Therefore, errno should only be used if it is known from the return
  value of a function that the call has failed.

which is, somewhat surprisingly, the case for close().

Instead of using errno, check the actual return values of the calls
we issue here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
b1a607fba1 udp: Take care of cert-int09-c clang-tidy warning for enum udp_iov_idx
/home/sbrivio/passt/udp.c:171:1: error: inital values in enum 'udp_iov_idx' are not consistent, consider explicit initialization of all, none or only the first enumerator [cert-int09-c,readability-enum-initial-value,-warnings-as-errors]
  171 | enum udp_iov_idx {
      | ^
  172 |         UDP_IOV_TAP     = 0,
  173 |         UDP_IOV_ETH     = 1,
  174 |         UDP_IOV_IP      = 2,
  175 |         UDP_IOV_PAYLOAD = 3,
  176 |         UDP_NUM_IOVS
      |
      |                      = 4

Don't initialise any value, so that it's obvious that constants map to
unique values.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
099ace64ce treewide: Address cert-err33-c clang-tidy warnings for clock and timer functions
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.

As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.

For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
59fe34ee36 treewide: Suppress clang-tidy warning if we already use O_CLOEXEC
In pcap_init(), we should always open the packet capture file with
O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means
close-on-exec, not close-on-fork.

In logfile_init() and pidfile_open(), the fact that we pass a third
'mode' argument to open() seems to confuse the android-cloexec-open
checker in LLVM versions from 16 to 19 (at least).

The checker is suggesting to add O_CLOEXEC to 'mode', and not in
'flags', where we already have it.

Add a suppression for clang-tidy and a comment, and avoid repeating
those three times by adding a new helper, output_file_open().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
134b4d58b4 Makefile: Disable readability-math-missing-parentheses clang-tidy check
With clang-tidy and LLVM 19:

/home/sbrivio/passt/conf.c:1218:29: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
 1218 |                 const char *octet = str + 3 * i;
      |                                           ^~~~~~
      |                                           (    )
/home/sbrivio/passt/ndp.c:285:18: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  285 |                                         .len            = 1 + 2 * n,
      |                                                               ^~~~~~
      |                                                               (    )
/home/sbrivio/passt/ndp.c:329:23: error: '%' has higher precedence than '-'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  329 |                         memset(ptr, 0, 8 - dns_s_len % 8);      /* padding */
      |                                            ^~~~~~~~~~~~~~
      |                                            (            )
/home/sbrivio/passt/pcap.c:131:20: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  131 |                 pcap_frame(iov + i * frame_parts, frame_parts, offset, &now);
      |                                  ^~~~~~~~~~~~~~~~
      |                                  (              )
/home/sbrivio/passt/util.c:216:10: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  216 |                 return (a->tv_nsec + 1000000000 - b->tv_nsec) / 1000 +
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                        (                                            )
/home/sbrivio/passt/util.c:217:10: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  217 |                        (a->tv_sec - b->tv_sec - 1) * 1000000;
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                        (                                    )
/home/sbrivio/passt/util.c:220:9: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  220 |         return (a->tv_nsec - b->tv_nsec) / 1000 +
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                (                               )
/home/sbrivio/passt/util.c:221:9: error: '*' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  221 |                (a->tv_sec - b->tv_sec) * 1000000;
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                (                                )
/home/sbrivio/passt/util.c:545:32: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors]
  545 |         return clone(fn, stack_area + stack_size / 2, flags, arg);
      |                                       ^~~~~~~~~~~~~~~
      |                                       (             )

Just... no.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
744247856d treewide: Silence cert-err33-c clang-tidy warnings for fprintf()
We use fprintf() to print to standard output or standard error
streams. If something gets truncated or there's an output error, we
don't really want to try and report that, and at the same time it's
not abnormal behaviour upon which we should terminate, either.

Just silence the warning with an ugly FPRINTF() variadic macro casting
the fprintf() expressions to void.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
98efe7c2fd treewide: Comply with CERT C rule ERR33-C for snprintf()
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):

/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
   43 |                 snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  577 |                         snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  579 |                                 snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  580 |                                          pidval);
      |                                          ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  105 |         snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  242 |         snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  243 |         snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
 1155 |                         snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning

Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:25 +01:00
Stefano Brivio
988a4d75f8 Makefile: Exclude qrap.c from clang-tidy checks
We'll deprecate qrap(1) soon, and warnings reported by clang-tidy as
of LLVM versions 16 and later would need a bunch of changes there to
be addressed, mostly around CERT C rule ERR33-C and checking return
code from snprintf().

It makes no sense to fix warnings in qrap just for the sake of it, so
officially declare the bitrotting season open.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 08:21:19 +01:00
Jon Maloy
ba38e67cf4 tcp: unify l2 TCPv4 and TCPv6 queues and structures
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.

Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.

Tests with iperf3 shows no measurable change in performance after this
change.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:44:08 +01:00
Jon Maloy
2053c36dec tcp: set ip and eth headers in l2 tap queues on the fly
l2 tap queue entries are currently initialized at system start, and
reused with preset headers through its whole life time. The only
fields we need to update per message are things like payload size
and checksums.

If we want to reuse these entries between ipv4 and ipv6 messages we
will need to set the pointer to the right header on the fly per
message, since the header type may differ between entries in the same
queue.

The same needs to be done for the ethernet header.

We do these changes here.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:43:24 +01:00
Laurent Vivier
5563d5f668 test: remove obsolete images
Remove debian-9-nocloud-amd64-daily-20200210-166.qcow2 and
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 as they cannot be
downloaded anymore

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:30:06 +02:00
Laurent Vivier
f43f7d5e89 tcp: cleanup tcp_buf_data_from_sock()
Remove the err label as there is only one caller, and move code
to the caller position. ret is not needed here anymore as it is
always 0.
Remove sendlen as we can user directly len.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:51 +02:00
David Gibson
e7fcd0c348 tcp: Use runtime tests for TCP_INFO fields
In order to use particular fields from the TCP_INFO getsockopt() we
need them to be in structure returned by the runtime kernel.  We attempt
to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed
in the Makefile.

However, that's not correct, because the kernel headers we compile against
may not be the same as the runtime kernel.  We instead should check against
the size of structure returned from the TCP_INFO getsockopt() as we already
do for tcpi_snd_wnd.  Switch from the compile time flags to a runtime
test.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:46 +02:00
David Gibson
81143813a6 tcp: Generalise probing for tcpi_snd_wnd field
In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we
need the field to be supported in the runtime kernel (snd_wnd_cap).

In fact we should check that for for every tcp_info field we want to use,
beyond the very old ones shared with BSD.  Prepare to do that, by
generalising the probing from setting a single bool to instead record the
size of the returned TCP_INFO structure.  We can then use that recorded
value to check for the presence of any field we need.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:27:17 +02:00
David Gibson
13f0291ede tcp: Remove compile-time dependency on struct tcp_info version
In the Makefile we probe to create several defines based on the presence
of particular fields in struct tcp_info.  These defines are used for two
purposes, neither of which they accomplish well:

1) Determining if the tcp_info fields are available at runtime.  For this
   purpose the defines are Just Plain Wrong, since the runtime kernel may
   not be the same as the compile time kernel. We corrected this for
   tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt

2) Allowing the source to compile against older kernel headers which don't
   have the fields in question.  This works in theory, but it does mean
   we won't be able to use the fields, even if later run against a
   newer kernel.  Furthermore, it's quite fragile: without much more
   thorough tests of builds in different environments that we're currently
   set up for, it's very easy to miss cases where we're accessing a field
   without protection from an #ifdef.  For example we currently access
   tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd().

Improve this with a different approach, borrowed from qemu (which has many
instances of similar problems).  Don't compile against linux/tcp.h, using
netinet/tcp.h instead.  Then for when we need an extension field, define
a struct tcp_info_linux, copied from the kernel, with all the fields we're
interested in.  That may need updating from future kernel versions, but
only when we want to use a new extension, so it shouldn't be frequent.

This allows us to remove the HAS_SND_WND define entirely.  We keep
HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1),
we'll fix that in a later patch.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Trivial grammar fixes in comments]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:26:48 +02:00
Stefano Brivio
9e4615b40b tcp_splice: fcntl(2) returns the size of the pipe, if F_SETPIPE_SZ succeeds
Don't report bogus failures (with --trace) just because the return
value is not zero.

Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-25 09:36:30 +02:00
Stefano Brivio
149f457b23 tcp_splice: splice() all we have to the writing side, not what we just read
In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.

However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 98304 from read-side call
  Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
  Flow 0 (TCP connection (spliced)): -1 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
  Flow 0 (TCP connection (spliced)): OUT_WAIT_0

we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 32768 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580

we splice() 32768 more bytes from our receiving side to the pipe. At
some point:

  pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
  Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
  Flow 0 (TCP connection (spliced)): 1320 from read-side call
  Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)

the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).

This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.

Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.

So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.

Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.

Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.

Reported-by: Ed Santiago <santiago@redhat.com>
Reported-by: Paul Holzinger <pholzing@redhat.com>
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-25 09:36:18 +02:00
David Gibson
9e5df350d6 tcp: Use structures to construct initial TCP options
As a rule, we prefer constructing packets with matching C structures,
rather than building them byte by byte.  However, one case we still build
byte by byte is the TCP options we include in SYN packets (in fact the only
time we generate TCP options on the tap interface).

Rework this to use a structure and initialisers which make it a bit
clearer what's going on.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
2024-10-21 18:51:04 +02:00
David Gibson
b4dace8f46 fwd: Direct inbound spliced forwards to the guest's external address
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing.  This gives us a very large performance improvement when it's
possible.

Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address.  However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1.  pasta's
forwarding exposes those services to the host, which isn't generally what
we want.

Correct this by instead forwarding inbound "splice" flows to the guest's
external address.

Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:03 +02:00
David Gibson
58e6d68599 test: Clarify test for spliced inbound transfers
The tests in pasta/tcp and pasta/udp for inbound transfers have the server
listening within the namespace explicitly bound to 127.0.0.1 or ::1.  This
only works because of the behaviour of inbound splice connections, which
always appear with both source and destination addresses as loopback in
the namespace.  That's not an inherent property for "spliced" connections
and arguably an undesirable one.  Also update the test names to make it
clearer that these tests are expecting to exercise the "splice" path.

Interestingly this was already correct for the equivalent passt_in_ns/*,
although we also update the test names for clarity there.

Note that there are similar issues in some of the podman tests, addressed
in https://github.com/containers/podman/pull/24064

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:00 +02:00
David Gibson
1fa421192c passt.1: Clarify and update "Handling of local addresses" section
This section didn't mention the effect of the --map-host-loopback option
which now alters this behaviour.  Update it accordingly.

It used "local addresses" to mean specifically 127.0.0.0/8 and ::1.
However, "local" could also refer to link-local addresses or to addresses
of any scope which happen to be configured on the host.  Use "loopback
address" to be more precise about this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:57 +02:00
David Gibson
ef8a5161d0 passt.1: Mark --stderr as deprecated more prominently
The description of this option says that it's deprecated, but unlike
--no-copy-addrs and --no-copy-routes it doesn't have a clear label.  Add
one to make it easier to spot.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:54 +02:00
David Gibson
53176ca91d test: Wait for DAD on DHCPv6 addresses
After running dhclient -6 we expect the DHCPv6 assigned address to be
immediately usable.  That's true with the Fedora dhclient-script (and the
upstream ISC DHCP one), however it's not true with the Debian
dhclient-script.  The Debian script can complete with the address still
in "tentative" state, and the address won't be usable until Duplicate
Address Detection (DAD) completes.  That's arguably a bug in Debian (see
link below), but for the time being we need to work around it anyway.

We usually get away with this, because by the time we do anything where the
address matters, DAD has completed.  However, it's not robust, so we should
explicitly wait for DAD to complete when we get an DHCPv6 address.

Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:51 +02:00
David Gibson
75b9c0feb0 test: Explicitly wait for DAD to complete on SLAAC addresses
Getting a SLAAC address takes a little while because the kernel must
complete Duplicate Address Detection (DAD) before marking the address as
ready.  In several places we have an explicit 'sleep 2' to wait for that
to complete.

Fixed length delays are never a great idea, although this one is pretty
solid.  Still, it would be better to explicitly wait for DAD to complete
in case of long delays (which might happen on slow emulated hosts, or with
heavy load), and to speed the tests up if DAD completes quicker.

Replace the fixed sleeps with a loop waiting for DAD to complete.  We do
this by looping waiting for all tentative addresses to disappear.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:47 +02:00
David Gibson
f9d677bff6 arp: Fix a handful of small warts
This fixes a number of harmless but slightly ugly warts in the ARP
resolution code:
 * Use in4addr_any to represent 0.0.0.0 rather than hand constructing an
   example.
 * When comparing am->sip against 0.0.0.0 use sizeof(am->sip) instead of
   sizeof(am->tip) (same value, but makes more logical sense)
 * Described the guest's assigned address as such, rather than as "our
   address" - that's not usually what we mean by "our address" these days
 * Remove "we might have the same IP address" comment which I can't make
   sense of in context (possibly it's relating to the statement below,
   which already has its own comment?)

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:27:04 +02:00
Stefano Brivio
2d7f734c45 tcp: Send "empty" handshake ACK before first data segment
Starting from commit 9178a9e346 ("tcp: Always send an ACK segment
once the handshake is completed"), we always send an ACK segment,
without any payload, to complete the three-way handshake while
establishing a connection started from a socket.

We queue that segment after checking if we already have data to send
to the tap, which means that its sequence number is higher than any
segment with data we're sending in the same iteration, if any data is
available on the socket.

However, in tcp_defer_handler(), we first flush "flags" buffers, that
is, we send out segments without any data first, and then segments
with data, which means that our "empty" ACK is sent before the ACK
segment with data (if any), which has a lower sequence number.

This appears to be harmless as the guest or container will generally
reorder segments, but it looks rather weird and we can't exclude it's
actually causing problems.

Queue the empty ACK first, so that it gets a lower sequence number,
before checking for any data from the socket.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-15 20:34:26 +02:00
Stefano Brivio
7612cb80fe test: Pass TRACE from run_term() into ./run from_term
Just like we do for PCAP, DEBUG and KERNEL. Otherwise, running tests
with TRACE=1 will not actually enable tracing output.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-10 05:25:19 +02:00
Stefano Brivio
b40880c157 test/lib/term: Always use printf for messages with escape sequences
...instead of echo: otherwise, bash won't handle escape sequences we
use to colour messages (and 'echo -e' is not specified by POSIX).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-10 05:25:00 +02:00
David Gibson
ff63ac922a conf: Add --dns-host option to configure host side nameserver
When redirecting DNS queries with the --dns-forward option, passt/pasta
needs a host side nameserver to redirect the queries to.  This is
controlled by the c->ip[46].dns_host variables.  This is set to the first
first nameserver listed in the host's /etc/resolv.conf, and there isn't
currently a way to override it from the command line.

Prior to 0b25cac9 ("conf: Treat --dns addresses as guest visible
addresses") it was possible to alter this with the -D/--dns option.
However, doing so was confusing and had some nonsensical edge cases because
-D generally takes guest side addresses, rather than host side addresses.

Add a new --dns-host option to restore this functionality in a more
sensible way.

Link: https://bugs.passt.top/show_bug.cgi?id=102
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 19:04:29 +02:00
David Gibson
9d66df9a9a conf: Add command line switch to enable IP_FREEBIND socket option
In a couple of recent reports, we've seen that it can be useful for pasta
to forward ports from addresses which are not currently configured on the
host, but might be in future.  That can be done with the sysctl
net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in
the first place.  We can allow the same thing on a per-socket basis with
the IP_FREEBIND (or IPV6_FREEBIND) socket option.

Add a --freebind command line argument to enable this socket option on
all listening sockets.

Link: https://bugs.passt.top/show_bug.cgi?id=101
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 19:04:29 +02:00
Laurent Vivier
151dbe0d3d udp: Update UDP checksum using an iovec array
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:13 +02:00
Laurent Vivier
3d484aa370 tcp: Update TCP checksum using an iovec array
TCP header and payload are supposed to be in the same buffer,
and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute
the checksum from the base address of the header using the
length of the IP payload.

In the future (for vhost-user) we need to dispatch the TCP header and
the TCP payload through several buffers. To be able to manage that, we
provide an iovec array that points to the data of the TCP frame.
We provide also an offset to be able to provide an array that contains
the TCP frame embedded in an lower level frame, and this offset points
to the TCP header inside the iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:10 +02:00
Laurent Vivier
e6548c6437 checksum: Add an offset argument in csum_iov()
The offset allows any headers that are not part of the data
to checksum to be skipped.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:08 +02:00
Laurent Vivier
fd8334b25d pcap: Add an offset argument in pcap_iov()
The offset is passed directly to pcap_frame() and allows
any headers that are not part of the frame to
capture to be skipped.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:51:02 +02:00
Laurent Vivier
72e7d3024b tcp: Use tcp_payload_t rather than tcphdr
As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the
checksum using the TCP header and the TCP payload, it is clearer
to use a pointer to tcp_payload_t that includes tcphdr and payload
rather than a pointer to tcphdr (and guessing TCP header is
followed by the payload).

Move tcp_payload_t and tcp_flags_t to tcp_internal.h.
(They will be used also by vhost-user).

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:50:46 +02:00
Stefano Brivio
def8acdcd8 test: Kernel binary can now be passed via the KERNEL environmental variable
This is quite useful at least for myself as I'm usually running tests
using a guest kernel that's not the same as the one on the host.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-02 14:50:34 +02:00
David Gibson
b55013b1a7 inany: Add inany_pton() helper
We already have an inany_ntop() function to format inany addresses into
text.  Add inany_pton() to parse them from text, and use it in
conf_ports().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:17 +02:00
David Gibson
cbde4192ee tcp, udp: Make {tcp,udp}_sock_init() take an inany address
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair.  Use an inany instead.  Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:16 +02:00
David Gibson
b8d4fac6a2 util, pif: Replace sock_l4() with pif_sock_l4()
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.

Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress.  This case doesn't yet arise, but
it might cause us trouble in future.

Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic.  This only makes sense if there is no specific
address given.  We verify this at runtime, but it would be nicer if we
could enforce it structurally.

For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems.  Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.

The new function takes the address as an inany *, with NULL indicating the
dual stack case.  This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:15 +02:00
David Gibson
204e77cd11 udp: Don't attempt to get dual-stack sockets in nonsensical cases
To save some kernel memory we try to use "dual stack" sockets (that listen
to both IPv4 and IPv6 traffic) when possible.   However udp_sock_init()
attempts to do this in some cases that can't work.  Specifically we can
only do this when listening on any address.  That's never true for the
ns (splicing) case, because we always listen on loopback.  For the !ns
case and AF_UNSPEC case, addr should always be NULL, but add an assert to
verify.

This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll
fall back to the other path.  But, it's messy and makes some upcoming
changes harder, so avoid attempting this in cases we know can't work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:15 +02:00
Laurent Vivier
8f8c4d27eb tcp: Allow checksum to be disabled
We can need not to set TCP checksum. Add a parameter to
tcp_fill_headers4() and tcp_fill_headers6() to disable it.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:28 +02:00
Laurent Vivier
4fe5f4e813 udp: Allow checksum to be disabled
We can need not to set the UDP checksum. Add a parameter to
udp_update_hdr4() and udp_update_hdr6() to disable it.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:20 +02:00
David Gibson
d836d9e345 util: Remove possible quadratic behaviour from write_remainder()
write_remainder() steps through the buffers in an IO vector writing out
everything past a certain byte offset.  However, on each iteration it
rescans the buffer from the beginning to find out where we're up to.  With
an unfortunate set of write sizes this could lead to quadratic behaviour.

In an even less likely set of circumstances (total vector length > maximum
size_t) the 'skip' variable could overflow.  This is one factor in a
longstanding Coverity error we've seen (although I still can't figure out
the remainder of its complaint).

Rework write_remainder() to always work out our new position in the vector
relative to our old/current position, rather than starting from the
beginning each time.  As a bonus this seems to fix the Coverity error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:03 +02:00
David Gibson
bfc294b90d util: Add helper to write() all of a buffer
write(2) might not write all the data it is given.  Add a write_all_buf()
helper to keep calling it until all the given data is written, or we get an
error.

Currently we use write_remainder() to do this operation in pcap_frame().
That's a little awkward since it requires constructing an iovec, and future
changes we want to make to write_remainder() will be easier in terms of
this single buffer helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:59 +02:00
David Gibson
bb41901c71 tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean
This parameter is already treated as a boolean internally.  Make it a
'bool' type for clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:55 +02:00
David Gibson
265b2099c7 tcp: Simplify ifdef logic in tcp_update_seqack_wnd()
This function has a block conditional on !snd_wnd_cap shortly before an
snd_wnd_cap is statically false).

Therefore, simplify this down to a single conditional with an else branch.
While we're there, fix some improperly indented closing braces.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:50 +02:00
David Gibson
4aff6f9392 tcp: Clean up tcpi_snd_wnd probing
When available, we want to retrieve our socket peer's advertised window and
forward that to the guest.  That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.

Currently our probing for this is a bit odd.  The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it.  Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field.  This is a bit hit and miss in two ways:
 * Zero is perfectly possible window the peer could report, so we can
   get false negatives
 * We're reading TCP_INFO into a local variable, which might not be zero
   initialised, so if the kernel _doesn't_ write it it could have non-zero
   garbage, giving us false positives.

We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived.  So, check whether that's long enough
to include the field.  This lets us probe the availability of the field
once and for all during initialisation.  That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.

We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:47 +02:00
David Gibson
7d8804beb8 tcp: Make some extra functions private
tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c,
and have no prototype in a header.  Make them static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:33 +02:00
David Gibson
5ff5d55291 tcp: Avoid overlapping memcpy() in DUP_ACK handling
When handling the DUP_ACK flag, we copy all the buffers making up the ack
frame.  However, all our frames share the same buffer for the Ethernet
header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will
result in a (perfectly) overlapping memcpy().  This seems to have been
harmless so far, but overlapping ranges to memcpy() is undefined behaviour,
so we really should avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:59 +02:00
David Gibson
1f414ed8f0 tcp: Remove redundant initialisation of iov[TCP_IOV_ETH].iov_base
This initialisation for IPv4 flags buffers is redundant with the very next
line which sets both iov_base and iov_len.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:46 +02:00
Stefano Brivio
6b38f07239 apparmor: Allow read access to /proc/sys/net/ipv4/ip_local_port_range
...for both passt and pasta: use passt's abstraction for this.

Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 15:34:06 +02:00
Stefano Brivio
116bc8266d selinux: Allow read access to /proc/sys/net/ipv4/ip_local_port_range
Since commit eedc81b6ef ("fwd, conf: Probe host's ephemeral ports"),
we might need to read from /proc/sys/net/ipv4/ip_local_port_range in
both passt and pasta.

While pasta was already allowed to open and write /proc/sys/net
entries, read access was missing in SELinux's type enforcement: add
that.

In passt, instead, this is the first time we need to access an entry
there: add everything we need.

Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 15:34:06 +02:00
David Gibson
a33ecafbd9 tap: Don't risk truncating frames on full buffer in tap_pasta_input()
tap_pasta_input() keeps reading frames from the tap device until the
buffer is full.  However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer.  If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.

Adjust the code to make sure we always have room for a maximum size frame.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:46 +02:00
David Gibson
d2a1dc744b tap: Restructure in tap_pasta_input()
tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around.  This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.

The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass.  This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.

Along the way handle a couple of extra edge cases:
 - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
   ports where those might not have the same value
 - Detect EOF on the tap device and exit in that case

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:43 +02:00
David Gibson
11e29054fe tap: Improve handling of EINTR in tap_passt_input()
When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK.  However in all
three cases it returns from the function.  That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again.  For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.

So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:41 +02:00
David Gibson
49fc4e0414 tap: Split out handling of EPOLLIN events
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN.  We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:37 +02:00
Stefano Brivio
63513e54f3 util: Fix order of operands and carry of one second in timespec_diff_us()
If the nanoseconds of the minuend timestamp are less than the
nanoseconds of the subtrahend timestamp, we need to carry one second
in the subtraction.

I subtracted this second from the minuend, but didn't actually carry
it in the subtraction of nanoseconds, and logged timestamps would jump
back whenever we switched to the first branch of timespec_diff_us()
from the second one.

Most likely, the reason why I didn't carry the second is that I
instinctively thought that swapping the operands would have the same
effect. But it doesn't, in general: that only happens with arithmetic
in modulo powers of 2. Undo the swap as well.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-06 13:01:34 +02:00
David Gibson
748ef4cd6e cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings
cppcheck-2.15.0 has apparently broadened when it throws a warning about
redundant initialization to include some cases where we have an initializer
for some fields, but then set other fields in the function body.

This is arguably a false positive: although we are technically overwriting
the zero-initialization the compiler supplies for fields not explicitly
initialized, this sort of construct makes sense when there are some fields
we know at the top of the function where the initializer is, but others
that require more complex calculation.

That said, in the two places this shows up, it's pretty easy to work
around.  The results are arguably slightly clearer than what we had, since
they move the parts of the initialization closer together.

So do that rather than having ugly suppressions or dealing with the
tedious process of reporting a cppcheck false positive.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:54:20 +02:00
Stefano Brivio
afedc2412e tcp: Use EPOLLET for any state of not established connections
Currently, for not established connections, we monitor sockets with
edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state
(outbound connection being established) but not in the
TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK
to the container/guest).

While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted
another possibility for a short EPOLLRDHUP storm (10 seconds), which
doesn't seem to happen in actual use cases, but I could reproduce it:
start a connection from a container, while dropping (using netfilter)
ACK segments coming out of the container itself.

On the server side, outside the container, accept the connection and
shutdown the writing side of it immediately.

At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere
TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't
have any reasonable way to handle it other than waiting for the tap
side to complete the three-way handshake. So we'll just keep getting
this EPOLLRDHUP until the SYN_TIMEOUT kicks in.

Always enable EPOLLET when EPOLLRDHUP is the only epoll event we
subscribe to: in this case, getting multiple EPOLLRDHUP reports is
totally useless.

In the only remaining non-established state, SOCK_ACCEPTED, for
inbound connections, we're anyway discarding EPOLLRDHUP events until
we established the conection, because we don't know what to do with
them until we get an answer from the tap side, so it's safe to enable
EPOLLET also in that case.

Link: https://bugs.passt.top/show_bug.cgi?id=94
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:54:16 +02:00
David Gibson
aff5a49b0e udp: Handle more error conditions in udp_sock_errs()
udp_sock_errs() reads out everything in the socket error queue.  However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.

One possibility is that the error is reported instead by the SO_ERROR
sockopt.  Check for that case and report it as best we can.  If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.

[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010

Link: https://bugs.passt.top/show_bug.cgi?id=95
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:38 +02:00
David Gibson
bd99f02a64 udp: Treat errors getting errors as unrecoverable
We can get network errors, usually transient, reported via the socket error
queue.  However, at least theoretically, we could get errors trying to
read the queue itself.  Since we have no idea how to clear an error
condition in that case, treat it as unrecoverable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:35 +02:00
David Gibson
bd092ca421 udp: Split socket error handling out from udp_sock_recv()
Currently udp_sock_recv() both attempts to clear socket errors and read
a batch of datagrams for forwarding.  That made sense initially, since
both listening and reply sockets need to do this.  However, we have certain
error cases which will add additional complexity to the error processing.
Furthermore, if we ever wanted to more thoroughly handle errors received
here - e.g. by synthesising ICMP messages on the tap device - it will
likely require different handling for the listening and reply socket cases.

So, split handling of error events into its own udp_sock_errs() function.
While we're there, allow it to report "unrecoverable errors".  We don't
have any of these so far, but some cases we're working on might require it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:33 +02:00
David Gibson
88bfa3801e flow: Helpers to log details of a flow
The details of a flow - endpoints, interfaces etc. - can be pretty
important for debugging.  We log this on flow state transitions, but it can
also be useful to log this when we report specific conditions.  Add some
helper functions and macros to make it easy to do that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:30 +02:00
David Gibson
1166401c2f udp: Allow UDP flows to be prematurely closed
Unlike TCP, UDP has no in-band signalling for the end of a flow.  So the
only way we remove flows is on a timer if they have no activity for 180s.

However, we've started to investigate some error conditions in which we
want to prematurely abort / abandon a UDP flow.  We can call
udp_flow_close(), which will make the flow inert (sockets closed, no epoll
events, can't be looked up in hash).  However it will still wait 3 minutes
to clear away the stale entry.

Clean this up by adding an explicit 'closed' flag which will cause a flow
to be more promptly cleaned up.  We also publish udp_flow_close() so it
can be called from other places to abort UDP flows().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:24 +02:00
David Gibson
7ad9f9bd2b flow: Fix incorrect hash probe in flowside_lookup()
Our flow hash table uses linear probing in which we step backwards through
clusters of adjacent hash entries when we have near collisions.  Usually
that's implemented by flow_hash_probe().  However, due to some details we
need a second implementation in flowside_lookup().  An embarrassing
oversight in rebasing from earlier versions has mean that version is
incorrect, trying to step forward through clusters rather than backward.

In situations with the right sorts of has near-collisions this can lead to
us not associating an ACK from the tap device with the right flow, leaving
it in a not-quite-established state.  If the remote peer does a shutdown()
at the right time, this can lead to a storm of EPOLLRDHUP events causing
high CPU load.

Fixes: acca4235c4 ("flow, tcp: Generalise TCP hash table to general flow hash table")
Link: https://bugs.passt.top/show_bug.cgi?id=94
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:52:31 +02:00
Stefano Brivio
0ea60e5a77 log: Don't prefix log file messages with time and severity if they're continuations
In fecb1b65b1 ("log: Don't prefix message with timestamp on --debug
if it's a continuation"), I fixed this for --debug on standard error,
but not for log files: if messages are continuations, they shouldn't
be prefixed by timestamp and severity.

Otherwise, we'll print stuff like this:

  0.0028: ERROR:   Receive error on guest connection, reset0.0028:  ERROR:   : Bad file descriptor

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-06 12:52:21 +02:00
Michal Privoznik
38363964fc Makefile: Enable _FORTIFY_SOURCE iff needed
On some systems source fortification is enabled whenever code
optimization is enabled (e.g. with -O2). Since code fortification
is explicitly enabled too (with possibly different value than the
system wants, there are three levels [1]), distros are required
to patch our Makefile, e.g. [2].

Detect whether fortification is not already enabled and enable it
explicitly only if really needed.

1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html
2: edfeb8763a

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:21 +02:00
David Gibson
eedc81b6ef fwd, conf: Probe host's ephemeral ports
When we forward "all" ports (-t all or -u all), or use an exclude-only
range, we don't actually forward *all* ports - that wouln't leave local
ports to use for outgoing connections.  Rather we forward all non-ephemeral
ports - those that won't be used for outgoing connections or datagrams.

Currently we assume the range of ephemeral ports is that recommended by
RFC 6335, 49152-65535.  However, that's not the range used by default on
Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range
sysctl.

We can't really know what range the guest will consider ephemeral, but if
it differs too much from the host it's likely to cause problems we can't
avoid anyway.  So, using the host's ephemeral range is a better guess than
using the RFC 6335 range.

Therefore, add logic to probe the host's ephemeral range, falling back to
the RFC 6335 range if that fails.  This has the bonus advantage of
reducing the number of ports bound by -t all -u all on most Linux machines
thereby reducing kernel memory usage.  Specifically this reduces kernel
memory usage with -t all -u all from ~380MiB to ~289MiB.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:08 +02:00
David Gibson
4a41dc58d6 conf, fwd: Don't attempt to forward port 0
When using -t all, -u all or exclude-only ranges, we'll attempt to forward
all non-ephemeral port numbers, including port 0.  However, this won't work
as intended: bind() treats a zero port not as literal port 0, but as
"pick a port for me".  Because of the special meaning of port 0, we mostly
outright exclude it in our handling.

Do the same for setting up forwards, not attempting to forward for port 0.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:05 +02:00
David Gibson
1daf6f4615 conf, fwd: Make ephemeral port logic more flexible
"Ephemeral" ports are those which the kernel may allocate as local
port numbers for outgoing connections or datagrams.  Because of that,
they're generally not good choices for listening servers to bind to.

Thefore when using -t all, -u all or exclude-only ranges, we map only
non-ephemeral ports.  Our logic for this is a bit rigid though: we
assume the ephemeral ports are always a fixed range at the top of the
port number space.  We also assume PORT_EPHEMERAL_MIN is a multiple of
8, or we won't set the forward bitmap correctly.

Make the logic in conf.c more flexible, using a helper moved into
fwd.[ch], although we don't change which ports we consider ephemeral
(yet).

The new handling is undoubtedly more computationally expensive, but
since it's a once-off operation at start off, I don't think it really
matters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:25:51 +02:00
Stefano Brivio
712ca32353 seccomp.sh: Try to account for terminal width while formatting list of system calls
Avoid excess lines on wide terminals, but make sure we don't fail if
we can't fetch the number of columns for any reason, as it's not a
fundamental feature and we don't want to break anything with it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 14:30:17 +02:00
David Gibson
e0be6bc2f4 udp: Use dual stack sockets for port forwarding when possible
Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
well as native IPv6 connections.  By doing this we halve the number of
listening sockets we need (assuming passt/pasta is listening on the same
ports for IPv4 and IPv6).  When forwarding many ports (e.g. -u all) this
can significantly reduce the amount of kernel memory that passt consumes.

We've used such dual stack sockets for TCP since 8e914238b "tcp: Use dual
stack sockets for port forwarding when possible".  Add similar support for
UDP "listening" sockets.  Since UDP sockets don't use as much kernel memory
as TCP sockets this isn't as big a saving, but it's still significant.
When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all),
this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version
6.10.6 on Fedora 40, x86_64).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:41 +02:00
David Gibson
c78b194001 udp: Remove unnnecessary local from udp_sock_init()
The 's' variable is always redundant with either 'r4' or 'r6', so remove
it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:38 +02:00
David Gibson
620e19a1b4 udp: Merge udp[46]_mh_recv arrays
We've already gotten rid of most of the IPv4/IPv6 specific data structures
in udp.c by merging them with each other.  One significant one remains:
udp[46]_mh_recv.  This was a bit awkward to remove because of a subtle
interaction.  We initialise the msg_namelen fields to represent the total
size we have for a socket address, but when we receive into the arrays
those are modified to the actual length of the sockaddr we received.

That meant that naively merging the arrays meant that if we received IPv4
datagrams, then IPv6 datagrams, the addresses for the latter would be
truncated.  In this patch address that by resetting the received
msg_namelen as soon as we've found a flow for the datagram.  Finding the
flow is the only thing that might use the actual sockaddr length, although
we in fact don't need it for the time being.

This also removes the last use of the 'v6' field from udp_listen_epoll_ref,
so remove that as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:25 +02:00
Stefano Brivio
418feb37ec test: Look for possible sshd-session paths (if it's there at all) in mbuto's profile
Some distributions already have OpenSSH 9.8, which introduces split
sshd/sshd-session binaries, and there we need to copy the binary from
the host, which can be /usr/libexec/openssh/sshd-session (Fedora
Rawhide), /usr/lib/ssh/sshd-session (Arch Linux),
/usr/lib/openssh/sshd-session (Debian), and possibly other paths.

Add at least those three, and, if we don't find sshd-session, assume
we don't need it: it could very well be an older version of OpenSSH,
as reported by David for Fedora 40, or perhaps another daemon (would
Dropbear even work? I'm not sure).

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: d6817b3930 ("test/passt.mbuto: Install sshd-session OpenSSH's split process")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-27 09:03:47 +02:00
139 changed files with 13293 additions and 2592 deletions

126
.clang-format Normal file
View file

@ -0,0 +1,126 @@
# SPDX-License-Identifier: GPL-2.0
#
# clang-format configuration file. Intended for clang-format >= 11.
#
# For more information, see:
#
# Documentation/dev-tools/clang-format.rst
# https://clang.llvm.org/docs/ClangFormat.html
# https://clang.llvm.org/docs/ClangFormatStyleOptions.html
#
---
AccessModifierOffset: -4
AlignAfterOpenBracket: Align
AlignConsecutiveAssignments: false
AlignConsecutiveDeclarations: false
AlignEscapedNewlines: Left
AlignOperands: true
AlignTrailingComments: false
AllowAllParametersOfDeclarationOnNextLine: false
AllowShortBlocksOnASingleLine: false
AllowShortCaseLabelsOnASingleLine: false
AllowShortFunctionsOnASingleLine: None
AllowShortIfStatementsOnASingleLine: false
AllowShortLoopsOnASingleLine: false
AlwaysBreakAfterDefinitionReturnType: None
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: false
BinPackArguments: true
BinPackParameters: true
BraceWrapping:
AfterClass: false
AfterControlStatement: false
AfterEnum: false
AfterFunction: true
AfterNamespace: true
AfterObjCDeclaration: false
AfterStruct: false
AfterUnion: false
AfterExternBlock: false
BeforeCatch: false
BeforeElse: false
IndentBraces: false
SplitEmptyFunction: true
SplitEmptyRecord: true
SplitEmptyNamespace: true
BreakBeforeBinaryOperators: None
BreakBeforeBraces: Custom
BreakBeforeInheritanceComma: false
BreakBeforeTernaryOperators: false
BreakConstructorInitializersBeforeComma: false
BreakConstructorInitializers: BeforeComma
BreakAfterJavaFieldAnnotations: false
BreakStringLiterals: false
ColumnLimit: 80
CommentPragmas: '^ IWYU pragma:'
CompactNamespaces: false
ConstructorInitializerAllOnOneLineOrOnePerLine: false
ConstructorInitializerIndentWidth: 8
ContinuationIndentWidth: 8
Cpp11BracedListStyle: false
DerivePointerAlignment: false
DisableFormat: false
ExperimentalAutoDetectBinPacking: false
FixNamespaceComments: false
# Taken from:
# git grep -h '^#define [^[:space:]]*for_each[^[:space:]]*(' include/ tools/ \
# | sed "s,^#define \([^[:space:]]*for_each[^[:space:]]*\)(.*$, - '\1'," \
# | LC_ALL=C sort -u
ForEachMacros:
- 'for_each_nst'
IncludeBlocks: Preserve
IncludeCategories:
- Regex: '.*'
Priority: 1
IncludeIsMainRegex: '(Test)?$'
IndentCaseLabels: false
IndentGotoLabels: false
IndentPPDirectives: None
IndentWidth: 8
IndentWrappedFunctionNames: false
JavaScriptQuotes: Leave
JavaScriptWrapImports: true
KeepEmptyLinesAtTheStartOfBlocks: false
MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
ObjCBinPackProtocolList: Auto
ObjCBlockIndentWidth: 8
ObjCSpaceAfterProperty: true
ObjCSpaceBeforeProtocolList: true
# Taken from git's rules
PenaltyBreakAssignment: 10
PenaltyBreakBeforeFirstCallParameter: 30
PenaltyBreakComment: 10
PenaltyBreakFirstLessLess: 0
PenaltyBreakString: 10
PenaltyExcessCharacter: 100
PenaltyReturnTypeOnItsOwnLine: 60
PointerAlignment: Right
ReflowComments: false
SortIncludes: false
SortUsingDeclarations: false
SpaceAfterCStyleCast: false
SpaceAfterTemplateKeyword: true
SpaceBeforeAssignmentOperators: true
SpaceBeforeCtorInitializerColon: true
SpaceBeforeInheritanceColon: true
SpaceBeforeParens: ControlStatementsExceptForEachMacros
SpaceBeforeRangeBasedForLoopColon: true
SpaceInEmptyParentheses: false
SpacesBeforeTrailingComments: 1
SpacesInAngles: false
SpacesInContainerLiterals: false
SpacesInCStyleCastParentheses: false
SpacesInParentheses: false
SpacesInSquareBrackets: false
Standard: Cpp03
TabWidth: 8
UseTab: Always
...

93
.clang-tidy Normal file
View file

@ -0,0 +1,93 @@
---
Checks:
- "clang-diagnostic-*,clang-analyzer-*,*,-modernize-*"
# TODO: enable once https://bugs.llvm.org/show_bug.cgi?id=41311 is fixed
- "-clang-analyzer-valist.Uninitialized"
# Dubious value, would kill readability
- "-cppcoreguidelines-init-variables"
# Dubious value over the compiler's built-in warning. Would
# increase verbosity.
- "-bugprone-assignment-in-if-condition"
# Debatable whether these improve readability, right now it would look
# like a mess
- "-google-readability-braces-around-statements"
- "-hicpp-braces-around-statements"
- "-readability-braces-around-statements"
# TODO: in most cases they are justified, but probably not everywhere
#
- "-readability-magic-numbers"
- "-cppcoreguidelines-avoid-magic-numbers"
# TODO: this is Linux-only for the moment, nice to fix eventually
- "-llvmlibc-restrict-system-libc-headers"
# Those are needed for syscalls, epoll_wait flags, etc.
- "-hicpp-signed-bitwise"
# Probably not doable to impement this without plain memcpy(), memset()
- "-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling"
# TODO: not really important, but nice to fix eventually
- "-llvm-include-order"
# Dubious value, would kill readability
- "-readability-isolate-declaration"
# TODO: nice to fix eventually
- "-bugprone-narrowing-conversions"
- "-cppcoreguidelines-narrowing-conversions"
# TODO: check, fix, and more in general constify wherever possible
- "-cppcoreguidelines-avoid-non-const-global-variables"
# TODO: check paths where it might make sense to improve performance
- "-altera-unroll-loops"
- "-altera-id-dependent-backward-branch"
# Not much can be done about them other than being careful
- "-bugprone-easily-swappable-parameters"
# TODO: split reported functions
- "-readability-function-cognitive-complexity"
# "Poor" alignment needed for structs reflecting message formats/headers
- "-altera-struct-pack-align"
# TODO: check again if multithreading is implemented
- "-concurrency-mt-unsafe"
# Complains about any identifier <3 characters, reasonable for
# globals, pointlessly verbose for locals and parameters.
- "-readability-identifier-length"
# Wants to include headers which *directly* provide the things
# we use. That sounds nice, but means it will often want a OS
# specific header instead of a mostly standard one, such as
# <linux/limits.h> instead of <limits.h>.
- "-misc-include-cleaner"
# Want to replace all #defines of integers with enums. Kind of
# makes sense when those defines form an enum-like set, but
# weird for cases like standalone constants, and causes other
# awkwardness for a bunch of cases we use
- "-cppcoreguidelines-macro-to-enum"
# It's been a couple of centuries since multiplication has been granted
# precedence over addition in modern mathematical notation. Adding
# parentheses to reinforce that certainly won't improve readability.
- "-readability-math-missing-parentheses"
WarningsAsErrors: "*"
HeaderFileExtensions:
- h
ImplementationFileExtensions:
- c
HeaderFilterRegex: ""
FormatStyle: none
CheckOptions:
bugprone-suspicious-string-compare.WarnOnImplicitComparison: "false"
SystemHeaders: false

3
.clangd Normal file
View file

@ -0,0 +1,3 @@
CompileFlags:
# Don't try to interpret our headers as C++'
Add: [-xc, -Wall]

2
.gitignore vendored
View file

@ -3,8 +3,10 @@
/passt.avx2
/pasta
/pasta.avx2
/passt-repair
/qrap
/pasta.1
/seccomp.h
/seccomp_repair.h
/c*.json
README.plain.md

217
Makefile
View file

@ -15,66 +15,47 @@ VERSION ?= $(shell git describe --tags HEAD 2>/dev/null || echo "unknown\ versio
# the IPv6 socket API? (Linux does)
DUAL_STACK_SOCKETS := 1
RLIMIT_STACK_VAL := $(shell /bin/sh -c 'ulimit -s')
ifeq ($(RLIMIT_STACK_VAL),unlimited)
RLIMIT_STACK_VAL := 1024
TARGET ?= $(shell $(CC) -dumpmachine)
$(if $(TARGET),,$(error Failed to get target architecture))
# Get 'uname -m'-like architecture description for target
TARGET_ARCH := $(firstword $(subst -, ,$(TARGET)))
TARGET_ARCH := $(patsubst [:upper:],[:lower:],$(TARGET_ARCH))
TARGET_ARCH := $(patsubst arm%,arm,$(TARGET_ARCH))
TARGET_ARCH := $(subst powerpc,ppc,$(TARGET_ARCH))
# On some systems enabling optimization also enables source fortification,
# automagically. Do not override it.
FORTIFY_FLAG :=
ifeq ($(shell $(CC) -O2 -dM -E - < /dev/null 2>&1 | grep ' _FORTIFY_SOURCE ' > /dev/null; echo $$?),1)
FORTIFY_FLAG := -D_FORTIFY_SOURCE=2
endif
TARGET ?= $(shell $(CC) -dumpmachine)
# Get 'uname -m'-like architecture description for target
TARGET_ARCH := $(shell echo $(TARGET) | cut -f1 -d- | tr [A-Z] [a-z])
TARGET_ARCH := $(shell echo $(TARGET_ARCH) | sed 's/powerpc/ppc/')
AUDIT_ARCH := $(shell echo $(TARGET_ARCH) | tr [a-z] [A-Z] | sed 's/^ARM.*/ARM/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/I[456]86/I386/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/PPC64/PPC/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/PPCLE/PPC64LE/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/MIPS64EL/MIPSEL64/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/HPPA/PARISC/')
AUDIT_ARCH := $(shell echo $(AUDIT_ARCH) | sed 's/SH4/SH/')
FLAGS := -Wall -Wextra -Wno-format-zero-length
FLAGS := -Wall -Wextra -Wno-format-zero-length -Wformat-security
FLAGS += -pedantic -std=c11 -D_XOPEN_SOURCE=700 -D_GNU_SOURCE
FLAGS += -D_FORTIFY_SOURCE=2 -O2 -pie -fPIE
FLAGS += $(FORTIFY_FLAG) -O2 -pie -fPIE
FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
FLAGS += -DNETNS_RUN_DIR=\"/run/netns\"
FLAGS += -DPASST_AUDIT_ARCH=AUDIT_ARCH_$(AUDIT_ARCH)
FLAGS += -DRLIMIT_STACK_VAL=$(RLIMIT_STACK_VAL)
FLAGS += -DARCH=\"$(TARGET_ARCH)\"
FLAGS += -DVERSION=\"$(VERSION)\"
FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c
ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
udp_vu.c util.c vhost_user.c virtio.c vu_common.c
QRAP_SRCS = qrap.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS)
PASST_REPAIR_SRCS = passt-repair.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1
MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
udp.h udp_flow.h util.h
lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
udp_vu.h util.h vhost_user.h virtio.h vu_common.h
HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_SND_WND
endif
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_bytes_acked = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_BYTES_ACKED
endif
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_min_rtt = 0 };
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_MIN_RTT
endif
C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
FLAGS += -DHAS_GETRANDOM
@ -84,11 +65,6 @@ ifeq ($(shell :|$(CC) -fstack-protector-strong -S -xc - -o - >/dev/null 2>&1; ec
FLAGS += -fstack-protector-strong
endif
C := \#define _GNU_SOURCE\n\#include <fcntl.h>\nint x = FALLOC_FL_COLLAPSE_RANGE;
ifeq ($(shell printf "$(C)" | $(CC) -S -xc - -o - >/dev/null 2>&1; echo $$?),0)
EXTRA_SYSCALLS += fallocate
endif
prefix ?= /usr/local
exec_prefix ?= $(prefix)
bindir ?= $(exec_prefix)/bin
@ -98,9 +74,9 @@ mandir ?= $(datarootdir)/man
man1dir ?= $(mandir)/man1
ifeq ($(TARGET_ARCH),x86_64)
BIN := passt passt.avx2 pasta pasta.avx2 qrap
BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
else
BIN := passt pasta qrap
BIN := passt pasta qrap passt-repair
endif
all: $(BIN) $(MANPAGES) docs
@ -109,7 +85,10 @@ static: FLAGS += -static -DGLIBC_NO_STATIC_NSS
static: clean all
seccomp.h: seccomp.sh $(PASST_SRCS) $(PASST_HEADERS)
@ EXTRA_SYSCALLS="$(EXTRA_SYSCALLS)" ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh $(PASST_SRCS) $(PASST_HEADERS)
@ EXTRA_SYSCALLS="$(EXTRA_SYSCALLS)" ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh seccomp.h $(PASST_SRCS) $(PASST_HEADERS)
seccomp_repair.h: seccomp.sh $(PASST_REPAIR_SRCS)
@ ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh seccomp_repair.h $(PASST_REPAIR_SRCS)
passt: $(PASST_SRCS) $(HEADERS)
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_SRCS) -o passt $(LDFLAGS)
@ -125,17 +104,21 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
ln -sf $< $@
qrap: $(QRAP_SRCS) passt.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(QRAP_SRCS) -o qrap $(LDFLAGS)
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
passt-repair: $(PASST_REPAIR_SRCS) seccomp_repair.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
rt_sigreturn getpid gettid kill clock_gettime mmap \
mmap2 munmap open unlink gettimeofday futex
rt_sigreturn getpid gettid kill clock_gettime \
mmap|mmap2 munmap open unlink gettimeofday futex \
statx readlink
valgrind: FLAGS += -g -DVALGRIND
valgrind: all
.PHONY: clean
clean:
$(RM) $(BIN) *~ *.o seccomp.h pasta.1 \
$(RM) $(BIN) *~ *.o seccomp.h seccomp_repair.h pasta.1 \
passt.tar passt.tar.gz *.deb *.rpm \
passt.pid README.plain.md
@ -189,116 +172,11 @@ docs: README.md
done < README.md; \
) > README.plain.md
# Checkers currently disabled for clang-tidy:
# - llvmlibc-restrict-system-libc-headers
# TODO: this is Linux-only for the moment, nice to fix eventually
#
# - google-readability-braces-around-statements
# - hicpp-braces-around-statements
# - readability-braces-around-statements
# Debatable whether that improves readability, right now it would look
# like a mess
#
# - readability-magic-numbers
# - cppcoreguidelines-avoid-magic-numbers
# TODO: in most cases they are justified, but probably not everywhere
#
# - clang-analyzer-valist.Uninitialized
# TODO: enable once https://bugs.llvm.org/show_bug.cgi?id=41311 is fixed
#
# - clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling
# Probably not doable to impement this without plain memcpy(), memset()
#
# - cppcoreguidelines-init-variables
# Dubious value, would kill readability
#
# - hicpp-signed-bitwise
# Those are needed for syscalls, epoll_wait flags, etc.
#
# - llvm-include-order
# TODO: not really important, but nice to fix eventually
#
# - readability-isolate-declaration
# Dubious value, would kill readability
#
# - bugprone-narrowing-conversions
# - cppcoreguidelines-narrowing-conversions
# TODO: nice to fix eventually
#
# - cppcoreguidelines-avoid-non-const-global-variables
# TODO: check, fix, and more in general constify wherever possible
#
# - altera-unroll-loops
# - altera-id-dependent-backward-branch
# TODO: check paths where it might make sense to improve performance
#
# - bugprone-easily-swappable-parameters
# Not much can be done about them other than being careful
#
# - readability-function-cognitive-complexity
# TODO: split reported functions
#
# - altera-struct-pack-align
# "Poor" alignment needed for structs reflecting message formats/headers
#
# - concurrency-mt-unsafe
# TODO: check again if multithreading is implemented
#
# - readability-identifier-length
# Complains about any identifier <3 characters, reasonable for
# globals, pointlessly verbose for locals and parameters.
#
# - bugprone-assignment-in-if-condition
# Dubious value over the compiler's built-in warning. Would
# increase verbosity.
#
# - misc-include-cleaner
# Wants to include headers which *directly* provide the things
# we use. That sounds nice, but means it will often want a OS
# specific header instead of a mostly standard one, such as
# <linux/limits.h> instead of <limits.h>.
#
# - cppcoreguidelines-macro-to-enum
# Want to replace all #defines of integers with enums. Kind of
# makes sense when those defines form an enum-like set, but
# weird for cases like standalone constants, and causes other
# awkwardness for a bunch of cases we use
clang-tidy: $(PASST_SRCS) $(HEADERS)
clang-tidy $(PASST_SRCS) -- $(filter-out -pie,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) \
-DCLANG_TIDY_58992
clang-tidy: $(SRCS) $(HEADERS)
clang-tidy -checks=*,-modernize-*,\
-clang-analyzer-valist.Uninitialized,\
-cppcoreguidelines-init-variables,\
-bugprone-assignment-in-if-condition,\
-google-readability-braces-around-statements,\
-hicpp-braces-around-statements,\
-readability-braces-around-statements,\
-readability-magic-numbers,\
-llvmlibc-restrict-system-libc-headers,\
-hicpp-signed-bitwise,\
-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,\
-llvm-include-order,\
-cppcoreguidelines-avoid-magic-numbers,\
-readability-isolate-declaration,\
-bugprone-narrowing-conversions,\
-cppcoreguidelines-narrowing-conversions,\
-cppcoreguidelines-avoid-non-const-global-variables,\
-altera-unroll-loops,-altera-id-dependent-backward-branch,\
-bugprone-easily-swappable-parameters,\
-readability-function-cognitive-complexity,\
-altera-struct-pack-align,\
-concurrency-mt-unsafe,\
-readability-identifier-length,\
-misc-include-cleaner,\
-cppcoreguidelines-macro-to-enum \
-config='{CheckOptions: [{key: bugprone-suspicious-string-compare.WarnOnImplicitComparison, value: "false"}]}' \
--warnings-as-errors=* $(SRCS) -- $(filter-out -pie,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) -DCLANG_TIDY_58992
SYSTEM_INCLUDES := /usr/include $(wildcard /usr/include/$(TARGET))
ifeq ($(shell $(CC) -v 2>&1 | grep -c "gcc version"),1)
VER := $(shell $(CC) -dumpversion)
SYSTEM_INCLUDES += /usr/lib/gcc/$(TARGET)/$(VER)/include
endif
cppcheck: $(SRCS) $(HEADERS)
cppcheck: $(PASST_SRCS) $(HEADERS)
if cppcheck --check-level=exhaustive /dev/null > /dev/null 2>&1; then \
CPPCHECK_EXHAUSTIVE="--check-level=exhaustive"; \
else \
@ -307,11 +185,8 @@ cppcheck: $(SRCS) $(HEADERS)
cppcheck --std=c11 --error-exitcode=1 --enable=all --force \
--inconclusive --library=posix --quiet \
$${CPPCHECK_EXHAUSTIVE} \
$(SYSTEM_INCLUDES:%=-I%) \
$(SYSTEM_INCLUDES:%=--config-exclude=%) \
$(SYSTEM_INCLUDES:%=--suppress=*:%/*) \
$(SYSTEM_INCLUDES:%=--suppress=unmatchedSuppression:%/*) \
--inline-suppr \
--suppress=missingIncludeSystem \
--suppress=unusedStructMember \
$(filter -D%,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) \
$(SRCS) $(HEADERS)
$(filter -D%,$(FLAGS) $(CFLAGS) $(CPPFLAGS)) -D CPPCHECK_6936 \
$(PASST_SRCS) $(HEADERS)

View file

@ -321,7 +321,7 @@ speeding up local connections, and usually requiring NAT. _pasta_:
protocol
* ✅ 4 to 50 times IPv4 TCP throughput of existing, conceptually similar
solutions depending on MTU (UDP and IPv6 hard to compare)
* 🛠 [_vhost-user_ support](https://bugs.passt.top/show_bug.cgi?id=25) for
* [_vhost-user_ support](https://bugs.passt.top/show_bug.cgi?id=25) for
maximum one copy on every data path and lower request-response latency
* ⌚ [multithreading](https://bugs.passt.top/show_bug.cgi?id=13)
* ⌚ [raw IP socket support](https://bugs.passt.top/show_bug.cgi?id=14) if

8
arch.c
View file

@ -19,6 +19,7 @@
#include <unistd.h>
#include "log.h"
#include "util.h"
/**
* arch_avx2_exec() - Switch to AVX2 build if supported
@ -40,8 +41,11 @@ void arch_avx2_exec(char **argv)
if (__builtin_cpu_supports("avx2")) {
char new_path[PATH_MAX + sizeof(".avx2")];
snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
execve(new_path, argv, environ);
if (snprintf_check(new_path, PATH_MAX + sizeof(".avx2"),
"%s.avx2", exe))
die_perror("Can't build AVX2 executable path");
execv(new_path, argv);
warn_perror("Can't run AVX2 build, using non-AVX2 version");
}
}

8
arp.c
View file

@ -59,14 +59,12 @@ int arp(const struct ctx *c, const struct pool *p)
ah->ar_op != htons(ARPOP_REQUEST))
return 1;
/* Discard announcements (but not 0.0.0.0 "probes"): we might have the
* same IP address, hide that.
*/
if (memcmp(am->sip, (unsigned char[4]){ 0 }, sizeof(am->tip)) &&
/* Discard announcements, but not 0.0.0.0 "probes" */
if (memcmp(am->sip, &in4addr_any, sizeof(am->sip)) &&
!memcmp(am->sip, am->tip, sizeof(am->sip)))
return 1;
/* Don't resolve our own address, either. */
/* Don't resolve the guest's assigned address, either. */
if (!memcmp(am->tip, &c->ip4.addr, sizeof(am->tip)))
return 1;

View file

@ -59,6 +59,7 @@
#include "util.h"
#include "ip.h"
#include "checksum.h"
#include "iov.h"
/* Checksums are optional for UDP over IPv4, so we usually just set
* them to 0. Change this to 1 to calculate real UDP over IPv4
@ -84,7 +85,7 @@
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing")))
uint32_t sum_16b(const void *buf, size_t len)
static uint32_t sum_16b(const void *buf, size_t len)
{
const uint16_t *p = buf;
uint32_t sum = 0;
@ -106,7 +107,7 @@ uint32_t sum_16b(const void *buf, size_t len)
*
* Return: 16-bit folded sum
*/
uint16_t csum_fold(uint32_t sum)
static uint16_t csum_fold(uint32_t sum)
{
while (sum >> 16)
sum = (sum & 0xffff) + (sum >> 16);
@ -160,27 +161,42 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
return psum;
}
/**
* csum() - Compute TCP/IP-style checksum
* @buf: Input buffer
* @len: Input length
* @init: Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing"))) /* See csum_16b() */
static uint16_t csum(const void *buf, size_t len, uint32_t init)
{
return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
}
/**
* csum_udp4() - Calculate and set checksum for a UDP over IPv4 packet
* @udp4hr: UDP header, initialised apart from checksum
* @saddr: IPv4 source address
* @daddr: IPv4 destination address
* @payload: UDP packet payload
* @dlen: Length of @payload (not including UDP header)
* @data: UDP payload (as IO vector tail)
*/
void csum_udp4(struct udphdr *udp4hr,
struct in_addr saddr, struct in_addr daddr,
const void *payload, size_t dlen)
struct iov_tail *data)
{
/* UDP checksums are optional, so don't bother */
udp4hr->check = 0;
if (UDP4_REAL_CHECKSUMS) {
uint16_t l4len = dlen + sizeof(struct udphdr);
uint16_t l4len = iov_tail_size(data) + sizeof(struct udphdr);
uint32_t psum = proto_ipv4_header_psum(l4len, IPPROTO_UDP,
saddr, daddr);
psum = csum_unfolded(udp4hr, sizeof(struct udphdr), psum);
udp4hr->check = csum(payload, dlen, psum);
udp4hr->check = csum_iov_tail(data, psum);
}
}
@ -226,19 +242,22 @@ uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
/**
* csum_udp6() - Calculate and set checksum for a UDP over IPv6 packet
* @udp6hr: UDP header, initialised apart from checksum
* @payload: UDP packet payload
* @dlen: Length of @payload (not including UDP header)
* @saddr: Source address
* @daddr: Destination address
* @data: UDP payload (as IO vector tail)
*/
void csum_udp6(struct udphdr *udp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t dlen)
struct iov_tail *data)
{
uint32_t psum = proto_ipv6_header_psum(dlen + sizeof(struct udphdr),
IPPROTO_UDP, saddr, daddr);
uint16_t l4len = iov_tail_size(data) + sizeof(struct udphdr);
uint32_t psum = proto_ipv6_header_psum(l4len, IPPROTO_UDP,
saddr, daddr);
udp6hr->check = 0;
psum = csum_unfolded(udp6hr, sizeof(struct udphdr), psum);
udp6hr->check = csum(payload, dlen, psum);
udp6hr->check = csum_iov_tail(data, psum);
}
/**
@ -448,7 +467,8 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
intptr_t align = ROUND_UP((intptr_t)buf, sizeof(__m256i));
unsigned int pad = align - (intptr_t)buf;
if (len < pad)
/* Don't mix sum_16b() and csum_avx2() with odd padding lengths */
if (pad & 1 || len < pad)
pad = len;
if (pad)
@ -478,36 +498,23 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
#endif /* !__AVX2__ */
/**
* csum() - Compute TCP/IP-style checksum
* @buf: Input buffer
* @len: Input length
* @init: Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((optimize("-fno-strict-aliasing"))) /* See csum_16b() */
uint16_t csum(const void *buf, size_t len, uint32_t init)
{
return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
}
/**
* csum_iov() - Calculates the unfolded checksum over an array of IO vectors
*
* @iov Pointer to the array of IO vectors
* @n Length of the array
* csum_iov_tail() - Calculate unfolded checksum for the tail of an IO vector
* @tail: IO vector tail to checksum
* @init Initial 32-bit checksum, 0 for no pre-computed checksum
*
* Return: 16-bit folded, complemented checksum
*/
/* cppcheck-suppress unusedFunction */
uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init)
uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init)
{
unsigned int i;
for (i = 0; i < n; i++)
init = csum_unfolded(iov[i].iov_base, iov[i].iov_len, init);
if (iov_tail_prune(tail)) {
size_t i;
init = csum_unfolded((char *)tail->iov[0].iov_base + tail->off,
tail->iov[0].iov_len - tail->off, init);
for (i = 1; i < tail->cnt; i++) {
const struct iovec *iov = &tail->iov[i];
init = csum_unfolded(iov->iov_base, iov->iov_len, init);
}
}
return (uint16_t)~csum_fold(init);
}

View file

@ -9,9 +9,8 @@
struct udphdr;
struct icmphdr;
struct icmp6hdr;
struct iov_tail;
uint32_t sum_16b(const void *buf, size_t len);
uint16_t csum_fold(uint32_t sum);
uint16_t csum_unaligned(const void *buf, size_t len, uint32_t init);
uint16_t csum_ip4_header(uint16_t l3len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr);
@ -19,19 +18,18 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
struct in_addr saddr, struct in_addr daddr);
void csum_udp4(struct udphdr *udp4hr,
struct in_addr saddr, struct in_addr daddr,
const void *payload, size_t dlen);
struct iov_tail *data);
void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t dlen);
uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
const struct in6_addr *saddr,
const struct in6_addr *daddr);
void csum_udp6(struct udphdr *udp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t dlen);
struct iov_tail *data);
void csum_icmp6(struct icmp6hdr *icmp6hr,
const struct in6_addr *saddr, const struct in6_addr *daddr,
const void *payload, size_t dlen);
uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init);
uint16_t csum(const void *buf, size_t len, uint32_t init);
uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init);
uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init);
#endif /* CHECKSUM_H */

675
conf.c

File diff suppressed because it is too large Load diff

1
conf.h
View file

@ -6,6 +6,7 @@
#ifndef CONF_H
#define CONF_H
enum passt_modes conf_mode(int argc, char *argv[]);
void conf(struct ctx *c, int argc, char **argv);
#endif /* CONF_H */

View file

@ -34,6 +34,8 @@
owner @{PROC}/@{pid}/uid_map r, # conf_ugid()
@{PROC}/sys/net/ipv4/ip_local_port_range r, # fwd_probe_ephemeral()
network netlink raw, # nl_sock_init_do(), netlink.c
network inet stream, # tcp.c

View file

@ -27,4 +27,25 @@ profile passt /usr/bin/passt{,.avx2} {
owner @{HOME}/** w, # pcap(), pidfile_open(),
# pidfile_write()
# Workaround: libvirt's profile comes with a passt subprofile which includes,
# in turn, <abstractions/passt>, and adds libvirt-specific rules on top, to
# allow passt (when started by libvirtd) to write socket and PID files in the
# location requested by libvirtd itself, and to execute passt itself.
#
# However, when libvirt runs as unprivileged user, the mechanism based on
# virt-aa-helper, designed to build per-VM profiles as guests are started,
# doesn't work. The helper needs to create and load profiles on the fly, which
# can't be done by unprivileged users, of course.
#
# As a result, libvirtd runs unconfined if guests are started by unprivileged
# users, starting passt unconfined as well, which means that passt runs under
# its own stand-alone profile (this one), which implies in turn that execve()
# of /usr/bin/passt is not allowed, and socket and PID files can't be written.
#
# Duplicate libvirt-specific rules here as long as this is not solved in
# libvirt's profile itself.
/usr/bin/passt r,
owner @{run}/user/[0-9]*/libvirt/qemu/run/passt/* rw,
owner @{run}/libvirt/qemu/passt/* rw,
}

View file

@ -0,0 +1,29 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# contrib/apparmor/usr.bin.passt-repair - AppArmor profile for passt-repair(1)
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
abi <abi/3.0>,
#include <tunables/global>
profile passt-repair /usr/bin/passt-repair {
#include <abstractions/base>
/** rw, # passt's ".repair" socket might be anywhere
unix (connect, receive, send) type=stream,
capability dac_override, # connect to passt's socket as root
capability net_admin, # currently needed for TCP_REPAIR socket option
capability net_raw, # what TCP_REPAIR should require instead
network unix stream, # connect and use UNIX domain socket
network inet stream, # use TCP sockets
}

View file

@ -44,7 +44,7 @@ Requires(preun): %{name}
Requires(preun): policycoreutils
%description selinux
This package adds SELinux enforcement to passt(1) and pasta(1).
This package adds SELinux enforcement to passt(1), pasta(1), passt-repair(1).
%prep
%setup -q -n passt-%{git_hash}
@ -82,6 +82,7 @@ make -f %{_datadir}/selinux/devel/Makefile
install -p -m 644 -D passt.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
install -p -m 644 -D passt.if %{buildroot}%{_datadir}/selinux/devel/include/distributed/passt.if
install -p -m 644 -D pasta.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
install -p -m 644 -D passt-repair.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
popd
%pre selinux
@ -90,11 +91,13 @@ popd
%post selinux
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
%postun selinux
if [ $1 -eq 0 ]; then
%selinux_modules_uninstall -s %{selinuxtype} passt
%selinux_modules_uninstall -s %{selinuxtype} pasta
%selinux_modules_uninstall -s %{selinuxtype} passt-repair
fi
%posttrans selinux
@ -108,9 +111,11 @@ fi
%{_bindir}/passt
%{_bindir}/pasta
%{_bindir}/qrap
%{_bindir}/passt-repair
%{_mandir}/man1/passt.1*
%{_mandir}/man1/pasta.1*
%{_mandir}/man1/qrap.1*
%{_mandir}/man1/passt-repair.1*
%ifarch x86_64
%{_bindir}/passt.avx2
%{_mandir}/man1/passt.avx2.1*
@ -122,6 +127,7 @@ fi
%{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
%{_datadir}/selinux/devel/include/distributed/passt.if
%{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
%{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
%changelog
{{{ passt_git_changelog }}}

View file

@ -0,0 +1,11 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# contrib/selinux/passt-repair.fc - SELinux: File Context for passt-repair
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
/usr/bin/passt-repair system_u:object_r:passt_repair_exec_t:s0

View file

@ -0,0 +1,87 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# contrib/selinux/passt-repair.te - SELinux: Type Enforcement for passt-repair
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
policy_module(passt-repair, 0.1)
require {
type unconfined_t;
type passt_t;
role unconfined_r;
class process transition;
class file { read execute execute_no_trans entrypoint open map };
class capability { dac_override net_admin net_raw };
class chr_file { append open getattr read write ioctl };
class unix_stream_socket { create connect sendto };
class sock_file { read write };
class tcp_socket { read setopt write };
type console_device_t;
type user_devpts_t;
type user_tmp_t;
# Workaround: passt-repair needs to needs to access socket files
# that passt, started by libvirt, might create under different
# labels, depending on whether passt is started as root or not.
#
# However, libvirt doesn't maintain its own policy, which makes
# updates particularly complicated. To avoid breakage in the short
# term, deal with that in passt's own policy.
type qemu_var_run_t;
type virt_var_run_t;
}
type passt_repair_t;
domain_type(passt_repair_t);
type passt_repair_exec_t;
corecmd_executable_file(passt_repair_exec_t);
role unconfined_r types passt_repair_t;
allow passt_repair_t passt_repair_exec_t:file { read execute execute_no_trans entrypoint open map };
type_transition unconfined_t passt_repair_exec_t:process passt_repair_t;
allow unconfined_t passt_repair_t:process transition;
allow passt_repair_t self:capability { dac_override dac_read_search net_admin net_raw };
allow passt_repair_t self:capability2 bpf;
allow passt_repair_t console_device_t:chr_file { append open getattr read write ioctl };
allow passt_repair_t user_devpts_t:chr_file { append open getattr read write ioctl };
allow passt_repair_t unconfined_t:unix_stream_socket { connectto read write };
allow passt_repair_t passt_t:unix_stream_socket { connectto read write };
allow passt_repair_t user_tmp_t:unix_stream_socket { connectto read write };
allow passt_repair_t user_tmp_t:dir { getattr read search watch };
allow passt_repair_t unconfined_t:sock_file { getattr read write };
allow passt_repair_t passt_t:sock_file { getattr read write };
allow passt_repair_t user_tmp_t:sock_file { getattr read write };
allow passt_repair_t unconfined_t:tcp_socket { read setopt write };
allow passt_repair_t passt_t:tcp_socket { read setopt write };
# Workaround: passt-repair needs to needs to access socket files
# that passt, started by libvirt, might create under different
# labels, depending on whether passt is started as root or not.
#
# However, libvirt doesn't maintain its own policy, which makes
# updates particularly complicated. To avoid breakage in the short
# term, deal with that in passt's own policy.
allow passt_repair_t qemu_var_run_t:unix_stream_socket { connectto read write };
allow passt_repair_t virt_var_run_t:unix_stream_socket { connectto read write };
allow passt_repair_t qemu_var_run_t:dir { getattr read search watch };
allow passt_repair_t virt_var_run_t:dir { getattr read search watch };
allow passt_repair_t qemu_var_run_t:sock_file { getattr read write };
allow passt_repair_t virt_var_run_t:sock_file { getattr read write };

View file

@ -20,9 +20,19 @@ require {
type fs_t;
type tmp_t;
type user_tmp_t;
type user_home_t;
type tmpfs_t;
type root_t;
# Workaround: passt --vhost-user needs to map guest memory, but
# libvirt doesn't maintain its own policy, which makes updates
# particularly complicated. To avoid breakage in the short term,
# deal with it in passt's own policy.
type svirt_image_t;
type svirt_tmpfs_t;
type svirt_t;
type null_device_t;
class file { ioctl getattr setattr create read write unlink open relabelto execute execute_no_trans map };
class dir { search write add_name remove_name mounton };
class chr_file { append read write open getattr ioctl };
@ -38,8 +48,8 @@ require {
type net_conf_t;
type proc_net_t;
type node_t;
class tcp_socket { create accept listen name_bind name_connect };
class udp_socket { create accept listen };
class tcp_socket { create accept listen name_bind name_connect getattr ioctl };
class udp_socket { create accept listen getattr };
class icmp_socket { bind create name_bind node_bind setopt read write };
class sock_file { create unlink write };
@ -47,9 +57,8 @@ require {
type port_t;
type http_port_t;
type passwd_file_t;
class netlink_route_socket { bind create nlmsg_read };
type sysctl_net_t;
class capability { sys_tty_config setuid setgid };
class cap_userns { setpcap sys_admin sys_ptrace };
@ -81,6 +90,9 @@ allow passt_t root_t:dir mounton;
allow passt_t tmp_t:dir { add_name mounton remove_name write };
allow passt_t tmpfs_t:filesystem mount;
allow passt_t fs_t:filesystem unmount;
allow passt_t user_home_t:dir search;
allow passt_t user_tmp_t:fifo_file append;
allow passt_t user_tmp_t:file map;
manage_files_pattern(passt_t, user_tmp_t, user_tmp_t)
files_pid_filetrans(passt_t, user_tmp_t, file)
@ -95,8 +107,7 @@ allow passt_t self:capability { sys_tty_config setpcap net_bind_service setuid s
allow passt_t self:cap_userns { setpcap sys_admin sys_ptrace };
allow passt_t self:user_namespace create;
allow passt_t passwd_file_t:file read_file_perms;
sssd_search_lib(passt_t)
auth_read_passwd(passt_t)
allow passt_t proc_net_t:file read;
allow passt_t net_conf_t:file { open read };
@ -104,6 +115,8 @@ allow passt_t net_conf_t:lnk_file read;
allow passt_t tmp_t:sock_file { create unlink write };
allow passt_t self:netlink_route_socket { bind create nlmsg_read read write setopt };
kernel_search_network_sysctl(passt_t)
allow passt_t sysctl_net_t:dir search;
allow passt_t sysctl_net_t:file { open read };
corenet_tcp_bind_all_nodes(passt_t)
corenet_udp_bind_all_nodes(passt_t)
@ -119,11 +132,19 @@ corenet_udp_sendrecv_all_ports(passt_t)
allow passt_t node_t:icmp_socket { name_bind node_bind };
allow passt_t port_t:icmp_socket name_bind;
allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write };
allow passt_t self:udp_socket { create getopt setopt connect bind read write };
allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write getattr ioctl };
allow passt_t self:udp_socket { create getopt setopt connect bind read write getattr };
allow passt_t self:icmp_socket { bind create setopt read write };
allow passt_t user_tmp_t:dir { add_name write };
allow passt_t user_tmp_t:file { create open };
allow passt_t user_tmp_t:sock_file { create read write unlink };
allow passt_t unconfined_t:unix_stream_socket { read write };
# Workaround: passt --vhost-user needs to map guest memory, but
# libvirt doesn't maintain its own policy, which makes updates
# particularly complicated. To avoid breakage in the short term,
# deal with it in passt's own policy.
allow passt_t svirt_image_t:file { read write map };
allow passt_t svirt_tmpfs_t:file { read write map };
allow passt_t null_device_t:chr_file map;

View file

@ -18,6 +18,7 @@ require {
type bin_t;
type user_home_t;
type user_home_dir_t;
type user_tmp_t;
type fs_t;
type tmp_t;
type tmpfs_t;
@ -56,8 +57,10 @@ require {
attribute port_type;
type port_t;
type http_port_t;
type http_cache_port_t;
type ssh_port_t;
type reserved_port_t;
type unreserved_port_t;
type dns_port_t;
type dhcpc_port_t;
type chronyd_port_t;
@ -68,9 +71,6 @@ require {
type system_dbusd_t;
type systemd_hostnamed_t;
type systemd_systemctl_exec_t;
type passwd_file_t;
type sssd_public_t;
type sssd_var_lib_t;
class dbus send_msg;
class system module_request;
class system status;
@ -115,8 +115,7 @@ allow pasta_t self:capability { setpcap net_bind_service sys_tty_config dac_read
allow pasta_t self:cap_userns { setpcap sys_admin sys_ptrace net_admin net_bind_service };
allow pasta_t self:user_namespace create;
allow pasta_t passwd_file_t:file read_file_perms;
sssd_search_lib(pasta_t)
auth_read_passwd(pasta_t)
domain_auto_trans(pasta_t, bin_t, unconfined_t);
domain_auto_trans(pasta_t, shell_exec_t, unconfined_t);
@ -126,8 +125,8 @@ domain_auto_trans(pasta_t, ping_exec_t, ping_t);
allow pasta_t nsfs_t:file { open read };
allow pasta_t user_home_t:dir getattr;
allow pasta_t user_home_t:file { open read getattr setattr };
allow pasta_t user_home_t:dir { getattr search };
allow pasta_t user_home_t:file { open read getattr setattr execute execute_no_trans map};
allow pasta_t user_home_dir_t:dir { search getattr open add_name read write };
allow pasta_t user_home_dir_t:file { create open read write };
allow pasta_t tmp_t:dir { add_name mounton remove_name write };
@ -137,6 +136,11 @@ allow pasta_t root_t:dir mounton;
manage_files_pattern(pasta_t, pasta_pid_t, pasta_pid_t)
files_pid_filetrans(pasta_t, pasta_pid_t, file)
allow pasta_t user_tmp_t:dir { add_name remove_name search write };
allow pasta_t user_tmp_t:fifo_file append;
allow pasta_t user_tmp_t:file { create open write };
allow pasta_t user_tmp_t:sock_file { create unlink };
allow pasta_t console_device_t:chr_file { open write getattr ioctl };
allow pasta_t user_devpts_t:chr_file { getattr read write ioctl };
logging_send_syslog_msg(pasta_t)
@ -164,6 +168,8 @@ allow pasta_t self:udp_socket create_stream_socket_perms;
allow pasta_t reserved_port_t:udp_socket name_bind;
allow pasta_t llmnr_port_t:tcp_socket name_bind;
allow pasta_t llmnr_port_t:udp_socket name_bind;
allow pasta_t http_cache_port_t:tcp_socket { name_bind name_connect };
allow pasta_t unreserved_port_t:udp_socket name_bind;
corenet_udp_sendrecv_generic_node(pasta_t)
corenet_udp_bind_generic_node(pasta_t)
allow pasta_t node_t:icmp_socket { name_bind node_bind };
@ -175,15 +181,12 @@ allow pasta_t init_t:lnk_file read;
allow pasta_t init_t:unix_stream_socket connectto;
allow pasta_t init_t:dbus send_msg;
allow pasta_t init_t:system status;
allow pasta_t unconfined_t:dir search;
allow pasta_t unconfined_t:dir { read search };
allow pasta_t unconfined_t:file read;
allow pasta_t unconfined_t:lnk_file read;
allow pasta_t passwd_file_t:file { getattr open read };
allow pasta_t self:process { setpgid setcap };
allow pasta_t shell_exec_t:file { execute execute_no_trans map };
allow pasta_t sssd_var_lib_t:dir search;
allow pasta_t sssd_public_t:dir search;
allow pasta_t hostname_exec_t:file { execute execute_no_trans getattr open read map };
allow pasta_t system_dbusd_t:unix_stream_socket connectto;
allow pasta_t system_dbusd_t:dbus send_msg;
@ -196,11 +199,9 @@ allow pasta_t ifconfig_var_run_t:dir { read search watch };
allow pasta_t self:tun_socket create;
allow pasta_t tun_tap_device_t:chr_file { ioctl open read write };
allow pasta_t sysctl_net_t:dir search;
allow pasta_t sysctl_net_t:file { open write };
allow pasta_t sysctl_net_t:file { open read write };
allow pasta_t kernel_t:system module_request;
allow pasta_t nsfs_t:file read;
allow pasta_t proc_t:dir mounton;
allow pasta_t proc_t:filesystem mount;
allow pasta_t net_conf_t:lnk_file read;

142
dhcp.c
View file

@ -36,9 +36,9 @@
/**
* struct opt - DHCP option
* @sent: Convenience flag, set while filling replies
* @slen: Length of option defined for server
* @slen: Length of option defined for server, -1 if not going to be sent
* @s: Option payload from server
* @clen: Length of option received from client
* @clen: Length of option received from client, -1 if not received
* @c: Option payload from client
*/
struct opt {
@ -63,11 +63,21 @@ static struct opt opts[255];
#define OPT_MIN 60 /* RFC 951 */
/* Total option size (excluding end option) is 576 (RFC 2131), minus
* offset of options (268), minus end option (1).
*/
#define OPT_MAX 307
/**
* dhcp_init() - Initialise DHCP options
*/
void dhcp_init(void)
{
int i;
for (i = 0; i < ARRAY_SIZE(opts); i++)
opts[i].slen = -1;
opts[1] = (struct opt) { 0, 4, { 0 }, 0, { 0 }, }; /* Mask */
opts[3] = (struct opt) { 0, 4, { 0 }, 0, { 0 }, }; /* Router */
opts[51] = (struct opt) { 0, 4, { 0xff,
@ -107,6 +117,8 @@ struct msg {
uint32_t xid;
uint16_t secs;
uint16_t flags;
#define FLAG_BROADCAST htons_constant(0x8000)
uint32_t ciaddr;
struct in_addr yiaddr;
uint32_t siaddr;
@ -115,7 +127,7 @@ struct msg {
uint8_t sname[64];
uint8_t file[128];
uint32_t magic;
uint8_t o[308];
uint8_t o[OPT_MAX + 1 /* End option */ ];
} __attribute__((__packed__));
/**
@ -123,15 +135,28 @@ struct msg {
* @m: Message to fill
* @o: Option number
* @offset: Current offset within options field, updated on insertion
*
* Return: false if m has space to write the option, true otherwise
*/
static void fill_one(struct msg *m, int o, int *offset)
static bool fill_one(struct msg *m, int o, int *offset)
{
size_t slen = opts[o].slen;
/* If we don't have space to write the option, then just skip */
if (*offset + 2 /* code and length of option */ + slen > OPT_MAX)
return true;
m->o[*offset] = o;
m->o[*offset + 1] = opts[o].slen;
memcpy(&m->o[*offset + 2], opts[o].s, opts[o].slen);
m->o[*offset + 1] = slen;
/* Move to option */
*offset += 2;
memcpy(&m->o[*offset], opts[o].s, slen);
opts[o].sent = 1;
*offset += 2 + opts[o].slen;
*offset += slen;
return false;
}
/**
@ -144,9 +169,6 @@ static int fill(struct msg *m)
{
int i, o, offset = 0;
m->op = BOOTREPLY;
m->secs = 0;
for (o = 0; o < 255; o++)
opts[o].sent = 0;
@ -154,22 +176,24 @@ static int fill(struct msg *m)
* option 53 at the beginning of the list.
* Put it there explicitly, unless requested via option 55.
*/
if (!memchr(opts[55].c, 53, opts[55].clen))
fill_one(m, 53, &offset);
if (opts[55].clen > 0 && !memchr(opts[55].c, 53, opts[55].clen))
if (fill_one(m, 53, &offset))
debug("DHCP: skipping option 53");
for (i = 0; i < opts[55].clen; i++) {
o = opts[55].c[i];
if (opts[o].slen)
fill_one(m, o, &offset);
if (opts[o].slen != -1)
if (fill_one(m, o, &offset))
debug("DHCP: skipping option %i", o);
}
for (o = 0; o < 255; o++) {
if (opts[o].slen && !opts[o].sent)
fill_one(m, o, &offset);
if (opts[o].slen != -1 && !opts[o].sent)
if (fill_one(m, o, &offset))
debug("DHCP: skipping option %i", o);
}
m->o[offset++] = 255;
m->o[offset++] = 0;
if (offset < OPT_MIN) {
memset(&m->o[offset], 0, OPT_MIN - offset);
@ -264,6 +288,9 @@ static void opt_set_dns_search(const struct ctx *c, size_t max_len)
".\xc0");
}
}
if (!opts[119].slen)
opts[119].slen = -1;
}
/**
@ -277,12 +304,13 @@ int dhcp(const struct ctx *c, const struct pool *p)
{
size_t mlen, dlen, offset = 0, opt_len, opt_off = 0;
char macstr[ETH_ADDRSTRLEN];
struct in_addr mask, dst;
const struct ethhdr *eh;
const struct iphdr *iph;
const struct udphdr *uh;
struct in_addr mask;
struct msg const *m;
struct msg reply;
unsigned int i;
struct msg *m;
eh = packet_get(p, 0, offset, sizeof(*eh), NULL);
offset += sizeof(*eh);
@ -311,8 +339,27 @@ int dhcp(const struct ctx *c, const struct pool *p)
m->op != BOOTREQUEST)
return -1;
reply.op = BOOTREPLY;
reply.htype = m->htype;
reply.hlen = m->hlen;
reply.hops = 0;
reply.xid = m->xid;
reply.secs = 0;
reply.flags = m->flags;
reply.ciaddr = m->ciaddr;
reply.yiaddr = c->ip4.addr;
reply.siaddr = 0;
reply.giaddr = m->giaddr;
memcpy(&reply.chaddr, m->chaddr, sizeof(reply.chaddr));
memset(&reply.sname, 0, sizeof(reply.sname));
memset(&reply.file, 0, sizeof(reply.file));
reply.magic = m->magic;
offset += offsetof(struct msg, o);
for (i = 0; i < ARRAY_SIZE(opts); i++)
opts[i].clen = -1;
while (opt_off + 2 < opt_len) {
const uint8_t *olen, *val;
uint8_t *type;
@ -331,11 +378,19 @@ int dhcp(const struct ctx *c, const struct pool *p)
opt_off += *olen + 2;
}
if (opts[53].c[0] == DHCPDISCOVER) {
info("DHCP: offer to discover");
opts[53].s[0] = DHCPOFFER;
} else if (opts[53].c[0] == DHCPREQUEST || !opts[53].clen) {
info("%s: ack to request", opts[53].clen ? "DHCP" : "BOOTP");
opts[80].slen = -1;
if (opts[53].clen > 0 && opts[53].c[0] == DHCPDISCOVER) {
if (opts[80].clen == -1) {
info("DHCP: offer to discover");
opts[53].s[0] = DHCPOFFER;
} else {
info("DHCP: ack to discover (Rapid Commit)");
opts[53].s[0] = DHCPACK;
opts[80].slen = 0;
}
} else if (opts[53].clen <= 0 || opts[53].c[0] == DHCPREQUEST) {
info("%s: ack to request", /* DHCP needs a valid message type */
(opts[53].clen <= 0) ? "BOOTP" : "DHCP");
opts[53].s[0] = DHCPACK;
} else {
return -1;
@ -343,7 +398,6 @@ int dhcp(const struct ctx *c, const struct pool *p)
info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));
m->yiaddr = c->ip4.addr;
mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len));
memcpy(opts[1].s, &mask, sizeof(mask));
memcpy(opts[3].s, &c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
@ -363,7 +417,7 @@ int dhcp(const struct ctx *c, const struct pool *p)
&c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
}
if (c->mtu != -1) {
if (c->mtu) {
opts[26].slen = 2;
opts[26].s[0] = c->mtu / 256;
opts[26].s[1] = c->mtu % 256;
@ -374,12 +428,44 @@ int dhcp(const struct ctx *c, const struct pool *p)
((struct in_addr *)opts[6].s)[i] = c->ip4.dns[i];
opts[6].slen += sizeof(uint32_t);
}
if (!opts[6].slen)
opts[6].slen = -1;
opt_len = strlen(c->hostname);
if (opt_len > 0) {
opts[12].slen = opt_len;
memcpy(opts[12].s, &c->hostname, opt_len);
}
opt_len = strlen(c->fqdn);
if (opt_len > 0) {
opt_len += 3 /* flags */
+ 2; /* Length byte for first label, and terminator */
if (sizeof(opts[81].s) >= opt_len) {
opts[81].s[0] = 0x4; /* flags (E) */
opts[81].s[1] = 0xff; /* RCODE1 */
opts[81].s[2] = 0xff; /* RCODE2 */
encode_domain_name((char *)opts[81].s + 3, c->fqdn);
opts[81].slen = opt_len;
} else {
debug("DHCP: client FQDN option doesn't fit, skipping");
}
}
if (!c->no_dhcp_dns_search)
opt_set_dns_search(c, sizeof(m->o));
dlen = offsetof(struct msg, o) + fill(m);
tap_udp4_send(c, c->ip4.our_tap_addr, 67, c->ip4.addr, 68, m, dlen);
dlen = offsetof(struct msg, o) + fill(&reply);
if (m->flags & FLAG_BROADCAST)
dst = in4addr_broadcast;
else
dst = c->ip4.addr;
tap_udp4_send(c, c->ip4.our_tap_addr, 67, dst, 68, &reply, dlen);
return 1;
}

158
dhcpv6.c
View file

@ -48,6 +48,7 @@ struct opt_hdr {
# define STATUS_NOTONLINK htons_constant(4)
# define OPT_DNS_SERVERS htons_constant(23)
# define OPT_DNS_SEARCH htons_constant(24)
# define OPT_CLIENT_FQDN htons_constant(39)
#define STR_NOTONLINK "Prefix not appropriate for link."
uint16_t l;
@ -58,6 +59,9 @@ struct opt_hdr {
sizeof(struct opt_hdr))
#define OPT_VSIZE(x) (sizeof(struct opt_##x) - \
sizeof(struct opt_hdr))
#define OPT_MAX_SIZE IPV6_MIN_MTU - (sizeof(struct ipv6hdr) + \
sizeof(struct udphdr) + \
sizeof(struct msg_hdr))
/**
* struct opt_client_id - DHCPv6 Client Identifier option
@ -140,7 +144,9 @@ struct opt_ia_addr {
struct opt_status_code {
struct opt_hdr hdr;
uint16_t code;
char status_msg[sizeof(STR_NOTONLINK) - 1];
/* "nonstring" is only supported since clang 23 */
/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
__attribute__((nonstring)) char status_msg[sizeof(STR_NOTONLINK) - 1];
} __attribute__((packed));
/**
@ -163,6 +169,18 @@ struct opt_dns_search {
char list[MAXDNSRCH * NS_MAXDNAME];
} __attribute__((packed));
/**
* struct opt_client_fqdn - Client FQDN option (RFC 4704)
* @hdr: Option header
* @flags: Flags described by RFC 4704
* @domain_name: Client FQDN
*/
struct opt_client_fqdn {
struct opt_hdr hdr;
uint8_t flags;
char domain_name[PASST_MAXDNAME];
} __attribute__((packed));
/**
* struct msg_hdr - DHCPv6 client/server message header
* @type: DHCP message type
@ -193,6 +211,7 @@ struct msg_hdr {
* @client_id: Client Identifier, variable length
* @dns_servers: DNS Recursive Name Server, here just for storage size
* @dns_search: Domain Search List, here just for storage size
* @client_fqdn: Client FQDN, variable length
*/
static struct resp_t {
struct msg_hdr hdr;
@ -203,6 +222,7 @@ static struct resp_t {
struct opt_client_id client_id;
struct opt_dns_servers dns_servers;
struct opt_dns_search dns_search;
struct opt_client_fqdn client_fqdn;
} __attribute__((__packed__)) resp = {
{ 0 },
SERVER_ID,
@ -228,6 +248,10 @@ static struct resp_t {
{ { OPT_DNS_SEARCH, 0, },
{ 0 },
},
{ { OPT_CLIENT_FQDN, 0, },
0, { 0 },
},
};
static const struct opt_status_code sc_not_on_link = {
@ -296,47 +320,42 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset,
static struct opt_hdr *dhcpv6_ia_notonlink(const struct pool *p,
struct in6_addr *la)
{
int ia_types[2] = { OPT_IA_NA, OPT_IA_TA }, *ia_type;
const struct opt_ia_addr *opt_addr;
char buf[INET6_ADDRSTRLEN];
struct in6_addr req_addr;
const struct opt_hdr *h;
struct opt_hdr *ia;
size_t offset;
int ia_type;
ia_type = OPT_IA_NA;
ia_ta:
offset = 0;
while ((ia = dhcpv6_opt(p, &offset, ia_type))) {
if (ntohs(ia->l) < OPT_VSIZE(ia_na))
return NULL;
offset += sizeof(struct opt_ia_na);
while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) {
const struct opt_ia_addr *opt_addr;
if (ntohs(h->l) != OPT_VSIZE(ia_addr))
foreach(ia_type, ia_types) {
offset = 0;
while ((ia = dhcpv6_opt(p, &offset, *ia_type))) {
if (ntohs(ia->l) < OPT_VSIZE(ia_na))
return NULL;
opt_addr = (const struct opt_ia_addr *)h;
req_addr = opt_addr->addr;
if (!IN6_ARE_ADDR_EQUAL(la, &req_addr)) {
info("DHCPv6: requested address %s not on link",
inet_ntop(AF_INET6, &req_addr,
buf, sizeof(buf)));
return ia;
}
offset += sizeof(struct opt_ia_na);
offset += sizeof(struct opt_ia_addr);
while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) {
if (ntohs(h->l) != OPT_VSIZE(ia_addr))
return NULL;
opt_addr = (const struct opt_ia_addr *)h;
req_addr = opt_addr->addr;
if (!IN6_ARE_ADDR_EQUAL(la, &req_addr))
goto err;
offset += sizeof(struct opt_ia_addr);
}
}
}
if (ia_type == OPT_IA_NA) {
ia_type = OPT_IA_TA;
goto ia_ta;
}
return NULL;
err:
info("DHCPv6: requested address %s not on link",
inet_ntop(AF_INET6, &req_addr, buf, sizeof(buf)));
return ia;
}
/**
@ -351,7 +370,6 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset)
{
struct opt_dns_servers *srv = NULL;
struct opt_dns_search *srch = NULL;
char *p = NULL;
int i;
if (c->no_dhcp_dns)
@ -388,34 +406,81 @@ search:
if (!name_len)
continue;
name_len += 2; /* Length byte for first label, and terminator */
if (name_len >
NS_MAXDNAME + 1 /* Length byte for first label */ ||
name_len > 255) {
debug("DHCP: DNS search name '%s' too long, skipping",
c->dns_search[i].n);
continue;
}
if (!srch) {
srch = (struct opt_dns_search *)(buf + offset);
offset += sizeof(struct opt_hdr);
srch->hdr.t = OPT_DNS_SEARCH;
srch->hdr.l = 0;
p = srch->list;
}
*p = '.';
p = stpncpy(p + 1, c->dns_search[i].n, name_len);
p++;
srch->hdr.l += name_len + 2;
offset += name_len + 2;
encode_domain_name(buf + offset, c->dns_search[i].n);
srch->hdr.l += name_len;
offset += name_len;
}
if (srch) {
for (i = 0; i < srch->hdr.l; i++) {
if (srch->list[i] == '.') {
srch->list[i] = strcspn(srch->list + i + 1,
".");
}
}
if (srch)
srch->hdr.l = htons(srch->hdr.l);
}
return offset;
}
/**
* dhcpv6_client_fqdn_fill() - Fill in client FQDN option
* @c: Execution context
* @buf: Response message buffer where options will be appended
* @offset: Offset in message buffer for new options
*
* Return: updated length of response message buffer.
*/
static size_t dhcpv6_client_fqdn_fill(const struct pool *p, const struct ctx *c,
char *buf, int offset)
{
struct opt_client_fqdn const *req_opt;
struct opt_client_fqdn *o;
size_t opt_len;
opt_len = strlen(c->fqdn);
if (opt_len == 0) {
return offset;
}
opt_len += 2; /* Length byte for first label, and terminator */
if (opt_len > OPT_MAX_SIZE - (offset +
sizeof(struct opt_hdr) +
1 /* flags */ )) {
debug("DHCPv6: client FQDN option doesn't fit, skipping");
return offset;
}
o = (struct opt_client_fqdn *)(buf + offset);
encode_domain_name(o->domain_name, c->fqdn);
req_opt = (struct opt_client_fqdn *)dhcpv6_opt(p, &(size_t){ 0 },
OPT_CLIENT_FQDN);
if (req_opt && req_opt->flags & 0x01 /* S flag */)
o->flags = 0x02 /* O flag */;
else
o->flags = 0x00;
opt_len++;
o->hdr.t = OPT_CLIENT_FQDN;
o->hdr.l = htons(opt_len);
return offset + sizeof(struct opt_hdr) + opt_len;
}
/**
* dhcpv6() - Check if this is a DHCPv6 message, reply as needed
* @c: Execution context
@ -428,11 +493,11 @@ search:
int dhcpv6(struct ctx *c, const struct pool *p,
const struct in6_addr *saddr, const struct in6_addr *daddr)
{
struct opt_hdr *ia, *bad_ia, *client_id;
const struct opt_hdr *server_id;
const struct opt_hdr *client_id, *server_id, *ia;
const struct in6_addr *src;
const struct msg_hdr *mh;
const struct udphdr *uh;
struct opt_hdr *bad_ia;
size_t mlen, n;
uh = packet_get(p, 0, 0, sizeof(*uh), &mlen);
@ -549,6 +614,7 @@ int dhcpv6(struct ctx *c, const struct pool *p,
n = offsetof(struct resp_t, client_id) +
sizeof(struct opt_hdr) + ntohs(client_id->l);
n = dhcpv6_dns_fill(c, (char *)&resp, n);
n = dhcpv6_client_fqdn_fill(p, c, (char *)&resp, n);
resp.hdr.xid = mh->xid;

2
doc/migration/.gitignore vendored Normal file
View file

@ -0,0 +1,2 @@
/source
/target

20
doc/migration/Makefile Normal file
View file

@ -0,0 +1,20 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
TARGETS = source target
CFLAGS = -Wall -Wextra -pedantic
all: $(TARGETS)
$(TARGETS): %: %.c
clean:
rm -f $(TARGETS)

51
doc/migration/README Normal file
View file

@ -0,0 +1,51 @@
<!---
SPDX-License-Identifier: GPL-2.0-or-later
Copyright (c) 2025 Red Hat GmbH
Author: Stefano Brivio <sbrivio@redhat.com>
-->
Migration
=========
These test programs show a migration of a TCP connection from one process to
another using the TCP_REPAIR socket option.
The two processes are a mock of the matching implementation in passt(1), and run
unprivileged, so they rely on the passt-repair helper to connect to them and set
or clear TCP_REPAIR on the connection socket, transferred to the helper using
SCM_RIGHTS.
The passt-repair helper needs to have the CAP_NET_ADMIN capability, or run as
root.
Example of usage
----------------
* Start the test server
$ nc -l 9999
* Start the source side of the TCP client (mock of the source instance of passt)
$ ./source 127.0.0.1 9999 9998 /tmp/repair.sock
* The client sends a test string, and waits for a connection from passt-repair
# passt-repair /tmp/repair.sock
* The socket is now in repair mode, and `source` dumps sequences, then exits
sending sequence: 3244673313
receiving sequence: 2250449386
* Continue the connection on the target side, restarting from those sequences
$ ./target 127.0.0.1 9999 9998 /tmp/repair.sock 3244673313 2250449386
* The target side now waits for a connection from passt-repair
# passt-repair /tmp/repair.sock
* The target side asks passt-repair to switch the socket to repair mode, sets up
the TCP sequences, then asks passt-repair to clear repair mode, and sends a
test string to the server

92
doc/migration/source.c Normal file
View file

@ -0,0 +1,92 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* doc/migration/source.c - Mock of TCP migration source, use with passt-repair
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
int main(int argc, char **argv)
{
struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
NULL, NULL, NULL };
struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
int seq, s, s_helper;
int8_t cmd;
struct iovec iov = { &cmd, sizeof(cmd) };
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
socklen_t seqlen = sizeof(int);
struct addrinfo *r;
(void)argc;
if (argc != 5) {
fprintf(stderr, "%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH\n",
argv[0]);
return -1;
}
strcpy(a_helper.sun_path, argv[4]);
getaddrinfo(argv[1], argv[2], &hints, &r);
/* Connect socket to server and send some data */
s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
bind(s, (struct sockaddr *)&a, sizeof(a));
connect(s, r->ai_addr, r->ai_addrlen);
send(s, "before migration\n", sizeof("before migration\n"), 0);
/* Wait for helper */
s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
unlink(a_helper.sun_path);
bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
listen(s_helper, 1);
s_helper = accept(s_helper, NULL, NULL);
/* Set up message for helper, with socket */
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_ON;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int8_t){ 0 }), 1, 0);
/* Terminate helper */
close(s_helper);
/* Get sending sequence */
seq = TCP_SEND_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
fprintf(stdout, "%u ", seq);
/* Get receiving sequence */
seq = TCP_RECV_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
fprintf(stdout, "%u\n", seq);
}

102
doc/migration/target.c Normal file
View file

@ -0,0 +1,102 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* doc/migration/target.c - Mock of TCP migration target, use with passt-repair
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
int main(int argc, char **argv)
{
struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
NULL, NULL, NULL };
struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
int s, s_helper, seq;
int8_t cmd;
struct iovec iov = { &cmd, sizeof(cmd) };
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
struct addrinfo *r;
(void)argc;
strcpy(a_helper.sun_path, argv[4]);
getaddrinfo(argv[1], argv[2], &hints, &r);
if (argc != 7) {
fprintf(stderr,
"%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH SSEQ RSEQ\n",
argv[0]);
return -1;
}
/* Prepare socket, bind to source port */
s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
bind(s, (struct sockaddr *)&a, sizeof(a));
/* Wait for helper */
s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
unlink(a_helper.sun_path);
bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
listen(s_helper, 1);
s_helper = accept(s_helper, NULL, NULL);
/* Set up message for helper, with socket */
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_ON;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int){ 0 }), 1, 0);
/* Set sending sequence */
seq = TCP_SEND_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
seq = atoi(argv[5]);
setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
/* Set receiving sequence */
seq = TCP_RECV_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
seq = atoi(argv[6]);
setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
/* Connect setting kernel state only, without actual SYN / handshake */
connect(s, r->ai_addr, r->ai_addrlen);
/* Send command to helper: turn repair mode off, wait for reply */
cmd = TCP_REPAIR_OFF;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int8_t){ 0 }), 1, 0);
/* Terminate helper */
close(s_helper);
/* Send some more data */
send(s, "after migration\n", sizeof("after migration\n"), 0);
}

View file

@ -1,3 +1,4 @@
/listen-vs-repair
/reuseaddr-priority
/recv-zero
/udp-close-dup

View file

@ -3,8 +3,8 @@
# Copyright Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
TARGETS = reuseaddr-priority recv-zero udp-close-dup
SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c
TARGETS = reuseaddr-priority recv-zero udp-close-dup listen-vs-repair
SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c listen-vs-repair.c
CFLAGS = -Wall
all: cppcheck clang-tidy $(TARGETS:%=check-%)

View file

@ -15,6 +15,7 @@
#include <stdio.h>
#include <stdlib.h>
__attribute__((format(printf, 1, 2), noreturn))
static inline void die(const char *fmt, ...)
{
va_list ap;

View file

@ -0,0 +1,128 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* liste-vs-repair.c
*
* Do listening sockets have address conflicts with sockets under repair
* ====================================================================
*
* When we accept() an incoming connection the accept()ed socket will have the
* same local address as the listening socket. This can be a complication on
* migration. On the migration target we've already set up listening sockets
* according to the command line. However to restore connections that we're
* migrating in we need to bind the new sockets to the same address, which would
* be an address conflict on the face of it. This test program verifies that
* enabling repair mode before bind() correctly suppresses that conflict.
*
* Copyright Red Hat
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
/* NOLINTNEXTLINE(bugprone-reserved-identifier,cert-dcl37-c,cert-dcl51-cpp) */
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <errno.h>
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
#include <net/if.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include "common.h"
#define PORT 13256U
#define CPORT 13257U
/* 127.0.0.1:PORT */
static const struct sockaddr_in addr = SOCKADDR_INIT(INADDR_LOOPBACK, PORT);
/* 127.0.0.1:CPORT */
static const struct sockaddr_in caddr = SOCKADDR_INIT(INADDR_LOOPBACK, CPORT);
/* Put ourselves into a network sandbox */
static void net_sandbox(void)
{
/* NOLINTNEXTLINE(altera-struct-pack-align) */
const struct req_t {
struct nlmsghdr nlh;
struct ifinfomsg ifm;
} __attribute__((packed)) req = {
.nlh.nlmsg_type = RTM_NEWLINK,
.nlh.nlmsg_flags = NLM_F_REQUEST,
.nlh.nlmsg_len = sizeof(req),
.nlh.nlmsg_seq = 1,
.ifm.ifi_family = AF_UNSPEC,
.ifm.ifi_index = 1,
.ifm.ifi_flags = IFF_UP,
.ifm.ifi_change = IFF_UP,
};
int nl;
if (unshare(CLONE_NEWUSER | CLONE_NEWNET))
die("unshare(): %s\n", strerror(errno));
/* Bring up lo in the new netns */
nl = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
if (nl < 0)
die("Can't create netlink socket: %s\n", strerror(errno));
if (send(nl, &req, sizeof(req), 0) < 0)
die("Netlink send(): %s\n", strerror(errno));
close(nl);
}
static void check(void)
{
int s1, s2, op;
s1 = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s1 < 0)
die("socket() 1: %s\n", strerror(errno));
if (bind(s1, (struct sockaddr *)&addr, sizeof(addr)))
die("bind() 1: %s\n", strerror(errno));
if (listen(s1, 0))
die("listen(): %s\n", strerror(errno));
s2 = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s2 < 0)
die("socket() 2: %s\n", strerror(errno));
op = TCP_REPAIR_ON;
if (setsockopt(s2, SOL_TCP, TCP_REPAIR, &op, sizeof(op)))
die("TCP_REPAIR: %s\n", strerror(errno));
if (bind(s2, (struct sockaddr *)&addr, sizeof(addr)))
die("bind() 2: %s\n", strerror(errno));
if (connect(s2, (struct sockaddr *)&caddr, sizeof(caddr)))
die("connect(): %s\n", strerror(errno));
op = TCP_REPAIR_OFF_NO_WP;
if (setsockopt(s2, SOL_TCP, TCP_REPAIR, &op, sizeof(op)))
die("TCP_REPAIR: %s\n", strerror(errno));
close(s1);
close(s2);
}
int main(int argc, char *argv[])
{
(void)argc;
(void)argv;
net_sandbox();
check();
printf("Repair mode appears to properly suppress conflicts with listening sockets\n");
exit(0);
}

View file

@ -46,13 +46,13 @@
/* Different cases for receiving socket configuration */
enum sock_type {
/* Socket is bound to 0.0.0.0:DSTPORT and not connected */
SOCK_BOUND_ANY = 0,
SOCK_BOUND_ANY,
/* Socket is bound to 127.0.0.1:DSTPORT and not connected */
SOCK_BOUND_LO = 1,
SOCK_BOUND_LO,
/* Socket is bound to 0.0.0.0:DSTPORT and connected to 127.0.0.1:SRCPORT */
SOCK_CONNECTED = 2,
SOCK_CONNECTED,
NUM_SOCK_TYPES,
};

View file

@ -22,8 +22,8 @@ enum epoll_type {
EPOLL_TYPE_TCP_TIMER,
/* UDP "listening" sockets */
EPOLL_TYPE_UDP_LISTEN,
/* UDP socket for replies on a specific flow */
EPOLL_TYPE_UDP_REPLY,
/* UDP socket for a specific flow */
EPOLL_TYPE_UDP,
/* ICMP/ICMPv6 ping sockets */
EPOLL_TYPE_PING,
/* inotify fd watching for end of netns (pasta) */
@ -36,6 +36,14 @@ enum epoll_type {
EPOLL_TYPE_TAP_PASST,
/* socket listening for qemu socket connections */
EPOLL_TYPE_TAP_LISTEN,
/* vhost-user command socket */
EPOLL_TYPE_VHOST_CMD,
/* vhost-user kick event socket */
EPOLL_TYPE_VHOST_KICK,
/* TCP_REPAIR helper listening socket */
EPOLL_TYPE_REPAIR_LISTEN,
/* TCP_REPAIR helper socket */
EPOLL_TYPE_REPAIR,
EPOLL_NUM_TYPES,
};

477
flow.c
View file

@ -19,6 +19,7 @@
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "repair.h"
const char *flow_state_str[] = {
[FLOW_STATE_FREE] = "FREE",
@ -52,6 +53,13 @@ const uint8_t flow_proto[] = {
static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
"flow_proto[] doesn't match enum flow_type");
#define foreach_established_tcp_flow(flow) \
flow_foreach_of_type((flow), FLOW_TCP) \
if (!tcp_flow_is_established(&(flow)->tcp)) \
/* NOLINTNEXTLINE(bugprone-branch-clone) */ \
continue; \
else
/* Global Flow Table */
/**
@ -73,7 +81,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
*
* Free cluster list
* flow_first_free gives the index of the first (lowest index) free cluster.
* Each free cluster has the index of the next free cluster, or MAX_FLOW if
* Each free cluster has the index of the next free cluster, or FLOW_MAX if
* it is the last free cluster. Together these form a linked list of free
* clusters, in strictly increasing order of index.
*
@ -259,11 +267,13 @@ int flowside_connect(const struct ctx *c, int s,
/** flow_log_ - Log flow-related message
* @f: flow the message is related to
* @newline: Append newline at the end of the message, if missing
* @pri: Log priority
* @fmt: Format string
* @...: printf-arguments
*/
void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
void flow_log_(const struct flow_common *f, bool newline, int pri,
const char *fmt, ...)
{
const char *type_or_state;
char msg[BUFSIZ];
@ -279,32 +289,27 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
else
type_or_state = FLOW_TYPE(f);
logmsg(true, false, pri,
logmsg(newline, false, pri,
"Flow %u (%s): %s", flow_idx(f), type_or_state, msg);
}
/**
* flow_set_state() - Change flow's state
* @f: Flow changing state
* @state: New state
/** flow_log_details_() - Log the details of a flow
* @f: flow to log
* @pri: Log priority
* @state: State to log details according to
*
* Logs the details of the flow: endpoints, interfaces, type etc.
*/
static void flow_set_state(struct flow_common *f, enum flow_state state)
void flow_log_details_(const struct flow_common *f, int pri,
enum flow_state state)
{
char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN];
char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN];
const struct flowside *ini = &f->side[INISIDE];
const struct flowside *tgt = &f->side[TGTSIDE];
uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES);
ASSERT(oldstate < FLOW_NUM_STATES);
f->state = state;
flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
FLOW_STATE(f));
if (MAX(state, oldstate) >= FLOW_STATE_TGT)
flow_log_(f, LOG_DEBUG,
if (state >= FLOW_STATE_TGT)
flow_log_(f, true, pri,
"%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
@ -316,8 +321,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
tgt->oport,
inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)),
tgt->eport);
else if (MAX(state, oldstate) >= FLOW_STATE_INI)
flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?",
else if (state >= FLOW_STATE_INI)
flow_log_(f, true, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
pif_name(f->pif[INISIDE]),
inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
ini->eport,
@ -325,6 +330,25 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
ini->oport);
}
/**
* flow_set_state() - Change flow's state
* @f: Flow changing state
* @state: New state
*/
static void flow_set_state(struct flow_common *f, enum flow_state state)
{
uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES);
ASSERT(oldstate < FLOW_NUM_STATES);
f->state = state;
flow_log_(f, true, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
FLOW_STATE(f));
flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
}
/**
* flow_initiate_() - Move flow to INI, setting pif[INISIDE]
* @flow: Flow to change state
@ -372,18 +396,27 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
* @flow: Flow to change state
* @pif: pif of the initiating side
* @ssa: Source socket address
* @daddr: Destination address (may be NULL)
* @dport: Destination port
*
* Return: pointer to the initiating flowside information
*/
const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
in_port_t dport)
struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
const union inany_addr *daddr,
in_port_t dport)
{
struct flowside *ini = &flow->f.side[INISIDE];
inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa);
if (inany_v4(&ini->eaddr))
if (inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa) < 0) {
char str[SOCKADDR_STRLEN];
ASSERT_WITH_MSG(0, "Bad socket address %s",
sockaddr_ntop(ssa, str, sizeof(str)));
}
if (daddr)
ini->oaddr = *daddr;
else if (inany_v4(&ini->eaddr))
ini->oaddr = inany_any4;
else
ini->oaddr = inany_any6;
@ -400,8 +433,8 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
*
* Return: pointer to the target flowside information
*/
const struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto)
struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto)
{
char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
struct flow_common *f = &flow->f;
@ -583,12 +616,7 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx)
const struct flowside *side = &f->side[sidx.sidei];
uint8_t pif = f->pif[sidx.sidei];
/* For the hash table to work, entries must have complete endpoint
* information, and at least a forwarding port.
*/
ASSERT(pif != PIF_NONE && !inany_is_unspecified(&side->eaddr) &&
side->eport != 0 && side->oport != 0);
ASSERT(pif != PIF_NONE);
return flow_hash(c, FLOW_PROTO(f), pif, side);
}
@ -697,7 +725,7 @@ static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto,
!(FLOW_PROTO(&flow->f) == proto &&
flow->f.pif[sidx.sidei] == pif &&
flowside_eq(&flow->f.side[sidx.sidei], side)))
b = (b + 1) % FLOW_HASH_SIZE;
b = mod_sub(b, 1, FLOW_HASH_SIZE);
return flow_hashtab[b];
}
@ -732,19 +760,30 @@ flow_sidx_t flow_lookup_af(const struct ctx *c,
* @proto: Protocol of the flow (IP L4 protocol number)
* @pif: Interface of the flow
* @esa: Socket address of the endpoint
* @oaddr: Our address (may be NULL)
* @oport: Our port number
*
* Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
*/
flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
const void *esa, in_port_t oport)
const void *esa,
const union inany_addr *oaddr, in_port_t oport)
{
struct flowside side = {
.oport = oport,
};
inany_from_sockaddr(&side.eaddr, &side.eport, esa);
if (inany_v4(&side.eaddr))
if (inany_from_sockaddr(&side.eaddr, &side.eport, esa) < 0) {
char str[SOCKADDR_STRLEN];
warn("Flow lookup on bad socket address %s",
sockaddr_ntop(esa, str, sizeof(str)));
return FLOW_SIDX_NONE;
}
if (oaddr)
side.oaddr = *oaddr;
else if (inany_v4(&side.eaddr))
side.oaddr = inany_any4;
else
side.oaddr = inany_any6;
@ -761,8 +800,9 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
{
struct flow_free_cluster *free_head = NULL;
unsigned *last_next = &flow_first_free;
bool to_free[FLOW_MAX] = { 0 };
bool timer = false;
unsigned idx;
union flow *flow;
if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
timer = true;
@ -771,49 +811,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
ASSERT(!flow_new_entry); /* Incomplete flow at end of cycle */
for (idx = 0; idx < FLOW_MAX; idx++) {
union flow *flow = &flowtab[idx];
/* Check which flows we might need to close first, but don't free them
* yet as it's not safe to do that in the middle of flow_foreach().
*/
flow_foreach(flow) {
bool closed = false;
switch (flow->f.state) {
case FLOW_STATE_FREE: {
unsigned skip = flow->free.n;
/* First entry of a free cluster must have n >= 1 */
ASSERT(skip);
if (free_head) {
/* Merge into preceding free cluster */
free_head->n += flow->free.n;
flow->free.n = flow->free.next = 0;
} else {
/* New free cluster, add to chain */
free_head = &flow->free;
*last_next = idx;
last_next = &free_head->next;
}
/* Skip remaining empty entries */
idx += skip - 1;
continue;
}
case FLOW_STATE_NEW:
case FLOW_STATE_INI:
case FLOW_STATE_TGT:
case FLOW_STATE_TYPED:
/* Incomplete flow at end of cycle */
ASSERT(false);
break;
case FLOW_STATE_ACTIVE:
/* Nothing to do */
break;
default:
ASSERT(false);
}
switch (flow->f.type) {
case FLOW_TYPE_NONE:
ASSERT(false);
@ -832,7 +835,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
closed = icmp_ping_timer(c, &flow->ping, now);
break;
case FLOW_UDP:
if (timer)
closed = udp_flow_defer(c, &flow->udp, now);
if (!closed && timer)
closed = udp_flow_timer(c, &flow->udp, now);
break;
default:
@ -840,30 +844,321 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
;
}
if (closed) {
flow_set_state(&flow->f, FLOW_STATE_FREE);
memset(flow, 0, sizeof(*flow));
to_free[FLOW_IDX(flow)] = closed;
}
/* Second step: actually free the flows */
flow_foreach_slot(flow) {
switch (flow->f.state) {
case FLOW_STATE_FREE: {
unsigned skip = flow->free.n;
/* First entry of a free cluster must have n >= 1 */
ASSERT(skip);
if (free_head) {
/* Add slot to current free cluster */
ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
free_head->n++;
/* Merge into preceding free cluster */
free_head->n += flow->free.n;
flow->free.n = flow->free.next = 0;
} else {
/* Create new free cluster */
/* New free cluster, add to chain */
free_head = &flow->free;
free_head->n = 1;
*last_next = idx;
*last_next = FLOW_IDX(flow);
last_next = &free_head->next;
}
} else {
free_head = NULL;
/* Skip remaining empty entries */
flow += skip - 1;
continue;
}
case FLOW_STATE_NEW:
case FLOW_STATE_INI:
case FLOW_STATE_TGT:
case FLOW_STATE_TYPED:
/* Incomplete flow at end of cycle */
ASSERT(false);
break;
case FLOW_STATE_ACTIVE:
if (to_free[FLOW_IDX(flow)]) {
flow_set_state(&flow->f, FLOW_STATE_FREE);
memset(flow, 0, sizeof(*flow));
if (free_head) {
/* Add slot to current free cluster */
ASSERT(FLOW_IDX(flow) ==
FLOW_IDX(free_head) + free_head->n);
free_head->n++;
flow->free.n = flow->free.next = 0;
} else {
/* Create new free cluster */
free_head = &flow->free;
free_head->n = 1;
*last_next = FLOW_IDX(flow);
last_next = &free_head->next;
}
} else {
free_head = NULL;
}
break;
default:
ASSERT(false);
}
}
*last_next = FLOW_MAX;
}
/**
* flow_migrate_source_rollback() - Disable repair mode, return failure
* @c: Execution context
* @bound: No need to roll back flow indices >= @bound
* @ret: Negative error code
*
* Return: @ret
*/
static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
{
union flow *flow;
debug("...roll back migration");
foreach_established_tcp_flow(flow) {
if (FLOW_IDX(flow) >= bound)
break;
if (tcp_flow_repair_off(c, &flow->tcp))
die("Failed to roll back TCP_REPAIR mode");
}
if (repair_flush(c))
die("Failed to roll back TCP_REPAIR mode");
return ret;
}
/**
* flow_migrate_need_repair() - Do we need to set repair mode for any flow?
*
* Return: true if repair mode is needed, false otherwise
*/
static bool flow_migrate_need_repair(void)
{
union flow *flow;
foreach_established_tcp_flow(flow)
return true;
return false;
}
/**
* flow_migrate_repair_all() - Turn repair mode on or off for all flows
* @c: Execution context
* @enable: Switch repair mode on if set, off otherwise
*
* Return: 0 on success, negative error code on failure
*/
static int flow_migrate_repair_all(struct ctx *c, bool enable)
{
union flow *flow;
int rc;
/* If we don't have a repair helper, there's nothing we can do */
if (c->fd_repair < 0)
return 0;
foreach_established_tcp_flow(flow) {
if (enable)
rc = tcp_flow_repair_on(c, &flow->tcp);
else
rc = tcp_flow_repair_off(c, &flow->tcp);
if (rc) {
debug("Can't %s repair mode: %s",
enable ? "enable" : "disable", strerror_(-rc));
return flow_migrate_source_rollback(c, FLOW_IDX(flow),
rc);
}
}
if ((rc = repair_flush(c))) {
debug("Can't %s repair mode: %s",
enable ? "enable" : "disable", strerror_(-rc));
return flow_migrate_source_rollback(c, FLOW_IDX(flow), rc);
}
return 0;
}
/**
* flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
* @c: Execution context
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor (unused)
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
int rc;
(void)stage;
(void)fd;
if (flow_migrate_need_repair())
repair_wait(c);
if ((rc = flow_migrate_repair_all(c, true)))
return -rc;
return 0;
}
/**
* flow_migrate_source() - Dump all the remaining information and send data
* @c: Execution context (unused)
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
uint32_t count = 0;
bool first = true;
union flow *flow;
int rc;
(void)c;
(void)stage;
/* If we don't have a repair helper, we can't migrate TCP flows */
if (c->fd_repair >= 0) {
foreach_established_tcp_flow(flow)
count++;
}
count = htonl(count);
if (write_all_buf(fd, &count, sizeof(count))) {
rc = errno;
err_perror("Can't send flow count (%u)", ntohl(count));
return flow_migrate_source_rollback(c, FLOW_MAX, rc);
}
debug("Sending %u flows", ntohl(count));
if (!count)
return 0;
/* Dump and send information that can be stored in the flow table.
*
* Limited rollback options here: if we fail to transfer any data (that
* is, on the first flow), undo everything and resume. Otherwise, the
* stream might now be inconsistent, and we might have closed listening
* TCP sockets, so just terminate.
*/
foreach_established_tcp_flow(flow) {
rc = tcp_flow_migrate_source(fd, &flow->tcp);
if (rc) {
flow_err(flow, "Can't send data: %s",
strerror_(-rc));
if (!first)
die("Inconsistent migration state, exiting");
return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
}
first = false;
}
/* And then "extended" data (including window data we saved previously):
* the target needs to set repair mode on sockets before it can set
* this stuff, but it needs sockets (and flows) for that.
*
* This also closes sockets so that the target can start connecting
* theirs: you can't sendmsg() to queues (using the socket) if the
* socket is not connected (EPIPE), not even in repair mode. And the
* target needs to restore queues now because we're sending the data.
*
* So, no rollback here, just try as hard as we can. Tolerate per-flow
* failures but not if the stream might be inconsistent (reported here
* as EIO).
*/
foreach_established_tcp_flow(flow) {
rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
if (rc) {
flow_err(flow, "Can't send extended data: %s",
strerror_(-rc));
if (rc == -EIO)
die("Inconsistent migration state, exiting");
}
}
return 0;
}
/**
* flow_migrate_target() - Receive flows and insert in flow table
* @c: Execution context
* @stage: Migration stage information (unused)
* @fd: Migration file descriptor
*
* Return: 0 on success, positive error code on failure
*/
int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
int fd)
{
uint32_t count;
unsigned i;
int rc;
(void)stage;
if (read_all_buf(fd, &count, sizeof(count)))
return errno;
count = ntohl(count);
debug("Receiving %u flows", count);
if (!count)
return 0;
repair_wait(c);
if ((rc = flow_migrate_repair_all(c, true)))
return -rc;
repair_flush(c);
/* TODO: flow header with type, instead? */
for (i = 0; i < count; i++) {
rc = tcp_flow_migrate_target(c, fd);
if (rc) {
flow_dbg(FLOW(i), "Migration data failure, abort: %s",
strerror_(-rc));
return -rc;
}
}
repair_flush(c);
for (i = 0; i < count; i++) {
rc = tcp_flow_migrate_target_ext(c, &flowtab[i].tcp, fd);
if (rc) {
flow_dbg(FLOW(i), "Migration data failure, abort: %s",
strerror_(-rc));
return -rc;
}
}
return 0;
}
/**
* flow_init() - Initialise flow related data structures
*/

36
flow.h
View file

@ -243,18 +243,27 @@ flow_sidx_t flow_lookup_af(const struct ctx *c,
const void *eaddr, const void *oaddr,
in_port_t eport, in_port_t oport);
flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
const void *esa, in_port_t oport);
const void *esa,
const union inany_addr *oaddr, in_port_t oport);
union flow;
void flow_init(void);
void flow_defer_handler(const struct ctx *c, const struct timespec *now);
int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
int fd);
int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
int fd);
void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
__attribute__((format(printf, 3, 4)));
#define flow_log(f_, pri, ...) flow_log_(&(f_)->f, (pri), __VA_ARGS__)
void flow_log_(const struct flow_common *f, bool newline, int pri,
const char *fmt, ...)
__attribute__((format(printf, 4, 5)));
#define flow_log(f_, pri, ...) flow_log_(&(f_)->f, true, (pri), __VA_ARGS__)
#define flow_dbg(f, ...) flow_log((f), LOG_DEBUG, __VA_ARGS__)
#define flow_err(f, ...) flow_log((f), LOG_ERR, __VA_ARGS__)
@ -264,4 +273,21 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
flow_dbg((f), __VA_ARGS__); \
} while (0)
#define flow_log_perror_(f, pri, ...) \
do { \
int errno_ = errno; \
flow_log_((f), false, (pri), __VA_ARGS__); \
logmsg(true, true, (pri), ": %s", strerror_(errno_)); \
} while (0)
#define flow_dbg_perror(f_, ...) flow_log_perror_(&(f_)->f, LOG_DEBUG, __VA_ARGS__)
#define flow_perror(f_, ...) flow_log_perror_(&(f_)->f, LOG_ERR, __VA_ARGS__)
void flow_log_details_(const struct flow_common *f, int pri,
enum flow_state state);
#define flow_log_details(f_, pri) \
flow_log_details_(&((f_)->f), (pri), (f_)->f.state)
#define flow_dbg_details(f_) flow_log_details((f_), LOG_DEBUG)
#define flow_err_details(f_) flow_log_details((f_), LOG_ERR)
#endif /* FLOW_H */

View file

@ -50,6 +50,42 @@ extern union flow flowtab[];
#define flow_foreach_sidei(sidei_) \
for ((sidei_) = INISIDE; (sidei_) < SIDES; (sidei_)++)
/**
* flow_foreach_slot() - Step through each flow table entry
* @flow: Takes values of pointer to each flow table entry
*
* Includes FREE slots.
*/
#define flow_foreach_slot(flow) \
for ((flow) = flowtab; FLOW_IDX(flow) < FLOW_MAX; (flow)++)
/**
* flow_foreach() - Step through each active flow
* @flow: Takes values of pointer to each active flow
*/
#define flow_foreach(flow) \
flow_foreach_slot((flow)) \
if ((flow)->f.state == FLOW_STATE_FREE) \
(flow) += (flow)->free.n - 1; \
else if ((flow)->f.state != FLOW_STATE_ACTIVE) { \
flow_err((flow), "Bad flow state during traversal"); \
continue; \
} else
/**
* flow_foreach_of_type() - Step through each active flow of given type
* @flow: Takes values of pointer to each flow
* @type_: Type of flow to traverse
*/
#define flow_foreach_of_type(flow, type_) \
flow_foreach((flow)) \
if ((flow)->f.type != (type_)) \
/* NOLINTNEXTLINE(bugprone-branch-clone) */ \
continue; \
else
/** flow_idx() - Index of flow from common structure
* @f: Common flow fields pointer
*
@ -57,6 +93,7 @@ extern union flow flowtab[];
*/
static inline unsigned flow_idx(const struct flow_common *f)
{
/* NOLINTNEXTLINE(clang-analyzer-security.PointerSub) */
return (union flow *)f - flowtab;
}
@ -110,7 +147,7 @@ static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx)
const union flow *flow = flow_at_sidx(sidx);
if (!flow)
return PIF_NONE;
return NULL;
return &flow->f.side[sidx.sidei];
}
@ -161,15 +198,16 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport);
const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
in_port_t dport);
struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
const union sockaddr_inany *ssa,
const union inany_addr *daddr,
in_port_t dport);
const struct flowside *flow_target_af(union flow *flow, uint8_t pif,
sa_family_t af,
const void *saddr, in_port_t sport,
const void *daddr, in_port_t dport);
const struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto);
struct flowside *flow_target(const struct ctx *c, union flow *flow,
uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type);
#define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_)

196
fwd.c
View file

@ -27,6 +27,80 @@
#include "lineread.h"
#include "flow_table.h"
/* Empheral port range: values from RFC 6335 */
static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
#define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range"
/** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
*
* Work out what ports the host thinks are emphemeral and record it for later
* use by fwd_port_is_ephemeral(). If we're unable to probe, assume the range
* recommended by RFC 6335.
*/
void fwd_probe_ephemeral(void)
{
char *line, *tab, *end;
struct lineread lr;
long min, max;
ssize_t len;
int fd;
fd = open(PORT_RANGE_SYSCTL, O_RDONLY | O_CLOEXEC);
if (fd < 0) {
warn_perror("Unable to open %s", PORT_RANGE_SYSCTL);
return;
}
lineread_init(&lr, fd);
len = lineread_get(&lr, &line);
close(fd);
if (len < 0)
goto parse_err;
tab = strchr(line, '\t');
if (!tab)
goto parse_err;
*tab = '\0';
errno = 0;
min = strtol(line, &end, 10);
if (*end || errno)
goto parse_err;
errno = 0;
max = strtol(tab + 1, &end, 10);
if (*end || errno)
goto parse_err;
if (min < 0 || min >= (long)NUM_PORTS ||
max < 0 || max >= (long)NUM_PORTS)
goto parse_err;
fwd_ephemeral_min = min;
fwd_ephemeral_max = max;
return;
parse_err:
warn("Unable to parse %s", PORT_RANGE_SYSCTL);
}
/**
* fwd_port_is_ephemeral() - Is port number ephemeral?
* @port: Port number
*
* Return: true if @port is ephemeral, that is may be allocated by the kernel as
* a local port for outgoing connections or datagrams, but should not be
* used for binding services to.
*/
bool fwd_port_is_ephemeral(in_port_t port)
{
return (port >= fwd_ephemeral_min) && (port <= fwd_ephemeral_max);
}
/* See enum in kernel's include/net/tcp_states.h */
#define UDP_LISTEN 0x07
#define TCP_LISTEN 0x0a
@ -249,6 +323,30 @@ static bool fwd_guest_accessible(const struct ctx *c,
return fwd_guest_accessible6(c, &addr->a6);
}
/**
* nat_outbound() - Apply address translation for outbound (TAP to HOST)
* @c: Execution context
* @addr: Input address (as seen on TAP interface)
* @translated: Output address (as seen on HOST interface)
*
* Only handles translations that depend *only* on the address. Anything
* related to specific ports or flows is handled elsewhere.
*/
static void nat_outbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated)
{
if (inany_equals4(addr, &c->ip4.map_host_loopback))
*translated = inany_loopback4;
else if (inany_equals6(addr, &c->ip6.map_host_loopback))
*translated = inany_loopback6;
else if (inany_equals4(addr, &c->ip4.map_guest_addr))
*translated = inany_from_v4(c->ip4.addr);
else if (inany_equals6(addr, &c->ip6.map_guest_addr))
translated->a6 = c->ip6.addr;
else
*translated = *addr;
}
/**
* fwd_nat_from_tap() - Determine to forward a flow from the tap interface
* @c: Execution context
@ -268,16 +366,8 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
else if (is_dns_flow(proto, ini) &&
inany_equals6(&ini->oaddr, &c->ip6.dns_match))
tgt->eaddr.a6 = c->ip6.dns_host;
else if (inany_equals4(&ini->oaddr, &c->ip4.map_host_loopback))
tgt->eaddr = inany_loopback4;
else if (inany_equals6(&ini->oaddr, &c->ip6.map_host_loopback))
tgt->eaddr = inany_loopback6;
else if (inany_equals4(&ini->oaddr, &c->ip4.map_guest_addr))
tgt->eaddr = inany_from_v4(c->ip4.addr);
else if (inany_equals6(&ini->oaddr, &c->ip6.map_guest_addr))
tgt->eaddr.a6 = c->ip6.addr;
else
tgt->eaddr = ini->oaddr;
nat_outbound(c, &ini->oaddr, &tgt->eaddr);
tgt->eport = ini->oport;
@ -328,7 +418,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
else
tgt->eaddr = inany_loopback6;
/* Preserve the specific loopback adddress used, but let the kernel pick
/* Preserve the specific loopback address used, but let the kernel pick
* a source port on the target side
*/
tgt->oaddr = ini->eaddr;
@ -349,6 +439,42 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
return PIF_HOST;
}
/**
* nat_inbound() - Apply address translation for inbound (HOST to TAP)
* @c: Execution context
* @addr: Input address (as seen on HOST interface)
* @translated: Output address (as seen on TAP interface)
*
* Return: true on success, false if it couldn't translate the address
*
* Only handles translations that depend *only* on the address. Anything
* related to specific ports or flows is handled elsewhere.
*/
bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated)
{
if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
inany_equals4(addr, &in4addr_loopback)) {
/* Specifically 127.0.0.1, not 127.0.0.0/8 */
*translated = inany_from_v4(c->ip4.map_host_loopback);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
inany_equals6(addr, &in6addr_loopback)) {
translated->a6 = c->ip6.map_host_loopback;
} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
inany_equals4(addr, &c->ip4.addr)) {
*translated = inany_from_v4(c->ip4.map_guest_addr);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
inany_equals6(addr, &c->ip6.addr)) {
translated->a6 = c->ip6.map_guest_addr;
} else if (fwd_guest_accessible(c, addr)) {
*translated = *addr;
} else {
return false;
}
return true;
}
/**
* fwd_nat_from_host() - Determine to forward a flow from the host interface
* @c: Execution context
@ -369,41 +495,43 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
else if (proto == IPPROTO_UDP)
tgt->eport += c->udp.fwd_in.delta[tgt->eport];
if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) &&
if (!c->no_splice && inany_is_loopback(&ini->eaddr) &&
(proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
/* spliceable */
/* Preserve the specific loopback adddress used, but let the
* kernel pick a source port on the target side
/* The traffic will go over the guest's 'lo' interface, but by
* default use its external address, so we don't inadvertently
* expose services that listen only on the guest's loopback
* address. That can be overridden by --host-lo-to-ns-lo which
* will instead forward to the loopback address in the guest.
*
* In either case, let the kernel pick the source address to
* match.
*/
tgt->oaddr = ini->eaddr;
if (inany_v4(&ini->eaddr)) {
if (c->host_lo_to_ns_lo)
tgt->eaddr = inany_loopback4;
else
tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
tgt->oaddr = inany_any4;
} else {
if (c->host_lo_to_ns_lo)
tgt->eaddr = inany_loopback6;
else
tgt->eaddr.a6 = c->ip6.addr_seen;
tgt->oaddr = inany_any6;
}
/* Let the kernel pick source port */
tgt->oport = 0;
if (proto == IPPROTO_UDP)
/* But for UDP preserve the source port */
tgt->oport = ini->eport;
if (inany_v4(&ini->eaddr))
tgt->eaddr = inany_loopback4;
else
tgt->eaddr = inany_loopback6;
return PIF_SPLICE;
}
if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
inany_equals4(&ini->eaddr, &in4addr_loopback)) {
/* Specifically 127.0.0.1, not 127.0.0.0/8 */
tgt->oaddr = inany_from_v4(c->ip4.map_host_loopback);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
inany_equals6(&ini->eaddr, &in6addr_loopback)) {
tgt->oaddr.a6 = c->ip6.map_host_loopback;
} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
inany_equals4(&ini->eaddr, &c->ip4.addr)) {
tgt->oaddr = inany_from_v4(c->ip4.map_guest_addr);
} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
inany_equals6(&ini->eaddr, &c->ip6.addr)) {
tgt->oaddr.a6 = c->ip6.map_guest_addr;
} else if (!fwd_guest_accessible(c, &ini->eaddr)) {
if (!nat_inbound(c, &ini->eaddr, &tgt->oaddr)) {
if (inany_v4(&ini->eaddr)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr))
/* No source address we can use */
@ -412,8 +540,6 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
} else {
tgt->oaddr.a6 = c->ip6.our_tap_ll;
}
} else {
tgt->oaddr = ini->eaddr;
}
tgt->oport = ini->eport;

6
fwd.h
View file

@ -7,11 +7,15 @@
#ifndef FWD_H
#define FWD_H
union inany_addr;
struct flowside;
/* Number of ports for both TCP and UDP */
#define NUM_PORTS (1U << 16)
void fwd_probe_ephemeral(void);
bool fwd_port_is_ephemeral(in_port_t port);
enum fwd_ports_mode {
FWD_UNSET = 0,
FWD_SPEC = 1,
@ -44,6 +48,8 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
const struct fwd_ports *tcp_rev);
void fwd_scan_ports_init(struct ctx *c);
bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated);
uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,

View file

@ -56,6 +56,7 @@ cd ..
make pkgs
scp passt passt.avx2 passt.1 qrap qrap.1 "${USER_HOST}:${BIN}"
scp pasta pasta.avx2 pasta.1 "${USER_HOST}:${BIN}"
scp passt-repair passt-repair.1 "${USER_HOST}:${BIN}"
ssh "${USER_HOST}" "rm -f ${BIN}/*.deb"
ssh "${USER_HOST}" "rm -f ${BIN}/*.rpm"

7
icmp.c
View file

@ -85,7 +85,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
n = recvfrom(ref.fd, buf, sizeof(buf), 0, &sr.sa, &sl);
if (n < 0) {
flow_err(pingf, "recvfrom() error: %s", strerror(errno));
flow_perror(pingf, "recvfrom() error");
return;
}
@ -150,7 +150,7 @@ unexpected:
static void icmp_ping_close(const struct ctx *c,
const struct icmp_ping_flow *pingf)
{
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
epoll_del(c, pingf->sock);
close(pingf->sock);
flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
}
@ -300,8 +300,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0);
if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) {
flow_dbg(pingf, "failed to relay request to socket: %s",
strerror(errno));
flow_dbg_perror(pingf, "failed to relay request to socket");
} else {
flow_dbg(pingf,
"echo request to socket, ID: %"PRIu16", seq: %"PRIu16,

20
inany.c
View file

@ -36,3 +36,23 @@ const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size)
return inet_ntop(AF_INET6, &src->a6, dst, size);
}
/** inany_pton - Parse an IPv[46] address from text format
* @src: IPv[46] address
* @dst: output buffer, filled with parsed address
*
* Return: On success, 1, if no parseable address is found, 0
*/
int inany_pton(const char *src, union inany_addr *dst)
{
if (inet_pton(AF_INET, src, &dst->v4mapped.a4)) {
memset(&dst->v4mapped.zero, 0, sizeof(dst->v4mapped.zero));
memset(&dst->v4mapped.one, 0xff, sizeof(dst->v4mapped.one));
return 1;
}
if (inet_pton(AF_INET6, src, &dst->a6))
return 1;
return 0;
}

30
inany.h
View file

@ -237,23 +237,30 @@ static inline void inany_from_af(union inany_addr *aa,
}
/** inany_from_sockaddr - Extract IPv[46] address and port number from sockaddr
* @aa: Pointer to store IPv[46] address
* @dst: Pointer to store IPv[46] address (output)
* @port: Pointer to store port number, host order
* @addr: AF_INET or AF_INET6 socket address
* @addr: Socket address
*
* Return: 0 on success, -1 on error (bad address family)
*/
static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
const union sockaddr_inany *sa)
static inline int inany_from_sockaddr(union inany_addr *dst, in_port_t *port,
const void *addr)
{
const union sockaddr_inany *sa = (const union sockaddr_inany *)addr;
if (sa->sa_family == AF_INET6) {
inany_from_af(aa, AF_INET6, &sa->sa6.sin6_addr);
inany_from_af(dst, AF_INET6, &sa->sa6.sin6_addr);
*port = ntohs(sa->sa6.sin6_port);
} else if (sa->sa_family == AF_INET) {
inany_from_af(aa, AF_INET, &sa->sa4.sin_addr);
*port = ntohs(sa->sa4.sin_port);
} else {
/* Not valid to call with other address families */
ASSERT(0);
return 0;
}
if (sa->sa_family == AF_INET) {
inany_from_af(dst, AF_INET, &sa->sa4.sin_addr);
*port = ntohs(sa->sa4.sin_port);
return 0;
}
return -1;
}
/** inany_siphash_feed- Fold IPv[46] address into an in-progress siphash
@ -270,5 +277,6 @@ static inline void inany_siphash_feed(struct siphash_state *state,
#define INANY_ADDRSTRLEN MAX(INET_ADDRSTRLEN, INET6_ADDRSTRLEN)
const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
int inany_pton(const char *src, union inany_addr *dst);
#endif /* INANY_H */

108
iov.c
View file

@ -26,7 +26,8 @@
#include "iov.h"
/* iov_skip_bytes() - Skip leading bytes of an IO vector
/**
* iov_skip_bytes() - Skip leading bytes of an IO vector
* @iov: IO vector
* @n: Number of entries in @iov
* @skip: Number of leading bytes of @iov to skip
@ -56,8 +57,8 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
}
/**
* iov_from_buf - Copy data from a buffer to an I/O vector (struct iovec)
* efficiently.
* iov_from_buf() - Copy data from a buffer to an I/O vector (struct iovec)
* efficiently.
*
* @iov: Pointer to the array of struct iovec describing the
* scatter/gather I/O vector.
@ -68,7 +69,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
*
* Returns: The number of bytes successfully copied.
*/
/* cppcheck-suppress unusedFunction */
size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
size_t offset, const void *buf, size_t bytes)
{
@ -97,8 +97,8 @@ size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
}
/**
* iov_to_buf - Copy data from a scatter/gather I/O vector (struct iovec) to
* a buffer efficiently.
* iov_to_buf() - Copy data from a scatter/gather I/O vector (struct iovec) to
* a buffer efficiently.
*
* @iov: Pointer to the array of struct iovec describing the scatter/gather
* I/O vector.
@ -137,8 +137,8 @@ size_t iov_to_buf(const struct iovec *iov, size_t iov_cnt,
}
/**
* iov_size - Calculate the total size of a scatter/gather I/O vector
* (struct iovec).
* iov_size() - Calculate the total size of a scatter/gather I/O vector
* (struct iovec).
*
* @iov: Pointer to the array of struct iovec describing the
* scatter/gather I/O vector.
@ -156,3 +156,95 @@ size_t iov_size(const struct iovec *iov, size_t iov_cnt)
return len;
}
/**
* iov_tail_prune() - Remove any unneeded buffers from an IOV tail
* @tail: IO vector tail (modified)
*
* If an IOV tail's offset is large enough, it may not include any bytes from
* the first (or first several) buffers in the underlying IO vector. Modify the
* tail's representation so it contains the same logical bytes, but only
* includes buffers that are actually needed. This will avoid stepping through
* unnecessary elements of the underlying IO vector on future operations.
*
* Return: true if the tail still contains any bytes, otherwise false
*/
bool iov_tail_prune(struct iov_tail *tail)
{
size_t i;
i = iov_skip_bytes(tail->iov, tail->cnt, tail->off, &tail->off);
tail->iov += i;
tail->cnt -= i;
return !!tail->cnt;
}
/**
* iov_tail_size - Calculate the total size of an IO vector tail
* @tail: IO vector tail
*
* Returns: The total size in bytes.
*/
size_t iov_tail_size(struct iov_tail *tail)
{
iov_tail_prune(tail);
return iov_size(tail->iov, tail->cnt) - tail->off;
}
/**
* iov_peek_header_() - Get pointer to a header from an IOV tail
* @tail: IOV tail to get header from
* @len: Length of header to get, in bytes
* @align: Required alignment of header, in bytes
*
* @tail may be pruned, but will represent the same bytes as before.
*
* Returns: Pointer to the first @len logical bytes of the tail, NULL if that
* overruns the IO vector, is not contiguous or doesn't have the
* requested alignment.
*/
/* cppcheck-suppress [staticFunction,unmatchedSuppression] */
void *iov_peek_header_(struct iov_tail *tail, size_t len, size_t align)
{
char *p;
if (!iov_tail_prune(tail))
return NULL; /* Nothing left */
if (tail->off + len < tail->off)
return NULL; /* Overflow */
if (tail->off + len > tail->iov[0].iov_len)
return NULL; /* Not contiguous */
p = (char *)tail->iov[0].iov_base + tail->off;
if ((uintptr_t)p % align)
return NULL; /* not aligned */
return p;
}
/**
* iov_remove_header_() - Remove a header from an IOV tail
* @tail: IOV tail to remove header from (modified)
* @len: Length of header to remove, in bytes
* @align: Required alignment of header, in bytes
*
* On success, @tail is updated so that it longer includes the bytes of the
* returned header.
*
* Returns: Pointer to the first @len logical bytes of the tail, NULL if that
* overruns the IO vector, is not contiguous or doesn't have the
* requested alignment.
*/
void *iov_remove_header_(struct iov_tail *tail, size_t len, size_t align)
{
char *p = iov_peek_header_(tail, len, align);
if (!p)
return NULL;
tail->off = tail->off + len;
return p;
}

76
iov.h
View file

@ -28,4 +28,80 @@ size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
size_t iov_to_buf(const struct iovec *iov, size_t iov_cnt,
size_t offset, void *buf, size_t bytes);
size_t iov_size(const struct iovec *iov, size_t iov_cnt);
/*
* DOC: Theory of Operation, struct iov_tail
*
* Sometimes a single logical network frame is split across multiple buffers,
* represented by an IO vector (struct iovec[]). We often want to process this
* one header / network layer at a time. So, it's useful to maintain a "tail"
* of the vector representing the parts we haven't yet extracted.
*
* The headers we extract need not line up with buffer boundaries (though we do
* assume they're contiguous within a single buffer for now). So, we could
* represent that tail as another struct iovec[], but that would mean copying
* the whole array of struct iovecs, just so we can adjust the offset and length
* on the first one.
*
* So, instead represent the tail as pointer into an existing struct iovec[],
* with an explicit offset for where the "tail" starts within it. If we extract
* enough headers that some buffers of the original vector no longer contain
* part of the tail, we (lazily) advance our struct iovec * to the first buffer
* we still need, and adjust the vector length and offset to match.
*/
/**
* struct iov_tail - An IO vector which may have some headers logically removed
* @iov: IO vector
* @cnt: Number of entries in @iov
* @off: Current offset in @iov
*/
struct iov_tail {
const struct iovec *iov;
size_t cnt, off;
};
/**
* IOV_TAIL() - Create a new IOV tail
* @iov_: IO vector to create tail from
* @cnt_: Length of the IO vector at @iov_
* @off_: Byte offset in the IO vector where the tail begins
*/
#define IOV_TAIL(iov_, cnt_, off_) \
(struct iov_tail){ .iov = (iov_), .cnt = (cnt_), .off = (off_) }
bool iov_tail_prune(struct iov_tail *tail);
size_t iov_tail_size(struct iov_tail *tail);
void *iov_peek_header_(struct iov_tail *tail, size_t len, size_t align);
void *iov_remove_header_(struct iov_tail *tail, size_t len, size_t align);
/**
* IOV_PEEK_HEADER() - Get typed pointer to a header from an IOV tail
* @tail_: IOV tail to get header from
* @type_: Data type of the header
*
* @tail_ may be pruned, but will represent the same bytes as before.
*
* Returns: Pointer of type (@type_ *) located at the start of @tail_, NULL if
* we can't get a contiguous and aligned pointer.
*/
#define IOV_PEEK_HEADER(tail_, type_) \
((type_ *)(iov_peek_header_((tail_), \
sizeof(type_), __alignof__(type_))))
/**
* IOV_REMOVE_HEADER() - Remove and return typed header from an IOV tail
* @tail_: IOV tail to remove header from (modified)
* @type_: Data type of the header to remove
*
* On success, @tail_ is updated so that it longer includes the bytes of the
* returned header.
*
* Returns: Pointer of type (@type_ *) located at the old start of @tail_, NULL
* if we can't get a contiguous and aligned pointer.
*/
#define IOV_REMOVE_HEADER(tail_, type_) \
((type_ *)(iov_remove_header_((tail_), \
sizeof(type_), __alignof__(type_))))
#endif /* IOVEC_H */

46
ip.h
View file

@ -36,13 +36,14 @@
.tos = 0, \
.tot_len = 0, \
.id = 0, \
.frag_off = 0, \
.frag_off = htons(IP_DF), \
.ttl = 0xff, \
.protocol = (proto), \
.saddr = 0, \
.daddr = 0, \
}
#define L2_BUF_IP4_PSUM(proto) ((uint32_t)htons_constant(0x4500) + \
(uint32_t)htons_constant(IP_DF) + \
(uint32_t)htons(0xff00 | (proto)))
@ -90,6 +91,49 @@ struct ipv6_opt_hdr {
*/
} __attribute__((packed)); /* required for some archs */
/**
* ip6_set_flow_lbl() - Set flow label in an IPv6 header
* @ip6h: Pointer to IPv6 header, updated
* @flow: Set @ip6h flow label to the low 20 bits of this integer
*/
static inline void ip6_set_flow_lbl(struct ipv6hdr *ip6h, uint32_t flow)
{
ip6h->flow_lbl[0] = (flow >> 16) & 0xf;
ip6h->flow_lbl[1] = (flow >> 8) & 0xff;
ip6h->flow_lbl[2] = (flow >> 0) & 0xff;
}
/** ip6_get_flow_lbl() - Get flow label from an IPv6 header
* @ip6h: Pointer to IPv6 header
*
* Return: flow label from @ip6h as an integer (<= 20 bits)
*/
static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h)
{
return (ip6h->flow_lbl[0] & 0xf) << 16 |
ip6h->flow_lbl[1] << 8 |
ip6h->flow_lbl[2];
}
char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
size_t *dlen);
/* IPv6 link-local all-nodes multicast address, ff02::1 */
static const struct in6_addr in6addr_ll_all_nodes = {
.s6_addr = {
0xff, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01,
},
};
/* IPv4 Limited Broadcast (RFC 919, Section 7), 255.255.255.255 */
static const struct in_addr in4addr_broadcast = { 0xffffffff };
#ifndef IPV4_MIN_MTU
#define IPV4_MIN_MTU 68
#endif
#ifndef IPV6_MIN_MTU
#define IPV6_MIN_MTU 1280
#endif
#endif /* IP_H */

View file

@ -129,7 +129,7 @@ static void drop_caps_ep_except(uint64_t keep)
* additional layer of protection. Executing this requires
* CAP_SETPCAP, which we will have within our userns.
*
* Note that dropping capabilites from the bounding set limits
* Note that dropping capabilities from the bounding set limits
* exec()ed processes, but does not remove them from the effective or
* permitted sets, so it doesn't reduce our own capabilities.
*/
@ -174,8 +174,8 @@ static void clamp_caps(void)
* Should:
* - drop unneeded capabilities
* - close all open files except for standard streams and the one from --fd
* Musn't:
* - remove filesytem access (we need to access files during setup)
* Mustn't:
* - remove filesystem access (we need to access files during setup)
*/
void isolate_initial(int argc, char **argv)
{
@ -194,7 +194,7 @@ void isolate_initial(int argc, char **argv)
*
* It's debatable whether it's useful to drop caps when we
* retain SETUID and SYS_ADMIN, but we might as well. We drop
* further capabilites in isolate_user() and
* further capabilities in isolate_user() and
* isolate_prefork().
*/
keep = BIT(CAP_NET_BIND_SERVICE) | BIT(CAP_SETUID) | BIT(CAP_SETGID) |
@ -379,12 +379,21 @@ void isolate_postfork(const struct ctx *c)
prctl(PR_SET_DUMPABLE, 0);
if (c->mode == MODE_PASTA) {
prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
prog.filter = filter_pasta;
} else {
switch (c->mode) {
case MODE_PASST:
prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
prog.filter = filter_passt;
break;
case MODE_PASTA:
prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
prog.filter = filter_pasta;
break;
case MODE_VU:
prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
prog.filter = filter_vu;
break;
default:
ASSERT(0);
}
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||

144
linux_dep.h Normal file
View file

@ -0,0 +1,144 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright Red Hat
*
* Declarations for Linux specific dependencies
*/
#ifndef LINUX_DEP_H
#define LINUX_DEP_H
/* struct tcp_info_linux - Information from Linux TCP_INFO getsockopt()
*
* Largely derived from include/linux/tcp.h in the Linux kernel
*
* Some fields returned by TCP_INFO have been there for ages and are shared with
* BSD. struct tcp_info from netinet/tcp.h has only those fields. There are
* also a many Linux specific extensions to the structure, which are only found
* in the linux/tcp.h version of struct tcp_info.
*
* We want to use some of those extension fields, when available. We can test
* for availability in the runtime kernel using the length returned from
* getsockopt(). However, we won't necessarily be compiled against the same
* kernel headers as we'll run with, so compiling directly against linux/tcp.h
* means wrapping every field access in an #ifdef whose #else does the same
* thing as when the field is missing at runtime. This rapidly gets messy.
*
* Instead we define here struct tcp_info_linux which includes all the Linux
* extensions that we want to use. This is taken from v6.11 of the kernel.
*/
struct tcp_info_linux {
uint8_t tcpi_state;
uint8_t tcpi_ca_state;
uint8_t tcpi_retransmits;
uint8_t tcpi_probes;
uint8_t tcpi_backoff;
uint8_t tcpi_options;
uint8_t tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
uint8_t tcpi_delivery_rate_app_limited:1, tcpi_fastopen_client_fail:2;
uint32_t tcpi_rto;
uint32_t tcpi_ato;
uint32_t tcpi_snd_mss;
uint32_t tcpi_rcv_mss;
uint32_t tcpi_unacked;
uint32_t tcpi_sacked;
uint32_t tcpi_lost;
uint32_t tcpi_retrans;
uint32_t tcpi_fackets;
/* Times. */
uint32_t tcpi_last_data_sent;
uint32_t tcpi_last_ack_sent;
uint32_t tcpi_last_data_recv;
uint32_t tcpi_last_ack_recv;
/* Metrics. */
uint32_t tcpi_pmtu;
uint32_t tcpi_rcv_ssthresh;
uint32_t tcpi_rtt;
uint32_t tcpi_rttvar;
uint32_t tcpi_snd_ssthresh;
uint32_t tcpi_snd_cwnd;
uint32_t tcpi_advmss;
uint32_t tcpi_reordering;
uint32_t tcpi_rcv_rtt;
uint32_t tcpi_rcv_space;
uint32_t tcpi_total_retrans;
/* Linux extensions */
uint64_t tcpi_pacing_rate;
uint64_t tcpi_max_pacing_rate;
uint64_t tcpi_bytes_acked; /* RFC4898 tcpEStatsAppHCThruOctetsAcked */
uint64_t tcpi_bytes_received; /* RFC4898 tcpEStatsAppHCThruOctetsReceived */
uint32_t tcpi_segs_out; /* RFC4898 tcpEStatsPerfSegsOut */
uint32_t tcpi_segs_in; /* RFC4898 tcpEStatsPerfSegsIn */
uint32_t tcpi_notsent_bytes;
uint32_t tcpi_min_rtt;
uint32_t tcpi_data_segs_in; /* RFC4898 tcpEStatsDataSegsIn */
uint32_t tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */
uint64_t tcpi_delivery_rate;
uint64_t tcpi_busy_time; /* Time (usec) busy sending data */
uint64_t tcpi_rwnd_limited; /* Time (usec) limited by receive window */
uint64_t tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */
uint32_t tcpi_delivered;
uint32_t tcpi_delivered_ce;
uint64_t tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
uint64_t tcpi_bytes_retrans; /* RFC4898 tcpEStatsPerfOctetsRetrans */
uint32_t tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */
uint32_t tcpi_reord_seen; /* reordering events seen */
uint32_t tcpi_rcv_ooopack; /* Out-of-order packets received */
uint32_t tcpi_snd_wnd; /* peer's advertised receive window after
* scaling (bytes)
*/
uint32_t tcpi_rcv_wnd; /* local advertised receive window after
* scaling (bytes)
*/
uint32_t tcpi_rehash; /* PLB or timeout triggered rehash attempts */
uint16_t tcpi_total_rto; /* Total number of RTO timeouts, including
* SYN/SYN-ACK and recurring timeouts.
*/
uint16_t tcpi_total_rto_recoveries; /* Total number of RTO
* recoveries, including any
* unfinished recovery.
*/
uint32_t tcpi_total_rto_time; /* Total time spent in RTO recoveries
* in milliseconds, including any
* unfinished recovery.
*/
};
#include <linux/falloc.h>
#ifndef FALLOC_FL_COLLAPSE_RANGE
#define FALLOC_FL_COLLAPSE_RANGE 0x08
#endif
#include <linux/close_range.h>
/* glibc < 2.34 and musl as of 1.2.5 need these */
#ifndef SYS_close_range
#define SYS_close_range 436
#endif
#ifndef CLOSE_RANGE_UNSHARE /* Linux kernel < 5.9 */
#define CLOSE_RANGE_UNSHARE (1U << 1)
#endif
__attribute__ ((weak))
/* cppcheck-suppress funcArgNamesDifferent */
int close_range(unsigned int first, unsigned int last, int flags) {
return syscall(SYS_close_range, first, last, flags);
}
#endif /* LINUX_DEP_H */

86
log.c
View file

@ -26,6 +26,7 @@
#include <stdarg.h>
#include <sys/socket.h>
#include "linux_dep.h"
#include "log.h"
#include "util.h"
#include "passt.h"
@ -55,7 +56,7 @@ bool log_stderr = true; /* Not daemonised, no shell spawned */
*
* Return: pointer to @now, or NULL if there was an error retrieving the time
*/
const struct timespec *logtime(struct timespec *ts)
static const struct timespec *logtime(struct timespec *ts)
{
if (clock_gettime(CLOCK_MONOTONIC, ts))
return NULL;
@ -92,7 +93,6 @@ const char *logfile_prefix[] = {
" ", /* LOG_DEBUG */
};
#ifdef FALLOC_FL_COLLAPSE_RANGE
/**
* logfile_rotate_fallocate() - Write header, set log_written after fallocate()
* @fd: Log file descriptor
@ -126,7 +126,6 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *now)
log_written -= log_cut_size;
}
#endif /* FALLOC_FL_COLLAPSE_RANGE */
/**
* logfile_rotate_move() - Fallback: move recent entries toward start, then cut
@ -198,21 +197,17 @@ out:
*
* Return: 0 on success, negative error code on failure
*
* #syscalls fcntl
*
* fallocate() passed as EXTRA_SYSCALL only if FALLOC_FL_COLLAPSE_RANGE is there
* #syscalls fcntl fallocate
*/
static int logfile_rotate(int fd, const struct timespec *now)
{
if (fcntl(fd, F_SETFL, O_RDWR /* Drop O_APPEND: explicit lseek() */))
return -errno;
#ifdef FALLOC_FL_COLLAPSE_RANGE
/* Only for Linux >= 3.15, extent-based ext4 or XFS, glibc >= 2.18 */
if (!fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, log_cut_size))
logfile_rotate_fallocate(fd, now);
else
#endif
logfile_rotate_move(fd, now);
if (fcntl(fd, F_SETFL, O_RDWR | O_APPEND))
@ -224,19 +219,23 @@ static int logfile_rotate(int fd, const struct timespec *now)
/**
* logfile_write() - Write entry to log file, trigger rotation if full
* @newline: Append newline at the end of the message, if missing
* @cont: Continuation of a previous message, on the same line
* @pri: Facility and level map, same as priority for vsyslog()
* @now: Timestamp
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
static void logfile_write(bool newline, int pri, const struct timespec *now,
static void logfile_write(bool newline, bool cont, int pri,
const struct timespec *now,
const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
int n = 0;
n = logtime_fmt(buf, BUFSIZ, now);
n += snprintf(buf + n, BUFSIZ - n, ": %s", logfile_prefix[pri]);
if (!cont) {
n += logtime_fmt(buf, BUFSIZ, now);
n += snprintf(buf + n, BUFSIZ - n, ": %s", logfile_prefix[pri]);
}
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
@ -250,6 +249,30 @@ static void logfile_write(bool newline, int pri, const struct timespec *now,
log_written += n;
}
/**
* passt_vsyslog() - vsyslog() implementation not using heap memory
* @newline: Append newline at the end of the message, if missing
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
static void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
/* Send without timestamp, the system logger should add it */
n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (newline && format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
* vlogmsg() - Print or send messages to log or output files as configured
* @newline: Append newline at the end of the message, if missing
@ -258,6 +281,7 @@ static void logfile_write(bool newline, int pri, const struct timespec *now,
* @format: Message
* @ap: Variable argument list
*/
/* cppcheck-suppress [staticFunction,unmatchedSuppression] */
void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
{
bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1;
@ -270,7 +294,7 @@ void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
char timestr[LOGTIME_STRLEN];
logtime_fmt(timestr, sizeof(timestr), now);
fprintf(stderr, "%s: ", timestr);
FPRINTF(stderr, "%s: ", timestr);
}
if ((log_mask & LOG_MASK(LOG_PRI(pri))) || !log_conf_parsed) {
@ -278,7 +302,7 @@ void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
va_copy(ap2, ap); /* Don't clobber ap, we need it again */
if (log_file != -1)
logfile_write(newline, pri, now, format, ap2);
logfile_write(newline, cont, pri, now, format, ap2);
else if (!(log_mask & LOG_MASK(LOG_DEBUG)))
passt_vsyslog(newline, pri, format, ap2);
@ -289,7 +313,7 @@ void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
(log_stderr && (log_mask & LOG_MASK(LOG_PRI(pri))))) {
(void)vfprintf(stderr, format, ap);
if (newline && format[strlen(format)] != '\n')
fprintf(stderr, "\n");
FPRINTF(stderr, "\n");
}
}
@ -323,7 +347,7 @@ void logmsg_perror(int pri, const char *format, ...)
vlogmsg(false, false, pri, format, ap);
va_end(ap);
logmsg(true, true, pri, ": %s", strerror(errno_copy));
logmsg(true, true, pri, ": %s", strerror_(errno_copy));
}
/**
@ -374,35 +398,11 @@ void __setlogmask(int mask)
setlogmask(mask);
}
/**
* passt_vsyslog() - vsyslog() implementation not using heap memory
* @newline: Append newline at the end of the message, if missing
* @pri: Facility and level map, same as priority for vsyslog()
* @format: Same as vsyslog() format
* @ap: Same as vsyslog() ap
*/
void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
{
char buf[BUFSIZ];
int n;
/* Send without timestamp, the system logger should add it */
n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
if (newline && format[strlen(format)] != '\n')
n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
fprintf(stderr, "Failed to send %i bytes to syslog\n", n);
}
/**
* logfile_init() - Open log file and write header with PID, version, path
* @name: Identifier for header: passt or pasta
* @path: Path to log file
* @size: Maximum size of log file: log_cut_size is calculatd here
* @size: Maximum size of log file: log_cut_size is calculated here
*/
void logfile_init(const char *name, const char *path, size_t size)
{
@ -412,8 +412,7 @@ void logfile_init(const char *name, const char *path, size_t size)
if (readlink("/proc/self/exe", exe, PATH_MAX - 1) < 0)
die_perror("Failed to read own /proc/self/exe link");
log_file = open(path, O_CREAT | O_TRUNC | O_APPEND | O_RDWR | O_CLOEXEC,
S_IRUSR | S_IWUSR);
log_file = output_file_open(path, O_APPEND | O_RDWR);
if (log_file == -1)
die_perror("Couldn't open log file %s", path);
@ -429,4 +428,3 @@ void logfile_init(const char *name, const char *path, size_t size)
/* For FALLOC_FL_COLLAPSE_RANGE: VFS block size can be up to one page */
log_cut_size = ROUND_UP(log_size * LOGFILE_CUT_RATIO / 100, PAGE_SIZE);
}

5
log.h
View file

@ -32,13 +32,13 @@ void logmsg_perror(int pri, const char *format, ...)
#define die(...) \
do { \
err(__VA_ARGS__); \
exit(EXIT_FAILURE); \
_exit(EXIT_FAILURE); \
} while (0)
#define die_perror(...) \
do { \
err_perror(__VA_ARGS__); \
exit(EXIT_FAILURE); \
_exit(EXIT_FAILURE); \
} while (0)
extern int log_trace;
@ -55,7 +55,6 @@ void trace_init(int enable);
void __openlog(const char *ident, int option, int facility);
void logfile_init(const char *name, const char *path, size_t size);
void passt_vsyslog(bool newline, int pri, const char *format, va_list ap);
void __setlogmask(int mask);
#endif /* LOG_H */

304
migrate.c Normal file
View file

@ -0,0 +1,304 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* migrate.c - Migration sections, layout, and routines
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <errno.h>
#include <sys/uio.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "migrate.h"
#include "repair.h"
/* Magic identifier for migration data */
#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
/**
* struct migrate_seen_addrs_v1 - Migratable guest addresses for v1 state stream
* @addr6: Observed guest IPv6 address
* @addr6_ll: Observed guest IPv6 link-local address
* @addr4: Observed guest IPv4 address
* @mac: Observed guest MAC address
*/
struct migrate_seen_addrs_v1 {
struct in6_addr addr6;
struct in6_addr addr6_ll;
struct in_addr addr4;
unsigned char mac[ETH_ALEN];
} __attribute__((packed));
/**
* seen_addrs_source_v1() - Copy and send guest observed addresses from source
* @c: Execution context
* @stage: Migration stage, unused
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */
static int seen_addrs_source_v1(struct ctx *c,
const struct migrate_stage *stage, int fd)
{
struct migrate_seen_addrs_v1 addrs = {
.addr6 = c->ip6.addr_seen,
.addr6_ll = c->ip6.addr_ll_seen,
.addr4 = c->ip4.addr_seen,
};
(void)stage;
memcpy(addrs.mac, c->guest_mac, sizeof(addrs.mac));
if (write_all_buf(fd, &addrs, sizeof(addrs)))
return errno;
return 0;
}
/**
* seen_addrs_target_v1() - Receive and use guest observed addresses on target
* @c: Execution context
* @stage: Migration stage, unused
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int seen_addrs_target_v1(struct ctx *c,
const struct migrate_stage *stage, int fd)
{
struct migrate_seen_addrs_v1 addrs;
(void)stage;
if (read_all_buf(fd, &addrs, sizeof(addrs)))
return errno;
c->ip6.addr_seen = addrs.addr6;
c->ip6.addr_ll_seen = addrs.addr6_ll;
c->ip4.addr_seen = addrs.addr4;
memcpy(c->guest_mac, addrs.mac, sizeof(c->guest_mac));
return 0;
}
/* Stages for version 2 */
static const struct migrate_stage stages_v2[] = {
{
.name = "observed addresses",
.source = seen_addrs_source_v1,
.target = seen_addrs_target_v1,
},
{
.name = "prepare flows",
.source = flow_migrate_source_pre,
.target = NULL,
},
{
.name = "transfer flows",
.source = flow_migrate_source,
.target = flow_migrate_target,
},
{ 0 },
};
/* Supported encoding versions, from latest (most preferred) to oldest */
static const struct migrate_version versions[] = {
{ 2, stages_v2, },
/* v1 was released, but not widely used. It had bad endianness for the
* MSS and omitted timestamps, which meant it usually wouldn't work.
* Therefore we don't attempt to support compatibility with it.
*/
{ 0 },
};
/* Current encoding version */
#define CURRENT_VERSION (&versions[0])
/**
* migrate_source() - Migration as source, send state to hypervisor
* @c: Execution context
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int migrate_source(struct ctx *c, int fd)
{
const struct migrate_version *v = CURRENT_VERSION;
const struct migrate_header header = {
.magic = htonll_constant(MIGRATE_MAGIC),
.version = htonl(v->id),
.compat_version = htonl(v->id),
};
const struct migrate_stage *s;
int ret;
if (write_all_buf(fd, &header, sizeof(header))) {
ret = errno;
err("Can't send migration header: %s, abort", strerror_(ret));
return ret;
}
for (s = v->s; s->name; s++) {
if (!s->source)
continue;
debug("Source side migration stage: %s", s->name);
if ((ret = s->source(c, s, fd))) {
err("Source migration stage: %s: %s, abort", s->name,
strerror_(ret));
return ret;
}
}
return 0;
}
/**
* migrate_target_read_header() - Read header in target
* @fd: Descriptor for state transfer
*
* Return: version structure on success, NULL on failure with errno set
*/
static const struct migrate_version *migrate_target_read_header(int fd)
{
const struct migrate_version *v;
struct migrate_header h;
uint32_t id, compat_id;
if (read_all_buf(fd, &h, sizeof(h)))
return NULL;
id = ntohl(h.version);
compat_id = ntohl(h.compat_version);
debug("Source magic: 0x%016" PRIx64 ", version: %u, compat: %u",
ntohll(h.magic), id, compat_id);
if (ntohll(h.magic) != MIGRATE_MAGIC || !id || !compat_id) {
err("Invalid incoming device state");
errno = EINVAL;
return NULL;
}
for (v = versions; v->id; v++)
if (v->id <= id && v->id >= compat_id)
return v;
errno = ENOTSUP;
err("Unsupported device state version: %u", id);
return NULL;
}
/**
* migrate_target() - Migration as target, receive state from hypervisor
* @c: Execution context
* @fd: File descriptor for state transfer
*
* Return: 0 on success, positive error code on failure
*/
static int migrate_target(struct ctx *c, int fd)
{
const struct migrate_version *v;
const struct migrate_stage *s;
int ret;
if (!(v = migrate_target_read_header(fd)))
return errno;
for (s = v->s; s->name; s++) {
if (!s->target)
continue;
debug("Target side migration stage: %s", s->name);
if ((ret = s->target(c, s, fd))) {
err("Target migration stage: %s: %s, abort", s->name,
strerror_(ret));
return ret;
}
}
return 0;
}
/**
* migrate_init() - Set up things necessary for migration
* @c: Execution context
*/
void migrate_init(struct ctx *c)
{
c->device_state_result = -1;
}
/**
* migrate_close() - Close migration channel and connection to passt-repair
* @c: Execution context
*/
void migrate_close(struct ctx *c)
{
if (c->device_state_fd != -1) {
debug("Closing migration channel, fd: %d", c->device_state_fd);
close(c->device_state_fd);
c->device_state_fd = -1;
c->device_state_result = -1;
}
repair_close(c);
}
/**
* migrate_request() - Request a migration of device state
* @c: Execution context
* @fd: fd to transfer state
* @target: Are we the target of the migration?
*/
void migrate_request(struct ctx *c, int fd, bool target)
{
debug("Migration requested, fd: %d (was %d)", fd, c->device_state_fd);
if (c->device_state_fd != -1)
migrate_close(c);
c->device_state_fd = fd;
c->migrate_target = target;
}
/**
* migrate_handler() - Send/receive passt internal state to/from hypervisor
* @c: Execution context
*/
void migrate_handler(struct ctx *c)
{
int rc;
if (c->device_state_fd < 0)
return;
debug("Handling migration request from fd: %d, target: %d",
c->device_state_fd, c->migrate_target);
if (c->migrate_target)
rc = migrate_target(c, c->device_state_fd);
else
rc = migrate_source(c, c->device_state_fd);
migrate_close(c);
c->device_state_result = rc;
}

51
migrate.h Normal file
View file

@ -0,0 +1,51 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#ifndef MIGRATE_H
#define MIGRATE_H
/**
* struct migrate_header - Migration header from source
* @magic: 0xB1BB1D1B0BB1D1B0, network order
* @version: Highest known, target aborts if too old, network order
* @compat_version: Lowest version compatible with @version, target aborts
* if too new, network order
*/
struct migrate_header {
uint64_t magic;
uint32_t version;
uint32_t compat_version;
} __attribute__((packed));
/**
* struct migrate_stage - Callbacks and parameters for one stage of migration
* @name: Stage name (for debugging)
* @source: Callback to implement this stage on the source
* @target: Callback to implement this stage on the target
*/
struct migrate_stage {
const char *name;
int (*source)(struct ctx *c, const struct migrate_stage *stage, int fd);
int (*target)(struct ctx *c, const struct migrate_stage *stage, int fd);
/* Add here separate rollback callbacks if needed */
};
/**
* struct migrate_version - Stages for a particular protocol version
* @id: Version number, host order
* @s: Ordered array of stages, NULL-terminated
*/
struct migrate_version {
uint32_t id;
const struct migrate_stage *s;
};
void migrate_init(struct ctx *c);
void migrate_close(struct ctx *c);
void migrate_request(struct ctx *c, int fd, bool target);
void migrate_handler(struct ctx *c);
#endif /* MIGRATE_H */

224
ndp.c
View file

@ -33,6 +33,8 @@
#include "tap.h"
#include "log.h"
#define RT_LIFETIME 65535
#define RS 133
#define RA 134
#define NS 135
@ -158,7 +160,7 @@ struct ndp_ra {
unsigned char var[sizeof(struct opt_mtu) + sizeof(struct opt_rdnss) +
sizeof(struct opt_dnssl)];
} __attribute__((packed));
} __attribute__((packed, aligned(__alignof__(struct in6_addr))));
/**
* struct ndp_ns - NDP Neighbor Solicitation (NS) message
@ -168,19 +170,31 @@ struct ndp_ra {
struct ndp_ns {
struct icmp6hdr ih;
struct in6_addr target_addr;
} __attribute__((packed));
} __attribute__((packed, aligned(__alignof__(struct in6_addr))));
/**
* ndp() - Check for NDP solicitations, reply as needed
* ndp_send() - Send an NDP message
* @c: Execution context
* @ih: ICMPv6 header
* @saddr: Source IPv6 address
* @p: Packet pool
*
* Return: 0 if not handled here, 1 if handled, -1 on failure
* @dst: IPv6 address to send the message to
* @buf: ICMPv6 header + message payload
* @l4len: Length of message, including ICMPv6 header
*/
int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
const struct pool *p)
static void ndp_send(const struct ctx *c, const struct in6_addr *dst,
const void *buf, size_t l4len)
{
const struct in6_addr *src = &c->ip6.our_tap_ll;
tap_icmp6_send(c, src, dst, buf, l4len);
}
/**
* ndp_na() - Send an NDP Neighbour Advertisement (NA) message
* @c: Execution context
* @dst: IPv6 address to send the NA to
* @addr: IPv6 address to advertise
*/
static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
const struct in6_addr *addr)
{
struct ndp_na na = {
.ih = {
@ -190,6 +204,7 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
.icmp6_solicited = 1,
.icmp6_override = 1,
},
.target_addr = *addr,
.target_l2_addr = {
.header = {
.type = OPT_TARGET_L2_ADDR,
@ -197,13 +212,26 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
},
}
};
memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN);
ndp_send(c, dst, &na, sizeof(na));
}
/**
* ndp_ra() - Send an NDP Router Advertisement (RA) message
* @c: Execution context
* @dst: IPv6 address to send the RA to
*/
static void ndp_ra(const struct ctx *c, const struct in6_addr *dst)
{
struct ndp_ra ra = {
.ih = {
.icmp6_type = RA,
.icmp6_code = 0,
.icmp6_hop_limit = 255,
/* RFC 8319 */
.icmp6_rt_lifetime = htons_constant(65535),
.icmp6_rt_lifetime = htons_constant(RT_LIFETIME),
.icmp6_addrconf_managed = 1,
},
.prefix_info = {
@ -216,6 +244,7 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
.valid_lifetime = ~0U,
.pref_lifetime = ~0U,
},
.prefix = c->ip6.addr,
.source_ll = {
.header = {
.type = OPT_SRC_L2_ADDR,
@ -223,59 +252,26 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
},
},
};
const struct in6_addr *rsaddr; /* src addr for reply */
unsigned char *ptr = NULL;
size_t dlen;
if (ih->icmp6_type < RS || ih->icmp6_type > NA)
return 0;
ptr = &ra.var[0];
if (c->no_ndp)
return 1;
if (c->mtu) {
struct opt_mtu *mtu = (struct opt_mtu *)ptr;
*mtu = (struct opt_mtu) {
.header = {
.type = OPT_MTU,
.len = 1,
},
.value = htonl(c->mtu),
};
ptr += sizeof(struct opt_mtu);
}
if (ih->icmp6_type == NS) {
struct ndp_ns *ns = packet_get(p, 0, 0, sizeof(struct ndp_ns),
NULL);
if (!ns)
return -1;
if (IN6_IS_ADDR_UNSPECIFIED(saddr))
return 1;
info("NDP: received NS, sending NA");
memcpy(&na.target_addr, &ns->target_addr,
sizeof(na.target_addr));
memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN);
} else if (ih->icmp6_type == RS) {
if (!c->no_dhcp_dns) {
size_t dns_s_len = 0;
int i, n;
if (c->no_ra)
return 1;
info("NDP: received RS, sending RA");
memcpy(&ra.prefix, &c->ip6.addr, sizeof(ra.prefix));
ptr = &ra.var[0];
if (c->mtu != -1) {
struct opt_mtu *mtu = (struct opt_mtu *)ptr;
*mtu = (struct opt_mtu) {
.header = {
.type = OPT_MTU,
.len = 1,
},
.value = htonl(c->mtu),
};
ptr += sizeof(struct opt_mtu);
}
if (c->no_dhcp_dns)
goto dns_done;
for (n = 0; !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns[n]); n++);
if (n) {
struct opt_rdnss *rdnss = (struct opt_rdnss *)ptr;
@ -287,8 +283,7 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
.lifetime = ~0U,
};
for (i = 0; i < n; i++) {
memcpy(&rdnss->dns[i], &c->ip6.dns[i],
sizeof(rdnss->dns[i]));
rdnss->dns[i] = c->ip6.dns[i];
}
ptr += offsetof(struct opt_rdnss, dns) +
i * sizeof(rdnss->dns[0]);
@ -329,27 +324,110 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
memset(ptr, 0, 8 - dns_s_len % 8); /* padding */
ptr += 8 - dns_s_len % 8;
}
dns_done:
memcpy(&ra.source_ll.mac, c->our_tap_mac, ETH_ALEN);
} else {
return 1;
}
if (IN6_IS_ADDR_LINKLOCAL(saddr))
c->ip6.addr_ll_seen = *saddr;
else
c->ip6.addr_seen = *saddr;
memcpy(&ra.source_ll.mac, c->our_tap_mac, ETH_ALEN);
rsaddr = &c->ip6.our_tap_ll;
/* NOLINTNEXTLINE(clang-analyzer-security.PointerSub) */
ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);
}
/**
* ndp() - Check for NDP solicitations, reply as needed
* @c: Execution context
* @ih: ICMPv6 header
* @saddr: Source IPv6 address
* @p: Packet pool
*
* Return: 0 if not handled here, 1 if handled, -1 on failure
*/
int ndp(const struct ctx *c, const struct icmp6hdr *ih,
const struct in6_addr *saddr, const struct pool *p)
{
if (ih->icmp6_type < RS || ih->icmp6_type > NA)
return 0;
if (c->no_ndp)
return 1;
if (ih->icmp6_type == NS) {
dlen = sizeof(struct ndp_na);
tap_icmp6_send(c, rsaddr, saddr, &na, dlen);
const struct ndp_ns *ns;
ns = packet_get(p, 0, 0, sizeof(struct ndp_ns), NULL);
if (!ns)
return -1;
if (IN6_IS_ADDR_UNSPECIFIED(saddr))
return 1;
info("NDP: received NS, sending NA");
ndp_na(c, saddr, &ns->target_addr);
} else if (ih->icmp6_type == RS) {
dlen = ptr - (unsigned char *)&ra;
tap_icmp6_send(c, rsaddr, saddr, &ra, dlen);
if (c->no_ra)
return 1;
info("NDP: received RS, sending RA");
ndp_ra(c, saddr);
}
return 1;
}
/* Default interval between unsolicited RAs (seconds) */
#define DEFAULT_MAX_RTR_ADV_INTERVAL 600 /* RFC 4861, 6.2.1 */
/* Minimum required interval between RAs (seconds) */
#define MIN_DELAY_BETWEEN_RAS 3 /* RFC 4861, 10 */
static time_t next_ra;
/**
* ndp_timer() - Send unsolicited NDP messages if necessary
* @c: Execution context
* @now: Current (monotonic) time
*/
void ndp_timer(const struct ctx *c, const struct timespec *now)
{
time_t max_rtr_adv_interval = DEFAULT_MAX_RTR_ADV_INTERVAL;
time_t min_rtr_adv_interval, interval;
if (c->fd_tap < 0 || c->no_ra || now->tv_sec < next_ra)
return;
/* We must advertise before the route's lifetime expires */
max_rtr_adv_interval = MIN(max_rtr_adv_interval, RT_LIFETIME - 1);
/* But we must not go smaller than the minimum delay */
max_rtr_adv_interval = MAX(max_rtr_adv_interval, MIN_DELAY_BETWEEN_RAS);
/* RFC 4861, 6.2.1 */
min_rtr_adv_interval = MAX(max_rtr_adv_interval / 3,
MIN_DELAY_BETWEEN_RAS);
/* As required by RFC 4861, we randomise the interval between
* unsolicited RAs. This is to prevent multiple routers on a link
* getting synchronised (e.g. after booting a bunch of routers at once)
* and causing flurries of RAs at the same time.
*
* This random doesn't need to be cryptographically strong, so random(3)
* is fine. Other routers on the link also want to avoid
* synchronisation, and anything malicious has much easier ways to cause
* trouble.
*
* The modulus also makes this not strictly a uniform distribution, but,
* again, it's close enough for our purposes.
*/
interval = min_rtr_adv_interval +
random() % (max_rtr_adv_interval - min_rtr_adv_interval);
if (!next_ra)
goto first;
info("NDP: sending unsolicited RA, next in %llds", (long long)interval);
ndp_ra(c, &in6addr_ll_all_nodes);
first:
next_ra = now->tv_sec + interval;
}

7
ndp.h
View file

@ -6,7 +6,10 @@
#ifndef NDP_H
#define NDP_H
int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr,
const struct pool *p);
struct icmp6hdr;
int ndp(const struct ctx *c, const struct icmp6hdr *ih,
const struct in6_addr *saddr, const struct pool *p);
void ndp_timer(const struct ctx *c, const struct timespec *now);
#endif /* NDP_H */

View file

@ -297,6 +297,10 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
if (!thisifi)
continue; /* No interface for this route */
/* Skip 'lo': we should test IFF_LOOPBACK, but keep it simple */
if (thisifi == 1)
continue;
/* Skip routes to link-local addresses */
if (af == AF_INET && dst &&
IN4_IS_PREFIX_LINKLOCAL(dst, rtm->rtm_dst_len))
@ -320,7 +324,7 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
}
if (status < 0)
warn("netlink: RTM_GETROUTE failed: %s", strerror(-status));
warn("netlink: RTM_GETROUTE failed: %s", strerror_(-status));
if (defifi) {
if (ndef > 1) {
@ -351,9 +355,9 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
*
* Return: true if a gateway was found, false otherwise
*/
bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
static bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
{
size_t nh_len = RTA_PAYLOAD(rta);
int nh_len = RTA_PAYLOAD(rta);
struct rtnexthop *rtnh;
bool found = false;
int hops = -1;
@ -582,7 +586,7 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
*(unsigned int *)RTA_DATA(rta) = ifi_dst;
} else if (rta->rta_type == RTA_MULTIPATH) {
size_t nh_len = RTA_PAYLOAD(rta);
int nh_len = RTA_PAYLOAD(rta);
struct rtnexthop *rtnh;
for (rtnh = (struct rtnexthop *)RTA_DATA(rta);

191
packet.c
View file

@ -22,12 +22,74 @@
#include "util.h"
#include "log.h"
/**
* packet_check_range() - Check if a memory range is valid for a pool
* @p: Packet pool
* @ptr: Start of desired data range
* @len: Length of desired data range
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: 0 if the range is valid, -1 otherwise
*/
static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
const char *func, int line)
{
if (len > PACKET_MAX_LEN) {
debug("packet range length %zu (max %zu), %s:%i",
len, PACKET_MAX_LEN, func, line);
return -1;
}
if (p->buf_size == 0) {
int ret;
ret = vu_packet_check_range((void *)p->buf, ptr, len);
if (ret == -1)
debug("cannot find region, %s:%i", func, line);
return ret;
}
if (ptr < p->buf) {
debug("packet range start %p before buffer start %p, %s:%i",
(void *)ptr, (void *)p->buf, func, line);
return -1;
}
if (len > p->buf_size) {
debug("packet range length %zu larger than buffer %zu, %s:%i",
len, p->buf_size, func, line);
return -1;
}
if ((size_t)(ptr - p->buf) > p->buf_size - len) {
debug("packet range %p, len %zu after buffer end %p, %s:%i",
(void *)ptr, len, (void *)(p->buf + p->buf_size),
func, line);
return -1;
}
return 0;
}
/**
* pool_full() - Is a packet pool full?
* @p: Pointer to packet pool
*
* Return: true if the pool is full, false if more packets can be added
*/
bool pool_full(const struct pool *p)
{
return p->count >= p->size;
}
/**
* packet_add_do() - Add data as packet descriptor to given pool
* @p: Existing pool
* @len: Length of new descriptor
* @start: Start of data
* @func: For tracing: name of calling function, NULL means no trace()
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*/
void packet_add_do(struct pool *p, size_t len, const char *start,
@ -35,44 +97,63 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
{
size_t idx = p->count;
if (idx >= p->size) {
trace("add packet index %zu to pool with size %zu, %s:%i",
if (pool_full(p)) {
debug("add packet index %zu to pool with size %zu, %s:%i",
idx, p->size, func, line);
return;
}
if (start < p->buf) {
trace("add packet start %p before buffer start %p, %s:%i",
(void *)start, (void *)p->buf, func, line);
if (packet_check_range(p, start, len, func, line))
return;
}
if (start + len > p->buf + p->buf_size) {
trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
(void *)start, len, (void *)(p->buf + p->buf_size),
func, line);
return;
}
if (len > UINT16_MAX) {
trace("add packet length %zu, %s:%i", len, func, line);
return;
}
#if UINTPTR_MAX == UINT64_MAX
if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
trace("add packet start %p, buffer start %p, %s:%i",
(void *)start, (void *)p->buf, func, line);
return;
}
#endif
p->pkt[idx].offset = start - p->buf;
p->pkt[idx].len = len;
p->pkt[idx].iov_base = (void *)start;
p->pkt[idx].iov_len = len;
p->count++;
}
/**
* packet_get_try_do() - Get data range from packet descriptor from given pool
* @p: Packet pool
* @idx: Index of packet descriptor in pool
* @offset: Offset of data range in packet descriptor
* @len: Length of desired data range
* @left: Length of available data after range, set on return, can be NULL
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: pointer to start of data range, NULL on invalid range or descriptor
*/
void *packet_get_try_do(const struct pool *p, size_t idx, size_t offset,
size_t len, size_t *left, const char *func, int line)
{
char *ptr;
ASSERT_WITH_MSG(p->count <= p->size,
"Corrupt pool count: %zu, size: %zu, %s:%i",
p->count, p->size, func, line);
if (idx >= p->count) {
debug("packet %zu from pool count: %zu, %s:%i",
idx, p->count, func, line);
return NULL;
}
if (offset > p->pkt[idx].iov_len ||
len > (p->pkt[idx].iov_len - offset))
return NULL;
ptr = (char *)p->pkt[idx].iov_base + offset;
ASSERT_WITH_MSG(!packet_check_range(p, ptr, len, func, line),
"Corrupt packet pool, %s:%i", func, line);
if (left)
*left = p->pkt[idx].iov_len - offset - len;
return ptr;
}
/**
* packet_get_do() - Get data range from packet descriptor from given pool
* @p: Packet pool
@ -80,52 +161,24 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
* @offset: Offset of data range in packet descriptor
* @len: Length of desired data range
* @left: Length of available data after range, set on return, can be NULL
* @func: For tracing: name of calling function, NULL means no trace()
* @func: For tracing: name of calling function
* @line: For tracing: caller line of function call
*
* Return: pointer to start of data range, NULL on invalid range or descriptor
* Return: as packet_get_try_do() but log a trace message when returning NULL
*/
void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
size_t len, size_t *left, const char *func, int line)
void *packet_get_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line)
{
if (idx >= p->size || idx >= p->count) {
if (func) {
trace("packet %zu from pool size: %zu, count: %zu, "
"%s:%i", idx, p->size, p->count, func, line);
}
return NULL;
void *r = packet_get_try_do(p, idx, offset, len, left, func, line);
if (!r) {
trace("missing packet data length %zu, offset %zu from "
"length %zu, %s:%i",
len, offset, p->pkt[idx].iov_len, func, line);
}
if (len > UINT16_MAX || len + offset > UINT32_MAX) {
if (func) {
trace("packet data length %zu, offset %zu, %s:%i",
len, offset, func, line);
}
return NULL;
}
if (p->pkt[idx].offset + len + offset > p->buf_size) {
if (func) {
trace("packet offset plus length %zu from size %zu, "
"%s:%i", p->pkt[idx].offset + len + offset,
p->buf_size, func, line);
}
return NULL;
}
if (len + offset > p->pkt[idx].len) {
if (func) {
trace("data length %zu, offset %zu from length %u, "
"%s:%i", len, offset, p->pkt[idx].len,
func, line);
}
return NULL;
}
if (left)
*left = p->pkt[idx].len - offset - len;
return p->buf + p->pkt[idx].offset + offset;
return r;
}
/**

View file

@ -6,20 +6,17 @@
#ifndef PACKET_H
#define PACKET_H
/**
* struct desc - Generic offset-based descriptor within buffer
* @offset: Offset of descriptor relative to buffer start, 32-bit limit
* @len: Length of descriptor, host order, 16-bit limit
*/
struct desc {
uint32_t offset;
uint16_t len;
};
#include <stdbool.h>
/* Maximum size of a single packet stored in pool, including headers */
#define PACKET_MAX_LEN ((size_t)UINT16_MAX)
/**
* struct pool - Generic pool of packets stored in a buffer
* @buf: Buffer storing packet descriptors
* @buf_size: Total size of buffer
* @buf: Buffer storing packet descriptors,
* a struct vu_dev_region array for passt vhost-user mode
* @buf_size: Total size of buffer,
* 0 for passt vhost-user mode
* @size: Number of usable descriptors for the pool
* @count: Number of used descriptors for the pool
* @pkt: Descriptors: see macros below
@ -29,32 +26,36 @@ struct pool {
size_t buf_size;
size_t size;
size_t count;
struct desc pkt[1];
struct iovec pkt[];
};
int vu_packet_check_range(void *buf, const char *ptr, size_t len);
void packet_add_do(struct pool *p, size_t len, const char *start,
const char *func, int line);
void *packet_get_try_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line);
void *packet_get_do(const struct pool *p, const size_t idx,
size_t offset, size_t len, size_t *left,
const char *func, int line);
bool pool_full(const struct pool *p);
void pool_flush(struct pool *p);
#define packet_add(p, len, start) \
packet_add_do(p, len, start, __func__, __LINE__)
#define packet_get_try(p, idx, offset, len, left) \
packet_get_try_do(p, idx, offset, len, left, __func__, __LINE__)
#define packet_get(p, idx, offset, len, left) \
packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
#define packet_get_try(p, idx, offset, len, left) \
packet_get_do(p, idx, offset, len, left, NULL, 0)
#define PACKET_POOL_DECL(_name, _size, _buf) \
struct _name ## _t { \
char *buf; \
size_t buf_size; \
size_t size; \
size_t count; \
struct desc pkt[_size]; \
struct iovec pkt[_size]; \
}
#define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size) \

74
passt-repair.1 Normal file
View file

@ -0,0 +1,74 @@
.\" SPDX-License-Identifier: GPL-2.0-or-later
.\" Copyright (c) 2025 Red Hat GmbH
.\" Author: Stefano Brivio <sbrivio@redhat.com>
.TH passt-repair 1
.SH NAME
.B passt-repair
\- Helper setting TCP_REPAIR socket options for \fBpasst\fR(1)
.SH SYNOPSIS
.B passt-repair
\fIPATH\fR
.SH DESCRIPTION
.B passt-repair
is a privileged helper setting and clearing repair mode on TCP sockets on behalf
of \fBpasst\fR(1), as instructed via single-byte commands over a UNIX domain
socket.
It can be used to migrate TCP connections between guests without granting
additional capabilities to \fBpasst\fR(1) itself: to migrate TCP connections,
\fBpasst\fR(1) leverages repair mode, which needs the \fBCAP_NET_ADMIN\fR
capability (see \fBcapabilities\fR(7)) to be set or cleared.
If \fIPATH\fR represents a UNIX domain socket, \fBpasst-repair\fR(1) attempts to
connect to it. If it is a directory, \fBpasst-repair\fR(1) waits until a file
ending with \fI.repair\fR appears in it, and then attempts to connect to it.
.SH PROTOCOL
\fBpasst-repair\fR(1) connects to \fBpasst\fR(1) using the socket specified via
\fI--repair-path\fR option in \fBpasst\fR(1) itself. By default, the name is the
same as the UNIX domain socket used for guest communication, suffixed by
\fI.repair\fR.
The messages consist of one 8-bit signed integer that can be \fITCP_REPAIR_ON\fR
(1), \fITCP_REPAIR_OFF\fR (0), or \fITCP_REPAIR_OFF_NO_WP\fR (-1), as defined by
the Linux kernel user API, and one to SCM_MAX_FD (253) sockets as SCM_RIGHTS
(see \fBunix\fR(7)) ancillary message, sent by the server, \fBpasst\fR(1).
The client, \fBpasst-repair\fR(1), replies with the same byte (and no ancillary
message) to indicate success, and closes the connection on failure.
The server closes the connection on error or completion.
.SH NOTES
\fBpasst-repair\fR(1) can be granted the \fBCAP_NET_ADMIN\fR capability
(preferred, as it limits privileges to the strictly necessary ones), or it can
be run as root.
.SH AUTHOR
Stefano Brivio <sbrivio@redhat.com>.
.SH REPORTING BUGS
Please report issues on the bug tracker at https://bugs.passt.top/, or
send a message to the passt-user@passt.top mailing list, see
https://lists.passt.top/.
.SH COPYRIGHT
Copyright (c) 2025 Red Hat GmbH.
\fBpasst-repair\fR is free software: you can redistribute them and/or modify
them under the terms of the GNU General Public License as published by the Free
Software Foundation, either version 2 of the License, or (at your option) any
later version.
.SH SEE ALSO
\fBpasst\fR(1), \fBqemu\fR(1), \fBcapabilities\fR(7), \fBunix\fR(7).

266
passt-repair.c Normal file
View file

@ -0,0 +1,266 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*
* Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
* with byte commands mapping to TCP_REPAIR values, and switch repair mode on or
* off. Reply by echoing the command. Exit on EOF.
*/
#include <sys/inotify.h>
#include <sys/prctl.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/un.h>
#include <errno.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
#include <linux/audit.h>
#include <linux/capability.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include "seccomp_repair.h"
#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
#define REPAIR_EXT ".repair"
#define REPAIR_EXT_LEN strlen(REPAIR_EXT)
/**
* main() - Entry point and whole program with loop
* @argc: Argument count, must be 2
* @argv: Argument: path of UNIX domain socket to connect to
*
* Return: 0 on success (EOF), 1 on error, 2 on usage error
*
* #syscalls:repair connect setsockopt write close exit_group
* #syscalls:repair socket s390x:socketcall i686:socketcall
* #syscalls:repair recvfrom recvmsg arm:recv ppc64le:recv
* #syscalls:repair sendto sendmsg arm:send ppc64le:send
* #syscalls:repair stat|statx stat64|statx statx
* #syscalls:repair fstat|fstat64 newfstatat|fstatat64
* #syscalls:repair inotify_init1 inotify_add_watch
*/
int main(int argc, char **argv)
{
char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
__attribute__ ((aligned(__alignof__(struct cmsghdr))));
struct sockaddr_un a = { AF_UNIX, "" };
int fds[SCM_MAX_FD], s, ret, i, n = 0;
bool inotify_dir = false;
struct sock_fprog prog;
int8_t cmd = INT8_MAX;
struct cmsghdr *cmsg;
struct msghdr msg;
struct iovec iov;
size_t cmsg_len;
struct stat sb;
int op;
prctl(PR_SET_DUMPABLE, 0);
prog.len = (unsigned short)sizeof(filter_repair) /
sizeof(filter_repair[0]);
prog.filter = filter_repair;
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
fprintf(stderr, "Failed to apply seccomp filter");
_exit(1);
}
iov = (struct iovec){ &cmd, sizeof(cmd) };
msg = (struct msghdr){ .msg_name = NULL, .msg_namelen = 0,
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = buf,
.msg_controllen = sizeof(buf),
.msg_flags = 0 };
cmsg = CMSG_FIRSTHDR(&msg);
if (argc != 2) {
fprintf(stderr, "Usage: %s PATH\n", argv[0]);
_exit(2);
}
if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
fprintf(stderr, "Failed to create AF_UNIX socket: %i\n", errno);
_exit(1);
}
if ((stat(argv[1], &sb))) {
fprintf(stderr, "Can't stat() %s: %i\n", argv[1], errno);
_exit(1);
}
if ((sb.st_mode & S_IFMT) == S_IFDIR) {
char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
__attribute__ ((aligned(__alignof__(struct inotify_event))));
const struct inotify_event *ev = NULL;
char path[PATH_MAX + 1];
bool found = false;
ssize_t n;
int fd;
if ((fd = inotify_init1(IN_CLOEXEC)) < 0) {
fprintf(stderr, "inotify_init1: %i\n", errno);
_exit(1);
}
if (inotify_add_watch(fd, argv[1], IN_CREATE) < 0) {
fprintf(stderr, "inotify_add_watch: %i\n", errno);
_exit(1);
}
do {
char *p;
n = read(fd, buf, sizeof(buf));
if (n < 0) {
fprintf(stderr, "inotify read: %i", errno);
_exit(1);
}
buf[n - 1] = '\0';
if (n < (ssize_t)sizeof(*ev)) {
fprintf(stderr, "Short inotify read: %zi", n);
continue;
}
for (p = buf; p < buf + n; p += sizeof(*ev) + ev->len) {
ev = (const struct inotify_event *)p;
if (ev->len >= REPAIR_EXT_LEN &&
!memcmp(ev->name +
strnlen(ev->name, ev->len) -
REPAIR_EXT_LEN,
REPAIR_EXT, REPAIR_EXT_LEN)) {
found = true;
break;
}
}
} while (!found);
if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
fprintf(stderr, "Invalid filename from inotify\n");
_exit(1);
}
snprintf(path, sizeof(path), "%s/%s", argv[1], ev->name);
if ((stat(path, &sb))) {
fprintf(stderr, "Can't stat() %s: %i\n", path, errno);
_exit(1);
}
ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", path);
inotify_dir = true;
} else {
ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
}
if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
fprintf(stderr, "Invalid socket path");
_exit(2);
}
if ((sb.st_mode & S_IFMT) != S_IFSOCK) {
fprintf(stderr, "%s is not a socket\n", a.sun_path);
_exit(2);
}
while (connect(s, (struct sockaddr *)&a, sizeof(a))) {
if (inotify_dir && errno == ECONNREFUSED)
continue;
fprintf(stderr, "Failed to connect to %s: %s\n", a.sun_path,
strerror(errno));
_exit(1);
}
loop:
ret = recvmsg(s, &msg, 0);
if (ret < 0) {
if (errno == ECONNRESET) {
ret = 0;
} else {
fprintf(stderr, "Failed to read message: %i\n", errno);
_exit(1);
}
}
if (!ret) /* Done */
_exit(0);
if (!cmsg ||
cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
cmsg->cmsg_type != SCM_RIGHTS) {
fprintf(stderr, "No/bad ancillary data from peer\n");
_exit(1);
}
/* No inverse formula for CMSG_LEN(x), and building one with CMSG_LEN(0)
* works but there's no guarantee it does. Search the whole domain.
*/
for (i = 1; i <= SCM_MAX_FD; i++) {
if (CMSG_LEN(sizeof(int) * i) == cmsg->cmsg_len) {
n = i;
break;
}
}
if (!n) {
cmsg_len = cmsg->cmsg_len; /* socklen_t is 'unsigned' on musl */
fprintf(stderr, "Invalid ancillary data length %zu from peer\n",
cmsg_len);
_exit(1);
}
memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF &&
cmd != TCP_REPAIR_OFF_NO_WP) {
fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
_exit(1);
}
op = cmd;
for (i = 0; i < n; i++) {
if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &op, sizeof(op))) {
fprintf(stderr,
"Setting TCP_REPAIR to %i on socket %i: %s", op,
fds[i], strerror(errno));
_exit(1);
}
/* Close _our_ copy */
close(fds[i]);
}
/* Confirm setting by echoing the command back */
if (send(s, &cmd, sizeof(cmd), 0) < 0) {
fprintf(stderr, "Reply to %i: %s\n", op, strerror(errno));
_exit(1);
}
goto loop;
return 0;
}

176
passt.1
View file

@ -95,7 +95,7 @@ detached PID namespace after starting, because the PID itself cannot change.
Default is to fork into background.
.TP
.BR \-e ", " \-\-stderr
.BR \-e ", " \-\-stderr " " (DEPRECATED)
This option has no effect, and is maintained for compatibility purposes only.
Note that this configuration option is \fBdeprecated\fR and will be removed in a
@ -160,7 +160,9 @@ once for IPv6).
By default, assigned IPv4 and IPv6 addresses are taken from the host interfaces
with the first default route, if any, for the corresponding IP version. If no
default routes are available and there is any interface with any route for a
given IP version, the first of these interfaces will be chosen instead.
given IP version, the first of these interfaces will be chosen instead. If no
such interface exists, the link-local address 169.254.2.1 is assigned for IPv4,
and no additional address will be assigned for IPv6.
.TP
.BR \-n ", " \-\-netmask " " \fImask
@ -174,8 +176,7 @@ according to the CIDR block of the assigned address (RFC 4632).
.BR \-M ", " \-\-mac-addr " " \fIaddr
Use source MAC address \fIaddr\fR when communicating to the guest or to the
target namespace.
Default is to use the MAC address of the interface with the first IPv4 default
route on the host.
Default is the locally administered MAC addresses 9a:55:9a:55:9a:55.
.TP
.BR \-g ", " \-\-gateway " " \fIaddr
@ -188,7 +189,9 @@ first default route, if any, for the corresponding IP version. If the default
route is a multipath one, the gateway is the first nexthop router returned by
the kernel which has the highest weight in the set of paths. If no default
routes are available and there is just one interface with any route, that
interface will be chosen instead.
interface will be chosen instead. If no such interface exists, the link-local
address 169.254.2.2 is used for IPv4, and the link-local address fe80::1 is used
for IPv6.
Note: these addresses are also used as source address for packets directed to
the guest or to the target namespace having a loopback or local source address,
@ -203,7 +206,9 @@ Default is to use the interfaces specified by \fB--outbound-if4\fR and
If no interfaces are given, the interface with the first default routes for each
IP version is selected. If no default routes are available and there is just one
interface with any route, that interface will be chosen instead.
interface with any route, that interface will be chosen instead. If no such
interface exists, host interfaces will be ignored for the purposes of assigning
addresses and routes, and link-local addresses will be used instead.
.TP
.BR \-o ", " \-\-outbound " " \fIaddr
@ -222,7 +227,8 @@ derive IPv4 addresses and routes.
By default, the interface given by the default route is selected. If no default
routes are available and there is just one interface with any route, that
interface will be chosen instead.
interface will be chosen instead. If no such interface exists, outbound sockets
will not be bound to any specific interface.
.TP
.BR \-\-outbound-if6 " " \fIname
@ -232,7 +238,8 @@ derive IPv6 addresses and routes.
By default, the interface given by the default route is selected. If no default
routes are available and there is just one interface with any route, that
interface will be chosen instead.
interface will be chosen instead. If no such interface exists, outbound sockets
will not be bound to any specific interface.
.TP
.BR \-D ", " \-\-dns " " \fIaddr
@ -249,10 +256,19 @@ the host.
.TP
.BR \-\-dns-forward " " \fIaddr
Map \fIaddr\fR (IPv4 or IPv6) as seen from guest or namespace to the
first configured DNS resolver (with corresponding IP version). Maps
only UDP and TCP traffic to port 53 or port 853. Replies are
translated back with a reverse mapping. This option can be specified
zero to two times (once for IPv4, once for IPv6).
nameserver (with corresponding IP version) specified by the
\fB\-\-dns-host\fR option. Maps only UDP and TCP traffic to port 53 or
port 853. Replies are translated back with a reverse mapping. This
option can be specified zero to two times (once for IPv4, once for
IPv6).
.TP
.BR \-\-dns-host " " \fIaddr
Configure the host nameserver which guest or namespace queries to the
\fB\-\-dns-forward\fR address will be redirected to. This option can
be specified zero to two times (once for IPv4, once for IPv6).
By default, the first nameserver from the host's
\fI/etc/resolv.conf\fR.
.TP
.BR \-S ", " \-\-search " " \fIlist
@ -327,6 +343,16 @@ namespace will be silently dropped.
Disable Router Advertisements. Router Solicitations coming from guest or target
namespace will be ignored.
.TP
.BR \-\-freebind
Allow any binding address to be specified for \fB-t\fR and \fB-u\fR
options. Usually binding addresses must be addresses currently
configured on the host. With \fB\-\-freebind\fR, the
\fBIP_FREEBIND\fR or \fBIPV6_FREEBIND\fR socket option is enabled
allowing any address to be used. This is typically used to bind
addresses which might be configured on the host in future, at which
point the forwarding will immediately start operating.
.TP
.BR \-\-map-host-loopback " " \fIaddr
Translate \fIaddr\fR to refer to the host. Packets from the guest to
@ -354,14 +380,14 @@ Translate \fIaddr\fR in the guest to be equal to the guest's assigned
address on the host. That is, packets from the guest to \fIaddr\fR
will be redirected to the address assigned to the guest with \fB-a\fR,
or by default the host's global address. This allows the guest to
access services availble on the host's global address, even though its
access services available on the host's global address, even though its
own address shadows that of the host.
If \fIaddr\fR is 'none', no address is mapped. Only one IPv4 and one
IPv6 address can be translated, and if the option is specified
multiple times, the last one for each address type takes effect.
Default is no mapping.
By default, mapping happens as described for the \-\-map-host-loopback option.
.TP
.BR \-4 ", " \-\-ipv4-only
@ -375,15 +401,44 @@ Enable IPv6-only operation. IPv4 traffic will be ignored.
By default, IPv4 operation is enabled as long as at least an IPv4 route and an
interface address are configured on a given host interface.
.TP
.BR \-H ", " \-\-hostname " " \fIname
Hostname to configure the client with.
Send \fIname\fR as DHCP option 12 (hostname).
.TP
.BR \-\-fqdn " " \fIname
FQDN to configure the client with.
Send \fIname\fR as Client FQDN: DHCP option 81 and DHCPv6 option 39.
.SS \fBpasst\fR-only options
.TP
.BR \-s ", " \-\-socket " " \fIpath
.BR \-s ", " \-\-socket-path ", " \-\-socket " " \fIpath
Path for UNIX domain socket used by \fBqemu\fR(1) or \fBqrap\fR(1) to connect to
\fBpasst\fR.
Default is to probe a free socket, not accepting connections, starting from
\fI/tmp/passt_1.socket\fR to \fI/tmp/passt_64.socket\fR.
.TP
.BR \-\-vhost-user
Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
.TP
.BR \-\-print-capabilities
Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
.TP
.BR \-\-repair-path " " \fIpath
Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect
to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
migration. \fB--repair-path none\fR disables this interface (if you need to
specify a socket path called "none" you can prefix the path by \fI./\fR).
Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
chosen for the hypervisor UNIX domain socket. No socket is created if not in
\-\-vhost-user mode.
.TP
.BR \-F ", " \-\-fd " " \fIFD
Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
@ -485,6 +540,7 @@ Default is \fBnone\fR.
.BR \-I ", " \-\-ns-ifname " " \fIname
Name of tap interface to be created in target namespace.
By default, the same interface name as the external, routable interface is used.
If no such interface exists, the name \fItap0\fR will be used instead.
.TP
.BR \-t ", " \-\-tcp-ports " " \fIspec
@ -586,6 +642,13 @@ Configure UDP port forwarding from target namespace to init namespace.
Default is \fBauto\fR.
.TP
.BR \-\-host-lo-to-ns-lo
If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from
the host's loopback address will appear on the loopback address in the
guest as well. Without this option such forwarded packets will appear
to come from the guest's public address.
.TP
.BR \-\-userns " " \fIspec
Target user namespace to join, as a path. If PID is given, without this option,
@ -653,6 +716,11 @@ Configure MAC address \fIaddr\fR on the tap interface in the namespace.
Default is to let the tap driver build a pseudorandom hardware address.
.TP
.BR \-\-no-splice
Disable the bypass path for inbound, local traffic. See the section \fBHandling
of local traffic in pasta\fR in the \fBNOTES\fR for more details.
.SH EXAMPLES
.SS \fBpasta
@ -863,26 +931,31 @@ root@localhost's password:
.SH NOTES
.SS Handling of traffic with local destination and source addresses
.SS Handling of traffic with loopback destination and source addresses
Both \fBpasst\fR and \fBpasta\fR can bind on ports with a local address,
depending on the configuration. Local destination or source addresses need to be
changed before packets are delivered to the guest or target namespace: most
operating systems would drop packets received from non-loopback interfaces with
local addresses, and it would also be impossible for guest or target namespace
to route answers back.
Both \fBpasst\fR and \fBpasta\fR can bind on ports with a loopback
address (127.0.0.0/8 or ::1), depending on the configuration. Loopback
destination or source addresses need to be changed before packets are
delivered to the guest or target namespace: most operating systems
would drop packets received with loopback addresses on non-loopback
interfaces, and it would also be impossible for guest or target
namespace to route answers back.
For convenience, and somewhat arbitrarily, the source address on these packets
is translated to the address of the default IPv4 or IPv6 gateway (if any) --
this is known to be an existing, valid address on the same subnet.
For convenience, the source address on these packets is translated to
the address specified by the \fB\-\-map-host-loopback\fR option (with
some exceptions in pasta mode, see next section below). If not
specified this defaults, somewhat arbitrarily, to the address of
default IPv4 or IPv6 gateway (if any) -- this is known to be an
existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or
\fB\-\-map-host-loopback none\fR are specified this translation is
disabled and packets with loopback addresses are simply dropped.
Loopback destination addresses are instead translated to the observed external
address of the guest or target namespace. For IPv6 packets, if usage of a
link-local address by guest or namespace has ever been observed, and the
original destination address is also a link-local address, the observed
link-local address is used. Otherwise, the observed global address is used. For
both IPv4 and IPv6, if no addresses have been seen yet, the configured addresses
will be used instead.
Loopback destination addresses are translated to the observed external
address of the guest or target namespace. For IPv6, the observed
link-local address is used if the translated source address is
link-local, otherwise the observed global address is used. For both
IPv4 and IPv6, if no addresses have been seen yet, the configured
addresses will be used instead.
For example, if \fBpasst\fR or \fBpasta\fR receive a connection from 127.0.0.1,
with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
@ -890,11 +963,15 @@ the last observed source address from guest or namespace is 192.0.2.2, this will
be translated to a connection from 192.0.2.1 to 192.0.2.2.
Similarly, for traffic coming from guest or namespace, packets with destination
address corresponding to the default gateway will have their destination address
translated to a loopback address, if and only if a packet, in the opposite
direction, with a loopback destination or source address, port-wise matching for
UDP, or connection-wise for TCP, has been recently forwarded to guest or
namespace. This behaviour can be disabled with \-\-no\-map\-gw.
address corresponding to the \fB\-\-map-host-loopback\fR address will have their
destination address translated to a loopback address.
As an exception, traffic identified as DNS, originally directed to the
\fB\-\-map-host-loopback\fR address, if this address matches a resolver address
on the host, is \fBnot\fR translated to loopback, but rather handled in the same
way as if specified as \-\-dns-forward address, if no such option was given.
In the common case where the host gateway also acts a resolver, this avoids that
the host mapping shadows the gateway/resolver itself.
.SS Handling of local traffic in pasta
@ -910,8 +987,15 @@ and the new socket using the \fBsplice\fR(2) system call, and for UDP, a pair
of \fBrecvmmsg\fR(2) and \fBsendmmsg\fR(2) system calls deals with packet
transfers.
This bypass only applies to local connections and traffic, because it's not
possible to bind sockets to foreign addresses.
Because it's not possible to bind sockets to foreign addresses, this
bypass only applies to local connections and traffic. It also means
that the address translation differs slightly from passt mode.
Connections from loopback to loopback on the host will appear to come
from the target namespace's public address within the guest, unless
\fB\-\-host-lo-to-ns-lo\fR is specified, in which case they will
appear to come from loopback in the namespace as well. The latter
behaviour used to be the default, but is usually undesirable, since it
can unintentionally expose namespace local services to the host.
.SS Binding to low numbered ports (well-known or system ports, up to 1023)
@ -996,6 +1080,20 @@ If the sending window cannot be queried, it will always be announced as the
current sending buffer size to guest or target namespace. This might affect
throughput of TCP connections.
.SS Local mode for disconnected setups
If \fBpasst\fR and \fBpasta\fR fail to find a host interface with a configured
address, other than loopback addresses, they will, obviously, not attempt to
source addresses or routes from the host.
In this case, unless configured otherwise, they will assign the IPv4 link-local
address 169.254.2.1 to the guest or target namespace, and no IPv6 address. The
notion of the guest or target namespace IPv6 address is derived from the first
link-local address observed.
Default gateways will be assigned as the link-local address 169.254.2.2 for
IPv4, and as the link-local address fe80::1 for IPv6.
.SH LIMITATIONS
Currently, IGMP/MLD proxying (RFC 4605) and support for SCTP (RFC 4960) are not

115
passt.c
View file

@ -36,9 +36,6 @@
#include <sys/prctl.h>
#include <netinet/if_ether.h>
#include <libgen.h>
#ifdef HAS_GETRANDOM
#include <sys/random.h>
#endif
#include "util.h"
#include "passt.h"
@ -52,6 +49,10 @@
#include "arch.h"
#include "log.h"
#include "tcp_splice.h"
#include "ndp.h"
#include "vu_common.h"
#include "migrate.h"
#include "repair.h"
#define EPOLL_EVENTS 8
@ -67,13 +68,17 @@ char *epoll_type_str[] = {
[EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket",
[EPOLL_TYPE_TCP_TIMER] = "TCP timer",
[EPOLL_TYPE_UDP_LISTEN] = "listening UDP socket",
[EPOLL_TYPE_UDP_REPLY] = "UDP reply socket",
[EPOLL_TYPE_UDP] = "UDP flow socket",
[EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket",
[EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch",
[EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch",
[EPOLL_TYPE_TAP_PASTA] = "/dev/net/tun device",
[EPOLL_TYPE_TAP_PASST] = "connected qemu socket",
[EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket",
[EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket",
[EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket",
[EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket",
[EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket",
};
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
"epoll_type_str[] doesn't match enum epoll_type");
@ -110,40 +115,25 @@ static void post_handler(struct ctx *c, const struct timespec *now)
flow_defer_handler(c, now);
#undef CALL_PROTO_HANDLER
if (!c->no_ndp)
ndp_timer(c, now);
}
/**
* secret_init() - Create secret value for SipHash calculations
* random_init() - Initialise things based on random data
* @c: Execution context
*/
static void secret_init(struct ctx *c)
static void random_init(struct ctx *c)
{
#ifndef HAS_GETRANDOM
int dev_random = open("/dev/random", O_RDONLY);
unsigned int random_read = 0;
unsigned int seed;
while (dev_random && random_read < sizeof(c->hash_secret)) {
int ret = read(dev_random,
(uint8_t *)&c->hash_secret + random_read,
sizeof(c->hash_secret) - random_read);
/* Create secret value for SipHash calculations */
raw_random(&c->hash_secret, sizeof(c->hash_secret));
if (ret == -1 && errno == EINTR)
continue;
if (ret <= 0)
break;
random_read += ret;
}
if (dev_random >= 0)
close(dev_random);
if (random_read < sizeof(c->hash_secret))
#else
if (getrandom(&c->hash_secret, sizeof(c->hash_secret),
GRND_RANDOM) < 0)
#endif /* !HAS_GETRANDOM */
die_perror("Failed to get random bytes for hash table and TCP");
/* Seed pseudo-RNG for things that need non-cryptographic random */
raw_random(&seed, sizeof(seed));
srandom(seed);
}
/**
@ -176,11 +166,11 @@ void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
*
* #syscalls exit_group
*/
void exit_handler(int signal)
static void exit_handler(int signal)
{
(void)signal;
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
}
/**
@ -194,26 +184,27 @@ void exit_handler(int signal)
* #syscalls socket getsockopt setsockopt s390x:socketcall i686:socketcall close
* #syscalls bind connect recvfrom sendto shutdown
* #syscalls arm:recv ppc64le:recv arm:send ppc64le:send
* #syscalls accept4|accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
* #syscalls accept4 accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
* #syscalls clock_gettime arm:clock_gettime64 i686:clock_gettime64
*/
int main(int argc, char **argv)
{
struct epoll_event events[EPOLL_EVENTS];
int nfds, i, devnull_fd = -1;
char argv0[PATH_MAX], *name;
struct ctx c = { 0 };
struct rlimit limit;
struct timespec now;
struct sigaction sa;
clock_gettime(CLOCK_MONOTONIC, &log_start);
if (clock_gettime(CLOCK_MONOTONIC, &log_start))
die_perror("Failed to get CLOCK_MONOTONIC time");
arch_avx2_exec(argv);
isolate_initial(argc, argv);
c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
c.device_state_fd = -1;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
@ -221,27 +212,18 @@ int main(int argc, char **argv)
sigaction(SIGTERM, &sa, NULL);
sigaction(SIGQUIT, &sa, NULL);
if (argc < 1)
exit(EXIT_FAILURE);
c.mode = conf_mode(argc, argv);
strncpy(argv0, argv[0], PATH_MAX - 1);
name = basename(argv0);
if (strstr(name, "pasta")) {
if (c.mode == MODE_PASTA) {
sa.sa_handler = pasta_child_handler;
if (sigaction(SIGCHLD, &sa, NULL))
die_perror("Couldn't install signal handlers");
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
die_perror("Couldn't set disposition for SIGPIPE");
c.mode = MODE_PASTA;
} else if (strstr(name, "passt")) {
c.mode = MODE_PASST;
} else {
exit(EXIT_FAILURE);
}
madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
die_perror("Couldn't set disposition for SIGPIPE");
madvise(pkt_buf, sizeof(pkt_buf), MADV_HUGEPAGE);
c.epollfd = epoll_create1(EPOLL_CLOEXEC);
if (c.epollfd == -1)
@ -261,16 +243,17 @@ int main(int argc, char **argv)
pasta_netns_quit_init(&c);
tap_sock_init(&c);
tap_backend_init(&c);
secret_init(&c);
random_init(&c);
clock_gettime(CLOCK_MONOTONIC, &now);
if (clock_gettime(CLOCK_MONOTONIC, &now))
die_perror("Failed to get CLOCK_MONOTONIC time");
flow_init();
if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
exit(EXIT_FAILURE);
_exit(EXIT_FAILURE);
proto_update_l2_buf(c.guest_mac, c.our_tap_mac);
@ -307,13 +290,15 @@ int main(int argc, char **argv)
timer_init(&c, &now);
loop:
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
/* NOLINTBEGIN(bugprone-branch-clone): intervals can be the same */
/* cppcheck-suppress [duplicateValueTernary, unmatchedSuppression] */
nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
/* NOLINTEND(bugprone-branch-clone) */
if (nfds == -1 && errno != EINTR)
die_perror("epoll_wait() failed in main loop");
clock_gettime(CLOCK_MONOTONIC, &now);
if (clock_gettime(CLOCK_MONOTONIC, &now))
err_perror("Failed to get CLOCK_MONOTONIC time");
for (i = 0; i < nfds; i++) {
union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);
@ -354,12 +339,24 @@ loop:
case EPOLL_TYPE_UDP_LISTEN:
udp_listen_sock_handler(&c, ref, eventmask, &now);
break;
case EPOLL_TYPE_UDP_REPLY:
udp_reply_sock_handler(&c, ref, eventmask, &now);
case EPOLL_TYPE_UDP:
udp_sock_handler(&c, ref, eventmask, &now);
break;
case EPOLL_TYPE_PING:
icmp_sock_handler(&c, ref);
break;
case EPOLL_TYPE_VHOST_CMD:
vu_control_handler(c.vdev, c.fd_tap, eventmask);
break;
case EPOLL_TYPE_VHOST_KICK:
vu_kick_cb(c.vdev, ref, &now);
break;
case EPOLL_TYPE_REPAIR_LISTEN:
repair_listen_handler(&c, eventmask);
break;
case EPOLL_TYPE_REPAIR:
repair_handler(&c, eventmask);
break;
default:
/* Can't happen */
ASSERT(0);
@ -368,5 +365,7 @@ loop:
post_handler(&c, &now);
migrate_handler(&c);
goto loop;
}

51
passt.h
View file

@ -20,11 +20,13 @@ union epoll_ref;
#include "siphash.h"
#include "ip.h"
#include "inany.h"
#include "migrate.h"
#include "flow.h"
#include "icmp.h"
#include "fwd.h"
#include "tcp.h"
#include "udp.h"
#include "vhost_user.h"
/* Default address for our end on the tap interface. Bit 0 of byte 0 must be 0
* (unicast) and bit 1 of byte 1 must be 1 (locally administered). Otherwise
@ -43,6 +45,7 @@ union epoll_ref;
* @icmp: ICMP-specific reference part
* @data: Data handled by protocol handlers
* @nsdir_fd: netns dirfd for fallback timer checking if namespace is gone
* @queue: vhost-user queue index for this fd
* @u64: Opaque reference for epoll_ctl() and epoll_wait()
*/
union epoll_ref {
@ -58,6 +61,7 @@ union epoll_ref {
union udp_listen_epoll_ref udp;
uint32_t data;
int nsdir_fd;
int queue;
};
};
uint64_t u64;
@ -65,12 +69,9 @@ union epoll_ref {
static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
"epoll_ref must have same size as epoll_data");
#define TAP_BUF_BYTES \
ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE)
#define TAP_MSGS \
DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
/* Large enough for ~128 maximum size frames */
#define PKT_BUF_BYTES (8UL << 20)
#define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0)
extern char pkt_buf [PKT_BUF_BYTES];
extern char *epoll_type_str[];
@ -94,6 +95,7 @@ struct fqdn {
enum passt_modes {
MODE_PASST,
MODE_PASTA,
MODE_VU,
};
/**
@ -189,6 +191,7 @@ struct ip6_ctx {
* @foreground: Run in foreground, don't log to stderr by default
* @nofile: Maximum number of open files (ulimit -n)
* @sock_path: Path for UNIX domain socket
* @repair_path: TCP_REPAIR helper path, can be "none", empty for default
* @pcap: Path for packet capture file
* @pidfile: Path to PID file, empty string if not configured
* @pidfile_fd: File descriptor for PID file, -1 if none
@ -199,13 +202,17 @@ struct ip6_ctx {
* @epollfd: File descriptor for epoll instance
* @fd_tap_listen: File descriptor for listening AF_UNIX socket, if any
* @fd_tap: AF_UNIX socket, tuntap device, or pre-opened socket
* @fd_repair_listen: File descriptor for listening TCP_REPAIR socket, if any
* @fd_repair: Connected AF_UNIX socket for TCP_REPAIR helper
* @our_tap_mac: Pasta/passt's MAC on the tap link
* @guest_mac: MAC address of guest or namespace, seen or configured
* @hash_secret: 128-bit secret for siphash functions
* @ifi4: Index of template interface for IPv4, 0 if IPv4 disabled
* @ifi4: Template interface for IPv4, -1: none, 0: IPv4 disabled
* @ip: IPv4 configuration
* @dns_search: DNS search list
* @ifi6: Index of template interface for IPv6, 0 if IPv6 disabled
* @hostname: Guest hostname
* @fqdn: Guest FQDN
* @ifi6: Template interface for IPv6, -1: none, 0: IPv6 disabled
* @ip6: IPv6 configuration
* @pasta_ifn: Name of namespace interface for pasta
* @pasta_ifi: Index of namespace interface for pasta
@ -225,8 +232,15 @@ struct ip6_ctx {
* @no_dhcpv6: Disable DHCPv6 server
* @no_ndp: Disable NDP handler altogether
* @no_ra: Disable router advertisements
* @no_splice: Disable socket splicing for inbound traffic
* @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses
* @freebind: Allow binding of non-local addresses for forwarding
* @low_wmem: Low probed net.core.wmem_max
* @low_rmem: Low probed net.core.rmem_max
* @vdev: vhost-user device
* @device_state_fd: Device state migration channel
* @device_state_result: Device state migration result
* @migrate_target: Are we the target, on the next migration request?
*/
struct ctx {
enum passt_modes mode;
@ -236,6 +250,7 @@ struct ctx {
int foreground;
int nofile;
char sock_path[UNIX_PATH_MAX];
char repair_path[UNIX_PATH_MAX];
char pcap[PATH_MAX];
char pidfile[PATH_MAX];
@ -252,16 +267,23 @@ struct ctx {
int epollfd;
int fd_tap_listen;
int fd_tap;
int fd_repair_listen;
int fd_repair;
unsigned char our_tap_mac[ETH_ALEN];
unsigned char guest_mac[ETH_ALEN];
uint16_t mtu;
uint64_t hash_secret[2];
unsigned int ifi4;
int ifi4;
struct ip4_ctx ip4;
struct fqdn dns_search[MAXDNSRCH];
unsigned int ifi6;
char hostname[PASST_MAXDNAME];
char fqdn[PASST_MAXDNAME];
int ifi6;
struct ip6_ctx ip6;
char pasta_ifn[IF_NAMESIZE];
@ -275,7 +297,6 @@ struct ctx {
int no_icmp;
struct icmp_ctx icmp;
int mtu;
int no_dns;
int no_dns_search;
int no_dhcp_dns;
@ -284,9 +305,19 @@ struct ctx {
int no_dhcpv6;
int no_ndp;
int no_ra;
int no_splice;
int host_lo_to_ns_lo;
int freebind;
int low_wmem;
int low_rmem;
struct vu_dev *vdev;
/* Migration */
int device_state_fd;
int device_state_result;
bool migrate_target;
};
void proto_update_l2_buf(const unsigned char *eth_d,

88
pasta.c
View file

@ -73,12 +73,12 @@ void pasta_child_handler(int signal)
!waitid(P_PID, pasta_child_pid, &infop, WEXITED | WNOHANG)) {
if (infop.si_pid == pasta_child_pid) {
if (infop.si_code == CLD_EXITED)
exit(infop.si_status);
_exit(infop.si_status);
/* If killed by a signal, si_status is the number.
* Follow common shell convention of returning it + 128.
*/
exit(infop.si_status + 128);
_exit(infop.si_status + 128);
/* Nothing to do, detached PID namespace going away */
}
@ -102,7 +102,9 @@ static int pasta_wait_for_ns(void *arg)
int flags = O_RDONLY | O_CLOEXEC;
char ns[PATH_MAX];
snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
if (snprintf_check(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid))
die_perror("Can't build netns path");
do {
while ((c->pasta_netns_fd = open(ns, flags)) < 0) {
if (errno != ENOENT)
@ -167,10 +169,12 @@ void pasta_open_ns(struct ctx *c, const char *netns)
* struct pasta_spawn_cmd_arg - Argument for pasta_spawn_cmd()
* @exe: Executable to run
* @argv: Command and arguments to run
* @ctx: Context to read config from
*/
struct pasta_spawn_cmd_arg {
const char *exe;
char *const *argv;
struct ctx *c;
};
/**
@ -184,6 +188,7 @@ static int pasta_spawn_cmd(void *arg)
{
char hostname[HOST_NAME_MAX + 1] = HOSTNAME_PREFIX;
const struct pasta_spawn_cmd_arg *a;
size_t conf_hostname_len;
sigset_t set;
/* We run in a detached PID and mount namespace: mount /proc over */
@ -193,9 +198,15 @@ static int pasta_spawn_cmd(void *arg)
if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
warn("Cannot set ping_group_range, ICMP requests might fail");
if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
errno == ENAMETOOLONG) {
a = (const struct pasta_spawn_cmd_arg *)arg;
conf_hostname_len = strlen(a->c->hostname);
if (conf_hostname_len > 0) {
if (sethostname(a->c->hostname, conf_hostname_len))
warn("Unable to set configured hostname");
} else if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
errno == ENAMETOOLONG) {
hostname[HOST_NAME_MAX] = '\0';
if (sethostname(hostname, strlen(hostname)))
warn("Unable to set pasta-prefixed hostname");
@ -206,7 +217,6 @@ static int pasta_spawn_cmd(void *arg)
sigaddset(&set, SIGUSR1);
sigwaitinfo(&set, NULL);
a = (const struct pasta_spawn_cmd_arg *)arg;
execvp(a->exe, a->argv);
die_perror("Failed to start command or shell");
@ -228,6 +238,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
struct pasta_spawn_cmd_arg arg = {
.exe = argv[0],
.argv = argv,
.c = c,
};
char uidmap[BUFSIZ], gidmap[BUFSIZ];
char *sh_argv[] = { NULL, NULL };
@ -239,8 +250,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
c->quiet = 1;
/* Configure user and group mappings */
snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
if (snprintf_check(uidmap, BUFSIZ, "0 %u 1", uid))
die_perror("Can't build uidmap");
if (snprintf_check(gidmap, BUFSIZ, "0 %u 1", gid))
die_perror("Can't build gidmap");
if (write_file("/proc/self/uid_map", uidmap) ||
write_file("/proc/self/setgroups", "deny") ||
@ -291,7 +305,7 @@ void pasta_ns_conf(struct ctx *c)
rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);
if (rc < 0)
die("Couldn't bring up loopback interface in namespace: %s",
strerror(-rc));
strerror_(-rc));
/* Get or set MAC in target namespace */
if (MAC_IS_ZERO(c->guest_mac))
@ -300,12 +314,12 @@ void pasta_ns_conf(struct ctx *c)
rc = nl_link_set_mac(nl_sock_ns, c->pasta_ifi, c->guest_mac);
if (rc < 0)
die("Couldn't set MAC address in namespace: %s",
strerror(-rc));
strerror_(-rc));
if (c->pasta_conf_ns) {
unsigned int flags = IFF_UP;
if (c->mtu != -1)
if (c->mtu)
nl_link_set_mtu(nl_sock_ns, c->pasta_ifi, c->mtu);
if (c->ifi6) /* Avoid duplicate address detection on link up */
@ -327,7 +341,7 @@ void pasta_ns_conf(struct ctx *c)
if (rc < 0) {
die("Couldn't set IPv4 address(es) in namespace: %s",
strerror(-rc));
strerror_(-rc));
}
if (c->ip4.no_copy_routes) {
@ -341,7 +355,7 @@ void pasta_ns_conf(struct ctx *c)
if (rc < 0) {
die("Couldn't set IPv4 route(s) in guest: %s",
strerror(-rc));
strerror_(-rc));
}
}
@ -350,13 +364,13 @@ void pasta_ns_conf(struct ctx *c)
&c->ip6.addr_ll_seen);
if (rc < 0) {
warn("Can't get LL address from namespace: %s",
strerror(-rc));
strerror_(-rc));
}
rc = nl_addr_set_ll_nodad(nl_sock_ns, c->pasta_ifi);
if (rc < 0) {
warn("Can't set nodad for LL in namespace: %s",
strerror(-rc));
strerror_(-rc));
}
/* We dodged DAD: re-enable neighbour solicitations */
@ -364,8 +378,11 @@ void pasta_ns_conf(struct ctx *c)
0, IFF_NOARP);
if (c->ip6.no_copy_addrs) {
rc = nl_addr_set(nl_sock_ns, c->pasta_ifi,
AF_INET6, &c->ip6.addr, 64);
if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr)) {
rc = nl_addr_set(nl_sock_ns,
c->pasta_ifi, AF_INET6,
&c->ip6.addr, 64);
}
} else {
rc = nl_addr_dup(nl_sock, c->ifi6,
nl_sock_ns, c->pasta_ifi,
@ -374,7 +391,7 @@ void pasta_ns_conf(struct ctx *c)
if (rc < 0) {
die("Couldn't set IPv6 address(es) in namespace: %s",
strerror(-rc));
strerror_(-rc));
}
if (c->ip6.no_copy_routes) {
@ -389,7 +406,7 @@ void pasta_ns_conf(struct ctx *c)
if (rc < 0) {
die("Couldn't set IPv6 route(s) in guest: %s",
strerror(-rc));
strerror_(-rc));
}
}
}
@ -427,29 +444,29 @@ static int pasta_netns_quit_timer(void)
*/
void pasta_netns_quit_init(const struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_NSQUIT_INOTIFY };
struct epoll_event ev = { .events = EPOLLIN };
int flags = O_NONBLOCK | O_CLOEXEC;
struct statfs s = { 0 };
bool try_inotify = true;
int fd = -1, dir_fd;
union epoll_ref ref;
if (c->mode != MODE_PASTA || c->no_netns_quit || !*c->netns_base)
return;
if ((dir_fd = open(c->netns_dir, O_CLOEXEC | O_RDONLY)) < 0)
die("netns dir open: %s, exiting", strerror(errno));
die("netns dir open: %s, exiting", strerror_(errno));
if (fstatfs(dir_fd, &s) || s.f_type == DEVPTS_SUPER_MAGIC ||
s.f_type == PROC_SUPER_MAGIC || s.f_type == SYSFS_MAGIC)
try_inotify = false;
if (try_inotify && (fd = inotify_init1(flags)) < 0)
warn("inotify_init1(): %s, use a timer", strerror(errno));
warn("inotify_init1(): %s, use a timer", strerror_(errno));
if (fd >= 0 && inotify_add_watch(fd, c->netns_dir, IN_DELETE) < 0) {
warn("inotify_add_watch(): %s, use a timer",
strerror(errno));
strerror_(errno));
close(fd);
fd = -1;
}
@ -463,6 +480,7 @@ void pasta_netns_quit_init(const struct ctx *c)
ref.type = EPOLL_TYPE_NSQUIT_TIMER;
} else {
close(dir_fd);
ref.type = EPOLL_TYPE_NSQUIT_INOTIFY;
}
if (fd > FD_REF_MAX)
@ -480,17 +498,23 @@ void pasta_netns_quit_init(const struct ctx *c)
*/
void pasta_netns_quit_inotify_handler(struct ctx *c, int inotify_fd)
{
char buf[sizeof(struct inotify_event) + NAME_MAX + 1];
const struct inotify_event *in_ev = (struct inotify_event *)buf;
char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
__attribute__ ((aligned(__alignof__(struct inotify_event))));
const struct inotify_event *ev;
ssize_t n;
char *p;
if (read(inotify_fd, buf, sizeof(buf)) < (ssize_t)sizeof(*in_ev))
if ((n = read(inotify_fd, buf, sizeof(buf))) < (ssize_t)sizeof(*ev))
return;
if (strncmp(in_ev->name, c->netns_base, sizeof(c->netns_base)))
return;
for (p = buf; p < buf + n; p += sizeof(*ev) + ev->len) {
ev = (const struct inotify_event *)p;
info("Namespace %s is gone, exiting", c->netns_base);
exit(EXIT_SUCCESS);
if (!strncmp(ev->name, c->netns_base, sizeof(c->netns_base))) {
info("Namespace %s is gone, exiting", c->netns_base);
_exit(EXIT_SUCCESS);
}
}
}
/**
@ -516,7 +540,7 @@ void pasta_netns_quit_timer_handler(struct ctx *c, union epoll_ref ref)
return;
info("Namespace %s is gone, exiting", c->netns_base);
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
}
close(fd);

77
pcap.c
View file

@ -33,33 +33,12 @@
#include "log.h"
#include "pcap.h"
#include "iov.h"
#include "tap.h"
#define PCAP_VERSION_MINOR 4
static int pcap_fd = -1;
/* See pcap.h from libpcap, or pcap-savefile(5) */
static const struct {
uint32_t magic;
#define PCAP_MAGIC 0xa1b2c3d4
uint16_t major;
#define PCAP_VERSION_MAJOR 2
uint16_t minor;
#define PCAP_VERSION_MINOR 4
int32_t thiszone;
uint32_t sigfigs;
uint32_t snaplen;
uint32_t linktype;
#define PCAP_LINKTYPE_ETHERNET 1
} pcap_hdr = {
PCAP_MAGIC, PCAP_VERSION_MAJOR, PCAP_VERSION_MINOR, 0, 0, ETH_MAX_MTU,
PCAP_LINKTYPE_ETHERNET
};
struct pcap_pkthdr {
uint32_t tv_sec;
uint32_t tv_usec;
@ -86,9 +65,8 @@ static void pcap_frame(const struct iovec *iov, size_t iovcnt,
.caplen = l2len,
.len = l2len
};
struct iovec hiov = { &h, sizeof(h) };
if (write_remainder(pcap_fd, &hiov, 1, 0) < 0 ||
if (write_all_buf(pcap_fd, &h, sizeof(h)) < 0 ||
write_remainder(pcap_fd, iov, iovcnt, offset) < 0)
debug_perror("Cannot log packet, length %zu", l2len);
}
@ -101,12 +79,14 @@ static void pcap_frame(const struct iovec *iov, size_t iovcnt,
void pcap(const char *pkt, size_t l2len)
{
struct iovec iov = { (char *)pkt, l2len };
struct timespec now;
struct timespec now = { 0 };
if (pcap_fd == -1)
return;
clock_gettime(CLOCK_REALTIME, &now);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
pcap_frame(&iov, 1, 0, &now);
}
@ -120,13 +100,14 @@ void pcap(const char *pkt, size_t l2len)
void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
size_t offset)
{
struct timespec now;
struct timespec now = { 0 };
unsigned int i;
if (pcap_fd == -1)
return;
clock_gettime(CLOCK_REALTIME, &now);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
for (i = 0; i < n; i++)
pcap_frame(iov + i * frame_parts, frame_parts, offset, &now);
@ -139,17 +120,19 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
* @iov: Pointer to the array of struct iovec describing the I/O vector
* containing packet data to write, including L2 header
* @iovcnt: Number of buffers (@iov entries)
* @offset: Offset of the L2 frame within the full data length
*/
/* cppcheck-suppress unusedFunction */
void pcap_iov(const struct iovec *iov, size_t iovcnt)
void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
{
struct timespec now;
struct timespec now = { 0 };
if (pcap_fd == -1)
return;
clock_gettime(CLOCK_REALTIME, &now);
pcap_frame(iov, iovcnt, 0, &now);
if (clock_gettime(CLOCK_REALTIME, &now))
err_perror("Failed to get CLOCK_REALTIME time");
pcap_frame(iov, iovcnt, offset, &now);
}
/**
@ -158,7 +141,28 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt)
*/
void pcap_init(struct ctx *c)
{
int flags = O_WRONLY | O_CREAT | O_TRUNC;
/* See pcap.h from libpcap, or pcap-savefile(5) */
#define PCAP_MAGIC 0xa1b2c3d4
#define PCAP_VERSION_MAJOR 2
#define PCAP_VERSION_MINOR 4
#define PCAP_LINKTYPE_ETHERNET 1
const struct {
uint32_t magic;
uint16_t major;
uint16_t minor;
int32_t thiszone;
uint32_t sigfigs;
uint32_t snaplen;
uint32_t linktype;
} pcap_hdr = {
.magic = PCAP_MAGIC,
.major = PCAP_VERSION_MAJOR,
.minor = PCAP_VERSION_MINOR,
.snaplen = tap_l2_max_len(c),
.linktype = PCAP_LINKTYPE_ETHERNET
};
if (pcap_fd != -1)
return;
@ -166,10 +170,9 @@ void pcap_init(struct ctx *c)
if (!*c->pcap)
return;
flags |= c->foreground ? O_CLOEXEC : 0;
pcap_fd = open(c->pcap, flags, S_IRUSR | S_IWUSR);
pcap_fd = output_file_open(c->pcap, O_WRONLY);
if (pcap_fd == -1) {
perror("open");
err_perror("Couldn't open pcap file %s", c->pcap);
return;
}

2
pcap.h
View file

@ -9,7 +9,7 @@
void pcap(const char *pkt, size_t l2len);
void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
size_t offset);
void pcap_iov(const struct iovec *iov, size_t iovcnt);
void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset);
void pcap_init(struct ctx *c);
#endif /* PCAP_H */

42
pif.c
View file

@ -59,3 +59,45 @@ void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
*sl = sizeof(sa->sa6);
}
}
/** pif_sock_l4() - Open a socket bound to an address on a specified interface
* @c: Execution context
* @type: Socket epoll type
* @pif: Interface for this socket
* @addr: Address to bind to, or NULL for dual-stack any
* @ifname: Interface for binding, NULL for any
* @port: Port number to bind to (host byte order)
* @data: epoll reference portion for protocol handlers
*
* NOTE: For namespace pifs, this must be called having already entered the
* relevant namespace.
*
* Return: newly created socket, negative error code on failure
*/
int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
const union inany_addr *addr, const char *ifname,
in_port_t port, uint32_t data)
{
union sockaddr_inany sa = {
.sa6.sin6_family = AF_INET6,
.sa6.sin6_addr = in6addr_any,
.sa6.sin6_port = htons(port),
};
socklen_t sl;
ASSERT(pif_is_socket(pif));
if (pif == PIF_SPLICE) {
/* Sanity checks */
ASSERT(!ifname);
ASSERT(addr && inany_is_loopback(addr));
}
if (!addr)
return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
ifname, false, data);
pif_sockaddr(c, &sa, &sl, pif, addr, port);
return sock_l4_sa(c, type, &sa, sl,
ifname, sa.sa_family == AF_INET6, data);
}

3
pif.h
View file

@ -59,5 +59,8 @@ static inline bool pif_is_socket(uint8_t pif)
void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
uint8_t pif, const union inany_addr *addr, in_port_t port);
int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
const union inany_addr *addr, const char *ifname,
in_port_t port, uint32_t data);
#endif /* PIF_H */

255
repair.c Normal file
View file

@ -0,0 +1,255 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* PASST - Plug A Simple Socket Transport
* for qemu/UNIX domain socket mode
*
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
* repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
*
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#include <errno.h>
#include <sys/socket.h>
#include <sys/uio.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "inany.h"
#include "flow.h"
#include "flow_table.h"
#include "repair.h"
#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
/* Wait for a while for TCP_REPAIR helper to connect if it's not there yet */
#define REPAIR_ACCEPT_TIMEOUT_MS 10
#define REPAIR_ACCEPT_TIMEOUT_US (REPAIR_ACCEPT_TIMEOUT_MS * 1000)
/* Pending file descriptors for next repair_flush() call, or command change */
static int repair_fds[SCM_MAX_FD];
/* Pending command: flush pending file descriptors if it changes */
static int8_t repair_cmd;
/* Number of pending file descriptors set in @repair_fds */
static int repair_nfds;
/**
* repair_sock_init() - Start listening for connections on helper socket
* @c: Execution context
*/
void repair_sock_init(const struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
struct epoll_event ev = { 0 };
if (c->fd_repair_listen == -1)
return;
if (listen(c->fd_repair_listen, 0)) {
err_perror("listen() on repair helper socket, won't migrate");
return;
}
ref.fd = c->fd_repair_listen;
ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
ev.data.u64 = ref.u64;
if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
err_perror("repair helper socket epoll_ctl(), won't migrate");
}
/**
* repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
* @c: Execution context
* @events: epoll events
*/
void repair_listen_handler(struct ctx *c, uint32_t events)
{
union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
struct epoll_event ev = { 0 };
struct ucred ucred;
socklen_t len;
if (events != EPOLLIN) {
debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
events);
return;
}
len = sizeof(ucred);
/* Another client is already connected: accept and close right away. */
if (c->fd_repair != -1) {
int discard = accept4(c->fd_repair_listen, NULL, NULL,
SOCK_NONBLOCK);
if (discard == -1)
return;
if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
close(discard);
return;
}
if ((c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0)) < 0) {
debug_perror("accept4() on TCP_REPAIR helper listening socket");
return;
}
if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
ref.fd = c->fd_repair;
ev.events = EPOLLHUP | EPOLLET;
ev.data.u64 = ref.u64;
if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
close(c->fd_repair);
c->fd_repair = -1;
}
}
/**
* repair_close() - Close connection to TCP_REPAIR helper
* @c: Execution context
*/
void repair_close(struct ctx *c)
{
debug("Closing TCP_REPAIR helper socket");
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
close(c->fd_repair);
c->fd_repair = -1;
}
/**
* repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
* @c: Execution context
* @events: epoll events
*/
void repair_handler(struct ctx *c, uint32_t events)
{
(void)events;
repair_close(c);
}
/**
* repair_wait() - Wait (with timeout) for TCP_REPAIR helper to connect
* @c: Execution context
*/
void repair_wait(struct ctx *c)
{
struct timeval tv = { .tv_sec = 0,
.tv_usec = (long)(REPAIR_ACCEPT_TIMEOUT_US) };
static_assert(REPAIR_ACCEPT_TIMEOUT_US < 1000 * 1000,
".tv_usec is greater than 1000 * 1000");
if (c->fd_repair >= 0 || c->fd_repair_listen == -1)
return;
if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
&tv, sizeof(tv))) {
err_perror("Set timeout on TCP_REPAIR listening socket");
return;
}
repair_listen_handler(c, EPOLLIN);
tv.tv_usec = 0;
if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
&tv, sizeof(tv)))
err_perror("Clear timeout on TCP_REPAIR listening socket");
}
/**
* repair_flush() - Flush current set of sockets to helper, with current command
* @c: Execution context
*
* Return: 0 on success, negative error code on failure
*/
int repair_flush(struct ctx *c)
{
char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
__attribute__ ((aligned(__alignof__(struct cmsghdr)))) = { 0 };
struct iovec iov = { &repair_cmd, sizeof(repair_cmd) };
struct cmsghdr *cmsg;
struct msghdr msg;
int8_t reply;
if (!repair_nfds)
return 0;
msg = (struct msghdr){ .msg_name = NULL, .msg_namelen = 0,
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = buf,
.msg_controllen = CMSG_SPACE(sizeof(int) *
repair_nfds),
.msg_flags = 0 };
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int) * repair_nfds);
memcpy(CMSG_DATA(cmsg), repair_fds, sizeof(int) * repair_nfds);
repair_nfds = 0;
if (sendmsg(c->fd_repair, &msg, 0) < 0) {
int ret = -errno;
err_perror("Failed to send sockets to TCP_REPAIR helper");
repair_close(c);
return ret;
}
if (recv(c->fd_repair, &reply, sizeof(reply), 0) < 0) {
int ret = -errno;
err_perror("Failed to receive reply from TCP_REPAIR helper");
repair_close(c);
return ret;
}
if (reply != repair_cmd) {
err("Unexpected reply from TCP_REPAIR helper: %d", reply);
repair_close(c);
return -ENXIO;
}
return 0;
}
/**
* repair_set() - Add socket to TCP_REPAIR set with given command
* @c: Execution context
* @s: Socket to add
* @cmd: TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
*
* Return: 0 on success, negative error code on failure
*/
int repair_set(struct ctx *c, int s, int cmd)
{
int rc;
if (repair_nfds && repair_cmd != cmd) {
if ((rc = repair_flush(c)))
return rc;
}
repair_cmd = cmd;
repair_fds[repair_nfds++] = s;
if (repair_nfds >= SCM_MAX_FD) {
if ((rc = repair_flush(c)))
return rc;
}
return 0;
}

17
repair.h Normal file
View file

@ -0,0 +1,17 @@
/* SPDX-License-Identifier: GPL-2.0-or-later
* Copyright (c) 2025 Red Hat GmbH
* Author: Stefano Brivio <sbrivio@redhat.com>
*/
#ifndef REPAIR_H
#define REPAIR_H
void repair_sock_init(const struct ctx *c);
void repair_listen_handler(struct ctx *c, uint32_t events);
void repair_handler(struct ctx *c, uint32_t events);
void repair_close(struct ctx *c);
void repair_wait(struct ctx *c);
int repair_flush(struct ctx *c);
int repair_set(struct ctx *c, int s, int cmd);
#endif /* REPAIR_H */

View file

@ -14,12 +14,23 @@
# Author: Stefano Brivio <sbrivio@redhat.com>
TMP="$(mktemp)"
IN="$@"
OUT="$(mktemp)"
OUT_FINAL="${1}"
shift
IN="$@"
[ -z "${ARCH}" ] && ARCH="$(uname -m)"
[ -z "${CC}" ] && CC="cc"
AUDIT_ARCH="AUDIT_ARCH_$(echo ${ARCH} | tr [a-z] [A-Z] \
| sed 's/^ARM.*/ARM/' \
| sed 's/I[456]86/I386/' \
| sed 's/PPC64/PPC/' \
| sed 's/PPCLE/PPC64LE/' \
| sed 's/MIPS64EL/MIPSEL64/' \
| sed 's/HPPA/PARISC/' \
| sed 's/SH4/SH/')"
HEADER="/* This file was automatically generated by $(basename ${0}) */
#ifndef AUDIT_ARCH_PPC64LE
@ -32,7 +43,7 @@ struct sock_filter filter_@PROFILE@[] = {
/* cppcheck-suppress [badBitmaskCheck, unmatchedSuppression] */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, arch))),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, PASST_AUDIT_ARCH, 0, @KILL@),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, @AUDIT_ARCH@, 0, @KILL@),
/* cppcheck-suppress [badBitmaskCheck, unmatchedSuppression] */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, nr))),
@ -233,7 +244,8 @@ gen_profile() {
sub ${__i} CALL "NR:${__nr}" "NAME:${__name}" "ALLOW:${__allow}"
done
finish PRE "PROFILE:${__profile}" "KILL:$(( __statements + 1))"
finish PRE "PROFILE:${__profile}" "KILL:$(( __statements + 1))" \
"AUDIT_ARCH:${AUDIT_ARCH}"
}
printf '%s\n' "${HEADER}" > "${OUT}"
@ -242,7 +254,10 @@ for __p in ${__profiles}; do
__calls="$(sed -n 's/[\t ]*\*[\t ]*#syscalls\(:'"${__p}"'\|\)[\t ]\{1,\}\(.*\)/\2/p' ${IN})"
__calls="${__calls} ${EXTRA_SYSCALLS:-}"
__calls="$(filter ${__calls})"
echo "seccomp profile ${__p} allows: ${__calls}" | tr '\n' ' ' | fmt -t
cols="$(stty -a 2>/dev/null | sed -n 's/.*columns \([0-9]*\).*/\1/p' || :)" 2>/dev/null
case $cols in [0-9]*) col_args="-w ${cols}";; *) col_args="";; esac
echo "seccomp profile ${__p} allows: ${__calls}" | tr '\n' ' ' | fmt -t ${col_args}
# Pad here to keep gen_profile() "simple"
__count=0
@ -255,4 +270,4 @@ for __p in ${__profiles}; do
gen_profile "${__p}" ${__calls}
done
mv "${OUT}" seccomp.h
mv "${OUT}" "${OUT_FINAL}"

544
tap.c
View file

@ -56,16 +56,70 @@
#include "netlink.h"
#include "pasta.h"
#include "packet.h"
#include "repair.h"
#include "tap.h"
#include "log.h"
#include "vhost_user.h"
#include "vu_common.h"
/* Maximum allowed frame lengths (including L2 header) */
/* Verify that an L2 frame length limit is large enough to contain the header,
* but small enough to fit in the packet pool
*/
#define CHECK_FRAME_LEN(len) \
static_assert((len) >= ETH_HLEN && (len) <= PACKET_MAX_LEN, \
#len " has bad value")
CHECK_FRAME_LEN(L2_MAX_LEN_PASTA);
CHECK_FRAME_LEN(L2_MAX_LEN_PASST);
CHECK_FRAME_LEN(L2_MAX_LEN_VU);
/* We try size the packet pools so that we can use a single batch for the entire
* packet buffer. This might be exceeded for vhost-user, though, which uses its
* own buffers rather than pkt_buf.
*
* This is just a tuning parameter, the code will work with slightly more
* overhead if it's incorrect. So, we estimate based on the minimum practical
* frame size - an empty UDP datagram - rather than the minimum theoretical
* frame size.
*
* FIXME: Profile to work out how big this actually needs to be to amortise
* per-batch syscall overheads
*/
#define TAP_MSGS_IP4 \
DIV_ROUND_UP(sizeof(pkt_buf), \
ETH_HLEN + sizeof(struct iphdr) + sizeof(struct udphdr))
#define TAP_MSGS_IP6 \
DIV_ROUND_UP(sizeof(pkt_buf), \
ETH_HLEN + sizeof(struct ipv6hdr) + sizeof(struct udphdr))
/* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS_IP4, pkt_buf);
static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
#define TAP_SEQS 128 /* Different L4 tuples in one batch */
#define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */
/**
* tap_l2_max_len() - Maximum frame size (including L2 header) for current mode
* @c: Execution context
*/
unsigned long tap_l2_max_len(const struct ctx *c)
{
/* NOLINTBEGIN(bugprone-branch-clone): values can be the same */
switch (c->mode) {
case MODE_PASST:
return L2_MAX_LEN_PASST;
case MODE_PASTA:
return L2_MAX_LEN_PASTA;
case MODE_VU:
return L2_MAX_LEN_VU;
}
/* NOLINTEND(bugprone-branch-clone) */
ASSERT(0);
}
/**
* tap_send_single() - Send a single frame
* @c: Execution context
@ -78,16 +132,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
struct iovec iov[2];
size_t iovcnt = 0;
if (c->mode == MODE_PASST) {
switch (c->mode) {
case MODE_PASST:
iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
iovcnt++;
/* fall through */
case MODE_PASTA:
iov[iovcnt].iov_base = (void *)data;
iov[iovcnt].iov_len = l2len;
iovcnt++;
tap_send_frames(c, iov, iovcnt, 1);
break;
case MODE_VU:
vu_send_single(c, data, l2len);
break;
}
iov[iovcnt].iov_base = (void *)data;
iov[iovcnt].iov_len = l2len;
iovcnt++;
tap_send_frames(c, iov, iovcnt, 1);
}
/**
@ -113,7 +173,7 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
{
struct ethhdr *eh = (struct ethhdr *)buf;
@ -134,8 +194,8 @@ static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto)
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto)
{
uint16_t l3len = l4len + sizeof(*ip4h);
@ -144,13 +204,43 @@ static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
ip4h->tos = 0;
ip4h->tot_len = htons(l3len);
ip4h->id = 0;
ip4h->frag_off = 0;
ip4h->frag_off = htons(IP_DF);
ip4h->ttl = 255;
ip4h->protocol = proto;
ip4h->saddr = src.s_addr;
ip4h->daddr = dst.s_addr;
ip4h->check = csum_ip4_header(l3len, proto, src, dst);
return ip4h + 1;
return (char *)ip4h + sizeof(*ip4h);
}
/**
* tap_push_uh4() - Build UDPv4 header with checksum
* @c: Execution context
* @src: IPv4 source address
* @sport: UDP source port
* @dst: IPv4 destination address
* @dport: UDP destination port
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*
* Return: pointer at which to write the packet's payload
*/
void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
const struct iovec iov = {
.iov_base = (void *)in,
.iov_len = dlen
};
struct iov_tail payload = IOV_TAIL(&iov, 1, 0);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp4(uh, src, dst, &payload);
return (char *)uh + sizeof(*uh);
}
/**
@ -160,7 +250,7 @@ static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
* @sport: UDP source port
* @dst: IPv4 destination address
* @dport: UDP destination port
* @in: UDP payload contents (not including UDP header)
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*/
void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
@ -171,14 +261,9 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
char buf[USHRT_MAX];
struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
char *data = (char *)(uh + 1);
char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp4(uh, src, dst, in, dlen);
memcpy(data, in, dlen);
tap_send_single(c, buf, dlen + (data - buf));
}
@ -215,10 +300,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
*
* Return: pointer at which to write the packet's payload
*/
static void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src,
const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow)
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src, const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow)
{
ip6h->payload_len = htons(l4len);
ip6h->priority = 0;
@ -227,10 +311,40 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h,
ip6h->hop_limit = 255;
ip6h->saddr = *src;
ip6h->daddr = *dst;
ip6h->flow_lbl[0] = (flow >> 16) & 0xf;
ip6h->flow_lbl[1] = (flow >> 8) & 0xff;
ip6h->flow_lbl[2] = (flow >> 0) & 0xff;
return ip6h + 1;
ip6_set_flow_lbl(ip6h, flow);
return (char *)ip6h + sizeof(*ip6h);
}
/**
* tap_push_uh6() - Build UDPv6 header with checksum
* @c: Execution context
* @src: IPv6 source address
* @sport: UDP source port
* @dst: IPv6 destination address
* @dport: UDP destination port
* @flow: Flow label
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*
* Return: pointer at which to write the packet's payload
*/
void *tap_push_uh6(struct udphdr *uh,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
const struct iovec iov = {
.iov_base = in,
.iov_len = dlen
};
struct iov_tail payload = IOV_TAIL(&iov, 1, 0);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp6(uh, src, dst, &payload);
return (char *)uh + sizeof(*uh);
}
/**
@ -241,27 +355,22 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h,
* @dst: IPv6 destination address
* @dport: UDP destination port
* @flow: Flow label
* @in: UDP payload contents (not including UDP header)
* @in: UDP payload contents (not including UDP header)
* @dlen: UDP payload length (not including UDP header)
*/
void tap_udp6_send(const struct ctx *c,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
uint32_t flow, const void *in, size_t dlen)
uint32_t flow, void *in, size_t dlen)
{
size_t l4len = dlen + sizeof(struct udphdr);
char buf[USHRT_MAX];
struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
struct udphdr *uh = tap_push_ip6h(ip6h, src, dst,
l4len, IPPROTO_UDP, flow);
char *data = (char *)(uh + 1);
char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
uh->source = htons(sport);
uh->dest = htons(dport);
uh->len = htons(l4len);
csum_udp6(uh, src, dst, in, dlen);
memcpy(data, in, dlen);
tap_send_single(c, buf, dlen + (data - buf));
}
@ -406,10 +515,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
if (!nframes)
return 0;
if (c->mode == MODE_PASTA)
switch (c->mode) {
case MODE_PASTA:
m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
else
break;
case MODE_PASST:
m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
break;
case MODE_VU:
/* fall through */
default:
ASSERT(0);
}
if (m < nframes)
debug("tap: failed to send %zu frames of %zu",
@ -442,6 +559,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
* struct l4_seq4_t - Message sequence for one protocol handler call, IPv4
* @msgs: Count of messages in sequence
* @protocol: Protocol number
* @ttl: Time to live
* @source: Source port
* @dest: Destination port
* @saddr: Source address
@ -450,6 +568,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
*/
static struct tap4_l4_t {
uint8_t protocol;
uint8_t ttl;
uint16_t source;
uint16_t dest;
@ -464,14 +583,17 @@ static struct tap4_l4_t {
* struct l4_seq6_t - Message sequence for one protocol handler call, IPv6
* @msgs: Count of messages in sequence
* @protocol: Protocol number
* @flow_lbl: IPv6 flow label
* @source: Source port
* @dest: Destination port
* @saddr: Source address
* @daddr: Destination address
* @hop_limit: Hop limit
* @msg: Array of messages that can be handled in a single call
*/
static struct tap6_l4_t {
uint8_t protocol;
uint32_t flow_lbl :20;
uint16_t source;
uint16_t dest;
@ -479,6 +601,8 @@ static struct tap6_l4_t {
struct in6_addr saddr;
struct in6_addr daddr;
uint8_t hop_limit;
struct pool_l4_t p;
} tap6_l4[TAP_SEQS /* Arbitrary: TAP_MSGS in theory, so limit in users */];
@ -667,7 +791,8 @@ resume:
#define L4_MATCH(iph, uh, seq) \
((seq)->protocol == (iph)->protocol && \
(seq)->source == (uh)->source && (seq)->dest == (uh)->dest && \
(seq)->saddr.s_addr == (iph)->saddr && (seq)->daddr.s_addr == (iph)->daddr)
(seq)->saddr.s_addr == (iph)->saddr && \
(seq)->daddr.s_addr == (iph)->daddr && (seq)->ttl == (iph)->ttl)
#define L4_SET(iph, uh, seq) \
do { \
@ -676,6 +801,7 @@ resume:
(seq)->dest = (uh)->dest; \
(seq)->saddr.s_addr = (iph)->saddr; \
(seq)->daddr.s_addr = (iph)->daddr; \
(seq)->ttl = (iph)->ttl; \
} while (0)
if (seq && L4_MATCH(iph, uh, seq) && seq->p.count < UIO_MAXIOV)
@ -717,14 +843,14 @@ append:
for (k = 0; k < p->count; )
k += tcp_tap_handler(c, PIF_TAP, AF_INET,
&seq->saddr, &seq->daddr,
p, k, now);
0, p, k, now);
} else if (seq->protocol == IPPROTO_UDP) {
if (c->no_udp)
continue;
for (k = 0; k < p->count; )
k += udp_tap_handler(c, PIF_TAP, AF_INET,
&seq->saddr, &seq->daddr,
p, k, now);
seq->ttl, p, k, now);
}
}
@ -795,6 +921,9 @@ resume:
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_seen)) {
c->ip6.addr_seen = *saddr;
}
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr))
c->ip6.addr = *saddr;
} else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){
c->ip6.addr_seen = *saddr;
}
@ -842,16 +971,20 @@ resume:
((seq)->protocol == (proto) && \
(seq)->source == (uh)->source && \
(seq)->dest == (uh)->dest && \
(seq)->flow_lbl == ip6_get_flow_lbl(ip6h) && \
IN6_ARE_ADDR_EQUAL(&(seq)->saddr, saddr) && \
IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr))
IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr) && \
(seq)->hop_limit == (ip6h)->hop_limit)
#define L4_SET(ip6h, proto, uh, seq) \
do { \
(seq)->protocol = (proto); \
(seq)->source = (uh)->source; \
(seq)->dest = (uh)->dest; \
(seq)->flow_lbl = ip6_get_flow_lbl(ip6h); \
(seq)->saddr = *saddr; \
(seq)->daddr = *daddr; \
(seq)->hop_limit = (ip6h)->hop_limit; \
} while (0)
if (seq && L4_MATCH(ip6h, proto, uh, seq) &&
@ -895,14 +1028,14 @@ append:
for (k = 0; k < p->count; )
k += tcp_tap_handler(c, PIF_TAP, AF_INET6,
&seq->saddr, &seq->daddr,
p, k, now);
seq->flow_lbl, p, k, now);
} else if (seq->protocol == IPPROTO_UDP) {
if (c->no_udp)
continue;
for (k = 0; k < p->count; )
k += udp_tap_handler(c, PIF_TAP, AF_INET6,
&seq->saddr, &seq->daddr,
p, k, now);
seq->hop_limit, p, k, now);
}
}
@ -937,8 +1070,10 @@ void tap_handler(struct ctx *c, const struct timespec *now)
* @c: Execution context
* @l2len: Total L2 packet length
* @p: Packet buffer
* @now: Current timestamp
*/
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
const struct timespec *now)
{
const struct ethhdr *eh;
@ -954,9 +1089,17 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
switch (ntohs(eh->h_proto)) {
case ETH_P_ARP:
case ETH_P_IP:
if (pool_full(pool_tap4)) {
tap4_handler(c, pool_tap4, now);
pool_flush(pool_tap4);
}
packet_add(pool_tap4, l2len, p);
break;
case ETH_P_IPV6:
if (pool_full(pool_tap6)) {
tap6_handler(c, pool_tap6, now);
pool_flush(pool_tap6);
}
packet_add(pool_tap6, l2len, p);
break;
default:
@ -968,38 +1111,33 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
* tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
* @c: Execution context
*/
static void tap_sock_reset(struct ctx *c)
void tap_sock_reset(struct ctx *c)
{
info("Client connection closed%s", c->one_off ? ", exiting" : "");
if (c->one_off)
exit(EXIT_SUCCESS);
_exit(EXIT_SUCCESS);
/* Close the connected socket, wait for a new connection */
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
epoll_del(c, c->fd_tap);
close(c->fd_tap);
c->fd_tap = -1;
if (c->mode == MODE_VU)
vu_cleanup(c->vdev);
}
/**
* tap_handler_passt() - Packet handler for AF_UNIX file descriptor
* tap_passt_input() - Handler for new data on the socket to qemu
* @c: Execution context
* @events: epoll events
* @now: Current timestamp
*/
void tap_handler_passt(struct ctx *c, uint32_t events,
const struct timespec *now)
static void tap_passt_input(struct ctx *c, const struct timespec *now)
{
static const char *partial_frame;
static ssize_t partial_len = 0;
ssize_t n;
char *p;
if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
tap_sock_reset(c);
return;
}
tap_flush_pools();
if (partial_len) {
@ -1010,10 +1148,13 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
memmove(pkt_buf, partial_frame, partial_len);
}
n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len,
MSG_DONTWAIT);
do {
n = recv(c->fd_tap, pkt_buf + partial_len,
sizeof(pkt_buf) - partial_len, MSG_DONTWAIT);
} while ((n < 0) && errno == EINTR);
if (n < 0) {
if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
if (errno != EAGAIN && errno != EWOULDBLOCK) {
err_perror("Receive error on guest connection, reset");
tap_sock_reset(c);
}
@ -1026,7 +1167,7 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
while (n >= (ssize_t)sizeof(uint32_t)) {
uint32_t l2len = ntohl_unaligned(p);
if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
if (l2len < sizeof(struct ethhdr) || l2len > L2_MAX_LEN_PASST) {
err("Bad frame size from guest, resetting connection");
tap_sock_reset(c);
return;
@ -1039,7 +1180,7 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
p += sizeof(uint32_t);
n -= sizeof(uint32_t);
tap_add_packet(c, l2len, p);
tap_add_packet(c, l2len, p, now);
p += l2len;
n -= l2len;
@ -1051,6 +1192,65 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
tap_handler(c, now);
}
/**
* tap_handler_passt() - Event handler for AF_UNIX file descriptor
* @c: Execution context
* @events: epoll events
* @now: Current timestamp
*/
void tap_handler_passt(struct ctx *c, uint32_t events,
const struct timespec *now)
{
if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
tap_sock_reset(c);
return;
}
if (events & EPOLLIN)
tap_passt_input(c, now);
}
/**
* tap_pasta_input() - Handler for new data on the socket to hypervisor
* @c: Execution context
* @now: Current timestamp
*/
static void tap_pasta_input(struct ctx *c, const struct timespec *now)
{
ssize_t n, len;
tap_flush_pools();
for (n = 0;
n <= (ssize_t)(sizeof(pkt_buf) - L2_MAX_LEN_PASTA);
n += len) {
len = read(c->fd_tap, pkt_buf + n, L2_MAX_LEN_PASTA);
if (len == 0) {
die("EOF on tap device, exiting");
} else if (len < 0) {
if (errno == EINTR) {
len = 0;
continue;
}
if (errno == EAGAIN && errno == EWOULDBLOCK)
break; /* all done for now */
die("Error on tap device, exiting");
}
/* Ignore frames of bad length */
if (len < (ssize_t)sizeof(struct ethhdr) ||
len > (ssize_t)L2_MAX_LEN_PASTA)
continue;
tap_add_packet(c, len, pkt_buf + n, now);
}
tap_handler(c, now);
}
/**
* tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor
* @c: Execution context
@ -1060,113 +1260,43 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
void tap_handler_pasta(struct ctx *c, uint32_t events,
const struct timespec *now)
{
ssize_t n, len;
int ret;
if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
die("Disconnect event on /dev/net/tun device, exiting");
redo:
n = 0;
tap_flush_pools();
restart:
while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
if (len < (ssize_t)sizeof(struct ethhdr) ||
len > (ssize_t)ETH_MAX_MTU) {
n += len;
continue;
}
tap_add_packet(c, len, pkt_buf + n);
if ((n += len) == TAP_BUF_BYTES)
break;
}
if (len < 0 && errno == EINTR)
goto restart;
ret = errno;
tap_handler(c, now);
if (len > 0 || ret == EAGAIN)
return;
if (n == TAP_BUF_BYTES)
goto redo;
die("Error on tap device, exiting");
if (events & EPOLLIN)
tap_pasta_input(c, now);
}
/**
* tap_sock_unix_open() - Create and bind AF_UNIX socket
* @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
*
* Return: socket descriptor on success, won't return on failure
* tap_backend_show_hints() - Give help information to start QEMU
* @c: Execution context
*/
int tap_sock_unix_open(char *sock_path)
static void tap_backend_show_hints(struct ctx *c)
{
int fd = socket(AF_UNIX, SOCK_STREAM, 0);
struct sockaddr_un addr = {
.sun_family = AF_UNIX,
};
int i;
if (fd < 0)
die_perror("Failed to open UNIX domain socket");
for (i = 1; i < UNIX_SOCK_MAX; i++) {
char *path = addr.sun_path;
int ex, ret;
if (*sock_path)
memcpy(path, sock_path, UNIX_PATH_MAX);
else
snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
if (ex < 0)
die_perror("Failed to check for UNIX domain conflicts");
ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
errno != EACCES)) {
if (*sock_path)
die("Socket path %s already in use", path);
close(ex);
continue;
}
close(ex);
unlink(path);
ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
if (*sock_path && ret)
die_perror("Failed to bind UNIX domain socket");
if (!ret)
break;
switch (c->mode) {
case MODE_PASTA:
/* No hints */
break;
case MODE_PASST:
info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
info(" kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
c->sock_path);
info("or qrap, for earlier qemu versions:");
info(" ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
break;
case MODE_VU:
info("You can start qemu with:");
info(" kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
c->sock_path);
break;
}
if (i == UNIX_SOCK_MAX)
die_perror("Failed to bind UNIX domain socket");
info("UNIX domain socket bound at %s", addr.sun_path);
if (!*sock_path)
memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
return fd;
}
/**
* tap_sock_unix_init() - Start listening for connections on AF_UNIX socket
* @c: Execution context
*/
static void tap_sock_unix_init(struct ctx *c)
static void tap_sock_unix_init(const struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
struct epoll_event ev = { 0 };
@ -1177,12 +1307,33 @@ static void tap_sock_unix_init(struct ctx *c)
ev.events = EPOLLIN | EPOLLET;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
}
info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
info(" kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
c->sock_path);
info("or qrap, for earlier qemu versions:");
info(" ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
/**
* tap_start_connection() - start a new connection
* @c: Execution context
*/
static void tap_start_connection(const struct ctx *c)
{
struct epoll_event ev = { 0 };
union epoll_ref ref = { 0 };
ref.fd = c->fd_tap;
switch (c->mode) {
case MODE_PASST:
ref.type = EPOLL_TYPE_TAP_PASST;
break;
case MODE_PASTA:
ref.type = EPOLL_TYPE_TAP_PASTA;
break;
case MODE_VU:
ref.type = EPOLL_TYPE_VHOST_CMD;
break;
}
ev.events = EPOLLIN | EPOLLRDHUP;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
}
/**
@ -1192,8 +1343,6 @@ static void tap_sock_unix_init(struct ctx *c)
*/
void tap_listen_handler(struct ctx *c, uint32_t events)
{
union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
struct epoll_event ev = { 0 };
int v = INT_MAX / 2;
struct ucred ucred;
socklen_t len;
@ -1232,10 +1381,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
setsockopt(c->fd_tap, SOL_SOCKET, SO_SNDBUF, &v, sizeof(v)))
trace("tap: failed to set SO_SNDBUF to %i", v);
ref.fd = c->fd_tap;
ev.events = EPOLLIN | EPOLLRDHUP;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
tap_start_connection(c);
}
/**
@ -1261,7 +1407,7 @@ static int tap_ns_tun(void *arg)
if (fd < 0)
die_perror("Failed to open() /dev/net/tun");
rc = ioctl(fd, TUNSETIFF, &ifr);
rc = ioctl(fd, (int)TUNSETIFF, &ifr);
if (rc < 0)
die_perror("TUNSETIFF ioctl on /dev/net/tun failed");
@ -1279,58 +1425,61 @@ static int tap_ns_tun(void *arg)
*/
static void tap_sock_tun_init(struct ctx *c)
{
union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASTA };
struct epoll_event ev = { 0 };
NS_CALL(tap_ns_tun, c);
if (c->fd_tap == -1)
die("Failed to set up tap device in namespace");
pasta_ns_conf(c);
ref.fd = c->fd_tap;
ev.events = EPOLLIN | EPOLLRDHUP;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
tap_start_connection(c);
}
/**
* tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
* @c: Execution context
* tap_sock_update_pool() - Set the buffer base and size for the pool of packets
* @base: Buffer base
* @size Buffer size
*/
void tap_sock_init(struct ctx *c)
void tap_sock_update_pool(void *base, size_t size)
{
size_t sz = sizeof(pkt_buf);
int i;
pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS_IP4, base, size);
pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS_IP6, base, size);
for (i = 0; i < TAP_SEQS; i++) {
tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
}
}
/**
* tap_backend_init() - Create and set up AF_UNIX socket or
* tuntap file descriptor
* @c: Execution context
*/
void tap_backend_init(struct ctx *c)
{
if (c->mode == MODE_VU) {
tap_sock_update_pool(NULL, 0);
vu_init(c);
} else {
tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
}
if (c->fd_tap != -1) { /* Passed as --fd */
struct epoll_event ev = { 0 };
union epoll_ref ref;
ASSERT(c->one_off);
ref.fd = c->fd_tap;
if (c->mode == MODE_PASST)
ref.type = EPOLL_TYPE_TAP_PASST;
else
ref.type = EPOLL_TYPE_TAP_PASTA;
ev.events = EPOLLIN | EPOLLRDHUP;
ev.data.u64 = ref.u64;
epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
tap_start_connection(c);
return;
}
if (c->mode == MODE_PASTA) {
switch (c->mode) {
case MODE_PASTA:
tap_sock_tun_init(c);
} else {
break;
case MODE_VU:
repair_sock_init(c);
/* fall through */
case MODE_PASST:
tap_sock_unix_init(c);
/* In passt mode, we don't know the guest's MAC address until it
@ -1338,5 +1487,8 @@ void tap_sock_init(struct ctx *c)
* first packets will reach it.
*/
memset(&c->guest_mac, 0xff, sizeof(c->guest_mac));
break;
}
tap_backend_show_hints(c);
}

59
tap.h
View file

@ -6,7 +6,32 @@
#ifndef TAP_H
#define TAP_H
#define ETH_HDR_INIT(proto) { .h_proto = htons_constant(proto) }
/** L2_MAX_LEN_PASTA - Maximum frame length for pasta mode (with L2 header)
*
* The kernel tuntap device imposes a maximum frame size of 65535 including
* 'hard_header_len' (14 bytes for L2 Ethernet in the case of "tap" mode).
*/
#define L2_MAX_LEN_PASTA USHRT_MAX
/** L2_MAX_LEN_PASST - Maximum frame length for passt mode (with L2 header)
*
* The only structural limit the QEMU socket protocol imposes on frames is
* (2^32-1) bytes, but that would be ludicrously long in practice. For now,
* limit it somewhat arbitrarily to 65535 bytes. FIXME: Work out an appropriate
* limit with more precision.
*/
#define L2_MAX_LEN_PASST USHRT_MAX
/** L2_MAX_LEN_VU - Maximum frame length for vhost-user mode (with L2 header)
*
* vhost-user allows multiple buffers per frame, each of which can be quite
* large, so the inherent frame size limit is rather large. Much larger than is
* actually useful for IP. For now limit arbitrarily to 65535 bytes. FIXME:
* Work out an appropriate limit with more precision.
*/
#define L2_MAX_LEN_VU USHRT_MAX
struct udphdr;
/**
* struct tap_hdr - tap backend specific headers
@ -40,9 +65,27 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
*/
static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
{
thdr->vnet_len = htonl(l2len);
if (thdr)
thdr->vnet_len = htonl(l2len);
}
unsigned long tap_l2_max_len(const struct ctx *c);
void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto);
void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen);
void *tap_push_uh6(struct udphdr *uh,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
void *in, size_t dlen);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto);
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src,
const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow);
void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen);
@ -50,10 +93,13 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
const void *in, size_t l4len);
const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
const struct in6_addr *src);
void *tap_push_ip6h(struct ipv6hdr *ip6h,
const struct in6_addr *src, const struct in6_addr *dst,
size_t l4len, uint8_t proto, uint32_t flow);
void tap_udp6_send(const struct ctx *c,
const struct in6_addr *src, in_port_t sport,
const struct in6_addr *dst, in_port_t dport,
uint32_t flow, const void *in, size_t dlen);
uint32_t flow, void *in, size_t dlen);
void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
const void *in, size_t l4len);
@ -68,9 +114,12 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
void tap_handler_passt(struct ctx *c, uint32_t events,
const struct timespec *now);
int tap_sock_unix_open(char *sock_path);
void tap_sock_init(struct ctx *c);
void tap_sock_reset(struct ctx *c);
void tap_sock_update_pool(void *base, size_t size);
void tap_backend_init(struct ctx *c);
void tap_flush_pools(void);
void tap_handler(struct ctx *c, const struct timespec *now);
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);
void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
const struct timespec *now);
#endif /* TAP_H */

1749
tcp.c

File diff suppressed because it is too large Load diff

18
tcp.h
View file

@ -10,21 +10,21 @@
struct ctx;
void tcp_timer_handler(struct ctx *c, union epoll_ref ref);
void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
void tcp_timer_handler(const struct ctx *c, union epoll_ref ref);
void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
const struct timespec *now);
void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events);
int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
uint32_t events);
int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr, uint32_t flow_lbl,
const struct pool *p, int idx, const struct timespec *now);
int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
const char *ifname, in_port_t port);
int tcp_init(struct ctx *c);
void tcp_timer(struct ctx *c, const struct timespec *now);
void tcp_defer_handler(struct ctx *c);
void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
int tcp_set_peek_offset(int s, int offset);
extern bool peek_offset_cap;
@ -58,16 +58,12 @@ union tcp_listen_epoll_ref {
* @fwd_in: Port forwarding configuration for inbound packets
* @fwd_out: Port forwarding configuration for outbound packets
* @timer_run: Timestamp of most recent timer run
* @kernel_snd_wnd: Kernel reports sending window (with commit 8f7baad7f035)
* @pipe_size: Size of pipes for spliced connections
*/
struct tcp_ctx {
struct fwd_ports fwd_in;
struct fwd_ports fwd_out;
struct timespec timer_run;
#ifdef HAS_SND_WND
int kernel_snd_wnd;
#endif
size_t pipe_size;
};

404
tcp_buf.c
View file

@ -20,7 +20,7 @@
#include <netinet/ip.h>
#include <linux/tcp.h>
#include <netinet/tcp.h>
#include "util.h"
#include "ip.h"
@ -38,88 +38,32 @@
(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
/* Static buffers */
/**
* struct tcp_payload_t - TCP header and data to send segments with payload
* @th: TCP header
* @data: TCP data
*/
struct tcp_payload_t {
struct tcphdr th;
uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)];
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32))); /* For AVX2 checksum routines */
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
#endif
/**
* struct tcp_flags_t - TCP header and data to send zero-length
* segments (flags)
* @th: TCP header
* @opts TCP options
*/
struct tcp_flags_t {
struct tcphdr th;
char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32)));
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
#endif
/* Ethernet header for IPv4 frames */
/* Ethernet header for IPv4 and IPv6 frames */
static struct ethhdr tcp4_eth_src;
static struct tap_hdr tcp4_payload_tap_hdr[TCP_FRAMES_MEM];
/* IPv4 headers */
static struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM];
/* TCP segments with payload for IPv4 frames */
static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM];
static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516");
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp4_payload_used;
static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM];
/* IPv4 headers for TCP segment without payload */
static struct iphdr tcp4_flags_ip[TCP_FRAMES_MEM];
/* TCP segments without payload for IPv4 frames */
static struct tcp_flags_t tcp4_flags[TCP_FRAMES_MEM];
static unsigned int tcp4_flags_used;
/* Ethernet header for IPv6 frames */
static struct ethhdr tcp6_eth_src;
static struct tap_hdr tcp6_payload_tap_hdr[TCP_FRAMES_MEM];
/* IPv6 headers */
static struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];
/* TCP headers and data for IPv6 frames */
static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM];
static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM];
static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516");
/* IP headers for IPv4 and IPv6 */
struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM];
struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];
/* TCP segments with payload for IPv4 and IPv6 frames */
static struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM];
static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516");
static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516");
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp6_payload_used;
static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM];
/* IPv6 headers for TCP segment without payload */
static struct ipv6hdr tcp6_flags_ip[TCP_FRAMES_MEM];
/* TCP segment without payload for IPv6 frames */
static struct tcp_flags_t tcp6_flags[TCP_FRAMES_MEM];
static unsigned int tcp6_flags_used;
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp_payload_used;
/* recvmsg()/sendmsg() data for tap */
static struct iovec iov_sock [TCP_FRAMES_MEM + 1];
static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
/**
* tcp_update_l2_buf() - Update Ethernet header buffers with addresses
* @eth_d: Ethernet destination address, NULL if unchanged
@ -132,105 +76,30 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
}
/**
* tcp_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
* tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
* @c: Execution context
*/
void tcp_sock4_iov_init(const struct ctx *c)
{
struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
struct iovec *iov;
int i;
tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
for (i = 0; i < ARRAY_SIZE(tcp4_payload); i++) {
tcp4_payload_ip[i] = iph;
tcp4_payload[i].th.doff = sizeof(struct tcphdr) / 4;
tcp4_payload[i].th.ack = 1;
}
for (i = 0; i < ARRAY_SIZE(tcp4_flags); i++) {
tcp4_flags_ip[i] = iph;
tcp4_flags[i].th.doff = sizeof(struct tcphdr) / 4;
tcp4_flags[i].th.ack = 1;
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
iov = tcp4_l2_iov[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp4_payload_tap_hdr[i]);
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp4_eth_src);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_payload[i];
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
iov = tcp4_l2_flags_iov[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp4_flags_tap_hdr[i]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp4_eth_src);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_flags_ip[i]);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_flags[i];
}
}
/**
* tcp_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
* @c: Execution context
*/
void tcp_sock6_iov_init(const struct ctx *c)
void tcp_sock_iov_init(const struct ctx *c)
{
struct ipv6hdr ip6 = L2_BUF_IP6_INIT(IPPROTO_TCP);
struct iovec *iov;
struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
int i;
tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
for (i = 0; i < ARRAY_SIZE(tcp6_payload); i++) {
for (i = 0; i < ARRAY_SIZE(tcp_payload); i++) {
tcp6_payload_ip[i] = ip6;
tcp6_payload[i].th.doff = sizeof(struct tcphdr) / 4;
tcp6_payload[i].th.ack = 1;
}
for (i = 0; i < ARRAY_SIZE(tcp6_flags); i++) {
tcp6_flags_ip[i] = ip6;
tcp6_flags[i].th.doff = sizeof(struct tcphdr) / 4;
tcp6_flags[i].th .ack = 1;
tcp4_payload_ip[i] = iph;
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
iov = tcp6_l2_iov[i];
struct iovec *iov = tcp_l2_iov[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp6_payload_tap_hdr[i]);
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp6_eth_src);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_payload[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]);
iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i];
}
for (i = 0; i < TCP_FRAMES_MEM; i++) {
iov = tcp6_l2_flags_iov[i];
iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp6_flags_tap_hdr[i]);
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp6_eth_src);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_flags_ip[i]);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_flags[i];
}
}
/**
* tcp_flags_flush() - Send out buffers for segments with no data (flags)
* @c: Execution context
*/
void tcp_flags_flush(const struct ctx *c)
{
tap_send_frames(c, &tcp6_l2_flags_iov[0][0], TCP_NUM_IOVS,
tcp6_flags_used);
tcp6_flags_used = 0;
tap_send_frames(c, &tcp4_l2_flags_iov[0][0], TCP_NUM_IOVS,
tcp4_flags_used);
tcp4_flags_used = 0;
}
/**
@ -240,7 +109,7 @@ void tcp_flags_flush(const struct ctx *c)
* @frames: Two-dimensional array containing queued frames with sub-iovs
* @num_frames: Number of entries in the two arrays to be compared
*/
static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
struct iovec (*frames)[TCP_NUM_IOVS], int num_frames)
{
int i;
@ -256,34 +125,55 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
conn->seq_to_tap = seq;
peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
if (tcp_set_peek_offset(conn->sock, peek_offset))
if (tcp_set_peek_offset(conn, peek_offset))
tcp_rst(c, conn);
}
}
/**
* tcp_payload_flush() - Send out buffers for segments with data
* tcp_payload_flush() - Send out buffers for segments with data or flags
* @c: Execution context
*/
void tcp_payload_flush(struct ctx *c)
void tcp_payload_flush(const struct ctx *c)
{
size_t m;
m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS,
tcp6_payload_used);
if (m != tcp6_payload_used) {
tcp_revert_seq(c, &tcp6_frame_conns[m], &tcp6_l2_iov[m],
tcp6_payload_used - m);
m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
tcp_payload_used);
if (m != tcp_payload_used) {
tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
tcp_payload_used - m);
}
tcp6_payload_used = 0;
tcp_payload_used = 0;
}
m = tap_send_frames(c, &tcp4_l2_iov[0][0], TCP_NUM_IOVS,
tcp4_payload_used);
if (m != tcp4_payload_used) {
tcp_revert_seq(c, &tcp4_frame_conns[m], &tcp4_l2_iov[m],
tcp4_payload_used - m);
}
tcp4_payload_used = 0;
/**
* tcp_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
* @conn: Connection pointer
* @iov: Pointer to an array of iovec of TCP pre-cooked buffers
* @check: Checksum, if already known
* @seq: Sequence number for this segment
* @no_tcp_csum: Do not set TCP checksum
*/
static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
struct iovec *iov, const uint16_t *check,
uint32_t seq, bool no_tcp_csum)
{
struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
struct tcphdr *th = IOV_REMOVE_HEADER(&tail, struct tcphdr);
struct tap_hdr *taph = iov[TCP_IOV_TAP].iov_base;
const struct flowside *tapside = TAPFLOW(conn);
const struct in_addr *a4 = inany_v4(&tapside->oaddr);
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
if (a4)
ip4h = iov[TCP_IOV_IP].iov_base;
else
ip6h = iov[TCP_IOV_IP].iov_base;
tcp_fill_headers(conn, taph, ip4h, ip6h, th, &tail,
check, seq, no_tcp_csum);
}
/**
@ -294,58 +184,50 @@ void tcp_payload_flush(struct ctx *c)
*
* Return: negative error code on connection reset, 0 otherwise
*/
int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
{
struct tcp_flags_t *payload;
struct tcp_payload_t *payload;
struct iovec *iov;
size_t optlen;
size_t l4len;
uint32_t seq;
int ret;
if (CONN_V4(conn))
iov = tcp4_l2_flags_iov[tcp4_flags_used++];
else
iov = tcp6_l2_flags_iov[tcp6_flags_used++];
iov = tcp_l2_iov[tcp_payload_used];
if (CONN_V4(conn)) {
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else {
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
payload = iov[TCP_IOV_PAYLOAD].iov_base;
seq = conn->seq_to_tap;
ret = tcp_prepare_flags(c, conn, flags, &payload->th,
payload->opts, &optlen);
if (ret <= 0) {
if (CONN_V4(conn))
tcp4_flags_used--;
else
tcp6_flags_used--;
(struct tcp_syn_opts *)&payload->data, &optlen);
if (ret <= 0)
return ret;
}
l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq);
tcp_payload_used++;
l4len = optlen + sizeof(struct tcphdr);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
if (flags & DUP_ACK) {
struct iovec *dup_iov;
int i;
struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
if (CONN_V4(conn))
dup_iov = tcp4_l2_flags_iov[tcp4_flags_used++];
else
dup_iov = tcp6_l2_flags_iov[tcp6_flags_used++];
for (i = 0; i < TCP_NUM_IOVS; i++)
memcpy(dup_iov[i].iov_base, iov[i].iov_base,
iov[i].iov_len);
dup_iov[TCP_IOV_PAYLOAD].iov_len = iov[TCP_IOV_PAYLOAD].iov_len;
memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_TAP].iov_len);
dup_iov[TCP_IOV_ETH].iov_base = iov[TCP_IOV_ETH].iov_base;
dup_iov[TCP_IOV_IP] = iov[TCP_IOV_IP];
memcpy(dup_iov[TCP_IOV_PAYLOAD].iov_base,
iov[TCP_IOV_PAYLOAD].iov_base, l4len);
dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
}
if (CONN_V4(conn)) {
if (tcp4_flags_used > TCP_FRAMES_MEM - 2)
tcp_flags_flush(c);
} else {
if (tcp6_flags_used > TCP_FRAMES_MEM - 2)
tcp_flags_flush(c);
}
if (tcp_payload_used > TCP_FRAMES_MEM - 2)
tcp_payload_flush(c);
return 0;
}
@ -357,40 +239,41 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
* @dlen: TCP payload length
* @no_csum: Don't compute IPv4 checksum, use the one from previous buffer
* @seq: Sequence number to be sent
* @push: Set PSH flag, last segment in a batch
*/
static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
ssize_t dlen, int no_csum, uint32_t seq)
static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
ssize_t dlen, int no_csum, uint32_t seq, bool push)
{
struct tcp_payload_t *payload;
const uint16_t *check = NULL;
struct iovec *iov;
size_t l4len;
conn->seq_to_tap = seq + dlen;
tcp_frame_conns[tcp_payload_used] = conn;
iov = tcp_l2_iov[tcp_payload_used];
if (CONN_V4(conn)) {
struct iovec *iov_prev = tcp4_l2_iov[tcp4_payload_used - 1];
const uint16_t *check = NULL;
if (no_csum) {
struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
check = &iph->check;
}
tcp4_frame_conns[tcp4_payload_used] = conn;
iov = tcp4_l2_iov[tcp4_payload_used++];
l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else if (CONN_V6(conn)) {
tcp6_frame_conns[tcp6_payload_used] = conn;
iov = tcp6_l2_iov[tcp6_payload_used++];
l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
payload = iov[TCP_IOV_PAYLOAD].iov_base;
payload->th.th_off = sizeof(struct tcphdr) / 4;
payload->th.th_x2 = 0;
payload->th.th_flags = 0;
payload->th.ack = 1;
payload->th.psh = push;
iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
}
/**
@ -402,12 +285,11 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
*
* #syscalls recvmsg
*/
int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
{
uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
int sendlen, len, dlen, v4 = CONN_V4(conn);
int s = conn->sock, i, ret = 0;
int len, dlen, i, s = conn->sock;
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
uint32_t already_sent, seq;
@ -422,13 +304,14 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
conn->seq_ack_from_tap, conn->seq_to_tap);
conn->seq_to_tap = conn->seq_ack_from_tap;
already_sent = 0;
if (tcp_set_peek_offset(s, 0)) {
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
}
}
if (!wnd_scaled || already_sent >= wnd_scaled) {
conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
conn_flag(c, conn, STALLED);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
@ -454,19 +337,15 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
mh_sock.msg_iovlen = fill_bufs;
}
if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) ||
(!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) {
if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) {
tcp_payload_flush(c);
/* Silence Coverity CWE-125 false positive */
tcp4_payload_used = tcp6_payload_used = 0;
tcp_payload_used = 0;
}
for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
if (v4)
iov->iov_base = &tcp4_payload[tcp4_payload_used + i].data;
else
iov->iov_base = &tcp6_payload[tcp6_payload_used + i].data;
iov->iov_base = &tcp_payload[tcp_payload_used + i].data;
iov->iov_len = mss;
}
if (iov_rem)
@ -477,12 +356,22 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
len = recvmsg(s, &mh_sock, MSG_PEEK);
while (len < 0 && errno == EINTR);
if (len < 0)
goto err;
if (len < 0) {
if (errno != EAGAIN && errno != EWOULDBLOCK) {
tcp_rst(c, conn);
return -errno;
}
if (already_sent) /* No new data and EAGAIN: set EPOLLET */
conn_flag(c, conn, STALLED);
return 0;
}
if (!len) {
if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
if ((ret = tcp_buf_send_flag(c, conn, FIN | ACK))) {
int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
if (ret) {
tcp_rst(c, conn);
return ret;
}
@ -493,45 +382,40 @@ int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
return 0;
}
sendlen = len;
if (!peek_offset_cap)
sendlen -= already_sent;
len -= already_sent;
if (sendlen <= 0) {
if (len <= 0) {
conn_flag(c, conn, STALLED);
return 0;
}
conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
conn_flag(c, conn, ~STALLED);
send_bufs = DIV_ROUND_UP(sendlen, mss);
last_len = sendlen - (send_bufs - 1) * mss;
send_bufs = DIV_ROUND_UP(len, mss);
last_len = len - (send_bufs - 1) * mss;
/* Likely, some new data was acked too. */
tcp_update_seqack_wnd(c, conn, 0, NULL);
tcp_update_seqack_wnd(c, conn, false, NULL);
/* Finally, queue to tap */
dlen = mss;
seq = conn->seq_to_tap;
for (i = 0; i < send_bufs; i++) {
int no_csum = i && i != send_bufs - 1 && tcp4_payload_used;
int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
bool push = false;
if (i == send_bufs - 1)
if (i == send_bufs - 1) {
dlen = last_len;
push = true;
}
tcp_data_to_tap(c, conn, dlen, no_csum, seq);
tcp_data_to_tap(c, conn, dlen, no_csum, seq, push);
seq += dlen;
}
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
err:
if (errno != EAGAIN && errno != EWOULDBLOCK) {
ret = -errno;
tcp_rst(c, conn);
}
return ret;
}

View file

@ -6,11 +6,9 @@
#ifndef TCP_BUF_H
#define TCP_BUF_H
void tcp_sock4_iov_init(const struct ctx *c);
void tcp_sock6_iov_init(const struct ctx *c);
void tcp_flags_flush(const struct ctx *c);
void tcp_payload_flush(struct ctx *c);
int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
void tcp_sock_iov_init(const struct ctx *c);
void tcp_payload_flush(const struct ctx *c);
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
#endif /*TCP_BUF_H */

View file

@ -19,6 +19,7 @@
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
* @sock: Socket descriptor number
* @events: Connection events, implying connection states
* @listening_sock: Listening socket this socket was accept()ed from, or -1
* @timer: timerfd descriptor for timeout events
* @flags: Connection flags representing internal attributes
* @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
@ -68,6 +69,7 @@ struct tcp_tap_conn {
#define CONN_STATE_BITS /* Setting these clears other flags */ \
(SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED)
int listening_sock;
int timer :FD_REF_BITS;
@ -77,6 +79,7 @@ struct tcp_tap_conn {
#define ACTIVE_CLOSE BIT(2)
#define ACK_TO_TAP_DUE BIT(3)
#define ACK_FROM_TAP_DUE BIT(4)
#define ACK_FROM_TAP_BLOCKS BIT(5)
#define SNDBUF_BITS 24
unsigned int sndbuf :SNDBUF_BITS;
@ -95,6 +98,95 @@ struct tcp_tap_conn {
uint32_t seq_init_from_tap;
};
/**
* struct tcp_tap_transfer - Migrated TCP data, flow table part, network order
* @pif: Interfaces for each side of the flow
* @side: Addresses and ports for each side of the flow
* @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @events: Connection events, implying connection states
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
* @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
* @flags: Connection flags representing internal attributes
* @seq_dup_ack_approx: Last duplicate ACK number sent to tap
* @wnd_from_tap: Last window size from tap, unscaled (as received)
* @wnd_to_tap: Sending window advertised to tap, unscaled (as sent)
* @seq_to_tap: Next sequence for packets to tap
* @seq_ack_from_tap: Last ACK number received from tap
* @seq_from_tap: Next sequence for packets from tap (not actually sent)
* @seq_ack_to_tap: Last ACK number sent to tap
* @seq_init_from_tap: Initial sequence number from tap
*/
struct tcp_tap_transfer {
uint8_t pif[SIDES];
struct flowside side[SIDES];
uint8_t retrans;
uint8_t ws_from_tap;
uint8_t ws_to_tap;
uint8_t events;
uint32_t tap_mss;
uint32_t sndbuf;
uint8_t flags;
uint8_t seq_dup_ack_approx;
uint16_t wnd_from_tap;
uint16_t wnd_to_tap;
uint32_t seq_to_tap;
uint32_t seq_ack_from_tap;
uint32_t seq_from_tap;
uint32_t seq_ack_to_tap;
uint32_t seq_init_from_tap;
} __attribute__((packed, aligned(__alignof__(uint32_t))));
/**
* struct tcp_tap_transfer_ext - Migrated TCP data, outside flow, network order
* @seq_snd: Socket-side send sequence
* @seq_rcv: Socket-side receive sequence
* @sndq: Length of pending send queue (unacknowledged / not sent)
* @notsent: Part of pending send queue that wasn't sent out yet
* @rcvq: Length of pending receive queue
* @mss: Socket-side MSS clamp
* @timestamp: RFC 7323 timestamp
* @snd_wl1: Next sequence used in window probe (next sequence - 1)
* @snd_wnd: Socket-side sending window
* @max_window: Window clamp
* @rcv_wnd: Socket-side receive window
* @rcv_wup: rcv_nxt on last window update sent
* @snd_ws: Window scaling factor, send
* @rcv_ws: Window scaling factor, receive
* @tcpi_state: Connection state in TCP_INFO style (enum, tcp_states.h)
* @tcpi_options: TCPI_OPT_* constants (timestamps, selective ACK)
*/
struct tcp_tap_transfer_ext {
uint32_t seq_snd;
uint32_t seq_rcv;
uint32_t sndq;
uint32_t notsent;
uint32_t rcvq;
uint32_t mss;
uint32_t timestamp;
/* We can't just use struct tcp_repair_window: we need network order */
uint32_t snd_wl1;
uint32_t snd_wnd;
uint32_t max_window;
uint32_t rcv_wnd;
uint32_t rcv_wup;
uint8_t snd_ws;
uint8_t rcv_ws;
uint8_t tcpi_state;
uint8_t tcpi_options;
} __attribute__((packed, aligned(__alignof__(uint32_t))));
/**
* struct tcp_splice_conn - Descriptor for a spliced TCP connection
* @f: Generic flow information
@ -139,11 +231,23 @@ extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE];
extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE];
bool tcp_flow_defer(const struct tcp_tap_conn *conn);
int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
int tcp_flow_migrate_target(struct ctx *c, int fd);
int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
int tcp_conn_pool_sock(int pool[]);
int tcp_conn_sock(const struct ctx *c, sa_family_t af);
int tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);
int tcp_conn_sock(sa_family_t af);
int tcp_sock_refill_pool(int pool[], sa_family_t af);
void tcp_splice_refill(const struct ctx *c);
#endif /* TCP_CONN_H */

View file

@ -33,16 +33,18 @@
#define OPT_EOL 0
#define OPT_NOP 1
#define OPT_MSS 2
#define OPT_MSS_LEN 4
#define OPT_WS 3
#define OPT_WS_LEN 3
#define OPT_SACKP 4
#define OPT_SACK 5
#define OPT_TS 8
#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP)
#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)]))
#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP)
#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)]))
#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define HOSTSIDE(conn_) ((conn_)->f.pif[1] == PIF_HOST)
#define HOSTFLOW(conn_) (&((conn_)->f.side[HOSTSIDE(conn_)]))
#define HOST_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->oaddr))
#define CONN_V6(conn) (!CONN_V4(conn))
@ -63,6 +65,79 @@ enum tcp_iov_parts {
TCP_NUM_IOVS
};
/**
* struct tcp_payload_t - TCP header and data to send segments with payload
* @th: TCP header
* @data: TCP data
*/
struct tcp_payload_t {
struct tcphdr th;
uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)];
#ifdef __AVX2__
} __attribute__ ((packed, aligned(32))); /* For AVX2 checksum routines */
#else
} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
#endif
/** struct tcp_opt_nop - TCP NOP option
* @kind: Option kind (OPT_NOP = 1)
*/
struct tcp_opt_nop {
uint8_t kind;
} __attribute__ ((packed));
#define TCP_OPT_NOP ((struct tcp_opt_nop){ .kind = OPT_NOP, })
/** struct tcp_opt_mss - TCP MSS option
* @kind: Option kind (OPT_MSS == 2)
* @len: Option length (4)
* @mss: Maximum Segment Size
*/
struct tcp_opt_mss {
uint8_t kind;
uint8_t len;
uint16_t mss;
} __attribute__ ((packed));
#define TCP_OPT_MSS(mss_) \
((struct tcp_opt_mss) { \
.kind = OPT_MSS, \
.len = sizeof(struct tcp_opt_mss), \
.mss = htons(mss_), \
})
/** struct tcp_opt_ws - TCP Window Scaling option
* @kind: Option kind (OPT_WS == 3)
* @len: Option length (3)
* @shift: Window scaling shift
*/
struct tcp_opt_ws {
uint8_t kind;
uint8_t len;
uint8_t shift;
} __attribute__ ((packed));
#define TCP_OPT_WS(shift_) \
((struct tcp_opt_ws) { \
.kind = OPT_WS, \
.len = sizeof(struct tcp_opt_ws), \
.shift = (shift_), \
})
/** struct tcp_syn_opts - TCP options we apply to SYN packets
* @mss: Maximum Segment Size (MSS) option
* @nop: NOP opt (for alignment)
* @ws: Window Scaling (WS) option
*/
struct tcp_syn_opts {
struct tcp_opt_mss mss;
struct tcp_opt_nop nop;
struct tcp_opt_ws ws;
} __attribute__ ((packed));
#define TCP_SYN_OPTS(mss_, ws_) \
((struct tcp_syn_opts){ \
.mss = TCP_OPT_MSS(mss_), \
.nop = TCP_OPT_NOP, \
.ws = TCP_OPT_WS(ws_), \
})
extern char tcp_buf_discard [MAX_WINDOW];
void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
@ -82,19 +157,26 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
conn_event_do(c, conn, event); \
} while (0)
void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
#define tcp_rst(c, conn) \
do { \
flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
tcp_rst_do(c, conn); \
} while (0)
size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
struct iovec *iov, size_t dlen,
const uint16_t *check, uint32_t seq);
struct tcp_info_linux;
void tcp_fill_headers(const struct tcp_tap_conn *conn,
struct tap_hdr *taph,
struct iphdr *ip4h, struct ipv6hdr *ip6h,
struct tcphdr *th, struct iov_tail *payload,
const uint16_t *ip4_check, uint32_t seq, bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
int force_seq, struct tcp_info *tinfo);
int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags,
struct tcphdr *th, char *data, size_t *optlen);
bool force_seq, struct tcp_info_linux *tinfo);
int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
size_t *optlen);
int tcp_set_peek_offset(const struct tcp_tap_conn *conn, int offset);
#endif /* TCP_INTERNAL_H */

View file

@ -28,7 +28,7 @@
* - FIN_SENT_0: FIN (write shutdown) sent to accepted socket
* - FIN_SENT_1: FIN (write shutdown) sent to target socket
*
* #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64 i686:fcntl64
* #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64|fcntl i686:fcntl64
*/
#include <sched.h>
@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
ev[1].events = EPOLLOUT;
}
flow_foreach_sidei(sidei)
ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0;
flow_foreach_sidei(sidei) {
if (events & OUT_WAIT(sidei)) {
ev[sidei].events |= EPOLLOUT;
ev[!sidei].events &= ~EPOLLIN;
}
}
}
/**
@ -160,7 +164,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) ||
epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) {
int ret = -errno;
flow_err(conn, "ERROR on epoll_ctl(): %s", strerror(errno));
flow_perror(conn, "ERROR on epoll_ctl()");
return ret;
}
@ -200,8 +204,8 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
}
if (flag == CLOSING) {
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[0], NULL);
epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[1], NULL);
epoll_del(c, conn->s[0]);
epoll_del(c, conn->s[1]);
}
}
@ -313,14 +317,14 @@ static int tcp_splice_connect_finish(const struct ctx *c,
if (conn->pipe[sidei][0] < 0) {
if (pipe2(conn->pipe[sidei], O_NONBLOCK | O_CLOEXEC)) {
flow_err(conn, "cannot create %d->%d pipe: %s",
sidei, !sidei, strerror(errno));
flow_perror(conn, "cannot create %d->%d pipe",
sidei, !sidei);
conn_flag(c, conn, CLOSING);
return -EIO;
}
if (fcntl(conn->pipe[sidei][0], F_SETPIPE_SZ,
c->tcp.pipe_size)) {
c->tcp.pipe_size) != (int)c->tcp.pipe_size) {
flow_trace(conn,
"cannot set %d->%d pipe size to %zu",
sidei, !sidei, c->tcp.pipe_size);
@ -348,9 +352,10 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
uint8_t tgtpif = conn->f.pif[TGTSIDE];
union sockaddr_inany sa;
socklen_t sl;
int one = 1;
if (tgtpif == PIF_HOST)
conn->s[1] = tcp_conn_sock(c, af);
conn->s[1] = tcp_conn_sock(af);
else if (tgtpif == PIF_SPLICE)
conn->s[1] = tcp_conn_sock_ns(c, af);
else
@ -359,18 +364,27 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
if (conn->s[1] < 0)
return -1;
if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK,
&((int){ 1 }), sizeof(int))) {
if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, &one, sizeof(one))) {
flow_trace(conn, "failed to set TCP_QUICKACK on socket %i",
conn->s[1]);
}
if (setsockopt(conn->s[0], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
conn->s[0]);
}
if (setsockopt(conn->s[1], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
conn->s[1]);
}
pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport);
if (connect(conn->s[1], &sa.sa, sl)) {
if (errno != EINPROGRESS) {
flow_trace(conn, "Couldn't connect socket for splice: %s",
strerror(errno));
strerror_(errno));
return -errno;
}
@ -468,11 +482,10 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
if (rc)
flow_err(conn, "Error retrieving SO_ERROR: %s",
strerror(errno));
flow_perror(conn, "Error retrieving SO_ERROR");
else
flow_trace(conn, "Error event on socket: %s",
strerror(err));
strerror_(err));
goto close;
}
@ -503,29 +516,27 @@ swap:
lowat_act_flag = RCVLOWAT_ACT(fromsidei);
while (1) {
ssize_t readlen, to_write = 0, written;
ssize_t readlen, written, pending;
int more = 0;
retry:
readlen = splice(conn->s[fromsidei], NULL,
conn->pipe[fromsidei][1], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
do
readlen = splice(conn->s[fromsidei], NULL,
conn->pipe[fromsidei][1], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
while (readlen < 0 && errno == EINTR);
if (readlen < 0 && errno != EAGAIN)
goto close;
flow_trace(conn, "%zi from read-side call", readlen);
if (readlen < 0) {
if (errno == EINTR)
goto retry;
if (errno != EAGAIN)
goto close;
to_write = c->tcp.pipe_size;
} else if (!readlen) {
if (!readlen) {
eof = 1;
to_write = c->tcp.pipe_size;
} else {
} else if (readlen > 0) {
never_read = 0;
to_write += readlen;
if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
more = SPLICE_F_MORE;
@ -533,19 +544,25 @@ retry:
conn_flag(c, conn, lowat_act_flag);
}
eintr:
written = splice(conn->pipe[fromsidei][0], NULL,
conn->s[!fromsidei], NULL, to_write,
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
do
written = splice(conn->pipe[fromsidei][0], NULL,
conn->s[!fromsidei], NULL,
c->tcp.pipe_size,
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
while (written < 0 && errno == EINTR);
if (written < 0 && errno != EAGAIN)
goto close;
flow_trace(conn, "%zi from write-side call (passed %zi)",
written, to_write);
written, c->tcp.pipe_size);
/* Most common case: skip updating counters. */
if (readlen > 0 && readlen == written) {
if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
continue;
if (conn->flags & lowat_set_flag &&
if (!(conn->flags & lowat_set_flag) &&
readlen > (long)c->tcp.pipe_size / 10) {
int lowat = c->tcp.pipe_size / 4;
@ -554,7 +571,7 @@ eintr:
&lowat, sizeof(lowat))) {
flow_trace(conn,
"Setting SO_RCVLOWAT %i: %s",
lowat, strerror(errno));
lowat, strerror_(errno));
} else {
conn_flag(c, conn, lowat_set_flag);
conn_flag(c, conn, lowat_act_flag);
@ -568,12 +585,6 @@ eintr:
conn->written[fromsidei] += written > 0 ? written : 0;
if (written < 0) {
if (errno == EINTR)
goto eintr;
if (errno != EAGAIN)
goto close;
if (conn->read[fromsidei] == conn->written[fromsidei])
break;
@ -584,10 +595,9 @@ eintr:
if (never_read && written == (long)(c->tcp.pipe_size))
goto retry;
if (!never_read && written < to_write) {
to_write -= written;
pending = conn->read[fromsidei] - conn->written[fromsidei];
if (!never_read && written > 0 && written < pending)
goto retry;
}
if (eof)
break;
@ -676,7 +686,7 @@ static void tcp_splice_pipe_refill(const struct ctx *c)
continue;
if (fcntl(splice_pipe_pool[i][0], F_SETPIPE_SZ,
c->tcp.pipe_size)) {
c->tcp.pipe_size) != (int)c->tcp.pipe_size) {
trace("TCP (spliced): cannot set pool pipe size to %zu",
c->tcp.pipe_size);
}
@ -697,16 +707,16 @@ static int tcp_sock_refill_ns(void *arg)
ns_enter(c);
if (c->ifi4) {
int rc = tcp_sock_refill_pool(c, ns_sock_pool4, AF_INET);
int rc = tcp_sock_refill_pool(ns_sock_pool4, AF_INET);
if (rc < 0)
warn("TCP: Error refilling IPv4 ns socket pool: %s",
strerror(-rc));
strerror_(-rc));
}
if (c->ifi6) {
int rc = tcp_sock_refill_pool(c, ns_sock_pool6, AF_INET6);
int rc = tcp_sock_refill_pool(ns_sock_pool6, AF_INET6);
if (rc < 0)
warn("TCP: Error refilling IPv6 ns socket pool: %s",
strerror(-rc));
strerror_(-rc));
}
return 0;

474
tcp_vu.c Normal file
View file

@ -0,0 +1,474 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* tcp_vu.c - TCP L2 vhost-user management functions
*
* Copyright Red Hat
* Author: Laurent Vivier <lvivier@redhat.com>
*/
#include <errno.h>
#include <stddef.h>
#include <stdint.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>
#include <sys/socket.h>
#include <netinet/if_ether.h>
#include <linux/virtio_net.h>
#include "util.h"
#include "ip.h"
#include "passt.h"
#include "siphash.h"
#include "inany.h"
#include "vhost_user.h"
#include "tcp.h"
#include "pcap.h"
#include "flow.h"
#include "tcp_conn.h"
#include "flow_table.h"
#include "tcp_vu.h"
#include "tap.h"
#include "tcp_internal.h"
#include "checksum.h"
#include "vu_common.h"
#include <time.h>
static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1];
static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
static int head[VIRTQUEUE_MAX_SIZE + 1];
/**
* tcp_vu_hdrlen() - return the size of the header in level 2 frame (TCP)
* @v6: Set for IPv6 packet
*
* Return: Return the size of the header
*/
static size_t tcp_vu_hdrlen(bool v6)
{
size_t hdrlen;
hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf) +
sizeof(struct ethhdr) + sizeof(struct tcphdr);
if (v6)
hdrlen += sizeof(struct ipv6hdr);
else
hdrlen += sizeof(struct iphdr);
return hdrlen;
}
/**
* tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload)
* @c: Execution context
* @conn: Connection pointer
* @flags: TCP flags: if not set, send segment only if ACK is due
*
* Return: negative error code on connection reset, 0 otherwise
*/
int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
{
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
size_t optlen, hdrlen;
struct vu_virtq_element flags_elem[2];
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
struct iovec flags_iov[2];
struct tcp_syn_opts *opts;
struct iov_tail payload;
struct tcphdr *th;
struct ethhdr *eh;
uint32_t seq;
int elem_cnt;
int nb_ack;
int ret;
hdrlen = tcp_vu_hdrlen(CONN_V6(conn));
vu_set_element(&flags_elem[0], NULL, &flags_iov[0]);
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1,
hdrlen + sizeof(struct tcp_syn_opts), NULL);
if (elem_cnt != 1)
return -1;
ASSERT(flags_elem[0].in_sg[0].iov_len >=
hdrlen + sizeof(struct tcp_syn_opts));
vu_set_vnethdr(vdev, flags_elem[0].in_sg[0].iov_base, 1);
eh = vu_eth(flags_elem[0].in_sg[0].iov_base);
memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
if (CONN_V4(conn)) {
eh->h_proto = htons(ETH_P_IP);
ip4h = vu_ip(flags_elem[0].in_sg[0].iov_base);
*ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
th = vu_payloadv4(flags_elem[0].in_sg[0].iov_base);
} else {
eh->h_proto = htons(ETH_P_IPV6);
ip6h = vu_ip(flags_elem[0].in_sg[0].iov_base);
*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
th = vu_payloadv6(flags_elem[0].in_sg[0].iov_base);
}
memset(th, 0, sizeof(*th));
th->doff = sizeof(*th) / 4;
th->ack = 1;
seq = conn->seq_to_tap;
opts = (struct tcp_syn_opts *)(th + 1);
ret = tcp_prepare_flags(c, conn, flags, th, opts, &optlen);
if (ret <= 0) {
vu_queue_rewind(vq, 1);
return ret;
}
flags_elem[0].in_sg[0].iov_len = hdrlen + optlen;
payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
NULL, seq, !*c->pcap);
if (*c->pcap) {
pcap_iov(&flags_elem[0].in_sg[0], 1,
sizeof(struct virtio_net_hdr_mrg_rxbuf));
}
nb_ack = 1;
if (flags & DUP_ACK) {
vu_set_element(&flags_elem[1], NULL, &flags_iov[1]);
elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1,
flags_elem[0].in_sg[0].iov_len, NULL);
if (elem_cnt == 1 &&
flags_elem[1].in_sg[0].iov_len >=
flags_elem[0].in_sg[0].iov_len) {
memcpy(flags_elem[1].in_sg[0].iov_base,
flags_elem[0].in_sg[0].iov_base,
flags_elem[0].in_sg[0].iov_len);
nb_ack++;
if (*c->pcap) {
pcap_iov(&flags_elem[1].in_sg[0], 1,
sizeof(struct virtio_net_hdr_mrg_rxbuf));
}
}
}
vu_flush(vdev, vq, flags_elem, nb_ack);
return 0;
}
/** tcp_vu_sock_recv() - Receive datastream from socket into vhost-user buffers
* @c: Execution context
* @conn: Connection pointer
* @v6: Set for IPv6 connections
* @already_sent: Number of bytes already sent
* @fillsize: Maximum bytes to fill in guest-side receiving window
* @iov_cnt: number of iov (output)
*
* Return: Number of iov entries used to store the data or negative error code
*/
static ssize_t tcp_vu_sock_recv(const struct ctx *c,
const struct tcp_tap_conn *conn, bool v6,
uint32_t already_sent, size_t fillsize,
int *iov_cnt, int *head_cnt)
{
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
int s = conn->sock;
ssize_t ret, len;
size_t hdrlen;
int elem_cnt;
int i;
*iov_cnt = 0;
hdrlen = tcp_vu_hdrlen(v6);
vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE);
elem_cnt = 0;
*head_cnt = 0;
while (fillsize > 0 && elem_cnt < VIRTQUEUE_MAX_SIZE) {
struct iovec *iov;
size_t frame_size, dlen;
int cnt;
cnt = vu_collect(vdev, vq, &elem[elem_cnt],
VIRTQUEUE_MAX_SIZE - elem_cnt,
MIN(mss, fillsize) + hdrlen, &frame_size);
if (cnt == 0)
break;
dlen = frame_size - hdrlen;
/* reserve space for headers in iov */
iov = &elem[elem_cnt].in_sg[0];
ASSERT(iov->iov_len >= hdrlen);
iov->iov_base = (char *)iov->iov_base + hdrlen;
iov->iov_len -= hdrlen;
head[(*head_cnt)++] = elem_cnt;
fillsize -= dlen;
elem_cnt += cnt;
}
if (peek_offset_cap) {
mh_sock.msg_iov = iov_vu + 1;
mh_sock.msg_iovlen = elem_cnt;
} else {
iov_vu[0].iov_base = tcp_buf_discard;
iov_vu[0].iov_len = already_sent;
mh_sock.msg_iov = iov_vu;
mh_sock.msg_iovlen = elem_cnt + 1;
}
do
ret = recvmsg(s, &mh_sock, MSG_PEEK);
while (ret < 0 && errno == EINTR);
if (ret < 0) {
vu_queue_rewind(vq, elem_cnt);
return -errno;
}
if (!peek_offset_cap)
ret -= already_sent;
/* adjust iov number and length of the last iov */
len = ret;
for (i = 0; len && i < elem_cnt; i++) {
struct iovec *iov = &elem[i].in_sg[0];
if (iov->iov_len > (size_t)len)
iov->iov_len = len;
len -= iov->iov_len;
}
/* adjust head count */
while (*head_cnt > 0 && head[*head_cnt - 1] >= i)
(*head_cnt)--;
/* mark end of array */
head[*head_cnt] = i;
*iov_cnt = i;
/* release unused buffers */
vu_queue_rewind(vq, elem_cnt - i);
/* restore space for headers in iov */
for (i = 0; i < *head_cnt; i++) {
struct iovec *iov = &elem[head[i]].in_sg[0];
iov->iov_base = (char *)iov->iov_base - hdrlen;
iov->iov_len += hdrlen;
}
return ret;
}
/**
* tcp_vu_prepare() - Prepare the frame header
* @c: Execution context
* @conn: Connection pointer
* @iov: Pointer to the array of IO vectors
* @iov_cnt: Number of entries in @iov
* @check: Checksum, if already known
* @no_tcp_csum: Do not set TCP checksum
* @push: Set PSH flag, last segment in a batch
*/
static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
struct iovec *iov, size_t iov_cnt,
const uint16_t **check, bool no_tcp_csum, bool push)
{
const struct flowside *toside = TAPFLOW(conn);
bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
size_t hdrlen = tcp_vu_hdrlen(v6);
struct iov_tail payload = IOV_TAIL(iov, iov_cnt, hdrlen);
char *base = iov[0].iov_base;
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
struct tcphdr *th;
struct ethhdr *eh;
/* we guess the first iovec provided by the guest can embed
* all the headers needed by L2 frame
*/
ASSERT(iov[0].iov_len >= hdrlen);
eh = vu_eth(base);
memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
/* initialize header */
if (!v6) {
eh->h_proto = htons(ETH_P_IP);
ip4h = vu_ip(base);
*ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
th = vu_payloadv4(base);
} else {
eh->h_proto = htons(ETH_P_IPV6);
ip6h = vu_ip(base);
*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
th = vu_payloadv6(base);
}
memset(th, 0, sizeof(*th));
th->doff = sizeof(*th) / 4;
th->ack = 1;
th->psh = push;
tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
*check, conn->seq_to_tap, no_tcp_csum);
if (ip4h)
*check = &ip4h->check;
}
/**
* tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
* in window
* @c: Execution context
* @conn: Connection pointer
*
* Return: Negative on connection reset, 0 otherwise
*/
int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
{
uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
ssize_t len, previous_dlen;
int i, iov_cnt, head_cnt;
size_t hdrlen, fillsize;
int v6 = CONN_V6(conn);
uint32_t already_sent;
const uint16_t *check;
if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
debug("Got packet, but RX virtqueue not usable yet");
return 0;
}
already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
if (SEQ_LT(already_sent, 0)) {
/* RFC 761, section 2.1. */
flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
conn->seq_ack_from_tap, conn->seq_to_tap);
conn->seq_to_tap = conn->seq_ack_from_tap;
already_sent = 0;
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
}
}
if (!wnd_scaled || already_sent >= wnd_scaled) {
conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
conn_flag(c, conn, STALLED);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
}
/* Set up buffer descriptors we'll fill completely and partially. */
fillsize = wnd_scaled - already_sent;
/* collect the buffers from vhost-user and fill them with the
* data from the socket
*/
len = tcp_vu_sock_recv(c, conn, v6, already_sent, fillsize,
&iov_cnt, &head_cnt);
if (len < 0) {
if (len != -EAGAIN && len != -EWOULDBLOCK) {
tcp_rst(c, conn);
return len;
}
if (already_sent) /* No new data and EAGAIN: set EPOLLET */
conn_flag(c, conn, STALLED);
return 0;
}
if (!len) {
if (already_sent) {
conn_flag(c, conn, STALLED);
} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
SOCK_FIN_RCVD) {
int ret = tcp_vu_send_flag(c, conn, FIN | ACK);
if (ret) {
tcp_rst(c, conn);
return ret;
}
conn_event(c, conn, TAP_FIN_SENT);
}
return 0;
}
conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
conn_flag(c, conn, ~STALLED);
/* Likely, some new data was acked too. */
tcp_update_seqack_wnd(c, conn, false, NULL);
/* initialize headers */
/* iov_vu is an array of buffers and the buffer size can be
* smaller than the frame size we want to use but with
* num_buffer we can merge several virtio iov buffers in one packet
* we need only to set the packet headers in the first iov and
* num_buffer to the number of iov entries
*/
hdrlen = tcp_vu_hdrlen(v6);
for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) {
struct iovec *iov = &elem[head[i]].in_sg[0];
int buf_cnt = head[i + 1] - head[i];
ssize_t dlen = iov_size(iov, buf_cnt) - hdrlen;
bool push = i == head_cnt - 1;
vu_set_vnethdr(vdev, iov->iov_base, buf_cnt);
/* The IPv4 header checksum varies only with dlen */
if (previous_dlen != dlen)
check = NULL;
previous_dlen = dlen;
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
if (*c->pcap) {
pcap_iov(iov, buf_cnt,
sizeof(struct virtio_net_hdr_mrg_rxbuf));
}
conn->seq_to_tap += dlen;
}
/* send packets */
vu_flush(vdev, vq, elem, iov_cnt);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
}

12
tcp_vu.h Normal file
View file

@ -0,0 +1,12 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/* Copyright Red Hat
* Author: Laurent Vivier <lvivier@redhat.com>
*/
#ifndef TCP_VU_H
#define TCP_VU_H
int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
#endif /*TCP_VU_H */

1
test/.gitignore vendored
View file

@ -8,5 +8,6 @@ QEMU_EFI.fd
*.raw.xz
*.bin
nstool
rampstream
guest-key
guest-key.pub

View file

@ -8,7 +8,6 @@
WGET = wget -c
DEBIAN_IMGS = debian-8.11.0-openstack-amd64.qcow2 \
debian-9-nocloud-amd64-daily-20200210-166.qcow2 \
debian-10-nocloud-amd64.qcow2 \
debian-10-generic-arm64.qcow2 \
debian-10-generic-ppc64el-20220911-1135.qcow2 \
@ -42,8 +41,7 @@ OPENSUSE_IMGS = openSUSE-Leap-15.1-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Leap-15.2-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Leap-15.3-JeOS.x86_64-kvm-and-xen.qcow2 \
openSUSE-Tumbleweed-ARM-JeOS-efi.aarch64.raw.xz \
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz \
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz
UBUNTU_OLD_IMGS = trusty-server-cloudimg-amd64-disk1.img \
trusty-server-cloudimg-i386-disk1.img \
@ -54,7 +52,8 @@ UBUNTU_IMGS = $(UBUNTU_OLD_IMGS) $(UBUNTU_NEW_IMGS)
DOWNLOAD_ASSETS = mbuto podman \
$(DEBIAN_IMGS) $(FEDORA_IMGS) $(OPENSUSE_IMGS) $(UBUNTU_IMGS)
TESTDATA_ASSETS = small.bin big.bin medium.bin
TESTDATA_ASSETS = small.bin big.bin medium.bin \
rampstream
LOCAL_ASSETS = mbuto.img mbuto.mem.img podman/bin/podman QEMU_EFI.fd \
$(DEBIAN_IMGS:%=prepared-%) $(FEDORA_IMGS:%=prepared-%) \
$(UBUNTU_NEW_IMGS:%=prepared-%) \
@ -87,7 +86,7 @@ podman/bin/podman: pull-podman
guest-key guest-key.pub:
ssh-keygen -f guest-key -N ''
mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub $(TESTDATA_ASSETS)
mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub rampstream-check.sh $(TESTDATA_ASSETS)
./mbuto/mbuto -p ./$< -c lz4 -f $@
mbuto.mem.img: passt.mem.mbuto mbuto ../passt.avx2
@ -135,9 +134,6 @@ realclean: clean
debian-8.11.0-openstack-%.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/OpenStack/archive/8.11.0/debian-8.11.0-openstack-$*.qcow2
debian-9-nocloud-%-daily-20200210-166.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/stretch/daily/20200210-166/debian-9-nocloud-$*-daily-20200210-166.qcow2
debian-10-nocloud-%.qcow2:
$(WGET) -O $@ https://cloud.debian.org/images/cloud/buster/latest/debian-10-nocloud-$*.qcow2
@ -203,9 +199,6 @@ openSUSE-Tumbleweed-ARM-JeOS-efi.aarch64.raw.xz:
openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz:
$(WGET) -O $@ http://download.opensuse.org/ports/armv7hl/tumbleweed/appliances/openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l.raw.xz
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2:
$(WGET) -O $@ https://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2
# Ubuntu downloads
trusty-server-cloudimg-%-disk1.img:
$(WGET) -O $@ https://cloud-images.ubuntu.com/trusty/current/trusty-server-cloudimg-$*-disk1.img

View file

@ -135,7 +135,7 @@ layout_two_guests() {
get_info_cols
pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #1" qemu_2 guest_2
tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
tmux send-keys -t ${PANE_INFO} -N 100 C-m
@ -143,13 +143,66 @@ layout_two_guests() {
pane_watch_contexts ${PANE_HOST} host host
pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #1" pasta_1 passt_2
info_layout "two guests, two passt instances, in namespaces"
sleep 1
}
# layout_migrate() - Two guest panes, two passt panes, two passt-repair panes,
# plus host and log
layout_migrate() {
sleep 1
tmux kill-pane -a -t 0
cmd_write 0 clear
tmux split-window -v -t passt_test
tmux split-window -h -l '33%'
tmux split-window -h -t passt_test:1.1
tmux split-window -h -l '35%' -t passt_test:1.0
tmux split-window -v -t passt_test:1.0
tmux split-window -v -t passt_test:1.4
tmux split-window -v -t passt_test:1.6
tmux split-window -v -t passt_test:1.3
PANE_GUEST_1=0
PANE_GUEST_2=1
PANE_INFO=2
PANE_MON=3
PANE_HOST=4
PANE_PASST_REPAIR_1=5
PANE_PASST_1=6
PANE_PASST_REPAIR_2=7
PANE_PASST_2=8
get_info_cols
pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
tmux send-keys -t ${PANE_INFO} -N 100 C-m
tmux select-pane -t ${PANE_INFO} -T "test log"
pane_watch_contexts ${PANE_MON} "QEMU monitor" mon mon
pane_watch_contexts ${PANE_HOST} host host
pane_watch_contexts ${PANE_PASST_REPAIR_1} "passt-repair #1 in namespace #1" repair_1 passt_repair_1
pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
pane_watch_contexts ${PANE_PASST_REPAIR_2} "passt-repair #2 in namespace #2" repair_2 passt_repair_2
pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
info_layout "two guests, two passt + passt-repair instances, in namespaces"
sleep 1
}
# layout_demo_pasta() - Four panes for pasta demo
layout_demo_pasta() {
sleep 1

View file

@ -49,6 +49,21 @@ td:empty { visibility: hidden; }
__passt_tcp_LINE__ __passt_udp_LINE__
</table>
</li><li><p>passt with vhost-user support</p>
<table class="passt" width="70%">
<tr>
<th/>
<th id="perf_passt_vu_tcp" colspan="__passt_vu_tcp_cols__">TCP, __passt_vu_tcp_threads__ at __passt_vu_tcp_freq__ GHz</th>
<th id="perf_passt_vu_udp" colspan="__passt_vu_udp_cols__">UDP, __passt_vu_udp_threads__ at __passt_vu_udp_freq__ GHz</th>
</tr>
<tr>
<td align="right">MTU:</td>
__passt_vu_tcp_header__
__passt_vu_udp_header__
</tr>
__passt_vu_tcp_LINE__ __passt_vu_udp_LINE__
</table>
<style type="text/CSS">
table.pasta_local td { border: 0px solid; padding: 6px; line-height: 1; }
table.pasta_local td { text-align: right; }

View file

@ -15,8 +15,7 @@
INITRAMFS="${BASEPATH}/mbuto.img"
VCPUS="$( [ $(nproc) -ge 8 ] && echo 6 || echo $(( $(nproc) / 2 + 1 )) )"
__mem_kib="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
VMEM="$((${__mem_kib} / 1024 / 4))"
MEM_KIB="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
QEMU_ARCH="$(uname -m)"
[ "${QEMU_ARCH}" = "i686" ] && QEMU_ARCH=i386
@ -46,24 +45,38 @@ setup_passt() {
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
context_run passt "make clean"
context_run passt "make valgrind"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt ${__opts} -s ${STATESETUP}/passt.socket -f -t 10001 -u 10001 -P ${STATESETUP}/passt.pid"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt ${__opts} -s ${STATESETUP}/passt.socket -f -t 10001 -u 10001 -H hostname1 --fqdn fqdn1.passt.test -P ${STATESETUP}/passt.pid"
# pidfile isn't created until passt is listening
wait_for [ -f "${STATESETUP}/passt.pid" ]
__vmem="$((${MEM_KIB} / 1024 / 4))"
if [ ${VHOST_USER} -eq 1 ]; then
__vmem="$(((${__vmem} + 500) / 1000))G"
__qemu_netdev=" \
-chardev socket,id=c,path=${STATESETUP}/passt.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
else
__qemu_netdev="-device virtio-net-pci,netdev=s \
-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket"
fi
GUEST_CID=94557
context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}" \
' -machine accel=kvm' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
' -device virtio-net-pci,netdev=s0 ' \
" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
" ${__qemu_netdev}" \
" -pidfile ${STATESETUP}/qemu.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
@ -142,29 +155,43 @@ setup_passt_in_ns() {
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_in_pasta.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
if [ ${VALGRIND} -eq 1 ]; then
context_run passt "make clean"
context_run passt "make valgrind"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -H hostname1 --fqdn fqdn1.passt.test -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
else
context_run passt "make clean"
context_run passt "make"
context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -H hostname1 --fqdn fqdn1.passt.test -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
fi
wait_for [ -f "${STATESETUP}/passt.pid" ]
__vmem="$((${MEM_KIB} / 1024 / 4))"
if [ ${VHOST_USER} -eq 1 ]; then
__vmem="$(((${__vmem} + 500) / 1000))G"
__qemu_netdev=" \
-chardev socket,id=c,path=${STATESETUP}/passt.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
else
__qemu_netdev="-device virtio-net-pci,netdev=s \
-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket"
fi
GUEST_CID=94557
context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}" \
' -machine accel=kvm' \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
' -device virtio-net-pci,netdev=s0 ' \
" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
" ${__qemu_netdev}" \
" -pidfile ${STATESETUP}/qemu.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
@ -214,11 +241,126 @@ setup_two_guests() {
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} --fqdn fqdn1.passt.test -H hostname1 -t 10001 -u 10001"
wait_for [ -f "${STATESETUP}/passt_1.pid" ]
__opts=
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} --hostname hostname2 --fqdn fqdn2 -t 10004 -u 10004"
wait_for [ -f "${STATESETUP}/passt_2.pid" ]
__vmem="$((${MEM_KIB} / 1024 / 4))"
if [ ${VHOST_USER} -eq 1 ]; then
__vmem="$(((${__vmem} + 500) / 1000))G"
__qemu_netdev1=" \
-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
__qemu_netdev2=" \
-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
else
__qemu_netdev1="-device virtio-net-pci,netdev=s \
-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_1.socket"
__qemu_netdev2="-device virtio-net-pci,netdev=s \
-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_2.socket"
fi
GUEST_1_CID=94557
context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
" ${__qemu_netdev1}" \
" -pidfile ${STATESETUP}/qemu_1.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"
GUEST_2_CID=94558
context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
" ${__qemu_netdev2}" \
" -pidfile ${STATESETUP}/qemu_2.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID"
context_setup_guest guest_1 ${GUEST_1_CID}
context_setup_guest guest_2 ${GUEST_2_CID}
}
# setup_migrate() - Set up two namespace, run qemu, passt/passt-repair in both
setup_migrate() {
context_setup_host host
context_setup_host mon
context_setup_host pasta_1
context_setup_host pasta_2
layout_migrate
# Ports:
#
# guest #1 | guest #2 | ns #1 | host
# --------- |-----------|-----------|------------
# 10001 as server | | to guest | to ns #1
# 10002 | | as server | to ns #1
# 10003 | | to init | as server
# 10004 | as server | to guest | to ns #1
__opts=
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/pasta_1.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
__map_host4=192.0.2.1
__map_host6=2001:db8:9a55::1
__map_ns4=192.0.2.2
__map_ns6=2001:db8:9a55::2
# Option 1: send stuff via spliced path in pasta
# context_run_bg pasta_1 "./pasta ${__opts} -P ${STATESETUP}/pasta_1.pid -t 10001,10002 -T 10003 -u 10001,10002 -U 10003 --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
# Option 2: send stuff via tap (--map-guest-addr) instead (useful to see capture of full migration)
context_run_bg pasta_1 "./pasta ${__opts} -P ${STATESETUP}/pasta_1.pid -t 10001,10002,10004 -T 10003 -u 10001,10002,10004 -U 10003 --map-guest-addr ${__map_host4} --map-guest-addr ${__map_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
context_setup_nstool passt_1 ${STATESETUP}/ns1.hold
context_setup_nstool passt_repair_1 ${STATESETUP}/ns1.hold
context_setup_nstool passt_2 ${STATESETUP}/ns1.hold
context_setup_nstool passt_repair_2 ${STATESETUP}/ns1.hold
context_setup_nstool qemu_1 ${STATESETUP}/ns1.hold
context_setup_nstool qemu_2 ${STATESETUP}/ns1.hold
__ifname="$(context_run qemu_1 "ip -j link show | jq -rM '.[] | select(.link_type == \"ether\").ifname'")"
sleep 1
__opts="--vhost-user"
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
wait_for [ -f "${STATESETUP}/passt_1.pid" ]
__opts=
context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
__opts="--vhost-user"
[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
@ -226,34 +368,52 @@ setup_two_guests() {
context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
wait_for [ -f "${STATESETUP}/passt_2.pid" ]
context_run_bg passt_repair_2 "./passt-repair ${STATESETUP}/passt_2.socket.repair"
__vmem="512M" # Keep migration fast
__qemu_netdev1=" \
-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
__qemu_netdev2=" \
-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
-netdev vhost-user,id=v,chardev=c \
-device virtio-net,netdev=v \
-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
-numa node,memdev=m"
GUEST_1_CID=94557
context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
' -device virtio-net-pci,netdev=s0 ' \
" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_1.socket " \
" ${__qemu_netdev1}" \
" -pidfile ${STATESETUP}/qemu_1.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"
" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID" \
" -monitor unix:${STATESETUP}/qemu_1_mon.sock,server,nowait"
GUEST_2_CID=94558
context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}" \
' -M accel=kvm:tcg' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -m '${__vmem}' -cpu host -smp '${VCPUS} \
' -kernel '"${KERNEL}" \
' -initrd '${INITRAMFS}' -nographic -serial stdio' \
' -nodefaults' \
' -append "console=ttyS0 mitigations=off apparmor=0" ' \
' -device virtio-net-pci,netdev=s0 ' \
" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_2.socket " \
" ${__qemu_netdev2}" \
" -pidfile ${STATESETUP}/qemu_2.pid" \
" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID"
" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID" \
" -monitor unix:${STATESETUP}/qemu_2_mon.sock,server,nowait" \
" -incoming tcp:0:20005"
context_setup_guest guest_1 ${GUEST_1_CID}
context_setup_guest guest_2 ${GUEST_2_CID}
# Only available after migration:
( context_setup_guest guest_2 ${GUEST_2_CID} & )
}
# teardown_context_watch() - Remove contexts and stop panes watching them
@ -326,7 +486,8 @@ teardown_two_guests() {
context_wait pasta_1
context_wait pasta_2
rm -f "${STATESETUP}/passt__[12].pid" "${STATESETUP}/pasta_[12].pid"
rm "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
rm "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"
teardown_context_watch ${PANE_HOST} host
teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
@ -335,6 +496,30 @@ teardown_two_guests() {
teardown_context_watch ${PANE_PASST_2} pasta_2 passt_2
}
# teardown_migrate() - Exit namespaces, kill qemu processes, passt and pasta
teardown_migrate() {
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_1.pid")
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_2.pid")
context_wait qemu_1
context_wait qemu_2
${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/passt_2.pid")
context_wait passt_1
context_wait passt_2
${NSTOOL} stop "${STATESETUP}/ns1.hold"
context_wait pasta_1
rm -f "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
rm -f "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"
teardown_context_watch ${PANE_HOST} host
teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
teardown_context_watch ${PANE_GUEST_2} qemu_2 guest_2
teardown_context_watch ${PANE_PASST_1} pasta_1 passt_1
teardown_context_watch ${PANE_PASST_2} pasta_1 passt_2
}
# teardown_demo_passt() - Exit namespace, kill qemu, passt and pasta
teardown_demo_passt() {
tmux send-keys -t ${PANE_GUEST} "C-c"

View file

@ -33,7 +33,7 @@ setup_memory() {
pane_or_context_run guest 'qemu-system-$(uname -m)' \
' -machine accel=kvm' \
' -m '${VMEM}' -cpu host -smp '${VCPUS} \
' -m '$((${MEM_KIB} / 1024 / 4))' -cpu host -smp '${VCPUS} \
' -kernel ' "/boot/vmlinuz-$(uname -r)" \
' -initrd '${INITRAMFS_MEM}' -nographic -serial stdio' \
' -nodefaults' \

View file

@ -19,6 +19,7 @@ STATUS_FILE_INDEX=0
STATUS_COLS=
STATUS_PASS=0
STATUS_FAIL=0
STATUS_SKIPPED=0
PR_RED='\033[1;31m'
PR_GREEN='\033[1;32m'
@ -31,8 +32,8 @@ PR_DELAY_INIT=100 # ms
# $@: Message to print
info() {
tmux select-pane -t ${PANE_INFO}
echo "${@}" >> $STATEBASE/log_pipe
echo "${@}" >> "${LOGFILE}"
printf "${@}\n" >> $STATEBASE/log_pipe
printf "${@}\n" >> "${LOGFILE}"
}
# info_n() - Highlight, print message to pane and to log file without newline
@ -47,13 +48,13 @@ info_n() {
# $@: Message to print
info_nolog() {
tmux select-pane -t ${PANE_INFO}
echo "${@}" >> $STATEBASE/log_pipe
printf "${@}\n" >> $STATEBASE/log_pipe
}
# info_nolog() - Print message to log file
# $@: Message to print
log() {
echo "${@}" >> "${LOGFILE}"
printf "${@}\n" >> "${LOGFILE}"
}
# info_nolog_n() - Send message to pane without highlighting it, without newline
@ -439,19 +440,21 @@ info_layout() {
# status_test_ok() - Update counter of passed tests, log and display message
status_test_ok() {
STATUS_PASS=$((STATUS_PASS + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | #(TZ="UTC" date -Iseconds)"
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_passed
}
# status_test_fail() - Update counter of failed tests, log and display message
status_test_fail() {
STATUS_FAIL=$((STATUS_FAIL + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | #(TZ="UTC" date -Iseconds)"
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_failed
}
# status_test_fail() - Update counter of failed tests, log and display message
status_test_skip() {
STATUS_SKIPPED=$((STATUS_SKIPPED + 1))
tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
info_skipped
}
@ -664,7 +667,7 @@ pause_continue() {
# run_term() - Start tmux session, running entry point, with recording if needed
run_term() {
TMUX="tmux new-session -s passt_test -eSTATEBASE=$STATEBASE -ePCAP=$PCAP -eDEBUG=$DEBUG"
TMUX="tmux new-session -s passt_test -eSTATEBASE=$STATEBASE -ePCAP=$PCAP -eDEBUG=$DEBUG -eTRACE=$TRACE -eKERNEL=$KERNEL"
if [ ${CI} -eq 1 ]; then
printf '\e[8;50;240t'

View file

@ -20,10 +20,7 @@ test_iperf3s() {
__sctx="${1}"
__port="${2}"
pane_or_context_run_bg "${__sctx}" \
'iperf3 -s -p'${__port}' & echo $! > s.pid' \
sleep 1 # Wait for server to be ready
pane_or_context_run "${__sctx}" 'iperf3 -s -p'${__port}' -D -I s.pid'
}
# test_iperf3k() - Kill iperf3 server
@ -31,7 +28,7 @@ test_iperf3s() {
test_iperf3k() {
__sctx="${1}"
pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid); rm s.pid'
pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid)'
sleep 1 # Wait for kernel to free up ports
}
@ -68,6 +65,45 @@ test_iperf3() {
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
}
# test_iperf3m() - Ugly helper for iperf3 directive, guest migration variant
# $1: Variable name: to put the measure bandwidth into
# $2: Initial source/client context
# $3: Second source/client context the guest is moving to
# $4: Destination name or address for client
# $5: Port number, ${i} is translated to process index
# $6: Run time, in seconds
# $7: Client options
test_iperf3m() {
__var="${1}"; shift
__cctx="${1}"; shift
__cctx2="${1}"; shift
__dest="${1}"; shift
__port="${1}"; shift
__time="${1}"; shift
pane_or_context_run "${__cctx}" 'rm -f c.json'
# A 1s wait for connection on what's basically a local link
# indicates something is pretty wrong
__timeout=1000
pane_or_context_run_bg "${__cctx}" \
'iperf3 -J -c '${__dest}' -p '${__port} \
' --connect-timeout '${__timeout} \
' -t'${__time}' -i0 '"${@}"' > c.json' \
__jval=".end.sum_received.bits_per_second"
sleep $((${__time} + 3))
pane_or_context_output "${__cctx2}" \
'cat c.json'
__bw=$(pane_or_context_output "${__cctx2}" \
'cat c.json | jq -rMs "map('${__jval}') | add"')
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
}
test_one_line() {
__line="${1}"
@ -177,6 +213,12 @@ test_one_line() {
"guest2w")
pane_or_context_wait guest_2 || TEST_ONE_nok=1
;;
"mon")
pane_or_context_run mon "${__arg}" || TEST_ONE_nok=1
;;
"monb")
pane_or_context_run_bg mon "${__arg}"
;;
"ns")
pane_or_context_run ns "${__arg}" || TEST_ONE_nok=1
;;
@ -292,6 +334,9 @@ test_one_line() {
"iperf3")
test_iperf3 ${__arg}
;;
"iperf3m")
test_iperf3m ${__arg}
;;
"set")
TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__arg%% *}__" "${__arg#* }")"
;;

59
test/migrate/basic Normal file
View file

@ -0,0 +1,59 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/basic - Check basic migration functionality
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: guest1/guest2 > host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
sleep 1
# Option 1: via spliced path in pasta, namespace to host
# guest1b { printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
guest1b { printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]

62
test/migrate/basic_fin Normal file
View file

@ -0,0 +1,62 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/basic_fin - Outbound traffic across migration, half-closed socket
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: guest1, half-close, guest2 > host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb echo FIN | socat TCP4-LISTEN:10006,shut-down STDIO,ignoreeof > __STATESETUP__/msg
#hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
#sleep 20
# Option 1: via spliced path in pasta, namespace to host
# guest1b { printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
guest1b { printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]

View file

@ -0,0 +1,64 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/bidirectional - Check migration with messages in both directions
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4: guest1/guest2 > host, host > guest1/guest2
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
guest1b socat -u TCP4-LISTEN:10001 OPEN:msg,create,trunc
sleep 1
guest1b socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
hostb socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
sleep 1
guest1 printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
host printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
sleep 2
guest2 printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
host printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
hostw
# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
# use sleep 1 for the moment
sleep 1
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]
g2out MSG cat msg
check [ "__MSG__" = "Dear guest 1, you are now guest 2" ]

View file

@ -0,0 +1,64 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/bidirectional_fin - Both directions, half-closed sockets
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4: guest1/guest2 <- (half closed) -> host
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb echo FIN | socat TCP4-LISTEN:10006,shut-down STDIO,ignoreeof > __STATESETUP__/msg
guest1b echo FIN | socat TCP4-LISTEN:10001,shut-down STDIO,ignoreeof > msg
sleep 1
guest1b socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
hostb socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
sleep 1
guest1 printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
host printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
sleep 2
guest2 printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
host printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
hostw
# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
# use sleep 1 for the moment
sleep 1
hout MSG cat __STATESETUP__/msg
check [ "__MSG__" = "Hello from guest 1 and from guest 2" ]
g2out MSG cat msg
check [ "__MSG__" = "Dear guest 1, you are now guest 2" ]

View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_bidir6 - Migration behaviour with many bidirectional flows
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 128
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N --bidir
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 host <-> guest flood, many flows, during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

50
test/migrate/iperf3_in4 Normal file
View file

@ -0,0 +1,50 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_in4 - Migration behaviour under inbound IPv4 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
guest1 /sbin/sysctl -w net.core.rmem_max=33554432
guest1 /sbin/sysctl -w net.core.wmem_max=33554432
set THREADS 1
set TIME 4
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

58
test/migrate/iperf3_in6 Normal file
View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_in6 - Migration behaviour under inbound IPv6 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 4
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

View file

@ -0,0 +1,60 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_many_out6 - Migration behaviour with many outbound flows
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 16
set TIME 3
set OMIT 0.1
set OPTS -Z -P __THREADS__ -O__OMIT__ -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 guest to host flood, many flows, during migration
test TCP/IPv6 host to guest throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

47
test/migrate/iperf3_out4 Normal file
View file

@ -0,0 +1,47 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_out4 - Migration behaviour under outbound IPv4 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 6
set TIME 2
set OMIT 0.1
set OPTS -P __THREADS__ -O__OMIT__ -Z -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test TCP/IPv4 guest to host throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

58
test/migrate/iperf3_out6 Normal file
View file

@ -0,0 +1,58 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/iperf3_out6 - Migration behaviour under outbound IPv6 flood
#
# Copyright (c) 2025 Red Hat GmbH
# Author: Stefano Brivio <sbrivio@redhat.com>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set THREADS 6
set TIME 2
set OMIT 0.1
set OPTS -P __THREADS__ -O__OMIT__ -Z -N -l 1M
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv6 guest to host throughput during migration
monb sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
iperf3s host 10006
iperf3m BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
bw __BW__ 1 2
iperf3k host

View file

@ -0,0 +1,59 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/rampstream_in - Check sequence correctness with inbound ramp
#
# Copyright (c) 2025 Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set RAMPS 6000000
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: sequence check, ramps, inbound
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
guest1b socat -u TCP4-LISTEN:10001 EXEC:"rampstream-check.sh __RAMPS__"
sleep 1
hostb socat -u EXEC:"test/rampstream send __RAMPS__" TCP4:__ADDR1__:10001
sleep 1
monb echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw
guest2 cat rampstream.err
guest2 [ $(cat rampstream.status) -eq 0 ]

View file

@ -0,0 +1,55 @@
# SPDX-License-Identifier: GPL-2.0-or-later
#
# PASST - Plug A Simple Socket Transport
# for qemu/UNIX domain socket mode
#
# PASTA - Pack A Subtle Tap Abstraction
# for network namespace/tap device mode
#
# test/migrate/rampstream_out - Check sequence correctness with outbound ramp
#
# Copyright (c) 2025 Red Hat
# Author: David Gibson <david@gibson.dropbear.id.au>
g1tools ip jq dhclient socat cat
htools ip jq
set MAP_HOST4 192.0.2.1
set MAP_HOST6 2001:db8:9a55::1
set MAP_NS4 192.0.2.2
set MAP_NS6 2001:db8:9a55::2
set RAMPS 6000000
test Interface name
g1out IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
check [ -n "__IFNAME1__" ]
test DHCP: address
guest1 ip link set dev __IFNAME1__ up
guest1 /sbin/dhclient -4 __IFNAME1__
g1out ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
check [ "__ADDR1__" = "__HOST_ADDR__" ]
test DHCPv6: address
# Link is up now, wait for DAD to complete
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
guest1 /sbin/dhclient -6 __IFNAME1__
# Wait for DAD to complete on the DHCP address
guest1 while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
g1out ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ "__ADDR1_6__" = "__HOST_ADDR6__" ]
test TCP/IPv4: sequence check, ramps, outbound
g1out GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
hostb socat -u TCP4-LISTEN:10006 EXEC:"test/rampstream check __RAMPS__"
sleep 1
guest1b socat -u EXEC:"rampstream send __RAMPS__" TCP4:__MAP_HOST4__:10006
sleep 1
mon echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
hostw

Some files were not shown because too many files have changed in this diff Show more