From e5ba8adef71ec53e192373ed1267dc338719dda0 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 12 Dec 2024 10:50:48 +0100
Subject: [PATCH 001/221] README: Mark vhost-user as supported

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 752e59f..54fed07 100644
--- a/README.md
+++ b/README.md
@@ -321,7 +321,7 @@ speeding up local connections, and usually requiring NAT. _pasta_:
   protocol
 * ✅ 4 to 50 times IPv4 TCP throughput of existing, conceptually similar
   solutions depending on MTU (UDP and IPv6 hard to compare)
-* 🛠 [_vhost-user_ support](https://bugs.passt.top/show_bug.cgi?id=25) for
+* ✅ [_vhost-user_ support](https://bugs.passt.top/show_bug.cgi?id=25) for
   maximum one copy on every data path and lower request-response latency
 * ⌚ [multithreading](https://bugs.passt.top/show_bug.cgi?id=13)
 * ⌚ [raw IP socket support](https://bugs.passt.top/show_bug.cgi?id=14) if

From 2385b69a66807e32dca5ae17ab64686888e4c682 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 19 Dec 2024 17:27:44 +0100
Subject: [PATCH 002/221] Makefile: Report error and stop if we can't set
 TARGET

I don't think it's necessarily productive to check all the possible
error conditions in the Makefile, but this one is annoying: issue
'make' without a C compiler, then install one, and build again.

Then run passt and it will mysteriously terminate on epoll_wait(),
because seccomp.h is good enough to build against, but the resulting
seccomp filter doesn't allow any system call. Not really fun to debug.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index 1fce737..464eef1 100644
--- a/Makefile
+++ b/Makefile
@@ -16,6 +16,7 @@ VERSION ?= $(shell git describe --tags HEAD 2>/dev/null || echo "unknown\ versio
 DUAL_STACK_SOCKETS := 1
 
 TARGET ?= $(shell $(CC) -dumpmachine)
+$(if $(TARGET),,$(error Failed to get target architecture))
 # Get 'uname -m'-like architecture description for target
 TARGET_ARCH := $(firstword $(subst -, ,$(TARGET)))
 TARGET_ARCH := $(patsubst [:upper:],[:lower:],$(TARGET_ARCH))

From 324233bd9b8baa3ec13a7425ea3ec7145e3ce645 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 20 Dec 2024 12:40:29 +0100
Subject: [PATCH 003/221] udp_flow: Don't block multicast and broadcast
 messages

It was reported that SSDP notifications sent from a container (with
e.g. minidlna) stopped appearing on the network starting from commit
1db4f773e87f ("udp: Improve detail of UDP endpoint sanity checking").
As a minimal reproducer using minidlnad(8):

  $ mkdir /tmp/minidlna
  $ cat conf
  media_dir=/tmp/minidlna
  db_dir=/tmp/minidlna
  $ ./pasta -d --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f conf -P /dev/null & (sleep 1; killall minidlnad)'

[...]

  1.0327: Flow 0 (NEW): FREE -> NEW
  1.0327: Flow 0 (INI): NEW -> INI
  1.0327: Flow 0 (INI): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
  1.0327: Flow 0 (INI): Invalid endpoint on UDP packet
  1.0327: Flow 0 (FREE): INI -> FREE
  1.0328: Flow 0 (FREE): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
  1.0328: Dropping datagram with no flow TAP 88.198.0.164:54185 -> 239.255.255.250:1900

This is an actual regression as there's no particular reason to block
outbound multicast UDP packets.

And even if we don't handle multicast groups in any particular way
(https://bugs.passt.top/show_bug.cgi?id=2, "Add IGMP/MLD proxy"),
there's no reason to block inbound multicast or broadcast packets
either, should they ever be somehow delivered to passt or pasta.

Let multicast and broadcast packets through, refusing only to
establish flows with unspecified endpoint, as those would actually
cause havoc in the flow table.

IP-wise, SSDP notifications look like this (after this patch), inside
and outside:

  $ pasta -p /tmp/minidlna.pcap --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f minidlna.conf -P /dev/null & (sleep 1; killall minidlnad)'

[...]

  $ tshark -a packets:1 -r /tmp/minidlna.pcap ssdp
      2   0.074808 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1

  # tshark -i ens3 -a packets:1 multicast 2>/dev/null
      1 0.000000000 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1

Link: https://github.com/containers/podman/issues/24871
Fixes: 1db4f773e87f ("udp: Improve detail of UDP endpoint sanity checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp_flow.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/udp_flow.c b/udp_flow.c
index 343caae..9fd7d06 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -209,7 +209,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
 
 	if (!inany_is_unicast(&ini->eaddr) ||
 	    ini->eport == 0 || ini->oport == 0) {
-		/* In principle ini->oddr also must be unicast, but when we've
+		/* In principle ini->oddr also must be specified, but when we've
 		 * been initiated from a socket bound to 0.0.0.0 or ::, we don't
 		 * know our address, so we have to leave it unpopulated.
 		 */
@@ -267,8 +267,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
 	ini = flow_initiate_af(flow, PIF_TAP, af, saddr, srcport,
 			       daddr, dstport);
 
-	if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 ||
-	    !inany_is_unicast(&ini->oaddr) || ini->oport == 0) {
+	if (inany_is_unspecified(&ini->eaddr) || ini->eport == 0 ||
+	    inany_is_unspecified(&ini->oaddr) || ini->oport == 0) {
 		flow_dbg(flow, "Invalid endpoint on UDP packet");
 		flow_alloc_cancel(flow);
 		return FLOW_SIDX_NONE;

From 898e853635a79e33917bb4646ff1fb5fc3a92997 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:13:52 +0100
Subject: [PATCH 004/221] virtio: Use const pointer for vu_dev

We don't modify the structure in some virtio functions.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 virtio.c    | 14 +++++++++-----
 virtio.h    |  2 +-
 vu_common.c |  2 +-
 vu_common.h |  2 +-
 4 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/virtio.c b/virtio.c
index a76de5e..625bac3 100644
--- a/virtio.c
+++ b/virtio.c
@@ -92,7 +92,8 @@
  *
  * Return: virtual address in our address space of the guest physical address
  */
-static void *vu_gpa_to_va(struct vu_dev *dev, uint64_t *plen, uint64_t guest_addr)
+static void *vu_gpa_to_va(const struct vu_dev *dev, uint64_t *plen,
+			  uint64_t guest_addr)
 {
 	unsigned int i;
 
@@ -210,7 +211,8 @@ static void virtqueue_get_head(const struct vu_virtq *vq,
  *
  * Return: -1 if there is an error, 0 otherwise
  */
-static int virtqueue_read_indirect_desc(struct vu_dev *dev, struct vring_desc *desc,
+static int virtqueue_read_indirect_desc(const struct vu_dev *dev,
+					struct vring_desc *desc,
 					uint64_t addr, size_t len)
 {
 	uint64_t read_len;
@@ -390,7 +392,7 @@ static inline void vring_set_avail_event(const struct vu_virtq *vq,
  *
  * Return: false on error, true otherwise
  */
-static bool virtqueue_map_desc(struct vu_dev *dev,
+static bool virtqueue_map_desc(const struct vu_dev *dev,
 			       unsigned int *p_num_sg, struct iovec *iov,
 			       unsigned int max_num_sg,
 			       uint64_t pa, size_t sz)
@@ -426,7 +428,8 @@ static bool virtqueue_map_desc(struct vu_dev *dev,
  *
  * Return: -1 if there is an error, 0 otherwise
  */
-static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned int idx,
+static int vu_queue_map_desc(const struct vu_dev *dev,
+			     struct vu_virtq *vq, unsigned int idx,
 			     struct vu_virtq_element *elem)
 {
 	const struct vring_desc *desc = vq->vring.desc;
@@ -504,7 +507,8 @@ static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned i
  *
  * Return: -1 if there is an error, 0 otherwise
  */
-int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
+int vu_queue_pop(const struct vu_dev *dev, struct vu_virtq *vq,
+		 struct vu_virtq_element *elem)
 {
 	unsigned int head;
 	int ret;
diff --git a/virtio.h b/virtio.h
index 6410d60..0af259d 100644
--- a/virtio.h
+++ b/virtio.h
@@ -170,7 +170,7 @@ static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
 
 bool vu_queue_empty(struct vu_virtq *vq);
 void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
-int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq,
+int vu_queue_pop(const struct vu_dev *dev, struct vu_virtq *vq,
 		 struct vu_virtq_element *elem);
 void vu_queue_detach_element(struct vu_virtq *vq);
 void vu_queue_unpop(struct vu_virtq *vq);
diff --git a/vu_common.c b/vu_common.c
index 299b5a3..6d365be 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -73,7 +73,7 @@ void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov, int elem_cnt
  *
  * Return: number of elements used to contain the frame
  */
-int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
+int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq,
 	       struct vu_virtq_element *elem, int max_elem,
 	       size_t size, size_t *frame_size)
 {
diff --git a/vu_common.h b/vu_common.h
index 901d972..bd70faf 100644
--- a/vu_common.h
+++ b/vu_common.h
@@ -46,7 +46,7 @@ static inline void vu_set_element(struct vu_virtq_element *elem,
 
 void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov,
 		  int elem_cnt);
-int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
+int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq,
 	       struct vu_virtq_element *elem, int max_elem, size_t size,
 	       size_t *frame_size);
 void vu_set_vnethdr(const struct vu_dev *vdev,

From 3876fc780d01870040343cdab7da3f14f53272d5 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 27 Dec 2024 11:40:19 +0100
Subject: [PATCH 005/221] seccomp: Unconditionally allow accept(2) even if
 accept4(2) is present

On Alpine Linux 3.21, passt aborts right away as soon as QEMU connects
to it.

Most likely, this has always been the case with musl, because since
musl commit dc01e2cbfb29 ("add fallback emulation for accept4 on old
kernels"), accept4() without flags is implemented using accept().

However, I guess that nobody realised earlier because it's typically
pasta(1) being used on musl-based distributions, and the only place
where we call accept4() without flags is tap_listen_handler().

Add accept() to the list of allowed system calls regardless of the
presence of accept4().

Reported-by: NN708 <nn708@outlook.com>
Link: https://bugs.passt.top/show_bug.cgi?id=106
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/passt.c b/passt.c
index 957f3d0..1a0c404 100644
--- a/passt.c
+++ b/passt.c
@@ -180,7 +180,7 @@ void exit_handler(int signal)
  * #syscalls socket getsockopt setsockopt s390x:socketcall i686:socketcall close
  * #syscalls bind connect recvfrom sendto shutdown
  * #syscalls arm:recv ppc64le:recv arm:send ppc64le:send
- * #syscalls accept4|accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
+ * #syscalls accept4 accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
  * #syscalls clock_gettime arm:clock_gettime64 i686:clock_gettime64
  */
 int main(int argc, char **argv)

From 725acd111ba340122f2bb0601e373534eb4b5ed8 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Mon, 6 Jan 2025 10:10:29 +0100
Subject: [PATCH 006/221] tcp_splice: Set (again) TCP_NODELAY on both sides

In commit 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced
sockets") I just assumed that we wouldn't benefit from disabling
Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay).

It turns out that with some patterns, such as a PostgreSQL server
in a container receiving parameterised, short queries, for which pasta
sees several short inbound messages (Parse, Bind, Describe, Execute
and Sync commands getting each one their own packet, 5 to 49 bytes TCP
payload each), we'll read them usually in two batches, and send them
in matching batches, for example:

  9165.2467:          pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
  9165.2468:          Flow 0 (TCP connection (spliced)): 76 from read-side call
  9165.2468:          Flow 0 (TCP connection (spliced)): 76 from write-side call (passed 524288)
  9165.2469:          pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
  9165.2470:          Flow 0 (TCP connection (spliced)): 15 from read-side call
  9165.2470:          Flow 0 (TCP connection (spliced)): 15 from write-side call (passed 524288)
  9165.2944:          pasta: epoll event on connected spliced TCP socket 118 (events: 0x00000001)

and the kernel delivers the first one, waits for acknowledgement from
the receiver, then delivers the second one. This adds very substantial
and unnecessary delay. It's usually a fixed ~40ms between the two
batches, which is clearly unacceptable for loopback connections.

In this example, the delay is shown by the timestamp of the response
from socket 118. The peer (server) doesn't actually take that long
(less than a millisecond), but it takes that long for the kernel to
deliver our request.

To avoid batching and delays, disable Nagle's algorithm by setting
TCP_NODELAY on both internal and external sockets: this way, we get
one inbound packet for each original message, we transfer them right
away, and the kernel delivers them to the process in the container as
they are, without delay.

We can do this safely as we don't care much about network utilisation
when there's in fact pretty much no network (loopback connections).

This is unfortunately not visible in the TCP request-response tests
from the test suite because, with smaller messages (we use one byte),
Nagle's algorithm doesn't even kick in. It's probably not trivial to
implement a universal test covering this case.

Fixes: 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced sockets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_splice.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 3a0f868..3a000ff 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -348,6 +348,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 	uint8_t tgtpif = conn->f.pif[TGTSIDE];
 	union sockaddr_inany sa;
 	socklen_t sl;
+	int one = 1;
 
 	if (tgtpif == PIF_HOST)
 		conn->s[1] = tcp_conn_sock(c, af);
@@ -359,12 +360,21 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 	if (conn->s[1] < 0)
 		return -1;
 
-	if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK,
-		       &((int){ 1 }), sizeof(int))) {
+	if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, &one, sizeof(one))) {
 		flow_trace(conn, "failed to set TCP_QUICKACK on socket %i",
 			   conn->s[1]);
 	}
 
+	if (setsockopt(conn->s[0], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
+		flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
+			   conn->s[0]);
+	}
+
+	if (setsockopt(conn->s[1], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
+		flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
+			   conn->s[1]);
+	}
+
 	pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport);
 
 	if (connect(conn->s[1], &sa.sa, sl)) {

From 2c174f1fe8a5f1923b14cde703941d4daac39850 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 9 Jan 2025 14:06:48 +0100
Subject: [PATCH 007/221] checksum: fix checksum with odd base address

csum_unfolded() must call csum_avx2() with a 32byte aligned base address.

To be able to do that if the buffer is not correctly aligned,
it splits the buffers in 2 parts, the second part is 32byte aligned and
can be used with csum_avx2(), the first part is the remaining part, that
is not 32byte aligned and we use sum_16b() to compute the checksum.

A problem appears if the length of the first part is odd because
the checksum is using 16bit words to do the checksum.

If the length is odd, when the second part is computed, all words are
shifted by 1 byte, meaning weight of upper and lower byte is swapped.

For instance a 13 bytes buffer:

bytes:

aa AA bb BB cc CC dd DD ee EE ff FF gg

16bit words:

AAaa BBbb CCcc DDdd EEee FFff 00gg

If we don't split the sequence, the checksum is:

AAaa + BBbb + CCcc + DDdd + EEee + FFff + 00gg

If we split the sequence with an even length for the first part:

(AAaa + BBbb) + (CCcc + DDdd + EEee + FFff + 00gg)

But if the first part has an odd length:

(AAaa + BBbb + 00cc) + (ddCC + eeDD + ffEE + ggFF)

To avoid the problem, do not call csum_avx2() if the first part cannot
have an even length, and compute the checksum of all the buffer using
sum_16b().

This is slower but it can only happen if the buffer base address is odd,
and this can only happen if the binary is built using '-Os', and that
means we have chosen to prioritize size over speed.

Reported-by: Mike Jones <mike@mjones.io>
Link: https://bugs.passt.top/show_bug.cgi?id=108
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Added comment explaining why we check for pad & 1]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 checksum.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/checksum.c b/checksum.c
index 1c4354d..b01e0fe 100644
--- a/checksum.c
+++ b/checksum.c
@@ -452,7 +452,8 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
 	intptr_t align = ROUND_UP((intptr_t)buf, sizeof(__m256i));
 	unsigned int pad = align - (intptr_t)buf;
 
-	if (len < pad)
+	/* Don't mix sum_16b() and csum_avx2() with odd padding lengths */
+	if (pad & 1 || len < pad)
 		pad = len;
 
 	if (pad)

From f04b483d1509b852951fe1421ef6f6740c9f9a08 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Sat, 11 Jan 2025 00:46:51 +0100
Subject: [PATCH 008/221] test/pasta_podman: Run Podman tests on a single CPU
 thread

Increasingly often, I'm getting occasional failures of the same type
as https://github.com/containers/podman/issues/24147. I guess it
mostly depends on the system load.

It will be a while until I'll actually run tests on a kernel
including my fix for it, kernel commit a502ea6fa94b ("udp: Deal with
race between UDP socket address change and rehash"), so add a horrible
workaround using taskset(1), for the moment.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 test/pasta_podman/bats | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test/pasta_podman/bats b/test/pasta_podman/bats
index 6b1c575..2f07be8 100644
--- a/test/pasta_podman/bats
+++ b/test/pasta_podman/bats
@@ -23,4 +23,4 @@ check	[ "__PASTA_BIN__" = "__WD__/pasta" ]
 
 test	Podman system test with bats
 
-host	PODMAN="__PODMAN__" CONTAINERS_HELPER_BINARY_DIR="__WD__" bats test/podman/test/system/505-networking-pasta.bats
+host	PODMAN="__PODMAN__" CONTAINERS_HELPER_BINARY_DIR="__WD__" taskset -c 1 bats test/podman/test/system/505-networking-pasta.bats

From 1b95bd6fa1148f3609bebf7b2bcd6d47376e61a6 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Wed, 15 Jan 2025 17:22:30 +0100
Subject: [PATCH 009/221] vhost_user: fix multibuffer from linux

Under some conditions, linux can provide several buffers
in the same element (multiple entries in the iovec array).

I didn't identify what changed between the kernel guest that
provides one buffer and the one that provides several
(doesn't seem to be a kernel change or a configuration change).

Fix the following assert:

ASSERTION FAILED in virtqueue_map_desc (virtio.c:402): num_sg < max_num_sg

What I can see is the buffer can be splitted in two iovecs:
  - vnet header
  - packet data

This change manages this special case but the real fix will be to allow
tap_add_packet() to manage iovec array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vu_common.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/vu_common.c b/vu_common.c
index 6d365be..431fba6 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -18,6 +18,8 @@
 #include "pcap.h"
 #include "vu_common.h"
 
+#define VU_MAX_TX_BUFFER_NB	2
+
 /**
  * vu_packet_check_range() - Check if a given memory zone is contained in
  * 			     a mapped guest memory region
@@ -168,10 +170,15 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 
 	count = 0;
 	out_sg_count = 0;
-	while (count < VIRTQUEUE_MAX_SIZE) {
+	while (count < VIRTQUEUE_MAX_SIZE &&
+	       out_sg_count + VU_MAX_TX_BUFFER_NB <= VIRTQUEUE_MAX_SIZE) {
 		int ret;
 
-		vu_set_element(&elem[count], &out_sg[out_sg_count], NULL);
+		elem[count].out_num = VU_MAX_TX_BUFFER_NB;
+		elem[count].out_sg = &out_sg[out_sg_count];
+		elem[count].in_num = 0;
+		elem[count].in_sg = NULL;
+
 		ret = vu_queue_pop(vdev, vq, &elem[count]);
 		if (ret < 0)
 			break;
@@ -181,11 +188,20 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 			warn("virtio-net transmit queue contains no out buffers");
 			break;
 		}
-		ASSERT(elem[count].out_num == 1);
+		if (elem[count].out_num == 1) {
+			tap_add_packet(vdev->context,
+				       elem[count].out_sg[0].iov_len - hdrlen,
+				       (char *)elem[count].out_sg[0].iov_base +
+				        hdrlen);
+		} else {
+			/* vnet header can be in a separate iovec */
+			ASSERT(elem[count].out_num == 2);
+			ASSERT(elem[count].out_sg[0].iov_len == (size_t)hdrlen);
+			tap_add_packet(vdev->context,
+				       elem[count].out_sg[1].iov_len,
+				       (char *)elem[count].out_sg[1].iov_base);
+		}
 
-		tap_add_packet(vdev->context,
-			       elem[count].out_sg[0].iov_len - hdrlen,
-			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
 		count++;
 	}
 	tap_handler(vdev->context, now);

From 707f77b0a93160c8695b3cf5bfd7c24d9992b106 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 16 Jan 2025 20:06:59 +0100
Subject: [PATCH 010/221] tcp: Fix ACK sequence getting out of sync on EPOLLOUT
 wake-up
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In the next patches, I'm extending the usage of STALLED to a few more
cases.

Doing so revealed this issue: if we set STALLED and, consequently,
EPOLLOUT (which is wrong, fixed later) right after we set a connection
to ESTABLISHED (which also happened by mistake while I was preparing
another change), with the guest sending data together with the final
ACK in the handshake, say:

  41.3661: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.3662: Flow 2 (NEW): FREE -> NEW
  41.3663: Flow 2 (INI): NEW -> INI
  41.3663: Flow 2 (INI): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => ?
  41.3665: Flow 2 (TGT): INI -> TGT
  41.3666: Flow 2 (TGT): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3667: Flow 2 (TCP connection): TGT -> TYPED
  41.3667: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3669: Flow 2 (TCP connection): TAP_SYN_RCVD: CLOSED -> SYN_SENT
  41.3670: Flow 2 (TCP connection): Side 0 hash table insert: bucket: 339814
  41.3672: Flow 2 (TCP connection): TYPED -> ACTIVE
  41.3673: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3674: Flow 2 (TCP connection): TAP_SYN_ACK_SENT: SYN_SENT -> SYN_RCVD
  41.3675: Flow 2 (TCP connection): ACK_FROM_TAP_DUE
  41.3675: Flow 2 (TCP connection): timer expires in 10.000s
  41.3675: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.3676: Flow 2 (TCP connection): ACK_FROM_TAP_DUE dropped
  41.3676: Flow 2 (TCP connection): ESTABLISHED: SYN_RCVD -> ESTABLISHED
  41.3678: Flow 2 (TCP connection): STALLED
  41.3678: vhost-user: got kick_data: 0000000000000002 idx: 1
  41.3679: Flow 2 (TCP connection): ACK_TO_TAP_DUE
  41.3680: Flow 2 (TCP connection): timer expires in 0.010s
  41.3680: Flow 2 (TCP connection): STALLED dropped

we'll immediately get an EPOLLOUT event, call tcp_update_seqack_wnd(),
but ignore window and ACK sequence update. At this point, we think we
acknowledged all the data to the guest (but we didn't) and we'll
happily proceed to clear the ACK_TO_TAP_DUE flag:

  41.3780: Flow 2 (TCP connection): ACK_TO_TAP_DUE dropped
  41.3780: Flow 2 (TCP connection): timer expires in 7200.000s
  41.5754: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.9956: vhost-user: got kick_data: 0000000000000001 idx: 1
  42.8275: vhost-user: got kick_data: 0000000000000001 idx: 1

while the guest starts retransmitting that data desperately, without
ever getting an ACK segment from us:

   1433  38.746353 2a01:4f8:222:904::2 → 2001:db8:9a55::1 94 TCP 54312 → 10003 [SYN] Seq=0 Win=65460 Len=0 MSS=65460 SACK_PERM TSval=1089126192 TSecr=0 WS=128
   1434  38.747357 2001:db8:9a55::1 → 2a01:4f8:222:904::2 82 TCP 10003 → 54312 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=61440 WS=256
   1435  38.747500 2a01:4f8:222:904::2 → 2001:db8:9a55::1 74 TCP 54312 → 10003 [ACK] Seq=1 Ack=1 Win=65536 Len=0
   1436  38.747769 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1437  38.747798 2a01:4f8:222:904::2 → 2001:db8:9a55::1 32841 TCP 54312 → 10003 [ACK] Seq=8193 Ack=1 Win=65536 Len=32767
   1438  38.748049 2001:db8:9a55::1 → 2a01:4f8:222:904::2 74 TCP [TCP Window Update] 10003 → 54312 [ACK] Seq=1 Ack=1 Win=65280 Len=0
   1439  38.954044 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1440  39.370096 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1441  40.202135 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192

because seq_ack_to_tap is already set to the sequence after frame
number 1437 in the example.

For some reason, I could only reproduce this with vhost-user, IPv6,
and passt running under valgrind while taking captures. Even under
these conditions, it happens quite rarely.

Forcibly send an ACK segment if we update the ACK sequence (or the
advertised window).

Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index ec433f7..72fca63 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2200,8 +2200,10 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 		if (events & EPOLLIN)
 			tcp_data_from_sock(c, conn);
 
-		if (events & EPOLLOUT)
-			tcp_update_seqack_wnd(c, conn, false, NULL);
+		if (events & EPOLLOUT) {
+			if (tcp_update_seqack_wnd(c, conn, false, NULL))
+				tcp_send_flag(c, conn, ACK);
+		}
 
 		return;
 	}

From 22cf08ba00890c83922c61f5d65803b7f4c1299a Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 16 Jan 2025 20:31:35 +0100
Subject: [PATCH 011/221] tcp: Don't subscribe to EPOLLOUT events on STALLED

I inadvertently added that in an unrelated change, but it doesn't make
sense: STALLED means we have pending socket data that we can't write
to the guest, not the other way around.

Fixes: bb708111833e ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 72fca63..ef33388 100644
--- a/tcp.c
+++ b/tcp.c
@@ -437,7 +437,7 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 			return EPOLLET;
 
 		if (conn_flags & STALLED)
-			return EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
+			return EPOLLIN | EPOLLRDHUP | EPOLLET;
 
 		return EPOLLIN | EPOLLRDHUP;
 	}

From b8f573cdc222905c06f39625c0567da265a2e36e Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 14 Jan 2025 23:03:49 +0100
Subject: [PATCH 012/221] tcp: Set EPOLLET when when reading from a socket
 fails with EAGAIN

Before SO_PEEK_OFF support was introduced by commit e63d281871ef
("tcp: leverage support of SO_PEEK_OFF socket option when available"),
we would peek data from sockets using a "discard" buffer as first
iovec element, so that, unless we had no pending data at all, we would
always get a positive return code from recvmsg() (except for closing
connections or errors).

If we couldn't send more data to the guest, in the window, we would
set the STALLED flag (causing the epoll descriptor to switch to
edge-triggered mode), and return early from tcp_data_from_sock().

With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
on the socket, but nothing beyond our current peeking offset, we'll
get EAGAIN instead of our current "discard" length. In that case, we
return even earlier, and we don't set EPOLLET on the socket as a
result.

As reported by Asahi Lina, this causes event loops where the kernel is
signalling socket readiness, because there's data we didn't dequeue
yet (waiting for the guest to acknowledge it), but we won't actually
peek anything new, and return early without setting EPOLLET.

This is the original report, mentioning the originally proposed fix:

--
When there is unacknowledged data in the inbound socket buffer, passt
leaves the socket in the epoll instance to accept new data from the
server. Since there is already data in the socket buffer, an epoll
without EPOLLET will repeatedly fire while no data is processed,
busy-looping the CPU:

epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)

Add in the missing EPOLLET flag for this case. This brings CPU
usage down from around ~80% when downloading over TCP, to ~5% (use
case: passt as network transport for muvm, downloading Steam games).
--

we can't set EPOLLET unconditionally though, at least right now,
because we don't monitor the guest tap for EPOLLOUT in case we fail
to write on that side because we filled up that buffer (and not the
window of a TCP connection).

Instead, rely on the observation that, once a connection is
established, we only get EAGAIN on recvmsg() if we are attempting to
peek data from a socket with a non-zero peeking offset: we only peek
when there's pending data on a socket, and in that case, if we peek
without offset, we'll always see some data.

And if we peek data with a non-zero offset and get EAGAIN, that means
that we're either waiting for more data to arrive on the socket (which
would cause further wake-ups, even with EPOLLET), or we're waiting for
the guest to acknowledge some of it, which would anyway cause a
wake-up.

In that case, it's safe to set STALLED and, in turn, EPOLLET on the
socket, which fixes the EPOLLIN event loop.

While we're establishing a connection from the socket side, though,
we'll call, once, tcp_{buf,vu}_data_from_sock() to see if we got
any data while we were waiting for SYN, ACK from the guest. See the
comment at the end of tcp_conn_from_sock_finish().

And if there's no data queued on the socket as we check, we'll also
get EAGAIN, even if our peeking offset is zero. For this reason, we
need to additionally check that 'already_sent' is not zero, meaning,
explicitly, that our peeking offset is not zero.

Reported-by: Asahi Lina <lina@asahilina.net>
Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_buf.c | 3 +++
 tcp_vu.c  | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/tcp_buf.c b/tcp_buf.c
index a975a55..8c15101 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -359,6 +359,9 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			return -errno;
 		}
 
+		if (already_sent) /* No new data and EAGAIN: set EPOLLET */
+			conn_flag(c, conn, STALLED);
+
 		return 0;
 	}
 
diff --git a/tcp_vu.c b/tcp_vu.c
index 10e17d3..8256f53 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -399,6 +399,10 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			tcp_rst(c, conn);
 			return len;
 		}
+
+		if (already_sent) /* No new data and EAGAIN: set EPOLLET */
+			conn_flag(c, conn, STALLED);
+
 		return 0;
 	}
 

From a8f4fc481ce3afbf48522a0af44d222d665b515e Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 16 Jan 2025 20:47:00 +0100
Subject: [PATCH 013/221] tcp: Mask EPOLLIN altogether if we're blocked waiting
 on an ACK from the guest

There are pretty much two cases of the (misnomer) STALLED: in one
case, we could send more data to the guest if it becomes available,
and in another case, we can't, because we filled the window.

If, in this second case, we keep EPOLLIN enabled, but never read from
the socket, we get short but CPU-annoying storms of EPOLLIN events,
upon which we reschedule the ACK timeout handler, never read from the
socket, go back to epoll_wait(), and so on:

  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1

also known as:

  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s

which, for some reason, becomes very visible with muvm and aria2c
downloading from a server nearby in parallel chunks.

That's because EPOLLIN isn't cleared if we don't read from the socket,
and even with EPOLLET, epoll_wait() will repeatedly wake us up until
we actually read something.

In this case, we don't want to subscribe to EPOLLIN at all: all we're
waiting for is an ACK segment from the guest. Differentiate this case
with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
indicate that we're waiting for an ACK from the guest
(ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.

If this flag is set before we set STALLED, EPOLLIN will be masked
while we set EPOLLET because of STALLED. Whenever we clear STALLED,
we also clear this flag.

This is definitely not elegant, but it's a minimal fix.

We can probably simplify this at a later point by having a category
of connection flags directly corresponding to epoll flags, and
dropping STALLED altogether, or, perhaps, always using EPOLLET (but
we need a mechanism to re-check sockets for pending data if we can't
temporarily write to the guest).

I suspect that this might also be implied in
https://github.com/containers/podman/issues/23686, hence the Link:
tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
that).

Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c      | 8 ++++++--
 tcp_buf.c  | 2 ++
 tcp_conn.h | 1 +
 tcp_vu.c   | 2 ++
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index ef33388..3b3193a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -345,7 +345,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
 
 static const char *tcp_flag_str[] __attribute((__unused__)) = {
 	"STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
-	"ACK_FROM_TAP_DUE",
+	"ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS",
 };
 
 /* Listening sockets, used for automatic port forwarding in pasta mode only */
@@ -436,8 +436,12 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 		if (events & TAP_FIN_SENT)
 			return EPOLLET;
 
-		if (conn_flags & STALLED)
+		if (conn_flags & STALLED) {
+			if (conn_flags & ACK_FROM_TAP_BLOCKS)
+				return EPOLLRDHUP | EPOLLET;
+
 			return EPOLLIN | EPOLLRDHUP | EPOLLET;
+		}
 
 		return EPOLLIN | EPOLLRDHUP;
 	}
diff --git a/tcp_buf.c b/tcp_buf.c
index 8c15101..cbefa42 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -309,6 +309,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
 		conn_flag(c, conn, STALLED);
 		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 		return 0;
@@ -387,6 +388,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		return 0;
 	}
 
+	conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
 	conn_flag(c, conn, ~STALLED);
 
 	send_bufs = DIV_ROUND_UP(len, mss);
diff --git a/tcp_conn.h b/tcp_conn.h
index 6ae0511..d342680 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -77,6 +77,7 @@ struct tcp_tap_conn {
 #define ACTIVE_CLOSE		BIT(2)
 #define ACK_TO_TAP_DUE		BIT(3)
 #define ACK_FROM_TAP_DUE	BIT(4)
+#define ACK_FROM_TAP_BLOCKS	BIT(5)
 
 #define SNDBUF_BITS		24
 	unsigned int	sndbuf		:SNDBUF_BITS;
diff --git a/tcp_vu.c b/tcp_vu.c
index 8256f53..a216bb1 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -381,6 +381,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
 		conn_flag(c, conn, STALLED);
 		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 		return 0;
@@ -423,6 +424,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		return 0;
 	}
 
+	conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
 	conn_flag(c, conn, ~STALLED);
 
 	/* Likely, some new data was acked too. */

From 6016e04a3aae90cdd49fec391088b83a6d2170a6 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:13:53 +0100
Subject: [PATCH 014/221] vhost-user: update protocol features and commands
 list

vhost-user protocol specification has been updated with
feature flags and commands we will need to implement migration.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to union vhost_user_payload]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c |  5 +++++
 vhost_user.h | 40 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/vhost_user.c b/vhost_user.c
index 4b8558f..48226a8 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -110,6 +110,11 @@ static const char *vu_request_to_string(unsigned int req)
 			REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
 			REQ(VHOST_USER_ADD_MEM_REG),
 			REQ(VHOST_USER_REM_MEM_REG),
+			REQ(VHOST_USER_SET_STATUS),
+			REQ(VHOST_USER_GET_STATUS),
+			REQ(VHOST_USER_GET_SHARED_OBJECT),
+			REQ(VHOST_USER_SET_DEVICE_STATE_FD),
+			REQ(VHOST_USER_CHECK_DEVICE_STATE),
 		};
 #undef REQ
 		return vu_request_str[req];
diff --git a/vhost_user.h b/vhost_user.h
index 464ba21..c880893 100644
--- a/vhost_user.h
+++ b/vhost_user.h
@@ -37,6 +37,10 @@ enum vhost_user_protocol_feature {
 	VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
 	VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
 	VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
+	VHOST_USER_PROTOCOL_F_STATUS = 16,
+	/* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */
+	VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 18,
+	VHOST_USER_PROTOCOL_F_DEVICE_STATE = 19,
 
 	VHOST_USER_PROTOCOL_F_MAX
 };
@@ -83,6 +87,11 @@ enum vhost_user_request {
 	VHOST_USER_GET_MAX_MEM_SLOTS = 36,
 	VHOST_USER_ADD_MEM_REG = 37,
 	VHOST_USER_REM_MEM_REG = 38,
+	VHOST_USER_SET_STATUS = 39,
+	VHOST_USER_GET_STATUS = 40,
+	VHOST_USER_GET_SHARED_OBJECT = 41,
+	VHOST_USER_SET_DEVICE_STATE_FD = 42,
+	VHOST_USER_CHECK_DEVICE_STATE = 43,
 	VHOST_USER_MAX
 };
 
@@ -128,12 +137,39 @@ struct vhost_user_memory {
 	struct vhost_user_memory_region regions[VHOST_MEMORY_BASELINE_NREGIONS];
 };
 
+/**
+ * struct vhost_user_log - Address and size of the shared memory region used
+ *			   to log page update
+ * @mmap_size:		Size of the shared memory region
+ * @mmap_offset:	Offset of the shared memory region
+ */
+struct vhost_user_log {
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
+};
+
+/**
+ * struct vhost_user_transfer_device_state - Set the direction and phase
+ *                                           of the backend device state fd
+ * @direction:		Device state transfer direction (save or load)
+ * @phase:		Migration phase (only stopped is supported)
+ */
+struct vhost_user_transfer_device_state {
+	uint32_t direction;
+#define VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE 0
+#define VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD 1
+	uint32_t phase;
+#define VHOST_USER_TRANSFER_STATE_PHASE_STOPPED 0
+};
+
 /**
  * union vhost_user_payload - vhost-user message payload
  * @u64:		64-bit payload
  * @state:		vring state payload
  * @addr:		vring addresses payload
- * vhost_user_memory:	Memory regions information payload
+ * @memory:		Memory regions information payload
+ * @log:		Memory logging payload
+ * @transfer_state:	Device state payload
  */
 union vhost_user_payload {
 #define VHOST_USER_VRING_IDX_MASK   0xff
@@ -142,6 +178,8 @@ union vhost_user_payload {
 	struct vhost_vring_state state;
 	struct vhost_vring_addr addr;
 	struct vhost_user_memory memory;
+	struct vhost_user_log log;
+	struct vhost_user_transfer_device_state transfer_state;
 };
 
 /**

From b04195c60ff34db89b6bc400ad582d0ff399757b Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:13:54 +0100
Subject: [PATCH 015/221] vhost-user: add VHOST_USER_SET_LOG_FD command

VHOST_USER_SET_LOG_FD is an optional message with an eventfd
in ancillary data, it may be used to inform the front-end that the
log has been modified.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to vu_set_log_fd_exec()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 vhost_user.h |  1 +
 virtio.h     |  2 ++
 3 files changed, 59 insertions(+)

diff --git a/vhost_user.c b/vhost_user.c
index 48226a8..3f34c91 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -504,6 +504,57 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
 	return false;
 }
 
+/**
+ * vu_close_log() - Close the logging file descriptor
+ * @vdev:	vhost-user device
+ */
+static void vu_close_log(struct vu_dev *vdev)
+{
+	if (vdev->log_call_fd != -1) {
+		close(vdev->log_call_fd);
+		vdev->log_call_fd = -1;
+	}
+}
+
+/**
+ * vu_log_kick() - Inform the front-end that the log has been modified
+ * @vdev:	vhost-user device
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_log_kick(const struct vu_dev *vdev)
+{
+	if (vdev->log_call_fd != -1) {
+		int rc;
+
+		rc = eventfd_write(vdev->log_call_fd, 1);
+		if (rc == -1)
+			die_perror("vhost-user kick eventfd_write()");
+	}
+}
+
+/**
+ * vu_set_log_fd_exec() - Set the eventfd used to report logging update
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_log_fd_exec(struct vu_dev *vdev,
+			       struct vhost_user_msg *msg)
+{
+	if (msg->fd_num != 1)
+		die("Invalid log_fd message");
+
+	if (vdev->log_call_fd != -1)
+		close(vdev->log_call_fd);
+
+	vdev->log_call_fd = msg->fds[0];
+
+	debug("Got log_call_fd: %d", vdev->log_call_fd);
+
+	return false;
+}
+
 /**
  * vu_set_vring_num_exec() - Set the size of the queue (vring size)
  * @vdev:	vhost-user device
@@ -864,8 +915,10 @@ void vu_init(struct ctx *c)
 			.notification = true,
 		};
 	}
+	c->vdev->log_call_fd = -1;
 }
 
+
 /**
  * vu_cleanup() - Reset vhost-user device
  * @vdev:	vhost-user device
@@ -909,6 +962,8 @@ void vu_cleanup(struct vu_dev *vdev)
 		}
 	}
 	vdev->nregions = 0;
+
+	vu_close_log(vdev);
 }
 
 /**
@@ -929,6 +984,7 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
 	[VHOST_USER_GET_QUEUE_NUM]	   = vu_get_queue_num_exec,
 	[VHOST_USER_SET_OWNER]		   = vu_set_owner_exec,
 	[VHOST_USER_SET_MEM_TABLE]	   = vu_set_mem_table_exec,
+	[VHOST_USER_SET_LOG_FD]		   = vu_set_log_fd_exec,
 	[VHOST_USER_SET_VRING_NUM]	   = vu_set_vring_num_exec,
 	[VHOST_USER_SET_VRING_ADDR]	   = vu_set_vring_addr_exec,
 	[VHOST_USER_SET_VRING_BASE]	   = vu_set_vring_base_exec,
diff --git a/vhost_user.h b/vhost_user.h
index c880893..bf3eb50 100644
--- a/vhost_user.h
+++ b/vhost_user.h
@@ -240,5 +240,6 @@ static inline bool vu_queue_started(const struct vu_virtq *vq)
 void vu_print_capabilities(void);
 void vu_init(struct ctx *c);
 void vu_cleanup(struct vu_dev *vdev);
+void vu_log_kick(const struct vu_dev *vdev);
 void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events);
 #endif /* VHOST_USER_H */
diff --git a/virtio.h b/virtio.h
index 0af259d..3b0df34 100644
--- a/virtio.h
+++ b/virtio.h
@@ -103,6 +103,7 @@ struct vu_dev_region {
  * @regions:		Guest shared memory regions
  * @features:		Vhost-user features
  * @protocol_features:	Vhost-user protocol features
+ * @log_call_fd:	Eventfd to report logging update
  */
 struct vu_dev {
 	struct ctx *context;
@@ -111,6 +112,7 @@ struct vu_dev {
 	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
 	uint64_t features;
 	uint64_t protocol_features;
+	int log_call_fd;
 };
 
 /**

From 538312af196308dea9a4ddb9442bed921c0dc915 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:13:55 +0100
Subject: [PATCH 016/221] vhost-user: Pass vu_dev to more virtio functions

vu_dev will be needed to log page update.

Add the parameter to:

  vring_used_write()
  vu_queue_fill_by_index()
  vu_queue_fill()
  vring_used_idx_set()
  vu_queue_flush()

The new parameter is unused for now.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 virtio.c    | 32 ++++++++++++++++++++++----------
 virtio.h    | 10 ++++++----
 vu_common.c |  8 ++++----
 3 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/virtio.c b/virtio.c
index 625bac3..52d5a4d 100644
--- a/virtio.c
+++ b/virtio.c
@@ -580,28 +580,34 @@ bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
 
 /**
  * vring_used_write() - Write an entry in the used ring
+ * @dev:	Vhost-user device
  * @vq:		Virtqueue
  * @uelem:	Entry to write
  * @i:		Index of the entry in the used ring
  */
-static inline void vring_used_write(struct vu_virtq *vq,
+static inline void vring_used_write(const struct vu_dev *vdev,
+				    struct vu_virtq *vq,
 				    const struct vring_used_elem *uelem, int i)
 {
 	struct vring_used *used = vq->vring.used;
 
 	used->ring[i] = *uelem;
+	(void)vdev;
 }
 
+
 /**
  * vu_queue_fill_by_index() - Update information of a descriptor ring entry
  *			      in the used ring
+ * @dev:	Vhost-user device
  * @vq:		Virtqueue
  * @index:	Descriptor ring index
  * @len:	Size of the element
  * @idx:	Used ring entry index
  */
-void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
-			    unsigned int len, unsigned int idx)
+void vu_queue_fill_by_index(const struct vu_dev *vdev, struct vu_virtq *vq,
+			    unsigned int index, unsigned int len,
+			    unsigned int idx)
 {
 	struct vring_used_elem uelem;
 
@@ -612,7 +618,7 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
 
 	uelem.id = htole32(index);
 	uelem.len = htole32(len);
-	vring_used_write(vq, &uelem, idx);
+	vring_used_write(vdev, vq, &uelem, idx);
 }
 
 /**
@@ -623,30 +629,36 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
  * @len:	Size of the element
  * @idx:	Used ring entry index
  */
-void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
-		   unsigned int len, unsigned int idx)
+void vu_queue_fill(const struct vu_dev *vdev, struct vu_virtq *vq,
+		   const struct vu_virtq_element *elem, unsigned int len,
+		   unsigned int idx)
 {
-	vu_queue_fill_by_index(vq, elem->index, len, idx);
+	vu_queue_fill_by_index(vdev, vq, elem->index, len, idx);
 }
 
 /**
  * vring_used_idx_set() - Set the descriptor ring current index
+ * @dev:	Vhost-user device
  * @vq:		Virtqueue
  * @val:	Value to set in the index
  */
-static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
+static inline void vring_used_idx_set(const struct vu_dev *vdev,
+				      struct vu_virtq *vq, uint16_t val)
 {
 	vq->vring.used->idx = htole16(val);
+	(void)vdev;
 
 	vq->used_idx = val;
 }
 
 /**
  * vu_queue_flush() - Flush the virtqueue
+ * @dev:	Vhost-user device
  * @vq:		Virtqueue
  * @count:	Number of entry to flush
  */
-void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
+void vu_queue_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
+		    unsigned int count)
 {
 	uint16_t old, new;
 
@@ -658,7 +670,7 @@ void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
 
 	old = vq->used_idx;
 	new = old + count;
-	vring_used_idx_set(vq, new);
+	vring_used_idx_set(vdev, vq, new);
 	vq->inuse -= count;
 	if ((uint16_t)(new - vq->signalled_used) < (uint16_t)(new - old))
 		vq->signalled_used_valid = false;
diff --git a/virtio.h b/virtio.h
index 3b0df34..d95bb07 100644
--- a/virtio.h
+++ b/virtio.h
@@ -177,10 +177,12 @@ int vu_queue_pop(const struct vu_dev *dev, struct vu_virtq *vq,
 void vu_queue_detach_element(struct vu_virtq *vq);
 void vu_queue_unpop(struct vu_virtq *vq);
 bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
-void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
-			    unsigned int len, unsigned int idx);
-void vu_queue_fill(struct vu_virtq *vq,
+void vu_queue_fill_by_index(const struct vu_dev *vdev, struct vu_virtq *vq,
+			    unsigned int index, unsigned int len,
+			    unsigned int idx);
+void vu_queue_fill(const struct vu_dev *vdev, struct vu_virtq *vq,
 		   const struct vu_virtq_element *elem, unsigned int len,
 		   unsigned int idx);
-void vu_queue_flush(struct vu_virtq *vq, unsigned int count);
+void vu_queue_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
+		    unsigned int count);
 #endif /* VIRTIO_H */
diff --git a/vu_common.c b/vu_common.c
index 431fba6..0ba2351 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -142,9 +142,9 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
 	int i;
 
 	for (i = 0; i < elem_cnt; i++)
-		vu_queue_fill(vq, &elem[i], elem[i].in_sg[0].iov_len, i);
+		vu_queue_fill(vdev, vq, &elem[i], elem[i].in_sg[0].iov_len, i);
 
-	vu_queue_flush(vq, elem_cnt);
+	vu_queue_flush(vdev, vq, elem_cnt);
 	vu_queue_notify(vdev, vq);
 }
 
@@ -210,8 +210,8 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 		int i;
 
 		for (i = 0; i < count; i++)
-			vu_queue_fill(vq, &elem[i], 0, i);
-		vu_queue_flush(vq, count);
+			vu_queue_fill(vdev, vq, &elem[i], 0, i);
+		vu_queue_flush(vdev, vq, count);
 		vu_queue_notify(vdev, vq);
 	}
 }

From 3c1d91b8162607ec27b05502278a361cd73a54e2 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:13:56 +0100
Subject: [PATCH 017/221] vhost-user: add VHOST_USER_SET_LOG_BASE command

Sets logging shared memory space.

When the back-end has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature,
the log memory fd is provided in the ancillary data of
VHOST_USER_SET_LOG_BASE message, the size and offset of shared memory
area provided in the message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix coding style in a bunch of places]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 util.h       |  3 ++
 vhost_user.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 vhost_user.h |  3 ++
 virtio.c     | 74 ++++++++++++++++++++++++++++++++++++++++++--
 virtio.h     |  4 +++
 5 files changed, 167 insertions(+), 3 deletions(-)

diff --git a/util.h b/util.h
index 3fa1d12..d02333d 100644
--- a/util.h
+++ b/util.h
@@ -152,6 +152,9 @@ static inline void barrier(void) { __asm__ __volatile__("" ::: "memory"); }
 #define smp_wmb()	smp_mb_release()
 #define smp_rmb()	smp_mb_acquire()
 
+#define qatomic_or(ptr, n) \
+	((void) __atomic_fetch_or(ptr, n, __ATOMIC_SEQ_CST))
+
 #define NS_FN_STACK_SIZE	(1024 * 1024) /* 1MiB */
 
 int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
diff --git a/vhost_user.c b/vhost_user.c
index 3f34c91..66ded12 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -510,6 +510,12 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
  */
 static void vu_close_log(struct vu_dev *vdev)
 {
+	if (vdev->log_table) {
+		if (munmap(vdev->log_table, vdev->log_size) != 0)
+			die_perror("close log munmap() error");
+		vdev->log_table = NULL;
+	}
+
 	if (vdev->log_call_fd != -1) {
 		close(vdev->log_call_fd);
 		vdev->log_call_fd = -1;
@@ -520,7 +526,6 @@ static void vu_close_log(struct vu_dev *vdev)
  * vu_log_kick() - Inform the front-end that the log has been modified
  * @vdev:	vhost-user device
  */
-/* cppcheck-suppress unusedFunction */
 void vu_log_kick(const struct vu_dev *vdev)
 {
 	if (vdev->log_call_fd != -1) {
@@ -532,6 +537,83 @@ void vu_log_kick(const struct vu_dev *vdev)
 	}
 }
 
+/**
+ * vu_log_page() - Update logging table
+ * @log_table:	Base address of the logging table
+ * @page:	Page number that has been updated
+ */
+/* NOLINTNEXTLINE(readability-non-const-parameter) */
+static void vu_log_page(uint8_t *log_table, uint64_t page)
+{
+	qatomic_or(&log_table[page / 8], 1 << (page % 8));
+}
+
+/**
+ * vu_log_write() - Log memory write
+ * @dev:	vhost-user device
+ * @address:	Memory address
+ * @length:	Memory size
+ */
+void vu_log_write(const struct vu_dev *vdev, uint64_t address, uint64_t length)
+{
+	uint64_t page;
+
+	if (!vdev->log_table || !length ||
+	    !vu_has_feature(vdev, VHOST_F_LOG_ALL))
+		return;
+
+	page = address / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < address + length) {
+		vu_log_page(vdev->log_table, page);
+		page++;
+	}
+	vu_log_kick(vdev);
+}
+
+/**
+ * vu_set_log_base_exec() - Set the memory log base
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ *
+ * #syscalls:vu mmap|mmap2 munmap
+ */
+static bool vu_set_log_base_exec(struct vu_dev *vdev,
+				 struct vhost_user_msg *msg)
+{
+	uint64_t log_mmap_size, log_mmap_offset;
+	void *base;
+	int fd;
+
+	if (msg->fd_num != 1 || msg->hdr.size != sizeof(msg->payload.log))
+		die("vhost-user: Invalid log_base message");
+
+	fd = msg->fds[0];
+	log_mmap_offset = msg->payload.log.mmap_offset;
+	log_mmap_size = msg->payload.log.mmap_size;
+
+	debug("vhost-user log mmap_offset: %"PRId64, log_mmap_offset);
+	debug("vhost-user log mmap_size:   %"PRId64, log_mmap_size);
+
+	base = mmap(0, log_mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
+		    log_mmap_offset);
+	close(fd);
+	if (base == MAP_FAILED)
+		die("vhost-user log mmap error");
+
+	if (vdev->log_table)
+		munmap(vdev->log_table, vdev->log_size);
+
+	vdev->log_table = base;
+	vdev->log_size = log_mmap_size;
+
+	msg->hdr.size = sizeof(msg->payload.u64);
+	msg->fd_num = 0;
+
+	return true;
+}
+
 /**
  * vu_set_log_fd_exec() - Set the eventfd used to report logging update
  * @vdev:	vhost-user device
@@ -915,6 +997,7 @@ void vu_init(struct ctx *c)
 			.notification = true,
 		};
 	}
+	c->vdev->log_table = NULL;
 	c->vdev->log_call_fd = -1;
 }
 
@@ -984,6 +1067,7 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
 	[VHOST_USER_GET_QUEUE_NUM]	   = vu_get_queue_num_exec,
 	[VHOST_USER_SET_OWNER]		   = vu_set_owner_exec,
 	[VHOST_USER_SET_MEM_TABLE]	   = vu_set_mem_table_exec,
+	[VHOST_USER_SET_LOG_BASE]	   = vu_set_log_base_exec,
 	[VHOST_USER_SET_LOG_FD]		   = vu_set_log_fd_exec,
 	[VHOST_USER_SET_VRING_NUM]	   = vu_set_vring_num_exec,
 	[VHOST_USER_SET_VRING_ADDR]	   = vu_set_vring_addr_exec,
diff --git a/vhost_user.h b/vhost_user.h
index bf3eb50..e769cb1 100644
--- a/vhost_user.h
+++ b/vhost_user.h
@@ -15,6 +15,7 @@
 #include "iov.h"
 
 #define VHOST_USER_F_PROTOCOL_FEATURES 30
+#define VHOST_LOG_PAGE 4096
 
 #define VHOST_MEMORY_BASELINE_NREGIONS 8
 
@@ -241,5 +242,7 @@ void vu_print_capabilities(void);
 void vu_init(struct ctx *c);
 void vu_cleanup(struct vu_dev *vdev);
 void vu_log_kick(const struct vu_dev *vdev);
+void vu_log_write(const struct vu_dev *vdev, uint64_t address,
+		  uint64_t length);
 void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events);
 #endif /* VHOST_USER_H */
diff --git a/virtio.c b/virtio.c
index 52d5a4d..2b58e4d 100644
--- a/virtio.c
+++ b/virtio.c
@@ -81,6 +81,7 @@
 
 #include "util.h"
 #include "virtio.h"
+#include "vhost_user.h"
 
 #define VIRTQUEUE_MAX_SIZE 1024
 
@@ -592,7 +593,72 @@ static inline void vring_used_write(const struct vu_dev *vdev,
 	struct vring_used *used = vq->vring.used;
 
 	used->ring[i] = *uelem;
-	(void)vdev;
+	vu_log_write(vdev, vq->vring.log_guest_addr +
+		     offsetof(struct vring_used, ring[i]),
+		     sizeof(used->ring[i]));
+}
+
+/**
+ * vu_log_queue_fill() - Log virtqueue memory update
+ * @dev:	vhost-user device
+ * @vq:		Virtqueue
+ * @index:	Descriptor ring index
+ * @len:	Size of the element
+ */
+static void vu_log_queue_fill(const struct vu_dev *vdev, struct vu_virtq *vq,
+			      unsigned int index, unsigned int len)
+{
+	struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
+	struct vring_desc *desc = vq->vring.desc;
+	unsigned int max, min;
+	unsigned num_bufs = 0;
+	uint64_t read_len;
+
+	if (!vdev->log_table || !len || !vu_has_feature(vdev, VHOST_F_LOG_ALL))
+		return;
+
+	max = vq->vring.num;
+
+	if (le16toh(desc[index].flags) & VRING_DESC_F_INDIRECT) {
+		unsigned int desc_len;
+		uint64_t desc_addr;
+
+		if (le32toh(desc[index].len) % sizeof(struct vring_desc))
+			die("Invalid size for indirect buffer table");
+
+		/* loop over the indirect descriptor table */
+		desc_addr = le64toh(desc[index].addr);
+		desc_len = le32toh(desc[index].len);
+		max = desc_len / sizeof(struct vring_desc);
+		read_len = desc_len;
+		desc = vu_gpa_to_va(vdev, &read_len, desc_addr);
+		if (desc && read_len != desc_len) {
+			/* Failed to use zero copy */
+			desc = NULL;
+			if (!virtqueue_read_indirect_desc(vdev, desc_buf,
+							  desc_addr,
+							  desc_len))
+				desc = desc_buf;
+		}
+
+		if (!desc)
+			die("Invalid indirect buffer table");
+
+		index = 0;
+	}
+
+	do {
+		if (++num_bufs > max)
+			die("Looped descriptor");
+
+		if (le16toh(desc[index].flags) & VRING_DESC_F_WRITE) {
+			min = MIN(le32toh(desc[index].len), len);
+			vu_log_write(vdev, le64toh(desc[index].addr), min);
+			len -= min;
+		}
+	} while (len > 0 &&
+		 (virtqueue_read_next_desc(desc, index, max, &index) ==
+		  VIRTQUEUE_READ_DESC_MORE));
 }
 
 
@@ -614,6 +680,8 @@ void vu_queue_fill_by_index(const struct vu_dev *vdev, struct vu_virtq *vq,
 	if (!vq->vring.avail)
 		return;
 
+	vu_log_queue_fill(vdev, vq, index, len);
+
 	idx = (idx + vq->used_idx) % vq->vring.num;
 
 	uelem.id = htole32(index);
@@ -646,7 +714,9 @@ static inline void vring_used_idx_set(const struct vu_dev *vdev,
 				      struct vu_virtq *vq, uint16_t val)
 {
 	vq->vring.used->idx = htole16(val);
-	(void)vdev;
+	vu_log_write(vdev, vq->vring.log_guest_addr +
+		     offsetof(struct vring_used, idx),
+		     sizeof(vq->vring.used->idx));
 
 	vq->used_idx = val;
 }
diff --git a/virtio.h b/virtio.h
index d95bb07..f572341 100644
--- a/virtio.h
+++ b/virtio.h
@@ -104,6 +104,8 @@ struct vu_dev_region {
  * @features:		Vhost-user features
  * @protocol_features:	Vhost-user protocol features
  * @log_call_fd:	Eventfd to report logging update
+ * @log_size:		Size of the logging memory region
+ * @log_table:		Base of the logging memory region
  */
 struct vu_dev {
 	struct ctx *context;
@@ -113,6 +115,8 @@ struct vu_dev {
 	uint64_t features;
 	uint64_t protocol_features;
 	int log_call_fd;
+	uint64_t log_size;
+	uint8_t *log_table;
 };
 
 /**

From 78c73e9395b13354272010d2f202c819689d48f8 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:13:57 +0100
Subject: [PATCH 018/221] vhost-user: Report to front-end we support
 VHOST_USER_PROTOCOL_F_LOG_SHMFD

This features allows QEMU to be migrated. We need also to report
VHOST_F_LOG_ALL.

This protocol feature reports we can log the page update and
implement VHOST_USER_SET_LOG_BASE and VHOST_USER_SET_LOG_FD.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/vhost_user.c b/vhost_user.c
index 66ded12..747b7f6 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -334,6 +334,7 @@ static bool vu_get_features_exec(struct vu_dev *vdev,
 	uint64_t features =
 		1ULL << VIRTIO_F_VERSION_1 |
 		1ULL << VIRTIO_NET_F_MRG_RXBUF |
+		1ULL << VHOST_F_LOG_ALL |
 		1ULL << VHOST_USER_F_PROTOCOL_FEATURES;
 
 	(void)vdev;
@@ -911,7 +912,8 @@ static bool vu_set_vring_err_exec(struct vu_dev *vdev,
 static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
 					  struct vhost_user_msg *msg)
 {
-	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK;
+	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK |
+			    1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD;
 
 	(void)vdev;
 	vmsg_set_reply_u64(msg, features);

From 878e16345461eb2745c761f6929fd6e9da0df447 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:13:58 +0100
Subject: [PATCH 019/221] vhost-user: add VHOST_USER_CHECK_DEVICE_STATE command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

After transferring the back-end’s internal state during migration,
check whether the back-end was able to successfully fully process
the state.

The value returned indicates success or error;
0 is success, any non-zero value is an error.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 21 +++++++++++++++++++++
 virtio.h     | 18 ++++++++++--------
 2 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/vhost_user.c b/vhost_user.c
index 747b7f6..2962709 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -980,6 +980,23 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
 	return false;
 }
 
+/**
+ * vu_check_device_state_exec() -- Return device state migration result
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: True as the reply contains the migration result
+ */
+static bool vu_check_device_state_exec(struct vu_dev *vdev,
+				       struct vhost_user_msg *msg)
+{
+	(void)vdev;
+
+	vmsg_set_reply_u64(msg, vdev->device_state_result);
+
+	return true;
+}
+
 /**
  * vu_init() - Initialize vhost-user device structure
  * @c:		execution context
@@ -1001,6 +1018,7 @@ void vu_init(struct ctx *c)
 	}
 	c->vdev->log_table = NULL;
 	c->vdev->log_call_fd = -1;
+	c->vdev->device_state_result = -1;
 }
 
 
@@ -1049,6 +1067,8 @@ void vu_cleanup(struct vu_dev *vdev)
 	vdev->nregions = 0;
 
 	vu_close_log(vdev);
+
+	vdev->device_state_result = -1;
 }
 
 /**
@@ -1079,6 +1099,7 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
 	[VHOST_USER_SET_VRING_CALL]	   = vu_set_vring_call_exec,
 	[VHOST_USER_SET_VRING_ERR]	   = vu_set_vring_err_exec,
 	[VHOST_USER_SET_VRING_ENABLE]	   = vu_set_vring_enable_exec,
+	[VHOST_USER_CHECK_DEVICE_STATE]    = vu_check_device_state_exec,
 };
 
 /**
diff --git a/virtio.h b/virtio.h
index f572341..512ec1b 100644
--- a/virtio.h
+++ b/virtio.h
@@ -98,14 +98,15 @@ struct vu_dev_region {
 
 /**
  * struct vu_dev - vhost-user device information
- * @context:		Execution context
- * @nregions:		Number of shared memory regions
- * @regions:		Guest shared memory regions
- * @features:		Vhost-user features
- * @protocol_features:	Vhost-user protocol features
- * @log_call_fd:	Eventfd to report logging update
- * @log_size:		Size of the logging memory region
- * @log_table:		Base of the logging memory region
+ * @context:			Execution context
+ * @nregions:			Number of shared memory regions
+ * @regions:			Guest shared memory regions
+ * @features:			Vhost-user features
+ * @protocol_features:		Vhost-user protocol features
+ * @log_call_fd:		Eventfd to report logging update
+ * @log_size:			Size of the logging memory region
+ * @log_table:			Base of the logging memory region
+ * @device_state_result:	Device state migration result
  */
 struct vu_dev {
 	struct ctx *context;
@@ -117,6 +118,7 @@ struct vu_dev {
 	int log_call_fd;
 	uint64_t log_size;
 	uint8_t *log_table;
+	int device_state_result;
 };
 
 /**

From 31d70024beda1e49131d7b68dd7554bee16c79f3 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:13:59 +0100
Subject: [PATCH 020/221] vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD
 command

Set the file descriptor to use to transfer the
backend device state during migration.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fixed nits and coding style here and there]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 epoll_type.h |  2 ++
 passt.c      |  4 +++
 vhost_user.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 virtio.h     |  2 ++
 vu_common.c  | 49 +++++++++++++++++++++++++++++++
 vu_common.h  |  1 +
 6 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/epoll_type.h b/epoll_type.h
index f3ef415..fd9eac3 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -40,6 +40,8 @@ enum epoll_type {
 	EPOLL_TYPE_VHOST_CMD,
 	/* vhost-user kick event socket */
 	EPOLL_TYPE_VHOST_KICK,
+	/* vhost-user migration socket */
+	EPOLL_TYPE_VHOST_MIGRATION,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/passt.c b/passt.c
index 1a0c404..b1c8ab6 100644
--- a/passt.c
+++ b/passt.c
@@ -75,6 +75,7 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
 	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
 	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
+	[EPOLL_TYPE_VHOST_MIGRATION]	= "vhost-user migration socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -356,6 +357,9 @@ loop:
 		case EPOLL_TYPE_VHOST_KICK:
 			vu_kick_cb(c.vdev, ref, &now);
 			break;
+		case EPOLL_TYPE_VHOST_MIGRATION:
+			vu_migrate(c.vdev, eventmask);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/vhost_user.c b/vhost_user.c
index 2962709..daff9ab 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -981,7 +981,78 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
 }
 
 /**
- * vu_check_device_state_exec() -- Return device state migration result
+ * vu_set_migration_watch() - Add the migration file descriptor to epoll
+ * @vdev:	vhost-user device
+ * @fd:		File descriptor to add
+ * @direction:	Direction of the migration (save or load backend state)
+ */
+static void vu_set_migration_watch(const struct vu_dev *vdev, int fd,
+				   uint32_t direction)
+{
+	union epoll_ref ref = {
+		.type = EPOLL_TYPE_VHOST_MIGRATION,
+		.fd = fd,
+	 };
+	struct epoll_event ev = { 0 };
+
+	ev.data.u64 = ref.u64;
+	switch (direction) {
+	case VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE:
+		ev.events = EPOLLOUT;
+		break;
+	case VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD:
+		ev.events = EPOLLIN;
+		break;
+	default:
+		ASSERT(0);
+	}
+
+	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
+}
+
+/**
+ * vu_set_device_state_fd_exec() - Set the device state migration channel
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: True as the reply contains 0 to indicate success
+ *         and set bit 8 as we don't provide our own fd.
+ */
+static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
+					struct vhost_user_msg *msg)
+{
+	unsigned int direction = msg->payload.transfer_state.direction;
+	unsigned int phase = msg->payload.transfer_state.phase;
+
+	if (msg->fd_num != 1)
+		die("Invalid device_state_fd message");
+
+	if (phase != VHOST_USER_TRANSFER_STATE_PHASE_STOPPED)
+		die("Invalid device_state_fd phase: %d", phase);
+
+	if (direction != VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE &&
+	    direction != VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD)
+		die("Invalide device_state_fd direction: %d", direction);
+
+	if (vdev->device_state_fd != -1) {
+		vu_remove_watch(vdev, vdev->device_state_fd);
+		close(vdev->device_state_fd);
+	}
+
+	vdev->device_state_fd = msg->fds[0];
+	vdev->device_state_result = -1;
+	vu_set_migration_watch(vdev, vdev->device_state_fd, direction);
+
+	debug("Got device_state_fd: %d", vdev->device_state_fd);
+
+	/* We don't provide a new fd for the data transfer */
+	vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
+
+	return true;
+}
+
+/**
+ * vu_check_device_state_exec() - Return device state migration result
  * @vdev:	vhost-user device
  * @vmsg:	vhost-user message
  *
@@ -1018,6 +1089,7 @@ void vu_init(struct ctx *c)
 	}
 	c->vdev->log_table = NULL;
 	c->vdev->log_call_fd = -1;
+	c->vdev->device_state_fd = -1;
 	c->vdev->device_state_result = -1;
 }
 
@@ -1068,7 +1140,12 @@ void vu_cleanup(struct vu_dev *vdev)
 
 	vu_close_log(vdev);
 
-	vdev->device_state_result = -1;
+	if (vdev->device_state_fd != -1) {
+		vu_remove_watch(vdev, vdev->device_state_fd);
+		close(vdev->device_state_fd);
+		vdev->device_state_fd = -1;
+		vdev->device_state_result = -1;
+	}
 }
 
 /**
@@ -1099,6 +1176,7 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
 	[VHOST_USER_SET_VRING_CALL]	   = vu_set_vring_call_exec,
 	[VHOST_USER_SET_VRING_ERR]	   = vu_set_vring_err_exec,
 	[VHOST_USER_SET_VRING_ENABLE]	   = vu_set_vring_enable_exec,
+	[VHOST_USER_SET_DEVICE_STATE_FD]   = vu_set_device_state_fd_exec,
 	[VHOST_USER_CHECK_DEVICE_STATE]    = vu_check_device_state_exec,
 };
 
diff --git a/virtio.h b/virtio.h
index 512ec1b..7bef2d2 100644
--- a/virtio.h
+++ b/virtio.h
@@ -106,6 +106,7 @@ struct vu_dev_region {
  * @log_call_fd:		Eventfd to report logging update
  * @log_size:			Size of the logging memory region
  * @log_table:			Base of the logging memory region
+ * @device_state_fd:		Device state migration channel
  * @device_state_result:	Device state migration result
  */
 struct vu_dev {
@@ -118,6 +119,7 @@ struct vu_dev {
 	int log_call_fd;
 	uint64_t log_size;
 	uint8_t *log_table;
+	int device_state_fd;
 	int device_state_result;
 };
 
diff --git a/vu_common.c b/vu_common.c
index 0ba2351..87a0d94 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -297,3 +297,52 @@ err:
 
 	return -1;
 }
+
+/**
+ * vu_migrate() - Send/receive passt insternal state to/from QEMU
+ * @vdev:	vhost-user device
+ * @events:	epoll events
+ */
+void vu_migrate(struct vu_dev *vdev, uint32_t events)
+{
+	int ret;
+
+	/* TODO: collect/set passt internal state
+	 * and use vdev->device_state_fd to send/receive it
+	 */
+	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
+	if (events & EPOLLOUT) {
+		debug("Saving backend state");
+
+		/* send some stuff */
+		ret = write(vdev->device_state_fd, "PASST", 6);
+		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
+		vdev->device_state_result = ret == -1 ? -1 : 0;
+		/* Closing the file descriptor signals the end of transfer */
+		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
+			  vdev->device_state_fd, NULL);
+		close(vdev->device_state_fd);
+		vdev->device_state_fd = -1;
+	} else if (events & EPOLLIN) {
+		char buf[6];
+
+		debug("Loading backend state");
+		/* read some stuff */
+		ret = read(vdev->device_state_fd, buf, sizeof(buf));
+		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
+		if (ret != sizeof(buf)) {
+			vdev->device_state_result = -1;
+		} else {
+			ret = strncmp(buf, "PASST", sizeof(buf));
+			vdev->device_state_result = ret == 0 ? 0 : -1;
+		}
+	} else if (events & EPOLLHUP) {
+		debug("Closing migration channel");
+
+		/* The end of file signals the end of the transfer. */
+		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
+			  vdev->device_state_fd, NULL);
+		close(vdev->device_state_fd);
+		vdev->device_state_fd = -1;
+	}
+}
diff --git a/vu_common.h b/vu_common.h
index bd70faf..d56c021 100644
--- a/vu_common.h
+++ b/vu_common.h
@@ -57,4 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now);
 int vu_send_single(const struct ctx *c, const void *buf, size_t size);
+void vu_migrate(struct vu_dev *vdev, uint32_t events);
 #endif /* VU_COMMON_H */

From 412ed4f09ff2e07545acdc5fe87a55a34aab4f92 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 19 Dec 2024 12:14:00 +0100
Subject: [PATCH 021/221] vhost-user: Report to front-end we support
 VHOST_USER_PROTOCOL_F_DEVICE_STATE

Report to front-end that we support device state commands:
VHOST_USER_CHECK_DEVICE_STATE
VHOST_USER_SET_LOG_BASE

These feature is needed to transfer backend state using frontend
channel.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/vhost_user.c b/vhost_user.c
index daff9ab..f12dec5 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -913,7 +913,8 @@ static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
 					  struct vhost_user_msg *msg)
 {
 	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK |
-			    1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD;
+			    1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD |
+			    1ULL << VHOST_USER_PROTOCOL_F_DEVICE_STATE;
 
 	(void)vdev;
 	vmsg_set_reply_u64(msg, features);

From c96a88d550fcda3f1972aee395fcfda19905d0a4 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Mon, 20 Jan 2025 14:15:22 +0100
Subject: [PATCH 022/221] vhost_user: remove ASSERT() on iovec number

Replace ASSERT() on the number of iovec in the element and on
the first entry length by a debug() message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix typo in failure message]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vu_common.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/vu_common.c b/vu_common.c
index 87a0d94..aa5ca7b 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -195,8 +195,12 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 				        hdrlen);
 		} else {
 			/* vnet header can be in a separate iovec */
-			ASSERT(elem[count].out_num == 2);
-			ASSERT(elem[count].out_sg[0].iov_len == (size_t)hdrlen);
+			if (elem[count].out_num != 2)
+				debug("virtio-net transmit queue contains more than one buffer ([%d]: %u)",
+				      count, elem[count].out_num);
+			if (elem[count].out_sg[0].iov_len != (size_t)hdrlen)
+				debug("virtio-net transmit queue entry not aligned on hdrlen ([%d]: %d != %zu)",
+				count, hdrlen, elem[count].out_sg[0].iov_len);
 			tap_add_packet(vdev->context,
 				       elem[count].out_sg[1].iov_len,
 				       (char *)elem[count].out_sg[1].iov_base);

From 8757834d145a06b845aa0bb6bdfd4f93971b8d74 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Mon, 20 Jan 2025 16:49:30 +0100
Subject: [PATCH 023/221] tcp: Buffer sizes are *not* inherited on
 accept()/accept4()

...so it's pointless to set SO_RCVBUF and SO_SNDBUF on listening
sockets.

Call tcp_sock_set_bufsize() after accept4(), for inbound sockets.

As we didn't have large buffer sizes set for inbound sockets for
a long time (they are set explicitly only if the maximum size is
big enough, more than than the ~200 KiB default), I ran some more
throughput tests for this one, and I see slightly better numbers
(say, 17 gbps instead of 15 gbps guest to host without vhost-user).

Fixes: 904b86ade7db ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/tcp.c b/tcp.c
index 3b3193a..a012b81 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2057,6 +2057,8 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 	if (s < 0)
 		goto cancel;
 
+	tcp_sock_set_bufsize(c, s);
+
 	/* FIXME: When listening port has a specific bound address, record that
 	 * as our address
 	 */
@@ -2260,7 +2262,6 @@ static int tcp_sock_init_one(const struct ctx *c, const union inany_addr *addr,
 	if (s < 0)
 		return s;
 
-	tcp_sock_set_bufsize(c, s);
 	return s;
 }
 
@@ -2317,9 +2318,7 @@ static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
 
 	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_SPLICE, &inany_loopback4,
 			NULL, port, tref.u32);
-	if (s >= 0)
-		tcp_sock_set_bufsize(c, s);
-	else
+	if (s < 0)
 		s = -1;
 
 	if (c->tcp.fwd_out.mode == FWD_AUTO)
@@ -2343,9 +2342,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
 
 	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_SPLICE, &inany_loopback6,
 			NULL, port, tref.u32);
-	if (s >= 0)
-		tcp_sock_set_bufsize(c, s);
-	else
+	if (s < 0)
 		s = -1;
 
 	if (c->tcp.fwd_out.mode == FWD_AUTO)

From 54bb972cfb2637f64a9718023a2351f8f259abdb Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 17 Jan 2025 10:10:10 +0100
Subject: [PATCH 024/221] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on
 all sockets

Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on
both sides"), David argues that, in general, we don't know what kind
of TCP traffic we're dealing with, on any side or path.

TCP segments might have been delivered to our socket with a PSH flag,
but we don't have a way to know about it.

Similarly, the guest might send us segments with PSH or URG set, but
we don't know if we should generally TCP_CORK sockets and uncork on
those flags, because that would assume they're running a Linux kernel
(and a particular version of it) matching the kernel that delivers
outbound packets for us.

Given that we can't make any assumption and everything might very well
be interactive traffic, disable Nagle's algorithm on all non-spliced
sockets as well.

After all, John Nagle himself is nowadays recommending that delayed
ACKs should never be enabled together with his algorithm, but we
don't have a practical way to ensure that our environment is free from
delayed ACKs (TCP_QUICKACK is not really usable for this purpose):

  https://news.ycombinator.com/item?id=34180239

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/tcp.c b/tcp.c
index a012b81..4d6a6b3 100644
--- a/tcp.c
+++ b/tcp.c
@@ -756,6 +756,19 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
 		trace("TCP: failed to set SO_SNDBUF to %i", v);
 }
 
+/**
+ * tcp_sock_set_nodelay() - Set TCP_NODELAY option (disable Nagle's algorithm)
+ * @s:		Socket, can be -1 to avoid check in the caller
+ */
+static void tcp_sock_set_nodelay(int s)
+{
+	if (s == -1)
+		return;
+
+	if (setsockopt(s, SOL_TCP, TCP_NODELAY, &((int){ 1 }), sizeof(int)))
+		debug("TCP: failed to set TCP_NODELAY on socket %i", s);
+}
+
 /**
  * tcp_update_csum() - Calculate TCP checksum
  * @psum:	Unfolded partial checksum of the IPv4 or IPv6 pseudo-header
@@ -1285,6 +1298,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
 		return -errno;
 
 	tcp_sock_set_bufsize(c, s);
+	tcp_sock_set_nodelay(s);
 
 	return s;
 }
@@ -2058,6 +2072,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 		goto cancel;
 
 	tcp_sock_set_bufsize(c, s);
+	tcp_sock_set_nodelay(s);
 
 	/* FIXME: When listening port has a specific bound address, record that
 	 * as our address

From db2c91ae86c7c0d1d068714db2342b9057506148 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Mon, 20 Jan 2025 18:36:30 +0100
Subject: [PATCH 025/221] tcp: Set ACK flag on *all* RST segments, even for
 client in SYN-SENT state

Somewhat curiously, RFC 9293, section 3.10.7.3, states:

   If the state is SYN-SENT, then
   [...]

      Second, check the RST bit:
      -  If the RST bit is set,
      [...]

         o  If the ACK was acceptable, then signal to the user "error:
            connection reset", drop the segment, enter CLOSED state,
            delete TCB, and return.  Otherwise (no ACK), drop the
            segment and return.

which matches verbatim RFC 793, pages 66-67, and is implemented as-is
by tcp_rcv_synsent_state_process() in the Linux kernel, that is:

	/* No ACK in the segment */

	if (th->rst) {
		/* rfc793:
		 * "If the RST bit is set
		 *
		 *      Otherwise (no ACK) drop the segment and return."
		 */

		goto discard_and_undo;
	}

meaning that if a client is in SYN-SENT state, and we send a RST
segment once we realise that we can't establish the outbound
connection, the client will ignore our segment and will need to
pointlessly wait until the connection times out instead of aborting
it right away.

The ACK flag on a RST, in this case, doesn't really seem to have any
function, but we must set it nevertheless. The ACK sequence number is
already correct because we always set it before calling
tcp_prepare_flags(), whenever relevant.

This leaves us with no cases where we should *not* set the ACK flag
on non-SYN segments, so always set the ACK flag for RST segments.

Note that non-SYN, non-RST segments were already covered by commit
4988e2b40631 ("tcp: Unconditionally force ACK for all !SYN, !RST
packets").

Reported-by: Dirk Janssen <Dirk.Janssen@schiphol.nl>
Reported-by: Roeland van de Pol <Roeland.van.de.Pol@schiphol.nl>
Reported-by: Robert Floor <Robert.Floor@schiphol.nl>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 4d6a6b3..c89f323 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1147,7 +1147,7 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
 
 		*opts = TCP_SYN_OPTS(mss, conn->ws_to_tap);
 		*optlen = sizeof(*opts);
-	} else if (!(flags & RST)) {
+	} else {
 		flags |= ACK;
 	}
 

From ec5c4d936dafcbc5e07caeb594dfd771050da221 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 21 Jan 2025 00:39:06 +0100
Subject: [PATCH 026/221] tcp: Set PSH flag for last incoming packets in a
 batch
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

So far we omitted setting PSH flags for inbound traffic altogether: as
we ignore the nature of the data we're sending, we can't conclude that
some data is more or less urgent. This works fine with Linux guests,
as the Linux kernel doesn't do much with it, on input: it will
generally deliver data to the application layer without delay.

However, with Windows, things change: if we don't set the PSH flag on
interactive inbound traffic, we can expect long delays before the data
is delivered to the application.

This is very visible with RDP, where packets we send on behalf of the
RDP client are delivered with delays exceeding one second:

  $ tshark -r rdp.pcap -td -Y 'frame.number in { 33170 .. 33173 }' --disable-protocol tls
  33170   0.030296 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285229 Win=387968 Len=0
  33171   0.985412 88.198.0.164 → 93.235.154.248 105 TCP 3389 → 49012 [PSH, ACK] Seq=285229 Ack=13820 Win=63198 Len=51
  33172   0.030373 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285280 Win=387968 Len=0
  33173   1.383776 88.198.0.164 → 93.235.154.248 424 TCP 3389 → 49012 [PSH, ACK] Seq=285280 Ack=13820 Win=63198 Len=370

in this example (packet capture taken by passt), frame #33172 is a
mouse event sent by the RDP client, and frame #33173 is the first
event (display reacting to click) sent back by the server. This
appears as a 1.4 s delay before we get frame #33173.

If we set PSH, instead:

  $ tshark -r rdp_psh.pcap -td -Y 'frame.number in { 314 .. 317 }' --disable-protocol tls
  314   0.002503 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7779 Ack=74047 Win=31872 Len=116
  315   0.000557 88.198.0.164 → 93.235.154.248 54 TCP 3389 → 51066 [ACK] Seq=79162 Ack=7895 Win=62872 Len=0
  316   0.012752 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7895 Ack=79162 Win=31872 Len=116
  317   0.011927 88.198.0.164 → 93.235.154.248 107 TCP 3389 → 51066 [PSH, ACK] Seq=79162 Ack=8011 Win=62756 Len=53

here, in frame #316, our mouse event is delivered without a delay and
receives a response in approximately 12 ms.

Set PSH on the last segment for any batch we dequeue from the socket,
that is, set it whenever we know that we might not be sending data to
the same port for a while.

Reported-by: NN708
Link: https://bugs.passt.top/show_bug.cgi?id=107
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_buf.c | 11 ++++++++---
 tcp_vu.c  |  7 +++++--
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c
index cbefa42..72d99c5 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -239,9 +239,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
  * @dlen:	TCP payload length
  * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
  * @seq:	Sequence number to be sent
+ * @push:	Set PSH flag, last segment in a batch
  */
 static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
-			    ssize_t dlen, int no_csum, uint32_t seq)
+			    ssize_t dlen, int no_csum, uint32_t seq, bool push)
 {
 	struct tcp_payload_t *payload;
 	const uint16_t *check = NULL;
@@ -268,6 +269,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	payload->th.th_x2 = 0;
 	payload->th.th_flags = 0;
 	payload->th.ack = 1;
+	payload->th.psh = push;
 	iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
 	tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
 	if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
@@ -402,11 +404,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	seq = conn->seq_to_tap;
 	for (i = 0; i < send_bufs; i++) {
 		int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
+		bool push = false;
 
-		if (i == send_bufs - 1)
+		if (i == send_bufs - 1) {
 			dlen = last_len;
+			push = true;
+		}
 
-		tcp_data_to_tap(c, conn, dlen, no_csum, seq);
+		tcp_data_to_tap(c, conn, dlen, no_csum, seq, push);
 		seq += dlen;
 	}
 
diff --git a/tcp_vu.c b/tcp_vu.c
index a216bb1..fad7065 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -289,10 +289,11 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c,
  * @iov_cnt:		Number of entries in @iov
  * @check:		Checksum, if already known
  * @no_tcp_csum:	Do not set TCP checksum
+ * @push:		Set PSH flag, last segment in a batch
  */
 static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
 			   struct iovec *iov, size_t iov_cnt,
-			   const uint16_t **check, bool no_tcp_csum)
+			   const uint16_t **check, bool no_tcp_csum, bool push)
 {
 	const struct flowside *toside = TAPFLOW(conn);
 	bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
@@ -334,6 +335,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
 	memset(th, 0, sizeof(*th));
 	th->doff = sizeof(*th) / 4;
 	th->ack = 1;
+	th->psh = push;
 
 	tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
 			 *check, conn->seq_to_tap, no_tcp_csum);
@@ -443,6 +445,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		struct iovec *iov = &elem[head[i]].in_sg[0];
 		int buf_cnt = head[i + 1] - head[i];
 		ssize_t dlen = iov_size(iov, buf_cnt) - hdrlen;
+		bool push = i == head_cnt - 1;
 
 		vu_set_vnethdr(vdev, iov->iov_base, buf_cnt);
 
@@ -451,7 +454,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			check = NULL;
 		previous_dlen = dlen;
 
-		tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap);
+		tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
 
 		if (*c->pcap) {
 			pcap_iov(iov, buf_cnt,

From 4f2c8e79130ef3d6132e34c49746e397745f9d73 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Tue, 21 Jan 2025 14:16:02 +0100
Subject: [PATCH 027/221] vhost_user: Drop packet with unsupported iovec array

If the iovec array cannot be managed, drop it rather than
passing the second entry to tap_add_packet().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vu_common.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/vu_common.c b/vu_common.c
index aa5ca7b..f43d8ac 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -195,15 +195,17 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 				        hdrlen);
 		} else {
 			/* vnet header can be in a separate iovec */
-			if (elem[count].out_num != 2)
+			if (elem[count].out_num != 2) {
 				debug("virtio-net transmit queue contains more than one buffer ([%d]: %u)",
 				      count, elem[count].out_num);
-			if (elem[count].out_sg[0].iov_len != (size_t)hdrlen)
+			} else if (elem[count].out_sg[0].iov_len != (size_t)hdrlen) {
 				debug("virtio-net transmit queue entry not aligned on hdrlen ([%d]: %d != %zu)",
-				count, hdrlen, elem[count].out_sg[0].iov_len);
-			tap_add_packet(vdev->context,
-				       elem[count].out_sg[1].iov_len,
-				       (char *)elem[count].out_sg[1].iov_base);
+				      count, hdrlen, elem[count].out_sg[0].iov_len);
+			} else {
+				tap_add_packet(vdev->context,
+					       elem[count].out_sg[1].iov_len,
+					       (char *)elem[count].out_sg[1].iov_base);
+			}
 		}
 
 		count++;

From d477a1fb03c5995d07e481b25dd94fc9e9bc02f2 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 23 Jan 2025 08:55:49 +0100
Subject: [PATCH 028/221] netlink: Skip loopback interface while looking for a
 template

There might be reasons to have routes on the loopback interface, for
example Any-IP/AnyIP routes as implemented by Linux kernel commit
ab79ad14a2d5 ("ipv6: Implement Any-IP support for IPv6.").

If we use the loopback interface as a template, though, we'll pick
'lo' (typically) as interface name for our tap interface, but we'll
already have an interface called 'lo' in the target namespace, and as
we TUNSETIFF on it, we'll fail with EINVAL, because it's not a tap
interface.

Skip the loopback interface while looking for a template interface or,
more accurately, skip the interface with index 1.

Strictly speaking, we should fetch interface flags via RTM_GETLINK
instead, and check for IFF_LOOPBACK, but interleaving that request
while we're iterating over routes is unnecessarily complicated.

Link: https://www.reddit.com/r/podman/comments/1i6pj7u/starting_pod_without_external_network/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 netlink.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/netlink.c b/netlink.c
index 0407692..37d8b5b 100644
--- a/netlink.c
+++ b/netlink.c
@@ -297,6 +297,10 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
 		if (!thisifi)
 			continue; /* No interface for this route */
 
+		/* Skip 'lo': we should test IFF_LOOPBACK, but keep it simple */
+		if (thisifi == 1)
+			continue;
+
 		/* Skip routes to link-local addresses */
 		if (af == AF_INET && dst &&
 		    IN4_IS_PREFIX_LINKLOCAL(dst, rtm->rtm_dst_len))

From dd6a6854c73a09c4091c1776ee7f349d1e1f966c Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Fri, 24 Jan 2025 20:07:41 +0100
Subject: [PATCH 029/221] vhost-user: Implement an empty VHOST_USER_SEND_RARP
 command

Passt cannot manage and doesn't need to manage the broadcast of a fake RARP,
but QEMU will report an error message if Passt doesn't implement it.

Implement an empty SEND_RARP command to silence QEMU error message.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/vhost_user.c b/vhost_user.c
index f12dec5..6bf0dda 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -914,7 +914,8 @@ static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
 {
 	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK |
 			    1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD |
-			    1ULL << VHOST_USER_PROTOCOL_F_DEVICE_STATE;
+			    1ULL << VHOST_USER_PROTOCOL_F_DEVICE_STATE |
+			    1ULL << VHOST_USER_PROTOCOL_F_RARP;
 
 	(void)vdev;
 	vmsg_set_reply_u64(msg, features);
@@ -981,6 +982,32 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
 	return false;
 }
 
+/**
+ * vu_set_send_rarp_exec() - vhost-user specification says: "Broadcast a fake
+ * 			     RARP to notify the migration is terminated",
+ * 			     but passt doesn't need to update any ARP table,
+ * 			     so do nothing to silence QEMU bogus error message
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_send_rarp_exec(struct vu_dev *vdev,
+			      struct vhost_user_msg *msg)
+{
+	char macstr[ETH_ADDRSTRLEN];
+
+	(void)vdev;
+
+	/* ignore the command */
+
+	debug("Ignore command VHOST_USER_SEND_RARP for %s",
+	      eth_ntop((unsigned char *)&msg->payload.u64, macstr,
+		       sizeof(macstr)));
+
+	return false;
+}
+
 /**
  * vu_set_migration_watch() - Add the migration file descriptor to epoll
  * @vdev:	vhost-user device
@@ -1177,6 +1204,7 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
 	[VHOST_USER_SET_VRING_CALL]	   = vu_set_vring_call_exec,
 	[VHOST_USER_SET_VRING_ERR]	   = vu_set_vring_err_exec,
 	[VHOST_USER_SET_VRING_ENABLE]	   = vu_set_vring_enable_exec,
+	[VHOST_USER_SEND_RARP]		   = vu_send_rarp_exec,
 	[VHOST_USER_SET_DEVICE_STATE_FD]   = vu_set_device_state_fd_exec,
 	[VHOST_USER_CHECK_DEVICE_STATE]    = vu_check_device_state_exec,
 };

From 10c4a9e1b383becd7366bda986f886675f7c4cb2 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 30 Jan 2025 17:52:10 +1100
Subject: [PATCH 030/221] tcp: Always pass NULL event with EPOLL_CTL_DEL

In tcp_epoll_ctl() we pass an event pointer with EPOLL_CTL_DEL, even though
it will be ignored.  It's possible this was a workaround for pre-2.6.9
kernels which required a non-NULL pointer here, but we rely on the kernel
accepting NULL events for EPOLL_CTL_DEL in lots of other places.  Use
NULL instead for simplicity and consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index c89f323..4eed82b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -468,9 +468,9 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 
 	if (conn->events == CLOSED) {
 		if (conn->in_epoll)
-			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->sock, &ev);
+			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->sock, NULL);
 		if (conn->timer != -1)
-			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->timer, &ev);
+			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->timer, NULL);
 		return 0;
 	}
 

From 0349cf637f64a5128846c79d9537849e1ed3e1cc Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 30 Jan 2025 17:52:11 +1100
Subject: [PATCH 031/221] util: Rename and make global vu_remove_watch()

vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set.  There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 icmp.c       |  2 +-
 tap.c        |  2 +-
 tcp.c        |  4 ++--
 tcp_splice.c |  4 ++--
 udp_flow.c   |  2 +-
 util.c       | 10 ++++++++++
 util.h       |  1 +
 vhost_user.c | 21 +++++----------------
 vu_common.c  |  6 ++----
 9 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/icmp.c b/icmp.c
index 143e93b..bcf498d 100644
--- a/icmp.c
+++ b/icmp.c
@@ -150,7 +150,7 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
+	epoll_del(c, pingf->sock);
 	close(pingf->sock);
 	flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
 }
diff --git a/tap.c b/tap.c
index cd32a90..772648f 100644
--- a/tap.c
+++ b/tap.c
@@ -1005,7 +1005,7 @@ void tap_sock_reset(struct ctx *c)
 		exit(EXIT_SUCCESS);
 
 	/* Close the connected socket, wait for a new connection */
-	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
+	epoll_del(c, c->fd_tap);
 	close(c->fd_tap);
 	c->fd_tap = -1;
 	if (c->mode == MODE_VU)
diff --git a/tcp.c b/tcp.c
index 4eed82b..7787381 100644
--- a/tcp.c
+++ b/tcp.c
@@ -468,9 +468,9 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 
 	if (conn->events == CLOSED) {
 		if (conn->in_epoll)
-			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->sock, NULL);
+			epoll_del(c, conn->sock);
 		if (conn->timer != -1)
-			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->timer, NULL);
+			epoll_del(c, conn->timer);
 		return 0;
 	}
 
diff --git a/tcp_splice.c b/tcp_splice.c
index 3a000ff..5db1d62 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -200,8 +200,8 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 	}
 
 	if (flag == CLOSING) {
-		epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[0], NULL);
-		epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[1], NULL);
+		epoll_del(c, conn->s[0]);
+		epoll_del(c, conn->s[1]);
 	}
 }
 
diff --git a/udp_flow.c b/udp_flow.c
index 9fd7d06..7fae81d 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -52,7 +52,7 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
 
 	if (uflow->s[TGTSIDE] >= 0) {
 		/* But the flow specific one needs to be removed */
-		epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL);
+		epoll_del(c, uflow->s[TGTSIDE]);
 		close(uflow->s[TGTSIDE]);
 		uflow->s[TGTSIDE] = -1;
 	}
diff --git a/util.c b/util.c
index 11973c4..c7b09f0 100644
--- a/util.c
+++ b/util.c
@@ -837,3 +837,13 @@ void raw_random(void *buf, size_t buflen)
 	if (random_read < buflen)
 		die("Unexpected EOF on random data source");
 }
+
+/**
+ * epoll_del() - Remove a file descriptor from our passt epoll
+ * @c:		Execution context
+ * @fd:		File descriptor to remove
+ */
+void epoll_del(const struct ctx *c, int fd)
+{
+	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, fd, NULL);
+}
diff --git a/util.h b/util.h
index d02333d..800a28b 100644
--- a/util.h
+++ b/util.h
@@ -276,6 +276,7 @@ static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m)
 #define FPRINTF(f, ...)	(void)fprintf(f, __VA_ARGS__)
 
 void raw_random(void *buf, size_t buflen);
+void epoll_del(const struct ctx *c, int fd);
 
 /*
  * Starting from glibc 2.40.9000 and commit 25a5eb4010df ("string: strerror,
diff --git a/vhost_user.c b/vhost_user.c
index 6bf0dda..bbbf504 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -162,17 +162,6 @@ static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
 		close(vmsg->fds[i]);
 }
 
-/**
- * vu_remove_watch() - Remove a file descriptor from our passt epoll
- * 		       file descriptor
- * @vdev:	vhost-user device
- * @fd:		file descriptor to remove
- */
-static void vu_remove_watch(const struct vu_dev *vdev, int fd)
-{
-	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL);
-}
-
 /**
  * vmsg_set_reply_u64() - Set reply payload.u64 and clear request flags
  * 			  and fd_num
@@ -748,7 +737,7 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
 		vdev->vq[idx].call_fd = -1;
 	}
 	if (vdev->vq[idx].kick_fd != -1) {
-		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
+		epoll_del(vdev->context, vdev->vq[idx].kick_fd);
 		close(vdev->vq[idx].kick_fd);
 		vdev->vq[idx].kick_fd = -1;
 	}
@@ -816,7 +805,7 @@ static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
 	vu_check_queue_msg_file(msg);
 
 	if (vdev->vq[idx].kick_fd != -1) {
-		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
+		epoll_del(vdev->context, vdev->vq[idx].kick_fd);
 		close(vdev->vq[idx].kick_fd);
 		vdev->vq[idx].kick_fd = -1;
 	}
@@ -1063,7 +1052,7 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
 		die("Invalide device_state_fd direction: %d", direction);
 
 	if (vdev->device_state_fd != -1) {
-		vu_remove_watch(vdev, vdev->device_state_fd);
+		epoll_del(vdev->context, vdev->device_state_fd);
 		close(vdev->device_state_fd);
 	}
 
@@ -1145,7 +1134,7 @@ void vu_cleanup(struct vu_dev *vdev)
 			vq->err_fd = -1;
 		}
 		if (vq->kick_fd != -1) {
-			vu_remove_watch(vdev, vq->kick_fd);
+			epoll_del(vdev->context, vq->kick_fd);
 			close(vq->kick_fd);
 			vq->kick_fd = -1;
 		}
@@ -1169,7 +1158,7 @@ void vu_cleanup(struct vu_dev *vdev)
 	vu_close_log(vdev);
 
 	if (vdev->device_state_fd != -1) {
-		vu_remove_watch(vdev, vdev->device_state_fd);
+		epoll_del(vdev->context, vdev->device_state_fd);
 		close(vdev->device_state_fd);
 		vdev->device_state_fd = -1;
 		vdev->device_state_result = -1;
diff --git a/vu_common.c b/vu_common.c
index f43d8ac..2c12dca 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -325,8 +325,7 @@ void vu_migrate(struct vu_dev *vdev, uint32_t events)
 		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
 		vdev->device_state_result = ret == -1 ? -1 : 0;
 		/* Closing the file descriptor signals the end of transfer */
-		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
-			  vdev->device_state_fd, NULL);
+		epoll_del(vdev->context, vdev->device_state_fd);
 		close(vdev->device_state_fd);
 		vdev->device_state_fd = -1;
 	} else if (events & EPOLLIN) {
@@ -346,8 +345,7 @@ void vu_migrate(struct vu_dev *vdev, uint32_t events)
 		debug("Closing migration channel");
 
 		/* The end of file signals the end of the transfer. */
-		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
-			  vdev->device_state_fd, NULL);
+		epoll_del(vdev->context, vdev->device_state_fd);
 		close(vdev->device_state_fd);
 		vdev->device_state_fd = -1;
 	}

From dcd6d8191aa29f232593ad2819a197e135f8cac8 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 31 Jan 2025 19:13:00 +0100
Subject: [PATCH 032/221] tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros

Those are symmetric to TAPSIDE(x)/TAPFLOW(x) and I'll use them in
the next patch to extract 'oport' in order to re-bind sockets to
the original socket-side local port.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp_internal.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tcp_internal.h b/tcp_internal.h
index 94e5780..9cf31f5 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -38,9 +38,13 @@
 #define OPT_SACK	5
 #define OPT_TS		8
 
-#define TAPSIDE(conn_)	((conn_)->f.pif[1] == PIF_TAP)
-#define TAPFLOW(conn_)	(&((conn_)->f.side[TAPSIDE(conn_)]))
-#define TAP_SIDX(conn_)	(FLOW_SIDX((conn_), TAPSIDE(conn_)))
+#define TAPSIDE(conn_)		((conn_)->f.pif[1] == PIF_TAP)
+#define TAPFLOW(conn_)		(&((conn_)->f.side[TAPSIDE(conn_)]))
+#define TAP_SIDX(conn_)		(FLOW_SIDX((conn_), TAPSIDE(conn_)))
+
+#define HOSTSIDE(conn_)		((conn_)->f.pif[1] == PIF_HOST)
+#define HOSTFLOW(conn_)		(&((conn_)->f.side[HOSTSIDE(conn_)]))
+#define HOST_SIDX(conn_)	(FLOW_SIDX((conn_), TAPSIDE(conn_)))
 
 #define CONN_V4(conn)		(!!inany_v4(&TAPFLOW(conn)->oaddr))
 #define CONN_V6(conn)		(!CONN_V4(conn))

From bf2860819d868c7d116923e9b5d798d410d38715 Mon Sep 17 00:00:00 2001
From: 7ppKb5bW <pONy4THS@protonmail.com>
Date: Sun, 2 Feb 2025 19:21:21 +0000
Subject: [PATCH 033/221] pasta.te: fix demo.sh and remove one duplicate rule

On Fedora 41, without "allow pasta_t unconfined_t:dir read"
/usr/bin/pasta can't open /proc/[pid]/ns, which is required by
pasta_netns_quit_init().

This patch also remove one duplicate rule "allow pasta_t nsfs_t:file
read;", "allow pasta_t nsfs_t:file { open read };" at line 123 is
enough.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 contrib/selinux/pasta.te | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/contrib/selinux/pasta.te b/contrib/selinux/pasta.te
index 69be081..d0ff0cc 100644
--- a/contrib/selinux/pasta.te
+++ b/contrib/selinux/pasta.te
@@ -171,7 +171,7 @@ allow pasta_t init_t:lnk_file read;
 allow pasta_t init_t:unix_stream_socket connectto;
 allow pasta_t init_t:dbus send_msg;
 allow pasta_t init_t:system status;
-allow pasta_t unconfined_t:dir search;
+allow pasta_t unconfined_t:dir { read search };
 allow pasta_t unconfined_t:file read;
 allow pasta_t unconfined_t:lnk_file read;
 allow pasta_t self:process { setpgid setcap };
@@ -192,8 +192,6 @@ allow pasta_t sysctl_net_t:dir search;
 allow pasta_t sysctl_net_t:file { open read write };
 allow pasta_t kernel_t:system module_request;
 
-allow pasta_t nsfs_t:file read;
-
 allow pasta_t proc_t:dir mounton;
 allow pasta_t proc_t:filesystem mount;
 allow pasta_t net_conf_t:lnk_file read;

From 722d347c1932f630a53ba05ea0270a651ed601b2 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Mon, 3 Feb 2025 08:19:16 +0100
Subject: [PATCH 034/221] tcp: Don't reset outbound connection on SYN retries

Reported by somebody on IRC: if the server has considerable latency,
it might happen that the client retries sending SYN segments for the
same flow while we're still in a TAP_SYN_RCVD, non-ESTABLISHED state.

In that case, we should go with the blanket assumption that we need
to reset the connection on any unexpected segment: RFC 9293 explicitly
mentions this case in Figure 8: Recovery from Old Duplicate SYN,
section 3.5. It doesn't make sense for us to set a specific sequence
number, socket-side, but we should definitely wait and see.

Ignoring the duplicate SYN segment should also be compatible with
section 3.10.7.3. SYN-SENT STATE, which mentions updating sequences
socket-side (which we can't do anyway), but certainly not reset the
connection.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tcp.c b/tcp.c
index 7787381..51ad692 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1920,6 +1920,9 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 	/* Establishing connection from tap */
 	if (conn->events & TAP_SYN_RCVD) {
+		if (th->syn && !th->ack && !th->fin)
+			return 1;	/* SYN retry: ignore and keep waiting */
+
 		if (!(conn->events & TAP_SYN_ACK_SENT))
 			goto reset;
 

From b75ad159e8a13a10ce1fb4b86503636420da126d Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Sun, 2 Feb 2025 20:49:58 +0100
Subject: [PATCH 035/221] vhost_user: On 32-bit ARM, mmap() is not available,
 mmap2() is used instead

Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armel&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477467&raw=0
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armhf&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477421&raw=0
Fixes: 31117b27c6c9 ("vhost-user: introduce vhost-user API")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 vhost_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vhost_user.c b/vhost_user.c
index bbbf504..58baee2 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -419,7 +419,7 @@ static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
  *
  * Return: False as no reply is requested
  *
- * #syscalls:vu mmap munmap
+ * #syscalls:vu mmap|mmap2 munmap
  */
 static bool vu_set_mem_table_exec(struct vu_dev *vdev,
 				  struct vhost_user_msg *msg)

From 71fa7362776bfa075d83383b600d2beeab923893 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Sun, 2 Feb 2025 20:53:47 +0100
Subject: [PATCH 036/221] tcp_splice, udp_flow: fcntl64() support on PPC64
 depends on glibc version

I explicitly added fcntl64() to the list of allowed system calls for
PPC64 a while ago, and now it turns out it's not available in recent
Debian builds. The warning from seccomp.sh is harmless because we
unconditionally try to enable fcntl() anyway, but take care of it
anyway.

Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp_splice.c | 2 +-
 udp_flow.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 5db1d62..f048a82 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -28,7 +28,7 @@
  * - FIN_SENT_0:		FIN (write shutdown) sent to accepted socket
  * - FIN_SENT_1:		FIN (write shutdown) sent to target socket
  *
- * #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64 i686:fcntl64
+ * #syscalls:pasta pipe2|pipe fcntl arm:fcntl64 ppc64:fcntl64|fcntl i686:fcntl64
  */
 
 #include <sched.h>
diff --git a/udp_flow.c b/udp_flow.c
index 7fae81d..83c2568 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -174,7 +174,7 @@ cancel:
  * @s_in:	Source socket address, filled in by recvmmsg()
  * @now:	Timestamp
  *
- * #syscalls fcntl arm:fcntl64 ppc64:fcntl64 i686:fcntl64
+ * #syscalls fcntl arm:fcntl64 ppc64:fcntl64|fcntl i686:fcntl64
  *
  * Return: sidx for the destination side of the flow for this packet, or
  *         FLOW_SIDX_NONE if we couldn't find or create a flow.

From e25a93032f8c09f1e0bfbc32e81431dd995f9605 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Sun, 26 Jan 2025 09:05:03 +0100
Subject: [PATCH 037/221] util: Add read_remainder() and read_all_buf()

These are symmetric to write_remainder() and write_all_buf() and
almost a copy and paste of them, with the most notable differences
being reversed reads/writes and a couple of better-safe-than-sorry
asserts to keep Coverity happy.

I'll use them in the next patch. At least for the moment, they're
going to be used for vhost-user mode only, so I'm not unconditionally
enabling readv() in the seccomp profile: the caller has to ensure it's
there.

[dgibson: make read_remainder() take const pointer to iovec]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 util.h |  2 ++
 2 files changed, 86 insertions(+)

diff --git a/util.c b/util.c
index c7b09f0..800c6b5 100644
--- a/util.c
+++ b/util.c
@@ -606,6 +606,90 @@ int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip)
 	return 0;
 }
 
+/**
+ * read_all_buf() - Fill a whole buffer from a file descriptor
+ * @fd:		File descriptor
+ * @buf:	Pointer to base of buffer
+ * @len:	Length of buffer
+ *
+ * Return: 0 on success, -1 on error (with errno set)
+ *
+ * #syscalls read
+ */
+int read_all_buf(int fd, void *buf, size_t len)
+{
+	size_t left = len;
+	char *p = buf;
+
+	while (left) {
+		ssize_t rc;
+
+		ASSERT(left <= len);
+
+		do
+			rc = read(fd, p, left);
+		while ((rc < 0) && errno == EINTR);
+
+		if (rc < 0)
+			return -1;
+
+		if (rc == 0) {
+			errno = ENODATA;
+			return -1;
+		}
+
+		p += rc;
+		left -= rc;
+	}
+	return 0;
+}
+
+/**
+ * read_remainder() - Read the tail of an IO vector from a file descriptor
+ * @fd:		File descriptor
+ * @iov:	IO vector
+ * @cnt:	Number of entries in @iov
+ * @skip:	Number of bytes of the vector to skip reading
+ *
+ * Return: 0 on success, -1 on error (with errno set)
+ *
+ * Note: mode-specific seccomp profiles need to enable readv() to use this.
+ */
+/* cppcheck-suppress unusedFunction */
+int read_remainder(int fd, const struct iovec *iov, size_t cnt, size_t skip)
+{
+	size_t i = 0, offset;
+
+	while ((i += iov_skip_bytes(iov + i, cnt - i, skip, &offset)) < cnt) {
+		ssize_t rc;
+
+		if (offset) {
+			ASSERT(offset < iov[i].iov_len);
+			/* Read the remainder of the partially read buffer */
+			if (read_all_buf(fd, (char *)iov[i].iov_base + offset,
+					 iov[i].iov_len - offset) < 0)
+				return -1;
+			i++;
+		}
+
+		if (cnt == i)
+			break;
+
+		/* Fill as many of the remaining buffers as we can */
+		rc = readv(fd, &iov[i], cnt - i);
+		if (rc < 0)
+			return -1;
+
+		if (rc == 0) {
+			errno = ENODATA;
+			return -1;
+		}
+
+		skip = rc;
+	}
+	return 0;
+}
+
 /** sockaddr_ntop() - Convert a socket address to text format
  * @sa:		Socket address
  * @dst:	output buffer, minimum SOCKADDR_STRLEN bytes
diff --git a/util.h b/util.h
index 800a28b..23b165c 100644
--- a/util.h
+++ b/util.h
@@ -203,6 +203,8 @@ int fls(unsigned long x);
 int write_file(const char *path, const char *buf);
 int write_all_buf(int fd, const void *buf, size_t len);
 int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
+int read_all_buf(int fd, void *buf, size_t len);
+int read_remainder(int fd, const struct iovec *iov, size_t cnt, size_t skip);
 void close_open_files(int argc, char **argv);
 bool snprintf_check(char *str, size_t size, const char *format, ...);
 

From e894d9ae8212c49dc44e52ad583954ed24e6905b Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 31 Jan 2025 11:41:51 +0100
Subject: [PATCH 038/221] vhost_user: Turn some vhost-user message reports to
 trace()

Having every vhost-user message printed as part of debug output makes
debugging anything else a bit complicated.

Change per-packet debug() messages in vu_kick_cb() and
vu_send_single() to trace()

[dgibson: switch different messages to trace()]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 4 ++--
 vu_common.c  | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/vhost_user.c b/vhost_user.c
index 58baee2..9e38cfd 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -640,8 +640,8 @@ static bool vu_set_vring_num_exec(struct vu_dev *vdev,
 	unsigned int idx = msg->payload.state.index;
 	unsigned int num = msg->payload.state.num;
 
-	debug("State.index: %u", idx);
-	debug("State.num:   %u", num);
+	trace("State.index: %u", idx);
+	trace("State.num:   %u", num);
 	vdev->vq[idx].vring.num = num;
 
 	return false;
diff --git a/vu_common.c b/vu_common.c
index 2c12dca..ab04d31 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -238,7 +238,7 @@ void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 	if (rc == -1)
 		die_perror("vhost-user kick eventfd_read()");
 
-	debug("vhost-user: got kick_data: %016"PRIx64" idx: %d",
+	trace("vhost-user: got kick_data: %016"PRIx64" idx: %d",
 	      kick_data, ref.queue);
 	if (VHOST_USER_IS_QUEUE_TX(ref.queue))
 		vu_handle_tx(vdev, ref.queue, now);
@@ -262,7 +262,7 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size)
 	int elem_cnt;
 	int i;
 
-	debug("vu_send_single size %zu", size);
+	trace("vu_send_single size %zu", size);
 
 	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
 		debug("Got packet, but RX virtqueue not usable yet");
@@ -294,7 +294,7 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size)
 
 	vu_flush(vdev, vq, elem, elem_cnt);
 
-	debug("vhost-user sent %zu", total);
+	trace("vhost-user sent %zu", total);
 
 	return total;
 err:

From 8c24301462c39027e6eb6f1ad56c1f6c83fb0c23 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 28 Jan 2025 00:03:13 +0100
Subject: [PATCH 039/221] Introduce passt-repair

A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
passt. Not used yet.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 .gitignore                            |   1 +
 Makefile                              |  17 ++-
 contrib/apparmor/usr.bin.passt-repair |  29 +++++
 contrib/fedora/passt.spec             |   2 +
 contrib/selinux/passt-repair.fc       |  11 ++
 contrib/selinux/passt-repair.te       |  58 ++++++++++
 hooks/pre-push                        |   1 +
 passt-repair.1                        |  70 ++++++++++++
 passt-repair.c                        | 154 ++++++++++++++++++++++++++
 seccomp.sh                            |   6 +-
 10 files changed, 342 insertions(+), 7 deletions(-)
 create mode 100644 contrib/apparmor/usr.bin.passt-repair
 create mode 100644 contrib/selinux/passt-repair.fc
 create mode 100644 contrib/selinux/passt-repair.te
 create mode 100644 passt-repair.1
 create mode 100644 passt-repair.c

diff --git a/.gitignore b/.gitignore
index d1c8be9..5824a71 100644
--- a/.gitignore
+++ b/.gitignore
@@ -3,6 +3,7 @@
 /passt.avx2
 /pasta
 /pasta.avx2
+/passt-repair
 /qrap
 /pasta.1
 /seccomp.h
diff --git a/Makefile b/Makefile
index 464eef1..6ab8d24 100644
--- a/Makefile
+++ b/Makefile
@@ -42,9 +42,10 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
 	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
-SRCS = $(PASST_SRCS) $(QRAP_SRCS)
+PASST_REPAIR_SRCS = passt-repair.c
+SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
 
-MANPAGES = passt.1 pasta.1 qrap.1
+MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
 
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
@@ -72,9 +73,9 @@ mandir		?= $(datarootdir)/man
 man1dir		?= $(mandir)/man1
 
 ifeq ($(TARGET_ARCH),x86_64)
-BIN := passt passt.avx2 pasta pasta.avx2 qrap
+BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
 else
-BIN := passt pasta qrap
+BIN := passt pasta qrap passt-repair
 endif
 
 all: $(BIN) $(MANPAGES) docs
@@ -83,7 +84,10 @@ static: FLAGS += -static -DGLIBC_NO_STATIC_NSS
 static: clean all
 
 seccomp.h: seccomp.sh $(PASST_SRCS) $(PASST_HEADERS)
-	@ EXTRA_SYSCALLS="$(EXTRA_SYSCALLS)" ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh $(PASST_SRCS) $(PASST_HEADERS)
+	@ EXTRA_SYSCALLS="$(EXTRA_SYSCALLS)" ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh seccomp.h $(PASST_SRCS) $(PASST_HEADERS)
+
+seccomp_repair.h: seccomp.sh $(PASST_REPAIR_SRCS)
+	@ ARCH="$(TARGET_ARCH)" CC="$(CC)" ./seccomp.sh seccomp_repair.h $(PASST_REPAIR_SRCS)
 
 passt: $(PASST_SRCS) $(HEADERS)
 	$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_SRCS) -o passt $(LDFLAGS)
@@ -101,6 +105,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
 qrap: $(QRAP_SRCS) passt.h
 	$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
 
+passt-repair: $(PASST_REPAIR_SRCS) seccomp_repair.h
+	$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
+
 valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction	\
 			    rt_sigreturn getpid gettid kill clock_gettime mmap \
 			    mmap2 munmap open unlink gettimeofday futex statx \
diff --git a/contrib/apparmor/usr.bin.passt-repair b/contrib/apparmor/usr.bin.passt-repair
new file mode 100644
index 0000000..901189d
--- /dev/null
+++ b/contrib/apparmor/usr.bin.passt-repair
@@ -0,0 +1,29 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# contrib/apparmor/usr.bin.passt-repair - AppArmor profile for passt-repair(1)
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+abi <abi/3.0>,
+
+#include <tunables/global>
+
+profile passt-repair /usr/bin/passt-repair {
+  #include <abstractions/base>
+  /** rw,			# passt's ".repair" socket might be anywhere
+  unix (connect, receive, send) type=stream,
+
+  capability dac_override,	# connect to passt's socket as root
+  capability net_admin,		# currently needed for TCP_REPAIR socket option
+  capability net_raw,		# what TCP_REPAIR should require instead
+
+  network unix stream,		# connect and use UNIX domain socket
+  network inet stream,		# use TCP sockets
+}
diff --git a/contrib/fedora/passt.spec b/contrib/fedora/passt.spec
index 7950fb9..6a83f8b 100644
--- a/contrib/fedora/passt.spec
+++ b/contrib/fedora/passt.spec
@@ -108,9 +108,11 @@ fi
 %{_bindir}/passt
 %{_bindir}/pasta
 %{_bindir}/qrap
+%{_bindir}/passt-repair
 %{_mandir}/man1/passt.1*
 %{_mandir}/man1/pasta.1*
 %{_mandir}/man1/qrap.1*
+%{_mandir}/man1/passt-repair.1*
 %ifarch x86_64
 %{_bindir}/passt.avx2
 %{_mandir}/man1/passt.avx2.1*
diff --git a/contrib/selinux/passt-repair.fc b/contrib/selinux/passt-repair.fc
new file mode 100644
index 0000000..bcd526e
--- /dev/null
+++ b/contrib/selinux/passt-repair.fc
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# contrib/selinux/passt-repair.fc - SELinux: File Context for passt-repair
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+/usr/bin/passt-repair		system_u:object_r:passt_repair_exec_t:s0
diff --git a/contrib/selinux/passt-repair.te b/contrib/selinux/passt-repair.te
new file mode 100644
index 0000000..e3ffbcd
--- /dev/null
+++ b/contrib/selinux/passt-repair.te
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# contrib/selinux/passt-repair.te - SELinux: Type Enforcement for passt-repair
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+policy_module(passt-repair, 0.1)
+
+require {
+	type unconfined_t;
+	type passt_t;
+	role unconfined_r;
+	class process transition;
+
+	class file { read execute execute_no_trans entrypoint open map };
+	class capability { dac_override net_admin net_raw };
+	class chr_file { append open getattr read write ioctl };
+
+	class unix_stream_socket { create connect sendto };
+	class sock_file { read write };
+
+	class tcp_socket { read setopt write };
+
+	type console_device_t;
+	type user_devpts_t;
+	type user_tmp_t;
+}
+
+type passt_repair_t;
+domain_type(passt_repair_t);
+type passt_repair_exec_t;
+files_type(passt_repair_exec_t);
+
+role unconfined_r types passt_repair_t;
+
+allow passt_repair_t passt_repair_exec_t:file { read execute execute_no_trans entrypoint open map };
+type_transition unconfined_t passt_repair_exec_t:process passt_repair_t;
+allow unconfined_t passt_repair_t:process transition;
+
+allow passt_repair_t self:capability { dac_override net_admin net_raw };
+
+allow passt_repair_t console_device_t:chr_file { append open getattr read write ioctl };
+allow passt_repair_t user_devpts_t:chr_file { append open getattr read write ioctl };
+
+allow passt_repair_t unconfined_t:unix_stream_socket { connectto read write };
+allow passt_repair_t passt_t:unix_stream_socket { connectto read write };
+allow passt_repair_t user_tmp_t:unix_stream_socket { connectto read write };
+
+allow passt_repair_t unconfined_t:sock_file { read write };
+allow passt_repair_t passt_t:sock_file { read write };
+allow passt_repair_t user_tmp_t:sock_file { read write };
+
+allow passt_repair_t unconfined_t:tcp_socket { read setopt write };
+allow passt_repair_t passt_t:tcp_socket { read setopt write };
diff --git a/hooks/pre-push b/hooks/pre-push
index 33a2052..8dbfa5f 100755
--- a/hooks/pre-push
+++ b/hooks/pre-push
@@ -56,6 +56,7 @@ cd ..
 make pkgs
 scp passt passt.avx2 passt.1 qrap qrap.1	"${USER_HOST}:${BIN}"
 scp pasta pasta.avx2 pasta.1			"${USER_HOST}:${BIN}"
+scp passt-repair passt-repair.1			"${USER_HOST}:${BIN}"
 
 ssh "${USER_HOST}" 				"rm -f ${BIN}/*.deb"
 ssh "${USER_HOST}"				"rm -f ${BIN}/*.rpm"
diff --git a/passt-repair.1 b/passt-repair.1
new file mode 100644
index 0000000..8d07c97
--- /dev/null
+++ b/passt-repair.1
@@ -0,0 +1,70 @@
+.\" SPDX-License-Identifier: GPL-2.0-or-later
+.\" Copyright (c) 2025 Red Hat GmbH
+.\" Author: Stefano Brivio <sbrivio@redhat.com>
+.TH passt-repair 1
+
+.SH NAME
+.B passt-repair
+\- Helper setting TCP_REPAIR socket options for \fBpasst\fR(1)
+
+.SH SYNOPSIS
+.B passt-repair
+\fIPATH\fR
+
+.SH DESCRIPTION
+
+.B passt-repair
+is a privileged helper setting and clearing repair mode on TCP sockets on behalf
+of \fBpasst\fR(1), as instructed via single-byte commands over a UNIX domain
+socket, specified by \fIPATH\fR.
+
+It can be used to migrate TCP connections between guests without granting
+additional capabilities to \fBpasst\fR(1) itself: to migrate TCP connections,
+\fBpasst\fR(1) leverages repair mode, which needs the \fBCAP_NET_ADMIN\fR
+capability (see \fBcapabilities\fR(7)) to be set or cleared.
+
+.SH PROTOCOL
+
+\fBpasst-repair\fR(1) connects to \fBpasst\fR(1) using the socket specified via
+\fI--repair-path\fR option in \fBpasst\fR(1) itself. By default, the name is the
+same as the UNIX domain socket used for guest communication, suffixed by
+\fI.repair\fR.
+
+The messages consist of one 8-bit signed integer that can be \fITCP_REPAIR_ON\fR
+(1), \fITCP_REPAIR_OFF\fR (2), or \fITCP_REPAIR_OFF_WP\fR (-1), as defined by
+the Linux kernel user API, and one to SCM_MAX_FD (253) sockets as SCM_RIGHTS
+(see \fBunix\fR(7)) ancillary message, sent by the server, \fBpasst\fR(1).
+
+The client, \fBpasst-repair\fR(1), replies with the same byte (and no ancillary
+message) to indicate success, and closes the connection on failure.
+
+The server closes the connection on error or completion.
+
+.SH NOTES
+
+\fBpasst-repair\fR(1) can be granted the \fBCAP_NET_ADMIN\fR capability
+(preferred, as it limits privileges to the strictly necessary ones), or it can
+be run as root.
+
+.SH AUTHOR
+
+Stefano Brivio <sbrivio@redhat.com>.
+
+.SH REPORTING BUGS
+
+Please report issues on the bug tracker at https://bugs.passt.top/, or
+send a message to the passt-user@passt.top mailing list, see
+https://lists.passt.top/.
+
+.SH COPYRIGHT
+
+Copyright (c) 2025 Red Hat GmbH.
+
+\fBpasst-repair\fR is free software: you can redistribute them and/or modify
+them under the terms of the GNU General Public License as published by the Free
+Software Foundation, either version 2 of the License, or (at your option) any
+later version. 
+
+.SH SEE ALSO
+
+\fBpasst\fR(1), \fBqemu\fR(1), \fBcapabilities\fR(7), \fBunix\fR(7).
diff --git a/passt-repair.c b/passt-repair.c
new file mode 100644
index 0000000..767a821
--- /dev/null
+++ b/passt-repair.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ *
+ * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
+ * with byte commands mapping to TCP_REPAIR values, and switch repair mode on or
+ * off. Reply by echoing the command. Exit on EOF.
+ */
+
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <limits.h>
+#include <unistd.h>
+#include <netdb.h>
+
+#include <netinet/tcp.h>
+
+#include <linux/audit.h>
+#include <linux/capability.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+
+#include "seccomp_repair.h"
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+/**
+ * main() - Entry point and whole program with loop
+ * @argc:	Argument count, must be 2
+ * @argv:	Argument: path of UNIX domain socket to connect to
+ *
+ * Return: 0 on success (EOF), 1 on error, 2 on usage error
+ *
+ * #syscalls:repair connect setsockopt write exit_group
+ * #syscalls:repair socket s390x:socketcall i686:socketcall
+ * #syscalls:repair recvfrom recvmsg arm:recv ppc64le:recv
+ * #syscalls:repair sendto sendmsg arm:send ppc64le:send
+ */
+int main(int argc, char **argv)
+{
+	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+	struct sockaddr_un a = { AF_UNIX, "" };
+	int fds[SCM_MAX_FD], s, ret, i, n;
+	struct sock_fprog prog;
+	int8_t cmd = INT8_MAX;
+	struct cmsghdr *cmsg;
+	struct msghdr msg;
+	struct iovec iov;
+
+	prctl(PR_SET_DUMPABLE, 0);
+
+	prog.len = (unsigned short)sizeof(filter_repair) /
+				   sizeof(filter_repair[0]);
+	prog.filter = filter_repair;
+	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
+	    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+		fprintf(stderr, "Failed to apply seccomp filter");
+		return 1;
+	}
+
+	iov = (struct iovec){ &cmd, sizeof(cmd) };
+	msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
+	cmsg = CMSG_FIRSTHDR(&msg);
+
+	if (argc != 2) {
+		fprintf(stderr, "Usage: %s PATH\n", argv[0]);
+		return 2;
+	}
+
+	ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
+	if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
+		fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
+		return 2;
+	}
+
+	if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
+		perror("Failed to create AF_UNIX socket");
+		return 1;
+	}
+
+	if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
+		fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
+			strerror(errno));
+		return 1;
+	}
+
+loop:
+	ret = recvmsg(s, &msg, 0);
+	if (ret < 0) {
+		perror("Failed to receive message");
+		return 1;
+	}
+
+	if (!ret)	/* Done */
+		return 0;
+
+	if (!cmsg ||
+	    cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
+	    cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
+	    cmsg->cmsg_type != SCM_RIGHTS) {
+		fprintf(stderr, "No/bad ancillary data from peer\n");
+		return 1;
+	}
+
+	n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
+	memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
+
+	if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF &&
+	    cmd != TCP_REPAIR_OFF_NO_WP) {
+		fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
+		return 1;
+	}
+
+	for (i = 0; i < n; i++) {
+		int o = cmd;
+
+		if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &o, sizeof(o))) {
+			fprintf(stderr,
+				"Setting TCP_REPAIR to %i on socket %i: %s", o,
+				fds[i], strerror(errno));
+			return 1;
+		}
+
+		/* Close _our_ copy */
+		close(fds[i]);
+
+		/* Confirm setting by echoing the command back */
+		if (send(s, &cmd, sizeof(cmd), 0) < 0) {
+			fprintf(stderr, "Reply to command %i: %s\n",
+				o, strerror(errno));
+			return 1;
+		}
+	}
+
+	goto loop;
+
+	return 0;
+}
diff --git a/seccomp.sh b/seccomp.sh
index 6499c58..4c521ae 100755
--- a/seccomp.sh
+++ b/seccomp.sh
@@ -14,8 +14,10 @@
 # Author: Stefano Brivio <sbrivio@redhat.com>
 
 TMP="$(mktemp)"
-IN="$@"
 OUT="$(mktemp)"
+OUT_FINAL="${1}"
+shift
+IN="$@"
 
 [ -z "${ARCH}" ] && ARCH="$(uname -m)"
 [ -z "${CC}" ] && CC="cc"
@@ -268,4 +270,4 @@ for __p in ${__profiles}; do
 	gen_profile "${__p}" ${__calls}
 done
 
-mv "${OUT}" seccomp.h
+mv "${OUT}" "${OUT_FINAL}"

From 52e57f9c9a6d8ae4153ac592d01d868b31c10171 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 31 Jan 2025 18:27:07 +0100
Subject: [PATCH 040/221] tcp: Get socket port and address using getsockname()
 when connecting from guest

For migration only: we need to store 'oport', our socket-side port,
as we establish a connection from the guest, so that we can bind the
same oport as source port in the migration target.

Similar for 'oaddr': this is needed in case the migration target has
additional network interfaces, and we need to make sure our socket is
bound to the equivalent interface as it was on the source.

Use getsockname() to fetch them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c       |  4 ++--
 flow_table.h |  4 ++--
 tcp.c        | 28 +++++++++++++++++++++++++++-
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/flow.c b/flow.c
index ee1221b..a6fe6d1 100644
--- a/flow.c
+++ b/flow.c
@@ -414,8 +414,8 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
  *
  * Return: pointer to the target flowside information
  */
-const struct flowside *flow_target(const struct ctx *c, union flow *flow,
-				   uint8_t proto)
+struct flowside *flow_target(const struct ctx *c, union flow *flow,
+			     uint8_t proto)
 {
 	char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
 	struct flow_common *f = &flow->f;
diff --git a/flow_table.h b/flow_table.h
index f15db53..eeb6f41 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -168,8 +168,8 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif,
 				      sa_family_t af,
 				      const void *saddr, in_port_t sport,
 				      const void *daddr, in_port_t dport);
-const struct flowside *flow_target(const struct ctx *c, union flow *flow,
-				   uint8_t proto);
+struct flowside *flow_target(const struct ctx *c, union flow *flow,
+			     uint8_t proto);
 
 union flow *flow_set_type(union flow *flow, enum flow_type type);
 #define FLOW_SET_TYPE(flow_, t_, var_)	(&flow_set_type((flow_), (t_))->var_)
diff --git a/tcp.c b/tcp.c
index 51ad692..fac322c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1415,6 +1415,8 @@ static void tcp_bind_outbound(const struct ctx *c,
  * @opts:	Pointer to start of options
  * @optlen:	Bytes in options: caller MUST ensure available length
  * @now:	Current timestamp
+ *
+ * #syscalls:vu getsockname
  */
 static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 			      const void *saddr, const void *daddr,
@@ -1423,9 +1425,10 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 {
 	in_port_t srcport = ntohs(th->source);
 	in_port_t dstport = ntohs(th->dest);
-	const struct flowside *ini, *tgt;
+	const struct flowside *ini;
 	struct tcp_tap_conn *conn;
 	union sockaddr_inany sa;
+	struct flowside *tgt;
 	union flow *flow;
 	int s = -1, mss;
 	uint64_t hash;
@@ -1530,6 +1533,29 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 	}
 
 	tcp_epoll_ctl(c, conn);
+
+	if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
+		if (af == AF_INET) {
+			struct sockaddr_in s_in;
+
+			sl = sizeof(s_in);
+			if (!getsockname(s, (struct sockaddr *)&s_in, &sl)) {
+				/* NOLINTNEXTLINE(clang-analyzer-core.CallAndMessage) */
+				tgt->oport = ntohs(s_in.sin_port);
+				tgt->oaddr = inany_from_v4(s_in.sin_addr);
+			}
+		} else {
+			struct sockaddr_in6 s_in6;
+
+			sl = sizeof(s_in6);
+			if (!getsockname(s, (struct sockaddr *)&s_in6, &sl)) {
+				/* NOLINTNEXTLINE(clang-analyzer-core.CallAndMessage) */
+				tgt->oport = ntohs(s_in6.sin6_port);
+				tgt->oaddr.a6 = s_in6.sin6_addr;
+			}
+		}
+	}
+
 	FLOW_ACTIVATE(conn);
 	return;
 

From dcf014be8876d5417b0eddb8b07152c6b2035485 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Sun, 2 Feb 2025 10:38:46 +0100
Subject: [PATCH 041/221] doc: Add mock of migration source and target

These test programs show the migration of a TCP connection using the
passt-repair helper.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 doc/migration/.gitignore |   2 +
 doc/migration/Makefile   |  20 ++++++++
 doc/migration/README     |  51 ++++++++++++++++++++
 doc/migration/source.c   |  92 +++++++++++++++++++++++++++++++++++
 doc/migration/target.c   | 102 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 267 insertions(+)
 create mode 100644 doc/migration/.gitignore
 create mode 100644 doc/migration/Makefile
 create mode 100644 doc/migration/README
 create mode 100644 doc/migration/source.c
 create mode 100644 doc/migration/target.c

diff --git a/doc/migration/.gitignore b/doc/migration/.gitignore
new file mode 100644
index 0000000..59cb765
--- /dev/null
+++ b/doc/migration/.gitignore
@@ -0,0 +1,2 @@
+/source
+/target
diff --git a/doc/migration/Makefile b/doc/migration/Makefile
new file mode 100644
index 0000000..04f6891
--- /dev/null
+++ b/doc/migration/Makefile
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+TARGETS = source target
+CFLAGS = -Wall -Wextra -pedantic
+
+all: $(TARGETS)
+
+$(TARGETS): %: %.c
+
+clean:
+	rm -f $(TARGETS)
diff --git a/doc/migration/README b/doc/migration/README
new file mode 100644
index 0000000..375603b
--- /dev/null
+++ b/doc/migration/README
@@ -0,0 +1,51 @@
+<!---
+SPDX-License-Identifier: GPL-2.0-or-later
+Copyright (c) 2025 Red Hat GmbH
+Author: Stefano Brivio <sbrivio@redhat.com>
+-->
+
+Migration
+=========
+
+These test programs show a migration of a TCP connection from one process to
+another using the TCP_REPAIR socket option.
+
+The two processes are a mock of the matching implementation in passt(1), and run
+unprivileged, so they rely on the passt-repair helper to connect to them and set
+or clear TCP_REPAIR on the connection socket, transferred to the helper using
+SCM_RIGHTS.
+
+The passt-repair helper needs to have the CAP_NET_ADMIN capability, or run as
+root.
+
+Example of usage
+----------------
+
+* Start the test server
+
+        $ nc -l 9999
+
+* Start the source side of the TCP client (mock of the source instance of passt)
+
+        $ ./source 127.0.0.1 9999 9998 /tmp/repair.sock
+
+* The client sends a test string, and waits for a connection from passt-repair
+
+        # passt-repair /tmp/repair.sock
+
+* The socket is now in repair mode, and `source` dumps sequences, then exits
+
+        sending sequence: 3244673313
+        receiving sequence: 2250449386
+
+* Continue the connection on the target side, restarting from those sequences
+
+        $ ./target 127.0.0.1 9999 9998 /tmp/repair.sock 3244673313 2250449386
+
+* The target side now waits for a connection from passt-repair
+
+        # passt-repair /tmp/repair.sock
+
+* The target side asks passt-repair to switch the socket to repair mode, sets up
+  the TCP sequences, then asks passt-repair to clear repair mode, and sends a
+  test string to the server
diff --git a/doc/migration/source.c b/doc/migration/source.c
new file mode 100644
index 0000000..d44ebf1
--- /dev/null
+++ b/doc/migration/source.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * doc/migration/source.c - Mock of TCP migration source, use with passt-repair
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <limits.h>
+#include <unistd.h>
+#include <netdb.h>
+#include <netinet/tcp.h>
+
+int main(int argc, char **argv)
+{
+	struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
+	struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
+				  NULL, NULL, NULL };
+	struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
+	int seq, s, s_helper;
+	int8_t cmd;
+	struct iovec iov = { &cmd, sizeof(cmd) };
+	char buf[CMSG_SPACE(sizeof(int))];
+	struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
+	struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+	socklen_t seqlen = sizeof(int);
+	struct addrinfo *r;
+
+	(void)argc;
+
+	if (argc != 5) {
+		fprintf(stderr, "%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH\n",
+			argv[0]);
+		return -1;
+	}
+
+	strcpy(a_helper.sun_path, argv[4]);
+	getaddrinfo(argv[1], argv[2], &hints, &r);
+
+	/* Connect socket to server and send some data */
+	s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
+	setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
+	bind(s, (struct sockaddr *)&a, sizeof(a));
+	connect(s, r->ai_addr, r->ai_addrlen);
+	send(s, "before migration\n", sizeof("before migration\n"), 0);
+
+	/* Wait for helper */
+	s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
+	unlink(a_helper.sun_path);
+	bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
+	listen(s_helper, 1);
+	s_helper = accept(s_helper, NULL, NULL);
+
+	/* Set up message for helper, with socket */
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_RIGHTS;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(int));
+	memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
+
+	/* Send command to helper: turn repair mode on, wait for reply */
+	cmd = TCP_REPAIR_ON;
+	sendmsg(s_helper, &msg, 0);
+	recv(s_helper, &((int8_t){ 0 }), 1, 0);
+
+	/* Terminate helper */
+	close(s_helper);
+
+	/* Get sending sequence */
+	seq = TCP_SEND_QUEUE;
+	setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
+	getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
+	fprintf(stdout, "%u ", seq);
+
+	/* Get receiving sequence */
+	seq = TCP_RECV_QUEUE;
+	setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
+	getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
+	fprintf(stdout, "%u\n", seq);
+}
diff --git a/doc/migration/target.c b/doc/migration/target.c
new file mode 100644
index 0000000..f7d3108
--- /dev/null
+++ b/doc/migration/target.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * doc/migration/target.c - Mock of TCP migration target, use with passt-repair
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <limits.h>
+#include <unistd.h>
+#include <netdb.h>
+#include <netinet/tcp.h>
+
+int main(int argc, char **argv)
+{
+	struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
+	struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
+				  NULL, NULL, NULL };
+	struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
+	int s, s_helper, seq;
+	int8_t cmd;
+	struct iovec iov = { &cmd, sizeof(cmd) };
+	char buf[CMSG_SPACE(sizeof(int))];
+	struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
+	struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+	struct addrinfo *r;
+
+	(void)argc;
+
+	strcpy(a_helper.sun_path, argv[4]);
+	getaddrinfo(argv[1], argv[2], &hints, &r);
+
+	if (argc != 7) {
+		fprintf(stderr,
+			"%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH SSEQ RSEQ\n",
+			argv[0]);
+		return -1;
+	}
+
+	/* Prepare socket, bind to source port */
+	s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
+	setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
+	bind(s, (struct sockaddr *)&a, sizeof(a));
+
+	/* Wait for helper */
+	s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
+	unlink(a_helper.sun_path);
+	bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
+	listen(s_helper, 1);
+	s_helper = accept(s_helper, NULL, NULL);
+
+	/* Set up message for helper, with socket */
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_RIGHTS;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(int));
+	memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
+
+	/* Send command to helper: turn repair mode on, wait for reply */
+	cmd = TCP_REPAIR_ON;
+	sendmsg(s_helper, &msg, 0);
+	recv(s_helper, &((int){ 0 }), 1, 0);
+
+	/* Set sending sequence */
+	seq = TCP_SEND_QUEUE;
+	setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
+	seq = atoi(argv[5]);
+	setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
+
+	/* Set receiving sequence */
+	seq = TCP_RECV_QUEUE;
+	setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
+	seq = atoi(argv[6]);
+	setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
+
+	/* Connect setting kernel state only, without actual SYN / handshake */
+	connect(s, r->ai_addr, r->ai_addrlen);
+
+	/* Send command to helper: turn repair mode off, wait for reply */
+	cmd = TCP_REPAIR_OFF;
+	sendmsg(s_helper, &msg, 0);
+
+	recv(s_helper, &((int8_t){ 0 }), 1, 0);
+
+	/* Terminate helper */
+	close(s_helper);
+
+	/* Send some more data */
+	send(s, "after migration\n", sizeof("after migration\n"), 0);
+}

From b4a7b5d4a66db5f419cb5de87da3403cfba3847d Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 4 Feb 2025 16:42:13 +1100
Subject: [PATCH 042/221] migrate: Fix several errors with passt-repair

The passt-repair helper is now merged, but alas it contains several small
bugs:
 * close() is not in the seccomp profile, meaning it will immediately
   SIGSYS when you make a request of it
 * The generated header, seccomp_repair.h isn't listed in .gitignore or
   removed by "make clean"

Fixes: 8c24301462c3 ("Introduce passt-repair")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 .gitignore     | 1 +
 Makefile       | 2 +-
 passt-repair.c | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/.gitignore b/.gitignore
index 5824a71..3c16adc 100644
--- a/.gitignore
+++ b/.gitignore
@@ -7,5 +7,6 @@
 /qrap
 /pasta.1
 /seccomp.h
+/seccomp_repair.h
 /c*.json
 README.plain.md
diff --git a/Makefile b/Makefile
index 6ab8d24..d3d4b78 100644
--- a/Makefile
+++ b/Makefile
@@ -117,7 +117,7 @@ valgrind: all
 
 .PHONY: clean
 clean:
-	$(RM) $(BIN) *~ *.o seccomp.h pasta.1 \
+	$(RM) $(BIN) *~ *.o seccomp.h seccomp_repair.h pasta.1 \
 		passt.tar passt.tar.gz *.deb *.rpm \
 		passt.pid README.plain.md
 
diff --git a/passt-repair.c b/passt-repair.c
index 767a821..dd8578f 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -46,7 +46,7 @@
  *
  * Return: 0 on success (EOF), 1 on error, 2 on usage error
  *
- * #syscalls:repair connect setsockopt write exit_group
+ * #syscalls:repair connect setsockopt write close exit_group
  * #syscalls:repair socket s390x:socketcall i686:socketcall
  * #syscalls:repair recvfrom recvmsg arm:recv ppc64le:recv
  * #syscalls:repair sendto sendmsg arm:send ppc64le:send

From 745c163e60b0e5da7bf6013645d79b4bdbf3e848 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 4 Feb 2025 16:42:15 +1100
Subject: [PATCH 043/221] tcp: Simplify handling of getsockname()

For migration we need to get the specific local address and port for
connected sockets with getsockname().  We currently open code marshalling
the results into the flow entry.

However, we already have inany_from_sockaddr() which handles the fiddly
parts of this, so use it.  Also report failures, which may make debugging
problems easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop re-declarations of 'sa' and 'sl']
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 22 +++++-----------------
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/tcp.c b/tcp.c
index fac322c..af6bd95 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1535,24 +1535,12 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 	tcp_epoll_ctl(c, conn);
 
 	if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
-		if (af == AF_INET) {
-			struct sockaddr_in s_in;
-
-			sl = sizeof(s_in);
-			if (!getsockname(s, (struct sockaddr *)&s_in, &sl)) {
-				/* NOLINTNEXTLINE(clang-analyzer-core.CallAndMessage) */
-				tgt->oport = ntohs(s_in.sin_port);
-				tgt->oaddr = inany_from_v4(s_in.sin_addr);
-			}
+		sl = sizeof(sa);
+		if (!getsockname(s, &sa.sa, &sl)) {
+			inany_from_sockaddr(&tgt->oaddr, &tgt->oport, &sa);
 		} else {
-			struct sockaddr_in6 s_in6;
-
-			sl = sizeof(s_in6);
-			if (!getsockname(s, (struct sockaddr *)&s_in6, &sl)) {
-				/* NOLINTNEXTLINE(clang-analyzer-core.CallAndMessage) */
-				tgt->oport = ntohs(s_in6.sin6_port);
-				tgt->oaddr.a6 = s_in6.sin6_addr;
-			}
+			err("Failed to get local address for socket: %s",
+			    strerror_(errno));
 		}
 	}
 

From d0006fa784a7de881db187756770d2492c75df5d Mon Sep 17 00:00:00 2001
From: Paul Holzinger <pholzing@redhat.com>
Date: Wed, 5 Feb 2025 14:00:41 +0100
Subject: [PATCH 044/221] treewide: use _exit() over exit()

In the podman CI I noticed many seccomp denials in our logs even though
tests passed:
comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e
syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000

Which is futex being called and blocked by the pasta profile. After a
few tries I managed to reproduce locally with this loop in ~20 min:
while :;
  do podman run -d --network bridge quay.io/libpod/testimage:20241011 \
	sleep 100 && \
  sleep 10 && \
  podman rm -fa -t0
done

And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the
following stack trace:
Stack trace of thread 1:
  #0  0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b)
  #1  0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de)
  #2  0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
  #3  0x000055f31b78c50b n/a (n/a + 0x0)
  #4  0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
  #5  0x000055f31b78d5a2 n/a (n/a + 0x0)

Pasta got killed in exit(), it seems glibc is trying to use a lock when
running exit handlers even though no exit handlers are defined.

Given no exit handlers are needed we can call _exit() instead. This
skips exit handlers and does not flush stdio streams compared to exit()
which should be fine for the use here.

Based on the input from Stefano I did not change the test/doc programs
or qrap as they do not use seccomp filters.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c       | 8 ++++----
 log.h        | 4 ++--
 passt.c      | 8 ++++----
 pasta.c      | 8 ++++----
 tap.c        | 2 +-
 util.c       | 8 ++++----
 vhost_user.c | 2 +-
 7 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/conf.c b/conf.c
index df2b016..6817377 100644
--- a/conf.c
+++ b/conf.c
@@ -769,7 +769,7 @@ static void conf_ip6_local(struct ip6_ctx *ip6)
  * usage() - Print usage, exit with given status code
  * @name:	Executable name
  * @f:		Stream to print usage info to
- * @status:	Status code for exit()
+ * @status:	Status code for _exit()
  */
 static void usage(const char *name, FILE *f, int status)
 {
@@ -925,7 +925,7 @@ static void usage(const char *name, FILE *f, int status)
 		"    SPEC is as described for TCP above\n"
 		"    default: none\n");
 
-	exit(status);
+	_exit(status);
 
 pasta_opts:
 
@@ -980,7 +980,7 @@ pasta_opts:
 		"  --ns-mac-addr ADDR	Set MAC address on tap interface\n"
 		"  --no-splice		Disable inbound socket splicing\n");
 
-	exit(status);
+	_exit(status);
 }
 
 /**
@@ -1482,7 +1482,7 @@ void conf(struct ctx *c, int argc, char **argv)
 			FPRINTF(stdout,
 				c->mode == MODE_PASTA ? "pasta " : "passt ");
 			FPRINTF(stdout, VERSION_BLOB);
-			exit(EXIT_SUCCESS);
+			_exit(EXIT_SUCCESS);
 		case 15:
 			ret = snprintf(c->ip4.ifname_out,
 				       sizeof(c->ip4.ifname_out), "%s", optarg);
diff --git a/log.h b/log.h
index a30b091..22c7b9a 100644
--- a/log.h
+++ b/log.h
@@ -32,13 +32,13 @@ void logmsg_perror(int pri, const char *format, ...)
 #define die(...)							\
 	do {								\
 		err(__VA_ARGS__);					\
-		exit(EXIT_FAILURE);					\
+		_exit(EXIT_FAILURE);					\
 	} while (0)
 
 #define die_perror(...)							\
 	do {								\
 		err_perror(__VA_ARGS__);				\
-		exit(EXIT_FAILURE);					\
+		_exit(EXIT_FAILURE);					\
 	} while (0)
 
 extern int log_trace;
diff --git a/passt.c b/passt.c
index b1c8ab6..53fdd38 100644
--- a/passt.c
+++ b/passt.c
@@ -167,7 +167,7 @@ void exit_handler(int signal)
 {
 	(void)signal;
 
-	exit(EXIT_SUCCESS);
+	_exit(EXIT_SUCCESS);
 }
 
 /**
@@ -210,7 +210,7 @@ int main(int argc, char **argv)
 	sigaction(SIGQUIT, &sa, NULL);
 
 	if (argc < 1)
-		exit(EXIT_FAILURE);
+		_exit(EXIT_FAILURE);
 
 	strncpy(argv0, argv[0], PATH_MAX - 1);
 	name = basename(argv0);
@@ -226,7 +226,7 @@ int main(int argc, char **argv)
 	} else if (strstr(name, "passt")) {
 		c.mode = MODE_PASST;
 	} else {
-		exit(EXIT_FAILURE);
+		_exit(EXIT_FAILURE);
 	}
 
 	madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
@@ -259,7 +259,7 @@ int main(int argc, char **argv)
 	flow_init();
 
 	if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
-		exit(EXIT_FAILURE);
+		_exit(EXIT_FAILURE);
 
 	proto_update_l2_buf(c.guest_mac, c.our_tap_mac);
 
diff --git a/pasta.c b/pasta.c
index ff41c95..f15084d 100644
--- a/pasta.c
+++ b/pasta.c
@@ -73,12 +73,12 @@ void pasta_child_handler(int signal)
 	    !waitid(P_PID, pasta_child_pid, &infop, WEXITED | WNOHANG)) {
 		if (infop.si_pid == pasta_child_pid) {
 			if (infop.si_code == CLD_EXITED)
-				exit(infop.si_status);
+				_exit(infop.si_status);
 
 			/* If killed by a signal, si_status is the number.
 			 * Follow common shell convention of returning it + 128.
 			 */
-			exit(infop.si_status + 128);
+			_exit(infop.si_status + 128);
 
 			/* Nothing to do, detached PID namespace going away */
 		}
@@ -499,7 +499,7 @@ void pasta_netns_quit_inotify_handler(struct ctx *c, int inotify_fd)
 		return;
 
 	info("Namespace %s is gone, exiting", c->netns_base);
-	exit(EXIT_SUCCESS);
+	_exit(EXIT_SUCCESS);
 }
 
 /**
@@ -525,7 +525,7 @@ void pasta_netns_quit_timer_handler(struct ctx *c, union epoll_ref ref)
 			return;
 
 		info("Namespace %s is gone, exiting", c->netns_base);
-		exit(EXIT_SUCCESS);
+		_exit(EXIT_SUCCESS);
 	}
 
 	close(fd);
diff --git a/tap.c b/tap.c
index 772648f..8c92d23 100644
--- a/tap.c
+++ b/tap.c
@@ -1002,7 +1002,7 @@ void tap_sock_reset(struct ctx *c)
 	info("Client connection closed%s", c->one_off ? ", exiting" : "");
 
 	if (c->one_off)
-		exit(EXIT_SUCCESS);
+		_exit(EXIT_SUCCESS);
 
 	/* Close the connected socket, wait for a new connection */
 	epoll_del(c, c->fd_tap);
diff --git a/util.c b/util.c
index 800c6b5..4d51e04 100644
--- a/util.c
+++ b/util.c
@@ -405,7 +405,7 @@ void pidfile_write(int fd, pid_t pid)
 
 	if (write(fd, pid_buf, n) < 0) {
 		perror("PID file write");
-		exit(EXIT_FAILURE);
+		_exit(EXIT_FAILURE);
 	}
 
 	close(fd);
@@ -441,12 +441,12 @@ int __daemon(int pidfile_fd, int devnull_fd)
 
 	if (pid == -1) {
 		perror("fork");
-		exit(EXIT_FAILURE);
+		_exit(EXIT_FAILURE);
 	}
 
 	if (pid) {
 		pidfile_write(pidfile_fd, pid);
-		exit(EXIT_SUCCESS);
+		_exit(EXIT_SUCCESS);
 	}
 
 	if (setsid()				< 0 ||
@@ -454,7 +454,7 @@ int __daemon(int pidfile_fd, int devnull_fd)
 	    dup2(devnull_fd, STDOUT_FILENO)	< 0 ||
 	    dup2(devnull_fd, STDERR_FILENO)	< 0 ||
 	    close(devnull_fd))
-		exit(EXIT_FAILURE);
+		_exit(EXIT_FAILURE);
 
 	return 0;
 }
diff --git a/vhost_user.c b/vhost_user.c
index 9e38cfd..159f0b3 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -60,7 +60,7 @@ void vu_print_capabilities(void)
 	info("{");
 	info("  \"type\": \"net\"");
 	info("}");
-	exit(EXIT_SUCCESS);
+	_exit(EXIT_SUCCESS);
 }
 
 /**

From a9d63f91a59a4c02cd77af41fa70d82e73f17576 Mon Sep 17 00:00:00 2001
From: Paul Holzinger <pholzing@redhat.com>
Date: Wed, 5 Feb 2025 14:00:42 +0100
Subject: [PATCH 045/221] passt-repair: use _exit() over return

When returning from main it does the same as calling exit() which is not
good as glibc might try to call futex() which will be blocked by
seccomp. See the prevoius commit "treewide: use _exit() over exit()" for
a more detailed explanation.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt-repair.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/passt-repair.c b/passt-repair.c
index dd8578f..6f79423 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -71,7 +71,7 @@ int main(int argc, char **argv)
 	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
 	    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
 		fprintf(stderr, "Failed to apply seccomp filter");
-		return 1;
+		_exit(1);
 	}
 
 	iov = (struct iovec){ &cmd, sizeof(cmd) };
@@ -80,42 +80,42 @@ int main(int argc, char **argv)
 
 	if (argc != 2) {
 		fprintf(stderr, "Usage: %s PATH\n", argv[0]);
-		return 2;
+		_exit(2);
 	}
 
 	ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
 	if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
 		fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
-		return 2;
+		_exit(2);
 	}
 
 	if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
 		perror("Failed to create AF_UNIX socket");
-		return 1;
+		_exit(1);
 	}
 
 	if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
 		fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
 			strerror(errno));
-		return 1;
+		_exit(1);
 	}
 
 loop:
 	ret = recvmsg(s, &msg, 0);
 	if (ret < 0) {
 		perror("Failed to receive message");
-		return 1;
+		_exit(1);
 	}
 
 	if (!ret)	/* Done */
-		return 0;
+		_exit(0);
 
 	if (!cmsg ||
 	    cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
 	    cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
 	    cmsg->cmsg_type != SCM_RIGHTS) {
 		fprintf(stderr, "No/bad ancillary data from peer\n");
-		return 1;
+		_exit(1);
 	}
 
 	n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
@@ -124,7 +124,7 @@ loop:
 	if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF &&
 	    cmd != TCP_REPAIR_OFF_NO_WP) {
 		fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
-		return 1;
+		_exit(1);
 	}
 
 	for (i = 0; i < n; i++) {
@@ -134,7 +134,7 @@ loop:
 			fprintf(stderr,
 				"Setting TCP_REPAIR to %i on socket %i: %s", o,
 				fds[i], strerror(errno));
-			return 1;
+			_exit(1);
 		}
 
 		/* Close _our_ copy */
@@ -144,11 +144,11 @@ loop:
 		if (send(s, &cmd, sizeof(cmd), 0) < 0) {
 			fprintf(stderr, "Reply to command %i: %s\n",
 				o, strerror(errno));
-			return 1;
+			_exit(1);
 		}
 	}
 
 	goto loop;
 
-	return 0;
+	_exit(0);
 }

From 9215f68a0c2ad274b73862bc865fbdbb464e182a Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 5 Feb 2025 16:57:55 +0100
Subject: [PATCH 046/221] passt-repair: Build fixes for musl

When building against musl headers:

- sizeof() needs stddef.h, as it should be;

- we can't initialise a struct msghdr by simply listing fields in
  order, as they contain explicit padding fields. Use field names
  instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt-repair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/passt-repair.c b/passt-repair.c
index 6f79423..3c3247b 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -21,6 +21,7 @@
 #include <sys/socket.h>
 #include <sys/un.h>
 #include <errno.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -75,7 +76,11 @@ int main(int argc, char **argv)
 	}
 
 	iov = (struct iovec){ &cmd, sizeof(cmd) };
-	msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
+	msg = (struct msghdr){ .msg_name = NULL, .msg_namelen = 0,
+			       .msg_iov = &iov, .msg_iovlen = 1,
+			       .msg_control = buf,
+			       .msg_controllen = sizeof(buf),
+			       .msg_flags = 0 };
 	cmsg = CMSG_FIRSTHDR(&msg);
 
 	if (argc != 2) {

From 593be3277429f0a2c06f6bebab4f20736c96abc8 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 5 Feb 2025 17:02:27 +0100
Subject: [PATCH 047/221] passt-repair.1: Fix indication of TCP_REPAIR
 constants

...perhaps I should adopt the healthy habit of actually reading
headers instead of using my mental copy.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt-repair.1 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/passt-repair.1 b/passt-repair.1
index 8d07c97..7c1b140 100644
--- a/passt-repair.1
+++ b/passt-repair.1
@@ -31,7 +31,7 @@ same as the UNIX domain socket used for guest communication, suffixed by
 \fI.repair\fR.
 
 The messages consist of one 8-bit signed integer that can be \fITCP_REPAIR_ON\fR
-(1), \fITCP_REPAIR_OFF\fR (2), or \fITCP_REPAIR_OFF_WP\fR (-1), as defined by
+(1), \fITCP_REPAIR_OFF\fR (0), or \fITCP_REPAIR_OFF_NO_WP\fR (-1), as defined by
 the Linux kernel user API, and one to SCM_MAX_FD (253) sockets as SCM_RIGHTS
 (see \fBunix\fR(7)) ancillary message, sent by the server, \fBpasst\fR(1).
 

From f66769c2de82550ac1ee2548960c09a4b052341f Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 5 Feb 2025 17:21:59 +0100
Subject: [PATCH 048/221] apparmor: Workaround for unconfined libvirtd when
 triggered by unprivileged user

If libvirtd is triggered by an unprivileged user, the virt-aa-helper
mechanism doesn't work, because per-VM profiles can't be instantiated,
and as a result libvirtd runs unconfined.

This means passt can't start, because the passt subprofile from
libvirt's profile is not loaded either.

Example:

  $ virsh start alpine
  error: Failed to start domain 'alpine'
  error: internal error: Child process (passt --one-off --socket /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0.socket --pid /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0-passt.pid --tcp-ports 40922:2) unexpected fatal signal 11

Add an annoying workaround for the moment being. Much better than
encouraging users to start guests as root, or to disable AppArmor
altogether.

Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 contrib/apparmor/usr.bin.passt | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/contrib/apparmor/usr.bin.passt b/contrib/apparmor/usr.bin.passt
index 9568189..62a4514 100644
--- a/contrib/apparmor/usr.bin.passt
+++ b/contrib/apparmor/usr.bin.passt
@@ -27,4 +27,25 @@ profile passt /usr/bin/passt{,.avx2} {
 
   owner @{HOME}/**			w,	# pcap(), pidfile_open(),
 						# pidfile_write()
+
+  # Workaround: libvirt's profile comes with a passt subprofile which includes,
+  # in turn, <abstractions/passt>, and adds libvirt-specific rules on top, to
+  # allow passt (when started by libvirtd) to write socket and PID files in the
+  # location requested by libvirtd itself, and to execute passt itself.
+  #
+  # However, when libvirt runs as unprivileged user, the mechanism based on
+  # virt-aa-helper, designed to build per-VM profiles as guests are started,
+  # doesn't work. The helper needs to create and load profiles on the fly, which
+  # can't be done by unprivileged users, of course.
+  #
+  # As a result, libvirtd runs unconfined if guests are started by unprivileged
+  # users, starting passt unconfined as well, which means that passt runs under
+  # its own stand-alone profile (this one), which implies in turn that execve()
+  # of /usr/bin/passt is not allowed, and socket and PID files can't be written.
+  #
+  # Duplicate libvirt-specific rules here as long as this is not solved in
+  # libvirt's profile itself.
+  /usr/bin/passt r,
+  owner @{run}/user/[0-9]*/libvirt/qemu/run/passt/* rw,
+  owner @{run}/libvirt/qemu/passt/* rw,
 }

From 0da87b393b63747526d162c728987f320b41771e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 6 Feb 2025 16:49:42 +1100
Subject: [PATCH 049/221] debug: Add tcpdump to mbuto.img

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 test/passt.mbuto | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/test/passt.mbuto b/test/passt.mbuto
index 138d365..d4d57cb 100755
--- a/test/passt.mbuto
+++ b/test/passt.mbuto
@@ -13,7 +13,7 @@
 PROGS="${PROGS:-ash,dash,bash ip mount ls insmod mkdir ln cat chmod lsmod
        modprobe find grep mknod mv rm umount jq iperf3 dhclient hostname
        sed tr chown sipcalc cut socat dd strace ping tail killall sleep sysctl
-       nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp}"
+       nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp tcpdump}"
 
 # OpenSSH 9.8 introduced split binaries, with sshd being the daemon, and
 # sshd-session the per-session program. We need the latter as well, and the path
@@ -65,6 +65,7 @@ EOF
 	# sshd via vsock
 	cat > /etc/passwd << EOF
 root:x:0:0:root:/root:/bin/sh
+tcpdump:x:72:72:tcpdump:/:/sbin/nologin
 sshd:x:100:100:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
 EOF
 	cat > /etc/shadow << EOF

From a5cca995dee9b4196d41c86034a4948d346266ca Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 6 Feb 2025 09:33:05 +0100
Subject: [PATCH 050/221] conf, passt.1: Un-deprecate --host-lo-to-ns-lo

It was established behaviour, and it's now the third report about it:
users ask how to achieve the same functionality, and we don't have a
better answer yet.

The idea behind declaring it deprecated to start with, I guess, was
that we would eventually replace it by more flexible and generic
configuration options, which is still planned. But there's nothing
preventing us to alias this in the future to a particular
configuration.

So, stop scaring users off, and un-deprecate this.

Link: https://archives.passt.top/passt-dev/20240925102009.62b9a0ce@elisabeth/
Link: https://github.com/rootless-containers/rootlesskit/pull/482#issuecomment-2591855705
Link: https://github.com/moby/moby/issues/48838
Link: https://github.com/containers/podman/discussions/25243
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c  | 3 +--
 passt.1 | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/conf.c b/conf.c
index 6817377..f5d04db 100644
--- a/conf.c
+++ b/conf.c
@@ -963,8 +963,7 @@ pasta_opts:
 		"  -U, --udp-ns SPEC	UDP port forwarding to init namespace\n"
 		"    SPEC is as described above\n"
 		"    default: auto\n"
-		"  --host-lo-to-ns-lo	DEPRECATED:\n"
-		"			Translate host-loopback forwards to\n"
+		"  --host-lo-to-ns-lo	Translate host-loopback forwards to\n"
 		"			namespace loopback\n"
 		"  --userns NSPATH 	Target user namespace to join\n"
 		"  --netns PATH|NAME	Target network namespace to join\n"
diff --git a/passt.1 b/passt.1
index d9cd33e..2928af5 100644
--- a/passt.1
+++ b/passt.1
@@ -622,7 +622,7 @@ Configure UDP port forwarding from target namespace to init namespace.
 Default is \fBauto\fR.
 
 .TP
-.BR \-\-host-lo-to-ns-lo " " (DEPRECATED)
+.BR \-\-host-lo-to-ns-lo
 If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from
 the host's loopback address will appear on the loopback address in the
 guest as well.  Without this option such forwarded packets will appear

From a0b7f56b3a3c220b3d8065d7cfdd83a6e3919467 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 7 Feb 2025 01:51:38 +0100
Subject: [PATCH 051/221] passt-repair: Don't use perror(), accept ECONNRESET
 as termination

If we use glibc's perror(), we need to allow dup() and fcntl() in our
seccomp profiles, which are a bit too much for this simple helper. On
top of that, we would probably need a wrapper to avoid allocation for
translated messages.

While at it: ECONNRESET is just a close() from passt, treat it like
EOF.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt-repair.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/passt-repair.c b/passt-repair.c
index 3c3247b..d137a18 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -95,7 +95,7 @@ int main(int argc, char **argv)
 	}
 
 	if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
-		perror("Failed to create AF_UNIX socket");
+		fprintf(stderr, "Failed to create AF_UNIX socket: %i\n", errno);
 		_exit(1);
 	}
 
@@ -108,8 +108,12 @@ int main(int argc, char **argv)
 loop:
 	ret = recvmsg(s, &msg, 0);
 	if (ret < 0) {
-		perror("Failed to receive message");
-		_exit(1);
+		if (errno == ECONNRESET) {
+			ret = 0;
+		} else {
+			fprintf(stderr, "Failed to read message: %i\n", errno);
+			_exit(1);
+		}
 	}
 
 	if (!ret)	/* Done */

From 0f009ea598707c5978846387d716f4a612d07b36 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 7 Feb 2025 01:55:08 +0100
Subject: [PATCH 052/221] passt-repair: Fix calculation of payload length from
 cmsg_len

There's no inverse function for CMSG_LEN(), so we need to loop over
SCM_MAX_FD (253) possible input values. The previous calculation is
clearly wrong, as not every int takes CMSG_LEN(sizeof(int)) in cmsg
data.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt-repair.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/passt-repair.c b/passt-repair.c
index d137a18..5ad5c9c 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -57,7 +57,7 @@ int main(int argc, char **argv)
 	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
 	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
 	struct sockaddr_un a = { AF_UNIX, "" };
-	int fds[SCM_MAX_FD], s, ret, i, n;
+	int fds[SCM_MAX_FD], s, ret, i, n = 0;
 	struct sock_fprog prog;
 	int8_t cmd = INT8_MAX;
 	struct cmsghdr *cmsg;
@@ -127,7 +127,21 @@ loop:
 		_exit(1);
 	}
 
-	n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
+	/* No inverse formula for CMSG_LEN(x), and building one with CMSG_LEN(0)
+	 * works but there's no guarantee it does. Search the whole domain.
+	 */
+	for (i = 1; i < SCM_MAX_FD; i++) {
+		if (CMSG_LEN(sizeof(int) * i) == cmsg->cmsg_len) {
+			n = i;
+			break;
+		}
+	}
+	if (!n) {
+		fprintf(stderr, "Invalid ancillary data length %zu from peer\n",
+			cmsg->cmsg_len);
+		_exit(1);
+	}
+
 	memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
 
 	if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF &&

From b7b70ba24369891d79079d247f246c1e357948d2 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 7 Feb 2025 01:58:00 +0100
Subject: [PATCH 053/221] passt-repair: Dodge "structurally unreachable code"
 warning from Coverity

While main() conventionally returns int, and we need a return at the
end of the function to avoid compiler warnings, turning that return
into _exit() to avoid exit handlers triggers a Coverity warning. It's
unreachable code anyway, so switch that single occurence back to a
plain return.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt-repair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/passt-repair.c b/passt-repair.c
index 5ad5c9c..322066a 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -173,5 +173,5 @@ loop:
 
 	goto loop;
 
-	_exit(0);
+	return 0;
 }

From fe8b6a7c42625ee1fc63186204d32458b1ba31b9 Mon Sep 17 00:00:00 2001
From: Enrique Llorente <ellorent@redhat.com>
Date: Tue, 4 Feb 2025 10:43:37 +0100
Subject: [PATCH 054/221] dhcp: Don't re-use request message for reply

The logic composing the DHCP reply message is reusing the request
message to compose it, future long options like FQDN may
exceed the request message limit making it go beyond the lower
bound.

This change creates a new reply message with a fixed options size of 308
and fills it in with proper fields from requests adding on top the generated
options, this way the reply lower bound does not depend on the request.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 dhcp.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/dhcp.c b/dhcp.c
index d8515aa..2a23ed4 100644
--- a/dhcp.c
+++ b/dhcp.c
@@ -151,9 +151,6 @@ static int fill(struct msg *m)
 {
 	int i, o, offset = 0;
 
-	m->op = BOOTREPLY;
-	m->secs = 0;
-
 	for (o = 0; o < 255; o++)
 		opts[o].sent = 0;
 
@@ -291,8 +288,9 @@ int dhcp(const struct ctx *c, const struct pool *p)
 	const struct ethhdr *eh;
 	const struct iphdr *iph;
 	const struct udphdr *uh;
+	struct msg const *m;
+	struct msg reply;
 	unsigned int i;
-	struct msg *m;
 
 	eh  = packet_get(p, 0, offset, sizeof(*eh),  NULL);
 	offset += sizeof(*eh);
@@ -321,6 +319,22 @@ int dhcp(const struct ctx *c, const struct pool *p)
 	    m->op != BOOTREQUEST)
 		return -1;
 
+	reply.op		= BOOTREPLY;
+	reply.htype		= m->htype;
+	reply.hlen		= m->hlen;
+	reply.hops		= 0;
+	reply.xid		= m->xid;
+	reply.secs		= 0;
+	reply.flags		= m->flags;
+	reply.ciaddr		= m->ciaddr;
+	reply.yiaddr		= c->ip4.addr;
+	reply.siaddr		= 0;
+	reply.giaddr		= m->giaddr;
+	memcpy(&reply.chaddr,	m->chaddr,	sizeof(reply.chaddr));
+	memset(&reply.sname,	0,		sizeof(reply.sname));
+	memset(&reply.file,	0,		sizeof(reply.file));
+	reply.magic		= m->magic;
+
 	offset += offsetof(struct msg, o);
 
 	for (i = 0; i < ARRAY_SIZE(opts); i++)
@@ -364,7 +378,6 @@ int dhcp(const struct ctx *c, const struct pool *p)
 
 	info("    from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));
 
-	m->yiaddr = c->ip4.addr;
 	mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len));
 	memcpy(opts[1].s,  &mask,                sizeof(mask));
 	memcpy(opts[3].s,  &c->ip4.guest_gw,     sizeof(c->ip4.guest_gw));
@@ -401,14 +414,14 @@ int dhcp(const struct ctx *c, const struct pool *p)
 	if (!c->no_dhcp_dns_search)
 		opt_set_dns_search(c, sizeof(m->o));
 
-	dlen = offsetof(struct msg, o) + fill(m);
+	dlen = offsetof(struct msg, o) + fill(&reply);
 
 	if (m->flags & FLAG_BROADCAST)
 		dst = in4addr_broadcast;
 	else
 		dst = c->ip4.addr;
 
-	tap_udp4_send(c, c->ip4.our_tap_addr, 67, dst, 68, m, dlen);
+	tap_udp4_send(c, c->ip4.our_tap_addr, 67, dst, 68, &reply, dlen);
 
 	return 1;
 }

From 864be475d9db58c93540eb883ecf656c3eff861f Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 7 Feb 2025 08:59:57 +0100
Subject: [PATCH 055/221] passt-repair: Send one confirmation *per command*,
 not *per socket*

It looks like me, myself and I couldn't agree on the "simple" protocol
between passt and passt-repair. The man page and passt say it's one
confirmation per command, but the passt-repair implementation had one
confirmation per socket instead.

This caused all sort of mysterious issues with repair mode
pseudo-randomly enabled, and leading to hours of fun (mostly not
mine). Oops.

Switch to one confirmation per command (of course).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt-repair.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/passt-repair.c b/passt-repair.c
index 322066a..614cee0 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -63,6 +63,7 @@ int main(int argc, char **argv)
 	struct cmsghdr *cmsg;
 	struct msghdr msg;
 	struct iovec iov;
+	int op;
 
 	prctl(PR_SET_DUMPABLE, 0);
 
@@ -150,25 +151,24 @@ loop:
 		_exit(1);
 	}
 
-	for (i = 0; i < n; i++) {
-		int o = cmd;
+	op = cmd;
 
-		if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &o, sizeof(o))) {
+	for (i = 0; i < n; i++) {
+		if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &op, sizeof(op))) {
 			fprintf(stderr,
-				"Setting TCP_REPAIR to %i on socket %i: %s", o,
+				"Setting TCP_REPAIR to %i on socket %i: %s", op,
 				fds[i], strerror(errno));
 			_exit(1);
 		}
 
 		/* Close _our_ copy */
 		close(fds[i]);
+	}
 
-		/* Confirm setting by echoing the command back */
-		if (send(s, &cmd, sizeof(cmd), 0) < 0) {
-			fprintf(stderr, "Reply to command %i: %s\n",
-				o, strerror(errno));
-			_exit(1);
-		}
+	/* Confirm setting by echoing the command back */
+	if (send(s, &cmd, sizeof(cmd), 0) < 0) {
+		fprintf(stderr, "Reply to %i: %s\n", op, strerror(errno));
+		_exit(1);
 	}
 
 	goto loop;

From a3d142a6f64d89fffe26634e158dedd55fa31e7b Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Mon, 3 Feb 2025 09:22:10 +0100
Subject: [PATCH 056/221] conf: Don't map DNS traffic to host, if host gateway
 is a resolver

This should be a relatively common case and I'm a bit surprised it's
been broken since I added the "gateway mapping" functionality, but it
doesn't happen with Podman, and not with systemd-resolved or similar
local proxies, and also not with servers where typically the gateway
is just a router and not a DNS resolver. That could be the reason why
nobody noticed until now.

By default, we'll map the address of the default gateway, in
containers and guests, to represent "the host", so that we have a
well-defined way to reach the host. Say:

  0.0029:     NAT to host 127.0.0.1: 192.168.100.1

But if the host gateway is also a DNS resolver:

  0.0029: DNS:
  0.0029:     192.168.100.1

then we'll send DNS queries directed to it to the host instead:

  0.0372: Flow 0 (INI): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => ?
  0.0372: Flow 0 (TGT): INI -> TGT
  0.0373: Flow 0 (TGT): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
  0.0373: Flow 0 (UDP flow): TGT -> TYPED
  0.0373: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
  0.0373: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 31049
  0.0374: Flow 0 (UDP flow): TYPED -> ACTIVE
  0.0374: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53

which doesn't quite work, of course:

  0.0374: pasta: epoll event on UDP reply socket 95 (events: 0x00000008)
  0.0374: ICMP error on UDP socket 95: Connection refused

unless the host is a resolver itself... but then we wouldn't find the
address of the gateway in its /etc/resolv.conf, presumably.

Fix this by making an exception for DNS traffic: if the default
gateway is a resolver, match on DNS traffic going to the default
gateway, and explicitly forward it to the configured resolver.

Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c  | 16 ++++++++++------
 passt.1 | 14 ++++++++++----
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/conf.c b/conf.c
index f5d04db..142dc94 100644
--- a/conf.c
+++ b/conf.c
@@ -426,10 +426,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver,
 		if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host))
 			c->ip4.dns_host = ns4;
 
-		/* Guest or container can only access local addresses via
-		 * redirect
+		/* Special handling if guest or container can only access local
+		 * addresses via redirect, or if the host gateway is also a
+		 * resolver and we shadow its address
 		 */
-		if (IN4_IS_ADDR_LOOPBACK(&ns4)) {
+		if (IN4_IS_ADDR_LOOPBACK(&ns4) ||
+		    IN4_ARE_ADDR_EQUAL(&ns4, &c->ip4.map_host_loopback)) {
 			if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
 				return;
 
@@ -445,10 +447,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver,
 		if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
 			c->ip6.dns_host = ns6;
 
-		/* Guest or container can only access local addresses via
-		 * redirect
+		/* Special handling if guest or container can only access local
+		 * addresses via redirect, or if the host gateway is also a
+		 * resolver and we shadow its address
 		 */
-		if (IN6_IS_ADDR_LOOPBACK(&ns6)) {
+		if (IN6_IS_ADDR_LOOPBACK(&ns6) ||
+		    IN6_ARE_ADDR_EQUAL(&ns6, &c->ip6.map_host_loopback)) {
 			if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
 				return;
 
diff --git a/passt.1 b/passt.1
index 2928af5..29cc3ed 100644
--- a/passt.1
+++ b/passt.1
@@ -941,10 +941,16 @@ with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
 the last observed source address from guest or namespace is 192.0.2.2, this will
 be translated to a connection from 192.0.2.1 to 192.0.2.2.
 
-Similarly, for traffic coming from guest or namespace, packets with
-destination address corresponding to the \fB\-\-map-host-loopback\fR
-address will have their destination address translated to a loopback
-address.
+Similarly, for traffic coming from guest or namespace, packets with destination
+address corresponding to the \fB\-\-map-host-loopback\fR address will have their
+destination address translated to a loopback address.
+
+As an exception, traffic identified as DNS, originally directed to the
+\fB\-\-map-host-loopback\fR address, if this address matches a resolver address
+on the host, is \fBnot\fR translated to loopback, but rather handled in the same
+way as if specified as \-\-dns-forward address, if no such option was given.
+In the common case where the host gateway also acts a resolver, this avoids that
+the host mapping shadows the gateway/resolver itself.
 
 .SS Handling of local traffic in pasta
 

From 31e8109a86eeebb473ffba8124a3f399cf0aeccf Mon Sep 17 00:00:00 2001
From: Enrique Llorente <ellorent@redhat.com>
Date: Fri, 7 Feb 2025 12:36:55 +0100
Subject: [PATCH 057/221] dhcp, dhcpv6: Add hostname and client fqdn ops

Both DHCPv4 and DHCPv6 has the capability to pass the hostname to
clients, the DHCPv4 uses option 12 (hostname) while the DHCPv6 uses option 39
(client fqdn), for some virt deployments like kubevirt is expected to
have the VirtualMachine name as the guest hostname.

This change add the following arguments:
 - -H --hostname NAME to configure the hostname DHCPv4 option(12)
 - --fqdn NAME to configure client fqdn option for both DHCPv4(81) and
   DHCPv6(39)

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c           | 20 ++++++++--
 dhcp.c           | 61 +++++++++++++++++++++++++----
 dhcpv6.c         | 99 ++++++++++++++++++++++++++++++++++++++++--------
 passt.1          | 10 +++++
 passt.h          |  5 +++
 pasta.c          | 17 +++++++--
 test/lib/setup   | 10 ++---
 test/passt.mbuto |  6 ++-
 test/passt/dhcp  | 15 +++++++-
 util.c           | 24 ++++++++++++
 util.h           |  6 +++
 11 files changed, 235 insertions(+), 38 deletions(-)

diff --git a/conf.c b/conf.c
index 142dc94..d9de07c 100644
--- a/conf.c
+++ b/conf.c
@@ -858,7 +858,9 @@ static void usage(const char *name, FILE *f, int status)
 		FPRINTF(f, "    default: use addresses from /etc/resolv.conf\n");
 	FPRINTF(f,
 		"  -S, --search LIST	Space-separated list, search domains\n"
-		"    a single, empty option disables the DNS search list\n");
+		"    a single, empty option disables the DNS search list\n"
+		"  -H, --hostname NAME 	Hostname to configure client with\n"
+		"  --fqdn NAME		FQDN to configure client with\n");
 	if (strstr(name, "pasta"))
 		FPRINTF(f, "    default: don't use any search list\n");
 	else
@@ -1316,6 +1318,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"outbound",	required_argument,	NULL,		'o' },
 		{"dns",		required_argument,	NULL,		'D' },
 		{"search",	required_argument,	NULL,		'S' },
+		{"hostname",	required_argument,	NULL,		'H' },
 		{"no-tcp",	no_argument,		&c->no_tcp,	1 },
 		{"no-udp",	no_argument,		&c->no_udp,	1 },
 		{"no-icmp",	no_argument,		&c->no_icmp,	1 },
@@ -1360,6 +1363,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		/* vhost-user backend program convention */
 		{"print-capabilities", no_argument,	NULL,		26 },
 		{"socket-path",	required_argument,	NULL,		's' },
+		{"fqdn",	required_argument,	NULL,		27 },
 		{ 0 },
 	};
 	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@@ -1382,9 +1386,9 @@ void conf(struct ctx *c, int argc, char **argv)
 	if (c->mode == MODE_PASTA) {
 		c->no_dhcp_dns = c->no_dhcp_dns_search = 1;
 		fwd_default = FWD_AUTO;
-		optstring = "+dqfel:hF:I:p:P:m:a:n:M:g:i:o:D:S:46t:u:T:U:";
+		optstring = "+dqfel:hF:I:p:P:m:a:n:M:g:i:o:D:S:H:46t:u:T:U:";
 	} else {
-		optstring = "+dqfel:hs:F:p:P:m:a:n:M:g:i:o:D:S:461t:u:";
+		optstring = "+dqfel:hs:F:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:";
 	}
 
 	c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET;
@@ -1561,6 +1565,11 @@ void conf(struct ctx *c, int argc, char **argv)
 		case 26:
 			vu_print_capabilities();
 			break;
+		case 27:
+			if (snprintf_check(c->fqdn, PASST_MAXDNAME,
+					   "%s", optarg))
+				die("Invalid FQDN: %s", optarg);
+			break;
 		case 'd':
 			c->debug = 1;
 			c->quiet = 0;
@@ -1730,6 +1739,11 @@ void conf(struct ctx *c, int argc, char **argv)
 
 			die("Cannot use DNS search domain %s", optarg);
 			break;
+		case 'H':
+			if (snprintf_check(c->hostname, PASST_MAXDNAME,
+					   "%s", optarg))
+				die("Invalid hostname: %s", optarg);
+			break;
 		case '4':
 			v4_only = true;
 			v6_only = false;
diff --git a/dhcp.c b/dhcp.c
index 2a23ed4..401cb5b 100644
--- a/dhcp.c
+++ b/dhcp.c
@@ -63,6 +63,11 @@ static struct opt opts[255];
 
 #define OPT_MIN		60 /* RFC 951 */
 
+/* Total option size (excluding end option) is 576 (RFC 2131), minus
+ * offset of options (268), minus end option and its length (2).
+ */
+#define OPT_MAX		306
+
 /**
  * dhcp_init() - Initialise DHCP options
  */
@@ -122,7 +127,7 @@ struct msg {
 	uint8_t sname[64];
 	uint8_t file[128];
 	uint32_t magic;
-	uint8_t o[308];
+	uint8_t o[OPT_MAX + 2 /* End option and its length */ ];
 } __attribute__((__packed__));
 
 /**
@@ -130,15 +135,28 @@ struct msg {
  * @m:		Message to fill
  * @o:		Option number
  * @offset:	Current offset within options field, updated on insertion
+ *
+ * Return: false if m has space to write the option, true otherwise
  */
-static void fill_one(struct msg *m, int o, int *offset)
+static bool fill_one(struct msg *m, int o, int *offset)
 {
+	size_t slen = opts[o].slen;
+
+	/* If we don't have space to write the option, then just skip */
+	if (*offset + 1 /* length of option */ + slen > OPT_MAX)
+		return true;
+
 	m->o[*offset] = o;
-	m->o[*offset + 1] = opts[o].slen;
-	memcpy(&m->o[*offset + 2], opts[o].s, opts[o].slen);
+	m->o[*offset + 1] = slen;
+
+	/* Move to option */
+	*offset += 2;
+
+	memcpy(&m->o[*offset], opts[o].s, slen);
 
 	opts[o].sent = 1;
-	*offset += 2 + opts[o].slen;
+	*offset += slen;
+	return false;
 }
 
 /**
@@ -159,17 +177,20 @@ static int fill(struct msg *m)
 	 * Put it there explicitly, unless requested via option 55.
 	 */
 	if (opts[55].clen > 0 && !memchr(opts[55].c, 53, opts[55].clen))
-		fill_one(m, 53, &offset);
+		if (fill_one(m, 53, &offset))
+			 debug("DHCP: skipping option 53");
 
 	for (i = 0; i < opts[55].clen; i++) {
 		o = opts[55].c[i];
 		if (opts[o].slen != -1)
-			fill_one(m, o, &offset);
+			if (fill_one(m, o, &offset))
+				debug("DHCP: skipping option %i", o);
 	}
 
 	for (o = 0; o < 255; o++) {
 		if (opts[o].slen != -1 && !opts[o].sent)
-			fill_one(m, o, &offset);
+			if (fill_one(m, o, &offset))
+				debug("DHCP: skipping option %i", o);
 	}
 
 	m->o[offset++] = 255;
@@ -411,6 +432,30 @@ int dhcp(const struct ctx *c, const struct pool *p)
 	if (!opts[6].slen)
 		opts[6].slen = -1;
 
+	opt_len = strlen(c->hostname);
+	if (opt_len > 0) {
+		opts[12].slen = opt_len;
+		memcpy(opts[12].s, &c->hostname, opt_len);
+	}
+
+	opt_len = strlen(c->fqdn);
+	if (opt_len > 0) {
+		opt_len += 3 /* flags */
+			+ 2; /* Length byte for first label, and terminator */
+
+		if (sizeof(opts[81].s) >= opt_len) {
+			opts[81].s[0] = 0x4; /* flags (E) */
+			opts[81].s[1] = 0xff; /* RCODE1 */
+			opts[81].s[2] = 0xff; /* RCODE2 */
+
+			encode_domain_name((char *)opts[81].s + 3, c->fqdn);
+
+			opts[81].slen = opt_len;
+		} else {
+			debug("DHCP: client FQDN option doesn't fit, skipping");
+		}
+	}
+
 	if (!c->no_dhcp_dns_search)
 		opt_set_dns_search(c, sizeof(m->o));
 
diff --git a/dhcpv6.c b/dhcpv6.c
index 0523bba..373a988 100644
--- a/dhcpv6.c
+++ b/dhcpv6.c
@@ -48,6 +48,7 @@ struct opt_hdr {
 # define  STATUS_NOTONLINK	htons_constant(4)
 # define OPT_DNS_SERVERS	htons_constant(23)
 # define OPT_DNS_SEARCH		htons_constant(24)
+# define OPT_CLIENT_FQDN	htons_constant(39)
 #define   STR_NOTONLINK		"Prefix not appropriate for link."
 
 	uint16_t l;
@@ -58,6 +59,9 @@ struct opt_hdr {
 					      sizeof(struct opt_hdr))
 #define OPT_VSIZE(x)		(sizeof(struct opt_##x) - 		\
 				 sizeof(struct opt_hdr))
+#define OPT_MAX_SIZE		IPV6_MIN_MTU - (sizeof(struct ipv6hdr) + \
+						sizeof(struct udphdr) + \
+						sizeof(struct msg_hdr))
 
 /**
  * struct opt_client_id - DHCPv6 Client Identifier option
@@ -163,6 +167,18 @@ struct opt_dns_search {
 	char list[MAXDNSRCH * NS_MAXDNAME];
 } __attribute__((packed));
 
+/**
+ * struct opt_client_fqdn - Client FQDN option (RFC 4704)
+ * @hdr:		Option header
+ * @flags:		Flags described by RFC 4704
+ * @domain_name:	Client FQDN
+ */
+struct opt_client_fqdn {
+	struct opt_hdr hdr;
+	uint8_t flags;
+	char domain_name[PASST_MAXDNAME];
+} __attribute__((packed));
+
 /**
  * struct msg_hdr - DHCPv6 client/server message header
  * @type:		DHCP message type
@@ -193,6 +209,7 @@ struct msg_hdr {
  * @client_id:		Client Identifier, variable length
  * @dns_servers:	DNS Recursive Name Server, here just for storage size
  * @dns_search:		Domain Search List, here just for storage size
+ * @client_fqdn:	Client FQDN, variable length
  */
 static struct resp_t {
 	struct msg_hdr hdr;
@@ -203,6 +220,7 @@ static struct resp_t {
 	struct opt_client_id client_id;
 	struct opt_dns_servers dns_servers;
 	struct opt_dns_search dns_search;
+	struct opt_client_fqdn client_fqdn;
 } __attribute__((__packed__)) resp = {
 	{ 0 },
 	SERVER_ID,
@@ -228,6 +246,10 @@ static struct resp_t {
 	{ { OPT_DNS_SEARCH,	0, },
 	  { 0 },
 	},
+
+	{ { OPT_CLIENT_FQDN, 0, },
+	  0, { 0 },
+	},
 };
 
 static const struct opt_status_code sc_not_on_link = {
@@ -346,7 +368,6 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset)
 {
 	struct opt_dns_servers *srv = NULL;
 	struct opt_dns_search *srch = NULL;
-	char *p = NULL;
 	int i;
 
 	if (c->no_dhcp_dns)
@@ -383,34 +404,81 @@ search:
 		if (!name_len)
 			continue;
 
+		name_len += 2; /* Length byte for first label, and terminator */
+		if (name_len >
+		    NS_MAXDNAME + 1 /* Length byte for first label */ ||
+		    name_len > 255) {
+			debug("DHCP: DNS search name '%s' too long, skipping",
+			      c->dns_search[i].n);
+			continue;
+		}
+
 		if (!srch) {
 			srch = (struct opt_dns_search *)(buf + offset);
 			offset += sizeof(struct opt_hdr);
 			srch->hdr.t = OPT_DNS_SEARCH;
 			srch->hdr.l = 0;
-			p = srch->list;
 		}
 
-		*p = '.';
-		p = stpncpy(p + 1, c->dns_search[i].n, name_len);
-		p++;
-		srch->hdr.l += name_len + 2;
-		offset += name_len + 2;
+		encode_domain_name(buf + offset, c->dns_search[i].n);
+
+		srch->hdr.l += name_len;
+		offset += name_len;
+
 	}
 
-	if (srch) {
-		for (i = 0; i < srch->hdr.l; i++) {
-			if (srch->list[i] == '.') {
-				srch->list[i] = strcspn(srch->list + i + 1,
-							".");
-			}
-		}
+	if (srch)
 		srch->hdr.l = htons(srch->hdr.l);
-	}
 
 	return offset;
 }
 
+/**
+ * dhcpv6_client_fqdn_fill() - Fill in client FQDN option
+ * @c:		Execution context
+ * @buf:	Response message buffer where options will be appended
+ * @offset:	Offset in message buffer for new options
+ *
+ * Return: updated length of response message buffer.
+ */
+static size_t dhcpv6_client_fqdn_fill(const struct pool *p, const struct ctx *c,
+				      char *buf, int offset)
+
+{
+	struct opt_client_fqdn const *req_opt;
+	struct opt_client_fqdn *o;
+	size_t opt_len;
+
+	opt_len = strlen(c->fqdn);
+	if (opt_len == 0) {
+		return offset;
+	}
+
+	opt_len += 2; /* Length byte for first label, and terminator */
+	if (opt_len > OPT_MAX_SIZE - (offset +
+				      sizeof(struct opt_hdr) +
+				      1 /* flags */ )) {
+		debug("DHCPv6: client FQDN option doesn't fit, skipping");
+		return offset;
+	}
+
+	o = (struct opt_client_fqdn *)(buf + offset);
+	encode_domain_name(o->domain_name, c->fqdn);
+	req_opt = (struct opt_client_fqdn *)dhcpv6_opt(p, &(size_t){ 0 },
+						       OPT_CLIENT_FQDN);
+	if (req_opt && req_opt->flags & 0x01 /* S flag */)
+		o->flags = 0x02 /* O flag */;
+	else
+		o->flags = 0x00;
+
+	opt_len++;
+
+	o->hdr.t = OPT_CLIENT_FQDN;
+	o->hdr.l = htons(opt_len);
+
+	return offset + sizeof(struct opt_hdr) + opt_len;
+}
+
 /**
  * dhcpv6() - Check if this is a DHCPv6 message, reply as needed
  * @c:		Execution context
@@ -544,6 +612,7 @@ int dhcpv6(struct ctx *c, const struct pool *p,
 	n = offsetof(struct resp_t, client_id) +
 	    sizeof(struct opt_hdr) + ntohs(client_id->l);
 	n = dhcpv6_dns_fill(c, (char *)&resp, n);
+	n = dhcpv6_client_fqdn_fill(p, c, (char *)&resp, n);
 
 	resp.hdr.xid = mh->xid;
 
diff --git a/passt.1 b/passt.1
index 29cc3ed..9d347d8 100644
--- a/passt.1
+++ b/passt.1
@@ -401,6 +401,16 @@ Enable IPv6-only operation. IPv4 traffic will be ignored.
 By default, IPv4 operation is enabled as long as at least an IPv4 route and an
 interface address are configured on a given host interface.
 
+.TP
+.BR \-H ", " \-\-hostname " " \fIname
+Hostname to configure the client with.
+Send \fIname\fR as DHCP option 12 (hostname).
+
+.TP
+.BR \-\-fqdn " " \fIname
+FQDN to configure the client with.
+Send \fIname\fR as Client FQDN: DHCP option 81 and DHCPv6 option 39.
+
 .SS \fBpasst\fR-only options
 
 .TP
diff --git a/passt.h b/passt.h
index 0dd4efa..f3151f0 100644
--- a/passt.h
+++ b/passt.h
@@ -209,6 +209,8 @@ struct ip6_ctx {
  * @ifi4:		Template interface for IPv4, -1: none, 0: IPv4 disabled
  * @ip:			IPv4 configuration
  * @dns_search:		DNS search list
+ * @hostname:		Guest hostname
+ * @fqdn:		Guest FQDN
  * @ifi6:		Template interface for IPv6, -1: none, 0: IPv6 disabled
  * @ip6:		IPv6 configuration
  * @pasta_ifn:		Name of namespace interface for pasta
@@ -269,6 +271,9 @@ struct ctx {
 
 	struct fqdn dns_search[MAXDNSRCH];
 
+	char hostname[PASST_MAXDNAME];
+	char fqdn[PASST_MAXDNAME];
+
 	int ifi6;
 	struct ip6_ctx ip6;
 
diff --git a/pasta.c b/pasta.c
index f15084d..585a51c 100644
--- a/pasta.c
+++ b/pasta.c
@@ -169,10 +169,12 @@ void pasta_open_ns(struct ctx *c, const char *netns)
  * struct pasta_spawn_cmd_arg - Argument for pasta_spawn_cmd()
  * @exe:	Executable to run
  * @argv:	Command and arguments to run
+ * @ctx:	Context to read config from
  */
 struct pasta_spawn_cmd_arg {
 	const char *exe;
 	char *const *argv;
+	struct ctx *c;
 };
 
 /**
@@ -186,6 +188,7 @@ static int pasta_spawn_cmd(void *arg)
 {
 	char hostname[HOST_NAME_MAX + 1] = HOSTNAME_PREFIX;
 	const struct pasta_spawn_cmd_arg *a;
+	size_t conf_hostname_len;
 	sigset_t set;
 
 	/* We run in a detached PID and mount namespace: mount /proc over */
@@ -195,9 +198,15 @@ static int pasta_spawn_cmd(void *arg)
 	if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
 		warn("Cannot set ping_group_range, ICMP requests might fail");
 
-	if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
-			 HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
-	    errno == ENAMETOOLONG) {
+	a = (const struct pasta_spawn_cmd_arg *)arg;
+
+	conf_hostname_len = strlen(a->c->hostname);
+	if (conf_hostname_len > 0) {
+		if (sethostname(a->c->hostname, conf_hostname_len))
+			warn("Unable to set configured hostname");
+	} else if (!gethostname(hostname + sizeof(HOSTNAME_PREFIX) - 1,
+				HOST_NAME_MAX + 1 - sizeof(HOSTNAME_PREFIX)) ||
+		   errno == ENAMETOOLONG) {
 		hostname[HOST_NAME_MAX] = '\0';
 		if (sethostname(hostname, strlen(hostname)))
 			warn("Unable to set pasta-prefixed hostname");
@@ -208,7 +217,6 @@ static int pasta_spawn_cmd(void *arg)
 	sigaddset(&set, SIGUSR1);
 	sigwaitinfo(&set, NULL);
 
-	a = (const struct pasta_spawn_cmd_arg *)arg;
 	execvp(a->exe, a->argv);
 
 	die_perror("Failed to start command or shell");
@@ -230,6 +238,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
 	struct pasta_spawn_cmd_arg arg = {
 		.exe = argv[0],
 		.argv = argv,
+		.c = c,
 	};
 	char uidmap[BUFSIZ], gidmap[BUFSIZ];
 	char *sh_argv[] = { NULL, NULL };
diff --git a/test/lib/setup b/test/lib/setup
index 580825f..ee67152 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -49,7 +49,7 @@ setup_passt() {
 
 	context_run passt "make clean"
 	context_run passt "make valgrind"
-	context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt ${__opts} -s ${STATESETUP}/passt.socket -f -t 10001 -u 10001 -P ${STATESETUP}/passt.pid"
+	context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt ${__opts} -s ${STATESETUP}/passt.socket -f -t 10001 -u 10001 -H hostname1 --fqdn fqdn1.passt.test -P ${STATESETUP}/passt.pid"
 
 	# pidfile isn't created until passt is listening
 	wait_for [ -f "${STATESETUP}/passt.pid" ]
@@ -160,11 +160,11 @@ setup_passt_in_ns() {
 	if [ ${VALGRIND} -eq 1 ]; then
 		context_run passt "make clean"
 		context_run passt "make valgrind"
-		context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
+		context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -H hostname1 --fqdn fqdn1.passt.test -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
 	else
 		context_run passt "make clean"
 		context_run passt "make"
-		context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
+		context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -H hostname1 --fqdn fqdn1.passt.test -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
 	fi
 	wait_for [ -f "${STATESETUP}/passt.pid" ]
 
@@ -243,7 +243,7 @@ setup_two_guests() {
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
 	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
 
-	context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
+	context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} --fqdn fqdn1.passt.test -H hostname1 -t 10001 -u 10001"
 	wait_for [ -f "${STATESETUP}/passt_1.pid" ]
 
 	__opts=
@@ -252,7 +252,7 @@ setup_two_guests() {
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
 	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
 
-	context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
+	context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} --hostname hostname2 --fqdn fqdn2 -t 10004 -u 10004"
 	wait_for [ -f "${STATESETUP}/passt_2.pid" ]
 
 	__vmem="$((${MEM_KIB} / 1024 / 4))"
diff --git a/test/passt.mbuto b/test/passt.mbuto
index d4d57cb..e45a284 100755
--- a/test/passt.mbuto
+++ b/test/passt.mbuto
@@ -13,7 +13,7 @@
 PROGS="${PROGS:-ash,dash,bash ip mount ls insmod mkdir ln cat chmod lsmod
        modprobe find grep mknod mv rm umount jq iperf3 dhclient hostname
        sed tr chown sipcalc cut socat dd strace ping tail killall sleep sysctl
-       nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp tcpdump}"
+       nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp tcpdump env}"
 
 # OpenSSH 9.8 introduced split binaries, with sshd being the daemon, and
 # sshd-session the per-session program. We need the latter as well, and the path
@@ -41,6 +41,7 @@ FIXUP="${FIXUP}"'
 #!/bin/sh
 LOG=/var/log/dhclient-script.log
 echo \${reason} \${interface} >> \$LOG
+env >> \$LOG
 set >> \$LOG
 
 [ -n "\${new_interface_mtu}" ]       && ip link set dev \${interface} mtu \${new_interface_mtu}
@@ -54,7 +55,8 @@ set >> \$LOG
 [ -n "\${new_ip6_address}" ]         && ip addr add \${new_ip6_address}/\${new_ip6_prefixlen} dev \${interface}
 [ -n "\${new_dhcp6_name_servers}" ]  && for d in \${new_dhcp6_name_servers}; do echo "nameserver \${d}%\${interface}" >> /etc/resolv.conf; done
 [ -n "\${new_dhcp6_domain_search}" ] && (printf "search"; for d in \${new_dhcp6_domain_search}; do printf " %s" "\${d}"; done; printf "\n") >> /etc/resolv.conf
-[ -n "\${new_host_name}" ]           && hostname "\${new_host_name}"
+[ -n "\${new_host_name}" ]           && echo "\${new_host_name}" > /tmp/new_host_name
+[ -n "\${new_fqdn_fqdn}" ]           && echo "\${new_fqdn_fqdn}" > /tmp/new_fqdn_fqdn
 exit 0
 EOF
 	chmod 755 /sbin/dhclient-script
diff --git a/test/passt/dhcp b/test/passt/dhcp
index 9925ab9..145f1ba 100644
--- a/test/passt/dhcp
+++ b/test/passt/dhcp
@@ -11,7 +11,7 @@
 # Copyright (c) 2021 Red Hat GmbH
 # Author: Stefano Brivio <sbrivio@redhat.com>
 
-gtools	ip jq dhclient sed tr
+gtools	ip jq dhclient sed tr hostname
 htools	ip jq sed tr head
 
 test	Interface name
@@ -47,7 +47,16 @@ gout	SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^searc
 hout	HOST_SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
 check	[ "__SEARCH__" = "__HOST_SEARCH__" ]
 
+test	DHCP: Hostname
+gout	NEW_HOST_NAME cat /tmp/new_host_name
+check	[ "__NEW_HOST_NAME__" = "hostname1" ]
+
+test	DHCP: Client FQDN
+gout	NEW_FQDN_FQDN cat /tmp/new_fqdn_fqdn
+check	[ "__NEW_FQDN_FQDN__" = "fqdn1.passt.test" ]
+
 test	DHCPv6: address
+guest	rm /tmp/new_fqdn_fqdn
 guest	/sbin/dhclient -6 __IFNAME__
 # Wait for DAD to complete
 guest	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
@@ -70,3 +79,7 @@ test	DHCPv6: search list
 gout	SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
 hout	HOST_SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search \(.*\)/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/'
 check	[ "__SEARCH6__" = "__HOST_SEARCH6__" ]
+
+test	DHCPv6: Hostname
+gout	NEW_FQDN_FQDN cat /tmp/new_fqdn_fqdn
+check	[ "__NEW_FQDN_FQDN__" = "fqdn1.passt.test" ]
diff --git a/util.c b/util.c
index 4d51e04..ba33866 100644
--- a/util.c
+++ b/util.c
@@ -930,4 +930,28 @@ void raw_random(void *buf, size_t buflen)
 void epoll_del(const struct ctx *c, int fd)
 {
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, fd, NULL);
+
+}
+
+/**
+ * encode_domain_name() - Encode domain name according to RFC 1035, section 3.1
+ * @buf:		Buffer to fill in with encoded domain name
+ * @domain_name:	Input domain name string with terminator
+ *
+ * The buffer's 'buf' size has to be >= strlen(domain_name) + 2
+ */
+void encode_domain_name(char *buf, const char *domain_name)
+{
+	size_t i;
+	char *p;
+
+	buf[0] = strcspn(domain_name, ".");
+	p = buf + 1;
+	for (i = 0; domain_name[i]; i++) {
+		if (domain_name[i] == '.')
+			p[i] = strcspn(domain_name + i + 1, ".");
+		else
+			p[i] = domain_name[i];
+	}
+	p[i] = 0L;
 }
diff --git a/util.h b/util.h
index 23b165c..9c92a37 100644
--- a/util.h
+++ b/util.h
@@ -40,6 +40,9 @@
 #ifndef IP_MAX_MTU
 #define IP_MAX_MTU			USHRT_MAX
 #endif
+#ifndef IPV6_MIN_MTU
+#define IPV6_MIN_MTU			1280
+#endif
 
 #ifndef MIN
 #define MIN(x, y)		(((x) < (y)) ? (x) : (y))
@@ -352,4 +355,7 @@ static inline int wrap_accept4(int sockfd, struct sockaddr *addr,
 #define accept4(s, addr, addrlen, flags) \
 	wrap_accept4((s), (addr), (addrlen), (flags))
 
+#define PASST_MAXDNAME 254 /* 253 (RFC 1035) + 1 (the terminator) */
+void encode_domain_name(char *buf, const char *domain_name);
+
 #endif /* UTIL_H */

From 472e2e930f6e17d9d8664d6cf44c47af1db58bb3 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 11 Feb 2025 20:11:00 +0100
Subject: [PATCH 058/221] tcp: Don't discard window information on keep-alive
 segments

It looks like a detail, but it's critical if we're dealing with
somebody, such as near-future self, using TCP_REPAIR to migrate TCP
connections in the guest or container.

The last packet sent from the 'source' process/guest/container
typically reports a small window, or zero, because the guest/container
hadn't been draining it for a while.

The next packet, appearing as the target sets TCP_REPAIR_OFF on the
migrated socket, is a keep-alive (also called "window probe" in CRIU
or TCP_REPAIR-related code), and it comes with an updated window
value, reflecting the pre-migration "regular" value.

If we ignore it, it might take a while/forever before we realise we
can actually restart sending.

Fixes: 238c69f9af45 ("tcp: Acknowledge keep-alive segments, ignore them for the rest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index af6bd95..2addf4a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1664,8 +1664,10 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			tcp_send_flag(c, conn, ACK);
 			tcp_timer_ctl(c, conn);
 
-			if (p->count == 1)
+			if (p->count == 1) {
+				tcp_tap_window_update(conn, ntohs(th->window));
 				return 1;
+			}
 
 			continue;
 		}

From 90f91fe72673e36c8e071a1750e9c03deb20ab0f Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 11 Feb 2025 20:19:05 +0100
Subject: [PATCH 059/221] tcp: Implement conservative zero-window probe on ACK
 timeout

This probably doesn't cover all the cases where we should send a
zero-window probe, but it's rather unobtrusive and obvious, so start
from here, also because I just observed this case (without the fix
from the previous patch, to take into account window information from
keep-alive segments).

If we hit the ACK timeout, and try re-sending data from the socket,
if the window is zero, we'll just fail again, go back to the timer,
and so on, until we hit the maximum number of re-transmissions and
reset the connection.

Don't do that: forcibly try to send something by implementing the
equivalent of a zero-window probe in this case.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tcp.c b/tcp.c
index 2addf4a..b87478f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2175,6 +2175,8 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
 			flow_dbg(conn, "ACK timeout, retry");
 			conn->retrans++;
 			conn->seq_to_tap = conn->seq_ack_from_tap;
+			if (!conn->wnd_from_tap)
+				conn->wnd_from_tap = 1; /* Zero-window probe */
 			if (tcp_set_peek_offset(conn->sock, 0)) {
 				tcp_rst(c, conn);
 			} else {

From def7de4690ddb40f7c3b29e6ca81d30e9409fb5d Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Tue, 11 Feb 2025 20:43:32 +0100
Subject: [PATCH 060/221] tcp_vu: Fix off-by one in header count array
 adjustment

head_cnt represents the number of frames we're going to forward to the
guest in tcp_vu_sock_recv(), each of which could require multiple
buffers ("elements").  We initialise it with as many frames as we can
find space for in vu buffers, and we then need to adjust it down to
the number of frames we actually (partially) filled.

We adjust it down based on number of individual buffers used by the
data from recvmsg().  At this point 'i' is *one greater than* that
number of buffers, so we need to discard all (unused) frames with a
buffer index >= i, instead of > i.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
[david: Contributed actual commit message]
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_vu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp_vu.c b/tcp_vu.c
index fad7065..0622f17 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -261,7 +261,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c,
 		len -= iov->iov_len;
 	}
 	/* adjust head count */
-	while (head_cnt > 0 && head[head_cnt - 1] > i)
+	while (head_cnt > 0 && head[head_cnt - 1] >= i)
 		head_cnt--;
 	/* mark end of array */
 	head[head_cnt] = i;

From 836fe215e049ee423750d3315a02742d8224eab2 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 12 Feb 2025 01:07:33 +0100
Subject: [PATCH 061/221] passt-repair: Fix off-by-one in check for number of
 file descriptors

Actually, 254 is too many, but 253 isn't.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt-repair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/passt-repair.c b/passt-repair.c
index 614cee0..1174ae3 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -131,7 +131,7 @@ loop:
 	/* No inverse formula for CMSG_LEN(x), and building one with CMSG_LEN(0)
 	 * works but there's no guarantee it does. Search the whole domain.
 	 */
-	for (i = 1; i < SCM_MAX_FD; i++) {
+	for (i = 1; i <= SCM_MAX_FD; i++) {
 		if (CMSG_LEN(sizeof(int) * i) == cmsg->cmsg_len) {
 			n = i;
 			break;

From 5911e08c0f53e46547e7eeb1dd824c8ab96e512e Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 12 Feb 2025 18:07:13 +1100
Subject: [PATCH 062/221] migrate: Skeleton of live migration logic

Introduce facilities for guest migration on top of vhost-user
infrastructure.  Add migration facilities based on top of the current
vhost-user infrastructure, moving vu_migrate() and related functions
to migrate.c.

Versioned migration stages define function pointers to be called on
source or target, or data sections that need to be transferred.

The migration header consists of a magic number, a version number for the
encoding, and a "compat_version" which represents the oldest version which
is compatible with the current one.  We don't use it yet, but that allows
for the future possibility of backwards compatible protocol extensions.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile     |  12 +--
 epoll_type.h |   2 -
 migrate.c    | 214 +++++++++++++++++++++++++++++++++++++++++++++++++++
 migrate.h    |  51 ++++++++++++
 passt.c      |   8 +-
 passt.h      |   8 ++
 util.h       |  29 +++++++
 vhost_user.c |  60 +++------------
 virtio.h     |   4 -
 vu_common.c  |  49 +-----------
 vu_common.h  |   2 +-
 11 files changed, 324 insertions(+), 115 deletions(-)
 create mode 100644 migrate.c
 create mode 100644 migrate.h

diff --git a/Makefile b/Makefile
index d3d4b78..be89b07 100644
--- a/Makefile
+++ b/Makefile
@@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
-	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
+	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
+	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
 	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 PASST_REPAIR_SRCS = passt-repair.c
@@ -49,10 +49,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
 
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
-	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
-	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
-	virtio.h vu_common.h
+	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
+	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
+	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
+	vhost_user.h virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/epoll_type.h b/epoll_type.h
index fd9eac3..f3ef415 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -40,8 +40,6 @@ enum epoll_type {
 	EPOLL_TYPE_VHOST_CMD,
 	/* vhost-user kick event socket */
 	EPOLL_TYPE_VHOST_KICK,
-	/* vhost-user migration socket */
-	EPOLL_TYPE_VHOST_MIGRATION,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/migrate.c b/migrate.c
new file mode 100644
index 0000000..aeac872
--- /dev/null
+++ b/migrate.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * migrate.c - Migration sections, layout, and routines
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "inany.h"
+#include "flow.h"
+#include "flow_table.h"
+
+#include "migrate.h"
+
+/* Magic identifier for migration data */
+#define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
+
+/* Stages for version 1 */
+static const struct migrate_stage stages_v1[] = {
+	{ 0 },
+};
+
+/* Supported encoding versions, from latest (most preferred) to oldest */
+static const struct migrate_version versions[] = {
+	{ 1,	stages_v1, },
+	{ 0 },
+};
+
+/* Current encoding version */
+#define CURRENT_VERSION		(&versions[0])
+
+/**
+ * migrate_source() - Migration as source, send state to hypervisor
+ * @c:		Execution context
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int migrate_source(struct ctx *c, int fd)
+{
+	const struct migrate_version *v = CURRENT_VERSION;
+	const struct migrate_header header = {
+		.magic		= htonll_constant(MIGRATE_MAGIC),
+		.version	= htonl(v->id),
+		.compat_version	= htonl(v->id),
+	};
+	const struct migrate_stage *s;
+	int ret;
+
+	if (write_all_buf(fd, &header, sizeof(header))) {
+		ret = errno;
+		err("Can't send migration header: %s, abort", strerror_(ret));
+		return ret;
+	}
+
+	for (s = v->s; s->name; s++) {
+		if (!s->source)
+			continue;
+
+		debug("Source side migration stage: %s", s->name);
+
+		if ((ret = s->source(c, s, fd))) {
+			err("Source migration stage: %s: %s, abort", s->name,
+			    strerror_(ret));
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * migrate_target_read_header() - Read header in target
+ * @fd:		Descriptor for state transfer
+ *
+ * Return: version structure on success, NULL on failure with errno set
+ */
+static const struct migrate_version *migrate_target_read_header(int fd)
+{
+	const struct migrate_version *v;
+	struct migrate_header h;
+	uint32_t id, compat_id;
+
+	if (read_all_buf(fd, &h, sizeof(h)))
+		return NULL;
+
+	id = ntohl(h.version);
+	compat_id = ntohl(h.compat_version);
+
+	debug("Source magic: 0x%016" PRIx64 ", version: %u, compat: %u",
+	      ntohll(h.magic), id, compat_id);
+
+	if (ntohll(h.magic) != MIGRATE_MAGIC || !id || !compat_id) {
+		err("Invalid incoming device state");
+		errno = EINVAL;
+		return NULL;
+	}
+
+	for (v = versions; v->id; v++)
+		if (v->id <= id && v->id >= compat_id)
+			return v;
+
+	errno = ENOTSUP;
+	err("Unsupported device state version: %u", id);
+	return NULL;
+}
+
+/**
+ * migrate_target() - Migration as target, receive state from hypervisor
+ * @c:		Execution context
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int migrate_target(struct ctx *c, int fd)
+{
+	const struct migrate_version *v;
+	const struct migrate_stage *s;
+	int ret;
+
+	if (!(v = migrate_target_read_header(fd)))
+		return errno;
+
+	for (s = v->s; s->name; s++) {
+		if (!s->target)
+			continue;
+
+		debug("Target side migration stage: %s", s->name);
+
+		if ((ret = s->target(c, s, fd))) {
+			err("Target migration stage: %s: %s, abort", s->name,
+			    strerror_(ret));
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * migrate_init() - Set up things necessary for migration
+ * @c:		Execution context
+ */
+void migrate_init(struct ctx *c)
+{
+	c->device_state_result = -1;
+}
+
+/**
+ * migrate_close() - Close migration channel
+ * @c:		Execution context
+ */
+void migrate_close(struct ctx *c)
+{
+	if (c->device_state_fd != -1) {
+		debug("Closing migration channel, fd: %d", c->device_state_fd);
+		close(c->device_state_fd);
+		c->device_state_fd = -1;
+		c->device_state_result = -1;
+	}
+}
+
+/**
+ * migrate_request() - Request a migration of device state
+ * @c:		Execution context
+ * @fd:		fd to transfer state
+ * @target:	Are we the target of the migration?
+ */
+void migrate_request(struct ctx *c, int fd, bool target)
+{
+	debug("Migration requested, fd: %d (was %d)", fd, c->device_state_fd);
+
+	if (c->device_state_fd != -1)
+		migrate_close(c);
+
+	c->device_state_fd = fd;
+	c->migrate_target = target;
+}
+
+/**
+ * migrate_handler() - Send/receive passt internal state to/from hypervisor
+ * @c:		Execution context
+ */
+void migrate_handler(struct ctx *c)
+{
+	int rc;
+
+	if (c->device_state_fd < 0)
+		return;
+
+	debug("Handling migration request from fd: %d, target: %d",
+	      c->device_state_fd, c->migrate_target);
+
+	if (c->migrate_target)
+		rc = migrate_target(c, c->device_state_fd);
+	else
+		rc = migrate_source(c, c->device_state_fd);
+
+	migrate_close(c);
+
+	c->device_state_result = rc;
+}
diff --git a/migrate.h b/migrate.h
new file mode 100644
index 0000000..2c51cd9
--- /dev/null
+++ b/migrate.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef MIGRATE_H
+#define MIGRATE_H
+
+/**
+ * struct migrate_header - Migration header from source
+ * @magic:		0xB1BB1D1B0BB1D1B0, network order
+ * @version:		Highest known, target aborts if too old, network order
+ * @compat_version:	Lowest version compatible with @version, target aborts
+ *			if too new, network order
+ */
+struct migrate_header {
+	uint64_t magic;
+	uint32_t version;
+	uint32_t compat_version;
+} __attribute__((packed));
+
+/**
+ * struct migrate_stage - Callbacks and parameters for one stage of migration
+ * @name:	Stage name (for debugging)
+ * @source:	Callback to implement this stage on the source
+ * @target:	Callback to implement this stage on the target
+ */
+struct migrate_stage {
+	const char *name;
+	int (*source)(struct ctx *c, const struct migrate_stage *stage, int fd);
+	int (*target)(struct ctx *c, const struct migrate_stage *stage, int fd);
+
+	/* Add here separate rollback callbacks if needed */
+};
+
+/**
+ * struct migrate_version - Stages for a particular protocol version
+ * @id:		Version number, host order
+ * @s:		Ordered array of stages, NULL-terminated
+ */
+struct migrate_version {
+	uint32_t id;
+	const struct migrate_stage *s;
+};
+
+void migrate_init(struct ctx *c);
+void migrate_close(struct ctx *c);
+void migrate_request(struct ctx *c, int fd, bool target);
+void migrate_handler(struct ctx *c);
+
+#endif /* MIGRATE_H */
diff --git a/passt.c b/passt.c
index 53fdd38..935a69f 100644
--- a/passt.c
+++ b/passt.c
@@ -51,6 +51,7 @@
 #include "tcp_splice.h"
 #include "ndp.h"
 #include "vu_common.h"
+#include "migrate.h"
 
 #define EPOLL_EVENTS		8
 
@@ -75,7 +76,6 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
 	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
 	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
-	[EPOLL_TYPE_VHOST_MIGRATION]	= "vhost-user migration socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -202,6 +202,7 @@ int main(int argc, char **argv)
 	isolate_initial(argc, argv);
 
 	c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
+	c.device_state_fd = -1;
 
 	sigemptyset(&sa.sa_mask);
 	sa.sa_flags = 0;
@@ -357,9 +358,6 @@ loop:
 		case EPOLL_TYPE_VHOST_KICK:
 			vu_kick_cb(c.vdev, ref, &now);
 			break;
-		case EPOLL_TYPE_VHOST_MIGRATION:
-			vu_migrate(c.vdev, eventmask);
-			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
@@ -368,5 +366,7 @@ loop:
 
 	post_handler(&c, &now);
 
+	migrate_handler(&c);
+
 	goto loop;
 }
diff --git a/passt.h b/passt.h
index f3151f0..5fdea52 100644
--- a/passt.h
+++ b/passt.h
@@ -237,6 +237,9 @@ struct ip6_ctx {
  * @low_wmem:		Low probed net.core.wmem_max
  * @low_rmem:		Low probed net.core.rmem_max
  * @vdev:		vhost-user device
+ * @device_state_fd:	Device state migration channel
+ * @device_state_result: Device state migration result
+ * @migrate_target:	Are we the target, on the next migration request?
  */
 struct ctx {
 	enum passt_modes mode;
@@ -305,6 +308,11 @@ struct ctx {
 	int low_rmem;
 
 	struct vu_dev *vdev;
+
+	/* Migration */
+	int device_state_fd;
+	int device_state_result;
+	bool migrate_target;
 };
 
 void proto_update_l2_buf(const unsigned char *eth_d,
diff --git a/util.h b/util.h
index 9c92a37..7df7767 100644
--- a/util.h
+++ b/util.h
@@ -125,14 +125,43 @@
 	 (((x) & 0x0000ff00) <<  8) | (((x) & 0x000000ff) << 24))
 #endif
 
+#ifndef __bswap_constant_32
+#define __bswap_constant_32(x)						\
+	((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >>  8) |	\
+	 (((x) & 0x0000ff00) <<  8) | (((x) & 0x000000ff) << 24))
+#endif
+
+#ifndef __bswap_constant_64
+#define __bswap_constant_64(x) \
+	((((x) & 0xff00000000000000ULL) >> 56) |			\
+	 (((x) & 0x00ff000000000000ULL) >> 40) |			\
+	 (((x) & 0x0000ff0000000000ULL) >> 24) |			\
+	 (((x) & 0x000000ff00000000ULL) >> 8)  |			\
+	 (((x) & 0x00000000ff000000ULL) << 8)  |			\
+	 (((x) & 0x0000000000ff0000ULL) << 24) |			\
+	 (((x) & 0x000000000000ff00ULL) << 40) |			\
+	 (((x) & 0x00000000000000ffULL) << 56))
+#endif
+
 #if __BYTE_ORDER == __BIG_ENDIAN
 #define	htons_constant(x)	(x)
 #define	htonl_constant(x)	(x)
+#define htonll_constant(x)	(x)
+#define	ntohs_constant(x)	(x)
+#define	ntohl_constant(x)	(x)
+#define ntohll_constant(x)	(x)
 #else
 #define	htons_constant(x)	(__bswap_constant_16(x))
 #define	htonl_constant(x)	(__bswap_constant_32(x))
+#define	htonll_constant(x)	(__bswap_constant_64(x))
+#define	ntohs_constant(x)	(__bswap_constant_16(x))
+#define	ntohl_constant(x)	(__bswap_constant_32(x))
+#define	ntohll_constant(x)	(__bswap_constant_64(x))
 #endif
 
+#define ntohll(x)		(be64toh((x)))
+#define htonll(x)		(htobe64((x)))
+
 /**
  * ntohl_unaligned() - Read 32-bit BE value from a possibly unaligned address
  * @p:		Pointer to the BE value in memory
diff --git a/vhost_user.c b/vhost_user.c
index 159f0b3..256c8ab 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -44,6 +44,7 @@
 #include "tap.h"
 #include "vhost_user.h"
 #include "pcap.h"
+#include "migrate.h"
 
 /* vhost-user version we are compatible with */
 #define VHOST_USER_VERSION 1
@@ -997,36 +998,6 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev,
 	return false;
 }
 
-/**
- * vu_set_migration_watch() - Add the migration file descriptor to epoll
- * @vdev:	vhost-user device
- * @fd:		File descriptor to add
- * @direction:	Direction of the migration (save or load backend state)
- */
-static void vu_set_migration_watch(const struct vu_dev *vdev, int fd,
-				   uint32_t direction)
-{
-	union epoll_ref ref = {
-		.type = EPOLL_TYPE_VHOST_MIGRATION,
-		.fd = fd,
-	 };
-	struct epoll_event ev = { 0 };
-
-	ev.data.u64 = ref.u64;
-	switch (direction) {
-	case VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE:
-		ev.events = EPOLLOUT;
-		break;
-	case VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD:
-		ev.events = EPOLLIN;
-		break;
-	default:
-		ASSERT(0);
-	}
-
-	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
-}
-
 /**
  * vu_set_device_state_fd_exec() - Set the device state migration channel
  * @vdev:	vhost-user device
@@ -1051,16 +1022,8 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
 	    direction != VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD)
 		die("Invalide device_state_fd direction: %d", direction);
 
-	if (vdev->device_state_fd != -1) {
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-	}
-
-	vdev->device_state_fd = msg->fds[0];
-	vdev->device_state_result = -1;
-	vu_set_migration_watch(vdev, vdev->device_state_fd, direction);
-
-	debug("Got device_state_fd: %d", vdev->device_state_fd);
+	migrate_request(vdev->context, msg->fds[0],
+			direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
 
 	/* We don't provide a new fd for the data transfer */
 	vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
@@ -1075,12 +1038,11 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
  *
  * Return: True as the reply contains the migration result
  */
+/* cppcheck-suppress constParameterCallback */
 static bool vu_check_device_state_exec(struct vu_dev *vdev,
 				       struct vhost_user_msg *msg)
 {
-	(void)vdev;
-
-	vmsg_set_reply_u64(msg, vdev->device_state_result);
+	vmsg_set_reply_u64(msg, vdev->context->device_state_result);
 
 	return true;
 }
@@ -1106,8 +1068,8 @@ void vu_init(struct ctx *c)
 	}
 	c->vdev->log_table = NULL;
 	c->vdev->log_call_fd = -1;
-	c->vdev->device_state_fd = -1;
-	c->vdev->device_state_result = -1;
+
+	migrate_init(c);
 }
 
 
@@ -1157,12 +1119,8 @@ void vu_cleanup(struct vu_dev *vdev)
 
 	vu_close_log(vdev);
 
-	if (vdev->device_state_fd != -1) {
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-		vdev->device_state_result = -1;
-	}
+	/* If we lose the VU dev, we also lose our migration channel */
+	migrate_close(vdev->context);
 }
 
 /**
diff --git a/virtio.h b/virtio.h
index 7bef2d2..0a59441 100644
--- a/virtio.h
+++ b/virtio.h
@@ -106,8 +106,6 @@ struct vu_dev_region {
  * @log_call_fd:		Eventfd to report logging update
  * @log_size:			Size of the logging memory region
  * @log_table:			Base of the logging memory region
- * @device_state_fd:		Device state migration channel
- * @device_state_result:	Device state migration result
  */
 struct vu_dev {
 	struct ctx *context;
@@ -119,8 +117,6 @@ struct vu_dev {
 	int log_call_fd;
 	uint64_t log_size;
 	uint8_t *log_table;
-	int device_state_fd;
-	int device_state_result;
 };
 
 /**
diff --git a/vu_common.c b/vu_common.c
index ab04d31..48826b1 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -5,6 +5,7 @@
  * common_vu.c - vhost-user common UDP and TCP functions
  */
 
+#include <errno.h>
 #include <unistd.h>
 #include <sys/uio.h>
 #include <sys/eventfd.h>
@@ -17,6 +18,7 @@
 #include "vhost_user.h"
 #include "pcap.h"
 #include "vu_common.h"
+#include "migrate.h"
 
 #define VU_MAX_TX_BUFFER_NB	2
 
@@ -303,50 +305,3 @@ err:
 
 	return -1;
 }
-
-/**
- * vu_migrate() - Send/receive passt insternal state to/from QEMU
- * @vdev:	vhost-user device
- * @events:	epoll events
- */
-void vu_migrate(struct vu_dev *vdev, uint32_t events)
-{
-	int ret;
-
-	/* TODO: collect/set passt internal state
-	 * and use vdev->device_state_fd to send/receive it
-	 */
-	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
-	if (events & EPOLLOUT) {
-		debug("Saving backend state");
-
-		/* send some stuff */
-		ret = write(vdev->device_state_fd, "PASST", 6);
-		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
-		vdev->device_state_result = ret == -1 ? -1 : 0;
-		/* Closing the file descriptor signals the end of transfer */
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-	} else if (events & EPOLLIN) {
-		char buf[6];
-
-		debug("Loading backend state");
-		/* read some stuff */
-		ret = read(vdev->device_state_fd, buf, sizeof(buf));
-		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
-		if (ret != sizeof(buf)) {
-			vdev->device_state_result = -1;
-		} else {
-			ret = strncmp(buf, "PASST", sizeof(buf));
-			vdev->device_state_result = ret == 0 ? 0 : -1;
-		}
-	} else if (events & EPOLLHUP) {
-		debug("Closing migration channel");
-
-		/* The end of file signals the end of the transfer. */
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-	}
-}
diff --git a/vu_common.h b/vu_common.h
index d56c021..f538f23 100644
--- a/vu_common.h
+++ b/vu_common.h
@@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now);
 int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-void vu_migrate(struct vu_dev *vdev, uint32_t events);
+
 #endif /* VU_COMMON_H */

From 155cd0c41e549cea956b7f8506cda7803cf63419 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Feb 2025 18:07:14 +1100
Subject: [PATCH 063/221] migrate: Migrate guest observed addresses

Most of the information in struct ctx doesn't need to be migrated.
Either it's strictly back end information which is allowed to differ
between the two ends, or it must already be configured identically on
the two ends.

There are a few exceptions though.  In particular passt learns several
addresses of the guest by observing what it sends out.  If we lose
this information across migration we might get away with it, but if
there are active flows we might misdirect some packets before
re-learning the guest address.

Avoid this by migrating the guest's observed addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Coding style stuff, comments, etc.]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 migrate.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/migrate.c b/migrate.c
index aeac872..72a6d40 100644
--- a/migrate.c
+++ b/migrate.c
@@ -27,8 +27,81 @@
 /* Magic identifier for migration data */
 #define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
 
+/**
+ * struct migrate_seen_addrs_v1 - Migratable guest addresses for v1 state stream
+ * @addr6:	Observed guest IPv6 address
+ * @addr6_ll:	Observed guest IPv6 link-local address
+ * @addr4:	Observed guest IPv4 address
+ * @mac:	Observed guest MAC address
+ */
+struct migrate_seen_addrs_v1 {
+	struct in6_addr addr6;
+	struct in6_addr addr6_ll;
+	struct in_addr addr4;
+	unsigned char mac[ETH_ALEN];
+} __attribute__((packed));
+
+/**
+ * seen_addrs_source_v1() - Copy and send guest observed addresses from source
+ * @c:		Execution context
+ * @stage:	Migration stage, unused
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */
+static int seen_addrs_source_v1(struct ctx *c,
+				const struct migrate_stage *stage, int fd)
+{
+	struct migrate_seen_addrs_v1 addrs = {
+		.addr6 = c->ip6.addr_seen,
+		.addr6_ll = c->ip6.addr_ll_seen,
+		.addr4 = c->ip4.addr_seen,
+	};
+
+	(void)stage;
+
+	memcpy(addrs.mac, c->guest_mac, sizeof(addrs.mac));
+
+	if (write_all_buf(fd, &addrs, sizeof(addrs)))
+		return errno;
+
+	return 0;
+}
+
+/**
+ * seen_addrs_target_v1() - Receive and use guest observed addresses on target
+ * @c:		Execution context
+ * @stage:	Migration stage, unused
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int seen_addrs_target_v1(struct ctx *c,
+				const struct migrate_stage *stage, int fd)
+{
+	struct migrate_seen_addrs_v1 addrs;
+
+	(void)stage;
+
+	if (read_all_buf(fd, &addrs, sizeof(addrs)))
+		return errno;
+
+	c->ip6.addr_seen = addrs.addr6;
+	c->ip6.addr_ll_seen = addrs.addr6_ll;
+	c->ip4.addr_seen = addrs.addr4;
+	memcpy(c->guest_mac, addrs.mac, sizeof(c->guest_mac));
+
+	return 0;
+}
+
 /* Stages for version 1 */
 static const struct migrate_stage stages_v1[] = {
+	{
+		.name = "observed addresses",
+		.source = seen_addrs_source_v1,
+		.target = seen_addrs_target_v1,
+	},
 	{ 0 },
 };
 

From b899141ad52fb417fe608d9c8cfe66f9572207c7 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 12 Feb 2025 18:07:15 +1100
Subject: [PATCH 064/221] Add interfaces and configuration bits for
 passt-repair

In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.

When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.

To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile     |  12 +--
 conf.c       |  43 +++++++++-
 epoll_type.h |   4 +
 migrate.c    |   5 +-
 passt.1      |  11 +++
 passt.c      |   9 +++
 passt.h      |   7 ++
 repair.c     | 219 +++++++++++++++++++++++++++++++++++++++++++++++++++
 repair.h     |  16 ++++
 tap.c        |  65 +--------------
 util.c       |  62 +++++++++++++++
 util.h       |   1 +
 12 files changed, 381 insertions(+), 73 deletions(-)
 create mode 100644 repair.c
 create mode 100644 repair.h

diff --git a/Makefile b/Makefile
index be89b07..d4e1096 100644
--- a/Makefile
+++ b/Makefile
@@ -38,9 +38,9 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
-	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
-	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
-	vhost_user.c virtio.c vu_common.c
+	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
+	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
+	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 PASST_REPAIR_SRCS = passt-repair.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
@@ -50,9 +50,9 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
-	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
-	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
-	vhost_user.h virtio.h vu_common.h
+	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
+	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
+	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/conf.c b/conf.c
index d9de07c..18017f5 100644
--- a/conf.c
+++ b/conf.c
@@ -820,6 +820,9 @@ static void usage(const char *name, FILE *f, int status)
 			"    UNIX domain socket is provided by -s option\n"
 			"  --print-capabilities	print back-end capabilities in JSON format,\n"
 			"    only meaningful for vhost-user mode\n");
+		FPRINTF(f,
+			"  --repair-path PATH	path for passt-repair(1)\n"
+			"    default: append '.repair' to UNIX domain path\n");
 	}
 
 	FPRINTF(f,
@@ -1245,8 +1248,25 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
  */
 static void conf_open_files(struct ctx *c)
 {
-	if (c->mode != MODE_PASTA && c->fd_tap == -1)
-		c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
+	if (c->mode != MODE_PASTA && c->fd_tap == -1) {
+		c->fd_tap_listen = sock_unix(c->sock_path);
+
+		if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
+			if (!*c->repair_path &&
+			    snprintf_check(c->repair_path,
+					   sizeof(c->repair_path), "%s.repair",
+					   c->sock_path)) {
+				warn("passt-repair path %s not usable",
+				     c->repair_path);
+				c->fd_repair_listen = -1;
+			} else {
+				c->fd_repair_listen = sock_unix(c->repair_path);
+			}
+		} else {
+			c->fd_repair_listen = -1;
+		}
+		c->fd_repair = -1;
+	}
 
 	if (*c->pidfile) {
 		c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
@@ -1360,10 +1380,12 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"host-lo-to-ns-lo", no_argument, 	NULL,		23 },
 		{"dns-host",	required_argument,	NULL,		24 },
 		{"vhost-user",	no_argument,		NULL,		25 },
+
 		/* vhost-user backend program convention */
 		{"print-capabilities", no_argument,	NULL,		26 },
 		{"socket-path",	required_argument,	NULL,		's' },
 		{"fqdn",	required_argument,	NULL,		27 },
+		{"repair-path",	required_argument,	NULL,		28 },
 		{ 0 },
 	};
 	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@@ -1570,6 +1592,9 @@ void conf(struct ctx *c, int argc, char **argv)
 					   "%s", optarg))
 				die("Invalid FQDN: %s", optarg);
 			break;
+		case 28:
+			/* Handle this once we checked --vhost-user */
+			break;
 		case 'd':
 			c->debug = 1;
 			c->quiet = 0;
@@ -1841,8 +1866,8 @@ void conf(struct ctx *c, int argc, char **argv)
 	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
 		c->no_dhcp = 1;
 
-	/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
-	 * settings)
+	/* Inbound port options, DNS, and --repair-path can be parsed now, after
+	 * IPv4/IPv6 settings and --vhost-user.
 	 */
 	fwd_probe_ephemeral();
 	udp_portmap_clear();
@@ -1888,6 +1913,16 @@ void conf(struct ctx *c, int argc, char **argv)
 			}
 
 			die("Cannot use DNS address %s", optarg);
+		} else if (name == 28) {
+			if (c->mode != MODE_VU && strcmp(optarg, "none"))
+				die("--repair-path is for vhost-user mode only");
+
+			if (snprintf_check(c->repair_path,
+					   sizeof(c->repair_path), "%s",
+					   optarg))
+				die("Invalid passt-repair path: %s", optarg);
+
+			break;
 		}
 	} while (name != -1);
 
diff --git a/epoll_type.h b/epoll_type.h
index f3ef415..7f2a121 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -40,6 +40,10 @@ enum epoll_type {
 	EPOLL_TYPE_VHOST_CMD,
 	/* vhost-user kick event socket */
 	EPOLL_TYPE_VHOST_KICK,
+	/* TCP_REPAIR helper listening socket */
+	EPOLL_TYPE_REPAIR_LISTEN,
+	/* TCP_REPAIR helper socket */
+	EPOLL_TYPE_REPAIR,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/migrate.c b/migrate.c
index 72a6d40..1c59016 100644
--- a/migrate.c
+++ b/migrate.c
@@ -23,6 +23,7 @@
 #include "flow_table.h"
 
 #include "migrate.h"
+#include "repair.h"
 
 /* Magic identifier for migration data */
 #define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
@@ -232,7 +233,7 @@ void migrate_init(struct ctx *c)
 }
 
 /**
- * migrate_close() - Close migration channel
+ * migrate_close() - Close migration channel and connection to passt-repair
  * @c:		Execution context
  */
 void migrate_close(struct ctx *c)
@@ -243,6 +244,8 @@ void migrate_close(struct ctx *c)
 		c->device_state_fd = -1;
 		c->device_state_result = -1;
 	}
+
+	repair_close(c);
 }
 
 /**
diff --git a/passt.1 b/passt.1
index 9d347d8..60066c2 100644
--- a/passt.1
+++ b/passt.1
@@ -428,6 +428,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
 .BR \-\-print-capabilities
 Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
 
+.TP
+.BR \-\-repair-path " " \fIpath
+Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect
+to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
+migration. \fB--repair-path none\fR disables this interface (if you need to
+specify a socket path called "none" you can prefix the path by \fI./\fR).
+
+Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
+chosen for the hypervisor UNIX domain socket. No socket is created if not in
+\-\-vhost-user mode.
+
 .TP
 .BR \-F ", " \-\-fd " " \fIFD
 Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
diff --git a/passt.c b/passt.c
index 935a69f..6f9fb4d 100644
--- a/passt.c
+++ b/passt.c
@@ -52,6 +52,7 @@
 #include "ndp.h"
 #include "vu_common.h"
 #include "migrate.h"
+#include "repair.h"
 
 #define EPOLL_EVENTS		8
 
@@ -76,6 +77,8 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
 	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
 	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
+	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
+	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -358,6 +361,12 @@ loop:
 		case EPOLL_TYPE_VHOST_KICK:
 			vu_kick_cb(c.vdev, ref, &now);
 			break;
+		case EPOLL_TYPE_REPAIR_LISTEN:
+			repair_listen_handler(&c, eventmask);
+			break;
+		case EPOLL_TYPE_REPAIR:
+			repair_handler(&c, eventmask);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index 5fdea52..1f0dab5 100644
--- a/passt.h
+++ b/passt.h
@@ -20,6 +20,7 @@ union epoll_ref;
 #include "siphash.h"
 #include "ip.h"
 #include "inany.h"
+#include "migrate.h"
 #include "flow.h"
 #include "icmp.h"
 #include "fwd.h"
@@ -193,6 +194,7 @@ struct ip6_ctx {
  * @foreground:		Run in foreground, don't log to stderr by default
  * @nofile:		Maximum number of open files (ulimit -n)
  * @sock_path:		Path for UNIX domain socket
+ * @repair_path:	TCP_REPAIR helper path, can be "none", empty for default
  * @pcap:		Path for packet capture file
  * @pidfile:		Path to PID file, empty string if not configured
  * @pidfile_fd:		File descriptor for PID file, -1 if none
@@ -203,6 +205,8 @@ struct ip6_ctx {
  * @epollfd:		File descriptor for epoll instance
  * @fd_tap_listen:	File descriptor for listening AF_UNIX socket, if any
  * @fd_tap:		AF_UNIX socket, tuntap device, or pre-opened socket
+ * @fd_repair_listen:	File descriptor for listening TCP_REPAIR socket, if any
+ * @fd_repair:		Connected AF_UNIX socket for TCP_REPAIR helper
  * @our_tap_mac:	Pasta/passt's MAC on the tap link
  * @guest_mac:		MAC address of guest or namespace, seen or configured
  * @hash_secret:	128-bit secret for siphash functions
@@ -249,6 +253,7 @@ struct ctx {
 	int foreground;
 	int nofile;
 	char sock_path[UNIX_PATH_MAX];
+	char repair_path[UNIX_PATH_MAX];
 	char pcap[PATH_MAX];
 
 	char pidfile[PATH_MAX];
@@ -265,6 +270,8 @@ struct ctx {
 	int epollfd;
 	int fd_tap_listen;
 	int fd_tap;
+	int fd_repair_listen;
+	int fd_repair;
 	unsigned char our_tap_mac[ETH_ALEN];
 	unsigned char guest_mac[ETH_ALEN];
 	uint64_t hash_secret[2];
diff --git a/repair.c b/repair.c
new file mode 100644
index 0000000..d288617
--- /dev/null
+++ b/repair.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "inany.h"
+#include "flow.h"
+#include "flow_table.h"
+
+#include "repair.h"
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+/* Pending file descriptors for next repair_flush() call, or command change */
+static int repair_fds[SCM_MAX_FD];
+
+/* Pending command: flush pending file descriptors if it changes */
+static int8_t repair_cmd;
+
+/* Number of pending file descriptors set in @repair_fds */
+static int repair_nfds;
+
+/**
+ * repair_sock_init() - Start listening for connections on helper socket
+ * @c:		Execution context
+ */
+void repair_sock_init(const struct ctx *c)
+{
+	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
+	struct epoll_event ev = { 0 };
+
+	if (c->fd_repair_listen == -1)
+		return;
+
+	if (listen(c->fd_repair_listen, 0)) {
+		err_perror("listen() on repair helper socket, won't migrate");
+		return;
+	}
+
+	ref.fd = c->fd_repair_listen;
+	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
+	ev.data.u64 = ref.u64;
+	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
+		err_perror("repair helper socket epoll_ctl(), won't migrate");
+}
+
+/**
+ * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
+ * @c:		Execution context
+ * @events:	epoll events
+ */
+void repair_listen_handler(struct ctx *c, uint32_t events)
+{
+	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
+	struct epoll_event ev = { 0 };
+	struct ucred ucred;
+	socklen_t len;
+
+	if (events != EPOLLIN) {
+		debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
+		      events);
+		return;
+	}
+
+	len = sizeof(ucred);
+
+	/* Another client is already connected: accept and close right away. */
+	if (c->fd_repair != -1) {
+		int discard = accept4(c->fd_repair_listen, NULL, NULL,
+				      SOCK_NONBLOCK);
+
+		if (discard == -1)
+			return;
+
+		if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
+			info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
+
+		close(discard);
+		return;
+	}
+
+	if ((c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0)) < 0) {
+		debug_perror("accept4() on TCP_REPAIR helper listening socket");
+		return;
+	}
+
+	if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
+		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
+
+	ref.fd = c->fd_repair;
+	ev.events = EPOLLHUP | EPOLLET;
+	ev.data.u64 = ref.u64;
+	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
+		debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
+		close(c->fd_repair);
+		c->fd_repair = -1;
+	}
+}
+
+/**
+ * repair_close() - Close connection to TCP_REPAIR helper
+ * @c:		Execution context
+ */
+void repair_close(struct ctx *c)
+{
+	debug("Closing TCP_REPAIR helper socket");
+
+	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
+	close(c->fd_repair);
+	c->fd_repair = -1;
+}
+
+/**
+ * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
+ * @c:		Execution context
+ * @events:	epoll events
+ */
+void repair_handler(struct ctx *c, uint32_t events)
+{
+	(void)events;
+
+	repair_close(c);
+}
+
+/**
+ * repair_flush() - Flush current set of sockets to helper, with current command
+ * @c:		Execution context
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int repair_flush(struct ctx *c)
+{
+	struct iovec iov = { &repair_cmd, sizeof(repair_cmd) };
+	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+	struct cmsghdr *cmsg;
+	struct msghdr msg;
+	int8_t reply;
+
+	if (!repair_nfds)
+		return 0;
+
+	msg = (struct msghdr){ NULL, 0, &iov, 1,
+			       buf, CMSG_SPACE(sizeof(int) * repair_nfds), 0 };
+	cmsg = CMSG_FIRSTHDR(&msg);
+
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_RIGHTS;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(int) * repair_nfds);
+	memcpy(CMSG_DATA(cmsg), repair_fds, sizeof(int) * repair_nfds);
+
+	repair_nfds = 0;
+
+	if (sendmsg(c->fd_repair, &msg, 0) < 0) {
+		int ret = -errno;
+		err_perror("Failed to send sockets to TCP_REPAIR helper");
+		repair_close(c);
+		return ret;
+	}
+
+	if (recv(c->fd_repair, &reply, sizeof(reply), 0) < 0) {
+		int ret = -errno;
+		err_perror("Failed to receive reply from TCP_REPAIR helper");
+		repair_close(c);
+		return ret;
+	}
+
+	if (reply != repair_cmd) {
+		err("Unexpected reply from TCP_REPAIR helper: %d", reply);
+		repair_close(c);
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+/**
+ * repair_set() - Add socket to TCP_REPAIR set with given command
+ * @c:		Execution context
+ * @s:		Socket to add
+ * @cmd:	TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+/* cppcheck-suppress unusedFunction */
+int repair_set(struct ctx *c, int s, int cmd)
+{
+	int rc;
+
+	if (repair_nfds && repair_cmd != cmd) {
+		if ((rc = repair_flush(c)))
+			return rc;
+	}
+
+	repair_cmd = cmd;
+	repair_fds[repair_nfds++] = s;
+
+	if (repair_nfds >= SCM_MAX_FD) {
+		if ((rc = repair_flush(c)))
+			return rc;
+	}
+
+	return 0;
+}
diff --git a/repair.h b/repair.h
new file mode 100644
index 0000000..de279d6
--- /dev/null
+++ b/repair.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef REPAIR_H
+#define REPAIR_H
+
+void repair_sock_init(const struct ctx *c);
+void repair_listen_handler(struct ctx *c, uint32_t events);
+void repair_handler(struct ctx *c, uint32_t events);
+void repair_close(struct ctx *c);
+int repair_flush(struct ctx *c);
+int repair_set(struct ctx *c, int s, int cmd);
+
+#endif /* REPAIR_H */
diff --git a/tap.c b/tap.c
index 8c92d23..d0673e5 100644
--- a/tap.c
+++ b/tap.c
@@ -56,6 +56,7 @@
 #include "netlink.h"
 #include "pasta.h"
 #include "packet.h"
+#include "repair.h"
 #include "tap.h"
 #include "log.h"
 #include "vhost_user.h"
@@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		tap_pasta_input(c, now);
 }
 
-/**
- * tap_sock_unix_open() - Create and bind AF_UNIX socket
- * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
- *
- * Return: socket descriptor on success, won't return on failure
- */
-int tap_sock_unix_open(char *sock_path)
-{
-	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
-	struct sockaddr_un addr = {
-		.sun_family = AF_UNIX,
-	};
-	int i;
-
-	if (fd < 0)
-		die_perror("Failed to open UNIX domain socket");
-
-	for (i = 1; i < UNIX_SOCK_MAX; i++) {
-		char *path = addr.sun_path;
-		int ex, ret;
-
-		if (*sock_path)
-			memcpy(path, sock_path, UNIX_PATH_MAX);
-		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
-					UNIX_SOCK_PATH, i))
-			die_perror("Can't build UNIX domain socket path");
-
-		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
-			    0);
-		if (ex < 0)
-			die_perror("Failed to check for UNIX domain conflicts");
-
-		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
-		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
-			     errno != EACCES)) {
-			if (*sock_path)
-				die("Socket path %s already in use", path);
-
-			close(ex);
-			continue;
-		}
-		close(ex);
-
-		unlink(path);
-		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
-		if (*sock_path && ret)
-			die_perror("Failed to bind UNIX domain socket");
-
-		if (!ret)
-			break;
-	}
-
-	if (i == UNIX_SOCK_MAX)
-		die_perror("Failed to bind UNIX domain socket");
-
-	info("UNIX domain socket bound at %s", addr.sun_path);
-	if (!*sock_path)
-		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
-
-	return fd;
-}
-
 /**
  * tap_backend_show_hints() - Give help information to start QEMU
  * @c:		Execution context
@@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c)
 		tap_sock_tun_init(c);
 		break;
 	case MODE_VU:
+		repair_sock_init(c);
+		/* fall through */
 	case MODE_PASST:
 		tap_sock_unix_init(c);
 
diff --git a/util.c b/util.c
index ba33866..656e86a 100644
--- a/util.c
+++ b/util.c
@@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	return fd;
 }
 
+/**
+ * sock_unix() - Create and bind AF_UNIX socket
+ * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
+ *
+ * Return: socket descriptor on success, won't return on failure
+ */
+int sock_unix(char *sock_path)
+{
+	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+	struct sockaddr_un addr = {
+		.sun_family = AF_UNIX,
+	};
+	int i;
+
+	if (fd < 0)
+		die_perror("Failed to open UNIX domain socket");
+
+	for (i = 1; i < UNIX_SOCK_MAX; i++) {
+		char *path = addr.sun_path;
+		int ex, ret;
+
+		if (*sock_path)
+			memcpy(path, sock_path, UNIX_PATH_MAX);
+		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
+					UNIX_SOCK_PATH, i))
+			die_perror("Can't build UNIX domain socket path");
+
+		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
+			    0);
+		if (ex < 0)
+			die_perror("Failed to check for UNIX domain conflicts");
+
+		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
+		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
+			     errno != EACCES)) {
+			if (*sock_path)
+				die("Socket path %s already in use", path);
+
+			close(ex);
+			continue;
+		}
+		close(ex);
+
+		unlink(path);
+		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
+		if (*sock_path && ret)
+			die_perror("Failed to bind UNIX domain socket");
+
+		if (!ret)
+			break;
+	}
+
+	if (i == UNIX_SOCK_MAX)
+		die_perror("Failed to bind UNIX domain socket");
+
+	info("UNIX domain socket bound at %s", addr.sun_path);
+	if (!*sock_path)
+		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
+
+	return fd;
+}
+
 /**
  * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
  * @c:		Execution context
diff --git a/util.h b/util.h
index 7df7767..50e96d3 100644
--- a/util.h
+++ b/util.h
@@ -217,6 +217,7 @@ struct ctx;
 int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	       const void *sa, socklen_t sl,
 	       const char *ifname, bool v6only, uint32_t data);
+int sock_unix(char *sock_path);
 void sock_probe_mem(struct ctx *c);
 long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
 int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);

From f3fe795ff58656c39a39dbfac47fe6769f5ce293 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 12 Feb 2025 18:07:16 +1100
Subject: [PATCH 065/221] vhost_user: Make source quit after reporting
 migration state

This will close all the sockets we currently have open in repair mode,
and completes our migration tasks as source. If the hypervisor wants
to have us back at this point, somebody needs to restart us.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/vhost_user.c b/vhost_user.c
index 256c8ab..7ab1377 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -1203,4 +1203,11 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 
 	if (reply_requested)
 		vu_send_reply(fd, &msg);
+
+	if (msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
+	    vdev->context->device_state_result == 0 &&
+	    !vdev->context->migrate_target) {
+		info("Migration complete, exiting");
+		_exit(EXIT_SUCCESS);
+	}
 }

From 6f122f0171fe4bc235d572945e0bf963e81139ea Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 12 Feb 2025 18:07:17 +1100
Subject: [PATCH 066/221] tcp: Get bound address for connected inbound sockets
 too

So that we can bind inbound sockets to specific addresses, like we
already do for outbound sockets.

While at it, change the error message in tcp_conn_from_tap() to match
this one.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c       |  6 +++---
 flow_table.h |  6 +++---
 tcp.c        | 22 ++++++++++++++--------
 3 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/flow.c b/flow.c
index a6fe6d1..3ac551b 100644
--- a/flow.c
+++ b/flow.c
@@ -390,9 +390,9 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
  *
  * Return: pointer to the initiating flowside information
  */
-const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
-					const union sockaddr_inany *ssa,
-					in_port_t dport)
+struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
+				  const union sockaddr_inany *ssa,
+				  in_port_t dport)
 {
 	struct flowside *ini = &flow->f.side[INISIDE];
 
diff --git a/flow_table.h b/flow_table.h
index eeb6f41..9a2ff24 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -161,9 +161,9 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
 					sa_family_t af,
 					const void *saddr, in_port_t sport,
 					const void *daddr, in_port_t dport);
-const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
-					const union sockaddr_inany *ssa,
-					in_port_t dport);
+struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
+				  const union sockaddr_inany *ssa,
+				  in_port_t dport);
 const struct flowside *flow_target_af(union flow *flow, uint8_t pif,
 				      sa_family_t af,
 				      const void *saddr, in_port_t sport,
diff --git a/tcp.c b/tcp.c
index b87478f..a1d6c53 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1536,12 +1536,10 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 
 	if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
 		sl = sizeof(sa);
-		if (!getsockname(s, &sa.sa, &sl)) {
+		if (!getsockname(s, &sa.sa, &sl))
 			inany_from_sockaddr(&tgt->oaddr, &tgt->oport, &sa);
-		} else {
-			err("Failed to get local address for socket: %s",
-			    strerror_(errno));
-		}
+		else
+			err_perror("Can't get local address for socket %i", s);
 	}
 
 	FLOW_ACTIVATE(conn);
@@ -2075,9 +2073,9 @@ static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow,
 void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 			const struct timespec *now)
 {
-	const struct flowside *ini;
 	union sockaddr_inany sa;
 	socklen_t sl = sizeof(sa);
+	struct flowside *ini;
 	union flow *flow;
 	int s;
 
@@ -2093,12 +2091,20 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 	tcp_sock_set_bufsize(c, s);
 	tcp_sock_set_nodelay(s);
 
-	/* FIXME: When listening port has a specific bound address, record that
-	 * as our address
+	/* FIXME: If useful: when the listening port has a specific bound
+	 * address, record that as our address, as implemented for vhost-user
+	 * mode only, below.
 	 */
 	ini = flow_initiate_sa(flow, ref.tcp_listen.pif, &sa,
 			       ref.tcp_listen.port);
 
+	if (c->mode == MODE_VU) { /* Rebind to same address after migration */
+		if (!getsockname(s, &sa.sa, &sl))
+			inany_from_sockaddr(&ini->oaddr, &ini->oport, &sa);
+		else
+			err_perror("Can't get local address for socket %i", s);
+	}
+
 	if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) {
 		char sastr[SOCKADDR_STRLEN];
 

From a3011584563bb7d6cf46416e8e84873c2615ad63 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Feb 2025 18:07:19 +1100
Subject: [PATCH 067/221] rampstream: Add utility to test for corruption of
 data streams

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 test/.gitignore             |   1 +
 test/Makefile               |   5 +-
 test/migrate/rampstream_in  |  59 +++++++++++++++
 test/migrate/rampstream_out |  55 ++++++++++++++
 test/passt.mbuto            |   5 +-
 test/rampstream-check.sh    |   3 +
 test/rampstream.c           | 143 ++++++++++++++++++++++++++++++++++++
 7 files changed, 267 insertions(+), 4 deletions(-)
 create mode 100644 test/migrate/rampstream_in
 create mode 100644 test/migrate/rampstream_out
 create mode 100755 test/rampstream-check.sh
 create mode 100644 test/rampstream.c

diff --git a/test/.gitignore b/test/.gitignore
index 6dd4790..3573444 100644
--- a/test/.gitignore
+++ b/test/.gitignore
@@ -8,5 +8,6 @@ QEMU_EFI.fd
 *.raw.xz
 *.bin
 nstool
+rampstream
 guest-key
 guest-key.pub
diff --git a/test/Makefile b/test/Makefile
index 5e49047..bf63db8 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -52,7 +52,8 @@ UBUNTU_IMGS = $(UBUNTU_OLD_IMGS) $(UBUNTU_NEW_IMGS)
 
 DOWNLOAD_ASSETS = mbuto podman \
 	$(DEBIAN_IMGS) $(FEDORA_IMGS) $(OPENSUSE_IMGS) $(UBUNTU_IMGS)
-TESTDATA_ASSETS = small.bin big.bin medium.bin
+TESTDATA_ASSETS = small.bin big.bin medium.bin \
+	rampstream
 LOCAL_ASSETS = mbuto.img mbuto.mem.img podman/bin/podman QEMU_EFI.fd \
 	$(DEBIAN_IMGS:%=prepared-%) $(FEDORA_IMGS:%=prepared-%) \
 	$(UBUNTU_NEW_IMGS:%=prepared-%) \
@@ -85,7 +86,7 @@ podman/bin/podman: pull-podman
 guest-key guest-key.pub:
 	ssh-keygen -f guest-key -N ''
 
-mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub $(TESTDATA_ASSETS)
+mbuto.img: passt.mbuto mbuto/mbuto guest-key.pub rampstream-check.sh $(TESTDATA_ASSETS)
 	./mbuto/mbuto -p ./$< -c lz4 -f $@
 
 mbuto.mem.img: passt.mem.mbuto mbuto ../passt.avx2
diff --git a/test/migrate/rampstream_in b/test/migrate/rampstream_in
new file mode 100644
index 0000000..46f4143
--- /dev/null
+++ b/test/migrate/rampstream_in
@@ -0,0 +1,59 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/basic - Check basic migration functionality
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+set	RAMPS 6000000
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv4: host > guest
+g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
+guest1b	socat -u TCP4-LISTEN:10001 EXEC:"rampstream-check.sh __RAMPS__"
+sleep	1
+hostb	socat -u EXEC:"test/rampstream send __RAMPS__" TCP4:__ADDR1__:10001
+
+sleep 1
+
+#mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+hostw
+
+guest2	cat rampstream.err
+guest2	[ $(cat rampstream.status) -eq 0 ]
diff --git a/test/migrate/rampstream_out b/test/migrate/rampstream_out
new file mode 100644
index 0000000..91b9c63
--- /dev/null
+++ b/test/migrate/rampstream_out
@@ -0,0 +1,55 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/basic - Check basic migration functionality
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+set	RAMPS 6000000
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv4: guest > host
+g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
+hostb	socat -u TCP4-LISTEN:10006 EXEC:"test/rampstream check __RAMPS__"
+sleep	1
+guest1b	socat -u EXEC:"rampstream send __RAMPS__" TCP4:__MAP_HOST4__:10006
+sleep	1
+
+mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+hostw
diff --git a/test/passt.mbuto b/test/passt.mbuto
index e45a284..5e00132 100755
--- a/test/passt.mbuto
+++ b/test/passt.mbuto
@@ -13,7 +13,8 @@
 PROGS="${PROGS:-ash,dash,bash ip mount ls insmod mkdir ln cat chmod lsmod
        modprobe find grep mknod mv rm umount jq iperf3 dhclient hostname
        sed tr chown sipcalc cut socat dd strace ping tail killall sleep sysctl
-       nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp tcpdump env}"
+       nproc tcp_rr tcp_crr udp_rr which tee seq bc sshd ssh-keygen cmp tcpdump
+       env}"
 
 # OpenSSH 9.8 introduced split binaries, with sshd being the daemon, and
 # sshd-session the per-session program. We need the latter as well, and the path
@@ -31,7 +32,7 @@ LINKS="${LINKS:-
 
 DIRS="${DIRS} /tmp /usr/sbin /usr/share /var/log /var/lib /etc/ssh /run/sshd /root/.ssh"
 
-COPIES="${COPIES} small.bin,/root/small.bin medium.bin,/root/medium.bin big.bin,/root/big.bin"
+COPIES="${COPIES} small.bin,/root/small.bin medium.bin,/root/medium.bin big.bin,/root/big.bin rampstream,/bin/rampstream rampstream-check.sh,/bin/rampstream-check.sh"
 
 FIXUP="${FIXUP}"'
 	mv /sbin/* /usr/sbin || :
diff --git a/test/rampstream-check.sh b/test/rampstream-check.sh
new file mode 100755
index 0000000..c27acdb
--- /dev/null
+++ b/test/rampstream-check.sh
@@ -0,0 +1,3 @@
+#! /bin/sh
+
+(rampstream check "$@" 2>&1; echo $? > rampstream.status) | tee rampstream.err
diff --git a/test/rampstream.c b/test/rampstream.c
new file mode 100644
index 0000000..8d81296
--- /dev/null
+++ b/test/rampstream.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* rampstream - Generate a check and stream of bytes in a ramp pattern
+ *
+ * Copyright Red Hat
+ * Author: David Gibson <david@gibson.dropbear.id.au>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+
+/* Length of the repeating ramp.  This is a deliberately not a "round" number so
+ * that we're very likely to misalign with likely block or chunk sizes of the
+ * transport.  That means we'll detect gaps in the stream, even if they occur
+ * neatly on block boundaries.  Specifically this is the largest 8-bit prime. */
+#define RAMPLEN		251
+
+#define INTERVAL	10000
+
+#define	ARRAY_SIZE(a)	((int)(sizeof(a) / sizeof((a)[0])))
+
+#define die(...)						\
+	do {							\
+		fprintf(stderr, "rampstream: " __VA_ARGS__);	\
+		exit(1);					\
+	} while (0)
+
+static void usage(void)
+{
+	die("Usage:\n"
+	    "  rampstream send <number>\n"
+	    "    Generate a ramp pattern of bytes on stdout, repeated <number>\n"
+	    "    times\n"
+	    "  rampstream check <number>\n"
+	    "    Check a ramp pattern of bytes on stdin, repeater <number>\n"
+	    "    times\n");
+}
+
+static void ramp_send(unsigned long long num, const uint8_t *ramp)
+{
+	unsigned long long i;
+
+	for (i = 0; i < num; i++) {
+		int off = 0;
+		ssize_t rc;
+
+		if (i % INTERVAL == 0)
+			fprintf(stderr, "%llu...\r", i);
+
+		while (off < RAMPLEN) {
+			rc = write(1, ramp + off, RAMPLEN - off);
+			if (rc < 0) {
+				if (errno == EINTR ||
+				    errno == EAGAIN ||
+				    errno == EWOULDBLOCK)
+					continue;
+				die("Error writing ramp: %s\n",
+				    strerror(errno));
+			}
+			if (rc == 0)
+				die("Zero length write\n");
+			off += rc;
+		}
+	}
+}
+
+static void ramp_check(unsigned long long num, const uint8_t *ramp)
+{
+	unsigned long long i;
+
+	for (i = 0; i < num; i++) {
+		uint8_t buf[RAMPLEN];
+		int off = 0;
+		ssize_t rc;
+
+		if (i % INTERVAL == 0)
+			fprintf(stderr, "%llu...\r", i);
+		
+		while (off < RAMPLEN) {
+			rc = read(0, buf + off, RAMPLEN - off);
+			if (rc < 0) {
+				if (errno == EINTR ||
+				    errno == EAGAIN ||
+				    errno == EWOULDBLOCK)
+					continue;
+				die("Error reading ramp: %s\n",
+				    strerror(errno));
+			}
+			if (rc == 0)
+				die("Unexpected EOF, ramp %llu, byte %d\n",
+				    i, off);
+			off += rc;
+		}
+
+		if (memcmp(buf, ramp, sizeof(buf)) != 0) {
+			int j, k;
+
+			for (j = 0; j < RAMPLEN; j++)
+				if (buf[j] != ramp[j])
+					break;
+			for (k = j; k < RAMPLEN && k < j + 16; k++)
+				fprintf(stderr,
+					"Byte %d: expected 0x%02x, got 0x%02x\n",
+					k, ramp[k], buf[k]);
+			die("Data mismatch, ramp %llu, byte %d\n", i, j);
+		}
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	const char *subcmd = argv[1];
+	unsigned long long num;
+	uint8_t ramp[RAMPLEN];
+	char *e;
+	int i;
+
+	if (argc < 2)
+		usage();
+
+	errno = 0;
+	num = strtoull(argv[2], &e, 0);
+	if (*e || errno)
+		usage();
+
+	/* Initialize the ramp block */
+	for (i = 0; i < RAMPLEN; i++)
+		ramp[i] = i;
+
+	if (strcmp(subcmd, "send") == 0)
+		ramp_send(num, ramp);
+	else if (strcmp(subcmd, "check") == 0)
+		ramp_check(num, ramp);
+	else
+		usage();
+
+	exit(0);
+}

From 9a84df4c3f9608c5e814f24ee3306a6c64a73edd Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 13 Feb 2025 00:42:52 +0100
Subject: [PATCH 068/221] selinux: Add rules needed to run tests

...other than being convenient, they might be reasonably
representative of typical stand-alone usage.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 contrib/selinux/passt.te |  4 ++++
 contrib/selinux/pasta.te | 14 ++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/contrib/selinux/passt.te b/contrib/selinux/passt.te
index c6cea34..6e7a4cb 100644
--- a/contrib/selinux/passt.te
+++ b/contrib/selinux/passt.te
@@ -20,6 +20,7 @@ require {
 	type fs_t;
 	type tmp_t;
 	type user_tmp_t;
+	type user_home_t;
 	type tmpfs_t;
 	type root_t;
 
@@ -80,6 +81,9 @@ allow passt_t root_t:dir mounton;
 allow passt_t tmp_t:dir { add_name mounton remove_name write };
 allow passt_t tmpfs_t:filesystem mount;
 allow passt_t fs_t:filesystem unmount;
+allow passt_t user_home_t:dir search;
+allow passt_t user_tmp_t:fifo_file append;
+allow passt_t user_tmp_t:file map;
 
 manage_files_pattern(passt_t, user_tmp_t, user_tmp_t)
 files_pid_filetrans(passt_t, user_tmp_t, file)
diff --git a/contrib/selinux/pasta.te b/contrib/selinux/pasta.te
index d0ff0cc..89c8043 100644
--- a/contrib/selinux/pasta.te
+++ b/contrib/selinux/pasta.te
@@ -18,6 +18,7 @@ require {
 	type bin_t;
 	type user_home_t;
 	type user_home_dir_t;
+	type user_tmp_t;
 	type fs_t;
 	type tmp_t;
 	type tmpfs_t;
@@ -56,8 +57,10 @@ require {
 	attribute port_type;
 	type port_t;
 	type http_port_t;
+	type http_cache_port_t;
 	type ssh_port_t;
 	type reserved_port_t;
+	type unreserved_port_t;
 	type dns_port_t;
 	type dhcpc_port_t;
 	type chronyd_port_t;
@@ -122,8 +125,8 @@ domain_auto_trans(pasta_t, ping_exec_t, ping_t);
 
 allow pasta_t nsfs_t:file { open read };
 
-allow pasta_t user_home_t:dir getattr;
-allow pasta_t user_home_t:file { open read getattr setattr };
+allow pasta_t user_home_t:dir { getattr search };
+allow pasta_t user_home_t:file { open read getattr setattr execute execute_no_trans map};
 allow pasta_t user_home_dir_t:dir { search getattr open add_name read write };
 allow pasta_t user_home_dir_t:file { create open read write };
 allow pasta_t tmp_t:dir { add_name mounton remove_name write };
@@ -133,6 +136,11 @@ allow pasta_t root_t:dir mounton;
 manage_files_pattern(pasta_t, pasta_pid_t, pasta_pid_t)
 files_pid_filetrans(pasta_t, pasta_pid_t, file)
 
+allow pasta_t user_tmp_t:dir { add_name remove_name search write };
+allow pasta_t user_tmp_t:fifo_file append;
+allow pasta_t user_tmp_t:file { create open write };
+allow pasta_t user_tmp_t:sock_file { create unlink };
+
 allow pasta_t console_device_t:chr_file { open write getattr ioctl };
 allow pasta_t user_devpts_t:chr_file { getattr read write ioctl };
 logging_send_syslog_msg(pasta_t)
@@ -160,6 +168,8 @@ allow pasta_t self:udp_socket create_stream_socket_perms;
 allow pasta_t reserved_port_t:udp_socket name_bind;
 allow pasta_t llmnr_port_t:tcp_socket name_bind;
 allow pasta_t llmnr_port_t:udp_socket name_bind;
+allow pasta_t http_cache_port_t:tcp_socket { name_bind name_connect };
+allow pasta_t unreserved_port_t:udp_socket name_bind;
 corenet_udp_sendrecv_generic_node(pasta_t)
 corenet_udp_bind_generic_node(pasta_t)
 allow pasta_t node_t:icmp_socket { name_bind node_bind };

From 98d474c8950e9cc5715d5686614fb0f504377303 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 13 Feb 2025 22:00:57 +0100
Subject: [PATCH 069/221] contrib/selinux: Enable mapping guest memory for
 libvirt guests

This doesn't actually belong to passt's own policy: we should export
an interface and libvirt's policy should use it, because passt's
policy shouldn't be aware of svirt_image_t at all.

However, libvirt doesn't maintain its own policy, which makes policy
updates rather involved. Add this workaround to ensure --vhost-user
is working in combination with libvirt, as it might take ages before
we can get the proper rule in libvirt's policy.

Reported-by: Laine Stump <laine@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 contrib/selinux/passt.te | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/contrib/selinux/passt.te b/contrib/selinux/passt.te
index 6e7a4cb..fc1320d 100644
--- a/contrib/selinux/passt.te
+++ b/contrib/selinux/passt.te
@@ -24,6 +24,12 @@ require {
 	type tmpfs_t;
 	type root_t;
 
+	# Workaround: passt --vhost-user needs to map guest memory, but
+	# libvirt doesn't maintain its own policy, which makes updates
+	# particularly complicated. To avoid breakage in the short term,
+	# deal with it in passt's own policy.
+	type svirt_image_t;
+
 	class file { ioctl getattr setattr create read write unlink open relabelto execute execute_no_trans map };
 	class dir { search write add_name remove_name mounton };
 	class chr_file { append read write open getattr ioctl };
@@ -131,3 +137,9 @@ allow passt_t user_tmp_t:dir { add_name write };
 allow passt_t user_tmp_t:file { create open };
 allow passt_t user_tmp_t:sock_file { create read write unlink };
 allow passt_t unconfined_t:unix_stream_socket { read write };
+
+# Workaround: passt --vhost-user needs to map guest memory, but
+# libvirt doesn't maintain its own policy, which makes updates
+# particularly complicated. To avoid breakage in the short term,
+# deal with it in passt's own policy.
+allow passt_t svirt_image_t:file { read write map };

From 30f1e082c3c0cee0a985b3c32e2b05280c596343 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 13 Feb 2025 16:24:55 +0100
Subject: [PATCH 070/221] tcp: Keep updating window and checking for socket
 data after FIN from guest

Once we get a FIN segment from the container/guest, we enter something
resembling CLOSE_WAIT (from the perspective of the peer), but that
doesn't mean that we should stop processing window updates from the
guest and checking for socket data if the guest acknowledges
something.

If we don't do that, we can very easily run into a situation where we
send a burst of data to the tap, get a zero window update, along with
a FIN segment, because the flow is meant to be unidirectional, and now
the connection will be stuck forever, because we'll ignore updates.

Reproducer, server:

  $ pasta --config-net -t 9999 -- sh -c 'echo DONE | socat TCP-LISTEN:9997,shut-down STDIO'

and client:

  $ ./test/rampstream send 50000 | socat -u STDIN TCP:$LOCAL_ADDR:9997
  2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe

while at it, update the message string for the third passive close
state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index a1d6c53..16d01f6 100644
--- a/tcp.c
+++ b/tcp.c
@@ -338,7 +338,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
 	"SYN_RCVD",	/* approximately maps to TAP_SYN_ACK_SENT */
 
 	/* Passive close: */
-	"CLOSE_WAIT", "CLOSE_WAIT", "LAST_ACK", "LAST_ACK", "LAST_ACK",
+	"CLOSE_WAIT", "CLOSE_WAIT", "CLOSE_WAIT", "LAST_ACK", "LAST_ACK",
 	/* Active close (+5): */
 	"CLOSING", "FIN_WAIT_1", "FIN_WAIT_1", "FIN_WAIT_2", "TIME_WAIT",
 };
@@ -1968,6 +1968,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	/* Established connections not accepting data from tap */
 	if (conn->events & TAP_FIN_RCVD) {
 		tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
+		tcp_tap_window_update(conn, ntohs(th->window));
+		tcp_data_from_sock(c, conn);
 
 		if (conn->events & SOCK_FIN_RCVD &&
 		    conn->seq_ack_from_tap == conn->seq_to_tap)

From 71249ef3f9bcf1dbb2d6c13cdbc41ba88c794f06 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 13 Feb 2025 20:54:04 +0100
Subject: [PATCH 071/221] tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to
 maximum values

I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.

This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.

Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).

The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.

Drop those settings.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c        | 41 +++++++++--------------------------------
 tcp_conn.h   |  4 ++--
 tcp_splice.c |  6 +++---
 3 files changed, 14 insertions(+), 37 deletions(-)

diff --git a/tcp.c b/tcp.c
index 16d01f6..b978b30 100644
--- a/tcp.c
+++ b/tcp.c
@@ -738,24 +738,6 @@ static void tcp_get_sndbuf(struct tcp_tap_conn *conn)
 	SNDBUF_SET(conn, MIN(INT_MAX, v));
 }
 
-/**
- * tcp_sock_set_bufsize() - Set SO_RCVBUF and SO_SNDBUF to maximum values
- * @s:		Socket, can be -1 to avoid check in the caller
- */
-static void tcp_sock_set_bufsize(const struct ctx *c, int s)
-{
-	int v = INT_MAX / 2; /* Kernel clamps and rounds, no need to check */
-
-	if (s == -1)
-		return;
-
-	if (!c->low_rmem && setsockopt(s, SOL_SOCKET, SO_RCVBUF, &v, sizeof(v)))
-		trace("TCP: failed to set SO_RCVBUF to %i", v);
-
-	if (!c->low_wmem && setsockopt(s, SOL_SOCKET, SO_SNDBUF, &v, sizeof(v)))
-		trace("TCP: failed to set SO_SNDBUF to %i", v);
-}
-
 /**
  * tcp_sock_set_nodelay() - Set TCP_NODELAY option (disable Nagle's algorithm)
  * @s:		Socket, can be -1 to avoid check in the caller
@@ -1278,12 +1260,11 @@ int tcp_conn_pool_sock(int pool[])
 
 /**
  * tcp_conn_new_sock() - Open and prepare new socket for connection
- * @c:		Execution context
  * @af:		Address family
  *
  * Return: socket number on success, negative code if socket creation failed
  */
-static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
+static int tcp_conn_new_sock(sa_family_t af)
 {
 	int s;
 
@@ -1297,7 +1278,6 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
 	if (s < 0)
 		return -errno;
 
-	tcp_sock_set_bufsize(c, s);
 	tcp_sock_set_nodelay(s);
 
 	return s;
@@ -1305,12 +1285,11 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
 
 /**
  * tcp_conn_sock() - Obtain a connectable socket in the host/init namespace
- * @c:		Execution context
  * @af:		Address family (AF_INET or AF_INET6)
  *
  * Return: Socket fd on success, -errno on failure
  */
-int tcp_conn_sock(const struct ctx *c, sa_family_t af)
+int tcp_conn_sock(sa_family_t af)
 {
 	int *pool = af == AF_INET6 ? init_sock_pool6 : init_sock_pool4;
 	int s;
@@ -1321,7 +1300,7 @@ int tcp_conn_sock(const struct ctx *c, sa_family_t af)
 	/* If the pool is empty we just open a new one without refilling the
 	 * pool to keep latency down.
 	 */
-	if ((s = tcp_conn_new_sock(c, af)) >= 0)
+	if ((s = tcp_conn_new_sock(af)) >= 0)
 		return s;
 
 	err("TCP: Unable to open socket for new connection: %s",
@@ -1462,7 +1441,7 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 		goto cancel;
 	}
 
-	if ((s = tcp_conn_sock(c, af)) < 0)
+	if ((s = tcp_conn_sock(af)) < 0)
 		goto cancel;
 
 	pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, tgt->eport);
@@ -1483,7 +1462,7 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 	} else {
 		/* Not a local, bound destination, inconclusive test */
 		close(s);
-		if ((s = tcp_conn_sock(c, af)) < 0)
+		if ((s = tcp_conn_sock(af)) < 0)
 			goto cancel;
 	}
 
@@ -2090,7 +2069,6 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 	if (s < 0)
 		goto cancel;
 
-	tcp_sock_set_bufsize(c, s);
 	tcp_sock_set_nodelay(s);
 
 	/* FIXME: If useful: when the listening port has a specific bound
@@ -2434,13 +2412,12 @@ static int tcp_ns_socks_init(void *arg)
 
 /**
  * tcp_sock_refill_pool() - Refill one pool of pre-opened sockets
- * @c:		Execution context
  * @pool:	Pool of sockets to refill
  * @af:		Address family to use
  *
  * Return: 0 on success, negative error code if there was at least one error
  */
-int tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af)
+int tcp_sock_refill_pool(int pool[], sa_family_t af)
 {
 	int i;
 
@@ -2450,7 +2427,7 @@ int tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af)
 		if (pool[i] >= 0)
 			continue;
 
-		if ((fd = tcp_conn_new_sock(c, af)) < 0)
+		if ((fd = tcp_conn_new_sock(af)) < 0)
 			return fd;
 
 		pool[i] = fd;
@@ -2466,13 +2443,13 @@ int tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af)
 static void tcp_sock_refill_init(const struct ctx *c)
 {
 	if (c->ifi4) {
-		int rc = tcp_sock_refill_pool(c, init_sock_pool4, AF_INET);
+		int rc = tcp_sock_refill_pool(init_sock_pool4, AF_INET);
 		if (rc < 0)
 			warn("TCP: Error refilling IPv4 host socket pool: %s",
 			     strerror_(-rc));
 	}
 	if (c->ifi6) {
-		int rc = tcp_sock_refill_pool(c, init_sock_pool6, AF_INET6);
+		int rc = tcp_sock_refill_pool(init_sock_pool6, AF_INET6);
 		if (rc < 0)
 			warn("TCP: Error refilling IPv6 host socket pool: %s",
 			     strerror_(-rc));
diff --git a/tcp_conn.h b/tcp_conn.h
index d342680..8c20805 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -143,8 +143,8 @@ bool tcp_flow_defer(const struct tcp_tap_conn *conn);
 bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
 void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
 int tcp_conn_pool_sock(int pool[]);
-int tcp_conn_sock(const struct ctx *c, sa_family_t af);
-int tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);
+int tcp_conn_sock(sa_family_t af);
+int tcp_sock_refill_pool(int pool[], sa_family_t af);
 void tcp_splice_refill(const struct ctx *c);
 
 #endif /* TCP_CONN_H */
diff --git a/tcp_splice.c b/tcp_splice.c
index f048a82..f1a9223 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -351,7 +351,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 	int one = 1;
 
 	if (tgtpif == PIF_HOST)
-		conn->s[1] = tcp_conn_sock(c, af);
+		conn->s[1] = tcp_conn_sock(af);
 	else if (tgtpif == PIF_SPLICE)
 		conn->s[1] = tcp_conn_sock_ns(c, af);
 	else
@@ -703,13 +703,13 @@ static int tcp_sock_refill_ns(void *arg)
 	ns_enter(c);
 
 	if (c->ifi4) {
-		int rc = tcp_sock_refill_pool(c, ns_sock_pool4, AF_INET);
+		int rc = tcp_sock_refill_pool(ns_sock_pool4, AF_INET);
 		if (rc < 0)
 			warn("TCP: Error refilling IPv4 ns socket pool: %s",
 			     strerror_(-rc));
 	}
 	if (c->ifi6) {
-		int rc = tcp_sock_refill_pool(c, ns_sock_pool6, AF_INET6);
+		int rc = tcp_sock_refill_pool(ns_sock_pool6, AF_INET6);
 		if (rc < 0)
 			warn("TCP: Error refilling IPv6 ns socket pool: %s",
 			     strerror_(-rc));

From 7c33b1208632a9581d0ee7aabd1e0584a5d1fb20 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Sat, 15 Feb 2025 00:08:41 +1100
Subject: [PATCH 072/221] vhost_user: Clear ring address on GET_VRING_BASE

GET_VRING_BASE stops the queue, clearing the call and kick fds.  However,
we don't clear vring.avail.  That means that if vu_queue_notify() is called
it won't realise the queue isn't ready and will die with an EBADFD.

We get this during migration, because for some reason, qemu reconfigures
the vhost-user device when a migration is triggered.  There's a window
between the GET_VRING_BASE and re-establishing the call fd where the
notify function can be called, causing a crash.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/vhost_user.c b/vhost_user.c
index 7ab1377..be1aa94 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -732,6 +732,7 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
 	msg->hdr.size = sizeof(msg->payload.state);
 
 	vdev->vq[idx].started = false;
+	vdev->vq[idx].vring.avail = 0;
 
 	if (vdev->vq[idx].call_fd != -1) {
 		close(vdev->vq[idx].call_fd);

From 667caa09c6d46d937b3076254176eded262b3eca Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Sun, 16 Feb 2025 08:16:33 +0100
Subject: [PATCH 073/221] tcp_splice: Don't wake up on input data if we can't
 write it anywhere

If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
given flow, it means that we're blocked, waiting for the receiver to
actually receive data, with a full pipe.

In that case, if we keep EPOLLIN set for the socket on the other side
(our receiving side), we'll get into a loop such as:

  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577

leading to 100% CPU usage, of course.

Drop EPOLLIN on our receiving side as long when we're waiting for
output readiness on the other side.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_splice.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index f1a9223..8a39a6f 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
 		ev[1].events = EPOLLOUT;
 	}
 
-	flow_foreach_sidei(sidei)
-		ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0;
+	flow_foreach_sidei(sidei) {
+		if (events & OUT_WAIT(sidei)) {
+			ev[sidei].events |= EPOLLOUT;
+			ev[!sidei].events &= ~EPOLLIN;
+		}
+	}
 }
 
 /**

From 01b6a164d94f26be7ad500f71210bdb888f416aa Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Sun, 16 Feb 2025 08:31:13 +0100
Subject: [PATCH 074/221] tcp_splice: A typo three years ago and SO_RCVLOWAT is
 gone

In commit e5eefe77435a ("tcp: Refactor to use events instead of
states, split out spliced implementation"), this:

			if (!bitmap_isset(rcvlowat_set, conn - ts) &&
			    readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

			if (conn->flags & lowat_set_flag &&
			    readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we
haven't set it already.

But, instead, we're checking if it's already set before we set it, so
we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us
improved CPU utilisation in non-interactive cases where we are not
transferring at full pipe capacity.

Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_splice.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 8a39a6f..5d845c9 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -556,7 +556,7 @@ eintr:
 			if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
 				continue;
 
-			if (conn->flags & lowat_set_flag &&
+			if (!(conn->flags & lowat_set_flag) &&
 			    readlen > (long)c->tcp.pipe_size / 10) {
 				int lowat = c->tcp.pipe_size / 4;
 

From 3e903bbb1f386ebb892b1196d339d2d705bce8a2 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Sat, 15 Feb 2025 06:13:13 +0100
Subject: [PATCH 075/221] repair, passt-repair: Build and warning fixes for
 musl

Checked against musl 1.2.5.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt-repair.c |  4 +++-
 repair.c       | 13 +++++++++----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/passt-repair.c b/passt-repair.c
index 1174ae3..e0c366e 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -63,6 +63,7 @@ int main(int argc, char **argv)
 	struct cmsghdr *cmsg;
 	struct msghdr msg;
 	struct iovec iov;
+	size_t cmsg_len;
 	int op;
 
 	prctl(PR_SET_DUMPABLE, 0);
@@ -138,8 +139,9 @@ loop:
 		}
 	}
 	if (!n) {
+		cmsg_len = cmsg->cmsg_len; /* socklen_t is 'unsigned' on musl */
 		fprintf(stderr, "Invalid ancillary data length %zu from peer\n",
-			cmsg->cmsg_len);
+			cmsg_len);
 		_exit(1);
 	}
 
diff --git a/repair.c b/repair.c
index d288617..dac28a6 100644
--- a/repair.c
+++ b/repair.c
@@ -13,6 +13,7 @@
  */
 
 #include <errno.h>
+#include <sys/socket.h>
 #include <sys/uio.h>
 
 #include "util.h"
@@ -145,9 +146,9 @@ void repair_handler(struct ctx *c, uint32_t events)
  */
 int repair_flush(struct ctx *c)
 {
-	struct iovec iov = { &repair_cmd, sizeof(repair_cmd) };
 	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
-	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+	     __attribute__ ((aligned(__alignof__(struct cmsghdr)))) = { 0 };
+	struct iovec iov = { &repair_cmd, sizeof(repair_cmd) };
 	struct cmsghdr *cmsg;
 	struct msghdr msg;
 	int8_t reply;
@@ -155,8 +156,12 @@ int repair_flush(struct ctx *c)
 	if (!repair_nfds)
 		return 0;
 
-	msg = (struct msghdr){ NULL, 0, &iov, 1,
-			       buf, CMSG_SPACE(sizeof(int) * repair_nfds), 0 };
+	msg = (struct msghdr){ .msg_name = NULL, .msg_namelen = 0,
+			       .msg_iov = &iov, .msg_iovlen = 1,
+			       .msg_control = buf,
+			       .msg_controllen = CMSG_SPACE(sizeof(int) *
+							    repair_nfds),
+			       .msg_flags = 0 };
 	cmsg = CMSG_FIRSTHDR(&msg);
 
 	cmsg->cmsg_level = SOL_SOCKET;

From 89ecf2fd40adab549bdf25cdb68996f56d67b13e Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 13 Feb 2025 23:14:13 +1100
Subject: [PATCH 076/221] migrate: Migrate TCP flows

This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 contrib/selinux/passt.te |   4 +-
 flow.c                   | 243 +++++++++++
 flow.h                   |   8 +
 migrate.c                |  10 +
 passt.c                  |   6 +-
 repair.c                 |   1 -
 tcp.c                    | 919 +++++++++++++++++++++++++++++++++++++++
 tcp_conn.h               | 103 +++++
 8 files changed, 1288 insertions(+), 6 deletions(-)

diff --git a/contrib/selinux/passt.te b/contrib/selinux/passt.te
index fc1320d..f595079 100644
--- a/contrib/selinux/passt.te
+++ b/contrib/selinux/passt.te
@@ -45,7 +45,7 @@ require {
 	type net_conf_t;
 	type proc_net_t;
 	type node_t;
-	class tcp_socket { create accept listen name_bind name_connect };
+	class tcp_socket { create accept listen name_bind name_connect getattr };
 	class udp_socket { create accept listen };
 	class icmp_socket { bind create name_bind node_bind setopt read write };
 	class sock_file { create unlink write };
@@ -129,7 +129,7 @@ corenet_udp_sendrecv_all_ports(passt_t)
 allow passt_t node_t:icmp_socket { name_bind node_bind };
 allow passt_t port_t:icmp_socket name_bind;
 
-allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write };
+allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write getattr };
 allow passt_t self:udp_socket { create getopt setopt connect bind read write };
 allow passt_t self:icmp_socket { bind create setopt read write };
 
diff --git a/flow.c b/flow.c
index 3ac551b..cc881e8 100644
--- a/flow.c
+++ b/flow.c
@@ -19,6 +19,7 @@
 #include "inany.h"
 #include "flow.h"
 #include "flow_table.h"
+#include "repair.h"
 
 const char *flow_state_str[] = {
 	[FLOW_STATE_FREE]	= "FREE",
@@ -52,6 +53,35 @@ const uint8_t flow_proto[] = {
 static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
 	      "flow_proto[] doesn't match enum flow_type");
 
+#define foreach_flow(i, flow, bound)					\
+	for ((i) = 0, (flow) = &flowtab[(i)];				\
+	     (i) < (bound);						\
+	     (i)++, (flow) = &flowtab[(i)])				\
+		if ((flow)->f.state == FLOW_STATE_FREE)			\
+			(i) += (flow)->free.n - 1;			\
+		else
+
+#define foreach_active_flow(i, flow, bound)				\
+	foreach_flow((i), (flow), (bound))				\
+		if ((flow)->f.state != FLOW_STATE_ACTIVE)		\
+			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
+			continue;					\
+		else
+
+#define foreach_tcp_flow(i, flow, bound)				\
+	foreach_active_flow((i), (flow), (bound))			\
+		if ((flow)->f.type != FLOW_TCP)				\
+			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
+			continue;					\
+		else
+
+#define foreach_established_tcp_flow(i, flow, bound)			\
+	foreach_tcp_flow((i), (flow), (bound))				\
+		if (!tcp_flow_is_established(&(flow)->tcp))		\
+			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
+			continue;					\
+		else
+
 /* Global Flow Table */
 
 /**
@@ -874,6 +904,219 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 	*last_next = FLOW_MAX;
 }
 
+/**
+ * flow_migrate_source_rollback() - Disable repair mode, return failure
+ * @c:		Execution context
+ * @max_flow:	Maximum index of affected flows
+ * @ret:	Negative error code
+ *
+ * Return: @ret
+ */
+static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
+					int ret)
+{
+	union flow *flow;
+	unsigned i;
+
+	debug("...roll back migration");
+
+	foreach_established_tcp_flow(i, flow, max_flow)
+		if (tcp_flow_repair_off(c, &flow->tcp))
+			die("Failed to roll back TCP_REPAIR mode");
+
+	if (repair_flush(c))
+		die("Failed to roll back TCP_REPAIR mode");
+
+	return ret;
+}
+
+/**
+ * flow_migrate_repair_all() - Turn repair mode on or off for all flows
+ * @c:		Execution context
+ * @enable:	Switch repair mode on if set, off otherwise
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int flow_migrate_repair_all(struct ctx *c, bool enable)
+{
+	union flow *flow;
+	unsigned i;
+	int rc;
+
+	foreach_established_tcp_flow(i, flow, FLOW_MAX) {
+		if (enable)
+			rc = tcp_flow_repair_on(c, &flow->tcp);
+		else
+			rc = tcp_flow_repair_off(c, &flow->tcp);
+
+		if (rc) {
+			debug("Can't %s repair mode: %s",
+			      enable ? "enable" : "disable", strerror_(-rc));
+			return flow_migrate_source_rollback(c, i, rc);
+		}
+	}
+
+	if ((rc = repair_flush(c))) {
+		debug("Can't %s repair mode: %s",
+		      enable ? "enable" : "disable", strerror_(-rc));
+		return flow_migrate_source_rollback(c, i, rc);
+	}
+
+	return 0;
+}
+
+/**
+ * flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
+ * @c:		Execution context
+ * @stage:	Migration stage information (unused)
+ * @fd:		Migration file descriptor (unused)
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
+			    int fd)
+{
+	int rc;
+
+	(void)stage;
+	(void)fd;
+
+	if ((rc = flow_migrate_repair_all(c, true)))
+		return -rc;
+
+	return 0;
+}
+
+/**
+ * flow_migrate_source() - Dump all the remaining information and send data
+ * @c:		Execution context (unused)
+ * @stage:	Migration stage information (unused)
+ * @fd:		Migration file descriptor
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
+			int fd)
+{
+	uint32_t count = 0;
+	bool first = true;
+	union flow *flow;
+	unsigned i;
+	int rc;
+
+	(void)c;
+	(void)stage;
+
+	foreach_established_tcp_flow(i, flow, FLOW_MAX)
+		count++;
+
+	count = htonl(count);
+	if (write_all_buf(fd, &count, sizeof(count))) {
+		rc = errno;
+		err_perror("Can't send flow count (%u)", ntohl(count));
+		return flow_migrate_source_rollback(c, FLOW_MAX, rc);
+	}
+
+	debug("Sending %u flows", ntohl(count));
+
+	/* Dump and send information that can be stored in the flow table.
+	 *
+	 * Limited rollback options here: if we fail to transfer any data (that
+	 * is, on the first flow), undo everything and resume. Otherwise, the
+	 * stream might now be inconsistent, and we might have closed listening
+	 * TCP sockets, so just terminate.
+	 */
+	foreach_established_tcp_flow(i, flow, FLOW_MAX) {
+		rc = tcp_flow_migrate_source(fd, &flow->tcp);
+		if (rc) {
+			err("Can't send data, flow %u: %s", i, strerror_(-rc));
+			if (!first)
+				die("Inconsistent migration state, exiting");
+
+			return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
+		}
+
+		first = false;
+	}
+
+	/* And then "extended" data (including window data we saved previously):
+	 * the target needs to set repair mode on sockets before it can set
+	 * this stuff, but it needs sockets (and flows) for that.
+	 *
+	 * This also closes sockets so that the target can start connecting
+	 * theirs: you can't sendmsg() to queues (using the socket) if the
+	 * socket is not connected (EPIPE), not even in repair mode. And the
+	 * target needs to restore queues now because we're sending the data.
+	 *
+	 * So, no rollback here, just try as hard as we can. Tolerate per-flow
+	 * failures but not if the stream might be inconsistent (reported here
+	 * as EIO).
+	 */
+	foreach_established_tcp_flow(i, flow, FLOW_MAX) {
+		rc = tcp_flow_migrate_source_ext(fd, i, &flow->tcp);
+		if (rc) {
+			err("Extended data for flow %u: %s", i, strerror_(-rc));
+
+			if (rc == -EIO)
+				die("Inconsistent migration state, exiting");
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * flow_migrate_target() - Receive flows and insert in flow table
+ * @c:		Execution context
+ * @stage:	Migration stage information (unused)
+ * @fd:		Migration file descriptor
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
+			int fd)
+{
+	uint32_t count;
+	unsigned i;
+	int rc;
+
+	(void)stage;
+
+	if (read_all_buf(fd, &count, sizeof(count)))
+		return errno;
+
+	count = ntohl(count);
+	debug("Receiving %u flows", count);
+
+	if ((rc = flow_migrate_repair_all(c, true)))
+		return -rc;
+
+	repair_flush(c);
+
+	/* TODO: flow header with type, instead? */
+	for (i = 0; i < count; i++) {
+		rc = tcp_flow_migrate_target(c, fd);
+		if (rc) {
+			debug("Migration data failure at flow %u: %s, abort",
+			      i, strerror_(-rc));
+			return -rc;
+		}
+	}
+
+	repair_flush(c);
+
+	for (i = 0; i < count; i++) {
+		rc = tcp_flow_migrate_target_ext(c, flowtab + i, fd);
+		if (rc) {
+			debug("Migration data failure at flow %u: %s, abort",
+			      i, strerror_(-rc));
+			return -rc;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * flow_init() - Initialise flow related data structures
  */
diff --git a/flow.h b/flow.h
index 24ba3ef..675726e 100644
--- a/flow.h
+++ b/flow.h
@@ -249,6 +249,14 @@ union flow;
 
 void flow_init(void);
 void flow_defer_handler(const struct ctx *c, const struct timespec *now);
+int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
+			      int fd);
+int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
+			    int fd);
+int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
+			int fd);
+int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
+			int fd);
 
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	__attribute__((format(printf, 3, 4)));
diff --git a/migrate.c b/migrate.c
index 1c59016..0fca77b 100644
--- a/migrate.c
+++ b/migrate.c
@@ -103,6 +103,16 @@ static const struct migrate_stage stages_v1[] = {
 		.source = seen_addrs_source_v1,
 		.target = seen_addrs_target_v1,
 	},
+	{
+		.name = "prepare flows",
+		.source = flow_migrate_source_pre,
+		.target = NULL,
+	},
+	{
+		.name = "transfer flows",
+		.source = flow_migrate_source,
+		.target = flow_migrate_target,
+	},
 	{ 0 },
 };
 
diff --git a/passt.c b/passt.c
index 6f9fb4d..68d1a28 100644
--- a/passt.c
+++ b/passt.c
@@ -223,9 +223,6 @@ int main(int argc, char **argv)
 		if (sigaction(SIGCHLD, &sa, NULL))
 			die_perror("Couldn't install signal handlers");
 
-		if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
-			die_perror("Couldn't set disposition for SIGPIPE");
-
 		c.mode = MODE_PASTA;
 	} else if (strstr(name, "passt")) {
 		c.mode = MODE_PASST;
@@ -233,6 +230,9 @@ int main(int argc, char **argv)
 		_exit(EXIT_FAILURE);
 	}
 
+	if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
+		die_perror("Couldn't set disposition for SIGPIPE");
+
 	madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
 
 	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
diff --git a/repair.c b/repair.c
index dac28a6..3ee089f 100644
--- a/repair.c
+++ b/repair.c
@@ -202,7 +202,6 @@ int repair_flush(struct ctx *c)
  *
  * Return: 0 on success, negative error code on failure
  */
-/* cppcheck-suppress unusedFunction */
 int repair_set(struct ctx *c, int s, int cmd)
 {
 	int rc;
diff --git a/tcp.c b/tcp.c
index b978b30..98e1c6a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -280,6 +280,7 @@
 #include <stddef.h>
 #include <string.h>
 #include <sys/epoll.h>
+#include <sys/ioctl.h>
 #include <sys/socket.h>
 #include <sys/timerfd.h>
 #include <sys/types.h>
@@ -287,6 +288,8 @@
 #include <time.h>
 #include <arpa/inet.h>
 
+#include <linux/sockios.h>
+
 #include "checksum.h"
 #include "util.h"
 #include "iov.h"
@@ -299,6 +302,7 @@
 #include "log.h"
 #include "inany.h"
 #include "flow.h"
+#include "repair.h"
 #include "linux_dep.h"
 
 #include "flow_table.h"
@@ -306,6 +310,21 @@
 #include "tcp_buf.h"
 #include "tcp_vu.h"
 
+#ifndef __USE_MISC
+/* From Linux UAPI, missing in netinet/tcp.h provided by musl */
+struct tcp_repair_opt {
+	__u32	opt_code;
+	__u32	opt_val;
+};
+
+enum {
+	TCP_NO_QUEUE,
+	TCP_RECV_QUEUE,
+	TCP_SEND_QUEUE,
+	TCP_QUEUES_NR,
+};
+#endif
+
 /* MSS rounding: see SET_MSS() */
 #define MSS_DEFAULT			536
 #define WINDOW_DEFAULT			14600		/* RFC 6928 */
@@ -326,6 +345,19 @@
 	 ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
 #define CONN_HAS(conn, set)	(((conn)->events & (set)) == (set))
 
+/* Buffers to migrate pending data from send and receive queues. No, they don't
+ * use memory if we don't use them. And we're going away after this, so splurge.
+ */
+#define TCP_MIGRATE_SND_QUEUE_MAX	(64 << 20)
+#define TCP_MIGRATE_RCV_QUEUE_MAX	(64 << 20)
+uint8_t tcp_migrate_snd_queue		[TCP_MIGRATE_SND_QUEUE_MAX];
+uint8_t tcp_migrate_rcv_queue		[TCP_MIGRATE_RCV_QUEUE_MAX];
+
+#define TCP_MIGRATE_RESTORE_CHUNK_MIN	1024 /* Try smaller when above this */
+
+/* "Extended" data (not stored in the flow table) for TCP flow migration */
+static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
+
 static const char *tcp_event_str[] __attribute((__unused__)) = {
 	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
 
@@ -1468,6 +1500,7 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 
 	conn->sock = s;
 	conn->timer = -1;
+	conn->listening_sock = -1;
 	conn_event(c, conn, TAP_SYN_RCVD);
 
 	conn->wnd_to_tap = WINDOW_DEFAULT;
@@ -1968,10 +2001,27 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		ack_due = 1;
 
 	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
+		socklen_t sl;
+		struct tcp_info tinfo;
+
 		shutdown(conn->sock, SHUT_WR);
 		conn_event(c, conn, SOCK_FIN_SENT);
 		tcp_send_flag(c, conn, ACK);
 		ack_due = 0;
+
+		/* If we received a FIN, but the socket is in TCP_ESTABLISHED
+		 * state, it must be a migrated socket. The kernel saw the FIN
+		 * on the source socket, but not on the target socket.
+		 *
+		 * Approximate the effect of that FIN: as we're sending a FIN
+		 * out ourselves, the socket is now in a state equivalent to
+		 * LAST_ACK. Now that we sent the FIN out, close it with a RST.
+		 */
+		sl = sizeof(tinfo);
+		getsockopt(conn->sock, SOL_TCP, TCP_INFO, &tinfo, &sl);
+		if (tinfo.tcpi_state == TCP_ESTABLISHED &&
+		    conn->events & SOCK_FIN_RCVD)
+			goto reset;
 	}
 
 	if (ack_due)
@@ -2054,6 +2104,7 @@ static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow,
 void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 			const struct timespec *now)
 {
+	struct tcp_tap_conn *conn;
 	union sockaddr_inany sa;
 	socklen_t sl = sizeof(sa);
 	struct flowside *ini;
@@ -2069,6 +2120,9 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 	if (s < 0)
 		goto cancel;
 
+	conn = (struct tcp_tap_conn *)flow;
+	conn->listening_sock = ref.fd;
+
 	tcp_sock_set_nodelay(s);
 
 	/* FIXME: If useful: when the listening port has a specific bound
@@ -2634,3 +2688,868 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
 	if (c->mode == MODE_PASTA)
 		tcp_splice_refill(c);
 }
+
+/**
+ * tcp_flow_is_established() - Was the connection established? Includes closing
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: true if the connection was established, false otherwise
+ */
+bool tcp_flow_is_established(const struct tcp_tap_conn *conn)
+{
+	return conn->events & ESTABLISHED;
+}
+
+/**
+ * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
+{
+	int rc = 0;
+
+	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
+		err("Failed to set TCP_REPAIR");
+
+	return rc;
+}
+
+/**
+ * tcp_flow_repair_off() - Clear repair mode for a single TCP flow
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn)
+{
+	int rc = 0;
+
+	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_OFF)))
+		err("Failed to clear TCP_REPAIR");
+
+	return rc;
+}
+
+/**
+ * tcp_flow_dump_tinfo() - Dump window scale, tcpi_state, tcpi_options
+ * @c:		Execution context
+ * @t:		Extended migration data
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_dump_tinfo(int s, struct tcp_tap_transfer_ext *t)
+{
+	struct tcp_info tinfo;
+	socklen_t sl;
+
+	sl = sizeof(tinfo);
+	if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
+		int rc = -errno;
+		err_perror("Querying TCP_INFO, socket %i", s);
+		return rc;
+	}
+
+	t->snd_ws		= tinfo.tcpi_snd_wscale;
+	t->rcv_ws		= tinfo.tcpi_rcv_wscale;
+	t->tcpi_state		= tinfo.tcpi_state;
+	t->tcpi_options		= tinfo.tcpi_options;
+
+	return 0;
+}
+
+/**
+ * tcp_flow_dump_mss() - Dump MSS clamp (not current MSS) via TCP_MAXSEG
+ * @c:		Execution context
+ * @t:		Extended migration data
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_dump_mss(int s, struct tcp_tap_transfer_ext *t)
+{
+	socklen_t sl = sizeof(t->mss);
+
+	if (getsockopt(s, SOL_TCP, TCP_MAXSEG, &t->mss, &sl)) {
+		int rc = -errno;
+		err_perror("Getting MSS, socket %i", s);
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_dump_wnd() - Dump current tcp_repair_window parameters
+ * @c:		Execution context
+ * @t:		Extended migration data
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_dump_wnd(int s, struct tcp_tap_transfer_ext *t)
+{
+	struct tcp_repair_window wnd;
+	socklen_t sl = sizeof(wnd);
+
+	if (getsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, &sl)) {
+		int rc = -errno;
+		err_perror("Getting window repair data, socket %i", s);
+		return rc;
+	}
+
+	t->snd_wl1	= wnd.snd_wl1;
+	t->snd_wnd	= wnd.snd_wnd;
+	t->max_window	= wnd.max_window;
+	t->rcv_wnd	= wnd.rcv_wnd;
+	t->rcv_wup	= wnd.rcv_wup;
+
+	/* If we received a FIN, we also need to adjust window parameters.
+	 *
+	 * This must be called after tcp_flow_dump_tinfo(), for t->tcpi_state.
+	 */
+	if (t->tcpi_state == TCP_CLOSE_WAIT || t->tcpi_state == TCP_LAST_ACK) {
+		t->rcv_wup--;
+		t->rcv_wnd++;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_wnd() - Restore window parameters from extended data
+ * @c:		Execution context
+ * @t:		Extended migration data
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_wnd(int s, const struct tcp_tap_transfer_ext *t)
+{
+	struct tcp_repair_window wnd;
+
+	wnd.snd_wl1	= t->snd_wl1;
+	wnd.snd_wnd	= t->snd_wnd;
+	wnd.max_window	= t->max_window;
+	wnd.rcv_wnd	= t->rcv_wnd;
+	wnd.rcv_wup	= t->rcv_wup;
+
+	if (setsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, sizeof(wnd))) {
+		int rc = -errno;
+		err_perror("Setting window data, socket %i", s);
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_select_queue() - Select queue (receive or send) for next operation
+ * @s:		Socket
+ * @queue:	TCP_RECV_QUEUE or TCP_SEND_QUEUE
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_select_queue(int s, int queue)
+{
+	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &queue, sizeof(queue))) {
+		int rc = -errno;
+		err_perror("Selecting TCP_SEND_QUEUE, socket %i", s);
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_dump_sndqueue() - Dump send queue, length of sent and not sent data
+ * @s:		Socket
+ * @t:		Extended migration data
+ *
+ * Return: 0 on success, negative error code on failure
+ *
+ * #syscalls:vu ioctl
+ */
+static int tcp_flow_dump_sndqueue(int s, struct tcp_tap_transfer_ext *t)
+{
+	ssize_t rc;
+
+	if (ioctl(s, SIOCOUTQ, &t->sndq) < 0) {
+		rc = -errno;
+		err_perror("Getting send queue size, socket %i", s);
+		return rc;
+	}
+
+	if (ioctl(s, SIOCOUTQNSD, &t->notsent) < 0) {
+		rc = -errno;
+		err_perror("Getting not sent count, socket %i", s);
+		return rc;
+	}
+
+	/* If we sent a FIN, SIOCOUTQ and SIOCOUTQNSD are one greater than the
+	 * actual pending queue length, because they are based on the sequence
+	 * numbers, not directly on the buffer contents.
+	 *
+	 * This must be called after tcp_flow_dump_tinfo(), for t->tcpi_state.
+	 */
+	if (t->tcpi_state == TCP_FIN_WAIT1 || t->tcpi_state == TCP_FIN_WAIT2 ||
+	    t->tcpi_state == TCP_LAST_ACK  || t->tcpi_state == TCP_CLOSING) {
+		if (t->sndq)
+			t->sndq--;
+		if (t->notsent)
+			t->notsent--;
+	}
+
+	if (t->notsent > t->sndq) {
+		err("Invalid notsent count socket %i, send: %u, not sent: %u",
+		    s, t->sndq, t->notsent);
+		return -EINVAL;
+	}
+
+	if (t->sndq > TCP_MIGRATE_SND_QUEUE_MAX) {
+		err("Send queue too large to migrate socket %i: %u bytes",
+		    s, t->sndq);
+		return -ENOBUFS;
+	}
+
+	rc = recv(s, tcp_migrate_snd_queue,
+		  MIN(t->sndq, TCP_MIGRATE_SND_QUEUE_MAX), MSG_PEEK);
+	if (rc < 0) {
+		if (errno == EAGAIN)  { /* EAGAIN means empty */
+			rc = 0;
+		} else {
+			rc = -errno;
+			err_perror("Can't read send queue, socket %i", s);
+			return rc;
+		}
+	}
+
+	if ((uint32_t)rc < t->sndq) {
+		err("Short read migrating send queue");
+		return -ENXIO;
+	}
+
+	t->notsent = MIN(t->notsent, t->sndq);
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_queue() - Restore contents of a given (pre-selected) queue
+ * @s:		Socket
+ * @len:	Length of data to be restored
+ * @buf:	Buffer with content of pending data queue
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_queue(int s, size_t len, uint8_t *buf)
+{
+	size_t chunk = len;
+	uint8_t *p = buf;
+
+	while (len > 0) {
+		ssize_t rc = send(s, p, MIN(len, chunk), 0);
+
+		if (rc < 0) {
+			if ((errno == ENOBUFS || errno == ENOMEM) &&
+			    chunk >= TCP_MIGRATE_RESTORE_CHUNK_MIN) {
+				chunk /= 2;
+				continue;
+			}
+
+			rc = -errno;
+			err_perror("Can't write queue, socket %i", s);
+			return rc;
+		}
+
+		len -= rc;
+		p += rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_dump_seq() - Dump current sequence of pre-selected queue
+ * @s:		Socket
+ * @v:		Sequence value, set on return
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_dump_seq(int s, uint32_t *v)
+{
+	socklen_t sl = sizeof(*v);
+
+	if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, v, &sl)) {
+		int rc = -errno;
+		err_perror("Dumping sequence, socket %i", s);
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_seq() - Restore sequence for pre-selected queue
+ * @s:		Socket
+ * @v:		Sequence value to be set
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_seq(int s, const uint32_t *v)
+{
+	if (setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, v, sizeof(*v))) {
+		int rc = -errno;
+		err_perror("Setting sequence, socket %i", s);
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_dump_rcvqueue() - Dump receive queue and its length, seal/block it
+ * @s:		Socket
+ * @t:		Extended migration data
+ *
+ * Return: 0 on success, negative error code on failure
+ *
+ * #syscalls:vu ioctl
+ */
+static int tcp_flow_dump_rcvqueue(int s, struct tcp_tap_transfer_ext *t)
+{
+	ssize_t rc;
+
+	if (ioctl(s, SIOCINQ, &t->rcvq) < 0) {
+		rc = -errno;
+		err_perror("Get receive queue size, socket %i", s);
+		return rc;
+	}
+
+	/* If we received a FIN, SIOCINQ is one greater than the actual number
+	 * of bytes on the queue, because it's based on the sequence number
+	 * rather than directly on the buffer contents.
+	 *
+	 * This must be called after tcp_flow_dump_tinfo(), for t->tcpi_state.
+	 */
+	if (t->rcvq &&
+	    (t->tcpi_state == TCP_CLOSE_WAIT || t->tcpi_state == TCP_LAST_ACK))
+		t->rcvq--;
+
+	if (t->rcvq > TCP_MIGRATE_RCV_QUEUE_MAX) {
+		err("Receive queue too large to migrate socket %i: %u bytes",
+		    s, t->rcvq);
+		return -ENOBUFS;
+	}
+
+	rc = recv(s, tcp_migrate_rcv_queue, t->rcvq, MSG_PEEK);
+	if (rc < 0) {
+		if (errno == EAGAIN)  { /* EAGAIN means empty */
+			rc = 0;
+		} else {
+			rc = -errno;
+			err_perror("Can't read receive queue for socket %i", s);
+			return rc;
+		}
+	}
+
+	if ((uint32_t)rc < t->rcvq) {
+		err("Short read migrating receive queue");
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_opt() - Set repair "options" (MSS, scale, SACK, timestamps)
+ * @s:		Socket
+ * @t:		Extended migration data
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_opt(int s, const struct tcp_tap_transfer_ext *t)
+{
+	const struct tcp_repair_opt opts[] = {
+		{ TCPOPT_WINDOW,		t->snd_ws + (t->rcv_ws << 16) },
+		{ TCPOPT_MAXSEG,		t->mss },
+		{ TCPOPT_SACK_PERMITTED,	0 },
+		{ TCPOPT_TIMESTAMP,		0 },
+	};
+	socklen_t sl;
+
+	sl = sizeof(opts[0]) * (2 +
+				!!(t->tcpi_options & TCPI_OPT_SACK) +
+				!!(t->tcpi_options & TCPI_OPT_TIMESTAMPS));
+
+	if (setsockopt(s, SOL_TCP, TCP_REPAIR_OPTIONS, opts, sl)) {
+		int rc = -errno;
+		err_perror("Setting repair options, socket %i", s);
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_source() - Send data (flow table) for flow, close listening
+ * @fd:		Descriptor for state migration
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
+{
+	struct tcp_tap_transfer t = {
+		.retrans		= conn->retrans,
+		.ws_from_tap		= conn->ws_from_tap,
+		.ws_to_tap		= conn->ws_to_tap,
+		.events			= conn->events,
+
+		.tap_mss		= htonl(MSS_GET(conn)),
+
+		.sndbuf			= htonl(conn->sndbuf),
+
+		.flags			= conn->flags,
+		.seq_dup_ack_approx	= conn->seq_dup_ack_approx,
+
+		.wnd_from_tap		= htons(conn->wnd_from_tap),
+		.wnd_to_tap		= htons(conn->wnd_to_tap),
+
+		.seq_to_tap		= htonl(conn->seq_to_tap),
+		.seq_ack_from_tap	= htonl(conn->seq_ack_from_tap),
+		.seq_from_tap		= htonl(conn->seq_from_tap),
+		.seq_ack_to_tap		= htonl(conn->seq_ack_to_tap),
+		.seq_init_from_tap	= htonl(conn->seq_init_from_tap),
+	};
+
+	memcpy(&t.pif, conn->f.pif, sizeof(t.pif));
+	memcpy(&t.side, conn->f.side, sizeof(t.side));
+
+	if (write_all_buf(fd, &t, sizeof(t))) {
+		int rc = -errno;
+		err_perror("Can't write migration data, socket %i", conn->sock);
+		return rc;
+	}
+
+	if (conn->listening_sock != -1 && !fcntl(conn->listening_sock, F_GETFD))
+		close(conn->listening_sock);
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
+ * @fd:		Descriptor for state migration
+ * @fidx:	Flow index
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure
+ */
+int tcp_flow_migrate_source_ext(int fd, int fidx,
+				const struct tcp_tap_conn *conn)
+{
+	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
+	struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
+	int s = conn->sock;
+	int rc;
+
+	/* Disable SO_PEEK_OFF, it will make accessing the queues in repair mode
+	 * weird.
+	 */
+	if (tcp_set_peek_offset(s, -1)) {
+		rc = -errno;
+		goto fail;
+	}
+
+	if ((rc = tcp_flow_dump_tinfo(s, t)))
+		goto fail;
+
+	if ((rc = tcp_flow_dump_mss(s, t)))
+		goto fail;
+
+	if ((rc = tcp_flow_dump_wnd(s, t)))
+		goto fail;
+
+	if ((rc = tcp_flow_select_queue(s, TCP_SEND_QUEUE)))
+		goto fail;
+
+	if ((rc = tcp_flow_dump_sndqueue(s, t)))
+		goto fail;
+
+	if ((rc = tcp_flow_dump_seq(s, &t->seq_snd)))
+		goto fail;
+
+	if ((rc = tcp_flow_select_queue(s, TCP_RECV_QUEUE)))
+		goto fail;
+
+	if ((rc = tcp_flow_dump_rcvqueue(s, t)))
+		goto fail;
+
+	if ((rc = tcp_flow_dump_seq(s, &t->seq_rcv)))
+		goto fail;
+
+	close(s);
+
+	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
+	 * based on the end of the queues.
+	 */
+	t->seq_rcv	-= t->rcvq;
+	t->seq_snd	-= t->sndq;
+
+	debug("Extended migration data, socket %i sequences send %u receive %u",
+	      s, t->seq_snd, t->seq_rcv);
+	debug("  pending queues: send %u not sent %u receive %u",
+	      t->sndq, t->notsent, t->rcvq);
+	debug("  window: snd_wl1 %u snd_wnd %u max %u rcv_wnd %u rcv_wup %u",
+	      t->snd_wl1, t->snd_wnd, t->max_window, t->rcv_wnd, t->rcv_wup);
+	debug("  SO_PEEK_OFF %s  offset=%"PRIu32,
+	      peek_offset_cap ? "enabled" : "disabled", peek_offset);
+
+	/* Endianness fix-ups */
+	t->seq_snd	= htonl(t->seq_snd);
+	t->seq_rcv 	= htonl(t->seq_rcv);
+	t->sndq		= htonl(t->sndq);
+	t->notsent	= htonl(t->notsent);
+	t->rcvq		= htonl(t->rcvq);
+
+	t->snd_wl1	= htonl(t->snd_wl1);
+	t->snd_wnd	= htonl(t->snd_wnd);
+	t->max_window	= htonl(t->max_window);
+	t->rcv_wnd	= htonl(t->rcv_wnd);
+	t->rcv_wup	= htonl(t->rcv_wup);
+
+	if (write_all_buf(fd, t, sizeof(*t))) {
+		err_perror("Failed to write extended data, socket %i", s);
+		return -EIO;
+	}
+
+	if (write_all_buf(fd, tcp_migrate_snd_queue, ntohl(t->sndq))) {
+		err_perror("Failed to write send queue data, socket %i", s);
+		return -EIO;
+	}
+
+	if (write_all_buf(fd, tcp_migrate_rcv_queue, ntohl(t->rcvq))) {
+		err_perror("Failed to write receive queue data, socket %i", s);
+		return -EIO;
+	}
+
+	return 0;
+
+fail:
+	/* For any type of failure dumping data, write an invalid extended data
+	 * descriptor that allows us to keep the stream in sync, but tells the
+	 * target to skip the flow. If we fail to transfer data, that's fatal:
+	 * return -EIO in that case (and only in that case).
+	 */
+	t->tcpi_state = 0; /* Not defined: tell the target to skip this flow */
+
+	if (write_all_buf(fd, t, sizeof(*t))) {
+		err_perror("Failed to write extended data, socket %i", s);
+		return -EIO;
+	}
+
+	if (rc == -EIO) /* but not a migration data transfer failure */
+		return -ENODATA;
+
+	return rc;
+}
+
+/**
+ * tcp_flow_repair_socket() - Open and bind socket, request repair mode
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
+{
+	sa_family_t af = CONN_V4(conn) ? AF_INET : AF_INET6;
+	const struct flowside *sockside = HOSTFLOW(conn);
+	union sockaddr_inany a;
+	socklen_t sl;
+	int s, rc;
+
+	pif_sockaddr(c, &a, &sl, PIF_HOST, &sockside->oaddr, sockside->oport);
+
+	if ((conn->sock = socket(af, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
+				 IPPROTO_TCP)) < 0) {
+		rc = -errno;
+		err_perror("Failed to create socket for migrated flow");
+		return rc;
+	}
+	s = conn->sock;
+
+	if (setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &(int){ 1 }, sizeof(int)))
+		debug_perror("Setting SO_REUSEADDR on socket %i", s);
+
+	tcp_sock_set_nodelay(s);
+
+	if ((rc = bind(s, &a.sa, sizeof(a)))) {
+		err_perror("Failed to bind socket for migrated flow");
+		goto err;
+	}
+
+	if ((rc = tcp_flow_repair_on(c, conn)))
+		goto err;
+
+	return 0;
+
+err:
+	close(s);
+	conn->sock = -1;
+	return rc;
+}
+
+/**
+ * tcp_flow_repair_connect() - Connect socket in repair mode, then turn it off
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_connect(const struct ctx *c,
+				   struct tcp_tap_conn *conn)
+{
+	const struct flowside *tgt = HOSTFLOW(conn);
+	int rc;
+
+	rc = flowside_connect(c, conn->sock, PIF_HOST, tgt);
+	if (rc) {
+		rc = -errno;
+		err_perror("Failed to connect migrated socket %i", conn->sock);
+		return rc;
+	}
+
+	conn->in_epoll = 0;
+	conn->timer = -1;
+	conn->listening_sock = -1;
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_target() - Receive data (flow table part) for flow, insert
+ * @c:		Execution context
+ * @fd:		Descriptor for state migration
+ *
+ * Return: 0 on success, negative on fatal failure, but 0 on single flow failure
+ */
+int tcp_flow_migrate_target(struct ctx *c, int fd)
+{
+	struct tcp_tap_transfer t;
+	struct tcp_tap_conn *conn;
+	union flow *flow;
+	int rc;
+
+	if (!(flow = flow_alloc())) {
+		err("Flow table full on migration target");
+		return 0;
+	}
+
+	if (read_all_buf(fd, &t, sizeof(t))) {
+		flow_alloc_cancel(flow);
+		err_perror("Failed to receive migration data");
+		return -errno;
+	}
+
+	flow->f.state = FLOW_STATE_TGT;
+	memcpy(&flow->f.pif, &t.pif, sizeof(flow->f.pif));
+	memcpy(&flow->f.side, &t.side, sizeof(flow->f.side));
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
+
+	conn->retrans			= t.retrans;
+	conn->ws_from_tap		= t.ws_from_tap;
+	conn->ws_to_tap			= t.ws_to_tap;
+	conn->events			= t.events;
+
+	conn->sndbuf			= htonl(t.sndbuf);
+
+	conn->flags			= t.flags;
+	conn->seq_dup_ack_approx	= t.seq_dup_ack_approx;
+
+	MSS_SET(conn,			  ntohl(t.tap_mss));
+
+	conn->wnd_from_tap		= ntohs(t.wnd_from_tap);
+	conn->wnd_to_tap		= ntohs(t.wnd_to_tap);
+
+	conn->seq_to_tap		= ntohl(t.seq_to_tap);
+	conn->seq_ack_from_tap		= ntohl(t.seq_ack_from_tap);
+	conn->seq_from_tap		= ntohl(t.seq_from_tap);
+	conn->seq_ack_to_tap		= ntohl(t.seq_ack_to_tap);
+	conn->seq_init_from_tap		= ntohl(t.seq_init_from_tap);
+
+	if ((rc = tcp_flow_repair_socket(c, conn))) {
+		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
+		flow_alloc_cancel(flow);
+		return 0;
+	}
+
+	flow_hash_insert(c, TAP_SIDX(conn));
+	FLOW_ACTIVATE(conn);
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_target_ext() - Receive extended data for flow, set, connect
+ * @c:		Execution context
+ * @flow:	Existing flow for this connection data
+ * @fd:		Descriptor for state migration
+ *
+ * Return: 0 on success, negative on fatal failure, but 0 on single flow failure
+ */
+int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd)
+{
+	struct tcp_tap_conn *conn = &flow->tcp;
+	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
+	struct tcp_tap_transfer_ext t;
+	int s = conn->sock, rc;
+
+	if (read_all_buf(fd, &t, sizeof(t))) {
+		rc = -errno;
+		err_perror("Failed to read extended data for socket %i", s);
+		return rc;
+	}
+
+	if (!t.tcpi_state) { /* Source wants us to skip this flow */
+		flow_err(flow, "Dropping as requested by source");
+		goto fail;
+	}
+
+	/* Endianness fix-ups */
+	t.seq_snd	= ntohl(t.seq_snd);
+	t.seq_rcv 	= ntohl(t.seq_rcv);
+	t.sndq		= ntohl(t.sndq);
+	t.notsent	= ntohl(t.notsent);
+	t.rcvq		= ntohl(t.rcvq);
+
+	t.snd_wl1	= ntohl(t.snd_wl1);
+	t.snd_wnd	= ntohl(t.snd_wnd);
+	t.max_window	= ntohl(t.max_window);
+	t.rcv_wnd	= ntohl(t.rcv_wnd);
+	t.rcv_wup	= ntohl(t.rcv_wup);
+
+	debug("Extended migration data, socket %i sequences send %u receive %u",
+	      s, t.seq_snd, t.seq_rcv);
+	debug("  pending queues: send %u not sent %u receive %u",
+	      t.sndq, t.notsent, t.rcvq);
+	debug("  window: snd_wl1 %u snd_wnd %u max %u rcv_wnd %u rcv_wup %u",
+	      t.snd_wl1, t.snd_wnd, t.max_window, t.rcv_wnd, t.rcv_wup);
+	debug("  SO_PEEK_OFF %s  offset=%"PRIu32,
+	      peek_offset_cap ? "enabled" : "disabled", peek_offset);
+
+	if (t.sndq > TCP_MIGRATE_SND_QUEUE_MAX || t.notsent > t.sndq ||
+	    t.rcvq > TCP_MIGRATE_RCV_QUEUE_MAX) {
+		err("Bad queues socket %i, send: %u, not sent: %u, receive: %u",
+		    s, t.sndq, t.notsent, t.rcvq);
+		return -EINVAL;
+	}
+
+	if (read_all_buf(fd, tcp_migrate_snd_queue, t.sndq)) {
+		rc = -errno;
+		err_perror("Failed to read send queue data, socket %i", s);
+		return rc;
+	}
+
+	if (read_all_buf(fd, tcp_migrate_rcv_queue, t.rcvq)) {
+		rc = -errno;
+		err_perror("Failed to read receive queue data, socket %i", s);
+		return rc;
+	}
+
+	if (tcp_flow_select_queue(s, TCP_SEND_QUEUE))
+		goto fail;
+
+	if (tcp_flow_repair_seq(s, &t.seq_snd))
+		goto fail;
+
+	if (tcp_flow_select_queue(s, TCP_RECV_QUEUE))
+		goto fail;
+
+	if (tcp_flow_repair_seq(s, &t.seq_rcv))
+		goto fail;
+
+	if (tcp_flow_repair_connect(c, conn))
+		goto fail;
+
+	if (tcp_flow_repair_queue(s, t.rcvq, tcp_migrate_rcv_queue))
+		goto fail;
+
+	if (tcp_flow_select_queue(s, TCP_SEND_QUEUE))
+		goto fail;
+
+	if (tcp_flow_repair_queue(s, t.sndq - t.notsent,
+				  tcp_migrate_snd_queue))
+		goto fail;
+
+	if (tcp_flow_repair_opt(s, &t))
+		goto fail;
+
+	/* If we sent a FIN sent and it was acknowledged (TCP_FIN_WAIT2), don't
+	 * send it out, because we already sent it for sure.
+	 *
+	 * Call shutdown(x, SHUT_WR) in repair mode, so that we move to
+	 * FIN_WAIT_1 (tcp_shutdown()) without sending anything
+	 * (goto in tcp_write_xmit()).
+	 */
+	if (t.tcpi_state == TCP_FIN_WAIT2) {
+		int v;
+
+		v = TCP_SEND_QUEUE;
+		if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v)))
+			debug_perror("Selecting repair queue, socket %i", s);
+		else
+			shutdown(s, SHUT_WR);
+	}
+
+	if (tcp_flow_repair_wnd(s, &t))
+		goto fail;
+
+	tcp_flow_repair_off(c, conn);
+	repair_flush(c);
+
+	if (t.notsent) {
+		if (tcp_flow_repair_queue(s, t.notsent,
+					  tcp_migrate_snd_queue +
+					  (t.sndq - t.notsent))) {
+			/* This sometimes seems to fail for unclear reasons.
+			 * Don't fail the whole migration, just reset the flow
+			 * and carry on to the next one.
+			 */
+			goto fail;
+		}
+	}
+
+	/* If we sent a FIN but it wasn't acknowledged yet (TCP_FIN_WAIT1), send
+	 * it out, because we don't know if we already sent it.
+	 *
+	 * Call shutdown(x, SHUT_WR) *not* in repair mode, which moves us to
+	 * TCP_FIN_WAIT1.
+	 */
+	if (t.tcpi_state == TCP_FIN_WAIT1)
+		shutdown(s, SHUT_WR);
+
+	if (tcp_set_peek_offset(conn->sock, peek_offset))
+		goto fail;
+
+	tcp_send_flag(c, conn, ACK);
+	tcp_data_from_sock(c, conn);
+
+	if ((rc = tcp_epoll_ctl(c, conn))) {
+		debug("Failed to subscribe to epoll for migrated socket %i: %s",
+		      conn->sock, strerror_(-rc));
+		goto fail;
+	}
+
+	return 0;
+
+fail:
+	tcp_flow_repair_off(c, conn);
+	repair_flush(c);
+
+	conn->flags = 0; /* Not waiting for ACK, don't schedule timer */
+	tcp_rst(c, conn);
+
+	return 0;
+}
diff --git a/tcp_conn.h b/tcp_conn.h
index 8c20805..42dff48 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -19,6 +19,7 @@
  * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
  * @sock:		Socket descriptor number
  * @events:		Connection events, implying connection states
+ * @listening_sock:	Listening socket this socket was accept()ed from, or -1
  * @timer:		timerfd descriptor for timeout events
  * @flags:		Connection flags representing internal attributes
  * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
@@ -68,6 +69,7 @@ struct tcp_tap_conn {
 #define	CONN_STATE_BITS		/* Setting these clears other flags */	\
 	(SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED)
 
+	int		listening_sock;
 
 	int		timer		:FD_REF_BITS;
 
@@ -96,6 +98,93 @@ struct tcp_tap_conn {
 	uint32_t	seq_init_from_tap;
 };
 
+/**
+ * struct tcp_tap_transfer - Migrated TCP data, flow table part, network order
+ * @pif:		Interfaces for each side of the flow
+ * @side:		Addresses and ports for each side of the flow
+ * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
+ * @ws_from_tap:	Window scaling factor advertised from tap/guest
+ * @ws_to_tap:		Window scaling factor advertised to tap/guest
+ * @events:		Connection events, implying connection states
+ * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
+ * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
+ * @flags:		Connection flags representing internal attributes
+ * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
+ * @wnd_from_tap:	Last window size from tap, unscaled (as received)
+ * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
+ * @seq_to_tap:		Next sequence for packets to tap
+ * @seq_ack_from_tap:	Last ACK number received from tap
+ * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
+ * @seq_ack_to_tap:	Last ACK number sent to tap
+ * @seq_init_from_tap:	Initial sequence number from tap
+*/
+struct tcp_tap_transfer {
+	uint8_t		pif[SIDES];
+	struct flowside	side[SIDES];
+
+	uint8_t		retrans;
+	uint8_t		ws_from_tap;
+	uint8_t		ws_to_tap;
+	uint8_t		events;
+
+	uint32_t	tap_mss;
+
+	uint32_t	sndbuf;
+
+	uint8_t		flags;
+	uint8_t		seq_dup_ack_approx;
+
+	uint16_t	wnd_from_tap;
+	uint16_t	wnd_to_tap;
+
+	uint32_t	seq_to_tap;
+	uint32_t	seq_ack_from_tap;
+	uint32_t	seq_from_tap;
+	uint32_t	seq_ack_to_tap;
+	uint32_t	seq_init_from_tap;
+} __attribute__((packed, aligned(__alignof__(uint32_t))));
+
+/**
+ * struct tcp_tap_transfer_ext - Migrated TCP data, outside flow, network order
+ * @seq_snd:		Socket-side send sequence
+ * @seq_rcv:		Socket-side receive sequence
+ * @sndq:		Length of pending send queue (unacknowledged / not sent)
+ * @notsent:		Part of pending send queue that wasn't sent out yet
+ * @rcvq:		Length of pending receive queue
+ * @mss:		Socket-side MSS clamp
+ * @snd_wl1:		Next sequence used in window probe (next sequence - 1)
+ * @snd_wnd:		Socket-side sending window
+ * @max_window:		Window clamp
+ * @rcv_wnd:		Socket-side receive window
+ * @rcv_wup:		rcv_nxt on last window update sent
+ * @snd_ws:		Window scaling factor, send
+ * @rcv_ws:		Window scaling factor, receive
+ * @tcpi_state:		Connection state in TCP_INFO style (enum, tcp_states.h)
+ * @tcpi_options:	TCPI_OPT_* constants (timestamps, selective ACK)
+ */
+struct tcp_tap_transfer_ext {
+	uint32_t	seq_snd;
+	uint32_t	seq_rcv;
+
+	uint32_t	sndq;
+	uint32_t	notsent;
+	uint32_t	rcvq;
+
+	uint32_t	mss;
+
+	/* We can't just use struct tcp_repair_window: we need network order */
+	uint32_t	snd_wl1;
+	uint32_t	snd_wnd;
+	uint32_t	max_window;
+	uint32_t	rcv_wnd;
+	uint32_t	rcv_wup;
+
+	uint8_t		snd_ws;
+	uint8_t		rcv_ws;
+	uint8_t		tcpi_state;
+	uint8_t		tcpi_options;
+} __attribute__((packed, aligned(__alignof__(uint32_t))));
+
 /**
  * struct tcp_splice_conn - Descriptor for a spliced TCP connection
  * @f:			Generic flow information
@@ -140,6 +229,20 @@ extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
 extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 
 bool tcp_flow_defer(const struct tcp_tap_conn *conn);
+
+int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
+int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
+
+int tcp_flow_migrate_shrink_window(int fidx, const struct tcp_tap_conn *conn);
+int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
+int tcp_flow_migrate_source_ext(int fd, int fidx,
+				const struct tcp_tap_conn *conn);
+
+int tcp_flow_migrate_target(struct ctx *c, int fd);
+int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd);
+
+bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
+
 bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
 void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
 int tcp_conn_pool_sock(int pool[]);

From a1e48a02ff3550eb7875a7df6726086e9b3a1213 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 13 Feb 2025 23:14:14 +1100
Subject: [PATCH 077/221] test: Add migration tests

PCAP=1 ./run migrate/bidirectional gives an overview of how the
whole thing is working.

Add 12 tests in total, checking basic functionality with and without
flows in both directions, with and without sockets in half-closed
states (both inbound and outbound), migration behaviour under traffic
flood, under traffic flood with > 253 flows, and strict checking of
sequences under flood with ramp patterns in both directions.

These tests need preparation and teardown for each case, as we need
to restore the source guest in its own context and pane before we can
test again. Eventually, we could consider alternating source and
target so that we don't need to restart from scratch every time, but
that's beyond the scope of this initial test implementation.

Trick: './run migrate/*' runs all the tests with preparation and
teardown steps.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 test/lib/layout                |  57 +++++++++++++-
 test/lib/setup                 | 138 ++++++++++++++++++++++++++++++++-
 test/lib/test                  |  48 ++++++++++++
 test/migrate/basic             |  59 ++++++++++++++
 test/migrate/basic_fin         |  62 +++++++++++++++
 test/migrate/bidirectional     |  64 +++++++++++++++
 test/migrate/bidirectional_fin |  64 +++++++++++++++
 test/migrate/iperf3_bidir6     |  58 ++++++++++++++
 test/migrate/iperf3_in4        |  50 ++++++++++++
 test/migrate/iperf3_in6        |  58 ++++++++++++++
 test/migrate/iperf3_many_out6  |  60 ++++++++++++++
 test/migrate/iperf3_out4       |  47 +++++++++++
 test/migrate/iperf3_out6       |  58 ++++++++++++++
 test/migrate/rampstream_in     |  12 +--
 test/migrate/rampstream_out    |   8 +-
 test/run                       |  42 +++++++++-
 16 files changed, 871 insertions(+), 14 deletions(-)
 create mode 100644 test/migrate/basic
 create mode 100644 test/migrate/basic_fin
 create mode 100644 test/migrate/bidirectional
 create mode 100644 test/migrate/bidirectional_fin
 create mode 100644 test/migrate/iperf3_bidir6
 create mode 100644 test/migrate/iperf3_in4
 create mode 100644 test/migrate/iperf3_in6
 create mode 100644 test/migrate/iperf3_many_out6
 create mode 100644 test/migrate/iperf3_out4
 create mode 100644 test/migrate/iperf3_out6

diff --git a/test/lib/layout b/test/lib/layout
index 4d03572..fddcdc4 100644
--- a/test/lib/layout
+++ b/test/lib/layout
@@ -135,7 +135,7 @@ layout_two_guests() {
 	get_info_cols
 
 	pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
-	pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
+	pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #1" qemu_2 guest_2
 
 	tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
 	tmux send-keys -t ${PANE_INFO} -N 100 C-m
@@ -143,13 +143,66 @@ layout_two_guests() {
 
 	pane_watch_contexts ${PANE_HOST} host host
 	pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
-	pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
+	pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #1" pasta_1 passt_2
 
 	info_layout "two guests, two passt instances, in namespaces"
 
 	sleep 1
 }
 
+# layout_migrate() - Two guest panes, two passt panes, two passt-repair panes,
+#		     plus host and log
+layout_migrate() {
+	sleep 1
+
+	tmux kill-pane -a -t 0
+	cmd_write 0 clear
+
+	tmux split-window -v -t passt_test
+	tmux split-window -h -l '33%'
+	tmux split-window -h -t passt_test:1.1
+
+	tmux split-window -h -l '35%' -t passt_test:1.0
+	tmux split-window -v -t passt_test:1.0
+
+	tmux split-window -v -t passt_test:1.4
+	tmux split-window -v -t passt_test:1.6
+
+	tmux split-window -v -t passt_test:1.3
+
+	PANE_GUEST_1=0
+	PANE_GUEST_2=1
+	PANE_INFO=2
+	PANE_MON=3
+	PANE_HOST=4
+	PANE_PASST_REPAIR_1=5
+	PANE_PASST_1=6
+	PANE_PASST_REPAIR_2=7
+	PANE_PASST_2=8
+
+	get_info_cols
+
+	pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
+	pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
+
+	tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
+	tmux send-keys -t ${PANE_INFO} -N 100 C-m
+	tmux select-pane -t ${PANE_INFO} -T "test log"
+
+	pane_watch_contexts ${PANE_MON} "QEMU monitor" mon mon
+
+	pane_watch_contexts ${PANE_HOST} host host
+	pane_watch_contexts ${PANE_PASST_REPAIR_1} "passt-repair #1 in namespace #1" repair_1 passt_repair_1
+	pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
+
+	pane_watch_contexts ${PANE_PASST_REPAIR_2} "passt-repair #2 in namespace #2" repair_2 passt_repair_2
+	pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
+
+	info_layout "two guests, two passt + passt-repair instances, in namespaces"
+
+	sleep 1
+}
+
 # layout_demo_pasta() - Four panes for pasta demo
 layout_demo_pasta() {
 	sleep 1
diff --git a/test/lib/setup b/test/lib/setup
index ee67152..575bc21 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -305,6 +305,117 @@ setup_two_guests() {
 	context_setup_guest guest_2 ${GUEST_2_CID}
 }
 
+# setup_migrate() - Set up two namespace, run qemu, passt/passt-repair in both
+setup_migrate() {
+	context_setup_host host
+	context_setup_host mon
+	context_setup_host pasta_1
+	context_setup_host pasta_2
+
+	layout_migrate
+
+	# Ports:
+	#
+	#         guest #1  |  guest #2 |   ns #1   |    host
+	#         --------- |-----------|-----------|------------
+	#  10001  as server |           | to guest  |  to ns #1
+	#  10002            |           | as server |  to ns #1
+	#  10003            |           |  to init  |  as server
+	#  10004            | as server | to guest  |  to ns #1
+
+	__opts=
+	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/pasta_1.pcap"
+	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
+	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+
+	__map_host4=192.0.2.1
+	__map_host6=2001:db8:9a55::1
+	__map_ns4=192.0.2.2
+	__map_ns6=2001:db8:9a55::2
+
+	# Option 1: send stuff via spliced path in pasta
+	# context_run_bg pasta_1 "./pasta ${__opts} -P ${STATESETUP}/pasta_1.pid -t 10001,10002 -T 10003 -u 10001,10002 -U 10003 --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
+	# Option 2: send stuff via tap (--map-guest-addr) instead (useful to see capture of full migration)
+	context_run_bg pasta_1 "./pasta ${__opts} -P ${STATESETUP}/pasta_1.pid -t 10001,10002,10004 -T 10003 -u 10001,10002,10004 -U 10003 --map-guest-addr ${__map_host4} --map-guest-addr ${__map_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
+	context_setup_nstool passt_1 ${STATESETUP}/ns1.hold
+	context_setup_nstool passt_repair_1 ${STATESETUP}/ns1.hold
+
+	context_setup_nstool passt_2 ${STATESETUP}/ns1.hold
+	context_setup_nstool passt_repair_2 ${STATESETUP}/ns1.hold
+
+	context_setup_nstool qemu_1 ${STATESETUP}/ns1.hold
+	context_setup_nstool qemu_2 ${STATESETUP}/ns1.hold
+
+	__ifname="$(context_run qemu_1 "ip -j link show | jq -rM '.[] | select(.link_type == \"ether\").ifname'")"
+
+	sleep 1
+
+	__opts="--vhost-user"
+	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
+	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
+	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+
+	context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
+	wait_for [ -f "${STATESETUP}/passt_1.pid" ]
+
+	context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
+
+	__opts="--vhost-user"
+	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
+	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
+	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+
+	context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
+	wait_for [ -f "${STATESETUP}/passt_2.pid" ]
+
+	context_run_bg passt_repair_2 "./passt-repair ${STATESETUP}/passt_2.socket.repair"
+
+	__vmem="512M"	# Keep migration fast
+	__qemu_netdev1="					       \
+		-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
+		-netdev vhost-user,id=v,chardev=c		       \
+		-device virtio-net,netdev=v			       \
+		-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
+		-numa node,memdev=m"
+	__qemu_netdev2="					       \
+		-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
+		-netdev vhost-user,id=v,chardev=c		       \
+		-device virtio-net,netdev=v			       \
+		-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
+		-numa node,memdev=m"
+
+	GUEST_1_CID=94557
+	context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}"		     \
+		' -M accel=kvm:tcg'                                          \
+		' -m '${__vmem}' -cpu host -smp '${VCPUS}		     \
+		' -kernel '"${KERNEL}"					     \
+		' -initrd '${INITRAMFS}' -nographic -serial stdio'	     \
+		' -nodefaults'						     \
+		' -append "console=ttyS0 mitigations=off apparmor=0" '	     \
+		" ${__qemu_netdev1}"					     \
+		" -pidfile ${STATESETUP}/qemu_1.pid"			     \
+		" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"	     \
+		" -monitor unix:${STATESETUP}/qemu_1_mon.sock,server,nowait"
+
+	GUEST_2_CID=94558
+	context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}"		     \
+		' -M accel=kvm:tcg'                                          \
+		' -m '${__vmem}' -cpu host -smp '${VCPUS}		     \
+		' -kernel '"${KERNEL}"					     \
+		' -initrd '${INITRAMFS}' -nographic -serial stdio'	     \
+		' -nodefaults'						     \
+		' -append "console=ttyS0 mitigations=off apparmor=0" '	     \
+		" ${__qemu_netdev2}"					     \
+		" -pidfile ${STATESETUP}/qemu_2.pid"			     \
+		" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID"	     \
+		" -monitor unix:${STATESETUP}/qemu_2_mon.sock,server,nowait" \
+		" -incoming tcp:0:20005"
+
+	context_setup_guest guest_1 ${GUEST_1_CID}
+	# Only available after migration:
+	( context_setup_guest guest_2 ${GUEST_2_CID} & )
+}
+
 # teardown_context_watch() - Remove contexts and stop panes watching them
 # $1:	Pane number watching
 # $@:	Context names
@@ -375,7 +486,8 @@ teardown_two_guests() {
 	context_wait pasta_1
 	context_wait pasta_2
 
-	rm -f "${STATESETUP}/passt__[12].pid" "${STATESETUP}/pasta_[12].pid"
+	rm "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
+	rm "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"
 
 	teardown_context_watch ${PANE_HOST} host
 	teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
@@ -384,6 +496,30 @@ teardown_two_guests() {
 	teardown_context_watch ${PANE_PASST_2} pasta_2 passt_2
 }
 
+# teardown_migrate() - Exit namespaces, kill qemu processes, passt and pasta
+teardown_migrate() {
+	${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_1.pid")
+	${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_2.pid")
+	context_wait qemu_1
+	context_wait qemu_2
+
+	${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/passt_2.pid")
+	context_wait passt_1
+	context_wait passt_2
+	${NSTOOL} stop "${STATESETUP}/ns1.hold"
+	context_wait pasta_1
+
+	rm -f "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
+	rm -f "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"
+
+	teardown_context_watch ${PANE_HOST} host
+
+	teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
+	teardown_context_watch ${PANE_GUEST_2} qemu_2 guest_2
+	teardown_context_watch ${PANE_PASST_1} pasta_1 passt_1
+	teardown_context_watch ${PANE_PASST_2} pasta_1 passt_2
+}
+
 # teardown_demo_passt() - Exit namespace, kill qemu, passt and pasta
 teardown_demo_passt() {
 	tmux send-keys -t ${PANE_GUEST} "C-c"
diff --git a/test/lib/test b/test/lib/test
index e6726be..758250a 100755
--- a/test/lib/test
+++ b/test/lib/test
@@ -68,6 +68,45 @@ test_iperf3() {
 	TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
 }
 
+# test_iperf3m() - Ugly helper for iperf3 directive, guest migration variant
+# $1:	Variable name: to put the measure bandwidth into
+# $2:	Initial source/client context
+# $3:	Second source/client context the guest is moving to
+# $4:	Destination name or address for client
+# $5:	Port number, ${i} is translated to process index
+# $6:	Run time, in seconds
+# $7:	Client options
+test_iperf3m() {
+	__var="${1}"; shift
+	__cctx="${1}"; shift
+	__cctx2="${1}"; shift
+	__dest="${1}"; shift
+	__port="${1}"; shift
+	__time="${1}"; shift
+
+	pane_or_context_run "${__cctx}" 'rm -f c.json'
+
+        # A 1s wait for connection on what's basically a local link
+        # indicates something is pretty wrong
+        __timeout=1000
+	pane_or_context_run_bg "${__cctx}" 				\
+		 'iperf3 -J -c '${__dest}' -p '${__port}		\
+		 '	 --connect-timeout '${__timeout}		\
+		 '	 -t'${__time}' -i0 '"${@}"' > c.json'		\
+
+	__jval=".end.sum_received.bits_per_second"
+
+	sleep $((${__time} + 3))
+
+	pane_or_context_output "${__cctx2}"				\
+		 'cat c.json'
+
+	__bw=$(pane_or_context_output "${__cctx2}"			\
+		 'cat c.json | jq -rMs "map('${__jval}') | add"')
+
+	TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
+}
+
 test_one_line() {
 	__line="${1}"
 
@@ -177,6 +216,12 @@ test_one_line() {
 	"guest2w")
 		pane_or_context_wait guest_2 || TEST_ONE_nok=1
 		;;
+	"mon")
+		pane_or_context_run mon "${__arg}" || TEST_ONE_nok=1
+		;;
+	"monb")
+		pane_or_context_run_bg mon "${__arg}"
+		;;
 	"ns")
 		pane_or_context_run ns "${__arg}" || TEST_ONE_nok=1
 		;;
@@ -292,6 +337,9 @@ test_one_line() {
 	"iperf3")
 		test_iperf3 ${__arg}
 		;;
+	"iperf3m")
+		test_iperf3m ${__arg}
+		;;
 	"set")
 		TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__arg%% *}__" "${__arg#* }")"
 		;;
diff --git a/test/migrate/basic b/test/migrate/basic
new file mode 100644
index 0000000..3f11f7d
--- /dev/null
+++ b/test/migrate/basic
@@ -0,0 +1,59 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/basic - Check basic migration functionality
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv4: guest1/guest2 > host
+g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
+hostb	socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
+sleep	1
+# Option 1: via spliced path in pasta, namespace to host
+# guest1b	{ printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
+# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
+guest1b	{ printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
+sleep	1
+
+mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+hostw
+hout	MSG cat __STATESETUP__/msg
+check	[ "__MSG__" = "Hello from guest 1 and from guest 2" ]
diff --git a/test/migrate/basic_fin b/test/migrate/basic_fin
new file mode 100644
index 0000000..aa61ec5
--- /dev/null
+++ b/test/migrate/basic_fin
@@ -0,0 +1,62 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/basic_fin - Outbound traffic across migration, half-closed socket
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv4: guest1, half-close, guest2 > host
+g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
+
+hostb	echo FIN | socat TCP4-LISTEN:10006,shut-down STDIO,ignoreeof > __STATESETUP__/msg
+#hostb	socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
+
+#sleep	20
+# Option 1: via spliced path in pasta, namespace to host
+# guest1b	{ printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
+# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
+guest1b	{ printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
+sleep	1
+
+mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+hostw
+hout	MSG cat __STATESETUP__/msg
+check	[ "__MSG__" = "Hello from guest 1 and from guest 2" ]
diff --git a/test/migrate/bidirectional b/test/migrate/bidirectional
new file mode 100644
index 0000000..4c04081
--- /dev/null
+++ b/test/migrate/bidirectional
@@ -0,0 +1,64 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/bidirectional - Check migration with messages in both directions
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	TCP/IPv4: guest1/guest2 > host, host > guest1/guest2
+g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
+
+hostb	socat -u TCP4-LISTEN:10006 OPEN:__STATESETUP__/msg,create,trunc
+guest1b	socat -u TCP4-LISTEN:10001 OPEN:msg,create,trunc
+sleep	1
+
+guest1b	socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
+hostb	socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
+sleep	1
+guest1	printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
+host	printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
+sleep	1
+
+mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+sleep	2
+guest2	printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
+host	printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
+
+hostw
+# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
+# use sleep 1 for the moment
+sleep	1
+
+hout	MSG cat __STATESETUP__/msg
+check	[ "__MSG__" = "Hello from guest 1 and from guest 2" ]
+
+g2out	MSG cat msg
+check	[ "__MSG__" = "Dear guest 1, you are now guest 2" ]
diff --git a/test/migrate/bidirectional_fin b/test/migrate/bidirectional_fin
new file mode 100644
index 0000000..1c13527
--- /dev/null
+++ b/test/migrate/bidirectional_fin
@@ -0,0 +1,64 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/bidirectional_fin - Both directions, half-closed sockets
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	TCP/IPv4: guest1/guest2 <- (half closed) -> host
+g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
+
+hostb	echo FIN | socat TCP4-LISTEN:10006,shut-down STDIO,ignoreeof > __STATESETUP__/msg
+guest1b	echo FIN | socat TCP4-LISTEN:10001,shut-down STDIO,ignoreeof > msg
+sleep	1
+
+guest1b	socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
+hostb	socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
+sleep	1
+guest1	printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
+host	printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
+sleep	1
+
+mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+sleep	2
+guest2	printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
+host	printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
+
+hostw
+# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
+# use sleep 1 for the moment
+sleep	1
+
+hout	MSG cat __STATESETUP__/msg
+check	[ "__MSG__" = "Hello from guest 1 and from guest 2" ]
+
+g2out	MSG cat msg
+check	[ "__MSG__" = "Dear guest 1, you are now guest 2" ]
diff --git a/test/migrate/iperf3_bidir6 b/test/migrate/iperf3_bidir6
new file mode 100644
index 0000000..4bfefb5
--- /dev/null
+++ b/test/migrate/iperf3_bidir6
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_bidir6 - Migration behaviour with many bidirectional flows
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+set	THREADS 128
+set	TIME 3
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -O__OMIT__ -N --bidir
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv6 host <-> guest flood, many flows, during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
+bw	__BW__ 1 2
+
+iperf3k	host
diff --git a/test/migrate/iperf3_in4 b/test/migrate/iperf3_in4
new file mode 100644
index 0000000..c5f3916
--- /dev/null
+++ b/test/migrate/iperf3_in4
@@ -0,0 +1,50 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_in4 - Migration behaviour under inbound IPv4 flood
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+guest1	/sbin/sysctl -w net.core.rmem_max=33554432
+guest1	/sbin/sysctl -w net.core.wmem_max=33554432
+
+set	THREADS 1
+set	TIME 4
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	TCP/IPv4 host to guest throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__
+bw	__BW__ 1 2
+
+iperf3k	host
diff --git a/test/migrate/iperf3_in6 b/test/migrate/iperf3_in6
new file mode 100644
index 0000000..16cf504
--- /dev/null
+++ b/test/migrate/iperf3_in6
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_in6 - Migration behaviour under inbound IPv6 flood
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+set	THREADS 4
+set	TIME 3
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv6 host to guest throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
+bw	__BW__ 1 2
+
+iperf3k	host
diff --git a/test/migrate/iperf3_many_out6 b/test/migrate/iperf3_many_out6
new file mode 100644
index 0000000..88133f2
--- /dev/null
+++ b/test/migrate/iperf3_many_out6
@@ -0,0 +1,60 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_many_out6 - Migration behaviour with many outbound flows
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+set	THREADS 16
+set	TIME 3
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -O__OMIT__ -N -l 1M
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv6 guest to host flood, many flows, during migration
+
+test	TCP/IPv6 host to guest throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
+bw	__BW__ 1 2
+
+iperf3k	host
diff --git a/test/migrate/iperf3_out4 b/test/migrate/iperf3_out4
new file mode 100644
index 0000000..968057b
--- /dev/null
+++ b/test/migrate/iperf3_out4
@@ -0,0 +1,47 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_out4 - Migration behaviour under outbound IPv4 flood
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+set	THREADS 6
+set	TIME 2
+set	OMIT 0.1
+set	OPTS -P __THREADS__ -O__OMIT__ -Z -N -l 1M
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	TCP/IPv4 guest to host throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__
+bw	__BW__ 1 2
+
+iperf3k	host
diff --git a/test/migrate/iperf3_out6 b/test/migrate/iperf3_out6
new file mode 100644
index 0000000..21fbfcd
--- /dev/null
+++ b/test/migrate/iperf3_out6
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_out6 - Migration behaviour under outbound IPv6 flood
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+set	THREADS 6
+set	TIME 2
+set	OMIT 0.1
+set	OPTS -P __THREADS__ -O__OMIT__ -Z -N -l 1M
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv6 guest to host throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__
+bw	__BW__ 1 2
+
+iperf3k	host
diff --git a/test/migrate/rampstream_in b/test/migrate/rampstream_in
index 46f4143..df333ba 100644
--- a/test/migrate/rampstream_in
+++ b/test/migrate/rampstream_in
@@ -6,10 +6,10 @@
 # PASTA - Pack A Subtle Tap Abstraction
 #  for network namespace/tap device mode
 #
-# test/migrate/basic - Check basic migration functionality
+# test/migrate/rampstream_in - Check sequence correctness with inbound ramp
 #
-# Copyright (c) 2025 Red Hat GmbH
-# Author: Stefano Brivio <sbrivio@redhat.com>
+# Copyright (c) 2025 Red Hat
+# Author: David Gibson <david@gibson.dropbear.id.au>
 
 g1tools	ip jq dhclient socat cat
 htools	ip jq
@@ -43,15 +43,15 @@ g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__")
 hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
 check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
 
-test	TCP/IPv4: host > guest
+test	TCP/IPv4: sequence check, ramps, inbound
 g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
 guest1b	socat -u TCP4-LISTEN:10001 EXEC:"rampstream-check.sh __RAMPS__"
 sleep	1
 hostb	socat -u EXEC:"test/rampstream send __RAMPS__" TCP4:__ADDR1__:10001
 
-sleep 1
+sleep	1
 
-#mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+monb	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
 
 hostw
 
diff --git a/test/migrate/rampstream_out b/test/migrate/rampstream_out
index 91b9c63..8ed3229 100644
--- a/test/migrate/rampstream_out
+++ b/test/migrate/rampstream_out
@@ -6,10 +6,10 @@
 # PASTA - Pack A Subtle Tap Abstraction
 #  for network namespace/tap device mode
 #
-# test/migrate/basic - Check basic migration functionality
+# test/migrate/rampstream_out - Check sequence correctness with outbound ramp
 #
-# Copyright (c) 2025 Red Hat GmbH
-# Author: Stefano Brivio <sbrivio@redhat.com>
+# Copyright (c) 2025 Red Hat
+# Author: David Gibson <david@gibson.dropbear.id.au>
 
 g1tools	ip jq dhclient socat cat
 htools	ip jq
@@ -43,7 +43,7 @@ g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__")
 hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
 check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
 
-test	TCP/IPv4: guest > host
+test	TCP/IPv4: sequence check, ramps, outbound
 g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
 hostb	socat -u TCP4-LISTEN:10006 EXEC:"test/rampstream check __RAMPS__"
 sleep	1
diff --git a/test/run b/test/run
index f188d8e..4e86f30 100755
--- a/test/run
+++ b/test/run
@@ -130,6 +130,43 @@ run() {
 	test two_guests_vu/basic
 	teardown two_guests
 
+	setup migrate
+	test migrate/basic
+	teardown migrate
+	setup migrate
+	test migrate/basic_fin
+	teardown migrate
+	setup migrate
+	test migrate/bidirectional
+	teardown migrate
+	setup migrate
+	test migrate/bidirectional_fin
+	teardown migrate
+	setup migrate
+	test migrate/iperf3_out4
+	teardown migrate
+	setup migrate
+	test migrate/iperf3_out6
+	teardown migrate
+	setup migrate
+	test migrate/iperf3_in4
+	teardown migrate
+	setup migrate
+	test migrate/iperf3_in6
+	teardown migrate
+	setup migrate
+	test migrate/iperf3_bidir6
+	teardown migrate
+	setup migrate
+	test migrate/iperf3_many_out6
+	teardown migrate
+	setup migrate
+	test migrate/rampstream_in
+	teardown migrate
+	setup migrate
+	test migrate/rampstream_out
+	teardown migrate
+
 	VALGRIND=0
 	VHOST_USER=0
 	setup passt_in_ns
@@ -186,7 +223,10 @@ run_selected() {
 
 	__setup=
 	for __test; do
-		if [ "${__test%%/*}" != "${__setup}" ]; then
+		# HACK: the migrate tests need the setup repeated for
+		#       each test
+		if [ "${__test%%/*}" != "${__setup}" -o		\
+		     "${__test%%/*}" = "migrate" ]; then
 			[ -n "${__setup}" ] && teardown "${__setup}"
 			__setup="${__test%%/*}"
 			setup "${__setup}"

From bcc4908c2b4a20c581f2b03fed40da97b804106f Mon Sep 17 00:00:00 2001
From: Enrique Llorente <ellorent@redhat.com>
Date: Mon, 17 Feb 2025 10:28:14 +0100
Subject: [PATCH 078/221] dhcp: Remove option 255 length byte

The option 255 (end of options) do not need the length byte, this change
remove that allowing to have one extra byte at other dynamic options.

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 dhcp.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/dhcp.c b/dhcp.c
index 401cb5b..4a209f1 100644
--- a/dhcp.c
+++ b/dhcp.c
@@ -64,9 +64,9 @@ static struct opt opts[255];
 #define OPT_MIN		60 /* RFC 951 */
 
 /* Total option size (excluding end option) is 576 (RFC 2131), minus
- * offset of options (268), minus end option and its length (2).
+ * offset of options (268), minus end option (1).
  */
-#define OPT_MAX		306
+#define OPT_MAX		307
 
 /**
  * dhcp_init() - Initialise DHCP options
@@ -127,7 +127,7 @@ struct msg {
 	uint8_t sname[64];
 	uint8_t file[128];
 	uint32_t magic;
-	uint8_t o[OPT_MAX + 2 /* End option and its length */ ];
+	uint8_t o[OPT_MAX + 1 /* End option */ ];
 } __attribute__((__packed__));
 
 /**
@@ -194,7 +194,6 @@ static int fill(struct msg *m)
 	}
 
 	m->o[offset++] = 255;
-	m->o[offset++] = 0;
 
 	if (offset < OPT_MIN) {
 		memset(&m->o[offset], 0, OPT_MIN - offset);

From 0a51060f7ac3e1e1a9d87ffdb037b9c367a2a4d9 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 18 Feb 2025 13:07:17 +1100
Subject: [PATCH 079/221] packet: Use flexible array member in struct pool

Currently we have a dummy pkt[1] array, which we alias with an array of
a different size via various macros.  However, we already require C11 which
includes flexible array members, so we can do better.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packet.h b/packet.h
index 3f70e94..85ee550 100644
--- a/packet.h
+++ b/packet.h
@@ -21,7 +21,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct iovec pkt[1];
+	struct iovec pkt[];
 };
 
 int vu_packet_check_range(void *buf, size_t offset, size_t len,

From 354bc0bab1cb6095592288674d375511443427fd Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 18 Feb 2025 13:07:18 +1100
Subject: [PATCH 080/221] packet: Don't pass start and offset separately to
 packet_check_range()

Fundamentally what packet_check_range() does is to check whether a given
memory range is within the allowed / expected memory set aside for packets
from a particular pool.  That range could represent a whole packet (from
packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't
really matter which.

However, we pass the start of the range as two parameters: @start which is
the start of the packet, and @offset which is the offset within the packet
of the range we're interested in.  We never use these separately, only as
(start + offset).  Simplify the interface of packet_check_range() and
vu_packet_check_range() to directly take the start of the relevant range.
This will allow some additional future improvements.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c    | 36 +++++++++++++++++++-----------------
 packet.h    |  3 +--
 vu_common.c | 11 ++++-------
 3 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/packet.c b/packet.c
index 03a11e6..0330b54 100644
--- a/packet.c
+++ b/packet.c
@@ -23,23 +23,22 @@
 #include "log.h"
 
 /**
- * packet_check_range() - Check if a packet memory range is valid
+ * packet_check_range() - Check if a memory range is valid for a pool
  * @p:		Packet pool
- * @offset:	Offset of data range in packet descriptor
+ * @ptr:	Start of desired data range
  * @len:	Length of desired data range
- * @start:	Start of the packet descriptor
  * @func:	For tracing: name of calling function
  * @line:	For tracing: caller line of function call
  *
  * Return: 0 if the range is valid, -1 otherwise
  */
-static int packet_check_range(const struct pool *p, size_t offset, size_t len,
-			      const char *start, const char *func, int line)
+static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
+			      const char *func, int line)
 {
 	if (p->buf_size == 0) {
 		int ret;
 
-		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
+		ret = vu_packet_check_range((void *)p->buf, ptr, len);
 
 		if (ret == -1)
 			trace("cannot find region, %s:%i", func, line);
@@ -47,16 +46,16 @@ static int packet_check_range(const struct pool *p, size_t offset, size_t len,
 		return ret;
 	}
 
-	if (start < p->buf) {
-		trace("packet start %p before buffer start %p, "
-		      "%s:%i", (void *)start, (void *)p->buf, func, line);
+	if (ptr < p->buf) {
+		trace("packet range start %p before buffer start %p, %s:%i",
+		      (void *)ptr, (void *)p->buf, func, line);
 		return -1;
 	}
 
-	if (start + len + offset > p->buf + p->buf_size) {
-		trace("packet offset plus length %zu from size %zu, "
-		      "%s:%i", start - p->buf + len + offset,
-		      p->buf_size, func, line);
+	if (ptr + len > p->buf + p->buf_size) {
+		trace("packet range end %p after buffer end %p, %s:%i",
+		      (void *)(ptr + len), (void *)(p->buf + p->buf_size),
+		      func, line);
 		return -1;
 	}
 
@@ -81,7 +80,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 		return;
 	}
 
-	if (packet_check_range(p, 0, len, start, func, line))
+	if (packet_check_range(p, start, len, func, line))
 		return;
 
 	if (len > UINT16_MAX) {
@@ -110,6 +109,8 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		    size_t len, size_t *left, const char *func, int line)
 {
+	char *ptr;
+
 	if (idx >= p->size || idx >= p->count) {
 		if (func) {
 			trace("packet %zu from pool size: %zu, count: %zu, "
@@ -135,14 +136,15 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
-			       func, line))
+	ptr = (char *)p->pkt[idx].iov_base + offset;
+
+	if (packet_check_range(p, ptr, len, func, line))
 		return NULL;
 
 	if (left)
 		*left = p->pkt[idx].iov_len - offset - len;
 
-	return (char *)p->pkt[idx].iov_base + offset;
+	return ptr;
 }
 
 /**
diff --git a/packet.h b/packet.h
index 85ee550..bdc07fe 100644
--- a/packet.h
+++ b/packet.h
@@ -24,8 +24,7 @@ struct pool {
 	struct iovec pkt[];
 };
 
-int vu_packet_check_range(void *buf, size_t offset, size_t len,
-			  const char *start);
+int vu_packet_check_range(void *buf, const char *ptr, size_t len);
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
 void *packet_get_do(const struct pool *p, const size_t idx,
diff --git a/vu_common.c b/vu_common.c
index 48826b1..686a09b 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -26,14 +26,12 @@
  * vu_packet_check_range() - Check if a given memory zone is contained in
  * 			     a mapped guest memory region
  * @buf:	Array of the available memory regions
- * @offset:	Offset of data range in packet descriptor
+ * @ptr:	Start of desired data range
  * @size:	Length of desired data range
- * @start:	Start of the packet descriptor
  *
  * Return: 0 if the zone is in a mapped memory region, -1 otherwise
  */
-int vu_packet_check_range(void *buf, size_t offset, size_t len,
-			  const char *start)
+int vu_packet_check_range(void *buf, const char *ptr, size_t len)
 {
 	struct vu_dev_region *dev_region;
 
@@ -41,9 +39,8 @@ int vu_packet_check_range(void *buf, size_t offset, size_t len,
 		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
 		char *m = (char *)(uintptr_t)dev_region->mmap_addr;
 
-		if (m <= start &&
-		    start + offset + len <= m + dev_region->mmap_offset +
-					       dev_region->size)
+		if (m <= ptr &&
+		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
 			return 0;
 	}
 

From 6b4065153c67e7578d448927e49f244deea70e4d Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 18 Feb 2025 13:07:19 +1100
Subject: [PATCH 081/221] tap: Remove unused ETH_HDR_INIT() macro

The uses of this macro were removed in d4598e1d18ac ("udp: Use the same
buffer for the L2 header for all frames").

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tap.h b/tap.h
index dfbd8b9..a476a12 100644
--- a/tap.h
+++ b/tap.h
@@ -6,8 +6,6 @@
 #ifndef TAP_H
 #define TAP_H
 
-#define ETH_HDR_INIT(proto) { .h_proto = htons_constant(proto) }
-
 /**
  * struct tap_hdr - tap backend specific headers
  * @vnet_len:	Frame length (for qemu socket transport)

From 5a07eb3cccf1abf0a44d6ab01819f8f605c87ef4 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 18 Feb 2025 13:50:13 +1100
Subject: [PATCH 082/221] tcp_vu: head_cnt need not be global

head_cnt is a global variable which tracks how many entries in head[] are
currently used.  The fact that it's global obscures the fact that the
lifetime over which it has a meaningful value is quite short: a single
call to of tcp_vu_data_from_sock().

Make it a local to tcp_vu_data_from_sock() to make that lifetime clearer.
We keep the head[] array global for now - although technically it has the
same valid lifetime - because it's large enough we might not want to put
it on the stack.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_vu.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/tcp_vu.c b/tcp_vu.c
index 0622f17..6891ed1 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -38,7 +38,6 @@
 static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1];
 static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
 static int head[VIRTQUEUE_MAX_SIZE + 1];
-static int head_cnt;
 
 /**
  * tcp_vu_hdrlen() - return the size of the header in level 2 frame (TCP)
@@ -183,7 +182,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 static ssize_t tcp_vu_sock_recv(const struct ctx *c,
 				const struct tcp_tap_conn *conn, bool v6,
 				uint32_t already_sent, size_t fillsize,
-				int *iov_cnt)
+				int *iov_cnt, int *head_cnt)
 {
 	struct vu_dev *vdev = c->vdev;
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
@@ -202,7 +201,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c,
 	vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE);
 
 	elem_cnt = 0;
-	head_cnt = 0;
+	*head_cnt = 0;
 	while (fillsize > 0 && elem_cnt < VIRTQUEUE_MAX_SIZE) {
 		struct iovec *iov;
 		size_t frame_size, dlen;
@@ -221,7 +220,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c,
 		ASSERT(iov->iov_len >= hdrlen);
 		iov->iov_base = (char *)iov->iov_base + hdrlen;
 		iov->iov_len -= hdrlen;
-		head[head_cnt++] = elem_cnt;
+		head[(*head_cnt)++] = elem_cnt;
 
 		fillsize -= dlen;
 		elem_cnt += cnt;
@@ -261,17 +260,18 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c,
 		len -= iov->iov_len;
 	}
 	/* adjust head count */
-	while (head_cnt > 0 && head[head_cnt - 1] >= i)
-		head_cnt--;
+	while (*head_cnt > 0 && head[*head_cnt - 1] >= i)
+		(*head_cnt)--;
+
 	/* mark end of array */
-	head[head_cnt] = i;
+	head[*head_cnt] = i;
 	*iov_cnt = i;
 
 	/* release unused buffers */
 	vu_queue_rewind(vq, elem_cnt - i);
 
 	/* restore space for headers in iov */
-	for (i = 0; i < head_cnt; i++) {
+	for (i = 0; i < *head_cnt; i++) {
 		struct iovec *iov = &elem[head[i]].in_sg[0];
 
 		iov->iov_base = (char *)iov->iov_base - hdrlen;
@@ -357,11 +357,11 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	struct vu_dev *vdev = c->vdev;
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
 	ssize_t len, previous_dlen;
+	int i, iov_cnt, head_cnt;
 	size_t hdrlen, fillsize;
 	int v6 = CONN_V6(conn);
 	uint32_t already_sent;
 	const uint16_t *check;
-	int i, iov_cnt;
 
 	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
 		debug("Got packet, but RX virtqueue not usable yet");
@@ -396,7 +396,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	/* collect the buffers from vhost-user and fill them with the
 	 * data from the socket
 	 */
-	len = tcp_vu_sock_recv(c, conn, v6, already_sent, fillsize, &iov_cnt);
+	len = tcp_vu_sock_recv(c, conn, v6, already_sent, fillsize,
+			       &iov_cnt, &head_cnt);
 	if (len < 0) {
 		if (len != -EAGAIN && len != -EWOULDBLOCK) {
 			tcp_rst(c, conn);

From e56c8038fc23a349ff4a457c6b447f927ac1a56e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 18 Feb 2025 19:59:21 +1100
Subject: [PATCH 083/221] tcp: More type safety for
 tcp_flow_migrate_target_ext()

tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c     | 2 +-
 tcp.c      | 7 +++----
 tcp_conn.h | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/flow.c b/flow.c
index cc881e8..abe95b2 100644
--- a/flow.c
+++ b/flow.c
@@ -1106,7 +1106,7 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	repair_flush(c);
 
 	for (i = 0; i < count; i++) {
-		rc = tcp_flow_migrate_target_ext(c, flowtab + i, fd);
+		rc = tcp_flow_migrate_target_ext(c, &flowtab[i].tcp, fd);
 		if (rc) {
 			debug("Migration data failure at flow %u: %s, abort",
 			      i, strerror_(-rc));
diff --git a/tcp.c b/tcp.c
index 98e1c6a..272e4cd 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3394,14 +3394,13 @@ int tcp_flow_migrate_target(struct ctx *c, int fd)
 /**
  * tcp_flow_migrate_target_ext() - Receive extended data for flow, set, connect
  * @c:		Execution context
- * @flow:	Existing flow for this connection data
+ * @conn:	Connection entry to complete with extra data
  * @fd:		Descriptor for state migration
  *
  * Return: 0 on success, negative on fatal failure, but 0 on single flow failure
  */
-int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd)
+int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd)
 {
-	struct tcp_tap_conn *conn = &flow->tcp;
 	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
 	struct tcp_tap_transfer_ext t;
 	int s = conn->sock, rc;
@@ -3413,7 +3412,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd)
 	}
 
 	if (!t.tcpi_state) { /* Source wants us to skip this flow */
-		flow_err(flow, "Dropping as requested by source");
+		flow_err(conn, "Dropping as requested by source");
 		goto fail;
 	}
 
diff --git a/tcp_conn.h b/tcp_conn.h
index 42dff48..53887c0 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -239,7 +239,7 @@ int tcp_flow_migrate_source_ext(int fd, int fidx,
 				const struct tcp_tap_conn *conn);
 
 int tcp_flow_migrate_target(struct ctx *c, int fd);
-int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd);
+int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
 
 bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
 

From 854bc7b1a3b4e5443ea071e49b3a68198dbb88b3 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 18 Feb 2025 19:59:22 +1100
Subject: [PATCH 084/221] tcp: Remove spurious prototype for
 tcp_flow_migrate_shrink_window

This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_conn.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tcp_conn.h b/tcp_conn.h
index 53887c0..8a15b08 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -233,7 +233,6 @@ bool tcp_flow_defer(const struct tcp_tap_conn *conn);
 int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
 int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
 
-int tcp_flow_migrate_shrink_window(int fidx, const struct tcp_tap_conn *conn);
 int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
 int tcp_flow_migrate_source_ext(int fd, int fidx,
 				const struct tcp_tap_conn *conn);

From ba0823f8a0e60d4fc0cb21179aaf64940509156a Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 18 Feb 2025 19:59:23 +1100
Subject: [PATCH 085/221] tcp: Don't pass both flow pointer and flow index

tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c     | 2 +-
 tcp.c      | 6 ++----
 tcp_conn.h | 3 +--
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/flow.c b/flow.c
index abe95b2..cc393e0 100644
--- a/flow.c
+++ b/flow.c
@@ -1053,7 +1053,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	 * as EIO).
 	 */
 	foreach_established_tcp_flow(i, flow, FLOW_MAX) {
-		rc = tcp_flow_migrate_source_ext(fd, i, &flow->tcp);
+		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
 		if (rc) {
 			err("Extended data for flow %u: %s", i, strerror_(-rc));
 
diff --git a/tcp.c b/tcp.c
index 272e4cd..21b6c6c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3141,16 +3141,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
 /**
  * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
  * @fd:		Descriptor for state migration
- * @fidx:	Flow index
  * @conn:	Pointer to the TCP connection structure
  *
  * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure
  */
-int tcp_flow_migrate_source_ext(int fd, int fidx,
-				const struct tcp_tap_conn *conn)
+int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 {
 	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
-	struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
+	struct tcp_tap_transfer_ext *t = &migrate_ext[FLOW_IDX(conn)];
 	int s = conn->sock;
 	int rc;
 
diff --git a/tcp_conn.h b/tcp_conn.h
index 8a15b08..9126a36 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -234,8 +234,7 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
 int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
 
 int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
-int tcp_flow_migrate_source_ext(int fd, int fidx,
-				const struct tcp_tap_conn *conn);
+int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
 
 int tcp_flow_migrate_target(struct ctx *c, int fd);
 int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);

From adb46c11d0ea67824cf8c4ef2113ec0b2c563c0e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 18 Feb 2025 19:59:24 +1100
Subject: [PATCH 086/221] flow: Add flow_perror() helper

Our general logging helpers include a number of _perror() variants which,
like perror(3) include the description of the current errno.  We didn't
have those for our flow specific logging helpers, though.  Fill this gap
with flow_perror() and flow_dbg_perror(), and use them where it's useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c       | 12 +++++++-----
 flow.h       | 18 ++++++++++++++----
 icmp.c       |  5 ++---
 tcp.c        | 33 +++++++++++++++------------------
 tcp_splice.c |  9 ++++-----
 udp_flow.c   | 19 +++++++------------
 6 files changed, 49 insertions(+), 47 deletions(-)

diff --git a/flow.c b/flow.c
index cc393e0..c68f6bb 100644
--- a/flow.c
+++ b/flow.c
@@ -289,11 +289,13 @@ int flowside_connect(const struct ctx *c, int s,
 
 /** flow_log_ - Log flow-related message
  * @f:		flow the message is related to
+ * @newline:	Append newline at the end of the message, if missing
  * @pri:	Log priority
  * @fmt:	Format string
  * @...:	printf-arguments
  */
-void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
+void flow_log_(const struct flow_common *f, bool newline, int pri,
+	       const char *fmt, ...)
 {
 	const char *type_or_state;
 	char msg[BUFSIZ];
@@ -309,7 +311,7 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	else
 		type_or_state = FLOW_TYPE(f);
 
-	logmsg(true, false, pri,
+	logmsg(newline, false, pri,
 	       "Flow %u (%s): %s", flow_idx(f), type_or_state, msg);
 }
 
@@ -329,7 +331,7 @@ void flow_log_details_(const struct flow_common *f, int pri,
 	const struct flowside *tgt = &f->side[TGTSIDE];
 
 	if (state >= FLOW_STATE_TGT)
-		flow_log_(f, pri,
+		flow_log_(f, true, pri,
 			  "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu",
 			  pif_name(f->pif[INISIDE]),
 			  inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
@@ -342,7 +344,7 @@ void flow_log_details_(const struct flow_common *f, int pri,
 			  inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)),
 			  tgt->eport);
 	else if (state >= FLOW_STATE_INI)
-		flow_log_(f, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
+		flow_log_(f, true, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
 			  pif_name(f->pif[INISIDE]),
 			  inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
 			  ini->eport,
@@ -363,7 +365,7 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
 	ASSERT(oldstate < FLOW_NUM_STATES);
 
 	f->state = state;
-	flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
+	flow_log_(f, true, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
 		  FLOW_STATE(f));
 
 	flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
diff --git a/flow.h b/flow.h
index 675726e..dcf7645 100644
--- a/flow.h
+++ b/flow.h
@@ -258,11 +258,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 			int fd);
 
-void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
-	__attribute__((format(printf, 3, 4)));
-
-#define flow_log(f_, pri, ...)	flow_log_(&(f_)->f, (pri), __VA_ARGS__)
+void flow_log_(const struct flow_common *f, bool newline, int pri,
+	       const char *fmt, ...)
+	__attribute__((format(printf, 4, 5)));
 
+#define flow_log(f_, pri, ...)	flow_log_(&(f_)->f, true, (pri), __VA_ARGS__)
 #define flow_dbg(f, ...)	flow_log((f), LOG_DEBUG, __VA_ARGS__)
 #define flow_err(f, ...)	flow_log((f), LOG_ERR, __VA_ARGS__)
 
@@ -272,6 +272,16 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 			flow_dbg((f), __VA_ARGS__);			\
 	} while (0)
 
+#define flow_log_perror_(f, pri, ...)					\
+	do {								\
+		int errno_ = errno;					\
+		flow_log_((f), false, (pri), __VA_ARGS__);		\
+		logmsg(true, true, (pri), ": %s", strerror_(errno_));	\
+	} while (0)
+
+#define flow_dbg_perror(f_, ...) flow_log_perror_(&(f_)->f, LOG_DEBUG, __VA_ARGS__)
+#define flow_perror(f_, ...)	flow_log_perror_(&(f_)->f, LOG_ERR, __VA_ARGS__)
+
 void flow_log_details_(const struct flow_common *f, int pri,
 		       enum flow_state state);
 #define flow_log_details(f_, pri) \
diff --git a/icmp.c b/icmp.c
index bcf498d..7e2b342 100644
--- a/icmp.c
+++ b/icmp.c
@@ -85,7 +85,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 
 	n = recvfrom(ref.fd, buf, sizeof(buf), 0, &sr.sa, &sl);
 	if (n < 0) {
-		flow_err(pingf, "recvfrom() error: %s", strerror_(errno));
+		flow_perror(pingf, "recvfrom() error");
 		return;
 	}
 
@@ -300,8 +300,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 	pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0);
 	if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) {
-		flow_dbg(pingf, "failed to relay request to socket: %s",
-			 strerror_(errno));
+		flow_dbg_perror(pingf, "failed to relay request to socket");
 	} else {
 		flow_dbg(pingf,
 			 "echo request to socket, ID: %"PRIu16", seq: %"PRIu16,
diff --git a/tcp.c b/tcp.c
index 21b6c6c..f498f5b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -551,8 +551,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 
 		fd = timerfd_create(CLOCK_MONOTONIC, 0);
 		if (fd == -1 || fd > FD_REF_MAX) {
-			flow_dbg(conn, "failed to get timer: %s",
-				 strerror_(errno));
+			flow_dbg_perror(conn, "failed to get timer");
 			if (fd > -1)
 				close(fd);
 			conn->timer = -1;
@@ -561,8 +560,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		conn->timer = fd;
 
 		if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
-			flow_dbg(conn, "failed to add timer: %s",
-				 strerror_(errno));
+			flow_dbg_perror(conn, "failed to add timer");
 			close(conn->timer);
 			conn->timer = -1;
 			return;
@@ -587,7 +585,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
 
 	if (timerfd_settime(conn->timer, 0, &it, NULL))
-		flow_err(conn, "failed to set timer: %s", strerror_(errno));
+		flow_perror(conn, "failed to set timer");
 }
 
 /**
@@ -1386,10 +1384,10 @@ static void tcp_bind_outbound(const struct ctx *c,
 		if (bind(s, &bind_sa.sa, sl)) {
 			char sstr[INANY_ADDRSTRLEN];
 
-			flow_dbg(conn,
-				 "Can't bind TCP outbound socket to %s:%hu: %s",
-				 inany_ntop(&tgt->oaddr, sstr, sizeof(sstr)),
-				 tgt->oport, strerror_(errno));
+			flow_dbg_perror(conn,
+					"Can't bind TCP outbound socket to %s:%hu",
+					inany_ntop(&tgt->oaddr, sstr, sizeof(sstr)),
+					tgt->oport);
 		}
 	}
 
@@ -1398,9 +1396,9 @@ static void tcp_bind_outbound(const struct ctx *c,
 			if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE,
 				       c->ip4.ifname_out,
 				       strlen(c->ip4.ifname_out))) {
-				flow_dbg(conn, "Can't bind IPv4 TCP socket to"
-					 " interface %s: %s", c->ip4.ifname_out,
-					 strerror_(errno));
+				flow_dbg_perror(conn,
+						"Can't bind IPv4 TCP socket to interface %s",
+						c->ip4.ifname_out);
 			}
 		}
 	} else if (bind_sa.sa_family == AF_INET6) {
@@ -1408,9 +1406,9 @@ static void tcp_bind_outbound(const struct ctx *c,
 			if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE,
 				       c->ip6.ifname_out,
 				       strlen(c->ip6.ifname_out))) {
-				flow_dbg(conn, "Can't bind IPv6 TCP socket to"
-					 " interface %s: %s", c->ip6.ifname_out,
-					 strerror_(errno));
+				flow_dbg_perror(conn,
+						"Can't bind IPv6 TCP socket to interface %s",
+						c->ip6.ifname_out);
 			}
 		}
 	}
@@ -2193,7 +2191,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
 	 * and we just set the timer to a new point in the future: discard it.
 	 */
 	if (timerfd_gettime(conn->timer, &check_armed))
-		flow_err(conn, "failed to read timer: %s", strerror_(errno));
+		flow_perror(conn, "failed to read timer");
 
 	if (check_armed.it_value.tv_sec || check_armed.it_value.tv_nsec)
 		return;
@@ -2235,8 +2233,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
 		 * ~ACK_TO_TAP_DUE or ~ACK_FROM_TAP_DUE.
 		 */
 		if (timerfd_settime(conn->timer, 0, &new, &old))
-			flow_err(conn, "failed to set timer: %s",
-				 strerror_(errno));
+			flow_perror(conn, "failed to set timer");
 
 		if (old.it_value.tv_sec == ACT_TIMEOUT) {
 			flow_dbg(conn, "activity timeout");
diff --git a/tcp_splice.c b/tcp_splice.c
index 5d845c9..0d10e3d 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -164,7 +164,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 	if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) ||
 	    epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) {
 		int ret = -errno;
-		flow_err(conn, "ERROR on epoll_ctl(): %s", strerror_(errno));
+		flow_perror(conn, "ERROR on epoll_ctl()");
 		return ret;
 	}
 
@@ -317,8 +317,8 @@ static int tcp_splice_connect_finish(const struct ctx *c,
 
 		if (conn->pipe[sidei][0] < 0) {
 			if (pipe2(conn->pipe[sidei], O_NONBLOCK | O_CLOEXEC)) {
-				flow_err(conn, "cannot create %d->%d pipe: %s",
-					 sidei, !sidei, strerror_(errno));
+				flow_perror(conn, "cannot create %d->%d pipe",
+					    sidei, !sidei);
 				conn_flag(c, conn, CLOSING);
 				return -EIO;
 			}
@@ -482,8 +482,7 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
 
 		rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
 		if (rc)
-			flow_err(conn, "Error retrieving SO_ERROR: %s",
-				 strerror_(errno));
+			flow_perror(conn, "Error retrieving SO_ERROR");
 		else
 			flow_trace(conn, "Error event on socket: %s",
 				   strerror_(err));
diff --git a/udp_flow.c b/udp_flow.c
index 83c2568..c6b8630 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -93,9 +93,8 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 		 */
 		uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);
 		if (uflow->s[INISIDE] < 0) {
-			flow_err(uflow,
-				 "Couldn't duplicate listening socket: %s",
-				 strerror_(errno));
+			flow_perror(uflow,
+				    "Couldn't duplicate listening socket");
 			goto cancel;
 		}
 	}
@@ -113,16 +112,13 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 		uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY,
 						     tgtpif, tgt, fref.data);
 		if (uflow->s[TGTSIDE] < 0) {
-			flow_dbg(uflow,
-				 "Couldn't open socket for spliced flow: %s",
-				 strerror_(errno));
+			flow_dbg_perror(uflow,
+					"Couldn't open socket for spliced flow");
 			goto cancel;
 		}
 
 		if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) {
-			flow_dbg(uflow,
-				 "Couldn't connect flow socket: %s",
-				 strerror_(errno));
+			flow_dbg_perror(uflow, "Couldn't connect flow socket");
 			goto cancel;
 		}
 
@@ -142,9 +138,8 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 			flow_trace(uflow,
 				   "Discarded %d spurious reply datagrams", rc);
 		} else if (errno != EAGAIN) {
-			flow_err(uflow,
-				 "Unexpected error discarding datagrams: %s",
-				 strerror_(errno));
+			flow_perror(uflow,
+				    "Unexpected error discarding datagrams");
 		}
 	}
 

From 7ffca35fddf1568698199c931ba1877c1908b443 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 19 Feb 2025 13:28:34 +1100
Subject: [PATCH 087/221] flow: Remove unneeded index from foreach_* macros

The foreach macros are odd in that they take two loop counters: an integer
index, and a pointer to the flow.  We nearly always want the latter, not
the former, and we can get the index from the pointer trivially when we
need it.  So, rearrange the macros not to need the integer index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c | 44 +++++++++++++++++++++-----------------------
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/flow.c b/flow.c
index c68f6bb..3fcdd9f 100644
--- a/flow.c
+++ b/flow.c
@@ -53,30 +53,28 @@ const uint8_t flow_proto[] = {
 static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
 	      "flow_proto[] doesn't match enum flow_type");
 
-#define foreach_flow(i, flow, bound)					\
-	for ((i) = 0, (flow) = &flowtab[(i)];				\
-	     (i) < (bound);						\
-	     (i)++, (flow) = &flowtab[(i)])				\
+#define foreach_flow(flow, bound)					\
+	for ((flow) = flowtab; FLOW_IDX(flow) < (bound); (flow)++)	\
 		if ((flow)->f.state == FLOW_STATE_FREE)			\
-			(i) += (flow)->free.n - 1;			\
+			(flow) += (flow)->free.n - 1;			\
 		else
 
-#define foreach_active_flow(i, flow, bound)				\
-	foreach_flow((i), (flow), (bound))				\
+#define foreach_active_flow(flow, bound)				\
+	foreach_flow((flow), (bound))					\
 		if ((flow)->f.state != FLOW_STATE_ACTIVE)		\
 			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
 			continue;					\
 		else
 
-#define foreach_tcp_flow(i, flow, bound)				\
-	foreach_active_flow((i), (flow), (bound))			\
+#define foreach_tcp_flow(flow, bound)					\
+	foreach_active_flow((flow), (bound))				\
 		if ((flow)->f.type != FLOW_TCP)				\
 			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
 			continue;					\
 		else
 
-#define foreach_established_tcp_flow(i, flow, bound)			\
-	foreach_tcp_flow((i), (flow), (bound))				\
+#define foreach_established_tcp_flow(flow, bound)			\
+	foreach_tcp_flow((flow), (bound))				\
 		if (!tcp_flow_is_established(&(flow)->tcp))		\
 			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
 			continue;					\
@@ -918,11 +916,10 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
 					int ret)
 {
 	union flow *flow;
-	unsigned i;
 
 	debug("...roll back migration");
 
-	foreach_established_tcp_flow(i, flow, max_flow)
+	foreach_established_tcp_flow(flow, max_flow)
 		if (tcp_flow_repair_off(c, &flow->tcp))
 			die("Failed to roll back TCP_REPAIR mode");
 
@@ -942,10 +939,9 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
 static int flow_migrate_repair_all(struct ctx *c, bool enable)
 {
 	union flow *flow;
-	unsigned i;
 	int rc;
 
-	foreach_established_tcp_flow(i, flow, FLOW_MAX) {
+	foreach_established_tcp_flow(flow, FLOW_MAX) {
 		if (enable)
 			rc = tcp_flow_repair_on(c, &flow->tcp);
 		else
@@ -954,14 +950,15 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable)
 		if (rc) {
 			debug("Can't %s repair mode: %s",
 			      enable ? "enable" : "disable", strerror_(-rc));
-			return flow_migrate_source_rollback(c, i, rc);
+			return flow_migrate_source_rollback(c, FLOW_IDX(flow),
+							    rc);
 		}
 	}
 
 	if ((rc = repair_flush(c))) {
 		debug("Can't %s repair mode: %s",
 		      enable ? "enable" : "disable", strerror_(-rc));
-		return flow_migrate_source_rollback(c, i, rc);
+		return flow_migrate_source_rollback(c, FLOW_IDX(flow), rc);
 	}
 
 	return 0;
@@ -1003,13 +1000,12 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	uint32_t count = 0;
 	bool first = true;
 	union flow *flow;
-	unsigned i;
 	int rc;
 
 	(void)c;
 	(void)stage;
 
-	foreach_established_tcp_flow(i, flow, FLOW_MAX)
+	foreach_established_tcp_flow(flow, FLOW_MAX)
 		count++;
 
 	count = htonl(count);
@@ -1028,10 +1024,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	 * stream might now be inconsistent, and we might have closed listening
 	 * TCP sockets, so just terminate.
 	 */
-	foreach_established_tcp_flow(i, flow, FLOW_MAX) {
+	foreach_established_tcp_flow(flow, FLOW_MAX) {
 		rc = tcp_flow_migrate_source(fd, &flow->tcp);
 		if (rc) {
-			err("Can't send data, flow %u: %s", i, strerror_(-rc));
+			err("Can't send data, flow %u: %s", FLOW_IDX(flow),
+			    strerror_(-rc));
 			if (!first)
 				die("Inconsistent migration state, exiting");
 
@@ -1054,10 +1051,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	 * failures but not if the stream might be inconsistent (reported here
 	 * as EIO).
 	 */
-	foreach_established_tcp_flow(i, flow, FLOW_MAX) {
+	foreach_established_tcp_flow(flow, FLOW_MAX) {
 		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
 		if (rc) {
-			err("Extended data for flow %u: %s", i, strerror_(-rc));
+			err("Extended data for flow %u: %s", FLOW_IDX(flow),
+			    strerror_(-rc));
 
 			if (rc == -EIO)
 				die("Inconsistent migration state, exiting");

From b79a22d3601b69cf58b1803c5ead7f4667c46827 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 19 Feb 2025 13:28:35 +1100
Subject: [PATCH 088/221] flow: Remove unneeded bound parameter from flow
 traversal macros

The foreach macros used to step through flows each take a 'bound' parameter
to only scan part of the flow table.  Only one place actually passes a
bound different from FLOW_MAX.  So we can simplify every other invocation
by having that one case manually handle the bound.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/flow.c b/flow.c
index 3fcdd9f..602fea7 100644
--- a/flow.c
+++ b/flow.c
@@ -53,28 +53,28 @@ const uint8_t flow_proto[] = {
 static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
 	      "flow_proto[] doesn't match enum flow_type");
 
-#define foreach_flow(flow, bound)					\
-	for ((flow) = flowtab; FLOW_IDX(flow) < (bound); (flow)++)	\
+#define foreach_flow(flow)						\
+	for ((flow) = flowtab; FLOW_IDX(flow) < FLOW_MAX; (flow)++)	\
 		if ((flow)->f.state == FLOW_STATE_FREE)			\
 			(flow) += (flow)->free.n - 1;			\
 		else
 
-#define foreach_active_flow(flow, bound)				\
-	foreach_flow((flow), (bound))					\
+#define foreach_active_flow(flow)					\
+	foreach_flow((flow))						\
 		if ((flow)->f.state != FLOW_STATE_ACTIVE)		\
 			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
 			continue;					\
 		else
 
-#define foreach_tcp_flow(flow, bound)					\
-	foreach_active_flow((flow), (bound))				\
+#define foreach_tcp_flow(flow)						\
+	foreach_active_flow((flow))					\
 		if ((flow)->f.type != FLOW_TCP)				\
 			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
 			continue;					\
 		else
 
-#define foreach_established_tcp_flow(flow, bound)			\
-	foreach_tcp_flow((flow), (bound))				\
+#define foreach_established_tcp_flow(flow)				\
+	foreach_tcp_flow((flow))					\
 		if (!tcp_flow_is_established(&(flow)->tcp))		\
 			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
 			continue;					\
@@ -907,21 +907,23 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 /**
  * flow_migrate_source_rollback() - Disable repair mode, return failure
  * @c:		Execution context
- * @max_flow:	Maximum index of affected flows
+ * @bound:	No need to roll back flow indices >= @bound
  * @ret:	Negative error code
  *
  * Return: @ret
  */
-static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
-					int ret)
+static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
 {
 	union flow *flow;
 
 	debug("...roll back migration");
 
-	foreach_established_tcp_flow(flow, max_flow)
+	foreach_established_tcp_flow(flow) {
+		if (FLOW_IDX(flow) >= bound)
+			break;
 		if (tcp_flow_repair_off(c, &flow->tcp))
 			die("Failed to roll back TCP_REPAIR mode");
+	}
 
 	if (repair_flush(c))
 		die("Failed to roll back TCP_REPAIR mode");
@@ -941,7 +943,7 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable)
 	union flow *flow;
 	int rc;
 
-	foreach_established_tcp_flow(flow, FLOW_MAX) {
+	foreach_established_tcp_flow(flow) {
 		if (enable)
 			rc = tcp_flow_repair_on(c, &flow->tcp);
 		else
@@ -1005,7 +1007,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	(void)c;
 	(void)stage;
 
-	foreach_established_tcp_flow(flow, FLOW_MAX)
+	foreach_established_tcp_flow(flow)
 		count++;
 
 	count = htonl(count);
@@ -1024,7 +1026,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	 * stream might now be inconsistent, and we might have closed listening
 	 * TCP sockets, so just terminate.
 	 */
-	foreach_established_tcp_flow(flow, FLOW_MAX) {
+	foreach_established_tcp_flow(flow) {
 		rc = tcp_flow_migrate_source(fd, &flow->tcp);
 		if (rc) {
 			err("Can't send data, flow %u: %s", FLOW_IDX(flow),
@@ -1051,7 +1053,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	 * failures but not if the stream might be inconsistent (reported here
 	 * as EIO).
 	 */
-	foreach_established_tcp_flow(flow, FLOW_MAX) {
+	foreach_established_tcp_flow(flow) {
 		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
 		if (rc) {
 			err("Extended data for flow %u: %s", FLOW_IDX(flow),

From 65e317a8fca4eaf9efbfe642cc7e4322c56aa1f7 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 19 Feb 2025 13:28:36 +1100
Subject: [PATCH 089/221] flow: Clean up and generalise flow traversal macros

The migration code introduced a number of 'foreach' macros to traverse the
flow table.  These aren't inherently tied to migration, so polish up their
naming, move them to flow_table.h and also use in flow_defer_handler()
which is the other place we need to traverse the whole table.

For now we keep foreach_established_tcp_flow() as is.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c       | 36 ++++++++----------------------------
 flow_table.h | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/flow.c b/flow.c
index 602fea7..bb5dcc3 100644
--- a/flow.c
+++ b/flow.c
@@ -53,28 +53,8 @@ const uint8_t flow_proto[] = {
 static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
 	      "flow_proto[] doesn't match enum flow_type");
 
-#define foreach_flow(flow)						\
-	for ((flow) = flowtab; FLOW_IDX(flow) < FLOW_MAX; (flow)++)	\
-		if ((flow)->f.state == FLOW_STATE_FREE)			\
-			(flow) += (flow)->free.n - 1;			\
-		else
-
-#define foreach_active_flow(flow)					\
-	foreach_flow((flow))						\
-		if ((flow)->f.state != FLOW_STATE_ACTIVE)		\
-			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
-			continue;					\
-		else
-
-#define foreach_tcp_flow(flow)						\
-	foreach_active_flow((flow))					\
-		if ((flow)->f.type != FLOW_TCP)				\
-			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
-			continue;					\
-		else
-
 #define foreach_established_tcp_flow(flow)				\
-	foreach_tcp_flow((flow))					\
+	flow_foreach_of_type((flow), FLOW_TCP)				\
 		if (!tcp_flow_is_established(&(flow)->tcp))		\
 			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
 			continue;					\
@@ -801,7 +781,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 	struct flow_free_cluster *free_head = NULL;
 	unsigned *last_next = &flow_first_free;
 	bool timer = false;
-	unsigned idx;
+	union flow *flow;
 
 	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
 		timer = true;
@@ -810,8 +790,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 
 	ASSERT(!flow_new_entry); /* Incomplete flow at end of cycle */
 
-	for (idx = 0; idx < FLOW_MAX; idx++) {
-		union flow *flow = &flowtab[idx];
+	flow_foreach_slot(flow) {
 		bool closed = false;
 
 		switch (flow->f.state) {
@@ -828,12 +807,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 			} else {
 				/* New free cluster, add to chain */
 				free_head = &flow->free;
-				*last_next = idx;
+				*last_next = FLOW_IDX(flow);
 				last_next = &free_head->next;
 			}
 
 			/* Skip remaining empty entries */
-			idx += skip - 1;
+			flow += skip - 1;
 			continue;
 		}
 
@@ -886,14 +865,15 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 
 			if (free_head) {
 				/* Add slot to current free cluster */
-				ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
+				ASSERT(FLOW_IDX(flow) ==
+				       FLOW_IDX(free_head) + free_head->n);
 				free_head->n++;
 				flow->free.n = flow->free.next = 0;
 			} else {
 				/* Create new free cluster */
 				free_head = &flow->free;
 				free_head->n = 1;
-				*last_next = idx;
+				*last_next = FLOW_IDX(flow);
 				last_next = &free_head->next;
 			}
 		} else {
diff --git a/flow_table.h b/flow_table.h
index 9a2ff24..fd2c57b 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -50,6 +50,42 @@ extern union flow flowtab[];
 #define flow_foreach_sidei(sidei_) \
 	for ((sidei_) = INISIDE; (sidei_) < SIDES; (sidei_)++)
 
+
+/**
+ * flow_foreach_slot() - Step through each flow table entry
+ * @flow:	Takes values of pointer to each flow table entry
+ *
+ * Includes FREE slots.
+ */
+#define flow_foreach_slot(flow)						\
+	for ((flow) = flowtab; FLOW_IDX(flow) < FLOW_MAX; (flow)++)
+
+/**
+ * flow_foreach() - Step through each active flow
+ * @flow:	Takes values of pointer to each active flow
+ */
+#define flow_foreach(flow)						\
+	flow_foreach_slot((flow))					\
+		if ((flow)->f.state == FLOW_STATE_FREE)			\
+			(flow) += (flow)->free.n - 1;			\
+		else if ((flow)->f.state != FLOW_STATE_ACTIVE) {	\
+			flow_err((flow), "Bad flow state during traversal"); \
+			continue;					\
+		} else
+
+/**
+ * flow_foreach_of_type() - Step through each active flow of given type
+ * @flow:	Takes values of pointer to each flow
+ * @type_:	Type of flow to traverse
+ */
+#define flow_foreach_of_type(flow, type_)				\
+	flow_foreach((flow))						\
+	if ((flow)->f.type != (type_))					\
+			/* NOLINTNEXTLINE(bugprone-branch-clone) */	\
+			continue;					\
+		else
+
+
 /** flow_idx() - Index of flow from common structure
  * @f:	Common flow fields pointer
  *

From 3dc7da68a2731f661d7251a5fc759daffe24ca70 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 19 Feb 2025 14:14:27 +1100
Subject: [PATCH 090/221] conf: More thorough error checking when parsing --mtu
 option

We're a bit sloppy with parsing MTU which can lead to some surprising,
though fairly harmless, results:
  * Passing a non-number like '-m xyz' will not give an error and act like
    -m 0
  * Junk after a number (e.g. '-m 1500pqr') will be ignored rather than
    giving an error
  * We parse the MTU as a long, then immediately assign to an int, so on
    some platforms certain ludicrously out of bounds values will be
    silently truncated, rather than giving an error

Be a bit more thorough with the error checking to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/conf.c b/conf.c
index 18017f5..335f37c 100644
--- a/conf.c
+++ b/conf.c
@@ -1652,20 +1652,29 @@ void conf(struct ctx *c, int argc, char **argv)
 				die("Invalid PID file: %s", optarg);
 
 			break;
-		case 'm':
-			errno = 0;
-			c->mtu = strtol(optarg, NULL, 0);
+		case 'm': {
+			unsigned long mtu;
+			char *e;
 
-			if (!c->mtu) {
+			errno = 0;
+			mtu = strtoul(optarg, &e, 0);
+
+			if (errno || *e)
+				die("Invalid MTU: %s", optarg);
+
+			if (!mtu) {
 				c->mtu = -1;
 				break;
 			}
 
-			if (c->mtu < ETH_MIN_MTU || c->mtu > (int)ETH_MAX_MTU ||
-			    errno)
-				die("Invalid MTU: %s", optarg);
+			if (mtu < ETH_MIN_MTU || mtu > ETH_MAX_MTU) {
+				die("MTU %lu out of range (%u..%u)", mtu,
+				    ETH_MIN_MTU, ETH_MAX_MTU);
+			}
 
+			c->mtu = mtu;
 			break;
+		}
 		case 'a':
 			if (inet_pton(AF_INET6, optarg, &c->ip6.addr)	&&
 			    !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr)	&&

From 1cc5d4c9fe0a84d3d39fc07358996989ca1b5875 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 19 Feb 2025 14:14:28 +1100
Subject: [PATCH 091/221] conf: Use 0 instead of -1 as "unassigned" mtu value

On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default.  However, internally we use (c->mtu == -1) to represent that
state.  We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.

This is unnecessarily confusing.  We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case.  This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c  | 11 ++---------
 dhcp.c  |  2 +-
 ndp.c   |  2 +-
 passt.h |  3 ++-
 pasta.c |  2 +-
 tcp.c   |  2 +-
 6 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/conf.c b/conf.c
index 335f37c..c5ee07b 100644
--- a/conf.c
+++ b/conf.c
@@ -1413,6 +1413,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		optstring = "+dqfel:hs:F:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:";
 	}
 
+	c->mtu = ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN, sizeof(uint32_t));
 	c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET;
 	c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET;
 	memcpy(c->our_tap_mac, MAC_OUR_LAA, ETH_ALEN);
@@ -1662,12 +1663,7 @@ void conf(struct ctx *c, int argc, char **argv)
 			if (errno || *e)
 				die("Invalid MTU: %s", optarg);
 
-			if (!mtu) {
-				c->mtu = -1;
-				break;
-			}
-
-			if (mtu < ETH_MIN_MTU || mtu > ETH_MAX_MTU) {
+			if (mtu && (mtu < ETH_MIN_MTU || mtu > ETH_MAX_MTU)) {
 				die("MTU %lu out of range (%u..%u)", mtu,
 				    ETH_MIN_MTU, ETH_MAX_MTU);
 			}
@@ -1980,9 +1976,6 @@ void conf(struct ctx *c, int argc, char **argv)
 		c->no_dhcpv6 = 1;
 	}
 
-	if (!c->mtu)
-		c->mtu = ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN, sizeof(uint32_t));
-
 	get_dns(c);
 
 	if (!*c->pasta_ifn) {
diff --git a/dhcp.c b/dhcp.c
index 4a209f1..66a716e 100644
--- a/dhcp.c
+++ b/dhcp.c
@@ -417,7 +417,7 @@ int dhcp(const struct ctx *c, const struct pool *p)
 		       &c->ip4.guest_gw, sizeof(c->ip4.guest_gw));
 	}
 
-	if (c->mtu != -1) {
+	if (c->mtu) {
 		opts[26].slen = 2;
 		opts[26].s[0] = c->mtu / 256;
 		opts[26].s[1] = c->mtu % 256;
diff --git a/ndp.c b/ndp.c
index 37bf7a3..ded2081 100644
--- a/ndp.c
+++ b/ndp.c
@@ -256,7 +256,7 @@ static void ndp_ra(const struct ctx *c, const struct in6_addr *dst)
 
 	ptr = &ra.var[0];
 
-	if (c->mtu != -1) {
+	if (c->mtu) {
 		struct opt_mtu *mtu = (struct opt_mtu *)ptr;
 		*mtu = (struct opt_mtu) {
 			.header = {
diff --git a/passt.h b/passt.h
index 1f0dab5..28d1389 100644
--- a/passt.h
+++ b/passt.h
@@ -274,6 +274,8 @@ struct ctx {
 	int fd_repair;
 	unsigned char our_tap_mac[ETH_ALEN];
 	unsigned char guest_mac[ETH_ALEN];
+	uint16_t mtu;
+
 	uint64_t hash_secret[2];
 
 	int ifi4;
@@ -298,7 +300,6 @@ struct ctx {
 	int no_icmp;
 	struct icmp_ctx icmp;
 
-	int mtu;
 	int no_dns;
 	int no_dns_search;
 	int no_dhcp_dns;
diff --git a/pasta.c b/pasta.c
index 585a51c..fa3e7de 100644
--- a/pasta.c
+++ b/pasta.c
@@ -319,7 +319,7 @@ void pasta_ns_conf(struct ctx *c)
 	if (c->pasta_conf_ns) {
 		unsigned int flags = IFF_UP;
 
-		if (c->mtu != -1)
+		if (c->mtu)
 			nl_link_set_mtu(nl_sock_ns, c->pasta_ifi, c->mtu);
 
 		if (c->ifi6) /* Avoid duplicate address detection on link up */
diff --git a/tcp.c b/tcp.c
index f498f5b..e3c0a53 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1139,7 +1139,7 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
 	if (flags & SYN) {
 		int mss;
 
-		if (c->mtu == -1) {
+		if (!c->mtu) {
 			mss = tinfo.tcpi_snd_mss;
 		} else {
 			mss = c->mtu - sizeof(struct tcphdr);

From 183bedf478e34079244fe4cfbb2c1a0f02a5a037 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 18 Feb 2025 09:34:26 +0100
Subject: [PATCH 092/221] Makefile: Use mmap2() as alternative for mmap() in
 valgrind extra syscalls

...instead of unconditionally trying to enable both: mmap2() is the
32-bit ARM variant for mmap() (and perhaps for other architectures),
bot if mmap() is available, valgrind will use that one.

This avoids seccomp.sh warning us about missing mmap2() if mmap() is
present, and is consistent with what we do in vhost-user code.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index d4e1096..f2ac8e5 100644
--- a/Makefile
+++ b/Makefile
@@ -109,9 +109,9 @@ passt-repair: $(PASST_REPAIR_SRCS) seccomp_repair.h
 	$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
 
 valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction	\
-			    rt_sigreturn getpid gettid kill clock_gettime mmap \
-			    mmap2 munmap open unlink gettimeofday futex statx \
-			    readlink
+			    rt_sigreturn getpid gettid kill clock_gettime \
+			    mmap|mmap2 munmap open unlink gettimeofday futex \
+			    statx readlink
 valgrind: FLAGS += -g -DVALGRIND
 valgrind: all
 

From 16553c82806e0a55508baf553cb79e902638c10f Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 18 Feb 2025 09:42:28 +0100
Subject: [PATCH 093/221] dhcp: Add option code byte in calculation for OPT_MAX
 boundary check

Otherwise we'll limit messages to 577 bytes, instead of 576 bytes as
intended:

  $ fqdn="thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.thirtytwocharactersforeachlabel.then_make_it_251_with_this"
  $ hostname="__eighteen_bytes__"
  $ ./pasta --fqdn ${fqdn} -H ${hostname} -p dhcp.pcap -- /sbin/dhclient -4
  Saving packet capture to dhcp.pcap
  $ tshark -r dhcp.pcap -V -Y 'dhcp.option.value == 5' | grep "Total Length"
      Total Length: 577

This was hidden by the issue fixed by commit bcc4908c2b4a ("dhcp
Remove option 255 length byte") until now.

Fixes: 31e8109a86ee ("dhcp, dhcpv6: Add hostname and client fqdn ops")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 dhcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dhcp.c b/dhcp.c
index 66a716e..b0de04b 100644
--- a/dhcp.c
+++ b/dhcp.c
@@ -143,7 +143,7 @@ static bool fill_one(struct msg *m, int o, int *offset)
 	size_t slen = opts[o].slen;
 
 	/* If we don't have space to write the option, then just skip */
-	if (*offset + 1 /* length of option */ + slen > OPT_MAX)
+	if (*offset + 2 /* code and length of option */ + slen > OPT_MAX)
 		return true;
 
 	m->o[*offset] = o;

From 4dac2351fae5534c01e144273f849ce9ece0dca7 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 18 Feb 2025 09:49:40 +0100
Subject: [PATCH 094/221] contrib/fedora: Actually install passt-repair SELinux
 policy file

Otherwise we build it, but we don't install it. Not an issue that
warrants a a release right away as it's anyway usable.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 contrib/fedora/passt.spec | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/contrib/fedora/passt.spec b/contrib/fedora/passt.spec
index 6a83f8b..745cf01 100644
--- a/contrib/fedora/passt.spec
+++ b/contrib/fedora/passt.spec
@@ -44,7 +44,7 @@ Requires(preun): %{name}
 Requires(preun): policycoreutils
 
 %description selinux
-This package adds SELinux enforcement to passt(1) and pasta(1).
+This package adds SELinux enforcement to passt(1), pasta(1), passt-repair(1).
 
 %prep
 %setup -q -n passt-%{git_hash}
@@ -82,6 +82,7 @@ make -f %{_datadir}/selinux/devel/Makefile
 install -p -m 644 -D passt.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
 install -p -m 644 -D passt.if %{buildroot}%{_datadir}/selinux/devel/include/distributed/passt.if
 install -p -m 644 -D pasta.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
+install -p -m 644 -D passt-repair.pp %{buildroot}%{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
 popd
 
 %pre selinux
@@ -90,11 +91,13 @@ popd
 %post selinux
 %selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
 %selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
+%selinux_modules_install -s %{selinuxtype} %{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
 
 %postun selinux
 if [ $1 -eq 0 ]; then
 	%selinux_modules_uninstall -s %{selinuxtype} passt
 	%selinux_modules_uninstall -s %{selinuxtype} pasta
+	%selinux_modules_uninstall -s %{selinuxtype} passt-repair
 fi
 
 %posttrans selinux
@@ -124,6 +127,7 @@ fi
 %{_datadir}/selinux/packages/%{selinuxtype}/passt.pp
 %{_datadir}/selinux/devel/include/distributed/passt.if
 %{_datadir}/selinux/packages/%{selinuxtype}/pasta.pp
+%{_datadir}/selinux/packages/%{selinuxtype}/passt-repair.pp
 
 %changelog
 {{{ passt_git_changelog }}}

From ea69ca6a20ac7408a913fd5de383a5383d679678 Mon Sep 17 00:00:00 2001
From: Jon Maloy <jmaloy@redhat.com>
Date: Wed, 19 Feb 2025 10:20:41 -0500
Subject: [PATCH 095/221] tap: always set the no_frag flag in IPv4 headers

When studying the Linux source code and Wireshark dumps it seems like
the no_frag flag in the IPv4 header is always set. Following discussions
in the Internet on this subject indicates that modern routers never
fragment packets, and that it isn't even supported in many cases.

Adding to this that incoming messages forwarded on the tap interface
never even pass through a router it seems safe to always set this flag.

This makes the IPv4 headers of forwarded messages identical to those
sent by the external sockets, something we must consider desirable.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 ip.h  | 3 ++-
 tap.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/ip.h b/ip.h
index 1544dbf..858cc89 100644
--- a/ip.h
+++ b/ip.h
@@ -36,13 +36,14 @@
 		.tos		= 0,					\
 		.tot_len	= 0,					\
 		.id		= 0,					\
-		.frag_off	= 0,					\
+		.frag_off	= htons(IP_DF), 			\
 		.ttl		= 0xff,					\
 		.protocol	= (proto),				\
 		.saddr		= 0,					\
 		.daddr		= 0,					\
 	}
 #define L2_BUF_IP4_PSUM(proto)	((uint32_t)htons_constant(0x4500) +	\
+				 (uint32_t)htons_constant(IP_DF) +	\
 				 (uint32_t)htons(0xff00 | (proto)))
 
 
diff --git a/tap.c b/tap.c
index d0673e5..44b0fc0 100644
--- a/tap.c
+++ b/tap.c
@@ -153,7 +153,7 @@ static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 	ip4h->tos = 0;
 	ip4h->tot_len = htons(l3len);
 	ip4h->id = 0;
-	ip4h->frag_off = 0;
+	ip4h->frag_off = htons(IP_DF);
 	ip4h->ttl = 255;
 	ip4h->protocol = proto;
 	ip4h->saddr = src.s_addr;

From be86232f72dcfbd51a889206e80d587fbcaa1c5b Mon Sep 17 00:00:00 2001
From: Michal Privoznik <mprivozn@redhat.com>
Date: Fri, 21 Feb 2025 12:53:13 +0100
Subject: [PATCH 096/221] seccomp.sh: Silence stty errors

When printing list of allowed syscalls the width of terminal is
obtained for nicer output (see commit below). The width is
obtained by running 'stty'. While this works when building from a
console, it doesn't work during rpmbuild/emerge/.. as stdout is
usually not a console but a logfile and stdin is usually
/dev/null or something. This results in stty reporting errors
like this:

  stty: 'standard input': Inappropriate ioctl for device

Redirect stty's stderr to /dev/null to silence it.

Fixes: 712ca3235329 ("seccomp.sh: Try to account for terminal width while formatting list of system calls")
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 seccomp.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/seccomp.sh b/seccomp.sh
index 4c521ae..a7bc417 100755
--- a/seccomp.sh
+++ b/seccomp.sh
@@ -255,7 +255,7 @@ for __p in ${__profiles}; do
 	__calls="${__calls} ${EXTRA_SYSCALLS:-}"
 	__calls="$(filter ${__calls})"
 
-	cols="$(stty -a | sed -n 's/.*columns \([0-9]*\).*/\1/p' || :)" 2>/dev/null
+	cols="$(stty -a 2>/dev/null | sed -n 's/.*columns \([0-9]*\).*/\1/p' || :)" 2>/dev/null
 	case $cols in [0-9]*) col_args="-w ${cols}";; *) col_args="";; esac
 	echo "seccomp profile ${__p} allows: ${__calls}" | tr '\n' ' ' | fmt -t ${col_args}
 

From 87471731e6bb0b5df3a50277527caf3381b45ee4 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 28 Feb 2025 01:14:01 +0100
Subject: [PATCH 097/221] selinux: Fixes/workarounds for passt and
 passt-repair, mostly for libvirt usage

Here are a bunch of workarounds and a couple of fixes for libvirt
usage which are rather hard to split into single logical patches
as there appear to be some obscure dependencies between some of them:

- passt-repair needs to have an exec_type typeattribute (otherwise
  the policy for lsmd(1) causes a violation on getattr on its
  executable) file, and that typeattribute just happened to be there
  for passt as a result of init_daemon_domain(), but passt-repair
  isn't a daemon, so we need an explicit corecmd_executable_file()

- passt-repair needs a workaround, which I'll revisit once
  https://github.com/fedora-selinux/selinux-policy/issues/2579 is
  solved, for usage with libvirt: allow it to use qemu_var_run_t
  and virt_var_run_t sockets

- add 'bpf' and 'dac_read_search' capabilities for passt-repair:
  they are needed (for whatever reason I didn't investigate) to
  actually receive socket files via SCM_RIGHTS

- passt needs further workarounds in the sense of
  https://github.com/fedora-selinux/selinux-policy/issues/2579:
  allow it to use map and use svirt_tmpfs_t (not just svirt_image_t):
  it depends on where the libvirt guest image is

- ...it also needs to map /dev/null if <access mode='shared'/> is
  enabled in libvirt's XML for the memoryBacking object, for
  vhost-user operation

- and 'ioctl' on the TCP socket appears to be actually needed, on top
  of 'getattr', to dump some socket parameters

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 contrib/selinux/passt-repair.te | 33 +++++++++++++++++++++++++++++++--
 contrib/selinux/passt.te        |  9 +++++++--
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/contrib/selinux/passt-repair.te b/contrib/selinux/passt-repair.te
index e3ffbcd..f171be6 100644
--- a/contrib/selinux/passt-repair.te
+++ b/contrib/selinux/passt-repair.te
@@ -28,12 +28,22 @@ require {
 	type console_device_t;
 	type user_devpts_t;
 	type user_tmp_t;
+
+	# Workaround: passt-repair needs to needs to access socket files
+	# that passt, started by libvirt, might create under different
+	# labels, depending on whether passt is started as root or not.
+	#
+	# However, libvirt doesn't maintain its own policy, which makes
+	# updates particularly complicated. To avoid breakage in the short
+	# term, deal with that in passt's own policy.
+	type qemu_var_run_t;
+	type virt_var_run_t;
 }
 
 type passt_repair_t;
 domain_type(passt_repair_t);
 type passt_repair_exec_t;
-files_type(passt_repair_exec_t);
+corecmd_executable_file(passt_repair_exec_t);
 
 role unconfined_r types passt_repair_t;
 
@@ -41,7 +51,8 @@ allow passt_repair_t passt_repair_exec_t:file { read execute execute_no_trans en
 type_transition unconfined_t passt_repair_exec_t:process passt_repair_t;
 allow unconfined_t passt_repair_t:process transition;
 
-allow passt_repair_t self:capability { dac_override net_admin net_raw };
+allow passt_repair_t self:capability { dac_override dac_read_search net_admin net_raw };
+allow passt_repair_t self:capability2 bpf;
 
 allow passt_repair_t console_device_t:chr_file { append open getattr read write ioctl };
 allow passt_repair_t user_devpts_t:chr_file { append open getattr read write ioctl };
@@ -50,9 +61,27 @@ allow passt_repair_t unconfined_t:unix_stream_socket { connectto read write };
 allow passt_repair_t passt_t:unix_stream_socket { connectto read write };
 allow passt_repair_t user_tmp_t:unix_stream_socket { connectto read write };
 
+allow passt_repair_t user_tmp_t:dir search;
+
 allow passt_repair_t unconfined_t:sock_file { read write };
 allow passt_repair_t passt_t:sock_file { read write };
 allow passt_repair_t user_tmp_t:sock_file { read write };
 
 allow passt_repair_t unconfined_t:tcp_socket { read setopt write };
 allow passt_repair_t passt_t:tcp_socket { read setopt write };
+
+# Workaround: passt-repair needs to needs to access socket files
+# that passt, started by libvirt, might create under different
+# labels, depending on whether passt is started as root or not.
+#
+# However, libvirt doesn't maintain its own policy, which makes
+# updates particularly complicated. To avoid breakage in the short
+# term, deal with that in passt's own policy.
+allow passt_repair_t qemu_var_run_t:unix_stream_socket { connectto read write };
+allow passt_repair_t virt_var_run_t:unix_stream_socket { connectto read write };
+
+allow passt_repair_t qemu_var_run_t:dir search;
+allow passt_repair_t virt_var_run_t:dir search;
+
+allow passt_repair_t qemu_var_run_t:sock_file { read write };
+allow passt_repair_t virt_var_run_t:sock_file { read write };
diff --git a/contrib/selinux/passt.te b/contrib/selinux/passt.te
index f595079..f8ea672 100644
--- a/contrib/selinux/passt.te
+++ b/contrib/selinux/passt.te
@@ -29,6 +29,9 @@ require {
 	# particularly complicated. To avoid breakage in the short term,
 	# deal with it in passt's own policy.
 	type svirt_image_t;
+	type svirt_tmpfs_t;
+	type svirt_t;
+	type null_device_t;
 
 	class file { ioctl getattr setattr create read write unlink open relabelto execute execute_no_trans map };
 	class dir { search write add_name remove_name mounton };
@@ -45,7 +48,7 @@ require {
 	type net_conf_t;
 	type proc_net_t;
 	type node_t;
-	class tcp_socket { create accept listen name_bind name_connect getattr };
+	class tcp_socket { create accept listen name_bind name_connect getattr ioctl };
 	class udp_socket { create accept listen };
 	class icmp_socket { bind create name_bind node_bind setopt read write };
 	class sock_file { create unlink write };
@@ -129,7 +132,7 @@ corenet_udp_sendrecv_all_ports(passt_t)
 allow passt_t node_t:icmp_socket { name_bind node_bind };
 allow passt_t port_t:icmp_socket name_bind;
 
-allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write getattr };
+allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write getattr ioctl };
 allow passt_t self:udp_socket { create getopt setopt connect bind read write };
 allow passt_t self:icmp_socket { bind create setopt read write };
 
@@ -143,3 +146,5 @@ allow passt_t unconfined_t:unix_stream_socket { read write };
 # particularly complicated. To avoid breakage in the short term,
 # deal with it in passt's own policy.
 allow passt_t svirt_image_t:file { read write map };
+allow passt_t svirt_tmpfs_t:file { read write map };
+allow passt_t null_device_t:chr_file map;

From 7b92f2e8525a94fb6f80d5e0bedba7eacc378714 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 27 Feb 2025 16:55:13 +1100
Subject: [PATCH 098/221] migrate, flow: Trivially succeed if migrating with no
 flows

We could get a migration request when we have no active flows; or at least
none that we need or are able to migrate.  In this case after sending or
receiving the number of flows we continue to step through various lists.

In the target case, this could include communication with passt-repair.  If
passt-repair wasn't started that could cause further errors, but of course
they shouldn't matter if we have nothing to repair.

Make it more obvious that there's nothing to do and avoid such errors by
short-circuiting flow_migrate_{source,target}() if there are no migratable
flows.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/flow.c b/flow.c
index bb5dcc3..6cf96c2 100644
--- a/flow.c
+++ b/flow.c
@@ -999,6 +999,9 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 
 	debug("Sending %u flows", ntohl(count));
 
+	if (!count)
+		return 0;
+
 	/* Dump and send information that can be stored in the flow table.
 	 *
 	 * Limited rollback options here: if we fail to transfer any data (that
@@ -1070,6 +1073,9 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	count = ntohl(count);
 	debug("Receiving %u flows", count);
 
+	if (!count)
+		return 0;
+
 	if ((rc = flow_migrate_repair_all(c, true)))
 		return -rc;
 

From 39f85bce1a3b9da3bd11458c521e589f674e587a Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 27 Feb 2025 16:55:14 +1100
Subject: [PATCH 099/221] migrate, flow: Don't attempt to migrate TCP flows
 without passt-repair

Migrating TCP flows requires passt-repair in order to use TCP_REPAIR.  If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode.  In some cases we may not roll back these changes properly, meaning
we break network connections on the source.

Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate.  So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/flow.c b/flow.c
index 6cf96c2..749c498 100644
--- a/flow.c
+++ b/flow.c
@@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable)
 	union flow *flow;
 	int rc;
 
+	/* If we don't have a repair helper, there's nothing we can do */
+	if (c->fd_repair < 0)
+		return 0;
+
 	foreach_established_tcp_flow(flow) {
 		if (enable)
 			rc = tcp_flow_repair_on(c, &flow->tcp);
@@ -987,8 +991,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	(void)c;
 	(void)stage;
 
-	foreach_established_tcp_flow(flow)
-		count++;
+	/* If we don't have a repair helper, we can't migrate TCP flows */
+	if (c->fd_repair >= 0) {
+		foreach_established_tcp_flow(flow)
+			count++;
+	}
 
 	count = htonl(count);
 	if (write_all_buf(fd, &count, sizeof(count))) {

From 56ce03ed0acf2a41c67d44e353c00a018604ccb7 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 27 Feb 2025 16:55:15 +1100
Subject: [PATCH 100/221] tcp: Correct error code handling from
 tcp_flow_repair_socket()

There are two small bugs in error returns from tcp_low_repair_socket(),
which is supposed to return a negative errno code:

1) On bind() failures, wedirectly pass on the return code from bind(),
   which is just 0 or -1, instead of an error code.

2) In the caller, tcp_flow_migrate_target() we call strerror_() directly
   on the negative error code, but strerror() requires a positive error
   code.

Correct both of these.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index e3c0a53..8528ee3 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3280,7 +3280,8 @@ int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
 
 	tcp_sock_set_nodelay(s);
 
-	if ((rc = bind(s, &a.sa, sizeof(a)))) {
+	if (bind(s, &a.sa, sizeof(a))) {
+		rc = -errno;
 		err_perror("Failed to bind socket for migrated flow");
 		goto err;
 	}
@@ -3375,7 +3376,7 @@ int tcp_flow_migrate_target(struct ctx *c, int fd)
 	conn->seq_init_from_tap		= ntohl(t.seq_init_from_tap);
 
 	if ((rc = tcp_flow_repair_socket(c, conn))) {
-		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
+		flow_err(flow, "Can't set up socket: %s, drop", strerror_(-rc));
 		flow_alloc_cancel(flow);
 		return 0;
 	}

From b2708218a6eec82fad98da52d7569d13cf35e05c Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 27 Feb 2025 16:55:16 +1100
Subject: [PATCH 101/221] tcp: Unconditionally move to CLOSED state on
 tcp_rst()

tcp_rst() attempts to send an RST packet to the guest, and if that succeeds
moves the flow to CLOSED state.  However, even if the tcp_send_flag() fails
the flow is still dead: we've usually closed the socket already, and
something has already gone irretrievably wrong.  So we should still mark
the flow as CLOSED.  That will cause it to be cleaned up, meaning any
future packets from the guest for it won't match a flow, so should generate
new RSTs (they don't at the moment, but that's a separate bug).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index 8528ee3..d23b6d9 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1214,8 +1214,8 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn)
 	if (conn->events == CLOSED)
 		return;
 
-	if (!tcp_send_flag(c, conn, RST))
-		conn_event(c, conn, CLOSED);
+	tcp_send_flag(c, conn, RST);
+	conn_event(c, conn, CLOSED);
 }
 
 /**

From 52419a64f2dfa31707b31148e6a311bb57be6e5f Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 27 Feb 2025 16:55:17 +1100
Subject: [PATCH 102/221] migrate, tcp: Don't flow_alloc_cancel() during
 incoming migration

In tcp_flow_migrate_target(), if we're unable to create and bind the new
socket, we print an error, cancel the flow and carry on.  This seems to
make sense based on our policy of generally letting the migration complete
even if some or all flows are lost in the process.  But it doesn't quite
work: the flow_alloc_cancel() means that the flows in the target's flow
table are no longer one to one match to the flows which the source is
sending data for.  This means that data for later flows will be mismatched
to a different flow.  Most likely that will cause some nasty error later,
but even worse it might appear to succeed but lead to data corruption due
to incorrectly restoring one of the flows.

Instead, we should leave the flow in the table until we've read all the
data for it, *then* discard it.  Technically removing the
flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket()
fails it leaves conn->sock == -1, which will cause the restore functions
in tcp_flow_migrate_target_ext() to fail, discarding the flow.  To make
what's going on clearer (and with less extraneous error messages), put
several explicit tests for a missing socket later in the migration path to
read the data associated with the flow but explicitly discard it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/tcp.c b/tcp.c
index d23b6d9..b3aa9a2 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2708,6 +2708,9 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
 {
 	int rc = 0;
 
+	if (conn->sock < 0)
+		return 0;
+
 	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
 		err("Failed to set TCP_REPAIR");
 
@@ -2725,6 +2728,9 @@ int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn)
 {
 	int rc = 0;
 
+	if (conn->sock < 0)
+		return 0;
+
 	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_OFF)))
 		err("Failed to clear TCP_REPAIR");
 
@@ -3377,7 +3383,8 @@ int tcp_flow_migrate_target(struct ctx *c, int fd)
 
 	if ((rc = tcp_flow_repair_socket(c, conn))) {
 		flow_err(flow, "Can't set up socket: %s, drop", strerror_(-rc));
-		flow_alloc_cancel(flow);
+		/* Can't leave the flow in an incomplete state */
+		FLOW_ACTIVATE(conn);
 		return 0;
 	}
 
@@ -3453,6 +3460,10 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 		return rc;
 	}
 
+	if (conn->sock < 0)
+		/* We weren't able to create the socket, discard flow */
+		goto fail;
+
 	if (tcp_flow_select_queue(s, TCP_SEND_QUEUE))
 		goto fail;
 
@@ -3540,8 +3551,10 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 	return 0;
 
 fail:
-	tcp_flow_repair_off(c, conn);
-	repair_flush(c);
+	if (conn->sock >= 0) {
+		tcp_flow_repair_off(c, conn);
+		repair_flush(c);
+	}
 
 	conn->flags = 0; /* Not waiting for ACK, don't schedule timer */
 	tcp_rst(c, conn);

From 008175636c789d36ef585a94eee4d62536cac7d6 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 15:32:28 +1100
Subject: [PATCH 103/221] ip: Helpers to access IPv6 flow label

The flow label is a 20-bit field in the IPv6 header.  The length and
alignment make it awkward to pass around as is.  Obviously, it can be
packed into a 32-bit integer though, and we do this in two places.  We
have some further upcoming places where we want to manipulate the flow
label, so make some helpers for marshalling and unmarshalling it to an
integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 ip.h  | 25 +++++++++++++++++++++++++
 tap.c |  4 +---
 tcp.c |  4 +---
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/ip.h b/ip.h
index 858cc89..5edb7e7 100644
--- a/ip.h
+++ b/ip.h
@@ -91,6 +91,31 @@ struct ipv6_opt_hdr {
 	 */
 } __attribute__((packed));	/* required for some archs */
 
+/**
+ * ip6_set_flow_lbl() - Set flow label in an IPv6 header
+ * @ip6h:	Pointer to IPv6 header, updated
+ * @flow:	Set @ip6h flow label to the low 20 bits of this integer
+ */
+static inline void ip6_set_flow_lbl(struct ipv6hdr *ip6h, uint32_t flow)
+{
+	ip6h->flow_lbl[0] = (flow >> 16) & 0xf;
+	ip6h->flow_lbl[1] = (flow >> 8) & 0xff;
+	ip6h->flow_lbl[2] = (flow >> 0) & 0xff;
+}
+
+/** ip6_get_flow_lbl() - Get flow label from an IPv6 header
+ * @ip6h:	Pointer to IPv6 header
+ *
+ * Return: flow label from @ip6h as an integer (<= 20 bits)
+ */
+/* cppcheck-suppress unusedFunction */
+static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h)
+{
+	return (ip6h->flow_lbl[0] & 0xf) << 16 |
+		ip6h->flow_lbl[1] << 8 |
+		ip6h->flow_lbl[2];
+}
+
 char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
 		 size_t *dlen);
 
diff --git a/tap.c b/tap.c
index 44b0fc0..3908262 100644
--- a/tap.c
+++ b/tap.c
@@ -241,9 +241,7 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h,
 	ip6h->hop_limit = 255;
 	ip6h->saddr = *src;
 	ip6h->daddr = *dst;
-	ip6h->flow_lbl[0] = (flow >> 16) & 0xf;
-	ip6h->flow_lbl[1] = (flow >> 8) & 0xff;
-	ip6h->flow_lbl[2] = (flow >> 0) & 0xff;
+	ip6_set_flow_lbl(ip6h, flow);
 	return ip6h + 1;
 }
 
diff --git a/tcp.c b/tcp.c
index b3aa9a2..7459803 100644
--- a/tcp.c
+++ b/tcp.c
@@ -963,9 +963,7 @@ void tcp_fill_headers(const struct tcp_tap_conn *conn,
 		ip6h->version = 6;
 		ip6h->nexthdr = IPPROTO_TCP;
 
-		ip6h->flow_lbl[0] = (conn->sock >> 16) & 0xf;
-		ip6h->flow_lbl[1] = (conn->sock >> 8) & 0xff;
-		ip6h->flow_lbl[2] = (conn->sock >> 0) & 0xff;
+		ip6_set_flow_lbl(ip6h, conn->sock);
 
 		if (!no_tcp_csum) {
 			psum = proto_ipv6_header_psum(l4len, IPPROTO_TCP,

From 1f236817ea715e9215e0fe4ecb0938d0a9809ce1 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 15:32:29 +1100
Subject: [PATCH 104/221] tap: Consider IPv6 flow label when building packet
 sequences

To allow more batching, we group together related packets into "seqs" in
the tap layer, before passing them to the L4 protocol layers.  Currently
we consider the IP protocol, both IP addresses and also the L4 ports when
grouping things into seqs.  We ignore the IPv6 flow label.

We have some future cases where we want to consider the the flow label in
the L4 code, which is awkward if we could be given a single batch with
multiple labels.  Add the flow label to tap6_l4_t and group by it as well
as the other criteria.  In future we could possibly use the flow label
_instead_ of peeking into the L4 header for the ports, but we don't do so
for now.

The guest should use the same flow label for all packets in a low, but if
it doesn't this change won't break anything, it just means we'll batch
things a bit sub-optimally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 ip.h  | 1 -
 tap.c | 4 ++++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/ip.h b/ip.h
index 5edb7e7..c82431e 100644
--- a/ip.h
+++ b/ip.h
@@ -108,7 +108,6 @@ static inline void ip6_set_flow_lbl(struct ipv6hdr *ip6h, uint32_t flow)
  *
  * Return: flow label from @ip6h as an integer (<= 20 bits)
  */
-/* cppcheck-suppress unusedFunction */
 static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h)
 {
 	return (ip6h->flow_lbl[0] & 0xf) << 16 |
diff --git a/tap.c b/tap.c
index 3908262..202abae 100644
--- a/tap.c
+++ b/tap.c
@@ -489,6 +489,7 @@ static struct tap4_l4_t {
  * struct l4_seq6_t - Message sequence for one protocol handler call, IPv6
  * @msgs:	Count of messages in sequence
  * @protocol:	Protocol number
+ * @flow_lbl:	IPv6 flow label
  * @source:	Source port
  * @dest:	Destination port
  * @saddr:	Source address
@@ -497,6 +498,7 @@ static struct tap4_l4_t {
  */
 static struct tap6_l4_t {
 	uint8_t protocol;
+	uint32_t flow_lbl :20;
 
 	uint16_t source;
 	uint16_t dest;
@@ -870,6 +872,7 @@ resume:
 		((seq)->protocol == (proto)                &&		\
 		 (seq)->source   == (uh)->source           &&		\
 		 (seq)->dest == (uh)->dest                 &&		\
+		 (seq)->flow_lbl == ip6_get_flow_lbl(ip6h) &&		\
 		 IN6_ARE_ADDR_EQUAL(&(seq)->saddr, saddr)  &&		\
 		 IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr))
 
@@ -878,6 +881,7 @@ resume:
 		(seq)->protocol	= (proto);				\
 		(seq)->source	= (uh)->source;				\
 		(seq)->dest	= (uh)->dest;				\
+		(seq)->flow_lbl	= ip6_get_flow_lbl(ip6h);		\
 		(seq)->saddr	= *saddr;				\
 		(seq)->daddr	= *daddr;				\
 	} while (0)

From 672d786de1c1f2aca32caedbcf440f710c4aecb5 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 15:32:30 +1100
Subject: [PATCH 105/221] tcp: Send RST in response to guest packets that match
 no connection

Currently, if a non-SYN TCP packet arrives which doesn't match any existing
connection, we simply ignore it.  However RFC 9293, section 3.10.7.1 says
we should respond with an RST to a non-SYN, non-RST packet that's for a
CLOSED (i.e. non-existent) connection.

This can arise in practice with migration, in cases where some error means
we have to discard a connection.  We destroy the connection with tcp_rst()
in that case, but because the guest is stopped, we may not be able to
deliver the RST packet on the tap interface immediately.  This change
ensures an RST will be sent if the guest tries to use the connection again.

A similar situation can arise if a passt/pasta instance is killed or
crashes, but is then replaced with another attached to the same guest.
This can leave the guest with stale connections that the new passt instance
isn't aware of.  It's better to send an RST so the guest knows quickly
these are broken, rather than letting them linger until they time out.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c | 17 +++++++-------
 tap.h |  6 +++++
 tcp.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 tcp.h |  2 +-
 4 files changed, 88 insertions(+), 11 deletions(-)

diff --git a/tap.c b/tap.c
index 202abae..86d051e 100644
--- a/tap.c
+++ b/tap.c
@@ -122,7 +122,7 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
  *
  * Return: pointer at which to write the packet's payload
  */
-static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
+void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
 {
 	struct ethhdr *eh = (struct ethhdr *)buf;
 
@@ -143,8 +143,8 @@ static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
  *
  * Return: pointer at which to write the packet's payload
  */
-static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
-			   struct in_addr dst, size_t l4len, uint8_t proto)
+void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
+		    struct in_addr dst, size_t l4len, uint8_t proto)
 {
 	uint16_t l3len = l4len + sizeof(*ip4h);
 
@@ -229,10 +229,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
  *
  * Return: pointer at which to write the packet's payload
  */
-static void *tap_push_ip6h(struct ipv6hdr *ip6h,
-			   const struct in6_addr *src,
-			   const struct in6_addr *dst,
-			   size_t l4len, uint8_t proto, uint32_t flow)
+void *tap_push_ip6h(struct ipv6hdr *ip6h,
+		    const struct in6_addr *src, const struct in6_addr *dst,
+		    size_t l4len, uint8_t proto, uint32_t flow)
 {
 	ip6h->payload_len = htons(l4len);
 	ip6h->priority = 0;
@@ -744,7 +743,7 @@ append:
 			for (k = 0; k < p->count; )
 				k += tcp_tap_handler(c, PIF_TAP, AF_INET,
 						     &seq->saddr, &seq->daddr,
-						     p, k, now);
+						     0, p, k, now);
 		} else if (seq->protocol == IPPROTO_UDP) {
 			if (c->no_udp)
 				continue;
@@ -927,7 +926,7 @@ append:
 			for (k = 0; k < p->count; )
 				k += tcp_tap_handler(c, PIF_TAP, AF_INET6,
 						     &seq->saddr, &seq->daddr,
-						     p, k, now);
+						     seq->flow_lbl, p, k, now);
 		} else if (seq->protocol == IPPROTO_UDP) {
 			if (c->no_udp)
 				continue;
diff --git a/tap.h b/tap.h
index a476a12..390ac12 100644
--- a/tap.h
+++ b/tap.h
@@ -42,6 +42,9 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 		thdr->vnet_len = htonl(l2len);
 }
 
+void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
+void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
+		     struct in_addr dst, size_t l4len, uint8_t proto);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
@@ -49,6 +52,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
 		    const void *in, size_t l4len);
 const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
 				     const struct in6_addr *src);
+void *tap_push_ip6h(struct ipv6hdr *ip6h,
+		    const struct in6_addr *src, const struct in6_addr *dst,
+		    size_t l4len, uint8_t proto, uint32_t flow);
 void tap_udp6_send(const struct ctx *c,
 		   const struct in6_addr *src, in_port_t sport,
 		   const struct in6_addr *dst, in_port_t dport,
diff --git a/tcp.c b/tcp.c
index 7459803..fb04e2e 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1866,6 +1866,75 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
 	tcp_data_from_sock(c, conn);
 }
 
+/**
+ * tcp_rst_no_conn() - Send RST in response to a packet with no connection
+ * @c:		Execution context
+ * @af:		Address family, AF_INET or AF_INET6
+ * @saddr:	Source address of the packet we're responding to
+ * @daddr:	Destination address of the packet we're responding to
+ * @flow_lbl:	IPv6 flow label (ignored for IPv4)
+ * @th:		TCP header of the packet we're responding to
+ * @l4len:	Packet length, including TCP header
+ */
+static void tcp_rst_no_conn(const struct ctx *c, int af,
+			    const void *saddr, const void *daddr,
+			    uint32_t flow_lbl,
+			    const struct tcphdr *th, size_t l4len)
+{
+	struct iov_tail payload = IOV_TAIL(NULL, 0, 0);
+	struct tcphdr *rsth;
+	char buf[USHRT_MAX];
+	uint32_t psum = 0;
+	size_t rst_l2len;
+
+	/* Don't respond to RSTs without a connection */
+	if (th->rst)
+		return;
+
+	if (af == AF_INET) {
+		struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
+		const struct in_addr *rst_src = daddr;
+		const struct in_addr *rst_dst = saddr;
+
+		rsth = tap_push_ip4h(ip4h, *rst_src, *rst_dst,
+				     sizeof(*rsth), IPPROTO_TCP);
+		psum = proto_ipv4_header_psum(sizeof(*rsth), IPPROTO_TCP,
+					      *rst_src, *rst_dst);
+
+	} else {
+		struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
+		const struct in6_addr *rst_src = daddr;
+		const struct in6_addr *rst_dst = saddr;
+
+		rsth = tap_push_ip6h(ip6h, rst_src, rst_dst,
+				     sizeof(*rsth), IPPROTO_TCP, flow_lbl);
+		psum = proto_ipv6_header_psum(sizeof(*rsth), IPPROTO_TCP,
+					      rst_src, rst_dst);
+	}
+
+	memset(rsth, 0, sizeof(*rsth));
+
+	rsth->source = th->dest;
+	rsth->dest = th->source;
+	rsth->rst = 1;
+	rsth->doff = sizeof(*rsth) / 4UL;
+
+	/* Sequence matching logic from RFC 9293 section 3.10.7.1 */
+	if (th->ack) {
+		rsth->seq = th->ack_seq;
+	} else {
+		size_t dlen = l4len - th->doff * 4UL;
+		uint32_t ack = ntohl(th->seq) + dlen;
+
+		rsth->ack_seq = htonl(ack);
+		rsth->ack = 1;
+	}
+
+	tcp_update_csum(psum, rsth, &payload);
+	rst_l2len = ((char *)rsth - buf) + sizeof(*rsth);
+	tap_send_single(c, buf, rst_l2len);
+}
+
 /**
  * tcp_tap_handler() - Handle packets from tap and state transitions
  * @c:		Execution context
@@ -1873,6 +1942,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
  * @af:		Address family, AF_INET or AF_INET6
  * @saddr:	Source address
  * @daddr:	Destination address
+ * @flow_lbl:	IPv6 flow label (ignored for IPv4)
  * @p:		Pool of TCP packets, with TCP headers
  * @idx:	Index of first packet in pool to process
  * @now:	Current timestamp
@@ -1880,7 +1950,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
  * Return: count of consumed packets
  */
 int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
-		    const void *saddr, const void *daddr,
+		    const void *saddr, const void *daddr, uint32_t flow_lbl,
 		    const struct pool *p, int idx, const struct timespec *now)
 {
 	struct tcp_tap_conn *conn;
@@ -1913,6 +1983,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (opts && th->syn && !th->ack)
 			tcp_conn_from_tap(c, af, saddr, daddr, th,
 					  opts, optlen, now);
+		else
+			tcp_rst_no_conn(c, af, saddr, daddr, flow_lbl, th, len);
 		return 1;
 	}
 
diff --git a/tcp.h b/tcp.h
index cf30744..9142eca 100644
--- a/tcp.h
+++ b/tcp.h
@@ -16,7 +16,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 		      uint32_t events);
 int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
-		    const void *saddr, const void *daddr,
+		    const void *saddr, const void *daddr, uint32_t flow_lbl,
 		    const struct pool *p, int idx, const struct timespec *now);
 int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
 		  const char *ifname, in_port_t port);

From 1924e25f0723c0a86c1e33812f8e1d8aa045a146 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 17:20:03 +1100
Subject: [PATCH 106/221] conf: Be more precise about minimum MTUs

Currently we reject the -m option if given a value less than ETH_MIN_MTU
(68).  That define is derived from the kernel, but its name is misleading:
it doesn't really have anything to do with Ethernet per se, but is rather
the minimum payload any L2 link must be able to handle in order to carry
IPv4.  For IPv6, it's not sufficient: that requires an MTU of at least
1280.

Newer kernels have better named constants IPV4_MIN_MTU and IPv6_MIN_MTU.
Copy and use those constants instead, along with some more specific error
messages.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 18 +++++++++++++++---
 ip.h   |  7 +++++++
 util.h |  6 ------
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/conf.c b/conf.c
index c5ee07b..065e720 100644
--- a/conf.c
+++ b/conf.c
@@ -1663,9 +1663,9 @@ void conf(struct ctx *c, int argc, char **argv)
 			if (errno || *e)
 				die("Invalid MTU: %s", optarg);
 
-			if (mtu && (mtu < ETH_MIN_MTU || mtu > ETH_MAX_MTU)) {
-				die("MTU %lu out of range (%u..%u)", mtu,
-				    ETH_MIN_MTU, ETH_MAX_MTU);
+			if (mtu > ETH_MAX_MTU) {
+				die("MTU %lu too large (max %u)",
+				    mtu, ETH_MAX_MTU);
 			}
 
 			c->mtu = mtu;
@@ -1842,9 +1842,21 @@ void conf(struct ctx *c, int argc, char **argv)
 		c->ifi4 = conf_ip4(ifi4, &c->ip4);
 	if (!v4_only)
 		c->ifi6 = conf_ip6(ifi6, &c->ip6);
+
+	if (c->ifi4 && c->mtu < IPV4_MIN_MTU) {
+		warn("MTU %"PRIu16" is too small for IPv4 (minimum %u)",
+		     c->mtu, IPV4_MIN_MTU);
+	}
+	if (c->ifi6 && c->mtu < IPV6_MIN_MTU) {
+		warn("MTU %"PRIu16" is too small for IPv6 (minimum %u)",
+			     c->mtu, IPV6_MIN_MTU);
+	}
+
 	if ((*c->ip4.ifname_out && !c->ifi4) ||
 	    (*c->ip6.ifname_out && !c->ifi6))
 		die("External interface not usable");
+
+
 	if (!c->ifi4 && !c->ifi6) {
 		info("No external interface as template, switch to local mode");
 
diff --git a/ip.h b/ip.h
index c82431e..471c57e 100644
--- a/ip.h
+++ b/ip.h
@@ -129,4 +129,11 @@ static const struct in6_addr in6addr_ll_all_nodes = {
 /* IPv4 Limited Broadcast (RFC 919, Section 7), 255.255.255.255 */
 static const struct in_addr in4addr_broadcast = { 0xffffffff };
 
+#ifndef IPV4_MIN_MTU
+#define IPV4_MIN_MTU		68
+#endif
+#ifndef IPV6_MIN_MTU
+#define IPV6_MIN_MTU		1280
+#endif
+
 #endif /* IP_H */
diff --git a/util.h b/util.h
index 50e96d3..0f70f4d 100644
--- a/util.h
+++ b/util.h
@@ -34,15 +34,9 @@
 #ifndef ETH_MAX_MTU
 #define ETH_MAX_MTU			USHRT_MAX
 #endif
-#ifndef ETH_MIN_MTU
-#define ETH_MIN_MTU			68
-#endif
 #ifndef IP_MAX_MTU
 #define IP_MAX_MTU			USHRT_MAX
 #endif
-#ifndef IPV6_MIN_MTU
-#define IPV6_MIN_MTU			1280
-#endif
 
 #ifndef MIN
 #define MIN(x, y)		(((x) < (y)) ? (x) : (y))

From 82a839be988ecfdb013b5823afc93211200a9f55 Mon Sep 17 00:00:00 2001
From: Jon Maloy <jmaloy@redhat.com>
Date: Thu, 6 Mar 2025 13:00:03 -0500
Subject: [PATCH 107/221] tap: break out building of udp header from
 tap_udp4_send function

We will need to build the UDP header at other locations than in function
tap_udp4_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c | 34 +++++++++++++++++++++++++++-------
 tap.h |  5 +++++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/tap.c b/tap.c
index 86d051e..6f7063e 100644
--- a/tap.c
+++ b/tap.c
@@ -163,7 +163,7 @@ void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 }
 
 /**
- * tap_udp4_send() - Send UDP over IPv4 packet
+ * tap_push_uh4() - Build UDPv4 header with checksum
  * @c:		Execution context
  * @src:	IPv4 source address
  * @sport:	UDP source port
@@ -171,16 +171,14 @@ void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
  * @dport:	UDP destination port
  * @in:		UDP payload contents (not including UDP header)
  * @dlen:	UDP payload length (not including UDP header)
+ *
+ * Return: pointer at which to write the packet's payload
  */
-void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
+void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen)
 {
 	size_t l4len = dlen + sizeof(struct udphdr);
-	char buf[USHRT_MAX];
-	struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
-	struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
-	char *data = (char *)(uh + 1);
 	const struct iovec iov = {
 		.iov_base = (void *)in,
 		.iov_len = dlen
@@ -191,8 +189,30 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 	uh->dest = htons(dport);
 	uh->len = htons(l4len);
 	csum_udp4(uh, src, dst, &payload);
-	memcpy(data, in, dlen);
+	return (char *)uh + sizeof(*uh);
+}
 
+/**
+ * tap_udp4_send() - Send UDP over IPv4 packet
+ * @c:		Execution context
+ * @src:	IPv4 source address
+ * @sport:	UDP source port
+ * @dst:	IPv4 destination address
+ * @dport:	UDP destination port
+ * @in:	UDP payload contents (not including UDP header)
+ * @dlen:	UDP payload length (not including UDP header)
+ */
+void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
+		   struct in_addr dst, in_port_t dport,
+		   const void *in, size_t dlen)
+{
+	size_t l4len = dlen + sizeof(struct udphdr);
+	char buf[USHRT_MAX];
+	struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
+	struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
+	char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
+
+	memcpy(data, in, dlen);
 	tap_send_single(c, buf, dlen + (data - buf));
 }
 
diff --git a/tap.h b/tap.h
index 390ac12..a2cf9bc 100644
--- a/tap.h
+++ b/tap.h
@@ -6,6 +6,8 @@
 #ifndef TAP_H
 #define TAP_H
 
+struct udphdr;
+
 /**
  * struct tap_hdr - tap backend specific headers
  * @vnet_len:	Frame length (for qemu socket transport)
@@ -45,6 +47,9 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
 void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 		     struct in_addr dst, size_t l4len, uint8_t proto);
+void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
+		   struct in_addr dst, in_port_t dport,
+		   const void *in, size_t dlen);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);

From 55431f0077b6a25c264bd2492680d7f99815cc5f Mon Sep 17 00:00:00 2001
From: Jon Maloy <jmaloy@redhat.com>
Date: Thu, 6 Mar 2025 13:00:04 -0500
Subject: [PATCH 108/221] udp: create and send ICMPv4 to local peer when
 applicable

When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv4 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error codes. We have noticed that at least ICMP_PROT_UNREACH
is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't
 modify 'in', it can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c          |  2 +-
 tap.h          |  2 ++
 udp.c          | 87 +++++++++++++++++++++++++++++++++++++++++++-------
 udp_internal.h |  2 +-
 udp_vu.c       |  4 +--
 5 files changed, 81 insertions(+), 16 deletions(-)

diff --git a/tap.c b/tap.c
index 6f7063e..57d0795 100644
--- a/tap.c
+++ b/tap.c
@@ -159,7 +159,7 @@ void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 	ip4h->saddr = src.s_addr;
 	ip4h->daddr = dst.s_addr;
 	ip4h->check = csum_ip4_header(l3len, proto, src, dst);
-	return ip4h + 1;
+	return (char *)ip4h + sizeof(*ip4h);
 }
 
 /**
diff --git a/tap.h b/tap.h
index a2cf9bc..9ac17ce 100644
--- a/tap.h
+++ b/tap.h
@@ -50,6 +50,8 @@ void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
+void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
+		    struct in_addr dst, size_t l4len, uint8_t proto);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
diff --git a/udp.c b/udp.c
index 923cc38..b72c3ce 100644
--- a/udp.c
+++ b/udp.c
@@ -87,6 +87,7 @@
 #include <netinet/in.h>
 #include <netinet/ip.h>
 #include <netinet/udp.h>
+#include <netinet/ip_icmp.h>
 #include <stdint.h>
 #include <stddef.h>
 #include <string.h>
@@ -112,6 +113,9 @@
 #include "udp_internal.h"
 #include "udp_vu.h"
 
+/* Maximum UDP data to be returned in ICMP messages */
+#define ICMP4_MAX_DLEN 8
+
 /* "Spliced" sockets indexed by bound port (host order) */
 static int udp_splice_ns  [IP_VERSIONS][NUM_PORTS];
 static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
@@ -402,25 +406,76 @@ static void udp_tap_prepare(const struct mmsghdr *mmh,
 	(*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len;
 }
 
+/**
+ * udp_send_conn_fail_icmp4() - Construct and send ICMPv4 to local peer
+ * @c:		Execution context
+ * @ee:	Extended error descriptor
+ * @toside:	Destination side of flow
+ * @saddr:	Address of ICMP generating node
+ * @in:	First bytes (max 8) of original UDP message body
+ * @dlen:	Length of the read part of original UDP message body
+ */
+static void udp_send_conn_fail_icmp4(const struct ctx *c,
+				     const struct sock_extended_err *ee,
+				     const struct flowside *toside,
+				     struct in_addr saddr,
+				     const void *in, size_t dlen)
+{
+	struct in_addr oaddr = toside->oaddr.v4mapped.a4;
+	struct in_addr eaddr = toside->eaddr.v4mapped.a4;
+	in_port_t eport = toside->eport;
+	in_port_t oport = toside->oport;
+	struct {
+		struct icmphdr icmp4h;
+		struct iphdr ip4h;
+		struct udphdr uh;
+		char data[ICMP4_MAX_DLEN];
+	} __attribute__((packed, aligned(__alignof__(max_align_t)))) msg;
+	size_t msglen = sizeof(msg) - sizeof(msg.data) + dlen;
+	size_t l4len = dlen + sizeof(struct udphdr);
+
+	ASSERT(dlen <= ICMP4_MAX_DLEN);
+	memset(&msg, 0, sizeof(msg));
+	msg.icmp4h.type = ee->ee_type;
+	msg.icmp4h.code = ee->ee_code;
+	if (ee->ee_type == ICMP_DEST_UNREACH && ee->ee_code == ICMP_FRAG_NEEDED)
+		msg.icmp4h.un.frag.mtu = htons((uint16_t) ee->ee_info);
+
+	/* Reconstruct the original headers as returned in the ICMP message */
+	tap_push_ip4h(&msg.ip4h, eaddr, oaddr, l4len, IPPROTO_UDP);
+	tap_push_uh4(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
+	memcpy(&msg.data, in, dlen);
+
+	tap_icmp4_send(c, saddr, eaddr, &msg, msglen);
+}
+
 /**
  * udp_sock_recverr() - Receive and clear an error from a socket
- * @s:		Socket to receive from
+ * @c:		Execution context
+ * @ref:	epoll reference
  *
  * Return: 1 if error received and processed, 0 if no more errors in queue, < 0
  *         if there was an error reading the queue
  *
  * #syscalls recvmsg
  */
-static int udp_sock_recverr(int s)
+static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 {
 	const struct sock_extended_err *ee;
 	const struct cmsghdr *hdr;
+	union sockaddr_inany saddr;
 	char buf[CMSG_SPACE(sizeof(*ee))];
+	char data[ICMP4_MAX_DLEN];
+	int s = ref.fd;
+	struct iovec iov = {
+		.iov_base = data,
+		.iov_len = sizeof(data)
+	};
 	struct msghdr mh = {
-		.msg_name = NULL,
-		.msg_namelen = 0,
-		.msg_iov = NULL,
-		.msg_iovlen = 0,
+		.msg_name = &saddr,
+		.msg_namelen = sizeof(saddr),
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
 		.msg_control = buf,
 		.msg_controllen = sizeof(buf),
 	};
@@ -450,8 +505,15 @@ static int udp_sock_recverr(int s)
 	}
 
 	ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
+	if (ref.type == EPOLL_TYPE_UDP_REPLY) {
+		flow_sidx_t sidx = flow_sidx_opposite(ref.flowside);
+		const struct flowside *toside = flowside_at_sidx(sidx);
 
-	/* TODO: When possible propagate and otherwise handle errors */
+		udp_send_conn_fail_icmp4(c, ee, toside, saddr.sa4.sin_addr,
+					 data, rc);
+	} else {
+		trace("Ignoring received IP_RECVERR cmsg on listener socket");
+	}
 	debug("%s error on UDP socket %i: %s",
 	      str_ee_origin(ee), s, strerror_(ee->ee_errno));
 
@@ -461,15 +523,16 @@ static int udp_sock_recverr(int s)
 /**
  * udp_sock_errs() - Process errors on a socket
  * @c:		Execution context
- * @s:		Socket to receive from
+ * @ref:	epoll reference
  * @events:	epoll events bitmap
  *
  * Return: Number of errors handled, or < 0 if we have an unrecoverable error
  */
-int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
+int udp_sock_errs(const struct ctx *c, union epoll_ref ref, uint32_t events)
 {
 	unsigned n_err = 0;
 	socklen_t errlen;
+	int s = ref.fd;
 	int rc, err;
 
 	ASSERT(!c->no_udp);
@@ -478,7 +541,7 @@ int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 		return 0; /* Nothing to do */
 
 	/* Empty the error queue */
-	while ((rc = udp_sock_recverr(s)) > 0)
+	while ((rc = udp_sock_recverr(c, ref)) > 0)
 		n_err += rc;
 
 	if (rc < 0)
@@ -558,7 +621,7 @@ static void udp_buf_listen_sock_handler(const struct ctx *c,
 	const socklen_t sasize = sizeof(udp_meta[0].s_in);
 	int n, i;
 
-	if (udp_sock_errs(c, ref.fd, events) < 0) {
+	if (udp_sock_errs(c, ref, events) < 0) {
 		err("UDP: Unrecoverable error on listening socket:"
 		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
 		/* FIXME: what now?  close/re-open socket? */
@@ -661,7 +724,7 @@ static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 	from_s = uflow->s[ref.flowside.sidei];
 
-	if (udp_sock_errs(c, from_s, events) < 0) {
+	if (udp_sock_errs(c, ref, events) < 0) {
 		flow_err(uflow, "Unrecoverable error on reply socket");
 		flow_err_details(uflow);
 		udp_flow_close(c, uflow);
diff --git a/udp_internal.h b/udp_internal.h
index cc80e30..3b081f5 100644
--- a/udp_internal.h
+++ b/udp_internal.h
@@ -30,5 +30,5 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
 size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
                        const struct flowside *toside, size_t dlen,
 		       bool no_udp_csum);
-int udp_sock_errs(const struct ctx *c, int s, uint32_t events);
+int udp_sock_errs(const struct ctx *c, union epoll_ref ref, uint32_t events);
 #endif /* UDP_INTERNAL_H */
diff --git a/udp_vu.c b/udp_vu.c
index 4123510..c26a223 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -227,7 +227,7 @@ void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
 	int i;
 
-	if (udp_sock_errs(c, ref.fd, events) < 0) {
+	if (udp_sock_errs(c, ref, events) < 0) {
 		err("UDP: Unrecoverable error on listening socket:"
 		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
 		return;
@@ -302,7 +302,7 @@ void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 	ASSERT(!c->no_udp);
 
-	if (udp_sock_errs(c, from_s, events) < 0) {
+	if (udp_sock_errs(c, ref, events) < 0) {
 		flow_err(uflow, "Unrecoverable error on reply socket");
 		flow_err_details(uflow);
 		udp_flow_close(c, uflow);

From 87e6a464429372dfaa7212b61e5062dad87179dc Mon Sep 17 00:00:00 2001
From: Jon Maloy <jmaloy@redhat.com>
Date: Thu, 6 Mar 2025 13:00:05 -0500
Subject: [PATCH 109/221] tap: break out building of udp header from
 tap_udp6_send function

We will need to build the UDP header at other locations than in function
tap_udp6_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c | 46 ++++++++++++++++++++++++++++++++++------------
 tap.h |  4 ++++
 2 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/tap.c b/tap.c
index 57d0795..7082620 100644
--- a/tap.c
+++ b/tap.c
@@ -265,7 +265,7 @@ void *tap_push_ip6h(struct ipv6hdr *ip6h,
 }
 
 /**
- * tap_udp6_send() - Send UDP over IPv6 packet
+ * tap_push_uh6() - Build UDPv6 header with checksum
  * @c:		Execution context
  * @src:	IPv6 source address
  * @sport:	UDP source port
@@ -274,6 +274,38 @@ void *tap_push_ip6h(struct ipv6hdr *ip6h,
  * @flow:	Flow label
  * @in:		UDP payload contents (not including UDP header)
  * @dlen:	UDP payload length (not including UDP header)
+ *
+ * Return: pointer at which to write the packet's payload
+ */
+void *tap_push_uh6(struct udphdr *uh,
+		   const struct in6_addr *src, in_port_t sport,
+		   const struct in6_addr *dst, in_port_t dport,
+		   void *in, size_t dlen)
+{
+	size_t l4len = dlen + sizeof(struct udphdr);
+	const struct iovec iov = {
+		.iov_base = in,
+		.iov_len = dlen
+	};
+	struct iov_tail payload = IOV_TAIL(&iov, 1, 0);
+
+	uh->source = htons(sport);
+	uh->dest = htons(dport);
+	uh->len = htons(l4len);
+	csum_udp6(uh, src, dst, &payload);
+	return (char *)uh + sizeof(*uh);
+}
+
+/**
+ * tap_udp6_send() - Send UDP over IPv6 packet
+ * @c:		Execution context
+ * @src:	IPv6 source address
+ * @sport:	UDP source port
+ * @dst:	IPv6 destination address
+ * @dport:	UDP destination port
+ * @flow:	Flow label
+ * @in:	UDP payload contents (not including UDP header)
+ * @dlen:	UDP payload length (not including UDP header)
  */
 void tap_udp6_send(const struct ctx *c,
 		   const struct in6_addr *src, in_port_t sport,
@@ -285,19 +317,9 @@ void tap_udp6_send(const struct ctx *c,
 	struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
 	struct udphdr *uh = tap_push_ip6h(ip6h, src, dst,
 					  l4len, IPPROTO_UDP, flow);
-	char *data = (char *)(uh + 1);
-	const struct iovec iov = {
-		.iov_base = in,
-		.iov_len = dlen
-	};
-	struct iov_tail payload = IOV_TAIL(&iov, 1, 0);
+	char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
 
-	uh->source = htons(sport);
-	uh->dest = htons(dport);
-	uh->len = htons(l4len);
-	csum_udp6(uh, src, dst, &payload);
 	memcpy(data, in, dlen);
-
 	tap_send_single(c, buf, dlen + (data - buf));
 }
 
diff --git a/tap.h b/tap.h
index 9ac17ce..b53a5b8 100644
--- a/tap.h
+++ b/tap.h
@@ -50,6 +50,10 @@ void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
+void *tap_push_uh6(struct udphdr *uh,
+		   const struct in6_addr *src, in_port_t sport,
+		   const struct in6_addr *dst, in_port_t dport,
+		   void *in, size_t dlen);
 void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 		    struct in_addr dst, size_t l4len, uint8_t proto);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,

From 68b04182e07da6a437479cb191e5468db382bc56 Mon Sep 17 00:00:00 2001
From: Jon Maloy <jmaloy@redhat.com>
Date: Thu, 6 Mar 2025 13:00:06 -0500
Subject: [PATCH 110/221] udp: create and send ICMPv6 to local peer when
 applicable

When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMPv6 message containing
the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header
and the first 1232 bytes of the original message, if any. If the sender
socket has been connected, it uses this message to issue a
"Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv6 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error types and codes. We have noticed that at least
ICMP_PROT_UNREACH is propagated as an error event back to the user.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
[sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't
 modify saddr which can be declared as const]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c |  2 +-
 tap.h |  4 ++++
 udp.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c
index 7082620..4541f51 100644
--- a/tap.c
+++ b/tap.c
@@ -261,7 +261,7 @@ void *tap_push_ip6h(struct ipv6hdr *ip6h,
 	ip6h->saddr = *src;
 	ip6h->daddr = *dst;
 	ip6_set_flow_lbl(ip6h, flow);
-	return ip6h + 1;
+	return (char *)ip6h + sizeof(*ip6h);
 }
 
 /**
diff --git a/tap.h b/tap.h
index b53a5b8..a2c3b87 100644
--- a/tap.h
+++ b/tap.h
@@ -56,6 +56,10 @@ void *tap_push_uh6(struct udphdr *uh,
 		   void *in, size_t dlen);
 void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 		    struct in_addr dst, size_t l4len, uint8_t proto);
+void *tap_push_ip6h(struct ipv6hdr *ip6h,
+		    const struct in6_addr *src,
+		    const struct in6_addr *dst,
+		    size_t l4len, uint8_t proto, uint32_t flow);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
diff --git a/udp.c b/udp.c
index b72c3ce..80520cb 100644
--- a/udp.c
+++ b/udp.c
@@ -88,6 +88,7 @@
 #include <netinet/ip.h>
 #include <netinet/udp.h>
 #include <netinet/ip_icmp.h>
+#include <netinet/icmp6.h>
 #include <stdint.h>
 #include <stddef.h>
 #include <string.h>
@@ -115,6 +116,9 @@
 
 /* Maximum UDP data to be returned in ICMP messages */
 #define ICMP4_MAX_DLEN 8
+#define ICMP6_MAX_DLEN (IPV6_MIN_MTU			\
+			- sizeof(struct udphdr)	\
+			- sizeof(struct ipv6hdr))
 
 /* "Spliced" sockets indexed by bound port (host order) */
 static int udp_splice_ns  [IP_VERSIONS][NUM_PORTS];
@@ -449,6 +453,51 @@ static void udp_send_conn_fail_icmp4(const struct ctx *c,
 	tap_icmp4_send(c, saddr, eaddr, &msg, msglen);
 }
 
+
+/**
+ * udp_send_conn_fail_icmp6() - Construct and send ICMPv6 to local peer
+ * @c:		Execution context
+ * @ee:	Extended error descriptor
+ * @toside:	Destination side of flow
+ * @saddr:	Address of ICMP generating node
+ * @in:	First bytes (max 1232) of original UDP message body
+ * @dlen:	Length of the read part of original UDP message body
+ * @flow:	IPv6 flow identifier
+ */
+static void udp_send_conn_fail_icmp6(const struct ctx *c,
+				     const struct sock_extended_err *ee,
+				     const struct flowside *toside,
+				     const struct in6_addr *saddr,
+				     void *in, size_t dlen, uint32_t flow)
+{
+	const struct in6_addr *oaddr = &toside->oaddr.a6;
+	const struct in6_addr *eaddr = &toside->eaddr.a6;
+	in_port_t eport = toside->eport;
+	in_port_t oport = toside->oport;
+	struct {
+		struct icmp6_hdr icmp6h;
+		struct ipv6hdr ip6h;
+		struct udphdr uh;
+		char data[ICMP6_MAX_DLEN];
+	} __attribute__((packed, aligned(__alignof__(max_align_t)))) msg;
+	size_t msglen = sizeof(msg) - sizeof(msg.data) + dlen;
+	size_t l4len = dlen + sizeof(struct udphdr);
+
+	ASSERT(dlen <= ICMP6_MAX_DLEN);
+	memset(&msg, 0, sizeof(msg));
+	msg.icmp6h.icmp6_type = ee->ee_type;
+	msg.icmp6h.icmp6_code = ee->ee_code;
+	if (ee->ee_type == ICMP6_PACKET_TOO_BIG)
+		msg.icmp6h.icmp6_dataun.icmp6_un_data32[0] = htonl(ee->ee_info);
+
+	/* Reconstruct the original headers as returned in the ICMP message */
+	tap_push_ip6h(&msg.ip6h, eaddr, oaddr, l4len, IPPROTO_UDP, flow);
+	tap_push_uh6(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
+	memcpy(&msg.data, in, dlen);
+
+	tap_icmp6_send(c, saddr, eaddr, &msg, msglen);
+}
+
 /**
  * udp_sock_recverr() - Receive and clear an error from a socket
  * @c:		Execution context
@@ -465,7 +514,7 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 	const struct cmsghdr *hdr;
 	union sockaddr_inany saddr;
 	char buf[CMSG_SPACE(sizeof(*ee))];
-	char data[ICMP4_MAX_DLEN];
+	char data[ICMP6_MAX_DLEN];
 	int s = ref.fd;
 	struct iovec iov = {
 		.iov_base = data,
@@ -508,9 +557,17 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 	if (ref.type == EPOLL_TYPE_UDP_REPLY) {
 		flow_sidx_t sidx = flow_sidx_opposite(ref.flowside);
 		const struct flowside *toside = flowside_at_sidx(sidx);
+		size_t dlen = rc;
 
-		udp_send_conn_fail_icmp4(c, ee, toside, saddr.sa4.sin_addr,
-					 data, rc);
+		if (hdr->cmsg_level == IPPROTO_IP) {
+			dlen = MIN(dlen, ICMP4_MAX_DLEN);
+			udp_send_conn_fail_icmp4(c, ee, toside, saddr.sa4.sin_addr,
+						 data, dlen);
+		} else if (hdr->cmsg_level == IPPROTO_IPV6) {
+			udp_send_conn_fail_icmp6(c, ee, toside,
+						 &saddr.sa6.sin6_addr,
+						 data, dlen, sidx.flowi);
+		}
 	} else {
 		trace("Ignoring received IP_RECVERR cmsg on listener socket");
 	}

From 57d2db370b9c12aca84901d968c2c31db89ca462 Mon Sep 17 00:00:00 2001
From: David Gibson <dgibson@redhat.com>
Date: Wed, 5 Mar 2025 17:15:03 +1100
Subject: [PATCH 111/221] treewide: Mark assorted functions static

This marks static a number of functions which are only used in their .c
file, have no prototypes in a .h and were never intended to be globally
exposed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 log.c     | 2 +-
 netlink.c | 2 +-
 passt.c   | 2 +-
 tcp.c     | 6 +++---
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/log.c b/log.c
index 95e4576..b6bce21 100644
--- a/log.c
+++ b/log.c
@@ -56,7 +56,7 @@ bool		log_stderr = true;	/* Not daemonised, no shell spawned */
  *
  * Return: pointer to @now, or NULL if there was an error retrieving the time
  */
-const struct timespec *logtime(struct timespec *ts)
+static const struct timespec *logtime(struct timespec *ts)
 {
 	if (clock_gettime(CLOCK_MONOTONIC, ts))
 		return NULL;
diff --git a/netlink.c b/netlink.c
index 37d8b5b..a052504 100644
--- a/netlink.c
+++ b/netlink.c
@@ -355,7 +355,7 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
  *
  * Return: true if a gateway was found, false otherwise
  */
-bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
+static bool nl_route_get_def_multipath(struct rtattr *rta, void *gw)
 {
 	int nh_len = RTA_PAYLOAD(rta);
 	struct rtnexthop *rtnh;
diff --git a/passt.c b/passt.c
index 68d1a28..868842b 100644
--- a/passt.c
+++ b/passt.c
@@ -166,7 +166,7 @@ void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
  *
  * #syscalls exit_group
  */
-void exit_handler(int signal)
+static void exit_handler(int signal)
 {
 	(void)signal;
 
diff --git a/tcp.c b/tcp.c
index fb04e2e..4c24367 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2497,7 +2497,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
  * @c:		Execution context
  * @port:	Port, host order
  */
-void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
+static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
 {
 	ASSERT(!c->no_tcp);
 
@@ -3141,7 +3141,7 @@ static int tcp_flow_dump_rcvqueue(int s, struct tcp_tap_transfer_ext *t)
  *
  * Return: 0 on success, negative error code on failure
  */
-int tcp_flow_repair_opt(int s, const struct tcp_tap_transfer_ext *t)
+static int tcp_flow_repair_opt(int s, const struct tcp_tap_transfer_ext *t)
 {
 	const struct tcp_repair_opt opts[] = {
 		{ TCPOPT_WINDOW,		t->snd_ws + (t->rcv_ws << 16) },
@@ -3333,7 +3333,7 @@ fail:
  *
  * Return: 0 on success, negative error code on failure
  */
-int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
+static int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
 {
 	sa_family_t af = CONN_V4(conn) ? AF_INET : AF_INET6;
 	const struct flowside *sockside = HOSTFLOW(conn);

From e36c35c952ef0848383cba8ef71e13cf25dab2da Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 17:15:04 +1100
Subject: [PATCH 112/221] log: Don't export passt_vsyslog()

passt_vsyslog() is an exposed function in log.h.  However it shouldn't
be called from outside log.c: it writes specifically to the system log,
and most code should call passt's logging helpers which might go to the
syslog or to a log file.

Make passt_vsyslog() local to log.c.  This requires a code motion to avoid
a forward declaration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 log.c | 48 ++++++++++++++++++++++++------------------------
 log.h |  1 -
 2 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/log.c b/log.c
index b6bce21..6eda4c4 100644
--- a/log.c
+++ b/log.c
@@ -249,6 +249,30 @@ static void logfile_write(bool newline, bool cont, int pri,
 		log_written += n;
 }
 
+/**
+ * passt_vsyslog() - vsyslog() implementation not using heap memory
+ * @newline:	Append newline at the end of the message, if missing
+ * @pri:	Facility and level map, same as priority for vsyslog()
+ * @format:	Same as vsyslog() format
+ * @ap:		Same as vsyslog() ap
+ */
+static void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
+{
+	char buf[BUFSIZ];
+	int n;
+
+	/* Send without timestamp, the system logger should add it */
+	n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
+
+	n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
+
+	if (newline && format[strlen(format)] != '\n')
+		n += snprintf(buf + n, BUFSIZ - n, "\n");
+
+	if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
+		FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
+}
+
 /**
  * vlogmsg() - Print or send messages to log or output files as configured
  * @newline:	Append newline at the end of the message, if missing
@@ -373,30 +397,6 @@ void __setlogmask(int mask)
 	setlogmask(mask);
 }
 
-/**
- * passt_vsyslog() - vsyslog() implementation not using heap memory
- * @newline:	Append newline at the end of the message, if missing
- * @pri:	Facility and level map, same as priority for vsyslog()
- * @format:	Same as vsyslog() format
- * @ap:		Same as vsyslog() ap
- */
-void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
-{
-	char buf[BUFSIZ];
-	int n;
-
-	/* Send without timestamp, the system logger should add it */
-	n = snprintf(buf, BUFSIZ, "<%i> %s: ", pri, log_ident);
-
-	n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
-
-	if (newline && format[strlen(format)] != '\n')
-		n += snprintf(buf + n, BUFSIZ - n, "\n");
-
-	if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && log_stderr)
-		FPRINTF(stderr, "Failed to send %i bytes to syslog\n", n);
-}
-
 /**
  * logfile_init() - Open log file and write header with PID, version, path
  * @name:	Identifier for header: passt or pasta
diff --git a/log.h b/log.h
index 22c7b9a..08aa88c 100644
--- a/log.h
+++ b/log.h
@@ -55,7 +55,6 @@ void trace_init(int enable);
 
 void __openlog(const char *ident, int option, int facility);
 void logfile_init(const char *name, const char *path, size_t size);
-void passt_vsyslog(bool newline, int pri, const char *format, va_list ap);
 void __setlogmask(int mask);
 
 #endif /* LOG_H */

From 12d5b36b2f17a1ddc9447b925dbec161b4da346a Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 17:15:05 +1100
Subject: [PATCH 113/221] checksum: Don't export various functions

Several of the exposed functions in checksum.h are no longer directly used.
Remove them from the header, and make static.  In particular sum_16b()
should not be used outside: generally csum_unfolded() should be used which
will automatically use either the AVX2 optimized version or sum_16b() as
necessary.

csum_fold() and csum() could have external uses, but they're not used right
now.  We can expose them again if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 checksum.c | 34 +++++++++++++++++-----------------
 checksum.h |  3 ---
 2 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/checksum.c b/checksum.c
index b01e0fe..0894eca 100644
--- a/checksum.c
+++ b/checksum.c
@@ -85,7 +85,7 @@
  */
 /* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
 __attribute__((optimize("-fno-strict-aliasing")))
-uint32_t sum_16b(const void *buf, size_t len)
+static uint32_t sum_16b(const void *buf, size_t len)
 {
 	const uint16_t *p = buf;
 	uint32_t sum = 0;
@@ -107,7 +107,7 @@ uint32_t sum_16b(const void *buf, size_t len)
  *
  * Return: 16-bit folded sum
  */
-uint16_t csum_fold(uint32_t sum)
+static uint16_t csum_fold(uint32_t sum)
 {
 	while (sum >> 16)
 		sum = (sum & 0xffff) + (sum >> 16);
@@ -161,6 +161,21 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol,
 	return psum;
 }
 
+/**
+ * csum() - Compute TCP/IP-style checksum
+ * @buf:	Input buffer
+ * @len:	Input length
+ * @init:	Initial 32-bit checksum, 0 for no pre-computed checksum
+ *
+ * Return: 16-bit folded, complemented checksum
+ */
+/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
+__attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
+static uint16_t csum(const void *buf, size_t len, uint32_t init)
+{
+	return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
+}
+
 /**
  * csum_udp4() - Calculate and set checksum for a UDP over IPv4 packet
  * @udp4hr:	UDP header, initialised apart from checksum
@@ -482,21 +497,6 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
 }
 #endif /* !__AVX2__ */
 
-/**
- * csum() - Compute TCP/IP-style checksum
- * @buf:	Input buffer
- * @len:	Input length
- * @init:	Initial 32-bit checksum, 0 for no pre-computed checksum
- *
- * Return: 16-bit folded, complemented checksum
- */
-/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
-__attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
-uint16_t csum(const void *buf, size_t len, uint32_t init)
-{
-	return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
-}
-
 /**
  * csum_iov_tail() - Calculate unfolded checksum for the tail of an IO vector
  * @tail:	IO vector tail to checksum
diff --git a/checksum.h b/checksum.h
index e243c97..683a09b 100644
--- a/checksum.h
+++ b/checksum.h
@@ -11,8 +11,6 @@ struct icmphdr;
 struct icmp6hdr;
 struct iov_tail;
 
-uint32_t sum_16b(const void *buf, size_t len);
-uint16_t csum_fold(uint32_t sum);
 uint16_t csum_unaligned(const void *buf, size_t len, uint32_t init);
 uint16_t csum_ip4_header(uint16_t l3len, uint8_t protocol,
 			 struct in_addr saddr, struct in_addr daddr);
@@ -32,7 +30,6 @@ void csum_icmp6(struct icmp6hdr *icmp6hr,
 		const struct in6_addr *saddr, const struct in6_addr *daddr,
 		const void *payload, size_t dlen);
 uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init);
-uint16_t csum(const void *buf, size_t len, uint32_t init);
 uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init);
 
 #endif /* CHECKSUM_H */

From 27395e67c26a73e2e035360195b5928a07996dd5 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 17:15:06 +1100
Subject: [PATCH 114/221] tcp: Don't export tcp_update_csum()

tcp_update_csum() is exposed in tcp_internal.h, but is only used in tcp.c.
Remove the unneded prototype and make it static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c          | 3 ++-
 tcp_internal.h | 2 --
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/tcp.c b/tcp.c
index 4c24367..32a08bd 100644
--- a/tcp.c
+++ b/tcp.c
@@ -787,7 +787,8 @@ static void tcp_sock_set_nodelay(int s)
  * @th:		TCP header (updated)
  * @payload:	TCP payload
  */
-void tcp_update_csum(uint32_t psum, struct tcphdr *th, struct iov_tail *payload)
+static void tcp_update_csum(uint32_t psum, struct tcphdr *th,
+			    struct iov_tail *payload)
 {
 	th->check = 0;
 	psum = csum_unfolded(th, sizeof(*th), psum);
diff --git a/tcp_internal.h b/tcp_internal.h
index 9cf31f5..6f5e054 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -166,8 +166,6 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
 
 struct tcp_info_linux;
 
-void tcp_update_csum(uint32_t psum, struct tcphdr *th,
-		     struct iov_tail *payload);
 void tcp_fill_headers(const struct tcp_tap_conn *conn,
 		      struct tap_hdr *taph,
 		      struct iphdr *ip4h, struct ipv6hdr *ip6h,

From a83c806d1786fbe19bc6a3014f248e928e00651b Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 17:15:07 +1100
Subject: [PATCH 115/221] vhost_user: Don't export several functions

vhost-user added several functions which are exposed in headers, but not
used outside the file where they're defined.  I can't tell if these are
really internal functions, or of they're logically supposed to be exported,
but we don't happen to have anything using them yet.

For the time being, just remove the exports.  We can add them back if we
need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 2 +-
 vhost_user.h | 1 -
 virtio.c     | 9 +++++----
 virtio.h     | 4 ----
 4 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/vhost_user.c b/vhost_user.c
index be1aa94..105f77a 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -517,7 +517,7 @@ static void vu_close_log(struct vu_dev *vdev)
  * vu_log_kick() - Inform the front-end that the log has been modified
  * @vdev:	vhost-user device
  */
-void vu_log_kick(const struct vu_dev *vdev)
+static void vu_log_kick(const struct vu_dev *vdev)
 {
 	if (vdev->log_call_fd != -1) {
 		int rc;
diff --git a/vhost_user.h b/vhost_user.h
index e769cb1..1daacd1 100644
--- a/vhost_user.h
+++ b/vhost_user.h
@@ -241,7 +241,6 @@ static inline bool vu_queue_started(const struct vu_virtq *vq)
 void vu_print_capabilities(void);
 void vu_init(struct ctx *c);
 void vu_cleanup(struct vu_dev *vdev);
-void vu_log_kick(const struct vu_dev *vdev);
 void vu_log_write(const struct vu_dev *vdev, uint64_t address,
 		  uint64_t length);
 void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events);
diff --git a/virtio.c b/virtio.c
index 2b58e4d..bc2b89a 100644
--- a/virtio.c
+++ b/virtio.c
@@ -286,7 +286,7 @@ static int virtqueue_read_next_desc(const struct vring_desc *desc,
  *
  * Return: true if the virtqueue is empty, false otherwise
  */
-bool vu_queue_empty(struct vu_virtq *vq)
+static bool vu_queue_empty(struct vu_virtq *vq)
 {
 	if (!vq->vring.avail)
 		return true;
@@ -671,9 +671,10 @@ static void vu_log_queue_fill(const struct vu_dev *vdev, struct vu_virtq *vq,
  * @len:	Size of the element
  * @idx:	Used ring entry index
  */
-void vu_queue_fill_by_index(const struct vu_dev *vdev, struct vu_virtq *vq,
-			    unsigned int index, unsigned int len,
-			    unsigned int idx)
+static void vu_queue_fill_by_index(const struct vu_dev *vdev,
+				   struct vu_virtq *vq,
+				   unsigned int index, unsigned int len,
+				   unsigned int idx)
 {
 	struct vring_used_elem uelem;
 
diff --git a/virtio.h b/virtio.h
index 0a59441..7a370bd 100644
--- a/virtio.h
+++ b/virtio.h
@@ -174,16 +174,12 @@ static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
 	return has_feature(vdev->protocol_features, fbit);
 }
 
-bool vu_queue_empty(struct vu_virtq *vq);
 void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
 int vu_queue_pop(const struct vu_dev *dev, struct vu_virtq *vq,
 		 struct vu_virtq_element *elem);
 void vu_queue_detach_element(struct vu_virtq *vq);
 void vu_queue_unpop(struct vu_virtq *vq);
 bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
-void vu_queue_fill_by_index(const struct vu_dev *vdev, struct vu_virtq *vq,
-			    unsigned int index, unsigned int len,
-			    unsigned int idx);
 void vu_queue_fill(const struct vu_dev *vdev, struct vu_virtq *vq,
 		   const struct vu_virtq_element *elem, unsigned int len,
 		   unsigned int idx);

From 2b58b22845a76baf24141155eb4d4a882f509e97 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 5 Mar 2025 17:15:08 +1100
Subject: [PATCH 116/221] cppcheck: Add suppressions for "logically" exported
 functions

We have some functions in our headers which are definitely there on
purpose.  However, they're not yet used outside the files in which they're
defined.  That causes sufficiently recent cppcheck versions (2.17) to
complain they should be static.

Suppress the errors for these "logically" exported functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 iov.c | 1 +
 log.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/iov.c b/iov.c
index 3b12272..8c63b7e 100644
--- a/iov.c
+++ b/iov.c
@@ -203,6 +203,7 @@ size_t iov_tail_size(struct iov_tail *tail)
  *	    overruns the IO vector, is not contiguous or doesn't have the
  *	    requested alignment.
  */
+/* cppcheck-suppress [staticFunction,unmatchedSuppression] */
 void *iov_peek_header_(struct iov_tail *tail, size_t len, size_t align)
 {
 	char *p;
diff --git a/log.c b/log.c
index 6eda4c4..d40d7ae 100644
--- a/log.c
+++ b/log.c
@@ -281,6 +281,7 @@ static void passt_vsyslog(bool newline, int pri, const char *format, va_list ap)
  * @format:	Message
  * @ap:		Variable argument list
  */
+/* cppcheck-suppress [staticFunction,unmatchedSuppression] */
 void vlogmsg(bool newline, bool cont, int pri, const char *format, va_list ap)
 {
 	bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1;

From 04701702471ececee362669cc6b49ed9e20a1b6d Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 7 Mar 2025 23:27:03 +0100
Subject: [PATCH 117/221] passt-repair: Add directory watch

It might not be feasible for users to start passt-repair after passt
is started, on a migration target, but before the migration process
starts.

For instance, with libvirt, the guest domain (and, hence, passt) is
started on the target as part of the migration process. At least for
the moment being, there's no hook a libvirt user (including KubeVirt)
can use to start passt-repair before the migration starts.

Add a directory watch using inotify: if PATH is a directory, instead
of connecting to it, we'll watch for a .repair socket file to appear
in it, and then attempt to connect to that socket.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 contrib/selinux/passt-repair.te | 16 +++----
 passt-repair.1                  |  6 ++-
 passt-repair.c                  | 84 +++++++++++++++++++++++++++++----
 3 files changed, 89 insertions(+), 17 deletions(-)

diff --git a/contrib/selinux/passt-repair.te b/contrib/selinux/passt-repair.te
index f171be6..7157dfb 100644
--- a/contrib/selinux/passt-repair.te
+++ b/contrib/selinux/passt-repair.te
@@ -61,11 +61,11 @@ allow passt_repair_t unconfined_t:unix_stream_socket { connectto read write };
 allow passt_repair_t passt_t:unix_stream_socket { connectto read write };
 allow passt_repair_t user_tmp_t:unix_stream_socket { connectto read write };
 
-allow passt_repair_t user_tmp_t:dir search;
+allow passt_repair_t user_tmp_t:dir { getattr read search watch };
 
-allow passt_repair_t unconfined_t:sock_file { read write };
-allow passt_repair_t passt_t:sock_file { read write };
-allow passt_repair_t user_tmp_t:sock_file { read write };
+allow passt_repair_t unconfined_t:sock_file { getattr read write };
+allow passt_repair_t passt_t:sock_file { getattr read write };
+allow passt_repair_t user_tmp_t:sock_file { getattr read write };
 
 allow passt_repair_t unconfined_t:tcp_socket { read setopt write };
 allow passt_repair_t passt_t:tcp_socket { read setopt write };
@@ -80,8 +80,8 @@ allow passt_repair_t passt_t:tcp_socket { read setopt write };
 allow passt_repair_t qemu_var_run_t:unix_stream_socket { connectto read write };
 allow passt_repair_t virt_var_run_t:unix_stream_socket { connectto read write };
 
-allow passt_repair_t qemu_var_run_t:dir search;
-allow passt_repair_t virt_var_run_t:dir search;
+allow passt_repair_t qemu_var_run_t:dir { getattr read search watch };
+allow passt_repair_t virt_var_run_t:dir { getattr read search watch };
 
-allow passt_repair_t qemu_var_run_t:sock_file { read write };
-allow passt_repair_t virt_var_run_t:sock_file { read write };
+allow passt_repair_t qemu_var_run_t:sock_file { getattr read write };
+allow passt_repair_t virt_var_run_t:sock_file { getattr read write };
diff --git a/passt-repair.1 b/passt-repair.1
index 7c1b140..e65aadd 100644
--- a/passt-repair.1
+++ b/passt-repair.1
@@ -16,13 +16,17 @@
 .B passt-repair
 is a privileged helper setting and clearing repair mode on TCP sockets on behalf
 of \fBpasst\fR(1), as instructed via single-byte commands over a UNIX domain
-socket, specified by \fIPATH\fR.
+socket.
 
 It can be used to migrate TCP connections between guests without granting
 additional capabilities to \fBpasst\fR(1) itself: to migrate TCP connections,
 \fBpasst\fR(1) leverages repair mode, which needs the \fBCAP_NET_ADMIN\fR
 capability (see \fBcapabilities\fR(7)) to be set or cleared.
 
+If \fIPATH\fR represents a UNIX domain socket, \fBpasst-repair\fR(1) attempts to
+connect to it. If it is a directory, \fBpasst-repair\fR(1) waits until a file
+ending with \fI.repair\fR appears in it, and then attempts to connect to it.
+
 .SH PROTOCOL
 
 \fBpasst-repair\fR(1) connects to \fBpasst\fR(1) using the socket specified via
diff --git a/passt-repair.c b/passt-repair.c
index e0c366e..8bb3f00 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -16,11 +16,14 @@
  * off. Reply by echoing the command. Exit on EOF.
  */
 
+#include <sys/inotify.h>
 #include <sys/prctl.h>
 #include <sys/types.h>
 #include <sys/socket.h>
+#include <sys/stat.h>
 #include <sys/un.h>
 #include <errno.h>
+#include <stdbool.h>
 #include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -39,6 +42,8 @@
 #include "seccomp_repair.h"
 
 #define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+#define REPAIR_EXT		".repair"
+#define REPAIR_EXT_LEN		strlen(REPAIR_EXT)
 
 /**
  * main() - Entry point and whole program with loop
@@ -51,6 +56,9 @@
  * #syscalls:repair socket s390x:socketcall i686:socketcall
  * #syscalls:repair recvfrom recvmsg arm:recv ppc64le:recv
  * #syscalls:repair sendto sendmsg arm:send ppc64le:send
+ * #syscalls:repair stat|statx stat64|statx statx
+ * #syscalls:repair fstat|fstat64 newfstatat|fstatat64
+ * #syscalls:repair inotify_init1 inotify_add_watch
  */
 int main(int argc, char **argv)
 {
@@ -58,12 +66,14 @@ int main(int argc, char **argv)
 	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
 	struct sockaddr_un a = { AF_UNIX, "" };
 	int fds[SCM_MAX_FD], s, ret, i, n = 0;
+	bool inotify_dir = false;
 	struct sock_fprog prog;
 	int8_t cmd = INT8_MAX;
 	struct cmsghdr *cmsg;
 	struct msghdr msg;
 	struct iovec iov;
 	size_t cmsg_len;
+	struct stat sb;
 	int op;
 
 	prctl(PR_SET_DUMPABLE, 0);
@@ -90,19 +100,77 @@ int main(int argc, char **argv)
 		_exit(2);
 	}
 
-	ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
-	if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
-		fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
-		_exit(2);
-	}
-
 	if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
 		fprintf(stderr, "Failed to create AF_UNIX socket: %i\n", errno);
 		_exit(1);
 	}
 
-	if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
-		fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
+	if ((stat(argv[1], &sb))) {
+		fprintf(stderr, "Can't stat() %s: %i\n", argv[1], errno);
+		_exit(1);
+	}
+
+	if ((sb.st_mode & S_IFMT) == S_IFDIR) {
+		char buf[sizeof(struct inotify_event) + NAME_MAX + 1];
+		const struct inotify_event *ev;
+		char path[PATH_MAX + 1];
+		ssize_t n;
+		int fd;
+
+		ev = (struct inotify_event *)buf;
+
+		if ((fd = inotify_init1(IN_CLOEXEC)) < 0) {
+			fprintf(stderr, "inotify_init1: %i\n", errno);
+			_exit(1);
+		}
+
+		if (inotify_add_watch(fd, argv[1], IN_CREATE) < 0) {
+			fprintf(stderr, "inotify_add_watch: %i\n", errno);
+			_exit(1);
+		}
+
+		do {
+			n = read(fd, buf, sizeof(buf));
+			if (n < 0) {
+				fprintf(stderr, "inotify read: %i", errno);
+				_exit(1);
+			}
+
+			if (n < (ssize_t)sizeof(*ev)) {
+				fprintf(stderr, "Short inotify read: %zi", n);
+				_exit(1);
+			}
+		} while (ev->len < REPAIR_EXT_LEN ||
+			 memcmp(ev->name + strlen(ev->name) - REPAIR_EXT_LEN,
+				REPAIR_EXT, REPAIR_EXT_LEN));
+
+		snprintf(path, sizeof(path), "%s/%s", argv[1], ev->name);
+		if ((stat(path, &sb))) {
+			fprintf(stderr, "Can't stat() %s: %i\n", path, errno);
+			_exit(1);
+		}
+
+		ret = snprintf(a.sun_path, sizeof(a.sun_path), path);
+		inotify_dir = true;
+	} else {
+		ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
+	}
+
+	if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
+		fprintf(stderr, "Invalid socket path");
+		_exit(2);
+	}
+
+	if ((sb.st_mode & S_IFMT) != S_IFSOCK) {
+		fprintf(stderr, "%s is not a socket\n", a.sun_path);
+		_exit(2);
+	}
+
+	while (connect(s, (struct sockaddr *)&a, sizeof(a))) {
+		if (inotify_dir && errno == ECONNREFUSED)
+			continue;
+
+		fprintf(stderr, "Failed to connect to %s: %s\n", a.sun_path,
 			strerror(errno));
 		_exit(1);
 	}

From c8b520c0625b440d0dcd588af085d35cf46aae2c Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 6 Mar 2025 20:00:51 +0100
Subject: [PATCH 118/221] flow, repair: Wait for a short while for passt-repair
 to connect

...and time out after that. This will be needed because of an upcoming
change to passt-repair enabling it to start before passt is started,
on both source and target, by means of an inotify watch.

Once the inotify watch triggers, passt-repair will connect right away,
but we have no guarantees that the connection completes before we
start the migration process, so wait for it (for a reasonable amount
of time).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c   | 20 ++++++++++++++++++++
 repair.c | 32 ++++++++++++++++++++++++++++++++
 repair.h |  1 +
 3 files changed, 53 insertions(+)

diff --git a/flow.c b/flow.c
index 749c498..5e64b79 100644
--- a/flow.c
+++ b/flow.c
@@ -911,6 +911,21 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
 	return ret;
 }
 
+/**
+ * flow_migrate_need_repair() - Do we need to set repair mode for any flow?
+ *
+ * Return: true if repair mode is needed, false otherwise
+ */
+static bool flow_migrate_need_repair(void)
+{
+	union flow *flow;
+
+	foreach_established_tcp_flow(flow)
+		return true;
+
+	return false;
+}
+
 /**
  * flow_migrate_repair_all() - Turn repair mode on or off for all flows
  * @c:		Execution context
@@ -966,6 +981,9 @@ int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
 	(void)stage;
 	(void)fd;
 
+	if (flow_migrate_need_repair())
+		repair_wait(c);
+
 	if ((rc = flow_migrate_repair_all(c, true)))
 		return -rc;
 
@@ -1083,6 +1101,8 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	if (!count)
 		return 0;
 
+	repair_wait(c);
+
 	if ((rc = flow_migrate_repair_all(c, true)))
 		return -rc;
 
diff --git a/repair.c b/repair.c
index 3ee089f..149fe51 100644
--- a/repair.c
+++ b/repair.c
@@ -27,6 +27,10 @@
 
 #define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
 
+/* Wait for a while for TCP_REPAIR helper to connect if it's not there yet */
+#define REPAIR_ACCEPT_TIMEOUT_MS	10
+#define REPAIR_ACCEPT_TIMEOUT_US	(REPAIR_ACCEPT_TIMEOUT_MS * 1000)
+
 /* Pending file descriptors for next repair_flush() call, or command change */
 static int repair_fds[SCM_MAX_FD];
 
@@ -138,6 +142,34 @@ void repair_handler(struct ctx *c, uint32_t events)
 	repair_close(c);
 }
 
+/**
+ * repair_wait() - Wait (with timeout) for TCP_REPAIR helper to connect
+ * @c:		Execution context
+ */
+void repair_wait(struct ctx *c)
+{
+	struct timeval tv = { .tv_sec = 0,
+			      .tv_usec = (long)(REPAIR_ACCEPT_TIMEOUT_US) };
+	static_assert(REPAIR_ACCEPT_TIMEOUT_US < 1000 * 1000,
+		      ".tv_usec is greater than 1000 * 1000");
+
+	if (c->fd_repair >= 0 || c->fd_repair_listen == -1)
+		return;
+
+	if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
+		       &tv, sizeof(tv))) {
+		err_perror("Set timeout on TCP_REPAIR listening socket");
+		return;
+	}
+
+	repair_listen_handler(c, EPOLLIN);
+
+	tv.tv_usec = 0;
+	if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
+		       &tv, sizeof(tv)))
+		err_perror("Clear timeout on TCP_REPAIR listening socket");
+}
+
 /**
  * repair_flush() - Flush current set of sockets to helper, with current command
  * @c:		Execution context
diff --git a/repair.h b/repair.h
index de279d6..1d37922 100644
--- a/repair.h
+++ b/repair.h
@@ -10,6 +10,7 @@ void repair_sock_init(const struct ctx *c);
 void repair_listen_handler(struct ctx *c, uint32_t events);
 void repair_handler(struct ctx *c, uint32_t events);
 void repair_close(struct ctx *c);
+void repair_wait(struct ctx *c);
 int repair_flush(struct ctx *c);
 int repair_set(struct ctx *c, int s, int cmd);
 

From bb00a0499fc9130e4b00a88928958b8b094ee2c9 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:31 +1100
Subject: [PATCH 119/221] conf: Use the same optstring for passt and pasta
 modes

Currently we rely on detecting our mode first and use different sets of
(single character) options for each.  This means that if you use an option
valid in only one mode in another you'll get the generic usage() message.

We can give more helpful errors with little extra effort by combining all
the options into a single value of the option string and giving bespoke
messages if an option for the wrong mode is used; in fact we already did
this for some single mode options like '-1'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/conf.c b/conf.c
index 065e720..7f20bc8 100644
--- a/conf.c
+++ b/conf.c
@@ -1388,6 +1388,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"repair-path",	required_argument,	NULL,		28 },
 		{ 0 },
 	};
+	const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
 	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
 	char userns[PATH_MAX] = { 0 }, netns[PATH_MAX] = { 0 };
 	bool copy_addrs_opt = false, copy_routes_opt = false;
@@ -1397,7 +1398,6 @@ void conf(struct ctx *c, int argc, char **argv)
 	struct fqdn *dnss = c->dns_search;
 	unsigned int ifi4 = 0, ifi6 = 0;
 	const char *logfile = NULL;
-	const char *optstring;
 	size_t logsize = 0;
 	char *runas = NULL;
 	long fd_tap_opt;
@@ -1408,9 +1408,6 @@ void conf(struct ctx *c, int argc, char **argv)
 	if (c->mode == MODE_PASTA) {
 		c->no_dhcp_dns = c->no_dhcp_dns_search = 1;
 		fwd_default = FWD_AUTO;
-		optstring = "+dqfel:hF:I:p:P:m:a:n:M:g:i:o:D:S:H:46t:u:T:U:";
-	} else {
-		optstring = "+dqfel:hs:F:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:";
 	}
 
 	c->mtu = ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN, sizeof(uint32_t));
@@ -1614,6 +1611,9 @@ void conf(struct ctx *c, int argc, char **argv)
 			c->foreground = 1;
 			break;
 		case 's':
+			if (c->mode == MODE_PASTA)
+				die("-s is for passt / vhost-user mode only");
+
 			ret = snprintf(c->sock_path, sizeof(c->sock_path), "%s",
 				       optarg);
 			if (ret <= 0 || ret >= (int)sizeof(c->sock_path))
@@ -1634,6 +1634,9 @@ void conf(struct ctx *c, int argc, char **argv)
 			*c->sock_path = 0;
 			break;
 		case 'I':
+			if (c->mode != MODE_PASTA)
+				die("-I is for pasta mode only");
+
 			ret = snprintf(c->pasta_ifn, IFNAMSIZ, "%s",
 				       optarg);
 			if (ret <= 0 || ret >= IFNAMSIZ)
@@ -1790,11 +1793,16 @@ void conf(struct ctx *c, int argc, char **argv)
 			break;
 		case 't':
 		case 'u':
-		case 'T':
-		case 'U':
 		case 'D':
 			/* Handle these later, once addresses are configured */
 			break;
+		case 'T':
+		case 'U':
+			if (c->mode != MODE_PASTA)
+				die("-%c is for pasta mode only", name);
+
+			/* Handle properly later, once addresses are configured */
+			break;
 		case 'h':
 			usage(argv[0], stdout, EXIT_SUCCESS);
 			break;

From 4b17d042c7e4f6e5b5a770181e2ebd53ec8e73d4 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:32 +1100
Subject: [PATCH 120/221] conf: Move mode detection into helper function

One of the first things we need to do is determine if we're in passt mode
or pasta mode.  Currently this is open-coded in main(), by examining
argv[0].  We want to complexify this a bit in future to cover vhost-user
mode as well.  Prepare for this, by moving the mode detection into a new
conf_mode() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c  | 26 ++++++++++++++++++++++++++
 conf.h  |  1 +
 passt.c | 14 ++------------
 3 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/conf.c b/conf.c
index 7f20bc8..2022ea1 100644
--- a/conf.c
+++ b/conf.c
@@ -991,6 +991,32 @@ pasta_opts:
 	_exit(status);
 }
 
+/**
+ * conf_mode() - Determine passt/pasta's operating mode from command line
+ * @argc:	Argument count
+ * @argv:	Command line arguments
+ *
+ * Return: mode to operate in, PASTA or PASST
+ */
+/* cppcheck-suppress constParameter */
+enum passt_modes conf_mode(int argc, char *argv[])
+{
+	char argv0[PATH_MAX], *basearg0;
+
+	if (argc < 1)
+		die("Cannot determine argv[0]");
+
+	strncpy(argv0, argv[0], PATH_MAX - 1);
+	basearg0 = basename(argv0);
+	if (strstr(basearg0, "pasta"))
+		return MODE_PASTA;
+
+	if (strstr(basearg0, "passt"))
+		return MODE_PASST;
+
+	die("Cannot determine mode, invoke as \"passt\" or \"pasta\"");
+}
+
 /**
  * conf_print() - Print fundamental configuration parameters
  * @c:		Execution context
diff --git a/conf.h b/conf.h
index 9d2143d..b45ad74 100644
--- a/conf.h
+++ b/conf.h
@@ -6,6 +6,7 @@
 #ifndef CONF_H
 #define CONF_H
 
+enum passt_modes conf_mode(int argc, char *argv[]);
 void conf(struct ctx *c, int argc, char **argv);
 
 #endif /* CONF_H */
diff --git a/passt.c b/passt.c
index 868842b..0bd2a29 100644
--- a/passt.c
+++ b/passt.c
@@ -191,7 +191,6 @@ int main(int argc, char **argv)
 {
 	struct epoll_event events[EPOLL_EVENTS];
 	int nfds, i, devnull_fd = -1;
-	char argv0[PATH_MAX], *name;
 	struct ctx c = { 0 };
 	struct rlimit limit;
 	struct timespec now;
@@ -213,21 +212,12 @@ int main(int argc, char **argv)
 	sigaction(SIGTERM, &sa, NULL);
 	sigaction(SIGQUIT, &sa, NULL);
 
-	if (argc < 1)
-		_exit(EXIT_FAILURE);
+	c.mode = conf_mode(argc, argv);
 
-	strncpy(argv0, argv[0], PATH_MAX - 1);
-	name = basename(argv0);
-	if (strstr(name, "pasta")) {
+	if (c.mode == MODE_PASTA) {
 		sa.sa_handler = pasta_child_handler;
 		if (sigaction(SIGCHLD, &sa, NULL))
 			die_perror("Couldn't install signal handlers");
-
-		c.mode = MODE_PASTA;
-	} else if (strstr(name, "passt")) {
-		c.mode = MODE_PASST;
-	} else {
-		_exit(EXIT_FAILURE);
 	}
 
 	if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)

From 74cd82adc87552c7ef6d255069a974b4ebeab4a1 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:33 +1100
Subject: [PATCH 121/221] conf: Detect vhost-user mode earlier

We detect our operating mode in conf_mode(), unless we're using vhost-user
mode, in which case we change it later when we parse the --vhost-user
option.  That means we need to delay parsing the --repair-path option (for
vhost-user only) until still later.

However, there are many other places in the main option parsing loop which
also rely on mode.  We get away with those, because they happen to be able
to treat passt and vhost-user modes identically.  This is potentially
confusing, though.  So, move setting of MODE_VU into conf_mode() so
c->mode always has its final value from that point onwards.

To match, we move the parsing of --repair-path back into the main option
parsing loop.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/conf.c b/conf.c
index 2022ea1..b58e2a6 100644
--- a/conf.c
+++ b/conf.c
@@ -998,10 +998,23 @@ pasta_opts:
  *
  * Return: mode to operate in, PASTA or PASST
  */
-/* cppcheck-suppress constParameter */
 enum passt_modes conf_mode(int argc, char *argv[])
 {
+	int vhost_user = 0;
+	const struct option optvu[] = {
+		{"vhost-user",	no_argument,		&vhost_user,	1 },
+		{ 0 },
+	};
 	char argv0[PATH_MAX], *basearg0;
+	int name;
+
+	optind = 0;
+	do {
+		name = getopt_long(argc, argv, "-:", optvu, NULL);
+	} while (name != -1);
+
+	if (vhost_user)
+		return MODE_VU;
 
 	if (argc < 1)
 		die("Cannot determine argv[0]");
@@ -1604,9 +1617,8 @@ void conf(struct ctx *c, int argc, char **argv)
 
 			die("Invalid host nameserver address: %s", optarg);
 		case 25:
-			if (c->mode == MODE_PASTA)
-				die("--vhost-user is for passt mode only");
-			c->mode = MODE_VU;
+			/* Already handled in conf_mode() */
+			ASSERT(c->mode == MODE_VU);
 			break;
 		case 26:
 			vu_print_capabilities();
@@ -1617,7 +1629,14 @@ void conf(struct ctx *c, int argc, char **argv)
 				die("Invalid FQDN: %s", optarg);
 			break;
 		case 28:
-			/* Handle this once we checked --vhost-user */
+			if (c->mode != MODE_VU && strcmp(optarg, "none"))
+				die("--repair-path is for vhost-user mode only");
+
+			if (snprintf_check(c->repair_path,
+					   sizeof(c->repair_path), "%s",
+					   optarg))
+				die("Invalid passt-repair path: %s", optarg);
+
 			break;
 		case 'd':
 			c->debug = 1;
@@ -1917,8 +1936,8 @@ void conf(struct ctx *c, int argc, char **argv)
 	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
 		c->no_dhcp = 1;
 
-	/* Inbound port options, DNS, and --repair-path can be parsed now, after
-	 * IPv4/IPv6 settings and --vhost-user.
+	/* Inbound port options and DNS can be parsed now, after IPv4/IPv6
+	 * settings
 	 */
 	fwd_probe_ephemeral();
 	udp_portmap_clear();
@@ -1964,16 +1983,6 @@ void conf(struct ctx *c, int argc, char **argv)
 			}
 
 			die("Cannot use DNS address %s", optarg);
-		} else if (name == 28) {
-			if (c->mode != MODE_VU && strcmp(optarg, "none"))
-				die("--repair-path is for vhost-user mode only");
-
-			if (snprintf_check(c->repair_path,
-					   sizeof(c->repair_path), "%s",
-					   optarg))
-				die("Invalid passt-repair path: %s", optarg);
-
-			break;
 		}
 	} while (name != -1);
 

From c43972ad67806fb403cdbc05179441917f2a776b Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:34 +1100
Subject: [PATCH 122/221] packet: Give explicit name to maximum packet size

We verify that every packet we store in a pool (and every partial packet
we retreive from it) has a length no longer than UINT16_MAX.  This
originated in the older packet pool implementation which stored packet
lengths in a uint16_t.  Now, that packets are represented by a struct
iovec with its size_t length, this check serves only as a sanity / security
check that we don't have some wildly out of range length due to a bug
elsewhere.

We have may reasons to (slightly) increase this limit in future, so in
preparation, give this quantity an explicit name - PACKET_MAX_LEN.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c | 4 ++--
 packet.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/packet.c b/packet.c
index 0330b54..bcac037 100644
--- a/packet.c
+++ b/packet.c
@@ -83,7 +83,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 	if (packet_check_range(p, start, len, func, line))
 		return;
 
-	if (len > UINT16_MAX) {
+	if (len > PACKET_MAX_LEN) {
 		trace("add packet length %zu, %s:%i", len, func, line);
 		return;
 	}
@@ -119,7 +119,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (len > UINT16_MAX) {
+	if (len > PACKET_MAX_LEN) {
 		if (func) {
 			trace("packet data length %zu, %s:%i",
 			      len, func, line);
diff --git a/packet.h b/packet.h
index bdc07fe..d099f02 100644
--- a/packet.h
+++ b/packet.h
@@ -6,6 +6,9 @@
 #ifndef PACKET_H
 #define PACKET_H
 
+/* Maximum size of a single packet stored in pool, including headers */
+#define PACKET_MAX_LEN	UINT16_MAX
+
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors,

From 1eda8de4384a93778a781257781c5b0967c8abfe Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:35 +1100
Subject: [PATCH 123/221] packet: Remove redundant TAP_BUF_BYTES define

Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing.  They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense.  So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.

In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt.c | 2 +-
 passt.h | 5 ++---
 tap.c   | 4 ++--
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/passt.c b/passt.c
index 0bd2a29..cd06772 100644
--- a/passt.c
+++ b/passt.c
@@ -223,7 +223,7 @@ int main(int argc, char **argv)
 	if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
 		die_perror("Couldn't set disposition for SIGPIPE");
 
-	madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
+	madvise(pkt_buf, sizeof(pkt_buf), MADV_HUGEPAGE);
 
 	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
 	if (c.epollfd == -1)
diff --git a/passt.h b/passt.h
index 28d1389..6b24805 100644
--- a/passt.h
+++ b/passt.h
@@ -69,12 +69,11 @@ union epoll_ref {
 static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
 	      "epoll_ref must have same size as epoll_data");
 
-#define TAP_BUF_BYTES							\
+#define PKT_BUF_BYTES							\
 	ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE)
 #define TAP_MSGS							\
-	DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
+	DIV_ROUND_UP(PKT_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
 
-#define PKT_BUF_BYTES		MAX(TAP_BUF_BYTES, 0)
 extern char pkt_buf		[PKT_BUF_BYTES];
 
 extern char *epoll_type_str[];
diff --git a/tap.c b/tap.c
index 4541f51..fb306e7 100644
--- a/tap.c
+++ b/tap.c
@@ -1080,7 +1080,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 
 	do {
 		n = recv(c->fd_tap, pkt_buf + partial_len,
-			 TAP_BUF_BYTES - partial_len, MSG_DONTWAIT);
+			 sizeof(pkt_buf) - partial_len, MSG_DONTWAIT);
 	} while ((n < 0) && errno == EINTR);
 
 	if (n < 0) {
@@ -1151,7 +1151,7 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 
 	tap_flush_pools();
 
-	for (n = 0; n <= (ssize_t)(TAP_BUF_BYTES - ETH_MAX_MTU); n += len) {
+	for (n = 0; n <= (ssize_t)(sizeof(pkt_buf) - ETH_MAX_MTU); n += len) {
 		len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
 
 		if (len == 0) {

From c4bfa3339cea586172d4b0fcd613b5638498651e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:36 +1100
Subject: [PATCH 124/221] tap: Use explicit defines for maximum length of L2
 frame

Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of
an L2 frame.  This define comes from the kernel, but it's badly named and
used confusingly.

First, it doesn't really have anything to do with Ethernet, which has no
structural limit on frame lengths.  It comes more from either a) IP which
imposes a 64k datagram limit or b) from internal buffers used in various
places in the kernel (and in passt).

Worse, MTU generally means the maximum size of the IP (L3) datagram which
may be transferred, _not_ counting the L2 headers.  In the kernel
ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as
a maximum frame length, _including_ L2 headers.  In tap.c we're mostly
using it in the second way.

Finally, each of our tap backends could have different limits on the frame
size imposed by the mechanisms they're using.

Start clearing up this confusion by replacing it in tap.c with new
L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame
length for each backend.

Signed-off-by: David Gibson <dgibson@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c | 23 +++++++++++++++++++----
 tap.h | 25 +++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c
index fb306e7..ede547c 100644
--- a/tap.c
+++ b/tap.c
@@ -62,6 +62,19 @@
 #include "vhost_user.h"
 #include "vu_common.h"
 
+/* Maximum allowed frame lengths (including L2 header) */
+
+/* Verify that an L2 frame length limit is large enough to contain the header,
+ * but small enough to fit in the packet pool
+ */
+#define CHECK_FRAME_LEN(len) \
+	static_assert((len) >= ETH_HLEN && (len) <= PACKET_MAX_LEN,	\
+		      #len " has bad value")
+
+CHECK_FRAME_LEN(L2_MAX_LEN_PASTA);
+CHECK_FRAME_LEN(L2_MAX_LEN_PASST);
+CHECK_FRAME_LEN(L2_MAX_LEN_VU);
+
 /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
 static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
 static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
@@ -1097,7 +1110,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 	while (n >= (ssize_t)sizeof(uint32_t)) {
 		uint32_t l2len = ntohl_unaligned(p);
 
-		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
+		if (l2len < sizeof(struct ethhdr) || l2len > L2_MAX_LEN_PASST) {
 			err("Bad frame size from guest, resetting connection");
 			tap_sock_reset(c);
 			return;
@@ -1151,8 +1164,10 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 
 	tap_flush_pools();
 
-	for (n = 0; n <= (ssize_t)(sizeof(pkt_buf) - ETH_MAX_MTU); n += len) {
-		len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
+	for (n = 0;
+	     n <= (ssize_t)(sizeof(pkt_buf) - L2_MAX_LEN_PASTA);
+	     n += len) {
+		len = read(c->fd_tap, pkt_buf + n, L2_MAX_LEN_PASTA);
 
 		if (len == 0) {
 			die("EOF on tap device, exiting");
@@ -1170,7 +1185,7 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 
 		/* Ignore frames of bad length */
 		if (len < (ssize_t)sizeof(struct ethhdr) ||
-		    len > (ssize_t)ETH_MAX_MTU)
+		    len > (ssize_t)L2_MAX_LEN_PASTA)
 			continue;
 
 		tap_add_packet(c, len, pkt_buf + n);
diff --git a/tap.h b/tap.h
index a2c3b87..84e9fdb 100644
--- a/tap.h
+++ b/tap.h
@@ -6,6 +6,31 @@
 #ifndef TAP_H
 #define TAP_H
 
+/** L2_MAX_LEN_PASTA - Maximum frame length for pasta mode (with L2 header)
+ *
+ * The kernel tuntap device imposes a maximum frame size of 65535 including
+ * 'hard_header_len' (14 bytes for L2 Ethernet in the case of "tap" mode).
+ */
+#define L2_MAX_LEN_PASTA	USHRT_MAX
+
+/** L2_MAX_LEN_PASST - Maximum frame length for passt mode (with L2 header)
+ *
+ * The only structural limit the QEMU socket protocol imposes on frames is
+ * (2^32-1) bytes, but that would be ludicrously long in practice.  For now,
+ * limit it somewhat arbitrarily to 65535 bytes.  FIXME: Work out an appropriate
+ * limit with more precision.
+ */
+#define L2_MAX_LEN_PASST	USHRT_MAX
+
+/** L2_MAX_LEN_VU - Maximum frame length for vhost-user mode (with L2 header)
+ *
+ * vhost-user allows multiple buffers per frame, each of which can be quite
+ * large, so the inherent frame size limit is rather large.  Much larger than is
+ * actually useful for IP.  For now limit arbitrarily to 65535 bytes. FIXME:
+ * Work out an appropriate limit with more precision.
+ */
+#define L2_MAX_LEN_VU		USHRT_MAX
+
 struct udphdr;
 
 /**

From b6945e055376be944867479dcd8deb77e47b1fa4 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:37 +1100
Subject: [PATCH 125/221] Simplify sizing of pkt_buf

We define the size of pkt_buf as large enough to hold 128 maximum size
packets.  Well, approximately, since we round down to the page size.  We
don't have any specific reliance on how many packets can fit in the buffer,
we just want it to be big enough to allow reasonable batching.  The
current definition relies on the confusingly named ETH_MAX_MTU and adds
in sizeof(uint32_t) rather non-obviously for the pseudo-physical header
used by the qemu socket (passt mode) protocol.

Instead, just define it to be 8MiB, which is what that complex calculation
works out to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/passt.h b/passt.h
index 6b24805..8f45091 100644
--- a/passt.h
+++ b/passt.h
@@ -69,8 +69,8 @@ union epoll_ref {
 static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
 	      "epoll_ref must have same size as epoll_data");
 
-#define PKT_BUF_BYTES							\
-	ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE)
+/* Large enough for ~128 maximum size frames */
+#define PKT_BUF_BYTES		(8UL << 20)
 #define TAP_MSGS							\
 	DIV_ROUND_UP(PKT_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
 

From 9d1a6b3eba9e6e5c4db4bfa0e395edc45ca6c39d Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:38 +1100
Subject: [PATCH 126/221] pcap: Correctly set snaplen based on tap backend type

The pcap header includes a value indicating how much of each frame is
captured.  We always capture the entire frame, so we want to set this to
the maximum possible frame size.  Currently we do that by setting it to
ETH_MAX_MTU, but that's a confusingly named constant which might not always
be correct depending on the details of our tap backend.

Instead add a tap_l2_max_len() function that explicitly returns the maximum
frame size for the current mode and use that to set snaplen.  While we're
there, there's no particular need for the pcap header to be defined in a
global; make it local to pcap_init() instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 pcap.c | 46 ++++++++++++++++++++++++----------------------
 tap.c  | 19 +++++++++++++++++++
 tap.h  |  1 +
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/pcap.c b/pcap.c
index 3d623cf..e95aa6f 100644
--- a/pcap.c
+++ b/pcap.c
@@ -33,33 +33,12 @@
 #include "log.h"
 #include "pcap.h"
 #include "iov.h"
+#include "tap.h"
 
 #define PCAP_VERSION_MINOR 4
 
 static int pcap_fd = -1;
 
-/* See pcap.h from libpcap, or pcap-savefile(5) */
-static const struct {
-	uint32_t magic;
-#define PCAP_MAGIC		0xa1b2c3d4
-
-	uint16_t major;
-#define PCAP_VERSION_MAJOR	2
-
-	uint16_t minor;
-#define PCAP_VERSION_MINOR	4
-
-	int32_t thiszone;
-	uint32_t sigfigs;
-	uint32_t snaplen;
-
-	uint32_t linktype;
-#define PCAP_LINKTYPE_ETHERNET	1
-} pcap_hdr = {
-	PCAP_MAGIC, PCAP_VERSION_MAJOR, PCAP_VERSION_MINOR, 0, 0, ETH_MAX_MTU,
-	PCAP_LINKTYPE_ETHERNET
-};
-
 struct pcap_pkthdr {
 	uint32_t tv_sec;
 	uint32_t tv_usec;
@@ -162,6 +141,29 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
  */
 void pcap_init(struct ctx *c)
 {
+	/* See pcap.h from libpcap, or pcap-savefile(5) */
+#define PCAP_MAGIC		0xa1b2c3d4
+#define PCAP_VERSION_MAJOR	2
+#define PCAP_VERSION_MINOR	4
+#define PCAP_LINKTYPE_ETHERNET	1
+	const struct {
+		uint32_t magic;
+		uint16_t major;
+		uint16_t minor;
+
+		int32_t thiszone;
+		uint32_t sigfigs;
+		uint32_t snaplen;
+
+		uint32_t linktype;
+	} pcap_hdr = {
+		.magic = PCAP_MAGIC,
+		.major = PCAP_VERSION_MAJOR,
+		.minor = PCAP_VERSION_MINOR,
+		.snaplen = tap_l2_max_len(c),
+		.linktype = PCAP_LINKTYPE_ETHERNET
+	};
+
 	if (pcap_fd != -1)
 		return;
 
diff --git a/tap.c b/tap.c
index ede547c..182a115 100644
--- a/tap.c
+++ b/tap.c
@@ -82,6 +82,25 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
 #define TAP_SEQS		128 /* Different L4 tuples in one batch */
 #define FRAGMENT_MSG_RATE	10  /* # seconds between fragment warnings */
 
+/**
+ * tap_l2_max_len() - Maximum frame size (including L2 header) for current mode
+ * @c:		Execution context
+ */
+unsigned long tap_l2_max_len(const struct ctx *c)
+{
+	/* NOLINTBEGIN(bugprone-branch-clone): values can be the same */
+	switch (c->mode) {
+	case MODE_PASST:
+		return L2_MAX_LEN_PASST;
+	case MODE_PASTA:
+		return L2_MAX_LEN_PASTA;
+	case MODE_VU:
+		return L2_MAX_LEN_VU;
+	}
+	/* NOLINTEND(bugprone-branch-clone) */
+	ASSERT(0);
+}
+
 /**
  * tap_send_single() - Send a single frame
  * @c:		Execution context
diff --git a/tap.h b/tap.h
index 84e9fdb..dd39fd8 100644
--- a/tap.h
+++ b/tap.h
@@ -69,6 +69,7 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 		thdr->vnet_len = htonl(l2len);
 }
 
+unsigned long tap_l2_max_len(const struct ctx *c);
 void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
 void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 		     struct in_addr dst, size_t l4len, uint8_t proto);

From 26df8a3608e7b006c00f44a9029bcadb6d5e4153 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 13:18:39 +1100
Subject: [PATCH 127/221] conf: Limit maximum MTU based on backend frame size

The -m option controls the MTU, that is the maximum transmissible L3
datagram, not including L2 headers.  We currently limit it to ETH_MAX_MTU
which sounds like it makes sense.  But ETH_MAX_MTU is confusing: it's not
consistently used as to whether it means the maximum L3 datagram size or
the maximum L2 frame size.  Even within conf() we explicitly account for
the L2 header size when computing the default --mtu value, but not when
we compute the maximum --mtu value.

Clean this up by reworking the maximum MTU computation to be the minimum of
IP_MAX_MTU (65535) and the maximum sized IP datagram which can fit into
our L2 frames when we account for the L2 header.  The latter can vary
depending on our tap backend, although it doesn't right now.

Link: https://bugs.passt.top/show_bug.cgi?id=66
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 11 +++++++----
 util.h |  3 ---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/conf.c b/conf.c
index b58e2a6..c760f79 100644
--- a/conf.c
+++ b/conf.c
@@ -1434,6 +1434,7 @@ void conf(struct ctx *c, int argc, char **argv)
 	enum fwd_ports_mode fwd_default = FWD_NONE;
 	bool v4_only = false, v6_only = false;
 	unsigned dns4_idx = 0, dns6_idx = 0;
+	unsigned long max_mtu = IP_MAX_MTU;
 	struct fqdn *dnss = c->dns_search;
 	unsigned int ifi4 = 0, ifi6 = 0;
 	const char *logfile = NULL;
@@ -1449,7 +1450,9 @@ void conf(struct ctx *c, int argc, char **argv)
 		fwd_default = FWD_AUTO;
 	}
 
-	c->mtu = ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN, sizeof(uint32_t));
+	if (tap_l2_max_len(c) - ETH_HLEN < max_mtu)
+		max_mtu = tap_l2_max_len(c) - ETH_HLEN;
+	c->mtu = ROUND_DOWN(max_mtu, sizeof(uint32_t));
 	c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET;
 	c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET;
 	memcpy(c->our_tap_mac, MAC_OUR_LAA, ETH_ALEN);
@@ -1711,9 +1714,9 @@ void conf(struct ctx *c, int argc, char **argv)
 			if (errno || *e)
 				die("Invalid MTU: %s", optarg);
 
-			if (mtu > ETH_MAX_MTU) {
-				die("MTU %lu too large (max %u)",
-				    mtu, ETH_MAX_MTU);
+			if (mtu > max_mtu) {
+				die("MTU %lu too large (max %lu)",
+				    mtu, max_mtu);
 			}
 
 			c->mtu = mtu;
diff --git a/util.h b/util.h
index 0f70f4d..4d512fa 100644
--- a/util.h
+++ b/util.h
@@ -31,9 +31,6 @@
 #ifndef SECCOMP_RET_KILL_PROCESS
 #define SECCOMP_RET_KILL_PROCESS	SECCOMP_RET_KILL
 #endif
-#ifndef ETH_MAX_MTU
-#define ETH_MAX_MTU			USHRT_MAX
-#endif
 #ifndef IP_MAX_MTU
 #define IP_MAX_MTU			USHRT_MAX
 #endif

From 78f1f0fdfc1831f2ca3a65c2cee98c44ff3c30ab Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 16:26:57 +1100
Subject: [PATCH 128/221] test/perf: Simplify iperf3 server lifetime management

After we start the iperf3 server in the background, we have a sleep to
make sure it's ready to receive connections.  We can simplify this slightly
by using the -D option to have iperf3 background itself rather than
backgrounding it manually.  That won't return until the server is ready to
use.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 test/lib/test | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/test/lib/test b/test/lib/test
index 758250a..7349674 100755
--- a/test/lib/test
+++ b/test/lib/test
@@ -20,10 +20,7 @@ test_iperf3s() {
 	__sctx="${1}"
 	__port="${2}"
 
-	pane_or_context_run_bg "${__sctx}" 				\
-		 'iperf3 -s -p'${__port}' & echo $! > s.pid'		\
-
-	sleep 1		# Wait for server to be ready
+	pane_or_context_run "${__sctx}" 'iperf3 -s -p'${__port}' -D -I s.pid'
 }
 
 # test_iperf3k() - Kill iperf3 server
@@ -31,7 +28,7 @@ test_iperf3s() {
 test_iperf3k() {
 	__sctx="${1}"
 
-	pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid); rm s.pid'
+	pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid)'
 
 	sleep 1		# Wait for kernel to free up ports
 }

From 96fe5548cb16fe2664ad121c2976048ccad6a1ab Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 12 Mar 2025 14:43:59 +1100
Subject: [PATCH 129/221] conf: Unify several paths in conf_ports()

In conf_ports() we have three different paths which actually do the setup
of an individual forwarded port: one for the "all" case, one for the
exclusions only case and one for the range of ports with possible
exclusions case.

We can unify those cases using a new helper which handles a single range
of ports, with a bitmap of exclusions.  Although this is slightly longer
(largely due to the new helpers function comment), it reduces duplicated
logic.  It will also make future improvements to the tracking of port
forwards easier.

The new conf_ports_range_except() function has a pretty prodigious
parameter list, but I still think it's an overall improvement in conceptual
complexity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 173 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 90 insertions(+), 83 deletions(-)

diff --git a/conf.c b/conf.c
index c760f79..0e2e8dc 100644
--- a/conf.c
+++ b/conf.c
@@ -123,6 +123,75 @@ static int parse_port_range(const char *s, char **endptr,
 	return 0;
 }
 
+/**
+ * conf_ports_range_except() - Set up forwarding for a range of ports minus a
+ *                             bitmap of exclusions
+ * @c:		Execution context
+ * @optname:	Short option name, t, T, u, or U
+ * @optarg:	Option argument (port specification)
+ * @fwd:	Pointer to @fwd_ports to be updated
+ * @addr:	Listening address
+ * @ifname:	Listening interface
+ * @first:	First port to forward
+ * @last:	Last port to forward
+ * @exclude:	Bitmap of ports to exclude
+ * @to:		Port to translate @first to when forwarding
+ * @weak:	Ignore errors, as long as at least one port is mapped
+ */
+static void conf_ports_range_except(const struct ctx *c, char optname,
+				    const char *optarg, struct fwd_ports *fwd,
+				    const union inany_addr *addr,
+				    const char *ifname,
+				    uint16_t first, uint16_t last,
+				    const uint8_t *exclude, uint16_t to,
+				    bool weak)
+{
+	bool bound_one = false;
+	unsigned i;
+	int ret;
+
+	if (first == 0) {
+		die("Can't forward port 0 for option '-%c %s'",
+		    optname, optarg);
+	}
+
+	for (i = first; i <= last; i++) {
+		if (bitmap_isset(exclude, i))
+			continue;
+
+		if (bitmap_isset(fwd->map, i)) {
+			warn(
+"Altering mapping of already mapped port number: %s", optarg);
+		}
+
+		bitmap_set(fwd->map, i);
+		fwd->delta[i] = to - first;
+
+		if (optname == 't')
+			ret = tcp_sock_init(c, addr, ifname, i);
+		else if (optname == 'u')
+			ret = udp_sock_init(c, 0, addr, ifname, i);
+		else
+			/* No way to check in advance for -T and -U */
+			ret = 0;
+
+		if (ret == -ENFILE || ret == -EMFILE) {
+			die("Can't open enough sockets for port specifier: %s",
+			    optarg);
+		}
+
+		if (!ret) {
+			bound_one = true;
+		} else if (!weak) {
+			die("Failed to bind port %u (%s) for option '-%c %s'",
+			    i, strerror_(-ret), optname, optarg);
+		}
+	}
+
+	if (!bound_one)
+		die("Failed to bind any port for '-%c %s'", optname, optarg);
+}
+
 /**
  * conf_ports() - Parse port configuration options, initialise UDP/TCP sockets
  * @c:		Execution context
@@ -135,10 +204,9 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
 {
 	union inany_addr addr_buf = inany_any6, *addr = &addr_buf;
 	char buf[BUFSIZ], *spec, *ifname = NULL, *p;
-	bool exclude_only = true, bound_one = false;
 	uint8_t exclude[PORT_BITMAP_SIZE] = { 0 };
+	bool exclude_only = true;
 	unsigned i;
-	int ret;
 
 	if (!strcmp(optarg, "none")) {
 		if (fwd->mode)
@@ -173,32 +241,15 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
 
 		fwd->mode = FWD_ALL;
 
-		/* Skip port 0.  It has special meaning for many socket APIs, so
-		 * trying to bind it is not really safe.
-		 */
-		for (i = 1; i < NUM_PORTS; i++) {
+		/* Exclude ephemeral ports */
+		for (i = 0; i < NUM_PORTS; i++)
 			if (fwd_port_is_ephemeral(i))
-				continue;
-
-			bitmap_set(fwd->map, i);
-			if (optname == 't') {
-				ret = tcp_sock_init(c, NULL, NULL, i);
-				if (ret == -ENFILE || ret == -EMFILE)
-					goto enfile;
-				if (!ret)
-					bound_one = true;
-			} else if (optname == 'u') {
-				ret = udp_sock_init(c, 0, NULL, NULL, i);
-				if (ret == -ENFILE || ret == -EMFILE)
-					goto enfile;
-				if (!ret)
-					bound_one = true;
-			}
-		}
-
-		if (!bound_one)
-			goto bind_all_fail;
+				bitmap_set(exclude, i);
 
+		conf_ports_range_except(c, optname, optarg, fwd,
+					NULL, NULL,
+					1, NUM_PORTS - 1, exclude,
+					1, true);
 		return;
 	}
 
@@ -275,37 +326,15 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
 	} while ((p = next_chunk(p, ',')));
 
 	if (exclude_only) {
-		/* Skip port 0.  It has special meaning for many socket APIs, so
-		 * trying to bind it is not really safe.
-		 */
-		for (i = 1; i < NUM_PORTS; i++) {
-			if (fwd_port_is_ephemeral(i) ||
-			    bitmap_isset(exclude, i))
-				continue;
-
-			bitmap_set(fwd->map, i);
-
-			if (optname == 't') {
-				ret = tcp_sock_init(c, addr, ifname, i);
-				if (ret == -ENFILE || ret == -EMFILE)
-					goto enfile;
-				if (!ret)
-					bound_one = true;
-			} else if (optname == 'u') {
-				ret = udp_sock_init(c, 0, addr, ifname, i);
-				if (ret == -ENFILE || ret == -EMFILE)
-					goto enfile;
-				if (!ret)
-					bound_one = true;
-			} else {
-				/* No way to check in advance for -T and -U */
-				bound_one = true;
-			}
-		}
-
-		if (!bound_one)
-			goto bind_all_fail;
+		/* Exclude ephemeral ports */
+		for (i = 0; i < NUM_PORTS; i++)
+			if (fwd_port_is_ephemeral(i))
+				bitmap_set(exclude, i);
 
+		conf_ports_range_except(c, optname, optarg, fwd,
+					addr, ifname,
+					1, NUM_PORTS - 1, exclude,
+					1, true);
 		return;
 	}
 
@@ -334,40 +363,18 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
 		if ((*p != '\0')  && (*p != ',')) /* Garbage after the ranges */
 			goto bad;
 
-		for (i = orig_range.first; i <= orig_range.last; i++) {
-			if (bitmap_isset(fwd->map, i))
-				warn(
-"Altering mapping of already mapped port number: %s", optarg);
-
-			if (bitmap_isset(exclude, i))
-				continue;
-
-			bitmap_set(fwd->map, i);
-
-			fwd->delta[i] = mapped_range.first - orig_range.first;
-
-			ret = 0;
-			if (optname == 't')
-				ret = tcp_sock_init(c, addr, ifname, i);
-			else if (optname == 'u')
-				ret = udp_sock_init(c, 0, addr, ifname, i);
-			if (ret)
-				goto bind_fail;
-		}
+		conf_ports_range_except(c, optname, optarg, fwd,
+					addr, ifname,
+					orig_range.first, orig_range.last,
+					exclude,
+					mapped_range.first, false);
 	} while ((p = next_chunk(p, ',')));
 
 	return;
-enfile:
-	die("Can't open enough sockets for port specifier: %s", optarg);
 bad:
 	die("Invalid port specifier %s", optarg);
 mode_conflict:
 	die("Port forwarding mode '%s' conflicts with previous mode", optarg);
-bind_fail:
-	die("Failed to bind port %u (%s) for option '-%c %s', exiting",
-	    i, strerror_(-ret), optname, optarg);
-bind_all_fail:
-	die("Failed to bind any port for '-%c %s', exiting", optname, optarg);
 }
 
 /**

From cb5b593563402680bee850245667f2e71b0d1bda Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 13 Mar 2025 13:56:17 +1100
Subject: [PATCH 130/221] tcp, flow: Better use flow specific logging heleprs

A number of places in the TCP code use general logging functions, instead
of the flow specific ones.  That includes a few older ones as well as many
places in the new migration code.  Thus they either don't identify which
flow the problem happened on, or identify it in a non-standard way.

Convert many of these to use the existing flow specific helpers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c         |  16 ++--
 tcp.c          | 252 +++++++++++++++++++++++++++----------------------
 tcp.h          |   1 -
 tcp_buf.c      |   4 +-
 tcp_internal.h |   1 +
 tcp_vu.c       |   2 +-
 6 files changed, 149 insertions(+), 127 deletions(-)

diff --git a/flow.c b/flow.c
index 5e64b79..8622242 100644
--- a/flow.c
+++ b/flow.c
@@ -1037,8 +1037,8 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	foreach_established_tcp_flow(flow) {
 		rc = tcp_flow_migrate_source(fd, &flow->tcp);
 		if (rc) {
-			err("Can't send data, flow %u: %s", FLOW_IDX(flow),
-			    strerror_(-rc));
+			flow_err(flow, "Can't send data: %s",
+				 strerror_(-rc));
 			if (!first)
 				die("Inconsistent migration state, exiting");
 
@@ -1064,8 +1064,8 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	foreach_established_tcp_flow(flow) {
 		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
 		if (rc) {
-			err("Extended data for flow %u: %s", FLOW_IDX(flow),
-			    strerror_(-rc));
+			flow_err(flow, "Can't send extended data: %s",
+				 strerror_(-rc));
 
 			if (rc == -EIO)
 				die("Inconsistent migration state, exiting");
@@ -1112,8 +1112,8 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	for (i = 0; i < count; i++) {
 		rc = tcp_flow_migrate_target(c, fd);
 		if (rc) {
-			debug("Migration data failure at flow %u: %s, abort",
-			      i, strerror_(-rc));
+			flow_dbg(FLOW(i), "Migration data failure, abort: %s",
+				 strerror_(-rc));
 			return -rc;
 		}
 	}
@@ -1123,8 +1123,8 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	for (i = 0; i < count; i++) {
 		rc = tcp_flow_migrate_target_ext(c, &flowtab[i].tcp, fd);
 		if (rc) {
-			debug("Migration data failure at flow %u: %s, abort",
-			      i, strerror_(-rc));
+			flow_dbg(FLOW(i), "Migration data failure, abort: %s",
+				 strerror_(-rc));
 			return -rc;
 		}
 	}
diff --git a/tcp.c b/tcp.c
index 32a08bd..a4c840e 100644
--- a/tcp.c
+++ b/tcp.c
@@ -434,19 +434,20 @@ static struct tcp_tap_conn *conn_at_sidx(flow_sidx_t sidx)
 }
 
 /**
- * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported
- * @s:          Socket to update
+ * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on connection if supported
+ * @conn:	Pointer to the TCP connection structure
  * @offset:     Offset in bytes
  *
  * Return:      -1 when it fails, 0 otherwise.
  */
-int tcp_set_peek_offset(int s, int offset)
+int tcp_set_peek_offset(const struct tcp_tap_conn *conn, int offset)
 {
 	if (!peek_offset_cap)
 		return 0;
 
-	if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) {
-		err("Failed to set SO_PEEK_OFF to %i in socket %i", offset, s);
+	if (setsockopt(conn->sock, SOL_SOCKET, SO_PEEK_OFF,
+		       &offset, sizeof(offset))) {
+		flow_perror(conn, "Failed to set SO_PEEK_OFF to %i", offset);
 		return -1;
 	}
 	return 0;
@@ -1757,7 +1758,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			   "fast re-transmit, ACK: %u, previous sequence: %u",
 			   max_ack_seq, conn->seq_to_tap);
 		conn->seq_to_tap = max_ack_seq;
-		if (tcp_set_peek_offset(conn->sock, 0)) {
+		if (tcp_set_peek_offset(conn, 0)) {
 			tcp_rst(c, conn);
 			return -1;
 		}
@@ -1854,7 +1855,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
 	conn->seq_ack_to_tap = conn->seq_from_tap;
 
 	conn_event(c, conn, ESTABLISHED);
-	if (tcp_set_peek_offset(conn->sock, 0)) {
+	if (tcp_set_peek_offset(conn, 0)) {
 		tcp_rst(c, conn);
 		return;
 	}
@@ -2022,7 +2023,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 			goto reset;
 
 		conn_event(c, conn, ESTABLISHED);
-		if (tcp_set_peek_offset(conn->sock, 0))
+		if (tcp_set_peek_offset(conn, 0))
 			goto reset;
 
 		if (th->fin) {
@@ -2286,7 +2287,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
 			conn->seq_to_tap = conn->seq_ack_from_tap;
 			if (!conn->wnd_from_tap)
 				conn->wnd_from_tap = 1; /* Zero-window probe */
-			if (tcp_set_peek_offset(conn->sock, 0)) {
+			if (tcp_set_peek_offset(conn, 0)) {
 				tcp_rst(c, conn);
 			} else {
 				tcp_data_from_sock(c, conn);
@@ -2810,20 +2811,21 @@ int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn)
 
 /**
  * tcp_flow_dump_tinfo() - Dump window scale, tcpi_state, tcpi_options
- * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
  * @t:		Extended migration data
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_dump_tinfo(int s, struct tcp_tap_transfer_ext *t)
+static int tcp_flow_dump_tinfo(const struct tcp_tap_conn *conn,
+			       struct tcp_tap_transfer_ext *t)
 {
 	struct tcp_info tinfo;
 	socklen_t sl;
 
 	sl = sizeof(tinfo);
-	if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
+	if (getsockopt(conn->sock, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
 		int rc = -errno;
-		err_perror("Querying TCP_INFO, socket %i", s);
+		flow_perror(conn, "Querying TCP_INFO");
 		return rc;
 	}
 
@@ -2837,18 +2839,19 @@ static int tcp_flow_dump_tinfo(int s, struct tcp_tap_transfer_ext *t)
 
 /**
  * tcp_flow_dump_mss() - Dump MSS clamp (not current MSS) via TCP_MAXSEG
- * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
  * @t:		Extended migration data
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_dump_mss(int s, struct tcp_tap_transfer_ext *t)
+static int tcp_flow_dump_mss(const struct tcp_tap_conn *conn,
+			     struct tcp_tap_transfer_ext *t)
 {
 	socklen_t sl = sizeof(t->mss);
 
-	if (getsockopt(s, SOL_TCP, TCP_MAXSEG, &t->mss, &sl)) {
+	if (getsockopt(conn->sock, SOL_TCP, TCP_MAXSEG, &t->mss, &sl)) {
 		int rc = -errno;
-		err_perror("Getting MSS, socket %i", s);
+		flow_perror(conn, "Getting MSS");
 		return rc;
 	}
 
@@ -2857,19 +2860,20 @@ static int tcp_flow_dump_mss(int s, struct tcp_tap_transfer_ext *t)
 
 /**
  * tcp_flow_dump_wnd() - Dump current tcp_repair_window parameters
- * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
  * @t:		Extended migration data
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_dump_wnd(int s, struct tcp_tap_transfer_ext *t)
+static int tcp_flow_dump_wnd(const struct tcp_tap_conn *conn,
+			     struct tcp_tap_transfer_ext *t)
 {
 	struct tcp_repair_window wnd;
 	socklen_t sl = sizeof(wnd);
 
-	if (getsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, &sl)) {
+	if (getsockopt(conn->sock, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, &sl)) {
 		int rc = -errno;
-		err_perror("Getting window repair data, socket %i", s);
+		flow_perror(conn, "Getting window repair data");
 		return rc;
 	}
 
@@ -2893,12 +2897,13 @@ static int tcp_flow_dump_wnd(int s, struct tcp_tap_transfer_ext *t)
 
 /**
  * tcp_flow_repair_wnd() - Restore window parameters from extended data
- * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
  * @t:		Extended migration data
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_repair_wnd(int s, const struct tcp_tap_transfer_ext *t)
+static int tcp_flow_repair_wnd(const struct tcp_tap_conn *conn,
+			       const struct tcp_tap_transfer_ext *t)
 {
 	struct tcp_repair_window wnd;
 
@@ -2908,9 +2913,10 @@ static int tcp_flow_repair_wnd(int s, const struct tcp_tap_transfer_ext *t)
 	wnd.rcv_wnd	= t->rcv_wnd;
 	wnd.rcv_wup	= t->rcv_wup;
 
-	if (setsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, sizeof(wnd))) {
+	if (setsockopt(conn->sock, IPPROTO_TCP, TCP_REPAIR_WINDOW,
+		       &wnd, sizeof(wnd))) {
 		int rc = -errno;
-		err_perror("Setting window data, socket %i", s);
+		flow_perror(conn, "Setting window data");
 		return rc;
 	}
 
@@ -2919,16 +2925,17 @@ static int tcp_flow_repair_wnd(int s, const struct tcp_tap_transfer_ext *t)
 
 /**
  * tcp_flow_select_queue() - Select queue (receive or send) for next operation
- * @s:		Socket
+ * @conn:	Connection to select queue for
  * @queue:	TCP_RECV_QUEUE or TCP_SEND_QUEUE
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_select_queue(int s, int queue)
+static int tcp_flow_select_queue(const struct tcp_tap_conn *conn, int queue)
 {
-	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &queue, sizeof(queue))) {
+	if (setsockopt(conn->sock, SOL_TCP, TCP_REPAIR_QUEUE,
+		       &queue, sizeof(queue))) {
 		int rc = -errno;
-		err_perror("Selecting TCP_SEND_QUEUE, socket %i", s);
+		flow_perror(conn, "Selecting TCP_SEND_QUEUE");
 		return rc;
 	}
 
@@ -2937,26 +2944,28 @@ static int tcp_flow_select_queue(int s, int queue)
 
 /**
  * tcp_flow_dump_sndqueue() - Dump send queue, length of sent and not sent data
- * @s:		Socket
+ * @conn:	Connection to dump queue for
  * @t:		Extended migration data
  *
  * Return: 0 on success, negative error code on failure
  *
  * #syscalls:vu ioctl
  */
-static int tcp_flow_dump_sndqueue(int s, struct tcp_tap_transfer_ext *t)
+static int tcp_flow_dump_sndqueue(const struct tcp_tap_conn *conn,
+				  struct tcp_tap_transfer_ext *t)
 {
+	int s = conn->sock;
 	ssize_t rc;
 
 	if (ioctl(s, SIOCOUTQ, &t->sndq) < 0) {
 		rc = -errno;
-		err_perror("Getting send queue size, socket %i", s);
+		flow_perror(conn, "Getting send queue size");
 		return rc;
 	}
 
 	if (ioctl(s, SIOCOUTQNSD, &t->notsent) < 0) {
 		rc = -errno;
-		err_perror("Getting not sent count, socket %i", s);
+		flow_perror(conn, "Getting not sent count");
 		return rc;
 	}
 
@@ -2975,14 +2984,16 @@ static int tcp_flow_dump_sndqueue(int s, struct tcp_tap_transfer_ext *t)
 	}
 
 	if (t->notsent > t->sndq) {
-		err("Invalid notsent count socket %i, send: %u, not sent: %u",
-		    s, t->sndq, t->notsent);
+		flow_err(conn,
+			 "Invalid notsent count socket %i, send: %u, not sent: %u",
+			 s, t->sndq, t->notsent);
 		return -EINVAL;
 	}
 
 	if (t->sndq > TCP_MIGRATE_SND_QUEUE_MAX) {
-		err("Send queue too large to migrate socket %i: %u bytes",
-		    s, t->sndq);
+		flow_err(conn,
+			 "Send queue too large to migrate socket %i: %u bytes",
+			 s, t->sndq);
 		return -ENOBUFS;
 	}
 
@@ -2993,13 +3004,13 @@ static int tcp_flow_dump_sndqueue(int s, struct tcp_tap_transfer_ext *t)
 			rc = 0;
 		} else {
 			rc = -errno;
-			err_perror("Can't read send queue, socket %i", s);
+			flow_perror(conn, "Can't read send queue");
 			return rc;
 		}
 	}
 
 	if ((uint32_t)rc < t->sndq) {
-		err("Short read migrating send queue");
+		flow_err(conn, "Short read migrating send queue");
 		return -ENXIO;
 	}
 
@@ -3010,19 +3021,20 @@ static int tcp_flow_dump_sndqueue(int s, struct tcp_tap_transfer_ext *t)
 
 /**
  * tcp_flow_repair_queue() - Restore contents of a given (pre-selected) queue
- * @s:		Socket
+ * @conn:	Connection to repair queue for
  * @len:	Length of data to be restored
  * @buf:	Buffer with content of pending data queue
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_repair_queue(int s, size_t len, uint8_t *buf)
+static int tcp_flow_repair_queue(const struct tcp_tap_conn *conn,
+				 size_t len, uint8_t *buf)
 {
 	size_t chunk = len;
 	uint8_t *p = buf;
 
 	while (len > 0) {
-		ssize_t rc = send(s, p, MIN(len, chunk), 0);
+		ssize_t rc = send(conn->sock, p, MIN(len, chunk), 0);
 
 		if (rc < 0) {
 			if ((errno == ENOBUFS || errno == ENOMEM) &&
@@ -3032,7 +3044,7 @@ static int tcp_flow_repair_queue(int s, size_t len, uint8_t *buf)
 			}
 
 			rc = -errno;
-			err_perror("Can't write queue, socket %i", s);
+			flow_perror(conn, "Can't write queue");
 			return rc;
 		}
 
@@ -3045,18 +3057,18 @@ static int tcp_flow_repair_queue(int s, size_t len, uint8_t *buf)
 
 /**
  * tcp_flow_dump_seq() - Dump current sequence of pre-selected queue
- * @s:		Socket
+ * @conn:	Pointer to the TCP connection structure
  * @v:		Sequence value, set on return
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_dump_seq(int s, uint32_t *v)
+static int tcp_flow_dump_seq(const struct tcp_tap_conn *conn, uint32_t *v)
 {
 	socklen_t sl = sizeof(*v);
 
-	if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, v, &sl)) {
+	if (getsockopt(conn->sock, SOL_TCP, TCP_QUEUE_SEQ, v, &sl)) {
 		int rc = -errno;
-		err_perror("Dumping sequence, socket %i", s);
+		flow_perror(conn, "Dumping sequence");
 		return rc;
 	}
 
@@ -3065,16 +3077,17 @@ static int tcp_flow_dump_seq(int s, uint32_t *v)
 
 /**
  * tcp_flow_repair_seq() - Restore sequence for pre-selected queue
- * @s:		Socket
+ * @conn:	Connection to repair sequences for
  * @v:		Sequence value to be set
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_repair_seq(int s, const uint32_t *v)
+static int tcp_flow_repair_seq(const struct tcp_tap_conn *conn,
+			       const uint32_t *v)
 {
-	if (setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, v, sizeof(*v))) {
+	if (setsockopt(conn->sock, SOL_TCP, TCP_QUEUE_SEQ, v, sizeof(*v))) {
 		int rc = -errno;
-		err_perror("Setting sequence, socket %i", s);
+		flow_perror(conn, "Setting sequence");
 		return rc;
 	}
 
@@ -3083,15 +3096,17 @@ static int tcp_flow_repair_seq(int s, const uint32_t *v)
 
 /**
  * tcp_flow_dump_rcvqueue() - Dump receive queue and its length, seal/block it
- * @s:		Socket
+ * @conn:	Pointer to the TCP connection structure
  * @t:		Extended migration data
  *
  * Return: 0 on success, negative error code on failure
  *
  * #syscalls:vu ioctl
  */
-static int tcp_flow_dump_rcvqueue(int s, struct tcp_tap_transfer_ext *t)
+static int tcp_flow_dump_rcvqueue(const struct tcp_tap_conn *conn,
+				  struct tcp_tap_transfer_ext *t)
 {
+	int s = conn->sock;
 	ssize_t rc;
 
 	if (ioctl(s, SIOCINQ, &t->rcvq) < 0) {
@@ -3111,8 +3126,9 @@ static int tcp_flow_dump_rcvqueue(int s, struct tcp_tap_transfer_ext *t)
 		t->rcvq--;
 
 	if (t->rcvq > TCP_MIGRATE_RCV_QUEUE_MAX) {
-		err("Receive queue too large to migrate socket %i: %u bytes",
-		    s, t->rcvq);
+		flow_err(conn,
+			 "Receive queue too large to migrate socket: %u bytes",
+			 t->rcvq);
 		return -ENOBUFS;
 	}
 
@@ -3122,13 +3138,13 @@ static int tcp_flow_dump_rcvqueue(int s, struct tcp_tap_transfer_ext *t)
 			rc = 0;
 		} else {
 			rc = -errno;
-			err_perror("Can't read receive queue for socket %i", s);
+			flow_perror(conn, "Can't read receive queue");
 			return rc;
 		}
 	}
 
 	if ((uint32_t)rc < t->rcvq) {
-		err("Short read migrating receive queue");
+		flow_err(conn, "Short read migrating receive queue");
 		return -ENXIO;
 	}
 
@@ -3137,12 +3153,13 @@ static int tcp_flow_dump_rcvqueue(int s, struct tcp_tap_transfer_ext *t)
 
 /**
  * tcp_flow_repair_opt() - Set repair "options" (MSS, scale, SACK, timestamps)
- * @s:		Socket
+ * @conn:	Pointer to the TCP connection structure
  * @t:		Extended migration data
  *
  * Return: 0 on success, negative error code on failure
  */
-static int tcp_flow_repair_opt(int s, const struct tcp_tap_transfer_ext *t)
+static int tcp_flow_repair_opt(const struct tcp_tap_conn *conn,
+			       const struct tcp_tap_transfer_ext *t)
 {
 	const struct tcp_repair_opt opts[] = {
 		{ TCPOPT_WINDOW,		t->snd_ws + (t->rcv_ws << 16) },
@@ -3156,9 +3173,9 @@ static int tcp_flow_repair_opt(int s, const struct tcp_tap_transfer_ext *t)
 				!!(t->tcpi_options & TCPI_OPT_SACK) +
 				!!(t->tcpi_options & TCPI_OPT_TIMESTAMPS));
 
-	if (setsockopt(s, SOL_TCP, TCP_REPAIR_OPTIONS, opts, sl)) {
+	if (setsockopt(conn->sock, SOL_TCP, TCP_REPAIR_OPTIONS, opts, sl)) {
 		int rc = -errno;
-		err_perror("Setting repair options, socket %i", s);
+		flow_perror(conn, "Setting repair options");
 		return rc;
 	}
 
@@ -3229,36 +3246,36 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 	/* Disable SO_PEEK_OFF, it will make accessing the queues in repair mode
 	 * weird.
 	 */
-	if (tcp_set_peek_offset(s, -1)) {
+	if (tcp_set_peek_offset(conn, -1)) {
 		rc = -errno;
 		goto fail;
 	}
 
-	if ((rc = tcp_flow_dump_tinfo(s, t)))
+	if ((rc = tcp_flow_dump_tinfo(conn, t)))
 		goto fail;
 
-	if ((rc = tcp_flow_dump_mss(s, t)))
+	if ((rc = tcp_flow_dump_mss(conn, t)))
 		goto fail;
 
-	if ((rc = tcp_flow_dump_wnd(s, t)))
+	if ((rc = tcp_flow_dump_wnd(conn, t)))
 		goto fail;
 
-	if ((rc = tcp_flow_select_queue(s, TCP_SEND_QUEUE)))
+	if ((rc = tcp_flow_select_queue(conn, TCP_SEND_QUEUE)))
 		goto fail;
 
-	if ((rc = tcp_flow_dump_sndqueue(s, t)))
+	if ((rc = tcp_flow_dump_sndqueue(conn, t)))
 		goto fail;
 
-	if ((rc = tcp_flow_dump_seq(s, &t->seq_snd)))
+	if ((rc = tcp_flow_dump_seq(conn, &t->seq_snd)))
 		goto fail;
 
-	if ((rc = tcp_flow_select_queue(s, TCP_RECV_QUEUE)))
+	if ((rc = tcp_flow_select_queue(conn, TCP_RECV_QUEUE)))
 		goto fail;
 
-	if ((rc = tcp_flow_dump_rcvqueue(s, t)))
+	if ((rc = tcp_flow_dump_rcvqueue(conn, t)))
 		goto fail;
 
-	if ((rc = tcp_flow_dump_seq(s, &t->seq_rcv)))
+	if ((rc = tcp_flow_dump_seq(conn, &t->seq_rcv)))
 		goto fail;
 
 	close(s);
@@ -3269,14 +3286,14 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 	t->seq_rcv	-= t->rcvq;
 	t->seq_snd	-= t->sndq;
 
-	debug("Extended migration data, socket %i sequences send %u receive %u",
-	      s, t->seq_snd, t->seq_rcv);
-	debug("  pending queues: send %u not sent %u receive %u",
-	      t->sndq, t->notsent, t->rcvq);
-	debug("  window: snd_wl1 %u snd_wnd %u max %u rcv_wnd %u rcv_wup %u",
-	      t->snd_wl1, t->snd_wnd, t->max_window, t->rcv_wnd, t->rcv_wup);
-	debug("  SO_PEEK_OFF %s  offset=%"PRIu32,
-	      peek_offset_cap ? "enabled" : "disabled", peek_offset);
+	flow_dbg(conn, "Extended migration data, socket %i sequences send %u receive %u",
+		 s, t->seq_snd, t->seq_rcv);
+	flow_dbg(conn, "  pending queues: send %u not sent %u receive %u",
+		 t->sndq, t->notsent, t->rcvq);
+	flow_dbg(conn, "  window: snd_wl1 %u snd_wnd %u max %u rcv_wnd %u rcv_wup %u",
+		 t->snd_wl1, t->snd_wnd, t->max_window, t->rcv_wnd, t->rcv_wup);
+	flow_dbg(conn, "  SO_PEEK_OFF %s  offset=%"PRIu32,
+		 peek_offset_cap ? "enabled" : "disabled", peek_offset);
 
 	/* Endianness fix-ups */
 	t->seq_snd	= htonl(t->seq_snd);
@@ -3292,17 +3309,17 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 	t->rcv_wup	= htonl(t->rcv_wup);
 
 	if (write_all_buf(fd, t, sizeof(*t))) {
-		err_perror("Failed to write extended data, socket %i", s);
+		flow_perror(conn, "Failed to write extended data");
 		return -EIO;
 	}
 
 	if (write_all_buf(fd, tcp_migrate_snd_queue, ntohl(t->sndq))) {
-		err_perror("Failed to write send queue data, socket %i", s);
+		flow_perror(conn, "Failed to write send queue data");
 		return -EIO;
 	}
 
 	if (write_all_buf(fd, tcp_migrate_rcv_queue, ntohl(t->rcvq))) {
-		err_perror("Failed to write receive queue data, socket %i", s);
+		flow_perror(conn, "Failed to write receive queue data");
 		return -EIO;
 	}
 
@@ -3317,7 +3334,7 @@ fail:
 	t->tcpi_state = 0; /* Not defined: tell the target to skip this flow */
 
 	if (write_all_buf(fd, t, sizeof(*t))) {
-		err_perror("Failed to write extended data, socket %i", s);
+		flow_perror(conn, "Failed to write extended data");
 		return -EIO;
 	}
 
@@ -3347,19 +3364,20 @@ static int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
 	if ((conn->sock = socket(af, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
 				 IPPROTO_TCP)) < 0) {
 		rc = -errno;
-		err_perror("Failed to create socket for migrated flow");
+		flow_perror(conn, "Failed to create socket for migrated flow");
 		return rc;
 	}
 	s = conn->sock;
 
 	if (setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &(int){ 1 }, sizeof(int)))
-		debug_perror("Setting SO_REUSEADDR on socket %i", s);
+		flow_dbg_perror(conn, "Failed to set SO_REUSEADDR on socket %i",
+				s);
 
 	tcp_sock_set_nodelay(s);
 
 	if (bind(s, &a.sa, sizeof(a))) {
 		rc = -errno;
-		err_perror("Failed to bind socket for migrated flow");
+		flow_perror(conn, "Failed to bind socket for migrated flow");
 		goto err;
 	}
 
@@ -3390,7 +3408,7 @@ static int tcp_flow_repair_connect(const struct ctx *c,
 	rc = flowside_connect(c, conn->sock, PIF_HOST, tgt);
 	if (rc) {
 		rc = -errno;
-		err_perror("Failed to connect migrated socket %i", conn->sock);
+		flow_perror(conn, "Failed to connect migrated socket");
 		return rc;
 	}
 
@@ -3421,8 +3439,8 @@ int tcp_flow_migrate_target(struct ctx *c, int fd)
 	}
 
 	if (read_all_buf(fd, &t, sizeof(t))) {
+		flow_perror(flow, "Failed to receive migration data");
 		flow_alloc_cancel(flow);
-		err_perror("Failed to receive migration data");
 		return -errno;
 	}
 
@@ -3481,7 +3499,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 
 	if (read_all_buf(fd, &t, sizeof(t))) {
 		rc = -errno;
-		err_perror("Failed to read extended data for socket %i", s);
+		flow_perror(conn, "Failed to read extended data");
 		return rc;
 	}
 
@@ -3503,31 +3521,34 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 	t.rcv_wnd	= ntohl(t.rcv_wnd);
 	t.rcv_wup	= ntohl(t.rcv_wup);
 
-	debug("Extended migration data, socket %i sequences send %u receive %u",
-	      s, t.seq_snd, t.seq_rcv);
-	debug("  pending queues: send %u not sent %u receive %u",
-	      t.sndq, t.notsent, t.rcvq);
-	debug("  window: snd_wl1 %u snd_wnd %u max %u rcv_wnd %u rcv_wup %u",
-	      t.snd_wl1, t.snd_wnd, t.max_window, t.rcv_wnd, t.rcv_wup);
-	debug("  SO_PEEK_OFF %s  offset=%"PRIu32,
-	      peek_offset_cap ? "enabled" : "disabled", peek_offset);
+	flow_dbg(conn,
+		 "Extended migration data, socket %i sequences send %u receive %u",
+		 s, t.seq_snd, t.seq_rcv);
+	flow_dbg(conn, "  pending queues: send %u not sent %u receive %u",
+		 t.sndq, t.notsent, t.rcvq);
+	flow_dbg(conn,
+		 "  window: snd_wl1 %u snd_wnd %u max %u rcv_wnd %u rcv_wup %u",
+		 t.snd_wl1, t.snd_wnd, t.max_window, t.rcv_wnd, t.rcv_wup);
+	flow_dbg(conn, "  SO_PEEK_OFF %s  offset=%"PRIu32,
+		 peek_offset_cap ? "enabled" : "disabled", peek_offset);
 
 	if (t.sndq > TCP_MIGRATE_SND_QUEUE_MAX || t.notsent > t.sndq ||
 	    t.rcvq > TCP_MIGRATE_RCV_QUEUE_MAX) {
-		err("Bad queues socket %i, send: %u, not sent: %u, receive: %u",
-		    s, t.sndq, t.notsent, t.rcvq);
+		flow_err(conn,
+			 "Bad queues socket %i, send: %u, not sent: %u, receive: %u",
+			 s, t.sndq, t.notsent, t.rcvq);
 		return -EINVAL;
 	}
 
 	if (read_all_buf(fd, tcp_migrate_snd_queue, t.sndq)) {
 		rc = -errno;
-		err_perror("Failed to read send queue data, socket %i", s);
+		flow_perror(conn, "Failed to read send queue data");
 		return rc;
 	}
 
 	if (read_all_buf(fd, tcp_migrate_rcv_queue, t.rcvq)) {
 		rc = -errno;
-		err_perror("Failed to read receive queue data, socket %i", s);
+		flow_perror(conn, "Failed to read receive queue data");
 		return rc;
 	}
 
@@ -3535,32 +3556,32 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 		/* We weren't able to create the socket, discard flow */
 		goto fail;
 
-	if (tcp_flow_select_queue(s, TCP_SEND_QUEUE))
+	if (tcp_flow_select_queue(conn, TCP_SEND_QUEUE))
 		goto fail;
 
-	if (tcp_flow_repair_seq(s, &t.seq_snd))
+	if (tcp_flow_repair_seq(conn, &t.seq_snd))
 		goto fail;
 
-	if (tcp_flow_select_queue(s, TCP_RECV_QUEUE))
+	if (tcp_flow_select_queue(conn, TCP_RECV_QUEUE))
 		goto fail;
 
-	if (tcp_flow_repair_seq(s, &t.seq_rcv))
+	if (tcp_flow_repair_seq(conn, &t.seq_rcv))
 		goto fail;
 
 	if (tcp_flow_repair_connect(c, conn))
 		goto fail;
 
-	if (tcp_flow_repair_queue(s, t.rcvq, tcp_migrate_rcv_queue))
+	if (tcp_flow_repair_queue(conn, t.rcvq, tcp_migrate_rcv_queue))
 		goto fail;
 
-	if (tcp_flow_select_queue(s, TCP_SEND_QUEUE))
+	if (tcp_flow_select_queue(conn, TCP_SEND_QUEUE))
 		goto fail;
 
-	if (tcp_flow_repair_queue(s, t.sndq - t.notsent,
+	if (tcp_flow_repair_queue(conn, t.sndq - t.notsent,
 				  tcp_migrate_snd_queue))
 		goto fail;
 
-	if (tcp_flow_repair_opt(s, &t))
+	if (tcp_flow_repair_opt(conn, &t))
 		goto fail;
 
 	/* If we sent a FIN sent and it was acknowledged (TCP_FIN_WAIT2), don't
@@ -3575,19 +3596,19 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 
 		v = TCP_SEND_QUEUE;
 		if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v)))
-			debug_perror("Selecting repair queue, socket %i", s);
+			flow_perror(conn, "Selecting repair queue");
 		else
 			shutdown(s, SHUT_WR);
 	}
 
-	if (tcp_flow_repair_wnd(s, &t))
+	if (tcp_flow_repair_wnd(conn, &t))
 		goto fail;
 
 	tcp_flow_repair_off(c, conn);
 	repair_flush(c);
 
 	if (t.notsent) {
-		if (tcp_flow_repair_queue(s, t.notsent,
+		if (tcp_flow_repair_queue(conn, t.notsent,
 					  tcp_migrate_snd_queue +
 					  (t.sndq - t.notsent))) {
 			/* This sometimes seems to fail for unclear reasons.
@@ -3607,15 +3628,16 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 	if (t.tcpi_state == TCP_FIN_WAIT1)
 		shutdown(s, SHUT_WR);
 
-	if (tcp_set_peek_offset(conn->sock, peek_offset))
+	if (tcp_set_peek_offset(conn, peek_offset))
 		goto fail;
 
 	tcp_send_flag(c, conn, ACK);
 	tcp_data_from_sock(c, conn);
 
 	if ((rc = tcp_epoll_ctl(c, conn))) {
-		debug("Failed to subscribe to epoll for migrated socket %i: %s",
-		      conn->sock, strerror_(-rc));
+		flow_dbg(conn,
+			 "Failed to subscribe to epoll for migrated socket: %s",
+			 strerror_(-rc));
 		goto fail;
 	}
 
diff --git a/tcp.h b/tcp.h
index 9142eca..234a803 100644
--- a/tcp.h
+++ b/tcp.h
@@ -25,7 +25,6 @@ void tcp_timer(struct ctx *c, const struct timespec *now);
 void tcp_defer_handler(struct ctx *c);
 
 void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
-int tcp_set_peek_offset(int s, int offset);
 
 extern bool peek_offset_cap;
 
diff --git a/tcp_buf.c b/tcp_buf.c
index 72d99c5..0530563 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -125,7 +125,7 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
 
 		conn->seq_to_tap = seq;
 		peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
-		if (tcp_set_peek_offset(conn->sock, peek_offset))
+		if (tcp_set_peek_offset(conn, peek_offset))
 			tcp_rst(c, conn);
 	}
 }
@@ -304,7 +304,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			   conn->seq_ack_from_tap, conn->seq_to_tap);
 		conn->seq_to_tap = conn->seq_ack_from_tap;
 		already_sent = 0;
-		if (tcp_set_peek_offset(s, 0)) {
+		if (tcp_set_peek_offset(conn, 0)) {
 			tcp_rst(c, conn);
 			return -1;
 		}
diff --git a/tcp_internal.h b/tcp_internal.h
index 6f5e054..36c6533 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -177,5 +177,6 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
 		      int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
 		      size_t *optlen);
+int tcp_set_peek_offset(const struct tcp_tap_conn *conn, int offset);
 
 #endif /* TCP_INTERNAL_H */
diff --git a/tcp_vu.c b/tcp_vu.c
index 6891ed1..57587cc 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -376,7 +376,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			   conn->seq_ack_from_tap, conn->seq_to_tap);
 		conn->seq_to_tap = conn->seq_ack_from_tap;
 		already_sent = 0;
-		if (tcp_set_peek_offset(conn->sock, 0)) {
+		if (tcp_set_peek_offset(conn, 0)) {
 			tcp_rst(c, conn);
 			return -1;
 		}

From 51f3c071a76bd20677e72b49007b822dca71e755 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 18 Mar 2025 17:18:47 +0100
Subject: [PATCH 131/221] passt-repair: Fix build with -Werror=format-security

Fixes: 04701702471e ("passt-repair: Add directory watch")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt-repair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/passt-repair.c b/passt-repair.c
index 8bb3f00..120f7aa 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -150,7 +150,7 @@ int main(int argc, char **argv)
 			_exit(1);
 		}
 
-		ret = snprintf(a.sun_path, sizeof(a.sun_path), path);
+		ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", path);
 		inotify_dir = true;
 	} else {
 		ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);

From 28772ee91a60b34786023496ea17c2c2f4e5f7f5 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 19 Mar 2025 16:14:21 +1100
Subject: [PATCH 132/221] migrate, tcp: More careful marshalling of mss
 parameter during migration

During migration we extract the limit on segment size using TCP_MAXSEG,
and set it on the other side with TCP_REPAIR_OPTIONS.  However, unlike most
32-bit values we transfer we transfer it in native endian, not network
endian.  This is not correct; add it to the list of endian fixups we make.

In addition, while MAXSEG will be 32-bits in practice, and is given as such
to TCP_REPAIR_OPTIONS, the TCP_MAXSEG sockopt treats it as an 'int'.  It's
not strictly safe to pass a uint32_t to a getsockopt() expecting an int,
although we'll get away with it on most (maybe all) platforms.  Correct
this as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor coding style fix]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index a4c840e..43ee76b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2848,13 +2848,16 @@ static int tcp_flow_dump_mss(const struct tcp_tap_conn *conn,
 			     struct tcp_tap_transfer_ext *t)
 {
 	socklen_t sl = sizeof(t->mss);
+	int val;
 
-	if (getsockopt(conn->sock, SOL_TCP, TCP_MAXSEG, &t->mss, &sl)) {
+	if (getsockopt(conn->sock, SOL_TCP, TCP_MAXSEG, &val, &sl)) {
 		int rc = -errno;
 		flow_perror(conn, "Getting MSS");
 		return rc;
 	}
 
+	t->mss = (uint32_t)val;
+
 	return 0;
 }
 
@@ -3301,6 +3304,7 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 	t->sndq		= htonl(t->sndq);
 	t->notsent	= htonl(t->notsent);
 	t->rcvq		= htonl(t->rcvq);
+	t->mss		= htonl(t->mss);
 
 	t->snd_wl1	= htonl(t->snd_wl1);
 	t->snd_wnd	= htonl(t->snd_wnd);
@@ -3514,6 +3518,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 	t.sndq		= ntohl(t.sndq);
 	t.notsent	= ntohl(t.notsent);
 	t.rcvq		= ntohl(t.rcvq);
+	t.mss		= ntohl(t.mss);
 
 	t.snd_wl1	= ntohl(t.snd_wl1);
 	t.snd_wnd	= ntohl(t.snd_wnd);

From cfb3740568ab291d7be00e457658c45ce9367ed5 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 19 Mar 2025 16:14:22 +1100
Subject: [PATCH 133/221] migrate, tcp: Migrate RFC 7323 timestamp

Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp.  In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.

Add code to dump and restore the timestamp across migration.

Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcp_conn.h |  2 ++
 2 files changed, 61 insertions(+)

diff --git a/tcp.c b/tcp.c
index 43ee76b..68af43d 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2861,6 +2861,57 @@ static int tcp_flow_dump_mss(const struct tcp_tap_conn *conn,
 	return 0;
 }
 
+
+/**
+ * tcp_flow_dump_timestamp() - Dump RFC 7323 timestamp via TCP_TIMESTAMP
+ * @conn:	Pointer to the TCP connection structure
+ * @t:		Extended migration data (tcpi_options must be populated)
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_dump_timestamp(const struct tcp_tap_conn *conn,
+				   struct tcp_tap_transfer_ext *t)
+{
+	int val = 0;
+
+	if (t->tcpi_options & TCPI_OPT_TIMESTAMPS) {
+		socklen_t sl = sizeof(val);
+
+		if (getsockopt(conn->sock, SOL_TCP, TCP_TIMESTAMP, &val, &sl)) {
+			int rc = -errno;
+			flow_perror(conn, "Getting RFC 7323 timestamp");
+			return rc;
+		}
+	}
+
+	t->timestamp = (uint32_t)val;
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_timestamp() - Restore RFC 7323 timestamp via TCP_TIMESTAMP
+ * @conn:	Pointer to the TCP connection structure
+ * @t:		Extended migration data
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_timestamp(const struct tcp_tap_conn *conn,
+				   const struct tcp_tap_transfer_ext *t)
+{
+	int val = (int)t->timestamp;
+
+	if (t->tcpi_options & TCPI_OPT_TIMESTAMPS) {
+		if (setsockopt(conn->sock, SOL_TCP, TCP_TIMESTAMP,
+			       &val, sizeof(val))) {
+			int rc = -errno;
+			flow_perror(conn, "Setting RFC 7323 timestamp");
+			return rc;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * tcp_flow_dump_wnd() - Dump current tcp_repair_window parameters
  * @conn:	Pointer to the TCP connection structure
@@ -3260,6 +3311,9 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 	if ((rc = tcp_flow_dump_mss(conn, t)))
 		goto fail;
 
+	if ((rc = tcp_flow_dump_timestamp(conn, t)))
+		goto fail;
+
 	if ((rc = tcp_flow_dump_wnd(conn, t)))
 		goto fail;
 
@@ -3305,6 +3359,7 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 	t->notsent	= htonl(t->notsent);
 	t->rcvq		= htonl(t->rcvq);
 	t->mss		= htonl(t->mss);
+	t->timestamp	= htonl(t->timestamp);
 
 	t->snd_wl1	= htonl(t->snd_wl1);
 	t->snd_wnd	= htonl(t->snd_wnd);
@@ -3519,6 +3574,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 	t.notsent	= ntohl(t.notsent);
 	t.rcvq		= ntohl(t.rcvq);
 	t.mss		= ntohl(t.mss);
+	t.timestamp	= ntohl(t.timestamp);
 
 	t.snd_wl1	= ntohl(t.snd_wl1);
 	t.snd_wnd	= ntohl(t.snd_wnd);
@@ -3561,6 +3617,9 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 		/* We weren't able to create the socket, discard flow */
 		goto fail;
 
+	if (tcp_flow_repair_timestamp(conn, &t))
+		goto fail;
+
 	if (tcp_flow_select_queue(conn, TCP_SEND_QUEUE))
 		goto fail;
 
diff --git a/tcp_conn.h b/tcp_conn.h
index 9126a36..35d813d 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -152,6 +152,7 @@ struct tcp_tap_transfer {
  * @notsent:		Part of pending send queue that wasn't sent out yet
  * @rcvq:		Length of pending receive queue
  * @mss:		Socket-side MSS clamp
+ * @timestamp:		RFC 7323 timestamp
  * @snd_wl1:		Next sequence used in window probe (next sequence - 1)
  * @snd_wnd:		Socket-side sending window
  * @max_window:		Window clamp
@@ -171,6 +172,7 @@ struct tcp_tap_transfer_ext {
 	uint32_t	rcvq;
 
 	uint32_t	mss;
+	uint32_t	timestamp;
 
 	/* We can't just use struct tcp_repair_window: we need network order */
 	uint32_t	snd_wl1;

From c250ffc5c11385d9618b3a8165e676d68d5cbfa2 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 19 Mar 2025 16:14:23 +1100
Subject: [PATCH 134/221] migrate: Bump migration version number

v1 of the migration stream format, had some flaws: it didn't properly
handle endianness of the MSS field, and it didn't transfer the RFC7323
timestamp.  We've now fixed those bugs, but it requires incompatible
changes to the stream format.

Because of the timestamps in particular, v1 is not really usable, so there
is little point maintaining compatible support for it.  However, v1 is in
released packages, both upstream and downstream (RHEL at least).  Just
updating the stream format without bumping the version would lead to very
cryptic errors if anyone did attempt to migrate between an old and new
passt.

So, bump the migration version to v2, so we'll get a clear error message if
anyone attempts this.  We don't attempt to maintain backwards compatibility
with v1, however: we'll simply fail if given a v1 stream.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 migrate.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/migrate.c b/migrate.c
index 0fca77b..48d63a0 100644
--- a/migrate.c
+++ b/migrate.c
@@ -96,8 +96,8 @@ static int seen_addrs_target_v1(struct ctx *c,
 	return 0;
 }
 
-/* Stages for version 1 */
-static const struct migrate_stage stages_v1[] = {
+/* Stages for version 2 */
+static const struct migrate_stage stages_v2[] = {
 	{
 		.name = "observed addresses",
 		.source = seen_addrs_source_v1,
@@ -118,7 +118,11 @@ static const struct migrate_stage stages_v1[] = {
 
 /* Supported encoding versions, from latest (most preferred) to oldest */
 static const struct migrate_version versions[] = {
-	{ 1,	stages_v1, },
+	{ 2,	stages_v2, },
+	/* v1 was released, but not widely used.  It had bad endianness for the
+	 * MSS and omitted timestamps, which meant it usually wouldn't work.
+	 * Therefore we don't attempt to support compatibility with it.
+	 */
 	{ 0 },
 };
 

From ebdd46367ce1acba235013d97e362b8677b538d5 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 19 Mar 2025 17:57:45 +0100
Subject: [PATCH 135/221] tcp: Flush socket before checking for more data in
 active close state
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Otherwise, if all the pending data is acknowledged:

- tcp_update_seqack_from_tap() updates the current tap-side ACK
  sequence (conn->seq_ack_from_tap)

- next, we compare the sequence we sent (conn->seq_to_tap) to the
  ACK sequence (conn->seq_ack_from_tap) in tcp_data_from_sock() to
  understand if there's more data we can send.

  If they match, we conclude that we haven't sent any of that data,
  and keep re-sending it.

We need, instead, to flush the socket (drop acknowledged data) before
calling tcp_update_seqack_from_tap(), so that once we update
conn->seq_ack_from_tap, we can be sure that all data until there is
gone from the socket.

Link: https://bugs.passt.top/show_bug.cgi?id=114
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Fixes: 30f1e082c3c0 ("tcp: Keep updating window and checking for socket data after FIN from guest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tcp.c b/tcp.c
index 68af43d..fa1d885 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2049,6 +2049,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 	/* Established connections not accepting data from tap */
 	if (conn->events & TAP_FIN_RCVD) {
+		tcp_sock_consume(conn, ntohl(th->ack_seq));
 		tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
 		tcp_tap_window_update(conn, ntohs(th->window));
 		tcp_data_from_sock(c, conn);

From 07c2d584b334b0c405a5702a4f2fad104d03940b Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 19 Mar 2025 20:43:47 +0100
Subject: [PATCH 136/221] conf: Include libgen.h for basename(), fix build
 against musl

Fixes: 4b17d042c7e4 ("conf: Move mode detection into helper function")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/conf.c b/conf.c
index 0e2e8dc..b54c55d 100644
--- a/conf.c
+++ b/conf.c
@@ -16,6 +16,7 @@
 #include <errno.h>
 #include <fcntl.h>
 #include <getopt.h>
+#include <libgen.h>
 #include <string.h>
 #include <sched.h>
 #include <sys/types.h>

From 32f6212551c5db3b7b3548e8483e5d73f07a35ac Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 19 Mar 2025 20:45:12 +0100
Subject: [PATCH 137/221] Makefile: Enable -Wformat-security

It looks like an easy win to prevent a number of possible security
flaws.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index f2ac8e5..31cbac3 100644
--- a/Makefile
+++ b/Makefile
@@ -29,7 +29,7 @@ ifeq ($(shell $(CC) -O2 -dM -E - < /dev/null 2>&1 | grep ' _FORTIFY_SOURCE ' > /
 FORTIFY_FLAG := -D_FORTIFY_SOURCE=2
 endif
 
-FLAGS := -Wall -Wextra -Wno-format-zero-length
+FLAGS := -Wall -Wextra -Wno-format-zero-length -Wformat-security
 FLAGS += -pedantic -std=c11 -D_XOPEN_SOURCE=700 -D_GNU_SOURCE
 FLAGS +=  $(FORTIFY_FLAG) -O2 -pie -fPIE
 FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)

From 4592719a744bcb47db2ff5680be4b8f6362a97ce Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:14 +1100
Subject: [PATCH 138/221] vu_common: Tighten vu_packet_check_range()

This function verifies that the given packet is within the mmap()ed memory
region of the vhost-user device.  We can do better, however.  The packet
should be not only within the mmap()ed range, but specifically in the
subsection of that range set aside for shared buffers, which starts at
dev_region->mmap_offset within there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vu_common.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/vu_common.c b/vu_common.c
index 686a09b..9eea4f2 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -37,10 +37,10 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
 
 	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
 		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
-		char *m = (char *)(uintptr_t)dev_region->mmap_addr;
+		char *m = (char *)(uintptr_t)dev_region->mmap_addr +
+			dev_region->mmap_offset;
 
-		if (m <= ptr &&
-		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
+		if (m <= ptr && ptr + len <= m + dev_region->size)
 			return 0;
 	}
 

From e43e00719d7701301e4bc4fb179dc7adff175409 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:15 +1100
Subject: [PATCH 139/221] packet: More cautious checks to avoid pointer
 arithmetic UB

packet_check_range and vu_packet_check_range() verify that the packet or
section of packet we're interested in lies in the packet buffer pool we
expect it to.  However, in doing so it doesn't avoid the possibility of
an integer overflow while performing pointer arithmetic, with is UB.  In
fact, AFAICT it's UB even to use arbitrary pointer arithmetic to construct
a pointer outside of a known valid buffer.

To do this safely, we can't calculate the end of a memory region with
pointer addition when then the length as untrusted.  Instead we must work
out the offset of one memory region within another using pointer
subtraction, then do integer checks against the length of the outer region.
We then need to be careful about the order of checks so that those integer
checks can't themselves overflow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c    | 12 +++++++++---
 vu_common.c | 10 +++++++---
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/packet.c b/packet.c
index bcac037..d1a51a5 100644
--- a/packet.c
+++ b/packet.c
@@ -52,9 +52,15 @@ static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
 		return -1;
 	}
 
-	if (ptr + len > p->buf + p->buf_size) {
-		trace("packet range end %p after buffer end %p, %s:%i",
-		      (void *)(ptr + len), (void *)(p->buf + p->buf_size),
+	if (len > p->buf_size) {
+		trace("packet range length %zu larger than buffer %zu, %s:%i",
+		      len, p->buf_size, func, line);
+		return -1;
+	}
+
+	if ((size_t)(ptr - p->buf) > p->buf_size - len) {
+		trace("packet range %p, len %zu after buffer end %p, %s:%i",
+		      (void *)ptr, len, (void *)(p->buf + p->buf_size),
 		      func, line);
 		return -1;
 	}
diff --git a/vu_common.c b/vu_common.c
index 9eea4f2..cefe5e2 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -36,11 +36,15 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
 	struct vu_dev_region *dev_region;
 
 	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
-		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
-		char *m = (char *)(uintptr_t)dev_region->mmap_addr +
+		uintptr_t base_addr = dev_region->mmap_addr +
 			dev_region->mmap_offset;
+		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+		const char *base = (const char *)base_addr;
 
-		if (m <= ptr && ptr + len <= m + dev_region->size)
+		ASSERT(base_addr >= dev_region->mmap_addr);
+
+		if (len <= dev_region->size && base <= ptr &&
+		    (size_t)(ptr - base) <= dev_region->size - len)
 			return 0;
 	}
 

From a41d6d125eca5ac8c54bed8157098be141557b03 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:16 +1100
Subject: [PATCH 140/221] tap: Make size of pool_tap[46] purely a tuning
 parameter

Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf (TAP_MSGS).  However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size.  But ETH_ZLEN is based on physical
constraints of Ethernet, which don't apply to our virtual devices.  It is
possible to generate a legitimate frame smaller than this, for example an
empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long.

Further more, the same limit applies for vhost-user, which is not limited
by the size of pkt_buf like the other backends.  In that case we don't even
have full control of the maximum buffer size, so we can't really calculate
how many packets could fit in there.

If we exceed do TAP_MSGS we'll drop packets, not just use more batches,
which is moderately bad.  The fact that this needs to be sized just so for
correctness not merely for tuning is a fairly non-obvious coupling between
different parts of the code.

To make this more robust, alter the tap code so it doesn't rely on
everything fitting in a single batch of TAP_MSGS packets, instead breaking
into multiple batches as necessary.  This leaves TAP_MSGS as purely a
tuning parameter, which we can freely adjust based on performance measures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c    | 13 ++++++++++++-
 packet.h    |  3 +++
 passt.h     |  2 --
 tap.c       | 19 ++++++++++++++++---
 tap.h       |  3 ++-
 vu_common.c |  5 +++--
 6 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/packet.c b/packet.c
index d1a51a5..08076d5 100644
--- a/packet.c
+++ b/packet.c
@@ -67,6 +67,17 @@ static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
 
 	return 0;
 }
+/**
+ * pool_full() - Is a packet pool full?
+ * @p:		Pointer to packet pool
+ *
+ * Return: true if the pool is full, false if more packets can be added
+ */
+bool pool_full(const struct pool *p)
+{
+	return p->count >= p->size;
+}
+
 /**
  * packet_add_do() - Add data as packet descriptor to given pool
  * @p:		Existing pool
@@ -80,7 +91,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 {
 	size_t idx = p->count;
 
-	if (idx >= p->size) {
+	if (pool_full(p)) {
 		trace("add packet index %zu to pool with size %zu, %s:%i",
 		      idx, p->size, func, line);
 		return;
diff --git a/packet.h b/packet.h
index d099f02..dd18461 100644
--- a/packet.h
+++ b/packet.h
@@ -6,6 +6,8 @@
 #ifndef PACKET_H
 #define PACKET_H
 
+#include <stdbool.h>
+
 /* Maximum size of a single packet stored in pool, including headers */
 #define PACKET_MAX_LEN	UINT16_MAX
 
@@ -33,6 +35,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 void *packet_get_do(const struct pool *p, const size_t idx,
 		    size_t offset, size_t len, size_t *left,
 		    const char *func, int line);
+bool pool_full(const struct pool *p);
 void pool_flush(struct pool *p);
 
 #define packet_add(p, len, start)					\
diff --git a/passt.h b/passt.h
index 8f45091..8693794 100644
--- a/passt.h
+++ b/passt.h
@@ -71,8 +71,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
 
 /* Large enough for ~128 maximum size frames */
 #define PKT_BUF_BYTES		(8UL << 20)
-#define TAP_MSGS							\
-	DIV_ROUND_UP(PKT_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
 
 extern char pkt_buf		[PKT_BUF_BYTES];
 
diff --git a/tap.c b/tap.c
index 182a115..34e6774 100644
--- a/tap.c
+++ b/tap.c
@@ -75,6 +75,9 @@ CHECK_FRAME_LEN(L2_MAX_LEN_PASTA);
 CHECK_FRAME_LEN(L2_MAX_LEN_PASST);
 CHECK_FRAME_LEN(L2_MAX_LEN_VU);
 
+#define TAP_MSGS							\
+	DIV_ROUND_UP(sizeof(pkt_buf), ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
+
 /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
 static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
 static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
@@ -1042,8 +1045,10 @@ void tap_handler(struct ctx *c, const struct timespec *now)
  * @c:		Execution context
  * @l2len:	Total L2 packet length
  * @p:		Packet buffer
+ * @now:	Current timestamp
  */
-void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
+void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
+		    const struct timespec *now)
 {
 	const struct ethhdr *eh;
 
@@ -1059,9 +1064,17 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
 	switch (ntohs(eh->h_proto)) {
 	case ETH_P_ARP:
 	case ETH_P_IP:
+		if (pool_full(pool_tap4)) {
+			tap4_handler(c, pool_tap4, now);
+			pool_flush(pool_tap4);
+		}
 		packet_add(pool_tap4, l2len, p);
 		break;
 	case ETH_P_IPV6:
+		if (pool_full(pool_tap6)) {
+			tap6_handler(c, pool_tap6, now);
+			pool_flush(pool_tap6);
+		}
 		packet_add(pool_tap6, l2len, p);
 		break;
 	default:
@@ -1142,7 +1155,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 		p += sizeof(uint32_t);
 		n -= sizeof(uint32_t);
 
-		tap_add_packet(c, l2len, p);
+		tap_add_packet(c, l2len, p, now);
 
 		p += l2len;
 		n -= l2len;
@@ -1207,7 +1220,7 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 		    len > (ssize_t)L2_MAX_LEN_PASTA)
 			continue;
 
-		tap_add_packet(c, len, pkt_buf + n);
+		tap_add_packet(c, len, pkt_buf + n, now);
 	}
 
 	tap_handler(c, now);
diff --git a/tap.h b/tap.h
index dd39fd8..6fe3d15 100644
--- a/tap.h
+++ b/tap.h
@@ -119,6 +119,7 @@ void tap_sock_update_pool(void *base, size_t size);
 void tap_backend_init(struct ctx *c);
 void tap_flush_pools(void);
 void tap_handler(struct ctx *c, const struct timespec *now);
-void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);
+void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
+		    const struct timespec *now);
 
 #endif /* TAP_H */
diff --git a/vu_common.c b/vu_common.c
index cefe5e2..5e6fd4a 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -195,7 +195,7 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 			tap_add_packet(vdev->context,
 				       elem[count].out_sg[0].iov_len - hdrlen,
 				       (char *)elem[count].out_sg[0].iov_base +
-				        hdrlen);
+				       hdrlen, now);
 		} else {
 			/* vnet header can be in a separate iovec */
 			if (elem[count].out_num != 2) {
@@ -207,7 +207,8 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 			} else {
 				tap_add_packet(vdev->context,
 					       elem[count].out_sg[1].iov_len,
-					       (char *)elem[count].out_sg[1].iov_base);
+					       (char *)elem[count].out_sg[1].iov_base,
+					       now);
 			}
 		}
 

From 9866d146e654975dd7f5fd3f1294d5fc4628cef3 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:17 +1100
Subject: [PATCH 141/221] tap: Clarify calculation of TAP_MSGS

The rationale behind the calculation of TAP_MSGS isn't necessarily obvious.
It's supposed to be the maximum number of packets that can fit in pkt_buf.
However, the calculation is wrong in several ways:
 * It's based on ETH_ZLEN which isn't meaningful for virtual devices
 * It always includes the qemu socket header which isn't used for pasta
 * The size of pkt_buf isn't relevant for vhost-user

We've already made sure this is just a tuning parameter, not a hard limit.
Clarify what we're calculating here and why.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/tap.c b/tap.c
index 34e6774..3a6fcbe 100644
--- a/tap.c
+++ b/tap.c
@@ -75,12 +75,28 @@ CHECK_FRAME_LEN(L2_MAX_LEN_PASTA);
 CHECK_FRAME_LEN(L2_MAX_LEN_PASST);
 CHECK_FRAME_LEN(L2_MAX_LEN_VU);
 
-#define TAP_MSGS							\
-	DIV_ROUND_UP(sizeof(pkt_buf), ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
+/* We try size the packet pools so that we can use a single batch for the entire
+ * packet buffer.  This might be exceeded for vhost-user, though, which uses its
+ * own buffers rather than pkt_buf.
+ *
+ * This is just a tuning parameter, the code will work with slightly more
+ * overhead if it's incorrect.  So, we estimate based on the minimum practical
+ * frame size - an empty UDP datagram - rather than the minimum theoretical
+ * frame size.
+ *
+ * FIXME: Profile to work out how big this actually needs to be to amortise
+ *        per-batch syscall overheads
+ */
+#define TAP_MSGS_IP4							\
+	DIV_ROUND_UP(sizeof(pkt_buf),					\
+		     ETH_HLEN + sizeof(struct iphdr) + sizeof(struct udphdr))
+#define TAP_MSGS_IP6							\
+	DIV_ROUND_UP(sizeof(pkt_buf),					\
+		     ETH_HLEN + sizeof(struct ipv6hdr) + sizeof(struct udphdr))
 
 /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
-static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
-static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
+static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS_IP4, pkt_buf);
+static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
 
 #define TAP_SEQS		128 /* Different L4 tuples in one batch */
 #define FRAGMENT_MSG_RATE	10  /* # seconds between fragment warnings */
@@ -1418,8 +1434,8 @@ void tap_sock_update_pool(void *base, size_t size)
 {
 	int i;
 
-	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, base, size);
-	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, base, size);
+	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS_IP4, base, size);
+	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS_IP6, base, size);
 
 	for (i = 0; i < TAP_SEQS; i++) {
 		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);

From c48331ca51399fe1779529511be395b576aaf0af Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:18 +1100
Subject: [PATCH 142/221] packet: Correct type of PACKET_MAX_LEN

PACKET_MAX_LEN is usually involved in calculations on size_t values - the
type of the iov_len field in struct iovec.  However, being defined bare as
UINT16_MAX, the compiled is likely to assign it a shorter type.  This can
lead to unexpected promotions (or lack thereof).  Add a cast to force the
type to be what we expect.

Fixes: c43972ad6 ("packet: Give explicit name to maximum packet size")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packet.h b/packet.h
index dd18461..9061dad 100644
--- a/packet.h
+++ b/packet.h
@@ -9,7 +9,7 @@
 #include <stdbool.h>
 
 /* Maximum size of a single packet stored in pool, including headers */
-#define PACKET_MAX_LEN	UINT16_MAX
+#define PACKET_MAX_LEN	((size_t)UINT16_MAX)
 
 /**
  * struct pool - Generic pool of packets stored in a buffer

From 37d9f374d9f0c47c092f80a5d85d4505ae4a9af7 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:19 +1100
Subject: [PATCH 143/221] packet: Avoid integer overflows in packet_get_do()

In packet_get_do() both offset and len are essentially untrusted.  We do
some validation of len (check it's < PACKET_MAX_LEN), but that's not enough
to ensure that (len + offset) doesn't overflow.  Rearrange our calculation
to make sure it's safe regardless of the given offset & len values.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/packet.c b/packet.c
index 08076d5..fdc4be7 100644
--- a/packet.c
+++ b/packet.c
@@ -144,7 +144,8 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (len + offset > p->pkt[idx].iov_len) {
+	if (offset > p->pkt[idx].iov_len ||
+	    len > (p->pkt[idx].iov_len - offset)) {
 		if (func) {
 			trace("data length %zu, offset %zu from length %zu, "
 			      "%s:%i", len, offset, p->pkt[idx].iov_len,

From 961aa6a0eb7fce956a34f8ccd883bfe12392d3d3 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:20 +1100
Subject: [PATCH 144/221] packet: Move checks against PACKET_MAX_LEN to
 packet_check_range()

Both the callers of packet_check_range() separately verify that the given
length does not exceed PACKET_MAX_LEN.  Fold that check into
packet_check_range() instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/packet.c b/packet.c
index fdc4be7..7cbe95d 100644
--- a/packet.c
+++ b/packet.c
@@ -35,6 +35,12 @@
 static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
 			      const char *func, int line)
 {
+	if (len > PACKET_MAX_LEN) {
+		trace("packet range length %zu (max %zu), %s:%i",
+		      len, PACKET_MAX_LEN, func, line);
+		return -1;
+	}
+
 	if (p->buf_size == 0) {
 		int ret;
 
@@ -100,11 +106,6 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 	if (packet_check_range(p, start, len, func, line))
 		return;
 
-	if (len > PACKET_MAX_LEN) {
-		trace("add packet length %zu, %s:%i", len, func, line);
-		return;
-	}
-
 	p->pkt[idx].iov_base = (void *)start;
 	p->pkt[idx].iov_len = len;
 
@@ -136,14 +137,6 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (len > PACKET_MAX_LEN) {
-		if (func) {
-			trace("packet data length %zu, %s:%i",
-			      len, func, line);
-		}
-		return NULL;
-	}
-
 	if (offset > p->pkt[idx].iov_len ||
 	    len > (p->pkt[idx].iov_len - offset)) {
 		if (func) {

From 38bcce997763f2e0c4bb6c0a3926674317796544 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:21 +1100
Subject: [PATCH 145/221] packet: Rework packet_get() versus packet_get_try()

Most failures of packet_get() indicate a serious problem, and log messages
accordingly.  However, a few callers expect failures here, because they're
probing for a certain range which might or might not be in a packet.  They
use packet_get_try() which passes a NULL func to packet_get_do() to
suppress the logging which is unwanted in this case.

However, this doesn't just suppress the log when packet_get_do() finds the
requested region isn't in the packet.  It suppresses logging for all other
errors too, which do indicate serious problems, even for the callers of
packet_get_try().  Worse it will pass the NULL func on to
packet_check_range() which doesn't expect it, meaning we'll get unhelpful
messages from there if there is a failure.

Fix this by making packet_get_try_do() the primary function which doesn't
log for the case of a range outside the packet.  packet_get_do() becomes a
trivial wrapper around that which logs a message if packet_get_try_do()
returns NULL.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c | 51 +++++++++++++++++++++++++++++++++++----------------
 packet.h |  8 +++++---
 2 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/packet.c b/packet.c
index 7cbe95d..b3e8c79 100644
--- a/packet.c
+++ b/packet.c
@@ -89,7 +89,7 @@ bool pool_full(const struct pool *p)
  * @p:		Existing pool
  * @len:	Length of new descriptor
  * @start:	Start of data
- * @func:	For tracing: name of calling function, NULL means no trace()
+ * @func:	For tracing: name of calling function
  * @line:	For tracing: caller line of function call
  */
 void packet_add_do(struct pool *p, size_t len, const char *start,
@@ -113,39 +113,31 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 }
 
 /**
- * packet_get_do() - Get data range from packet descriptor from given pool
+ * packet_get_try_do() - Get data range from packet descriptor from given pool
  * @p:		Packet pool
  * @idx:	Index of packet descriptor in pool
  * @offset:	Offset of data range in packet descriptor
  * @len:	Length of desired data range
  * @left:	Length of available data after range, set on return, can be NULL
- * @func:	For tracing: name of calling function, NULL means no trace()
+ * @func:	For tracing: name of calling function
  * @line:	For tracing: caller line of function call
  *
  * Return: pointer to start of data range, NULL on invalid range or descriptor
  */
-void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
-		    size_t len, size_t *left, const char *func, int line)
+void *packet_get_try_do(const struct pool *p, size_t idx, size_t offset,
+			size_t len, size_t *left, const char *func, int line)
 {
 	char *ptr;
 
 	if (idx >= p->size || idx >= p->count) {
-		if (func) {
-			trace("packet %zu from pool size: %zu, count: %zu, "
-			      "%s:%i", idx, p->size, p->count, func, line);
-		}
+		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
+		      idx, p->size, p->count, func, line);
 		return NULL;
 	}
 
 	if (offset > p->pkt[idx].iov_len ||
-	    len > (p->pkt[idx].iov_len - offset)) {
-		if (func) {
-			trace("data length %zu, offset %zu from length %zu, "
-			      "%s:%i", len, offset, p->pkt[idx].iov_len,
-			      func, line);
-		}
+	    len > (p->pkt[idx].iov_len - offset))
 		return NULL;
-	}
 
 	ptr = (char *)p->pkt[idx].iov_base + offset;
 
@@ -158,6 +150,33 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 	return ptr;
 }
 
+/**
+ * packet_get_do() - Get data range from packet descriptor from given pool
+ * @p:		Packet pool
+ * @idx:	Index of packet descriptor in pool
+ * @offset:	Offset of data range in packet descriptor
+ * @len:	Length of desired data range
+ * @left:	Length of available data after range, set on return, can be NULL
+ * @func:	For tracing: name of calling function
+ * @line:	For tracing: caller line of function call
+ *
+ * Return: as packet_get_try_do() but log a trace message when returning NULL
+ */
+void *packet_get_do(const struct pool *p, const size_t idx,
+		    size_t offset, size_t len, size_t *left,
+		    const char *func, int line)
+{
+	void *r = packet_get_try_do(p, idx, offset, len, left, func, line);
+
+	if (!r) {
+		trace("missing packet data length %zu, offset %zu from "
+		      "length %zu, %s:%i",
+		      len, offset, p->pkt[idx].iov_len, func, line);
+	}
+
+	return r;
+}
+
 /**
  * pool_flush() - Flush a packet pool
  * @p:		Pointer to packet pool
diff --git a/packet.h b/packet.h
index 9061dad..c94780a 100644
--- a/packet.h
+++ b/packet.h
@@ -32,6 +32,9 @@ struct pool {
 int vu_packet_check_range(void *buf, const char *ptr, size_t len);
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
+void *packet_get_try_do(const struct pool *p, const size_t idx,
+			size_t offset, size_t len, size_t *left,
+			const char *func, int line);
 void *packet_get_do(const struct pool *p, const size_t idx,
 		    size_t offset, size_t len, size_t *left,
 		    const char *func, int line);
@@ -41,12 +44,11 @@ void pool_flush(struct pool *p);
 #define packet_add(p, len, start)					\
 	packet_add_do(p, len, start, __func__, __LINE__)
 
+#define packet_get_try(p, idx, offset, len, left)			\
+	packet_get_try_do(p, idx, offset, len, left, __func__, __LINE__)
 #define packet_get(p, idx, offset, len, left)				\
 	packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
 
-#define packet_get_try(p, idx, offset, len, left)			\
-	packet_get_do(p, idx, offset, len, left, NULL, 0)
-
 #define PACKET_POOL_DECL(_name, _size, _buf)				\
 struct _name ## _t {							\
 	char *buf;							\

From 9153aca15bc1150e450dd56e79bc035cc2dbf27c Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:22 +1100
Subject: [PATCH 146/221] util: Add abort_with_msg() and ASSERT_WITH_MSG()
 helpers

We already have the ASSERT() macro which will abort() passt based on a
condition.  It always has a fixed error message based on its location and
the asserted expression.  We have some upcoming cases where we want to
customise the message when hitting an assert.

Add abort_with_msg() and ASSERT_WITH_MSG() helpers to allow this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 util.c | 19 +++++++++++++++++++
 util.h | 25 ++++++++++---------------
 2 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/util.c b/util.c
index 656e86a..b9a3d43 100644
--- a/util.c
+++ b/util.c
@@ -1017,3 +1017,22 @@ void encode_domain_name(char *buf, const char *domain_name)
 	}
 	p[i] = 0L;
 }
+
+/**
+ * abort_with_msg() - Print error message and abort
+ * @fmt:	Format string
+ * @...:	Format parameters
+ */
+void abort_with_msg(const char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	vlogmsg(true, false, LOG_CRIT, fmt, ap);
+	va_end(ap);
+
+	/* This may actually cause a SIGSYS instead of SIGABRT, due to seccomp,
+	 * but that will still get the job done.
+	 */
+	abort();
+}
diff --git a/util.h b/util.h
index 4d512fa..b1e7e79 100644
--- a/util.h
+++ b/util.h
@@ -61,27 +61,22 @@
 #define STRINGIFY(x)	#x
 #define STR(x)		STRINGIFY(x)
 
-#ifdef CPPCHECK_6936
+void abort_with_msg(const char *fmt, ...)
+	__attribute__((format(printf, 1, 2), noreturn));
+
 /* Some cppcheck versions get confused by aborts inside a loop, causing
  * it to give false positive uninitialised variable warnings later in
  * the function, because it doesn't realise the non-initialising path
  * already exited.  See https://trac.cppcheck.net/ticket/13227
+ *
+ * Therefore, avoid using the usual do while wrapper we use to force the macro
+ * to act like a single statement requiring a ';'.
  */
-#define ASSERT(expr)		\
-	((expr) ? (void)0 : abort())
-#else
+#define ASSERT_WITH_MSG(expr, ...)					\
+	((expr) ? (void)0 : abort_with_msg(__VA_ARGS__))
 #define ASSERT(expr)							\
-	do {								\
-		if (!(expr)) {						\
-			err("ASSERTION FAILED in %s (%s:%d): %s",	\
-			    __func__, __FILE__, __LINE__, STRINGIFY(expr)); \
-			/* This may actually SIGSYS, due to seccomp,	\
-			 * but that will still get the job done		\
-			 */						\
-			abort();					\
-		}							\
-	} while (0)
-#endif
+	ASSERT_WITH_MSG((expr), "ASSSERTION FAILED in %s (%s:%d): %s",	\
+			__func__, __FILE__, __LINE__, STRINGIFY(expr))
 
 #ifdef P_tmpdir
 #define TMPDIR		P_tmpdir

From 0857515c943d439eade80710c16f15f146dfa9e8 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:23 +1100
Subject: [PATCH 147/221] packet: ASSERT on signs of pool corruption

If packet_check_range() fails in packet_get_try_do() we just return NULL.
But this check only takes places after we've already validated the given
range against the packet it's in.  That means that if packet_check_range()
fails, the packet pool is already in a corrupted state (we should have
made strictly stronger checks when the packet was added).  Simply returning
NULL and logging a trace() level message isn't really adequate for that
situation; ASSERT instead.

Similarly we check the given idx against both p->count and p->size.  The
latter should be redundant, because count should always be <= size.  If
that's not the case then, again, the pool is already in a corrupted state
and we may have overwritten unknown memory.  Assert for this case too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/packet.c b/packet.c
index b3e8c79..be28f27 100644
--- a/packet.c
+++ b/packet.c
@@ -129,9 +129,13 @@ void *packet_get_try_do(const struct pool *p, size_t idx, size_t offset,
 {
 	char *ptr;
 
-	if (idx >= p->size || idx >= p->count) {
-		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
-		      idx, p->size, p->count, func, line);
+	ASSERT_WITH_MSG(p->count <= p->size,
+			"Corrupt pool count: %zu, size: %zu, %s:%i",
+			p->count, p->size, func, line);
+
+	if (idx >= p->count) {
+		trace("packet %zu from pool count: %zu, %s:%i",
+		      idx, p->count, func, line);
 		return NULL;
 	}
 
@@ -141,8 +145,8 @@ void *packet_get_try_do(const struct pool *p, size_t idx, size_t offset,
 
 	ptr = (char *)p->pkt[idx].iov_base + offset;
 
-	if (packet_check_range(p, ptr, len, func, line))
-		return NULL;
+	ASSERT_WITH_MSG(!packet_check_range(p, ptr, len, func, line),
+			"Corrupt packet pool, %s:%i", func, line);
 
 	if (left)
 		*left = p->pkt[idx].iov_len - offset - len;

From cf4d3f05c9263d1b0a88dbbcf9e48d34cac6708e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Mon, 17 Mar 2025 20:24:24 +1100
Subject: [PATCH 148/221] packet: Upgrade severity of most packet errors

All errors from packet_range_check(), packet_add() and packet_get() are
trace level.  However, these are for the most part actual error conditions.
They're states that should not happen, in many cases indicating a bug
in the caller or elswhere.

We don't promote these to err() or ASSERT() level, for fear of a localised
bug on very specific input crashing the entire program, or flooding the
logs, but we can at least upgrade them to debug level.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 packet.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/packet.c b/packet.c
index be28f27..72c6158 100644
--- a/packet.c
+++ b/packet.c
@@ -36,7 +36,7 @@ static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
 			      const char *func, int line)
 {
 	if (len > PACKET_MAX_LEN) {
-		trace("packet range length %zu (max %zu), %s:%i",
+		debug("packet range length %zu (max %zu), %s:%i",
 		      len, PACKET_MAX_LEN, func, line);
 		return -1;
 	}
@@ -47,25 +47,25 @@ static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
 		ret = vu_packet_check_range((void *)p->buf, ptr, len);
 
 		if (ret == -1)
-			trace("cannot find region, %s:%i", func, line);
+			debug("cannot find region, %s:%i", func, line);
 
 		return ret;
 	}
 
 	if (ptr < p->buf) {
-		trace("packet range start %p before buffer start %p, %s:%i",
+		debug("packet range start %p before buffer start %p, %s:%i",
 		      (void *)ptr, (void *)p->buf, func, line);
 		return -1;
 	}
 
 	if (len > p->buf_size) {
-		trace("packet range length %zu larger than buffer %zu, %s:%i",
+		debug("packet range length %zu larger than buffer %zu, %s:%i",
 		      len, p->buf_size, func, line);
 		return -1;
 	}
 
 	if ((size_t)(ptr - p->buf) > p->buf_size - len) {
-		trace("packet range %p, len %zu after buffer end %p, %s:%i",
+		debug("packet range %p, len %zu after buffer end %p, %s:%i",
 		      (void *)ptr, len, (void *)(p->buf + p->buf_size),
 		      func, line);
 		return -1;
@@ -98,7 +98,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 	size_t idx = p->count;
 
 	if (pool_full(p)) {
-		trace("add packet index %zu to pool with size %zu, %s:%i",
+		debug("add packet index %zu to pool with size %zu, %s:%i",
 		      idx, p->size, func, line);
 		return;
 	}
@@ -134,7 +134,7 @@ void *packet_get_try_do(const struct pool *p, size_t idx, size_t offset,
 			p->count, p->size, func, line);
 
 	if (idx >= p->count) {
-		trace("packet %zu from pool count: %zu, %s:%i",
+		debug("packet %zu from pool count: %zu, %s:%i",
 		      idx, p->count, func, line);
 		return NULL;
 	}

From 89b203b851f32a532cc0406cf26a1d24950a207c Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 26 Mar 2025 14:44:01 +1100
Subject: [PATCH 149/221] udp: Common invocation of udp_sock_errs() for
 vhost-user and "buf" paths

The vhost-user and non-vhost-user paths for both udp_listen_sock_handler()
and udp_reply_sock_handler() are more or less completely separate.  Both,
however, start with essentially the same invocation of udp_sock_errs(), so
that can be made common.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c          | 37 ++++++++++++++++++++-----------------
 udp_internal.h |  2 +-
 udp_vu.c       | 15 ---------------
 3 files changed, 21 insertions(+), 33 deletions(-)

diff --git a/udp.c b/udp.c
index 80520cb..4a06b16 100644
--- a/udp.c
+++ b/udp.c
@@ -585,7 +585,8 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
  *
  * Return: Number of errors handled, or < 0 if we have an unrecoverable error
  */
-int udp_sock_errs(const struct ctx *c, union epoll_ref ref, uint32_t events)
+static int udp_sock_errs(const struct ctx *c, union epoll_ref ref,
+			 uint32_t events)
 {
 	unsigned n_err = 0;
 	socklen_t errlen;
@@ -678,13 +679,6 @@ static void udp_buf_listen_sock_handler(const struct ctx *c,
 	const socklen_t sasize = sizeof(udp_meta[0].s_in);
 	int n, i;
 
-	if (udp_sock_errs(c, ref, events) < 0) {
-		err("UDP: Unrecoverable error on listening socket:"
-		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
-		/* FIXME: what now?  close/re-open socket? */
-		return;
-	}
-
 	if ((n = udp_sock_recv(c, ref.fd, events, udp_mh_recv)) <= 0)
 		return;
 
@@ -750,6 +744,13 @@ void udp_listen_sock_handler(const struct ctx *c,
 			     union epoll_ref ref, uint32_t events,
 			     const struct timespec *now)
 {
+	if (udp_sock_errs(c, ref, events) < 0) {
+		err("UDP: Unrecoverable error on listening socket:"
+		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
+		/* FIXME: what now?  close/re-open socket? */
+		return;
+	}
+
 	if (c->mode == MODE_VU) {
 		udp_vu_listen_sock_handler(c, ref, events, now);
 		return;
@@ -777,17 +778,8 @@ static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 	uint8_t topif = pif_at_sidx(tosidx);
 	int n, i, from_s;
 
-	ASSERT(!c->no_udp && uflow);
-
 	from_s = uflow->s[ref.flowside.sidei];
 
-	if (udp_sock_errs(c, ref, events) < 0) {
-		flow_err(uflow, "Unrecoverable error on reply socket");
-		flow_err_details(uflow);
-		udp_flow_close(c, uflow);
-		return;
-	}
-
 	if ((n = udp_sock_recv(c, from_s, events, udp_mh_recv)) <= 0)
 		return;
 
@@ -825,6 +817,17 @@ static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 			    uint32_t events, const struct timespec *now)
 {
+	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
+
+	ASSERT(!c->no_udp && uflow);
+
+	if (udp_sock_errs(c, ref, events) < 0) {
+		flow_err(uflow, "Unrecoverable error on reply socket");
+		flow_err_details(uflow);
+		udp_flow_close(c, uflow);
+		return;
+	}
+
 	if (c->mode == MODE_VU) {
 		udp_vu_reply_sock_handler(c, ref, events, now);
 		return;
diff --git a/udp_internal.h b/udp_internal.h
index 3b081f5..02724e5 100644
--- a/udp_internal.h
+++ b/udp_internal.h
@@ -30,5 +30,5 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
 size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
                        const struct flowside *toside, size_t dlen,
 		       bool no_udp_csum);
-int udp_sock_errs(const struct ctx *c, union epoll_ref ref, uint32_t events);
+
 #endif /* UDP_INTERNAL_H */
diff --git a/udp_vu.c b/udp_vu.c
index c26a223..84f52af 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -227,12 +227,6 @@ void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
 	int i;
 
-	if (udp_sock_errs(c, ref, events) < 0) {
-		err("UDP: Unrecoverable error on listening socket:"
-		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
-		return;
-	}
-
 	for (i = 0; i < UDP_MAX_FRAMES; i++) {
 		const struct flowside *toside;
 		union sockaddr_inany s_in;
@@ -300,15 +294,6 @@ void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
 	int i;
 
-	ASSERT(!c->no_udp);
-
-	if (udp_sock_errs(c, ref, events) < 0) {
-		flow_err(uflow, "Unrecoverable error on reply socket");
-		flow_err_details(uflow);
-		udp_flow_close(c, uflow);
-		return;
-	}
-
 	for (i = 0; i < UDP_MAX_FRAMES; i++) {
 		uint8_t topif = pif_at_sidx(tosidx);
 		ssize_t dlen;

From 5a977c2f4ee8926673554b2b456e7791962b2ce2 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 26 Mar 2025 14:44:02 +1100
Subject: [PATCH 150/221] udp: Simplify checking of epoll event bits

udp_{listen,reply}_sock_handler() can accept both EPOLLERR and EPOLLIN
events.  However, unlike most epoll event handlers we don't check the
event bits right there.  EPOLLERR is checked within udp_sock_errs() which
we call unconditionally.  Checking EPOLLIN is still more buried: it is
checked within both udp_sock_recv() and udp_vu_sock_recv().

We can simplify the logic and pass less extraneous parameters around by
moving the checking of the event bits to the top level event handlers.

This makes udp_{buf,vu}_{listen,reply}_sock_handler() no longer general
event handlers, but specific to EPOLLIN events, meaning new data.  So,
rename those functions to udp_{buf,vu}_{listen,reply}_sock_data() to better
reflect their function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c    | 78 ++++++++++++++++++++++++--------------------------------
 udp_vu.c | 25 +++++++-----------
 udp_vu.h |  8 +++---
 3 files changed, 47 insertions(+), 64 deletions(-)

diff --git a/udp.c b/udp.c
index 4a06b16..26a91c9 100644
--- a/udp.c
+++ b/udp.c
@@ -581,12 +581,10 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
  * udp_sock_errs() - Process errors on a socket
  * @c:		Execution context
  * @ref:	epoll reference
- * @events:	epoll events bitmap
  *
  * Return: Number of errors handled, or < 0 if we have an unrecoverable error
  */
-static int udp_sock_errs(const struct ctx *c, union epoll_ref ref,
-			 uint32_t events)
+static int udp_sock_errs(const struct ctx *c, union epoll_ref ref)
 {
 	unsigned n_err = 0;
 	socklen_t errlen;
@@ -595,9 +593,6 @@ static int udp_sock_errs(const struct ctx *c, union epoll_ref ref,
 
 	ASSERT(!c->no_udp);
 
-	if (!(events & EPOLLERR))
-		return 0; /* Nothing to do */
-
 	/* Empty the error queue */
 	while ((rc = udp_sock_recverr(c, ref)) > 0)
 		n_err += rc;
@@ -630,15 +625,13 @@ static int udp_sock_errs(const struct ctx *c, union epoll_ref ref,
  * udp_sock_recv() - Receive datagrams from a socket
  * @c:		Execution context
  * @s:		Socket to receive from
- * @events:	epoll events bitmap
  * @mmh		mmsghdr array to receive into
  *
  * Return: Number of datagrams received
  *
  * #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64
  */
-static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
-			 struct mmsghdr *mmh)
+static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh)
 {
 	/* For not entirely clear reasons (data locality?) pasta gets better
 	 * throughput if we receive tap datagrams one at a atime.  For small
@@ -651,9 +644,6 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
 
 	ASSERT(!c->no_udp);
 
-	if (!(events & EPOLLIN))
-		return 0;
-
 	n = recvmmsg(s, mmh, n, 0, NULL);
 	if (n < 0) {
 		err_perror("Error receiving datagrams");
@@ -664,22 +654,20 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
 }
 
 /**
- * udp_buf_listen_sock_handler() - Handle new data from socket
+ * udp_buf_listen_sock_data() - Handle new data from socket
  * @c:		Execution context
  * @ref:	epoll reference
- * @events:	epoll events bitmap
  * @now:	Current timestamp
  *
  * #syscalls recvmmsg
  */
-static void udp_buf_listen_sock_handler(const struct ctx *c,
-					union epoll_ref ref, uint32_t events,
-					const struct timespec *now)
+static void udp_buf_listen_sock_data(const struct ctx *c, union epoll_ref ref,
+				     const struct timespec *now)
 {
 	const socklen_t sasize = sizeof(udp_meta[0].s_in);
 	int n, i;
 
-	if ((n = udp_sock_recv(c, ref.fd, events, udp_mh_recv)) <= 0)
+	if ((n = udp_sock_recv(c, ref.fd, udp_mh_recv)) <= 0)
 		return;
 
 	/* We divide datagrams into batches based on how we need to send them,
@@ -744,33 +732,33 @@ void udp_listen_sock_handler(const struct ctx *c,
 			     union epoll_ref ref, uint32_t events,
 			     const struct timespec *now)
 {
-	if (udp_sock_errs(c, ref, events) < 0) {
-		err("UDP: Unrecoverable error on listening socket:"
-		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
-		/* FIXME: what now?  close/re-open socket? */
-		return;
+	if (events & EPOLLERR) {
+		if (udp_sock_errs(c, ref) < 0) {
+			err("UDP: Unrecoverable error on listening socket:"
+			    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
+			/* FIXME: what now?  close/re-open socket? */
+			return;
+		}
 	}
 
-	if (c->mode == MODE_VU) {
-		udp_vu_listen_sock_handler(c, ref, events, now);
-		return;
+	if (events & EPOLLIN) {
+		if (c->mode == MODE_VU)
+			udp_vu_listen_sock_data(c, ref, now);
+		else
+			udp_buf_listen_sock_data(c, ref, now);
 	}
-
-	udp_buf_listen_sock_handler(c, ref, events, now);
 }
 
 /**
- * udp_buf_reply_sock_handler() - Handle new data from flow specific socket
+ * udp_buf_reply_sock_data() - Handle new data from flow specific socket
  * @c:		Execution context
  * @ref:	epoll reference
- * @events:	epoll events bitmap
  * @now:	Current timestamp
  *
  * #syscalls recvmmsg
  */
-static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-				       uint32_t events,
-				       const struct timespec *now)
+static void udp_buf_reply_sock_data(const struct ctx *c, union epoll_ref ref,
+				    const struct timespec *now)
 {
 	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
 	const struct flowside *toside = flowside_at_sidx(tosidx);
@@ -780,7 +768,7 @@ static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 	from_s = uflow->s[ref.flowside.sidei];
 
-	if ((n = udp_sock_recv(c, from_s, events, udp_mh_recv)) <= 0)
+	if ((n = udp_sock_recv(c, from_s, udp_mh_recv)) <= 0)
 		return;
 
 	flow_trace(uflow, "Received %d datagrams on reply socket", n);
@@ -821,19 +809,21 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 	ASSERT(!c->no_udp && uflow);
 
-	if (udp_sock_errs(c, ref, events) < 0) {
-		flow_err(uflow, "Unrecoverable error on reply socket");
-		flow_err_details(uflow);
-		udp_flow_close(c, uflow);
-		return;
+	if (events & EPOLLERR) {
+		if (udp_sock_errs(c, ref) < 0) {
+			flow_err(uflow, "Unrecoverable error on reply socket");
+			flow_err_details(uflow);
+			udp_flow_close(c, uflow);
+			return;
+		}
 	}
 
-	if (c->mode == MODE_VU) {
-		udp_vu_reply_sock_handler(c, ref, events, now);
-		return;
+	if (events & EPOLLIN) {
+		if (c->mode == MODE_VU)
+			udp_vu_reply_sock_data(c, ref, now);
+		else
+			udp_buf_reply_sock_data(c, ref, now);
 	}
-
-	udp_buf_reply_sock_handler(c, ref, events, now);
 }
 
 /**
diff --git a/udp_vu.c b/udp_vu.c
index 84f52af..698667f 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -78,14 +78,12 @@ static int udp_vu_sock_info(int s, union sockaddr_inany *s_in)
  * udp_vu_sock_recv() - Receive datagrams from socket into vhost-user buffers
  * @c:		Execution context
  * @s:		Socket to receive from
- * @events:	epoll events bitmap
  * @v6:		Set for IPv6 connections
  * @dlen:	Size of received data (output)
  *
  * Return: Number of iov entries used to store the datagram
  */
-static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events,
-			    bool v6, ssize_t *dlen)
+static int udp_vu_sock_recv(const struct ctx *c, int s, bool v6, ssize_t *dlen)
 {
 	struct vu_dev *vdev = c->vdev;
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
@@ -95,9 +93,6 @@ static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events,
 
 	ASSERT(!c->no_udp);
 
-	if (!(events & EPOLLIN))
-		return 0;
-
 	/* compute L2 header length */
 	hdrlen = udp_vu_hdrlen(v6);
 
@@ -214,14 +209,13 @@ static void udp_vu_csum(const struct flowside *toside, int iov_used)
 }
 
 /**
- * udp_vu_listen_sock_handler() - Handle new data from socket
+ * udp_vu_listen_sock_data() - Handle new data from socket
  * @c:		Execution context
  * @ref:	epoll reference
- * @events:	epoll events bitmap
  * @now:	Current timestamp
  */
-void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
-				uint32_t events, const struct timespec *now)
+void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
+			     const struct timespec *now)
 {
 	struct vu_dev *vdev = c->vdev;
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
@@ -262,7 +256,7 @@ void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
 
-		iov_used = udp_vu_sock_recv(c, ref.fd, events, v6, &dlen);
+		iov_used = udp_vu_sock_recv(c, ref.fd, v6, &dlen);
 		if (iov_used <= 0)
 			break;
 
@@ -277,14 +271,13 @@ void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 }
 
 /**
- * udp_vu_reply_sock_handler() - Handle new data from flow specific socket
+ * udp_vu_reply_sock_data() - Handle new data from flow specific socket
  * @c:		Execution context
  * @ref:	epoll reference
- * @events:	epoll events bitmap
  * @now:	Current timestamp
  */
-void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			        uint32_t events, const struct timespec *now)
+void udp_vu_reply_sock_data(const struct ctx *c, union epoll_ref ref,
+			    const struct timespec *now)
 {
 	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
 	const struct flowside *toside = flowside_at_sidx(tosidx);
@@ -313,7 +306,7 @@ void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
 
-		iov_used = udp_vu_sock_recv(c, from_s, events, v6, &dlen);
+		iov_used = udp_vu_sock_recv(c, from_s, v6, &dlen);
 		if (iov_used <= 0)
 			break;
 		flow_trace(uflow, "Received 1 datagram on reply socket");
diff --git a/udp_vu.h b/udp_vu.h
index ba7018d..4f2262d 100644
--- a/udp_vu.h
+++ b/udp_vu.h
@@ -6,8 +6,8 @@
 #ifndef UDP_VU_H
 #define UDP_VU_H
 
-void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
-				uint32_t events, const struct timespec *now);
-void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			       uint32_t events, const struct timespec *now);
+void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
+			     const struct timespec *now);
+void udp_vu_reply_sock_data(const struct ctx *c, union epoll_ref ref,
+			    const struct timespec *now);
 #endif /* UDP_VU_H */

From d924b7dfc40cfaf9ebc64fe052efd8b0c45c6478 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 26 Mar 2025 14:44:03 +1100
Subject: [PATCH 151/221] udp_vu: Factor things out of udp_vu_reply_sock_data()
 loop

At the start of every cycle of the loop in udp_vu_reply_sock_data() we:
 - ASSERT that uflow is not NULL
 - Check if the target pif is PIF_TAP
 - Initialize the v6 boolean

However, all of these depend only on the flow, which doesn't change across
the loop.  This is probably a duplication from udp_vu_listen_sock_data(),
where the flow can be different for each packet.  For the reply socket
case, however, factor that logic out of the loop.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp_vu.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/udp_vu.c b/udp_vu.c
index 698667f..6e1823a 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -281,30 +281,28 @@ void udp_vu_reply_sock_data(const struct ctx *c, union epoll_ref ref,
 {
 	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
 	const struct flowside *toside = flowside_at_sidx(tosidx);
+	bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
 	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
 	int from_s = uflow->s[ref.flowside.sidei];
 	struct vu_dev *vdev = c->vdev;
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	uint8_t topif = pif_at_sidx(tosidx);
 	int i;
 
+	ASSERT(uflow);
+
+	if (topif != PIF_TAP) {
+		uint8_t frompif = pif_at_sidx(ref.flowside);
+
+		flow_err(uflow,
+			 "No support for forwarding UDP from %s to %s",
+			 pif_name(frompif), pif_name(topif));
+		return;
+	}
+
 	for (i = 0; i < UDP_MAX_FRAMES; i++) {
-		uint8_t topif = pif_at_sidx(tosidx);
 		ssize_t dlen;
 		int iov_used;
-		bool v6;
-
-		ASSERT(uflow);
-
-		if (topif != PIF_TAP) {
-			uint8_t frompif = pif_at_sidx(ref.flowside);
-
-			flow_err(uflow,
-				 "No support for forwarding UDP from %s to %s",
-				 pif_name(frompif), pif_name(topif));
-			continue;
-		}
-
-		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
 
 		iov_used = udp_vu_sock_recv(c, from_s, v6, &dlen);
 		if (iov_used <= 0)

From 269cf6a12a5f89683daa8da9232cc2524d7a4ae2 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 26 Mar 2025 14:44:04 +1100
Subject: [PATCH 152/221] udp: Share more logic between vu and non-vu reply
 socket paths

Share some additional miscellaneous logic between the vhost-user and "buf"
paths for data on udp reply sockets.  The biggest piece is error handling
of cases where we can't forward between the two pifs of the flow.  We also
make common some more simple logic locating the correct flow and its
parameters.

This adds some lines of code due to extra comment lines, but nonetheless
reduces logic duplication.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c    | 41 ++++++++++++++++++++++++++---------------
 udp_vu.c | 26 +++++++++++---------------
 udp_vu.h |  3 ++-
 3 files changed, 39 insertions(+), 31 deletions(-)

diff --git a/udp.c b/udp.c
index 26a91c9..f417cea 100644
--- a/udp.c
+++ b/udp.c
@@ -752,24 +752,25 @@ void udp_listen_sock_handler(const struct ctx *c,
 /**
  * udp_buf_reply_sock_data() - Handle new data from flow specific socket
  * @c:		Execution context
- * @ref:	epoll reference
+ * @s:		Socket to read data from
+ * @tosidx:	Flow & side to forward data from @s to
  * @now:	Current timestamp
  *
+ * Return: true on success, false if can't forward from socket to flow's pif
+ *
  * #syscalls recvmmsg
  */
-static void udp_buf_reply_sock_data(const struct ctx *c, union epoll_ref ref,
+static bool udp_buf_reply_sock_data(const struct ctx *c,
+				    int s, flow_sidx_t tosidx,
 				    const struct timespec *now)
 {
-	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
 	const struct flowside *toside = flowside_at_sidx(tosidx);
-	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
+	struct udp_flow *uflow = udp_at_sidx(tosidx);
 	uint8_t topif = pif_at_sidx(tosidx);
-	int n, i, from_s;
+	int n, i;
 
-	from_s = uflow->s[ref.flowside.sidei];
-
-	if ((n = udp_sock_recv(c, from_s, udp_mh_recv)) <= 0)
-		return;
+	if ((n = udp_sock_recv(c, s, udp_mh_recv)) <= 0)
+		return true;
 
 	flow_trace(uflow, "Received %d datagrams on reply socket", n);
 	uflow->ts = now->tv_sec;
@@ -788,11 +789,10 @@ static void udp_buf_reply_sock_data(const struct ctx *c, union epoll_ref ref,
 	} else if (topif == PIF_TAP) {
 		tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
 	} else {
-		uint8_t frompif = pif_at_sidx(ref.flowside);
-
-		flow_err(uflow, "No support for forwarding UDP from %s to %s",
-			 pif_name(frompif), pif_name(topif));
+		return false;
 	}
+
+	return true;
 }
 
 /**
@@ -819,10 +819,21 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 	}
 
 	if (events & EPOLLIN) {
+		flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
+		int s = ref.fd;
+		bool ret;
+
 		if (c->mode == MODE_VU)
-			udp_vu_reply_sock_data(c, ref, now);
+			ret = udp_vu_reply_sock_data(c, s, tosidx, now);
 		else
-			udp_buf_reply_sock_data(c, ref, now);
+			ret = udp_buf_reply_sock_data(c, s, tosidx, now);
+
+		if (!ret) {
+			flow_err(uflow,
+				 "No support for forwarding UDP from %s to %s",
+				 pif_name(pif_at_sidx(ref.flowside)),
+				 pif_name(pif_at_sidx(tosidx)));
+		}
 	}
 }
 
diff --git a/udp_vu.c b/udp_vu.c
index 6e1823a..06bdeae 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -273,38 +273,32 @@ void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 /**
  * udp_vu_reply_sock_data() - Handle new data from flow specific socket
  * @c:		Execution context
- * @ref:	epoll reference
+ * @s:		Socket to read data from
+ * @tosidx:	Flow & side to forward data from @s to
  * @now:	Current timestamp
+ *
+ * Return: true on success, false if can't forward from socket to flow's pif
  */
-void udp_vu_reply_sock_data(const struct ctx *c, union epoll_ref ref,
+bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx,
 			    const struct timespec *now)
 {
-	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
 	const struct flowside *toside = flowside_at_sidx(tosidx);
 	bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
-	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
-	int from_s = uflow->s[ref.flowside.sidei];
+	struct udp_flow *uflow = udp_at_sidx(tosidx);
 	struct vu_dev *vdev = c->vdev;
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
-	uint8_t topif = pif_at_sidx(tosidx);
 	int i;
 
 	ASSERT(uflow);
 
-	if (topif != PIF_TAP) {
-		uint8_t frompif = pif_at_sidx(ref.flowside);
-
-		flow_err(uflow,
-			 "No support for forwarding UDP from %s to %s",
-			 pif_name(frompif), pif_name(topif));
-		return;
-	}
+	if (pif_at_sidx(tosidx) != PIF_TAP)
+		return false;
 
 	for (i = 0; i < UDP_MAX_FRAMES; i++) {
 		ssize_t dlen;
 		int iov_used;
 
-		iov_used = udp_vu_sock_recv(c, from_s, v6, &dlen);
+		iov_used = udp_vu_sock_recv(c, s, v6, &dlen);
 		if (iov_used <= 0)
 			break;
 		flow_trace(uflow, "Received 1 datagram on reply socket");
@@ -318,4 +312,6 @@ void udp_vu_reply_sock_data(const struct ctx *c, union epoll_ref ref,
 		}
 		vu_flush(vdev, vq, elem, iov_used);
 	}
+
+	return true;
 }
diff --git a/udp_vu.h b/udp_vu.h
index 4f2262d..2299b51 100644
--- a/udp_vu.h
+++ b/udp_vu.h
@@ -8,6 +8,7 @@
 
 void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 			     const struct timespec *now);
-void udp_vu_reply_sock_data(const struct ctx *c, union epoll_ref ref,
+bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx,
 			    const struct timespec *now);
+
 #endif /* UDP_VU_H */

From f67c488b81ca2a4d9f819b625fceab10b71fc3a5 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 26 Mar 2025 14:44:05 +1100
Subject: [PATCH 153/221] udp: Better handling of failure to forward from reply
 socket

In udp_reply_sock_handler() if we're unable to forward the datagrams we
just print an error.  Generally this means we have an unsupported pair of
pifs in the flow table, though, and that hasn't change.  So, next time we
get a matching packet we'll just get the same failure.  In vhost-user mode
we don't even dequeue the incoming packets which triggered this so we're
likely to get the same failure immediately.

Instead, close the flow, in the same we we do for an unrecoverable error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/udp.c b/udp.c
index f417cea..96e48dd 100644
--- a/udp.c
+++ b/udp.c
@@ -812,9 +812,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 	if (events & EPOLLERR) {
 		if (udp_sock_errs(c, ref) < 0) {
 			flow_err(uflow, "Unrecoverable error on reply socket");
-			flow_err_details(uflow);
-			udp_flow_close(c, uflow);
-			return;
+			goto fail;
 		}
 	}
 
@@ -829,12 +827,15 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 			ret = udp_buf_reply_sock_data(c, s, tosidx, now);
 
 		if (!ret) {
-			flow_err(uflow,
-				 "No support for forwarding UDP from %s to %s",
-				 pif_name(pif_at_sidx(ref.flowside)),
-				 pif_name(pif_at_sidx(tosidx)));
+			flow_err(uflow, "Unable to forward UDP");
+			goto fail;
 		}
 	}
+	return;
+
+fail:
+	flow_err_details(uflow);
+	udp_flow_close(c, uflow);
 }
 
 /**

From 37d78c9ef3944c1b060e3e8259b82fea3f8ec6bf Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 26 Mar 2025 14:44:06 +1100
Subject: [PATCH 154/221] udp: Always hash socket facing flowsides

For UDP packets from the tap interface (like TCP) we use a hash table to
look up which flow they belong to.  Unlike TCP, we sometimes also create a
hash table entry for the socket side of UDP flows.  We need that when we
receive a UDP packet from a "listening" socket which isn't specific to a
single flow.

At present we only do this for the initiating side of flows, which re-use
the listening socket.  For the target side we use a connected "reply"
socket specific to the single flow.

We have in mind changes that maye introduce some edge cases were we could
receive UDP packets on a non flow specific socket more often.  To allow for
those changes - and slightly simplifying things in the meantime - always
put both sides of a UDP flow - tap or socket - in the hash table.  It's
not that costly, and means we always have the option of falling back to a
hash lookup.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp_flow.c | 41 ++++++++++++++++++++---------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/udp_flow.c b/udp_flow.c
index c6b8630..7e80924 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -41,25 +41,23 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx)
  */
 void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
 {
+	unsigned sidei;
+
 	if (uflow->closed)
 		return; /* Nothing to do */
 
-	if (uflow->s[INISIDE] >= 0) {
-		/* The listening socket needs to stay in epoll */
-		close(uflow->s[INISIDE]);
-		uflow->s[INISIDE] = -1;
+	flow_foreach_sidei(sidei) {
+		flow_hash_remove(c, FLOW_SIDX(uflow, sidei));
+		if (uflow->s[sidei] >= 0) {
+			/* The listening socket needs to stay in epoll, but the
+			 * flow specific one needs to be removed */
+			if (sidei == TGTSIDE)
+				epoll_del(c, uflow->s[sidei]);
+			close(uflow->s[sidei]);
+			uflow->s[sidei] = -1;
+		}
 	}
 
-	if (uflow->s[TGTSIDE] >= 0) {
-		/* But the flow specific one needs to be removed */
-		epoll_del(c, uflow->s[TGTSIDE]);
-		close(uflow->s[TGTSIDE]);
-		uflow->s[TGTSIDE] = -1;
-	}
-	flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE));
-	if (!pif_is_socket(uflow->f.pif[TGTSIDE]))
-		flow_hash_remove(c, FLOW_SIDX(uflow, TGTSIDE));
-
 	uflow->closed = true;
 }
 
@@ -77,6 +75,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 {
 	struct udp_flow *uflow = NULL;
 	const struct flowside *tgt;
+	unsigned sidei;
 	uint8_t tgtpif;
 
 	if (!(tgt = flow_target(c, flow, IPPROTO_UDP)))
@@ -143,14 +142,14 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 		}
 	}
 
-	flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE));
-
-	/* If the target side is a socket, it will be a reply socket that knows
-	 * its own flowside.  But if it's tap, then we need to look it up by
-	 * hash.
+	/* Tap sides always need to be looked up by hash.  Socket sides don't
+	 * always, but sometimes do (receiving packets on a socket not specific
+	 * to one flow).  Unconditionally hash both sides so all our bases are
+	 * covered
 	 */
-	if (!pif_is_socket(tgtpif))
-		flow_hash_insert(c, FLOW_SIDX(uflow, TGTSIDE));
+	flow_foreach_sidei(sidei)
+		flow_hash_insert(c, FLOW_SIDX(uflow, sidei));
+
 	FLOW_ACTIVATE(uflow);
 
 	return FLOW_SIDX(uflow, TGTSIDE);

From 77883fbdd17e836247f746d888dcad3f611a6a59 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 26 Mar 2025 14:44:07 +1100
Subject: [PATCH 155/221] udp: Add helper function for creating connected UDP
 socket

Currently udp_flow_new() open codes creating and connecting a socket to use
for reply messages.  We have in mind some more places to use this logic,
plus it just makes for a rather large function.  Split this handling out
into a new udp_flow_sock() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp_flow.c | 104 +++++++++++++++++++++++++++++------------------------
 1 file changed, 58 insertions(+), 46 deletions(-)

diff --git a/udp_flow.c b/udp_flow.c
index 7e80924..bf4b896 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -61,6 +61,61 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
 	uflow->closed = true;
 }
 
+/**
+ * udp_flow_sock() - Create, bind and connect a flow specific UDP socket
+ * @c:		Execution context
+ * @uflow:	UDP flow to open socket for
+ * @sidei:	Side of @uflow to open socket for
+ *
+ * Return: fd of new socket on success, -ve error code on failure
+ */
+static int udp_flow_sock(const struct ctx *c,
+			 const struct udp_flow *uflow, unsigned sidei)
+{
+	const struct flowside *side = &uflow->f.side[sidei];
+	struct mmsghdr discard[UIO_MAXIOV] = { 0 };
+	uint8_t pif = uflow->f.pif[sidei];
+	union {
+		flow_sidx_t sidx;
+		uint32_t data;
+	} fref = { .sidx = FLOW_SIDX(uflow, sidei) };
+	int rc, s;
+
+	s = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, pif, side, fref.data);
+	if (s < 0) {
+		flow_dbg_perror(uflow, "Couldn't open flow specific socket");
+		return s;
+	}
+
+	if (flowside_connect(c, s, pif, side) < 0) {
+		rc = -errno;
+		flow_dbg_perror(uflow, "Couldn't connect flow socket");
+		return rc;
+	}
+
+	/* It's possible, if unlikely, that we could receive some unrelated
+	 * packets in between the bind() and connect() of this socket.  For now
+	 * we just discard these.
+	 *
+	 * FIXME: Redirect these to an appropriate handler
+	 */
+	rc = recvmmsg(s, discard, ARRAY_SIZE(discard), MSG_DONTWAIT, NULL);
+	if (rc >= ARRAY_SIZE(discard)) {
+		flow_dbg(uflow, "Too many (%d) spurious reply datagrams", rc);
+		return -E2BIG;
+	}
+
+	if (rc > 0) {
+		flow_trace(uflow, "Discarded %d spurious reply datagrams", rc);
+	} else if (errno != EAGAIN) {
+		rc = -errno;
+		flow_perror(uflow, "Unexpected error discarding datagrams");
+		return rc;
+	}
+
+	return s;
+}
+
 /**
  * udp_flow_new() - Common setup for a new UDP flow
  * @c:		Execution context
@@ -74,13 +129,10 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 				int s_ini, const struct timespec *now)
 {
 	struct udp_flow *uflow = NULL;
-	const struct flowside *tgt;
 	unsigned sidei;
-	uint8_t tgtpif;
 
-	if (!(tgt = flow_target(c, flow, IPPROTO_UDP)))
+	if (!flow_target(c, flow, IPPROTO_UDP))
 		goto cancel;
-	tgtpif = flow->f.pif[TGTSIDE];
 
 	uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp);
 	uflow->ts = now->tv_sec;
@@ -98,49 +150,9 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 		}
 	}
 
-	if (pif_is_socket(tgtpif)) {
-		struct mmsghdr discard[UIO_MAXIOV] = { 0 };
-		union {
-			flow_sidx_t sidx;
-			uint32_t data;
-		} fref = {
-			.sidx = FLOW_SIDX(flow, TGTSIDE),
-		};
-		int rc;
-
-		uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY,
-						     tgtpif, tgt, fref.data);
-		if (uflow->s[TGTSIDE] < 0) {
-			flow_dbg_perror(uflow,
-					"Couldn't open socket for spliced flow");
+	if (pif_is_socket(flow->f.pif[TGTSIDE]))
+		if ((uflow->s[TGTSIDE] = udp_flow_sock(c, uflow, TGTSIDE)) < 0)
 			goto cancel;
-		}
-
-		if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) {
-			flow_dbg_perror(uflow, "Couldn't connect flow socket");
-			goto cancel;
-		}
-
-		/* It's possible, if unlikely, that we could receive some
-		 * unrelated packets in between the bind() and connect() of this
-		 * socket.  For now we just discard these.  We could consider
-		 * trying to redirect these to an appropriate handler, if we
-		 * need to.
-		 */
-		rc = recvmmsg(uflow->s[TGTSIDE], discard, ARRAY_SIZE(discard),
-			      MSG_DONTWAIT, NULL);
-		if (rc >= ARRAY_SIZE(discard)) {
-			flow_dbg(uflow,
-				 "Too many (%d) spurious reply datagrams", rc);
-			goto cancel;
-		} else if (rc > 0) {
-			flow_trace(uflow,
-				   "Discarded %d spurious reply datagrams", rc);
-		} else if (errno != EAGAIN) {
-			flow_perror(uflow,
-				    "Unexpected error discarding datagrams");
-		}
-	}
 
 	/* Tap sides always need to be looked up by hash.  Socket sides don't
 	 * always, but sometimes do (receiving packets on a socket not specific

From 664c588be752bf590adb55bf1f613d4a36f02e7c Mon Sep 17 00:00:00 2001
From: Julian Wundrak <julian@wundrak.net>
Date: Wed, 26 Mar 2025 20:14:31 +0000
Subject: [PATCH 156/221] build: normalize arm targets
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Linux distributions use different dumpmachine outputs for the ARM
architecture. arm, armv6l, armv7l.
For the syscall annotation, these variants are standardized to “arm”.

Link: https://bugs.passt.top/show_bug.cgi?id=117
Signed-off-by: Julian Wundrak <julian@wundrak.net>
[sbrivio: Fix typo: assign from TARGET_ARCH, not from TARGET]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index 31cbac3..3328f83 100644
--- a/Makefile
+++ b/Makefile
@@ -20,6 +20,7 @@ $(if $(TARGET),,$(error Failed to get target architecture))
 # Get 'uname -m'-like architecture description for target
 TARGET_ARCH := $(firstword $(subst -, ,$(TARGET)))
 TARGET_ARCH := $(patsubst [:upper:],[:lower:],$(TARGET_ARCH))
+TARGET_ARCH := $(patsubst arm%,arm,$(TARGET_ARCH))
 TARGET_ARCH := $(subst powerpc,ppc,$(TARGET_ARCH))
 
 # On some systems enabling optimization also enables source fortification,

From 65cca54be84ffc5d2e18fcb8229dcc9d1f229479 Mon Sep 17 00:00:00 2001
From: Jon Maloy <jmaloy@redhat.com>
Date: Wed, 26 Mar 2025 11:59:02 -0400
Subject: [PATCH 157/221] udp: correct source address for ICMP messages

While developing traceroute forwarding tap-to-sock we found that
struct msghdr.msg_name for the ICMPs in the opposite direction always
contains the destination address of the original UDP message, and not,
as one might expect, the one of the host which created the error message.

Study of the kernel code reveals that this address instead is appended
as extra data after the received struct sock_extended_err area.

We now change the ICMP receive code accordingly.

Fixes: 55431f0077b6 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e07d ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/udp.c b/udp.c
index 96e48dd..0c223b4 100644
--- a/udp.c
+++ b/udp.c
@@ -510,10 +510,13 @@ static void udp_send_conn_fail_icmp6(const struct ctx *c,
  */
 static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 {
-	const struct sock_extended_err *ee;
+	struct errhdr {
+		struct sock_extended_err ee;
+		union sockaddr_inany saddr;
+	};
+	const struct errhdr *eh;
 	const struct cmsghdr *hdr;
-	union sockaddr_inany saddr;
-	char buf[CMSG_SPACE(sizeof(*ee))];
+	char buf[CMSG_SPACE(sizeof(struct errhdr))];
 	char data[ICMP6_MAX_DLEN];
 	int s = ref.fd;
 	struct iovec iov = {
@@ -521,8 +524,6 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 		.iov_len = sizeof(data)
 	};
 	struct msghdr mh = {
-		.msg_name = &saddr,
-		.msg_namelen = sizeof(saddr),
 		.msg_iov = &iov,
 		.msg_iovlen = 1,
 		.msg_control = buf,
@@ -553,7 +554,7 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 		return -1;
 	}
 
-	ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
+	eh = (const struct errhdr *)CMSG_DATA(hdr);
 	if (ref.type == EPOLL_TYPE_UDP_REPLY) {
 		flow_sidx_t sidx = flow_sidx_opposite(ref.flowside);
 		const struct flowside *toside = flowside_at_sidx(sidx);
@@ -561,18 +562,19 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 
 		if (hdr->cmsg_level == IPPROTO_IP) {
 			dlen = MIN(dlen, ICMP4_MAX_DLEN);
-			udp_send_conn_fail_icmp4(c, ee, toside, saddr.sa4.sin_addr,
+			udp_send_conn_fail_icmp4(c, &eh->ee, toside,
+						 eh->saddr.sa4.sin_addr,
 						 data, dlen);
 		} else if (hdr->cmsg_level == IPPROTO_IPV6) {
-			udp_send_conn_fail_icmp6(c, ee, toside,
-						 &saddr.sa6.sin6_addr,
+			udp_send_conn_fail_icmp6(c, &eh->ee, toside,
+						 &eh->saddr.sa6.sin6_addr,
 						 data, dlen, sidx.flowi);
 		}
 	} else {
 		trace("Ignoring received IP_RECVERR cmsg on listener socket");
 	}
 	debug("%s error on UDP socket %i: %s",
-	      str_ee_origin(ee), s, strerror_(ee->ee_errno));
+	      str_ee_origin(&eh->ee), s, strerror_(eh->ee.ee_errno));
 
 	return 1;
 }

From 42a854a52b6fa2bbd70cbc0c7657c8a49a9c3d2d Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 28 Mar 2025 11:39:58 +1100
Subject: [PATCH 158/221] pasta, passt-repair: Support multiple events per
 read() in inotify handlers

The current code assumes that we'll get one event per read() on
inotify descriptors, but that's not the case, not from documentation,
and not from reports.

Add loops in the two inotify handlers we have, in pasta-specific code
and passt-repair, to go through all the events we receive.

Link: https://bugs.passt.top/show_bug.cgi?id=119
[dwg: Remove unnecessary buffer expansion, use strnlen instead of strlen
 to make Coverity happier]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Add additional check on ev->name and ev->len in passt-repair]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt-repair.c | 32 +++++++++++++++++++++++++-------
 pasta.c        | 20 +++++++++++++-------
 2 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/passt-repair.c b/passt-repair.c
index 120f7aa..86f0293 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -111,14 +111,14 @@ int main(int argc, char **argv)
 	}
 
 	if ((sb.st_mode & S_IFMT) == S_IFDIR) {
-		char buf[sizeof(struct inotify_event) + NAME_MAX + 1];
+		char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
+		   __attribute__ ((aligned(__alignof__(struct inotify_event))));
 		const struct inotify_event *ev;
 		char path[PATH_MAX + 1];
+		bool found = false;
 		ssize_t n;
 		int fd;
 
-		ev = (struct inotify_event *)buf;
-
 		if ((fd = inotify_init1(IN_CLOEXEC)) < 0) {
 			fprintf(stderr, "inotify_init1: %i\n", errno);
 			_exit(1);
@@ -130,6 +130,8 @@ int main(int argc, char **argv)
 		}
 
 		do {
+			char *p;
+
 			n = read(fd, buf, sizeof(buf));
 			if (n < 0) {
 				fprintf(stderr, "inotify read: %i", errno);
@@ -138,11 +140,27 @@ int main(int argc, char **argv)
 
 			if (n < (ssize_t)sizeof(*ev)) {
 				fprintf(stderr, "Short inotify read: %zi", n);
-				_exit(1);
+				continue;
 			}
-		} while (ev->len < REPAIR_EXT_LEN ||
-			 memcmp(ev->name + strlen(ev->name) - REPAIR_EXT_LEN,
-				REPAIR_EXT, REPAIR_EXT_LEN));
+
+			for (p = buf; p < buf + n; p += sizeof(*ev) + ev->len) {
+				ev = (const struct inotify_event *)p;
+
+				if (ev->len >= REPAIR_EXT_LEN &&
+				    !memcmp(ev->name +
+					    strnlen(ev->name, ev->len) -
+					    REPAIR_EXT_LEN,
+					    REPAIR_EXT, REPAIR_EXT_LEN)) {
+					found = true;
+					break;
+				}
+			}
+		} while (!found);
+
+		if (ev->len > NAME_MAX + 1 || ev->name[ev->len] != '\0') {
+			fprintf(stderr, "Invalid filename from inotify\n");
+			_exit(1);
+		}
 
 		snprintf(path, sizeof(path), "%s/%s", argv[1], ev->name);
 		if ((stat(path, &sb))) {
diff --git a/pasta.c b/pasta.c
index fa3e7de..017fa32 100644
--- a/pasta.c
+++ b/pasta.c
@@ -498,17 +498,23 @@ void pasta_netns_quit_init(const struct ctx *c)
  */
 void pasta_netns_quit_inotify_handler(struct ctx *c, int inotify_fd)
 {
-	char buf[sizeof(struct inotify_event) + NAME_MAX + 1];
-	const struct inotify_event *in_ev = (struct inotify_event *)buf;
+	char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
+		__attribute__ ((aligned(__alignof__(struct inotify_event))));
+	const struct inotify_event *ev;
+	ssize_t n;
+	char *p;
 
-	if (read(inotify_fd, buf, sizeof(buf)) < (ssize_t)sizeof(*in_ev))
+	if ((n = read(inotify_fd, buf, sizeof(buf))) < (ssize_t)sizeof(*ev))
 		return;
 
-	if (strncmp(in_ev->name, c->netns_base, sizeof(c->netns_base)))
-		return;
+	for (p = buf; p < buf + n; p += sizeof(*ev) + ev->len) {
+		ev = (const struct inotify_event *)p;
 
-	info("Namespace %s is gone, exiting", c->netns_base);
-	_exit(EXIT_SUCCESS);
+		if (!strncmp(ev->name, c->netns_base, sizeof(c->netns_base))) {
+			info("Namespace %s is gone, exiting", c->netns_base);
+			_exit(EXIT_SUCCESS);
+		}
+	}
 }
 
 /**

From 025a3c2686b06be3fd09e29b2e3408d2c4ad6239 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 28 Mar 2025 14:34:14 +1100
Subject: [PATCH 159/221] udp: Don't attempt to forward ICMP socket errors to
 other sockets

Recently we added support for detecting ICMP triggered errors on UDP
sockets and forwarding them to the tap interface.  However, in
udp_sock_recverr() where this is handled we don't know for certain that
the tap interface is the other side of the UDP flow.  It could be a spliced
connection with another socket on the other side.

To forward errors in that case, we'd need to force the other side's socket
to trigger issue an ICMP error.  I'm not sure if there's a way to do that;
probably not for an arbitrary ICMP but it might be possible for certain
error conditions.

Nonetheless what we do now - synthesise an ICMP on the tap interface - is
certainly wrong.  It's probably harmless; for a spliced connection it will
have loopback addresses meaning we can expect the guest to discard it.
But, correct this for now, by not attempting to propagate errors when the
other side of the flow is a socket.

Fixes: 55431f0077b6 ("udp: create and send ICMPv4 to local peer when applicable")
Fixes: 68b04182e07d ("udp: create and send ICMPv6 to local peer when applicable")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/udp.c b/udp.c
index 0c223b4..e410f55 100644
--- a/udp.c
+++ b/udp.c
@@ -560,7 +560,10 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 		const struct flowside *toside = flowside_at_sidx(sidx);
 		size_t dlen = rc;
 
-		if (hdr->cmsg_level == IPPROTO_IP) {
+		if (pif_is_socket(pif_at_sidx(sidx))) {
+			/* XXX Is there any way to propagate ICMPs from socket
+			 * to socket? */
+		} else if (hdr->cmsg_level == IPPROTO_IP) {
 			dlen = MIN(dlen, ICMP4_MAX_DLEN);
 			udp_send_conn_fail_icmp4(c, &eh->ee, toside,
 						 eh->saddr.sa4.sin_addr,

From 3de5af6e4145c6971be2597d7fb0386332d44a45 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 28 Mar 2025 14:34:15 +1100
Subject: [PATCH 160/221] udp: Improve name of UDP related ICMP sending
 functions

udp_send_conn_fail_icmp[46]() aren't actually specific to connections
failing: they can propagate a variety of ICMP errors, which might or might
not break a "connection".  They are, however, specific to sending ICMP
errors to the tap connection, not splice or host.  Rename them to better
reflect that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/udp.c b/udp.c
index e410f55..39431d7 100644
--- a/udp.c
+++ b/udp.c
@@ -411,7 +411,7 @@ static void udp_tap_prepare(const struct mmsghdr *mmh,
 }
 
 /**
- * udp_send_conn_fail_icmp4() - Construct and send ICMPv4 to local peer
+ * udp_send_tap_icmp4() - Construct and send ICMPv4 to local peer
  * @c:		Execution context
  * @ee:	Extended error descriptor
  * @toside:	Destination side of flow
@@ -419,11 +419,11 @@ static void udp_tap_prepare(const struct mmsghdr *mmh,
  * @in:	First bytes (max 8) of original UDP message body
  * @dlen:	Length of the read part of original UDP message body
  */
-static void udp_send_conn_fail_icmp4(const struct ctx *c,
-				     const struct sock_extended_err *ee,
-				     const struct flowside *toside,
-				     struct in_addr saddr,
-				     const void *in, size_t dlen)
+static void udp_send_tap_icmp4(const struct ctx *c,
+			       const struct sock_extended_err *ee,
+			       const struct flowside *toside,
+			       struct in_addr saddr,
+			       const void *in, size_t dlen)
 {
 	struct in_addr oaddr = toside->oaddr.v4mapped.a4;
 	struct in_addr eaddr = toside->eaddr.v4mapped.a4;
@@ -455,7 +455,7 @@ static void udp_send_conn_fail_icmp4(const struct ctx *c,
 
 
 /**
- * udp_send_conn_fail_icmp6() - Construct and send ICMPv6 to local peer
+ * udp_send_tap_icmp6() - Construct and send ICMPv6 to local peer
  * @c:		Execution context
  * @ee:	Extended error descriptor
  * @toside:	Destination side of flow
@@ -464,11 +464,11 @@ static void udp_send_conn_fail_icmp4(const struct ctx *c,
  * @dlen:	Length of the read part of original UDP message body
  * @flow:	IPv6 flow identifier
  */
-static void udp_send_conn_fail_icmp6(const struct ctx *c,
-				     const struct sock_extended_err *ee,
-				     const struct flowside *toside,
-				     const struct in6_addr *saddr,
-				     void *in, size_t dlen, uint32_t flow)
+static void udp_send_tap_icmp6(const struct ctx *c,
+			       const struct sock_extended_err *ee,
+			       const struct flowside *toside,
+			       const struct in6_addr *saddr,
+			       void *in, size_t dlen, uint32_t flow)
 {
 	const struct in6_addr *oaddr = &toside->oaddr.a6;
 	const struct in6_addr *eaddr = &toside->eaddr.a6;
@@ -565,13 +565,12 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 			 * to socket? */
 		} else if (hdr->cmsg_level == IPPROTO_IP) {
 			dlen = MIN(dlen, ICMP4_MAX_DLEN);
-			udp_send_conn_fail_icmp4(c, &eh->ee, toside,
-						 eh->saddr.sa4.sin_addr,
-						 data, dlen);
+			udp_send_tap_icmp4(c, &eh->ee, toside,
+					   eh->saddr.sa4.sin_addr, data, dlen);
 		} else if (hdr->cmsg_level == IPPROTO_IPV6) {
-			udp_send_conn_fail_icmp6(c, &eh->ee, toside,
-						 &eh->saddr.sa6.sin6_addr,
-						 data, dlen, sidx.flowi);
+			udp_send_tap_icmp6(c, &eh->ee, toside,
+					   &eh->saddr.sa6.sin6_addr, data,
+					   dlen, sidx.flowi);
 		}
 	} else {
 		trace("Ignoring received IP_RECVERR cmsg on listener socket");

From 2ed2d59def758b049f42e7c75bfb48957a73bd39 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 2 Apr 2025 14:13:16 +1100
Subject: [PATCH 161/221] platform requirements: Fix clang-tidy warning

Recent clang-tidy versions complain about enums defined with some but not
all entries given explicit values.  I'm not entirely convinced about
whether that's a useful warning, but in any case we really don't need the
explicit values in doc/platform-requirements/reuseaddr-priority.c, so
remove them to make clang happy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 doc/platform-requirements/reuseaddr-priority.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/platform-requirements/reuseaddr-priority.c b/doc/platform-requirements/reuseaddr-priority.c
index 701b6ff..af39a39 100644
--- a/doc/platform-requirements/reuseaddr-priority.c
+++ b/doc/platform-requirements/reuseaddr-priority.c
@@ -46,13 +46,13 @@
 /* Different cases for receiving socket configuration */
 enum sock_type {
 	/* Socket is bound to 0.0.0.0:DSTPORT and not connected */
-	SOCK_BOUND_ANY = 0,
+	SOCK_BOUND_ANY,
 
 	/* Socket is bound to 127.0.0.1:DSTPORT and not connected */
-	SOCK_BOUND_LO = 1,
+	SOCK_BOUND_LO,
 
 	/* Socket is bound to 0.0.0.0:DSTPORT and connected to 127.0.0.1:SRCPORT */
-	SOCK_CONNECTED = 2,
+	SOCK_CONNECTED,
 
 	NUM_SOCK_TYPES,
 };

From 8e32881ef1d6d5867223a164052f8ff39d4ebb4e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 2 Apr 2025 14:13:17 +1100
Subject: [PATCH 162/221] platform requirements: Add attributes to die()
 function

Add both format string and ((noreturn)) attributes to the version of die()
used in the test programs in doc/platform-requirements.  As well as
potentially catching problems in format strings, this means that the
compiler and static checkers can properly reason about the fact that it
will exit, preventing bogus warnings.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 doc/platform-requirements/common.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/doc/platform-requirements/common.h b/doc/platform-requirements/common.h
index 8844b1e..e85fc2b 100644
--- a/doc/platform-requirements/common.h
+++ b/doc/platform-requirements/common.h
@@ -15,6 +15,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 
+__attribute__((format(printf, 1, 2), noreturn))
 static inline void die(const char *fmt, ...)
 {
 	va_list ap;

From 6bfc60b09522bd6f47660b835f0681977a28e1de Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 2 Apr 2025 14:13:18 +1100
Subject: [PATCH 163/221] platform requirements: Add test for address conflicts
 with TCP_REPAIR

Simple test program to check the behaviour we need for bind() address
conflicts between listening sockets and repair mode sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 doc/platform-requirements/.gitignore         |   1 +
 doc/platform-requirements/Makefile           |   4 +-
 doc/platform-requirements/listen-vs-repair.c | 128 +++++++++++++++++++
 3 files changed, 131 insertions(+), 2 deletions(-)
 create mode 100644 doc/platform-requirements/listen-vs-repair.c

diff --git a/doc/platform-requirements/.gitignore b/doc/platform-requirements/.gitignore
index 3b5a10a..f6272cf 100644
--- a/doc/platform-requirements/.gitignore
+++ b/doc/platform-requirements/.gitignore
@@ -1,3 +1,4 @@
+/listen-vs-repair
 /reuseaddr-priority
 /recv-zero
 /udp-close-dup
diff --git a/doc/platform-requirements/Makefile b/doc/platform-requirements/Makefile
index 6a7d374..83930ef 100644
--- a/doc/platform-requirements/Makefile
+++ b/doc/platform-requirements/Makefile
@@ -3,8 +3,8 @@
 # Copyright Red Hat
 # Author: David Gibson <david@gibson.dropbear.id.au>
 
-TARGETS = reuseaddr-priority recv-zero udp-close-dup
-SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c
+TARGETS = reuseaddr-priority recv-zero udp-close-dup listen-vs-repair
+SRCS = reuseaddr-priority.c recv-zero.c udp-close-dup.c listen-vs-repair.c
 CFLAGS = -Wall
 
 all: cppcheck clang-tidy $(TARGETS:%=check-%)
diff --git a/doc/platform-requirements/listen-vs-repair.c b/doc/platform-requirements/listen-vs-repair.c
new file mode 100644
index 0000000..d31fe3f
--- /dev/null
+++ b/doc/platform-requirements/listen-vs-repair.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* liste-vs-repair.c
+ *
+ * Do listening sockets have address conflicts with sockets under repair
+ * ====================================================================
+ *
+ * When we accept() an incoming connection the accept()ed socket will have the
+ * same local address as the listening socket.  This can be a complication on
+ * migration.  On the migration target we've already set up listening sockets
+ * according to the command line.  However to restore connections that we're
+ * migrating in we need to bind the new sockets to the same address, which would
+ * be an address conflict on the face of it.  This test program verifies that
+ * enabling repair mode before bind() correctly suppresses that conflict.
+ *
+ * Copyright Red Hat
+ * Author: David Gibson <david@gibson.dropbear.id.au>
+ */
+
+/* NOLINTNEXTLINE(bugprone-reserved-identifier,cert-dcl37-c,cert-dcl51-cpp) */
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "common.h"
+
+#define PORT	13256U
+#define CPORT	13257U
+
+/* 127.0.0.1:PORT */
+static const struct sockaddr_in addr = SOCKADDR_INIT(INADDR_LOOPBACK, PORT);
+
+/* 127.0.0.1:CPORT */
+static const struct sockaddr_in caddr = SOCKADDR_INIT(INADDR_LOOPBACK, CPORT);
+
+/* Put ourselves into a network sandbox */
+static void net_sandbox(void)
+{
+	/* NOLINTNEXTLINE(altera-struct-pack-align) */
+	const struct req_t {
+		struct nlmsghdr nlh;
+		struct ifinfomsg ifm;
+	} __attribute__((packed)) req = {
+		.nlh.nlmsg_type		= RTM_NEWLINK,
+		.nlh.nlmsg_flags	= NLM_F_REQUEST,
+		.nlh.nlmsg_len		= sizeof(req),
+		.nlh.nlmsg_seq		= 1,
+		.ifm.ifi_family		= AF_UNSPEC,
+                .ifm.ifi_index		= 1,
+                .ifm.ifi_flags		= IFF_UP,
+                .ifm.ifi_change		= IFF_UP,
+	};
+	int nl;
+
+	if (unshare(CLONE_NEWUSER | CLONE_NEWNET))
+		die("unshare(): %s\n", strerror(errno));
+
+	/* Bring up lo in the new netns */
+	nl = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
+	if (nl < 0)
+		die("Can't create netlink socket: %s\n", strerror(errno));
+
+	if (send(nl, &req, sizeof(req), 0) < 0)
+		die("Netlink send(): %s\n", strerror(errno));
+	close(nl);
+}
+
+static void check(void)
+{
+	int s1, s2, op;
+
+	s1 = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+	if (s1 < 0)
+		die("socket() 1: %s\n", strerror(errno));
+
+	if (bind(s1, (struct sockaddr *)&addr, sizeof(addr)))
+		die("bind() 1: %s\n", strerror(errno));
+
+	if (listen(s1, 0))
+		die("listen(): %s\n", strerror(errno));
+
+	s2 = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+	if (s2 < 0)
+		die("socket() 2: %s\n", strerror(errno));
+
+	op = TCP_REPAIR_ON;
+	if (setsockopt(s2, SOL_TCP, TCP_REPAIR, &op, sizeof(op)))
+		die("TCP_REPAIR: %s\n", strerror(errno));
+
+	if (bind(s2, (struct sockaddr *)&addr, sizeof(addr)))
+		die("bind() 2: %s\n", strerror(errno));
+
+	if (connect(s2, (struct sockaddr *)&caddr, sizeof(caddr)))
+		die("connect(): %s\n", strerror(errno));
+
+	op = TCP_REPAIR_OFF_NO_WP;
+	if (setsockopt(s2, SOL_TCP, TCP_REPAIR, &op, sizeof(op)))
+		die("TCP_REPAIR: %s\n", strerror(errno));
+
+	close(s1);
+	close(s2);
+}
+
+int main(int argc, char *argv[])
+{
+	(void)argc;
+	(void)argv;
+
+	net_sandbox();
+
+	check();
+
+	printf("Repair mode appears to properly suppress conflicts with listening sockets\n");
+
+	exit(0);
+}

From dec3d73e1e8e007d05f9dce9a48aca7cb8532992 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 2 Apr 2025 14:13:19 +1100
Subject: [PATCH 164/221] migrate, tcp: bind() migrated sockets in repair mode

Currently on a migration target, we create then immediately bind() new
sockets for the TCP connections we're reconstructing.  Mostly, this works,
since a socket() that is bound but hasn't had listen() or connect() called
is essentially passive.  However, this bind() is subject to the usual
address conflict checking.  In particular that means if we already have
a listening socket on that port, we'll get an EADDRINUSE.  This will happen
for every connection we try to migrate that was initiated from outside to
the guest, since we necessarily created a listening socket for that case.

We set SO_REUSEADDR on the socket in an attempt to avoid this, but that's
not sufficient; even with SO_REUSEADDR address conflicts are still
prohibited for listening sockets.  Of course once these incoming sockets
are fully repaired and connect()ed they'll no longer conflict, but that
doesn't help us if we fail at the bind().

We can avoid this by not calling bind() until we're already in repair mode
which suppresses this transient conflict.  Because of the batching of
setting repair mode, to do that we need to move the bind to a step in
tcp_flow_migrate_target_ext().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 38 +++++++++++++++++++++++++++-----------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/tcp.c b/tcp.c
index fa1d885..35626c9 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3414,13 +3414,8 @@ fail:
 static int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
 {
 	sa_family_t af = CONN_V4(conn) ? AF_INET : AF_INET6;
-	const struct flowside *sockside = HOSTFLOW(conn);
-	union sockaddr_inany a;
-	socklen_t sl;
 	int s, rc;
 
-	pif_sockaddr(c, &a, &sl, PIF_HOST, &sockside->oaddr, sockside->oport);
-
 	if ((conn->sock = socket(af, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
 				 IPPROTO_TCP)) < 0) {
 		rc = -errno;
@@ -3435,12 +3430,6 @@ static int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
 
 	tcp_sock_set_nodelay(s);
 
-	if (bind(s, &a.sa, sizeof(a))) {
-		rc = -errno;
-		flow_perror(conn, "Failed to bind socket for migrated flow");
-		goto err;
-	}
-
 	if ((rc = tcp_flow_repair_on(c, conn)))
 		goto err;
 
@@ -3452,6 +3441,30 @@ err:
 	return rc;
 }
 
+/**
+ * tcp_flow_repair_bind() - Bind socket in repair mode
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_bind(const struct ctx *c, struct tcp_tap_conn *conn)
+{
+	const struct flowside *sockside = HOSTFLOW(conn);
+	union sockaddr_inany a;
+	socklen_t sl;
+
+	pif_sockaddr(c, &a, &sl, PIF_HOST, &sockside->oaddr, sockside->oport);
+
+	if (bind(conn->sock, &a.sa, sizeof(a))) {
+		int rc = -errno;
+		flow_perror(conn, "Failed to bind socket for migrated flow");
+		return rc;
+	}
+
+	return 0;
+}
+
 /**
  * tcp_flow_repair_connect() - Connect socket in repair mode, then turn it off
  * @c:		Execution context
@@ -3618,6 +3631,9 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 		/* We weren't able to create the socket, discard flow */
 		goto fail;
 
+	if (tcp_flow_repair_bind(c, conn))
+		goto fail;
+
 	if (tcp_flow_repair_timestamp(conn, &t))
 		goto fail;
 

From 3d41e4d8389578e5d5f3cf2e47b9ff9cdd29ffd1 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 2 Apr 2025 15:43:40 +1100
Subject: [PATCH 165/221] passt-repair: Correct off-by-one error verifying name

passt-repair will generate an error if the name it gets from the kernel is
too long or not NUL terminated.  Downstream testing has reported
occasionally seeing this error in practice.

In turns out there is a trivial off-by-one error in the check: ev->len is
the length of the name, including terminating \0 characters, so to check
for a \0 at the end of the buffer we need to check ev->name[len - 1] not
ev->name[len].

Fixes: 42a854a52b6f ("pasta, passt-repair: Support multiple events per read() in inotify handlers")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt-repair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/passt-repair.c b/passt-repair.c
index 86f0293..440c77a 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -157,7 +157,7 @@ int main(int argc, char **argv)
 			}
 		} while (!found);
 
-		if (ev->len > NAME_MAX + 1 || ev->name[ev->len] != '\0') {
+		if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
 			fprintf(stderr, "Invalid filename from inotify\n");
 			_exit(1);
 		}

From 8aa2d90c8d95d0fa1dad7027fdf92b48a1bbf3c6 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 1 Apr 2025 19:57:08 +1100
Subject: [PATCH 166/221] udp: Remove redundant udp_at_sidx() call in
 udp_tap_handler()

We've already have a pointer to the UDP flow in variable uflow, we can just
re-use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/udp.c b/udp.c
index 39431d7..ac168db 100644
--- a/udp.c
+++ b/udp.c
@@ -907,7 +907,7 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 	}
 	toside = flowside_at_sidx(tosidx);
 
-	s = udp_at_sidx(tosidx)->s[tosidx.sidei];
+	s = uflow->s[tosidx.sidei];
 	ASSERT(s >= 0);
 
 	pif_sockaddr(c, &to_sa, &sl, topif, &toside->eaddr, toside->eport);

From 76e554d9ec8dc80c1856621e17e45be811d198d0 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 1 Apr 2025 19:57:09 +1100
Subject: [PATCH 167/221] udp: Simplify updates to UDP flow timestamp

Since UDP has no built in knowledge of connections, the only way we
know when we're done with a UDP flow is a timeout with no activity.
To keep track of this struct udp_flow includes a timestamp to record
the last time we saw traffic on the flow.

For data from listening sockets and from tap, this is done implicitly via
udp_flow_from_{sock,tap}() but for reply sockets it's done explicitly.
However, that logic is duplicated between the vhost-user and "buf" paths.
Make it common in udp_reply_sock_handler() instead.

Technically this is a behavioural change: previously if we got an EPOLLIN
event, but there wasn't actually any data we wouldn't update the timestamp,
now we will.  This should be harmless: if there's an EPOLLIN we expect
there to be data, and even if there isn't the worst we can do is mildly
delay the cleanup of a stale flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c    | 15 ++++++---------
 udp_vu.c |  9 +--------
 udp_vu.h |  3 +--
 3 files changed, 8 insertions(+), 19 deletions(-)

diff --git a/udp.c b/udp.c
index ac168db..44b58d1 100644
--- a/udp.c
+++ b/udp.c
@@ -758,27 +758,21 @@ void udp_listen_sock_handler(const struct ctx *c,
  * @c:		Execution context
  * @s:		Socket to read data from
  * @tosidx:	Flow & side to forward data from @s to
- * @now:	Current timestamp
  *
  * Return: true on success, false if can't forward from socket to flow's pif
  *
  * #syscalls recvmmsg
  */
 static bool udp_buf_reply_sock_data(const struct ctx *c,
-				    int s, flow_sidx_t tosidx,
-				    const struct timespec *now)
+				    int s, flow_sidx_t tosidx)
 {
 	const struct flowside *toside = flowside_at_sidx(tosidx);
-	struct udp_flow *uflow = udp_at_sidx(tosidx);
 	uint8_t topif = pif_at_sidx(tosidx);
 	int n, i;
 
 	if ((n = udp_sock_recv(c, s, udp_mh_recv)) <= 0)
 		return true;
 
-	flow_trace(uflow, "Received %d datagrams on reply socket", n);
-	uflow->ts = now->tv_sec;
-
 	for (i = 0; i < n; i++) {
 		if (pif_is_socket(topif))
 			udp_splice_prepare(udp_mh_recv, i);
@@ -825,10 +819,13 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 		int s = ref.fd;
 		bool ret;
 
+		flow_trace(uflow, "Received data on reply socket");
+		uflow->ts = now->tv_sec;
+
 		if (c->mode == MODE_VU)
-			ret = udp_vu_reply_sock_data(c, s, tosidx, now);
+			ret = udp_vu_reply_sock_data(c, s, tosidx);
 		else
-			ret = udp_buf_reply_sock_data(c, s, tosidx, now);
+			ret = udp_buf_reply_sock_data(c, s, tosidx);
 
 		if (!ret) {
 			flow_err(uflow, "Unable to forward UDP");
diff --git a/udp_vu.c b/udp_vu.c
index 06bdeae..4153b6c 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -275,22 +275,17 @@ void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
  * @c:		Execution context
  * @s:		Socket to read data from
  * @tosidx:	Flow & side to forward data from @s to
- * @now:	Current timestamp
  *
  * Return: true on success, false if can't forward from socket to flow's pif
  */
-bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx,
-			    const struct timespec *now)
+bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx)
 {
 	const struct flowside *toside = flowside_at_sidx(tosidx);
 	bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
-	struct udp_flow *uflow = udp_at_sidx(tosidx);
 	struct vu_dev *vdev = c->vdev;
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
 	int i;
 
-	ASSERT(uflow);
-
 	if (pif_at_sidx(tosidx) != PIF_TAP)
 		return false;
 
@@ -301,8 +296,6 @@ bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx,
 		iov_used = udp_vu_sock_recv(c, s, v6, &dlen);
 		if (iov_used <= 0)
 			break;
-		flow_trace(uflow, "Received 1 datagram on reply socket");
-		uflow->ts = now->tv_sec;
 
 		udp_vu_prepare(c, toside, dlen);
 		if (*c->pcap) {
diff --git a/udp_vu.h b/udp_vu.h
index 2299b51..6d541a4 100644
--- a/udp_vu.h
+++ b/udp_vu.h
@@ -8,7 +8,6 @@
 
 void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 			     const struct timespec *now);
-bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx,
-			    const struct timespec *now);
+bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx);
 
 #endif /* UDP_VU_H */

From 684870a766e7f024a5720464ad070e666cb4793e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 1 Apr 2025 19:57:10 +1100
Subject: [PATCH 168/221] udp: Correct some seccomp filter annotations

Both udp_buf_listen_sock_data() and udp_buf_reply_sock_data() have comments
stating they use recvmmsg().  That's not correct, they only do so via
udp_sock_recv() which lists recvmmsg() itself.

In contrast udp_splice_send() and udp_tap_handler() both directly use
sendmmsg(), but only the latter lists it.  Add it to the former as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/udp.c b/udp.c
index 44b58d1..ab3e9d2 100644
--- a/udp.c
+++ b/udp.c
@@ -272,6 +272,8 @@ static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx)
  * @dst:	Destination port for datagrams (target side)
  * @ref:	epoll reference for origin socket
  * @now:	Timestamp
+ *
+ * #syscalls sendmmsg
  */
 static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
 			    flow_sidx_t tosidx)
@@ -662,8 +664,6 @@ static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh)
  * @c:		Execution context
  * @ref:	epoll reference
  * @now:	Current timestamp
- *
- * #syscalls recvmmsg
  */
 static void udp_buf_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 				     const struct timespec *now)
@@ -760,8 +760,6 @@ void udp_listen_sock_handler(const struct ctx *c,
  * @tosidx:	Flow & side to forward data from @s to
  *
  * Return: true on success, false if can't forward from socket to flow's pif
- *
- * #syscalls recvmmsg
  */
 static bool udp_buf_reply_sock_data(const struct ctx *c,
 				    int s, flow_sidx_t tosidx)

From 06784d7fc6761528d587837b241d27c6d17c0842 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 3 Apr 2025 19:01:02 +0200
Subject: [PATCH 169/221] passt-repair: Ensure that read buffer is
 NULL-terminated

After 3d41e4d83895 ("passt-repair: Correct off-by-one error verifying
name"), Coverity Scan isn't convinced anymore about the fact that the
ev->name used in the snprintf() is NULL-terminated.

It comes from a read() call, and read() of course doesn't terminate
it, but we already check that the byte at ev->len - 1 is a NULL
terminator, so this is actually a false positive.

In any case, the logic ensuring that ev->name is NULL-terminated isn't
necessarily obvious, and additionally checking that the last byte in
the buffer we read is a NULL terminator is harmless, so do that
explicitly, even if it's redundant.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt-repair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/passt-repair.c b/passt-repair.c
index 440c77a..256a8c9 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -137,6 +137,7 @@ int main(int argc, char **argv)
 				fprintf(stderr, "inotify read: %i", errno);
 				_exit(1);
 			}
+			buf[n - 1] = '\0';
 
 			if (n < (ssize_t)sizeof(*ev)) {
 				fprintf(stderr, "Short inotify read: %zi", n);

From a7775e9550fa698e4af1322f6ef63924c24d1fab Mon Sep 17 00:00:00 2001
From: Jon Maloy <jmaloy@redhat.com>
Date: Sat, 5 Apr 2025 15:21:26 -0400
Subject: [PATCH 170/221] udp: support traceroute in direction tap-socket

Now that ICMP pass-through from socket-to-tap is in place, it is
easy to support UDP based traceroute functionality in direction
tap-to-socket.

We fix that in this commit.

Link: https://bugs.passt.top/show_bug.cgi?id=64
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c      | 17 +++++++++++++----
 udp.c      | 22 +++++++++++++++++++++-
 udp.h      |  3 ++-
 udp_flow.c |  1 +
 udp_flow.h |  4 +++-
 5 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/tap.c b/tap.c
index 3a6fcbe..d630f6d 100644
--- a/tap.c
+++ b/tap.c
@@ -559,6 +559,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
  * struct l4_seq4_t - Message sequence for one protocol handler call, IPv4
  * @msgs:	Count of messages in sequence
  * @protocol:	Protocol number
+ * @ttl:	Time to live
  * @source:	Source port
  * @dest:	Destination port
  * @saddr:	Source address
@@ -567,6 +568,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
  */
 static struct tap4_l4_t {
 	uint8_t protocol;
+	uint8_t ttl;
 
 	uint16_t source;
 	uint16_t dest;
@@ -586,6 +588,7 @@ static struct tap4_l4_t {
  * @dest:	Destination port
  * @saddr:	Source address
  * @daddr:	Destination address
+ * @hop_limit:	Hop limit
  * @msg:	Array of messages that can be handled in a single call
  */
 static struct tap6_l4_t {
@@ -598,6 +601,8 @@ static struct tap6_l4_t {
 	struct in6_addr saddr;
 	struct in6_addr daddr;
 
+	uint8_t hop_limit;
+
 	struct pool_l4_t p;
 } tap6_l4[TAP_SEQS /* Arbitrary: TAP_MSGS in theory, so limit in users */];
 
@@ -786,7 +791,8 @@ resume:
 #define L4_MATCH(iph, uh, seq)							\
 	((seq)->protocol == (iph)->protocol &&					\
 	 (seq)->source   == (uh)->source    && (seq)->dest  == (uh)->dest &&	\
-	 (seq)->saddr.s_addr == (iph)->saddr && (seq)->daddr.s_addr == (iph)->daddr)
+	 (seq)->saddr.s_addr == (iph)->saddr &&				\
+	 (seq)->daddr.s_addr == (iph)->daddr && (seq)->ttl == (iph)->ttl)
 
 #define L4_SET(iph, uh, seq)						\
 	do {								\
@@ -795,6 +801,7 @@ resume:
 		(seq)->dest		= (uh)->dest;			\
 		(seq)->saddr.s_addr	= (iph)->saddr;			\
 		(seq)->daddr.s_addr	= (iph)->daddr;			\
+		(seq)->ttl		= (iph)->ttl;			\
 	} while (0)
 
 		if (seq && L4_MATCH(iph, uh, seq) && seq->p.count < UIO_MAXIOV)
@@ -843,7 +850,7 @@ append:
 			for (k = 0; k < p->count; )
 				k += udp_tap_handler(c, PIF_TAP, AF_INET,
 						     &seq->saddr, &seq->daddr,
-						     p, k, now);
+						     seq->ttl, p, k, now);
 		}
 	}
 
@@ -966,7 +973,8 @@ resume:
 		 (seq)->dest == (uh)->dest                 &&		\
 		 (seq)->flow_lbl == ip6_get_flow_lbl(ip6h) &&		\
 		 IN6_ARE_ADDR_EQUAL(&(seq)->saddr, saddr)  &&		\
-		 IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr))
+		 IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr)  &&		\
+		 (seq)->hop_limit == (ip6h)->hop_limit)
 
 #define L4_SET(ip6h, proto, uh, seq)					\
 	do {								\
@@ -976,6 +984,7 @@ resume:
 		(seq)->flow_lbl	= ip6_get_flow_lbl(ip6h);		\
 		(seq)->saddr	= *saddr;				\
 		(seq)->daddr	= *daddr;				\
+		(seq)->hop_limit = (ip6h)->hop_limit;			\
 	} while (0)
 
 		if (seq && L4_MATCH(ip6h, proto, uh, seq) &&
@@ -1026,7 +1035,7 @@ append:
 			for (k = 0; k < p->count; )
 				k += udp_tap_handler(c, PIF_TAP, AF_INET6,
 						     &seq->saddr, &seq->daddr,
-						     p, k, now);
+						     seq->hop_limit, p, k, now);
 		}
 	}
 
diff --git a/udp.c b/udp.c
index ab3e9d2..5a251df 100644
--- a/udp.c
+++ b/udp.c
@@ -844,6 +844,7 @@ fail:
  * @af:		Address family, AF_INET or AF_INET6
  * @saddr:	Source address
  * @daddr:	Destination address
+ * @ttl:	TTL or hop limit for packets to be sent in this call
  * @p:		Pool of UDP packets, with UDP headers
  * @idx:	Index of first packet to process
  * @now:	Current timestamp
@@ -854,7 +855,8 @@ fail:
  */
 int udp_tap_handler(const struct ctx *c, uint8_t pif,
 		    sa_family_t af, const void *saddr, const void *daddr,
-		    const struct pool *p, int idx, const struct timespec *now)
+		    uint8_t ttl, const struct pool *p, int idx,
+		    const struct timespec *now)
 {
 	const struct flowside *toside;
 	struct mmsghdr mm[UIO_MAXIOV];
@@ -933,6 +935,24 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 		mm[i].msg_hdr.msg_controllen = 0;
 		mm[i].msg_hdr.msg_flags = 0;
 
+		if (ttl != uflow->ttl[tosidx.sidei]) {
+			uflow->ttl[tosidx.sidei] = ttl;
+			if (af == AF_INET) {
+				if (setsockopt(s, IPPROTO_IP, IP_TTL,
+					       &ttl, sizeof(ttl)) < 0)
+					flow_perror(uflow,
+						    "setsockopt IP_TTL");
+			} else {
+				/* IPv6 hop_limit cannot be only 1 byte */
+				int hop_limit = ttl;
+
+				if (setsockopt(s, SOL_IPV6, IPV6_UNICAST_HOPS,
+					       &hop_limit, sizeof(hop_limit)) < 0)
+					flow_perror(uflow,
+						    "setsockopt IPV6_UNICAST_HOPS");
+			}
+		}
+
 		count++;
 	}
 
diff --git a/udp.h b/udp.h
index de2df6d..a811475 100644
--- a/udp.h
+++ b/udp.h
@@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 			    uint32_t events, const struct timespec *now);
 int udp_tap_handler(const struct ctx *c, uint8_t pif,
 		    sa_family_t af, const void *saddr, const void *daddr,
-		    const struct pool *p, int idx, const struct timespec *now);
+		    uint8_t ttl, const struct pool *p, int idx,
+		    const struct timespec *now);
 int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
 		  const char *ifname, in_port_t port);
 int udp_init(struct ctx *c);
diff --git a/udp_flow.c b/udp_flow.c
index bf4b896..99ae490 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 	uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp);
 	uflow->ts = now->tv_sec;
 	uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1;
+	uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = 0;
 
 	if (s_ini >= 0) {
 		/* When using auto port-scanning the listening port could go
diff --git a/udp_flow.h b/udp_flow.h
index 9a1b059..520de62 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -8,11 +8,12 @@
 #define UDP_FLOW_H
 
 /**
- * struct udp - Descriptor for a flow of UDP packets
+ * struct udp_flow - Descriptor for a flow of UDP packets
  * @f:		Generic flow information
  * @closed:	Flow is already closed
  * @ts:		Activity timestamp
  * @s:		Socket fd (or -1) for each side of the flow
+ * @ttl:	TTL or hop_limit for both sides
  */
 struct udp_flow {
 	/* Must be first element */
@@ -21,6 +22,7 @@ struct udp_flow {
 	bool closed :1;
 	time_t ts;
 	int s[SIDES];
+	uint8_t ttl[SIDES];
 };
 
 struct udp_flow *udp_at_sidx(flow_sidx_t sidx);

From d74b5a7c107006b95df6a69e5f1e6b9a373c7f53 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:31 +1100
Subject: [PATCH 171/221] udp: Use connect()ed sockets for initiating side

Currently we have an asymmetry in how we handle UDP sockets.  For flows
where the target side is a socket, we create a new connect()ed socket
- the "reply socket" specifically for that flow used for sending and
receiving datagrams on that flow and only that flow.  For flows where the
initiating side is a socket, we continue to use the "listening" socket (or
rather, a dup() of it).  This has some disadvantages:

 * We need a hash lookup for every datagram on the listening socket in
   order to work out what flow it belongs to
 * The dup() keeps the socket alive even if automatic forwarding removes
   the listening socket.  However, the epoll data remains the same
   including containing the now stale original fd.  This causes bug 103.
 * We can't (easily) set flow-specific options on an initiating side
   socket, because that could affect other flows as well

Alter the code to use a connect()ed socket on the initiating side as well
as the target side.  There's no way to "clone and connect" the listening
socket (a loose equivalent of accept() for UDP), so we have to create a
new socket.  We have to bind() this socket before we connect() it, which
is allowed thanks to SO_REUSEADDR, but does leave a small window where it
could receive datagrams not intended for this flow.  For now we handle this
by simply discarding any datagrams received between bind() and connect(),
but I intend to improve this in a later patch.

Link: https://bugs.passt.top/show_bug.cgi?id=103
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 epoll_type.h |  4 ++--
 passt.c      |  6 +++---
 udp.c        | 50 ++++++++++++++++++++++++++------------------------
 udp.h        |  4 ++--
 udp_flow.c   | 32 +++++++++-----------------------
 util.c       |  2 +-
 6 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/epoll_type.h b/epoll_type.h
index 7f2a121..12ac59b 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -22,8 +22,8 @@ enum epoll_type {
 	EPOLL_TYPE_TCP_TIMER,
 	/* UDP "listening" sockets */
 	EPOLL_TYPE_UDP_LISTEN,
-	/* UDP socket for replies on a specific flow */
-	EPOLL_TYPE_UDP_REPLY,
+	/* UDP socket for a specific flow */
+	EPOLL_TYPE_UDP,
 	/* ICMP/ICMPv6 ping sockets */
 	EPOLL_TYPE_PING,
 	/* inotify fd watching for end of netns (pasta) */
diff --git a/passt.c b/passt.c
index cd06772..388d10f 100644
--- a/passt.c
+++ b/passt.c
@@ -68,7 +68,7 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TCP_LISTEN]		= "listening TCP socket",
 	[EPOLL_TYPE_TCP_TIMER]		= "TCP timer",
 	[EPOLL_TYPE_UDP_LISTEN]		= "listening UDP socket",
-	[EPOLL_TYPE_UDP_REPLY]		= "UDP reply socket",
+	[EPOLL_TYPE_UDP]		= "UDP flow socket",
 	[EPOLL_TYPE_PING]	= "ICMP/ICMPv6 ping socket",
 	[EPOLL_TYPE_NSQUIT_INOTIFY]	= "namespace inotify watch",
 	[EPOLL_TYPE_NSQUIT_TIMER]	= "namespace timer watch",
@@ -339,8 +339,8 @@ loop:
 		case EPOLL_TYPE_UDP_LISTEN:
 			udp_listen_sock_handler(&c, ref, eventmask, &now);
 			break;
-		case EPOLL_TYPE_UDP_REPLY:
-			udp_reply_sock_handler(&c, ref, eventmask, &now);
+		case EPOLL_TYPE_UDP:
+			udp_sock_handler(&c, ref, eventmask, &now);
 			break;
 		case EPOLL_TYPE_PING:
 			icmp_sock_handler(&c, ref);
diff --git a/udp.c b/udp.c
index 5a251df..1b3fffd 100644
--- a/udp.c
+++ b/udp.c
@@ -39,27 +39,30 @@
  * could receive packets from multiple flows, so we use a hash table match to
  * find the specific flow for a datagram.
  *
- * When a UDP flow is initiated from a listening socket we take a duplicate of
- * the socket and store it in uflow->s[INISIDE].  This will last for the
+ * Flow sockets
+ * ============
+ *
+ * When a UDP flow targets a socket, we create a "flow" socket in
+ * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive
+ * replies on the target side.  This socket is both bound and connected and has
+ * EPOLL_TYPE_UDP.  The connect() means it will only receive datagrams
+ * associated with this flow, so the epoll reference directly points to the flow
+ * and we don't need a hash lookup.
+ *
+ * When a flow is initiated from a listening socket, we create a "flow" socket
+ * with the same bound address as the listening socket, but also connect()ed to
+ * the flow's peer.  This is stored in uflow->s[INISIDE] and will last for the
  * lifetime of the flow, even if the original listening socket is closed due to
  * port auto-probing.  The duplicate is used to deliver replies back to the
  * originating side.
  *
- * Reply sockets
- * =============
- *
- * When a UDP flow targets a socket, we create a "reply" socket in
- * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive
- * replies on the target side.  This socket is both bound and connected and has
- * EPOLL_TYPE_UDP_REPLY.  The connect() means it will only receive datagrams
- * associated with this flow, so the epoll reference directly points to the flow
- * and we don't need a hash lookup.
- *
- * NOTE: it's possible that the reply socket could have a bound address
- * overlapping with an unrelated listening socket.  We assume datagrams for the
- * flow will come to the reply socket in preference to a listening socket.  The
- * sample program doc/platform-requirements/reuseaddr-priority.c documents and
- * tests that assumption.
+ * NOTE: A flow socket can have a bound address overlapping with a listening
+ * socket.  That will happen naturally for flows initiated from a socket, but is
+ * also possible (though unlikely) for tap initiated flows, depending on the
+ * source port.  We assume datagrams for the flow will come to a connect()ed
+ * socket in preference to a listening socket.  The sample program
+ * doc/platform-requirements/reuseaddr-priority.c documents and tests that
+ * assumption.
  *
  * "Spliced" flows
  * ===============
@@ -71,8 +74,7 @@
  * actually used; it doesn't make sense for datagrams and instead a pair of
  * recvmmsg() and sendmmsg() is used to forward the datagrams.
  *
- * Note that a spliced flow will have *both* a duplicated listening socket and a
- * reply socket (see above).
+ * Note that a spliced flow will have two flow sockets (see above).
  */
 
 #include <sched.h>
@@ -557,7 +559,7 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 	}
 
 	eh = (const struct errhdr *)CMSG_DATA(hdr);
-	if (ref.type == EPOLL_TYPE_UDP_REPLY) {
+	if (ref.type == EPOLL_TYPE_UDP) {
 		flow_sidx_t sidx = flow_sidx_opposite(ref.flowside);
 		const struct flowside *toside = flowside_at_sidx(sidx);
 		size_t dlen = rc;
@@ -792,14 +794,14 @@ static bool udp_buf_reply_sock_data(const struct ctx *c,
 }
 
 /**
- * udp_reply_sock_handler() - Handle new data from flow specific socket
+ * udp_sock_handler() - Handle new data from flow specific socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
  * @now:	Current timestamp
  */
-void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			    uint32_t events, const struct timespec *now)
+void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
+		      uint32_t events, const struct timespec *now)
 {
 	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
 
@@ -807,7 +809,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 	if (events & EPOLLERR) {
 		if (udp_sock_errs(c, ref) < 0) {
-			flow_err(uflow, "Unrecoverable error on reply socket");
+			flow_err(uflow, "Unrecoverable error on flow socket");
 			goto fail;
 		}
 	}
diff --git a/udp.h b/udp.h
index a811475..8f8531a 100644
--- a/udp.h
+++ b/udp.h
@@ -11,8 +11,8 @@
 void udp_portmap_clear(void);
 void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 			     uint32_t events, const struct timespec *now);
-void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			    uint32_t events, const struct timespec *now);
+void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
+		      uint32_t events, const struct timespec *now);
 int udp_tap_handler(const struct ctx *c, uint8_t pif,
 		    sa_family_t af, const void *saddr, const void *daddr,
 		    uint8_t ttl, const struct pool *p, int idx,
diff --git a/udp_flow.c b/udp_flow.c
index 99ae490..a2d417f 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -49,10 +49,7 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
 	flow_foreach_sidei(sidei) {
 		flow_hash_remove(c, FLOW_SIDX(uflow, sidei));
 		if (uflow->s[sidei] >= 0) {
-			/* The listening socket needs to stay in epoll, but the
-			 * flow specific one needs to be removed */
-			if (sidei == TGTSIDE)
-				epoll_del(c, uflow->s[sidei]);
+			epoll_del(c, uflow->s[sidei]);
 			close(uflow->s[sidei]);
 			uflow->s[sidei] = -1;
 		}
@@ -81,7 +78,7 @@ static int udp_flow_sock(const struct ctx *c,
 	} fref = { .sidx = FLOW_SIDX(uflow, sidei) };
 	int rc, s;
 
-	s = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, pif, side, fref.data);
+	s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
 	if (s < 0) {
 		flow_dbg_perror(uflow, "Couldn't open flow specific socket");
 		return s;
@@ -120,13 +117,12 @@ static int udp_flow_sock(const struct ctx *c,
  * udp_flow_new() - Common setup for a new UDP flow
  * @c:		Execution context
  * @flow:	Initiated flow
- * @s_ini:	Initiating socket (or -1)
  * @now:	Timestamp
  *
  * Return: UDP specific flow, if successful, NULL on failure
  */
 static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
-				int s_ini, const struct timespec *now)
+				const struct timespec *now)
 {
 	struct udp_flow *uflow = NULL;
 	unsigned sidei;
@@ -139,22 +135,12 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 	uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1;
 	uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = 0;
 
-	if (s_ini >= 0) {
-		/* When using auto port-scanning the listening port could go
-		 * away, so we need to duplicate the socket
-		 */
-		uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);
-		if (uflow->s[INISIDE] < 0) {
-			flow_perror(uflow,
-				    "Couldn't duplicate listening socket");
-			goto cancel;
-		}
+	flow_foreach_sidei(sidei) {
+		if (pif_is_socket(uflow->f.pif[sidei]))
+			if ((uflow->s[sidei] = udp_flow_sock(c, uflow, sidei)) < 0)
+				goto cancel;
 	}
 
-	if (pif_is_socket(flow->f.pif[TGTSIDE]))
-		if ((uflow->s[TGTSIDE] = udp_flow_sock(c, uflow, TGTSIDE)) < 0)
-			goto cancel;
-
 	/* Tap sides always need to be looked up by hash.  Socket sides don't
 	 * always, but sometimes do (receiving packets on a socket not specific
 	 * to one flow).  Unconditionally hash both sides so all our bases are
@@ -225,7 +211,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
 		return FLOW_SIDX_NONE;
 	}
 
-	return udp_flow_new(c, flow, ref.fd, now);
+	return udp_flow_new(c, flow, now);
 }
 
 /**
@@ -281,7 +267,7 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
 		return FLOW_SIDX_NONE;
 	}
 
-	return udp_flow_new(c, flow, -1, now);
+	return udp_flow_new(c, flow, now);
 }
 
 /**
diff --git a/util.c b/util.c
index b9a3d43..0f68cf5 100644
--- a/util.c
+++ b/util.c
@@ -71,7 +71,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	case EPOLL_TYPE_UDP_LISTEN:
 		freebind = c->freebind;
 		/* fallthrough */
-	case EPOLL_TYPE_UDP_REPLY:
+	case EPOLL_TYPE_UDP:
 		proto = IPPROTO_UDP;
 		socktype = SOCK_DGRAM | SOCK_NONBLOCK;
 		break;

From 1d7bbb101a0b1dcbc99c51cd65abb90a0144ac7b Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:32 +1100
Subject: [PATCH 172/221] udp: Make udp_sock_recv() take max number of frames
 as a parameter

Currently udp_sock_recv() decides the maximum number of frames it is
willing to receive based on the mode.  However, we have upcoming use cases
where we will have different criteria for how many frames we want with
information that's not naturally available here but is in the caller.  So
make the maximum number of frames a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typo in comment in udp_buf_reply_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/udp.c b/udp.c
index 1b3fffd..53403bf 100644
--- a/udp.c
+++ b/udp.c
@@ -634,22 +634,14 @@ static int udp_sock_errs(const struct ctx *c, union epoll_ref ref)
  * @c:		Execution context
  * @s:		Socket to receive from
  * @mmh		mmsghdr array to receive into
+ * @n:		Maximum number of datagrams to receive
  *
  * Return: Number of datagrams received
  *
  * #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64
  */
-static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh)
+static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh, int n)
 {
-	/* For not entirely clear reasons (data locality?) pasta gets better
-	 * throughput if we receive tap datagrams one at a atime.  For small
-	 * splice datagrams throughput is slightly better if we do batch, but
-	 * it's slightly worse for large splice datagrams.  Since we don't know
-	 * before we receive whether we'll use tap or splice, always go one at a
-	 * time for pasta mode.
-	 */
-	int n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
-
 	ASSERT(!c->no_udp);
 
 	n = recvmmsg(s, mmh, n, 0, NULL);
@@ -671,9 +663,10 @@ static void udp_buf_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 				     const struct timespec *now)
 {
 	const socklen_t sasize = sizeof(udp_meta[0].s_in);
-	int n, i;
+	/* See udp_buf_sock_data() comment */
+	int n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES), i;
 
-	if ((n = udp_sock_recv(c, ref.fd, udp_mh_recv)) <= 0)
+	if ((n = udp_sock_recv(c, ref.fd, udp_mh_recv, n)) <= 0)
 		return;
 
 	/* We divide datagrams into batches based on how we need to send them,
@@ -768,9 +761,15 @@ static bool udp_buf_reply_sock_data(const struct ctx *c,
 {
 	const struct flowside *toside = flowside_at_sidx(tosidx);
 	uint8_t topif = pif_at_sidx(tosidx);
-	int n, i;
+	/* For not entirely clear reasons (data locality?) pasta gets better
+	 * throughput if we receive tap datagrams one at a time.  For small
+	 * splice datagrams throughput is slightly better if we do batch, but
+	 * it's slightly worse for large splice datagrams.  Since we don't know
+	 * the size before we receive, always go one at a time for pasta mode.
+	 */
+	int n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES), i;
 
-	if ((n = udp_sock_recv(c, s, udp_mh_recv)) <= 0)
+	if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
 		return true;
 
 	for (i = 0; i < n; i++) {

From 84ab1305fabaf07b5badf433e55a458de5b86918 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:33 +1100
Subject: [PATCH 173/221] udp: Polish udp_vu_sock_info() and remove from vu
 specific code

udp_vu_sock_info() uses MSG_PEEK to look ahead at the next datagram to be
received and gets its source address.  Currently we only use it in the
vhost-user path, but there's nothing inherently vhost-user specific about
it.  We have upcoming uses for it elsewhere so rename and move to udp.c.

While we're there, polish its error reporting a litle.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop excess newline before udp_sock_recv()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c          | 24 ++++++++++++++++++++++++
 udp_internal.h |  1 +
 udp_vu.c       | 19 +------------------
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/udp.c b/udp.c
index 53403bf..1e241c8 100644
--- a/udp.c
+++ b/udp.c
@@ -629,6 +629,30 @@ static int udp_sock_errs(const struct ctx *c, union epoll_ref ref)
 	return n_err;
 }
 
+/**
+ * udp_peek_addr() - Get source address for next packet
+ * @s:		Socket to get information from
+ * @src:	Socket address (output)
+ *
+ * Return: 0 on success, -1 otherwise
+ */
+int udp_peek_addr(int s, union sockaddr_inany *src)
+{
+	struct msghdr msg = {
+		.msg_name = src,
+		.msg_namelen = sizeof(*src),
+	};
+	int rc;
+
+	rc = recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
+	if (rc < 0) {
+		if (errno != EAGAIN && errno != EWOULDBLOCK)
+			warn_perror("Error peeking at socket address");
+		return rc;
+	}
+	return 0;
+}
+
 /**
  * udp_sock_recv() - Receive datagrams from a socket
  * @c:		Execution context
diff --git a/udp_internal.h b/udp_internal.h
index 02724e5..43a6109 100644
--- a/udp_internal.h
+++ b/udp_internal.h
@@ -30,5 +30,6 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
 size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
                        const struct flowside *toside, size_t dlen,
 		       bool no_udp_csum);
+int udp_peek_addr(int s, union sockaddr_inany *src);
 
 #endif /* UDP_INTERNAL_H */
diff --git a/udp_vu.c b/udp_vu.c
index 4153b6c..5faf1e1 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -57,23 +57,6 @@ static size_t udp_vu_hdrlen(bool v6)
 	return hdrlen;
 }
 
-/**
- * udp_vu_sock_info() - get socket information
- * @s:		Socket to get information from
- * @s_in:	Socket address (output)
- *
- * Return: 0 if socket address can be read, -1 otherwise
- */
-static int udp_vu_sock_info(int s, union sockaddr_inany *s_in)
-{
-	struct msghdr msg = {
-		.msg_name = s_in,
-		.msg_namelen = sizeof(union sockaddr_inany),
-	};
-
-	return recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
-}
-
 /**
  * udp_vu_sock_recv() - Receive datagrams from socket into vhost-user buffers
  * @c:		Execution context
@@ -230,7 +213,7 @@ void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 		int iov_used;
 		bool v6;
 
-		if (udp_vu_sock_info(ref.fd, &s_in) < 0)
+		if (udp_peek_addr(ref.fd, &s_in) < 0)
 			break;
 
 		sidx = udp_flow_from_sock(c, ref, &s_in, now);

From 3a0881dfd02d758b0dc8ca6f5732bcb666b6d21e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:34 +1100
Subject: [PATCH 174/221] udp: Don't bother to batch datagrams from "listening"
 socket

A "listening" UDP socket can receive datagrams from multiple flows.  So,
we currently have some quite subtle and complex code in
udp_buf_listen_sock_data() to group contiguously received packets for the
same flow into batches for forwarding.

However, since we are now always using flow specific connect()ed sockets
once a flow is established, handling of datagrams on listening sockets is
essentially a slow path.  Given that, it's not worth the complexity.
Substantially simplify the code by using an approach more like vhost-user,
and "peeking" at the address of the next datagram, one at a time to
determine the correct flow before we actually receive the data,

This removes all meaningful use of the s_in and tosidx fields in
udp_meta_t, so they too can be removed, along with setting of msg_name and
msg_namelen in the msghdr arrays which referenced them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 75 +++++++++++++++--------------------------------------------
 1 file changed, 19 insertions(+), 56 deletions(-)

diff --git a/udp.c b/udp.c
index 1e241c8..4d32124 100644
--- a/udp.c
+++ b/udp.c
@@ -138,20 +138,15 @@ static struct ethhdr udp4_eth_hdr;
 static struct ethhdr udp6_eth_hdr;
 
 /**
- * struct udp_meta_t - Pre-cooked headers and metadata for UDP packets
+ * struct udp_meta_t - Pre-cooked headers for UDP packets
  * @ip6h:	Pre-filled IPv6 header (except for payload_len and addresses)
  * @ip4h:	Pre-filled IPv4 header (except for tot_len and saddr)
  * @taph:	Tap backend specific header
- * @s_in:	Source socket address, filled in by recvmmsg()
- * @tosidx:	sidx for the destination side of this datagram's flow
  */
 static struct udp_meta_t {
 	struct ipv6hdr ip6h;
 	struct iphdr ip4h;
 	struct tap_hdr taph;
-
-	union sockaddr_inany s_in;
-	flow_sidx_t tosidx;
 }
 #ifdef __AVX2__
 __attribute__ ((aligned(32)))
@@ -234,8 +229,6 @@ static void udp_iov_init_one(const struct ctx *c, size_t i)
 	tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph);
 	tiov[UDP_IOV_PAYLOAD].iov_base = payload;
 
-	mh->msg_name	= &meta->s_in;
-	mh->msg_namelen	= sizeof(meta->s_in);
 	mh->msg_iov	= siov;
 	mh->msg_iovlen	= 1;
 }
@@ -686,60 +679,32 @@ static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh, int n)
 static void udp_buf_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 				     const struct timespec *now)
 {
-	const socklen_t sasize = sizeof(udp_meta[0].s_in);
-	/* See udp_buf_sock_data() comment */
-	int n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES), i;
+	union sockaddr_inany src;
 
-	if ((n = udp_sock_recv(c, ref.fd, udp_mh_recv, n)) <= 0)
-		return;
+	while (udp_peek_addr(ref.fd, &src) == 0) {
+		flow_sidx_t tosidx = udp_flow_from_sock(c, ref, &src, now);
+		uint8_t topif = pif_at_sidx(tosidx);
 
-	/* We divide datagrams into batches based on how we need to send them,
-	 * determined by udp_meta[i].tosidx.  To avoid either two passes through
-	 * the array, or recalculating tosidx for a single entry, we have to
-	 * populate it one entry *ahead* of the loop counter.
-	 */
-	udp_meta[0].tosidx = udp_flow_from_sock(c, ref, &udp_meta[0].s_in, now);
-	udp_mh_recv[0].msg_hdr.msg_namelen = sasize;
-	for (i = 0; i < n; ) {
-		flow_sidx_t batchsidx = udp_meta[i].tosidx;
-		uint8_t batchpif = pif_at_sidx(batchsidx);
-		int batchstart = i;
+		if (udp_sock_recv(c, ref.fd, udp_mh_recv, 1) <= 0)
+			break;
 
-		do {
-			if (pif_is_socket(batchpif)) {
-				udp_splice_prepare(udp_mh_recv, i);
-			} else if (batchpif == PIF_TAP) {
-				udp_tap_prepare(udp_mh_recv, i,
-						flowside_at_sidx(batchsidx),
-						false);
-			}
-
-			if (++i >= n)
-				break;
-
-			udp_meta[i].tosidx = udp_flow_from_sock(c, ref,
-								&udp_meta[i].s_in,
-								now);
-			udp_mh_recv[i].msg_hdr.msg_namelen = sasize;
-		} while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx));
-
-		if (pif_is_socket(batchpif)) {
-			udp_splice_send(c, batchstart, i - batchstart,
-					batchsidx);
-		} else if (batchpif == PIF_TAP) {
-			tap_send_frames(c, &udp_l2_iov[batchstart][0],
-					UDP_NUM_IOVS, i - batchstart);
-		} else if (flow_sidx_valid(batchsidx)) {
-			flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
-			struct udp_flow *uflow = udp_at_sidx(batchsidx);
+		if (pif_is_socket(topif)) {
+			udp_splice_prepare(udp_mh_recv, 0);
+			udp_splice_send(c, 0, 1, tosidx);
+		} else if (topif == PIF_TAP) {
+			udp_tap_prepare(udp_mh_recv, 0, flowside_at_sidx(tosidx),
+					false);
+			tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, 1);
+		} else if (flow_sidx_valid(tosidx)) {
+			flow_sidx_t fromsidx = flow_sidx_opposite(tosidx);
+			struct udp_flow *uflow = udp_at_sidx(tosidx);
 
 			flow_err(uflow,
 				 "No support for forwarding UDP from %s to %s",
 				 pif_name(pif_at_sidx(fromsidx)),
-				 pif_name(batchpif));
+				 pif_name(topif));
 		} else {
-			debug("Discarding %d datagrams without flow",
-			      i - batchstart);
+			debug("Discarding datagram without flow");
 		}
 	}
 }
@@ -801,8 +766,6 @@ static bool udp_buf_reply_sock_data(const struct ctx *c,
 			udp_splice_prepare(udp_mh_recv, i);
 		else if (topif == PIF_TAP)
 			udp_tap_prepare(udp_mh_recv, i, toside, false);
-		/* Restore sockaddr length clobbered by recvmsg() */
-		udp_mh_recv[i].msg_hdr.msg_namelen = sizeof(udp_meta[i].s_in);
 	}
 
 	if (pif_is_socket(topif)) {

From 5221e177e132b8b5001ec97f42975ad1251f7110 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:35 +1100
Subject: [PATCH 175/221] udp: Parameterize number of datagrams handled by
 udp_*_reply_sock_data()

Both udp_buf_reply_sock_data() and udp_vu_reply_sock_data() internally
decide what the maximum number of datagrams they will forward is.  We have
some upcoming reasons to allow the caller to decide that instead, so make
the maximum number of datagrams a parameter for both of them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c    | 31 ++++++++++++++++++-------------
 udp_vu.c |  6 ++++--
 udp_vu.h |  3 ++-
 3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/udp.c b/udp.c
index 4d32124..0f09e67 100644
--- a/udp.c
+++ b/udp.c
@@ -741,22 +741,17 @@ void udp_listen_sock_handler(const struct ctx *c,
  * udp_buf_reply_sock_data() - Handle new data from flow specific socket
  * @c:		Execution context
  * @s:		Socket to read data from
+ * @n:		Maximum number of datagrams to forward
  * @tosidx:	Flow & side to forward data from @s to
  *
  * Return: true on success, false if can't forward from socket to flow's pif
  */
-static bool udp_buf_reply_sock_data(const struct ctx *c,
-				    int s, flow_sidx_t tosidx)
+static bool udp_buf_reply_sock_data(const struct ctx *c, int s, int n,
+				    flow_sidx_t tosidx)
 {
 	const struct flowside *toside = flowside_at_sidx(tosidx);
 	uint8_t topif = pif_at_sidx(tosidx);
-	/* For not entirely clear reasons (data locality?) pasta gets better
-	 * throughput if we receive tap datagrams one at a time.  For small
-	 * splice datagrams throughput is slightly better if we do batch, but
-	 * it's slightly worse for large splice datagrams.  Since we don't know
-	 * the size before we receive, always go one at a time for pasta mode.
-	 */
-	int n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES), i;
+	int i;
 
 	if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
 		return true;
@@ -801,6 +796,14 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 	}
 
 	if (events & EPOLLIN) {
+		/* For not entirely clear reasons (data locality?) pasta gets
+		 * better throughput if we receive tap datagrams one at a
+		 * time.  For small splice datagrams throughput is slightly
+		 * better if we do batch, but it's slightly worse for large
+		 * splice datagrams.  Since we don't know the size before we
+		 * receive, always go one at a time for pasta mode.
+		 */
+		size_t n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
 		flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
 		int s = ref.fd;
 		bool ret;
@@ -808,10 +811,12 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 		flow_trace(uflow, "Received data on reply socket");
 		uflow->ts = now->tv_sec;
 
-		if (c->mode == MODE_VU)
-			ret = udp_vu_reply_sock_data(c, s, tosidx);
-		else
-			ret = udp_buf_reply_sock_data(c, s, tosidx);
+		if (c->mode == MODE_VU) {
+			ret = udp_vu_reply_sock_data(c, s, UDP_MAX_FRAMES,
+						     tosidx);
+		} else {
+			ret = udp_buf_reply_sock_data(c, s, n, tosidx);
+		}
 
 		if (!ret) {
 			flow_err(uflow, "Unable to forward UDP");
diff --git a/udp_vu.c b/udp_vu.c
index 5faf1e1..b2618b3 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -257,11 +257,13 @@ void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
  * udp_vu_reply_sock_data() - Handle new data from flow specific socket
  * @c:		Execution context
  * @s:		Socket to read data from
+ * @n:		Maximum number of datagrams to forward
  * @tosidx:	Flow & side to forward data from @s to
  *
  * Return: true on success, false if can't forward from socket to flow's pif
  */
-bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx)
+bool udp_vu_reply_sock_data(const struct ctx *c, int s, int n,
+			    flow_sidx_t tosidx)
 {
 	const struct flowside *toside = flowside_at_sidx(tosidx);
 	bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
@@ -272,7 +274,7 @@ bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx)
 	if (pif_at_sidx(tosidx) != PIF_TAP)
 		return false;
 
-	for (i = 0; i < UDP_MAX_FRAMES; i++) {
+	for (i = 0; i < n; i++) {
 		ssize_t dlen;
 		int iov_used;
 
diff --git a/udp_vu.h b/udp_vu.h
index 6d541a4..c897c36 100644
--- a/udp_vu.h
+++ b/udp_vu.h
@@ -8,6 +8,7 @@
 
 void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 			     const struct timespec *now);
-bool udp_vu_reply_sock_data(const struct ctx *c, int s, flow_sidx_t tosidx);
+bool udp_vu_reply_sock_data(const struct ctx *c, int s, int n,
+			    flow_sidx_t tosidx);
 
 #endif /* UDP_VU_H */

From 0304dd9c34a7dd29c3a8a2058626a971d4e71a8e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:36 +1100
Subject: [PATCH 176/221] udp: Split spliced forwarding path from
 udp_buf_reply_sock_data()

udp_buf_reply_sock_data() can handle forwarding data either from socket
to socket ("splicing") or from socket to tap.  It has a test on each
datagram for which case we're in, but that will be the same for everything
in the batch.

Split out the spliced path into a separate udp_sock_to_sock() function.
This leaves udp_{buf,vu}_reply_sock_data() handling only forwards from
socket to tap, so rename and simplify them accordingly.

This makes the code slightly longer for now, but will allow future cleanups
to shrink it back down again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Fix typos in comments to udp_sock_recv() and
 udp_vu_listen_sock_data()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c    | 103 ++++++++++++++++++++++++++++++-------------------------
 udp_vu.c |  12 ++-----
 udp_vu.h |   3 +-
 3 files changed, 60 insertions(+), 58 deletions(-)

diff --git a/udp.c b/udp.c
index 0f09e67..2745e5d 100644
--- a/udp.c
+++ b/udp.c
@@ -670,6 +670,49 @@ static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh, int n)
 	return n;
 }
 
+/**
+ * udp_sock_to_sock() - Forward datagrams from socket to socket
+ * @c:		Execution context
+ * @from_s:	Socket to receive datagrams from
+ * @n:		Maximum number of datagrams to forward
+ * @tosidx:	Flow & side to forward datagrams to
+ */
+static void udp_sock_to_sock(const struct ctx *c, int from_s, int n,
+			     flow_sidx_t tosidx)
+{
+	int i;
+
+	if ((n = udp_sock_recv(c, from_s, udp_mh_recv, n)) <= 0)
+		return;
+
+	for (i = 0; i < n; i++)
+		udp_splice_prepare(udp_mh_recv, i);
+
+	udp_splice_send(c, 0, n, tosidx);
+}
+
+/**
+ * udp_buf_sock_to_tap() - Forward datagrams from socket to tap
+ * @c:		Execution context
+ * @s:		Socket to read data from
+ * @n:		Maximum number of datagrams to forward
+ * @tosidx:	Flow & side to forward data from @s to
+ */
+static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
+				flow_sidx_t tosidx)
+{
+	const struct flowside *toside = flowside_at_sidx(tosidx);
+	int i;
+
+	if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
+		return;
+
+	for (i = 0; i < n; i++)
+		udp_tap_prepare(udp_mh_recv, i, toside, false);
+
+	tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
+}
+
 /**
  * udp_buf_listen_sock_data() - Handle new data from socket
  * @c:		Execution context
@@ -737,43 +780,6 @@ void udp_listen_sock_handler(const struct ctx *c,
 	}
 }
 
-/**
- * udp_buf_reply_sock_data() - Handle new data from flow specific socket
- * @c:		Execution context
- * @s:		Socket to read data from
- * @n:		Maximum number of datagrams to forward
- * @tosidx:	Flow & side to forward data from @s to
- *
- * Return: true on success, false if can't forward from socket to flow's pif
- */
-static bool udp_buf_reply_sock_data(const struct ctx *c, int s, int n,
-				    flow_sidx_t tosidx)
-{
-	const struct flowside *toside = flowside_at_sidx(tosidx);
-	uint8_t topif = pif_at_sidx(tosidx);
-	int i;
-
-	if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
-		return true;
-
-	for (i = 0; i < n; i++) {
-		if (pif_is_socket(topif))
-			udp_splice_prepare(udp_mh_recv, i);
-		else if (topif == PIF_TAP)
-			udp_tap_prepare(udp_mh_recv, i, toside, false);
-	}
-
-	if (pif_is_socket(topif)) {
-		udp_splice_send(c, 0, n, tosidx);
-	} else if (topif == PIF_TAP) {
-		tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
-	} else {
-		return false;
-	}
-
-	return true;
-}
-
 /**
  * udp_sock_handler() - Handle new data from flow specific socket
  * @c:		Execution context
@@ -805,21 +811,26 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 		 */
 		size_t n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
 		flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
+		uint8_t topif = pif_at_sidx(tosidx);
 		int s = ref.fd;
-		bool ret;
 
 		flow_trace(uflow, "Received data on reply socket");
 		uflow->ts = now->tv_sec;
 
-		if (c->mode == MODE_VU) {
-			ret = udp_vu_reply_sock_data(c, s, UDP_MAX_FRAMES,
-						     tosidx);
+		if (pif_is_socket(topif)) {
+			udp_sock_to_sock(c, ref.fd, n, tosidx);
+		} else if (topif == PIF_TAP) {
+			if (c->mode == MODE_VU) {
+				udp_vu_sock_to_tap(c, s, UDP_MAX_FRAMES,
+						   tosidx);
+			} else {
+				udp_buf_sock_to_tap(c, s, n, tosidx);
+			}
 		} else {
-			ret = udp_buf_reply_sock_data(c, s, n, tosidx);
-		}
-
-		if (!ret) {
-			flow_err(uflow, "Unable to forward UDP");
+			flow_err(uflow,
+				 "No support for forwarding UDP from %s to %s",
+				 pif_name(pif_at_sidx(ref.flowside)),
+				 pif_name(topif));
 			goto fail;
 		}
 	}
diff --git a/udp_vu.c b/udp_vu.c
index b2618b3..8e02093 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -254,16 +254,13 @@ void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 }
 
 /**
- * udp_vu_reply_sock_data() - Handle new data from flow specific socket
+ * udp_vu_sock_to_tap() - Forward datagrams from socket to tap
  * @c:		Execution context
  * @s:		Socket to read data from
  * @n:		Maximum number of datagrams to forward
  * @tosidx:	Flow & side to forward data from @s to
- *
- * Return: true on success, false if can't forward from socket to flow's pif
  */
-bool udp_vu_reply_sock_data(const struct ctx *c, int s, int n,
-			    flow_sidx_t tosidx)
+void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx)
 {
 	const struct flowside *toside = flowside_at_sidx(tosidx);
 	bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
@@ -271,9 +268,6 @@ bool udp_vu_reply_sock_data(const struct ctx *c, int s, int n,
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
 	int i;
 
-	if (pif_at_sidx(tosidx) != PIF_TAP)
-		return false;
-
 	for (i = 0; i < n; i++) {
 		ssize_t dlen;
 		int iov_used;
@@ -290,6 +284,4 @@ bool udp_vu_reply_sock_data(const struct ctx *c, int s, int n,
 		}
 		vu_flush(vdev, vq, elem, iov_used);
 	}
-
-	return true;
 }
diff --git a/udp_vu.h b/udp_vu.h
index c897c36..576b0e7 100644
--- a/udp_vu.h
+++ b/udp_vu.h
@@ -8,7 +8,6 @@
 
 void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 			     const struct timespec *now);
-bool udp_vu_reply_sock_data(const struct ctx *c, int s, int n,
-			    flow_sidx_t tosidx);
+void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx);
 
 #endif /* UDP_VU_H */

From fc6ee68ad3a8863cba534dfa4b88767114a6701e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:37 +1100
Subject: [PATCH 177/221] udp: Merge vhost-user and "buf" listening socket
 paths

udp_buf_listen_sock_data() and udp_vu_listen_sock_data() now have
effectively identical structure.  The forwarding functions used for flow
specific sockets (udp_buf_sock_to_tap(), udp_vu_sock_to_tap() and
udp_sock_to_sock()) also now take a number of datagrams.  This means we
can re-use them for the listening socket path, just passing '1' so they
handle a single datagram at a time.

This allows us to merge both the vhost-user and flow specific paths into
a single, simpler udp_listen_sock_data().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c          | 27 ++++++++--------------
 udp_internal.h |  1 -
 udp_vu.c       | 62 --------------------------------------------------
 3 files changed, 10 insertions(+), 80 deletions(-)

diff --git a/udp.c b/udp.c
index 2745e5d..b0a7bf7 100644
--- a/udp.c
+++ b/udp.c
@@ -629,7 +629,7 @@ static int udp_sock_errs(const struct ctx *c, union epoll_ref ref)
  *
  * Return: 0 on success, -1 otherwise
  */
-int udp_peek_addr(int s, union sockaddr_inany *src)
+static int udp_peek_addr(int s, union sockaddr_inany *src)
 {
 	struct msghdr msg = {
 		.msg_name = src,
@@ -714,12 +714,12 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
 }
 
 /**
- * udp_buf_listen_sock_data() - Handle new data from socket
+ * udp_listen_sock_data() - Handle new data from listening socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @now:	Current timestamp
  */
-static void udp_buf_listen_sock_data(const struct ctx *c, union epoll_ref ref,
+static void udp_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 				     const struct timespec *now)
 {
 	union sockaddr_inany src;
@@ -728,16 +728,13 @@ static void udp_buf_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 		flow_sidx_t tosidx = udp_flow_from_sock(c, ref, &src, now);
 		uint8_t topif = pif_at_sidx(tosidx);
 
-		if (udp_sock_recv(c, ref.fd, udp_mh_recv, 1) <= 0)
-			break;
-
 		if (pif_is_socket(topif)) {
-			udp_splice_prepare(udp_mh_recv, 0);
-			udp_splice_send(c, 0, 1, tosidx);
+			udp_sock_to_sock(c, ref.fd, 1, tosidx);
 		} else if (topif == PIF_TAP) {
-			udp_tap_prepare(udp_mh_recv, 0, flowside_at_sidx(tosidx),
-					false);
-			tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, 1);
+			if (c->mode == MODE_VU)
+				udp_vu_sock_to_tap(c, ref.fd, 1, tosidx);
+			else
+				udp_buf_sock_to_tap(c, ref.fd, 1, tosidx);
 		} else if (flow_sidx_valid(tosidx)) {
 			flow_sidx_t fromsidx = flow_sidx_opposite(tosidx);
 			struct udp_flow *uflow = udp_at_sidx(tosidx);
@@ -772,12 +769,8 @@ void udp_listen_sock_handler(const struct ctx *c,
 		}
 	}
 
-	if (events & EPOLLIN) {
-		if (c->mode == MODE_VU)
-			udp_vu_listen_sock_data(c, ref, now);
-		else
-			udp_buf_listen_sock_data(c, ref, now);
-	}
+	if (events & EPOLLIN)
+		udp_listen_sock_data(c, ref, now);
 }
 
 /**
diff --git a/udp_internal.h b/udp_internal.h
index 43a6109..02724e5 100644
--- a/udp_internal.h
+++ b/udp_internal.h
@@ -30,6 +30,5 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
 size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
                        const struct flowside *toside, size_t dlen,
 		       bool no_udp_csum);
-int udp_peek_addr(int s, union sockaddr_inany *src);
 
 #endif /* UDP_INTERNAL_H */
diff --git a/udp_vu.c b/udp_vu.c
index 8e02093..1f89509 100644
--- a/udp_vu.c
+++ b/udp_vu.c
@@ -191,68 +191,6 @@ static void udp_vu_csum(const struct flowside *toside, int iov_used)
 	}
 }
 
-/**
- * udp_vu_listen_sock_data() - Handle new data from socket
- * @c:		Execution context
- * @ref:	epoll reference
- * @now:	Current timestamp
- */
-void udp_vu_listen_sock_data(const struct ctx *c, union epoll_ref ref,
-			     const struct timespec *now)
-{
-	struct vu_dev *vdev = c->vdev;
-	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
-	int i;
-
-	for (i = 0; i < UDP_MAX_FRAMES; i++) {
-		const struct flowside *toside;
-		union sockaddr_inany s_in;
-		flow_sidx_t sidx;
-		uint8_t pif;
-		ssize_t dlen;
-		int iov_used;
-		bool v6;
-
-		if (udp_peek_addr(ref.fd, &s_in) < 0)
-			break;
-
-		sidx = udp_flow_from_sock(c, ref, &s_in, now);
-		pif = pif_at_sidx(sidx);
-
-		if (pif != PIF_TAP) {
-			if (flow_sidx_valid(sidx)) {
-				flow_sidx_t fromsidx = flow_sidx_opposite(sidx);
-				struct udp_flow *uflow = udp_at_sidx(sidx);
-
-				flow_err(uflow,
-					"No support for forwarding UDP from %s to %s",
-					pif_name(pif_at_sidx(fromsidx)),
-					pif_name(pif));
-			} else {
-				debug("Discarding 1 datagram without flow");
-			}
-
-			continue;
-		}
-
-		toside = flowside_at_sidx(sidx);
-
-		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
-
-		iov_used = udp_vu_sock_recv(c, ref.fd, v6, &dlen);
-		if (iov_used <= 0)
-			break;
-
-		udp_vu_prepare(c, toside, dlen);
-		if (*c->pcap) {
-			udp_vu_csum(toside, iov_used);
-			pcap_iov(iov_vu, iov_used,
-				 sizeof(struct virtio_net_hdr_mrg_rxbuf));
-		}
-		vu_flush(vdev, vq, elem, iov_used);
-	}
-}
-
 /**
  * udp_vu_sock_to_tap() - Forward datagrams from socket to tap
  * @c:		Execution context

From fd844a90bce0274d2488370ed7fadd850b6a0294 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:38 +1100
Subject: [PATCH 178/221] udp: Move UDP_MAX_FRAMES to udp.c

Recent changes mean that this define is no longer used anywhere except in
udp.c.  Move it back into udp.c from udp_internal.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c          | 2 ++
 udp_internal.h | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/udp.c b/udp.c
index b0a7bf7..f74a992 100644
--- a/udp.c
+++ b/udp.c
@@ -116,6 +116,8 @@
 #include "udp_internal.h"
 #include "udp_vu.h"
 
+#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+
 /* Maximum UDP data to be returned in ICMP messages */
 #define ICMP4_MAX_DLEN 8
 #define ICMP6_MAX_DLEN (IPV6_MIN_MTU			\
diff --git a/udp_internal.h b/udp_internal.h
index 02724e5..f7d8426 100644
--- a/udp_internal.h
+++ b/udp_internal.h
@@ -8,8 +8,6 @@
 
 #include "tap.h" /* needed by udp_meta_t */
 
-#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
-
 /**
  * struct udp_payload_t - UDP header and data for inbound messages
  * @uh:		UDP header

From 159beefa36a09fc36cc9669fd536926d84c7c342 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:39 +1100
Subject: [PATCH 179/221] udp_flow: Take pif and port as explicit parameters to
 udp_flow_from_sock()

Currently udp_flow_from_sock() is only used when receiving a datagram
from a "listening" socket.  It takes the listening socket's epoll
reference to get the interface and port on which the datagram arrived.

We have some upcoming cases where we want to use this in different
contexts, so make it take the pif and port as direct parameters instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop @ref from comment to udp_flow_from_sock()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c      |  4 +++-
 udp_flow.c | 16 +++++++---------
 udp_flow.h |  2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/udp.c b/udp.c
index f74a992..157697e 100644
--- a/udp.c
+++ b/udp.c
@@ -727,7 +727,9 @@ static void udp_listen_sock_data(const struct ctx *c, union epoll_ref ref,
 	union sockaddr_inany src;
 
 	while (udp_peek_addr(ref.fd, &src) == 0) {
-		flow_sidx_t tosidx = udp_flow_from_sock(c, ref, &src, now);
+		flow_sidx_t tosidx = udp_flow_from_sock(c, ref.udp.pif,
+							ref.udp.port, &src,
+							now);
 		uint8_t topif = pif_at_sidx(tosidx);
 
 		if (pif_is_socket(topif)) {
diff --git a/udp_flow.c b/udp_flow.c
index a2d417f..5afe6e5 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -161,9 +161,10 @@ cancel:
 }
 
 /**
- * udp_flow_from_sock() - Find or create UDP flow for "listening" socket
+ * udp_flow_from_sock() - Find or create UDP flow for incoming datagram
  * @c:		Execution context
- * @ref:	epoll reference of the receiving socket
+ * @pif:	Interface the datagram is arriving from
+ * @port:	Our (local) port number to which the datagram is arriving
  * @s_in:	Source socket address, filled in by recvmmsg()
  * @now:	Timestamp
  *
@@ -172,7 +173,7 @@ cancel:
  * Return: sidx for the destination side of the flow for this packet, or
  *         FLOW_SIDX_NONE if we couldn't find or create a flow.
  */
-flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
+flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, in_port_t port,
 			       const union sockaddr_inany *s_in,
 			       const struct timespec *now)
 {
@@ -181,9 +182,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
 	union flow *flow;
 	flow_sidx_t sidx;
 
-	ASSERT(ref.type == EPOLL_TYPE_UDP_LISTEN);
-
-	sidx = flow_lookup_sa(c, IPPROTO_UDP, ref.udp.pif, s_in, ref.udp.port);
+	sidx = flow_lookup_sa(c, IPPROTO_UDP, pif, s_in, port);
 	if ((uflow = udp_at_sidx(sidx))) {
 		uflow->ts = now->tv_sec;
 		return flow_sidx_opposite(sidx);
@@ -193,12 +192,11 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
 		char sastr[SOCKADDR_STRLEN];
 
 		debug("Couldn't allocate flow for UDP datagram from %s %s",
-		      pif_name(ref.udp.pif),
-		      sockaddr_ntop(s_in, sastr, sizeof(sastr)));
+		      pif_name(pif), sockaddr_ntop(s_in, sastr, sizeof(sastr)));
 		return FLOW_SIDX_NONE;
 	}
 
-	ini = flow_initiate_sa(flow, ref.udp.pif, s_in, ref.udp.port);
+	ini = flow_initiate_sa(flow, pif, s_in, port);
 
 	if (!inany_is_unicast(&ini->eaddr) ||
 	    ini->eport == 0 || ini->oport == 0) {
diff --git a/udp_flow.h b/udp_flow.h
index 520de62..bbdeb2a 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -26,7 +26,7 @@ struct udp_flow {
 };
 
 struct udp_flow *udp_at_sidx(flow_sidx_t sidx);
-flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref,
+flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, in_port_t port,
 			       const union sockaddr_inany *s_in,
 			       const struct timespec *now);
 flow_sidx_t udp_flow_from_tap(const struct ctx *c,

From bd6a41ee76bb9a0da2150d76dbabf9a3212d0fca Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:40 +1100
Subject: [PATCH 180/221] udp: Rework udp_listen_sock_data() into
 udp_sock_fwd()

udp_listen_sock_data() forwards datagrams from a "listening" socket until
there are no more (for now).  We have an upcoming use case where we want
to do that for a socket that's not a "listening" socket, and uses a
different epoll reference.  So, adjust the function to take the pieces it
needs from the reference as direct parameters and rename to udp_sock_fwd().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/udp.c b/udp.c
index 157697e..20d8f0c 100644
--- a/udp.c
+++ b/udp.c
@@ -716,37 +716,36 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
 }
 
 /**
- * udp_listen_sock_data() - Handle new data from listening socket
+ * udp_sock_fwd() - Forward datagrams from a possibly unconnected socket
  * @c:		Execution context
- * @ref:	epoll reference
+ * @s:		Socket to forward from
+ * @frompif:	Interface to which @s belongs
+ * @port:	Our (local) port number of @s
  * @now:	Current timestamp
  */
-static void udp_listen_sock_data(const struct ctx *c, union epoll_ref ref,
-				     const struct timespec *now)
+static void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
+			 in_port_t port, const struct timespec *now)
 {
 	union sockaddr_inany src;
 
-	while (udp_peek_addr(ref.fd, &src) == 0) {
-		flow_sidx_t tosidx = udp_flow_from_sock(c, ref.udp.pif,
-							ref.udp.port, &src,
-							now);
+	while (udp_peek_addr(s, &src) == 0) {
+		flow_sidx_t tosidx = udp_flow_from_sock(c, frompif, port,
+							&src, now);
 		uint8_t topif = pif_at_sidx(tosidx);
 
 		if (pif_is_socket(topif)) {
-			udp_sock_to_sock(c, ref.fd, 1, tosidx);
+			udp_sock_to_sock(c, s, 1, tosidx);
 		} else if (topif == PIF_TAP) {
 			if (c->mode == MODE_VU)
-				udp_vu_sock_to_tap(c, ref.fd, 1, tosidx);
+				udp_vu_sock_to_tap(c, s, 1, tosidx);
 			else
-				udp_buf_sock_to_tap(c, ref.fd, 1, tosidx);
+				udp_buf_sock_to_tap(c, s, 1, tosidx);
 		} else if (flow_sidx_valid(tosidx)) {
-			flow_sidx_t fromsidx = flow_sidx_opposite(tosidx);
 			struct udp_flow *uflow = udp_at_sidx(tosidx);
 
 			flow_err(uflow,
 				 "No support for forwarding UDP from %s to %s",
-				 pif_name(pif_at_sidx(fromsidx)),
-				 pif_name(topif));
+				 pif_name(frompif), pif_name(topif));
 		} else {
 			debug("Discarding datagram without flow");
 		}
@@ -774,7 +773,7 @@ void udp_listen_sock_handler(const struct ctx *c,
 	}
 
 	if (events & EPOLLIN)
-		udp_listen_sock_data(c, ref, now);
+		udp_sock_fwd(c, ref.fd, ref.udp.pif, ref.udp.port, now);
 }
 
 /**

From 9eb540626047bece3f25f38e47ec3b2b0030f9f4 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:41 +1100
Subject: [PATCH 181/221] udp: Fold udp_splice_prepare and udp_splice_send into
 udp_sock_to_sock

udp_splice() prepare and udp_splice_send() are both quite simple functions
that now have only one caller: udp_sock_to_sock().  Fold them both into
that caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 55 +++++++++++++++----------------------------------------
 1 file changed, 15 insertions(+), 40 deletions(-)

diff --git a/udp.c b/udp.c
index 20d8f0c..d9d2183 100644
--- a/udp.c
+++ b/udp.c
@@ -250,43 +250,6 @@ static void udp_iov_init(const struct ctx *c)
 		udp_iov_init_one(c, i);
 }
 
-/**
- * udp_splice_prepare() - Prepare one datagram for splicing
- * @mmh:	Receiving mmsghdr array
- * @idx:	Index of the datagram to prepare
- */
-static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx)
-{
-	udp_mh_splice[idx].msg_hdr.msg_iov->iov_len = mmh[idx].msg_len;
-}
-
-/**
- * udp_splice_send() - Send a batch of datagrams from socket to socket
- * @c:		Execution context
- * @start:	Index of batch's first datagram in udp[46]_l2_buf
- * @n:		Number of datagrams in batch
- * @src:	Source port for datagram (target side)
- * @dst:	Destination port for datagrams (target side)
- * @ref:	epoll reference for origin socket
- * @now:	Timestamp
- *
- * #syscalls sendmmsg
- */
-static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
-			    flow_sidx_t tosidx)
-{
-	const struct flowside *toside = flowside_at_sidx(tosidx);
-	const struct udp_flow *uflow = udp_at_sidx(tosidx);
-	uint8_t topif = pif_at_sidx(tosidx);
-	int s = uflow->s[tosidx.sidei];
-	socklen_t sl;
-
-	pif_sockaddr(c, &udp_splice_to, &sl, topif,
-		     &toside->eaddr, toside->eport);
-
-	sendmmsg(s, udp_mh_splice + start, n, MSG_NOSIGNAL);
-}
-
 /**
  * udp_update_hdr4() - Update headers for one IPv4 datagram
  * @ip4h:		Pre-filled IPv4 header (except for tot_len and saddr)
@@ -678,19 +641,31 @@ static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh, int n)
  * @from_s:	Socket to receive datagrams from
  * @n:		Maximum number of datagrams to forward
  * @tosidx:	Flow & side to forward datagrams to
+ *
+ * #syscalls sendmmsg
  */
 static void udp_sock_to_sock(const struct ctx *c, int from_s, int n,
 			     flow_sidx_t tosidx)
 {
+	const struct flowside *toside = flowside_at_sidx(tosidx);
+	const struct udp_flow *uflow = udp_at_sidx(tosidx);
+	uint8_t topif = pif_at_sidx(tosidx);
+	int to_s = uflow->s[tosidx.sidei];
+	socklen_t sl;
 	int i;
 
 	if ((n = udp_sock_recv(c, from_s, udp_mh_recv, n)) <= 0)
 		return;
 
-	for (i = 0; i < n; i++)
-		udp_splice_prepare(udp_mh_recv, i);
+	for (i = 0; i < n; i++) {
+		udp_mh_splice[i].msg_hdr.msg_iov->iov_len
+			= udp_mh_recv[i].msg_len;
+	}
 
-	udp_splice_send(c, 0, n, tosidx);
+	pif_sockaddr(c, &udp_splice_to, &sl, topif,
+		     &toside->eaddr, toside->eport);
+
+	sendmmsg(to_s, udp_mh_splice, n, MSG_NOSIGNAL);
 }
 
 /**

From 9725e79888374a4e4060a2d798f3407c0006cc8a Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Fri, 4 Apr 2025 21:15:42 +1100
Subject: [PATCH 182/221] udp_flow: Don't discard packets that arrive between
 bind() and connect()

When we establish a new UDP flow we create connect()ed sockets that will
only handle datagrams for this flow.  However, there is a race between
bind() and connect() where they might get some packets queued for a
different flow.  Currently we handle this by simply discarding any
queued datagrams after the connect.  UDP protocols should be able to handle
such packet loss, but it's not ideal.

We now have the tools we need to handle this better, by redirecting any
datagrams received during that race to the appropriate flow.  We need to
use a deferred handler for this to avoid unexpectedly re-ordering datagrams
in some edge cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update comment to udp_flow_defer()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c         |  2 +-
 udp.c          |  4 +--
 udp_flow.c     | 77 +++++++++++++++++++++++++++++++++++---------------
 udp_flow.h     |  6 +++-
 udp_internal.h |  2 ++
 5 files changed, 64 insertions(+), 27 deletions(-)

diff --git a/flow.c b/flow.c
index 8622242..29a83e1 100644
--- a/flow.c
+++ b/flow.c
@@ -850,7 +850,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 				closed = icmp_ping_timer(c, &flow->ping, now);
 			break;
 		case FLOW_UDP:
-			closed = udp_flow_defer(&flow->udp);
+			closed = udp_flow_defer(c, &flow->udp, now);
 			if (!closed && timer)
 				closed = udp_flow_timer(c, &flow->udp, now);
 			break;
diff --git a/udp.c b/udp.c
index d9d2183..ed6edc1 100644
--- a/udp.c
+++ b/udp.c
@@ -698,8 +698,8 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
  * @port:	Our (local) port number of @s
  * @now:	Current timestamp
  */
-static void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
-			 in_port_t port, const struct timespec *now)
+void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
+		  in_port_t port, const struct timespec *now)
 {
 	union sockaddr_inany src;
 
diff --git a/udp_flow.c b/udp_flow.c
index 5afe6e5..75f5a0b 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -9,10 +9,12 @@
 #include <fcntl.h>
 #include <sys/uio.h>
 #include <unistd.h>
+#include <netinet/udp.h>
 
 #include "util.h"
 #include "passt.h"
 #include "flow_table.h"
+#include "udp_internal.h"
 
 #define UDP_CONN_TIMEOUT	180 /* s, timeout for ephemeral or local bind */
 
@@ -67,16 +69,15 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
  * Return: fd of new socket on success, -ve error code on failure
  */
 static int udp_flow_sock(const struct ctx *c,
-			 const struct udp_flow *uflow, unsigned sidei)
+			 struct udp_flow *uflow, unsigned sidei)
 {
 	const struct flowside *side = &uflow->f.side[sidei];
-	struct mmsghdr discard[UIO_MAXIOV] = { 0 };
 	uint8_t pif = uflow->f.pif[sidei];
 	union {
 		flow_sidx_t sidx;
 		uint32_t data;
 	} fref = { .sidx = FLOW_SIDX(uflow, sidei) };
-	int rc, s;
+	int s;
 
 	s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
 	if (s < 0) {
@@ -85,30 +86,32 @@ static int udp_flow_sock(const struct ctx *c,
 	}
 
 	if (flowside_connect(c, s, pif, side) < 0) {
-		rc = -errno;
+		int rc = -errno;
 		flow_dbg_perror(uflow, "Couldn't connect flow socket");
 		return rc;
 	}
 
-	/* It's possible, if unlikely, that we could receive some unrelated
-	 * packets in between the bind() and connect() of this socket.  For now
-	 * we just discard these.
+	/* It's possible, if unlikely, that we could receive some packets in
+	 * between the bind() and connect() which may or may not be for this
+	 * flow.  Being UDP we could just discard them, but it's not ideal.
 	 *
-	 * FIXME: Redirect these to an appropriate handler
+	 * There's also a tricky case if a bunch of datagrams for a new flow
+	 * arrive in rapid succession, the first going to the original listening
+	 * socket and later ones going to this new socket.  If we forwarded the
+	 * datagrams from the new socket immediately here they would go before
+	 * the datagram which established the flow.  Again, not strictly wrong
+	 * for UDP, but not ideal.
+	 *
+	 * So, we flag that the new socket is in a transient state where it
+	 * might have datagrams for a different flow queued.  Before the next
+	 * epoll cycle, udp_flow_defer() will flush out any such datagrams, and
+	 * thereafter everything on the new socket should be strictly for this
+	 * flow.
 	 */
-	rc = recvmmsg(s, discard, ARRAY_SIZE(discard), MSG_DONTWAIT, NULL);
-	if (rc >= ARRAY_SIZE(discard)) {
-		flow_dbg(uflow, "Too many (%d) spurious reply datagrams", rc);
-		return -E2BIG;
-	}
-
-	if (rc > 0) {
-		flow_trace(uflow, "Discarded %d spurious reply datagrams", rc);
-	} else if (errno != EAGAIN) {
-		rc = -errno;
-		flow_perror(uflow, "Unexpected error discarding datagrams");
-		return rc;
-	}
+	if (sidei)
+		uflow->flush1 = true;
+	else
+		uflow->flush0 = true;
 
 	return s;
 }
@@ -269,13 +272,41 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
 }
 
 /**
- * udp_flow_defer() - Deferred per-flow handling (clean up aborted flows)
+ * udp_flush_flow() - Flush datagrams that might not be for this flow
+ * @c:		Execution context
  * @uflow:	Flow to handle
+ * @sidei:	Side of the flow to flush
+ * @now:	Current timestamp
+ */
+static void udp_flush_flow(const struct ctx *c,
+			   const struct udp_flow *uflow, unsigned sidei,
+			   const struct timespec *now)
+{
+	/* We don't know exactly where the datagrams will come from, but we know
+	 * they'll have an interface and oport matching this flow */
+	udp_sock_fwd(c, uflow->s[sidei], uflow->f.pif[sidei],
+		     uflow->f.side[sidei].oport, now);
+}
+
+/**
+ * udp_flow_defer() - Deferred per-flow handling (clean up aborted flows)
+ * @c:		Execution context
+ * @uflow:	Flow to handle
+ * @now:	Current timestamp
  *
  * Return: true if the connection is ready to free, false otherwise
  */
-bool udp_flow_defer(const struct udp_flow *uflow)
+bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow,
+		    const struct timespec *now)
 {
+	if (uflow->flush0) {
+		udp_flush_flow(c, uflow, INISIDE, now);
+		uflow->flush0 = false;
+	}
+	if (uflow->flush1) {
+		udp_flush_flow(c, uflow, TGTSIDE, now);
+		uflow->flush1 = false;
+	}
 	return uflow->closed;
 }
 
diff --git a/udp_flow.h b/udp_flow.h
index bbdeb2a..90d3b29 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -11,6 +11,8 @@
  * struct udp_flow - Descriptor for a flow of UDP packets
  * @f:		Generic flow information
  * @closed:	Flow is already closed
+ * @flush0:	@s[0] may have datagrams queued for other flows
+ * @flush1:	@s[1] may have datagrams queued for other flows
  * @ts:		Activity timestamp
  * @s:		Socket fd (or -1) for each side of the flow
  * @ttl:	TTL or hop_limit for both sides
@@ -20,6 +22,7 @@ struct udp_flow {
 	struct flow_common f;
 
 	bool closed :1;
+	bool flush0, flush1 :1;
 	time_t ts;
 	int s[SIDES];
 	uint8_t ttl[SIDES];
@@ -35,7 +38,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
 			      in_port_t srcport, in_port_t dstport,
 			      const struct timespec *now);
 void udp_flow_close(const struct ctx *c, struct udp_flow *uflow);
-bool udp_flow_defer(const struct udp_flow *uflow);
+bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow,
+		    const struct timespec *now);
 bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow,
 		    const struct timespec *now);
 
diff --git a/udp_internal.h b/udp_internal.h
index f7d8426..96d11cf 100644
--- a/udp_internal.h
+++ b/udp_internal.h
@@ -28,5 +28,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
 size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
                        const struct flowside *toside, size_t dlen,
 		       bool no_udp_csum);
+void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
+		  in_port_t port, const struct timespec *now);
 
 #endif /* UDP_INTERNAL_H */

From 06ef64cdb72475fd02c72cdd607a31a86605e734 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 8 Apr 2025 07:49:55 +0200
Subject: [PATCH 183/221] udp_flow: Save 8 bytes in struct udp_flow on 64-bit
 architectures

Shuffle the fields just added by commits a7775e9550fa ("udp: support
traceroute in direction tap-socket") and 9725e7988837 ("udp_flow:
Don't discard packets that arrive between bind() and connect()").

On x86_64, as reported by pahole(1), before:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        _Bool                      closed:1;             /*    76: 0  1 */

        /* XXX 7 bits hole, try to pack */

        _Bool                      flush0;               /*    77     1 */
        _Bool                      flush1:1;             /*    78: 0  1 */

        /* XXX 7 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */
        uint8_t                    ttl[2];               /*    96     2 */

        /* size: 104, cachelines: 2, members: 7 */
        /* sum members: 95, holes: 1, sum holes: 1 */
        /* sum bitfield members: 2 bits, bit holes: 2, sum bit holes: 14 bits */
        /* padding: 6 */
        /* last cacheline: 40 bytes */
};

and after:

struct udp_flow {
        struct flow_common         f;                    /*     0    76 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        uint8_t                    ttl[2];               /*    76     2 */
        _Bool                      closed:1;             /*    78: 0  1 */
        _Bool                      flush0:1;             /*    78: 1  1 */
        _Bool                      flush1:1;             /*    78: 2  1 */

        /* XXX 5 bits hole, try to pack */
        /* XXX 1 byte hole, try to pack */

        time_t                     ts;                   /*    80     8 */
        int                        s[2];                 /*    88     8 */

        /* size: 96, cachelines: 2, members: 7 */
        /* sum members: 94, holes: 1, sum holes: 1 */
        /* sum bitfield members: 3 bits, bit holes: 1, sum bit holes: 5 bits */
        /* last cacheline: 32 bytes */
};

It doesn't matter much because anyway the typical storage for struct
udp_flow is given by union flow:

union flow {
        struct flow_common         f;                  /*     0    76 */
        struct flow_free_cluster   free;               /*     0    84 */
        struct tcp_tap_conn        tcp;                /*     0   120 */
        struct tcp_splice_conn     tcp_splice;         /*     0   120 */
        struct icmp_ping_flow      ping;               /*     0    96 */
        struct udp_flow            udp;                /*     0    96 */
};

but it still improves data locality somewhat, so let me fix this up
now that commits are fresh.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 udp_flow.h | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/udp_flow.h b/udp_flow.h
index 90d3b29..e289122 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -10,22 +10,25 @@
 /**
  * struct udp_flow - Descriptor for a flow of UDP packets
  * @f:		Generic flow information
+ * @ttl:	TTL or hop_limit for both sides
  * @closed:	Flow is already closed
  * @flush0:	@s[0] may have datagrams queued for other flows
  * @flush1:	@s[1] may have datagrams queued for other flows
  * @ts:		Activity timestamp
  * @s:		Socket fd (or -1) for each side of the flow
- * @ttl:	TTL or hop_limit for both sides
  */
 struct udp_flow {
 	/* Must be first element */
 	struct flow_common f;
 
-	bool closed :1;
-	bool flush0, flush1 :1;
+	uint8_t ttl[SIDES];
+
+	bool	closed	:1,
+		flush0	:1,
+		flush1	:1;
+
 	time_t ts;
 	int s[SIDES];
-	uint8_t ttl[SIDES];
 };
 
 struct udp_flow *udp_at_sidx(flow_sidx_t sidx);

From ffbef85e975ba117ed1c20f733d989ac08ebf325 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, 8 Apr 2025 07:57:51 +0200
Subject: [PATCH 184/221] conf: Add missing return in conf_nat(), fix
 --map-guest-addr none

As reported by somebody on IRC:

  $ pasta --map-guest-addr none
  Invalid address to remap to host: none

that's because once we parsed "none", we try to parse it as an address
as well. But we already handled it, so stop once we're done.

Fixes: e813a4df7da2 ("conf: Allow address remapped to host to be configured")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/conf.c b/conf.c
index b54c55d..168646f 100644
--- a/conf.c
+++ b/conf.c
@@ -1272,6 +1272,8 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
 		*addr6 = in6addr_any;
 		if (no_map_gw)
 			*no_map_gw = 1;
+
+		return;
 	}
 
 	if (inet_pton(AF_INET6, arg, addr6)	&&

From d3f33f3b8ec4646dae3584b648cba142a73d3208 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 9 Apr 2025 16:35:40 +1000
Subject: [PATCH 185/221] tcp_splice: Don't double count bytes read on EINTR

In tcp_splice_sock_handler(), if we get an EINTR on our second splice()
(pipe to output socket) we - as we should - go back and retry it.  However,
we do so *after* we've already updated our byte counters.  That does no
harm for the conn->written[] counter - since the second splice() returned
an error it will be advanced by 0.  However we also advance the
conn->read[] counter, and then do so again when the splice() succeeds.
This results in the counters being out of sync, and us thinking we have
remaining data in the pipe when we don't, which can leave us in an
infinite loop once the stream finishes.

Fix this by moving the EINTR handling to directly next to the splice()
call (which is what we usually do for EINTR).  As a bonus this removes one
mildly confusing goto.

For symmetry, also rework the EINTR handling on the first splice() the same
way, although that doesn't (as far as I can tell) have buggy side effects.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2779347687
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_splice.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 0d10e3d..7c3b56f 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -520,15 +520,14 @@ swap:
 		int more = 0;
 
 retry:
-		readlen = splice(conn->s[fromsidei], NULL,
-				 conn->pipe[fromsidei][1], NULL,
-				 c->tcp.pipe_size,
-				 SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
+		do
+			readlen = splice(conn->s[fromsidei], NULL,
+					 conn->pipe[fromsidei][1], NULL,
+					 c->tcp.pipe_size,
+					 SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
+		while (readlen < 0 && errno == EINTR);
 		flow_trace(conn, "%zi from read-side call", readlen);
 		if (readlen < 0) {
-			if (errno == EINTR)
-				goto retry;
-
 			if (errno != EAGAIN)
 				goto close;
 		} else if (!readlen) {
@@ -543,10 +542,13 @@ retry:
 				conn_flag(c, conn, lowat_act_flag);
 		}
 
-eintr:
-		written = splice(conn->pipe[fromsidei][0], NULL,
-				 conn->s[!fromsidei], NULL, c->tcp.pipe_size,
-				 SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
+		do
+			written = splice(conn->pipe[fromsidei][0], NULL,
+					 conn->s[!fromsidei], NULL,
+					 c->tcp.pipe_size,
+					 SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
+		while (written < 0 && errno == EINTR);
+
 		flow_trace(conn, "%zi from write-side call (passed %zi)",
 			   written, c->tcp.pipe_size);
 
@@ -578,9 +580,6 @@ eintr:
 		conn->written[fromsidei] += written > 0 ? written : 0;
 
 		if (written < 0) {
-			if (errno == EINTR)
-				goto eintr;
-
 			if (errno != EAGAIN)
 				goto close;
 

From 6693fa115824d198b7cde46c272514be194500a9 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Wed, 9 Apr 2025 16:35:41 +1000
Subject: [PATCH 186/221] tcp_splice: Don't clobber errno before checking for
 EAGAIN

Like many places, tcp_splice_sock_handler() needs to handle EAGAIN
specially, in this case for both of its splice() calls.  Unfortunately it
tests for EAGAIN some time after those calls.  In between there has been
at least a flow_trace() which could have clobbered errno.  Move the test on
errno closer to the relevant system calls to avoid this problem.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_splice.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 7c3b56f..60455d6 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -526,13 +526,15 @@ retry:
 					 c->tcp.pipe_size,
 					 SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
 		while (readlen < 0 && errno == EINTR);
+
+		if (readlen < 0 && errno != EAGAIN)
+			goto close;
+
 		flow_trace(conn, "%zi from read-side call", readlen);
-		if (readlen < 0) {
-			if (errno != EAGAIN)
-				goto close;
-		} else if (!readlen) {
+
+		if (!readlen) {
 			eof = 1;
-		} else {
+		} else if (readlen > 0) {
 			never_read = 0;
 
 			if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
@@ -549,6 +551,9 @@ retry:
 					 SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
 		while (written < 0 && errno == EINTR);
 
+		if (written < 0 && errno != EAGAIN)
+			goto close;
+
 		flow_trace(conn, "%zi from write-side call (passed %zi)",
 			   written, c->tcp.pipe_size);
 
@@ -580,9 +585,6 @@ retry:
 		conn->written[fromsidei] += written > 0 ? written : 0;
 
 		if (written < 0) {
-			if (errno != EAGAIN)
-				goto close;
-
 			if (conn->read[fromsidei] == conn->written[fromsidei])
 				break;
 

From f4b0dd8b06850bacb2da57c8576e3377daa88572 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 10 Apr 2025 17:16:38 +1000
Subject: [PATCH 187/221] udp: Use PKTINFO cmsgs to get destination address for
 received datagrams

Currently we get the source address for received datagrams from recvmsg(),
but we don't get the local destination address.  Sometimes we implicitly
know this because the receiving socket is bound to a specific address, but
when listening on 0.0.0.0 or ::, we don't.

We need this information to properly direct replies to flows which come in
to a non-default local address.  So, enable the IP_PKTINFO and IPV6_PKTINFO
control messages to obtain this information in udp_peek_addr().  For now
we log a trace messages but don't do anything more with the information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c  | 37 +++++++++++++++++++++++++++++++++++--
 util.c |  8 ++++++--
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/udp.c b/udp.c
index ed6edc1..a71141a 100644
--- a/udp.c
+++ b/udp.c
@@ -587,18 +587,29 @@ static int udp_sock_errs(const struct ctx *c, union epoll_ref ref)
 	return n_err;
 }
 
+#define PKTINFO_SPACE					\
+	MAX(CMSG_SPACE(sizeof(struct in_pktinfo)),	\
+	    CMSG_SPACE(sizeof(struct in6_pktinfo)))
+
 /**
  * udp_peek_addr() - Get source address for next packet
  * @s:		Socket to get information from
  * @src:	Socket address (output)
+ * @dst:	(Local) destination address (output)
  *
  * Return: 0 on success, -1 otherwise
  */
-static int udp_peek_addr(int s, union sockaddr_inany *src)
+static int udp_peek_addr(int s, union sockaddr_inany *src,
+			 union inany_addr *dst)
 {
+	char sastr[SOCKADDR_STRLEN], dstr[INANY_ADDRSTRLEN];
+	const struct cmsghdr *hdr;
+	char cmsg[PKTINFO_SPACE];
 	struct msghdr msg = {
 		.msg_name = src,
 		.msg_namelen = sizeof(*src),
+		.msg_control = cmsg,
+		.msg_controllen = sizeof(cmsg),
 	};
 	int rc;
 
@@ -608,6 +619,27 @@ static int udp_peek_addr(int s, union sockaddr_inany *src)
 			warn_perror("Error peeking at socket address");
 		return rc;
 	}
+
+	hdr = CMSG_FIRSTHDR(&msg);
+	if (hdr && hdr->cmsg_level == IPPROTO_IP &&
+	    hdr->cmsg_type == IP_PKTINFO) {
+		const struct in_pktinfo *info4 = (void *)CMSG_DATA(hdr);
+
+		*dst = inany_from_v4(info4->ipi_addr);
+	} else if (hdr && hdr->cmsg_level == IPPROTO_IPV6 &&
+		   hdr->cmsg_type == IPV6_PKTINFO) {
+		const struct in6_pktinfo *info6 = (void *)CMSG_DATA(hdr);
+
+		dst->a6 = info6->ipi6_addr;
+	} else {
+		debug("Unexpected cmsg on UDP datagram");
+		*dst = inany_any6;
+	}
+
+	trace("Peeked UDP datagram: %s -> %s",
+	      sockaddr_ntop(src, sastr, sizeof(sastr)),
+	      inany_ntop(dst, dstr, sizeof(dstr)));
+
 	return 0;
 }
 
@@ -702,8 +734,9 @@ void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
 		  in_port_t port, const struct timespec *now)
 {
 	union sockaddr_inany src;
+	union inany_addr dst;
 
-	while (udp_peek_addr(s, &src) == 0) {
+	while (udp_peek_addr(s, &src, &dst) == 0) {
 		flow_sidx_t tosidx = udp_flow_from_sock(c, frompif, port,
 							&src, now);
 		uint8_t topif = pif_at_sidx(tosidx);
diff --git a/util.c b/util.c
index 0f68cf5..62a6003 100644
--- a/util.c
+++ b/util.c
@@ -109,11 +109,15 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 		debug("Failed to set SO_REUSEADDR on socket %i", fd);
 
 	if (proto == IPPROTO_UDP) {
+		int pktinfo = af == AF_INET ? IP_PKTINFO : IPV6_RECVPKTINFO;
+		int recverr = af == AF_INET ? IP_RECVERR : IPV6_RECVERR;
 		int level = af == AF_INET ? IPPROTO_IP : IPPROTO_IPV6;
-		int opt = af == AF_INET ? IP_RECVERR : IPV6_RECVERR;
 
-		if (setsockopt(fd, level, opt, &y, sizeof(y)))
+		if (setsockopt(fd, level, recverr, &y, sizeof(y)))
 			die_perror("Failed to set RECVERR on socket %i", fd);
+
+		if (setsockopt(fd, level, pktinfo, &y, sizeof(y)))
+			die_perror("Failed to set PKTINFO on socket %i", fd);
 	}
 
 	if (ifname && *ifname) {

From 695c62396eb3f4627c1114ce444394e3ba34373a Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 10 Apr 2025 17:16:39 +1000
Subject: [PATCH 188/221] inany: Improve ASSERT message for bad socket family

inany_from_sockaddr() can only handle sockaddrs of family AF_INET or
AF_INET6 and asserts if given something else.  I hit this assertion while
debugging something else, and wanted to see what the bad sockaddr family
was.  Now that we have ASSERT_WITH_MSG() its easy to add this information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 inany.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/inany.h b/inany.h
index 6a12c29..1c247e1 100644
--- a/inany.h
+++ b/inany.h
@@ -252,7 +252,8 @@ static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
 		*port = ntohs(sa->sa4.sin_port);
 	} else {
 		/* Not valid to call with other address families */
-		ASSERT(0);
+		ASSERT_WITH_MSG(0, "Unexpected sockaddr family: %u",
+				sa->sa_family);
 	}
 }
 

From 59cc89f4cc018988428637d97745cc4c919126cb Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 10 Apr 2025 17:16:40 +1000
Subject: [PATCH 189/221] udp, udp_flow: Track our specific address on socket
 interfaces

So far for UDP flows (like TCP connections) we didn't record our address
(oaddr) in the flow table entry for socket based pifs.  That's because we
didn't have that information when a flow was initiated by a datagram coming
to a "listening" socket with 0.0.0.0 or :: address.  Even when we did have
the information, we didn't record it, to simplify address matching on
lookups.

This meant that in some circumstances we could send replies on a UDP flow
from a different address than the originating request came to, which is
surprising and breaks certain setups.

We now have code in udp_peek_addr() which does determine our address for
incoming UDP datagrams.  We can use that information to properly populate
oaddr in the flow table for flow initiated from a socket.

In order to be able to consistently match datagrams to flows, we must
*always* have a specific oaddr, not an unspecified address (that's how the
flow hash table works).  So, we also need to fill in oaddr correctly for
flows we initiate *to* sockets.  Our forwarding logic doesn't specify
oaddr here, letting the kernel decide based on the routing table.  In this
case we need to call getsockname() after connect()ing the socket to find
which local address the kernel picked.

This adds getsockname() to our seccomp profile for all variants.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c       | 14 +++++++++++---
 flow.h       |  3 ++-
 flow_table.h |  1 +
 tcp.c        |  2 +-
 udp.c        |  4 ++--
 udp_flow.c   | 36 ++++++++++++++++++++++++++++++++----
 udp_flow.h   |  3 ++-
 util.h       | 10 ++++++++++
 8 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/flow.c b/flow.c
index 29a83e1..3c81cb4 100644
--- a/flow.c
+++ b/flow.c
@@ -396,18 +396,22 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
  * @flow:	Flow to change state
  * @pif:	pif of the initiating side
  * @ssa:	Source socket address
+ * @daddr:	Destination address (may be NULL)
  * @dport:	Destination port
  *
  * Return: pointer to the initiating flowside information
  */
 struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
 				  const union sockaddr_inany *ssa,
+				  const union inany_addr *daddr,
 				  in_port_t dport)
 {
 	struct flowside *ini = &flow->f.side[INISIDE];
 
 	inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa);
-	if (inany_v4(&ini->eaddr))
+	if (daddr)
+		ini->oaddr = *daddr;
+	else if (inany_v4(&ini->eaddr))
 		ini->oaddr = inany_any4;
 	else
 		ini->oaddr = inany_any6;
@@ -751,19 +755,23 @@ flow_sidx_t flow_lookup_af(const struct ctx *c,
  * @proto:	Protocol of the flow (IP L4 protocol number)
  * @pif:	Interface of the flow
  * @esa:	Socket address of the endpoint
+ * @oaddr:	Our address (may be NULL)
  * @oport:	Our port number
  *
  * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
  */
 flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
-			   const void *esa, in_port_t oport)
+			   const void *esa,
+			   const union inany_addr *oaddr, in_port_t oport)
 {
 	struct flowside side = {
 		.oport = oport,
 	};
 
 	inany_from_sockaddr(&side.eaddr, &side.eport, esa);
-	if (inany_v4(&side.eaddr))
+	if (oaddr)
+		side.oaddr = *oaddr;
+	else if (inany_v4(&side.eaddr))
 		side.oaddr = inany_any4;
 	else
 		side.oaddr = inany_any6;
diff --git a/flow.h b/flow.h
index dcf7645..cac618a 100644
--- a/flow.h
+++ b/flow.h
@@ -243,7 +243,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c,
 			   const void *eaddr, const void *oaddr,
 			   in_port_t eport, in_port_t oport);
 flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
-			   const void *esa, in_port_t oport);
+			   const void *esa,
+			   const union inany_addr *oaddr, in_port_t oport);
 
 union flow;
 
diff --git a/flow_table.h b/flow_table.h
index fd2c57b..2d5c65c 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -199,6 +199,7 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
 					const void *daddr, in_port_t dport);
 struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
 				  const union sockaddr_inany *ssa,
+				  const union inany_addr *daddr,
 				  in_port_t dport);
 const struct flowside *flow_target_af(union flow *flow, uint8_t pif,
 				      sa_family_t af,
diff --git a/tcp.c b/tcp.c
index 35626c9..9c6bc52 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2201,7 +2201,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 	 * mode only, below.
 	 */
 	ini = flow_initiate_sa(flow, ref.tcp_listen.pif, &sa,
-			       ref.tcp_listen.port);
+			       NULL, ref.tcp_listen.port);
 
 	if (c->mode == MODE_VU) { /* Rebind to same address after migration */
 		if (!getsockname(s, &sa.sa, &sl))
diff --git a/udp.c b/udp.c
index a71141a..40af7df 100644
--- a/udp.c
+++ b/udp.c
@@ -737,8 +737,8 @@ void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
 	union inany_addr dst;
 
 	while (udp_peek_addr(s, &src, &dst) == 0) {
-		flow_sidx_t tosidx = udp_flow_from_sock(c, frompif, port,
-							&src, now);
+		flow_sidx_t tosidx = udp_flow_from_sock(c, frompif,
+							&dst, port, &src, now);
 		uint8_t topif = pif_at_sidx(tosidx);
 
 		if (pif_is_socket(topif)) {
diff --git a/udp_flow.c b/udp_flow.c
index 75f5a0b..ef2cbb0 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -123,14 +123,17 @@ static int udp_flow_sock(const struct ctx *c,
  * @now:	Timestamp
  *
  * Return: UDP specific flow, if successful, NULL on failure
+ *
+ * #syscalls getsockname
  */
 static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 				const struct timespec *now)
 {
 	struct udp_flow *uflow = NULL;
+	const struct flowside *tgt;
 	unsigned sidei;
 
-	if (!flow_target(c, flow, IPPROTO_UDP))
+	if (!(tgt = flow_target(c, flow, IPPROTO_UDP)))
 		goto cancel;
 
 	uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp);
@@ -144,6 +147,29 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 				goto cancel;
 	}
 
+	if (uflow->s[TGTSIDE] >= 0 && inany_is_unspecified(&tgt->oaddr)) {
+		/* When we target a socket, we connect() it, but might not
+		 * always bind(), leaving the kernel to pick our address.  In
+		 * that case connect() will implicitly bind() the socket, but we
+		 * need to determine its local address so that we can match
+		 * reply packets back to the correct flow.  Update the flow with
+		 * the information from getsockname() */
+		union sockaddr_inany sa;
+		socklen_t sl = sizeof(sa);
+		in_port_t port;
+
+		if (getsockname(uflow->s[TGTSIDE], &sa.sa, &sl) < 0) {
+			flow_perror(uflow, "Unable to determine local address");
+			goto cancel;
+		}
+		inany_from_sockaddr(&uflow->f.side[TGTSIDE].oaddr,
+				    &port, &sa);
+		if (port != tgt->oport) {
+			flow_err(uflow, "Unexpected local port");
+			goto cancel;
+		}
+	}
+
 	/* Tap sides always need to be looked up by hash.  Socket sides don't
 	 * always, but sometimes do (receiving packets on a socket not specific
 	 * to one flow).  Unconditionally hash both sides so all our bases are
@@ -167,6 +193,7 @@ cancel:
  * udp_flow_from_sock() - Find or create UDP flow for incoming datagram
  * @c:		Execution context
  * @pif:	Interface the datagram is arriving from
+ * @dst:	Our (local) address to which the datagram is arriving
  * @port:	Our (local) port number to which the datagram is arriving
  * @s_in:	Source socket address, filled in by recvmmsg()
  * @now:	Timestamp
@@ -176,7 +203,8 @@ cancel:
  * Return: sidx for the destination side of the flow for this packet, or
  *         FLOW_SIDX_NONE if we couldn't find or create a flow.
  */
-flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, in_port_t port,
+flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif,
+			       const union inany_addr *dst, in_port_t port,
 			       const union sockaddr_inany *s_in,
 			       const struct timespec *now)
 {
@@ -185,7 +213,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, in_port_t port,
 	union flow *flow;
 	flow_sidx_t sidx;
 
-	sidx = flow_lookup_sa(c, IPPROTO_UDP, pif, s_in, port);
+	sidx = flow_lookup_sa(c, IPPROTO_UDP, pif, s_in, dst, port);
 	if ((uflow = udp_at_sidx(sidx))) {
 		uflow->ts = now->tv_sec;
 		return flow_sidx_opposite(sidx);
@@ -199,7 +227,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, in_port_t port,
 		return FLOW_SIDX_NONE;
 	}
 
-	ini = flow_initiate_sa(flow, pif, s_in, port);
+	ini = flow_initiate_sa(flow, pif, s_in, dst, port);
 
 	if (!inany_is_unicast(&ini->eaddr) ||
 	    ini->eport == 0 || ini->oport == 0) {
diff --git a/udp_flow.h b/udp_flow.h
index e289122..4c528e9 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -32,7 +32,8 @@ struct udp_flow {
 };
 
 struct udp_flow *udp_at_sidx(flow_sidx_t sidx);
-flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, in_port_t port,
+flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif,
+			       const union inany_addr *dst, in_port_t port,
 			       const union sockaddr_inany *s_in,
 			       const struct timespec *now);
 flow_sidx_t udp_flow_from_tap(const struct ctx *c,
diff --git a/util.h b/util.h
index b1e7e79..cc7d084 100644
--- a/util.h
+++ b/util.h
@@ -371,6 +371,16 @@ static inline int wrap_accept4(int sockfd, struct sockaddr *addr,
 #define accept4(s, addr, addrlen, flags) \
 	wrap_accept4((s), (addr), (addrlen), (flags))
 
+static inline int wrap_getsockname(int sockfd, struct sockaddr *addr,
+/* cppcheck-suppress constParameterPointer */
+				   socklen_t *addrlen)
+{
+	sa_init(addr, addrlen);
+	return getsockname(sockfd, addr, addrlen);
+}
+#define getsockname(s, addr, addrlen) \
+	wrap_getsockname((s), (addr), (addrlen))
+
 #define PASST_MAXDNAME 254 /* 253 (RFC 1035) + 1 (the terminator) */
 void encode_domain_name(char *buf, const char *domain_name);
 

From bbff3653d6412690eee1a079d584a7365d2ed886 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 11 Apr 2025 09:58:31 +0200
Subject: [PATCH 190/221] conf: Split add_dns_resolv() into separate IPv4 and
 IPv6 versions

Not really valuable by itself, but dropping one level of nested blocks
makes the next change more convenient.

No functional changes intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c | 101 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 60 insertions(+), 41 deletions(-)

diff --git a/conf.c b/conf.c
index 168646f..18ed11c 100644
--- a/conf.c
+++ b/conf.c
@@ -414,6 +414,62 @@ static unsigned add_dns6(struct ctx *c, const struct in6_addr *addr,
 	return 1;
 }
 
+/**
+ * add_dns_resolv4() - Possibly add one IPv4 nameserver from host's resolv.conf
+ * @c:		Execution context
+ * @ns:		Nameserver address
+ * @idx:	Pointer to index of current IPv4 resolver entry, set on return
+ */
+static void add_dns_resolv4(struct ctx *c, struct in_addr *ns, unsigned *idx)
+{
+	if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host))
+		c->ip4.dns_host = *ns;
+
+	/* Special handling if guest or container can only access local
+	 * addresses via redirect, or if the host gateway is also a resolver and
+	 * we shadow its address
+	 */
+	if (IN4_IS_ADDR_LOOPBACK(ns) ||
+	    IN4_ARE_ADDR_EQUAL(ns, &c->ip4.map_host_loopback)) {
+		if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
+			return;
+
+		*ns = c->ip4.map_host_loopback;
+		if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match))
+			c->ip4.dns_match = c->ip4.map_host_loopback;
+	}
+
+	*idx += add_dns4(c, ns, *idx);
+}
+
+/**
+ * add_dns_resolv6() - Possibly add one IPv6 nameserver from host's resolv.conf
+ * @c:		Execution context
+ * @ns:		Nameserver address
+ * @idx:	Pointer to index of current IPv6 resolver entry, set on return
+ */
+static void add_dns_resolv6(struct ctx *c, struct in6_addr *ns, unsigned *idx)
+{
+	if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
+		c->ip6.dns_host = *ns;
+
+	/* Special handling if guest or container can only access local
+	 * addresses via redirect, or if the host gateway is also a resolver and
+	 * we shadow its address
+	 */
+	if (IN6_IS_ADDR_LOOPBACK(ns) ||
+	    IN6_ARE_ADDR_EQUAL(ns, &c->ip6.map_host_loopback)) {
+		if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
+			return;
+
+		*ns = c->ip6.map_host_loopback;
+		if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match))
+			c->ip6.dns_match = c->ip6.map_host_loopback;
+	}
+
+	*idx += add_dns6(c, ns, *idx);
+}
+
 /**
  * add_dns_resolv() - Possibly add ns from host resolv.conf to configuration
  * @c:		Execution context
@@ -430,48 +486,11 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver,
 	struct in6_addr ns6;
 	struct in_addr ns4;
 
-	if (idx4 && inet_pton(AF_INET, nameserver, &ns4)) {
-		if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host))
-			c->ip4.dns_host = ns4;
+	if (idx4 && inet_pton(AF_INET, nameserver, &ns4))
+		add_dns_resolv4(c, &ns4, idx4);
 
-		/* Special handling if guest or container can only access local
-		 * addresses via redirect, or if the host gateway is also a
-		 * resolver and we shadow its address
-		 */
-		if (IN4_IS_ADDR_LOOPBACK(&ns4) ||
-		    IN4_ARE_ADDR_EQUAL(&ns4, &c->ip4.map_host_loopback)) {
-			if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
-				return;
-
-			ns4 = c->ip4.map_host_loopback;
-			if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match))
-				c->ip4.dns_match = c->ip4.map_host_loopback;
-		}
-
-		*idx4 += add_dns4(c, &ns4, *idx4);
-	}
-
-	if (idx6 && inet_pton(AF_INET6, nameserver, &ns6)) {
-		if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
-			c->ip6.dns_host = ns6;
-
-		/* Special handling if guest or container can only access local
-		 * addresses via redirect, or if the host gateway is also a
-		 * resolver and we shadow its address
-		 */
-		if (IN6_IS_ADDR_LOOPBACK(&ns6) ||
-		    IN6_ARE_ADDR_EQUAL(&ns6, &c->ip6.map_host_loopback)) {
-			if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
-				return;
-
-			ns6 = c->ip6.map_host_loopback;
-
-			if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match))
-				c->ip6.dns_match = c->ip6.map_host_loopback;
-		}
-
-		*idx6 += add_dns6(c, &ns6, *idx6);
-	}
+	if (idx6 && inet_pton(AF_INET6, nameserver, &ns6))
+		add_dns_resolv6(c, &ns6, idx6);
 }
 
 /**

From 50249086a967c54ff5b2521038cbe1d27303958c Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 11 Apr 2025 10:50:00 +0200
Subject: [PATCH 191/221] conf: Honour --dns-forward for local resolver even
 with --no-map-gw

If the first resolver listed in the host's /etc/resolv.conf is a
loopback address, and --no-map-gw is given, we automatically conclude
that the resolver is not reachable, discard it, and, if it's the only
nameserver listed in /etc/resolv.conf, we'll warn that we:

  Couldn't get any nameserver address

However, this isn't true in a general case: the user might have passed
--dns-forward, and in that case, while we won't map the address of the
default gateway to the host, we're still supposed to map that
particular address. Otherwise, in this common Podman usage:

  pasta --config-net --dns-forward 169.254.1.1 -t none -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-c02a8d8f-6ee3-902e-33c5-317e0f24e0af --map-guest-addr 169.254.1.2

and with a loopback address in /etc/resolv.conf, we'll unexpectedly
refuse to forward DNS queries:

  # nslookup passt.top 169.254.1.1
  ;; connection timed out; no servers could be reached

To fix this, make an exception for --dns-forward: if &c->ip4.dns_match
or &c->ip6.dns_match are set in add_dns_resolv4() / add_dns_resolv6(),
use that address as guest-facing resolver.

We already set 'dns_host' to the address we found in /etc/resolv.conf,
that's correct in this case and it makes us forward queries as
expected.

I'm not changing the man page as the current description of
--dns-forward is already consistent with the new behaviour: there's no
described way in which --no-map-gw should affect it.

Reported-by: Andrew Sayers <andrew-bugs.passt.top@pileofstuff.org>
Link: https://bugs.passt.top/show_bug.cgi?id=111
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/conf.c b/conf.c
index 18ed11c..f942851 100644
--- a/conf.c
+++ b/conf.c
@@ -431,12 +431,19 @@ static void add_dns_resolv4(struct ctx *c, struct in_addr *ns, unsigned *idx)
 	 */
 	if (IN4_IS_ADDR_LOOPBACK(ns) ||
 	    IN4_ARE_ADDR_EQUAL(ns, &c->ip4.map_host_loopback)) {
-		if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
-			return;
+		if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match)) {
+			if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
+				return;		/* Address unreachable */
 
-		*ns = c->ip4.map_host_loopback;
-		if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match))
+			*ns = c->ip4.map_host_loopback;
 			c->ip4.dns_match = c->ip4.map_host_loopback;
+		} else {
+			/* No general host mapping, but requested for DNS
+			 * (--dns-forward and --no-map-gw): advertise resolver
+			 * address from --dns-forward, and map that to loopback
+			 */
+			*ns = c->ip4.dns_match;
+		}
 	}
 
 	*idx += add_dns4(c, ns, *idx);
@@ -459,12 +466,19 @@ static void add_dns_resolv6(struct ctx *c, struct in6_addr *ns, unsigned *idx)
 	 */
 	if (IN6_IS_ADDR_LOOPBACK(ns) ||
 	    IN6_ARE_ADDR_EQUAL(ns, &c->ip6.map_host_loopback)) {
-		if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
-			return;
+		if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) {
+			if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
+				return;		/* Address unreachable */
 
-		*ns = c->ip6.map_host_loopback;
-		if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match))
+			*ns = c->ip6.map_host_loopback;
 			c->ip6.dns_match = c->ip6.map_host_loopback;
+		} else {
+			/* No general host mapping, but requested for DNS
+			 * (--dns-forward and --no-map-gw): advertise resolver
+			 * address from --dns-forward, and map that to loopback
+			 */
+			*ns = c->ip6.dns_match;
+		}
 	}
 
 	*idx += add_dns6(c, ns, *idx);

From baf049f8e06b7f0a73dfa7913297679a75aad381 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 15 Apr 2025 17:16:18 +1000
Subject: [PATCH 192/221] udp: Fix breakage of UDP error handling by PKTINFO
 support

We recently enabled the IP_PKTINFO / IPV6_RECVPKTINFO socket options on our
UDP sockets.  This lets us obtain and properly handle the specific local
address used when we're "listening" with a socket on 0.0.0.0 or ::.

However, the PKTINFO cmsgs this option generates appear on error queue
messages as well as regular datagrams.  udp_sock_recverr() doesn't expect
this and so flags an unrecoverable error when it can't parse the control
message.

Correct this by adding space in udp_sock_recverr()s control buffer for the
additional PKTINFO data, and scan through all cmsgs for the RECVERR, rather
than only looking at the first one.

Link: https://bugs.passt.top/show_bug.cgi?id=99
Fixes: f4b0dd8b0685 ("udp: Use PKTINFO cmsgs to get destination address for received datagrams")
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/udp.c b/udp.c
index 40af7df..f5fb98c 100644
--- a/udp.c
+++ b/udp.c
@@ -155,6 +155,10 @@ __attribute__ ((aligned(32)))
 #endif
 udp_meta[UDP_MAX_FRAMES];
 
+#define PKTINFO_SPACE					\
+	MAX(CMSG_SPACE(sizeof(struct in_pktinfo)),	\
+	    CMSG_SPACE(sizeof(struct in6_pktinfo)))
+
 /**
  * enum udp_iov_idx - Indices for the buffers making up a single UDP frame
  * @UDP_IOV_TAP         tap specific header
@@ -476,10 +480,10 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 		struct sock_extended_err ee;
 		union sockaddr_inany saddr;
 	};
-	const struct errhdr *eh;
-	const struct cmsghdr *hdr;
-	char buf[CMSG_SPACE(sizeof(struct errhdr))];
+	char buf[PKTINFO_SPACE + CMSG_SPACE(sizeof(struct errhdr))];
 	char data[ICMP6_MAX_DLEN];
+	const struct errhdr *eh;
+	struct cmsghdr *hdr;
 	int s = ref.fd;
 	struct iovec iov = {
 		.iov_base = data,
@@ -507,12 +511,16 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 		return -1;
 	}
 
-	hdr = CMSG_FIRSTHDR(&mh);
-	if (!((hdr->cmsg_level == IPPROTO_IP &&
-	       hdr->cmsg_type == IP_RECVERR) ||
-	      (hdr->cmsg_level == IPPROTO_IPV6 &&
-	       hdr->cmsg_type == IPV6_RECVERR))) {
-		err("Unexpected cmsg reading error queue");
+	for (hdr = CMSG_FIRSTHDR(&mh); hdr; hdr = CMSG_NXTHDR(&mh, hdr)) {
+		if ((hdr->cmsg_level == IPPROTO_IP &&
+		      hdr->cmsg_type == IP_RECVERR) ||
+		     (hdr->cmsg_level == IPPROTO_IPV6 &&
+		      hdr->cmsg_type == IPV6_RECVERR))
+		    break;
+	}
+
+	if (!hdr) {
+		err("Missing RECVERR cmsg in error queue");
 		return -1;
 	}
 
@@ -587,10 +595,6 @@ static int udp_sock_errs(const struct ctx *c, union epoll_ref ref)
 	return n_err;
 }
 
-#define PKTINFO_SPACE					\
-	MAX(CMSG_SPACE(sizeof(struct in_pktinfo)),	\
-	    CMSG_SPACE(sizeof(struct in6_pktinfo)))
-
 /**
  * udp_peek_addr() - Get source address for next packet
  * @s:		Socket to get information from

From 1bb8145c221a9124ca1671e64b27de173ff2d82d Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 15 Apr 2025 17:16:19 +1000
Subject: [PATCH 193/221] udp: Be quieter about errors on UDP receive

If we get an error on UDP receive, either in udp_peek_addr() or
udp_sock_recv(), we'll print an error message.  However, this could be
a perfectly routine UDP error triggered by an ICMP, which need not go to
the error log.

This doesn't usually happen, because before receiving we typically clear
the error queue from udp_sock_errs().  However, it's possible an error
could be flagged after udp_sock_errs() but before we receive.  So it's
better to handle this error "silently" (trace level only).  We'll bail out
of the receive, return to the epoll loop, and get an EPOLLERR where we'll
handle and report the error properly.

In particular there's one situation that can trigger this case much more
easily.  If we start a new outbound UDP flow to a local destination with
nothing listening, we'll get a more or less immediate connection refused
error.  So, we'll get that error on the very first receive after the
connect().  That will occur in udp_flow_defer() -> udp_flush_flow() ->
udp_sock_fwd() -> udp_peek_addr() -> recvmsg().  This path doesn't call
udp_sock_errs() first, so isn't (imperfectly) protected the way we are
most of the time.

Fixes: 84ab1305faba ("udp: Polish udp_vu_sock_info() and remove from vu specific code")
Fixes: 69e5393c3722 ("udp: Move some more of sock_handler tasks into sub-functions")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/udp.c b/udp.c
index f5fb98c..154f99b 100644
--- a/udp.c
+++ b/udp.c
@@ -619,8 +619,8 @@ static int udp_peek_addr(int s, union sockaddr_inany *src,
 
 	rc = recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
 	if (rc < 0) {
-		if (errno != EAGAIN && errno != EWOULDBLOCK)
-			warn_perror("Error peeking at socket address");
+		trace("Error peeking at socket address: %s", strerror_(errno));
+		/* Bail out and let the EPOLLERR handler deal with it */
 		return rc;
 	}
 
@@ -664,7 +664,8 @@ static int udp_sock_recv(const struct ctx *c, int s, struct mmsghdr *mmh, int n)
 
 	n = recvmmsg(s, mmh, n, 0, NULL);
 	if (n < 0) {
-		err_perror("Error receiving datagrams");
+		trace("Error receiving datagrams: %s", strerror_(errno));
+		/* Bail out and let the EPOLLERR handler deal with it */
 		return 0;
 	}
 

From 3f995586b35494b08631081fbf609ff932110849 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 15 Apr 2025 17:16:20 +1000
Subject: [PATCH 194/221] udp: Pass socket & flow information direction to
 error handling functions

udp_sock_recverr() and udp_sock_errs() take an epoll reference from which
they obtain both the socket fd to receive errors from, and - for flow
specific sockets - the flow and side the socket is associated with.

We have some upcoming cases where we want to clear errors when we're not
directly associated with receiving an epoll event, so it's not natural to
have an epoll reference.  Therefore, make these functions take the socket
and flow from explicit parameters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/udp.c b/udp.c
index 154f99b..c51ac95 100644
--- a/udp.c
+++ b/udp.c
@@ -467,14 +467,15 @@ static void udp_send_tap_icmp6(const struct ctx *c,
 /**
  * udp_sock_recverr() - Receive and clear an error from a socket
  * @c:		Execution context
- * @ref:	epoll reference
+ * @s:		Socket to receive errors from
+ * @sidx:	Flow and side of @s, or FLOW_SIDX_NONE if unknown
  *
  * Return: 1 if error received and processed, 0 if no more errors in queue, < 0
  *         if there was an error reading the queue
  *
  * #syscalls recvmsg
  */
-static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
+static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx)
 {
 	struct errhdr {
 		struct sock_extended_err ee;
@@ -484,7 +485,6 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 	char data[ICMP6_MAX_DLEN];
 	const struct errhdr *eh;
 	struct cmsghdr *hdr;
-	int s = ref.fd;
 	struct iovec iov = {
 		.iov_base = data,
 		.iov_len = sizeof(data)
@@ -525,12 +525,12 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 	}
 
 	eh = (const struct errhdr *)CMSG_DATA(hdr);
-	if (ref.type == EPOLL_TYPE_UDP) {
-		flow_sidx_t sidx = flow_sidx_opposite(ref.flowside);
-		const struct flowside *toside = flowside_at_sidx(sidx);
+	if (flow_sidx_valid(sidx)) {
+		flow_sidx_t tosidx = flow_sidx_opposite(sidx);
+		const struct flowside *toside = flowside_at_sidx(tosidx);
 		size_t dlen = rc;
 
-		if (pif_is_socket(pif_at_sidx(sidx))) {
+		if (pif_is_socket(pif_at_sidx(tosidx))) {
 			/* XXX Is there any way to propagate ICMPs from socket
 			 * to socket? */
 		} else if (hdr->cmsg_level == IPPROTO_IP) {
@@ -554,21 +554,21 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
 /**
  * udp_sock_errs() - Process errors on a socket
  * @c:		Execution context
- * @ref:	epoll reference
+ * @s:		Socket to receive errors from
+ * @sidx:	Flow and side of @s, or FLOW_SIDX_NONE if unknown
  *
  * Return: Number of errors handled, or < 0 if we have an unrecoverable error
  */
-static int udp_sock_errs(const struct ctx *c, union epoll_ref ref)
+static int udp_sock_errs(const struct ctx *c, int s, flow_sidx_t sidx)
 {
 	unsigned n_err = 0;
 	socklen_t errlen;
-	int s = ref.fd;
 	int rc, err;
 
 	ASSERT(!c->no_udp);
 
 	/* Empty the error queue */
-	while ((rc = udp_sock_recverr(c, ref)) > 0)
+	while ((rc = udp_sock_recverr(c, s, sidx)) > 0)
 		n_err += rc;
 
 	if (rc < 0)
@@ -777,7 +777,7 @@ void udp_listen_sock_handler(const struct ctx *c,
 			     const struct timespec *now)
 {
 	if (events & EPOLLERR) {
-		if (udp_sock_errs(c, ref) < 0) {
+		if (udp_sock_errs(c, ref.fd, FLOW_SIDX_NONE) < 0) {
 			err("UDP: Unrecoverable error on listening socket:"
 			    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
 			/* FIXME: what now?  close/re-open socket? */
@@ -804,7 +804,7 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 	ASSERT(!c->no_udp && uflow);
 
 	if (events & EPOLLERR) {
-		if (udp_sock_errs(c, ref) < 0) {
+		if (udp_sock_errs(c, ref.fd, ref.flowside) < 0) {
 			flow_err(uflow, "Unrecoverable error on flow socket");
 			goto fail;
 		}

From 04984578b00f7507a05544b7a5490b03ab2d5135 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 15 Apr 2025 17:16:21 +1000
Subject: [PATCH 195/221] udp: Deal with errors as we go in udp_sock_fwd()

When we get an epoll event on a listening socket, we first deal with any
errors (udp_sock_errs()), then with any received packets (udp_sock_fwd()).
However, it's theoretically possible that new errors could get flagged on
the socket after we call udp_sock_errs(), in which case we could get errors
returned in in udp_sock_fwd() -> udp_peek_addr() -> recvmsg().

In fact, we do deal with this correctly, although the path is somewhat
non-obvious.  The recvmsg() error will cause us to bail out of
udp_sock_fwd(), but the EPOLLERR event will now be flagged, so we'll come
back here next epoll loop and call udp_sock_errs().

Except.. we call udp_sock_fwd() from udp_flush_flow() as well as from
epoll events.  This is to deal with any packets that arrived between bind()
and connect(), and so might not be associated with the socket's intended
flow.  This expects udp_sock_fwd() to flush _all_ queued datagrams, so that
anything received later must be for the correct flow.

At the moment, udp_sock_errs() might fail to flush all datagrams if errors
occur.  In particular this can happen in practice for locally reported
errors which occur immediately after connect() (e.g. connecting to a local
port with nothing listening).

We can deal with the problem case, and also make the flow a little more
natural for the common case by having udp_sock_fwd() call udp_sock_errs()
to handle errors as the occur, rather than trying to deal with all errors
in advance.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/udp.c b/udp.c
index c51ac95..0bec499 100644
--- a/udp.c
+++ b/udp.c
@@ -601,7 +601,7 @@ static int udp_sock_errs(const struct ctx *c, int s, flow_sidx_t sidx)
  * @src:	Socket address (output)
  * @dst:	(Local) destination address (output)
  *
- * Return: 0 on success, -1 otherwise
+ * Return: 0 if no more packets, 1 on success, -ve error code on error
  */
 static int udp_peek_addr(int s, union sockaddr_inany *src,
 			 union inany_addr *dst)
@@ -619,9 +619,9 @@ static int udp_peek_addr(int s, union sockaddr_inany *src,
 
 	rc = recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
 	if (rc < 0) {
-		trace("Error peeking at socket address: %s", strerror_(errno));
-		/* Bail out and let the EPOLLERR handler deal with it */
-		return rc;
+		if (errno == EAGAIN || errno == EWOULDBLOCK)
+			return 0;
+		return -errno;
 	}
 
 	hdr = CMSG_FIRSTHDR(&msg);
@@ -644,7 +644,7 @@ static int udp_peek_addr(int s, union sockaddr_inany *src,
 	      sockaddr_ntop(src, sastr, sizeof(sastr)),
 	      inany_ntop(dst, dstr, sizeof(dstr)));
 
-	return 0;
+	return 1;
 }
 
 /**
@@ -740,11 +740,27 @@ void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
 {
 	union sockaddr_inany src;
 	union inany_addr dst;
+	int rc;
 
-	while (udp_peek_addr(s, &src, &dst) == 0) {
-		flow_sidx_t tosidx = udp_flow_from_sock(c, frompif,
-							&dst, port, &src, now);
-		uint8_t topif = pif_at_sidx(tosidx);
+	while ((rc = udp_peek_addr(s, &src, &dst)) != 0) {
+		flow_sidx_t tosidx;
+		uint8_t topif;
+
+		if (rc < 0) {
+			trace("Error peeking at socket address: %s",
+			      strerror_(-rc));
+			/* Clear errors & carry on */
+			if (udp_sock_errs(c, s, FLOW_SIDX_NONE) < 0) {
+				err(
+"UDP: Unrecoverable error on listening socket: (%s port %hu)",
+				    pif_name(frompif), port);
+				/* FIXME: what now?  close/re-open socket? */
+			}
+			continue;
+		}
+
+		tosidx = udp_flow_from_sock(c, frompif, &dst, port, &src, now);
+		topif = pif_at_sidx(tosidx);
 
 		if (pif_is_socket(topif)) {
 			udp_sock_to_sock(c, s, 1, tosidx);
@@ -776,16 +792,7 @@ void udp_listen_sock_handler(const struct ctx *c,
 			     union epoll_ref ref, uint32_t events,
 			     const struct timespec *now)
 {
-	if (events & EPOLLERR) {
-		if (udp_sock_errs(c, ref.fd, FLOW_SIDX_NONE) < 0) {
-			err("UDP: Unrecoverable error on listening socket:"
-			    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
-			/* FIXME: what now?  close/re-open socket? */
-			return;
-		}
-	}
-
-	if (events & EPOLLIN)
+	if (events & (EPOLLERR | EPOLLIN))
 		udp_sock_fwd(c, ref.fd, ref.udp.pif, ref.udp.port, now);
 }
 

From f107a86cc05c83c5755861b00b85cdf0eb5c9534 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 15 Apr 2025 17:16:22 +1000
Subject: [PATCH 196/221] udp: Add udp_pktinfo() helper

Currently we open code parsing the control message for IP_PKTINFO in
udp_peek_addr().  We have an upcoming case where we want to parse PKTINFO
in another place, so split this out into a helper function.

While we're there, make the parsing a bit more robust: scan all cmsgs to
look for the one we want, rather than assuming there's only one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: udp_pktinfo(): Fix typo in comment and change err() to debug()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 52 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/udp.c b/udp.c
index 0bec499..97034f6 100644
--- a/udp.c
+++ b/udp.c
@@ -464,6 +464,41 @@ static void udp_send_tap_icmp6(const struct ctx *c,
 	tap_icmp6_send(c, saddr, eaddr, &msg, msglen);
 }
 
+/**
+ * udp_pktinfo() - Retrieve packet destination address from cmsg
+ * @msg:	msghdr into which message has been received
+ * @dst:	(Local) destination address of message in @mh (output)
+ *
+ * Return: 0 on success, -1 if the information was missing (@dst is set to
+ *         inany_any6).
+ */
+static int udp_pktinfo(struct msghdr *msg, union inany_addr *dst)
+{
+	struct cmsghdr *hdr;
+
+	for (hdr = CMSG_FIRSTHDR(msg); hdr; hdr = CMSG_NXTHDR(msg, hdr)) {
+		if (hdr->cmsg_level == IPPROTO_IP &&
+		    hdr->cmsg_type == IP_PKTINFO) {
+			const struct in_pktinfo *i4 = (void *)CMSG_DATA(hdr);
+
+			*dst = inany_from_v4(i4->ipi_addr);
+			return 0;
+		}
+
+		if (hdr->cmsg_level == IPPROTO_IPV6 &&
+			   hdr->cmsg_type == IPV6_PKTINFO) {
+			const struct in6_pktinfo *i6 = (void *)CMSG_DATA(hdr);
+
+			dst->a6 = i6->ipi6_addr;
+			return 0;
+		}
+	}
+
+	debug("Missing PKTINFO cmsg on datagram");
+	*dst = inany_any6;
+	return -1;
+}
+
 /**
  * udp_sock_recverr() - Receive and clear an error from a socket
  * @c:		Execution context
@@ -607,7 +642,6 @@ static int udp_peek_addr(int s, union sockaddr_inany *src,
 			 union inany_addr *dst)
 {
 	char sastr[SOCKADDR_STRLEN], dstr[INANY_ADDRSTRLEN];
-	const struct cmsghdr *hdr;
 	char cmsg[PKTINFO_SPACE];
 	struct msghdr msg = {
 		.msg_name = src,
@@ -624,21 +658,7 @@ static int udp_peek_addr(int s, union sockaddr_inany *src,
 		return -errno;
 	}
 
-	hdr = CMSG_FIRSTHDR(&msg);
-	if (hdr && hdr->cmsg_level == IPPROTO_IP &&
-	    hdr->cmsg_type == IP_PKTINFO) {
-		const struct in_pktinfo *info4 = (void *)CMSG_DATA(hdr);
-
-		*dst = inany_from_v4(info4->ipi_addr);
-	} else if (hdr && hdr->cmsg_level == IPPROTO_IPV6 &&
-		   hdr->cmsg_type == IPV6_PKTINFO) {
-		const struct in6_pktinfo *info6 = (void *)CMSG_DATA(hdr);
-
-		dst->a6 = info6->ipi6_addr;
-	} else {
-		debug("Unexpected cmsg on UDP datagram");
-		*dst = inany_any6;
-	}
+	udp_pktinfo(&msg, dst);
 
 	trace("Peeked UDP datagram: %s -> %s",
 	      sockaddr_ntop(src, sastr, sizeof(sastr)),

From cfc0ee145a5cdd29b6e584171085dac6539b86c0 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 15 Apr 2025 17:16:23 +1000
Subject: [PATCH 197/221] udp: Minor re-organisation of udp_sock_recverr()

Usually we work with the "exit early" flow style, where we return early
on "error" conditions in functions.  We don't currently do this in
udp_sock_recverr() for the case where we don't have a flow to associate
the error with.

Reorganise to use the "exit early" style, which will make some subsequent
changes less awkward.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/udp.c b/udp.c
index 97034f6..e8240fe 100644
--- a/udp.c
+++ b/udp.c
@@ -530,6 +530,9 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx)
 		.msg_control = buf,
 		.msg_controllen = sizeof(buf),
 	};
+	const struct flowside *toside;
+	flow_sidx_t tosidx;
+	size_t dlen;
 	ssize_t rc;
 
 	rc = recvmsg(s, &mh, MSG_ERRQUEUE);
@@ -560,29 +563,32 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx)
 	}
 
 	eh = (const struct errhdr *)CMSG_DATA(hdr);
-	if (flow_sidx_valid(sidx)) {
-		flow_sidx_t tosidx = flow_sidx_opposite(sidx);
-		const struct flowside *toside = flowside_at_sidx(tosidx);
-		size_t dlen = rc;
 
-		if (pif_is_socket(pif_at_sidx(tosidx))) {
-			/* XXX Is there any way to propagate ICMPs from socket
-			 * to socket? */
-		} else if (hdr->cmsg_level == IPPROTO_IP) {
-			dlen = MIN(dlen, ICMP4_MAX_DLEN);
-			udp_send_tap_icmp4(c, &eh->ee, toside,
-					   eh->saddr.sa4.sin_addr, data, dlen);
-		} else if (hdr->cmsg_level == IPPROTO_IPV6) {
-			udp_send_tap_icmp6(c, &eh->ee, toside,
-					   &eh->saddr.sa6.sin6_addr, data,
-					   dlen, sidx.flowi);
-		}
-	} else {
-		trace("Ignoring received IP_RECVERR cmsg on listener socket");
-	}
 	debug("%s error on UDP socket %i: %s",
 	      str_ee_origin(&eh->ee), s, strerror_(eh->ee.ee_errno));
 
+	if (!flow_sidx_valid(sidx)) {
+		trace("Ignoring received IP_RECVERR cmsg on listener socket");
+		return 1;
+	}
+
+	tosidx = flow_sidx_opposite(sidx);
+	toside = flowside_at_sidx(tosidx);
+	dlen = rc;
+
+	if (pif_is_socket(pif_at_sidx(tosidx))) {
+		/* XXX Is there any way to propagate ICMPs from socket to
+		 * socket? */
+	} else if (hdr->cmsg_level == IPPROTO_IP) {
+		dlen = MIN(dlen, ICMP4_MAX_DLEN);
+		udp_send_tap_icmp4(c, &eh->ee, toside,
+				   eh->saddr.sa4.sin_addr, data, dlen);
+	} else if (hdr->cmsg_level == IPPROTO_IPV6) {
+		udp_send_tap_icmp6(c, &eh->ee, toside,
+				   &eh->saddr.sa6.sin6_addr, data,
+				   dlen, sidx.flowi);
+	}
+
 	return 1;
 }
 

From 2340bbf867e6c3c3b5ac67345b0e841ab49bbaa5 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 15 Apr 2025 17:16:24 +1000
Subject: [PATCH 198/221] udp: Propagate errors on listening and brand new
 sockets

udp_sock_recverr() processes errors on UDP sockets and attempts to
propagate them as ICMP packets on the tap interface.  To do this it
currently requires the flow with which the error is associated as a
parameter.  If that's missing it will clear the error condition, but not
propagate it.

That means that we largely ignore errors on "listening" sockets.  It also
means we may discard some errors on flow specific sockets if they occur
very shortly after the socket is created.  In udp_flush_flow() we need to
clear any datagrams received between bind() and connect() which might not
be associated with the "final" flow for the socket.  If we get errors
before that point we'll ignore them in the same way because we don't know
the flow they're associated with in advance.

This can happen in practice if we have errors which occur almost
immediately after connect(), such as ECONNREFUSED when we connect() to a
local address where nothing is listening.

Between the extended error message itself and the PKTINFO information we
do actually have enough information to find the correct flow.  So, rather
than ignoring errors where we don't have a flow "hint", determine the flow
the hard way in udp_sock_recverr().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change warn() to debug() in udp_sock_recverr()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 41 ++++++++++++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/udp.c b/udp.c
index e8240fe..57769d0 100644
--- a/udp.c
+++ b/udp.c
@@ -504,27 +504,34 @@ static int udp_pktinfo(struct msghdr *msg, union inany_addr *dst)
  * @c:		Execution context
  * @s:		Socket to receive errors from
  * @sidx:	Flow and side of @s, or FLOW_SIDX_NONE if unknown
+ * @pif:	Interface on which the error occurred
+ *              (only used if @sidx == FLOW_SIDX_NONE)
+ * @port:	Local port number of @s (only used if @sidx == FLOW_SIDX_NONE)
  *
  * Return: 1 if error received and processed, 0 if no more errors in queue, < 0
  *         if there was an error reading the queue
  *
  * #syscalls recvmsg
  */
-static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx)
+static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
+			    uint8_t pif, in_port_t port)
 {
 	struct errhdr {
 		struct sock_extended_err ee;
 		union sockaddr_inany saddr;
 	};
 	char buf[PKTINFO_SPACE + CMSG_SPACE(sizeof(struct errhdr))];
+	const struct errhdr *eh = NULL;
 	char data[ICMP6_MAX_DLEN];
-	const struct errhdr *eh;
 	struct cmsghdr *hdr;
 	struct iovec iov = {
 		.iov_base = data,
 		.iov_len = sizeof(data)
 	};
+	union sockaddr_inany src;
 	struct msghdr mh = {
+		.msg_name = &src,
+		.msg_namelen = sizeof(src),
 		.msg_iov = &iov,
 		.msg_iovlen = 1,
 		.msg_control = buf,
@@ -554,7 +561,7 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx)
 		      hdr->cmsg_type == IP_RECVERR) ||
 		     (hdr->cmsg_level == IPPROTO_IPV6 &&
 		      hdr->cmsg_type == IPV6_RECVERR))
-		    break;
+			break;
 	}
 
 	if (!hdr) {
@@ -568,8 +575,19 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx)
 	      str_ee_origin(&eh->ee), s, strerror_(eh->ee.ee_errno));
 
 	if (!flow_sidx_valid(sidx)) {
-		trace("Ignoring received IP_RECVERR cmsg on listener socket");
-		return 1;
+		/* No hint from the socket, determine flow from addresses */
+		union inany_addr dst;
+
+		if (udp_pktinfo(&mh, &dst) < 0) {
+			debug("Missing PKTINFO on UDP error");
+			return 1;
+		}
+
+		sidx = flow_lookup_sa(c, IPPROTO_UDP, pif, &src, &dst, port);
+		if (!flow_sidx_valid(sidx)) {
+			debug("Ignoring UDP error without flow");
+			return 1;
+		}
 	}
 
 	tosidx = flow_sidx_opposite(sidx);
@@ -597,10 +615,14 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx)
  * @c:		Execution context
  * @s:		Socket to receive errors from
  * @sidx:	Flow and side of @s, or FLOW_SIDX_NONE if unknown
+ * @pif:	Interface on which the error occurred
+ *              (only used if @sidx == FLOW_SIDX_NONE)
+ * @port:	Local port number of @s (only used if @sidx == FLOW_SIDX_NONE)
  *
  * Return: Number of errors handled, or < 0 if we have an unrecoverable error
  */
-static int udp_sock_errs(const struct ctx *c, int s, flow_sidx_t sidx)
+static int udp_sock_errs(const struct ctx *c, int s, flow_sidx_t sidx,
+			 uint8_t pif, in_port_t port)
 {
 	unsigned n_err = 0;
 	socklen_t errlen;
@@ -609,7 +631,7 @@ static int udp_sock_errs(const struct ctx *c, int s, flow_sidx_t sidx)
 	ASSERT(!c->no_udp);
 
 	/* Empty the error queue */
-	while ((rc = udp_sock_recverr(c, s, sidx)) > 0)
+	while ((rc = udp_sock_recverr(c, s, sidx, pif, port)) > 0)
 		n_err += rc;
 
 	if (rc < 0)
@@ -776,7 +798,8 @@ void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
 			trace("Error peeking at socket address: %s",
 			      strerror_(-rc));
 			/* Clear errors & carry on */
-			if (udp_sock_errs(c, s, FLOW_SIDX_NONE) < 0) {
+			if (udp_sock_errs(c, s, FLOW_SIDX_NONE,
+					  frompif, port) < 0) {
 				err(
 "UDP: Unrecoverable error on listening socket: (%s port %hu)",
 				    pif_name(frompif), port);
@@ -837,7 +860,7 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 	ASSERT(!c->no_udp && uflow);
 
 	if (events & EPOLLERR) {
-		if (udp_sock_errs(c, ref.fd, ref.flowside) < 0) {
+		if (udp_sock_errs(c, ref.fd, ref.flowside, PIF_NONE, 0) < 0) {
 			flow_err(uflow, "Unrecoverable error on flow socket");
 			goto fail;
 		}

From 9128f6e8f47d94c761b5fd8c0d0b8308758cbdc5 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 17 Apr 2025 11:55:40 +1000
Subject: [PATCH 199/221] fwd: Split out helpers for port-independent NAT

Currently the functions fwd_nat_from_*() make some address translations
based on both the IP address and protocol port numbers, and others based
only on the address.  We have some upcoming cases where it's useful to use
the IP-address-only translations separately, so split them out into helper
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 fwd.c | 87 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 62 insertions(+), 25 deletions(-)

diff --git a/fwd.c b/fwd.c
index 2829cd2..5c70e83 100644
--- a/fwd.c
+++ b/fwd.c
@@ -323,6 +323,30 @@ static bool fwd_guest_accessible(const struct ctx *c,
 	return fwd_guest_accessible6(c, &addr->a6);
 }
 
+/**
+ * nat_outbound() - Apply address translation for outbound (TAP to HOST)
+ * @c:		Execution context
+ * @addr:	Input address (as seen on TAP interface)
+ * @translated:	Output address (as seen on HOST interface)
+ *
+ * Only handles translations that depend *only* on the address.  Anything
+ * related to specific ports or flows is handled elsewhere.
+ */
+static void nat_outbound(const struct ctx *c, const union inany_addr *addr,
+			 union inany_addr *translated)
+{
+	if (inany_equals4(addr, &c->ip4.map_host_loopback))
+		*translated = inany_loopback4;
+	else if (inany_equals6(addr, &c->ip6.map_host_loopback))
+		*translated = inany_loopback6;
+	else if (inany_equals4(addr, &c->ip4.map_guest_addr))
+		*translated = inany_from_v4(c->ip4.addr);
+	else if (inany_equals6(addr, &c->ip6.map_guest_addr))
+		translated->a6 = c->ip6.addr;
+	else
+		*translated = *addr;
+}
+
 /**
  * fwd_nat_from_tap() - Determine to forward a flow from the tap interface
  * @c:		Execution context
@@ -342,16 +366,8 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
 	else if (is_dns_flow(proto, ini) &&
 		   inany_equals6(&ini->oaddr, &c->ip6.dns_match))
 		tgt->eaddr.a6 = c->ip6.dns_host;
-	else if (inany_equals4(&ini->oaddr, &c->ip4.map_host_loopback))
-		tgt->eaddr = inany_loopback4;
-	else if (inany_equals6(&ini->oaddr, &c->ip6.map_host_loopback))
-		tgt->eaddr = inany_loopback6;
-	else if (inany_equals4(&ini->oaddr, &c->ip4.map_guest_addr))
-		tgt->eaddr = inany_from_v4(c->ip4.addr);
-	else if (inany_equals6(&ini->oaddr, &c->ip6.map_guest_addr))
-		tgt->eaddr.a6 = c->ip6.addr;
 	else
-		tgt->eaddr = ini->oaddr;
+		nat_outbound(c, &ini->oaddr, &tgt->eaddr);
 
 	tgt->eport = ini->oport;
 
@@ -423,6 +439,42 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
 	return PIF_HOST;
 }
 
+/**
+ * nat_inbound() - Apply address translation for outbound (HOST to TAP)
+ * @c:		Execution context
+ * @addr:	Input address (as seen on HOST interface)
+ * @translated:	Output address (as seen on TAP interface)
+ *
+ * Return: true on success, false if it couldn't translate the address
+ *
+ * Only handles translations that depend *only* on the address.  Anything
+ * related to specific ports or flows is handled elsewhere.
+ */
+static bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
+			 union inany_addr *translated)
+{
+	if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
+	    inany_equals4(addr, &in4addr_loopback)) {
+		/* Specifically 127.0.0.1, not 127.0.0.0/8 */
+		*translated = inany_from_v4(c->ip4.map_host_loopback);
+	} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
+		   inany_equals6(addr, &in6addr_loopback)) {
+		translated->a6 = c->ip6.map_host_loopback;
+	} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
+		   inany_equals4(addr, &c->ip4.addr)) {
+		*translated = inany_from_v4(c->ip4.map_guest_addr);
+	} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
+		   inany_equals6(addr, &c->ip6.addr)) {
+		translated->a6 = c->ip6.map_guest_addr;
+	} else if (fwd_guest_accessible(c, addr)) {
+		*translated = *addr;
+	} else {
+		return false;
+	}
+
+	return true;
+}
+
 /**
  * fwd_nat_from_host() - Determine to forward a flow from the host interface
  * @c:		Execution context
@@ -479,20 +531,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
 		return PIF_SPLICE;
 	}
 
-	if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
-	    inany_equals4(&ini->eaddr, &in4addr_loopback)) {
-		/* Specifically 127.0.0.1, not 127.0.0.0/8 */
-		tgt->oaddr = inany_from_v4(c->ip4.map_host_loopback);
-	} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
-		   inany_equals6(&ini->eaddr, &in6addr_loopback)) {
-		tgt->oaddr.a6 = c->ip6.map_host_loopback;
-	} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
-		   inany_equals4(&ini->eaddr, &c->ip4.addr)) {
-		tgt->oaddr = inany_from_v4(c->ip4.map_guest_addr);
-	} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
-		   inany_equals6(&ini->eaddr, &c->ip6.addr)) {
-		tgt->oaddr.a6 = c->ip6.map_guest_addr;
-	} else if (!fwd_guest_accessible(c, &ini->eaddr)) {
+	if (!nat_inbound(c, &ini->eaddr, &tgt->oaddr)) {
 		if (inany_v4(&ini->eaddr)) {
 			if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr))
 				/* No source address we can use */
@@ -501,8 +540,6 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
 		} else {
 			tgt->oaddr.a6 = c->ip6.our_tap_ll;
 		}
-	} else {
-		tgt->oaddr = ini->eaddr;
 	}
 	tgt->oport = ini->eport;
 

From 4668e9137806b551f6ee44609064cc40243c2b6b Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 17 Apr 2025 11:55:41 +1000
Subject: [PATCH 200/221] treewide: Improve robustness against sockaddrs of
 unexpected family

inany_from_sockaddr() expects a socket address of family AF_INET or
AF_INET6 and ASSERT()s if it gets anything else.  In many of the callers we
can handle an unexpected family more gracefully, though, e.g. by failing
a single flow rather than killing passt.

Change inany_from_sockaddr() to return an error instead of ASSERT()ing,
and handle those errors in the callers.  Improve the reporting of any such
errors while we're at it.

With this greater robustness, allow inany_from_sockaddr() to take a void *
rather than specifically a union sockaddr_inany *.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c     | 16 ++++++++++++++--
 inany.h    | 30 ++++++++++++++++++------------
 tcp.c      | 10 ++++------
 udp_flow.c |  6 +++---
 4 files changed, 39 insertions(+), 23 deletions(-)

diff --git a/flow.c b/flow.c
index 3c81cb4..447c021 100644
--- a/flow.c
+++ b/flow.c
@@ -408,7 +408,12 @@ struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
 {
 	struct flowside *ini = &flow->f.side[INISIDE];
 
-	inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa);
+	if (inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa) < 0) {
+		char str[SOCKADDR_STRLEN];
+
+		ASSERT_WITH_MSG(0, "Bad socket address %s",
+				sockaddr_ntop(ssa, str, sizeof(str)));
+	}
 	if (daddr)
 		ini->oaddr = *daddr;
 	else if (inany_v4(&ini->eaddr))
@@ -768,7 +773,14 @@ flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
 		.oport = oport,
 	};
 
-	inany_from_sockaddr(&side.eaddr, &side.eport, esa);
+	if (inany_from_sockaddr(&side.eaddr, &side.eport, esa) < 0) {
+		char str[SOCKADDR_STRLEN];
+
+		warn("Flow lookup on bad socket address %s",
+		     sockaddr_ntop(esa, str, sizeof(str)));
+		return FLOW_SIDX_NONE;
+	}
+
 	if (oaddr)
 		side.oaddr = *oaddr;
 	else if (inany_v4(&side.eaddr))
diff --git a/inany.h b/inany.h
index 1c247e1..7ca5cbd 100644
--- a/inany.h
+++ b/inany.h
@@ -237,24 +237,30 @@ static inline void inany_from_af(union inany_addr *aa,
 }
 
 /** inany_from_sockaddr - Extract IPv[46] address and port number from sockaddr
- * @aa:		Pointer to store IPv[46] address
+ * @dst:	Pointer to store IPv[46] address (output)
  * @port:	Pointer to store port number, host order
- * @addr:	AF_INET or AF_INET6 socket address
+ * @addr:	Socket address
+ *
+ * Return: 0 on success, -1 on error (bad address family)
  */
-static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
-				       const union sockaddr_inany *sa)
+static inline int inany_from_sockaddr(union inany_addr *dst, in_port_t *port,
+				      const void *addr)
 {
+	const union sockaddr_inany *sa = (const union sockaddr_inany *)addr;
+
 	if (sa->sa_family == AF_INET6) {
-		inany_from_af(aa, AF_INET6, &sa->sa6.sin6_addr);
+		inany_from_af(dst, AF_INET6, &sa->sa6.sin6_addr);
 		*port = ntohs(sa->sa6.sin6_port);
-	} else if (sa->sa_family == AF_INET) {
-		inany_from_af(aa, AF_INET, &sa->sa4.sin_addr);
-		*port = ntohs(sa->sa4.sin_port);
-	} else {
-		/* Not valid to call with other address families */
-		ASSERT_WITH_MSG(0, "Unexpected sockaddr family: %u",
-				sa->sa_family);
+		return 0;
 	}
+
+	if (sa->sa_family == AF_INET) {
+		inany_from_af(dst, AF_INET, &sa->sa4.sin_addr);
+		*port = ntohs(sa->sa4.sin_port);
+		return 0;
+	}
+
+	return -1;
 }
 
 /** inany_siphash_feed- Fold IPv[46] address into an in-progress siphash
diff --git a/tcp.c b/tcp.c
index 9c6bc52..0ac298a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1546,9 +1546,8 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 
 	if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
 		sl = sizeof(sa);
-		if (!getsockname(s, &sa.sa, &sl))
-			inany_from_sockaddr(&tgt->oaddr, &tgt->oport, &sa);
-		else
+		if (getsockname(s, &sa.sa, &sl) ||
+		    inany_from_sockaddr(&tgt->oaddr, &tgt->oport, &sa) < 0)
 			err_perror("Can't get local address for socket %i", s);
 	}
 
@@ -2204,9 +2203,8 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 			       NULL, ref.tcp_listen.port);
 
 	if (c->mode == MODE_VU) { /* Rebind to same address after migration */
-		if (!getsockname(s, &sa.sa, &sl))
-			inany_from_sockaddr(&ini->oaddr, &ini->oport, &sa);
-		else
+		if (getsockname(s, &sa.sa, &sl) ||
+		    inany_from_sockaddr(&ini->oaddr, &ini->oport, &sa) < 0)
 			err_perror("Can't get local address for socket %i", s);
 	}
 
diff --git a/udp_flow.c b/udp_flow.c
index ef2cbb0..fea1cf3 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -158,12 +158,12 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
 		socklen_t sl = sizeof(sa);
 		in_port_t port;
 
-		if (getsockname(uflow->s[TGTSIDE], &sa.sa, &sl) < 0) {
+		if (getsockname(uflow->s[TGTSIDE], &sa.sa, &sl) < 0 ||
+		    inany_from_sockaddr(&uflow->f.side[TGTSIDE].oaddr,
+					&port, &sa) < 0) {
 			flow_perror(uflow, "Unable to determine local address");
 			goto cancel;
 		}
-		inany_from_sockaddr(&uflow->f.side[TGTSIDE].oaddr,
-				    &port, &sa);
 		if (port != tgt->oport) {
 			flow_err(uflow, "Unexpected local port");
 			goto cancel;

From 08e617ec2ba916d8250a41d3ac68183124a6ec3e Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 17 Apr 2025 11:55:42 +1000
Subject: [PATCH 201/221] udp: Rework offender address handling in
 udp_sock_recverr()

Make a number of changes to udp_sock_recverr() to improve the robustness
of how we handle addresses.

 * Get the "offender" address (source of the ICMP packet) using the
   SO_EE_OFFENDER() macro, reducing assumptions about structure layout.
 * Parse the offender sockaddr using inany_from_sockaddr()
 * Check explicitly that the source and destination pifs are what we
   expect.  Previously we checked something that was probably equivalent
   in practice, but isn't strictly speaking what we require for the rest
   of the code.
 * Verify that for an ICMPv4 error we also have an IPv4 source/offender
   and destination/endpoint address
 * Verify that for an ICMPv6 error we have an IPv6 endpoint
 * Improve debug reporting of any failures

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp.c | 69 +++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 48 insertions(+), 21 deletions(-)

diff --git a/udp.c b/udp.c
index 57769d0..d09b3eb 100644
--- a/udp.c
+++ b/udp.c
@@ -159,6 +159,12 @@ udp_meta[UDP_MAX_FRAMES];
 	MAX(CMSG_SPACE(sizeof(struct in_pktinfo)),	\
 	    CMSG_SPACE(sizeof(struct in6_pktinfo)))
 
+#define RECVERR_SPACE							\
+	MAX(CMSG_SPACE(sizeof(struct sock_extended_err) +		\
+		       sizeof(struct sockaddr_in)),			\
+	    CMSG_SPACE(sizeof(struct sock_extended_err) +		\
+		       sizeof(struct sockaddr_in6)))
+
 /**
  * enum udp_iov_idx - Indices for the buffers making up a single UDP frame
  * @UDP_IOV_TAP         tap specific header
@@ -516,12 +522,8 @@ static int udp_pktinfo(struct msghdr *msg, union inany_addr *dst)
 static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
 			    uint8_t pif, in_port_t port)
 {
-	struct errhdr {
-		struct sock_extended_err ee;
-		union sockaddr_inany saddr;
-	};
-	char buf[PKTINFO_SPACE + CMSG_SPACE(sizeof(struct errhdr))];
-	const struct errhdr *eh = NULL;
+	char buf[PKTINFO_SPACE + RECVERR_SPACE];
+	const struct sock_extended_err *ee;
 	char data[ICMP6_MAX_DLEN];
 	struct cmsghdr *hdr;
 	struct iovec iov = {
@@ -538,7 +540,13 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
 		.msg_controllen = sizeof(buf),
 	};
 	const struct flowside *toside;
-	flow_sidx_t tosidx;
+	char astr[INANY_ADDRSTRLEN];
+	char sastr[SOCKADDR_STRLEN];
+	union inany_addr offender;
+	const struct in_addr *o4;
+	in_port_t offender_port;
+	struct udp_flow *uflow;
+	uint8_t topif;
 	size_t dlen;
 	ssize_t rc;
 
@@ -569,10 +577,10 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
 		return -1;
 	}
 
-	eh = (const struct errhdr *)CMSG_DATA(hdr);
+	ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
 
 	debug("%s error on UDP socket %i: %s",
-	      str_ee_origin(&eh->ee), s, strerror_(eh->ee.ee_errno));
+	      str_ee_origin(ee), s, strerror_(ee->ee_errno));
 
 	if (!flow_sidx_valid(sidx)) {
 		/* No hint from the socket, determine flow from addresses */
@@ -588,25 +596,44 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
 			debug("Ignoring UDP error without flow");
 			return 1;
 		}
+	} else {
+		pif = pif_at_sidx(sidx);
 	}
 
-	tosidx = flow_sidx_opposite(sidx);
-	toside = flowside_at_sidx(tosidx);
+	uflow = udp_at_sidx(sidx);
+	ASSERT(uflow);
+	toside = &uflow->f.side[!sidx.sidei];
+	topif = uflow->f.pif[!sidx.sidei];
 	dlen = rc;
 
-	if (pif_is_socket(pif_at_sidx(tosidx))) {
-		/* XXX Is there any way to propagate ICMPs from socket to
-		 * socket? */
-	} else if (hdr->cmsg_level == IPPROTO_IP) {
+	if (inany_from_sockaddr(&offender, &offender_port,
+				SO_EE_OFFENDER(ee)) < 0)
+		goto fail;
+
+	if (pif != PIF_HOST || topif != PIF_TAP)
+		/* XXX Can we support any other cases? */
+		goto fail;
+
+	if (hdr->cmsg_level == IPPROTO_IP &&
+	    (o4 = inany_v4(&offender)) && inany_v4(&toside->eaddr)) {
 		dlen = MIN(dlen, ICMP4_MAX_DLEN);
-		udp_send_tap_icmp4(c, &eh->ee, toside,
-				   eh->saddr.sa4.sin_addr, data, dlen);
-	} else if (hdr->cmsg_level == IPPROTO_IPV6) {
-		udp_send_tap_icmp6(c, &eh->ee, toside,
-				   &eh->saddr.sa6.sin6_addr, data,
-				   dlen, sidx.flowi);
+		udp_send_tap_icmp4(c, ee, toside, *o4, data, dlen);
+		return 1;
 	}
 
+	if (hdr->cmsg_level == IPPROTO_IPV6 && !inany_v4(&toside->eaddr)) {
+		udp_send_tap_icmp6(c, ee, toside, &offender.a6, data, dlen,
+				   FLOW_IDX(uflow));
+		return 1;
+	}
+
+fail:
+	flow_dbg(uflow, "Can't propagate %s error from %s %s to %s %s",
+		 str_ee_origin(ee),
+		 pif_name(pif),
+		 sockaddr_ntop(SO_EE_OFFENDER(ee), sastr, sizeof(sastr)),
+		 pif_name(topif),
+		 inany_ntop(&toside->eaddr, astr, sizeof(astr)));
 	return 1;
 }
 

From 436afc30447c6f0ce516f2b38c769833114bb5f8 Mon Sep 17 00:00:00 2001
From: David Gibson <david@gibson.dropbear.id.au>
Date: Thu, 17 Apr 2025 11:55:43 +1000
Subject: [PATCH 202/221] udp: Translate offender addresses for ICMP messages

We've recently added support for propagating ICMP errors related to a UDP
flow from the host to the guest, by handling the extended UDP error on the
socket and synthesizing a suitable ICMP on the tap interface.

Currently we create that ICMP with a source address of the "offender" from
the extended error information - the source of the ICMP error received on
the host.  However, we don't translate this address for cases where we NAT
between host and guest.  This means (amongst other things) that we won't
get a "Connection refused" error as expected if send data from the guest to
the --map-host-loopback address.  The error comes from 127.0.0.1 on the
host, which doesn't make sense on the tap interface and will be discarded
by the guest.

Because ICMP errors can be sent by an intermediate host, not just by the
endpoints of the flow, we can't handle this translation purely with the
information in the flow table entry.  We need to explicitly translate this
address by our NAT rules, which we can do with the nat_inbound() helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 fwd.c |  4 ++--
 fwd.h |  3 +++
 udp.c | 18 ++++++++++++++----
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fwd.c b/fwd.c
index 5c70e83..b73c2c8 100644
--- a/fwd.c
+++ b/fwd.c
@@ -450,8 +450,8 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
  * Only handles translations that depend *only* on the address.  Anything
  * related to specific ports or flows is handled elsewhere.
  */
-static bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
-			 union inany_addr *translated)
+bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
+		 union inany_addr *translated)
 {
 	if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback) &&
 	    inany_equals4(addr, &in4addr_loopback)) {
diff --git a/fwd.h b/fwd.h
index 3562f3c..0458a3c 100644
--- a/fwd.h
+++ b/fwd.h
@@ -7,6 +7,7 @@
 #ifndef FWD_H
 #define FWD_H
 
+union inany_addr;
 struct flowside;
 
 /* Number of ports for both TCP and UDP */
@@ -47,6 +48,8 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
 			const struct fwd_ports *tcp_rev);
 void fwd_scan_ports_init(struct ctx *c);
 
+bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
+		 union inany_addr *translated);
 uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
 			 const struct flowside *ini, struct flowside *tgt);
 uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
diff --git a/udp.c b/udp.c
index d09b3eb..f5a5cd1 100644
--- a/udp.c
+++ b/udp.c
@@ -539,10 +539,10 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
 		.msg_control = buf,
 		.msg_controllen = sizeof(buf),
 	};
-	const struct flowside *toside;
+	const struct flowside *fromside, *toside;
+	union inany_addr offender, otap;
 	char astr[INANY_ADDRSTRLEN];
 	char sastr[SOCKADDR_STRLEN];
-	union inany_addr offender;
 	const struct in_addr *o4;
 	in_port_t offender_port;
 	struct udp_flow *uflow;
@@ -602,6 +602,7 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
 
 	uflow = udp_at_sidx(sidx);
 	ASSERT(uflow);
+	fromside = &uflow->f.side[sidx.sidei];
 	toside = &uflow->f.side[!sidx.sidei];
 	topif = uflow->f.pif[!sidx.sidei];
 	dlen = rc;
@@ -614,15 +615,24 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx,
 		/* XXX Can we support any other cases? */
 		goto fail;
 
+	/* If the offender *is* the endpoint, make sure our translation is
+	 * consistent with the flow's translation.  This matters if the flow
+	 * endpoint has a port specific translation (like --dns-match).
+	 */
+	if (inany_equals(&offender, &fromside->eaddr))
+		otap = toside->oaddr;
+	else if (!nat_inbound(c, &offender, &otap))
+		goto fail;
+
 	if (hdr->cmsg_level == IPPROTO_IP &&
-	    (o4 = inany_v4(&offender)) && inany_v4(&toside->eaddr)) {
+	    (o4 = inany_v4(&otap)) && inany_v4(&toside->eaddr)) {
 		dlen = MIN(dlen, ICMP4_MAX_DLEN);
 		udp_send_tap_icmp4(c, ee, toside, *o4, data, dlen);
 		return 1;
 	}
 
 	if (hdr->cmsg_level == IPPROTO_IPV6 && !inany_v4(&toside->eaddr)) {
-		udp_send_tap_icmp6(c, ee, toside, &offender.a6, data, dlen,
+		udp_send_tap_icmp6(c, ee, toside, &otap.a6, data, dlen,
 				   FLOW_IDX(uflow));
 		return 1;
 	}

From aa1cc8922867b8f7c17742f8da3b9fcc6291bbeb Mon Sep 17 00:00:00 2001
From: Alyssa Ross <hi@alyssa.is>
Date: Sat, 26 Apr 2025 10:44:25 +0200
Subject: [PATCH 203/221] conf: allow --fd 0
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

inetd-style socket passing traditionally starts a service with a
connected socket on file descriptors 0 and 1.  passt disallowing
obtaining its socket from either of these descriptors made it
difficult to use with super-servers providing this interface — in my
case I wanted to use passt with s6-ipcserver[1].  Since (as far as I
can tell) passt does not use standard input for anything else (unlike
standard output), it should be safe to relax the restrictions on --fd
to allow setting it to 0, enabling this use case.

Link: https://skarnet.org/software/s6/s6-ipcserver.html [1]
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 3 ++-
 util.c | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/conf.c b/conf.c
index f942851..a6d7e22 100644
--- a/conf.c
+++ b/conf.c
@@ -1717,7 +1717,8 @@ void conf(struct ctx *c, int argc, char **argv)
 			fd_tap_opt = strtol(optarg, NULL, 0);
 
 			if (errno ||
-			    fd_tap_opt <= STDERR_FILENO || fd_tap_opt > INT_MAX)
+			    (fd_tap_opt != STDIN_FILENO && fd_tap_opt <= STDERR_FILENO) ||
+			    fd_tap_opt > INT_MAX)
 				die("Invalid --fd: %s", optarg);
 
 			c->fd_tap = fd_tap_opt;
diff --git a/util.c b/util.c
index 62a6003..f5497d4 100644
--- a/util.c
+++ b/util.c
@@ -875,7 +875,9 @@ void close_open_files(int argc, char **argv)
 			errno = 0;
 			fd = strtol(optarg, NULL, 0);
 
-			if (errno || fd <= STDERR_FILENO || fd > INT_MAX)
+			if (errno ||
+			    (fd != STDIN_FILENO && fd <= STDERR_FILENO) ||
+			    fd > INT_MAX)
 				die("Invalid --fd: %s", optarg);
 		}
 	} while (name != -1);

From ea0a1240df671de221f469327899564ed74b5edd Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 30 Apr 2025 16:48:34 +0200
Subject: [PATCH 204/221] passt-repair: Hide bogus gcc warning from -Og
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When building with gcc 13 and -Og, we get:

passt-repair.c: In function ‘main’:
passt-repair.c:161:23: warning: ‘ev’ may be used uninitialized [-Wmaybe-uninitialized]
  161 |                 if (ev->len > NAME_MAX + 1 || ev->name[ev->len - 1] != '\0') {
      |                     ~~^~~~~

but that can't actually happen, because we only exit the preceding
while loop if 'found' is true, and that only happens, in turn, as we
assign 'ev'.

Get rid of the warning by (redundantly) initialising ev to NULL.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt-repair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/passt-repair.c b/passt-repair.c
index 256a8c9..ff1c44f 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -113,7 +113,7 @@ int main(int argc, char **argv)
 	if ((sb.st_mode & S_IFMT) == S_IFDIR) {
 		char buf[sizeof(struct inotify_event) + NAME_MAX + 1]
 		   __attribute__ ((aligned(__alignof__(struct inotify_event))));
-		const struct inotify_event *ev;
+		const struct inotify_event *ev = NULL;
 		char path[PATH_MAX + 1];
 		bool found = false;
 		ssize_t n;

From 6a96cd97a5fda26a8f12531a72f6a969e476ad9e Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Wed, 30 Apr 2025 16:59:13 +0200
Subject: [PATCH 205/221] util: Fix typo, ASSSERTION -> ASSERTION

Fixes: 9153aca15bc1 ("util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 util.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util.h b/util.h
index cc7d084..5947337 100644
--- a/util.h
+++ b/util.h
@@ -75,7 +75,7 @@ void abort_with_msg(const char *fmt, ...)
 #define ASSERT_WITH_MSG(expr, ...)					\
 	((expr) ? (void)0 : abort_with_msg(__VA_ARGS__))
 #define ASSERT(expr)							\
-	ASSERT_WITH_MSG((expr), "ASSSERTION FAILED in %s (%s:%d): %s",	\
+	ASSERT_WITH_MSG((expr), "ASSERTION FAILED in %s (%s:%d): %s",	\
 			__func__, __FILE__, __LINE__, STRINGIFY(expr))
 
 #ifdef P_tmpdir

From 11be695f5c0a6a7d74e9628e9863e665f59d511f Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Wed, 30 Apr 2025 18:05:25 +0200
Subject: [PATCH 206/221] flow: fix podman issue #25959

While running piHole using podman, traffic can trigger the following
assert:

ASSSERTION FAILED in flow_alloc (flow.c:521): flow->f.state == FLOW_STATE_FREE

Backtrace shows that this happens in flow_defer_handler():

    #4  0x00005610d6f5b481 flow_alloc (passt + 0xb481)
    #5  0x00005610d6f74f86 udp_flow_from_sock (passt + 0x24f86)
    #6  0x00005610d6f737c3 udp_sock_fwd (passt + 0x237c3)
    #7  0x00005610d6f74c07 udp_flush_flow (passt + 0x24c07)
    #8  0x00005610d6f752c2 udp_flow_defer (passt + 0x252c2)
    #9  0x00005610d6f5bce1 flow_defer_handler (passt + 0xbce1)

We are trying to allocate a new flow inside the loop freeing them.

Inside the loop free_head points to the first free flow entry in the
current cluster. But if we allocate a new entry during the loop,
free_head is not updated and can point now to the entry we have just
allocated.

We can fix the problem by spliting the loop in two parts:
- first part where we can close some of them and allocate some new
  flow entries,
- second part where we free the entries closed in the previous loop
  and we aggregate the free entries to merge consecutive the clusters.

Reported-by: Martin Rijntjes <bugs@air-global.nl>
Link: https://github.com/containers/podman/issues/25959
Fixes: 9725e7988837 ("udp_flow: Don't discard packets that arrive between bind() and connect()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c | 109 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 58 insertions(+), 51 deletions(-)

diff --git a/flow.c b/flow.c
index 447c021..c5718e3 100644
--- a/flow.c
+++ b/flow.c
@@ -800,6 +800,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 {
 	struct flow_free_cluster *free_head = NULL;
 	unsigned *last_next = &flow_first_free;
+	bool to_free[FLOW_MAX] = { 0 };
 	bool timer = false;
 	union flow *flow;
 
@@ -810,9 +811,44 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 
 	ASSERT(!flow_new_entry); /* Incomplete flow at end of cycle */
 
-	flow_foreach_slot(flow) {
+	/* Check which flows we might need to close first, but don't free them
+	 * yet as it's not safe to do that in the middle of flow_foreach().
+	 */
+	flow_foreach(flow) {
 		bool closed = false;
 
+		switch (flow->f.type) {
+		case FLOW_TYPE_NONE:
+			ASSERT(false);
+			break;
+		case FLOW_TCP:
+			closed = tcp_flow_defer(&flow->tcp);
+			break;
+		case FLOW_TCP_SPLICE:
+			closed = tcp_splice_flow_defer(&flow->tcp_splice);
+			if (!closed && timer)
+				tcp_splice_timer(c, &flow->tcp_splice);
+			break;
+		case FLOW_PING4:
+		case FLOW_PING6:
+			if (timer)
+				closed = icmp_ping_timer(c, &flow->ping, now);
+			break;
+		case FLOW_UDP:
+			closed = udp_flow_defer(c, &flow->udp, now);
+			if (!closed && timer)
+				closed = udp_flow_timer(c, &flow->udp, now);
+			break;
+		default:
+			/* Assume other flow types don't need any handling */
+			;
+		}
+
+		to_free[FLOW_IDX(flow)] = closed;
+	}
+
+	/* Second step: actually free the flows */
+	flow_foreach_slot(flow) {
 		switch (flow->f.state) {
 		case FLOW_STATE_FREE: {
 			unsigned skip = flow->free.n;
@@ -845,59 +881,30 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 			break;
 
 		case FLOW_STATE_ACTIVE:
-			/* Nothing to do */
-			break;
+			if (to_free[FLOW_IDX(flow)]) {
+				flow_set_state(&flow->f, FLOW_STATE_FREE);
+				memset(flow, 0, sizeof(*flow));
 
-		default:
-			ASSERT(false);
-		}
-
-		switch (flow->f.type) {
-		case FLOW_TYPE_NONE:
-			ASSERT(false);
-			break;
-		case FLOW_TCP:
-			closed = tcp_flow_defer(&flow->tcp);
-			break;
-		case FLOW_TCP_SPLICE:
-			closed = tcp_splice_flow_defer(&flow->tcp_splice);
-			if (!closed && timer)
-				tcp_splice_timer(c, &flow->tcp_splice);
-			break;
-		case FLOW_PING4:
-		case FLOW_PING6:
-			if (timer)
-				closed = icmp_ping_timer(c, &flow->ping, now);
-			break;
-		case FLOW_UDP:
-			closed = udp_flow_defer(c, &flow->udp, now);
-			if (!closed && timer)
-				closed = udp_flow_timer(c, &flow->udp, now);
-			break;
-		default:
-			/* Assume other flow types don't need any handling */
-			;
-		}
-
-		if (closed) {
-			flow_set_state(&flow->f, FLOW_STATE_FREE);
-			memset(flow, 0, sizeof(*flow));
-
-			if (free_head) {
-				/* Add slot to current free cluster */
-				ASSERT(FLOW_IDX(flow) ==
-				       FLOW_IDX(free_head) + free_head->n);
-				free_head->n++;
-				flow->free.n = flow->free.next = 0;
+				if (free_head) {
+					/* Add slot to current free cluster */
+					ASSERT(FLOW_IDX(flow) ==
+					    FLOW_IDX(free_head) + free_head->n);
+					free_head->n++;
+					flow->free.n = flow->free.next = 0;
+				} else {
+					/* Create new free cluster */
+					free_head = &flow->free;
+					free_head->n = 1;
+					*last_next = FLOW_IDX(flow);
+					last_next = &free_head->next;
+				}
 			} else {
-				/* Create new free cluster */
-				free_head = &flow->free;
-				free_head->n = 1;
-				*last_next = FLOW_IDX(flow);
-				last_next = &free_head->next;
+				free_head = NULL;
 			}
-		} else {
-			free_head = NULL;
+			break;
+
+		default:
+			ASSERT(false);
 		}
 	}
 

From 93394f4ef0966602b2ada8f72beaf75352add7b1 Mon Sep 17 00:00:00 2001
From: Janne Grunau <janne-psst@jannau.net>
Date: Thu, 1 May 2025 11:54:07 +0200
Subject: [PATCH 207/221] selinux: Add getattr to class udp_socket

Commit 59cc89f ("udp, udp_flow: Track our specific address on socket
interfaces") added a getsockname() call in udp_flow_new(). This requires
getattr. Fixes "Flow 0 (UDP flow): Unable to determine local address:
Permission denied" errors in muvm/passt on Fedora Linux 42 with SELinux.

The SELinux audit message is

| type=AVC msg=audit(1746083799.606:235): avc:  denied  { getattr } for
|   pid=2961 comm="passt" laddr=127.0.0.1 lport=49221
|   faddr=127.0.0.53 fport=53
|   scontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tcontext=unconfined_u:unconfined_r:passt_t:s0-s0:c0.c1023
|   tclass=udp_socket permissive=0

Fixes: 59cc89f4cc01 ("udp, udp_flow: Track our specific address on socket interfaces")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2363238
Signed-off-by: Janne Grunau <janne-psst@jannau.net>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 contrib/selinux/passt.te | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/contrib/selinux/passt.te b/contrib/selinux/passt.te
index f8ea672..eb9ce72 100644
--- a/contrib/selinux/passt.te
+++ b/contrib/selinux/passt.te
@@ -49,7 +49,7 @@ require {
 	type proc_net_t;
 	type node_t;
 	class tcp_socket { create accept listen name_bind name_connect getattr ioctl };
-	class udp_socket { create accept listen };
+	class udp_socket { create accept listen getattr };
 	class icmp_socket { bind create name_bind node_bind setopt read write };
 	class sock_file { create unlink write };
 
@@ -133,7 +133,7 @@ allow passt_t node_t:icmp_socket { name_bind node_bind };
 allow passt_t port_t:icmp_socket name_bind;
 
 allow passt_t self:tcp_socket { create getopt setopt connect bind listen accept shutdown read write getattr ioctl };
-allow passt_t self:udp_socket { create getopt setopt connect bind read write };
+allow passt_t self:udp_socket { create getopt setopt connect bind read write getattr };
 allow passt_t self:icmp_socket { bind create setopt read write };
 
 allow passt_t user_tmp_t:dir { add_name write };

From f0021f9e1d4f118f4167149b256346f3dfea9d2b Mon Sep 17 00:00:00 2001
From: Emanuel Valasiadis <emanuel@valasiadis.space>
Date: Fri, 2 May 2025 15:31:39 +0200
Subject: [PATCH 208/221] fwd: fix doc typo

Signed-off-by: Emanuel Valasiadis <emanuel@valasiadis.space>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 fwd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fwd.c b/fwd.c
index b73c2c8..49aabc3 100644
--- a/fwd.c
+++ b/fwd.c
@@ -440,7 +440,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
 }
 
 /**
- * nat_inbound() - Apply address translation for outbound (HOST to TAP)
+ * nat_inbound() - Apply address translation for inbound (HOST to TAP)
  * @c:		Execution context
  * @addr:	Input address (as seen on HOST interface)
  * @translated:	Output address (as seen on TAP interface)

From 587980ca1e9d5645f6738f67ec3f15cc61a7efa3 Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 2 May 2025 21:56:30 +0200
Subject: [PATCH 209/221] udp: Actually discard datagrams we can't forward

Given that udp_sock_fwd() now loops on udp_peek_addr() to get endpoint
addresses for datagrams, if we can't forward one of these datagrams,
we need to make sure we actually discard it. Otherwise, with MSG_PEEK,
we won't dequeue and loop on it forever.

For example, if we fail to create a socket for a new flow, because,
say, the destination of an inbound packet is multicast, and we can't
bind() to a multicast address, the loop will look like this:

18.0563: Flow 0 (NEW): FREE -> NEW
18.0563: Flow 0 (INI): NEW -> INI
18.0563: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0563: Flow 0 (TGT): INI -> TGT
18.0563: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0563: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow
18.0564: Flow 0 (NEW): FREE -> NEW
18.0564: Flow 0 (INI): NEW -> INI
18.0564: Flow 0 (INI): HOST [127.0.0.1]:42487 -> [127.0.0.1]:9997 => ?
18.0564: Flow 0 (TGT): INI -> TGT
18.0564: Flow 0 (TGT): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): TGT -> TYPED
18.0564: Flow 0 (UDP flow): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Flow 0 (UDP flow): Couldn't open flow specific socket: Invalid argument
18.0564: Flow 0 (FREE): TYPED -> FREE
18.0564: Flow 0 (FREE): HOST [127.0.0.1]:42487 -> [ff02::c]:9997 => SPLICE [0.0.0.0]:42487 -> [88.198.0.164]:9997
18.0564: Discarding datagram without flow

and seen from strace:

epoll_wait(3, [{events=EPOLLIN, data=0x1076c00000705}], 8, 1000) = 1
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0
recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(55899), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "fe80::26e8:53ff:fef3:13b6", &sin6_addr), sin6_scope_id=if_nametoindex("wlp4s0")}, msg_namelen=28, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32, cmsg_data="\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x03\x00\x00\x00"}], msg_controllen=40, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_DONTWAIT) = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_UDP) = 12
setsockopt(12, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(12, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
bind(12, {sa_family=AF_INET6, sin6_port=htons(1900), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "ff02::c", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINVAL (Invalid argument)
close(12)                               = 0

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 udp.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/udp.c b/udp.c
index f5a5cd1..ca28b37 100644
--- a/udp.c
+++ b/udp.c
@@ -828,6 +828,7 @@ void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
 	int rc;
 
 	while ((rc = udp_peek_addr(s, &src, &dst)) != 0) {
+		bool discard = false;
 		flow_sidx_t tosidx;
 		uint8_t topif;
 
@@ -861,8 +862,17 @@ void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
 			flow_err(uflow,
 				 "No support for forwarding UDP from %s to %s",
 				 pif_name(frompif), pif_name(topif));
+			discard = true;
 		} else {
 			debug("Discarding datagram without flow");
+			discard = true;
+		}
+
+		if (discard) {
+			struct msghdr msg = { 0 };
+
+			if (recvmsg(s, &msg, MSG_DONTWAIT) < 0)
+				debug_perror("Failed to discard datagram");
 		}
 	}
 }

From eea8a76caf85f4bae5f92b695d09b9ddea354b57 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Wed, 7 May 2025 14:36:34 +0200
Subject: [PATCH 210/221] flow: fix podman issue #26073

While running pasta, we trigger the following assert:

  ASSERTION FAILED in udp_at_sidx (udp_flow.c:35): flow->f.type == FLOW_UDP

in udp_at_sidx() in the following path:

 902 void udp_sock_handler(const struct ctx *c, union epoll_ref ref,
 903                       uint32_t events, const struct timespec *now)
 904 {
 905         struct udp_flow *uflow = udp_at_sidx(ref.flowside);

The invalid sidx is comming from the epoll_ref provided by epoll_wait().

This assert follows the following error:

  Couldn't connect flow socket: Permission denied

It appears that an error happens in udp_flow_sock() and the recently
created fd is not removed from the epoll_ctl() pool:

 71 static int udp_flow_sock(const struct ctx *c,
 72                          struct udp_flow *uflow, unsigned sidei)
 73 {
...
 82         s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
 83         if (s < 0) {
 84                 flow_dbg_perror(uflow, "Couldn't open flow specific socket");
 85                 return s;
 86         }
 87
 88         if (flowside_connect(c, s, pif, side) < 0) {
 89                 int rc = -errno;
 90                 flow_dbg_perror(uflow, "Couldn't connect flow socket");
 91                 return rc;
 92         }
...

flowside_sock_l4() calls sock_l4_sa() that adds 's' to the epoll_ctl()
pool.

So to cleanly manage the error of flowside_connect() we need to remove
's' from the epoll_ctl() pool using epoll_del().

Link: https://github.com/containers/podman/issues/26073
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp_flow.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/udp_flow.c b/udp_flow.c
index fea1cf3..b3a13b7 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -87,6 +87,10 @@ static int udp_flow_sock(const struct ctx *c,
 
 	if (flowside_connect(c, s, pif, side) < 0) {
 		int rc = -errno;
+
+		if (pif == PIF_HOST)
+			epoll_del(c, s);
+
 		flow_dbg_perror(uflow, "Couldn't connect flow socket");
 		return rc;
 	}

From 92d5d680134455f1a5b51fd8a3e9e64c99ac6d13 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Tue, 6 May 2025 16:13:25 +0200
Subject: [PATCH 211/221] flow: fix wrong macro name in comments

The maximum number of flow macro name is FLOW_MAX, not MAX_FLOW.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/flow.c b/flow.c
index c5718e3..6a5c8aa 100644
--- a/flow.c
+++ b/flow.c
@@ -81,7 +81,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
  *
  * Free cluster list
  *    flow_first_free gives the index of the first (lowest index) free cluster.
- *    Each free cluster has the index of the next free cluster, or MAX_FLOW if
+ *    Each free cluster has the index of the next free cluster, or FLOW_MAX if
  *    it is the last free cluster.  Together these form a linked list of free
  *    clusters, in strictly increasing order of index.
  *

From 8ec134109eb136432a29bdf5a14f8b1fd4e46208 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Mon, 12 May 2025 18:47:00 +0200
Subject: [PATCH 212/221] flow: close socket fd on error

In eea8a76caf85 ("flow: fix podman issue #26073"), we unregister
the fd from epoll_ctl() in case of error, but we also need to close it.

As flowside_sock_l4() also calls sock_l4_sa() via flowside_sock_splice()
we can do it unconditionally.

Fixes: eea8a76caf85 ("flow: fix podman issue #26073")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 udp_flow.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/udp_flow.c b/udp_flow.c
index b3a13b7..4c6b3c2 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -88,8 +88,8 @@ static int udp_flow_sock(const struct ctx *c,
 	if (flowside_connect(c, s, pif, side) < 0) {
 		int rc = -errno;
 
-		if (pif == PIF_HOST)
-			epoll_del(c, s);
+		epoll_del(c, s);
+		close(s);
 
 		flow_dbg_perror(uflow, "Couldn't connect flow socket");
 		return rc;

From 570e7b4454f2f879180ae3ca13dedd759aff5243 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Tue, 13 May 2025 11:40:59 +0200
Subject: [PATCH 213/221] dhcpv6: fix GCC error
 (unterminated-string-initialization)

The string STR_NOTONLINK is intentionally not NUL-terminated.
Ignore the GCC error using __attribute__((nonstring)).

This error is reported by GCC 15.1.1 on Fedora 42. However,
Clang 20.1.3 does not support __attribute__((nonstring)).
Therefore, NOLINTNEXTLINE(clang-diagnostic-unknown-attributes)
is also added to suppress Clang's unknown attribute warning.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 dhcpv6.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/dhcpv6.c b/dhcpv6.c
index 373a988..ba16c66 100644
--- a/dhcpv6.c
+++ b/dhcpv6.c
@@ -144,7 +144,9 @@ struct opt_ia_addr {
 struct opt_status_code {
 	struct opt_hdr hdr;
 	uint16_t code;
-	char status_msg[sizeof(STR_NOTONLINK) - 1];
+	/* "nonstring" is only supported since clang 23 */
+	/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
+	__attribute__((nonstring)) char status_msg[sizeof(STR_NOTONLINK) - 1];
 } __attribute__((packed));
 
 /**

From a6b9832e495be636bcccf25e0aebdeb564addf06 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Tue, 13 May 2025 11:41:00 +0200
Subject: [PATCH 214/221] virtio: Fix Clang warning
 (bugprone-sizeof-expression, cert-arr39-c)

In `virtqueue_read_indirect_desc()`, the pointer arithmetic involving
`desc` is intentional. We add the length in bytes (`read_len`)
divided by the size of `struct vring_desc` to `desc`, which is
an array of `struct vring_desc`. This correctly calculates the
offset in terms of the number of `struct vring_desc` elements.

Clang issues the following warning due to this explicit scaling:

virtio.c:238:8: error: suspicious usage of 'sizeof(...)' in pointer
arithmetic; this scaled value will be scaled again by the '+='
operator [bugprone-sizeof-expression,cert-arr39-c,-Werror]
  238 |         desc += read_len / sizeof(struct vring_desc);
      |               ^            ~~~~~~~~~~~~~~~~~~~~~~~~~
virtio.c:238:8: note: '+=' in pointer arithmetic internally scales
with 'sizeof(struct vring_desc)' == 16

This behavior is intended, so the warning can be considered a
false positive in this context. The code correctly advances the
pointer by the desired number of descriptor entries.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 virtio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/virtio.c b/virtio.c
index bc2b89a..f7db007 100644
--- a/virtio.c
+++ b/virtio.c
@@ -235,6 +235,7 @@ static int virtqueue_read_indirect_desc(const struct vu_dev *dev,
 		memcpy(desc, orig_desc, read_len);
 		len -= read_len;
 		addr += read_len;
+		/* NOLINTNEXTLINE(bugprone-sizeof-expression,cert-arr39-c) */
 		desc += read_len / sizeof(struct vring_desc);
 	}
 

From 0f7bf10b0a5542690dc6c75e4b56a6030ca8a663 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Tue, 13 May 2025 11:41:01 +0200
Subject: [PATCH 215/221] ndp: Fix Clang analyzer warning
 (clang-analyzer-security.PointerSub)

Addresses Clang warning: "Subtraction of two pointers that do not
point into the same array is undefined behavior" for the line:
  `ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);`

Here, `ptr` is `&ra.var[0]`. The subtraction calculates the offset
of `var[0]` within the `struct ra_options ra`. Since `ptr` points
inside `ra`, this pointer arithmetic is well-defined for
calculating the size of the data to send, even if `ptr` and `&ra`
are not strictly considered part of the same "array" by the analyzer.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 ndp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ndp.c b/ndp.c
index ded2081..b664034 100644
--- a/ndp.c
+++ b/ndp.c
@@ -328,6 +328,7 @@ static void ndp_ra(const struct ctx *c, const struct in6_addr *dst)
 
 	memcpy(&ra.source_ll.mac, c->our_tap_mac, ETH_ALEN);
 
+	/* NOLINTNEXTLINE(clang-analyzer-security.PointerSub) */
 	ndp_send(c, dst, &ra, ptr - (unsigned char *)&ra);
 }
 

From 2d3d69c5c348d18112596bd3fdeed95689c613c8 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Tue, 13 May 2025 11:41:02 +0200
Subject: [PATCH 216/221] flow: Fix clang error
 (clang-analyzer-security.PointerSub)

Fixes the following clang-analyzer warning:

flow_table.h:96:25: note: Subtraction of two pointers that do not point into the same array is undefined behavior
   96 |         return (union flow *)f - flowtab;

The `flow_idx()` function is called via `FLOW_IDX()` from
`flow_foreach_slot()`, where `f` is set to `&flowtab[idx].f`.
Therefore, `f` and `flowtab` do point to the same array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow_table.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/flow_table.h b/flow_table.h
index 2d5c65c..3f3f4b7 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -93,6 +93,7 @@ extern union flow flowtab[];
  */
 static inline unsigned flow_idx(const struct flow_common *f)
 {
+	/* NOLINTNEXTLINE(clang-analyzer-security.PointerSub) */
 	return (union flow *)f - flowtab;
 }
 

From 4234ace84cdf989cbcdb96a8165221dc83a11c85 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Wed, 14 May 2025 15:45:09 +0200
Subject: [PATCH 217/221] test: Display count of skipped tests in status and
 summary

This commit enhances test reporting by tracking and displaying the
number of skipped tests.

The skipped test count is now visible in the tmux status bar during
execution and included in the final test summary log. This provides
a more complete overview of test suite results.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 test/lib/term | 7 +++++--
 test/run      | 6 +++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/test/lib/term b/test/lib/term
index ed690de..089364c 100755
--- a/test/lib/term
+++ b/test/lib/term
@@ -19,6 +19,7 @@ STATUS_FILE_INDEX=0
 STATUS_COLS=
 STATUS_PASS=0
 STATUS_FAIL=0
+STATUS_SKIPPED=0
 
 PR_RED='\033[1;31m'
 PR_GREEN='\033[1;32m'
@@ -439,19 +440,21 @@ info_layout() {
 # status_test_ok() - Update counter of passed tests, log and display message
 status_test_ok() {
 	STATUS_PASS=$((STATUS_PASS + 1))
-	tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | #(TZ="UTC" date -Iseconds)"
+	tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
 	info_passed
 }
 
 # status_test_fail() - Update counter of failed tests, log and display message
 status_test_fail() {
 	STATUS_FAIL=$((STATUS_FAIL + 1))
-	tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | #(TZ="UTC" date -Iseconds)"
+	tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
 	info_failed
 }
 
 # status_test_fail() - Update counter of failed tests, log and display message
 status_test_skip() {
+	STATUS_SKIPPED=$((STATUS_SKIPPED + 1))
+	tmux set status-right "PASS: ${STATUS_PASS} | FAIL: ${STATUS_FAIL} | SKIPPED: ${STATUS_SKIPPED} | #(TZ="UTC" date -Iseconds)"
 	info_skipped
 }
 
diff --git a/test/run b/test/run
index 4e86f30..f73c311 100755
--- a/test/run
+++ b/test/run
@@ -202,7 +202,7 @@ skip_distro() {
 	perf_finish
 	[ ${CI} -eq 1 ] && video_stop
 
-	log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}"
+	log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}, SKIPPED: ${STATUS_SKIPPED}"
 
 	pause_continue \
 		"Press any key to keep test session open"	\
@@ -236,7 +236,7 @@ run_selected() {
 	done
 	teardown "${__setup}"
 
-	log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}"
+	log "PASS: ${STATUS_PASS}, FAIL: ${STATUS_FAIL}, SKIPPED: ${STATUS_SKIPPED}"
 
 	pause_continue \
 		"Press any key to keep test session open"	\
@@ -307,4 +307,4 @@ fi
 
 tail -n1 ${LOGFILE}
 echo "Log at ${LOGFILE}"
-exit $(tail -n1 ${LOGFILE} | sed -n 's/.*FAIL: \(.*\)$/\1/p')
+exit $(tail -n1 ${LOGFILE} | sed -n 's/.*FAIL: \(.*\),.*$/\1/p')

From 2046976866dd1f983cb0417a1d3ee3f64190805d Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 15 May 2025 11:41:51 +0200
Subject: [PATCH 218/221] codespell: Correct typos in comments and error
 message

This commit addresses several spelling errors identified by the `codespell`
tool. The corrections apply to:
- Code comments in `fwd.c`, `ip.h`, `isolation.c`, and `log.c`.
- An error message string in `vhost_user.c`.

Specifically, the following misspellings were corrected:
- "adddress" to "address"
- "capabilites" to "capabilities"
- "Musn't" to "Mustn't"
- "calculatd" to "calculated"
- "Invalide" to "Invalid"

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 fwd.c        | 2 +-
 ip.h         | 2 +-
 isolation.c  | 8 ++++----
 log.c        | 2 +-
 vhost_user.c | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fwd.c b/fwd.c
index 49aabc3..250cf56 100644
--- a/fwd.c
+++ b/fwd.c
@@ -418,7 +418,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
 	else
 		tgt->eaddr = inany_loopback6;
 
-	/* Preserve the specific loopback adddress used, but let the kernel pick
+	/* Preserve the specific loopback address used, but let the kernel pick
 	 * a source port on the target side
 	 */
 	tgt->oaddr = ini->eaddr;
diff --git a/ip.h b/ip.h
index 471c57e..24509d9 100644
--- a/ip.h
+++ b/ip.h
@@ -118,7 +118,7 @@ static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h)
 char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
 		 size_t *dlen);
 
-/* IPv6 link-local all-nodes multicast adddress, ff02::1 */
+/* IPv6 link-local all-nodes multicast address, ff02::1 */
 static const struct in6_addr in6addr_ll_all_nodes = {
 	.s6_addr = {
 		0xff, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
diff --git a/isolation.c b/isolation.c
index c944fb3..bbcd23b 100644
--- a/isolation.c
+++ b/isolation.c
@@ -129,7 +129,7 @@ static void drop_caps_ep_except(uint64_t keep)
  * additional layer of protection.  Executing this requires
  * CAP_SETPCAP, which we will have within our userns.
  *
- * Note that dropping capabilites from the bounding set limits
+ * Note that dropping capabilities from the bounding set limits
  * exec()ed processes, but does not remove them from the effective or
  * permitted sets, so it doesn't reduce our own capabilities.
  */
@@ -174,8 +174,8 @@ static void clamp_caps(void)
  * Should:
  *  - drop unneeded capabilities
  *  - close all open files except for standard streams and the one from --fd
- * Musn't:
- *  - remove filesytem access (we need to access files during setup)
+ * Mustn't:
+ *  - remove filesystem access (we need to access files during setup)
  */
 void isolate_initial(int argc, char **argv)
 {
@@ -194,7 +194,7 @@ void isolate_initial(int argc, char **argv)
 	 *
 	 * It's debatable whether it's useful to drop caps when we
 	 * retain SETUID and SYS_ADMIN, but we might as well.  We drop
-	 * further capabilites in isolate_user() and
+	 * further capabilities in isolate_user() and
 	 * isolate_prefork().
 	 */
 	keep = BIT(CAP_NET_BIND_SERVICE) | BIT(CAP_SETUID) | BIT(CAP_SETGID) |
diff --git a/log.c b/log.c
index d40d7ae..5d7d76f 100644
--- a/log.c
+++ b/log.c
@@ -402,7 +402,7 @@ void __setlogmask(int mask)
  * logfile_init() - Open log file and write header with PID, version, path
  * @name:	Identifier for header: passt or pasta
  * @path:	Path to log file
- * @size:	Maximum size of log file: log_cut_size is calculatd here
+ * @size:	Maximum size of log file: log_cut_size is calculated here
  */
 void logfile_init(const char *name, const char *path, size_t size)
 {
diff --git a/vhost_user.c b/vhost_user.c
index 105f77a..ca36763 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -1021,7 +1021,7 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
 
 	if (direction != VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE &&
 	    direction != VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD)
-		die("Invalide device_state_fd direction: %d", direction);
+		die("Invalid device_state_fd direction: %d", direction);
 
 	migrate_request(vdev->context, msg->fds[0],
 			direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);

From 2fd0944f21d6b9fce53c328acf1faaeb46b98528 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Fri, 16 May 2025 14:42:26 +0200
Subject: [PATCH 219/221] vhost_user: Correct and align function comment
 headers

This commit cleans up function comment headers in vhost_user.c to ensure
accuracy and consistency with the code. Changes include correcting
parameter names in comments and signatures (e.g., standardizing on vmsg
for vhost messages, fixing dev to vdev), updating function names in
comment descriptions, and removing/rectifying erroneous parameter
documentation.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 221 +++++++++++++++++++++++++--------------------------
 vhost_user.h |   2 +-
 2 files changed, 111 insertions(+), 112 deletions(-)

diff --git a/vhost_user.c b/vhost_user.c
index ca36763..e8377bb 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -302,13 +302,13 @@ static void vu_message_write(int conn_fd, struct vhost_user_msg *vmsg)
  * @conn_fd:	vhost-user command socket
  * @vmsg:	vhost-user message
  */
-static void vu_send_reply(int conn_fd, struct vhost_user_msg *msg)
+static void vu_send_reply(int conn_fd, struct vhost_user_msg *vmsg)
 {
-	msg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
-	msg->hdr.flags |= VHOST_USER_VERSION;
-	msg->hdr.flags |= VHOST_USER_REPLY_MASK;
+	vmsg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
+	vmsg->hdr.flags |= VHOST_USER_VERSION;
+	vmsg->hdr.flags |= VHOST_USER_REPLY_MASK;
 
-	vu_message_write(conn_fd, msg);
+	vu_message_write(conn_fd, vmsg);
 }
 
 /**
@@ -319,7 +319,7 @@ static void vu_send_reply(int conn_fd, struct vhost_user_msg *msg)
  * Return: True as a reply is requested
  */
 static bool vu_get_features_exec(struct vu_dev *vdev,
-				 struct vhost_user_msg *msg)
+				 struct vhost_user_msg *vmsg)
 {
 	uint64_t features =
 		1ULL << VIRTIO_F_VERSION_1 |
@@ -329,9 +329,9 @@ static bool vu_get_features_exec(struct vu_dev *vdev,
 
 	(void)vdev;
 
-	vmsg_set_reply_u64(msg, features);
+	vmsg_set_reply_u64(vmsg, features);
 
-	debug("Sending back to guest u64: 0x%016"PRIx64, msg->payload.u64);
+	debug("Sending back to guest u64: 0x%016"PRIx64, vmsg->payload.u64);
 
 	return true;
 }
@@ -357,11 +357,11 @@ static void vu_set_enable_all_rings(struct vu_dev *vdev, bool enable)
  * Return: False as no reply is requested
  */
 static bool vu_set_features_exec(struct vu_dev *vdev,
-				 struct vhost_user_msg *msg)
+				 struct vhost_user_msg *vmsg)
 {
-	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+	debug("u64: 0x%016"PRIx64, vmsg->payload.u64);
 
-	vdev->features = msg->payload.u64;
+	vdev->features = vmsg->payload.u64;
 	/* We only support devices conforming to VIRTIO 1.0 or
 	 * later
 	 */
@@ -382,10 +382,10 @@ static bool vu_set_features_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_owner_exec(struct vu_dev *vdev,
-			      struct vhost_user_msg *msg)
+			      struct vhost_user_msg *vmsg)
 {
 	(void)vdev;
-	(void)msg;
+	(void)vmsg;
 
 	return false;
 }
@@ -423,9 +423,9 @@ static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
  * #syscalls:vu mmap|mmap2 munmap
  */
 static bool vu_set_mem_table_exec(struct vu_dev *vdev,
-				  struct vhost_user_msg *msg)
+				  struct vhost_user_msg *vmsg)
 {
-	struct vhost_user_memory m = msg->payload.memory, *memory = &m;
+	struct vhost_user_memory m = vmsg->payload.memory, *memory = &m;
 	unsigned int i;
 
 	for (i = 0; i < vdev->nregions; i++) {
@@ -465,7 +465,7 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
 		 */
 		mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
 				 PROT_READ | PROT_WRITE, MAP_SHARED |
-				 MAP_NORESERVE, msg->fds[i], 0);
+				 MAP_NORESERVE, vmsg->fds[i], 0);
 
 		if (mmap_addr == MAP_FAILED)
 			die_perror("vhost-user region mmap error");
@@ -474,7 +474,7 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
 		debug("    mmap_addr:       0x%016"PRIx64,
 		      dev_region->mmap_addr);
 
-		close(msg->fds[i]);
+		close(vmsg->fds[i]);
 	}
 
 	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
@@ -541,7 +541,7 @@ static void vu_log_page(uint8_t *log_table, uint64_t page)
 
 /**
  * vu_log_write() - Log memory write
- * @dev:	vhost-user device
+ * @vdev:	vhost-user device
  * @address:	Memory address
  * @length:	Memory size
  */
@@ -566,23 +566,23 @@ void vu_log_write(const struct vu_dev *vdev, uint64_t address, uint64_t length)
  * @vdev:	vhost-user device
  * @vmsg:	vhost-user message
  *
- * Return: False as no reply is requested
+ * Return: True as a reply is requested
  *
  * #syscalls:vu mmap|mmap2 munmap
  */
 static bool vu_set_log_base_exec(struct vu_dev *vdev,
-				 struct vhost_user_msg *msg)
+				 struct vhost_user_msg *vmsg)
 {
 	uint64_t log_mmap_size, log_mmap_offset;
 	void *base;
 	int fd;
 
-	if (msg->fd_num != 1 || msg->hdr.size != sizeof(msg->payload.log))
+	if (vmsg->fd_num != 1 || vmsg->hdr.size != sizeof(vmsg->payload.log))
 		die("vhost-user: Invalid log_base message");
 
-	fd = msg->fds[0];
-	log_mmap_offset = msg->payload.log.mmap_offset;
-	log_mmap_size = msg->payload.log.mmap_size;
+	fd = vmsg->fds[0];
+	log_mmap_offset = vmsg->payload.log.mmap_offset;
+	log_mmap_size = vmsg->payload.log.mmap_size;
 
 	debug("vhost-user log mmap_offset: %"PRId64, log_mmap_offset);
 	debug("vhost-user log mmap_size:   %"PRId64, log_mmap_size);
@@ -599,8 +599,8 @@ static bool vu_set_log_base_exec(struct vu_dev *vdev,
 	vdev->log_table = base;
 	vdev->log_size = log_mmap_size;
 
-	msg->hdr.size = sizeof(msg->payload.u64);
-	msg->fd_num = 0;
+	vmsg->hdr.size = sizeof(vmsg->payload.u64);
+	vmsg->fd_num = 0;
 
 	return true;
 }
@@ -613,15 +613,15 @@ static bool vu_set_log_base_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_log_fd_exec(struct vu_dev *vdev,
-			       struct vhost_user_msg *msg)
+			       struct vhost_user_msg *vmsg)
 {
-	if (msg->fd_num != 1)
+	if (vmsg->fd_num != 1)
 		die("Invalid log_fd message");
 
 	if (vdev->log_call_fd != -1)
 		close(vdev->log_call_fd);
 
-	vdev->log_call_fd = msg->fds[0];
+	vdev->log_call_fd = vmsg->fds[0];
 
 	debug("Got log_call_fd: %d", vdev->log_call_fd);
 
@@ -636,10 +636,10 @@ static bool vu_set_log_fd_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_vring_num_exec(struct vu_dev *vdev,
-				  struct vhost_user_msg *msg)
+				  struct vhost_user_msg *vmsg)
 {
-	unsigned int idx = msg->payload.state.index;
-	unsigned int num = msg->payload.state.num;
+	unsigned int idx = vmsg->payload.state.index;
+	unsigned int num = vmsg->payload.state.num;
 
 	trace("State.index: %u", idx);
 	trace("State.num:   %u", num);
@@ -656,13 +656,13 @@ static bool vu_set_vring_num_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
-				   struct vhost_user_msg *msg)
+				   struct vhost_user_msg *vmsg)
 {
 	/* We need to copy the payload to vhost_vring_addr structure
-         * to access index because address of msg->payload.addr
+         * to access index because address of vmsg->payload.addr
          * can be unaligned as it is packed.
          */
-	struct vhost_vring_addr addr = msg->payload.addr;
+	struct vhost_vring_addr addr = vmsg->payload.addr;
 	struct vu_virtq *vq = &vdev->vq[addr.index];
 
 	debug("vhost_vring_addr:");
@@ -677,7 +677,7 @@ static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
 	debug("    log_guest_addr:   0x%016" PRIx64,
 	      (uint64_t)addr.log_guest_addr);
 
-	vq->vra = msg->payload.addr;
+	vq->vra = vmsg->payload.addr;
 	vq->vring.flags = addr.flags;
 	vq->vring.log_guest_addr = addr.log_guest_addr;
 
@@ -702,10 +702,10 @@ static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_vring_base_exec(struct vu_dev *vdev,
-				   struct vhost_user_msg *msg)
+				   struct vhost_user_msg *vmsg)
 {
-	unsigned int idx = msg->payload.state.index;
-	unsigned int num = msg->payload.state.num;
+	unsigned int idx = vmsg->payload.state.index;
+	unsigned int num = vmsg->payload.state.num;
 
 	debug("State.index: %u", idx);
 	debug("State.num:   %u", num);
@@ -723,13 +723,13 @@ static bool vu_set_vring_base_exec(struct vu_dev *vdev,
  * Return: True as a reply is requested
  */
 static bool vu_get_vring_base_exec(struct vu_dev *vdev,
-				   struct vhost_user_msg *msg)
+				   struct vhost_user_msg *vmsg)
 {
-	unsigned int idx = msg->payload.state.index;
+	unsigned int idx = vmsg->payload.state.index;
 
 	debug("State.index: %u", idx);
-	msg->payload.state.num = vdev->vq[idx].last_avail_idx;
-	msg->hdr.size = sizeof(msg->payload.state);
+	vmsg->payload.state.num = vdev->vq[idx].last_avail_idx;
+	vmsg->hdr.size = sizeof(vmsg->payload.state);
 
 	vdev->vq[idx].started = false;
 	vdev->vq[idx].vring.avail = 0;
@@ -771,21 +771,21 @@ static void vu_set_watch(const struct vu_dev *vdev, int idx)
  * 			       close fds if NOFD bit is set
  * @vmsg:	vhost-user message
  */
-static void vu_check_queue_msg_file(struct vhost_user_msg *msg)
+static void vu_check_queue_msg_file(struct vhost_user_msg *vmsg)
 {
-	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
-	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+	bool nofd = vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = vmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
 
 	if (idx >= VHOST_USER_MAX_QUEUES)
 		die("Invalid vhost-user queue index: %u", idx);
 
 	if (nofd) {
-		vmsg_close_fds(msg);
+		vmsg_close_fds(vmsg);
 		return;
 	}
 
-	if (msg->fd_num != 1)
-		die("Invalid fds in vhost-user request: %d", msg->hdr.request);
+	if (vmsg->fd_num != 1)
+		die("Invalid fds in vhost-user request: %d", vmsg->hdr.request);
 }
 
 /**
@@ -797,14 +797,14 @@ static void vu_check_queue_msg_file(struct vhost_user_msg *msg)
  * Return: False as no reply is requested
  */
 static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
-				   struct vhost_user_msg *msg)
+				   struct vhost_user_msg *vmsg)
 {
-	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
-	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+	bool nofd = vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = vmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
 
-	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+	debug("u64: 0x%016"PRIx64, vmsg->payload.u64);
 
-	vu_check_queue_msg_file(msg);
+	vu_check_queue_msg_file(vmsg);
 
 	if (vdev->vq[idx].kick_fd != -1) {
 		epoll_del(vdev->context, vdev->vq[idx].kick_fd);
@@ -813,7 +813,7 @@ static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
 	}
 
 	if (!nofd)
-		vdev->vq[idx].kick_fd = msg->fds[0];
+		vdev->vq[idx].kick_fd = vmsg->fds[0];
 
 	debug("Got kick_fd: %d for vq: %d", vdev->vq[idx].kick_fd, idx);
 
@@ -837,14 +837,14 @@ static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_vring_call_exec(struct vu_dev *vdev,
-				   struct vhost_user_msg *msg)
+				   struct vhost_user_msg *vmsg)
 {
-	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
-	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+	bool nofd = vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = vmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
 
-	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+	debug("u64: 0x%016"PRIx64, vmsg->payload.u64);
 
-	vu_check_queue_msg_file(msg);
+	vu_check_queue_msg_file(vmsg);
 
 	if (vdev->vq[idx].call_fd != -1) {
 		close(vdev->vq[idx].call_fd);
@@ -852,11 +852,11 @@ static bool vu_set_vring_call_exec(struct vu_dev *vdev,
 	}
 
 	if (!nofd)
-		vdev->vq[idx].call_fd = msg->fds[0];
+		vdev->vq[idx].call_fd = vmsg->fds[0];
 
 	/* in case of I/O hang after reconnecting */
 	if (vdev->vq[idx].call_fd != -1)
-		eventfd_write(msg->fds[0], 1);
+		eventfd_write(vmsg->fds[0], 1);
 
 	debug("Got call_fd: %d for vq: %d", vdev->vq[idx].call_fd, idx);
 
@@ -872,14 +872,14 @@ static bool vu_set_vring_call_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_vring_err_exec(struct vu_dev *vdev,
-				  struct vhost_user_msg *msg)
+				  struct vhost_user_msg *vmsg)
 {
-	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
-	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+	bool nofd = vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = vmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
 
-	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+	debug("u64: 0x%016"PRIx64, vmsg->payload.u64);
 
-	vu_check_queue_msg_file(msg);
+	vu_check_queue_msg_file(vmsg);
 
 	if (vdev->vq[idx].err_fd != -1) {
 		close(vdev->vq[idx].err_fd);
@@ -887,7 +887,7 @@ static bool vu_set_vring_err_exec(struct vu_dev *vdev,
 	}
 
 	if (!nofd)
-		vdev->vq[idx].err_fd = msg->fds[0];
+		vdev->vq[idx].err_fd = vmsg->fds[0];
 
 	return false;
 }
@@ -901,7 +901,7 @@ static bool vu_set_vring_err_exec(struct vu_dev *vdev,
  * Return: True as a reply is requested
  */
 static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
-					  struct vhost_user_msg *msg)
+					  struct vhost_user_msg *vmsg)
 {
 	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK |
 			    1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD |
@@ -909,7 +909,7 @@ static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
 			    1ULL << VHOST_USER_PROTOCOL_F_RARP;
 
 	(void)vdev;
-	vmsg_set_reply_u64(msg, features);
+	vmsg_set_reply_u64(vmsg, features);
 
 	return true;
 }
@@ -922,13 +922,13 @@ static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_protocol_features_exec(struct vu_dev *vdev,
-					  struct vhost_user_msg *msg)
+					  struct vhost_user_msg *vmsg)
 {
-	uint64_t features = msg->payload.u64;
+	uint64_t features = vmsg->payload.u64;
 
 	debug("u64: 0x%016"PRIx64, features);
 
-	vdev->protocol_features = msg->payload.u64;
+	vdev->protocol_features = vmsg->payload.u64;
 
 	return false;
 }
@@ -941,11 +941,11 @@ static bool vu_set_protocol_features_exec(struct vu_dev *vdev,
  * Return: True as a reply is requested
  */
 static bool vu_get_queue_num_exec(struct vu_dev *vdev,
-				  struct vhost_user_msg *msg)
+				  struct vhost_user_msg *vmsg)
 {
 	(void)vdev;
 
-	vmsg_set_reply_u64(msg, VHOST_USER_MAX_QUEUES);
+	vmsg_set_reply_u64(vmsg, VHOST_USER_MAX_QUEUES);
 
 	return true;
 }
@@ -958,10 +958,10 @@ static bool vu_get_queue_num_exec(struct vu_dev *vdev,
  * Return: False as no reply is requested
  */
 static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
-				     struct vhost_user_msg *msg)
+				     struct vhost_user_msg *vmsg)
 {
-	unsigned int enable = msg->payload.state.num;
-	unsigned int idx = msg->payload.state.index;
+	unsigned int enable = vmsg->payload.state.num;
+	unsigned int idx = vmsg->payload.state.index;
 
 	debug("State.index:  %u", idx);
 	debug("State.enable: %u", enable);
@@ -974,17 +974,17 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
 }
 
 /**
- * vu_set_send_rarp_exec() - vhost-user specification says: "Broadcast a fake
- * 			     RARP to notify the migration is terminated",
- * 			     but passt doesn't need to update any ARP table,
- * 			     so do nothing to silence QEMU bogus error message
+ * vu_send_rarp_exec() - vhost-user specification says: "Broadcast a fake
+ * 			 RARP to notify the migration is terminated",
+ * 			 but passt doesn't need to update any ARP table,
+ * 			 so do nothing to silence QEMU bogus error message
  * @vdev:	vhost-user device
  * @vmsg:	vhost-user message
  *
  * Return: False as no reply is requested
  */
 static bool vu_send_rarp_exec(struct vu_dev *vdev,
-			      struct vhost_user_msg *msg)
+			      struct vhost_user_msg *vmsg)
 {
 	char macstr[ETH_ADDRSTRLEN];
 
@@ -993,7 +993,7 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev,
 	/* ignore the command */
 
 	debug("Ignore command VHOST_USER_SEND_RARP for %s",
-	      eth_ntop((unsigned char *)&msg->payload.u64, macstr,
+	      eth_ntop((unsigned char *)&vmsg->payload.u64, macstr,
 		       sizeof(macstr)));
 
 	return false;
@@ -1008,12 +1008,12 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev,
  *         and set bit 8 as we don't provide our own fd.
  */
 static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
-					struct vhost_user_msg *msg)
+					struct vhost_user_msg *vmsg)
 {
-	unsigned int direction = msg->payload.transfer_state.direction;
-	unsigned int phase = msg->payload.transfer_state.phase;
+	unsigned int direction = vmsg->payload.transfer_state.direction;
+	unsigned int phase = vmsg->payload.transfer_state.phase;
 
-	if (msg->fd_num != 1)
+	if (vmsg->fd_num != 1)
 		die("Invalid device_state_fd message");
 
 	if (phase != VHOST_USER_TRANSFER_STATE_PHASE_STOPPED)
@@ -1023,11 +1023,11 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
 	    direction != VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD)
 		die("Invalid device_state_fd direction: %d", direction);
 
-	migrate_request(vdev->context, msg->fds[0],
+	migrate_request(vdev->context, vmsg->fds[0],
 			direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
 
 	/* We don't provide a new fd for the data transfer */
-	vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
+	vmsg_set_reply_u64(vmsg, VHOST_USER_VRING_NOFD_MASK);
 
 	return true;
 }
@@ -1041,9 +1041,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
  */
 /* cppcheck-suppress constParameterCallback */
 static bool vu_check_device_state_exec(struct vu_dev *vdev,
-				       struct vhost_user_msg *msg)
+				       struct vhost_user_msg *vmsg)
 {
-	vmsg_set_reply_u64(msg, vdev->context->device_state_result);
+	vmsg_set_reply_u64(vmsg, vdev->context->device_state_result);
 
 	return true;
 }
@@ -1051,7 +1051,6 @@ static bool vu_check_device_state_exec(struct vu_dev *vdev,
 /**
  * vu_init() - Initialize vhost-user device structure
  * @c:		execution context
- * @vdev:	vhost-user device
  */
 void vu_init(struct ctx *c)
 {
@@ -1134,7 +1133,7 @@ static void vu_sock_reset(struct vu_dev *vdev)
 }
 
 static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
-					struct vhost_user_msg *msg) = {
+					struct vhost_user_msg *vmsg) = {
 	[VHOST_USER_GET_FEATURES]	   = vu_get_features_exec,
 	[VHOST_USER_SET_FEATURES]	   = vu_set_features_exec,
 	[VHOST_USER_GET_PROTOCOL_FEATURES] = vu_get_protocol_features_exec,
@@ -1165,7 +1164,7 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
  */
 void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 {
-	struct vhost_user_msg msg = { 0 };
+	struct vhost_user_msg vmsg = { 0 };
 	bool need_reply, reply_requested;
 	int ret;
 
@@ -1174,38 +1173,38 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 		return;
 	}
 
-	ret = vu_message_read_default(fd, &msg);
+	ret = vu_message_read_default(fd, &vmsg);
 	if (ret == 0) {
 		vu_sock_reset(vdev);
 		return;
 	}
 	debug("================ Vhost user message ================");
-	debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
-		msg.hdr.request);
-	debug("Flags:   0x%x", msg.hdr.flags);
-	debug("Size:    %u", msg.hdr.size);
+	debug("Request: %s (%d)", vu_request_to_string(vmsg.hdr.request),
+		vmsg.hdr.request);
+	debug("Flags:   0x%x", vmsg.hdr.flags);
+	debug("Size:    %u", vmsg.hdr.size);
 
-	need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
+	need_reply = vmsg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
 
-	if (msg.hdr.request >= 0 && msg.hdr.request < VHOST_USER_MAX &&
-	    vu_handle[msg.hdr.request])
-		reply_requested = vu_handle[msg.hdr.request](vdev, &msg);
+	if (vmsg.hdr.request >= 0 && vmsg.hdr.request < VHOST_USER_MAX &&
+	    vu_handle[vmsg.hdr.request])
+		reply_requested = vu_handle[vmsg.hdr.request](vdev, &vmsg);
 	else
-		die("Unhandled request: %d", msg.hdr.request);
+		die("Unhandled request: %d", vmsg.hdr.request);
 
 	/* cppcheck-suppress legacyUninitvar */
 	if (!reply_requested && need_reply) {
-		msg.payload.u64 = 0;
-		msg.hdr.flags = 0;
-		msg.hdr.size = sizeof(msg.payload.u64);
-		msg.fd_num = 0;
+		vmsg.payload.u64 = 0;
+		vmsg.hdr.flags = 0;
+		vmsg.hdr.size = sizeof(vmsg.payload.u64);
+		vmsg.fd_num = 0;
 		reply_requested = true;
 	}
 
 	if (reply_requested)
-		vu_send_reply(fd, &msg);
+		vu_send_reply(fd, &vmsg);
 
-	if (msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
+	if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
 	    vdev->context->device_state_result == 0 &&
 	    !vdev->context->migrate_target) {
 		info("Migration complete, exiting");
diff --git a/vhost_user.h b/vhost_user.h
index 1daacd1..f2ae2da 100644
--- a/vhost_user.h
+++ b/vhost_user.h
@@ -184,7 +184,7 @@ union vhost_user_payload {
 };
 
 /**
- * struct vhost_user_msg - vhost-use message
+ * struct vhost_user_msg - vhost-user message
  * @hdr:		Message header
  * @payload:		Message payload
  * @fds:		File descriptors associated with the message

From b915375a421d70065baa90444da49954ceacde38 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Fri, 16 May 2025 14:42:27 +0200
Subject: [PATCH 220/221] virtio: Correct and align comment headers

Standardize and fix issues in `virtio.c` and `virtio.h` comment headers.

Improvements include:
- Added `()` to function names in comment summaries.
- Added colons after parameter and enum member tags.
- Changed `/*` to `/**` for `virtq_avail_event()` comment.
- Fixed typos (e.g., "file"->"fill", "virqueue"->"virtqueue").
- Added missing `Return:` tag for `vu_queue_rewind()`.
- Corrected parameter names in `virtio.h` comments to match code.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 virtio.c | 29 ++++++++++++++++-------------
 virtio.h |  4 ++--
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/virtio.c b/virtio.c
index f7db007..83906aa 100644
--- a/virtio.c
+++ b/virtio.c
@@ -156,9 +156,9 @@ static inline uint16_t vring_avail_ring(const struct vu_virtq *vq, int i)
 }
 
 /**
- * virtq_used_event - Get location of used event indices
+ * virtq_used_event() - Get location of used event indices
  *		      (only with VIRTIO_F_EVENT_IDX)
- * @vq		Virtqueue
+ * @vq:		Virtqueue
  *
  * Return: return the location of the used event index
  */
@@ -170,7 +170,7 @@ static inline uint16_t *virtq_used_event(const struct vu_virtq *vq)
 
 /**
  * vring_get_used_event() - Get the used event from the available ring
- * @vq		Virtqueue
+ * @vq:		Virtqueue
  *
  * Return: the used event (available only if VIRTIO_RING_F_EVENT_IDX is set)
  *         used_event is a performant alternative where the driver
@@ -244,9 +244,9 @@ static int virtqueue_read_indirect_desc(const struct vu_dev *dev,
 
 /**
  * enum virtqueue_read_desc_state - State in the descriptor chain
- * @VIRTQUEUE_READ_DESC_ERROR	Found an invalid descriptor
- * @VIRTQUEUE_READ_DESC_DONE	No more descriptors in the chain
- * @VIRTQUEUE_READ_DESC_MORE	there are more descriptors in the chain
+ * @VIRTQUEUE_READ_DESC_ERROR:	Found an invalid descriptor
+ * @VIRTQUEUE_READ_DESC_DONE:	No more descriptors in the chain
+ * @VIRTQUEUE_READ_DESC_MORE:	there are more descriptors in the chain
  */
 enum virtqueue_read_desc_state {
 	VIRTQUEUE_READ_DESC_ERROR = -1,
@@ -347,8 +347,9 @@ void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
 		die_perror("Error writing vhost-user queue eventfd");
 }
 
-/* virtq_avail_event() -  Get location of available event indices
- *			      (only with VIRTIO_F_EVENT_IDX)
+/**
+ * virtq_avail_event() -  Get location of available event indices
+ *			  (only with VIRTIO_F_EVENT_IDX)
  * @vq:		Virtqueue
  *
  * Return: return the location of the available event index
@@ -421,8 +422,8 @@ static bool virtqueue_map_desc(const struct vu_dev *dev,
 }
 
 /**
- * vu_queue_map_desc - Map the virtqueue descriptor ring into our virtual
- * 		       address space
+ * vu_queue_map_desc() - Map the virtqueue descriptor ring into our virtual
+ * 			 address space
  * @dev:	Vhost-user device
  * @vq:		Virtqueue
  * @idx:	First descriptor ring entry to map
@@ -505,7 +506,7 @@ static int vu_queue_map_desc(const struct vu_dev *dev,
  * vu_queue_pop() - Pop an entry from the virtqueue
  * @dev:	Vhost-user device
  * @vq:		Virtqueue
- * @elem:	Virtqueue element to file with the entry information
+ * @elem:	Virtqueue element to fill with the entry information
  *
  * Return: -1 if there is an error, 0 otherwise
  */
@@ -545,7 +546,7 @@ int vu_queue_pop(const struct vu_dev *dev, struct vu_virtq *vq,
 }
 
 /**
- * vu_queue_detach_element() - Detach an element from the virqueue
+ * vu_queue_detach_element() - Detach an element from the virtqueue
  * @vq:		Virtqueue
  */
 void vu_queue_detach_element(struct vu_virtq *vq)
@@ -555,7 +556,7 @@ void vu_queue_detach_element(struct vu_virtq *vq)
 }
 
 /**
- * vu_queue_unpop() - Push back the previously popped element from the virqueue
+ * vu_queue_unpop() - Push back the previously popped element from the virtqueue
  * @vq:		Virtqueue
  */
 /* cppcheck-suppress unusedFunction */
@@ -569,6 +570,8 @@ void vu_queue_unpop(struct vu_virtq *vq)
  * vu_queue_rewind() - Push back a given number of popped elements
  * @vq:		Virtqueue
  * @num:	Number of element to unpop
+ *
+ * Return: True on success, false if not
  */
 bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
 {
diff --git a/virtio.h b/virtio.h
index 7a370bd..d8beb88 100644
--- a/virtio.h
+++ b/virtio.h
@@ -150,7 +150,7 @@ static inline bool has_feature(uint64_t features, unsigned int fbit)
 /**
  * vu_has_feature() - Check if a virtio-net feature is available
  * @vdev:	Vhost-user device
- * @bit:	Feature to check
+ * @fbit:	Feature to check
  *
  * Return:	True if the feature is available
  */
@@ -163,7 +163,7 @@ static inline bool vu_has_feature(const struct vu_dev *vdev,
 /**
  * vu_has_protocol_feature() - Check if a vhost-user feature is available
  * @vdev:	Vhost-user device
- * @bit:	Feature to check
+ * @fbit:	Feature to check
  *
  * Return:	True if the feature is available
  */

From 3262c9b088288902f28b5d09f61220fae5376082 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Fri, 16 May 2025 14:42:28 +0200
Subject: [PATCH 221/221] iov: Standardize function comment headers

Update function comment headers in iov.c to a consistent and
standardized format.

This change ensures:
- Comment blocks for functions consistently start with /**.
- Function names in the comment summary line include parentheses ().

This improves overall comment clarity and uniformity within the file.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 iov.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/iov.c b/iov.c
index 8c63b7e..91e87a7 100644
--- a/iov.c
+++ b/iov.c
@@ -26,7 +26,8 @@
 #include "iov.h"
 
 
-/* iov_skip_bytes() - Skip leading bytes of an IO vector
+/**
+ * iov_skip_bytes() - Skip leading bytes of an IO vector
  * @iov:	IO vector
  * @n:		Number of entries in @iov
  * @skip:	Number of leading bytes of @iov to skip
@@ -56,8 +57,8 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
 }
 
 /**
- * iov_from_buf - Copy data from a buffer to an I/O vector (struct iovec)
- *                efficiently.
+ * iov_from_buf() - Copy data from a buffer to an I/O vector (struct iovec)
+ *                  efficiently.
  *
  * @iov:       Pointer to the array of struct iovec describing the
  *             scatter/gather I/O vector.
@@ -96,8 +97,8 @@ size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
 }
 
 /**
- * iov_to_buf - Copy data from a scatter/gather I/O vector (struct iovec) to
- *		a buffer efficiently.
+ * iov_to_buf() - Copy data from a scatter/gather I/O vector (struct iovec) to
+ *		  a buffer efficiently.
  *
  * @iov:       Pointer to the array of struct iovec describing the scatter/gather
  *             I/O vector.
@@ -136,8 +137,8 @@ size_t iov_to_buf(const struct iovec *iov, size_t iov_cnt,
 }
 
 /**
- * iov_size - Calculate the total size of a scatter/gather I/O vector
- *            (struct iovec).
+ * iov_size() - Calculate the total size of a scatter/gather I/O vector
+ *              (struct iovec).
  *
  * @iov:       Pointer to the array of struct iovec describing the
  *             scatter/gather I/O vector.