Commit graph

1541 commits

Author SHA1 Message Date
Stefano Brivio
7e6a606c32 tcp_splice: Set OUT_WAIT_ flag whenever pipe isn't emptied
In tcp_splice_sock_handler(), if we get EAGAIN on the second splice(),
from pipe to receiving socket, that doesn't necessarily mean that the
pipe is empty: the receiver buffer might be full instead.

Hence, we can't use the 'never_read' flag to decide that there's
nothing to wait for: even if we didn't read anything from the sending
side in a given iteration, we might still have data to send in the
pipe. Use read/written counters, instead.

This fixes an issue where large bulk transfers would occasionally
hang. From a corresponding strace:

     0.000061 epoll_wait(4, [{events=EPOLLOUT, data={u32=29442, u64=12884931330}}], 8, 1000) = 1
     0.005003 epoll_ctl(4, EPOLL_CTL_MOD, 211, {events=EPOLLIN|EPOLLRDHUP, data={u32=54018, u64=8589988610}}) = 0
     0.000089 epoll_ctl(4, EPOLL_CTL_MOD, 115, {events=EPOLLIN|EPOLLRDHUP, data={u32=29442, u64=12884931330}}) = 0
     0.000081 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000073 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 1048576
     0.000087 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000045 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 520415
     0.000060 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000044 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000044 epoll_wait(4, [], 8, 1000) = 0

we're reading from socket 211 into to the pipe end numbered 151,
which connects to pipe end 150, and from there we're writing into
socket 115.

We initially drop EPOLLOUT from the set of monitored flags for socket
115, because it already signaled it's ready for output. Then we read
nothing from socket 211 (the sender had nothing to send), and we keep
emptying the pipe into socket 115 (first 1048576 bytes, then 520415
bytes).

This call of tcp_splice_sock_handler() ends with EAGAIN on the writing
side, and we just exit this function without setting the OUT_WAIT_1
flag (and, in turn, EPOLLOUT for socket 115). However, it turns out,
the pipe wasn't actually emptied, and while socket 211 had nothing
more to send, we should have waited on socket 115 to be ready for
output again.

As a further step, we could consider not clearing EPOLLOUT at all,
unless the read/written counters match, but I'm first trying to fix
this ugly issue with a minimal patch.

Link: https://github.com/containers/podman/issues/22575
Link: https://github.com/containers/podman/issues/22593
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-10 16:52:29 +02:00
David Gibson
1ba76c9e8c udp: Single buffer for IPv4, IPv6 headers and metadata
Currently we have separate arrays for IPv4 and IPv6 which contain the
headers for guest-bound packets, and also the originating socket address.
We can combine these into a single array of "metadata" structures with
space for both pre-cooked IPv4 and IPv6 headers, as well as shared space
for the tap specific header and socket address (using sockaddr_inany).

Because we're using IOVs to separately address the pieces of each frame,
these structures don't need to be packed to keep the headers contiguous
so we can more naturally arrange for the alignment we want.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:52 +02:00
David Gibson
d4598e1d18 udp: Use the same buffer for the L2 header for all frames
Currently each tap-bound frame buffer has room for its own ethernet header.
However the ethernet header is always the same for such frames, so now
that we're indirectly referencing the ethernet header via iov, we can use
a single buffer for all of them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:49 +02:00
David Gibson
6170688616 udp: Share payload buffers between IPv4 and IPv6
Currently the IPv4 and IPv6 paths unnecessarily use different buffers for
the UDP payload.  Now that we're handling the various pieces of the UDP
packets with an iov, we can split the payload part of the buffers off into
its own array shared between IPv4 and IPv6.  As well as saving a little
memory, this allows the payload buffers to be neatly page aligned.

With the buffers merged, udp[46]_l2_iov_sock contain exactly the same thing
as each other and can also be merged.  Likewise udp[46]_iov_splice can be
merged together.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:46 +02:00
David Gibson
2d16946bac udp: Explicitly set checksum in guest-bound UDP headers
For IPv4, UDP checksums are optional and can just be set to 0.
udp_update_hdr4() ignores the checksum field entirely.  Since these are set
to 0 during startup, this works as intended for now.

However, we'd like to share payload and UDP header buffers betweem IPv4 and
IPv6, which does calculate UDP checksums.  Therefore, for robustness, we
should explicitly set the checksum field to 0 for guest-bound UDP packets.

In the tap_udp4_send() slow path, however, we do allow IPv4 UDP checksums
to be calculated as a compile time option.  For consistency, use the same
thing in the udp_update_hdr4() path, which will typically initialize to 0,
but calculate a real checksum if configured to do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:44 +02:00
David Gibson
6c4d26a364 udp: Combine initialisation of IPv4 and IPv6 iovs
We're going to introduce more sharing between the IPv4 and IPv6 buffer
structures.  Prepare for this by combinng the initialisation functions.
While we're at it remove the misleading "sock" from the name since these
initialise both tap side and sock side structures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:41 +02:00
David Gibson
3f9bd867b5 udp: Split tap-bound UDP packets into multiple buffers using io vector
When sending to the tap device, currently we assemble the headers and
payload into a single contiguous buffer.  Those are described by a single
struct iovec, then a batch of frames is sent to the device with
tap_send_frames().

In order to better integrate the IPv4 and IPv6 paths, we want the IP
header in a different buffer that might not be contiguous with the
payload.  To prepare for that, split the UDP packet into an iovec of
buffers.  We use the same split that Laurent recently introduced for
TCP for convenience.

This removes the last use of tap_hdr_len_(), tap_frame_base() and
tap_frame_len(), so remove those too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:38 +02:00
David Gibson
fcd9308856 test: Allow sftp via vsock-ssh in tests
During some debugging recently, I wanted to extact a file from a test
guest and found it was tricky, since the ssh-over-vsock setup we had didn't
allow sftp/scp.  We can fix this by adding a line to the guest side sshd
config from mbuto.  While we're there correct an inaccurate comment.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:36 +02:00
David Gibson
eea5d3ef2d tcp: Update tap specific header too in tcp_fill_headers[46]()
tcp_fill_headers[46]() fill most of the headers, but the tap specific
header (the frame length for qemu sockets) is filled in afterwards.
Filling this as well:
  * Removes a little redundancy between the tcp_send_flag() and
    tcp_data_to_tap() path
  * Makes calculation of the correct length a little easier
  * Removes the now misleadingly named 'vnet_len' variable in
    tcp_send_flag()

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:34 +02:00
David Gibson
3559899586 iov: Helper macro to construct iovs covering existing variables or fields
Laurent's recent changes mean we use IO vectors much more heavily in the
TCP code.  In many of those cases, and few others around the code base,
individual iovs of these vectors are constructed to exactly cover existing
variables or fields.  We can make initializing such iovs shorter and
clearer with a macro for the purpose.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:31 +02:00
David Gibson
40f8b2976a tap, tcp: (Re-)abstract TAP specific header handling
Recent changes to the TCP code (reworking of the buffer handling) have
meant that it now (again) deals explicitly with the MODE_PASST specific
vnet_len field, instead of using the (partial) abstractions provided by the
tap layer.

The abstractions we had don't work for the new TCP structure, so make some
new ones that do: tap_hdr_iov() which constructs an iovec suitable for
containing (just) the TAP specific header and tap_hdr_update() which
updates it as necessary per-packet.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:29 +02:00
David Gibson
68d1b0a152 tcp: Simplify packet length calculation when preparing headers
tcp_fill_headers[46]() compute the L3 packet length from the L4 packet
length, then their caller tcp_l2_buf_fill_headers() converts it back to the
L4 packet length.  We can just use the L4 length throughout.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:26 +02:00
David Gibson
5566386f5f treewide: Standardise variable names for various packet lengths
At various points we need to track the lengths of a packet including or
excluding various different sets of headers.  We don't always use the same
variable names for doing so.  Worse in some places we use the same name
for different things: e.g. tcp_fill_headers[46]() use ip_len for the
length including the IP headers, but then tcp_send_flag() which calls it
uses it to mean the IP payload length only.

To improve clarity, standardise on these names:
   dlen:		L4 protocol payload length ("data length")
   l4len:		plen + length of L4 protocol header
   l3len:		l4len + length of IPv4/IPv6 header
   l2len:		l3len + length of L2 (ethernet) header

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:23 +02:00
David Gibson
9e22c53aa9 checksum: Make csum_ip4_header() take a host endian length
csum_ip4_header() takes the packet length as a network endian value.  In
general it's very error-prone to pass non-native-endian values as a raw
integer.  It's particularly bad here because this differs from other
checksum functions (e.g. proto_ipv4_header_psum()) which take host native
lengths.

It turns out all the callers have easy access to the native endian value,
so switch it to use host order like everything else.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:21 +02:00
David Gibson
1095a7b0c9 treewide: Remove misleading and redundant endianness notes
In general, it's much less error-prone to have the endianness of values
implied by the type, rather than just noting it in comments.  We can't
always easily avoid it, because C, but we can do so when possible.  struct
in_addr and in6_addr are always encoded network endian, so noting it
explicitly isn't useful.  Remove them.

In some cases we also have endianness notes on uint8_t parameters, which
doesn't make sense: for a single byte endianness is irrelevant.  Remove
those too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:16 +02:00
David Gibson
5d37dab012 tap: Remove unused structs tap_msg, tap_l4_msg
Use of these structures was removed in bb70811183 ("treewide: Packet
abstraction with mandatory boundary checks").  Remove the stale
declarations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:13 +02:00
David Gibson
34fb381b5a tap: Split tap specific and L2 (ethernet) headers
In some places (well, actually only UDP now) we use struct tap_hdr to
represent both tap backend specific and L2 ethernet headers.  Handling
these together seemed like a good idea at the time, but Laurent's changes
in the TCP code working towards vhost-user support suggest that treating
them separately is more useful, more often.

Alter struct tap_hdr to represent only the TAP backend specific headers.
Updated related helpers and the UDP code to match.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:08 +02:00
David Gibson
c27ca91564 checksum: Use proto_ipv6_header_psum() for ICMPv6 as well
7df624e79 ("checksum: introduce functions to compute the header part
checksum for TCP/UDP") introduced a helper to compute the partial checksum
for the IPv6 pseudo-header used in L4 protocol checksums.  It used it in
csum_udp6() for UDP packets, but not in csum_icmp6() for the identical
calculation in csum_icmp6() for ICMPv6 packets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:03 +02:00
Stefano Brivio
76e32022c4 netlink: Fix iterations over nexthop objects
Somewhat confusingly, RTNH_NEXT(), as defined by <linux/rtnetlink.h>,
doesn't take an attribute length parameter like RTA_NEXT() does, and
I just modelled loops over nexthops after RTA loops, forgetting to
decrease the remaining length we pass to RTNH_OK().

In practice, this didn't cause issue in any of the combinations I
checked, at least without the next patch.

We seem to be the only user of RTNH_OK(): even iproute2 has an
open-coded version of it in print_rta_multipath() (ip/iproute.c).

Introduce RTNH_NEXT_AND_DEC(), similar to RTA_NEXT(), and use it.

Fixes: 6c7623d07b ("netlink: Add support to fetch default gateway from multipath routes")
Fixes: f4e38b5cd2 ("netlink: Adjust interface index inside copied nexthop objects too")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-02 16:12:45 +02:00
Stefano Brivio
d03c4e2020 netlink: Use IFA_F_NODAD also while duplicating addresses from the host
...not just for a single set address (legacy operation with
--no-copy-addrs). I forgot to add this to nl_addr_dup().

Note that we can have two version of flags: the 8-bit ifa_flags in
ifaddrmsg, and the newer 32-bit version as IFA_FLAGS attribute, which
is given priority if present. Make sure IFA_F_NODAD is set in both.

Without this, a Podman user reports, something on the lines of:
  pasta --config-net -- ping -c1 -6 passt.top

would fail as the kernel would start Duplicate Address Detection
once we configure the address, which can't really work (and doesn't
make sense) in this case.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-26 07:46:54 +02:00
Stefano Brivio
bfc83b54c4 netlink: For IPv4, IFA_LOCAL is the interface address, not IFA_ADDRESS
See the comment to the unnamed enum in linux/if_addr.h, which
currently states:

  /*
   * Important comment:
   * IFA_ADDRESS is prefix address, rather than local interface address.
   * It makes no difference for normally configured broadcast interfaces,
   * but for point-to-point IFA_ADDRESS is DESTINATION address,
   * local address is supplied in IFA_LOCAL attribute.
   *
   * [...]
   */

if we fetch IFA_ADDRESS, and we have a point-to-point link with a peer
address configured, we'll source the peer address as "our" address,
and refuse to resolve it in arp().

This was reported with pasta and a tun upstream interface configured
by OpenVPN in "p2p" topology: the target namespace will have similar
addresses and routes as the host, which is fine, and will try to
resolve the point-to-point peer address (because it's the default
gateway).

Given that we configure it as our address (only internally, not
visibly in the namespace), we'll fail to resolve that and traffic
doesn't go anywhere.

Note that this is not the case for IPv6: there, IFA_ADDRESS is the
actual, local address of the interface, and IFA_LOCAL is not
necessarily present, so the comment in linux/if_addr.h doesn't apply
either.

Link: https://github.com/containers/podman/issues/22320
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-26 07:46:42 +02:00
David Gibson
ff2ff2fbca test: Make log truncation test more robust
test/pasta_options/log_to_file checks that pasta truncates its log file
when started.  It does that by starting pasta with a log file once, then
starting it again and checking that after the second round, the log file
has only one line: the startup banner from the second invocation.

However, this test will break if the second invocation logs any additional
messages at startup.  This can easily happen on a host with multiple
network interfaces due to the "Multiple default route" informational
messages added in 639fdf06e ("netlink: Fix selection of template
interface").  I believe it could also happen on a host without IPv6
connectivity due to the "Couldn't pick external interface" messages, though
I haven't confirmed this.

Make the log file test more robust, by not testing for a single line, but
instead explicitly testing for the PID of the second pasta invocation in
the banner line.

Link: https://bugs.passt.top/show_bug.cgi?id=88
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-25 00:00:34 +02:00
David Gibson
2681366966 test: Slight simplification to pasta log tests
test/pasta_options/log_to_file contains a couple of rudimentary tests
where we start pasta with an interactive shell, then immediately exit it.
We can achieve the same thing by using /bin/true as the command to pasta.
This also means that waiting for pasta to start, waiting for the executed
command to complete and for pasta to clean up are all handled by simply
waiting for pasta to complete in the foreground, so there's no need for an
additional sleep.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-25 00:00:34 +02:00
David Gibson
0804fdbc28 udp: Correctly look up outbound socket with port remappings
Commit bb9bf0bb ("tcp, udp: Don't precompute port remappings in epoll
references") changed the epoll reference for UDP sockets to include the
bound port as seen by the socket itself, rather than the bound port it
would be translated to on the guest side.  As a side effect, it also means
that udp_tap_map[] is indexed by the bound port on the host side, rather
than on the guest side.  This is consistent and a good idea, however we
forgot to account for it when finding the correct outgoing socket for
packets originating in the guest.  This means that if forwarding UDP
inbound with a port number change, reply packets would be misdirected.

Fix this by applying the reverse mapping before looking up the socket in
udp_tap_handler().  While we're at it, use 'port' directly instead of
'uref.port' in udp_sock_init().  Those now always have the same value -
failing to realise that is the same error as above.

Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=87
Fixes: bb9bf0bb8f ("tcp, udp: Don't precompute port remappings in epoll references")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-25 00:00:34 +02:00
Laurent Vivier
95601237ef tcp: Replace TCP buffer structure by an iovec array
To be able to provide pointers to TCP headers and IP headers without
worrying about alignment in the structure, split the structure into
several arrays and point to each part of the frame using an iovec array.

Using iovec also allows us to simply ignore the first entry when the
vnet length header is not needed.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-19 11:21:09 +02:00
Stefano Brivio
27f1c762b1 conf: Don't fail if the template interface doesn't have a MAC address
...simply resort to using locally-administered address (LAA) as
host-side source, instead.

Pick 02:00:00:00:00:00, to make it clear that we don't actually care
about that address, and also to match the 00 (Administratively
Assigned Identifier) quadrant of SLAP (RFC 8948).

Otherwise, pasta refuses to start if the template is a tun or
Wireguard interface.

Link: https://bugs.passt.top/show_bug.cgi?id=49
Link: https://github.com/containers/podman/issues/22320
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-19 11:21:00 +02:00
Stefano Brivio
eca8baa028 conf: We're interested in the MAC address, not in the MAC itself
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-19 11:15:36 +02:00
Stefano Brivio
ee338a256e pasta, util: Align stack area for clones to maximum natural alignment
Given that we use this stack pointer as a location to store arbitrary
data types from the cloned process, we need to guarantee that its
alignment matches any of those possible data types.

runsisi reports that pasta gets a SIGBUS in pasta_open_ns() on
aarch64, where the alignment requirement for stack pointers is a
16 bytes (same as the size of a long double), and similar requirements
actually apply to most architectures we run on.

Reported-by: runsisi <runsisi@hust.edu.cn>
Link: https://bugs.passt.top/show_bug.cgi?id=85
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-19 11:15:27 +02:00
Stefano Brivio
5d5208b67d treewide: Compilers' name for armv6l and armv7l is "arm"
When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the
architecture name for, among others, seccomp.sh, I didn't realise
that "armv6l" and "armv7l" are just Linux kernel names -- compilers
just call that "arm".

Fix the "syscalls" annotation we use to define seccomp profiles
accordingly, otherwise pasta will be terminated on sigreturn() on
armv6l and armv7l.

Fixes: 213c397492 ("passt, pasta: Run-time selection of AVX2 build")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-11 17:34:04 +02:00
David Gibson
954589b64b test: Verify that podman tests are using the pasta binary we expect
Paul Holzinger pointed out that when we invoke the podman tests inside the
passt testsuite, the way we point podman at the newly built pasta binary
is kind of indirect.  It's therefore prudent to check that podman is
actually using the binary we expect it to - in particular that it is using
the binary built in this tree, not some system installed pasta binary.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:24 +02:00
David Gibson
489b28e216 test: catatonit may not be in $PATH
The pasta_podman/bats test script looks for 'catatonit' amongst other tools
to be avaiiliable on the host.  However, while the podman tests do require
catatonit, it doesn't necessarily need to be in the regular path.  For
example Fedora and RHEL place catatonit in /usr/libexec and podman finds it
there fine.

Therefore, remove it as an htools dependency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:21 +02:00
David Gibson
f9fe3ae5dd test: Build and download podman as a test asset
The pasta_podman/bats test scrpt downloads and builds podman, then runs its
pasta specific tests.  Downloading from within a test case has some
drawbacks:
 * It can be very tedious if you have poor connectivity to the server
 * It makes a test that's ostensibly for pasta itself dependent on the
   state of the github server
 * It precludes runnning the tests in an isolated network environment

The same concerns largely apply to building podman too, because it's pretty
common for Go builds to download dependencies themselves.  Therefore move
the download and build of podman from the test itself, to the Makefile
where we prepare other test assets.

To avoid cryptic failures if something went wrong with the build, make
running the test dependent on having the built podman binary.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:16 +02:00
David Gibson
e8b78217bb test: Make sure to update mbuto repository
We download and use mbuto to build trivial boot images for our VM tests.
However, if mbuto is already cloned, we won't update it to the current
version.  Add some make logic to ensure that we do this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:13 +02:00
David Gibson
ef2cb13b49 cppcheck: Explicitly give files to check
Currently "make cppcheck" invokes cppcheck on ".", so it will check all the
.c and .h files it can find in the source tree.  This isn't ideal, because
it can find files that aren't actually part of the real build, or even
stale files which aren't in git.

More practically, some upcoming changes are looking at downloading other
source trees for some tests.  Static errors in there is Not Our Problem,
so checking them is both slow and pointless.

So, change the Makefile to invoke cppcheck only on the specific source
files that are part of the build.  For some reason in this format the
badBitmaskCheck warnings in seccomp.h which were suppressed by 5beb3472e
("cppcheck: Avoid errors due to zeroes in bitwise ORs") no longer trigger.
That means we get unmatchedSuppression warnings instead.  We add an
unmatchedSuppression suppression instead of simply removing the original
suppressions, just in case this odd behaviour isn't the same for all
cppcheck versions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:11 +02:00
David Gibson
97e8b33f87 netlink: Ignore routes to link-local addresses for selecting interface
Since f919dc7a4b ("conf, netlink: Don't require a default route to
start"), and since 639fdf06ed ("netlink: Fix selection of template
interface") less buggily, we haven't required a default route on the host
in order to operate.  Instead, if we lack a default route we'll pick an
interface with any route, as long as there's only one such interface.  If
there's more than one, we don't have a good criterion to pick, so we give
up with an informational message.

Paul Holzinger pointed out that this code considers it ambiguous even if
all but one of the interfaces has only routes to link-local addresses
(fe80::/10).  A route to link-local addresses isn't really useful from
pasta's point of view, so ignore them instead.  This removes a misleading
message in many cases, and a spurious failure in some cases.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:08 +02:00
David Gibson
67a6258918 util: Add helper to return name of address family
We have a few places where we want to include the name of the internet
protocol version (IPv4 or IPv6) in a message, which we handle with an
open-coded ?: expression.

This seems like something that might be more widely useful, so make a
trivial helper to return the correct string based on the address family.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:05 +02:00
Stefano Brivio
f4e38b5cd2 netlink: Adjust interface index inside copied nexthop objects too
As pasta duplicates host routes into the target namespaces, interface
indices might not match, so we go through RTA_OIF attributes and fix
them up to match the identifier in the namespace.

But RTA_OIF is not the ony attribute specifying interfaces for routes:
multipath routes use RTA_MULTIPATH attributes with nexthop objects,
which contain in turn interface indices. Fix them up as well.

If we don't, and we have at least two host interfaces, and the host
interface we use as template isn't the first one (hence the
mismatching indices), we'll fail to insert multipath routes with
nexthop objects, and ultimately refuse to start as the kernel
unexpectedly gives us ENODEV.

Link: https://github.com/containers/podman/issues/22192
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-05 16:58:52 +02:00
Danish Prakash
88c2f08eba apparmor: Fix access to procfs namespace entries in pasta's abstraction
From an original patch by Danish Prakash.

With commit ff22a78d7b ("pasta: Don't try to watch namespaces in
procfs with inotify, use timer instead"), if a filesystem-bound
target namespace is passed on the command line, we'll grab a handle
on its parent directory. That commit, however, didn't introduce a
matching AppArmor rule. Add it here.

To access a network namespace procfs entry, we also need a 'ptrace'
rule. See commit 594dce66d3 ("isolation: keep CAP_SYS_PTRACE when
required") for details as to when we need this -- essentially, it's
about operation with Buildah.

Reported-by: Jörg Sonnenberger <joerg@bec.de>
Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Fixes: ff22a78d7b ("pasta: Don't try to watch namespaces in procfs with inotify, use timer instead")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 12:12:26 +02:00
Stefano Brivio
100919ce74 apparmor: Expand scope of @{run}/user access, allow writing PID files too
With Podman's custom networks, pasta will typically need to open the
target network namespace at /run/user/<UID>/containers/networks:
grant access to anything under /run/user/<UID> instead of limiting it
to some subpath.

Note that in this case, Podman will need pasta to write out a PID
file, so we need write access, for similar locations, too.

Reported-by: Jörg Sonnenberger <joerg@bec.de>
Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 12:12:26 +02:00
Stefano Brivio
dc7b7f28b7 apparmor: Add mount rule with explicit, empty source in passt abstraction
For the policy to work as expected across either AppArmor commit
9d3f8c6cc05d ("parser: fix parsing of source as mount point for
propagation type flags") and commit 300889c3a4b7 ("parser: fix option
flag processing for single conditional rules"), we need one mount
rule with matching mount options as "source" (that is, without
source), and one without mount options and an explicit, empty source.

Link: https://github.com/containers/buildah/issues/5440
Link: https://bugzilla.suse.com/show_bug.cgi?id=1221840
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 12:12:26 +02:00
Stefano Brivio
bbea2752f6 README.md: Alpine, Guix and OpenSUSE now have packages for passt
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-05 12:12:23 +02:00
David Gibson
4988e2b406 tcp: Unconditionally force ACK for all !SYN, !RST packets
Currently we set ACK on flags packets only when the acknowledged byte
pointer has advanced, or we hadn't previously set a window.  This means
in particular that we can send a window update with no ACK flag, which
doesn't appear to be correct.  RFC 9293 requires a receiver to ignore such
a packet [0], and indeed it appears that every non-SYN, non-RST packet
should have the ACK flag.

The reason for the existing logic, rather than always forcing an ACK seems
to be to avoid having the packet mistaken as a duplicate ACK which might
trigger a fast retransmit.  However, earlier tests in the function mean we
won't reach here if we don't have either an advance in the ack pointer -
which will already set the ACK flag, or a window update - which shouldn't
trigger a fast retransmit.

[0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1

Link: https://github.com/containers/podman/issues/22146
Link: https://bugs.passt.top/show_bug.cgi?id=84
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:52:04 +01:00
David Gibson
5894a245b9 tcp: Never automatically add the ACK flag to RST packets
tcp_send_flag() will sometimes force on the ACK flag for all !SYN packets.
This doesn't make sense for RST packets, where plain RST and RST+ACK have
somewhat different meanings.  AIUI, RST+ACK indicates an abrupt end to
a connection, but acknowledges data already sent.  Plain RST indicates an
abort, when one end receives a packet that doesn't seem to make sense in
the context of what it knows about the connection.  All of the cases where
we send RSTs are the second, so we don't want an ACK flag, but we currently
could add one anyway.

Change that, so we won't add an ACK to an RST unless the caller explicitly
requests it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:58 +01:00
David Gibson
16c2d8da0d tcp: Rearrange logic for setting ACK flag in tcp_send_flag()
We have different paths for controlling the ACK flag for the SYN and !SYN
paths.  This amounts to sometimes forcing on the ACK flag in the !SYN path
regardless of options.  We can rearrange things to explicitly be that which
will make things neater for some future changes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:55 +01:00
David Gibson
99355e25b9 tcp: Split handling of DUP_ACK from ACK
The DUP_ACK flag to tcp_send_flag() has two effects: first it forces the
setting of the ACK flag in the packet, even if we otherwise wouldn't.
Secondly, it causes a duplicate of the flags packet to be sent immediately
after the first.

Setting the ACK flag to tcp_send_flag() also has the first effect, so
instead of having DUP_ACK also do that, pass both flags when we need both
operations.  This slightly simplifies the logic of tcp_send_flag() in a way
that makes some future changes easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-26 09:51:41 +01:00
Laurent Vivier
71dd405460 util: fix confusion between offset in the iovec array and in the entry
In write_remainder() 'skip' is the offset to start the operation from
in the iovec array.

In iov_skip_bytes(), 'skip' is also the offset in the iovec array but
'offset' is the first unskipped byte in the iovec entry.

As write_remainder() uses 'skip' for both, 'skip' is reset to the
first unskipped byte in the iovec entry rather to staying the first
unskipped byte in the iovec array.

Fix the problem by introducing a new variable not to overwrite 'skip'
on each loop.

Fixes: 8bdb0883b4 ("util: Add write_remainder() helper")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-20 10:06:32 +01:00
David Gibson
639fdf06ed netlink: Fix selection of template interface
Since f919dc7a4b ("conf, netlink: Don't require a default route to
start"), if there is only one host interface with routes, we will pick that
as the template interface, even if there are no default routes for an IP
version.  Unfortunately this selection had a serious flaw: in some cases
it would 'return' in the middle of an nl_foreach() loop, meaning we
wouldn't consume all the netlink responses for our query.  This could cause
later netlink operations to fail as we read leftover responses from the
aborted query.

Rewrite the interface detection to avoid this problem.  While we're there:
  * Perform detection of both default and non-default routes in a single
    pass, avoiding an ugly goto
  * Give more detail on error and working but unusual paths about the
    situation (no suitable interface, multiple possible candidates, etc.).

Fixes: f919dc7a4b ("conf, netlink: Don't require a default route to start")
Link: https://bugs.passt.top/show_bug.cgi?id=83
Link: https://github.com/containers/podman/issues/22052
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2270257
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Use info(), not warn() for somewhat expected cases where one
 IP version has no default routes, or no routes at all]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-20 09:34:08 +01:00
David Gibson
d35bcbee90 netlink: Fix handling of NLMSG_DONE in nl_route_dup()
A recent kernel change 87d381973e49 ("genetlink: fit NLMSG_DONE into
same read() as families") changed netlink behaviour so that the
NLMSG_DONE terminating a bunch of responses can go in the same
datagram as those responses, rather than in a separate one.

Our netlink code is supposed to handle that behaviour, and indeed does
so for most cases, using the nl_foreach() macro.  However, there was a
subtle error in nl_route_dup() which doesn't work with this change.
f00b1534 ("netlink: Don't try to get further datagrams in
nl_route_dup() on NLMSG_DONE") attempted to fix this, but has its own
subtle error.

The problem arises because nl_route_dup(), unlike other cases doesn't
just make a single pass through all the responses to a netlink
request.  It needs to get all the routes, then make multiple passes
through them.  We don't really have anywhere to buffer multiple
datagrams, so we only support the case where all the routes fit in a
single datagram - but we need to fail gracefully when that's not the
case.

After receiving the first datagram of responses (with nl_next()) we
have a first loop scanning them.  It needs to exit when either we run
out of messages in the datagram (!NLMSG_OK()) or when we get a message
indicating the last response (nl_status() <= 0).

What we do after the loop depends on which exit case we had.  If we
saw the last response, we're done, but otherwise we need to receive
more datagrams to discard the rest of the responses.

We attempt to check for that second case by re-checking NLMSG_OK(nh,
status).  However in the got-last-response case, we've altered status
from the number of remaining bytes to the error code (usually 0). That
means NLMSG_OK() now returns false even if it didn't during the loop
check.  To fix this we need separate variables for the number of bytes
left and the final status code.

We also checked status after the loop, but this was redundant: we can
only exit the loop with NLMSG_OK() == true if status <= 0.

Reported-by: Martin Pitt <mpitt@redhat.com>
Fixes: f00b153414 ("netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE")
Fixes: 4d6e9d0816 ("netlink: Always process all responses to a netlink request")
Link: https://github.com/containers/podman/issues/22052
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-19 10:23:34 +01:00
Dan Čermák
615d370ca2 fedora: Switch license identifier to SPDX
The spec file patch by Dan Čermák was originally contributed at:
  https://src.fedoraproject.org/rpms/passt/pull-request/1

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-18 08:57:47 +01:00
Stefano Brivio
d989eae308 udp: Translate source address of resolver only for DNS remapped queries
Paul reports that if pasta is configured with --dns-forward, and the
container queries a resolver which is configured on the host directly,
without using the address given for --dns-forward, we'll translate
the source address of the response pretending it's coming from the
address passed as --dns-forward, and the client will discard the
reply.

That is,

  $ cat /etc/resolv.conf
  198.51.100.1
  $ pasta --config-net --dns-forward 192.0.2.1 nslookup passt.top

will not work, because we change the source address of the reply from
198.51.100.1 to 192.0.2.1. But the client contacted 198.51.100.1, and
it's from that address that it expects an answer.

Add a PORT_DNS_FWD flag for tap-facing ports, which is triggered by
activity in the opposite direction as the other flags. If the
tap-facing port was seen sending a DNS query that was remapped, we'll
remap the source address of the response, otherwise we'll leave it
unaffected.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-18 08:57:40 +01:00