passt

mirror of https://passt.top/passt synced 2025-05-30 21:05:34 +02:00

Author	SHA1	Message	Date
David Gibson	1d7cff3779	clang: Add rudimentary clangd configuration clangd's default configuration seems to try to treat .h files as C++ not C. There are many more spurious warnings generated at present, but this removes some of the most egregious ones. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:47:07 +01:00
David Gibson	c560e2f65b	Makefile: Don't attempt to auto-detect stack size We probe the available stack limit in the Makefile using rlimit, then use that to set the size of the stack when we clone() extra threads. But the rlimit at compile time need not be the same as the rlimit at runtime, so that's not particularly sensible. Ideally, we'd set the stack size based on an estimate of the actual maximum stack usage of all our clone()ed functions. We don't have that at the moment, but to keep things simple just set it to 1MiB - that's what the current probe will set things to on my default configuration Fedora 40, so it's likely to be fine in most cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:47:03 +01:00
David Gibson	13fc6d511e	Makefile: Use -DARCH for qrap only We insert -DARCH for all compiles, based on TARGET_ARCH determined in the Makefile. However, this is only used in qrap.c, not anywhere else in passt or pasta. Only supply this -D when compiling qrap specifically. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:46:59 +01:00
David Gibson	7917159005	seccomp: Simplify handling of AUDIT_ARCH Currently we construct the AUDIT_ARCH variable in the Makefile, then pass it into the C code with -D. The only place that uses it, though is the BPF filter generated by seccomp.sh. seccomp.sh already needs to do things differently depending on the arch, so it might as well just insert the expanded AUDIT_ARCH directly into the generated code, rather than using a #define. Arguably this is better, even, since it ensures more locally that the arch the BPF checks for matches the arch seccomp.sh built the filter for. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:46:55 +01:00
David Gibson	93bce404c1	Makefile: Move NETNS_RUN_DIR definition to C code NETNS_RUN_DIR is set in the Makefile, then passed into the C code with -D. But NETNS_RUN_DIR is just a fixed string, it doesn't depend on any make probes or variables, so there's really no reason to handle it via the Makefile. Just move it to a plain #define in conf.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:46:52 +01:00
David Gibson	c938d8a93e	netlink: RTA_PAYLOAD() returns int, not size_t Since it's the size of a chunk of memory it would seem logical that RTA_PAYLOAD() returns size_t. However, it doesn't - it explicitly casts its result to an int. RTNH_OK(), which often takes the result of RTA_PAYLOAD() as a parameter compares it to an int, so using size_t can result in comparison of different-signed integer warnings from clang. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:46:48 +01:00
David Gibson	f6b546c6e4	flow: Correct type of flowside_at_sidx() Due to a copy-pasta error, this returns 'PIF_NONE' instead of NULL on the failure case. PIF_NONE expands to 0, which turns into NULL, but it's still confusing, so fix it. This removes a clang warning. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:46:44 +01:00
David Gibson	30b4f88167	arch: Avoid explicit access to 'environ' We pass 'environ' to execve() in arch_avc2_exec(), so that we retain the environment in the current process. But the declaration of 'environ' is a bit weird - it doesn't seem to be in a standard header, requiring a manual explicit declaration. But, we can avoid needing to reference it explicitly by using execv() instead of execve(). This removes a clang warning. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:46:29 +01:00
David Gibson	b78e72da0b	clang: Move clang-tidy configuration from Makefile to .clang-tidy Currently we configure clang-tidy with a very long command line spelled out in the Makefile (mostly a big list of lints to disable). Move it from here into a .clang-tidy configuration file, so that the config is accessible if clang-tidy is invoked in other ways (e.g. via clangd) as well. As a bonus this also means that we can move the bulky comments about why we're suppressing various tests inline with the relevant config lines. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:46:19 +01:00
David Gibson	8346216c9a	Makefile: Simplify exclusion of qrap from static checks There are things in qrap.c that clang-tidy complains about that aren't worth fixing. So, we currently exclude it using $(filter-out). However, we already have a make variable which has just the passt sources, excluding qrap, so we can use that instead of the awkward filter-out expression. Currently, we still include qrap.c for cppcheck, but there's not much point doing so: it's, well, qrap, so we don't care that much about lints. Exclude it from cppcheck as well, for consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:46:07 +01:00
David Gibson	8f1b6a0ca6	clang: Add .clang-format file I've been experimenting with clangd, but its default format style is horrid. Since our style is basically that of the Linux kernel, copy the .clang-format from the kernel, minus reference to a bunch of kernel specific macros. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:45:16 +01:00
David Gibson	5e93bcd8bf	test: Adjust misplaced sleeps in two_guests code Most of our transfer tests using socat use 'sleep' waaiting for the server side to be ready before starting the client. However in two_guests/basic the sleep is in the wrong place: rather than being between starting the server and starting the client, it's after waiting for the server to complete. This causes occasional hangs when the client runs before the server is ready - in that case the receiving guest sends an RST, which we don't (currently) propagate back to the sender. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-05 23:46:38 +01:00
Stefano Brivio	9afce0b45c	tap: Explicitly cast TUNSETIFF to fix build warning with musl on ppc64le On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than INT_MAX (2^31 - 1), and musl declares the second argument of ioctl() as 'int', not 'unsigned long' like glibc does, probably because of how POSIX specifies the equivalent argument, int dcmd, in posix_devctl(), so gcc reports a warning: tap.c: In function 'tap_ns_tun': tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow] 1291 \| rc = ioctl(fd, TUNSETIFF, &ifr); \| ^~~~~~~~~ We don't care about that overflow, so explicitly cast TUNSETIFF to int. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-05 23:46:33 +01:00
Stefano Brivio	d165d36a0c	tcp: Fix build against musl, __sum16 comes from linux/types.h Use a plain uint16_t instead and avoid including one extra header: the 'bitwise' attribute of __sum16 is just used by sparse(1). Reported-by: omni <omni+alpine@hack.org> Fixes: `3d484aa370` ("tcp: Update TCP checksum using an iovec array") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-05 23:46:24 +01:00
Stefano Brivio	ee7d0b62a7	util: Don't use errno after a successful call in __daemon() I thought we could just set errno to 0, do a bunch of stuff, and check that errno didn't change to infer we succeeded. But clang-tidy, starting with LLVM 19, reports: /home/sbrivio/passt/util.c:465:6: error: An undefined value may be read from 'errno' [clang-analyzer-unix.Errno,-warnings-as-errors] 465 \| if (errno) \| ^ /usr/include/errno.h:38:16: note: expanded from macro 'errno' 38 \| # define errno (__errno_location ()) \| ^~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/util.c:446:6: note: Assuming the condition is false 446 \| if (pid == -1) { \| ^~~~~~~~~ /home/sbrivio/passt/util.c:446:2: note: Taking false branch 446 \| if (pid == -1) { \| ^ /home/sbrivio/passt/util.c:451:6: note: Assuming 'pid' is 0 451 \| if (pid) { \| ^~~ /home/sbrivio/passt/util.c:451:2: note: Taking false branch 451 \| if (pid) { \| ^ /home/sbrivio/passt/util.c:463:2: note: Assuming that 'close' is successful; 'errno' becomes undefined after the call 463 \| close(devnull_fd); \| ^~~~~~~~~~~~~~~~~ /home/sbrivio/passt/util.c:465:6: note: An undefined value may be read from 'errno' 465 \| if (errno) \| ^ /usr/include/errno.h:38:16: note: expanded from macro 'errno' 38 \| # define errno (__errno_location ()) \| ^~~~~~~~~~~~~~~~~~~~~~ And the LLVM documentation for the unix.Errno checker, 1.1.8.3 unix.Errno (C), mentions, at: https://clang.llvm.org/docs/analyzer/checkers.html#unix-errno that: The C and POSIX standards often do not define if a standard library function may change value of errno if the call does not fail. Therefore, errno should only be used if it is known from the return value of a function that the call has failed. which is, somewhat surprisingly, the case for close(). Instead of using errno, check the actual return values of the calls we issue here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-30 12:37:31 +01:00
Stefano Brivio	b1a607fba1	udp: Take care of cert-int09-c clang-tidy warning for enum udp_iov_idx /home/sbrivio/passt/udp.c:171:1: error: inital values in enum 'udp_iov_idx' are not consistent, consider explicit initialization of all, none or only the first enumerator [cert-int09-c,readability-enum-initial-value,-warnings-as-errors] 171 \| enum udp_iov_idx { \| ^ 172 \| UDP_IOV_TAP = 0, 173 \| UDP_IOV_ETH = 1, 174 \| UDP_IOV_IP = 2, 175 \| UDP_IOV_PAYLOAD = 3, 176 \| UDP_NUM_IOVS \| \| = 4 Don't initialise any value, so that it's obvious that constants map to unique values. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-30 12:37:31 +01:00
Stefano Brivio	099ace64ce	treewide: Address cert-err33-c clang-tidy warnings for clock and timer functions For clock_gettime(), we shouldn't ignore errors if they happen at initialisation phase, because something is seriously wrong and it's not helpful if we proceed as if nothing happened. As we're up and running, though, it's probably better to report the error and use a stale value than to terminate altogether. Make sure we use a zero value if we don't have a stale one somewhere. For timerfd_gettime() and timerfd_settime() failures, just report an error, there isn't much else we can do. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-30 12:37:31 +01:00
Stefano Brivio	59fe34ee36	treewide: Suppress clang-tidy warning if we already use O_CLOEXEC In pcap_init(), we should always open the packet capture file with O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means close-on-exec, not close-on-fork. In logfile_init() and pidfile_open(), the fact that we pass a third 'mode' argument to open() seems to confuse the android-cloexec-open checker in LLVM versions from 16 to 19 (at least). The checker is suggesting to add O_CLOEXEC to 'mode', and not in 'flags', where we already have it. Add a suppression for clang-tidy and a comment, and avoid repeating those three times by adding a new helper, output_file_open(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-30 12:37:31 +01:00
Stefano Brivio	134b4d58b4	Makefile: Disable readability-math-missing-parentheses clang-tidy check With clang-tidy and LLVM 19: /home/sbrivio/passt/conf.c:1218:29: error: '' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 1218 \| const char octet = str + 3 * i; \| ^~~~~~ \| ( ) /home/sbrivio/passt/ndp.c:285:18: error: '' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 285 \| .len = 1 + 2 n, \| ^~~~~~ \| ( ) /home/sbrivio/passt/ndp.c:329:23: error: '%' has higher precedence than '-'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 329 \| memset(ptr, 0, 8 - dns_s_len % 8); /* padding / \| ^~~~~~~~~~~~~~ \| ( ) /home/sbrivio/passt/pcap.c:131:20: error: '' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 131 \| pcap_frame(iov + i * frame_parts, frame_parts, offset, &now); \| ^~~~~~~~~~~~~~~~ \| ( ) /home/sbrivio/passt/util.c:216:10: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 216 \| return (a->tv_nsec + 1000000000 - b->tv_nsec) / 1000 + \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| ( ) /home/sbrivio/passt/util.c:217:10: error: '' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 217 \| (a->tv_sec - b->tv_sec - 1) 1000000; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| ( ) /home/sbrivio/passt/util.c:220:9: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 220 \| return (a->tv_nsec - b->tv_nsec) / 1000 + \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| ( ) /home/sbrivio/passt/util.c:221:9: error: '' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 221 \| (a->tv_sec - b->tv_sec) 1000000; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| ( ) /home/sbrivio/passt/util.c:545:32: error: '/' has higher precedence than '+'; add parentheses to explicitly specify the order of operations [readability-math-missing-parentheses,-warnings-as-errors] 545 \| return clone(fn, stack_area + stack_size / 2, flags, arg); \| ^~~~~~~~~~~~~~~ \| ( ) Just... no. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-30 12:37:31 +01:00
Stefano Brivio	744247856d	treewide: Silence cert-err33-c clang-tidy warnings for fprintf() We use fprintf() to print to standard output or standard error streams. If something gets truncated or there's an output error, we don't really want to try and report that, and at the same time it's not abnormal behaviour upon which we should terminate, either. Just silence the warning with an ugly FPRINTF() variadic macro casting the fprintf() expressions to void. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-30 12:37:31 +01:00
Stefano Brivio	98efe7c2fd	treewide: Comply with CERT C rule ERR33-C for snprintf() clang-tidy, starting from LLVM version 16, up to at least LLVM version 19, now checks that we detect and handle errors for snprintf() as requested by CERT C rule ERR33-C. These warnings were logged with LLVM version 19.1.2 (at least Debian and Fedora match): /home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 43 \| snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning /home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 577 \| snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning /home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 579 \| snprintf(userns, PATH_MAX, "/proc/%ld/ns/user", \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 580 \| pidval); \| ~~~~~~~ /home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning /home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 105 \| snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning /home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 242 \| snprintf(uidmap, BUFSIZ, "0 %u 1", uid); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning /home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 243 \| snprintf(gidmap, BUFSIZ, "0 %u 1", gid); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning /home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors] 1155 \| snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning Don't silence the warnings as they might actually have some merit. Add an snprintf_check() function, instead, checking that we're not truncating messages while printing to buffers, and terminate if the check fails. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-30 12:37:25 +01:00
Stefano Brivio	988a4d75f8	Makefile: Exclude qrap.c from clang-tidy checks We'll deprecate qrap(1) soon, and warnings reported by clang-tidy as of LLVM versions 16 and later would need a bunch of changes there to be addressed, mostly around CERT C rule ERR33-C and checking return code from snprintf(). It makes no sense to fix warnings in qrap just for the sake of it, so officially declare the bitrotting season open. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-30 08:21:19 +01:00
Jon Maloy	ba38e67cf4	tcp: unify l2 TCPv4 and TCPv6 queues and structures Following the preparations in the previous commit, we can now remove the payload and flag queues dedicated for TCPv6 and TCPv4 and move all traffic into common queues handling both protocol types. Apart from reducing code and memory footprint, this change reduces a potential risk for TCPv4 traffic starving out TCPv6 traffic. Since we always flush out the TCPv4 frame queue before the TCPv6 queue, the latter will never be handled if the former fails to send all its frames. Tests with iperf3 shows no measurable change in performance after this change. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-29 12:44:08 +01:00
Jon Maloy	2053c36dec	tcp: set ip and eth headers in l2 tap queues on the fly l2 tap queue entries are currently initialized at system start, and reused with preset headers through its whole life time. The only fields we need to update per message are things like payload size and checksums. If we want to reuse these entries between ipv4 and ipv6 messages we will need to set the pointer to the right header on the fly per message, since the header type may differ between entries in the same queue. The same needs to be done for the ethernet header. We do these changes here. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-29 12:43:24 +01:00
Laurent Vivier	5563d5f668	test: remove obsolete images Remove debian-9-nocloud-amd64-daily-20200210-166.qcow2 and openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 as they cannot be downloaded anymore Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-25 14:30:06 +02:00
Laurent Vivier	f43f7d5e89	tcp: cleanup tcp_buf_data_from_sock() Remove the err label as there is only one caller, and move code to the caller position. ret is not needed here anymore as it is always 0. Remove sendlen as we can user directly len. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-25 14:29:51 +02:00
David Gibson	e7fcd0c348	tcp: Use runtime tests for TCP_INFO fields In order to use particular fields from the TCP_INFO getsockopt() we need them to be in structure returned by the runtime kernel. We attempt to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed in the Makefile. However, that's not correct, because the kernel headers we compile against may not be the same as the runtime kernel. We instead should check against the size of structure returned from the TCP_INFO getsockopt() as we already do for tcpi_snd_wnd. Switch from the compile time flags to a runtime test. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-25 14:29:46 +02:00
David Gibson	81143813a6	tcp: Generalise probing for tcpi_snd_wnd field In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we need the field to be supported in the runtime kernel (snd_wnd_cap). In fact we should check that for for every tcp_info field we want to use, beyond the very old ones shared with BSD. Prepare to do that, by generalising the probing from setting a single bool to instead record the size of the returned TCP_INFO structure. We can then use that recorded value to check for the presence of any field we need. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-25 14:27:17 +02:00
David Gibson	13f0291ede	tcp: Remove compile-time dependency on struct tcp_info version In the Makefile we probe to create several defines based on the presence of particular fields in struct tcp_info. These defines are used for two purposes, neither of which they accomplish well: 1) Determining if the tcp_info fields are available at runtime. For this purpose the defines are Just Plain Wrong, since the runtime kernel may not be the same as the compile time kernel. We corrected this for tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt 2) Allowing the source to compile against older kernel headers which don't have the fields in question. This works in theory, but it does mean we won't be able to use the fields, even if later run against a newer kernel. Furthermore, it's quite fragile: without much more thorough tests of builds in different environments that we're currently set up for, it's very easy to miss cases where we're accessing a field without protection from an #ifdef. For example we currently access tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd(). Improve this with a different approach, borrowed from qemu (which has many instances of similar problems). Don't compile against linux/tcp.h, using netinet/tcp.h instead. Then for when we need an extension field, define a struct tcp_info_linux, copied from the kernel, with all the fields we're interested in. That may need updating from future kernel versions, but only when we want to use a new extension, so it shouldn't be frequent. This allows us to remove the HAS_SND_WND define entirely. We keep HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1), we'll fix that in a later patch. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Trivial grammar fixes in comments] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-25 14:26:48 +02:00
Stefano Brivio	9e4615b40b	tcp_splice: fcntl(2) returns the size of the pipe, if F_SETPIPE_SZ succeeds Don't report bogus failures (with --trace) just because the return value is not zero. Link: https://github.com/containers/podman/issues/24219 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-25 09:36:30 +02:00
Stefano Brivio	149f457b23	tcp_splice: splice() all we have to the writing side, not what we just read In tcp_splice_sock_handler(), we try to calculate how much we can move from the pipe to the writing socket: if we just read some bytes, we'll use that amount, but if we haven't, we just try to empty the pipe. However, if we just read something, that doesn't mean that that's all the data we have on the pipe, as it's obvious from this sequence, where: pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001) Flow 0 (TCP connection (spliced)): 98304 from read-side call Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304) Flow 0 (TCP connection (spliced)): -1 from read-side call Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 Flow 0 (TCP connection (spliced)): OUT_WAIT_0 we first pile up 98304 - 33615 = 64689 pending bytes, that we read but couldn't write, as the receiver buffer is full, and we set the corresponding OUT_WAIT flag. Then: pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001) Flow 0 (TCP connection (spliced)): 32768 from read-side call Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 we splice() 32768 more bytes from our receiving side to the pipe. At some point: pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489 Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0 Flow 0 (TCP connection (spliced)): 1320 from read-side call Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320) the receiver is signalling to us that it's ready for more data (EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from our receiving socket into the pipe, and that's what we write to the receiver, forgetting about the pending 97457 bytes we had, which the receiver might never get (not the same 97547 bytes: we'll actually send 1320 of those). This condition is rather hard to reproduce, and it was observed with Podman pulling container images via HTTPS. In the traces above, the client is side 0 (the initiating peer), and the server is sending the whole data. Instead of splicing from pipe to socket the amount of data we just read, we need to splice all the pending data we piled up until that point. We could do that using 'read' and 'written' counters, but there's actually no need, as the kernel also keeps track of how much data is available on the pipe. So, to make this simple and more robust, just give the whole pipe size as length to splice(). The kernel knows what to do with it. Later in the function, we used 'to_write' for an optimisation meant to reduce wakeups which retries right away to splice() in both directions if we couldn't write to the receiver the whole amount of pending data. Calculate a 'pending' value instead, only if we reach that point. Now that we check for the actual amount of pending data in that optimisation, we need to make sure we don't compare a zero or negative 'written' value: if we met that, it means that the receiver signalled end-of-file, an error, or to try again later. In those three cases, the optimisation doesn't make any sense, so skip it. Reported-by: Ed Santiago <santiago@redhat.com> Reported-by: Paul Holzinger <pholzing@redhat.com> Analysed-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/24219 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-25 09:36:18 +02:00
David Gibson	9e5df350d6	tcp: Use structures to construct initial TCP options As a rule, we prefer constructing packets with matching C structures, rather than building them byte by byte. However, one case we still build byte by byte is the TCP options we include in SYN packets (in fact the only time we generate TCP options on the tap interface). Rework this to use a structure and initialisers which make it a bit clearer what's going on. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by; Stefano Brivio <sbrivio@redhat.com>	2024-10-21 18:51:04 +02:00
David Gibson	b4dace8f46	fwd: Direct inbound spliced forwards to the guest's external address In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:28:03 +02:00
David Gibson	58e6d68599	test: Clarify test for spliced inbound transfers The tests in pasta/tcp and pasta/udp for inbound transfers have the server listening within the namespace explicitly bound to 127.0.0.1 or ::1. This only works because of the behaviour of inbound splice connections, which always appear with both source and destination addresses as loopback in the namespace. That's not an inherent property for "spliced" connections and arguably an undesirable one. Also update the test names to make it clearer that these tests are expecting to exercise the "splice" path. Interestingly this was already correct for the equivalent passt_in_ns/*, although we also update the test names for clarity there. Note that there are similar issues in some of the podman tests, addressed in https://github.com/containers/podman/pull/24064 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:28:00 +02:00
David Gibson	1fa421192c	passt.1: Clarify and update "Handling of local addresses" section This section didn't mention the effect of the --map-host-loopback option which now alters this behaviour. Update it accordingly. It used "local addresses" to mean specifically 127.0.0.0/8 and ::1. However, "local" could also refer to link-local addresses or to addresses of any scope which happen to be configured on the host. Use "loopback address" to be more precise about this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:57 +02:00
David Gibson	ef8a5161d0	passt.1: Mark --stderr as deprecated more prominently The description of this option says that it's deprecated, but unlike --no-copy-addrs and --no-copy-routes it doesn't have a clear label. Add one to make it easier to spot. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:54 +02:00
David Gibson	53176ca91d	test: Wait for DAD on DHCPv6 addresses After running dhclient -6 we expect the DHCPv6 assigned address to be immediately usable. That's true with the Fedora dhclient-script (and the upstream ISC DHCP one), however it's not true with the Debian dhclient-script. The Debian script can complete with the address still in "tentative" state, and the address won't be usable until Duplicate Address Detection (DAD) completes. That's arguably a bug in Debian (see link below), but for the time being we need to work around it anyway. We usually get away with this, because by the time we do anything where the address matters, DAD has completed. However, it's not robust, so we should explicitly wait for DAD to complete when we get an DHCPv6 address. Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:51 +02:00
David Gibson	75b9c0feb0	test: Explicitly wait for DAD to complete on SLAAC addresses Getting a SLAAC address takes a little while because the kernel must complete Duplicate Address Detection (DAD) before marking the address as ready. In several places we have an explicit 'sleep 2' to wait for that to complete. Fixed length delays are never a great idea, although this one is pretty solid. Still, it would be better to explicitly wait for DAD to complete in case of long delays (which might happen on slow emulated hosts, or with heavy load), and to speed the tests up if DAD completes quicker. Replace the fixed sleeps with a loop waiting for DAD to complete. We do this by looping waiting for all tentative addresses to disappear. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:47 +02:00
David Gibson	f9d677bff6	arp: Fix a handful of small warts This fixes a number of harmless but slightly ugly warts in the ARP resolution code: * Use in4addr_any to represent 0.0.0.0 rather than hand constructing an example. * When comparing am->sip against 0.0.0.0 use sizeof(am->sip) instead of sizeof(am->tip) (same value, but makes more logical sense) * Described the guest's assigned address as such, rather than as "our address" - that's not usually what we mean by "our address" these days * Remove "we might have the same IP address" comment which I can't make sense of in context (possibly it's relating to the statement below, which already has its own comment?) Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:04 +02:00
Stefano Brivio	2d7f734c45	tcp: Send "empty" handshake ACK before first data segment Starting from commit `9178a9e346` ("tcp: Always send an ACK segment once the handshake is completed"), we always send an ACK segment, without any payload, to complete the three-way handshake while establishing a connection started from a socket. We queue that segment after checking if we already have data to send to the tap, which means that its sequence number is higher than any segment with data we're sending in the same iteration, if any data is available on the socket. However, in tcp_defer_handler(), we first flush "flags" buffers, that is, we send out segments without any data first, and then segments with data, which means that our "empty" ACK is sent before the ACK segment with data (if any), which has a lower sequence number. This appears to be harmless as the guest or container will generally reorder segments, but it looks rather weird and we can't exclude it's actually causing problems. Queue the empty ACK first, so that it gets a lower sequence number, before checking for any data from the socket. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-15 20:34:26 +02:00
Stefano Brivio	7612cb80fe	test: Pass TRACE from run_term() into ./run from_term Just like we do for PCAP, DEBUG and KERNEL. Otherwise, running tests with TRACE=1 will not actually enable tracing output. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-10 05:25:19 +02:00
Stefano Brivio	b40880c157	test/lib/term: Always use printf for messages with escape sequences ...instead of echo: otherwise, bash won't handle escape sequences we use to colour messages (and 'echo -e' is not specified by POSIX). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-10 05:25:00 +02:00
David Gibson	ff63ac922a	conf: Add --dns-host option to configure host side nameserver When redirecting DNS queries with the --dns-forward option, passt/pasta needs a host side nameserver to redirect the queries to. This is controlled by the c->ip[46].dns_host variables. This is set to the first first nameserver listed in the host's /etc/resolv.conf, and there isn't currently a way to override it from the command line. Prior to `0b25cac9` ("conf: Treat --dns addresses as guest visible addresses") it was possible to alter this with the -D/--dns option. However, doing so was confusing and had some nonsensical edge cases because -D generally takes guest side addresses, rather than host side addresses. Add a new --dns-host option to restore this functionality in a more sensible way. Link: https://bugs.passt.top/show_bug.cgi?id=102 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 19:04:29 +02:00
David Gibson	9d66df9a9a	conf: Add command line switch to enable IP_FREEBIND socket option In a couple of recent reports, we've seen that it can be useful for pasta to forward ports from addresses which are not currently configured on the host, but might be in future. That can be done with the sysctl net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in the first place. We can allow the same thing on a per-socket basis with the IP_FREEBIND (or IPV6_FREEBIND) socket option. Add a --freebind command line argument to enable this socket option on all listening sockets. Link: https://bugs.passt.top/show_bug.cgi?id=101 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 19:04:29 +02:00
Laurent Vivier	151dbe0d3d	udp: Update UDP checksum using an iovec array As for tcp_update_check_tcp4()/tcp_update_check_tcp6(), change csum_udp4() and csum_udp6() to use an iovec array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:13 +02:00
Laurent Vivier	3d484aa370	tcp: Update TCP checksum using an iovec array TCP header and payload are supposed to be in the same buffer, and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute the checksum from the base address of the header using the length of the IP payload. In the future (for vhost-user) we need to dispatch the TCP header and the TCP payload through several buffers. To be able to manage that, we provide an iovec array that points to the data of the TCP frame. We provide also an offset to be able to provide an array that contains the TCP frame embedded in an lower level frame, and this offset points to the TCP header inside the iovec array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:10 +02:00
Laurent Vivier	e6548c6437	checksum: Add an offset argument in csum_iov() The offset allows any headers that are not part of the data to checksum to be skipped. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:08 +02:00
Laurent Vivier	fd8334b25d	pcap: Add an offset argument in pcap_iov() The offset is passed directly to pcap_frame() and allows any headers that are not part of the frame to capture to be skipped. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:02 +02:00
Laurent Vivier	72e7d3024b	tcp: Use tcp_payload_t rather than tcphdr As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the checksum using the TCP header and the TCP payload, it is clearer to use a pointer to tcp_payload_t that includes tcphdr and payload rather than a pointer to tcphdr (and guessing TCP header is followed by the payload). Move tcp_payload_t and tcp_flags_t to tcp_internal.h. (They will be used also by vhost-user). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:50:46 +02:00
Stefano Brivio	def8acdcd8	test: Kernel binary can now be passed via the KERNEL environmental variable This is quite useful at least for myself as I'm usually running tests using a guest kernel that's not the same as the one on the host. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-02 14:50:34 +02:00

1 2 3 4 5 ...

1743 commits