passt

Author	SHA1	Message	Date
Stefano Brivio	5b646b9b10	conf: Accept -a and -g without --config-net in pasta mode While --no-copy-addrs and --no-copy-routes only make sense with --config-net, and they are implied on -g and -a, respectively, that doesn't mean we should refuse -a or -g without --config-net: they are still relevant for a number of things (including DHCP/DHCPv6/NDP configuration). Reported-by: Gianluca Stivan <me@yawnt.com> Fixes: `cc9d16758b` ("conf, pasta: With --config-net, copy all addresses by default") Fixes: `da54641f14` ("conf, pasta: With --config-net, copy all routes by default") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-25 23:49:25 +02:00
Stefano Brivio	d034fb698f	conf: Make -a/--address really imply --no-copy-addrs I wrote it in commit message and man page, but not in conf()... Note that -g/--gateway correctly implies --no-copy-routes already. This fixes Podman's tests: podman networking with pasta(1) - IPv4 address assignment podman networking with pasta(1) - IPv4 default route assignment where we pass -a and -g to assign an address and a default gateway that's compatible with it, but -a doesn't disable the copy of addresses, so we ignore -a, and the default gateway is incompatible with the addresses we copy -- hence no routes in the container. Link: https://github.com/containers/podman/pull/18612 Fixes: `cc9d16758b` ("conf, pasta: With --config-net, copy all addresses by default") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-25 23:49:25 +02:00
David Gibson	db29fd281a	seccomp: Make seccomp.sh re-entrancy safe seccomp.sh generates seccomp.h piece by piece using >> directives. This means that if two instances of seccomp.h are run concurrently a corrupted version of seccomp.h will be generated. Amongst other problems this can cause spurious failures on clang-tidy. Alter seccomp.sh to build the output in a temporary file and atomic move it to seccomp.h, so concurrent invocations will still result in valud output. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-25 23:49:11 +02:00
Stefano Brivio	3c6d1b9bb2	conf, log: On -h / --help, print usage to stdout, not stderr Erik suggests that this makes it easier to grep for options, and with --help we're anyway printing usage information as expected, not as part of an error report. While at it: on -h, we should exit with 0. Reported-by: Erik Sjölund <erik.sjolund@gmail.com> Link: https://bugs.passt.top/show_bug.cgi?id=52 Link: https://bugs.passt.top/show_bug.cgi?id=53 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-06-23 10:15:55 +02:00
Stefano Brivio	d072ac2434	tap: With pasta, don't reset on tap errors, handle write failures Since commit `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), it makes no sense to close and reopen the tap device on error: we don't have access to /dev/net/tun after the initial setup phase. If we hit ENOBUFS while writing (as reported: in one case because the kernel actually ran out of memory, with another case under investigation), or ENOSPC, we're supposed to drop whatever data we were trying to send: there's no room for it. Handle EINTR just like we handled EAGAIN/EWOULDBLOCK: there's no particular reason why sending the same data should fail again. Anything else I can think of would be an unrecoverable error: exit with failure then. While at it, drop a useless cast on the write() call: it takes a const void * anyway. Reported-by: Gianluca Stivan <me@yawnt.com> Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu> Fixes: `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-06-23 10:15:10 +02:00
David Gibson	429e1a7e71	conf: Fix erroneous check of ip6->gw `a7359f0948` ("conf: Don't exit if sourced default route has no gateway") was supposed to allow passt/pasta to run even if given a template interface which has no default gateway. However a mistake in the patch means it still requires the gateway, but doesn't require a global address for the guest which we really do need. This is one part (but not the only part) of the problem seen in https://bugs.passt.top/show_bug.cgi?id=50. Reported-by: Justin Jereza <justinjereza@gmail.com> Fixes: `a7359f0948` ("conf: Don't exit if sourced default route has no gateway") Link: https://bugs.passt.top/show_bug.cgi?id=50 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-03 07:49:17 +02:00
David Gibson	e3b19530e4	test/nstool: Fix fd leak in accept() loop nstool loops on accept(), but failed to close the accepted socket fds before continuing on. So, with repeated commands it would eventually die with an EMFILE. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-05-23 17:06:32 +02:00
David Gibson	527c822a3b	test/nstool: Provide useful error if given a path that's too long Normal filesystem paths can be very long (PATH_MAX is around 8k), however Unix domain sockets can only use relatively short paths (UNIX_PATH_MAX is 108 on Linux). Currently nstool will simply truncate paths that are too long, leading to difficult to understand failures. Make such failures clearer, with an explicit error message if given a path that's too long. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-05-23 17:06:29 +02:00
Stefano Brivio	9f61c5b68b	passt.h: Fix description of pasta_ifi in struct ctx Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	cc9d16758b	conf, pasta: With --config-net, copy all addresses by default Use the newly-introduced NL_DUP mode for nl_addr() to copy all the addresses associated to the template interface in the outer namespace, unless --no-copy-addrs (also implied by -a) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. This is done mostly for consistency with routes. It might partially cover the issue at: https://bugs.passt.top/show_bug.cgi?id=47 Support multiple addresses per address family for some use cases, but not the originally intended one: we'll still use a single outbound address (unless the routing table specifies different preferred source addresses depending on the destination), regardless of the address used in the target namespace. Link: https://bugs.passt.top/show_bug.cgi?id=47 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	e89da3cf03	netlink: Add functionality to copy addresses from outer namespace Similarly to what we've just done with routes, support NL_DUP for addresses (currently not exposed): nl_addr() can optionally copy mulitple addresses to the target namespace, by fixing up data from the dump with appropriate flags and interface index, and repeating it back to the kernel on the socket opened in the target namespace. Link-local addresses are not copied: the family is set to AF_UNSPEC, which means the kernel will ignore them. Same for addresses from a mismatching address (pre-4.19 kernels without support for NETLINK_GET_STRICT_CHK). Ignore IFA_LABEL attributes by changing their type to IFA_UNSPEC, because in general they will report mismatching names, and we don't really need to use labels as we already know the interface index. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	a7359f0948	conf: Don't exit if sourced default route has no gateway If we use a template interface without a gateway on the default route, we can still offer almost complete functionality, except that, of course, we can't map the gateway address to the outer namespace or host, and that we have no obvious server address or identifier for use in DHCP's siaddr and option 54 (Server identifier, mandatory). Continue, if we have a default route but no default gateway, and imply --no-map-gw and --no-dhcp in that case. NDP responder and DHCPv6 should be able to work as usual because we require a link-local address to be present, and we'll fall back to that. Together with the previous commits implementing an actual copy of routes from the outer namespace, this should finally fix the operation of 'pasta --config-net' for cases where we have a default route on the host, but no default gateway, as it's the case for tap-style routes, including typical Wireguard endpoints. Reported-by: me@yawnt.com Link: https://bugs.passt.top/show_bug.cgi?id=49 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	e8fef7525c	Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway" This reverts commit `7656a6f888`: now, by default, we copy all the routes associated to the outbound interface into the routing table of the container, so there's no need for this horrible workaround anymore. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	da54641f14	conf, pasta: With --config-net, copy all routes by default Use the newly-introduced NL_DUP mode for nl_route() to copy all the routes associated to the template interface in the outer namespace, unless --no-copy-routes (also implied by -g) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. Otherwise, we can't use default gateways which are not, address-wise, on the same subnet as the container, as reported by Callum. Reported-by: Callum Parsey <callum@neoninteger.au> Link: https://github.com/containers/podman/issues/18539 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	468f19a852	conf: --config-net option is for pasta mode only Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	2fe0461856	netlink: Add functionality to copy routes from outer namespace Instead of just fetching the default gateway and configuring a single equivalent route in the target namespace, on 'pasta --config-net', it might be desirable in some cases to copy the whole set of routes corresponding to a given output interface. For instance, in: https://github.com/containers/podman/issues/18539 IPv4 Default Route Does Not Propagate to Pasta Containers on Hetzner VPSes configuring the default gateway won't work without a gateway-less route (specifying the output interface only), because the default gateway is, somewhat dubiously, not on the same subnet as the container. This is a similar case to the one covered by commit `7656a6f888` ("conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway"), and I'm not exactly proud of that workaround. We also have: https://bugs.passt.top/show_bug.cgi?id=49 pasta does not work with tap-style interface for which, eventually, we should be able to configure a gateway-less route in the target namespace. Introduce different operation modes for nl_route(), including a new NL_DUP one, not exposed yet, which simply parrots back to the kernel the route dump for a given interface from the outer namespace, fixing up flags and interface indices on the way, and requesting to add the same routes in the target namespace, on the interface we manage. For n routes we want to duplicate, send n identical netlink requests including the full dump: routes might depend on each other and the kernel processes RTM_NEWROUTE messages sequentially, not atomically, and repeating the full dump naturally resolves dependencies without the need to actually calculate them. I'm not kidding, it actually works pretty well. Link: https://github.com/containers/podman/issues/18539 Link: https://bugs.passt.top/show_bug.cgi?id=49 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	f099afb1f2	pasta: Improve error handling on failure to join network namespace In pasta_wait_for_ns(), open() failing with ENOENT is expected: we're busy-looping until the network namespace appears. But any other failure is not something we're going to recover from: return right away if we don't get either success or ENOENT. Now that pasta_wait_for_ns() can actually fail, handle that in pasta_start_ns() by reporting the issue and exiting. Looping on EPERM, when pasta doesn't actually have the permissions to join a given namespace, isn't exactly a productive thing to do. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-05-23 16:13:28 +02:00
Stefano Brivio	1c3c68970e	netlink: Fix comment about response buffer size for nl_req() Fixes: `fde8004ab0` ("netlink: Use 8 KiB * netlink message header size as response buffer") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	770d1a4502	isolation: Initially Keep CAP_SETFCAP if running as UID 0 in non-init If pasta spawns a child process while running as UID 0, which is only allowed from a non-init namespace, we need to keep CAP_SETFCAP before pasta_start_ns() is called: otherwise, starting from Linux 5.12, we won't be able to update /proc/self/uid_map with the intended mapping (from 0 to 0). See user_namespaces(7). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	b0e450aa85	pasta: Detach mount namespace, (re)mount procfs before spawning command If we want /proc contents to be consistent after pasta spawns a child process in a new PID namespace (only for operation without a pre-existing namespace), we need to mount /proc after the clone(2) call with CLONE_NEWPID, and we enable the child to do that by passing, in the same call, the CLONE_NEWNS flag, as described by pid_namespaces(7). This is not really a remount: in fact, passing MS_REMOUNT to mount(2) would make the call fail. We're in another mount namespace now, so it's a fresh mount that has the effect of hiding the existing one. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	b0881aae6d	util, conf: Add and use ns_is_init() helper We'll need this in isolate_initial(). While at it, don't rely on BUFSIZ: the earlier issue we had with musl reminded me it's not a magic "everything will fit" value. Size the read buffer to what we actually need from uid_map, and check for the final newline too, because uid_map is organised in lines. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-05-23 16:13:18 +02:00
David Gibson	25f1d1a84f	tap: Don't update ip6.addr_seen to :: When we receive packets from the tap side, we update the addr_seen fields to reflect the last known address of the guest or ns. For ip4.addr_seen we, sensibly, only update if the address we've just seen isn't 0 (0.0.0.0). This case can occur during early DHCP transactions. We have no equivalent case for IPv6. We're less likely to hit this, because DHCPv6 uses link-local addresses, however we can see an source address of :: with certain multicast operations. This can bite us if we try to make an incoming connection very early after starting pasta with --config-net: we may have only seen some of those multicast packets, updated addr_seen to :: and not had any "real" packets to update it to a global address. I've seen this with some of the avocado test conversions. In any case, it can never make sense to update addr_seen to ::, so explicitly exclude that case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-05-17 18:50:34 +02:00
lemmi	96f8d55c4f	correct -6 option in manpage Signed-off-by: lemmi <lemmi@nerd2nerd.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-05-09 23:29:32 +02:00
Stefano Brivio	940bd3eff9	passt: Fix error check for signal(), improve error messages Valtteri reports that if SIGPIPE already has a disposition set by the parent process, such as systemd with the default setting of IgnoreSIGPIPE=yes, signal() will return the previous value, not zero, and this is not an error: check for SIG_ERR instead. While at it, split messages for failures of sigaction() and signal(), and report the actual error. Reported-by: Valtteri Vuorikoski <vuori@notcom.org> Fixes: `8534be076c` ("Catch failures when installing signal handlers") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-13 19:32:13 +02:00
David Gibson	1a3ade9037	nstool: Enter holder's cwd when changing mount ns with nstool exec If we enter a mount namespace with nstool exec our working directory will be changed to / in the new mount ns. This is surprising if we haven't actually altered any mounts yet in the new ns. Instead, change the working directory to match that of the holder process in this situation. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:12:12 +02:00
David Gibson	98031bee73	nstool: Advertise the holder's cwd (in its mountns) across the socket This is possible useful in nstool info and has further uses for nstool exec. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:12:10 +02:00
David Gibson	469b69aaa1	test: Use "nstool exec" to slightly simplify tests Using this, rather than using "nstool info" to get the pid then manually connecting with nsenter makes things a little simpler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:12:08 +02:00
David Gibson	3372cd0902	test: Initialise ${TRACE} properly Unlike ${DEBUG} we don't initialize ${TRACE} to 0 if not set, which cases failures when testing it later. That failure acts as though it is false, however it emits spurious errors in script.log, which can make it harder to spot real errors. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:12:05 +02:00
David Gibson	329149d51a	nstool: Add --keep-caps option to nstool exec This allows you to run commands within a user namespace with the privilege that comes from owning that userns. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:12:03 +02:00
David Gibson	0b66944648	nstool: Add nstool exec command to execute commands in an nstool namespace This combines nstool info -pw <sock> with nsenter with various options for a more convenient and less verbose of entering existing nstool managed namespaces. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:12:01 +02:00
David Gibson	3bcbca5db8	nstool: Helpers to iterate through namespace types Will make things a bit less verbose in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:59 +02:00
David Gibson	f6a9ea3af5	nstool: Add magic number to advertized information So that we'll probably give a better error if you point it at something that's not an nstool hold control socket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:57 +02:00
David Gibson	4311066bdb	nstool: Detect what namespaces target is in Give nstool the ability to detect what namespaces the target process is in, relative to where it's called. That is, those namespace types for which the target is not in the same namespace as the caller. For now, just print this information with "info", which can be useful for debugging. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:55 +02:00
David Gibson	fd4a752e92	nstool: Replace "pid" subcommand with "info" subcommand The new subcommand gives more information about the holder process and its namespace, and may be further extended in future. Add some options which give the old behaviour for existing scripts. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:52 +02:00
David Gibson	a4b017d91c	nstool: Split some command line parsing and socket setup to subcommands This will make it easier to differentiate the options to those commands further in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:50 +02:00
David Gibson	42fb218347	nstool: Move description of its operation modes from comment to usage Easier to see it there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:48 +02:00
David Gibson	2884ccd2e7	nstool: Reverse parameters to nstool Having the "subcommand" first is more conventional and will make it more natural for future extensions I have planned. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:43 +02:00
David Gibson	4914fce77b	nstool: Rename nsholder to nstool In preparation for extending what it does. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:41 +02:00
David Gibson	55bbe3dbcb	test: Remove race between commands run in the same context context_run() has a race condition if two commands are run in close proximity (generally involving at least one in the background). Because we always use the same name for the temporary fifo files, if another command is issued while the fifos for the first still exist, mkfifo will fail, typically causing the entire test script to jam. Create unique names for the temporary fifos to avoid this problem. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-08 01:11:36 +02:00
Stefano Brivio	ca2749e1bd	passt: Relicense to GPL 2.0, or any later version In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-06 18:00:33 +02:00
Stefano Brivio	b10b983fbd	fedora: Adjust path for SELinux policy and interface file to latest guidelines Forget about: https://fedoraproject.org/wiki/SELinux_Policy_Modules_Packaging_Draft and: https://fedoraproject.org/wiki/PackagingDrafts/SELinux_Independent_Policy The guidelines to follow are: https://fedoraproject.org/wiki/SELinux/IndependentPolicy Start from fixing the most pressing issue, that is, a path conflict with policy-selinux-devel about passt.if, and, while at it, adjust the installation paths for policy files too. Reported-by: Xose Vazquez Perez <xose.vazquez@gmail.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182476 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 22:11:07 +02:00
Stefano Brivio	387f4aca74	fedora: Don't install useless SELinux interface file for pasta That was meant to be an example, and I just dropped it in the previous commit -- passt.if should be more than enough as a possible example. Reported-by: Carl G. <carlg@fedoraproject.org> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182145 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:48:12 +02:00
Stefano Brivio	dafd92d555	selinux: Drop useless interface file for pasta This was meant to be an example, but I managed to add syntax errors to it. Drop it altogether. Reported-by: Carl G. <carlg@fedoraproject.org> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182145 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:48:12 +02:00
Stefano Brivio	98a9a7d9e5	conf: Allow binding to ports on an interface without a specific address Somebody might want to bind listening sockets to a specific interface, but not a specific address, and there isn't really a reason to prevent that. For example: -t %eth0/2022 Alternatively, we support options such as -t 0.0.0.0%eth0/2022 and -t ::%eth0/2022, but not together, for the same port. Enable this kind of syntax and add examples to the man page. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/14425#issuecomment-1485192195 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:48:12 +02:00
Stefano Brivio	33d88f79d9	tcp: Clear ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer Since commit `cc6d8286d1` ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we process an ACK segment, but, more correctly, only if we're really not waiting for a further ACK segment, that is, only if the acknowledged sequence matches what we sent. In the new function implementing this, tcp_update_seqack_from_tap(), we also reset the retransmission counter and store the updated ACK sequence. Both should be done iff forward progress is acknowledged, implied by the fact that the new ACK sequence is greater than the one we previously stored. At that point, it looked natural to also include the statements that clear and set the ACK_FROM_TAP_DUE flag inside the same conditional block: if we're not making forward progress, the need for an ACK, or lack thereof, should remain unchanged. There might be cases where this isn't true, though: without the previous commit `4e73e9bd65` ("tcp: Don't special case the handling of the ack of a syn"), this would happen if a tap-side client initiated a connection, and the server didn't send any data. At that point we would never, in the established state of the connection, call tcp_update_seqack_from_tap() with reported forward progress. That issue itself is fixed by the previous commit, now, but clearing ACK_FROM_TAP_DUE only on ACK sequence progress doesn't really follow any logic. Clear the ACK_FROM_TAP_DUE flag regardless of reported forward progress. If we clear it when it's already unset, conn_flag() will do nothing with it. This doesn't fix any known functional issue, rather a conceptual one. Fixes: `cc6d8286d1` ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer") Reported-by: David Gibson <david@gibson.dropbear.id.au> Analysed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:47:17 +02:00
David Gibson	4e73e9bd65	tcp: Don't special case the handling of the ack of a syn TCP treats the SYN packets as though they occupied 1 byte in the logical data stream described by the sequence numbers. That is, the very first ACK (or SYN-ACK) each side sends should acknowledge a sequence number one greater than the initial sequence number given in the SYN or SYN-ACK it's responding to. In passt we were tracking that by advancing conn->seq_to_tap by one when we send a SYN or SYN-ACK (in tcp_send_flag()). However, we also initialized conn->seq_ack_from_tap, representing the acks we've already seen from the tap side, to ISN+1, meaning we treated it has having acknowledged the SYN before it actually did. There were apparently reasons for this in earlier versions, but it causes problems now. Because of this when we actually did receive the initial ACK or SYN-ACK, we wouldn't see the acknoweldged serial number as advancing, and so wouldn't clear the ACK_FROM_TAP_DUE flag. In most cases we'd get away because subsequent packets would clear the flag. However if one (or both) sides didn't send any data, the other side would (correctly) keep sending ISN+1 as the acknowledged sequence number, meaning we would never clear the ACK_FROM_TAP_DUE flag. That would mean we'd treat the connection as if we needed to retransmit (although we had 0 bytes to retransmit), and eventaully (after around 30s) reset the connection due to too many retransmits. Specifically this could cause the iperf3 throughput tests in the testsuite to fail if set for a long enough test period. Correct this by initializing conn->seq_ack_from_tap to the ISN and only advancing it when we actually get the first ACK (or SYN-ACK). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:47:07 +02:00
David Gibson	085672f77c	tcp: Clarify allowed state for tcp_data_from_tap() Comments suggest that this should only be called for an ESTABLISHED connection. However, it's non-trivial to ascertain that from the actual control flow in the caller. Add an ASSERT() to make it very clear that this is only called in ESTABLISHED state. In fact, there were some circumstances where it could be called on a CLOSED connection. In a sense that is "established", but with that assert this does require specific (trivial) handling to avoid a spurious abort(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:46:58 +02:00
Stefano Brivio	1ee2f7cada	tcp: Don't reset ACK_TO_TAP_DUE on any ACK, reschedule timer as needed This is mostly symmetric with commit `cc6d8286d1` ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer"): we shouldn't reset the ACK_TO_TAP_DUE flag on any inbound ACK segment, but only once we acknowledge everything we received from the guest or the container. If we don't, a client might unnecessarily hold off further data, especially during slow start, and in general we won't converge to the usable bandwidth. This is very visible especially with traffic tests on links with non-negligible latency, such as in the reported issue. There, a public iperf3 server sometimes aborts the test due do what appears to be a low iperf3's --rcv-timeout (probably less than a second). Even if this doesn't happen, the throughput will converge to a fraction of the usable bandwidth. Clear ACK_TO_TAP_DUE if we acknowledged everything, set it if we didn't, and reschedule the timer in case the flag is still set as the timer expires. While at it, decrease the ACK timer interval to 10ms. A 50ms interval is short enough for any bandwidth-delay product I had in mind (local connections, or non-local connections with limited bandwidth), but here I am, testing 1gbps transfers to a peer with 100ms RTT. Indeed, we could eventually make the timer interval dependent on the current window and estimated bandwidth-delay product, but at least for the moment being, 10ms should be long enough to avoid any measurable syscall overhead, yet usable for any real-world application. Reported-by: Lukas Mrtvy <lukas.mrtvy@gmail.com> Link: https://bugs.passt.top/show_bug.cgi?id=44 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 23:14:58 +01:00
Stefano Brivio	9ffccf7acc	tcp: When a connection flag it set, don't negate it for debug print Fix a copy and paste typo I added in commit `5474bc5485` ("tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()") and --debug altogether. Fixes: `5474bc5485` ("tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 19:39:55 +01:00
David Gibson	89d1494974	Fix false positive if cppcheck doesn't give a false positive `da46fdac` "tcp: Suppress knownConditionTrueFalse cppcheck false positive" introduced a suppression to work around a cppcheck bug causing a false positive warning. However, the suppression will itself cause a spurious unmatchedSuppression warning if used with a version of cppcheck from before the bug was introduced. That includes the packaged version of cppcheck in Fedora. Suppress the unmatchedSuppression as well. Fixes: `da46fdac36` ("tcp: Suppress knownConditionTrueFalse cppcheck false positive") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 16:38:44 +01:00

1 2 3 4 5 ...

1068 commits