passt

mirror of https://passt.top/passt synced 2025-05-11 04:48:51 +02:00

Author	SHA1	Message	Date
David Gibson	ae5f6c8e1b	epoll: Use different epoll types for passt and pasta tap fds Currently we have a single epoll event type for the "tap" fd, which could be either a handle on a /dev/net/tun device (pasta) or a connected Unix socket (passt). However for the two modes we call different handler functions. Simplify this a little by using different epoll types and dispatching directly to the correct handler function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:20 +02:00
David Gibson	eda4f1997e	epoll: Split listening Unix domain socket into its own type tap_handler() actually handles events on three different types of object: the /dev/tap character device (pasta), a connected Unix domain socket (passt) or a listening Unix domain socket (passt). The last, in particular, really has no handling in common with the others, so split it into its own epoll type and directly dispatch to the relevant handler from the top level. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:17 +02:00
David Gibson	485b5fb8f9	epoll: Split handling of listening TCP sockets into their own handler tcp_sock_handler() handles both listening TCP sockets, and connected TCP sockets, but what it needs to do in those cases has essentially nothing in common. Therefore, give listening sockets their own epoll_type value and dispatch directly to their own handler from the top level. Furthermore, the two handlers need essentially entirely different information from the reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate the port for the listening socket, but that's not the same meaning. So, switch listening sockets to their own reference type which we can lay out as we please. That lets us remove the listen and outbound fields from the normal (connected) tcp_epoll_ref, reducing it to just the connection table index. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:15 +02:00
David Gibson	e6f81e5578	epoll: Split handling of TCP timerfds into its own handler function tcp_sock_handler() actually handles several different types of fd events. This includes timerfds that aren't sockets at all. The handling of these has essentially nothing in common with the other cases. So, give the TCP timers there own epoll_type value and dispatch directly to their handler. This also means we can remove the timer field from tcp_epoll_ref, the information it encoded is now implicit in the epoll_type value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:13 +02:00
David Gibson	8271a2ed57	epoll: Tiny cleanup to udp_sock_handler() Move the test for c->no_udp into the function itself, rather than in the dispatching switch statement to better localize the UDP specific logic, and make for greated consistency with other handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:10 +02:00
David Gibson	05f606ab0b	epoll: Split handling of ICMP and ICMPv6 sockets We have different epoll type values for ICMP and ICMPv6 sockets, but they both call the same handler function, icmp_sock_handler(). However that function does essentially nothing in common for the two cases. So, split it into icmp_sock_handler() and icmpv6_sock_handler() and dispatch them separately from the top level. While we're there remove some parameters that the function was never using anyway. Also move the test for c->no_icmp into the functions, so that all the logic specific to ICMP is within the handler, rather than in the top level dispatch code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:07 +02:00
David Gibson	d850caab66	epoll: Fold sock_handler into general switch on epoll event fd When we process events from epoll_wait(), we check for a number of special cases before calling sock_handler() which then dispatches based on the protocol type of the socket in the event. Fold these cases together into a single switch on the fd type recorded in the epoll data field. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:56 +02:00
David Gibson	6a6735ece4	epoll: Always use epoll_ref for the epoll data variable epoll_ref contains a variety of information useful when handling epoll events on our sockets, and we place it in the epoll_event data field returned by epoll. However, for a few other things we use the 'fd' field in the standard union of types for that data field. This actually introduces a bug which is vanishingly unlikely to hit in practice, but very nasty if it ever did: theoretically if we had a very large file descriptor number for fd_tap or fd_tap_listen it could overflow into bits that overlap with the 'proto' field in epoll_ref. With some very bad luck this could mean that we mistakenly think an event on a regular socket is an event on fd_tap or fd_tap_listen. More practically, using different (but overlapping) fields of the epoll_data means we can't unify dispatch for the various different objects in the epoll. Therefore use the same epoll_ref as the data for the tap fds and the netns quit fd, adding new fd type values to describe them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:53 +02:00
David Gibson	3401644453	epoll: Generalize epoll_ref to cover things other than sockets The epoll_ref type includes fields for the IP protocol of a socket, and the socket fd. However, we already have a few things in the epoll which aren't protocol sockets, and we may have more in future. Rename these fields to an abstract "fd type" and file descriptor for more generality. Similarly, rather than using existing IP protocol numbers for the type, introduce our own number space. For now these just correspond to the supported protocols, but we'll expand on that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:51 +02:00
David Gibson	e26282b67d	tap: Fold reset handling into tap_handler_passt() We call tap_sock_reset() if tap_handler_passt() fails, or if we get an error event on the socket. Fold that logic into tap_handler() passt itself which simplifies the caller. It also makes it clearer that we had a redundant EPOLL_CTL_DEL and close() in one of the reset paths, so fix that too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:49 +02:00
David Gibson	0d870c5da6	tap: Fold reset handling into tap_handler_pasta() If tap_handler_pasta() fails, we reset the connection. But in the case of pasta the "reset" is just a fatal error. Fold the die() calls directly into tap_handler_pasta() for simplicity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:46 +02:00
David Gibson	548e05f76a	tap: Clean up behaviour for errors on listening Unix socket We call tap_sock_unix_new() to handle a new connection to the qemu socket if we get an EPOLLIN event on c->fd_tap_listen. If we get any other event on the fd, we'll fall through to the "tap reset" path. But that won't do anything relevant to the listening socket, it will just close the already connected socket. Furthermore, the only other event we're subscribed to for the listening socket is EPOLLRDHUP, which doesn't apply to a non connected socket. Remove EPOLLRDHUP from the subscribed events. We don't need to explicitly add EPOLLERR, because errors are always reported. There's no obvious case that would cause an error on a listening socket anyway, and it's not obvious how we'd recover, treat it as a fatal error if it ever does happen. Finally, fold all this handling into the tap_sock_unix_new() function, there's no real reason to split it between there and tap_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:44 +02:00
David Gibson	28877b0fcd	tap: Clean up tap reset path In tap_handler() if we get an error on the tap device or socket, we use tap_sock_init() to re-initialise it. However, what we actually need for this reset case has remarkably little in common with the case where we're initialising for the first time: * Re-initialising the packet pools is unnecessary * The case of a passed in fd (--fd) isn't relevant * We don't even call this for pasta mode * We will never re-call tap_sock_unix_init() because we never clear fd_tap_listen In fact the only thing we do in tap_sock_init() relevant to the reset case is to remove the fd from the epoll and close it... which isn't used in the first initialisation case. So make a new tap_sock_reset() function just for this case, and simplify tap_sock_init() slightly as being used only for the first time case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:40 +02:00
Laurent Vivier	b2bea0047d	tap: fix seq->p.count limit The number of items in pool_l4_t is defined to UIO_MAXIOV, not TAP_SEQS. TAP_SEQS is the number of the sequences. Fix the value used to compare seq->p.count with. Fixes: `37c228ada8` ("tap, tcp, udp, icmp: Cut down on some oversized buffers") Signed-off-by: Laurent Vivier <lvivier@redhat.com> [sbrivio: s/messages/sequences/ in commit message, extend initialisation of packets in pool to UIO_MAXIOV items] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:24:56 +02:00
David Gibson	02b30e7871	netlink: Propagate errors for "dup" operations We now detect errors on netlink "set" operations while configuring the pasta namespace with --config-net. However in many cases rather than a simple "set" we use a more complex "dup" function to copy configuration from the host to the namespace. We're not yet properly detecting and reporting netlink errors for that case. Change the "dup" operations to propagate netlink errors to their caller, pasta_ns_conf() and report them there. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:32:32 +02:00
David Gibson	5103811e2d	netlink: Propagate errors for "dump" operations Currently if we receive any netlink errors while discovering network configuration from the host, we'll just ignore it and carry on. This might lead to cryptic error messages later on, or even silent misconfiguration. We now have the mechanisms to detect errors from get/dump netlink operations. Propgate these errors up to the callers and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:30:41 +02:00
David Gibson	4d6e9d0816	netlink: Always process all responses to a netlink request A single netlink request can result in multiple response datagrams. We process multiple response datagrams in some circumstances, but there are cases where we exit early and will leave remaining datagrams in the queue. These will be flushed in nl_send() before we send another request. This is confusing, and not what we need to reliably check for errors from netlink operations. So, instead, make sure we always process all the response datagrams whenever we send a request (excepting fatal errors). In most cases this is just a matter of avoiding early exits from nl_foreach loops. nl_route_dup() is a bit trickier, because we need to retain all the routes we're going to try to copy in a single buffer. Here we instead use a secondary buffer to flush any remaining datagrams, and report an error if there are any additional routes in those datagrams . Link: https://bugs.passt.top/show_bug.cgi?id=67 Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:30:37 +02:00
David Gibson	8de9805224	netlink: Propagate errors for "set" operations Currently if anything goes wrong while we're configuring the namespace network with --config-net, we'll just ignore it and carry on. This might lead to a silently unconfigured or misconfigured namespace environment. For simple "set" operations based on nl_do() we can now detect failures reported via netlink. Propagate those errors up to pasta_ns_conf() and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:30:22 +02:00
David Gibson	a309318275	netlink: Add nl_foreach_oftype to filter response message types In most cases where processing response messages, we expect only one type of message (excepting NLMSG_DONE or NLMSG_ERROR), and so we need a test and continue to skip anything else. Add a helper macro to do this. This also fixes a bug in nl_get_ext_if() where we didn't have such a test and if we got a message other than RTM_NEWROUTE we would have parsed its contents as nonsense. Also add a warning message if we get such an unexpected message type, which could be useful for debugging if we ever hit it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:28:29 +02:00
David Gibson	99ddd7ce83	netlink: Split nl_req() to allow processing multiple response datagrams Currently nl_req() sends the request, and receives a single response datagram which we then process. However, a single request can result in multiple response datagrams. That happens nearly all the time for DUMP requests, where the 'DONE' message usually comes in a second datagram after the NEW{LINK\|ADDR\|ROUTE} messages. It can also happen if there are just too many objects to dump in a single datagram. Allow our netlink code to process multiple response datagrams by splitting nl_req() into three different helpers: nl_send() just sends a request, without getting a response. nl_status() checks a single message to see if it indicates the end of the reponses for our request. nl_next() moves onto the next response message, whether it's in a datagram we already received or we need to recv() a new one. We also add a 'for'-style macro to use these to step through every response message to a request across multiple datagrams. While we're at it, be more thourough with checking that our sequence numbers are in sync. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:28:26 +02:00
David Gibson	8ec757d003	netlink: Clearer reasoning about the netlink response buffer size Currently we set NLBUFSIZ large enough for 8192 netlink headers (128kiB in total), and reference netlink(7). However netlink(7) says nothing about reponse buffer sizes, and the documents which do reference 8192 bytes not 8192 headers. Update NLBUFSIZ to 64kiB with a more detailed rationale. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:28:22 +02:00
David Gibson	9d4ab98d53	netlink: Add nl_do() helper for simple operations with error checking So far we never checked for errors reported on netlink operations via NLMSG_ERROR messages. This has led to several subtle and tricky to debug situations which would have been obvious if we knew that certain netlink operations had failed. Introduce a nl_do() helper that performs netlink "do" operations (that is making a single change without retreiving complex information) with much more thorough error checking. As well as returning an error code if we get an NLMSG_ERROR message, we also check for unexpected behaviour in several places. That way if we've made a mistake in our assumptions about how netlink works it should result in a clear error rather than some subtle misbehaviour. We update those calls to nl_req() that can use the new wrapper to do so. We will extend those to better handle errors in future. We don't touch non-"do" operations for now, those are a bit trickier. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:28:19 +02:00
David Gibson	282581ba84	netlink: Fill in netlink header fields from nl_req() Currently netlink functions need to fill in a full netlink header, as well as a payload then call nl_req() to submit that to the kernel. It makes things a bit terser if we just give the relevant header fields as parameters to nl_req() and have it complete the header. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:28:16 +02:00
David Gibson	f62600b2df	netlink: Treat send() or recv() errors as fatal Errors on send() or recv() calls on a netlink socket don't indicate errors with the netlink operations we're attempting, but rather that something's gone wrong with the mechanics of netlink itself. We don't really expect this to ever happen, and if it does, it's not clear what we could to to recover. So, treat errors from these calls as fatal, rather than returning the error up the stack. This makes handling failures in the callers of nl_req() simpler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:28:14 +02:00
David Gibson	0a568c847d	netlink: Start sequence number from 1 instead of 0 Netlink messages have a sequence number that's used to match requests to responses. It mostly doesn't matter what it is as long as it monotonically increases, so we just use a global counter which we advance with each request. However, we start this counter at 0, so our very first request has sequence number 0, which is usually reserved for asynchronous messages from the kernel which aren't in response to a specific request. Since we don't (for now) use such async messages, this doesn't really matter, but it's not good practce. So start the sequence at 1 instead. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:28:09 +02:00
David Gibson	dee7594180	netlink: Make nl__dup() use a separate datagram for each request nl_req() is designed to handle a single netlink request message: it only receives a single reply datagram for the request, and only waits for a single NLMSG_DONE or NLMSG_ERROR message at the beginning to clear out things from previous requests. However, in both nl_addr_dup() and nl_route_dup() we can send multiple request messages as a single datagram, with a single nl_req() call. This can easily mean that the replies nl_req() collects get out of sync with requests. We only get away with this because after we call these functions we don't make any netlink calls where we need to parse the replies. This is fragile, so alter nl__dup() to make an nl_req() call for each address it is adding in the target namespace. For nl_route_dup() this fixes an additional minor problem: because routes can have dependencies, some of the route add requests might fail on the first attempt, so we need to repeat the requests a number of times. When we did that, we weren't updating the sequence number on each new attempt. This works, but not updating the sequence number for each new request isn't ideal. Now that we're making the requests one at a time, it's easier to make sure we update the sequence number each time. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:28:00 +02:00
David Gibson	576df71e8b	netlink: Explicitly pass netlink sockets to operations All the netlink operations currently implicitly use one of the two global netlink sockets, sometimes depending on an 'ns' parameter. Change them all to explicitly take the socket to use (or two sockets to use in the case of the *_dup() functions). As well as making these functions strictly more general, it makes the callers easier to follow because we're passing a socket variable with a name rather than an unexplained '0' or '1' for the ns parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:27:42 +02:00
David Gibson	cfe7509e5c	netlink: Use struct in_addr for IPv4 addresses, not bare uint32_t This improves consistency with IPv6 and makes it harder to misuse these as some other sort of value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:25:23 +02:00
David Gibson	257a6b0b7e	netlink: Split nl_route() into separate operation functions nl_route() can perform 3 quite different operations based on the 'op' parameter. Split this into separate functions for each one. This requires more lines of code, but makes the internal logic of each operation much easier to follow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:25:20 +02:00
David Gibson	eff3bcb245	netlink: Split nl_addr() into separate operation functions nl_addr() can perform three quite different operations based on the 'op' parameter, each of which uses a different subset of the parameters. Split them up into a function for each operation. This does use more lines of code, but the overlap wasn't that great, and the separated logic is much easier to follow. It's also clearer in the callers what we expect the netlink operations to do, and what information it uses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting fixes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:24:52 +02:00
David Gibson	e96182e9c2	netlink: Split up functionality of nl_link() nl_link() performs a number of functions: it can bring links up, set MAC address and MTU and also retrieve the existing MAC. This makes for a small number of lines of code, but high conceptual complexity: it's quite hard to follow what's going on both in nl_link() itself and it's also not very obvious which function its callers are intending to use. Clarify this, by splitting nl_link() into nl_link_up(), nl_link_set_mac(), and nl_link_get_mac(). The first brings up a link, optionally setting the MTU, the others get or set the MAC address. This fixes an arguable bug in pasta_ns_conf(): it looks as though that was intended to retrieve the guest MAC whether or not c->pasta_conf_ns is set. However, it only actually does so in the !c->pasta_conf_ns case: the fact that we set up==1 means we would only ever set, never get, the MAC in the nl_link() call in the other path. We get away with this because the MAC will quickly be discovered once we receive packets on the tap interface. Still, it's neater to always get the MAC address here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:14 +02:00
David Gibson	0cf7bf31f6	tap: Remove unnecessary global tun_ns_fd tap_ns_tun(), which runs in an ephemeral thread puts the fd it opens into the global variable tun_ns_fd to communicate it back to the main thread in tap_sock_tun_init(). However, the only thing tap_sock_tun_init() does with it is copies it to c->fd_tap and everything else uses it from there. tap_ns_tun() already has access to the context structure, so we might as well store the value directly in there rather than having a global as an intermediate. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:11 +02:00
David Gibson	7bc9b66fc2	tap: More detailed error reporting in tap_ns_tun() There are several possible failure points in tap_ns_tun(), but if anything goes wrong, we just set tun_ns_fd to -1 resulting in the same error message. Add more detailed error reporting to the various failure points. At the same time, we know this is only called from tap_sock_tun_init() which will terminate pasta if we fail, so we can simplify things a little because we don't need to close() the fd on the failure paths. Link: https://bugs.passt.top/show_bug.cgi?id=69 Link: https://github.com/containers/podman/issues/19428 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:06 +02:00
David Gibson	6920adda0d	util: Make ns_enter() a void function and report setns() errors ns_enter() returns an integer... but it's always zero. If we actually fail the function doesn't return. Therefore it makes more sense for this to be a function returning void, and we can remove the cases where we pointlessly checked its return value. In addition ns_enter() is usually called from an ephemeral thread created by NS_CALL(). That means that the exit(EXIT_FAILURE) there usually won't be reported (since NS_CALL() doesn't wait() for the thread). So, use die() instead to print out some information in the unlikely event that our setns() here does fail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:02 +02:00
David Gibson	b15ce5b6ce	Use static assertion to verify that union epoll_ref is the right size union epoll_ref is used to subdivide the 64-bit data field in struct epoll_event. Thus it must fit within that field or we're likely to get very subtle and nasty bugs. C11 introduces the notion of static assertions which we can use to verify this is the case at compile time. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:00 +02:00
David Gibson	8218d99013	Use C11 anonymous members to make poll refs less verbose to use union epoll_ref has a deeply nested set of structs and unions to let us subdivide it into the various different fields we want. This means that referencing elements can involve an awkward long string of intermediate fields. Using C11 anonymous structs and unions lets us do this less clumsily. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:17:57 +02:00
David Gibson	649068a287	Allow C11 code, not just C99 code C11 has some features that will allow us to make some things a bit cleaner. Alter the Makefile to tell the compiler to allow us to use C11 code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:17:41 +02:00
Stefano Brivio	023d684420	Revert "MAKE: Fix parallel builds; .o files; .gitignore; new makedocs" This reverts commit `cc2a6bec3c`: I applied that patch by mistake. Fixes: `cc2a6bec3c` ("MAKE: Fix parallel builds; .o files; .gitignore; new makedocs") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-07-10 06:33:44 +02:00
KuhnChris	cc2a6bec3c	MAKE: Fix parallel builds; .o files; .gitignore; new makedocs	2023-07-07 22:34:37 +02:00
David Gibson	e01759e2fa	tap: Explicitly drop IPv4 fragments, and give a warning We don't handle defragmentation of IP packets coming from the tap side, and we're unlikely to any time soon (with our large MTU, it's not useful for practical use cases). Currently, however, we simply ignore the fragmentation flags and treat fragments as though they were whole IP packets. This isn't ideal and can lead to rather cryptic behaviour if we do receive IP fragments. Change the code to explicitly drop fragmented packets, and print a rate limited warning if we do encounter them. Link: https://bugs.passt.top/show_bug.cgi?id=62 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-07-07 19:07:12 +02:00
David Gibson	4c98d3be80	conf: Correct length checking of interface names in conf_ports() When interface names are specified in forwarding specs, we need to check the length of the given interface name against the limit of IFNAMSIZ - 1 (15) characters. However, we managed to have 3 separate off-by-one errors here meaning we only accepted interface names up to 12 characters. 1. At the point of the check 'ifname' was still on the '%' character, not the first character of the name, meaning we overestimated the length by one 2. At the point of the check 'spec' had been advanced one character past the '/' which terminates the interface name, meaning we overestimated the length by another one 3. We checked if the (miscalculated) length was >= IFNAMSIZ - 1, that is >= 15, whereas lengths equal to 15 should be accepted. Correct all 3 errors. Link: https://bugs.passt.top/show_bug.cgi?id=61 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-28 17:51:25 +02:00
David Gibson	c4017cc4a1	conf: Fix size checking of -I interface name Network interface names must fit in a buffer of IFNAMSIZ bytes, including the terminating \0. IFNAMSIZ is 16 on Linux, so interface names can be up to (and including) 15 characters long. We validate this for the -I option, but we have an off by one error. We pass (IFNAMSIZ - 1) as the buffer size to snprintf(), but that buffer size already includes the terminating \0, so this actually truncates the value to 14 characters. The return value returned from snprintf() however, is the number of characters that would have been printed excluding the terminating \0, so by comparing it >= IFNAMSIZ - 1 we are giving an error on names >= 15 characters rather than strictly > 15 characters. Link: https://bugs.passt.top/show_bug.cgi?id=61 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-28 17:51:11 +02:00
David Gibson	289301b39c	netlink: Use correct interface index in NL_SET mode nl_addr() and nl_route() take an 'op' selector which affects a number of parameters to the netlink call. Unfortunately when we introduced this option a bug was introduced so that we always use the interface index for the host side, rather than the one for the pasta namespace. Really, the entire interface to nl_addr() and nl_route() is pretty bad: it's tightly coupled with the use cases of its callers. This is a minimal fix which doesn't address that, but also doesn't make it significantly worse. Link: https://bugs.passt.top/show_bug.cgi?id=59 Fixes: `2fe0461856` ("netlink: Add functionality to copy routes from outer namespace") Fixes: `e89da3cf03` ("netlink: Add functionality to copy addresses from outer namespace") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-27 17:52:30 +02:00
Paul Holzinger	32660cea04	pasta: include errno in error message When the open() or setns() calls fails pasta exits early and prints an error. However it did not include the errno so it was impossible to know why the syscall failed. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Split print to fit 80 columns in pasta_open_ns()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-25 23:50:42 +02:00
Paul Holzinger	594dce66d3	isolation: keep CAP_SYS_PTRACE when required When pasta is started from an existing userns and tries to join the netns from another process it fails to open /proc/$pid/ns/net due the missing CAP_SYS_PTRACE capability in the --netns-only case. A simple reproducer for this. First create a userns: $ unshare -r Then create a new netns inside it and try to join that netns with pasta. $ unshare -n sleep inf & $ pasta --config-net --netns /proc/$!/ns/net Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-25 23:49:25 +02:00
Stefano Brivio	5b646b9b10	conf: Accept -a and -g without --config-net in pasta mode While --no-copy-addrs and --no-copy-routes only make sense with --config-net, and they are implied on -g and -a, respectively, that doesn't mean we should refuse -a or -g without --config-net: they are still relevant for a number of things (including DHCP/DHCPv6/NDP configuration). Reported-by: Gianluca Stivan <me@yawnt.com> Fixes: `cc9d16758b` ("conf, pasta: With --config-net, copy all addresses by default") Fixes: `da54641f14` ("conf, pasta: With --config-net, copy all routes by default") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-25 23:49:25 +02:00
Stefano Brivio	d034fb698f	conf: Make -a/--address really imply --no-copy-addrs I wrote it in commit message and man page, but not in conf()... Note that -g/--gateway correctly implies --no-copy-routes already. This fixes Podman's tests: podman networking with pasta(1) - IPv4 address assignment podman networking with pasta(1) - IPv4 default route assignment where we pass -a and -g to assign an address and a default gateway that's compatible with it, but -a doesn't disable the copy of addresses, so we ignore -a, and the default gateway is incompatible with the addresses we copy -- hence no routes in the container. Link: https://github.com/containers/podman/pull/18612 Fixes: `cc9d16758b` ("conf, pasta: With --config-net, copy all addresses by default") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-25 23:49:25 +02:00
David Gibson	db29fd281a	seccomp: Make seccomp.sh re-entrancy safe seccomp.sh generates seccomp.h piece by piece using >> directives. This means that if two instances of seccomp.h are run concurrently a corrupted version of seccomp.h will be generated. Amongst other problems this can cause spurious failures on clang-tidy. Alter seccomp.sh to build the output in a temporary file and atomic move it to seccomp.h, so concurrent invocations will still result in valud output. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-06-25 23:49:11 +02:00
Stefano Brivio	3c6d1b9bb2	conf, log: On -h / --help, print usage to stdout, not stderr Erik suggests that this makes it easier to grep for options, and with --help we're anyway printing usage information as expected, not as part of an error report. While at it: on -h, we should exit with 0. Reported-by: Erik Sjölund <erik.sjolund@gmail.com> Link: https://bugs.passt.top/show_bug.cgi?id=52 Link: https://bugs.passt.top/show_bug.cgi?id=53 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-06-23 10:15:55 +02:00
Stefano Brivio	d072ac2434	tap: With pasta, don't reset on tap errors, handle write failures Since commit `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), it makes no sense to close and reopen the tap device on error: we don't have access to /dev/net/tun after the initial setup phase. If we hit ENOBUFS while writing (as reported: in one case because the kernel actually ran out of memory, with another case under investigation), or ENOSPC, we're supposed to drop whatever data we were trying to send: there's no room for it. Handle EINTR just like we handled EAGAIN/EWOULDBLOCK: there's no particular reason why sending the same data should fail again. Anything else I can think of would be an unrecoverable error: exit with failure then. While at it, drop a useless cast on the write() call: it takes a const void * anyway. Reported-by: Gianluca Stivan <me@yawnt.com> Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu> Fixes: `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-06-23 10:15:10 +02:00

... 5 6 7 8 9 ...

1413 commits