passt

Author	SHA1	Message	Date
Stefano Brivio	8f3f8e190c	pasta: Add fallback timer mechanism to check if namespace is gone We don't know how frequently this happens, but hitting fs.inotify.max_user_watches or similar sysctl limits is definitely not out of question, and Paul mentioned that, for example, Podman's CI environments hit similar issues in the past. Introduce a fallback mechanism based on a timer file descriptor: we grab the directory handle at startup, and we can then use openat(), triggered periodically, to check if the (network) namespace directory still exists. If openat() fails at some point, exit. Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707 Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-16 08:47:14 +01:00
David Gibson	fb7c00169d	flow: Move flow_count from context structure to a global In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:29 +01:00
David Gibson	02e092b4fe	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:25 +01:00
David Gibson	70121ca1ec	epoll: Better handling of number of epoll types As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static assert on the size of the matching names array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:22 +01:00
David Gibson	e21b6d69b1	tcp: "TCP" hash secret doesn't need to be TCP specific The TCP state structure includes a 128-bit hash_secret which we use for SipHash calculations to mitigate attacks on the TCP hash table and initial sequence number. We have plans to use SipHash in places that aren't TCP related, and there's no particular reason they'd need their own secret. So move the hash_secret to the general context structure. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-04 09:51:32 +01:00
David Gibson	705549f834	flow,tcp: Use epoll_ref type including flow and side Currently TCP uses the 'flow' epoll_ref field for both connected sockets and timers, which consists of just the index of the relevant flow (connection). This is just fine for timers, for while it obviously works, it's subtly incomplete for sockets on spliced connections. In that case we want to know which side of the connection the event is occurring on as well as which connection. At present, we deduce that information by looking at the actual fd, and comparing it to the fds of the sockets on each side. When we use the flow table for more things, we expect more cases where something will need to know a specific side of a specific flow for an event, but nothing more. Therefore add a new 'flowside' epoll_ref field, with exactly that information. We use it for TCP connected sockets. This allows us to directly know the side for spliced connections. For "tap" connections, it's pretty meaningless, since the side is always the socket side. It still makes logical sense though, and it may become important for future flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-04 09:51:24 +01:00
David Gibson	ecea8d36ff	flow,tcp: Generalise TCP epoll_ref to generic flows TCP uses three different epoll object types: one for connected sockets, one for timers and one for listening sockets. Listening sockets really need information that's specific to TCP, so need their own epoll_ref field. Timers and connected sockets, however, only need the connection (flow) they're associated with. As we expand the use of the flow table, we expect that to be true for more epoll fds. So, rename the "TCP" epoll_ref field to be a "flow" epoll_ref field that can be used both for TCP and for other future cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-04 09:51:20 +01:00
David Gibson	9d44aba7e0	util: MAX_FROM_BITS() should be unsigned MAX_FROM_BITS() computes the maximum value representable in a number of bits. The expression for that is an unsigned value, but we explicitly cast it to a signed int. It looks like this is because one of the main users is for FD_REF_MAX, which is used to bound fd values, typically stored as a signed int. The value MAX_FROM_BITS() is calculating is naturally non-negative, though, so it makes more sense for it to be unsigned, and to move the case to the definition of FD_REF_MAX. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-04 09:51:06 +01:00
David Gibson	f08ce92a13	flow, tcp: Move TCP connection table to unified flow table We want to generalise "connection" tracking to things other than true TCP connections. Continue implenenting this by renaming the TCP connection table to the "flow table" and moving it to flow.c. The definitions are split between flow.h and flow_table.h - we need this separation to avoid circular dependencies: the definitions in flow.h will be needed by many headers using the flow mechanism, but flow_table.h needs all those protocol specific headers in order to define the full flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-04 09:51:02 +01:00
David Gibson	732e249376	pif: Record originating pif in listening socket refs For certain socket types, we record in the epoll ref whether they're sockets in the namespace, or on the host. We now have the notion of "pif" to indicate what "place" a socket is associated with, so generalise the simple one-bit 'ns' to a pif id. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-11-07 09:53:41 +01:00
David Gibson	dcf5c0eb1e	port_fwd: Move port scanning /proc fds into struct port_fwd Currently we store /proc/net fds used to implement automatic port forwarding in the proc_net_{tcp,udp} fields of the main context structure. However, in fact each of those is associated with a particular direction of forwarding, and we already have struct port_fwd which collects all other information related to a particular direction of port forwarding. We can simplify things a bit by moving the /proc fds into struct port_fwd. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-11-07 09:53:29 +01:00
David Gibson	955dd3251c	tcp, udp: Don't pre-fill IPv4 destination address in headers Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-22 12:15:33 +02:00
David Gibson	ae5f6c8e1b	epoll: Use different epoll types for passt and pasta tap fds Currently we have a single epoll event type for the "tap" fd, which could be either a handle on a /dev/net/tun device (pasta) or a connected Unix socket (passt). However for the two modes we call different handler functions. Simplify this a little by using different epoll types and dispatching directly to the correct handler function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:20 +02:00
David Gibson	eda4f1997e	epoll: Split listening Unix domain socket into its own type tap_handler() actually handles events on three different types of object: the /dev/tap character device (pasta), a connected Unix domain socket (passt) or a listening Unix domain socket (passt). The last, in particular, really has no handling in common with the others, so split it into its own epoll type and directly dispatch to the relevant handler from the top level. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:17 +02:00
David Gibson	485b5fb8f9	epoll: Split handling of listening TCP sockets into their own handler tcp_sock_handler() handles both listening TCP sockets, and connected TCP sockets, but what it needs to do in those cases has essentially nothing in common. Therefore, give listening sockets their own epoll_type value and dispatch directly to their own handler from the top level. Furthermore, the two handlers need essentially entirely different information from the reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate the port for the listening socket, but that's not the same meaning. So, switch listening sockets to their own reference type which we can lay out as we please. That lets us remove the listen and outbound fields from the normal (connected) tcp_epoll_ref, reducing it to just the connection table index. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:15 +02:00
David Gibson	e6f81e5578	epoll: Split handling of TCP timerfds into its own handler function tcp_sock_handler() actually handles several different types of fd events. This includes timerfds that aren't sockets at all. The handling of these has essentially nothing in common with the other cases. So, give the TCP timers there own epoll_type value and dispatch directly to their handler. This also means we can remove the timer field from tcp_epoll_ref, the information it encoded is now implicit in the epoll_type value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:13 +02:00
David Gibson	6a6735ece4	epoll: Always use epoll_ref for the epoll data variable epoll_ref contains a variety of information useful when handling epoll events on our sockets, and we place it in the epoll_event data field returned by epoll. However, for a few other things we use the 'fd' field in the standard union of types for that data field. This actually introduces a bug which is vanishingly unlikely to hit in practice, but very nasty if it ever did: theoretically if we had a very large file descriptor number for fd_tap or fd_tap_listen it could overflow into bits that overlap with the 'proto' field in epoll_ref. With some very bad luck this could mean that we mistakenly think an event on a regular socket is an event on fd_tap or fd_tap_listen. More practically, using different (but overlapping) fields of the epoll_data means we can't unify dispatch for the various different objects in the epoll. Therefore use the same epoll_ref as the data for the tap fds and the netns quit fd, adding new fd type values to describe them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:53 +02:00
David Gibson	3401644453	epoll: Generalize epoll_ref to cover things other than sockets The epoll_ref type includes fields for the IP protocol of a socket, and the socket fd. However, we already have a few things in the epoll which aren't protocol sockets, and we may have more in future. Rename these fields to an abstract "fd type" and file descriptor for more generality. Similarly, rather than using existing IP protocol numbers for the type, introduce our own number space. For now these just correspond to the supported protocols, but we'll expand on that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:51 +02:00
David Gibson	b15ce5b6ce	Use static assertion to verify that union epoll_ref is the right size union epoll_ref is used to subdivide the 64-bit data field in struct epoll_event. Thus it must fit within that field or we're likely to get very subtle and nasty bugs. C11 introduces the notion of static assertions which we can use to verify this is the case at compile time. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:00 +02:00
David Gibson	8218d99013	Use C11 anonymous members to make poll refs less verbose to use union epoll_ref has a deeply nested set of structs and unions to let us subdivide it into the various different fields we want. This means that referencing elements can involve an awkward long string of intermediate fields. Using C11 anonymous structs and unions lets us do this less clumsily. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:17:57 +02:00
Stefano Brivio	9f61c5b68b	passt.h: Fix description of pasta_ifi in struct ctx Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	cc9d16758b	conf, pasta: With --config-net, copy all addresses by default Use the newly-introduced NL_DUP mode for nl_addr() to copy all the addresses associated to the template interface in the outer namespace, unless --no-copy-addrs (also implied by -a) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. This is done mostly for consistency with routes. It might partially cover the issue at: https://bugs.passt.top/show_bug.cgi?id=47 Support multiple addresses per address family for some use cases, but not the originally intended one: we'll still use a single outbound address (unless the routing table specifies different preferred source addresses depending on the destination), regardless of the address used in the target namespace. Link: https://bugs.passt.top/show_bug.cgi?id=47 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	da54641f14	conf, pasta: With --config-net, copy all routes by default Use the newly-introduced NL_DUP mode for nl_route() to copy all the routes associated to the template interface in the outer namespace, unless --no-copy-routes (also implied by -g) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. Otherwise, we can't use default gateways which are not, address-wise, on the same subnet as the container, as reported by Callum. Reported-by: Callum Parsey <callum@neoninteger.au> Link: https://github.com/containers/podman/issues/18539 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-05-23 16:13:28 +02:00
Stefano Brivio	ca2749e1bd	passt: Relicense to GPL 2.0, or any later version In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-06 18:00:33 +02:00
Chris Kuhn	5c58feab7b	conf, passt: Rename stderr to force_stderr While building against musl, gcc informs us that 'stderr' is a protected keyword. This probably comes from a #define stderr (stderr) in musl's stdio.h, to avoid a clash with extern FILE *const stderr, but I didn't really track it down. Just rename it to force_stderr, it makes more sense. [sbrivio: Added commit message] Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-03-09 03:44:21 +01:00
Stefano Brivio	a9c59dd91b	conf, icmp, tcp, udp: Add options to bind to outbound address and interface I didn't notice earlier: libslirp (and slirp4netns) supports binding outbound sockets to specific IPv4 and IPv6 addresses, to force the source addresse selection. If we want to claim feature parity, we should implement that as well. Further, Podman supports specifying outbound interfaces as well, but this is simply done by resolving the primary address for an interface when the network back-end is started. However, since kernel version 5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for non-root users"), we can actually bind to a specific interface name, which doesn't need to be validated in advance. Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses, and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets to given interfaces. Given that it probably makes little sense to select addresses and routes from interfaces different than the ones given for outbound sockets, also assign those as "template" interfaces, by default, unless explicitly overridden by '-i'. For ICMP and UDP, we call sock_l4() to open outbound sockets, as we already needed to bind to given ports or echo identifiers, and we can bind() a socket only once: there, pass address (if any) and interface (if any) for the existing bind() and setsockopt() calls. For TCP, in general, we wouldn't otherwise bind sockets. Add a specific helper to do that. For UDP outbound sockets, we need to know if the final destination of the socket is a loopback address, before we decide whether it makes sense to bind the socket at all: move the block mangling the address destination before the creation of the socket in the IPv4 path. This was already the case for the IPv6 path. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-03-09 03:43:59 +01:00
Stefano Brivio	70148ce5be	conf, passt.h: Rename "outbound" interface to "template" interface In preparation for the next patch, make it clear that the first routable interface fetched via netlink, or the one configured via -i/--interface, is simply used as template to copy addresses and routes, not an interface we actually use to derive the source address (which will be _bound to_) for outgoing packets. The man page and usage message appear to be already clear enough. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-09 00:36:08 +01:00
Stefano Brivio	26a0e4d6ee	Fix definitions of SOCKET_MAX, TCP_MAX_CONNS ...and, given that I keep getting this wrong, add a convenience macro, MAX_FROM_BITS(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-02-27 18:55:31 +01:00
David Gibson	d9394eb9b7	udp: Split splice field in udp_epoll_ref into (mostly) independent bits The @splice field in union udp_epoll_ref can have a number of values for different types of "spliced" packet flows. Split it into several single bit fields with more or less independent meanings. The new @splice field is just a boolean indicating whether the socket is associated with a spliced flow, making it identical to the @splice fiend in tcp_epoll_ref. The new bit @orig, indicates whether this is a socket which can originate new udp packet flows (created with -u or -U) or a socket created on the fly to handle reply socket. @ns indicates whether the socket lives in the init namespace or the pasta namespace. Making these bits more orthogonal to each other will simplify some future cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:31 +01:00
Richard W.M. Jones	6b4e68383c	passt, tap: Add --fd option This passes a fully connected stream socket to passt. Signed-off-by: Richard W.M. Jones <rjones@redhat.com> [sbrivio: reuse fd_tap instead of adding a new descriptor, imply --one-off on --fd, add to optstring and usage()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:40:47 +01:00
Stefano Brivio	3a2afde87d	conf, udp: Drop mostly duplicated dns_send arrays, rename related fields Given that we use just the first valid DNS resolver address configured, or read from resolv.conf(5) on the host, to forward DNS queries to, in case --dns-forward is used, we don't need to duplicate dns[] to dns_send[]: - rename dns_send[] back to dns[]: those are the resolvers we advertise to the guest/container - for forwarding purposes, instead of dns[], use a single field (for each protocol version): dns_host - and rename dns_fwd to dns_match, so that it's clear this is the address we are matching DNS queries against, to decide if they need to be forwarded Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-16 15:09:31 +01:00
Stefano Brivio	73f50a76aa	conf: Split the notions of read DNS addresses and offered ones With --dns-forward, if the host has a loopback address configured as DNS server, we should actually use it to forward queries, but, if --no-map-gw is passed, we shouldn't offer the same address via DHCP, NDP and DHCPv6, because it's not going to be reachable. Problematic configuration: * systemd-resolved configuring the usual 127.0.0.53 on the host: we read that from /etc/resolv.conf * --dns-forward specified with an unrelated address, for example 198.51.100.1 We still want to forward queries to 127.0.0.53, if we receive one directed to 198.51.100.1, so we can't drop 127.0.0.53 from our list: we want to use it for forwarding. At the same time, we shouldn't offer 127.0.0.53 to the guest or container either. With this change, I'm only covering the case of automatically configured DNS servers from /etc/resolv.conf. We could extend this to addresses configured with command-line options, but I don't really see a likely use case at this point. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:32 +01:00
David Gibson	7c7b68dbe0	Use typing to reduce chances of IPv4 endianness errors We recently corrected some errors handling the endianness of IPv4 addresses. These are very easy errors to make since although we mostly store them in network endianness, we sometimes need to manipulate them in host endianness. To reduce the chances of making such mistakes again, change to always using a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network endian addresses. This makes it harder to accidentally do arithmetic or comparisons on such addresses as if they were host endian. We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to directly work with struct in_addr values. This has the additional benefit of making the IPv4 and IPv6 paths more visually similar. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:24 +01:00
David Gibson	dd09cceaee	Minor improvements to IPv4 netmask handling There are several minor problems with our parsing of IPv4 netmasks (-n). First, we don't reject nonsensical netmasks like 0.255.0.255. Address this structurally by using prefix length instead of netmask as the primary variable, only converting (and validating) when we need to. This has the added benefit of making some things more uniform with the IPv6 path. Second, when the user specifies a prefix length, we truncate the output from strtol() to an integer, which means we would treat -n 4294967320 as valid (equivalent to 24). Fix types to check for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:19 +01:00
Stefano Brivio	a62ed181db	conf, tap: Add option to quit once the client closes the connection This is practical to avoid explicit lifecycle management in users, e.g. libvirtd, and is trivial to implement. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-15 02:10:36 +02:00
Stefano Brivio	f4e1e88e1d	passt.h: Include netinet/if_ether.h before struct ctx declaration This saves some hassle when including passt.h, as we need ETH_ALEN there. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-14 17:38:28 +02:00
David Gibson	1128fa03fe	Improve types and names for port forwarding configuration enum conf_port_type is local to conf.c and is used to track the port forwarding mode during configuration. We don't keep it around in the context structure, however the 'init_detect_ports' and 'ns_detect_ports' fields in the context are based solely on this. Rather than changing encoding, just include the forwarding mode into the context structure. Move the type definition to a new port_fwd.h, which is kind of trivial at the moment but will have more stuff later. While we're there, "conf_port_type" doesn't really convey that this enum is describing how port forwarding is configured. Rename it to port_fwd_mode. The variables (now fields) of this type also have mildly confusing names since it's not immediately obvious whether 'ns' and 'init' refer to the source or destination of the packets. Use "in" (host to guest / init to ns) and "out" (guest to host / ns to init) instead. This has the added bonus that we no longer have locals 'udp_init' and 'tcp_init' which shadow global functions. In addition, add a typedef 'port_fwd_map' for a bitmap of each port number, which is used in several places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-24 14:48:35 +02:00
David Gibson	eed17a47fe	Handle userns isolation and dropping root at the same time passt/pasta can interact with user namespaces in a number of ways: 1) With --netns-only we'll remain in our original user namespace 2) With --userns or a PID option to pasta we'll join either the given user namespace or that of the PID 3) When pasta spawns a shell or command we'll start a new user namespace for the command and then join it 4) With passt we'll create a new user namespace when we sandbox() ourself However (3) and (4) turn out to have essentially the same effect. In both cases we create one new user namespace. The spawned command starts there, and passt/pasta itself will live there from sandbox() onwards. Because of this, we can simplify user namespace handling by moving the userns handling earlier, to the same point we drop root in the original namespace. Extend the drop_user() function to isolate_user() which does both. After switching UID and GID in the original userns, isolate_user() will either join or create the userns we require. When we spawn a command with pasta_start_ns()/pasta_setup_ns() we no longer need to create a userns, because we're already made one. sandbox() likewise no longer needs to create (or join) an userns because we're already in the one we need. We no longer need c->pasta_userns_fd, since the fd is only used locally in isolate_user(). Likewise we can replace c->netns_only with a local in conf(), since it's not used outside there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-13 05:31:51 +02:00
David Gibson	7330ae3abf	Don't store UID & GID persistently in the context structure c->uid and c->gid are first set in conf(), and last used in check_root() itself called from conf(). Therefore these don't need to be fields in the long lived context structure and can instead be locals in conf(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-13 05:31:51 +02:00
David Gibson	16f5586bb8	Make substructures for IPv4 and IPv6 specific context information The context structure contains a batch of fields specific to IPv4 and to IPv6 connectivity. Split those out into a sub-structure. This allows the conf_ip4() and conf_ip6() functions, which take the entire context but touch very little of it, to be given more specific parameters, making it clearer what it affects without stepping through the code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-07-30 22:14:07 +02:00
David Gibson	5e12d23acb	Separate IPv4 and IPv6 configuration After recent changes, conf_ip() now has essentially entirely disjoint paths for IPv4 and IPv6 configuration. So, it's cleaner to split them out into different functions conf_ip4() and conf_ip6(). Splitting these out also lets us make the interface a bit nicer, having them return success or failure directly, rather than manipulating c->v4 and c->v6 to indicate success/failure of the two versions. Since these functions may also initialize the interface index for each protocol, it turns out we can then drop c->v4 and c->v6 entirely, replacing tests on those with tests on whether c->ifi4 or c->ifi6 is non-zero (since a 0 interface index is never valid). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Whitespace fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-07-30 22:12:50 +02:00
David Gibson	4b2e018d70	Allow different external interfaces for IPv4 and IPv6 connectivity It's quite plausible for a host to have both IPv4 and IPv6 connectivity, but only via different interfaces. For example, this will happen in the case that IPv6 connectivity is via a tunnel (e.g. 6in4 or 6rd). It would also happen in the case that IPv4 access is via a tunnel on an otherwise IPv6 only local network, which is a setup that might become more common in the post IPv4 address exhaustion world. In turns out there's no real need for passt/pasta to get its IPv4 and IPv6 connectivity via the same interface, so we can handle this situation fairly easily. Change the core to allow eparate external interfaces for IPv4 and IPv6. We don't actually set these separately for now. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-07-30 21:50:41 +02:00
Stefano Brivio	a951e0b9ef	conf: Add --runas option, changing to given UID and GID if started as root On some systems, user and group "nobody" might not be available. The new --runas option allows to override the default "nobody" choice if started as root. Now that we allow this, drop the initgroups() call that was used to add any additional groups for the given user, as that might now grant unnecessarily broad permissions. For instance, several distributions have a "kvm" group to allow regular user access to /dev/kvm, and we don't need that in passt or pasta. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-05-19 16:27:20 +02:00
Stefano Brivio	37c228ada8	tap, tcp, udp, icmp: Cut down on some oversized buffers The existing sizes provide no measurable differences in throughput and packet rates at this point. They were probably needed as batched implementations were not complete, but they can be decreased quite a bit now. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	48582bf47f	treewide: Mark constant references as const Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	965f603238	treewide: Add include guards ...at the moment, just for consistency with packet.h, icmp.h, tcp.h and udp.h. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	bb70811183	treewide: Packet abstraction with mandatory boundary checks Implement a packet abstraction providing boundary and size checks based on packet descriptors: packets stored in a buffer can be queued into a pool (without storage of its own), and data can be retrieved referring to an index in the pool, specifying offset and length. Checks ensure data is not read outside the boundaries of buffer and descriptors, and that packets added to a pool are within the buffer range with valid offset and indices. This implies a wider rework: usage of the "queueing" part of the abstraction mostly affects tap_handler_{passt,pasta}() functions and their callees, while the "fetching" part affects all the guest or tap facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6 handlers. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	92074c16a8	tcp_splice: Close sockets right away on high number of open files We can't take for granted that the hard limit for open files is big enough as to allow to delay closing sockets to a timer. Store the value of RTLIMIT_NOFILE we set at start, and use it to understand if we're approaching the limit with pending, spliced TCP connections. If that's the case, close sockets right away as soon as they're not needed, instead of deferring this task to a timer. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	3eb19cfd8a	tcp, udp, util: Enforce 24-bit limit on socket numbers This should never happen, but there are no formal guarantees: ensure socket numbers are below SOCKET_MAX. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	d2e40bb8d9	conf, util, tap: Implement --trace option for extra verbose logging --debug can be a bit too noisy, especially as single packets or socket messages are logged: implement a new option, --trace, implying --debug, that enables all debug messages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-25 13:21:13 +01:00

1 2

83 commits