passt

Author	SHA1	Message	Date
David Gibson	97d1ca2ed6	udp: Use abstracted tap header Update the UDP code to use the tap layer abstractions for initializing and updating the L2 and lower headers. This will make adding other tap backends in future easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:59 +01:00
David Gibson	716a926ef4	util: Parameterize ethernet header initializer macro We have separate IPv4 and IPv6 versions of a macro to construct an initializer for ethernet headers. However, now that we have htons_constant it's easy to simply paramterize this with the ethernet protocol number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:46 +01:00
David Gibson	67afaab411	tcp, udp: Use named field initializers in iov_init functions Both the TCP and UDP iov_init functions have some large structure literals defined in "field order" style. These are pretty hard to read since it's not obvious what value corresponds to what field. Use named field style initializers instead to make this clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:44 +01:00
David Gibson	b93d025d50	udp: Don't use separate sockets to listen for spliced packets Currently, when ports are forwarded inbound in pasta mode, we open two sockets for incoming traffic: one listens on the public IP address and will forward packets to the tuntap interface. The other listens on localhost and forwards via "splicing" (resending directly via sockets in the ns). Now that we've improved the logic about whether we "splice" any individual packet, we don't need this. Instead we can have a single socket bound to 0.0.0.0 or ::, marked as able to splice and udp_sock_handler() will deal with each packet as appropriate. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:12 +01:00
David Gibson	8d503e825f	udp: Decide whether to "splice" per datagram rather than per socket Currently we have special sockets for receiving datagrams from locahost which can use the optimized "splice" path rather than going across the tap interface. We want to loosen this so that sockets can receive sockets that will be forwarded by both the spliced and non-spliced paths. To do this, we alter the meaning of the @splice bit in the reference to mean that packets receieved on this socket can be spliced, not that they will be spliced. They'll only actually be spliced if they come from 127.0.0.1 or ::1. We can't (for now) remove the splice bit entirely, unlike with TCP. Our gateway mapping means that if the ns initiates communication to the gw address, we'll translate that to target 127.0.0.1 on the host side. Reply packets will therefore have source address 127.0.0.1 when received on the host, but these need to go via the tap path where that will be translated back to the gateway address. We need the @splice bit to distinguish that case from packets going from localhost to a port mapped explicitly with -u which should be spliced. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:09 +01:00
David Gibson	8a10f23720	udp: Unify udp_sock_handler_splice() with udp_sock_handler() These two functions now have a very similar structure, and their first part (calling recvmmsg()) is functionally identical. So, merge the two functions into one. This does have the side effect of meaning we no longer receive multiple packets at once for splice (we already didn't for tap). This does hurt throughput for small spliced packets, but improves it for large spliced packets and tap packets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:06 +01:00
David Gibson	f1ed8dbfa7	udp: Pre-populate msg_names with local address udp_splice_namebuf is now used only for spliced sending, and so it is only ever populated with the localhost address, either IPv4 or IPv6. So, replace the awkward initialization in udp_sock_handler_splice() with statically initialized versions for IPv4 and IPv6. We then just need to update the port number in udp_sock_handler_splice(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:03 +01:00
David Gibson	c9e193b5ae	udp: Don't handle tap receive batch size calculation within a #define UDP_MAX_FRAMES gives the maximum number of datagrams we'll ever handle as a batch for sizing our buffers and control structures. The subtly different UDP_TAP_FRAMES gives the maximum number of datagrams we'll actually try to receive at once for tap packets in the current configuration. This depends on the mode, meaning that the macro has a non-obvious dependency on the usual 'c' context variable being available. We only use it in one place, so it makes more sense to open code this. Add an explanatory comment while we're there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:01 +01:00
David Gibson	4eb54fd2e7	udp: Split receive from preparation and send in udp_sock_handler() The receive part of udp_sock_handler() and udp_sock_handler_splice() is now almost identical. In preparation for merging that, split the receive part of udp_sock_handler() from the part preparing and sending the frames for sending on the tap interface. The latter goes into a new udp_tap_send() function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:06:58 +01:00
David Gibson	09c00f1d2a	udp: Split sending to passt tap interface into separate function The last part of udp_sock_handler() does the actual sending of frames to the tap interface. For pasta that's just a call to udp_tap_send_pasta() but for passt, it's moderately complex and open coded. For symmetry, move the passt send path into its own function, udp_tap_send_passt(). This will make it easier to abstract the tap interface in future (e.g. when we want to add vhost-user). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:06:55 +01:00
David Gibson	a6e919a951	udp: Move sending pasta tap frames to the end of udp_sock_handler() udp_sock_handler() has a surprising difference in flow between pasta and passt mode: For pasta we send each frame to the tap interface as we prepare it. For passt, though, we prepare all the frames, then send them with a single sendmmsg(). Alter the pasta path to also prepare all the frames, then send them at the end. We already have a suitable data structure for the passt case. This will make it easier to abstract out the tap backend difference in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:06:51 +01:00
David Gibson	310bdbdcf4	udp: Factor out control structure management from udp_sock_fill_data_v[46] The main purpose of udp_sock_fill_data_v[46]() is to construct the IP, UDP and other headers we'll need to forward data onto the tap interface. In addition they update the control structures (iovec and mmsghdr) we'll need to send the messages, and in the case of pasta actually sends it. This leads the control structure management and the send itself awkwardly split between udp_sock_fill_data_v[46]() and their caller udp_sock_handler(). In addition, this tail part of udp_sock_fill_datav[46] is essentially common between the IPv4 and IPv6 versions, apart from which control array we're working on. Clean this up by reducing these functions to just construct the headers and renaming them to udp_update_hdr[46]() accordingly. The control structure updates are now all in the caller, and common for IPv4 and IPv6. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:22 +01:00
David Gibson	4b2d227c86	udp: Preadjust udp[46]_l2_iov_tap[].iov_base for pasta mode Currently, we always populate udp[46]_l2_iov_tap[].iov_base with the very start of the header buffers, including space for the qemu vnet_len tag suitable for passt mode. That's ok because we don't actually use these iovecs for pasta mode. However, we do know the mode in udp_sock[46]_iov_init() so adjust these to the beginning of the headers we'll actually need for the mode: including the vnet_len tag for passt, but excluding it for pasta. This allows a slightly nicer way to locate the right buffer to send in the pasta case, and will allow some additional cleanups later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:19 +01:00
David Gibson	d533807027	udp: Better factor IPv4 and IPv6 paths in udp_sock_handler() Apart from which mh array they're operating on the recvmmsg() calls in udp_sock_handler() are identical between the IPv4 and IPv6 paths, as are some of the control structure updates. By using some local variables to refer to the IP version specific control arrays, make some more logic common between the IPv4 and IPv6 paths. As well as slightly reducing the code size, this makes it less likely that we'll accidentally use the IPv4 arrays in the IPv6 path or vice versa as we did in a recently fixed bug. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:16 +01:00
David Gibson	6af7ee74cf	udp: Fix incorrect use of IPv6 mh buffers in IPv4 path udp_sock_handler() incorrectly uses udp6_l2_mh_tap[] on the IPv4 path. In fact this is harmless because this assignment is redundant (the 0th entry msg_hdr will always point to the 0th iov entry for both IPv4 and IPv6 and won't change). There is also an incorrect usage of udp6_l2_mh_tap[] in udp_sock_fill_data_v4. This one can cause real problems, because we'll use stale iov_len values if we send multiple messages to the qemu socket. Most of the time that will be relatively harmless - we're likely to either drop UDP packets, or send duplicates. However, if the stale iov_len we use ends up referencing an uninitialized buffer we could desynchronize the qemu stream socket. Correct both these bugs. The UDP6 path appears to be correct, but it does have some comments that incorrectly reference the IPv4 versions, so fix those as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:07 +01:00
David Gibson	34764ea4f3	udp: Correct splice forwarding when receiving from multiple sources udp_sock_handler_splice() reads a whole batch of datagrams at once with recvmmsg(). It then forwards them all via a single socket on the other side, based on the source port. However, it's entirely possible that the datagrams in the set have different source ports, and thus ought to be forwarded via different sockets on the destination side. In fact this situation arises with the iperf -P4 throughput tests in our own test suite. AFAICT we only get away with this because iperf3 is strictly one way and doesn't send reply packets which would be misdirected because of the incorrect source ports. Alter udp_sock_handler_splice() to split the packets it receives into batches with the same source address and send each batch with a separate sendmmsg(). For now we only look for already contiguous batches, which means that if there are multiple active flows interleaved this is likely to degenerate to batches of size 1. For now this is the simplest way to correct the behaviour and we can try to optimize later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:03 +01:00
David Gibson	2dec914209	udp: Split send half of udp_sock_handler_splice() from the receive half Move the part of udp_sock_handler_splice() concerned with sending out the datagrams into a new udp_splice_sendfrom() helper. This will make later cleanups easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:00 +01:00
David Gibson	fc7f91e709	udp: Unify buffers for tap and splice paths We maintain a set of buffers for UDP packets to be forwarded via the tap interface in udp[46]_l2_buf. We then have a separate set of buffers for packets to be "spliced" in udp_splice_buf[]. However, we only use one of these at a time, so we can share the buffer space. For the receiving splice packets we can not only re-use the data buffers but also the udp[46]_l2_iov_sock and udp[46]_l2_mh_sock control structures. For sending the splice packets we keep the same data buffers, but we need specific control structures. We create udp[46]_iov_splice - we can't reuse udp_l2_iov_sock[] because we need to write iov_len as we're writing spliced packets, but the tap path expects iov_len to remain the same (it only uses it for receive). Likewise we create udp[46]_mh_splice with the mmsghdr structures for sending spliced packets. As well as needing to reference different iovs, these need to all reference udp_splice_namebuf instead of individual msg_name fields for each slot. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:56 +01:00
David Gibson	c52ca4aecf	udp: Add helper to extract port from a sockaddr_in or sockaddr_in6 udp_sock_handler_splice() has a somewhat clunky if to extract the port from a socket address which could be either IPv4 or IPv6. Future changes are going to make this even more clunky, so introduce a helper function to do this extraction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:53 +01:00
David Gibson	e7e2a321ba	udp: Make UDP_SPLICE_FRAMES and UDP_TAP_FRAMES_MEM the same thing These two constants have the same value, and there's not a lot of reason they'd ever need to be different. Future changes will further integrate the spliced and "tap" paths so that these need to be the same. So, merge them into UDP_MAX_FRAMES. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:51 +01:00
David Gibson	5b0027f942	udp: Simplify udp_sock_handler_splice Previous cleanups mean that we can now rework some complex ifs in udp_sock_handler_splice() into a simpler set. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:48 +01:00
David Gibson	71d2595a8f	udp: Update UDP "connection" timestamps in both directions A UDP pseudo-connection between port A in the init namespace and port B in the pasta guest namespace involves two sockets: udp_splice_init[v6][B] and udp_splice_ns[v6][A]. The socket which originated this "connection" will be permanent but the other one will be closed on a timeout. When we get a packet from the originating socket, we update the timeout on the other socket, but we don't do the same when we get a reply packet from the other socket. However any activity on the "connection" probably indicates that it's still in use. Without this we could incorrectly time out a "connection" if it's using a protocol which involves a single initiating packet, but which then gets continuing replies from the target. Correct this by updating the timeout on both sockets for a packet in either direction. This also updates the timestamps for the permanent originating sockets which is unnecessary, but harmless. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:45 +01:00
David Gibson	7610034fef	udp: Don't explicitly track originating socket for spliced "connections" When we look up udp_splice_to_ns[][].orig_sock in udp_sock_handler_splice() we're finding the socket on which the originating packet for the "connection" was received on. However, we don't specifically need this socket to be the originating one - we just need one that's bound to the the source port of this reply packet in the init namespace. We can look this up in udp_splice_to_init[v6][src].target_sock, whose defining characteristic is exactly that. The same applies with init and ns swapped. In practice, of course, the port we locate this way will always be the originating port, since we couldn't have started this "connection" if it wasn't. Change this, and we no longer need the @orig_sock field at all. That leaves just @target_sock which we rename to simply @sock. The whole udp_splice_flow structure now more represents a single bound port than a "flow" per se, so rename and recomment it accordingly. Likewise the udp_splice_to_{ns,init} names are now misleading, since the ports in those maps are used in both directions. Rename them to udp_splice_{ns,init} indicating the location where the described socket is bound. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:42 +01:00
David Gibson	27bfebb061	udp: Re-use fixed bound sockets for packet forwarding when possible When we look up udp_splice_to_ns[v6][src].target_sock in udp_sock_handler_splice, all we really require of the socket is that it be bound to port src in the pasta guest namespace. Similarly for udp_splice_to_init but bound in the init namespace. Usually these sockets are created temporarily by udp_splice_connect() and cleaned up by udp_timer(). However, depending on the -u and -U options its possible we have a permanent socket bound to the relevant port created by udp_sock_init(). If such a socket exists, we could use it instead of creating a temporary one. In fact we must use it, because we'll fail trying to bind() a temporary one to the same port. So allow this, store permanently bound sockets into udp_splice_to_{ns,init} in udp_sock_init(). These won't get incorrectly removed by the timer because we don't put a corresponding entry in the udp_act[] structure which directs the timer what to clean up. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:40 +01:00
David Gibson	c277c6dd7d	udp: Don't create double sockets for -U port For each IP version udp_socket() has 3 possible calls to sock_l4(). One is for the "non-spliced" bound socket in the init namespace, one for the "spliced" bound socket in the init namespace and one for the "spliced" bound socket in the pasta namespace. However when this is called to create a socket in the pasta namspeace there is a logic error which causes it to take the path for the init side spliced socket as well as the ns socket. This essentially tries to create two identical sockets on the ns side. Unsurprisingly the second bind() call fails according to strace. Correct this to only attempt to open one socket within the ns. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:35 +01:00
David Gibson	d9394eb9b7	udp: Split splice field in udp_epoll_ref into (mostly) independent bits The @splice field in union udp_epoll_ref can have a number of values for different types of "spliced" packet flows. Split it into several single bit fields with more or less independent meanings. The new @splice field is just a boolean indicating whether the socket is associated with a spliced flow, making it identical to the @splice fiend in tcp_epoll_ref. The new bit @orig, indicates whether this is a socket which can originate new udp packet flows (created with -u or -U) or a socket created on the fly to handle reply socket. @ns indicates whether the socket lives in the init namespace or the pasta namespace. Making these bits more orthogonal to each other will simplify some future cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:31 +01:00
David Gibson	8517239243	udp: Remove the @bound field from union udp_epoll_ref We set this field, but nothing ever checked it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:28 +01:00
David Gibson	1cd684b09b	udp: Don't connect "forward" sockets for spliced flows Currently we connect() the socket we use to forward spliced UDP flows. However, we now only ever use sendto() rather than send() on this socket so there's not actually any need to connect it. Don't do so. Rename a number of things that referred to "connect" or "conn" since that would now be misleading. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:25 +01:00
David Gibson	9ef31b7619	udp: Always use sendto() rather than send() for forwarding spliced packets udp_sock_handler_splice() has two different ways of sending out packets once it has determined the correct destination socket. For the originating sockets (which are not connected) it uses sendto() to specify a specific address. For the forward socket (which is connected) we use send(). However we know the correct destination address even for the forward socket we do also know the correct destination address. We can use this to use sendto() instead of send(), removing the need for two different paths and some staging data structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:22 +01:00
David Gibson	729edc241d	udp: Separate tracking of inbound and outbound packet flows Each entry udp_splice_map[v6][N] keeps information about two essentially unrelated packet flows. @ns_conn_sock, @ns_conn_ts and @init_bound_sock track a packet flow from port N in the host init namespace to some other port in the pasta namespace (the one @ns_conn_sock is connected to). @init_conn_sock, @init_conn_ts and @ns_bound_sock track packet flow from port N in the pasta namespace to some other port in the host init namespace (the one @init_conn_sock is connected to). Split udp_splice_map[][] into two separate tables for the two directions. Each entry in each table is a 'struct udp_splice_flow' with @orig_sock (previously the bound socket), @target_sock (previously the connected socket) and @ts (the timeout for the target socket). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:18 +01:00
David Gibson	4ebb4905e9	udp: Also bind() connected ports for "splice" forwarding pasta handles "spliced" port forwarding by resending datagrams received on a bound socket in the init namespace to a connected socket in the guest namespace. This means there are actually three ports associated with each "connection". First there's the source and destination ports of the originating datagram. That's also the destination port of the forwarded datagram, but the source port of the forwarded datagram is the kernel allocated bound address of the connected socket. However, by bind()ing as well as connect()ing the forwarding socket we can choose the source port of the forwarded datagrams. By choosing it to match the original source port we remove that surprising third port number and no longer need to store port numbers in struct udp_splice_port. As a bonus this means that the recipient of the packets will see the original source port if they call getpeername(). This rarely matters, but it can't hurt. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:40:56 +01:00
Stefano Brivio	3a2afde87d	conf, udp: Drop mostly duplicated dns_send arrays, rename related fields Given that we use just the first valid DNS resolver address configured, or read from resolv.conf(5) on the host, to forward DNS queries to, in case --dns-forward is used, we don't need to duplicate dns[] to dns_send[]: - rename dns_send[] back to dns[]: those are the resolvers we advertise to the guest/container - for forwarding purposes, instead of dns[], use a single field (for each protocol version): dns_host - and rename dns_fwd to dns_match, so that it's clear this is the address we are matching DNS queries against, to decide if they need to be forwarded Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-16 15:09:31 +01:00
Stefano Brivio	817eedc28a	tcp, udp: Don't initialise IPv6/IPv4 sockets if IPv4/IPv6 are not enabled If we disable a given IP version automatically (no corresponding default route on host) or administratively (--ipv4-only or --ipv6-only options), we don't initialise related buffers and services (DHCP for IPv4, NDP and DHCPv6 for IPv6). The "tap" handlers will also ignore packets with a disabled IP version. However, in commit `3c6ae62510` ("conf, tcp, udp: Allow address specification for forwarded ports") I happily changed socket initialisation functions to take AF_UNSPEC meaning "any enabled IP version", but I forgot to add checks back for the "enabled" part. Reported by Paul: on a host without default IPv6 route, but IPv6 enabled, connect, using IPv6, to a port handled by pasta, which tries to send data to a tap device without initialised buffers for that IP version and exits because the resulting write() fails. Simpler way to reproduce: pasta -6 and inbound IPv4 connection, or pasta -4 and inbound IPv6 connection. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: `3c6ae62510` ("conf, tcp, udp: Allow address specification for forwarded ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-10 11:17:50 +01:00
Stefano Brivio	db74679f98	udp: Check for answers to forwarded DNS queries before handling local redirects Now that we allow loopback DNS addresses to be used as targets for forwarding, we need to check if DNS answers come from those targets, before deciding to eventually remap traffic for local redirects. Otherwise, the source address won't match the one configured as forwarder, which means that the guest or the container will refuse those responses. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:32 +01:00
David Gibson	7c7b68dbe0	Use typing to reduce chances of IPv4 endianness errors We recently corrected some errors handling the endianness of IPv4 addresses. These are very easy errors to make since although we mostly store them in network endianness, we sometimes need to manipulate them in host endianness. To reduce the chances of making such mistakes again, change to always using a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network endian addresses. This makes it harder to accidentally do arithmetic or comparisons on such addresses as if they were host endian. We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to directly work with struct in_addr values. This has the additional benefit of making the IPv4 and IPv6 paths more visually similar. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:24 +01:00
David Gibson	dd3470d9a9	Use IPV4_IS_LOOPBACK more widely This macro checks if an IPv4 address is in the loopback network (127.0.0.0/8). There are two places where we open code an identical check, use the macro instead. There are also a number of places we specifically exclude the loopback address (127.0.0.1), but we should actually be excluding anything in the loopback network. Change those sites to use the macro as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:21 +01:00
Stefano Brivio	346da48fe6	udp: Fix port and address checks for DNS forwarder First off, as we swap endianness for source ports in udp_fill_data_v{4,6}(), we want host endianness, not network endianness. It doesn't actually matter if we use htons() or ntohs() here, but the current version is confusing. In the IPv4 path, when we remap DNS answers, we already swapped the endianness as needed for the source port: don't swap it again, otherwise we'll not map DNS answers for IPv4. In the IPv6 path, when we remap DNS answers, we want to check that they came from our upstream DNS server, not the one configured via --dns-forward (which doesn't even need to exist for this functionality to work). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-15 02:10:36 +02:00
Stefano Brivio	c1eff9a3c6	conf, tcp, udp: Allow specification of interface to bind to Since kernel version 5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for non-root users"), we can bind sockets to interfaces, if they haven't been bound yet (as in bind()). Introduce an optional interface specification for forwarded ports, prefixed by %, that can be passed together with an address. Reported use case: running local services that use ports we want to have externally forwarded: https://github.com/containers/podman/issues/14425 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-15 02:10:36 +02:00
Stefano Brivio	da152331cf	Move logging functions to a new file, log.c Logging to file is going to add some further complexity that we don't want to squeeze into util.c. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-14 17:38:25 +02:00
Stefano Brivio	5290b6f13e	udp: Replace pragma to ignore bogus stringop-overread warning with workaround Commit `c318ffcb4c` ("udp: Ignore bogus -Wstringop-overread for write() from gcc 12.1") uses a gcc pragma to ignore a bogus warning, which started appearing on gcc 12.1 (aarch64 and x86_64) due to: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103483 ...but gcc 12.2 has the same issue. Not just that: if LTO is enabled, the pragma itself is ignored (this wasn't the case with gcc 12.1), because of: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80922 Drop the pragma, and assign a frame (in the networking sense) pointer from the offset of the Ethernet header in the buffer, then pass it to write() and pcap(), so that gcc doesn't obsess anymore with the fact that an Ethernet header is 14 bytes and we're sending more than that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-09-29 12:23:10 +02:00
David Gibson	d5b80ccc72	Fix widespread off-by-one error dealing with port numbers Port numbers (for both TCP and UDP) are 16-bit, and so fit exactly into a 'short'. USHRT_MAX is therefore the maximum port number and this is widely used in the code. Unfortunately, a lot of those places don't actually want the maximum port number (USHRT_MAX == 65535), they want the total number of ports (65536). This leads to a number of potentially nasty consequences: * We have buffer overruns on the port_fwd::delta array if we try to use port 65535 * We have similar potential overruns for the tcp_sock_* arrays * Interestingly udp_act had the correct size, but we can calculate it in a more direct manner * We have a logical overrun of the ports bitmap as well, although it will just use an unused bit in the last byte so isnt harmful * Many loops don't consider port 65535 (which does mitigate some but not all of the buffer overruns above) * In udp_invert_portmap() we incorrectly compute the reverse port translation for return packets Correct all these by using a new NUM_PORTS defined explicitly for this purpose. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-24 14:48:35 +02:00
David Gibson	3ede07aac9	Treat port numbers as unsigned Port numbers are unsigned values, but we're storing them in (signed) int variables in some places. This isn't actually harmful, because int is large enough to hold the entire range of ports. However in places we don't want to use an in_port_t (usually to avoid overflow on the last iteration of a loop) it makes more conceptual sense to use an unsigned int. This will also avoid some problems with later cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-24 14:48:35 +02:00
David Gibson	f5a31ee94c	Don't use indirect remap functions for conf_ports() Now that we've delayed initialization of the UDP specific "reverse" map until udp_init(), the only difference between the various 'remap' functions used in conf_ports() is which array they target. So, simplify by open coding the logic into conf_ports() with a pointer to the correct mapping array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-24 14:48:35 +02:00
David Gibson	1467a35b5a	udp: Delay initialization of UDP reversed port mapping table Because it's connectionless, when mapping UDP ports we need, in addition to the table of deltas for destination ports needed by TCP, we need an inverted table to translate the source ports on return packets. Currently we fill out the inverted table at the same time we construct the main table in udp_remap_to_tap() and udp_remap_to_init(). However, we don't use either table until after we've initialized UDP, so we can delay the construction of the reverse table to udp_init(). This makes the configuration more symmetric between TCP and UDP which will enable further cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-24 14:48:35 +02:00
David Gibson	163dc5f188	Consolidate port forwarding configuration into a common structure The configuration for how to forward ports in and out of the guest/ns is divided between several different variables. For each connect direction and protocol we have a mode in the udp/tcp context structure, a bitmap of which ports to forward also in the context structure and an array of deltas to apply if the outward facing and inward facing port numbers are different. This last is a separate global variable, rather than being in the context structure, for no particular reason. UDP also requires an additional array which has the reverse mapping used for return packets. Consolidate these into a re-used substructure in the context structure. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-24 14:48:35 +02:00
David Gibson	af9c98af5f	udp: Don't drop zero-length outbound UDP packets udp_tap_handler() currently skips outbound packets if they have a payload length of zero. This is not correct, since in a datagram protocol zero length packets still have meaning. Adjust this to correctly forward the zero-length packets by using a msghdr with msg_iovlen == 0. Bugzilla: https://bugs.passt.top/show_bug.cgi?id=19 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-13 11:14:29 +02:00
David Gibson	484652c632	udp: Don't pre-initialize msghdr array In udp_tap_handler() the array of msghdr structures, mm[], is initialized to zero. Since UIO_MAXIOV is 1024, this can be quite a large zero, which is expensive if we only end up using a few of its entries. It also makes it less obvious how we're setting all the control fields at the point we actually invoke sendmmsg(). Rather than pre-initializing it, just initialize each element as we use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-13 11:14:29 +02:00
David Gibson	16f5586bb8	Make substructures for IPv4 and IPv6 specific context information The context structure contains a batch of fields specific to IPv4 and to IPv6 connectivity. Split those out into a sub-structure. This allows the conf_ip4() and conf_ip6() functions, which take the entire context but touch very little of it, to be given more specific parameters, making it clearer what it affects without stepping through the code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-07-30 22:14:07 +02:00
David Gibson	5e12d23acb	Separate IPv4 and IPv6 configuration After recent changes, conf_ip() now has essentially entirely disjoint paths for IPv4 and IPv6 configuration. So, it's cleaner to split them out into different functions conf_ip4() and conf_ip6(). Splitting these out also lets us make the interface a bit nicer, having them return success or failure directly, rather than manipulating c->v4 and c->v6 to indicate success/failure of the two versions. Since these functions may also initialize the interface index for each protocol, it turns out we can then drop c->v4 and c->v6 entirely, replacing tests on those with tests on whether c->ifi4 or c->ifi6 is non-zero (since a 0 interface index is never valid). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Whitespace fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-07-30 22:12:50 +02:00
Stefano Brivio	c318ffcb4c	udp: Ignore bogus -Wstringop-overread for write() from gcc 12.1 With current OpenSUSE Tumbleweed on aarch64 (gcc-12-1.3.aarch64) and on x86_64 (gcc-12-1.4.x86_64), but curiously not on armv7hl (gcc-12-1.3.armv7hl), gcc warns about using the _pointer_ to the 802.3 header to write the whole frame to the tap descriptor: reading between 62 and 4294967357 bytes from a region of size 14 which is bogus: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103483 Probably declaring udp_sock_fill_data_v{4,6}() as noinline would "fix" this, but that's on the data path, so I'd rather not. Use a gcc pragma instead. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-05-19 16:27:20 +02:00
Stefano Brivio	3c6ae62510	conf, tcp, udp: Allow address specification for forwarded ports This feature is available in slirp4netns but was missing in passt and pasta. Given that we don't do dynamic memory allocation, we need to bind sockets while parsing port configuration. This means we need to process all other options first, as they might affect addressing and IP version support. It also implies a minor rework of how TCP and UDP implementations bind sockets. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-05-01 07:19:05 +02:00
Stefano Brivio	2b1fbf4631	udp: Out-of-bounds read, CWE-125 in udp_timer() Not an actual issue due to how it's typically stored, but udp_act can also be used for ports 65528-65535. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-04-07 11:44:35 +02:00
Stefano Brivio	22ed4467a4	treewide: Unchecked return value from library, CWE-252 All instances were harmless, but it might be useful to have some debug messages here and there. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-04-07 11:44:35 +02:00
Stefano Brivio	37c228ada8	tap, tcp, udp, icmp: Cut down on some oversized buffers The existing sizes provide no measurable differences in throughput and packet rates at this point. They were probably needed as batched implementations were not complete, but they can be decreased quite a bit now. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	ad7f57a5b7	udp: Move flags before ts in struct udp_tap_port, avoid end padding Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	48582bf47f	treewide: Mark constant references as const Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	bb70811183	treewide: Packet abstraction with mandatory boundary checks Implement a packet abstraction providing boundary and size checks based on packet descriptors: packets stored in a buffer can be queued into a pool (without storage of its own), and data can be retrieved referring to an index in the pool, specifying offset and length. Checks ensure data is not read outside the boundaries of buffer and descriptors, and that packets added to a pool are within the buffer range with valid offset and indices. This implies a wider rework: usage of the "queueing" part of the abstraction mostly affects tap_handler_{passt,pasta}() functions and their callees, while the "fetching" part affects all the guest or tap facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6 handlers. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	3eb19cfd8a	tcp, udp, util: Enforce 24-bit limit on socket numbers This should never happen, but there are no formal guarantees: ensure socket numbers are below SOCKET_MAX. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	79217b7689	udp: Use flags for local, loopback, and configured unicast binds There's no value in keeping a separate timestamp for activity and for aging of local binds, given that they have the same timeout. Reduce that to a single timestamp, with a flag indicating the local bind. Also use flags instead of separate int fields for loopback and configured unicast address usage as source address. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-28 17:11:40 +02:00
Stefano Brivio	5eb7604203	udp: Split buffer queueing/writing parts of udp_sock_handler() ...it became too hard to follow: split it off to udp_sock_fill_data_v{4,6}. While at it, use IN6_ARE_ADDR_EQUAL(a, b), courtesy of netinet/in.h, instead of open-coded memcmp(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-28 17:11:40 +02:00
Stefano Brivio	0296a57242	udp: Drop _splice from recv, send, sendto static buffer names It's already implied by the fact they don't have "l2" in their names, and dropping it improves readability a bit. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-28 17:11:40 +02:00
Stefano Brivio	eed6933e6c	udp: Explicitly initialise sin6_scope_id and sin_zero in sockaddr_in{,6} Not functionally needed, but gcc versions 7 to 9 (at least) will issue a warning otherwise. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-02-25 22:54:35 +01:00
Stefano Brivio	6c93111864	tcp, udp: Receive batching doesn't pay off when writing single frames to tap In pasta mode, when we get data from sockets and write it as single frames to the tap device, we batch receive operations considerably, and then (conceptually) split the data in many smaller writes. It looked like an obvious choice, but performance is actually better if we receive data in many small frame-sized recvmsg()/recvmmsg(). The syscall overhead with the previous behaviour, observed by perf, comes predominantly from write operations, but receiving data in shorter chunks probably improves cache locality by a considerable amount. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-02-21 13:41:13 +01:00
Stefano Brivio	9afd87b733	udp: Allow loopback connections from host using configured unicast address Likely for testing purposes only: allow connections from host to guest or namespace using, as connection target, the configured, possibly global unicast address. In this case, we have to map the destination address to a link-local address, and for port-based tracked responses, the source address needs to be again the unicast address: not loopback, not link-local. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-02-21 13:41:13 +01:00
Stefano Brivio	89678c5157	conf, udp: Introduce basic DNS forwarding For compatibility with libslirp/slirp4netns users: introduce a mechanism to map, in the UDP routines, an address facing guest or namespace to the first IPv4 or IPv6 address resulting from configuration as resolver. This can be enabled with the new --dns-forward option. This implies that sourcing and using DNS addresses and search lists, passed via command line or read from /etc/resolv.conf, is not bound anymore to DHCP/DHCPv6/NDP usage: for example, pasta users might just want to use addresses from /etc/resolv.conf as mapping target, while not passing DNS options via DHCP. Reflect this in all the involved code paths by differentiating DHCP/DHCPv6/NDP usage from DNS configuration per se, and in the new options --dhcp-dns, --dhcp-search for pasta, and --no-dhcp-dns, --no-dhcp-search for passt. This should be the last bit to enable substantial compatibility between slirp4netns.sh and slirp4netns(1): pass the --dns-forward option from the script too. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-02-21 13:41:13 +01:00
Stefano Brivio	0515adceaa	passt, pasta: Namespace-based sandboxing, defer seccomp policy application To reach (at least) a conceptually equivalent security level as implemented by --enable-sandbox in slirp4netns, we need to create a new mount namespace and pivot_root() into a new (empty) mountpoint, so that passt and pasta can't access any filesystem resource after initialisation. While at it, also detach IPC, PID (only for passt, to prevent vulnerabilities based on the knowledge of a target PID), and UTS namespaces. With this approach, if we apply the seccomp filters right after the configuration step, the number of allowed syscalls grows further. To prevent this, defer the application of seccomp policies after the initialisation phase, before the main loop, that's where we expect bad things to happen, potentially. This way, we get back to 22 allowed syscalls for passt and 34 for pasta, on x86_64. While at it, move #syscalls notes to specific code paths wherever it conceptually makes sense. We have to open all the file handles we'll ever need before sandboxing: - the packet capture file can only be opened once, drop instance numbers from the default path and use the (pre-sandbox) PID instead - /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection of bound ports in pasta mode, are now opened only once, before sandboxing, and their handles are stored in the execution context - the UNIX domain socket for passt is also bound only once, before sandboxing: to reject clients after the first one, instead of closing the listening socket, keep it open, accept and immediately discard new connection if we already have a valid one Clarify the (unchanged) behaviour for --netns-only in the man page. To actually make passt and pasta processes run in a separate PID namespace, we need to unshare(CLONE_NEWPID) before forking to background (if configured to do so). Introduce a small daemon() implementation, __daemon(), that additionally saves the PID file before forking. While running in foreground, the process itself can't move to a new PID namespace (a process can't change the notion of its own PID): mention that in the man page. For some reason, fork() in a detached PID namespace causes SIGTERM and SIGQUIT to be ignored, even if the handler is still reported as SIG_DFL: add a signal handler that just exits. We can now drop most of the pasta_child_handler() implementation, that took care of terminating all processes running in the same namespace, if pasta started a shell: the shell itself is now the init process in that namespace, and all children will terminate once the init process exits. Issuing 'echo $$' in a detached PID namespace won't return the actual namespace PID as seen from the init namespace: adapt demo and test setup scripts to reflect that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-02-21 13:41:13 +01:00
Stefano Brivio	caa22aa644	tcp, udp, util: Fixes for bitmap handling on big-endian, casts Bitmap manipulating functions would otherwise refer to inconsistent sets of bits on big-endian architectures. While at it, fix up a couple of casts. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-01-26 16:30:59 +01:00
Stefano Brivio	b93c2c1713	passt: Drop <linux/ipv6.h> include, carry own ipv6hdr and opt_hdr definitions This is the only remaining Linux-specific include -- drop it to avoid clang-tidy warnings and to make code more portable. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-01-26 07:57:09 +01:00
Stefano Brivio	627e18fa8a	passt: Add cppcheck target, test, and address resulting warnings ...mostly false positives, but a number of very relevant ones too, in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-21 09:41:13 +02:00
Stefano Brivio	27f5999677	udp: Avoid static initialiser for udp{4,6}_l2_buf With the new UDP_TAP_FRAMES value, the binary size grows considerably. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-21 04:49:25 +02:00
Stefano Brivio	85a80f8f63	udp: Fix maximum payload size calculation for IPv4 buffers, bump UDP_TAP_FRAMES The issue with a higher UDP_TAP_FRAMES was actually coming from a payload size the guest couldn't digest. Fix that, and bump UDP_TAP_FRAMES back to 128. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-21 04:42:09 +02:00
Stefano Brivio	dd942eaa48	passt: Fix build with gcc 7, use std=c99, enable some more Clang checkers Unions and structs, you all have names now. Take the chance to enable bugprone-reserved-identifier, cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy. Provide a ffsl() weak declaration using gcc built-in. Start reordering includes, but that's not enough for the llvm-include-order checker yet. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-21 04:26:08 +02:00
Stefano Brivio	9618d24700	ndp, dhcpv6, tcp, udp: Always use link-local as source if gateway isn't This shouldn't happen on any sane configuration, but I just met an example of that: the default IPv6 gateway on the host is configured with a global unicast address, we use that as source for RA, DHCPv6 replies, and the guest ignores it. Same later on if we talk TCP or UDP and the guest has no idea where that address comes from. Use our link-local address in case the gateway address is global. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-20 11:10:23 +02:00
Stefano Brivio	12cfa6444c	passt: Add clang-tidy Makefile target and test, take care of warnings Most are just about style and form, but a few were actually serious mistakes (NDP-related). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-20 08:34:22 +02:00
Stefano Brivio	b0b77118fe	passt: Address warnings from Clang's scan-build All false positives so far. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-20 08:29:30 +02:00
Stefano Brivio	1a563a0cbd	passt: Address gcc 11 warnings A mix of unchecked return values, a missing permission mask for open(2) with O_CREAT, and some false positives from -Wstringop-overflow and -Wmaybe-uninitialized. Reported-by: Martin Hauke <mardnh@gmx.de> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-20 08:29:30 +02:00
Stefan Hajnoczi	14fe73e766	udp: drop bogus udp_tap_map ts assignment The 'ts' field is a timestamp so assigning the socket file descriptor is incorrect. There is no actual bug because the current time is assigned just a few lines later: udp_tap_map[V4][src].sock = s; udp_tap_map[V4][src].ts = s; ^^^^^^^^^^^ bogus ^^^^^^^^^^ bitmap_set(udp_act[V4][UDP_ACT_TAP], src); } udp_tap_map[V4][src].ts = now->tv_sec; ^^^^^^^^^^^^^^^ correct ^^^^^^^^^^^^^^ Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-15 20:46:17 +02:00
Stefano Brivio	f45891cf26	conf, tcp, udp: Add --no-map-gw to disable mapping gateway address to host Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-14 13:19:52 +02:00
Stefano Brivio	66d5930ec7	passt, pasta: Add seccomp support List of allowed syscalls comes from comments in the form: #syscalls <list> for syscalls needed both in passt and pasta mode, and: #syscalls:pasta <list> #syscalls:passt <list> for syscalls specifically needed in pasta or passt mode only. seccomp.sh builds a list of BPF statements from those comments, prefixed by a binary search tree to keep lookup fast. While at it, clean up a bit the Makefile using wildcards. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-14 13:15:46 +02:00
Giuseppe Scrivano	9a175cc2ce	pasta: Allow specifying paths and names of namespaces Based on a patch from Giuseppe Scrivano, this adds the ability to: - specify paths and names of target namespaces to join, instead of a PID, also for user namespaces, with --userns - request to join or create a network namespace only, without entering or creating a user namespace, with --netns-only - specify the base directory for netns mountpoints, with --nsrun-dir Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [sbrivio: reworked logic to actually join the given namespaces when they're not created, implemented --netns-only and --nsrun-dir, updated pasta demo script and man page] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-07 04:05:15 +02:00
Stefano Brivio	dd581730e5	tap: Completely de-serialise input message batches Until now, messages would be passed to protocol handlers in a single batch only if they happened to be dequeued in a row. Packets interleaved between different connections would result in multiple calls to the same protocol handler for a single connection. Instead, keep track of incoming packet descriptors, arrange them in sequences, and call protocol handlers only as we completely sorted input messages in batches. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-27 01:28:02 +02:00
Stefano Brivio	ec0bdc10b1	udp: Switch to new socket message after 32KiB instead of 64KiB For some reason, this measurably improves performance with qemu and virtio-net. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-27 01:28:02 +02:00
Stefano Brivio	c2d86b7475	udp: Decrease UDP_TAP_FRAMES to 16 Similarly to the decrease in TCP_TAP_FRAMES, this improves fairness, with a very small impact on performance. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-27 01:28:02 +02:00
Stefano Brivio	3994fc8f58	udp: Reset iov_base after sending partial message on sendmmsg() failure We set the length while processing messges, but the starting address is pre-initialised. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	ecf1f97564	udp: Fix comparison of seen IPv4 address for local connections c->addr4_seen is stored in network order. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	8e9333616a	udp: Fix retry mechanism on partial sendmmsg() Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	647a413794	tcp, udp: Restore usage of gateway for guest to connect to local host This went lost in a recent rework: if the guest wants to connect directly to the host, it can use the address of the default gateway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	77d4efa236	udp: Handle partial failure in sendmmsg() to UNIX domain socket Similarly to the handling introduced by commit "tcp: Proper error handling for sendmmsg() to UNIX domain socket" for TCP, we need to deal with partial sendmmsg() failures for UDP as well. Here, we can lose messages, but we need to make sure that the last message is delivered completely, otherwise qemu will fail to reassemble further packets. For UDP, this is somewhat complicated by the fact that one message might include multiple datagrams, and we need to respect message boundaries: go through headers, and calculate what we need to re-send, if anything. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	1e49d194d0	passt, pasta: Introduce command-line options and port re-mapping Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	8af961b85b	tcp, udp: Map source address to gateway for any traffic from 127.0.0.0/8 ...instead of just 127.0.0.1. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-26 18:20:01 +02:00
Stefano Brivio	86b273150a	tcp, udp: Allow binding ports in init namespace to both tap and loopback Traffic with loopback source address will be forwarded to the direct loopback connection in the namespace, and the tap interface is used for the rest. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-26 14:10:29 +02:00
Stefano Brivio	17765f8de0	checksum: Introduce AVX2 implementation, unify helpers Provide an AVX2-based function using compiler intrinsics for TCP/IP-style checksums. The load/unpack/add idea and implementation is largely based on code from BESS (the Berkeley Extensible Software Switch) licensed as 3-Clause BSD, with a number of modifications to further decrease pipeline stalls and to minimise cache pollution. This speeds up considerably data paths from sockets to tap interfaces, decreasing overhead for checksum computation, with 16-64KiB packet buffers, from approximately 11% to 7%. The rest is just syscalls at this point. While at it, provide convenience targets in the Makefile for avx2, avx2_debug, and debug targets -- these simply add target-specific CFLAGS to the build. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-26 07:18:50 +02:00
Stefano Brivio	49631a38a6	tcp, udp: Split IPv4 and IPv6 bound port sets Allow to bind IPv4 and IPv6 ports to tap, namespace or init separately. Port numbers of TCP ports that are bound in a namespace are also bound for UDP for convenience (e.g. iperf3), and IPv4 ports are always bound if the corresponding IPv6 port is bound (socket might not have the IPV6_V6ONLY option set). This will also be configurable later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-21 17:44:39 +02:00
Stefano Brivio	64a0ba3b27	udp: Introduce recvmmsg()/sendmmsg(), zero-copy path from socket Packets are received directly onto pre-cooked, static buffers for IPv4 (with partial checksum pre-calculation) and IPv6 frames, with pre-filled Ethernet addresses and, partially, IP headers, and sent out from the same buffers with sendmmsg(), for both passt and pasta (non-local traffic only) modes. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-21 12:01:04 +02:00
Stefano Brivio	33482d5bf2	passt: Add PASTA mode, major rework PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host connectivity to an otherwise disconnected, unprivileged network and user namespace, similarly to slirp4netns. Given that the implementation is largely overlapping with PASST, no separate binary is built: 'pasta' (and 'passt4netns' for clarity) both link to 'passt', and the mode of operation is selected depending on how the binary is invoked. Usage example: $ unshare -rUn # echo $$ 1871759 $ ./pasta 1871759 # From another terminal # udhcpc -i pasta0 2>/dev/null # ping -c1 pasta.pizza PING pasta.pizza (64.190.62.111) 56(84) bytes of data. 64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms --- pasta.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms # ping -c1 spaghetti.pizza PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes 64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms --- spaghetti.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms This entails a major rework, especially with regard to the storage of tracked connections and to the semantics of epoll(7) references. Indexing TCP and UDP bindings merely by socket proved to be inflexible and unsuitable to handle different connection flows: pasta also provides Layer-2 to Layer-2 socket mapping between init and a separate namespace for local connections, using a pair of splice() system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local bindings. For instance, building on the previous example: # ip link set dev lo up # iperf3 -s $ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 \| tail -n4 [SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender [SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver iperf Done. epoll(7) references now include a generic part in order to demultiplex data to the relevant protocol handler, using 24 bits for the socket number, and an opaque portion reserved for usage by the single protocol handlers, in order to track sockets back to corresponding connections and bindings. A number of fixes pertaining to TCP state machine and congestion window handling are also included here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-17 11:04:22 +02:00
Stefano Brivio	e07f539ae0	udp, passt: Introduce socket packet buffer, avoid getsockname() for UDP This is in preparation for scatter-gather IO on the UDP receive path: save a getsockname() syscall by setting a flag if we get the numbering of all bound sockets in a strict sequence (expected, in practice) and repurpose the tap buffer to be also a socket receive buffer, passing it down to protocol handlers. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-30 14:52:18 +02:00
Stefano Brivio	605af213c5	udp: Connection tracking for ephemeral, local ports, and related fixes As we support UDP forwarding for packets that are sent to local ports, we actually need some kind of connection tracking for UDP. While at it, this commit introduces a number of vaguely related fixes for issues observed while trying this out. In detail: - implement an explicit, albeit minimalistic, connection tracking for UDP, to allow usage of ephemeral ports by the guest and by the host at the same time, by binding them dynamically as needed, and to allow mapping address changes for packets with a loopback address as destination - set the guest MAC address whenever we receive a packet from tap instead of waiting for an ARP request, and set it to broadcast on start, otherwise DHCPv6 might not work if all DHCPv6 requests time out before the guest starts talking IPv4 - split context IPv6 address into address we assign, global or site address seen on tap, and link-local address seen on tap, and make sure we use the addresses we've seen as destination (link-local choice depends on source address). Similarly, for IPv4, split into address we assign and address we observe, and use the address we observe as destination - introduce a clock_gettime() syscall right after epoll_wait() wakes up, so that we can remove all the other ones and pass the current timestamp to tap and socket handlers -- this is additionally needed by UDP to time out bindings to ephemeral ports and mappings between loopback address and a local address - rename sock_l4_add() to sock_l4(), no semantic changes intended - include <arpa/inet.h> in passt.c before kernel headers so that we can use <netinet/in.h> macros to check IPv6 address types, and remove a duplicate <linux/ip.h> inclusion Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-29 17:15:26 +02:00
Stefano Brivio	b3b3451ae2	udp: Disable SO_ZEROCOPY again ...on a second thought, this won't really help with veth, and actually causes a significant overhead as we get EPOLLERR whenever another process is tapping on the traffic. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-25 10:41:55 +02:00
Stefano Brivio	38b50dba47	passt: Spare some syscalls, add some optimisations from profiling Avoid a bunch of syscalls on forwarding paths by: - storing minimum and maximum file descriptor numbers for each protocol, fall back to SO_PROTOCOL query only on overlaps - allocating a larger receive buffer -- this can result in more coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024), so make sure we don't exceed that within a single call to protocol tap handlers - nesting the handling loop in tap_handler() in the receive loop, so that we have better chances of filling our receive buffer in fewer calls - skipping the recvfrom() in the UDP handler on EPOLLERR -- there's nothing to be done in that case and while at it: - restore the 20ms timer interval for periodic (TCP) events, I accidentally changed that to 100ms in an earlier commit - attempt using SO_ZEROCOPY for UDP -- if it's not available, sendmmsg() will succeed anyway - fix the handling of the status code from sendmmsg(), if it fails, we'll try to discard the first message, hence return 1 from the UDP handler Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-23 22:22:37 +02:00
Stefano Brivio	6488c3e848	tcp, udp: Replace loopback source address by gateway address This is symmetric with tap operation and addressing model, and allows again to reach the guest behind the tap interface by contacting the local address. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-22 17:03:43 +02:00
Stefano Brivio	1f7cf04d34	passt: Introduce packet batching mechanism Receive packets in batches from AF_UNIX, check if they can be sent with a single syscall, and batch them up with sendmmsg() in case. A bit rudimentary, currently only implemented for UDP, but it seems to work. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-22 13:39:36 +02:00
Stefano Brivio	f435e38927	udp: Fix typo in tcp_tap_handler() documentation Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-03-17 10:57:42 +01:00
Stefano Brivio	93977868f9	udp: Use size_t for return value of recvfrom() Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-03-17 10:57:42 +01:00
Stefano Brivio	8bca388e8a	passt: Assorted fixes from "fresh eyes" review A bunch of fixes not worth single commits at this stage, notably: - make buffer, length parameter ordering consistent in ARP, DHCP, NDP handlers - strict checking of buffer, message and option length in DHCP handler (a malicious client could have easily crashed it) - set up forwarding for IPv4 and IPv6, and masquerading with nft for IPv4, from demo script - get rid of separate slow and fast timers, we don't save any overhead that way - stricter checking of buffer lengths as passed to tap handlers - proper dequeuing from qemu socket back-end: I accidentally trashed messages that were bundled up together in a single tap read operation -- the length header tells us what's the size of the next frame, but there's no apparent limit to the number of messages we get with one single receive - rework some bits of the TCP state machine, now passive and active connection closes appear to be robust -- introduce a new FIN_WAIT_1_SOCK_FIN state indicating a FIN_WAIT_1 with a FIN flag from socket - streamline TCP option parsing routine - track TCP state changes to stderr (this is temporary, proper debugging and syslogging support pending) - observe that multiplying a number by four might very well change its value, and this happens to be the case for the data offset from the TCP header as we check if it's the same as the total length to find out if it's a duplicated ACK segment - recent estimates suggest that the duration of a millisecond is closer to a million nanoseconds than a thousand of them, this trend is now reflected into the timespec_diff_ms() convenience routine Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-02-21 11:55:49 +01:00
Stefano Brivio	105b916361	passt: New design and implementation with native Layer 4 sockets This is a reimplementation, partially building on the earlier draft, that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW, providing L4-L2 translation functionality without requiring any security capability. Conceptually, this follows the design presented at: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md The most significant novelty here comes from TCP and UDP translation layers. In particular, the TCP state and translation logic follows the intent of being minimalistic, without reimplementing a full TCP stack in either direction, and synchronising as much as possible the TCP dynamic and flows between guest and host kernel. Another important introduction concerns addressing, port translation and forwarding. The Layer 4 implementations now attempt to bind on all unbound ports, in order to forward connections in a transparent way. While at it: - the qemu 'tap' back-end can't be used as-is by qrap anymore, because of explicit checks now introduced in qemu to ensure that the corresponding file descriptor is actually a tap device. For this reason, qrap now operates on a 'socket' back-end type, accounting for and building the additional header reporting frame length - provide a demo script that sets up namespaces, addresses and routes, and starts the daemon. A virtual machine started in the network namespace, wrapped by qrap, will now directly interface with passt and communicate using Layer 4 sockets provided by the host kernel. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-02-16 09:28:55 +01:00

1 2 3 4 5

205 commits