passt

Author	SHA1	Message	Date
David Gibson	dc467d526f	tcp: Split pool lookup from creating new sockets in tcp_conn_new_sock() tcp_conn_new_sock() first looks for a socket in a pre-opened pool, then if that's empty creates a new socket in the init namespace. Both parts of this are duplicated in other places: the pool lookup logic is duplicated in tcp_splice_new(), and the socket opening logic is duplicated in tcp_sock_refill_pool(). Split the function into separate parts so we can remove both these duplications. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:25:11 +01:00
David Gibson	912d37cd5b	tcp: Move socket pool declarations around tcp_splice.c has some explicit extern declarations to access the socket pools. This is pretty dangerous - if we changed the type of these variables in tcp.c, we'd have tcp.c and tcp_splice.c using the same memory in different ways with no compiler error. So, move the extern declarations to tcp_conn.h so they're visible to both tcp.c and tcp_splice.c, but not the rest of pasta. In fact the pools for the guest namespace are necessarily only used by tcp_splice.c - we have no sockets on the guest side if we're not splicing. So move those declarations and the functions that deal exclusively with them to tcp_splice.c Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:25:08 +01:00
David Gibson	c8993476d5	tcp: Split init and ns cases for tcp_sock_refill() With the creation of the tcp_sock_refill_pool() helper, the ns==true and ns==false paths for tcp_sock_refill() now have almost nothing in common. Split the two versions into tcp_sock_refill_init() and tcp_sock_refill_ns() functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:24:51 +01:00
David Gibson	e456c1ccdc	tcp: Make a helper to refill each socket pool tcp_sock_refill() contains two near-identical loops to refill the IPv4 and IPv6 socket pools. In addition, if we get an error on the IPv4 pool we exit early and won't attempt to refill the IPv6 pool. At least theoretically, these are independent from each other and there's value to filling up either pool without the other. So, there's no strong reason to give up on one because the other failed. Address both of these with a helper function 'tcp_sock_refill_pool()' to refill a single given pool. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:24:40 +01:00
Florian Weimer	ee8353cd64	Makefile: Explict int type in FALLOC_FL_COLLAPSE_RANGE probe Future compilers will not support implicit ints by default, causing the probe to always fail. Link: https://bugs.passt.top/show_bug.cgi?id=42 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:24:16 +01:00
Stefano Brivio	67920b8820	test/pasta_options: Ignore failures on shell 'exit' On shell 'exit' commands, running shells from pasta, we might get: Cannot set tty process group (No such process) as some TTY devices might be unaccessible. This is harmless, but after commit "pasta: propagate exit code from child command", we'll get test failures there, at least with dash. Ignore those explicitly with a ugly workaround: we can't simply do something like: exit \|\| : because the failure is reported by the shell itself once it exits, regardless of the command evaluation. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-13 00:05:48 +01:00
Paul Holzinger	a234407f5c	pasta: propagate exit code from child command Exits codes are very useful for scripts, when the pasta child execvp() call fails with ENOENT that parent should also exit with > 0. In short the parent should always exit with the code from the child to make it useful in scripts. It is easy to test with: `pasta -- bash -c "exit 3"; echo $?` Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 23:42:50 +01:00
Paul Holzinger	04dfc5b81f	pasta: correctly exit when execvp() fails By default clone() will create a child that does not send SIGCHLD when the child exits. The caller has to specifiy the SIGNAL it should get in the flag bitmask. see clone(2) under "The child termination signal" This fixes the problem where pasta would not exit when the execvp() call failed, i.e. when the command does not exists. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 23:42:43 +01:00
Paul Holzinger	3487b5fc93	pasta: do not leak netlink sock into child When spawning a child command with pasta command... pasta should not leak fds that it opened. Only the fds that were already open should be given to the child. Run `pasta --config-net -- ls -l /proc/self/fd` from a terminal where only stdin/out/err are open. The fd 3 was opend by ls to read the /proc/self/fd dir. But fd 5 is the netlink socket that was opend in pasta. To prevent such a leak we will open the socket with SOCK_CLOEXEC. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 23:42:34 +01:00
David Gibson	7a8ed9459d	Make assertions actually useful There are some places in passt/pasta which #include <assert.h> and make various assertions. If we hit these something has already gone wrong, but they're there so that we a useful message instead of cryptic misbehaviour if assumptions we thought were correct turn out not to be. Except.. the glibc implementation of assert() uses syscalls that aren't in our seccomp filter, so we'll get a SIGSYS before it actually prints the message. Work around this by adding our own ASSERT() implementation using our existing err() function to log the message, and an abort(). The abort() probably also won't work exactly right with seccomp, but once we've printed the message, dying with a SIGSYS works just as well as dying with a SIGABRT. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 23:42:24 +01:00
Stefano Brivio	cc6d8286d1	tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer David reports that TCP transfers might stall, especially with smaller socket buffer sizes, because we reset the ACK_FROM_TAP_DUE flag, in tcp_tap_handler(), whenever we receive an ACK segment, regardless of its sequence number and the fact that we might still be waiting for one. This way, we might fail to re-transmit frames on ACK timeouts. We need, instead, to: - indicate with the @retrans field only re-transmissions for the same data sequences. If we make progress, it should be reset, given that it's used to abort a connection when we exceed a given number of re-transmissions for the same data - unset the ACK_FROM_TAP_DUE flag if and only if the acknowledged sequence is the same as the last one we sent, as suggested by David - keep it set otherwise, if progress was done but not all the data we sent was acknowledged, and update the expiration of the ACK timeout Add a new helper for these purposes, tcp_update_seqack_from_tap(). To extend the ACK timeout, the new helper sets the ACK_FROM_TAP_DUE flag, even if it was already set, and conn_flag_do() triggers a timer update. This part should be revisited at a later time, because, strictly speaking, ACK_FROM_TAP_DUE isn't a flag anymore. One possibility might be to introduce another connection attribute for events affecting timer deadlines. Reported-by: David Gibson <david@gibson.dropbear.id.au> Link: https://bugs.passt.top/show_bug.cgi?id=41 Suggested-by: David Gibson <david@gibson.dropbear.id.au> Fixes: `be5bbb9b06` ("tcp: Rework timers to use timerfd instead of periodic bitmap scan") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 22:26:55 +01:00
Stefano Brivio	ac153595c0	tap: Send frames after the first one in tap_send_frames_pasta() ...instead of repeatedly sending out the first one in iov. Fixes: `e21ee41ac3` ("tcp: Combine two parts of pasta tap send path together") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 14:34:38 +01:00
Stefano Brivio	d8921dafe5	pasta: Wait for tap to be set up before spawning command Adapted from a patch by Paul Holzinger: when pasta spawns a command, operating without a pre-existing user and network namespace, it needs to wait for the tap device to be configured and its handler ready, before the command is actually executed. Otherwise, something like: pasta --config-net nslookup passt.top usually fails as the nslookup command is issued before the network interface is ready. We can't adopt a simpler approach based on SIGSTOP and SIGCONT here: the child runs in a separate PID namespace, so it can't send SIGSTOP to itself as the kernel sees the child as init process and blocks the delivery of the signal. We could send SIGSTOP from the parent, but this wouldn't avoid the possible condition where the child isn't ready to wait for it when the parent sends it, also raised by Paul -- and SIGSTOP can't be blocked, so it can never be pending. Use SIGUSR1 instead: mask it before clone(), so that the child starts with it blocked, and can safely wait for it. Once the parent is ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the child is waiting for it, the kernel will queue it for us, because it's blocked. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: `1392bc5ca0` ("Allow pasta to take a command to execute") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 12:22:59 +01:00
David Gibson	54502cca7f	udp: Use tap_send_frames() To send frames on the tap interface, the UDP uses a fairly complicated two level batching. First multiple frames are gathered into a single "message" for the qemu stream socket, then multiple messages are send with sendmmsg(). We now have tap_send_frames() which already deals with sending a number of frames, including batching and handling partial sends. Use that to considerably simplify things. This does make a couple of behavioural changes: * We used to split messages to keep them under 32kiB (except when a single frame was longer than that). The comments claim this is needed to stop qemu from closing the connection, but we don't have any equivalent logic for TCP. I wasn't able to reproduce the problem with this series, although it was apparently easy to reproduce earlier. My suspicion is that there was never an inherent need to keep messages small, however with larger messages (and default kernel buffer sizes) the chances of needing more than one resend for partial send()s is greatly increased. We used not to correctly handle that case of multiple resends, but now we do. * Previously when we got a partial send on UDP, we would resend the remainder of the entire "message", including multiple frames. The common code now only resends the remainder of a single frame, simply dropping any frames which weren't even partially sent. This is what TCP always did and is probably a better idea for UDP too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:55:04 +01:00
David Gibson	2d553b587a	tap: Improve handling of partial frame sends In passt mode, when writing frames to the qemu socket, we might get a short send. If we ignored this and carried on, the qemu socket would get out of sync, because the bytes we actually sent wouldn't correspond to the length header we already sent. tap_send_frames_passt() handles that by doing a a blocking send to complete the message, but it has a few flaws: * We only attempt to resend once: although it's unlikely in practice, nothing prevents the blocking send() from also being short * We print a debug error if send() returns non-zero.. but send() returns the number of bytes sent, so we actually want it to return the length of the remaining data. Correct those flaws and also be a bit more thorough about reporting problems here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:55:01 +01:00
David Gibson	97d1ca2ed6	udp: Use abstracted tap header Update the UDP code to use the tap layer abstractions for initializing and updating the L2 and lower headers. This will make adding other tap backends in future easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:59 +01:00
David Gibson	0fb7b2b908	tap: Use different io vector bases depending on tap type Currently tap_send_frames() expects the frames it is given to include the vnet_len field, even in pasta mode which doesn't use it (although it need not be initialized in that case). To match, tap_iov_base() and tap_iov_len() construct the frame in that way. This will inconvenience future changes, so alter things to set the buffers to include just the frame needed by the tap backend type. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:57 +01:00
David Gibson	2416310a17	tcp: Use abstracted tap header Update the TCP code to use the tap layer abstractions for initializing and updating the L2 and lower headers. This will make adding other tap backends in future easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:54 +01:00
David Gibson	4b3d38a069	tap: Add "tap headers" abstraction Currently both the TCP and UDP code need to deal in various places with the details of the L2 headers, and also the tap-specific "vnet_len" header. This makes abstracting the tap interface to new backends (e.g. vhost-user or tun) more difficult. To improve this abstraction, create a new 'tap_hdr' structure which represents both L2 (always Ethernet at the moment, but might be vary in future) and any additional tap specific headers (such as the qemu socket's vnet_len field). Provide helper functions and macros to initialize, update and use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:52 +01:00
David Gibson	e4443ba9bd	tcp: Consolidate calculation of total frame size tcp_l2_buf_fill_headers() returns the size of the generated frame including the ethernet header. The caller then adds on the size of the vnet_len field to get the total frame size to be passed to the tap device. Outside the tap code, though, we never care about the ethernet header size only the final total size we need to put into an iovec. So, consolidate the total frame size calculation within tcp_l2_buf_fill_headers(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:49 +01:00
David Gibson	dcea26076d	tcp: Remove redundant and incorrect initialization from *_iov_init() tcp_sock[46]_iov_init() initialize the length of each iovec buffer to MSS_DEFAULT. That will always be overwritten before use in tcp_data_to_tap, so it's redundant. It also wasn't correct, because it didn't correctly account for the header lengths in all cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:47 +01:00
David Gibson	716a926ef4	util: Parameterize ethernet header initializer macro We have separate IPv4 and IPv6 versions of a macro to construct an initializer for ethernet headers. However, now that we have htons_constant it's easy to simply paramterize this with the ethernet protocol number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:46 +01:00
David Gibson	67afaab411	tcp, udp: Use named field initializers in iov_init functions Both the TCP and UDP iov_init functions have some large structure literals defined in "field order" style. These are pretty hard to read since it's not obvious what value corresponds to what field. Use named field style initializers instead to make this clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:44 +01:00
David Gibson	f5a950115b	util: Introduce hton*_constant() in place of #ifdefs We have several places where we have fairly ugly #ifdefs on __BYTE_ORDER where we need network order values in a constant expression (so we can't use htons() or htonl()). We can do this more cleanly by using a single __BYTE_ORDER ifdef to define htons_constant() and htonl_constant() macros, then using those in all the other places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:42 +01:00
David Gibson	6d011c1faa	tap, tcp: Move tap send path to tap.c The functions which do the final steps of sending TCP packets on through the tap interface - tcp_l2_buf_flush*() - no longer have anything that's actually specific to TCP in them, other than comments and names. Move them all to tap.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:40 +01:00
David Gibson	e21ee41ac3	tcp: Combine two parts of pasta tap send path together tcp_l2_buf_flush() open codes the loop across each frame in a group, but but calls tcp_l2_buf_write_one() to send each frame to the pasta tuntap device. Combine these two pasta-specific operations into tcp_l2_buf_flush_pasta() which is a little cleaner and will enable further cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:39 +01:00
David Gibson	a79e774770	tcp: Improve interface to tcp_l2_buf_flush() Currently this takes a msghdr, but the only thing we actually care about in there is the io vector. Make it take an io vector directly. We also have a weird side effect of zeroing @buf_used. Just pass this by value and zero it in the caller instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:36 +01:00
David Gibson	f1d2d35416	tcp: Don't compute total bytes in a message until we need it tcp[46]_l2_buf_bytes keep track of the total number of bytes we have queued to send to the tap interface. tcp_l2_buf_flush_passt() uses this to determine if sendmsg() has sent all the data we requested, or whether we need to resend a trailing portion. However, the logic for finding where we're up to in the case of a short sendmsg() can equally well tell whether we've had one at all, without knowing the total number in advance. This does require an extra loop after each sendmsg(), but it's doing simple arithmetic on values we've already been accessing, and it leads to overall simpler code. tcp[46]_l2_flags_buf_bytes were being calculated, but never used for anything, so simply remove them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:33 +01:00
David Gibson	99f0be3cad	tcp: Combine two parts of passt tap send path together tcp_l2_buf_flush() open codes the "primary" send of message to the passt tap interface, but calls tcp_l2_buf_flush_part() to handle the case of a short send. Combine these two passt-specific operations into tcp_l2_buf_flush_passt() which is a little cleaner and will enable furrther cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:30 +01:00
David Gibson	d3089eb0ea	pcap: Replace pcapm() with pcap_multiple() pcapm() captures multiple frames from a msghdr, however the only thing it cares about in the msghdr is the list of buffers, where it assumes there is one frame to capture per buffer. That's what we want for its single caller but it's not the only obvious choice here (one frame per msghdr would arguably make more sense in isolation). In addition pcapm() has logic that only makes sense in the context of the passt specific path its called from: it skips the first 4 bytes of each buffer, because those have the qemu vnet_len rather than the frame proper. Make this clearer by replacing pcapm() with pcap_multiple() which more explicitly takes one struct iovec per frame, and parameterizes how much of each buffer to skip (i.e. the offset of the frame within the buffer). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:27 +01:00
David Gibson	cb3c1ce307	pcap: Introduce pcap_frame() helper pcap(), pcapm() and pcapmm() duplicate some code, for the actual writing to the capture file. The purpose of pcapm() and pcapmm() not calling pcap() seems to be to avoid repeatedly calling gettimeofday() and to avoid printing errors for every packet in a batch if there's a problem. We can accomplish that while still sharing code by adding a new helper which takes the packet timestamp as a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:48:51 +01:00
David Gibson	b93d025d50	udp: Don't use separate sockets to listen for spliced packets Currently, when ports are forwarded inbound in pasta mode, we open two sockets for incoming traffic: one listens on the public IP address and will forward packets to the tuntap interface. The other listens on localhost and forwards via "splicing" (resending directly via sockets in the ns). Now that we've improved the logic about whether we "splice" any individual packet, we don't need this. Instead we can have a single socket bound to 0.0.0.0 or ::, marked as able to splice and udp_sock_handler() will deal with each packet as appropriate. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:12 +01:00
David Gibson	8d503e825f	udp: Decide whether to "splice" per datagram rather than per socket Currently we have special sockets for receiving datagrams from locahost which can use the optimized "splice" path rather than going across the tap interface. We want to loosen this so that sockets can receive sockets that will be forwarded by both the spliced and non-spliced paths. To do this, we alter the meaning of the @splice bit in the reference to mean that packets receieved on this socket can be spliced, not that they will be spliced. They'll only actually be spliced if they come from 127.0.0.1 or ::1. We can't (for now) remove the splice bit entirely, unlike with TCP. Our gateway mapping means that if the ns initiates communication to the gw address, we'll translate that to target 127.0.0.1 on the host side. Reply packets will therefore have source address 127.0.0.1 when received on the host, but these need to go via the tap path where that will be translated back to the gateway address. We need the @splice bit to distinguish that case from packets going from localhost to a port mapped explicitly with -u which should be spliced. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:09 +01:00
David Gibson	8a10f23720	udp: Unify udp_sock_handler_splice() with udp_sock_handler() These two functions now have a very similar structure, and their first part (calling recvmmsg()) is functionally identical. So, merge the two functions into one. This does have the side effect of meaning we no longer receive multiple packets at once for splice (we already didn't for tap). This does hurt throughput for small spliced packets, but improves it for large spliced packets and tap packets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:06 +01:00
David Gibson	f1ed8dbfa7	udp: Pre-populate msg_names with local address udp_splice_namebuf is now used only for spliced sending, and so it is only ever populated with the localhost address, either IPv4 or IPv6. So, replace the awkward initialization in udp_sock_handler_splice() with statically initialized versions for IPv4 and IPv6. We then just need to update the port number in udp_sock_handler_splice(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:03 +01:00
David Gibson	c9e193b5ae	udp: Don't handle tap receive batch size calculation within a #define UDP_MAX_FRAMES gives the maximum number of datagrams we'll ever handle as a batch for sizing our buffers and control structures. The subtly different UDP_TAP_FRAMES gives the maximum number of datagrams we'll actually try to receive at once for tap packets in the current configuration. This depends on the mode, meaning that the macro has a non-obvious dependency on the usual 'c' context variable being available. We only use it in one place, so it makes more sense to open code this. Add an explanatory comment while we're there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:07:01 +01:00
David Gibson	4eb54fd2e7	udp: Split receive from preparation and send in udp_sock_handler() The receive part of udp_sock_handler() and udp_sock_handler_splice() is now almost identical. In preparation for merging that, split the receive part of udp_sock_handler() from the part preparing and sending the frames for sending on the tap interface. The latter goes into a new udp_tap_send() function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:06:58 +01:00
David Gibson	09c00f1d2a	udp: Split sending to passt tap interface into separate function The last part of udp_sock_handler() does the actual sending of frames to the tap interface. For pasta that's just a call to udp_tap_send_pasta() but for passt, it's moderately complex and open coded. For symmetry, move the passt send path into its own function, udp_tap_send_passt(). This will make it easier to abstract the tap interface in future (e.g. when we want to add vhost-user). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:06:55 +01:00
David Gibson	a6e919a951	udp: Move sending pasta tap frames to the end of udp_sock_handler() udp_sock_handler() has a surprising difference in flow between pasta and passt mode: For pasta we send each frame to the tap interface as we prepare it. For passt, though, we prepare all the frames, then send them with a single sendmmsg(). Alter the pasta path to also prepare all the frames, then send them at the end. We already have a suitable data structure for the passt case. This will make it easier to abstract out the tap backend difference in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-13 01:06:51 +01:00
Stefano Brivio	c196953f1e	test/perf/pasta_tcp: Add host to namespace cases for traffic via tap Similarly to UDP cases, these were missing as it wasn't clear, when the other tests were introduced, if using the global address of a namespace, from the host, should have resulted in connections being routed via the tap interface. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-01-05 15:08:11 +01:00
Stefano Brivio	37f82ccd9f	tcp: Explicitly check option length field values in tcp_opt_get() Reported by Coverity (CWE-606, Untrusted loop bound), and actually harmless because we'll exit the option-scanning loop if the remaining length is not enough for a new option, instead of reading past the header. In any case, it looks like a good idea to explicitly check for reasonable values of option lengths. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-01-05 15:08:07 +01:00
Stefano Brivio	08c01f5b4e	test/perf/pasta_udp: Add host to namespace cases for traffic via tap These were missing as it wasn't clear, when the other tests were introduced, if using the global address of a namespace, from the host, should have resulted in traffic being routed via the tap interface (as opposed to the loopback interface). We now clarified that's actually the case. Use same values and thresholds as the tests for loopback traffic, as throughput figures currently indicate there isn't much difference. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-01-05 15:07:54 +01:00
David Gibson	310bdbdcf4	udp: Factor out control structure management from udp_sock_fill_data_v[46] The main purpose of udp_sock_fill_data_v[46]() is to construct the IP, UDP and other headers we'll need to forward data onto the tap interface. In addition they update the control structures (iovec and mmsghdr) we'll need to send the messages, and in the case of pasta actually sends it. This leads the control structure management and the send itself awkwardly split between udp_sock_fill_data_v[46]() and their caller udp_sock_handler(). In addition, this tail part of udp_sock_fill_datav[46] is essentially common between the IPv4 and IPv6 versions, apart from which control array we're working on. Clean this up by reducing these functions to just construct the headers and renaming them to udp_update_hdr[46]() accordingly. The control structure updates are now all in the caller, and common for IPv4 and IPv6. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:22 +01:00
David Gibson	4b2d227c86	udp: Preadjust udp[46]_l2_iov_tap[].iov_base for pasta mode Currently, we always populate udp[46]_l2_iov_tap[].iov_base with the very start of the header buffers, including space for the qemu vnet_len tag suitable for passt mode. That's ok because we don't actually use these iovecs for pasta mode. However, we do know the mode in udp_sock[46]_iov_init() so adjust these to the beginning of the headers we'll actually need for the mode: including the vnet_len tag for passt, but excluding it for pasta. This allows a slightly nicer way to locate the right buffer to send in the pasta case, and will allow some additional cleanups later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:19 +01:00
David Gibson	d533807027	udp: Better factor IPv4 and IPv6 paths in udp_sock_handler() Apart from which mh array they're operating on the recvmmsg() calls in udp_sock_handler() are identical between the IPv4 and IPv6 paths, as are some of the control structure updates. By using some local variables to refer to the IP version specific control arrays, make some more logic common between the IPv4 and IPv6 paths. As well as slightly reducing the code size, this makes it less likely that we'll accidentally use the IPv4 arrays in the IPv6 path or vice versa as we did in a recently fixed bug. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:16 +01:00
David Gibson	6af7ee74cf	udp: Fix incorrect use of IPv6 mh buffers in IPv4 path udp_sock_handler() incorrectly uses udp6_l2_mh_tap[] on the IPv4 path. In fact this is harmless because this assignment is redundant (the 0th entry msg_hdr will always point to the 0th iov entry for both IPv4 and IPv6 and won't change). There is also an incorrect usage of udp6_l2_mh_tap[] in udp_sock_fill_data_v4. This one can cause real problems, because we'll use stale iov_len values if we send multiple messages to the qemu socket. Most of the time that will be relatively harmless - we're likely to either drop UDP packets, or send duplicates. However, if the stale iov_len we use ends up referencing an uninitialized buffer we could desynchronize the qemu stream socket. Correct both these bugs. The UDP6 path appears to be correct, but it does have some comments that incorrectly reference the IPv4 versions, so fix those as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:07 +01:00
David Gibson	34764ea4f3	udp: Correct splice forwarding when receiving from multiple sources udp_sock_handler_splice() reads a whole batch of datagrams at once with recvmmsg(). It then forwards them all via a single socket on the other side, based on the source port. However, it's entirely possible that the datagrams in the set have different source ports, and thus ought to be forwarded via different sockets on the destination side. In fact this situation arises with the iperf -P4 throughput tests in our own test suite. AFAICT we only get away with this because iperf3 is strictly one way and doesn't send reply packets which would be misdirected because of the incorrect source ports. Alter udp_sock_handler_splice() to split the packets it receives into batches with the same source address and send each batch with a separate sendmmsg(). For now we only look for already contiguous batches, which means that if there are multiple active flows interleaved this is likely to degenerate to batches of size 1. For now this is the simplest way to correct the behaviour and we can try to optimize later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:03 +01:00
David Gibson	2dec914209	udp: Split send half of udp_sock_handler_splice() from the receive half Move the part of udp_sock_handler_splice() concerned with sending out the datagrams into a new udp_splice_sendfrom() helper. This will make later cleanups easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:42:00 +01:00
David Gibson	fc7f91e709	udp: Unify buffers for tap and splice paths We maintain a set of buffers for UDP packets to be forwarded via the tap interface in udp[46]_l2_buf. We then have a separate set of buffers for packets to be "spliced" in udp_splice_buf[]. However, we only use one of these at a time, so we can share the buffer space. For the receiving splice packets we can not only re-use the data buffers but also the udp[46]_l2_iov_sock and udp[46]_l2_mh_sock control structures. For sending the splice packets we keep the same data buffers, but we need specific control structures. We create udp[46]_iov_splice - we can't reuse udp_l2_iov_sock[] because we need to write iov_len as we're writing spliced packets, but the tap path expects iov_len to remain the same (it only uses it for receive). Likewise we create udp[46]_mh_splice with the mmsghdr structures for sending spliced packets. As well as needing to reference different iovs, these need to all reference udp_splice_namebuf instead of individual msg_name fields for each slot. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:56 +01:00
David Gibson	c52ca4aecf	udp: Add helper to extract port from a sockaddr_in or sockaddr_in6 udp_sock_handler_splice() has a somewhat clunky if to extract the port from a socket address which could be either IPv4 or IPv6. Future changes are going to make this even more clunky, so introduce a helper function to do this extraction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-12-06 07:41:53 +01:00

... 10 11 12 13 14 ...

1487 commits