passt

Author	SHA1	Message	Date
David Gibson	68d1b0a152	tcp: Simplify packet length calculation when preparing headers tcp_fill_headers[46]() compute the L3 packet length from the L4 packet length, then their caller tcp_l2_buf_fill_headers() converts it back to the L4 packet length. We can just use the L4 length throughout. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-02 16:13:26 +02:00
David Gibson	5566386f5f	treewide: Standardise variable names for various packet lengths At various points we need to track the lengths of a packet including or excluding various different sets of headers. We don't always use the same variable names for doing so. Worse in some places we use the same name for different things: e.g. tcp_fill_headers[46]() use ip_len for the length including the IP headers, but then tcp_send_flag() which calls it uses it to mean the IP payload length only. To improve clarity, standardise on these names: dlen: L4 protocol payload length ("data length") l4len: plen + length of L4 protocol header l3len: l4len + length of IPv4/IPv6 header l2len: l3len + length of L2 (ethernet) header Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-02 16:13:23 +02:00
David Gibson	9e22c53aa9	checksum: Make csum_ip4_header() take a host endian length csum_ip4_header() takes the packet length as a network endian value. In general it's very error-prone to pass non-native-endian values as a raw integer. It's particularly bad here because this differs from other checksum functions (e.g. proto_ipv4_header_psum()) which take host native lengths. It turns out all the callers have easy access to the native endian value, so switch it to use host order like everything else. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-02 16:13:21 +02:00
Laurent Vivier	95601237ef	tcp: Replace TCP buffer structure by an iovec array To be able to provide pointers to TCP headers and IP headers without worrying about alignment in the structure, split the structure into several arrays and point to each part of the frame using an iovec array. Using iovec also allows us to simply ignore the first entry when the vnet length header is not needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-04-19 11:21:09 +02:00
David Gibson	4988e2b406	tcp: Unconditionally force ACK for all !SYN, !RST packets Currently we set ACK on flags packets only when the acknowledged byte pointer has advanced, or we hadn't previously set a window. This means in particular that we can send a window update with no ACK flag, which doesn't appear to be correct. RFC 9293 requires a receiver to ignore such a packet [0], and indeed it appears that every non-SYN, non-RST packet should have the ACK flag. The reason for the existing logic, rather than always forcing an ACK seems to be to avoid having the packet mistaken as a duplicate ACK which might trigger a fast retransmit. However, earlier tests in the function mean we won't reach here if we don't have either an advance in the ack pointer - which will already set the ACK flag, or a window update - which shouldn't trigger a fast retransmit. [0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1 Link: https://github.com/containers/podman/issues/22146 Link: https://bugs.passt.top/show_bug.cgi?id=84 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-26 09:52:04 +01:00
David Gibson	5894a245b9	tcp: Never automatically add the ACK flag to RST packets tcp_send_flag() will sometimes force on the ACK flag for all !SYN packets. This doesn't make sense for RST packets, where plain RST and RST+ACK have somewhat different meanings. AIUI, RST+ACK indicates an abrupt end to a connection, but acknowledges data already sent. Plain RST indicates an abort, when one end receives a packet that doesn't seem to make sense in the context of what it knows about the connection. All of the cases where we send RSTs are the second, so we don't want an ACK flag, but we currently could add one anyway. Change that, so we won't add an ACK to an RST unless the caller explicitly requests it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-26 09:51:58 +01:00
David Gibson	16c2d8da0d	tcp: Rearrange logic for setting ACK flag in tcp_send_flag() We have different paths for controlling the ACK flag for the SYN and !SYN paths. This amounts to sometimes forcing on the ACK flag in the !SYN path regardless of options. We can rearrange things to explicitly be that which will make things neater for some future changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-26 09:51:55 +01:00
David Gibson	99355e25b9	tcp: Split handling of DUP_ACK from ACK The DUP_ACK flag to tcp_send_flag() has two effects: first it forces the setting of the ACK flag in the packet, even if we otherwise wouldn't. Secondly, it causes a duplicate of the flags packet to be sent immediately after the first. Setting the ACK flag to tcp_send_flag() also has the first effect, so instead of having DUP_ACK also do that, pass both flags when we need both operations. This slightly simplifies the logic of tcp_send_flag() in a way that makes some future changes easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-26 09:51:41 +01:00
David Gibson	d3eb0d7b59	tap: Rename tap_iov_{base,len} These two functions are typically used to calculate values to go into the iov_base and iov_len fields of a struct iovec. They don't have to be used for that, though. Rename them in terms of what they actually do: calculate the base address and total length of the complete frame, including both L2 and tap specific headers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-14 16:57:43 +01:00
David Gibson	2d0e0084b6	tap: Extend tap_send_frames() to allow multi-buffer frames tap_send_frames() takes a vector of buffers and requires exactly one frame per buffer. We have future plans where we want to have multiple buffers per frame in some circumstances, so extend tap_send_frames() to take the number of buffers per frame as a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Improve comment to rembufs calculation] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-14 16:57:28 +01:00
Laurent Vivier	bb11d15495	tcp: Introduce tcp_fill_headers4()/tcp_fill_headers6() Replace the macro SET_TCP_HEADER_COMMON_V4_V6() by a new function tcp_fill_header(). Move IPv4 and IPv6 code from tcp_l2_buf_fill_headers() to tcp_fill_headers4() and tcp_fill_headers6() Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-10-lvivier@redhat.com> [dwg: Correct commit message with new function names] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-06 08:03:52 +01:00
Laurent Vivier	6b22e10a26	tap: make tap_update_mac() generic Use ethhdr rather than tap_hdr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-9-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-06 08:03:49 +01:00
Laurent Vivier	7df624e79a	checksum: introduce functions to compute the header part checksum for TCP/UDP The TCP and UDP checksums are computed using the data in the TCP/UDP payload but also some informations in the IP header (protocol, length, source and destination addresses). We add two functions, proto_ipv4_header_psum() and proto_ipv6_header_psum(), to compute the checksum of the IP header part. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-8-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-06 08:03:47 +01:00
Laurent Vivier	feb4900c25	checksum: use csum_ip4_header() in udp.c and tcp.c We can find the same function to compute the IPv4 header checksum in tcp.c, udp.c and tap.c Use the function defined for tap.c, csum_ip4_header(), but with the code used in tcp.c and udp.c as it doesn't need a fully initialiazed IPv4 header, only protocol, tot_len, saddr and daddr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-7-lvivier@redhat.com> [dwg: Fix weird cppcheck regression; it appears to be a problem in pre-existing code, but somehow this patch is exposing it] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-06 08:03:44 +01:00
Laurent Vivier	324bd46782	util: move IP stuff from util.[ch] to ip.[ch] Introduce ip.[ch] file to encapsulate IP protocol handling functions and structures. Modify various files to include the new header ip.h when it's needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-5-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-06 08:03:38 +01:00
David Gibson	3b9098aa49	fwd: Rename port_fwd.[ch] and their contents Currently port_fwd.[ch] contains helpers related to port forwarding, particular automatic port forwarding. We're planning to allow much more flexible sorts of forwarding, including both port translation and NAT based on the flow table. This will subsume the existing port forwarding logic, so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names within. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-29 09:48:27 +01:00
David Gibson	3b59b9748a	tcp: Validate TCP endpoint addresses TCP connections should typically not have wildcard addresses (0.0.0.0 or ::) nor a zero port number for either endpoint. It's not entirely clear (at least to me) if it's strictly against the RFCs to do so, but at any rate the socket interfaces often treat those values specially[1], so it's not really possible to manipulate such connections. Likewise they should not have broadcast or multicast addresses for either endpoint. However, nothing prevents a guest from creating a SYN packet with such values, and it's not entirely clear what the effect on passt would be. To ensure sane behaviour, explicitly check for this case and drop such packets, logging a debug warning (we don't want a higher level, because that would allow a guest to spam the logs). We never expect such an address on an accept()ed socket either, but just in case, check for it as well. [1] Depending on context as "unknown", "match any" or "kernel, pick something for me" Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-29 09:48:17 +01:00
David Gibson	dc9a5d71e9	tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler() tcp_listen_handler() uses the epoll reference for the listening socket it handles, and also passes on one variant of it to tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock(). The latter two functions only need a couple of specific fields from the reference. Pass those specific values instead of the whole reference, which localises the handling of the listening (as opposed to accepted) socket and its reference entirely within tcp_listen_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-29 09:48:15 +01:00
David Gibson	ee677e0a42	tcp_splice: Improve logic deciding when to splice This makes several tweaks to improve the logic which decides whether we're able to use the splice method for a new connection. * Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we check for pasta mode within it, better localising the checks. * Previously if we got a connection from a non-loopback address we'd always fall back to the "tap" path, even if the connection was on a socket in the namespace. If we did get a non-loopback address on a namespace socket, something has gone wrong and the "tap" path certainly won't be able to handle it. Report the error and close, rather than passing it along to tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-29 09:48:13 +01:00
David Gibson	0f938c3b9a	flow: Clarify flow entry life cycle, introduce uniform logging Our allocation scheme for flow entries means there are some non-obvious constraints on when what things can be done with an entry. Add a big doc comment explaining the life cycle. In addition, make a FLOW_START() macro to mark one of the important transitions. This encourages correct usage, by making it natural to only access the flow type specific structure after calling it. It also logs that a new flow has been created, which is useful for debugging. We also add logging when a flow's lifecycle ends. This doesn't need a new helper, because it can only happen either from flow_alloc_cancel() or from the flow deferred handler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-29 09:48:01 +01:00
David Gibson	bb9bf0bb8f	tcp, udp: Don't precompute port remappings in epoll references The epoll references for both TCP listening sockets and UDP sockets includes a port number. This gives the destination port that traffic to that socket will be sent to on the other side. That will usually be the same as the socket's bound port, but might not if the -t, -u, -T or -U options are given with different original and forwarded port numbers. As we move towards a more flexible forwarding model for passt, it's going to become possible for that destination port to vary depending on more things (for example the source or destination address). So, it will no longer make sense to have a fixed value for a listening socket. Change to simpler semantics where this field in the reference gives the bound port of the socket. We apply the translations to the correct destination port later on, when we're actually forwarding. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-29 09:47:40 +01:00
David Gibson	f6e6e8ad40	inany: Introduce union sockaddr_inany There are a number of places where we want to handle either a sockaddr_in or a sockaddr_in6. In some of those we use a void *, which works ok and matches some standard library interfaces, but doesn't give a signature level hint that we're dealing with only sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of socket address. Other places we use a sockaddr_storage, which also works, but has the same problem in addition to allocating more on the stack than we need to. Introduce union sockaddr_inany to explictly handle this case: it has variants for sockaddr_in and sockaddr_in6. Use it in a number of places where it's easy to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-29 09:47:31 +01:00
David Gibson	0cf6b2d89d	inany: Provide more conveniently typed constants for special addresses Our inany_addr type is used in some places to represent either IPv4 or IPv6 addresses, and we plan to use it more widely. We don't yet provide constants of this type for special addresses (loopback and "any"). Add some of these, both the IPv4 and IPv6 variants of those addresses, but typed as union inany_addr. To avoid actually adding more things to .data we can use some macros and casting to overlay the IPv6 versions of these with the standard library's in6addr_loopback and in6addr_any. For the IPv4 versions we need to create new constant globals. For complicated historical reasons, the standard library doesn't provide constants for IPv4 loopback and any addresses as struct in_addr. It just has macros of type in_addr_t == uint32_t, which has some gotchas w.r.t. endianness. We can use some more macros to address this lack, using macros to effectively create these IPv4 constants as pieces of the inany constants above. We use this last to avoid some awkward temporary variables just used to get an address of an IPv4 loopback address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-29 09:47:28 +01:00
David Gibson	e5e6f29459	tcp: Don't store errnos in socket pool If tcp_sock_refill_pool() gets an error opening new sockets, it stores the negative errno of that error in the socket pool. This isn't especially useful: * It's inconsistent with the initial state of the pool (all -1) * It's inconsistent with the state of an entry that was valid and was then consumed (also -1) * By the time we did anything with this error code, it's now far removed from the situation in which the error occurred, making it difficult to report usefully We now have error reporting closer to when failures happen on the refill paths, so just leave a pool slot we can't fill as -1. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-27 12:53:30 +01:00
David Gibson	fe27ebce5c	tcp, tcp_splice: Helpers for getting sockets from the pools We maintain pools of ready-to-connect sockets in both the original and (for pasta) guest namespace to reduce latency when starting new TCP connections. If we exhaust those pools we have to take a higher latency path to get a new socket. Currently we open-code that fallback in the places we need it. To improve clarity encapsulate that into helper functions. While we're at it, give those helpers clearer error reporting. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-27 12:52:46 +01:00
David Gibson	fbe81decbd	tcp, tcp_splice: Issue warnings if unable to refill socket pool Currently if tcp_sock_refill_pool() is unable to fill all the slots in the pool, it will silently exit. This might lead to a later attempt to get fds from the pool to fail at which point it will be harder to tell what originally went wrong. Instead add warnings if we're unable to refill any of the socket pools when requested. We have tcp_sock_refill_pool() return an error and report it in the callers, because those callers have more context allowing for a more useful message. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-27 12:52:44 +01:00
David Gibson	554b3aefe7	tcp: Stop on first error when refilling socket pools Currently if we get an error opening a new socket while refilling a socket pool, we carry on to the next slot and try again. This isn't very useful, since by far the most likely cause of an error is some sort of resource exhaustion. Trying again will probably just hit the same error, and maybe even make things worse. So, instead stop on the first error while refilling the pool, making do with however many sockets we managed to open before the error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-27 12:52:17 +01:00
David Gibson	af303fdbff	tcp: Don't stop refilling socket pool if we find a filled entry Currently tcp_sock_refill_pool() stops as soon as it finds an entry in the pool with a valid fd. This appears to makes sense: we always use fds from the front of the pool, so if we find a filled one, the rest of the pool should be filled as well. However, that's not quite correct. If a previous refill hit errors trying to open new sockets, it could leave gaps between blocks of valid fds. We're going to add some changes that could make that more likely. So, for robustness, instead skip over the filled entry but still try to refill the rest of the array. We expect simply iterating over the pool to be of small cost compared to even a single system call, so this shouldn't have much impact. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-27 12:52:14 +01:00
David Gibson	4e08d9b9c6	treewide: Use sa_family_t for address family variables Sometimes we use sa_family_t for variables and parameters containing a socket address family, other times we use a plain int. Since sa_family_t is what's actually used in struct sockaddr and friends, standardise on that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-27 12:52:02 +01:00
David Gibson	78901eeae3	tcp: Fix subtle bug in fast re-transmit path When a duplicate ack from the tap side triggers a fast re-transmit, we set both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of the duplicate ack. Setting seq_to_tap is correct: this is what triggers the retransmit from this point onwards. Setting seq_ack_from_tap is not correct, though. In most cases setting seq_ack_from_tap will be redundant but harmless: it will have already been updated to the same value by tcp_update_seqack_from_tap() a few lines above. However that call can be skipped if tcp_sock_consume() fails, which is rare but possible. In that case this update will cause problems. We use seq_ack_from_tap to track two logically distinct things: how much of the stream has been acked by the guest, and how much of the stream from the socket has been read and discarded (as opposed to MSG_PEEKed). We attempt to keep those values the same, because we discard data exactly when it is acked by the guest. However tcp_sock_consume() failing means we weren't able to disard the acked data. To handle that case, we skip the usual update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get one which supersedes it soon enough. Setting seq_ack_from_tap in the fast retransmit path, however, means we now really will have the read/discard point in the stream out of sync with seq_ack_from_tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-11 18:30:01 +01:00
David Gibson	8981a720aa	flow: Avoid moving flow entries to compact table Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b [sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid Coverity Scan false positive] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:37 +01:00
David Gibson	9c0881d4f6	flow: Enforce that freeing of closed flows must happen in deferred handlers Currently, flows are only evern finally freed (and the table compacted) from the deferred handlers. Some future ways we want to optimise managing the flow table will rely on this, so enforce it: rather than having the TCP code directly call flow_table_compact(), add a boolean return value to the per-flow deferred handlers. If true, this indicates that the flow code itself should free the flow. This forces all freeing of flows to occur during the flow code's scan of the table in flow_defer_handler() which opens possibilities for future optimisations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:33 +01:00
David Gibson	4a849e9526	flow: Abstract allocation of new flows with helper function Currently tcp.c open codes the process of allocating a new flow from the flow table: twice, in fact, once for guest to host and once for host to guest connections. This duplication isn't ideal and will get worse as we add more protocols to the flow table. It also makes it harder to experiment with different ways of handling flow table allocation. Instead, introduce a function to allocate a new flow: flow_alloc(). In some cases we currently check if we're able to allocate, but delay the actual allocation. We now handle that slightly differently with a flow_alloc_cancel() function to back out a recent allocation. We have that separate from a flow_free() function, because future changes we have in mind will need to handle this case a little differently. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:31 +01:00
David Gibson	fb7c00169d	flow: Move flow_count from context structure to a global In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:29 +01:00
David Gibson	02e092b4fe	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:25 +01:00
David Gibson	36dfa8b8fb	flow, tcp: Add handling for per-flow timers tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:19 +01:00
David Gibson	b43e4483ed	flow, tcp: Add flow-centric dispatch for deferred flow handling tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table. To that end, add a new flow_defer_handler() handling all flow-linked deferred operations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:17 +01:00
David Gibson	c97bb527d6	tcp, tcp_splice: Move per-type cleanup logic into per-type helpers tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally on the connection being closed or closing. Move that logic into the "destroy" functions themselves, renaming them tcp_flow_defer() and tcp_splice_flow_defer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:15 +01:00
David Gibson	eebca1115f	tcp, tcp_splice: Remove redundant handling from tcp_timer() tcp_timer() scans the connection table, expiring "tap" connections and calling tcp_splice_timer() for "splice" connections. tcp_splice_timer() expires spliced connections and then does some other processing. However, tcp_timer() is always called shortly after tcp_defer_handler() (from post_handler()), which also scans the flow table expiring both tap and spliced connections. So remove the redundant handling, and only do the extra tcp_splice_timer() work from tcp_timer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:13 +01:00
David Gibson	8563e7c870	treewide: Standardise on 'now' for current timestamp variables In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:10 +01:00
David Gibson	17bbab1c97	flow: Make flow_table.h #include the protocol specific headers it needs flow_table.h, the lower level flow header relies on having the struct definitions for every protocol specific flow type - so far that means tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included before flow_table.h. That's ok for now, but as we use the flow table for more things, flow_table.h will need the structs for all of them, which means the protocol specific .c files would need to include tcp_conn.h _and_ the equivalents for every other flow type before flow_table.h every time, which is weird. So, although we mostly lean towards the include style where .c files need to handle the include dependencies, in this case it makes more sense to have flow_table.h include all the protocol specific headers it needs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:34:55 +01:00
David Gibson	a179ca6707	treewide: Make a bunch of pointer variables pointers to const Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-16 21:49:27 +01:00
Laurent Vivier	6a348cb435	tcp: make tcp_sock_set_bufsize() static (again) `e5eefe7743` ("tcp: Refactor to use events instead of states, split out spliced implementation") has exported tcp_sock_set_bufsize() to be able to use it in tcp_splice.c, but `6ccab72d9b` has removed its use in tcp_splice.c, so we can set it static again. Fixes: `6ccab72d9b` ("tcp: Improve handling of fallback if socket pool is empty on new splice") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-27 19:31:25 +01:00
David Gibson	546332786c	treewide: Use IN4ADDR_LOOPBACK_INIT more widely We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address without delving into its internals. However there are some places we don't use it, and explicitly look at the internal structure of struct in_addr, which we generally want to avoid. Use the define more widely to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-27 19:29:45 +01:00
David Gibson	1f2aab8aaa	tcp: Fix address type for tcp_sock_init_af() This takes a struct in_addr * (i.e. an IPv4 address), although it's explicitly supposed to handle IPv6 as well. Both its caller and sock_l4() which it calls use a void * for the address, which can be either an in_addr or an in6_addr. We get away with this, because we don't do anything with the pointer other than transfer it from the caller to sock_l4(), but it's misleading. And quite possibly technically UB, because C is like that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-27 19:29:45 +01:00
David Gibson	5d5bb8c150	tcp: Don't account for hash table size in tcp_hash() Currently tcp_hash() returns the hash bucket for a value, that is the hash modulo the size of the hash table. Usually it's a bit more flexible to have hash functions return a "raw" hash value and perform the modulus in the callers. That allows the same hash function to be used for multiple tables of different sizes, or to re-use the hash for other purposes. We don't do anything like that with tcp_hash() at present, but we have some plans to do so. Prepare for that by making tcp_hash() and tcp_conn_hash() return raw hash values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-27 19:29:45 +01:00
David Gibson	64e5459ba6	tcp: Implement hash table with indices rather than pointers We implement our hash table with pointers to the entry for each bucket (or NULL). However, the entries are always allocated within the flow table, meaning that a flow index will suffice, halving the size of the hash table. For TCP, just a flow index would be enough, but future uses will want to expand the hash table to cover indexing either side of a flow, so use a flow_sidx_t as the type for each hash bucket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-27 19:29:45 +01:00
David Gibson	5913f26415	tcp: Switch hash table to linear probing instead of chaining Currently we deal with hash collisions by letting a hash bucket contain multiple entries, forming a linked list using an index in the connection structure. That's a pretty standard and simple approach, but in our case we can use an even simpler one: linear probing. Here if a hash bucket is occupied we just move onto the next one until we find a feww one. This slightly simplifies lookup and more importantly saves some precious bytes in the connection structure by removing the need for a link. It does require some additional complexity for hash removal. This approach can perform poorly with hash table load is high. However, we already size our hash table of pointers larger than the connection table, which puts an upper bound on the load. It's relatively cheap to decrease that bound if we find we need to. I adapted the linear probing operations from Knuth's The Art of Computer Programming, Volume 3, 2nd Edition. Specifically Algorithm L and Algorithm R in Section 6.4. Note that there is an error in Algorithm R as printed, see errata at [0]. [0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-27 19:29:45 +01:00
David Gibson	89fcb563fc	tcp: Fix conceptually incorrect byte-order switch in tcp_tap_handler() tcp_hash_lookup() expects the port numbers in host order, but the TCP header, of course, has them in network order, so we need to switch them. However we call htons() (host to network) instead of ntohs() (network to host). This works because those do the same thing in practice (they only wouldn't on very strange theoretical platforms which are neither big nor little endian). But, having this the "wrong" way around is misleading, so switch it around. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-27 19:29:45 +01:00
Stefano Brivio	fd29d62a9d	tcp: Cast timeval fields to unsigned long long for printing On x32, glibc defines time_t and suseconds_t (the latter, also known as __syscall_slong_t) as unsigned long long, whereas "everywhere else", including x86_64 and i686, those are unsigned long. See also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 for all the gory details. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-27 19:29:45 +01:00

1 2 3 4 5 ...

327 commits