passt

Author	SHA1	Message	Date
Stefano Brivio	71869e2912	passt: Use NOLINT clang-tidy block instead of NOLINTNEXTLINE For some reason, this is only reported by clang-tidy 19.1.2 on Alpine: /home/sbrivio/passt/passt.c:314:53: error: conditional operator with identical true and false expressions [bugprone-branch-clone,-warnings-as-errors] 314 \| nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL); \| ^ We do have a suppression, but not on the line preceding it, because we also need a cppcheck suppression there. Use NOLINTBEGIN/NOLINTEND for the clang-tidy suppression. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:52 +01:00
Stefano Brivio	099ace64ce	treewide: Address cert-err33-c clang-tidy warnings for clock and timer functions For clock_gettime(), we shouldn't ignore errors if they happen at initialisation phase, because something is seriously wrong and it's not helpful if we proceed as if nothing happened. As we're up and running, though, it's probably better to report the error and use a stale value than to terminate altogether. Make sure we use a zero value if we don't have a stale one somewhere. For timerfd_gettime() and timerfd_settime() failures, just report an error, there isn't much else we can do. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-30 12:37:31 +01:00
Stefano Brivio	2aea1da143	treewide: Allow additional system calls for i386/i686 I haven't tested i386 for a long time (after playing with some openSUSE i586 image a couple of years ago). It turns out that a number of system calls we actually need were denied by the seccomp filter, and not even basic functionality works. Add some system calls that glibc started using with the 64-bit time ("t64") transition, see also: https://wiki.debian.org/ReleaseGoals/64bit-time that is: clock_gettime64, timerfd_gettime64, fcntl64, and recvmmsg_time64. Add further system calls that are needed regardless of time_t width, that is, mmap2 (valgrind profile only), _llseek and sigreturn (common outside x86_64), and socketcall (same as s390x). I validated this against an almost full run of the test suite, with just a few selected tests skipped. Fixes needed to run most tests on i386/i686, and other assorted fixes for tests, are included in upcoming patches. Reported-by: Uroš Knupleš <uros@knuples.net> Analysed-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-08-21 12:00:43 +02:00
David Gibson	905ecd2b0b	treewide: Rename MAC address fields for clarity c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-21 11:59:54 +02:00
Stefano Brivio	09603cab28	passt, util: Close any open file that the parent might have leaked If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any. This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul. Use close_range(2) to close all open files except for standard streams and the one from --fd. Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple. As we need to check that the passed --fd option doesn't exceed INT_MAX, because we'll parse it with strtol() but file descriptor indices are signed ints (regardless of the arguments close_range() take), extend the existing check in the actual --fd parsing in conf(), also rejecting file descriptors numbers that match standard streams, while at it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>	2024-08-08 21:31:25 +02:00
Stefano Brivio	ee36266a55	log, passt: Keep printing to stderr when passt is running in foreground There are two cases where we want to stop printing to stderr: if it's closed, and if pasta spawned a shell (and --debug wasn't given). But if passt is running in foreground, we currently stop to report any message, even error messages, once we're ready, as reported by Laurent, because we set the log_runtime flag, which we use to indicate we're ready, regardless of whether we're running in foreground or not. Turn that flag (back) to log_stderr, and set it only when we really want to stop printing to stderr. Reported-by: Laurent Vivier <lvivier@redhat.com> Fixes: `afd9cdc9bb` ("log, passt: Always print to stderr before initialisation is complete") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-06 15:03:48 +02:00
Stefano Brivio	77c092ee5e	log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME We report relative timestamps in logs, so we want to avoid jumps in the system time. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-26 13:47:36 +02:00
Stefano Brivio	e5c37ba0f4	log: Initialise timestamp for relative log time also if we use a log file ...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero. Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-07-26 13:47:34 +02:00
David Gibson	882599e180	udp: Rename UDP listening sockets EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN and associated functions to match. Along with that, remove the .orig field from union udp_listen_epoll_ref, since it is now always true. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:34:01 +02:00
David Gibson	e0647ad80c	udp: Handle "spliced" datagrams with per-flow sockets When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-07-19 18:33:42 +02:00
Stefano Brivio	e7323e515a	conf, passt: Don't call __openlog() if a log file is used If a log file is configured, we would otherwise open a connection to the system logger (if any), print any message that we might have before we initialise the log file, and then keep that connection around for no particular reason. Call __openlog() as an alternative to the log file setup, instead. This way, we might skip printing some messages during the initialisation phase, but they're probably not really valuable to have in a system log, and we're going to print them to standard error anyway. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:46 +02:00
Stefano Brivio	dba7f0f5ce	treewide: Replace strerror() calls Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:44 +02:00
Stefano Brivio	92a22fef93	treewide: Replace perror() calls with calls to logging functions perror() prints directly to standard error, but in many cases standard error might be already closed, or we might want to skip logging, based on configuration. Our logging functions provide all that. While at it, make errors more descriptive, replacing some of the existing basic perror-style messages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:43 +02:00
Stefano Brivio	afd9cdc9bb	log, passt: Always print to stderr before initialisation is complete After commit `15001b39ef` ("conf: set the log level much earlier"), we had a phase during initialisation when messages wouldn't be printed to standard error anymore. Commit `f67238aa86` ("passt, log: Call __openlog() earlier, log to stderr until we detach") fixed that, but only for the case where no log files are given. If a log file is configured, vlogmsg() will not call passt_vsyslog(), but during initialisation, LOG_PERROR is set, so to avoid duplicated prints (which would result from passt_vsyslog() printing to stderr), we don't call fprintf() from vlogmsg() either. This is getting a bit too complicated. Instead of abusing LOG_PERROR, define an internal logging flag that clearly represents that we're not done with the initialisation phase yet. If this flag is not set, make sure we always print to stderr, if the log mask matches. Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:34 +02:00
Stefano Brivio	bca0fefa32	conf, passt: Make --stderr do nothing, and deprecate it The original behaviour of printing messages to standard error by default when running from a non-interactive terminal was introduced because the first KubeVirt integration draft used to start passt in foreground and get messages via standard error. For development purposes, the system logger was more convenient at that point, and passt was running from interactive terminals only if not started by the KubeVirt integration. This behaviour was introduced by `84a62b79a2` ("passt: Also log to stderr, don't fork to background if not interactive"). Later, I added command-line options in `1e49d194d0` ("passt, pasta: Introduce command-line options and port re-mapping") and accidentally reversed this condition, which wasn't a problem as --stderr could force printing to standard error anyway (and it was used by KubeVirt). Nowadays, the KubeVirt integration uses a log file (requested via libvirt configuration), and the same applies for Podman if one actually needs to look at runtime logs. There are no use cases left, as far as I know, where passt runs in foreground in non-interactive terminals. Seize the chance to reintroduce some sanity here. If we fork to background, standard error is closed, so --stderr is useless in that case. If we run in foreground, there's no harm in printing messages to standard error, and that accidentally became the default behaviour anyway, so --stderr is not needed in that case. It would be needed for non-interactive terminals, but there are no use cases, and if there were, let's log to standard error anyway: the user can always redirect standard error to /dev/null if needed. Before we're up and running, we need to print to standard error anyway if something happens, otherwise we can't report failure to start in any kind of usage, stand-alone or in integrations. So, make --stderr do nothing, and deprecate it. While at it, drop a left-over comment about --foreground being the default only for interactive terminals, because it's not the case anymore. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:28 +02:00
Stefano Brivio	b74801645c	conf, passt: Don't try to log to stderr after we close it If we don't run in foreground, we close standard error as we daemonise, so it makes no sense to check if the controlling terminal is an interactive terminal or if --force-stderr was given, to decide if we want to log to standard error. Make --force-stderr depend on --foreground. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-06-21 15:32:15 +02:00
Laurent Vivier	0c335d751a	vhost-user: compare mode MODE_PASTA and not MODE_PASST As we are going to introduce the MODE_VU that will act like the mode MODE_PASST, compare to MODE_PASTA rather than to add a comparison to MODE_VU when we check for MODE_PASST. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:38 +02:00
Laurent Vivier	377b666dc9	udp: rename udp_sock_handler() to udp_buf_sock_handler() We are going to introduce a variant of the function to use vhost-user buffers rather than passt internal buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:34 +02:00
David Gibson	7094b91d10	Remove pointless macro parameters in CALL_PROTO_HANDLER The 'c' parameter is always passed exactly 'c'. The 'now' parameter is always passed exactly 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
Stefano Brivio	c9b2413465	conf, passt, tap: Open socket and PID files before switching UID/GID Otherwise, if the user runs us as root, and gives us paths that are only accessible by root, we'll fail to open them, which might in turn encourage users to change permissions or ownerships: definitely a bad idea in terms of security. Reported-by: Minxi Hou <mhou@redhat.com> Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:43:26 +02:00
Stefano Brivio	ba23b05545	passt, util: Move opening of PID file to its own function We won't call it from main() any longer: move it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:43:13 +02:00
Stefano Brivio	57d8aa8ffe	util: Rename write_pidfile() to pidfile_write() As I'm adding pidfile_open() in the next patch. The function comment didn't match, by the way. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:43:05 +02:00
Stefano Brivio	fcfb592adc	passt, tap: Don't use -1 as uninitialised value for fd_tap_listen This is a remnant from the time we kept access to the original filesystem and we could reinitialise the listening AF_UNIX socket. Since commit `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), however, we can't re-bind the listening socket once we're up and running. Drop the -1 initalisation and the corresponding check. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-05-23 16:42:27 +02:00
lemmi	26c71db332	passt.c: explicitly include libgen.h for basename fixes implicit declaration warning on musl Signed-off-by: lemmi <lemmi@nerd2nerd.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-13 23:00:52 +02:00
Stefano Brivio	5d5208b67d	treewide: Compilers' name for armv6l and armv7l is "arm" When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the architecture name for, among others, seccomp.sh, I didn't realise that "armv6l" and "armv7l" are just Linux kernel names -- compilers just call that "arm". Fix the "syscalls" annotation we use to define seccomp profiles accordingly, otherwise pasta will be terminated on sigreturn() on armv6l and armv7l. Fixes: `213c397492` ("passt, pasta: Run-time selection of AVX2 build") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-04-11 17:34:04 +02:00
Stefano Brivio	f67238aa86	passt, log: Call __openlog() earlier, log to stderr until we detach Paul reports that, with commit `15001b39ef` ("conf: set the log level much earlier"), early messages aren't reported to standard error anymore. The reason is that, once the log mask is changed from LOG_EARLY, we don't force logging to stderr, and this mechanism was abused to have early errors on stderr. Now that we drop LOG_EARLY earlier on, this doesn't work anymore. Call __openlog() as soon as we know the mode we're running as, using LOG_PERROR. Then, once we detach, if we're not running from an interactive terminal and logging to standard error is not forced, drop LOG_PERROR from the options. While at it, check if the standard error descriptor refers to a terminal, instead of checking standard output: if the user redirects standard output to /dev/null, they might still want to see messages from standard error. Further, make sure we don't print messages to standard error reporting that we couldn't log to the system logger, if we didn't open a connection yet. That's expected. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: `15001b39ef` ("conf: set the log level much earlier") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-03-14 08:19:36 +01:00
David Gibson	4779dfe12f	icmp: Use 'flowside' epoll references for ping sockets Currently ping sockets use a custom epoll reference type which includes the ICMP id. However, now that we have entries in the flow table for ping flows, finding that is sufficient to get everything else we want, including the id. Therefore remove the icmp_epoll_ref type and use the general 'flowside' field for ping sockets. Having done this we no longer need separate EPOLL_TYPE_ICMP and EPOLL_TYPE_ICMPV6 reference types, because we can easily determine which case we have from the flow type. Merge both types into EPOLL_TYPE_PING. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-12 01:49:05 +01:00
David Gibson	3af5e9fdba	icmp: Store ping socket information in flow table Currently icmp_id_map[][] stores information about ping sockets in a bespoke structure. Move the same information into new types of flow in the flow table. To match that change, replace the existing ICMP timer with a flow-based timer for expiring ping sockets. This has the advantage that we only need to scan the active flows, not all possible ids. We convert icmp_id_map[][] to point to the flow table entries, rather than containing its own information. We do still use that array for locating the right ping flows, rather than using a "flow native" form of lookup for the time being. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Update id_sock description in comment to icmp_ping_new()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-03-12 01:34:45 +01:00
Paul Holzinger	15001b39ef	conf: set the log level much earlier --quiet is supposed to silence the "No routable interface" message but it does not work because the log level was set long after conf_ip4/6() was called which means it uses the default level which logs everything. To address this move the log level logic directly after the option parsing in conf(). Signed-off-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-27 14:08:33 +01:00
Paul Holzinger	b08716551a	passt: make --quiet set the log level to warning Based on the man page and help output --quiet hides informational messages. This means that warnings should still be logged. This was discussed in[1]. [1] https://archives.passt.top/passt-dev/20240216114304.7234a83f@elisabeth/T/#m42652824644973674e84baf9e0bf1d0e88104450 Signed-off-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-27 14:08:00 +01:00
Stefano Brivio	8f3f8e190c	pasta: Add fallback timer mechanism to check if namespace is gone We don't know how frequently this happens, but hitting fs.inotify.max_user_watches or similar sysctl limits is definitely not out of question, and Paul mentioned that, for example, Podman's CI environments hit similar issues in the past. Introduce a fallback mechanism based on a timer file descriptor: we grab the directory handle at startup, and we can then use openat(), triggered periodically, to check if the (network) namespace directory still exists. If openat() fails at some point, exit. Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707 Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-02-16 08:47:14 +01:00
Stefano Brivio	7ee4e17267	log: setlogmask(0) can actually result in a system call, don't use it Before commit `32d07f5e59` ("passt, pasta: Completely avoid dynamic memory allocation"), we didn't store the current log mask in a variable, and we fetched it using setlogmask(0) wherever needed. But after that commit, we can use our log_mask copy instead. And we should: with recent glibc versions, setlogmask(0) actually results in a system call, which causes a substantial overhead with high transfer rates: we use setlogmask(0) even to decide we don't want to print debug messages. Now that we rely on log_mask in early stages, before setlogmask() is called, we need to initialise that variable to the special LOG_EMERG mask value right away: define LOG_EARLY to make this clearer, and, while at it, group conditions in vlogmsg() into something more terse. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-02-14 01:10:11 +01:00
David Gibson	a325121759	icmp: Consolidate icmp_sock_handler() with icmpv6_sock_handler() Currently we have separate handlers for ICMP and ICMPv6 ping replies. Although there are a number of points of difference, with some creative refactoring we can combine these together sensibly. Although it doesn't save a vast amount of code, it does make it clearer that we're performing basically the same steps for each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:36:51 +01:00
David Gibson	8981a720aa	flow: Avoid moving flow entries to compact table Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b [sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid Coverity Scan false positive] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:37 +01:00
David Gibson	02e092b4fe	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:25 +01:00
David Gibson	70121ca1ec	epoll: Better handling of number of epoll types As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static assert on the size of the matching names array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:22 +01:00
David Gibson	36dfa8b8fb	flow, tcp: Add handling for per-flow timers tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:19 +01:00
David Gibson	b43e4483ed	flow, tcp: Add flow-centric dispatch for deferred flow handling tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table. To that end, add a new flow_defer_handler() handling all flow-linked deferred operations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-01-22 23:35:17 +01:00
David Gibson	e21b6d69b1	tcp: "TCP" hash secret doesn't need to be TCP specific The TCP state structure includes a 128-bit hash_secret which we use for SipHash calculations to mitigate attacks on the TCP hash table and initial sequence number. We have plans to use SipHash in places that aren't TCP related, and there's no particular reason they'd need their own secret. So move the hash_secret to the general context structure. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-12-04 09:51:32 +01:00
David Gibson	955dd3251c	tcp, udp: Don't pre-fill IPv4 destination address in headers Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-22 12:15:33 +02:00
David Gibson	ae5f6c8e1b	epoll: Use different epoll types for passt and pasta tap fds Currently we have a single epoll event type for the "tap" fd, which could be either a handle on a /dev/net/tun device (pasta) or a connected Unix socket (passt). However for the two modes we call different handler functions. Simplify this a little by using different epoll types and dispatching directly to the correct handler function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:20 +02:00
David Gibson	eda4f1997e	epoll: Split listening Unix domain socket into its own type tap_handler() actually handles events on three different types of object: the /dev/tap character device (pasta), a connected Unix domain socket (passt) or a listening Unix domain socket (passt). The last, in particular, really has no handling in common with the others, so split it into its own epoll type and directly dispatch to the relevant handler from the top level. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:17 +02:00
David Gibson	485b5fb8f9	epoll: Split handling of listening TCP sockets into their own handler tcp_sock_handler() handles both listening TCP sockets, and connected TCP sockets, but what it needs to do in those cases has essentially nothing in common. Therefore, give listening sockets their own epoll_type value and dispatch directly to their own handler from the top level. Furthermore, the two handlers need essentially entirely different information from the reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate the port for the listening socket, but that's not the same meaning. So, switch listening sockets to their own reference type which we can lay out as we please. That lets us remove the listen and outbound fields from the normal (connected) tcp_epoll_ref, reducing it to just the connection table index. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:15 +02:00
David Gibson	e6f81e5578	epoll: Split handling of TCP timerfds into its own handler function tcp_sock_handler() actually handles several different types of fd events. This includes timerfds that aren't sockets at all. The handling of these has essentially nothing in common with the other cases. So, give the TCP timers there own epoll_type value and dispatch directly to their handler. This also means we can remove the timer field from tcp_epoll_ref, the information it encoded is now implicit in the epoll_type value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:13 +02:00
David Gibson	8271a2ed57	epoll: Tiny cleanup to udp_sock_handler() Move the test for c->no_udp into the function itself, rather than in the dispatching switch statement to better localize the UDP specific logic, and make for greated consistency with other handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:10 +02:00
David Gibson	05f606ab0b	epoll: Split handling of ICMP and ICMPv6 sockets We have different epoll type values for ICMP and ICMPv6 sockets, but they both call the same handler function, icmp_sock_handler(). However that function does essentially nothing in common for the two cases. So, split it into icmp_sock_handler() and icmpv6_sock_handler() and dispatch them separately from the top level. While we're there remove some parameters that the function was never using anyway. Also move the test for c->no_icmp into the functions, so that all the logic specific to ICMP is within the handler, rather than in the top level dispatch code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:07 +02:00
David Gibson	d850caab66	epoll: Fold sock_handler into general switch on epoll event fd When we process events from epoll_wait(), we check for a number of special cases before calling sock_handler() which then dispatches based on the protocol type of the socket in the event. Fold these cases together into a single switch on the fd type recorded in the epoll data field. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:56 +02:00
David Gibson	6a6735ece4	epoll: Always use epoll_ref for the epoll data variable epoll_ref contains a variety of information useful when handling epoll events on our sockets, and we place it in the epoll_event data field returned by epoll. However, for a few other things we use the 'fd' field in the standard union of types for that data field. This actually introduces a bug which is vanishingly unlikely to hit in practice, but very nasty if it ever did: theoretically if we had a very large file descriptor number for fd_tap or fd_tap_listen it could overflow into bits that overlap with the 'proto' field in epoll_ref. With some very bad luck this could mean that we mistakenly think an event on a regular socket is an event on fd_tap or fd_tap_listen. More practically, using different (but overlapping) fields of the epoll_data means we can't unify dispatch for the various different objects in the epoll. Therefore use the same epoll_ref as the data for the tap fds and the netns quit fd, adding new fd type values to describe them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:53 +02:00
David Gibson	3401644453	epoll: Generalize epoll_ref to cover things other than sockets The epoll_ref type includes fields for the IP protocol of a socket, and the socket fd. However, we already have a few things in the epoll which aren't protocol sockets, and we may have more in future. Rename these fields to an abstract "fd type" and file descriptor for more generality. Similarly, rather than using existing IP protocol numbers for the type, introduce our own number space. For now these just correspond to the supported protocols, but we'll expand on that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:51 +02:00
David Gibson	8218d99013	Use C11 anonymous members to make poll refs less verbose to use union epoll_ref has a deeply nested set of structs and unions to let us subdivide it into the various different fields we want. This means that referencing elements can involve an awkward long string of intermediate fields. Using C11 anonymous structs and unions lets us do this less clumsily. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:17:57 +02:00

1 2 3 4

164 commits