Add this helper to format an inany into either IPv4 or IPv6 text
format as appropriate.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add helpers to determine if an inany is loopback, unspecified or
multicast, regardless of whether it's a "true" IPv6 address or an IPv4
address represented as v4-mapped.
Use the loopback helper to simplify tcp_splice_conn_from_sock() slightly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we determine we have sent a partial frame in tap_send_frames_passt(),
we call tap_send_remainder() to send the remainder of it. The logic in
that function is very similar to that in the more general write_remainder()
except that it uses send() instead of write()/writev(). But we are dealing
specifically with the qemu socket here, which is a connected stream socket.
In that case write()s do the same thing as send() with the options we were
using, so we can just reuse write_remainder().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently pcap_frame() assumes that if write() doesn't return an error, it
has written everything we want. That's not necessarily true, because it
could return a short write. That's not likely to happen on a regular file,
but there's not a lot of reason not to be robust here; it's conceivable we
might want to direct the pcap fd at a named pipe or similar.
So, make pcap_frame() handle short frames by using the write_remainder()
helper.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Formatting fix, and avoid gcc warning in pcap_frame()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have several places where we want to write(2) a buffer or buffers and we
handle short write()s by retrying until everything is successfully written.
Add a helper for this in util.c.
This version has some differences from the typical write_all() function.
First, take an IO vector rather than a single buffer, because that will be
useful for some of our cases. Second, allow it to take an parameter to
skip the first n bytes of the given buffers. This will be useful for some
of the cases we want, and also falls out quite naturally from the
implementation.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor formatting fixes in write_remainder()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Update the low-level helper pcap_frame() to take a struct iovec and
offset within it, rather than an explicit pointer and length for the
frame. This moves the handling of an offset (to skip vnet_len) from
pcap_multiple() to pcap_frame().
This doesn't accomplish a great deal immediately, but will make
subsequent changes easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Several of the IOV functions in iov.c, and also tap_send_frames_passt()
needs to determine which buffer element a byte offset into an IO vector
lies in. Split this out into a helper function iov_skip_bytes().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Introduce functions to copy to/from a buffer from/to an iovec array,
to compute data length in in bytes of an iovec and to copy memory from
an iovec to another.
iov_from_buf(), iov_to_buf(), iov_size(), iov_copy().
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240217150725.661467-2-lvivier@redhat.com>
[dwg: Small changes to suppress cppcheck warnings]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If the configured output address is unspecified, we don't set the bind
address to it when creating a new socket in udp_tap_handler(). That sounds
sensible, but what we're leaving the bind address as is, exactly, the
unspecified address, so this test makes no difference. Remove it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When forwarding IPv4 packets in udp_tap_handler(), we incorrectly use an
IPv6 address test on our IPv4 address (which could cause an out of bounds
access), and possibly set our bind interface to the IPv6 interface based on
it. Adjust to correctly look at the IPv4 address and IPv4 interface.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Streamline the logic here slightly, by introducing a 'src' temporary for
brevity. We also transform the logic for setting/clearing PORT_LOOPBACK.
This makes udp_update_hdr4() more closely match the corresponding logic
from udp_update_udp6().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The udp_epoll_ref contains a field for the pif to which the socket belongs.
We fill this in for permanent sockets created with udp_sock_init() and for
spliced sockets, however, we omit it for ephemeral sockets created for
tap originated flows.
This is a bug, although we currently get away with it, because we don't
consult that field for such flows. Correctly fill it in.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If an incoming packet has a source address of 0.0.0.0 we translate that to
the gateway address. This doesn't really make sense, because we have no
way to do a reverse translation for reply packets.
Certain UDP protocols do use an unspecified source address in some
circumstances (e.g. DHCP). These generally either require no reply, a
multicast reply, or provide a suitable reply address by other means.
In none of those cases does translating it in passt/pasta make sense. The
best we can really do here is just leave it as is.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...instead of implying that by stating that there's no routable
interface for a given IP version. There might be interfaces with
non-default routes.
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Don't run cppcheck to find out if the --check-level=exhaustive option
is available, unless we're actually going to run cppcheck later.
To avoid this, move this check under the cppcheck target, and
implement it in shell script instead of using Makefile directives,
because we can't easily implement conditionals in recipes.
Reported-by: Rahil Bhimjiani <me@rahil.website>
Link: https://bugs.gentoo.org/920795
Fixes: 8640d62af7 ("cppcheck: Use "exhaustive" level checking when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
--quiet is supposed to silence the "No routable interface" message but
it does not work because the log level was set long after conf_ip4/6()
was called which means it uses the default level which logs everything.
To address this move the log level logic directly after the option
parsing in conf().
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If tcp_sock_refill_pool() gets an error opening new sockets, it stores the
negative errno of that error in the socket pool. This isn't especially
useful:
* It's inconsistent with the initial state of the pool (all -1)
* It's inconsistent with the state of an entry that was valid and was
then consumed (also -1)
* By the time we did anything with this error code, it's now far removed
from the situation in which the error occurred, making it difficult to
report usefully
We now have error reporting closer to when failures happen on the refill
paths, so just leave a pool slot we can't fill as -1.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We maintain pools of ready-to-connect sockets in both the original and
(for pasta) guest namespace to reduce latency when starting new TCP
connections. If we exhaust those pools we have to take a higher
latency path to get a new socket.
Currently we open-code that fallback in the places we need it. To improve
clarity encapsulate that into helper functions. While we're at it, give
those helpers clearer error reporting.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently if tcp_sock_refill_pool() is unable to fill all the slots in the
pool, it will silently exit. This might lead to a later attempt to get
fds from the pool to fail at which point it will be harder to tell what
originally went wrong.
Instead add warnings if we're unable to refill any of the socket pools when
requested. We have tcp_sock_refill_pool() return an error and report it
in the callers, because those callers have more context allowing for a
more useful message.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently if we get an error opening a new socket while refilling a socket
pool, we carry on to the next slot and try again. This isn't very useful,
since by far the most likely cause of an error is some sort of resource
exhaustion. Trying again will probably just hit the same error, and maybe
even make things worse.
So, instead stop on the first error while refilling the pool, making do
with however many sockets we managed to open before the error.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently tcp_sock_refill_pool() stops as soon as it finds an entry in the
pool with a valid fd. This appears to makes sense: we always use fds from
the front of the pool, so if we find a filled one, the rest of the pool
should be filled as well.
However, that's not quite correct. If a previous refill hit errors trying
to open new sockets, it could leave gaps between blocks of valid fds. We're
going to add some changes that could make that more likely.
So, for robustness, instead skip over the filled entry but still try to
refill the rest of the array. We expect simply iterating over the pool to
be of small cost compared to even a single system call, so this shouldn't
have much impact.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int. Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The code in udp_invert_portmap() is written based on an incorrect
understanding of C's (arcane) integer promotion rules. We calculate
'(in_port_t)i + delta' expecting the result to be of type in_port_t (16
bits). However "small integer types" (those narrower than 'int') are
always promoted to int for expressions, meaning this calculation can
overrun the rdelta[] array.
Fix this, and use a new intermediate for the index, to make it very clear
what it's type is. We also change i to unsigned, to avoid any possible
confusion from mixing signed and unsigned types.
Link: https://bugs.passt.top/show_bug.cgi?id=80
Reported-by: Laurent Jacquot <jk@lutty.net>
Suggested-by: Laurent Jacquot <jk@lutty.net>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
All the values in this ASSERT() are known at compile time, so this can be
converted to a static_assert().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We watch network namespace entries to detect when we should quit
(unless --no-netns-quit is passed), and these might stored in a tmpfs
typically mounted at /run/user/UID or /var/run/user/UID, or found in
procfs at /proc/PID/ns/.
Currently, we try to use inotify for any possible location of those
entries, but inotify, of course, doesn't work on pseudo-filesystems
(see inotify(7)).
The man page reflects this: the description of --no-netns-quit
implies that we won't quit anyway if the namespace is not "bound to
the filesystem".
Well, we won't quit, but, since commit 9e0dbc8948 ("More
deterministic detection of whether argument is a PID, PATH or NAME"),
we try. And, indeed, this is harmless, as the caveat from that
commit message states.
Now, it turns out that Buildah, a tool to create container images,
sharing its codebase with Podman, passes a procfs entry to pasta, and
expects pasta to exit once the network namespace is not needed
anymore, that is, once the original container process, also spawned
by Buildah, terminates.
Get this to work by using the timer fallback mechanism if the
namespace name is passed as a path belonging to a pseudo-filesystem.
This is expected to be procfs, but I covered sysfs and devpts
pseudo-filesystems as well, because nothing actually prevents
creating this kind of directory structure and links there.
Note that fstatfs(), according to some versions of man pages, was
apparently "deprecated" by the LSB. My reasoning for using it is
essentially this:
https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u
...that is, there was no such thing as an LSB deprecation, and
anyway there's no other way to get the filesystem type.
Also note that, while it might sound more obvious to detect the
filesystem type using fstatfs() on the file descriptor itself
(c->pasta_netns_fd), the reported filesystem type for it is nsfs, no
matter what path was given to pasta. If we use the parent directory,
we'll typically have either tmpfs or procfs reported.
If the target namespace is given as a PID, or as a PID-based procfs
entry, we don't risk races if this PID is recycled: our handle on
/proc/PID/ns will always refer to the original namespace associated
with that PID, and we don't re-open this entry from procfs to check
it.
There's, however, a remaining race possibility if the parent process
is not the one associated to the network namespace we operate on: in
that case, the parent might pass a procfs entry associated to a PID
that was recycled by the time we parse it. This can't happen if the
namespace PID matches the one of the parent, because we detach from
the controlling terminal after parsing the namespace reference.
To avoid this type of race, if desired, we could add the option for
the parent to pass a PID file descriptor, that the parent obtained
via pidfd_open(). This is beyond the scope of this change.
Update the man page to reflect that, even if the target network
namespace is passed as a procfs path or a PID, we'll now quit when
the procfs entry is gone.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Partially equivalent to commit abf5ef6c22 ("apparmor: Allow pasta
to remount /proc, access entries under its own copy"): we should
allow pasta to remount /proc. It still works otherwise, but further
UID remapping in nested user namespaces (e.g. pasta in pasta) won't.
Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=79#c3
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...Podman users might get confused by the fact that if we can't
find a default route for a given IP version, we'll report that as a
warning message and possibly just before actual error messages.
However, a lack of routable interface for IPv4 or IPv6 can be a
normal circumstance: don't warn about it, just state that as
informational message, if those are displayed (they're not in
non-error paths in Podman, for example).
While at it, make it clear that we're disabling IPv4 or IPv6 if
there's no routable interface for the corresponding IP version.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We don't know how frequently this happens, but hitting
fs.inotify.max_user_watches or similar sysctl limits is definitely
not out of question, and Paul mentioned that, for example, Podman's
CI environments hit similar issues in the past.
Introduce a fallback mechanism based on a timer file descriptor: we
grab the directory handle at startup, and we can then use openat(),
triggered periodically, to check if the (network) namespace directory
still exists. If openat() fails at some point, exit.
Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...or similar, that is, if only excluded ranges are given (implying
we'll forward any other available port). In that case, we'll usually
forward large sets of ports, and it might be inconvenient for the
user to skip excluding single ports that are already taken.
The existing behaviour, that is, exiting only if we fail to bind all
the ports for one given forwarding option, turns out to be
problematic for several aspects raised by Paul:
- Podman merges ranges anyway, so we might fail to bind all the ports
from a specific range given by the user, but we'll not fail anyway
because Podman merges it with another one where we succeed to bind
at least one port. At the same time, there should be no semantic
difference between multiple ranges given by a single option and
multiple ranges given as multiple options: it's unexpected and
not documented
- the user might actually rely on a given port to be forwarded to a
given container or a virtual machine, and if connections are
forwarded to an unrelated process, this might raise security
concerns
- given that we can try and fail to bind multiple ports before
exiting (in case we can't bind any), we don't have a specific error
code we can return to the user, so we don't give the user helpful
indication as to why we couldn't bind ports.
Exit as soon as we fail to create or bind a socket for a given
forwarded port, and report the actual error.
Keep the current behaviour, however, in case the user wants to
forward all the (available) ports for a given protocol, or all the
ports with excluded ranges only. There, it's more reasonable that
the user is expecting partial failures, and it's probably convenient
that we continue with the ports we could forward.
Update the manual page to reflect the new behaviour, and the old
behaviour too in the cases where we keep it.
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Usually automatically forwarded UDP outbound ports are set up by
udp_port_rebind_outbound() called from udp_timer(). However, the very
first time they're created and bound is by udp_sock_init_ns() called from
udp_init(). udp_sock_init_ns() is essentially an unnecessary cut down
version of udp_port_rebind_outbound(), so we can jusat remove it.
Doing so does require moving udp_init() below udp_port_rebind_outbound()'s
definition.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For automated inbound port forwarding in pasta mode we scan bound ports
within the guest namespace via /proc and bind matching ports on the host to
listen for packets. For UDP this is usually handled by udp_timer() which
calls port_fwd_scan_udp() followed by udp_port_rebind(). However there's
one initial scan before the the UDP timer is started: we call
port_fwd_scan_udp() from port_fwd_init(), and actually bind the resulting
ports in udp_sock_init_init() called from udp_init().
Unfortunately, the version in udp_sock_init_init() isn't correct. It
unconditionally opens a new socket for every forwarded port, even if a
socket has already been explicit created with the -u option. If the
explicitly forwarded ports have particular configuration (such as a
specific bound address address, or one implied by the -o option) those will
not be replicated in the new socket. We essentially leak the original
correctly configured socket, replacing it with one which might not be
right.
We could make udp_sock_init_init() use udp_port_rebind() to get that right,
but there's actually no point doing so:
* The initial bind was introduced by ccf6d2a7b4 ("udp: Actually bind
detected namespace ports in init namespace") at which time we didn't
periodically scan for bound UDP ports. Periodic scanning was introduced
in 457ff122e ("udp,pasta: Periodically scan for ports to automatically
forward") making the bind from udp_init() redundant.
* At the time of udp_init(), programs in the guest namespace are likely
not to have started yet (unless attaching a pre-existing namespace) so
there's likely not anything to scan for anyway.
So, simply remove the initial, broken socket create/bind, allowing
automatic port forwards to be created the first time udp_timer() runs.
Reported-by: Laurent Jacquot <jk@lutty.net>
Suggested-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=79
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
6c7623d07 ("netlink: Add support to fetch default gateway from multipath
routes") inadvertently introduced a new cppcheck warning for a variable
which could be a const pointer but isn't. This occurs with
cppcheck-2.13.0-1.fc39.x86_64 in Fedora 39 at least.
Fixes: 6c7623d07b ("netlink: Add support to fetch default gateway from multipath routes")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Before commit 32d07f5e59 ("passt, pasta: Completely avoid dynamic
memory allocation"), we didn't store the current log mask in a
variable, and we fetched it using setlogmask(0) wherever needed.
But after that commit, we can use our log_mask copy instead. And we
should: with recent glibc versions, setlogmask(0) actually results in
a system call, which causes a substantial overhead with high transfer
rates: we use setlogmask(0) even to decide we don't want to print
debug messages.
Now that we rely on log_mask in early stages, before setlogmask() is
called, we need to initialise that variable to the special LOG_EMERG
mask value right away: define LOG_EARLY to make this clearer, and,
while at it, group conditions in vlogmsg() into something more terse.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When a duplicate ack from the tap side triggers a fast re-transmit, we set
both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of
the duplicate ack. Setting seq_to_tap is correct: this is what triggers
the retransmit from this point onwards. Setting seq_ack_from_tap is
not correct, though.
In most cases setting seq_ack_from_tap will be redundant but harmless:
it will have already been updated to the same value by
tcp_update_seqack_from_tap() a few lines above. However that call can
be skipped if tcp_sock_consume() fails, which is rare but possible. In
that case this update will cause problems.
We use seq_ack_from_tap to track two logically distinct things: how much of
the stream has been acked by the guest, and how much of the stream from the
socket has been read and discarded (as opposed to MSG_PEEKed). We attempt
to keep those values the same, because we discard data exactly when it is
acked by the guest. However tcp_sock_consume() failing means we weren't
able to disard the acked data. To handle that case, we skip the usual
update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get
one which supersedes it soon enough. Setting seq_ack_from_tap in the
fast retransmit path, however, means we now really will have the
read/discard point in the stream out of sync with seq_ack_from_tap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If the default route for a given IP version is a multipath one,
instead of refusing to start because there's no RTA_GATEWAY attribute
in the set returned by the kernel, we can just pick one of the paths.
To make this somewhat less arbitrary, pick the path with the highest
weight, if weights differ.
Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/20927
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
ICMP sockets are cleaned up on a timeout implemented in icmp_timer_one(),
and the logic to do that cleanup is open coded in that function. Similarly
new sockets are opened when we discover we don't have an existing one in
icmp_tap_handler(), and again the logic is open-coded.
That's not the worst thing, but it's a bit cleaner to have dedicated
functions for the creation and destruction of ping sockets. This will also
make things a bit easier for future changes we have in mind.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We access fields of packets received from ping sockets assuming they're
echo replies, without actually checking that. Of course, we don't expect
anything else from the kernel, but it's probably best to verify.
While we're at it, also check for short packets, or a receive address of
the wrong family.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we silently ignore an errors receiving a packet from a ping
socket. We don't expect that to happen, so it's probably worth reporting
if it does.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we have separate handlers for ICMP and ICMPv6 ping replies.
Although there are a number of points of difference, with some creative
refactoring we can combine these together sensibly. Although it doesn't
save a vast amount of code, it does make it clearer that we're performing
basically the same steps for each case.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently icmp_tap_handler() consists of two almost disjoint paths for the
IPv4 and IPv6 cases. The only thing they share is an error message.
We can use some intermediate variables to refactor this to share some more
code between those paths.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we use icmp_act[] to scan for ICMP ids which might have an open
socket which could time out. However icmp_act[] contains no information
that's not already in icmp_id_map[] - it's just an "index" which allows
scanning for relevant entries with less cache footprint.
We only scan for ICMP socket expiry every 1s, though, so it's not clear
that cache footprint really matters. Furthermore, there's no strong reason
we need to scan even that often - the timeout is fairly arbitrary and
approximate.
So, eliminate icmp_act[] in favour of directly scanning icmp_id_map[] and
compensate for the cache impact by reducing the scan frequency to once
every 10s.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
icmp_id_map[] contains, amongst other things, fds for "ping" sockets
associated with various ICMP echo ids. However, we only lazily open()
those sockets, so many will be missing. We currently represent that with
a 0, which isn't great, since that's technically a valid fd. Use -1
instead. This does require initializing the fields in icmp_id_map[] but
we already have an obvious place to do that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When forwarding pings from tap, currently we create a ping socket with
a socket address whose port is set to the ID of the ping received from the
guest. This causes the socket to send pings with the same ID on the host.
Although this seems look a good idea for maximum transparency, it's
probably unwise.
First, it's fallible - the bind() could fail, and we already have fallback
logic which will overwrite the packets with the expected guest id if the
id we get on replies doesn't already match. We might as well do that
unconditionally.
But more importantly, we don't know what else on the host might be using
ping sockets, so we could end up with an ID that's the same as an existing
socket. You'd expect that to fail the bind() with EADDRINUSE, which would
be fine: we'd fall back to rewriting the reply ids. However it appears the
kernel (v6.6.3 at least), does *not* fail the bind() and instead it's
"last socket wins" in terms of who gets the replies. So we could
accidentally intercept ping replies for something else on the host.
So, instead of using bind() to set the id, just let the kernel pick one
and expect to translate the replies back. Although theoretically this
makes the passt/pasta link a bit less "transparent", essentially nothing
cares about specific ping IDs, much like TCP source ports, which we also
don't preserve.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Linux ICMP "ping" sockets are very specific in what they do. They let
userspace send ping requests (ICMP_ECHO or ICMP6_ECHO_REQUEST), and receive
matching replies (ICMP_ECHOREPLY or ICMP6_ECHO_REPLY). They don't let you
intercept or handle incoming ping requests.
In the case of passt/pasta that means we can process echo requests from tap
and forward them to a ping socket, then take the replies from the ping
socket and forward them to tap. We can't do the reverse: take echo
requests from the host and somehow forward them to the guest. There's
really no way for something outside to initiate a ping to a passt/pasta
connected guest and if there was we'd need an entirely different mechanism
to handle it.
However, we have some logic to deal with packets going in that reverse
direction. Remove it, since it can't ever be used that way. While we're
there use defines for the ICMPv6 types, instead of open coded type values.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We initialise the address portion of the sockaddr for sendto() to the
unspecified address, but then always overwrite it with the actual
destination address before we call the sendto().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We set the port to the ICMP id on the sendto() address when using ICMP
ping sockets. However, this has no effect: the ICMP id the kernel
uses is determined only by the "port" on the socket's *bound* address
(which is constructed inside sock_l4(), using the id we also pass to
it).
For unclear reasons this change triggers cppcheck 2.13.0 to give new
"variable could be const pointer" warnings, so make *ih const as well to
fix that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table. Doing this
sometimes requires moving an entry when one is freed. That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.
Implement a new way of managing the flow table that doesn't ever move
entries. It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b
[sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid
Coverity Scan false positive]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, flows are only evern finally freed (and the table compacted)
from the deferred handlers. Some future ways we want to optimise managing
the flow table will rely on this, so enforce it: rather than having the
TCP code directly call flow_table_compact(), add a boolean return value to
the per-flow deferred handlers. If true, this indicates that the flow
code itself should free the flow.
This forces all freeing of flows to occur during the flow code's scan of
the table in flow_defer_handler() which opens possibilities for future
optimisations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently tcp.c open codes the process of allocating a new flow from the
flow table: twice, in fact, once for guest to host and once for host to
guest connections. This duplication isn't ideal and will get worse as we
add more protocols to the flow table. It also makes it harder to
experiment with different ways of handling flow table allocation.
Instead, introduce a function to allocate a new flow: flow_alloc(). In
some cases we currently check if we're able to allocate, but delay the
actual allocation. We now handle that slightly differently with a
flow_alloc_cancel() function to back out a recent allocation. We have that
separate from a flow_free() function, because future changes we have in
mind will need to handle this case a little differently.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>