On certain architectures we get a warning about comparison between
different signedness integers in fwd_probe_ephemeral(). This is because
NUM_PORTS evaluates to an unsigned integer. It's a fixed value, though
and we know it will fit in a signed long on anything reasonable, so add
a cast to suppress the warning.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing. This gives us a very large performance improvement when it's
possible.
Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address. However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
forwarding exposes those services to the host, which isn't generally what
we want.
Correct this by instead forwarding inbound "splice" flows to the guest's
external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we forward "all" ports (-t all or -u all), or use an exclude-only
range, we don't actually forward *all* ports - that wouln't leave local
ports to use for outgoing connections. Rather we forward all non-ephemeral
ports - those that won't be used for outgoing connections or datagrams.
Currently we assume the range of ephemeral ports is that recommended by
RFC 6335, 49152-65535. However, that's not the range used by default on
Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range
sysctl.
We can't really know what range the guest will consider ephemeral, but if
it differs too much from the host it's likely to cause problems we can't
avoid anyway. So, using the host's ephemeral range is a better guess than
using the RFC 6335 range.
Therefore, add logic to probe the host's ephemeral range, falling back to
the RFC 6335 range if that fails. This has the bonus advantage of
reducing the number of ports bound by -t all -u all on most Linux machines
thereby reducing kernel memory usage. Specifically this reduces kernel
memory usage with -t all -u all from ~380MiB to ~289MiB.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
"Ephemeral" ports are those which the kernel may allocate as local
port numbers for outgoing connections or datagrams. Because of that,
they're generally not good choices for listening servers to bind to.
Thefore when using -t all, -u all or exclude-only ranges, we map only
non-ephemeral ports. Our logic for this is a bit rigid though: we
assume the ephemeral ports are always a fixed range at the top of the
port number space. We also assume PORT_EPHEMERAL_MIN is a multiple of
8, or we won't set the forward bitmap correctly.
Make the logic in conf.c more flexible, using a helper moved into
fwd.[ch], although we don't change which ports we consider ephemeral
(yet).
The new handling is undoubtedly more computationally expensive, but
since it's a once-off operation at start off, I don't think it really
matters.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The guest is usually assigned one of the host's IP addresses. That means
it can't access the host itself via its usual address. The
--map-host-loopback option (enabled by default with the gateway address)
allows the guest to contact the host. However, connections forwarded this
way appear on the host to have originated from the loopback interface,
which isn't always desirable.
Add a new --map-guest-addr option, which acts similarly but forwarded
connections will go to the host's external address, instead of loopback.
If '-a' is used, so the guest's address is not the same as the host's, this
will instead forward to whatever host-visible site is shadowed by the
guest's assigned address.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
fwd_nat_from_host() needs to adjust the source address for new flows coming
from an address which is not accessible to the guest. Currently we always
use our_tap_addr or our_tap_ll. However in cases where the address is
accessible to the guest via translation (i.e. via --map-host-loopback) then
it makes more sense to use that translation, rather than the fallback
mapping of our_tap_*.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default
gateway. We use this for two quite distinct things: advertising the
gateway that the guest should use (via DHCP, NDP and/or --config-net)
and for a limited form of NAT. So that the guest can access services
on the host, we map the gateway address within the guest to the
loopback address on the host.
Using the gateway address for this isn't necessarily the best choice
for this purpose, certainly not for all circumstances. So, start off
by splitting the notion of these into two different values: @guest_gw
which is the gateway address the guest should use and @nat_host_loopback,
which is the guest visible address to remap to the host's loopback.
Usually nat_host_loopback will have the same value as guest_gw. However
when --no-map-gw is specified we leave them unspecified instead. This
means when we use nat_host_loopback, we don't need to separately check
c->no_map_gw to see if it's relevant.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
ip4.gw conflates 3 conceptually different things, which (for now) have the
same value:
1. The router/gateway address as seen by the guest
2. An address to NAT to the host with --no-map-gw isn't specified
3. An address to use as source when nothing else makes sense
Case 3 occurs in two situations:
a) for our DHCP responses - since they come from passt internally there's
no naturally meaningful address for them to come from
b) for forwarded connections coming from an address that isn't guest
accessible (localhost or the guest's own address).
(b) occurs even with --no-map-gw, and the expected behaviour of forwarding
local connections requires it.
For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the
same value as ip6.gw). For future flexibility we may want to make this
"address of last resort" different from the gateway address, so split them
logically for IPv4 as well.
Specifically, add a new ip4.our_tap_addr field for the address with this
role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always
get a link-local address, we might not be able to get a (non 0.0.0.0)
address here (e.g. if the host is disconnected or only has a point to point
link with no gateway address). In that case we have to disable forwarding
of inbound connections with guest-inaccessible source addresses.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We usually avoid NAT, but in a few cases we need to apply address
translations. For inbound connections that happens for addresses which
make sense to the host but are either inaccessible, or mean a different
location from the guest's point of view.
Add some helper functions to determine such addresses, and use them in
fwd_nat_from_host(). In doing so clarify some of the reasons for the
logic. We'll also have further use for these helpers in future.
While we're there fix one unneccessary inconsistency between IPv4 and IPv6.
We always translated the guest's observed address, but for IPv4 we didn't
translate the guest's assigned address, whereas for IPv6 we did. Change
this to translate both in all cases for consistency.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In every place we use our_tap_ll, we only use it as a fallback if the
IPv6 gateway address is not link-local. We can avoid that conditional at
use time by doing it at initialisation of our_tap_ll instead.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the
guest, but the former is an address for our use on the tap link. Rename it
accordingly, to 'our_tap_ll'.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing. As discussed on a
recent call, let's try "our" instead.
(While we're there correct an error in flow_initiate_af()s comments where
we referred to parameters by the wrong name).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
passt/pasta has options to redirect DNS requests from the guest to a
different server address on the host side. Currently, however, only UDP
packets to port 53 are considered "DNS requests". This ignores DNS
requests over TCP - less common, but certainly possible. It also ignores
encrypted DNS requests on port 853.
Extend the DNS forwarding logic to handle both of those cases.
Link: https://github.com/containers/podman/issues/23239
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, we start by handling the common case, where we don't translate
the destination address, then we modify the tgt side for the special cases.
In the process we do comparisons on the tentatively set fields in tgt,
which obscures the fact that tgt should be an essentially pure function of
ini, and risks people examining fields of tgt that are not yet initialized.
To make this clearer, do all our tests on 'ini', constructing tgt from
scratch on that basis.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map. This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port. This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.
So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The
logic here doesn't exactly match our current forwarding, since our current
forwarding has some very strange and buggy edge cases. Instead it's
attempting to replicate what appears to be the intended logic behind the
current forwarding.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.
Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side. This performs any NAT or port forwarding needed.
We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we have logging functions embedding perror() functionality,
we can make _some_ calls more terse by using them. In many places,
the strerror() calls are still more convenient because, for example,
they are used in flow debugging functions, or because the return code
variable of interest is not 'errno'.
While at it, convert a few error messages from a scant perror style
to proper failure descriptions.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the
architecture name for, among others, seccomp.sh, I didn't realise
that "armv6l" and "armv7l" are just Linux kernel names -- compilers
just call that "arm".
Fix the "syscalls" annotation we use to define seccomp profiles
accordingly, otherwise pasta will be terminated on sigreturn() on
armv6l and armv7l.
Fixes: 213c397492 ("passt, pasta: Run-time selection of AVX2 build")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Introduce ip.[ch] file to encapsulate IP protocol handling functions and
structures. Modify various files to include the new header ip.h when
it's needed.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-5-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently port_fwd.[ch] contains helpers related to port forwarding,
particular automatic port forwarding. We're planning to allow much more
flexible sorts of forwarding, including both port translation and NAT based
on the flow table. This will subsume the existing port forwarding logic,
so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names
within.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>