In conf_ports() we have three different paths which actually do the setup
of an individual forwarded port: one for the "all" case, one for the
exclusions only case and one for the range of ports with possible
exclusions case.
We can unify those cases using a new helper which handles a single range
of ports, with a bitmap of exclusions. Although this is slightly longer
(largely due to the new helpers function comment), it reduces duplicated
logic. It will also make future improvements to the tracking of port
forwards easier.
The new conf_ports_range_except() function has a pretty prodigious
parameter list, but I still think it's an overall improvement in conceptual
complexity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The -m option controls the MTU, that is the maximum transmissible L3
datagram, not including L2 headers. We currently limit it to ETH_MAX_MTU
which sounds like it makes sense. But ETH_MAX_MTU is confusing: it's not
consistently used as to whether it means the maximum L3 datagram size or
the maximum L2 frame size. Even within conf() we explicitly account for
the L2 header size when computing the default --mtu value, but not when
we compute the maximum --mtu value.
Clean this up by reworking the maximum MTU computation to be the minimum of
IP_MAX_MTU (65535) and the maximum sized IP datagram which can fit into
our L2 frames when we account for the L2 header. The latter can vary
depending on our tap backend, although it doesn't right now.
Link: https://bugs.passt.top/show_bug.cgi?id=66
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We detect our operating mode in conf_mode(), unless we're using vhost-user
mode, in which case we change it later when we parse the --vhost-user
option. That means we need to delay parsing the --repair-path option (for
vhost-user only) until still later.
However, there are many other places in the main option parsing loop which
also rely on mode. We get away with those, because they happen to be able
to treat passt and vhost-user modes identically. This is potentially
confusing, though. So, move setting of MODE_VU into conf_mode() so
c->mode always has its final value from that point onwards.
To match, we move the parsing of --repair-path back into the main option
parsing loop.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
One of the first things we need to do is determine if we're in passt mode
or pasta mode. Currently this is open-coded in main(), by examining
argv[0]. We want to complexify this a bit in future to cover vhost-user
mode as well. Prepare for this, by moving the mode detection into a new
conf_mode() function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we rely on detecting our mode first and use different sets of
(single character) options for each. This means that if you use an option
valid in only one mode in another you'll get the generic usage() message.
We can give more helpful errors with little extra effort by combining all
the options into a single value of the option string and giving bespoke
messages if an option for the wrong mode is used; in fact we already did
this for some single mode options like '-1'.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we reject the -m option if given a value less than ETH_MIN_MTU
(68). That define is derived from the kernel, but its name is misleading:
it doesn't really have anything to do with Ethernet per se, but is rather
the minimum payload any L2 link must be able to handle in order to carry
IPv4. For IPv6, it's not sufficient: that requires an MTU of at least
1280.
Newer kernels have better named constants IPV4_MIN_MTU and IPv6_MIN_MTU.
Copy and use those constants instead, along with some more specific error
messages.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default. However, internally we use (c->mtu == -1) to represent that
state. We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.
This is unnecessarily confusing. We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case. This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We're a bit sloppy with parsing MTU which can lead to some surprising,
though fairly harmless, results:
* Passing a non-number like '-m xyz' will not give an error and act like
-m 0
* Junk after a number (e.g. '-m 1500pqr') will be ignored rather than
giving an error
* We parse the MTU as a long, then immediately assign to an int, so on
some platforms certain ludicrously out of bounds values will be
silently truncated, rather than giving an error
Be a bit more thorough with the error checking to avoid that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.
When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.
To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.
Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Both DHCPv4 and DHCPv6 has the capability to pass the hostname to
clients, the DHCPv4 uses option 12 (hostname) while the DHCPv6 uses option 39
(client fqdn), for some virt deployments like kubevirt is expected to
have the VirtualMachine name as the guest hostname.
This change add the following arguments:
- -H --hostname NAME to configure the hostname DHCPv4 option(12)
- --fqdn NAME to configure client fqdn option for both DHCPv4(81) and
DHCPv6(39)
Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This should be a relatively common case and I'm a bit surprised it's
been broken since I added the "gateway mapping" functionality, but it
doesn't happen with Podman, and not with systemd-resolved or similar
local proxies, and also not with servers where typically the gateway
is just a router and not a DNS resolver. That could be the reason why
nobody noticed until now.
By default, we'll map the address of the default gateway, in
containers and guests, to represent "the host", so that we have a
well-defined way to reach the host. Say:
0.0029: NAT to host 127.0.0.1: 192.168.100.1
But if the host gateway is also a DNS resolver:
0.0029: DNS:
0.0029: 192.168.100.1
then we'll send DNS queries directed to it to the host instead:
0.0372: Flow 0 (INI): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => ?
0.0372: Flow 0 (TGT): INI -> TGT
0.0373: Flow 0 (TGT): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
0.0373: Flow 0 (UDP flow): TGT -> TYPED
0.0373: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
0.0373: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 31049
0.0374: Flow 0 (UDP flow): TYPED -> ACTIVE
0.0374: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
which doesn't quite work, of course:
0.0374: pasta: epoll event on UDP reply socket 95 (events: 0x00000008)
0.0374: ICMP error on UDP socket 95: Connection refused
unless the host is a resolver itself... but then we wouldn't find the
address of the gateway in its /etc/resolv.conf, presumably.
Fix this by making an exception for DNS traffic: if the default
gateway is a resolver, match on DNS traffic going to the default
gateway, and explicitly forward it to the configured resolver.
Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the podman CI I noticed many seccomp denials in our logs even though
tests passed:
comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e
syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000
Which is futex being called and blocked by the pasta profile. After a
few tries I managed to reproduce locally with this loop in ~20 min:
while :;
do podman run -d --network bridge quay.io/libpod/testimage:20241011 \
sleep 100 && \
sleep 10 && \
podman rm -fa -t0
done
And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the
following stack trace:
Stack trace of thread 1:
#0 0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b)
#1 0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de)
#2 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
#3 0x000055f31b78c50b n/a (n/a + 0x0)
#4 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
#5 0x000055f31b78d5a2 n/a (n/a + 0x0)
Pasta got killed in exit(), it seems glibc is trying to use a lock when
running exit handlers even though no exit handlers are defined.
Given no exit handlers are needed we can call _exit() instead. This
skips exit handlers and does not flush stdio streams compared to exit()
which should be fine for the use here.
Based on the input from Stefano I did not change the test/doc programs
or qrap as they do not use seccomp filters.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
During testing it is sometimes useful to force traffic which would
normally be forwared by socket splicing through the tap interface.
In this commit, we add a command switch enabling such funtionality
for inbound local traffic.
For outbound local traffic this is much trickier, if even possible,
so leave that for a later commit.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are setups where no host interface is available or configured
at all, intentionally or not, temporarily or not, but users expect
(Podman) containers to run in any case as they did with slirp4netns,
and we're now getting reports that we broke such setups at a rather
alarming rate.
To this end, if we don't find any usable host interface, instead of
exiting:
- for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2
as default gateway
- for IPv6, don't assign any address (forcibly disable DHCPv6), and
use the *first* link-local address we observe to represent the
guest/container. Advertise fe80::1 as default gateway
- use 'tap0' as default interface name for pasta
Change ifi4 and ifi6 in struct ctx to int and accept a special -1
value meaning that no host interface was selected, but the IP family
is enabled. The fact that the kernel uses unsigned int values for
those is not an issue as 1. one can't create so many interfaces
anyway and 2. we otherwise handle those values transparently.
Fix a botched conditional in conf_print() to actually skip printing
DHCPv6 information if DHCPv6 is disabled (and skip printing NDP
information if NDP is disabled).
Link: https://github.com/containers/podman/issues/24614
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Fixes: 90e83d50a9 ("Don't take "our" MAC address from the host")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
NETNS_RUN_DIR is set in the Makefile, then passed into the C code with
-D. But NETNS_RUN_DIR is just a fixed string, it doesn't depend on any
make probes or variables, so there's really no reason to handle it via the
Makefile. Just move it to a plain #define in conf.c.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In pcap_init(), we should always open the packet capture file with
O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means
close-on-exec, not close-on-fork.
In logfile_init() and pidfile_open(), the fact that we pass a third
'mode' argument to open() seems to confuse the android-cloexec-open
checker in LLVM versions from 16 to 19 (at least).
The checker is suggesting to add O_CLOEXEC to 'mode', and not in
'flags', where we already have it.
Add a suppression for clang-tidy and a comment, and avoid repeating
those three times by adding a new helper, output_file_open().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We use fprintf() to print to standard output or standard error
streams. If something gets truncated or there's an output error, we
don't really want to try and report that, and at the same time it's
not abnormal behaviour upon which we should terminate, either.
Just silence the warning with an ugly FPRINTF() variadic macro casting
the fprintf() expressions to void.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):
/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
43 | snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
577 | snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
579 | snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
580 | pidval);
| ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
105 | snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
242 | snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
243 | snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
1155 | snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning
Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing. This gives us a very large performance improvement when it's
possible.
Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address. However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
forwarding exposes those services to the host, which isn't generally what
we want.
Correct this by instead forwarding inbound "splice" flows to the guest's
external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When redirecting DNS queries with the --dns-forward option, passt/pasta
needs a host side nameserver to redirect the queries to. This is
controlled by the c->ip[46].dns_host variables. This is set to the first
first nameserver listed in the host's /etc/resolv.conf, and there isn't
currently a way to override it from the command line.
Prior to 0b25cac9 ("conf: Treat --dns addresses as guest visible
addresses") it was possible to alter this with the -D/--dns option.
However, doing so was confusing and had some nonsensical edge cases because
-D generally takes guest side addresses, rather than host side addresses.
Add a new --dns-host option to restore this functionality in a more
sensible way.
Link: https://bugs.passt.top/show_bug.cgi?id=102
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In a couple of recent reports, we've seen that it can be useful for pasta
to forward ports from addresses which are not currently configured on the
host, but might be in future. That can be done with the sysctl
net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in
the first place. We can allow the same thing on a per-socket basis with
the IP_FREEBIND (or IPV6_FREEBIND) socket option.
Add a --freebind command line argument to enable this socket option on
all listening sockets.
Link: https://bugs.passt.top/show_bug.cgi?id=101
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We already have an inany_ntop() function to format inany addresses into
text. Add inany_pton() to parse them from text, and use it in
conf_ports().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair. Use an inany instead. Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
When we forward "all" ports (-t all or -u all), or use an exclude-only
range, we don't actually forward *all* ports - that wouln't leave local
ports to use for outgoing connections. Rather we forward all non-ephemeral
ports - those that won't be used for outgoing connections or datagrams.
Currently we assume the range of ephemeral ports is that recommended by
RFC 6335, 49152-65535. However, that's not the range used by default on
Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range
sysctl.
We can't really know what range the guest will consider ephemeral, but if
it differs too much from the host it's likely to cause problems we can't
avoid anyway. So, using the host's ephemeral range is a better guess than
using the RFC 6335 range.
Therefore, add logic to probe the host's ephemeral range, falling back to
the RFC 6335 range if that fails. This has the bonus advantage of
reducing the number of ports bound by -t all -u all on most Linux machines
thereby reducing kernel memory usage. Specifically this reduces kernel
memory usage with -t all -u all from ~380MiB to ~289MiB.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When using -t all, -u all or exclude-only ranges, we'll attempt to forward
all non-ephemeral port numbers, including port 0. However, this won't work
as intended: bind() treats a zero port not as literal port 0, but as
"pick a port for me". Because of the special meaning of port 0, we mostly
outright exclude it in our handling.
Do the same for setting up forwards, not attempting to forward for port 0.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
"Ephemeral" ports are those which the kernel may allocate as local
port numbers for outgoing connections or datagrams. Because of that,
they're generally not good choices for listening servers to bind to.
Thefore when using -t all, -u all or exclude-only ranges, we map only
non-ephemeral ports. Our logic for this is a bit rigid though: we
assume the ephemeral ports are always a fixed range at the top of the
port number space. We also assume PORT_EPHEMERAL_MIN is a multiple of
8, or we won't set the forward bitmap correctly.
Make the logic in conf.c more flexible, using a helper moved into
fwd.[ch], although we don't change which ports we consider ephemeral
(yet).
The new handling is undoubtedly more computationally expensive, but
since it's a once-off operation at start off, I don't think it really
matters.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Seen with krun: we get a file descriptor via --fd, but we close it and
happily use the same number for TCP files.
The issue is that if we also get other options before --fd, with
arguments, getopt_long() stops parsing them because it sees them as
non-option values.
Use the - modifier at the beginning of optstring (before :, which is
needed to avoid printing errors) instead of +, which means we'll
continue parsing after finding unrelated option values, but
getopt_long() won't reorder them anyway: they'll be passed with option
value '1', which we can ignore.
By the way, we also need to add : after F in the optstring, so that
we're able to parse the option when given as short name as well.
Now that we change the parsing mode between close_open_files() and
conf(), we need to reset optind to 0, not to 1, whenever we call
getopt_long() again in conf(), so that the internal initialisation
of getopt_long() evaluating GNU extensions is re-triggered.
Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828
Fixes: baccfb95ce ("conf: Stop parsing options at first non-option argument")
Fixes: 09603cab28 ("passt, util: Close any open file that the parent might have leaked")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The guest is usually assigned one of the host's IP addresses. That means
it can't access the host itself via its usual address. The
--map-host-loopback option (enabled by default with the gateway address)
allows the guest to contact the host. However, connections forwarded this
way appear on the host to have originated from the loopback interface,
which isn't always desirable.
Add a new --map-guest-addr option, which acts similarly but forwarded
connections will go to the host's external address, instead of loopback.
If '-a' is used, so the guest's address is not the same as the host's, this
will instead forward to whatever host-visible site is shadowed by the
guest's assigned address.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because the host and guest share the same IP address with passt/pasta, it's
not possible for the guest to directly address the host. Therefore we
allow packets from the guest going to a special "NAT to host" address to be
redirected to the host, appearing there as though they have both source and
destination address of loopback.
Currently that special address is always the address of the default
gateway (or none). That can be a problem if we want that gateway to be
addressable by the guest. Therefore, allow the special "NAT to host"
address to be overridden on the command line with a new --map-host-loopback
option.
In order to exercise and test it, update the passt_in_ns and perf
tests to use this option and give different mapping addresses for the
two layers of the environment.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default
gateway. We use this for two quite distinct things: advertising the
gateway that the guest should use (via DHCP, NDP and/or --config-net)
and for a limited form of NAT. So that the guest can access services
on the host, we map the gateway address within the guest to the
loopback address on the host.
Using the gateway address for this isn't necessarily the best choice
for this purpose, certainly not for all circumstances. So, start off
by splitting the notion of these into two different values: @guest_gw
which is the gateway address the guest should use and @nat_host_loopback,
which is the guest visible address to remap to the host's loopback.
Usually nat_host_loopback will have the same value as guest_gw. However
when --no-map-gw is specified we leave them unspecified instead. This
means when we use nat_host_loopback, we don't need to separately check
c->no_map_gw to see if it's relevant.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When sending frames to the guest over the tap link, we need a source MAC
address. Currently we take that from the MAC address of the main interface
on the host, but that doesn't actually make much sense:
* We can't preserve the real MAC address of packets from anywhere
external so there's no transparency case here
* In fact, it's confusingly different from how we handle IP addresses:
whereas we give the guest the same IP as the host, we're making the
host's MAC the one MAC that the guest *can't* use for itself.
* We already need a fallback case if the host doesn't have an Ethernet
like MAC (e.g. if it's connected via a point to point interface, such
as a wireguard VPN).
Change to just just use an arbitrary fixed MAC address - I've picked
9a:55:9a:55:9a:55. It's simpler and has the small advantage of making
the fact that passt/pasta is in use typically obvious from guest side
packet dumps. This can still, of course, be overridden with the -M option.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In every place we use our_tap_ll, we only use it as a fallback if the
IPv6 gateway address is not link-local. We can avoid that conditional at
use time by doing it at initialisation of our_tap_ll instead.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the
guest, but the former is an address for our use on the tap link. Rename it
accordingly, to 'our_tap_ll'.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Despite the names, addr_ll_seen does not relate to addr_ll the same
way addr_seen relates to addr. addr_ll_seen is an observed address
from the guest, whereas addr_ll is *our* link-local address for use on
the tap link when we can't use an external endpoint address. It's
used both for passt provided services (DHCPv6, NDP) and in some cases
for connections from addresses the guest can't access.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Although it's not 100% explicit in the man page, addresses given to
the --dns option are intended to be addresses as seen by the guest.
This differs from addresses taken from the host's /etc/resolv.conf,
which must be translated to guest accessible versions in some cases.
Our implementation is currently inconsistent on this: when using
--dns-forward, you must usually also give --dns with the matching address,
which is meaningful only in the guest's address view. However if you give
--dns with a loopback addres, it will be translated like a host view
address.
Move the remapping logic for DNS addresses out of add_dns4() and add_dns6()
into add_dns_resolv() so that it is only applied for host nameserver
addresses, not for nameservers given explicitly with --dns.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to
the given address, rather than the gateway address. This is doubly wrong:
- We've just established the given address is a host loopback address
the guest can't access
- We've just set ip6.dns[] to tell the guest to use the gateway address,
so it won't use the dns_match address we're setting
Correct this to use the gateway address, like IPv4.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
get_dns() is already quite deeply nested, and future changes I have in
mind will add more complexity. Prepare for this by splitting out the
adding of a single nameserver to the configuration into its own function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Every time we call add_dns[46] we need to first check if there's space in
the c->ip[46].dns array for the new entry. We might as well make that
check in add_dns[46]() itself.
In fact it looks like the calls in get_dns() had an off by one error, not
allowing the last entry of the array to be filled. So, that bug is also
fixed by the change.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
get_dns() counts the number of guest DNS servers it adds, and gives an
error if it couldn't add any. However, this count ignores the fact that
add_dns[46]() may in some cases *not* add an entry. Use the array indices
we're already tracking to get an accurate count.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently add_dns[46]() take a somewhat awkward double pointer to the
entry in the c->ip[46].dns array to update. It turns out to be easier to
work with indices into that array instead.
This diff does add some lines, but it's comments, and will allow some
future code reductions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We rely on C11 already, so we can use clearer and more type-checkable
struct assignment instead of mempcy() for copying IP addresses around.
This exposes some "pointer could be const" warnings from cppcheck, so
address those too.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it. Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are a couple of places where we somewhat messily open code formatting
an Ethernet like MAC address for display. Add an eth_ntop() helper for
this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
f6d5a52392 moved handling of -D into a later loop. However as a side
effect it moved this from a switch block to an if block. I left a couple
of 'break' statements that don't make sense in the new context. They
should be 'continue' so that we go onto the next option, rather than
leaving the loop entirely.
Fixes: f6d5a52392 ("conf: Delay handling -D option until after addresses are configured")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add_dns[46]() rely on the gateway address and c->no_map_gw being already
initialised, in order to properly handle DNS servers which need NAT to be
accessed from the guest.
Usually these are called from get_dns() which is well after the addresses
are configured, so that's fine. However, they can also be called earlier
if an explicit -D command line option is given. In this case no_map_gw
and/or c->ip[46].gw may not get be initialised properly, leading to this
doing the wrong thing.
Luckily we already have a second pass of option parsing for things which
need addresses to already be configured. Move handling of -D to there.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>