Commit graph

907 commits

Author SHA1 Message Date
David Gibson
ca69c3f196 inany: Helper functions for handling addresses which could be IPv4 or IPv6
struct tcp_conn stores an address which could be IPv6 or IPv4 using a
union.  We can do this without an additional tag by encoding IPv4 addresses
as IPv4-mapped IPv6 addresses.

This approach is useful wider than the specific place in tcp_conn, so
expose a new 'union inany_addr' like this from a new inany.h.  Along with
that create a number of helper functions to make working with these "inany"
addresses easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:32 +01:00
David Gibson
b65d603e23 tcp: Don't store hash bucket in connection structures
Currently when we insert a connection into the hash table, we store its
bucket number so we can find it when removing entries.  However, we can
recompute the hash value from other contents of the structure so we don't
need to store it.  This brings the size of tcp_tap_conn down to 64 bytes
again, which means it will fit in a single cacheline on common machines.

This change also removes a non-obvious constraint that the hash table have
less than twice TCP_MAX_CONNS buckets, because of the way
TCP_HASH_BUCKET_BITS was constructed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:28 +01:00
David Gibson
233b95e90f tcp: Remove splice from tcp_epoll_ref
Currently the epoll reference for tcp sockets includes a bit indicating
whether the socket maps to a spliced connection.  However, the reference
also has the index of the connection structure which also indicates whether
it is spliced.  We can therefore avoid the splice bit in the epoll_ref by
unifying the first part of the non-spliced and spliced handlers where we
look up the connection state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:25 +01:00
David Gibson
d909fda1e8 tcp: Use the same sockets to listen for spliced and non-spliced connections
In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
spliced connections (these are bound to localhost) and non-spliced
connections (these are bound to the host address).  This introduces a
subtle behavioural difference between pasta and passt: by default, pasta
will listen only on a single host address, whereas passt will listen on
all addresses (0.0.0.0 or ::).  This also prevents us using some additional
optimizations that only work with the unspecified (0.0.0.0 or ::) address.

However, it turns out we don't need to do this.  We can splice a connection
if and only if it originates from the loopback address.  Currently we
ensure this by having the "spliced" listening sockets listening only on
loopback.  Instead, defer the decision about whether to splice a connection
until after accept(), by checking if the connection was made from the
loopback address.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:22 +01:00
David Gibson
356c6e0677 tcp: Unify part of spliced and non-spliced conn_from_sock path
In tcp_sock_handler() we split off to handle spliced sockets before
checking anything else.  However the first steps of the "new connection"
path for each case are the same: allocate a connection entry and accept()
the connection.

Remove this duplication by making tcp_conn_from_sock() handle both spliced
and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
and tcp_splice_conn_from_sock functions for the later stages which differ.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:19 +01:00
David Gibson
73d3a3e84e tcp: Separate helpers to create ns listening sockets
tcp_sock_init*() can create either sockets listening on the host, or in
the pasta network namespace (with @ns==1).  There are, however, a number
of differences in how these two cases work in practice though.  "ns"
sockets are only used in pasta mode, and they always lead to spliced
connections only.  The functions are also only ever called in "ns" mode
with a NULL address and interface name, and it doesn't really make sense
for them to be called any other way.

Later changes will introduce further differences in behaviour between these
two cases, so it makes more sense to use separate functions for creating
the ns listening sockets than the regular external/host listening sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:02 +01:00
David Gibson
433604a581 tcp: Unify the IN_EPOLL flag
There is very little common between the tcp_tap_conn and tcp_splice_conn
structures.  However, both do have an IN_EPOLL flag which has the same
meaning in each case, though it's stored in a different location.

Simplify things slightly by moving this bit into the common header of the
two structures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:35:00 +01:00
David Gibson
34476511f7 tcp: Partially unify tcp_timer() and tcp_splice_timer()
These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them.  Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:58 +01:00
David Gibson
0eef48c4be tcp: Unify tcp_defer_handler and tcp_splice_defer_handler()
These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections.  To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:54 +01:00
David Gibson
ee8f8e9564 tcp: Unify spliced and non-spliced connection tables
Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections.  This is a bit counter-intuitive.

More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases.  So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.

For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall.  If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.

For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them.  We'll improve that in later patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:51 +01:00
David Gibson
181ce83d9b tcp: Improved helpers to update connections after moving
When we compact the connection tables (both spliced and non-spliced) we
need to move entries from one slot to another.  That requires some updates
in the entries themselves.  Add helpers to make all the necessary updates
for the spliced and non-spliced cases.  This will simplify later cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:48 +01:00
David Gibson
ff27fd63cd tcp: Add connection union type
Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays.  We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection.  For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.

This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline.  This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:46 +01:00
David Gibson
3cf027bd59 tcp: Move connection state structures into a shared header
Currently spliced and non-spliced connections use completely independent
tracking structures.  We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.

This requires renaming some #defines with the same name but different
meanings between the two cases.  In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:43 +01:00
David Gibson
c6b822428a tcp_splice: Helpers for converting from index to/from tcp_splice_conn
Like we already have for non-spliced connections, create a CONN_IDX()
macro for looking up the index of spliced connection structures.  Change
the name of the array of spliced connections to be different from that for
non-spliced connections (even though they're in different modules).  This
will make subsequent changes a bit safer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:40 +01:00
David Gibson
60be7438fa tcp: Better helpers for converting between connection pointer and index
The macro CONN_OR_NULL() is used to look up connections by index with
bounds checking.  Replace it with an inline function, which means:
    - Better type checking
    - No danger of multiple evaluation of an @index with side effects

Also add a helper to perform the reverse translation: from connection
pointer to index.  Introduce a macro for this which will make later
cleanups easier and safer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:37 +01:00
David Gibson
46b78ce96a tcp: Remove unused TCP_MAX_SOCKS constant
Presumably it meant something in the past, but it's no longer used.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:33 +01:00
David Gibson
9ffa0184e3 tcp_splice: #include tcp_splice.h in tcp_splice.c
This obvious include was omitted, which means that declarations in the
header weren't checked against definitions in the .c file.  This shows up
an old declaration for a function that is now static, and a duplicate

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:30 +01:00
David Gibson
698e4fd761 style: Minor corrections to function comments
Some style issues and a typo.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:26 +01:00
David Gibson
e6948822ba clang-tidy: Suppress warning about assignments in if statements
clang-tools 15.0.0 appears to have added a new warning that will always
complain about assignments in if statements, which we use in a number of
places in passt/pasta.  Encountered on Fedora 37 with
clang-tools-extra-15.0.0-3.fc37.x86_64.

Suppress the new warning so that we can compile and test.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-25 01:34:08 +01:00
Stefano Brivio
708b366016 README: Add link to weekly development meeting
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-21 10:55:33 +01:00
Stefano Brivio
3eb26fecf5 README: Fix left-over and indentation for Podman example command
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-16 23:36:55 +01:00
Stefano Brivio
ace074cf49 README: The upcoming version of Podman adds support for pasta
Don't update the demo yet: we could just wait until Podman 4.3.2 is
out and packaged.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-16 17:29:11 +01:00
Stefano Brivio
ab6f825889 util, pasta: Add do_clone() wrapper around __clone2() and clone()
Spotted in Debian's buildd logs: on ia64, clone(2) is not available:
the glibc wrapper is named __clone2() and it takes, additionally,
the size of the stack area passed by the caller.

Add a do_clone() wrapper handling the different cases, and also
taking care of pointing the child's stack in the middle of the
allocated area: on PA-RISC (hppa), handled by clone(), the stack
grows up, and on ia64 the stack grows down, but the register backing
store grows up -- and I think it might be actually used here.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-16 17:28:53 +01:00
Stefano Brivio
7d8c2fb916 test/lib/test: Clean up iperf3 JSON files before starting the server
...instead of doing it after the test. Now that we have pre-built
guest images, we might also have old JSON files from previous,
interrupted test runs.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-16 15:11:25 +01:00
Stefano Brivio
25dab96205 tap: Revert recently added checks in tap_handler_passt()
This reverts commit 198f87835d ("tap: Return -EIO from
tap_handler_passt() on inconsistent packet stream") and commit
510dace86c ("tap: Keep stream consistent if qemu length descriptor
spans two recv() calls").

I can hit occasional failures in perf/passt_tcp tests where we seem
to be getting excess data at the end of a recv(), and for some reason
I couldn't figure out yet, if we just ignore it, subsequent recv()
calls from qemu return correct data. If we close the connection, qemu
can't talk to us anymore, of course.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-16 15:11:22 +01:00
Stefano Brivio
b27d6d121c arp, tap, util: Don't use perror() after seccomp filter is installed
If stderr is closed, after we fork to background, glibc's
implementation of perror() will try to re-open it by calling dup(),
upon which the seccomp filter causes the process to terminate,
because dup() is not included in the list of allowed syscalls.

Replace perror() calls that might happen after isolation_postfork().
We could probably replace all of them, but early ones need a bit more
attention as we have to check whether log.c functions work in early
stages.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-16 15:11:13 +01:00
Stefano Brivio
5f74465010 Remove contrib/debian, Debian package development now happens on Salsa
The development of the Debian package is now at:
  https://salsa.debian.org/sbrivio/passt

Drop contrib/debian, it's finally obsolete.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-16 15:11:09 +01:00
Stefano Brivio
fb7b71b86f contrib/apparmor: Merge pasta and passt profiles, update rules
AppArmor resolves executable links before profile attachment rules
are evaluated, so, as long as pasta is installed as a link to passt,
there's no way to differentiate the two cases. Merge the two profiles
and leave a TODO note behind, explaining two possible ways forward.

Update the rules so that passt and pasta are actually usable, once
the profile is installed. Most required changes are related to
isolation and sandboxing features.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-16 15:11:07 +01:00
Stefano Brivio
b6400db3a6 README: Add links to Debian package tracker
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-16 15:10:53 +01:00
Stefano Brivio
85d8ba3f24 Makefile: Change HPPA into PARISC while building PASST_AUDIT_ARCH
The AUDIT_ARCH defines in seccomp.h corresponding to HPPA are
AUDIT_ARCH_PARISC and AUDIT_ARCH_PARISC64.

Build error spotted in Debian's buildd log on
phantom.physik.fu-berlin.de.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-16 15:10:46 +01:00
Stefano Brivio
fb8376f4b5 Makefile: It's AUDIT_ARCH_MIPSEL64, not AUDIT_ARCH_MIPS64EL
On mips64el, gcc -dumpmachine correctly reports mips64el as
architecture prefix, but for some reason seccomp.h defines
AUDIT_ARCH_MIPSEL64 and not AUDIT_ARCH_MIPS64EL. Mangle AUDIT_ARCH
accordingly.

Build error spotted in Debian's buildd logs from Loongson build.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-16 15:10:35 +01:00
Stefano Brivio
b501a8f274 Makefile: Don't filter out -O2 from supplied flags for AVX2 builds
Drop it from the internal FLAGS variable, but honour -O2 if passed in
CFLAGS. In Debian packages, dpkg-buildflags uses it as hardening
flag, and we get a QA warning if we drop it:
  https://qa.debian.org/bls/bytag/W-dpkg-buildflags-missing.html

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-16 15:10:32 +01:00
Stefano Brivio
b3c9e76dab Makefile: Honour passed CPPFLAGS, not just CFLAGS
CPPFLAGS allow the user to pass pre-processor flags. This is unlikely
to be needed at the moment, but the Debian Hardening Walkthrough
reasonably requests it to be handled in order to fully support
hardened build flags:
  https://wiki.debian.org/HardeningWalkthrough#Handling_dpkg-buildflags_in_your_upstream_build_system

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-16 15:10:19 +01:00
Stefano Brivio
3a2afde87d conf, udp: Drop mostly duplicated dns_send arrays, rename related fields
Given that we use just the first valid DNS resolver address
configured, or read from resolv.conf(5) on the host, to forward DNS
queries to, in case --dns-forward is used, we don't need to duplicate
dns[] to dns_send[]:

- rename dns_send[] back to dns[]: those are the resolvers we
  advertise to the guest/container

- for forwarding purposes, instead of dns[], use a single field (for
  each protocol version): dns_host

- and rename dns_fwd to dns_match, so that it's clear this is the
  address we are matching DNS queries against, to decide if they need
  to be forwarded

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-16 15:09:31 +01:00
Stefano Brivio
4129764eca conf: Fix mask calculation from prefix_len in conf_print()
Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: dd09cceaee ("Minor improvements to IPv4 netmask handling")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-10 11:18:09 +01:00
Stefano Brivio
817eedc28a tcp, udp: Don't initialise IPv6/IPv4 sockets if IPv4/IPv6 are not enabled
If we disable a given IP version automatically (no corresponding
default route on host) or administratively (--ipv4-only or
--ipv6-only options), we don't initialise related buffers and
services (DHCP for IPv4, NDP and DHCPv6 for IPv6). The "tap"
handlers will also ignore packets with a disabled IP version.

However, in commit 3c6ae62510 ("conf, tcp, udp: Allow address
specification for forwarded ports") I happily changed socket
initialisation functions to take AF_UNSPEC meaning "any enabled
IP version", but I forgot to add checks back for the "enabled"
part.

Reported by Paul: on a host without default IPv6 route, but IPv6
enabled, connect, using IPv6, to a port handled by pasta, which
tries to send data to a tap device without initialised buffers
for that IP version and exits because the resulting write() fails.

Simpler way to reproduce: pasta -6 and inbound IPv4 connection, or
pasta -4 and inbound IPv6 connection.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 3c6ae62510 ("conf, tcp, udp: Allow address specification for forwarded ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-11-10 11:17:50 +01:00
Stefano Brivio
6533a4a07b passt: Move __setlogmask() calls before output unrelated to configuration
...so that we avoid printing some lines twice because log-level is
still set to LOG_EMERG, as if logging configuration didn't happen
yet.

While at it, note that logging to stderr doesn't really depend on
whether debug mode is enabled or not.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-10 11:17:50 +01:00
Stefano Brivio
198f87835d tap: Return -EIO from tap_handler_passt() on inconsistent packet stream
While it's important to fail in that case, it makes little sense to
fail quietly: it's better to tell qemu explicitly that something went
wrong and that we won't recover, by closing the socket.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-10 11:17:50 +01:00
Stefano Brivio
510dace86c tap: Keep stream consistent if qemu length descriptor spans two recv() calls
I got all paranoid after triggering a divide-by-zero general
protection fault in passt with a qemu version without the virtio_net
TX hang fix, while flooding UDP. I start thinking this was actually
coming from some random changes I was playing with, but before
reaching this conclusion I reviewed once more the relatively short
path in tap_handler_passt() before we start using packet_*()
functions, and found this.

Never observed in practice, but artificially reproduced with changes
in qemu's socket interface: if we don't receive from qemu a complete
length descriptor in one recv() call, or if we receive a partial one
at the end of one call, we currently disregard the rest, which would
make the stream inconsistent.

Nothing really bad happens, except that from that point on we would
disregard all the packets we get until, if ever, we get the stream
back in sync by chance.

Force reading a complete packet length descriptor with a blocking
recv(), if needed -- not just a complete packet later.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-10 11:17:50 +01:00
Stefano Brivio
e308018bbe test/memory/passt: Change passt.avx2 path to /bin in test itself
Now that we install the binary in /bin, and we have a link from
/usr/bin, change the path in the test itself as well. Otherwise
it works with bash but not with dash for some reason.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
Stefano Brivio
11efaefa1e passt, qrap, README: Update notes and documentation for AF_UNIX support in qemu
We can't get rid of qrap quite yet, but at least we should start
telling users it's not going to be needed anymore starting from qemu
7.2.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
Stefano Brivio
6c1697764c test/perf: Finally drop workaround for virtio_net TX stall
Now that we require 13c6be96618c ("net: stream: add unix socket")
in qemu to run the tests, we can also assume that commit df8d07081718
("virtio-net: fix bottom-half packet TX on asynchronous completion")
is present, as it was merged before that one.

This fixes the issue we attempted to work around in passt TCP and
UDP performance tests: finally drop that stuff.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
Stefano Brivio
667397db11 test: Switch to qemu -netdev stream option instead of using qrap
qemu commit 13c6be96618c ("net: stream: add unix socket") introduces
support for native AF_UNIX support, finally making qrap useless.

We can't quite drop that yet until a qemu release includes it, and
then we'll need to wait a while for users to switch anyway, but at
least for tests, we can use that support.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
Stefano Brivio
1401962a37 test: Wait for network before starting passt in two_guests setup
As pasta now configures that target network namespace with
--config-net, we need to wait for addresses and routes to be actually
present. Just sending netlink messages doesn't mean this is done
synchronously.

A more elegant alternative, which probably makes sense regardless of
this test setup, would be to query, from pasta, addresses and routes
we added, and wait until they're there, before proceeding.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
Stefano Brivio
db74679f98 udp: Check for answers to forwarded DNS queries before handling local redirects
Now that we allow loopback DNS addresses to be used as targets for
forwarding, we need to check if DNS answers come from those targets,
before deciding to eventually remap traffic for local redirects.

Otherwise, the source address won't match the one configured as
forwarder, which means that the guest or the container will refuse
those responses.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
Stefano Brivio
73f50a76aa conf: Split the notions of read DNS addresses and offered ones
With --dns-forward, if the host has a loopback address configured as
DNS server, we should actually use it to forward queries, but, if
--no-map-gw is passed, we shouldn't offer the same address via DHCP,
NDP and DHCPv6, because it's not going to be reachable.

Problematic configuration:

* systemd-resolved configuring the usual 127.0.0.53 on the host: we
  read that from /etc/resolv.conf

* --dns-forward specified with an unrelated address, for example
  198.51.100.1

We still want to forward queries to 127.0.0.53, if we receive one
directed to 198.51.100.1, so we can't drop 127.0.0.53 from our list:
we want to use it for forwarding. At the same time, we shouldn't
offer 127.0.0.53 to the guest or container either.

With this change, I'm only covering the case of automatically
configured DNS servers from /etc/resolv.conf. We could extend this to
addresses configured with command-line options, but I don't really
see a likely use case at this point.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
Stefano Brivio
7656a6f888 conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway
Seen in a Google Compute Engine environment with a machine configured
via cloud-init-dhcp, while testing Podman integration for pasta: the
assigned address has a /32 netmask, and there's a default route,
which can be added on the host because there's another route, also
/32, pointing to the default gateway. For example, on the host:

  ip -4 address add 10.156.0.2/32 dev eth0
  ip -4 route add 10.156.0.1/32 dev eth0
  ip -4 route add default via 10.156.0.1

This is not a valid configuration as far as I can tell: if the
address is configured as /32, it shouldn't be used to reach a gateway
outside its derived netmask. However, Linux allows that, and
everything works.

The problem comes when pasta --config-net sources address and default
route from the host, and it can't configure the route in the target
namespace because the gateway is invalid. That is, we would skip
configuring the first route in the example, which results in the
equivalent of doing:

  ip -4 address add 10.156.0.2/32 dev eth0
  ip -4 route add default via 10.156.0.1

where, at this point, 10.156.0.1 is unreachable, and hence invalid
as a gateway.

Sourcing more routes than just the default is doable, but probably
undesirable: pasta users want to provide connectivity to a container,
not reflect exactly whatever trickery is configured on the host.

Add a consistency check and an adjustment: if the configured default
gateway is not reachable, shrink the given netmask until we can reach
it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:32 +01:00
David Gibson
de93acbe70 tcp: Correct function comments for address types
A number of functions describe themselves as taking a pointer to 'sin_addr
or sin6_addr'.  Those are field names, not type names.  Replace them with
the correct type names, in_addr or in6_addr.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:30 +01:00
David Gibson
f7653a1446 Use endian-safer typing in struct tap4_l4_t
We recently converted to using struct in_addr rather than bare in_addr_t
or uint32_t to represent IPv4 addresses in network order.  This makes it
harder forget to apply the correct endian conversions.

We omitted the IPv4 addresses stored in struct tap4_l4_t, however.  Convert
those as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:26 +01:00
David Gibson
7c7b68dbe0 Use typing to reduce chances of IPv4 endianness errors
We recently corrected some errors handling the endianness of IPv4
addresses.  These are very easy errors to make since although we mostly
store them in network endianness, we sometimes need to manipulate them in
host endianness.

To reduce the chances of making such mistakes again, change to always using
a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network
endian addresses.  This makes it harder to accidentally do arithmetic or
comparisons on such addresses as if they were host endian.

We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to
directly work with struct in_addr values.  This has the additional benefit
of making the IPv4 and IPv6 paths more visually similar.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:24 +01:00