Commit graph

790 commits

Author SHA1 Message Date
Stefano Brivio
e67039f712 checksum: Fix calculation for ICMP checksum on IPv4
We need to zero out the checksum field before calculating the
checksum, of course. I have no idea how this passed the "icmp" test
set, looking into it.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 67ab617172 ("Add csum_icmp4() helper for calculating ICMP checksums")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-26 06:28:06 +02:00
Stefano Brivio
c11277b94f conf: Don't pass leading ~ to parse_port_range() on exclusions
Commit 84fec4e998 ("Clean up parsing of port ranges") drops the
strspn() call before the parsing of excluded port ranges, because now
we're checking against any stray characters at every step.

However, that also has the effect of passing ~ as first character to
the new parse_port_range(), which makes no sense: we already checked
that ~ is the first character before the call, so skip it.

Alona reported this output:
  Invalid port specifier ~15000,~15001,~15006,~15008,~15020,~15021,~15090

while the whole specifier is indeed valid.

Reported-by: Alona Paz <alkaplan@redhat.com>
Fixes: 84fec4e998 ("Clean up parsing of port ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-24 14:37:22 +02:00
Stefano Brivio
b68da100ba util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size
...instead of one fourth. On the main() -> conf() -> nl_sock_init()
call path, LTO from gcc 12 on (at least) x86_64 decides to inline...
everything: nl_sock_init() is effectively part of main(), after
commit 3e2eb4337b ("conf: Bind inbound ports with
CAP_NET_BIND_SERVICE before isolate_user()").

This means we exceed the maximum stack size, and we get SIGSEGV,
under any condition, at start time, as reported by Andrea on a recent
build for CentOS Stream 9.

The calculation of NS_FN_STACK_SIZE, which is the stack size we
reserve for clones, was previously obtained by dividing the maximum
stack size by two, to avoid an explicit check on architecture (on
PA-RISC, also known as hppa, the stack grows up, so we point the
clone to the middle of this area), and then further divided by two
to allow for any additional usage in the caller.

Well, if there are essentially no function calls anymore, this is
not enough. Divide it by eight, which is anyway much more than
possibly needed by any clone()d callee.

I think this is robust, so it's a fix in some sense. Strictly
speaking, though, we have no formal guarantees that this isn't
either too little or too much.

What we should do, eventually: check cloned() callees, there are just
thirteen of them at the moment. Note down any stack usage (they are
mostly small helpers), bonus points for an automated way at build
time, quadruple that or so, to allow for extreme clumsiness, and use
as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa.

Reported-by: Andrea Bolognani <abologna@redhat.com>
Fixes: 3e2eb4337b ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-22 08:46:57 +02:00
Andrea Bolognani
5715a297a7 Add git-publish configuration file
Signed-off-by: Andrea Bolognani <abologna@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-22 03:45:50 +02:00
Andrea Bolognani
b944ca1855 qrap: Support JSON syntax for -device
Starting with version 8.1.0, libvirt uses JSON syntax when
generating the arguments to -device, so they will now look like

  {"driver":"virtio-scsi-pci","bus":"pci.3","addr":"0x0"}

instead of

  virtio-scsi-pci,bus=pci.3,addr=0x0

qrap needs to parse these arguments and extract the bus number
in order to figure out what address to use for the virtio-net
device it adds, and the libvirt change described above has
broken this parsing logic.

Tweak the code so that both styles are accepted and handled
correctly.

Note that, when JSON is in use, qrap needs to generate its own
command line options in that format as well or things will not
work as expected.

Signed-off-by: Andrea Bolognani <abologna@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-21 11:43:45 +02:00
David Gibson
c6845f60a0 dhcp: Use tap_udp4_send() helper in dhcp()
The IPv4 specific dhcp() manually constructs L2 and IP headers to send its
DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the
tap_udp6_send() helper.  Now that we've broaded the parameters to
tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:35:00 +02:00
David Gibson
2dbc622f54 tap: Split tap_ip4_send() into UDP and ICMP variants
tap_ip4_send() has special case logic to compute the checksums for UDP
and ICMP packets, which is a mild layering violation.  By using a suitable
helper we can split it into tap_udp4_send() and tap_icmp4_send() functions
without greatly increasing the code size, this removing that layering
violation.

We make some small changes to the interface while there.  In both cases
we make the destination IPv4 address a parameter, which will be useful
later.  For the UDP variant we make it take just the UDP payload, and it
will generate the UDP header.  For the ICMP variant we pass in the ICMP
header as before.  The inconsistency is because that's what seems to be
the more natural way to invoke the function in the callers in each case.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:56 +02:00
David Gibson
db07804d26 ndp: Use tap_icmp6_send() helper
We send ICMPv6 packets to the guest from both icmp.c and from ndp.c.  The
case in ndp() manually constructs L2 and IPv6 headers, unlike the version
in icmp.c which uses the tap_icmp6_send() helper from tap.c  Now that we've
broaded the parameters of tap_icmp6_send() we can use it in ndp() as well
saving some duplicated logic.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:53 +02:00
David Gibson
cb1edae3b5 ndp: Remove unneeded eh_source parameter
ndp() takes a parameter giving the ethernet source address of the packet
it is to respond to, which it uses to determine the destination address to
send the reply packet to.

This is not necessary, because the address will always be the guest's
MAC address.  Even if the guest has just changed MAC address, then either
tap_handler_passt() or tap_handler_pasta() - which are the only call paths
leading to ndp() will have updated c->mac_guest with the new value.

So, remove the parameter, and just use c->mac_guest, making it more
consistent with other paths where we construct packets to send inwards.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:51 +02:00
David Gibson
9d8dd8b6f4 tap: Split tap_ip6_send() into UDP and ICMP variants
tap_ip6_send() has special case logic to compute the checksums for UDP
and ICMP packets, which is a mild layering violation.  By using a suitable
helper we can split it into tap_udp6_send() and tap_icmp6_send() functions
without greatly increasing the code size, this removing that layering
violation.

We make some small changes to the interface while there.  In both cases
we make the destination IPv6 address a parameter, which will be useful
later.  For the UDP variant we make it take just the UDP payload, and it
will generate the UDP header.  For the ICMP variant we pass in the ICMP
header as before.  The inconsistency is because that's what seems to be
the more natural way to invoke the function in the callers in each case.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:48 +02:00
David Gibson
f616ca231e Split tap_ip_send() into IPv4 and IPv6 specific functions
The IPv4 and IPv6 paths in tap_ip_send() have very little in common, and
it turns out that every caller (statically) knows if it is using IPv4 or
IPv6.  So split into separate tap_ip4_send() and tap_ip6_send() functions.
Use a new tap_l2_hdr() function for the very small common part.

While we're there, make some minor cleanups:
  - We were double writing some fields in the IPv6 header, so that it
    temporary matched the pseudo-header for checksum calculation.  With
    recent checksum reworks, this isn't neccessary any more.
  - We don't use any IPv4 header options, so use some sizeof() constructs
    instead of some open coded values for header length.
  - The comment used to say that the flow label was for TCP over IPv6, but
    in fact the only thing we used it for was DHCPv6 over UDP traffic

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:45 +02:00
David Gibson
fb5d1c5d7d tap: Remove unhelpeful vnet_pre optimization from tap_send()
Callers of tap_send() can optionally use a small optimization by adding
extra space for the 4 byte length header used on the qemu socket interface.
tap_ip_send() is currently the only user of this, but this is used only
for "slow path" ICMP and DHCP packets, so there's not a lot of value to
the optimization.

Worse, having the two paths here complicates the interface and makes future
cleanups difficult, so just remove it.  I have some plans to bring back the
optimization in a more general way in future, but for now it's just in the
way.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:43 +02:00
David Gibson
f72b63e92f Remove support for TCP packets from tap_ip_send()
tap_ip_send() is never used for TCP packets, we're unlikely to use it for
that in future, and the handling of TCP packets makes other cleanups
unnecessarily awkward.  Remove it.

This is the only user of csum_tcp4(), so we can remove that as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:40 +02:00
David Gibson
a2eb2d310a Add helpers for normal inbound packet destination addresses
tap_ip_send() doesn't take a destination address, because it's specifically
for inbound packets, and the IP addresses of the guest/namespace are
already known to us.  Rather than open-coding this destination address
logic, make helper functions for it which will enable some later cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:38 +02:00
David Gibson
3d8ccb44a6 Add csum_ip4_header() helper to calculate IPv4 header checksums
We calculate IPv4 header checksums in at least two places, in dhcp() and
in tap_ip_send.  Add a helper to handle this calculation in both places.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:34 +02:00
David Gibson
bd4be308fc Add csum_udp4() helper for calculating UDP over IPv4 checksums
At least two places in passt fill in UDP over IPv4 checksums, although
since UDP checksums are optional with IPv4 that just amounts to storing
a 0 (in tap_ip_send()) or leaving a 0 from an earlier initialization (in
dhcp()).  For consistency, add a helper for this "calculation".

Just for the heck of it, add the option (compile time disabled for now) to
calculate real UDP checksums.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:32 +02:00
David Gibson
6905ac75ec Add csum_udp6() helper for calculating UDP over IPv6 checksums
Add a helper for calculating UDP checksums when used over IPv6
For future flexibility, the new helper takes parameters for the fields in
the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to
be explicitly constructed.  It also allows the UDP header and payload to
be in separate buffers, although we don't use this yet.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:29 +02:00
David Gibson
67ab617172 Add csum_icmp4() helper for calculating ICMP checksums
Although tap_ip_send() is currently the only place calculating ICMP
checksums, create a helper function for symmetry with ICMPv6.  For
future flexibility it allows the ICMPv6 header and payload to be in
separate buffers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:26 +02:00
David Gibson
7abd2b0d72 Add csum_icmp6() helper for calculating ICMPv6 checksums
At least two places in passt calculate ICMPv6 checksums, ndp() and
tap_ip_send().  Add a helper to handle this calculation in both places.
For future flexibility, the new helper takes parameters for the fields in
the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to
be explicitly constructed.  It also allows the ICMPv6 header and payload to
be in separate buffers, although we don't use this yet.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-19 03:34:21 +02:00
Stefano Brivio
b3f359167b passt.1: Add David to AUTHORS
I just realised while reading the man page.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
3e2eb4337b conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()
Even if CAP_NET_BIND_SERVICE is granted, we'll lose the capability in
the target user namespace as we isolate the process, which means
we're unable to bind to low ports at that point.

Bind inbound ports, and only those, before isolate_user(). Keep the
handling of outbound ports (for pasta mode only) after the setup of
the namespace, because that's where we'll bind them.

To this end, initialise the netlink socket for the init namespace
before isolate_user() as well, as we actually need to know the
addresses of the upstream interface before binding ports, in case
they're not explicitly passed by the user.

As we now call nl_sock_init() twice, checking its return code from
conf() twice looks a bit heavy: make it exit(), instead, as we
can't do much if we don't have netlink sockets.

While at it:

- move the v4_only && v6_only options check just after the first
  option processing loop, as this is more strictly related to
  option parsing proper

- update the man page, explaining that CAP_NET_BIND_SERVICE is
  *not* the preferred way to bind ports, because passt and pasta
  can be abused to allow other processes to make effective usage
  of it. Add a note about the recommended sysctl instead

- simplify nl_sock_init_do() now that it's called once for each
  case

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
40abd447c8 Rename pasta_setup_ns() to pasta_spawn_cmd()
pasta_setup_ns() no longer has much to do with setting up a namespace.
Instead it's really about starting the shell or other command we want to
run with pasta connectivity.  Rename it and its argument structure to be
less misleading.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
eb3d03a588 isolation: Only configure UID/GID mappings in userns when spawning shell
When in passt mode, or pasta mode spawning a command, we create a userns
for ourselves.  This is used both to isolate the pasta/passt process itself
and to run the spawned command, if any.

Since eed17a47 "Handle userns isolation and dropping root at the same time"
we've handled both cases the same, configuring the UID and GID mappings in
the new userns to map whichever UID we're running as to root within the
userns.

This mapping is desirable when spawning a shell or other command, so that
the user gets a root shell with reasonably clear abilities within the
userns and netns.  It's not necessarily essential, though.  When not
spawning a shell, it doesn't really have any purpose: passt itself doesn't
need to be root and can operate fine with an unmapped user (using some of
the capabilities we get when entering the userns instead).

Configuring the uid_map can cause problems if passt is running with any
capabilities in the initial namespace, such as CAP_NET_BIND_SERVICE to
allow it to forward low ports.  In this case the kernel makes files in
/proc/pid owned by root rather than the starting user to prevent the user
from interfering with the operation of the capability-enhanced process.
This includes uid_map meaning we are not able to write to it.

Whether this behaviour is correct in the kernel is debatable, but in any
case we might as well avoid problems by only initializing the user mappings
when we really want them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
fb449b16bd isolation: Prevent any child processes gaining capabilities
We drop our own capabilities, but it's possible that processes we exec()
could gain extra privilege via file capabilities.  It shouldn't be possible
for us to exec() anyway due to seccomp() and our filesystem isolation.  But
just in case, zero the bounding and inheritable capability sets to prevent
any such child from gainin privilege.

Note that we do this *after* spawning the pasta shell/command (if any),
because we do want the user to be able to give that privilege if they want.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
c22ebccba8 isolation: Replace drop_caps() with a version that actually does something
The current implementation of drop_caps() doesn't really work because it
attempts to drop capabilities from the bounding set.  That's not the set
that really matters, it's about limiting the abilities of things we might
later exec() rather than our own capabilities.  It also requires
CAP_SETPCAP which we won't usually have.

Replace it with a new version which uses setcap(2) to drop capabilities
from the effective and permitted sets.  For now we leave the inheritable
set as is, since we don't want to preclude the user from passing
inheritable capabilities to the command spawed by pasta.

Correctly dropping caps reveals that we were relying on some capabilities
we'd supposedly dropped.  Re-divide the dropping of capabilities between
isolate_initial(), isolate_user() and isolate_prefork() to make this work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
ceb2061587 isolation: Refactor isolate_user() to allow for a common exit path
Currently, isolate_user() exits early if the --netns-only option is given.
That works for now, but shortly we're going to want to add some logic to
go at the end of isolate_user() that needs to run in all cases: joining a
given userns, creating a new userns, or staying in our original userns
(--netns-only).

To avoid muddying those changes, here we reorganize isolate_user() to have
a common exit path for all cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
ea5936dd3f Replace FWRITE with a function
In a few places we use the FWRITE() macro to open a file, replace it's
contents with a given string and close it again.  There's no real
reason this needs to be a macro rather than just a function though.
Turn it into a function 'write_file()' and make some ancillary
cleanups while we're there:
  - Add a return code so the caller can handle giving a useful error message
  - Handle the case of short write()s (unlikely, but possible)
  - Add O_TRUNC, to make sure we replace the existing contents entirely

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
096e48669b isolation: Clarify various self-isolation steps
We have a number of steps of self-isolation scattered across our code.
Improve function names and add comments to make it clearer what the self
isolation model is, what the steps do, and why they happen at the points
they happen.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
6909a8e339 Remove unhelpful drop_caps() call in pasta_start_ns()
drop_caps() has a number of bugs which mean it doesn't do what you'd
expect.  However, even if we fixed those, the call in pasta_start_ns()
doesn't do anything useful:

* In the common case, we're UID 0 at this point.  In this case drop_caps()
  doesn't accomplish anything, because even with capabilities dropped, we
  are still privileged.
* When attaching to an existing namespace with --userns or --netns-only
  we might not be UID 0.  In this case it's too early to drop all
  capabilities: we need at least CAP_NET_ADMIN to configure the
  tap device in the namespace.

Remove this call - we will still drop capabilities a little later in
sandbox().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
01b4e71f7a pasta_start_ns() always ends in parent context
The end of pasta_start_ns() has a test against pasta_child_pid, testing
if we're in the parent or the child.  However we started the child running
the pasta_setup_ns function which always exec()s or exit()s, so if we
return from the clone() we are always in the parent, making that test
unnecessary.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
672a8cd80e pasta: More general way of starting spawned shell as a login shell
When invoked so as to spawn a shell, pasta checks explicitly for the
shell being bash and if so, adds a "-l" option to make it a login shell.
This is not ideal, since this is a bash specific option and requires pasta
to know about specific shell variants.

There's a general convention for starting a login shell, which is to
prepend a "-" to argv[0].  Use this approach instead, so we don't need bash
specific logic.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
David Gibson
f99e9a3338 test: Move slower tests to end of test run
The distro and performance tests are by far the slowest part of the passt
testsuite.  Move them to the end of the testsuite run, so that it's easier
to do a quick test during development by letting the other tests run then
interrupting the test runner.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
Stefano Brivio
7f2a7396e2 log.h: Avoid unnecessary GNU extension for token pasting
clang says:

  ./log.h:23:18: warning: token pasting of ',' and __VA_ARGS__ is a GNU extension [-Wgnu-zero-variadic-macro-arguments]

We need token pasting here just because of the 'format' in trace():
drop it.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
Stefano Brivio
55cdcc159b util.h: Add missing gcc pragma push before pragma pop
While building with clang:

  ./util.h:176:24: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas]

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
abbe01af59 icmp: Set sin6_scope_id for outbound ICMPv6 echo requests
If we ping a link-local address, we need to pass this to sendto(), as
it will obviously fail with -EINVAL otherwise.

If we ping other addresses, it's probably a good idea anyway to
specify the configured outbound interface here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
57e2c066e9 conf: Drop excess colons in usage for DHCP and DNS options
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
6acf89638b netlink: Disable duplicate address detection for configured IPv6 address
With default options, when we pass --config-net, the IPv6 address is
actually going to be recycled from the init namespace, so it is in
fact duplicated, but duplicate address detection has no way to find
out.

With a different configured address, that's not the case, but anyway
duplicate address detection will be unable to see this.

In both cases, we're wasting time for nothing.

Pass the IFA_F_NODAD flag as we configure globally scoped IPv6
addresses via netlink.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
6f3e38cac5 Don't create 'tap' socket for ports that are bound to loopback only
If the user specifies an explicit loopback address for a port
binding, we're going to use that address for the 'tap' socket, and
the same exact address for the 'spliced' socket (because those are,
by definition, only bound to loopback addresses).

This means that the second binding will fail, and, unexpectedly, the
port is forwarded, but via tap device, which means the source address
in the namespace won't be a loopback address.

Make it explicit under which conditions we're creating which kind of
socket, by refactoring tcp_sock_init() into two separate functions
for IPv4 and IPv6 and gathering those conditions at the beginning.

Also, don't create spliced sockets if the user specifies explicitly
a non-loopback address, those are harmless but not desired either.

Fixes: 3c6ae62510 ("conf, tcp, udp: Allow address specification for forwarded ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
Stefano Brivio
d0dd0242a6 tcp, tcp_splice: Fix port remapping for inbound, spliced connections
In pasta mode, when we receive a new inbound connection, we need to
select a socket that was created in the namespace to proceed and
connect() it to its final destination.

The existing condition might pick a wrong socket, though, if the
destination port is remapped, because we'll check the bitmap of
inbound ports using the remapped port (stored in the epoll reference)
as index, and not the original port.

Instead of using the port bitmap for this purpose, store this
information in the epoll reference itself, by adding a new 'outbound'
bit, that's set if the listening socket was created the namespace,
and unset otherwise.

Then, use this bit to pick a socket on the right side.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 33482d5bf2 ("passt: Add PASTA mode, major rework")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
eab9d8d5d6 tcp, tcp_splice: Adjust comments to current meaning of inbound and outbound
For tcp_sock_init_ns(), "inbound" connections used to be the ones
being established toward any listening socket we create, as opposed
to sockets we connect().

Similarly, tcp_splice_new() used to handle "inbound" connections in
the sense that they originated from listening sockets, and they would
in turn cause a connect() on an "outbound" socket.

Since commit 1128fa03fe ("Improve types and names for port
forwarding configuration"), though, inbound connections are more
broadly defined as the ones directed to guest or namepsace, and
outbound the ones originating from there.

Update comments for those two functions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
346da48fe6 udp: Fix port and address checks for DNS forwarder
First off, as we swap endianness for source ports in
udp_fill_data_v{4,6}(), we want host endianness, not network
endianness. It doesn't actually matter if we use htons() or ntohs()
here, but the current version is confusing.

In the IPv4 path, when we remap DNS answers, we already swapped the
endianness as needed for the source port: don't swap it again,
otherwise we'll not map DNS answers for IPv4.

In the IPv6 path, when we remap DNS answers, we want to check that
they came from our upstream DNS server, not the one configured via
--dns-forward (which doesn't even need to exist for this
functionality to work).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
2970dc257c tap: Don't check sequence counts when adding packets to pool
This is a minor optimisation possibility I spotted while trying to
debug a hang in tap4_handler(): if we run out of space for packet
sequences, it's fine to add packets to an existing per-sequence pool.

We should check the count of packet sequences only once we realise
that we actually need a new packet sequence.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:36 +02:00
Stefano Brivio
cc65f31250 packet: Fix off-by-one in packet_get_do() sanity checks
An n-sized pool, or a pool with n entries, doesn't include index n,
only up to n - 1.

I'm not entirely sure this sanity check actually covers any
practical case, but I spotted this while debugging a hang in
tap4_handler() (possibly due to malformed sequence entries from
qemu).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
10236de486 conf: Report usage for --no-netns-quit
Fixes: 745a9ba428 ("pasta: By default, quit if filesystem-bound net namespace goes away")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
c1eff9a3c6 conf, tcp, udp: Allow specification of interface to bind to
Since kernel version 5.7, commit c427bfec18f2 ("net: core: enable
SO_BINDTODEVICE for non-root users"), we can bind sockets to
interfaces, if they haven't been bound yet (as in bind()).

Introduce an optional interface specification for forwarded ports,
prefixed by %, that can be passed together with an address.

Reported use case: running local services that use ports we want
to have externally forwarded:
  https://github.com/containers/podman/issues/14425

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
a62ed181db conf, tap: Add option to quit once the client closes the connection
This is practical to avoid explicit lifecycle management in users,
e.g. libvirtd, and is trivial to implement.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
9de65dd3f4 util: Check return value of lseek() while reading bound ports from procfs
Coverity now noticed we're checking most lseek() return values, but
not this. Not really relevant, but it doesn't hurt to check we can
actually seek before reading lines.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
Stefano Brivio
e23024ccff conf, log, Makefile: Add versioning information
Add a --version option displaying that, and also include this
information in the log files.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-10-15 02:10:28 +02:00
Stefano Brivio
2074b332f9 log: Add missing function comment for trace_init()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-14 17:38:58 +02:00
Stefano Brivio
01efc71ddd log, conf: Add support for logging to file
In some environments, such as KubeVirt pods, we might not have a
system logger available. We could choose to run in foreground, but
this takes away the convenient synchronisation mechanism derived from
forking to background when interfaces are ready.

Add optional logging to file with -l/--log-file and --log-size.

Unfortunately, this means we need to duplicate features that are more
appropriately implemented by a system logger, such as rotation. Keep
that reasonably simple, by using fallocate() with range collapsing
where supported (Linux kernel >= 3.15, extent-based ext4 and XFS) and
falling back to an unsophisticated block-by-block moving of entries
toward the beginning of the file once we reach the (mandatory) size
limit.

While at it, clarify the role of LOG_EMERG in passt.c.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-14 17:38:28 +02:00