Commit graph

1734 commits

Author SHA1 Message Date
David Gibson
8d4baa4446 Clarify which addresses in ip[46]_ctx are meaningful where
Some are guest visible addresses and may not be valid on the host, others
are host visible addresses and may not be valid on the guest.  Rearrange
and comment the ip[46]_ctx definitions to make it clearer which is which.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:19 +02:00
David Gibson
a42fb9c000 treewide: Change misleading 'addr_ll' name
c->ip6.addr_ll is not like c->ip6.addr.  The latter is an address for the
guest, but the former is an address for our use on the tap link.  Rename it
accordingly, to 'our_tap_ll'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:16 +02:00
David Gibson
c9f0ec3227 util: Correct sock_l4() binding for link local addresses
When binding an IPv6 socket in sock_l4() we need to supply a scope id
if the address is link-local.  We check for this by comparing the
given address to c->ip6.addr_ll.  This is correct only by accident:
while c->ip6.addr_ll is typically set to the host interface's link
local address, the actual purpose of it is to provide a link local
address for passt's private use on the tap interface.

Instead set the scope id for any link-local address we're binding to.
We're going to need something and this is what makes sense for sockets
on the host.  It doesn't make sense for PIF_SPLICE sockets, but those
should always have loopback, not link-local addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:13 +02:00
David Gibson
57532f1ded conf: Remove incorrect initialisation of addr_ll_seen
Despite the names, addr_ll_seen does not relate to addr_ll the same
way addr_seen relates to addr.  addr_ll_seen is an observed address
from the guest, whereas addr_ll is *our* link-local address for use on
the tap link when we can't use an external endpoint address.  It's
used both for passt provided services (DHCPv6, NDP) and in some cases
for connections from addresses the guest can't access.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:10 +02:00
David Gibson
0b25cac94e conf: Treat --dns addresses as guest visible addresses
Although it's not 100% explicit in the man page, addresses given to
the --dns option are intended to be addresses as seen by the guest.
This differs from addresses taken from the host's /etc/resolv.conf,
which must be translated to guest accessible versions in some cases.

Our implementation is currently inconsistent on this: when using
--dns-forward, you must usually also give --dns with the matching address,
which is meaningful only in the guest's address view.  However if you give
--dns with a loopback addres, it will be translated like a host view
address.

Move the remapping logic for DNS addresses out of add_dns4() and add_dns6()
into add_dns_resolv() so that it is only applied for host nameserver
addresses, not for nameservers given explicitly with --dns.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:08 +02:00
David Gibson
a6066f4e27 conf: Correct setting of dns_match address in add_dns6()
add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to
the given address, rather than the gateway address.  This is doubly wrong:
 - We've just established the given address is a host loopback address
   the guest can't access
 - We've just set ip6.dns[] to tell the guest to use the gateway address,
   so it won't use the dns_match address we're setting

Correct this to use the gateway address, like IPv4.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:06 +02:00
David Gibson
7c083ee41c conf: Move adding of a nameserver from resolv.conf into subfunction
get_dns() is already quite deeply nested, and future changes I have in
mind will add more complexity.  Prepare for this by splitting out the
adding of a single nameserver to the configuration into its own function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:04 +02:00
David Gibson
1d10760c9f conf: Move DNS array bounds checks into add_dns[46]
Every time we call add_dns[46] we need to first check if there's space in
the c->ip[46].dns array for the new entry.  We might as well make that
check in add_dns[46]() itself.

In fact it looks like the calls in get_dns() had an off by one error, not
allowing the last entry of the array to be filled.  So, that bug is also
fixed by the change.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:02 +02:00
David Gibson
6852bd07cc conf: More accurately count entries added in get_dns()
get_dns() counts the number of guest DNS servers it adds, and gives an
error if it couldn't add any.  However, this count ignores the fact that
add_dns[46]() may in some cases *not* add an entry.  Use the array indices
we're already tracking to get an accurate count.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:00 +02:00
David Gibson
c679894668 conf: Use array indices rather than pointers for DNS array slots
Currently add_dns[46]() take a somewhat awkward double pointer to the
entry in the c->ip[46].dns array to update.  It turns out to be easier to
work with indices into that array instead.

This diff does add some lines, but it's comments, and will allow some
future code reductions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:58 +02:00
David Gibson
ceea52ca93 treewide: Use struct assignment instead of memcpy() for IP addresses
We rely on C11 already, so we can use clearer and more type-checkable
struct assignment instead of mempcy() for copying IP addresses around.

This exposes some "pointer could be const" warnings from cppcheck, so
address those too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:56 +02:00
David Gibson
905ecd2b0b treewide: Rename MAC address fields for clarity
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it.  Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac".  Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:54 +02:00
David Gibson
066e69986b util: Helper for formatting MAC addresses
There are a couple of places where we somewhat messily open code formatting
an Ethernet like MAC address for display.  Add an eth_ntop() helper for
this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:51 +02:00
David Gibson
e6feb5a892 treewide: Use "our address" instead of "forwarding address"
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing.  As discussed on a
recent call, let's try "our" instead.

(While we're there correct an error in flow_initiate_af()s comments where
we referred to parameters by the wrong name).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:29 +02:00
Stefano Brivio
32c386834d netlink: Fix typo in function comment for nl_addr_set()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-18 01:29:52 +02:00
Stefano Brivio
f4e9f26480 pasta: Disable neighbour solicitations on device up to prevent DAD
As soon as we the kernel notifier for IPv6 address configuration
(addrconf_notify()) sees that we bring the target interface up
(NETDEV_UP), it will schedule duplicate address detection, so, by
itself, setting the nodad flag later is useless, because that won't
stop a detection that's already in progress.

However, if we disable neighbour solicitations with IFF_NOARP (which
is a misnomer for IPv6 interfaces, but there's no possibility of
mixing things up), the notifier will not trigger DAD, because it can't
be done, of course, without neighbour solicitations.

Set IFF_NOARP as we bring up the device, and drop it after we had a
chance to set the nodad attribute on the link.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-18 01:29:52 +02:00
Stefano Brivio
d6f0220731 netlink, pasta: Fetch link-local address from namespace interface once it's up
As soon as we bring up the interface, the Linux kernel will set up a
link-local address for it, so we can fetch it and start using right
away, if we need a link-local address to communicate to the container
before we see any traffic coming from it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-18 01:29:52 +02:00
Stefano Brivio
74e508cf79 netlink, pasta: Disable DAD for link-local addresses on namespace interface
It makes no sense for a container or a guest to try and perform
duplicate address detection for their link-local address, as we'll
anyway not relay neighbour solicitations with an unspecified source
address.

While they perform duplicate address detection, the link-local address
is not usable, which prevents us from bringing up especially
containers and communicate with them right away via IPv6.

This is not enough to prevent DAD and reach the container right away:
we'll need a couple more patches.

As we send NLM_F_REPLACE requests right away, while we still have to
read out other addresses on the same socket, we can't use nl_do():
keep track of the last sequence we sent (last address we changed), and
deal with the answers to those NLM_F_REPLACE requests in a separate
loop, later.

Link: https://github.com/containers/podman/pull/23561#discussion_r1711639663
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-18 01:29:38 +02:00
Stefano Brivio
0c74068f56 netlink, pasta: Turn nl_link_up() into a generic function to set link flags
In the next patches, we'll reuse it to set flags other than IFF_UP.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-15 09:14:47 +02:00
Stefano Brivio
8231ce54c3 netlink, pasta: Split MTU setting functionality out of nl_link_up()
As we'll use nl_link_up() for more than just bringing up devices, it
will become awkward to carry empty MTU values around whenever we call
it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-15 09:14:43 +02:00
Stefano Brivio
b91d3373ac netlink: Fix typo in function comment for nl_addr_get()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-15 09:14:29 +02:00
Stefano Brivio
946206437a test: Speed up by cutting on eye candy and performance test duration
We have a number of delays when we switch to new layouts that were
added to make the tests visually easier to follow, together with
blinking status bars. Shorten the delays and avoid blinking the
status bar if $FAST is set to 1 (no demo mode).

Shorten delays in busy loops to 10ms, instead of 100ms, and skip the
one-second fixed delay when we wait for the status of a command.

Cut the duration of throughput and latency tests to one second, down
from ten. Somewhat surprisingly, the results we get are rather
consistent, and not significantly different from what we'd get with
10 seconds.

This, together with Podman's commit 20f3e8909e3a ("test/system:
pasta_test_do add explicit port check"), cuts the time needed on my
setup for full test run from approximately 37 minutes to...:

  $ time ./run
  [exited]
  PASS: 165, FAIL: 0
  Log at /home/sbrivio/passt/test/test_logs/test.log

  real	15m34.253s
  user	0m0.011s
  sys	0m0.011s

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-15 09:13:15 +02:00
David Gibson
61c0b0d0f1 flow: Don't crash if guest attempts to connect to port 0
Using a zero port on TCP or UDP is dubious, and we can't really deal with
forwarding such a flow within the constraints of the socket API.  Hence
we ASSERT()ed that we had non-zero ports in flow_hash().

The intention was to make sure that the protocol code sanitizes such ports
before completing a flow entry.  Unfortunately, flow_hash() is also called
on new packets to see if they have an existing flow, so the unsanitized
guest packet can crash passt with the assert.

Correct this by moving the assert from flow_hash() to flow_sidx_hash()
which is only used on entries already in the table, not on unsanitized
data.

Reported-by: Matt Hamilton <matt@thmail.io>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-14 12:20:31 +02:00
David Gibson
baba284912 conf: Don't ignore -t and -u options after -D
f6d5a52392 moved handling of -D into a later loop.  However as a side
effect it moved this from a switch block to an if block.  I left a couple
of 'break' statements that don't make sense in the new context.  They
should be 'continue' so that we go onto the next option, rather than
leaving the loop entirely.

Fixes: f6d5a52392 ("conf: Delay handling -D option until after addresses are configured")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-14 09:14:12 +02:00
AbdAlRahman Gad
c16141eda5 ndp.c: Turn NDP responder into more declarative implementation
- Add structs for NA, RA, NS, MTU, prefix info, option header,
  link-layer address, RDNSS, DNSSL and link-layer for RA message.

- Turn NA message from purely imperative, going byte by byte,
  to declarative by filling it's struct.

- Turn part of RA message into declarative.

- Move packet_add() to be before the call of ndp() in tap6_handler()
  if the protocol of the packet  is ICMPv6.

- Add a pool of packets as an additional parameter to ndp().

- Check the size of NS packet with packet_get() before sending an NA
  packet.

- Add documentation for the structs.

- Add an enum for NDP option types.

Link: https://bugs.passt.top/show_bug.cgi?id=21
Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
[sbrivio: Minor coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-13 19:46:16 +02:00
David Gibson
f6d5a52392 conf: Delay handling -D option until after addresses are configured
add_dns[46]() rely on the gateway address and c->no_map_gw being already
initialised, in order to properly handle DNS servers which need NAT to be
accessed from the guest.

Usually these are called from get_dns() which is well after the addresses
are configured, so that's fine.  However, they can also be called earlier
if an explicit -D command line option is given.  In this case no_map_gw
and/or c->ip[46].gw may not get be initialised properly, leading to this
doing the wrong thing.

Luckily we already have a second pass of option parsing for things which
need addresses to already be configured.  Move handling of -D to there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-12 21:29:36 +02:00
David Gibson
86bdd968ea Correct inaccurate comments on ip[46]_ctx::addr
These fields are described as being an address for an external, routable
interface.  That's not necessarily the case when using -a.  But, more
importantly, saying where the value comes from is not as useful as what
it's used for.  The real purpose of this field is as the address which we
assign to the guest via DHCP or --config-net.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-12 21:29:21 +02:00
Stefano Brivio
fecb1b65b1 log: Don't prefix message with timestamp on --debug if it's a continuation
If we prefix the second part of messages printed through
logmsg_perror() by the timestamp, on debug, we'll have two timestamps
and a weird separator in the result, such as this beauty:

  0.0013: Failed to clone process with detached namespaces0.0013: : Operation not permitted

Add a parameter to logmsg() and vlogmsg() which indicates a message
continuation. If that's set, don't print the timestamp in vlogmsg().

Link: https://github.com/moby/moby/issues/48257#issuecomment-2282875092
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-12 16:21:53 +02:00
Stefano Brivio
baccfb95ce conf: Stop parsing options at first non-option argument
Given that pasta supports specifying a command to be executed on the
command line, even without the usual -- separator as long as there's
no ambiguity, we shouldn't eat up options that are not meant for us.

Paul reports, for instance, that with:

  pasta --config-net ip -6 route

-6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'.
That's because getopt_long(), by default, shuffles the argument list
to shift non-option arguments at the end.

Avoid that by adding '+' at the beginning of 'optstring'.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-08 21:34:06 +02:00
Stefano Brivio
09603cab28 passt, util: Close any open file that the parent might have leaked
If a parent accidentally or due to implementation reasons leaks any
open file, we don't want to have access to them, except for the file
passed via --fd, if any.

This is the case for Podman when Podman's parent leaks files into
Podman: it's not practical for Podman to close unrelated files before
starting pasta, as reported by Paul.

Use close_range(2) to close all open files except for standard streams
and the one from --fd.

Given that parts of conf() depend on other files to be already opened,
such as the epoll file descriptor, we can't easily defer this to a
more convenient point, where --fd was already parsed. Introduce a
minimal, duplicate version of --fd parsing to keep this simple.

As we need to check that the passed --fd option doesn't exceed
INT_MAX, because we'll parse it with strtol() but file descriptor
indices are signed ints (regardless of the arguments close_range()
take), extend the existing check in the actual --fd parsing in conf(),
also rejecting file descriptors numbers that match standard streams,
while at it.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
2024-08-08 21:31:25 +02:00
David Gibson
755f9fd911 nstool: Propagate SIGTERM to processes executed in the namespace
Particularly in shell it's sometimes natural to save the pid from a process
run and later kill it.  If doing this with nstool exec, however, it will
kill nstool itself, not the program it is running, which isn't usually what
you want or expect.

Address this by having nstool propagate SIGTERM to its child process.  It
may make sense to propagate some other signals, but some introduce extra
complications, so we'll worry about them when and if it seems useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:48 +02:00
David Gibson
5ca61c2f34 nstool: Fix some trivial typos
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:45 +02:00
David Gibson
a628cb93a7 log: Avoid duplicate calls to logtime()
We use logtime() to get a timestamp for the log in two places:
  - in vlogmsg(), which is used only for debug_print messages
  - in logfile_write() which is only used messages to the log file

These cases are mutually exclusive, so we don't ever print the same message
with different timestamps, but that's not particularly obvious to see.
It's possible future tweaks to logging logic could mean we log to two
different places with different timestamps, which would be confusing.

Refactor to have a single logtime() call in vlogmsg() and use it for all
the places we need it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:40 +02:00
David Gibson
2c7558dc43 log: Handle errors from clock_gettime()
clock_gettime() can, theoretically, fail, although it probably won't until
2038 on old 32-bit systems.  Still, it's possible someone could run with
a wildly out of sync clock, or new errors could be added, or it could fail
due to a bug in libc or the kernel.

We don't handle this well.  In the debug_print case in vlogmsg we'll just
ignore the failure, and print a timestamp based on uninitialised garbage.
In logfile_write() we exit early and won't log anything at all, which seems
like a good way to make an already weird situation undebuggable.

Add some helpers to instead handle this by using "<error>" in place of a
timestamp if something goes wrong with clock_gettime().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:29 +02:00
David Gibson
b91bae1ded log: Correct formatting of timestamps
logtime_fmt_and_arg() is a rather odd macro, producing both a format
string and an argument, which can only be used in quite specific printf()
like formulations.  It also has a significant bug: it tries to display 4
digits after the decimal point (so down to tenths of milliseconds) using
%04i.  But the field width in printf() is always a *minimum* not maximum
field width, so this will not truncate the given value, but will redisplay
the entire tenth-of-milliseconds difference again after the decimal point.

Replace the macro with an snprintf() like function which will format the
timestamp, and use an explicit % to correct the display.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Make logtime_fmt() static]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:16:10 +02:00
David Gibson
95569e4aa4 util: Some corrections for timespec_diff_us
The comment for timespec_diff_us() claims it will wrap after 2^64µs.  This
is incorrect for two reasons:
  * It returns a long long, which is probably 64-bits, but might not be
  * It returns a signed value, so even if it is 64 bits it will wrap after
    2^63µs

Correct the comment and use an explicitly 64-bit type to avoid that
imprecision.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:15:40 +02:00
Stefano Brivio
fbb0c9523e conf, pasta: Make -g and -a skip route/addresses copy for matching IP version only
Paul reports that setting IPv4 address and gateway manually, using
--address and --gateway, causes pasta to fail inserting IPv6 routes
in a setup where multiple, inter-dependent IPv6 routes are present
on the host.

That's because, currently, any -g option implies --no-copy-routes
altogether, and any -a implies --no-copy-addrs.

Limit this implication to the matching IP version, instead, by having
two copies of no_copy_routes and no_copy_addrs in the context
structure, separately for IPv4 and IPv6.

While at it, change them to 'bool': we had them as 'int' because
getopt_long() used to set them directly, but it hasn't been the case
for a while already.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-07 09:15:25 +02:00
Stefano Brivio
ee36266a55 log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).

But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.

Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.

Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 15:03:48 +02:00
Stefano Brivio
3a082c4ecb tcp_splice: Fix side in OUT_WAIT flag setting
If the "from" (input) side for a given transfer is 0, and we can't
complete the write right away, what we need to be waiting for is for
output readiness on side 1, not 0, and the other way around as well.

This causes random transfer failures for local TCP connections,
depending if we ever need to wait for output readiness.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/23517
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
2024-08-06 15:03:31 +02:00
David Gibson
031df332e9 util: Use unsigned (size_t) value for iov length
The "correct" type for the length of an IOV is unclear: writev() and
readv() use an int, but sendmsg() and recvmsg() use a size_t.  Using the
unsigned size_t has some advantages, though, and it makes more sense for
the case of write_remainder.  Using size_t throughout here means we don't
have a signed vs. unsigned comparison, and we don't have to deal with
the case of iov_skip_bytes() returning a value which becomes negative
when assigned to an integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 15:01:46 +02:00
Laurent Vivier
e877f905e5 udp_flow: move all udp_flow functions to udp_flow.c
No code change.

They need to be exported to be available by the vhost-user version of
passt.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 17:38:17 +02:00
Laurent Vivier
623ceb1f2b udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock()
To be used with the vhost-user version of udp.c, we need to export the
udp_flow functions. To avoid to export udp_meta_t too that is specific
to the socket version of udp.c, don't pass udp_meta_t to it,
but the only needed field, s_in.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 17:37:53 +02:00
David Gibson
a5bbefa6fb log: Make logfile_write() private
logfile_write() is not used outside log.c, nor should it be.  It should
only be used externall via the general logging functions.  Make it static
in log.c.  To avoid forward declarations this requires moving a bunch of
functions earlier in the file.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-05 15:03:33 +02:00
Stefano Brivio
f30ed68c52 pasta: Save errno on signal handler entry, restore on return when needed
Ed reported this:

  # Error: pasta failed with exit code 1:
  # Couldn't drop cap 3 from bounding set
  # : No child processes

in a Podman CI run with tests being run in parallel. The error message
itself, by the way, is fixed by commit 1cd773081f ("log: Drop
newlines in the middle of the perror()-like messages"), but how can we
possibly get ECHILD as failure code for prctl()?

Well, we don't, but if we exit early enough, pasta_child_handler()
might run before we're even done with isolation steps, and it calls
waitid(), which sets errno. We need to restore it before returning
from the signal handler (if we return after calling functions that
might set it), as signal-safety(7) also implies:

       Fetching and setting the value of errno is async-signal-safe
       provided that the signal handler saves errno on entry and
       restores its value before returning.

Eventually, we'll probably need to switch to signalfd(2) the day we
want to implement multithreading, but this will do for the moment.

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/23478
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-05 15:02:36 +02:00
Danish Prakash
0149d11cc5 pasta: modify hostname when detaching new namespace
When invoking pasta without any arguments, it's difficult
to tell whether we are in the new namespace or not leaving
users a bit confused. This change modifies the host namespace
to add a prefix "pasta-" to make it a bit more obvious.

Signed-off-by: Danish Prakash <contact@danishpraka.sh>
[sbrivio: coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-30 17:27:43 +02:00
AbdAlRahman Gad
8fae3b73cb Fix typo in README file
- remove duplicated 'the' in the 'Services' section

Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-29 19:02:35 +02:00
Stefano Brivio
f87b11c7be fedora/rpkg: List myself as author for changelog entries
...instead of the latest author for contrib/fedora.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 16:40:41 +02:00
David Gibson
57a21d2df1 tap: Improve handling of partially received frames on qemu socket
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie.  Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.

Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket.  This handles a number of (unlikely) cases
which previously would not be dealt with correctly:

* If qemu sent a partial frame then waited some time before sending the
  remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
  doing the partial frame handling, which would put us out of sync with
  the stream from qemu
* If a the blocking recv() only received some of the remainder of the
  frame, not all of it, we'd return leaving us out of sync with the
  stream again

Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU).  This
is probably acceptable because it's an unlikely case in practice.  If
necessary we could mitigate this by using a true ring buffer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:42 +02:00
David Gibson
37e3b24d90 tap: Correctly handle frames of odd length
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself.  As far as I can tell,
frames can be any length, with no particular alignment requirement.  This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.

Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading.  Some platforms will generate a fatal trap on
such an unaligned load.  Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.

Introduce a new helper to safely load a possibly unaligned value here.  We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads.  If that turns out not
to be the case, we can look at improvements then.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:38 +02:00
David Gibson
4684f60344 tap: Don't use EPOLLET on Qemu sockets
Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket.  It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling.  That consideration
doesn't apply to the way we handle the qemu socket however.

Furthermore, using EPOLLET causes additional complications:

1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
   we *do* set it when using pasta mode with --fd.  This inconsistency
   doesn't seem to have broken anything, but it's odd.

2) EPOLLET requires that tap_handler_passt() loop until all data available
   is read (otherwise we may have data in the buffer but never get an event
   causing us to read it).  We do that with a rather ugly goto.

   Worse, our condition for that goto appears to be incorrect.  We'll only
   loop if rem is non-zero, which will only happen if we perform a blocking
   recv() for a partially received frame.  We'll only perform that second
   recv() if the original recv() resulted in a partially read frame.  As
   far as I can tell the original recv() could end on a frame boundary
   (never triggering the second recv()) even if there is additional data in
   the socket buffer.  In that circumstance we wouldn't goto redo and could
   leave unprocessed frames in the qemu socket buffer indefinitely.

   This doesn't seem to have caused any problems in practice, but since
   there's no obvious reason to use EPOLLET here anyway, we might as well
   get rid of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:20 +02:00