I thought we could just set errno to 0, do a bunch of stuff, and check
that errno didn't change to infer we succeeded. But clang-tidy,
starting with LLVM 19, reports:
/home/sbrivio/passt/util.c:465:6: error: An undefined value may be read from 'errno' [clang-analyzer-unix.Errno,-warnings-as-errors]
465 | if (errno)
| ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
38 | # define errno (*__errno_location ())
| ^~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:446:6: note: Assuming the condition is false
446 | if (pid == -1) {
| ^~~~~~~~~
/home/sbrivio/passt/util.c:446:2: note: Taking false branch
446 | if (pid == -1) {
| ^
/home/sbrivio/passt/util.c:451:6: note: Assuming 'pid' is 0
451 | if (pid) {
| ^~~
/home/sbrivio/passt/util.c:451:2: note: Taking false branch
451 | if (pid) {
| ^
/home/sbrivio/passt/util.c:463:2: note: Assuming that 'close' is successful; 'errno' becomes undefined after the call
463 | close(devnull_fd);
| ^~~~~~~~~~~~~~~~~
/home/sbrivio/passt/util.c:465:6: note: An undefined value may be read from 'errno'
465 | if (errno)
| ^
/usr/include/errno.h:38:16: note: expanded from macro 'errno'
38 | # define errno (*__errno_location ())
| ^~~~~~~~~~~~~~~~~~~~~~
And the LLVM documentation for the unix.Errno checker, 1.1.8.3
unix.Errno (C), mentions, at:
https://clang.llvm.org/docs/analyzer/checkers.html#unix-errno
that:
The C and POSIX standards often do not define if a standard library
function may change value of errno if the call does not fail.
Therefore, errno should only be used if it is known from the return
value of a function that the call has failed.
which is, somewhat surprisingly, the case for close().
Instead of using errno, check the actual return values of the calls
we issue here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.
As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.
For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In pcap_init(), we should always open the packet capture file with
O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means
close-on-exec, not close-on-fork.
In logfile_init() and pidfile_open(), the fact that we pass a third
'mode' argument to open() seems to confuse the android-cloexec-open
checker in LLVM versions from 16 to 19 (at least).
The checker is suggesting to add O_CLOEXEC to 'mode', and not in
'flags', where we already have it.
Add a suppression for clang-tidy and a comment, and avoid repeating
those three times by adding a new helper, output_file_open().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We use fprintf() to print to standard output or standard error
streams. If something gets truncated or there's an output error, we
don't really want to try and report that, and at the same time it's
not abnormal behaviour upon which we should terminate, either.
Just silence the warning with an ugly FPRINTF() variadic macro casting
the fprintf() expressions to void.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):
/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
43 | snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
577 | snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
579 | snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
580 | pidval);
| ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
105 | snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
242 | snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
243 | snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
1155 | snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning
Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We'll deprecate qrap(1) soon, and warnings reported by clang-tidy as
of LLVM versions 16 and later would need a bunch of changes there to
be addressed, mostly around CERT C rule ERR33-C and checking return
code from snprintf().
It makes no sense to fix warnings in qrap just for the sake of it, so
officially declare the bitrotting season open.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.
Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.
Tests with iperf3 shows no measurable change in performance after this
change.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
l2 tap queue entries are currently initialized at system start, and
reused with preset headers through its whole life time. The only
fields we need to update per message are things like payload size
and checksums.
If we want to reuse these entries between ipv4 and ipv6 messages we
will need to set the pointer to the right header on the fly per
message, since the header type may differ between entries in the same
queue.
The same needs to be done for the ethernet header.
We do these changes here.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Remove debian-9-nocloud-amd64-daily-20200210-166.qcow2 and
openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 as they cannot be
downloaded anymore
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Remove the err label as there is only one caller, and move code
to the caller position. ret is not needed here anymore as it is
always 0.
Remove sendlen as we can user directly len.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In order to use particular fields from the TCP_INFO getsockopt() we
need them to be in structure returned by the runtime kernel. We attempt
to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed
in the Makefile.
However, that's not correct, because the kernel headers we compile against
may not be the same as the runtime kernel. We instead should check against
the size of structure returned from the TCP_INFO getsockopt() as we already
do for tcpi_snd_wnd. Switch from the compile time flags to a runtime
test.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we
need the field to be supported in the runtime kernel (snd_wnd_cap).
In fact we should check that for for every tcp_info field we want to use,
beyond the very old ones shared with BSD. Prepare to do that, by
generalising the probing from setting a single bool to instead record the
size of the returned TCP_INFO structure. We can then use that recorded
value to check for the presence of any field we need.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the Makefile we probe to create several defines based on the presence
of particular fields in struct tcp_info. These defines are used for two
purposes, neither of which they accomplish well:
1) Determining if the tcp_info fields are available at runtime. For this
purpose the defines are Just Plain Wrong, since the runtime kernel may
not be the same as the compile time kernel. We corrected this for
tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt
2) Allowing the source to compile against older kernel headers which don't
have the fields in question. This works in theory, but it does mean
we won't be able to use the fields, even if later run against a
newer kernel. Furthermore, it's quite fragile: without much more
thorough tests of builds in different environments that we're currently
set up for, it's very easy to miss cases where we're accessing a field
without protection from an #ifdef. For example we currently access
tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd().
Improve this with a different approach, borrowed from qemu (which has many
instances of similar problems). Don't compile against linux/tcp.h, using
netinet/tcp.h instead. Then for when we need an extension field, define
a struct tcp_info_linux, copied from the kernel, with all the fields we're
interested in. That may need updating from future kernel versions, but
only when we want to use a new extension, so it shouldn't be frequent.
This allows us to remove the HAS_SND_WND define entirely. We keep
HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1),
we'll fix that in a later patch.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Trivial grammar fixes in comments]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Don't report bogus failures (with --trace) just because the return
value is not zero.
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.
However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:
pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
Flow 0 (TCP connection (spliced)): 98304 from read-side call
Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
Flow 0 (TCP connection (spliced)): -1 from read-side call
Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
Flow 0 (TCP connection (spliced)): OUT_WAIT_0
we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:
pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
Flow 0 (TCP connection (spliced)): 32768 from read-side call
Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
we splice() 32768 more bytes from our receiving side to the pipe. At
some point:
pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
Flow 0 (TCP connection (spliced)): 1320 from read-side call
Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)
the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).
This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.
Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.
So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.
Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.
Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.
Reported-by: Ed Santiago <santiago@redhat.com>
Reported-by: Paul Holzinger <pholzing@redhat.com>
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As a rule, we prefer constructing packets with matching C structures,
rather than building them byte by byte. However, one case we still build
byte by byte is the TCP options we include in SYN packets (in fact the only
time we generate TCP options on the tap interface).
Rework this to use a structure and initialisers which make it a bit
clearer what's going on.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing. This gives us a very large performance improvement when it's
possible.
Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address. However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
forwarding exposes those services to the host, which isn't generally what
we want.
Correct this by instead forwarding inbound "splice" flows to the guest's
external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The tests in pasta/tcp and pasta/udp for inbound transfers have the server
listening within the namespace explicitly bound to 127.0.0.1 or ::1. This
only works because of the behaviour of inbound splice connections, which
always appear with both source and destination addresses as loopback in
the namespace. That's not an inherent property for "spliced" connections
and arguably an undesirable one. Also update the test names to make it
clearer that these tests are expecting to exercise the "splice" path.
Interestingly this was already correct for the equivalent passt_in_ns/*,
although we also update the test names for clarity there.
Note that there are similar issues in some of the podman tests, addressed
in https://github.com/containers/podman/pull/24064
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This section didn't mention the effect of the --map-host-loopback option
which now alters this behaviour. Update it accordingly.
It used "local addresses" to mean specifically 127.0.0.0/8 and ::1.
However, "local" could also refer to link-local addresses or to addresses
of any scope which happen to be configured on the host. Use "loopback
address" to be more precise about this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The description of this option says that it's deprecated, but unlike
--no-copy-addrs and --no-copy-routes it doesn't have a clear label. Add
one to make it easier to spot.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
After running dhclient -6 we expect the DHCPv6 assigned address to be
immediately usable. That's true with the Fedora dhclient-script (and the
upstream ISC DHCP one), however it's not true with the Debian
dhclient-script. The Debian script can complete with the address still
in "tentative" state, and the address won't be usable until Duplicate
Address Detection (DAD) completes. That's arguably a bug in Debian (see
link below), but for the time being we need to work around it anyway.
We usually get away with this, because by the time we do anything where the
address matters, DAD has completed. However, it's not robust, so we should
explicitly wait for DAD to complete when we get an DHCPv6 address.
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Getting a SLAAC address takes a little while because the kernel must
complete Duplicate Address Detection (DAD) before marking the address as
ready. In several places we have an explicit 'sleep 2' to wait for that
to complete.
Fixed length delays are never a great idea, although this one is pretty
solid. Still, it would be better to explicitly wait for DAD to complete
in case of long delays (which might happen on slow emulated hosts, or with
heavy load), and to speed the tests up if DAD completes quicker.
Replace the fixed sleeps with a loop waiting for DAD to complete. We do
this by looping waiting for all tentative addresses to disappear.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This fixes a number of harmless but slightly ugly warts in the ARP
resolution code:
* Use in4addr_any to represent 0.0.0.0 rather than hand constructing an
example.
* When comparing am->sip against 0.0.0.0 use sizeof(am->sip) instead of
sizeof(am->tip) (same value, but makes more logical sense)
* Described the guest's assigned address as such, rather than as "our
address" - that's not usually what we mean by "our address" these days
* Remove "we might have the same IP address" comment which I can't make
sense of in context (possibly it's relating to the statement below,
which already has its own comment?)
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Starting from commit 9178a9e346 ("tcp: Always send an ACK segment
once the handshake is completed"), we always send an ACK segment,
without any payload, to complete the three-way handshake while
establishing a connection started from a socket.
We queue that segment after checking if we already have data to send
to the tap, which means that its sequence number is higher than any
segment with data we're sending in the same iteration, if any data is
available on the socket.
However, in tcp_defer_handler(), we first flush "flags" buffers, that
is, we send out segments without any data first, and then segments
with data, which means that our "empty" ACK is sent before the ACK
segment with data (if any), which has a lower sequence number.
This appears to be harmless as the guest or container will generally
reorder segments, but it looks rather weird and we can't exclude it's
actually causing problems.
Queue the empty ACK first, so that it gets a lower sequence number,
before checking for any data from the socket.
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Just like we do for PCAP, DEBUG and KERNEL. Otherwise, running tests
with TRACE=1 will not actually enable tracing output.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...instead of echo: otherwise, bash won't handle escape sequences we
use to colour messages (and 'echo -e' is not specified by POSIX).
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When redirecting DNS queries with the --dns-forward option, passt/pasta
needs a host side nameserver to redirect the queries to. This is
controlled by the c->ip[46].dns_host variables. This is set to the first
first nameserver listed in the host's /etc/resolv.conf, and there isn't
currently a way to override it from the command line.
Prior to 0b25cac9 ("conf: Treat --dns addresses as guest visible
addresses") it was possible to alter this with the -D/--dns option.
However, doing so was confusing and had some nonsensical edge cases because
-D generally takes guest side addresses, rather than host side addresses.
Add a new --dns-host option to restore this functionality in a more
sensible way.
Link: https://bugs.passt.top/show_bug.cgi?id=102
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In a couple of recent reports, we've seen that it can be useful for pasta
to forward ports from addresses which are not currently configured on the
host, but might be in future. That can be done with the sysctl
net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in
the first place. We can allow the same thing on a per-socket basis with
the IP_FREEBIND (or IPV6_FREEBIND) socket option.
Add a --freebind command line argument to enable this socket option on
all listening sockets.
Link: https://bugs.passt.top/show_bug.cgi?id=101
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
TCP header and payload are supposed to be in the same buffer,
and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute
the checksum from the base address of the header using the
length of the IP payload.
In the future (for vhost-user) we need to dispatch the TCP header and
the TCP payload through several buffers. To be able to manage that, we
provide an iovec array that points to the data of the TCP frame.
We provide also an offset to be able to provide an array that contains
the TCP frame embedded in an lower level frame, and this offset points
to the TCP header inside the iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The offset allows any headers that are not part of the data
to checksum to be skipped.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The offset is passed directly to pcap_frame() and allows
any headers that are not part of the frame to
capture to be skipped.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the
checksum using the TCP header and the TCP payload, it is clearer
to use a pointer to tcp_payload_t that includes tcphdr and payload
rather than a pointer to tcphdr (and guessing TCP header is
followed by the payload).
Move tcp_payload_t and tcp_flags_t to tcp_internal.h.
(They will be used also by vhost-user).
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is quite useful at least for myself as I'm usually running tests
using a guest kernel that's not the same as the one on the host.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We already have an inany_ntop() function to format inany addresses into
text. Add inany_pton() to parse them from text, and use it in
conf_ports().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair. Use an inany instead. Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.
Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress. This case doesn't yet arise, but
it might cause us trouble in future.
Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic. This only makes sense if there is no specific
address given. We verify this at runtime, but it would be nicer if we
could enforce it structurally.
For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems. Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.
The new function takes the address as an inany *, with NULL indicating the
dual stack case. This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
To save some kernel memory we try to use "dual stack" sockets (that listen
to both IPv4 and IPv6 traffic) when possible. However udp_sock_init()
attempts to do this in some cases that can't work. Specifically we can
only do this when listening on any address. That's never true for the
ns (splicing) case, because we always listen on loopback. For the !ns
case and AF_UNSPEC case, addr should always be NULL, but add an assert to
verify.
This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll
fall back to the other path. But, it's messy and makes some upcoming
changes harder, so avoid attempting this in cases we know can't work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We can need not to set TCP checksum. Add a parameter to
tcp_fill_headers4() and tcp_fill_headers6() to disable it.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can need not to set the UDP checksum. Add a parameter to
udp_update_hdr4() and udp_update_hdr6() to disable it.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
write_remainder() steps through the buffers in an IO vector writing out
everything past a certain byte offset. However, on each iteration it
rescans the buffer from the beginning to find out where we're up to. With
an unfortunate set of write sizes this could lead to quadratic behaviour.
In an even less likely set of circumstances (total vector length > maximum
size_t) the 'skip' variable could overflow. This is one factor in a
longstanding Coverity error we've seen (although I still can't figure out
the remainder of its complaint).
Rework write_remainder() to always work out our new position in the vector
relative to our old/current position, rather than starting from the
beginning each time. As a bonus this seems to fix the Coverity error.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
write(2) might not write all the data it is given. Add a write_all_buf()
helper to keep calling it until all the given data is written, or we get an
error.
Currently we use write_remainder() to do this operation in pcap_frame().
That's a little awkward since it requires constructing an iovec, and future
changes we want to make to write_remainder() will be easier in terms of
this single buffer helper.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This parameter is already treated as a boolean internally. Make it a
'bool' type for clarity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This function has a block conditional on !snd_wnd_cap shortly before an
snd_wnd_cap is statically false).
Therefore, simplify this down to a single conditional with an else branch.
While we're there, fix some improperly indented closing braces.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When available, we want to retrieve our socket peer's advertised window and
forward that to the guest. That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.
Currently our probing for this is a bit odd. The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it. Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field. This is a bit hit and miss in two ways:
* Zero is perfectly possible window the peer could report, so we can
get false negatives
* We're reading TCP_INFO into a local variable, which might not be zero
initialised, so if the kernel _doesn't_ write it it could have non-zero
garbage, giving us false positives.
We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived. So, check whether that's long enough
to include the field. This lets us probe the availability of the field
once and for all during initialisation. That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.
We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c,
and have no prototype in a header. Make them static.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When handling the DUP_ACK flag, we copy all the buffers making up the ack
frame. However, all our frames share the same buffer for the Ethernet
header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will
result in a (perfectly) overlapping memcpy(). This seems to have been
harmless so far, but overlapping ranges to memcpy() is undefined behaviour,
so we really should avoid it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This initialisation for IPv4 flags buffers is redundant with the very next
line which sets both iov_base and iov_len.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>