Commit graph

326 commits

Author SHA1 Message Date
Stefano Brivio
4f69efcfba README: .. doesn't actually work for comments in Markdown
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:34:02 +02:00
Stefano Brivio
7d24803fb3 conf: Always pass an empty buffer to line_read() in get_dns()
Given that get_dns() touches the buffer read by line_read(), we
can't optimise that by passing the existing buffer.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:29:30 +02:00
Stefano Brivio
b0b77118fe passt: Address warnings from Clang's scan-build
All false positives so far.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:29:30 +02:00
Stefano Brivio
1a563a0cbd passt: Address gcc 11 warnings
A mix of unchecked return values, a missing permission mask for
open(2) with O_CREAT, and some false positives from
-Wstringop-overflow and -Wmaybe-uninitialized.

Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:29:30 +02:00
Stefano Brivio
087b5f4dbb LICENSES: Add license text files, add missing notices, fix SPDX tags
SPDX tags don't replace license files. Some notices were missing and
some tags were not according to the SPDX specification, too.

Now reuse --lint from the REUSE tool (https://reuse.software/) passes.

Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:29:30 +02:00
Stefano Brivio
f154a0489a Makefile: Install man pages to /usr/share/man instead of /usr/man
Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:29:30 +02:00
Stefano Brivio
2725003d45 Makefile: Prefix installation paths with $(DESTDIR)
Martin reports that DESTDIR is ignored in install/uninstall targets,
see also:
	https://www.gnu.org/prep/standards/html_node/DESTDIR.html

Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:42:08 +02:00
Stefano Brivio
9df5027129 perf/passt_udp: Don't overshoot UDP bandwidth excessively on larger MTUs
...performance with 64KiB MTUs might look worse than with 9000bytes
on some configurations.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:30:42 +02:00
Stefano Brivio
7aaff3387a perf/passt_tcp: Don't exceed typical L3 cache sizes with buffers
...we might see misleading rate drops with larger MTUs otherwise.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:28:44 +02:00
Stefano Brivio
e8ac8a3b7c test/perf: Use CPU frequency from /proc/cpuinfo instead of cpupower(1)
Get it to work also in nested virtualisation environments.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:25:29 +02:00
Stefano Brivio
1bddcf3dd7 tcp: Fix for non-blocking splice() on older kernels
For some reason, on 4.19, splice() doesn't honour SOCK_NONBLOCK from
accept4() while reading from a TCP socket. Pass SPLICE_F_NONBLOCK
explicitly in all cases.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:19:50 +02:00
Stefano Brivio
9e065b1448 tcp: Fix ACK reporting on older kernels (no tcp.kernel_snd_wnd case)
If the window isn't updated on !c->tcp.kernel_snd_wnd, we still
have to send ACKs if the ACK sequence was updated, or if an error
occurred while querying TCP_INFO.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:13:53 +02:00
Stefano Brivio
5496074862 netlink: NETLINK_GET_STRICT_CHK is not available on older kernels
For example on 4.19. Don't fail if we can't set it, filter on
interface index in nl_addr().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:08:06 +02:00
Stefano Brivio
8a874ecf58 passt: Include linux/seccomp.h and linux/audit.h instead of seccomp.h
We don't use libseccomp.

Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:03:35 +02:00
Stefano Brivio
17600d6d6e netlink, conf: Actually get prefix/mask length
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-19 09:01:27 +02:00
Stefano Brivio
1ac0d52820 tcp: Arm tcp_data_noack on insufficient window too, don't reset if ACK doesn't match
...and while at it, reverse the operands in the window equality
comparison to detect the need for fast re-transmit: it's easier
to read this way.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-16 16:58:16 +02:00
Stefano Brivio
85038e9410 passt: Add clock_gettime to list of allowed syscalls
...depending on the system clock source, glibc might use it to
fetch the wall time.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-16 16:54:23 +02:00
Stefano Brivio
2c7d1ce088 passt: Static builds: don't redefine __vsyslog(), skip getpwnam() and initgroups()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-16 16:53:40 +02:00
Stefano Brivio
1fd0c9b0e1 util, pasta: Don't read() and lseek() every single line in read_line()
...periodically checking bound ports becomes quite expensive
otherwise.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-16 00:49:33 +02:00
Stefan Hajnoczi
14fe73e766 udp: drop bogus udp_tap_map ts assignment
The 'ts' field is a timestamp so assigning the socket file descriptor is
incorrect. There is no actual bug because the current time is assigned
just a few lines later:

      udp_tap_map[V4][src].sock = s;
      udp_tap_map[V4][src].ts = s;
      ^^^^^^^^^^^ bogus ^^^^^^^^^^
      bitmap_set(udp_act[V4][UDP_ACT_TAP], src);
  }

  udp_tap_map[V4][src].ts = now->tv_sec;
  ^^^^^^^^^^^^^^^ correct ^^^^^^^^^^^^^^

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 20:46:17 +02:00
Stefano Brivio
6231422dfb demo/pasta: Swap init>ns and ns>init flows
...make those short performance tests actually match table headers.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 20:46:17 +02:00
Stefano Brivio
a56721b61c util: Don't duplicate debug messages, they're already on stderr
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 20:46:17 +02:00
Stefano Brivio
6943d41d6c tcp: ...and so I got a socket called zero
I thought I'd get away with it, but no, after some clean-ups, I
finally got a socket with number 0. Fix up all the convenient,
yet botched assumptions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 20:46:17 +02:00
Stefano Brivio
bd47b68ebf passt: Check if a PID file was actually requested before creating it
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 20:46:17 +02:00
Stefano Brivio
955fe812dc util: Define ROUND_UP()
...not actually used, just for completeness, as ROUND_DOWN() is
defined.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 17:18:48 +02:00
Stefano Brivio
2f4f29c5a7 tcp: Bump TCP_TAP_FRAMES back to 256
With a batched sendmsg(), this is now beneficial.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 17:17:57 +02:00
Stefano Brivio
38fbfdbcb9 tcp: Get rid of iov with cached MSS, drop sendmmsg(), add deferred flush
Caching iov_len for messages from socket doesn't actually decrease
overhead by the tiniest bit, and added a lot of complexity. Drop
that.

Also drop the sendmmsg(), we don't need to send multiple messages
with TCP, as long as we make sure no messages with a length
descriptor are sent partially, qemu is fine with it.

Just like it's done for segments without data (flags), also defer
the sendmsg() for sending data segments, to improve batching.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 17:13:23 +02:00
Stefano Brivio
54e6513d17 tcp: Clamp MSS depending on IP version, properly derive buffer sizes
It makes no sense to include an IPv6 header in the calculation for
clamping MSS on IPv4.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 17:09:37 +02:00
Stefano Brivio
bf63832207 conf, pasta: Create a new namespace also if probing netns options failed
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 17:07:16 +02:00
Stefano Brivio
4b12cf94f0 checksum: Stream load into four registers at a time with > 128 bytes
...and further interleave register usage. This brings the csum()
overhead reported by perf(1) for 30 seconds of 64KiB TCP IPv4
frames, host to guest, from 7.2% to 5.8%.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 17:04:46 +02:00
Stefano Brivio
74f29d3148 checksum: Interleave lo/hi sums while folding into 128-bit sums, drop TODO
I left a TODO and never checked -- this actually seems to slightly
improve CPIs on AMD Naples (two 128-bit FMA units glued together).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 16:59:12 +02:00
Stefano Brivio
364cc313ea pasta: Allow nanosleep(2) and clock_nanosleep(2) syscalls too
...we need those to wait for terminating processes in the namespace.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 21:48:44 +02:00
Stefano Brivio
dca31d4206 netlink: Bring up interface even if neither MTU nor MAC address is configured
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 19:11:05 +02:00
Stefano Brivio
388435542e passt: Don't refuse to run if UID is 0 in non-init namespace
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 18:01:00 +02:00
Stefano Brivio
54a65e3693 pasta: Push pasta.h header
...I forgot to add this earlier.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:40:53 +02:00
Stefano Brivio
2e6e29a757 slirp4netns.sh: Introduce compatibility wrapper behaving like slirp4netns(1)
Warning: draft quality, not really tested, --enable-sandbox not
supported yet.

Example:

 $ unshare -rUn
 # echo $$
 3130879

 $ ./slirp4netns.sh -m 65520 -c 3130879 tap0
 sent tapfd=5 for tap0
 received tapfd=5
 Starting slirp
 * MTU:             65520
 * Network:         10.0.2.0
 * Netmask:         255.255.255.0
 * Gateway:         10.0.2.2
 * DNS:             10.0.2.3
 * Recommended IP:  10.0.2.100
 WARNING: 127.0.0.1:* on the host is accessible as 10.0.2.2 (set --disable-host-loopback to prohibit connecting to 127.0.0.1:*)

 # ip li sh
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 33: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
     link/ether 5e:9d:a0:c5:cf:67 brd ff:ff:ff:ff:ff:ff
 # ip ad sh
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
 33: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UNKNOWN group default qlen 1000
     link/ether 5e:9d:a0:c5:cf:67 brd ff:ff:ff:ff:ff:ff
     inet 10.0.2.0/24 scope global tap0
        valid_lft forever preferred_lft forever
     inet6 fe80::5c9d:a0ff:fec5:cf67/64 scope link
        valid_lft forever preferred_lft forever
 # ip ro sh
 default via 10.0.2.2 dev tap0
 10.0.2.0/24 dev tap0 proto kernel scope link src 10.0.2.0
 root@epycfail:~# ip -6 ro sh
 fe80::/64 dev tap0 proto kernel metric 256 pref medium
 # iperf3 -c 10.0.2.2 -l1M
 Connecting to host 10.0.2.2, port 5201
 [  5] local 10.0.2.0 port 43014 connected to 10.0.2.2 port 5201
 [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
 [  5]   0.00-1.00   sec  1.38 GBytes  11.8 Gbits/sec    0   9.96 MBytes
 [  5]   1.00-2.00   sec  1.59 GBytes  13.6 Gbits/sec    0   13.3 MBytes
 [  5]   2.00-3.00   sec  1.63 GBytes  14.0 Gbits/sec    0   13.3 MBytes
 [  5]   3.00-4.00   sec  1.78 GBytes  15.3 Gbits/sec    0   13.3 MBytes
 [  5]   4.00-5.00   sec  1.80 GBytes  15.5 Gbits/sec    0   15.8 MBytes
 [  5]   5.00-6.00   sec  1.69 GBytes  14.5 Gbits/sec    0   15.8 MBytes
 [  5]   6.00-7.00   sec  1.65 GBytes  14.2 Gbits/sec    0   15.8 MBytes
 [  5]   7.00-8.00   sec  1.68 GBytes  14.4 Gbits/sec    0   15.8 MBytes
 [  5]   8.00-9.00   sec  1.60 GBytes  13.7 Gbits/sec    0   15.8 MBytes
 [  5]   9.00-10.00  sec  1.66 GBytes  14.3 Gbits/sec    0   15.8 MBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bitrate         Retr
 [  5]   0.00-10.00  sec  16.5 GBytes  14.1 Gbits/sec    0             sender
 [  5]   0.00-10.01  sec  16.4 GBytes  14.1 Gbits/sec                  receiver

 iperf Done.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
3c6d24dd30 netlink, pasta: Configure MTU of tap interface on --config-net
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
54a19002df conf: Add -P, --pid, to specify a file where own PID is written to
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
1cbd2c8c6b conf: Reset netns_only flag after probing
...if we check whether an option might be a namespace specification,
and it turns out not to be (e.g. with --pcap), we might set
netns_only, but we don't reset it back to 0 if it wasn't set.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
c61944a1f8 tcp: Explicitly align IP headers in tcp4_l2_{,flags}buf_t also in non-AVX2 build
Otherwise, tcp4_l2_flags_buf_t is not consistent with tcp4_l2_buf_t and
header fields get all mixed up in tcp_l2_buf_fill_headers().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
f45891cf26 conf, tcp, udp: Add --no-map-gw to disable mapping gateway address to host
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:19:52 +02:00
Stefano Brivio
3bb859c505 passt: Warn if we're running as root, abort if we can't change to nobody:nobody
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:19:25 +02:00
Stefano Brivio
fc93f97774 conf: Reset errno before checking port specifier with strtol(3)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:18:50 +02:00
Stefano Brivio
9f1724ad1e passt: Drop all capabilities that we might have, except for CAP_NET_BIND_SERVICE
While it's not recommended to give passt any capability, drop all the
ones we might have got by mistake, except for the only sensible one,
CAP_NET_BIND_SERVICE.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:17:43 +02:00
Stefano Brivio
32d07f5e59 passt, pasta: Completely avoid dynamic memory allocation
Replace libc functions that might dynamically allocate memory with own
implementations or wrappers.

Drop brk(2) from list of allowed syscalls in seccomp profile.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:16:03 +02:00
Stefano Brivio
66d5930ec7 passt, pasta: Add seccomp support
List of allowed syscalls comes from comments in the form:
	#syscalls <list>

for syscalls needed both in passt and pasta mode, and:
	#syscalls:pasta <list>
	#syscalls:passt <list>

for syscalls specifically needed in pasta or passt mode only.

seccomp.sh builds a list of BPF statements from those comments,
prefixed by a binary search tree to keep lookup fast.

While at it, clean up a bit the Makefile using wildcards.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:46 +02:00
Stefano Brivio
f318174a93 test: Drop debugging left-overs in lib/util
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
d5c887de87 doc: Add to man page tip to grant passt the CAP_NET_BIND_SERVICE capability
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
4869d309e1 doc: Fix up note about missing tcpi_snd_wnd in man page
The behaviour without tcpi_snd_wnd changed: the only difference now
is the advertised window, which corresponds to the queried sending
buffer size.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
c9d57fee7c tcp: Decrease pool size for pipes to 16
This should be a reasonable balance between quick connection
establishment and a fast start-up.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00