Commit graph

1697 commits

Author SHA1 Message Date
Stefano Brivio
4b12cf94f0 checksum: Stream load into four registers at a time with > 128 bytes
...and further interleave register usage. This brings the csum()
overhead reported by perf(1) for 30 seconds of 64KiB TCP IPv4
frames, host to guest, from 7.2% to 5.8%.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 17:04:46 +02:00
Stefano Brivio
74f29d3148 checksum: Interleave lo/hi sums while folding into 128-bit sums, drop TODO
I left a TODO and never checked -- this actually seems to slightly
improve CPIs on AMD Naples (two 128-bit FMA units glued together).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 16:59:12 +02:00
Stefano Brivio
364cc313ea pasta: Allow nanosleep(2) and clock_nanosleep(2) syscalls too
...we need those to wait for terminating processes in the namespace.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 21:48:44 +02:00
Stefano Brivio
dca31d4206 netlink: Bring up interface even if neither MTU nor MAC address is configured
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 19:11:05 +02:00
Stefano Brivio
388435542e passt: Don't refuse to run if UID is 0 in non-init namespace
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 18:01:00 +02:00
Stefano Brivio
54a65e3693 pasta: Push pasta.h header
...I forgot to add this earlier.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:40:53 +02:00
Stefano Brivio
2e6e29a757 slirp4netns.sh: Introduce compatibility wrapper behaving like slirp4netns(1)
Warning: draft quality, not really tested, --enable-sandbox not
supported yet.

Example:

 $ unshare -rUn
 # echo $$
 3130879

 $ ./slirp4netns.sh -m 65520 -c 3130879 tap0
 sent tapfd=5 for tap0
 received tapfd=5
 Starting slirp
 * MTU:             65520
 * Network:         10.0.2.0
 * Netmask:         255.255.255.0
 * Gateway:         10.0.2.2
 * DNS:             10.0.2.3
 * Recommended IP:  10.0.2.100
 WARNING: 127.0.0.1:* on the host is accessible as 10.0.2.2 (set --disable-host-loopback to prohibit connecting to 127.0.0.1:*)

 # ip li sh
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 33: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
     link/ether 5e:9d:a0:c5:cf:67 brd ff:ff:ff:ff:ff:ff
 # ip ad sh
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
 33: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UNKNOWN group default qlen 1000
     link/ether 5e:9d:a0:c5:cf:67 brd ff:ff:ff:ff:ff:ff
     inet 10.0.2.0/24 scope global tap0
        valid_lft forever preferred_lft forever
     inet6 fe80::5c9d:a0ff:fec5:cf67/64 scope link
        valid_lft forever preferred_lft forever
 # ip ro sh
 default via 10.0.2.2 dev tap0
 10.0.2.0/24 dev tap0 proto kernel scope link src 10.0.2.0
 root@epycfail:~# ip -6 ro sh
 fe80::/64 dev tap0 proto kernel metric 256 pref medium
 # iperf3 -c 10.0.2.2 -l1M
 Connecting to host 10.0.2.2, port 5201
 [  5] local 10.0.2.0 port 43014 connected to 10.0.2.2 port 5201
 [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
 [  5]   0.00-1.00   sec  1.38 GBytes  11.8 Gbits/sec    0   9.96 MBytes
 [  5]   1.00-2.00   sec  1.59 GBytes  13.6 Gbits/sec    0   13.3 MBytes
 [  5]   2.00-3.00   sec  1.63 GBytes  14.0 Gbits/sec    0   13.3 MBytes
 [  5]   3.00-4.00   sec  1.78 GBytes  15.3 Gbits/sec    0   13.3 MBytes
 [  5]   4.00-5.00   sec  1.80 GBytes  15.5 Gbits/sec    0   15.8 MBytes
 [  5]   5.00-6.00   sec  1.69 GBytes  14.5 Gbits/sec    0   15.8 MBytes
 [  5]   6.00-7.00   sec  1.65 GBytes  14.2 Gbits/sec    0   15.8 MBytes
 [  5]   7.00-8.00   sec  1.68 GBytes  14.4 Gbits/sec    0   15.8 MBytes
 [  5]   8.00-9.00   sec  1.60 GBytes  13.7 Gbits/sec    0   15.8 MBytes
 [  5]   9.00-10.00  sec  1.66 GBytes  14.3 Gbits/sec    0   15.8 MBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bitrate         Retr
 [  5]   0.00-10.00  sec  16.5 GBytes  14.1 Gbits/sec    0             sender
 [  5]   0.00-10.01  sec  16.4 GBytes  14.1 Gbits/sec                  receiver

 iperf Done.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
3c6d24dd30 netlink, pasta: Configure MTU of tap interface on --config-net
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
54a19002df conf: Add -P, --pid, to specify a file where own PID is written to
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
1cbd2c8c6b conf: Reset netns_only flag after probing
...if we check whether an option might be a namespace specification,
and it turns out not to be (e.g. with --pcap), we might set
netns_only, but we don't reset it back to 0 if it wasn't set.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
c61944a1f8 tcp: Explicitly align IP headers in tcp4_l2_{,flags}buf_t also in non-AVX2 build
Otherwise, tcp4_l2_flags_buf_t is not consistent with tcp4_l2_buf_t and
header fields get all mixed up in tcp_l2_buf_fill_headers().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:20:34 +02:00
Stefano Brivio
f45891cf26 conf, tcp, udp: Add --no-map-gw to disable mapping gateway address to host
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:19:52 +02:00
Stefano Brivio
3bb859c505 passt: Warn if we're running as root, abort if we can't change to nobody:nobody
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:19:25 +02:00
Stefano Brivio
fc93f97774 conf: Reset errno before checking port specifier with strtol(3)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:18:50 +02:00
Stefano Brivio
9f1724ad1e passt: Drop all capabilities that we might have, except for CAP_NET_BIND_SERVICE
While it's not recommended to give passt any capability, drop all the
ones we might have got by mistake, except for the only sensible one,
CAP_NET_BIND_SERVICE.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:17:43 +02:00
Stefano Brivio
32d07f5e59 passt, pasta: Completely avoid dynamic memory allocation
Replace libc functions that might dynamically allocate memory with own
implementations or wrappers.

Drop brk(2) from list of allowed syscalls in seccomp profile.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:16:03 +02:00
Stefano Brivio
66d5930ec7 passt, pasta: Add seccomp support
List of allowed syscalls comes from comments in the form:
	#syscalls <list>

for syscalls needed both in passt and pasta mode, and:
	#syscalls:pasta <list>
	#syscalls:passt <list>

for syscalls specifically needed in pasta or passt mode only.

seccomp.sh builds a list of BPF statements from those comments,
prefixed by a binary search tree to keep lookup fast.

While at it, clean up a bit the Makefile using wildcards.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:46 +02:00
Stefano Brivio
f318174a93 test: Drop debugging left-overs in lib/util
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
d5c887de87 doc: Add to man page tip to grant passt the CAP_NET_BIND_SERVICE capability
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
4869d309e1 doc: Fix up note about missing tcpi_snd_wnd in man page
The behaviour without tcpi_snd_wnd changed: the only difference now
is the advertised window, which corresponds to the queried sending
buffer size.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
c9d57fee7c tcp: Decrease pool size for pipes to 16
This should be a reasonable balance between quick connection
establishment and a fast start-up.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
44ca4bcf3e util: Fix comment to bitmap_clear()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
675174d4ba conf, tap: Split netlink and pasta functions, allow interface configuration
Move netlink routines to their own file, and use netlink to configure
or fetch all the information we need, except for the TUNSETIFF ioctl.

Move pasta-specific functions to their own file as well, add
parameters and calls to configure the tap interface in the namespace.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
dcd3605d14 conf: Don't get IPv{4,6} DNS addresses if IPv{4,6} is disabled
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-10 01:24:19 +02:00
Stefano Brivio
580581fd96 conf: Avoid getifaddrs(), split L2/L3 address fetching, get filtered dumps
getifaddrs() needs to allocate heap memory, and gets a ton of results
we don't need. Use explicit netlink messages with "strict checking"
instead.

While at it, separate L2/L3 address handling, so that we don't fetch
MAC addresses for IPv6, and also use netlink instead of ioctl() to
get the MAC address.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-10 01:13:27 +02:00
Stefano Brivio
e871fa9f22 README: Drop domain part in absolute links
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 15:14:22 +02:00
Stefano Brivio
40767a0da3 conf: Fix getopt_long() return value for --quiet
Only the short version actually worked.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 04:12:17 +02:00
Stefano Brivio
2da54a0292 pasta: Add second waitid() in pasta_child_handler()
We usually have up to one additional child exiting while we receive
a SIGCHLD, instead of complicating this with tracking PIDs, just
add a second waitid() call.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 04:12:17 +02:00
Giuseppe Scrivano
9a175cc2ce pasta: Allow specifying paths and names of namespaces
Based on a patch from Giuseppe Scrivano, this adds the ability to:

- specify paths and names of target namespaces to join, instead of
  a PID, also for user namespaces, with --userns

- request to join or create a network namespace only, without
  entering or creating a user namespace, with --netns-only

- specify the base directory for netns mountpoints, with --nsrun-dir

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[sbrivio: reworked logic to actually join the given namespaces when
 they're not created, implemented --netns-only and --nsrun-dir,
 updated pasta demo script and man page]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 04:05:15 +02:00
Stefano Brivio
ab32838022 git: Add pre-push hook
I've been using this for a while, now it's all "nice" and clean.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-06 14:06:37 +02:00
Stefano Brivio
8131fc9175 tcp: Check if timestamp is passed also while sending FIN to tap/guest
...it's probably possible that we might need to reset a connection
together with a FIN segment.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 23:21:40 +02:00
Stefano Brivio
ccbf13ed1b tcp: Drop EPOLLOUT for connections being established earlier
That's the first thing we have to do, before sending SYN, ACK:
if tcp_send_to_tap() fails, we'll get a lot of useless events
otherwise.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 21:22:59 +02:00
Stefano Brivio
a909fd5e7a conf: Silence gcc -Os warning
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 21:22:59 +02:00
Stefano Brivio
16f4b983de passt: Shrink binary size by dropping static initialisers
...from 11MiB to 155KiB for 'make avx2', 95KiB with -Os and stripped.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 21:22:59 +02:00
Stefano Brivio
a26722b875 test/lib/term: Export PCAP and DEBUG variables for tmux sessions globally
Otherwise, this would depend on the local tmux configuration.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
8ec5adc989 test/lib/setup: Increase --max-stackframe in commented-out valgrind command
...so that I don't keep fighting with this for pasta clone() calls.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
d565082f84 tcp: Simplify ACK-sending conditions in tcp_data_from_tap()
Now that we have a proper function checking when and how to send
ACKs and window updates, we don't need to duplicate this logic in
tcp_data_from_tap().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
eda446ba54 tcp: Always probe SO_SNDBUF, second attempt
I fell for this already: the sending buffer might shrink later!

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
a4826ee04b tcp: Defer and coalesce all segments with no data (flags) to handler
...using pre-cooked buffers, just like we do with other segments.

While at it, remove some code duplication by having separate
functions for updating ACK sequence and window, and for filling in
buffer headers.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
371667fcfb tcp: Increase LOW_RTT_THRESHOLD to 10us
Sometimes we can get up to 6-7us minimum RTT for local connections too.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
78631ceb99 tcp: Reduce size of socket pools
A large pool helps marginally with CRR latency, but has detrimental
effects on TCP memory pressure.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
cf9976beac tcp: Increase TCP_TAP_FRAMES once more
With an increased sending buffer size for the AF_UNIX socket, we
can get slightly lower overhead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
d4d61480b6 tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
eef4e82903 passt: Add handler for optional deferred tasks
We'll need this for TCP ACK coalescing on tap/guest-side. For
convenience, allow _handler() functions to be undefined, courtesy
of __attribute__((weak)).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 19:20:26 +02:00
Stefano Brivio
529b245d2b demo/pasta: Enter the right directory before issuing perf report -g
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
52054d8b37 tcp: Fix botched timeout comparison
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
98dfe1cdf4 tcp: Check pending ACK every two thirds of window, not every half
...to spare some syscalls. If it's not enough, the timer will take
care of it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
ffaf1d09f2 tcp: Don't set ACK flag while merely updating window value
The receiver might take this as a duplicate ACK othewise.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
81128241d6 tcp: Set TCP_TAP_FRAMES back to 32
Now that we fixed the issue with small receiving buffers, we can
safely increase this back and get slightly lower syscall overhead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
683043e200 tcp: Probe net.core.{r,w}mem_max, don't set SO_{RCV,SND}BUF if low
If net.core.rmem_max and net.core.wmem_max sysctls have low values,
we can get bigger buffers by not trying to set them high -- the
kernel would lock their values to what we get.

Try, instead, to get bigger buffers by queueing as much as possible,
and if maximum values in tcp_wmem and tcp_rmem are bigger than this,
that will work.

While at it, drop QUICKACK option for non-spliced sockets, I set
that earlier by mistake.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:20:43 +02:00