Commit graph

333 commits

Author SHA1 Message Date
Stefano Brivio
9f1724ad1e passt: Drop all capabilities that we might have, except for CAP_NET_BIND_SERVICE
While it's not recommended to give passt any capability, drop all the
ones we might have got by mistake, except for the only sensible one,
CAP_NET_BIND_SERVICE.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:17:43 +02:00
Stefano Brivio
32d07f5e59 passt, pasta: Completely avoid dynamic memory allocation
Replace libc functions that might dynamically allocate memory with own
implementations or wrappers.

Drop brk(2) from list of allowed syscalls in seccomp profile.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:16:03 +02:00
Stefano Brivio
66d5930ec7 passt, pasta: Add seccomp support
List of allowed syscalls comes from comments in the form:
	#syscalls <list>

for syscalls needed both in passt and pasta mode, and:
	#syscalls:pasta <list>
	#syscalls:passt <list>

for syscalls specifically needed in pasta or passt mode only.

seccomp.sh builds a list of BPF statements from those comments,
prefixed by a binary search tree to keep lookup fast.

While at it, clean up a bit the Makefile using wildcards.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:46 +02:00
Stefano Brivio
f318174a93 test: Drop debugging left-overs in lib/util
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
d5c887de87 doc: Add to man page tip to grant passt the CAP_NET_BIND_SERVICE capability
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
4869d309e1 doc: Fix up note about missing tcpi_snd_wnd in man page
The behaviour without tcpi_snd_wnd changed: the only difference now
is the advertised window, which corresponds to the queried sending
buffer size.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
c9d57fee7c tcp: Decrease pool size for pipes to 16
This should be a reasonable balance between quick connection
establishment and a fast start-up.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
44ca4bcf3e util: Fix comment to bitmap_clear()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
675174d4ba conf, tap: Split netlink and pasta functions, allow interface configuration
Move netlink routines to their own file, and use netlink to configure
or fetch all the information we need, except for the TUNSETIFF ioctl.

Move pasta-specific functions to their own file as well, add
parameters and calls to configure the tap interface in the namespace.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
dcd3605d14 conf: Don't get IPv{4,6} DNS addresses if IPv{4,6} is disabled
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-10 01:24:19 +02:00
Stefano Brivio
580581fd96 conf: Avoid getifaddrs(), split L2/L3 address fetching, get filtered dumps
getifaddrs() needs to allocate heap memory, and gets a ton of results
we don't need. Use explicit netlink messages with "strict checking"
instead.

While at it, separate L2/L3 address handling, so that we don't fetch
MAC addresses for IPv6, and also use netlink instead of ioctl() to
get the MAC address.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-10 01:13:27 +02:00
Stefano Brivio
e871fa9f22 README: Drop domain part in absolute links
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 15:14:22 +02:00
Stefano Brivio
40767a0da3 conf: Fix getopt_long() return value for --quiet
Only the short version actually worked.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 04:12:17 +02:00
Stefano Brivio
2da54a0292 pasta: Add second waitid() in pasta_child_handler()
We usually have up to one additional child exiting while we receive
a SIGCHLD, instead of complicating this with tracking PIDs, just
add a second waitid() call.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 04:12:17 +02:00
Giuseppe Scrivano
9a175cc2ce pasta: Allow specifying paths and names of namespaces
Based on a patch from Giuseppe Scrivano, this adds the ability to:

- specify paths and names of target namespaces to join, instead of
  a PID, also for user namespaces, with --userns

- request to join or create a network namespace only, without
  entering or creating a user namespace, with --netns-only

- specify the base directory for netns mountpoints, with --nsrun-dir

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[sbrivio: reworked logic to actually join the given namespaces when
 they're not created, implemented --netns-only and --nsrun-dir,
 updated pasta demo script and man page]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 04:05:15 +02:00
Stefano Brivio
ab32838022 git: Add pre-push hook
I've been using this for a while, now it's all "nice" and clean.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-06 14:06:37 +02:00
Stefano Brivio
8131fc9175 tcp: Check if timestamp is passed also while sending FIN to tap/guest
...it's probably possible that we might need to reset a connection
together with a FIN segment.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 23:21:40 +02:00
Stefano Brivio
ccbf13ed1b tcp: Drop EPOLLOUT for connections being established earlier
That's the first thing we have to do, before sending SYN, ACK:
if tcp_send_to_tap() fails, we'll get a lot of useless events
otherwise.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 21:22:59 +02:00
Stefano Brivio
a909fd5e7a conf: Silence gcc -Os warning
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 21:22:59 +02:00
Stefano Brivio
16f4b983de passt: Shrink binary size by dropping static initialisers
...from 11MiB to 155KiB for 'make avx2', 95KiB with -Os and stripped.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 21:22:59 +02:00
Stefano Brivio
a26722b875 test/lib/term: Export PCAP and DEBUG variables for tmux sessions globally
Otherwise, this would depend on the local tmux configuration.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
8ec5adc989 test/lib/setup: Increase --max-stackframe in commented-out valgrind command
...so that I don't keep fighting with this for pasta clone() calls.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
d565082f84 tcp: Simplify ACK-sending conditions in tcp_data_from_tap()
Now that we have a proper function checking when and how to send
ACKs and window updates, we don't need to duplicate this logic in
tcp_data_from_tap().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
eda446ba54 tcp: Always probe SO_SNDBUF, second attempt
I fell for this already: the sending buffer might shrink later!

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
a4826ee04b tcp: Defer and coalesce all segments with no data (flags) to handler
...using pre-cooked buffers, just like we do with other segments.

While at it, remove some code duplication by having separate
functions for updating ACK sequence and window, and for filling in
buffer headers.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
371667fcfb tcp: Increase LOW_RTT_THRESHOLD to 10us
Sometimes we can get up to 6-7us minimum RTT for local connections too.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
78631ceb99 tcp: Reduce size of socket pools
A large pool helps marginally with CRR latency, but has detrimental
effects on TCP memory pressure.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
cf9976beac tcp: Increase TCP_TAP_FRAMES once more
With an increased sending buffer size for the AF_UNIX socket, we
can get slightly lower overhead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
d4d61480b6 tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
eef4e82903 passt: Add handler for optional deferred tasks
We'll need this for TCP ACK coalescing on tap/guest-side. For
convenience, allow _handler() functions to be undefined, courtesy
of __attribute__((weak)).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 19:20:26 +02:00
Stefano Brivio
529b245d2b demo/pasta: Enter the right directory before issuing perf report -g
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
52054d8b37 tcp: Fix botched timeout comparison
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
98dfe1cdf4 tcp: Check pending ACK every two thirds of window, not every half
...to spare some syscalls. If it's not enough, the timer will take
care of it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
ffaf1d09f2 tcp: Don't set ACK flag while merely updating window value
The receiver might take this as a duplicate ACK othewise.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
81128241d6 tcp: Set TCP_TAP_FRAMES back to 32
Now that we fixed the issue with small receiving buffers, we can
safely increase this back and get slightly lower syscall overhead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:21:21 +02:00
Stefano Brivio
683043e200 tcp: Probe net.core.{r,w}mem_max, don't set SO_{RCV,SND}BUF if low
If net.core.rmem_max and net.core.wmem_max sysctls have low values,
we can get bigger buffers by not trying to set them high -- the
kernel would lock their values to what we get.

Try, instead, to get bigger buffers by queueing as much as possible,
and if maximum values in tcp_wmem and tcp_rmem are bigger than this,
that will work.

While at it, drop QUICKACK option for non-spliced sockets, I set
that earlier by mistake.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:20:43 +02:00
Stefano Brivio
e1a2e2780c tcp: Check if connection is local or low RTT was seen before using large MSS
If the connection is local or the RTT was comparable to the time it
takes to queue a batch of messages, we can safely use a large MSS
regardless of the sending buffer, but otherwise not.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:20:43 +02:00
Stefano Brivio
f6bff339a9 tcp: Adjust usage of sending buffer depending on its size
If we start with a very small sending buffer, we can make the kernel
expand it if we cause the congestion window to get bigger, but this
won't reliably happen if we use just half (other half is accounted
as overhead).

Scale usage depending on its own size, we might eventually get some
retransmissions because we can't queue messages the sender sends us
in-window, but it's better than keeping that small buffer forever.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-04 22:20:43 +02:00
Stefano Brivio
2408ddffa3 tcp: Derive MSS announced to guest/namespace from configured MTU if present
...and from the sending socket only if the MTU is not configured.

Otherwise, a connection to a host from a local guest, with a
non-loopback destination address, will get its MSS from the MTU of the
outbound interface with that address, which is unnecessary as we know
the guest can send us larger segments.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-29 16:46:58 +02:00
Stefano Brivio
4e5129719d test: Record CI and demo videos in Xvfb by default, fix demo setup sequence
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-29 16:45:26 +02:00
Stefano Brivio
a8b767b06d README: Fix pasta anchor in Try it section
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-28 14:45:07 +02:00
Stefano Brivio
299737fa74 doc: Add source Excalidraw scene files for diagrams
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 15:11:14 +02:00
Stefano Brivio
061519b562 test: Add CI/demo scripts
Not really quick, definitely dirty.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 15:10:35 +02:00
Stefano Brivio
ca325e7583 README: Add demo section
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 13:45:17 +02:00
Stefano Brivio
9657b6ed05 conf, tcp: Periodic detection of bound ports for pasta port forwarding
Detecting bound ports at start-up time isn't terribly useful: do this
periodically instead, if configured.

This is only implemented for TCP at the moment, UDP is somewhat more
complicated: leave a TODO there.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 11:23:44 +02:00
Stefano Brivio
e69e13671d util: Fix parsing of next option in ipv6_l4hdr()
We need to update next header and header length as soon as we meet
a new option header.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 01:28:09 +02:00
Stefano Brivio
904b86ade7 tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes
This introduces a number of fundamental changes that would be quite
messy to split. Summary:

- advertised window scaling can be as big as we want, we just need
  to clamp window sizes to avoid exceeding the size of our "discard"
  buffer for unacknowledged data from socket

- add macros to compare sequence numbers

- force sending ACK to guest/tap on PSH segments, always in pasta
  mode, whenever we see an overlapping segment, or when we reach a
  given threshold compared to our window

- we don't actually use recvmmsg() here, fix comments and label

- introduce pools for pre-opened sockets and pipes, to decrease
  latency on new connections

- set receiving and sending buffer sizes to the maximum allowed,
  kernel will clamp and round appropriately

- defer clean-up of spliced and non-spliced connection to timer

- in tcp_send_to_tap(), there's no need anymore to keep a large
  buffer, shrink it down to what we actually need

- introduce SO_RCVLOWAT setting and activity tracking for spliced
  connections, to coalesce data moved by splice() calls as much as
  possible

- as we now have a compacted connection table, there's no need to
  keep sparse bitmaps tracking connection activity -- simply go
  through active connections with a loop in the timer handler

- always clamp the advertised window to half our sending buffer,
  too, to minimise retransmissions from the guest/tap

- set TCP_QUICKACK for originating socket in spliced connections,
  there's no need to delay them

- fix up timeout for unacknowledged data from socket

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 01:28:02 +02:00
Stefano Brivio
3c839bfc46 tcp: Drop TODO about sequence collision attacks
A random initial sequence number based on a secret has already been
there for a while.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 01:28:02 +02:00
Stefano Brivio
f004de4a9d tap: Don't leak file descriptor used to bring up loopback interface
...and while at it, set the socket as non-blocking directly on open().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 01:28:02 +02:00
Stefano Brivio
f1c1d40f90 tap: Fix comment for tap_sock_init_tun_ns()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 01:28:02 +02:00