No description
Find a file
Stefano Brivio 78631ceb99 tcp: Reduce size of socket pools
A large pool helps marginally with CRR latency, but has detrimental
effects on TCP memory pressure.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
doc doc: Add source Excalidraw scene files for diagrams 2021-09-27 15:11:14 +02:00
libvirt libvirt: Rebase to latest upstream 2021-04-22 13:38:32 +02:00
qemu qemu: Rebase patches on latest upstream 2021-04-22 13:36:53 +02:00
test demo/pasta: Enter the right directory before issuing perf report -g 2021-10-04 22:21:21 +02:00
arp.c arp: Don't resolve own, configured IPv4 address 2021-09-01 17:00:27 +02:00
arp.h qrap: Connect to the first available instance of passt, probe via ARP request 2021-05-21 11:14:52 +02:00
checksum.c checksum: Introduce AVX2 implementation, unify helpers 2021-07-26 07:18:50 +02:00
checksum.h checksum: Add checksum.h 2021-09-14 19:02:36 +02:00
conf.c conf, tcp: Periodic detection of bound ports for pasta port forwarding 2021-09-27 11:23:44 +02:00
conf.h conf, tcp: Periodic detection of bound ports for pasta port forwarding 2021-09-27 11:23:44 +02:00
dhcp.c conf, dhcp, ndp: Fix message about default MTU, make NDP consistent 2021-09-09 15:40:04 +02:00
dhcp.h passt: Assorted fixes from "fresh eyes" review 2021-02-21 11:55:49 +01:00
dhcpv6.c passt, pasta: Introduce command-line options and port re-mapping 2021-09-01 17:00:27 +02:00
dhcpv6.h passt: Introduce a DHCPv6 server 2021-04-13 22:37:40 +02:00
icmp.c tap: Completely de-serialise input message batches 2021-09-27 01:28:02 +02:00
icmp.h tap: Completely de-serialise input message batches 2021-09-27 01:28:02 +02:00
igmp.c passt: Create dummy igmp.c, mld.c files for image map in README 2021-04-13 22:41:04 +02:00
Makefile passt: Align pkt_buf to PAGE_SIZE (start and size), try to fit in huge pages 2021-09-27 01:28:02 +02:00
mld.c passt: Create dummy igmp.c, mld.c files for image map in README 2021-04-13 22:41:04 +02:00
ndp.c ndp: Set router lifetime to 9000s instead of 3600s 2021-09-27 01:28:02 +02:00
ndp.h passt: Assorted fixes from "fresh eyes" review 2021-02-21 11:55:49 +01:00
passt.1 conf, tcp: Periodic detection of bound ports for pasta port forwarding 2021-09-27 11:23:44 +02:00
passt.c tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too 2021-10-05 20:02:03 +02:00
passt.h tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too 2021-10-05 20:02:03 +02:00
pcap.c pcap: Drop O_DSYNC from pcap file descriptor 2021-09-27 01:28:02 +02:00
pcap.h udp: Introduce recvmmsg()/sendmmsg(), zero-copy path from socket 2021-07-21 12:01:04 +02:00
qrap.1 passt, qrap: Add man pages 2021-09-01 17:00:27 +02:00
qrap.c qrap: Set x-txburst as temporary workaround for virtio-net TX stall 2021-09-09 15:40:04 +02:00
README.md README: Fix pasta anchor in Try it section 2021-09-28 14:45:07 +02:00
siphash.c passt: Add PASTA mode, major rework 2021-07-17 11:04:22 +02:00
siphash.h tcp: Add siphash implementation for initial sequence numbers 2021-03-17 10:57:36 +01:00
tap.c tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too 2021-10-05 20:02:03 +02:00
tap.h tap: Fill the IPv6 flow label field to represent flow association 2021-07-26 07:30:57 +02:00
tcp.c tcp: Reduce size of socket pools 2021-10-05 20:02:03 +02:00
tcp.h tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too 2021-10-05 20:02:03 +02:00
udp.c tap: Completely de-serialise input message batches 2021-09-27 01:28:02 +02:00
udp.h conf, tcp: Periodic detection of bound ports for pasta port forwarding 2021-09-27 11:23:44 +02:00
util.c tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too 2021-10-05 20:02:03 +02:00
util.h tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too 2021-10-05 20:02:03 +02:00

While functional and tested to some extent, this project is still in early development phase: don't use in production or critical environments yet.

passt: Plug A Simple Socket Transport

passt implements a translation layer between a Layer-2 network interface and native Layer-4 sockets (TCP, UDP, ICMP/ICMPv6 echo) on a host. It doesn't require any capabilities or privileges, and it can be used as a simple replacement for Slirp.

pasta: Pack A Subtle Tap Abstraction

pasta (same binary as passt, different command) offers equivalent functionality, for network namespaces: traffic is forwarded using a tap interface inside the namespace, without the need to create further interfaces on the host, hence not requiring any capabilities or privileges.

It also implements a tap bypass path for local connections: packets with a local destination address are moved directly between Layer-4 sockets, avoiding Layer-2 translations, using the splice(2) and recvmmsg(2)/sendmmsg(2) system calls for TCP and UDP, respectively.

Motivation

passt

When container workloads are moved to virtual machines, the network traffic is typically forwarded by interfaces operating at data link level. Some components in the containers ecosystem (such as service meshes), however, expect applications to run locally, with visible sockets and processes, for the purposes of socket redirection, monitoring, port mapping.

To solve this issue, user mode networking, as provided e.g. by libslirp, can be used. Existing solutions implement a full TCP/IP stack, replaying traffic on sockets that are local to the pod of the service mesh. This creates the illusion of application processes running on the same host, eventually separated by user namespaces.

While being almost transparent to the service mesh infrastructure, that kind of solution comes with a number of downsides:

  • three different TCP/IP stacks (guest, adaptation and host) need to be traversed for every service request
  • addressing needs to be coordinated to create the pretense of consistent addresses and routes between guest and host environments. This typically needs a NAT with masquerading, or some form of packet bridging
  • the traffic seen by the service mesh and observable externally is a distant replica of the packets forwarded to and from the guest environment:
    • TCP congestion windows and network buffering mechanisms in general operate differently from what would be naturally expected by the application
    • protocols carrying addressing information might pose additional challenges, as the applications don't see the same set of addresses and routes as they would if deployed with regular containers

passt implements a thinner layer between guest and host, that only implements what's strictly needed to pretend processes are running locally. The TCP adaptation doesn't keep per-connection packet buffers, and reflects observed sending windows and acknowledgements between the two sides. This TCP adaptation is needed as passt runs without the CAP_NET_RAW capability: it can't create raw IP sockets on the pod, and therefore needs to map packets at Layer-2 to Layer-4 sockets offered by the host kernel.

The problem and this approach are illustrated in more detail, with diagrams, here.

pasta

On Linux, regular users can create network namespaces and run application services inside them. However, connecting namespaces to other namespaces and to external hosts requires the creation of network interfaces, such as veth pairs, which needs in turn elevated privileges or the CAP_NET_ADMIN capability. pasta, similarly to slirp4netns, solves this problem by creating a tap interface available to processes in the namespace, and mapping network traffic outside the namespace using native Layer-4 sockets.

Existing approaches typically implement a full, generic TCP/IP stack for this translation between data and transport layers, without the possibility of speeding up local connections, and usually requiring NAT. pasta:

  • avoids the need for a generic, full-fledged TCP/IP stack by coordinating TCP connection dynamics between sender and receiver
  • offers a fast bypass path for local connections: if a process connects to another process on the same host across namespaces, data is directly forwarded using pairs of Layer-4 sockets
  • with default options, maps routing and addressing information to the namespace, avoiding any need for NAT

Non-functional Targets

Security and maintainability goals:

  • no dynamic memory allocation
  • ~5 000 LoC target
  • no external dependencies

Interfaces and Environment

passt exchanges packets with qemu via UNIX domain socket, using the socket back-end in qemu. Currently, qemu can only connect to a listening process via TCP. Two temporary solutions are available:

  • a patch for qemu
  • a wrapper, qrap, that connects to a UNIX domain socket and starts qemu, which can now use the file descriptor that's already opened

This approach, compared to using a tap device, doesn't require any security capabilities, as we don't need to create any interface.

pasta runs out of the box with any recent (post-3.8) Linux kernel.

Services

passt and pasta provide some minimalistic implementations of networking services:

  • ARP proxy, that resolves the address of the host (which is used as gateway) to the original MAC address of the host
  • DHCP server, a simple implementation handing out one single IPv4 address to the guest or namespace, namely, the same address as the first one configured for the upstream host interface, and passing the nameservers configured on the host
  • NDP proxy, which can also assign prefix and nameserver using SLAAC
  • DHCPv6 server: a simple implementation handing out one single IPv6 address to the guest or namespace, namely, the the same address as the first one configured for the upstream host interface, and passing the nameservers configured on the host

Addresses

For IPv4, the guest or namespace is assigned, via DHCP, the same address as the upstream interface of the host, and the same default gateway as the default gateway of the host. Addresses are translated in case the guest is seen using a different address from the assigned one.

For IPv6, the guest or namespace is assigned, via SLAAC, the same prefix as the upstream interface of the host, the same default route as the default route of the host, and, if a DHCPv6 client is running in the guest or namespace, also the same address as the upstream address of the host. This means that, with a DHCPv6 client in the guest or namespace, addresses don't need to be translated. Should the client use a different address, the destination address is translated for packets going to the guest or to the namespace.

Local connections with passt

For UDP and TCP, for both IPv4 and IPv6, packets from the host addressed to a loopback address are forwarded to the guest with their source address changed to the address of the gateway or first hop of the default route. This mapping is reversed on the other way.

Local connections with pasta

Packets addressed to a loopback address in either namespace are directly forwarded to the corresponding (or configured) port in the other namespace. Similarly as passt, packets from the non-init namespace addressed to the default gateway, which are therefore sent via the tap device, will have their destination address translated to the loopback address.

Protocols

passt and pasta support TCP, UDP and ICMP/ICMPv6 echo (requests and replies). More details about the TCP implementation are available here, and for the UDP implementation here.

An IGMP/MLD proxy is currently work in progress.

Ports

passt

To avoid the need for explicit port mapping configuration, passt can bind to all unbound non-ephemeral (0-49152) TCP and UDP ports. Binding to low ports (0-1023) will fail without additional capabilities, and ports already bound (service proxies, etc.) will also not be used. Smaller subsets of ports, with port translations, are also configurable.

UDP ephemeral ports are bound dynamically, as the guest uses them.

If all ports are forwarded, service proxies and other services running in the container need to be started before passt starts.

pasta

With default options, pasta scans for bound ports on init and non-init namespaces, and automatically forwards them from the other side. Port forwarding is fully configurable with command line options.

Demo

pasta

passt

Continuous Integration

Test logs here.

Performance

Try it

passt

  • build from source:

      git clone https://passt.top/passt
      cd passt
      make
    
    • alternatively, static builds for x86_64, with or without AVX2 instructions, as of the latest commit are also available for convenience here and here. Convenience, non-official packages for Debian (and derivatives) and RPM-based distributions are also available there. These binaries and packages are simply built with:

          CFLAGS="-static" make avx2
          make pkgs
          make static
          make pkgs
      
  • have a look at the man page for synopsis and options:

      man ./passt.1
    
  • run the demo script, that creates a network namespace called passt, sets up sets up a veth pair and and addresses, together with NAT for IPv4 and NDP proxying for IPv6, then starts passt in the network namespace:

      doc/demo.sh
    
  • from the same network namespace, start qemu. At the moment, qemu doesn't support UNIX domain sockets for the socket back-end. Two alternatives:

    • use the qrap wrapper, which maps a tap socket descriptor to passt's UNIX domain socket, for example:

          ip netns exec passt ./qrap 5 qemu-system-x86_64 ... -net socket,fd=5 -net nic,model=virtio ...
      
    • or patch qemu with this patch and start it like this:

          qemu-system-x86_64 ... -net socket,connect=/tmp/passt.socket -net nic,model=virtio
      
  • alternatively, you can use libvirt, with this patch, to start qemu (with the patch mentioned above), with this kind of network interface configuration:

      <interface type='client'>
        <mac address='52:54:00:02:6b:60'/>
        <source path='/tmp/passt.socket'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </interface>
    
  • and that's it, you should now have TCP connections, UDP, and ICMP/ICMPv6 echo working from/to the guest for IPv4 and IPv6

  • to connect to a service on the VM, just connect to the same port directly with the address of the network namespace. For example, to ssh to the guest, from the main namespace on the host:

      ssh 192.0.2.2
    

pasta

  • build from source:

      git clone https://passt.top/passt
      cd passt
      make
    
    • alternatively, static builds for x86_64, with or without AVX2 instructions, as of the latest commit are also available for convenience here and here. Convenience, non-official packages for Debian (and derivatives) and RPM-based distributions are also available there. These binaries and packages are simply built with:

          CFLAGS="-static" make avx2
          make pkgs
          make static
          make pkgs
      
  • have a look at the man page for synopsis and options:

      man ./pasta.1
    
  • start pasta with:

      ./pasta
    
  • you're now inside a new user and network namespace. For IPv6, SLAAC happens right away as pasta sets up the interface, but DHCPv6 support is available as well. For IPv4, configure the interface with a DHCP client:

      dhclient
    

    and, optionally:

      dhclient -6
    
  • and that's it, you should now have TCP connections, UDP, and ICMP/ICMPv6 echo working from/to the guest for IPv4 and IPv6

  • to connect to a service inside the namespace, just connect to the same port using the loopback address.

Contribute

Public bug tracker and mailing lists are coming soon. For the moment being, send patches and issue reports to sbrivio@redhat.com.