passt

Author	SHA1	Message	Date
Stefano Brivio	a041f6d920	pasta: Clean up namespace processes on exit, reap zombies from clone() If pasta created the namespace, it's probably expected that processes started in the same namespace are terminated once pasta exits. Scan procfs namespace links for corresponding processes, send SIGQUIT and SIGKILL (after one second) if found. While at it, make the signal handler reap otherwise-zombies resulting from clone(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-15 00:32:23 +02:00
Stefano Brivio	7d81b3c646	checksum: Add checksum.h I forgot to commit this. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-14 19:02:36 +02:00
Stefano Brivio	9af8e0a1a7	tcp: Request retransmission with updated sequence also on partial write to socket If we couldn't write the whole batch of received packets to the socket, and we have missing segments, we still need to request their retransmission right away, otherwise it will take ages for the guest to figure out we're missing them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-14 16:57:50 +02:00
Stefano Brivio	a616357c86	tcp: In ESTABLISHED state, acknowledge segments as they're sent to the socket ...instead of waiting for the remote peer to do that -- it's especially important in case we request retransmissions from the guest, but it also helps speeding up slow start. This should probably be a configurable behaviour in the future. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-14 16:57:50 +02:00
Stefano Brivio	621c589d36	tcp: Properly time out ACK wait from tap Seen with iperf3: a control connection is established, no data flows for a while, all segments are acknowledged. The socket starts closing it, and we immediately time out because the last ACK from tap was one minute before that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-14 16:57:50 +02:00
Stefano Brivio	7c82ea4dd9	tcp: Don't mistake a FIN segment with no data for a Fast Retransmit request It carries no data and usually duplicates the previous ACK sequence, but it's just a FIN. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-14 16:57:50 +02:00
Stefano Brivio	c162f1e801	tcp: Check errno on sendmmsg() failure, not just the return value Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-14 16:57:50 +02:00
Stefano Brivio	2c009e8e6f	tcp: Make sure sending window is initialised before sending to tap Seen with iperf3: the first packet from socket (data connection) is 65520 bytes and doesn't fit in the window. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-14 16:57:50 +02:00
Stefano Brivio	e0530a802f	qrap: Set x-txburst as temporary workaround for virtio-net TX stall Flooding a virtio-net interface connected to a socket back-end results in a TX stall I'm still debugging. The stall goes away by setting a higher value for x-txburst (256 by default). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	3994fc8f58	udp: Reset iov_base after sending partial message on sendmmsg() failure We set the length while processing messges, but the starting address is pre-initialised. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	ecf1f97564	udp: Fix comparison of seen IPv4 address for local connections c->addr4_seen is stored in network order. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	e58828f340	tcp: Fixes for closing states, spliced connections, out-of-order packets, etc. This fixes a number of issues found with some heavier testing with uperf and neper: - in most closing states, we can still accept data, check for EPOLLIN when appropriate - introduce a new state, ESTABLISHED_SOCK_FIN_SENT, to track the fact we already sent a FIN segment to the tap device, for proper sequence number bookkeeping - for pasta mode only: spliced connections also need tracking of (inferred) FIN segments and clean half-pipe shutdowns - streamline resetting epoll_wait bitmaps with a new function, tcp_tap_epoll_mask(), instead of repeating the logic all over the place - set EPOLLET for tap connections too, whenever we are waiting for EPOLLRDHUP or an event from the tap to proceed with data transfer, to avoid useless loops with EPOLLIN set - impose an additional limit on the sending window advertised to the guest, given by SO_SNDBUF: it makes no sense to completely fill the sending buffer and send a zero window: stop a bit before we hit that - handle all interrupted system calls as needed - simplify the logic for reordering of out-of-order segments received from tap: it's not a corner case, and the previous logic allowed for deadloops - fix comparison of seen IPv4 address when we get a new connection from a socket directed to the configured guest address Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	5e23b1ef44	tap: Fix calculation of number of tap scatter-gather IO messages Messages are typically smaller than ETH_MAX_MTU. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	089dec90ca	pasta: Set ping_group_range upon namespace creation ...this allows processes running as the only group available in the namespace to create ICMP Echo sockets. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	9d19f5bc73	passt: Add epoll event indication and passt/pasta mode in socket debug message Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	3df5debf37	conf: Fix help message about default behaviour for UDP port forwarding Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	ec2b58ea4d	conf, dhcp, ndp: Fix message about default MTU, make NDP consistent Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	8e9333616a	udp: Fix retry mechanism on partial sendmmsg() Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	04d62bb013	qrap: Drop debugging left-overs, enable timeout for connect() too Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	b15e97cb9d	conf: Introduce PASST_LEGACY_NO_OPTIONS ifdef for legacy Before introducing options, the default behaviour in passt mode was to forward all ports, to run in foreground and to log to stderr. Make it a bit more convenient to restore that at build time. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	647a413794	tcp, udp: Restore usage of gateway for guest to connect to local host This went lost in a recent rework: if the guest wants to connect directly to the host, it can use the address of the default gateway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	f29c48db6b	Makefile: Make sure destination directories exist on install Mostly theoretical, but convenient for testing. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	77d4efa236	udp: Handle partial failure in sendmmsg() to UNIX domain socket Similarly to the handling introduced by commit "tcp: Proper error handling for sendmmsg() to UNIX domain socket" for TCP, we need to deal with partial sendmmsg() failures for UDP as well. Here, we can lose messages, but we need to make sure that the last message is delivered completely, otherwise qemu will fail to reassemble further packets. For UDP, this is somewhat complicated by the fact that one message might include multiple datagrams, and we need to respect message boundaries: go through headers, and calculate what we need to re-send, if anything. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	cd04d238b2	doc/demo: Also forward all UDP ports from namespace Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	564fbca39a	doc/demo: Explicitly run in foreground, drop pipe to cat Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	e1c94637ad	dhcp: Send option 121 if the default gateway is not on the assigned subnet This enables CirrOS with udhcpc to set up a route to a gateway that's not on the assigned subnet, together with: https://bugs.launchpad.net/cirros/+bug/1190372 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	7eb155ab8f	conf: Fix check for IPv6 DNS address being already set Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	64c0f20ab3	arp: Don't resolve own, configured IPv4 address DHCP clients might try to resolve the assigned address to check if it's already in use: don't resolve the configured IPv4 address. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	77c72b31ed	Makefile: Quick hack to build convenience Debian and RPM packages Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	b9c6fca469	Makefile: Add install, uninstall targets Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	da20f57f19	passt, qrap: Add man pages Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	05432945bb	qrap: Minor fixes in comments and usage message Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	d5b3467056	pasta: If a new namespace is created, wait for it to be ready before proceeding Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	088d19fbb8	conf: Minor fixes for usage message Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	2166c5872e	arp: Don't answer announcements from guest or namespace Depending on the configuration, the host might have the same address. Don't answer them to avoid a duplicate IP address detection. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	1e49d194d0	passt, pasta: Introduce command-line options and port re-mapping Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	1b1b27c06a	tcp: Fixes for early data in SOCK_SYN_SENT, closing states, clamping window More details here after rebase. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 16:49:21 +02:00
Stefano Brivio	75a7239e5b	tap: Make sure we don't receive frames bigger than ETH_MAX_MTU from qemu And while at it, remove some attributes that are not needed anymore after introducing command line options. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 16:49:21 +02:00
Stefano Brivio	353185cd36	dhcpv6: Fix parsing for IA_ADDR suboptions of IA_NA/IA_TA Once we're past the IA_NA or IA_TA option itself, before we start looking for IA_ADDR suboptions, we need to subtract the length of the option we parsed so far, otherwise we might end up reading past the end of the message, or miss some parts. While at it, streamline calculations in dhcpv6_opt(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 16:49:21 +02:00
Stefano Brivio	d2272f74f7	tcp: Proper error handling for sendmmsg() to UNIX domain socket As data from socket is forwarded to the guest, sendmmsg() might send fewer bytes than requested in three different ways: - failing altogether with a negative error code -- ignore that, we'll get an error on the UNIX domain socket later if there's really an issue with it and reset the connection to the guest - sending less than 'vlen' messages -- instead of assuming success in that case and waiting for the guest to send a duplicate ACK indicating missing data, update the sequence number according to what was actually sent and spare some retransmissions - somewhat unexpectedly to me, sending 'vlen' or less than 'vlen' messages, returning up to 'vlen', with the last message being partially sent, and no further indication of errors other than the returned msg_len for the last partially sent message being less than iov_len. In this case, we would assume success and proceed as nothing happened. However, qemu would fail to parse any further message, having received a partial descriptor, and eventually close the connection, logging: serious error: oversized packet received,connection terminated. as the length descriptor for the next message would be sourced from the middle of the next successfully sent message, not from its header. Handle this by checking the msg_len returned for the last (even partially) sent message, and force re-sending the missing bytes, if any, with a blocking sendmsg() -- qemu must not receive anything else than that anyway. While at it, allow to send up to 64KiB for each message, the previous 32KiB limit isn't actually required, and just switch to a new message at each iteration on sending buffers, they are already MSS-sized anyway, so the check in the loop isn't really needed. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-26 23:30:22 +02:00
Stefano Brivio	cc2ebfd5f2	tcp: Never send ACK because of pending unacknowleged data when sending SYN With a kernel older than 5.3 (no_snd_wnd set), ack_pending in tcp_send_to_tap() might be true at the beginning of a new connection initiated by a socket. This means we send the first SYN segment to the tap together with ACK set, which is clearly invalid and triggers the receiver to reply with an RST segment right away. Set ack_pending to 0 whenever we're sending a SYN segment. In case of a SYN, ACK segment sent by the caller, the caller passes the ACK flag explicitly. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-24 18:27:24 +02:00
Stefano Brivio	f2e3b9defd	tcp: Drop EPOLLET for non-spliced connections Socket-facing functions don't guarantee that all data is handled before they return: stick to level-triggered mode for TCP sockets. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-24 18:24:11 +02:00
Stefano Brivio	ce24fe0b3f	util: Don't close ping sockets if bind() fails ...they're still usable, thanks to the workaround implemented in icmp_tap_handler(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-04 01:44:58 +02:00
Stefano Brivio	a340e5336d	util: Fix millisecond logging timestamp calculation Four sub-second digits means 0.1ms units: divide nanoseconds by 10^5, not 10^6. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-04 01:39:00 +02:00
Stefano Brivio	539dcf5add	tcp: Fast re-transmit, more fixes for closing states and no_snd_wnd ...and while at it, fix an issue in the calculation of the last IOV buffer size: if we can't receive enough data to fill up the window, the last buffer can be filled completely. Also streamline the code setting iovec lengths if cached values are not matching. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-04 01:35:45 +02:00
Stefano Brivio	0017bc3c3e	tcp: Always allow ACKs when pending, fixes for no_snd_wnd and closing states We won't necessarily have another choice to ACK in a timely fashion if we skip ACKs from a number of states (including ESTABLISHED) when there's enough window left. Check for ACKed bytes as soon as it makes sense. If the sending window is not reported by the kernel, ACK as soon as we queue onto the socket, given that we're forced to use a rather small window. In FIN_WAIT_1_SOCK_FIN, we also have to account for the FIN flag sent by the peer in the sequence. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-04 01:29:59 +02:00
Stefano Brivio	c62490ffa8	tcp: Lower TCP_TAP_FRAMES to 32 Sending 64 frames in a batch looks quite bad when a duplicate ACK comes right at the beginning of it. Lowering this to 32 doesn't affect performance noticeably, with 16 the impact is more apparent. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-04 01:28:21 +02:00
Stefano Brivio	f57c2a72e4	doc/demo.sh: Pick IPv6 interface only if it has a nexthop route Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-08-04 01:22:27 +02:00
Stefano Brivio	dc169643a4	tcp: Full batched processing for tap messages Similar to UDP, but using a simple sendmsg() on iovec-style buffers from tap instead, as we don't need to preserve message boundaries. A quick test in PASTA mode, from namespace to init via tap: # ip link set dev pasta0 mtu 16384 # iperf3 -c 192.168.1.222 -t 60 [...] [ ID] Interval Transfer Bitrate [ 5] 0.00-60.00 sec 80.4 GBytes 11.5 Gbits/sec receiver # iperf3 -c 2a02:6d40:3cfc:3a01:2b20:4a6a:c25a:3056 -t 60 [...] [ ID] Interval Transfer Bitrate [ 5] 0.00-60.01 sec 39.9 GBytes 5.71 Gbits/sec receiver # ip link set dev pasta0 mtu 65520 # iperf3 -c 192.168.1.222 -t 60 [...] [ ID] Interval Transfer Bitrate [ 5] 0.00-60.01 sec 88.7 GBytes 12.7 Gbits/sec receiver # iperf3 -c 2a02:6d40:3cfc:3a01:2b20:4a6a:c25a:3056 -t 60 [...] [ ID] Interval Transfer Bitrate [ 5] 0.00-60.00 sec 79.5 GBytes 11.4 Gbits/sec receiver Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-27 01:35:58 +02:00
Stefano Brivio	fd5050ccba	tcp: Limit TCP_INFO getsockopt() syscalls There's no need to constantly query the socket for number of acknowledged bytes if we're far from exhausting the sending window, just do it if we're at least down to 90% of it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-27 00:50:53 +02:00

... 7 8 9 10 11 ...

554 commits