Commit graph

16 commits

Author SHA1 Message Date
Jon Maloy
78da088f7b tcp: unify payload and flags l2 frames array
In order to reduce static memory and code footprint, we merge
the array for l2 flag frames into the one for payload frames.

This change also ensures that no flag message will be sent out
over the l2 media bypassing already queued payload messages.

Performance measurements with iperf3, where we force all
traffic via the tap queue, show no significant difference:

Dual traffic both directions sinmultaneously, with patch:
========================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  36.3 GBytes  3.12 Gbits/sec  4759       sender
[  5]   0.00-100.04 sec  36.3 GBytes  3.11 Gbits/sec             receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   321 GBytes  27.6 Gbits/sec            receiver

Dual traffic both directions sinmultaneously, without patch:
============================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  35.0 GBytes  3.01 Gbits/sec  6001       sender
[  5]   0.00-100.04 sec  34.8 GBytes  2.99 Gbits/sec            receiver

ns->host
--------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   345 GBytes  29.6 Gbits/sec            receiver

Single connection, with patch:
==============================
host->ns:
---------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   138 GBytes  11.8 Gbits/sec  922       sender
[  5]   0.00-100.04 sec   138 GBytes  11.8 Gbits/sec            receiver

ns->host:
-----------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   430 GBytes  36.9 Gbits/sec            receiver

Single connection, without patch:
=================================
host->ns:
------------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   139 GBytes  11.9 Gbits/sec  900       sender
[  5]   0.00-100.04 sec   139 GBytes  11.9 Gbits/sec            receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   440 GBytes  37.8 Gbits/sec            receiver

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:41 +01:00
Jon Maloy
ba38e67cf4 tcp: unify l2 TCPv4 and TCPv6 queues and structures
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.

Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.

Tests with iperf3 shows no measurable change in performance after this
change.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:44:08 +01:00
Jon Maloy
2053c36dec tcp: set ip and eth headers in l2 tap queues on the fly
l2 tap queue entries are currently initialized at system start, and
reused with preset headers through its whole life time. The only
fields we need to update per message are things like payload size
and checksums.

If we want to reuse these entries between ipv4 and ipv6 messages we
will need to set the pointer to the right header on the fly per
message, since the header type may differ between entries in the same
queue.

The same needs to be done for the ethernet header.

We do these changes here.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-29 12:43:24 +01:00
Laurent Vivier
f43f7d5e89 tcp: cleanup tcp_buf_data_from_sock()
Remove the err label as there is only one caller, and move code
to the caller position. ret is not needed here anymore as it is
always 0.
Remove sendlen as we can user directly len.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-25 14:29:51 +02:00
David Gibson
9e5df350d6 tcp: Use structures to construct initial TCP options
As a rule, we prefer constructing packets with matching C structures,
rather than building them byte by byte.  However, one case we still build
byte by byte is the TCP options we include in SYN packets (in fact the only
time we generate TCP options on the tap interface).

Rework this to use a structure and initialisers which make it a bit
clearer what's going on.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
2024-10-21 18:51:04 +02:00
Laurent Vivier
72e7d3024b tcp: Use tcp_payload_t rather than tcphdr
As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the
checksum using the TCP header and the TCP payload, it is clearer
to use a pointer to tcp_payload_t that includes tcphdr and payload
rather than a pointer to tcphdr (and guessing TCP header is
followed by the payload).

Move tcp_payload_t and tcp_flags_t to tcp_internal.h.
(They will be used also by vhost-user).

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 14:50:46 +02:00
Laurent Vivier
8f8c4d27eb tcp: Allow checksum to be disabled
We can need not to set TCP checksum. Add a parameter to
tcp_fill_headers4() and tcp_fill_headers6() to disable it.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:15:28 +02:00
David Gibson
bb41901c71 tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean
This parameter is already treated as a boolean internally.  Make it a
'bool' type for clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:55 +02:00
David Gibson
4aff6f9392 tcp: Clean up tcpi_snd_wnd probing
When available, we want to retrieve our socket peer's advertised window and
forward that to the guest.  That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.

Currently our probing for this is a bit odd.  The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it.  Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field.  This is a bit hit and miss in two ways:
 * Zero is perfectly possible window the peer could report, so we can
   get false negatives
 * We're reading TCP_INFO into a local variable, which might not be zero
   initialised, so if the kernel _doesn't_ write it it could have non-zero
   garbage, giving us false positives.

We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived.  So, check whether that's long enough
to include the field.  This lets us probe the availability of the field
once and for all during initialisation.  That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.

We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:47 +02:00
David Gibson
5ff5d55291 tcp: Avoid overlapping memcpy() in DUP_ACK handling
When handling the DUP_ACK flag, we copy all the buffers making up the ack
frame.  However, all our frames share the same buffer for the Ethernet
header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will
result in a (perfectly) overlapping memcpy().  This seems to have been
harmless so far, but overlapping ranges to memcpy() is undefined behaviour,
so we really should avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:59 +02:00
David Gibson
1f414ed8f0 tcp: Remove redundant initialisation of iov[TCP_IOV_ETH].iov_base
This initialisation for IPv4 flags buffers is redundant with the very next
line which sets both iov_base and iov_len.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:46 +02:00
David Gibson
a09aeb4bd6 tcp: Correctly update SO_PEEK_OFF when tcp_send_frames() drops frames
When using the new SO_PEEK_OFF feature on TCP sockets, we must adjust
the SO_PEEK_OFF value whenever we move conn->seq_to_tap backwards.
Although it was discussed during development, somewhere during the shuffles
the case where we move the pointer backwards because we lost frames while
sending them to the guest.  This can happen, for example, if the socket
buffer on the Unix socket to qemu overflows.

Fixing this is slightly complicated because we need to pass a non-const
context pointer to some places we previously didn't need it.  While we're
there also fix a small stylistic issue in the function comment for
tcp_revert_seq() - it was using spaces instead of tabs.

Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-24 09:27:46 +02:00
David Gibson
52d45f1737 tcp: Obtain guest address from flowside
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address.  This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address.  That won't
work; it will probably result in an RST from the guest.  Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.

Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:44 +02:00
Jon Maloy
e63d281871 tcp: leverage support of SO_PEEK_OFF socket option when available
>From linux-6.9.0 the kernel will contain
commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").

This new feature makes is possible to call recv_msg(MSG_PEEK) and make
it start reading data from a given offset set by the SO_PEEK_OFF socket
option. This way, we can avoid repeated reading of already read bytes of
a received message, hence saving read cycles when forwarding TCP
messages in the host->name space direction.

In this commit, we add functionality to leverage this feature when
available, while we fall back to the previous behavior when not.

Measurements with iperf3 shows that throughput increases with 15-20
percent in the host->namespace direction when this feature is used.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-15 17:57:03 +02:00
Laurent Vivier
0c335d751a vhost-user: compare mode MODE_PASTA and not MODE_PASST
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:38 +02:00
Laurent Vivier
fba2b544b6 tcp: move buffers management functions to their own file
Move all the TCP parts using internal buffers to tcp_buf.c
and keep generic TCP management functions in tcp.c.
Add tcp_internal.h to export needed functions from tcp.c and
tcp_buf.h from tcp_buf.c

With this change we can use existing TCP functions with a
different kind of memory storage as for instance the shared
memory provided by the guest via vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:05 +02:00