Otherwise, if all the pending data is acknowledged:
- tcp_update_seqack_from_tap() updates the current tap-side ACK
sequence (conn->seq_ack_from_tap)
- next, we compare the sequence we sent (conn->seq_to_tap) to the
ACK sequence (conn->seq_ack_from_tap) in tcp_data_from_sock() to
understand if there's more data we can send.
If they match, we conclude that we haven't sent any of that data,
and keep re-sending it.
We need, instead, to flush the socket (drop acknowledged data) before
calling tcp_update_seqack_from_tap(), so that once we update
conn->seq_ack_from_tap, we can be sure that all data until there is
gone from the socket.
Link: https://bugs.passt.top/show_bug.cgi?id=114
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Fixes: 30f1e082c3 ("tcp: Keep updating window and checking for socket data after FIN from guest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp. In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.
Add code to dump and restore the timestamp across migration.
Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
During migration we extract the limit on segment size using TCP_MAXSEG,
and set it on the other side with TCP_REPAIR_OPTIONS. However, unlike most
32-bit values we transfer we transfer it in native endian, not network
endian. This is not correct; add it to the list of endian fixups we make.
In addition, while MAXSEG will be 32-bits in practice, and is given as such
to TCP_REPAIR_OPTIONS, the TCP_MAXSEG sockopt treats it as an 'int'. It's
not strictly safe to pass a uint32_t to a getsockopt() expecting an int,
although we'll get away with it on most (maybe all) platforms. Correct
this as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor coding style fix]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A number of places in the TCP code use general logging functions, instead
of the flow specific ones. That includes a few older ones as well as many
places in the new migration code. Thus they either don't identify which
flow the problem happened on, or identify it in a non-standard way.
Convert many of these to use the existing flow specific helpers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_update_csum() is exposed in tcp_internal.h, but is only used in tcp.c.
Remove the unneded prototype and make it static.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This marks static a number of functions which are only used in their .c
file, have no prototypes in a .h and were never intended to be globally
exposed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, if a non-SYN TCP packet arrives which doesn't match any existing
connection, we simply ignore it. However RFC 9293, section 3.10.7.1 says
we should respond with an RST to a non-SYN, non-RST packet that's for a
CLOSED (i.e. non-existent) connection.
This can arise in practice with migration, in cases where some error means
we have to discard a connection. We destroy the connection with tcp_rst()
in that case, but because the guest is stopped, we may not be able to
deliver the RST packet on the tap interface immediately. This change
ensures an RST will be sent if the guest tries to use the connection again.
A similar situation can arise if a passt/pasta instance is killed or
crashes, but is then replaced with another attached to the same guest.
This can leave the guest with stale connections that the new passt instance
isn't aware of. It's better to send an RST so the guest knows quickly
these are broken, rather than letting them linger until they time out.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The flow label is a 20-bit field in the IPv6 header. The length and
alignment make it awkward to pass around as is. Obviously, it can be
packed into a 32-bit integer though, and we do this in two places. We
have some further upcoming places where we want to manipulate the flow
label, so make some helpers for marshalling and unmarshalling it to an
integer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_flow_migrate_target(), if we're unable to create and bind the new
socket, we print an error, cancel the flow and carry on. This seems to
make sense based on our policy of generally letting the migration complete
even if some or all flows are lost in the process. But it doesn't quite
work: the flow_alloc_cancel() means that the flows in the target's flow
table are no longer one to one match to the flows which the source is
sending data for. This means that data for later flows will be mismatched
to a different flow. Most likely that will cause some nasty error later,
but even worse it might appear to succeed but lead to data corruption due
to incorrectly restoring one of the flows.
Instead, we should leave the flow in the table until we've read all the
data for it, *then* discard it. Technically removing the
flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket()
fails it leaves conn->sock == -1, which will cause the restore functions
in tcp_flow_migrate_target_ext() to fail, discarding the flow. To make
what's going on clearer (and with less extraneous error messages), put
several explicit tests for a missing socket later in the migration path to
read the data associated with the flow but explicitly discard it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_rst() attempts to send an RST packet to the guest, and if that succeeds
moves the flow to CLOSED state. However, even if the tcp_send_flag() fails
the flow is still dead: we've usually closed the socket already, and
something has already gone irretrievably wrong. So we should still mark
the flow as CLOSED. That will cause it to be cleaned up, meaning any
future packets from the guest for it won't match a flow, so should generate
new RSTs (they don't at the moment, but that's a separate bug).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are two small bugs in error returns from tcp_low_repair_socket(),
which is supposed to return a negative errno code:
1) On bind() failures, wedirectly pass on the return code from bind(),
which is just 0 or -1, instead of an error code.
2) In the caller, tcp_flow_migrate_target() we call strerror_() directly
on the negative error code, but strerror() requires a positive error
code.
Correct both of these.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default. However, internally we use (c->mtu == -1) to represent that
state. We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.
This is unnecessarily confusing. We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case. This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Our general logging helpers include a number of _perror() variants which,
like perror(3) include the description of the current errno. We didn't
have those for our flow specific logging helpers, though. Fill this gap
with flow_perror() and flow_dbg_perror(), and use them where it's useful.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure. However, the
former is trivially derived from the latter. Simplify the interface.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry. Our usual convention is that
such functions should take a struct tcp_tap_conn * instead. Convert it to
do so.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.
Contents of pending queues are transferred as well.
The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.
Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.
Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.
This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.
Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).
The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.
Drop those settings.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Once we get a FIN segment from the container/guest, we enter something
resembling CLOSE_WAIT (from the perspective of the peer), but that
doesn't mean that we should stop processing window updates from the
guest and checking for socket data if the guest acknowledges
something.
If we don't do that, we can very easily run into a situation where we
send a burst of data to the tap, get a zero window update, along with
a FIN segment, because the flow is meant to be unidirectional, and now
the connection will be stuck forever, because we'll ignore updates.
Reproducer, server:
$ pasta --config-net -t 9999 -- sh -c 'echo DONE | socat TCP-LISTEN:9997,shut-down STDIO'
and client:
$ ./test/rampstream send 50000 | socat -u STDIN TCP:$LOCAL_ADDR:9997
2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe
while at it, update the message string for the third passive close
state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
So that we can bind inbound sockets to specific addresses, like we
already do for outbound sockets.
While at it, change the error message in tcp_conn_from_tap() to match
this one.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This probably doesn't cover all the cases where we should send a
zero-window probe, but it's rather unobtrusive and obvious, so start
from here, also because I just observed this case (without the fix
from the previous patch, to take into account window information from
keep-alive segments).
If we hit the ACK timeout, and try re-sending data from the socket,
if the window is zero, we'll just fail again, go back to the timer,
and so on, until we hit the maximum number of re-transmissions and
reset the connection.
Don't do that: forcibly try to send something by implementing the
equivalent of a zero-window probe in this case.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
It looks like a detail, but it's critical if we're dealing with
somebody, such as near-future self, using TCP_REPAIR to migrate TCP
connections in the guest or container.
The last packet sent from the 'source' process/guest/container
typically reports a small window, or zero, because the guest/container
hadn't been draining it for a while.
The next packet, appearing as the target sets TCP_REPAIR_OFF on the
migrated socket, is a keep-alive (also called "window probe" in CRIU
or TCP_REPAIR-related code), and it comes with an updated window
value, reflecting the pre-migration "regular" value.
If we ignore it, it might take a while/forever before we realise we
can actually restart sending.
Fixes: 238c69f9af ("tcp: Acknowledge keep-alive segments, ignore them for the rest")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For migration we need to get the specific local address and port for
connected sockets with getsockname(). We currently open code marshalling
the results into the flow entry.
However, we already have inany_from_sockaddr() which handles the fiddly
parts of this, so use it. Also report failures, which may make debugging
problems easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop re-declarations of 'sa' and 'sl']
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For migration only: we need to store 'oport', our socket-side port,
as we establish a connection from the guest, so that we can bind the
same oport as source port in the migration target.
Similar for 'oaddr': this is needed in case the migration target has
additional network interfaces, and we need to make sure our socket is
bound to the equivalent interface as it was on the source.
Use getsockname() to fetch them.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reported by somebody on IRC: if the server has considerable latency,
it might happen that the client retries sending SYN segments for the
same flow while we're still in a TAP_SYN_RCVD, non-ESTABLISHED state.
In that case, we should go with the blanket assumption that we need
to reset the connection on any unexpected segment: RFC 9293 explicitly
mentions this case in Figure 8: Recovery from Old Duplicate SYN,
section 3.5. It doesn't make sense for us to set a specific sequence
number, socket-side, but we should definitely wait and see.
Ignoring the duplicate SYN segment should also be compatible with
section 3.10.7.3. SYN-SENT STATE, which mentions updating sequences
socket-side (which we can't do anyway), but certainly not reset the
connection.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set. There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_epoll_ctl() we pass an event pointer with EPOLL_CTL_DEL, even though
it will be ignored. It's possible this was a workaround for pre-2.6.9
kernels which required a non-NULL pointer here, but we rely on the kernel
accepting NULL events for EPOLL_CTL_DEL in lots of other places. Use
NULL instead for simplicity and consistency.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Somewhat curiously, RFC 9293, section 3.10.7.3, states:
If the state is SYN-SENT, then
[...]
Second, check the RST bit:
- If the RST bit is set,
[...]
o If the ACK was acceptable, then signal to the user "error:
connection reset", drop the segment, enter CLOSED state,
delete TCB, and return. Otherwise (no ACK), drop the
segment and return.
which matches verbatim RFC 793, pages 66-67, and is implemented as-is
by tcp_rcv_synsent_state_process() in the Linux kernel, that is:
/* No ACK in the segment */
if (th->rst) {
/* rfc793:
* "If the RST bit is set
*
* Otherwise (no ACK) drop the segment and return."
*/
goto discard_and_undo;
}
meaning that if a client is in SYN-SENT state, and we send a RST
segment once we realise that we can't establish the outbound
connection, the client will ignore our segment and will need to
pointlessly wait until the connection times out instead of aborting
it right away.
The ACK flag on a RST, in this case, doesn't really seem to have any
function, but we must set it nevertheless. The ACK sequence number is
already correct because we always set it before calling
tcp_prepare_flags(), whenever relevant.
This leaves us with no cases where we should *not* set the ACK flag
on non-SYN segments, so always set the ACK flag for RST segments.
Note that non-SYN, non-RST segments were already covered by commit
4988e2b406 ("tcp: Unconditionally force ACK for all !SYN, !RST
packets").
Reported-by: Dirk Janssen <Dirk.Janssen@schiphol.nl>
Reported-by: Roeland van de Pol <Roeland.van.de.Pol@schiphol.nl>
Reported-by: Robert Floor <Robert.Floor@schiphol.nl>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Following up on 725acd111b ("tcp_splice: Set (again) TCP_NODELAY on
both sides"), David argues that, in general, we don't know what kind
of TCP traffic we're dealing with, on any side or path.
TCP segments might have been delivered to our socket with a PSH flag,
but we don't have a way to know about it.
Similarly, the guest might send us segments with PSH or URG set, but
we don't know if we should generally TCP_CORK sockets and uncork on
those flags, because that would assume they're running a Linux kernel
(and a particular version of it) matching the kernel that delivers
outbound packets for us.
Given that we can't make any assumption and everything might very well
be interactive traffic, disable Nagle's algorithm on all non-spliced
sockets as well.
After all, John Nagle himself is nowadays recommending that delayed
ACKs should never be enabled together with his algorithm, but we
don't have a practical way to ensure that our environment is free from
delayed ACKs (TCP_QUICKACK is not really usable for this purpose):
https://news.ycombinator.com/item?id=34180239
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...so it's pointless to set SO_RCVBUF and SO_SNDBUF on listening
sockets.
Call tcp_sock_set_bufsize() after accept4(), for inbound sockets.
As we didn't have large buffer sizes set for inbound sockets for
a long time (they are set explicitly only if the maximum size is
big enough, more than than the ~200 KiB default), I ran some more
throughput tests for this one, and I see slightly better numbers
(say, 17 gbps instead of 15 gbps guest to host without vhost-user).
Fixes: 904b86ade7 ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
There are pretty much two cases of the (misnomer) STALLED: in one
case, we could send more data to the guest if it becomes available,
and in another case, we can't, because we filled the window.
If, in this second case, we keep EPOLLIN enabled, but never read from
the socket, we get short but CPU-annoying storms of EPOLLIN events,
upon which we reschedule the ACK timeout handler, never read from the
socket, go back to epoll_wait(), and so on:
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
also known as:
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
which, for some reason, becomes very visible with muvm and aria2c
downloading from a server nearby in parallel chunks.
That's because EPOLLIN isn't cleared if we don't read from the socket,
and even with EPOLLET, epoll_wait() will repeatedly wake us up until
we actually read something.
In this case, we don't want to subscribe to EPOLLIN at all: all we're
waiting for is an ACK segment from the guest. Differentiate this case
with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
indicate that we're waiting for an ACK from the guest
(ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.
If this flag is set before we set STALLED, EPOLLIN will be masked
while we set EPOLLET because of STALLED. Whenever we clear STALLED,
we also clear this flag.
This is definitely not elegant, but it's a minimal fix.
We can probably simplify this at a later point by having a category
of connection flags directly corresponding to epoll flags, and
dropping STALLED altogether, or, perhaps, always using EPOLLET (but
we need a mechanism to re-check sockets for pending data if we can't
temporarily write to the guest).
I suspect that this might also be implied in
https://github.com/containers/podman/issues/23686, hence the Link:
tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
that).
Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I inadvertently added that in an unrelated change, but it doesn't make
sense: STALLED means we have pending socket data that we can't write
to the guest, not the other way around.
Fixes: bb70811183 ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
We have different versions of this function for IPv4 and IPv6, but the
caller already requires some IP version specific code to get the right
header pointers. Instead, have a common function that fills either an
IPv4 or an IPv6 header based on which header pointer it is passed. This
allows us to remove a small amount of code duplication and make a few
slightly ugly conditionals.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The only reason we need separate functions for the IPv4 and IPv6 case is
to calculate the checksum of the IP pseudo-header, which is different for
the two cases. However, the caller already knows which path it's on and
can access the values needed for the pseudo-header partial sum more easily
than tcp_update_check_tcp[46]() can.
So, merge these functions into a single tcp_update_csum() function that
just takes the pseudo-header partial sum, calculated in the caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At the moment these take separate pointers to the tap specific and IP
headers, but expect the TCP header and payload as a single tcp_payload_t.
As well as being slightly inconsistent, this involves some slightly iffy
pointer shenanigans when called on the flags path with a tcp_flags_t
instead of a tcp_payload_t.
More importantly, it's inconvenient for the upcoming vhost-user case, where
the TCP header and payload might not be contiguous. Furthermore, the
payload itself might not be contiguous.
So, pass the TCP header as its own pointer, and the TCP payload as an IO
vector.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently these expects both the TCP header and payload in a single IOV,
and goes to some trouble to locate the checksum field within it. In the
current caller we've already know where the TCP header is, so we might as
well just pass it in. This will need to work a bit differently for
vhost-user, but that code already needs to locate the TCP header for other
reasons, so again we can just pass it in.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We usually want to checksum only the tail part of a frame, excluding at
least some headers. csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector. We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.
We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters. This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This function only has callers in tcp_buf.c. More importantly, it's
inherently tied to the "buf" path, because it uses internal knowledge of
how we lay out the various headers across our locally allocated buffers.
Therefore, move it to tcp_buf.c.
Slightly reformat the prototypes while we're at it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
RFC 9293, 3.8.4 says:
Implementers MAY include "keep-alives" in their TCP implementations
(MAY-5), although this practice is not universally accepted. Some
TCP implementations, however, have included a keep-alive mechanism.
To confirm that an idle connection is still active, these
implementations send a probe segment designed to elicit a response
from the TCP peer. Such a segment generally contains SEG.SEQ =
SND.NXT-1 and may or may not contain one garbage octet of data. If
keep-alives are included, the application MUST be able to turn them
on or off for each TCP connection (MUST-24), and they MUST default to
off (MUST-25).
but currently, tcp_data_from_tap() is not aware of this and will
schedule a fast re-transmit on the second keep-alive (because it's
also a duplicate ACK), ignoring the fact that the sequence number was
rewinded to SND.NXT-1.
ACK these keep-alive segments, reset the activity timeout, and ignore
them for the rest.
At some point, we could think of implementing an approximation of
keep-alive segments on outbound sockets, for example by setting
TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single
keep-alive segment at approximately the same time, and never reset the
connection. That's beyond the scope of this fix, though.
Reported-by: Tim Besard <tim.besard@gmail.com>
Link: https://github.com/containers/podman/discussions/24572
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We enter the timer handler with the ACK_TO_TAP_DUE flag, call
tcp_prepare_flags() with ACK_IF_NEEDED, and realise that we
acknowledged everything meanwhile, so we return early, but we also
need to reset that flag to avoid unnecessarily scheduling the timer
over and over again until more pending data appears.
I'm not sure if this fixes any real issue, but I've spotted this
in several logs reported by users, including one where we have some
unexpected bursts of high CPU load during TCP transfers at low rates,
from https://github.com/containers/podman/issues/23686.
Link: https://github.com/containers/podman/discussions/24572
Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:
/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1139 | int fd = socket(AF_UNIX, SOCK_STREAM, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1158 | ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1413 | s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
188 | if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
| ^
| | SOCK_CLOEXEC
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
tcp_info.h exists just to contain a modern enough version of struct
tcp_info for our needs, removing compile time dependency on the version of
kernel headers. There are several other cases where we can remove similar
compile time dependencies on kernel version. Prepare for that by renaming
tcp_info.h to linux_dep.h.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Use a plain uint16_t instead and avoid including one extra header:
the 'bitwise' attribute of __sum16 is just used by sparse(1).
Reported-by: omni <omni+alpine@hack.org>
Fixes: 3d484aa370 ("tcp: Update TCP checksum using an iovec array")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.
As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.
For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.
Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.
Tests with iperf3 shows no measurable change in performance after this
change.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In order to use particular fields from the TCP_INFO getsockopt() we
need them to be in structure returned by the runtime kernel. We attempt
to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed
in the Makefile.
However, that's not correct, because the kernel headers we compile against
may not be the same as the runtime kernel. We instead should check against
the size of structure returned from the TCP_INFO getsockopt() as we already
do for tcpi_snd_wnd. Switch from the compile time flags to a runtime
test.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>