Internet Engineering Task Force (IETF) A. Zimmermann Request for Comments: 6069 A. Hannemann Category: Experimental RWTH Aachen University ISSN: 2070-1721 December 2010 Making TCP More Robust to Long Connectivity Disruptions (TCP-LCD) Abstract Disruptions in end-to-end path connectivity, which last longer than one retransmission timeout, cause suboptimal TCP performance. The reason for this performance degradation is that TCP interprets segment loss induced by long connectivity disruptions as a sign of congestion, resulting in repeated retransmission timer backoffs. This, in turn, leads to a delayed detection of the re-establishment of the connection since TCP waits for the next retransmission timeout before it attempts a retransmission. This document proposes an algorithm to make TCP more robust to long connectivity disruptions (TCP-LCD). It describes how standard ICMP messages can be exploited during timeout-based loss recovery to disambiguate true congestion loss from non-congestion loss caused by connectivity disruptions. Moreover, a reversion strategy of the retransmission timer is specified that enables a more prompt detection of whether or not the connectivity to a previously disconnected peer node has been restored. TCP-LCD is a TCP sender- only modification that effectively improves TCP performance in the case of connectivity disruptions. Status of This Memo This document is not an Internet Standards Track specification; it is published for examination, experimental implementation, and evaluation. This document defines an Experimental Protocol for the Internet community. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are a candidate for any level of Internet Standard; see Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6069. Zimmermann & Hannemann Experimental [Page 1]
RFC 6069 Making TCP More Robust to LCDs December 2010 Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction ....................................................3 2. Terminology .....................................................4 3. Connectivity Disruption Indication ..............................5 4. Connectivity Disruption Reaction ................................7 4.1. Basic Idea .................................................7 4.2. Algorithm Details ..........................................8 5. Discussion of TCP-LCD ..........................................11 5.1. Retransmission Ambiguity ..................................12 5.2. Wrapped Sequence Numbers ..................................12 5.3. Packet Duplication ........................................13 5.4. Probing Frequency .........................................14 5.5. Reaction during Connection Establishment ..................14 5.6. Reaction in Steady-State ..................................14 6. Dissolving Ambiguity Issues Using the TCP Timestamps Option ....15 7. Interoperability Issues ........................................17 7.1. Detection of TCP Connection Failures ......................17 7.2. Explicit Congestion Notification (ECN) ....................17 7.3. TCP-LCD and IP Tunnels ....................................17 8. Related Work ...................................................18 9. Security Considerations ........................................19 10. Acknowledgments ...............................................20 11. References ....................................................20 11.1. Normative References .....................................20 11.2. Informative References ...................................21 Zimmermann & Hannemann Experimental [Page 2]
RFC 6069 Making TCP More Robust to LCDs December 2010 1. Introduction Connectivity disruptions can occur in many different situations. The frequency of connectivity disruptions depends on the properties of the end-to-end path between the communicating hosts. While connectivity disruptions can occur in traditional wired networks, e.g., disruption caused by an unplugged network cable, the likelihood of their occurrence is significantly higher in wireless (multi-hop) networks. Especially, end-host mobility, network topology changes, and wireless interferences are crucial factors. In the case of the Transmission Control Protocol (TCP) [RFC0793], the performance of the connection can experience a significant reduction compared to a permanently connected path [SESB05]. This is because TCP, which was originally designed to operate in fixed and wired networks, generally assumes that the end-to-end path connectivity is relatively stable over the connection's lifetime. Depending on their duration, connectivity disruptions can be classified into two groups [TCP-RLCI]: "short" and "long". A connectivity disruption is "short" if connectivity returns before the retransmission timer fires for the first time. In this case, TCP recovers lost data segments through Fast Retransmit and lost acknowledgments (ACKs) through successfully delivered later ACKs. Connectivity disruptions are declared as "long" for a given TCP connection if the retransmission timer fires at least once before connectivity is resumed. Whether or not path characteristics, like the round-trip time (RTT) or the available bandwidth, have changed when connectivity resumes after a disruption is another important aspect for TCP's retransmission scheme [TCP-RLCI]. The algorithm specified in this document improves TCP's behavior in the case of "long connectivity disruptions". In particular, it focuses on the period prior to the re-establishment of the connectivity to a previously disconnected peer node. The document does not describe any modifications to TCP's behavior and its congestion control mechanisms [RFC5681] after connectivity has been restored. When a long connectivity disruption occurs on a TCP connection, the TCP sender eventually does not receive any more acknowledgments. After the retransmission timer expires, the TCP sender enters the timeout-based loss recovery and declares the oldest outstanding segment (SND.UNA) as lost. Since TCP tightly couples reliability and congestion control, the retransmission of SND.UNA is triggered together with the reduction of the transmission rate. This is based on the assumption that segment loss is an indication of congestion [RFC5681]. As long as the connectivity disruption persists, TCP will repeat this procedure until the oldest outstanding segment has Zimmermann & Hannemann Experimental [Page 3]
RFC 6069 Making TCP More Robust to LCDs December 2010 successfully been acknowledged or until the connection has timed out. TCP implementations that follow the recommended retransmission timeout (RTO) management of RFC 2988 [RFC2988] double the RTO after each retransmission attempt. However, the RTO growth may be bounded by an upper limit, the maximum RTO, which is at least 60 s, but may be longer: Linux, for example, uses 120 s. If connectivity is restored between two retransmission attempts, TCP still has to wait until the retransmission timer expires before resuming transmission, since it simply does not have any means to know if the connectivity has been re-established. Therefore, depending on when connectivity becomes available again, this can waste up to a maximum RTO of possible transmission time. This retransmission behavior is not efficient, especially in scenarios with long connectivity disruptions. In the ideal case, TCP would attempt a retransmission as soon as connectivity to its peer has been re-established. In this document, we specify a TCP sender- only modification to provide robustness to long connectivity disruptions (TCP-LCD). The memo describes how the standard Internet Control Message Protocol (ICMP) can be exploited during timeout-based loss recovery to identify non-congestion loss caused by long connectivity disruptions. TCP-LCD's reversion strategy of the retransmission timer enables higher-frequency retransmissions and thereby a prompt detection when connectivity to a previously disconnected peer node has been restored. If no congestion is present, TCP-LCD approaches the ideal behavior. Experimental results of a Linux implementation of TCP-LCD have been presented in [ZimHan09]. The implementation has been incorporated into mainline Linux, and is already used within the Internet. Thus far, no negative experiences have been reported that could be attributed to the algorithm. However, we consider TCP-LCD as experimental until more real-life results have been obtained. Nevertheless, we encourage implementation of TCP-LCD under other operating systems to provide for broader testing and experimentation opportunities. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. The reader should be familiar with the algorithm and terminology from [RFC2988], which defines the standard algorithm that Transmission Control Protocol (TCP) senders are required to use to compute and manage their retransmission timer. In this document, the terms "retransmission timer" and "retransmission timeout" are used as Zimmermann & Hannemann Experimental [Page 4]
RFC 6069 Making TCP More Robust to LCDs December 2010 defined in [RFC2988]. The retransmission timer ensures data delivery in the absence of any feedback from the receiver. The duration of this timer is referred to as retransmission timeout (RTO). As defined in [RFC0793], the term "acceptable acknowledgment (ACK)" refers to a TCP segment that acknowledges previously unacknowledged data. The TCP sender state variable "SND.UNA" and the current segment variable "SEG.SEQ" are used as defined in [RFC0793]. SND.UNA holds the segment sequence number of the earliest segment that has not been acknowledged by the TCP receiver (the oldest outstanding segment). SEG.SEQ is the segment sequence number of a given segment. For the purposes of this specification, we define the term "timeout- based loss recovery", which refers to the state that a TCP sender enters upon the first timeout of the oldest outstanding segment (SND.UNA) and leaves upon the arrival of the *first* acceptable ACK. It is important to note that other documents use a different interpretation of the term "timeout-based loss recovery". For example, the NewReno modification to TCP's Fast Recovery algorithm [RFC3782] extends the period that a TCP sender remains in timeout- based loss recovery compared to the one defined in this document. This is because [RFC3782] attempts to avoid unnecessary multiple Fast Retransmits that can occur after an RTO. 3. Connectivity Disruption Indication If the queue of an intermediate router that is experiencing a link outage can buffer all incoming packets, a connectivity disruption will only cause a variation in delay, which is handled well by TCP implementations using either Eifel [RFC3522], [RFC4015] or Forward RTO-Recovery (F-RTO) [RFC5682]. However, if the link outage lasts for too long, the router experiencing the link outage is forced to drop packets, and finally may remove the corresponding next hop from its routing table. Means to detect such link outages include reacting to failed address resolution protocol (ARP) [RFC0826] queries, sensing unsuccessful links, and the like. However, this is solely the responsibility of the respective router. Note: The focus of this memo is on introducing a method of how ICMP messages may be exploited to improve TCP's performance; how different physical and link-layer mechanisms below the network layer may trigger ICMP destination unreachable messages are out of scope of this memo. Provided that no other route to the specific destination exists, an Internet Protocol version 4 (IPv4) [RFC0791] router will notify the corresponding sending host about the dropped packets via ICMP destination unreachable messages of code 0 (net unreachable) or Zimmermann & Hannemann Experimental [Page 5]
RFC 6069 Making TCP More Robust to LCDs December 2010 code 1 (host unreachable) [RFC1812]. Therefore, the sending host can use the ICMP destination unreachable messages of these codes as an indication of a connectivity disruption, since the reception of these messages provides evidence that packets were dropped due to a link outage. For Internet Protocol version 6 (IPv6) [RFC2460], the counterpart of the ICMP destination unreachable message of code 0 (net unreachable) and of code 1 (host unreachable) is the ICMPv6 destination unreachable message of code 0 (no route to destination) [RFC4443]. As with IPv4, a router should generate an ICMPv6 destination unreachable message of code 0 in response to a packet that cannot be delivered to its destination address because it lacks a matching entry in its routing table. Note that there are also other ICMP and ICMPv6 destination unreachable messages with different codes. Some of them are candidates for connectivity disruption indications, too, but need further investigation (for example, ICMP destination unreachable messages with code 5 (source route failed), code 11 (net unreachable for TOS (Type of Service)), or code 12 (host unreachable for TOS) [RFC1812]). On the other hand, codes that flag hard errors are of no use for this scheme, since TCP should abort the connection when those are received [RFC1122]. For the sake of simplicity, we will use, unless explicitly qualified with ICMPv4 or ICMPv6, the term "ICMP unreachable message" as a synonym for ICMP destination unreachable messages of code 0 or code 1 and ICMPv6 destination unreachable messages of code 0. This implies that all keywords from [RFC2119] that deal with the handling of received ICMP messages apply in the same way to ICMPv6 messages. The accurate interpretation of ICMP unreachable messages as a connectivity disruption indication is complicated by the following two peculiarities of ICMP messages. First, they do not necessarily operate on the same timescale as the packets, i.e., TCP segments that elicited them. When a router drops a packet due to a missing route, it will not necessarily send an ICMP unreachable message immediately, but will rather queue it for later delivery. Second, ICMP messages are subject to rate-limiting, e.g., when a router drops a whole window of data due to a link outage, it is unlikely to send as many ICMP unreachable messages as dropped TCP segments. Depending on the load of the router, it may not even send any ICMP unreachable messages at all. Both peculiarities originate from [RFC1812] for ICMPv4 and [RFC4443] for ICMPv6. Zimmermann & Hannemann Experimental [Page 6]
RFC 6069 Making TCP More Robust to LCDs December 2010 Fortunately, according to [RFC0792], ICMPv4 unreachable messages have to contain, in their body, the entire IPv4 header [RFC0791] of the datagram eliciting the ICMPv4 unreachable message, plus the first 64 bits of the payload of that datagram. This allows the sending host to match the ICMPv4 error message to the transport connection that elicited it. RFC 1812 [RFC1812] augments these requirements and states that ICMPv4 messages should contain as much of the original datagram as possible without the length of the ICMPv4 datagram exceeding 576 bytes. Therefore, in the case of TCP, at least the source port number, the destination port number, and the 32-bit TCP sequence number are included. This allows the originating TCP to demultiplex the received ICMPv4 message and to identify the affected connection. Moreover, it can identify which segment of the respective connection triggered the ICMPv4 unreachable message, unless there are several segments in flight with the same sequence number (see Section 5.1). For IPv6 [RFC2460], the payload of an ICMPv6 error message has to include as many bytes as possible from the IPv6 datagram that elicited the ICMPv6 error message, without making the error message exceed the minimum IPv6 MTU (1280 bytes) [RFC4443]. Thus, enough information is available to identify both the affected connection and the corresponding segment that triggered the ICMPv6 error message. A connectivity disruption indication in the form of an ICMP unreachable message associated with a presumably lost TCP segment provides strong evidence that the segment was not dropped due to congestion, but was successfully delivered as far as the reporting router. It therefore did not witness any congestion at least on that part of the path that was traversed by both the TCP segment eliciting the ICMP unreachable message and the ICMP unreachable message itself. 4. Connectivity Disruption Reaction Section 4.1 introduces the basic idea of TCP-LCD. The complete algorithm is specified in Section 4.2. 4.1. Basic Idea The goal of the algorithm is to promptly detect when connectivity to a previously disconnected peer node has been restored after a long connectivity disruption, while retaining appropriate behavior in case of congestion. TCP-LCD exploits standard ICMP unreachable messages during timeout-based loss recovery. This increases TCP's retransmission frequency by undoing one retransmission timer backoff whenever an ICMP unreachable message is received that contains a segment with a sequence number of a presumably lost retransmission. Zimmermann & Hannemann Experimental [Page 7]
RFC 6069 Making TCP More Robust to LCDs December 2010 This approach has the advantage of appropriately reducing the probing rate in case of congestion. If either the retransmission itself or the corresponding ICMP message is dropped, the previously performed retransmission timer backoff is not undone, which effectively halves the probing rate. 4.2. Algorithm Details A TCP sender that uses RFC 2988 [RFC2988] to compute TCP's retransmission timer MAY employ the following scheme to avoid over- conservative retransmission timer backoffs in case of long connectivity disruptions. If a TCP sender does implement the following steps, the algorithm MUST be initiated upon the first timeout of the oldest outstanding segment (SND.UNA) and MUST be stopped upon the arrival of the first acceptable ACK. The algorithm MUST NOT be re-initiated upon subsequent timeouts for the same segment. The scheme SHOULD NOT be used in SYN-SENT or SYN-RECEIVED states [