TCP流量控制主要由基于接收方确认的滑动窗口机制和基于网络情况的发送方的拥塞控制算法来实现!
TCP Traffic Control
TCP流量控制主要涉及滑动窗口,拥塞控制算法,RTT(Round-Trip Time,往返时间),RTO(Retransmission Timeout,重传超时)计算等等。
- 滑动窗口通常认为是接收方反馈给发送方的流量和重传控制(不考虑网络);
- 拥塞控制算法通常认为是发送方的流量控制算法(考虑网络时延丢包等);
- RTT计算会影响到RTO计算及影响某些拥塞算法运行结果;
- RTO计算会涉及到超时重传定时器设置。
滑动窗口
滑动窗口(sliding window)是数据传输时接收方通告窗口给发送方,然后发送方根据窗口决定一次发送多少数据的机制,本质是ARQ(Automatic Repeat Request, 停等协议),滑动窗口一方面利用接收方确认报文反馈的窗口大小控制每次发包个数,另一方面根据接收方确认报文确认号判断数据帧重传,滑动窗口仅通过接收方反馈的信息进行流量控制,不会考虑网络带宽延迟等因素。
ARQ主要有连续ARQGo-Back-N和选择ARQSelective Repeat两种,TCP中使用Go-back-N,Selective Repeat作为可选实现。Go-back-N和Selective Repeat区别在于,Go-back-N一次发包只有一个定时器,中间出现丢包,则丢的包之后的所有包都需要重传,Selective Repeat一次发包需要和包个数一样的定时器数量,中间出现丢包则仅重传丢失的那个包,简单来说Go-back-N比较耗网络带宽,Selective Repeat比较耗计算资源。
两种ARQ滑动窗口发送过程可参考如下网站的动画:
https://www2.tkn.tu-berlin.de/teaching/rn/animations/gbn_sr/
TCP滑动机制原理可简述为:TCP发送方包含一个发送窗口,按序列号发送数据包,TCP接收方根据收到的包回复确认包,TCP发送包根据收到的确认包中通告窗口反馈来动态调节发送窗口大小、滑动窗口右移并发送后续的数据包。
如下tcpipguide网站TCP滑动窗口描述,TCP发送方发送缓存区分为4块区域:已发送已确认、已发送未确认、未发送接收方已准备好和未发送接收方未准备好,TCP接收方接收缓存区分为3块区域:已接收已确认、未接收被允许发送、未收到发生方可能未发送。
Linux内核4.x中滑动窗口代码示例,大体可以看出来和tcpipguid网站中描述一致:
/* Update our send window.
*
* Window update algorithm, described in RFC793/RFC1122 (used in linux-2.2
* and in FreeBSD. NetBSD's one is even worse.) is wrong.
*/
static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32 ack,
u32 ack_seq)
{
struct tcp_sock *tp = tcp_sk(sk);
int flag = 0;
u32 nwin = ntohs(tcp_hdr(skb)->window);
if (likely(!tcp_hdr(skb)->syn))
nwin rx_opt.snd_wscale;
if (tcp_may_update_window(tp, ack, ack_seq, nwin)) {
flag |= FLAG_WIN_UPDATE;
tcp_update_wl(tp, ack_seq);
if (tp->snd_wnd != nwin) {
tp->snd_wnd = nwin;
/* Note, it is the only place, where
* fast path is recovered for sending TCP.
*/
tp->pred_flags = 0;
tcp_fast_path_check(sk);
if (!tcp_write_queue_empty(sk))
tcp_slow_start_after_idle_check(sk);
if (nwin > tp->max_window) {
tp->max_window = nwin;
tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie);
}
}
}
tcp_snd_una_update(tp, ack);
return flag;
}
RTT&RTO计算
TCP流量控制中RTT及RTO计算是一个很重要的问题,ARQ中针对丢包都需要超时重传,超时的设置就会涉及到RTO,RTO设置过短(
网络就像高速公路网,时而拥堵时而畅通无阻,对应在网络里就是RTT抖动,如何计算RTT和RTO直接影响重传进而影响通信效率,其计算也经过长时间的理论和实践过程。
最初Phil Karn和Craig Partridge提出SRTT(Smoothed RTT,估算重传时间)和RTO计算如下:
SRTT = (α * SRTT) + ( (1-α) *RTT)
RTO = min[UBOUND, max[LBOUND,(β * SRTT)]]
其中α取0.8 ~ 0.9,β取1.3 ~ 2.0, UBOUND和LBOUND分别为上限时间和下限时间
后来Jacobson 和 Karels做了一定的修正,最终形成RFC6298
https://datatracker.ietf.org/doc/html/rfc6298,SRTT,DevRTT(RTT偏差) ,RTO计算公式,具体如下:
SRTT = SRTT + α * (RTT - SRTT)
DevRTT = (1-β) * DevRTT + β *(|RTT - SRTT|)
RTO = μ * SRTT + δ * DevRTT
其中α取0.125,β取0.25,μ取1,δ取4
Linux内核4.x中SRTT计算代码如下:
/* Called to compute a smoothed rtt estimate. The data fed to this
* routine either comes from timestamps, or from segments that were
* known _not_ to have been retransmitted [see Karn/Partridge
* Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
* piece by Van Jacobson.
* NOTE: the next three routines used to be one big routine.
* To save cycles in the RFC 1323 implementation it was better to break
* it up into three procedures. -- erics
*/
static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
{
struct tcp_sock *tp = tcp_sk(sk);
long m = mrtt_us; /* RTT */
u32 srtt = tp->srtt_us;
/* The following amusing code comes from Jacobson's
* article in SIGCOMM '88. Note that rtt and mdev
* are scaled versions of rtt and mean deviation.
* This is designed to be as fast as possible
* m stands for "measurement".
*
* On a 1990 paper the rto value is changed to:
* RTO = rtt + 4 * mdev
*
* Funny. This algorithm seems to be very broken.
* These formulae increase RTO, when it should be decreased, increase
* too slowly, when it should be increased quickly, decrease too quickly
* etc. I guess in BSD RTO takes ONE value, so that it is absolutely
* does not matter how to _calculate_ it. Seems, it was trap
* that VJ failed to avoid. 8)
*/
if (srtt != 0) {
m -= (srtt >> 3); /* m is now error in rtt est */
srtt += m; /* rtt = 7/8 rtt + 1/8 new */
if (m mdev_us >> 2); /* similar update on mdev */
/* This is similar to one of Eifel findings.
* Eifel blocks mdev updates when rtt decreases.
* This solution is a bit different: we use finer gain
* for mdev in this case (alpha*beta).
* Like Eifel it also prevents growth of rto,
* but also it limits too fast rto decreases,
* happening in pure Eifel.
*/
if (m > 0)
m >>= 3;
} else {
m -= (tp->mdev_us >> 2); /* similar update on mdev */
}
tp->mdev_us += m; /* mdev = 3/4 mdev + 1/4 new */
if (tp->mdev_us > tp->mdev_max_us) {
tp->mdev_max_us = tp->mdev_us;
if (tp->mdev_max_us > tp->rttvar_us)
tp->rttvar_us = tp->mdev_max_us;
}
if (after(tp->snd_una, tp->rtt_seq)) {
if (tp->mdev_max_us rttvar_us)
tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
tp->rtt_seq = tp->snd_nxt;
tp->mdev_max_us = tcp_rto_min_us(sk);
}
} else {
/* no previous measure. */
srtt = m mdev_us = m rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
tp->mdev_max_us = tp->rttvar_us;
tp->rtt_seq = tp->snd_nxt;
}
tp->srtt_us = max(1U, srtt);
}
Linux内核4.x中RTO计算代码:
/* Calculate rto without backoff. This is the second half of Van Jacobson's
* routine referred to above.
*/
void tcp_set_rto(struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
/* Old crap is replaced with new one. 8)
*
* More seriously:
* 1. If rtt variance happened to be less 50msec, it is hallucination.
* It cannot be less due to utterly erratic ACK generation made
* at least by solaris and freebsd. "Erratic ACKs" has _nothing_
* to do with delayed acks, because at cwnd>2 true delack timeout
* is invisible. Actually, Linux-2.4 also generates erratic
* ACKs in some circumstances.
*/
inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
/* 2. Fixups made earlier cannot be right.
* If we do not estimate RTO correctly without them,
* all the algo is pure shit and should be replaced
* with correct one. It is exactly, which we pretend to do.
*/
/* NOTE: clamping at TCP_RTO_MIN is not required, current algo
* guarantees that rto is higher.
*/
tcp_bound_rto(sk);
}
static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
{
return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
}
拥塞控制算法
由于滑动窗口仅考虑接收方的接收能力不考虑网络因素,网络带宽是共享的,网络延时是抖动的,网络的变化带来的问题势必会造成整个系统的崩溃,这就像两个火车站运输货物,仅考虑火车站的容纳量,不考虑火车轨道的承载能力,火车车次发的再多可能也是堵在轨道上,还可能导致轨道系统瘫痪。
拥塞控制算法是发送方根据当前网络情况调整发包速率的流量控制机制,TCP中拥塞控制算法发展至今,从经典的reno到google近年来提出的bbr,期间不断有新的拥塞控制算法提出来,拥塞控制也形成了如下RFC。
RFC5681
https://datatracker.ietf.org/doc/html/rfc5681
以linux内核为例,最新的内核版本内部实现了十多种拥塞控制算法实现,包括一些比较新的算法实现(比如bbr),linux较早版本内核使用reno,linux 2.6.8以后默认采用bic算法,linux 2.6.19以后又默认使用了cubic算法,目前linux 5.x依然默认使用cubic算法, windows同样也是支持多种拥塞控制算法的,目前windows11默认也是使用cubic算法。
拥塞算法原理(reno为例)
TCP拥塞算法整体框架基本一致,以经典的TCP reno拥塞控制算法(教科书通常讲的版本)进行简要介绍,TCP拥塞控制RFC5681规定都包含慢启动、拥塞避免、快重传、快恢复阶段,这又涉及到几个基本参数(cwnd、ssthresh),cwnd指拥塞窗口的大小,决定了一次发送多少数据包,ssthresh指慢启动阈值。
- 慢启动
慢启动阶段分为两种,一种是流刚开始cwndRFC3465
https://datatracker.ietf.org/doc/html/rfc3465
RFC3742
https://datatracker.ietf.org/doc/html/rfc3742
首次慢启动最初初始ssthresh都是设置为一个很大的值,这样直到丢包才会进入拥塞避免,后又出现了hystart优化(混合慢启动),比如预测出理想的ssthresh,从而让首次慢启动不至于丢包才进入拥塞避免(丢包代价太大)。
本文转载自:https://www.gylmap.com
- 拥塞避免
当cwnd>=ssthresh时会进入拥塞避免阶段,cwnd会随着RTT呈线性增长,这个起始是比较保守地试探最大网络能力,不至于网络崩溃。
- 快重传
拥塞避免阶段当收到3个连续ACK,表明可能出现了丢包,表明网络出现轻微拥堵,这个时候会进入快重传阶段,ssthresh会设置为0.5 * cwnd,cwnd会设置为ssthresh+3MSS,进行数据包重传进入快恢复阶段。
- 快恢复
快恢复阶段如果重传数据包后如果依然收不到新数据包ACK而且RTO超时了,表明网络并没有恢复,就会重新进入慢启动阶段,ssthresh会设置为0.5 * cwnd,cwnd会设置为初始值,如果收到了新数据的ACK包,表明网络已恢复,cwnd会设置为ssthresh,进入拥塞避免阶段。
reno拥塞控制算法状态图如下
reno算法iperf打流wireshark抓包io图如下:
拥塞算法对比
拥塞控制算法主要是基于网络丢包和延迟(RTT)来实现,所以有的算法丢包敏感,有的算法延迟敏感,有的结合丢包和延迟,不同的算法主要的区别可能在于拥塞避免阶段如何去拟合理想发送速率曲线又不至于丢包,如下
https://en.wikipedia.org/wiki/TCP_congestion_control关于不同拥塞算法对比。
拥塞算法 | 网络反馈 | 需要修改点 | 应用场景 | 公平性准则 |
(New) Reno | Loss | — | — | Delay |
Vegas | Delay | Sender | Less loss | Proportional |
High Speed | Loss | Sender | High bandwidth | |
BIC | Loss | Sender | High bandwidth | |
CUBIC | Loss | Sender | High bandwidth | |
C2TCP[10][11] | Loss/Delay | Sender | Ultra-low latency and high bandwidth | |
NATCP[12] | Multi-bit signal | Sender | Near Optimal Performance | |
Elastic-TCP | Loss/Delay | Sender | High bandwidth/short & long-distance | |
Agile-TCP | Loss | Sender | High bandwidth/short-distance | |
H-TCP | Loss | Sender | High bandwidth | |
FAST | Delay | Sender | High bandwidth | Proportional |
Compound TCP | Loss/Delay | Sender | High bandwidth | Proportional |
Westwood | Loss/Delay | Sender | L | |
Jersey | Loss/Delay | Sender | L | |
BBR[13] | Delay | Sender | BLVC, Bufferbloat | |
CLAMP | Multi-bit signal | Receiver, Router | V | Max-min |
TFRC | Loss | Sender, Receiver | No Retransmission | Minimum delay |
XCP | Multi-bit signal | Sender, Receiver, Router | BLFC | Max-min |
VCP | 2-bit signal | Sender, Receiver, Router | BLF | Proportional |
MaxNet | Multi-bit signal | Sender, Receiver, Router | BLFSC | Max-min |
JetMax | Multi-bit signal | Sender, Receiver, Router | High bandwidth | Max-min |
RED | Loss | Router | Reduced delay | |
ECN | Single-bit signal | Sender, Receiver, Router | Reduced loss |
这里还要提一下TCP流的公平性问题,有的拥塞算法可能会存在带宽抢占,基于延时相关的算法就很容易出现,因为是基于RTT来抢占带宽,RTT越小占用则更多,比如有两条TCP流,TCP流中延迟更小的会抢占更多的带宽,甚至完全占有所有带宽,其他流都无法正常工作。比如reno和bic都存在抢占问题,cubic是通过3次曲线来探测理想带宽,和RTT无关联,公平性做得较好。
如上图使用reno算法,使用iperf打流到两个不同服务端,stream2延时低直接抢占了大部分带宽
如上图使用cubic算法,同样使用iperf打流到两个不同服务端,steam2延时低抢占带宽则没有那么严重
Linux内核中拥塞算法
Linux 2.6.x内核版本具体可参考
https://linuxgazette.net/135/pfeiffer.html进行拥塞算法选择,最新的linux 5.x内核多支持了几种,内核代码Kconfig中也有简要描述,顺便提一下mptcp内核也包含了多路径场景的几种拥塞算法。
Linux 4.x内核中支持的tcp拥塞算法,默认使用cubic:
config DEFAULT_TCP_CONG
string
default "bic" if DEFAULT_BIC
default "cubic" if DEFAULT_CUBIC
default "htcp" if DEFAULT_HTCP
default "hybla" if DEFAULT_HYBLA
default "vegas" if DEFAULT_VEGAS
default "westwood" if DEFAULT_WESTWOOD
default "veno" if DEFAULT_VENO
default "reno" if DEFAULT_RENO
default "dctcp" if DEFAULT_DCTCP
default "cdg" if DEFAULT_CDG
default "bbr" if DEFAULT_BBR
default "cubic"
Linux mptcp v0.95内核支持的mptcp拥塞算法:
config TCP_CONG_LIA
tristate "MPTCP Linked Increase"
depends on MPTCP
default n
---help---
MultiPath TCP Linked Increase Congestion Control
To enable it, just put 'lia' in tcp_congestion_control
config TCP_CONG_OLIA
tristate "MPTCP Opportunistic Linked Increase"
depends on MPTCP
default n
---help---
MultiPath TCP Opportunistic Linked Increase Congestion Control
To enable it, just put 'olia' in tcp_congestion_control
config TCP_CONG_WVEGAS
tristate "MPTCP WVEGAS CONGESTION CONTROL"
depends on MPTCP
default n
---help---
wVegas congestion control for MPTCP
To enable it, just put 'wvegas' in tcp_congestion_control
config TCP_CONG_BALIA
tristate "MPTCP BALIA CONGESTION CONTROL"
depends on MPTCP
default n
---help---
Multipath TCP Balanced Linked Adaptation Congestion Control
To enable it, just put 'balia' in tcp_congestion_control
config TCP_CONG_MCTCPDESYNC
tristate "DESYNCHRONIZED MCTCP CONGESTION CONTROL (EXPERIMENTAL)"
depends on MPTCP
default n
---help---
Desynchronized MultiChannel TCP Congestion Control. This is experimental
code that only supports single path and must have set mptcp_ndiffports
larger than one.
To enable it, just put 'mctcpdesync' in tcp_congestion_control
For further details see:
http://ieeexplore.ieee.org/abstract/document/6911722/
https://doi.org/10.1016/j.comcom.2015.07.010
Linux4.x内核中拥塞算法扩展时参考struct,拥塞算法主要重写里面函数来实现:
struct tcp_congestion_ops {
struct list_head list;
u32 key;
u32 flags;
/* initialize private data (optional) */
void (*init)(struct sock *sk);
/* cleanup private data (optional) */
void (*release)(struct sock *sk);
/* return slow start threshold (required) */
u32 (*ssthresh)(struct sock *sk);
/* do new cwnd calculation (required) */
void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked);
/* call before changing ca_state (optional) */
void (*set_state)(struct sock *sk, u8 new_state);
/* call when cwnd event occurs (optional) */
void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
/* call when ack arrives (optional) */
void (*in_ack_event)(struct sock *sk, u32 flags);
/* new value of cwnd after loss (required) */
u32 (*undo_cwnd)(struct sock *sk);
/* hook for packet ack accounting (optional) */
void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
/* override sysctl_tcp_min_tso_segs */
u32 (*min_tso_segs)(struct sock *sk);
/* returns the multiplier used in tcp_sndbuf_expand (optional) */
u32 (*sndbuf_expand)(struct sock *sk);
/* call when packets are delivered to update cwnd and pacing rate,
* after all the ca_state processing. (optional)
*/
void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
union tcp_cc_info *info);
char name[TCP_CA_NAME_MAX];
struct module *owner;
};
Linux4.x内核cubic算法重新如下函数来实现:
static struct tcp_congestion_ops cubictcp __read_mostly = {
.init = bictcp_init,
.ssthresh = bictcp_recalc_ssthresh,
.cong_avoid = bictcp_cong_avoid,
.set_state = bictcp_state,
.undo_cwnd = tcp_reno_undo_cwnd,
.cwnd_event = bictcp_cwnd_event,
.pkts_acked = bictcp_acked,
.owner = THIS_MODULE,
.name = "cubic",
};
Linux 系统流量控制重要系统参数
# 每个套接字所允许的最大缓冲区的大小
net.core.optmem_max = 20480
# 接收套接字缓冲区大小的缺省值(以字节为单位)。
net.core.rmem_default = 229376
# 接收套接字缓冲区大小的最大值(以字节为单位)。
net.core.rmem_max = 16777216
# socket监听的backlog(监听队列)上限
net.core.somaxconn = 128
# tcp默认拥塞算法
net.ipv4.tcp_congestion_control = cubic
# 是否开启tcp窗口扩大
net.ipv4.tcp_window_scaling = 1
# tcp内存大小区(以页为单位, 1页=4096字节), 3个数值分别为 low, pressure, high
# low: tcp使用低于该值页数时不考虑释放内存, pressure: tcp使用内存大于该值则试图稳定其使用内存low以下
# high: 允许tcp用排队缓存数据报文的页面量, 超过该值拒绝分配socket, 出现TCP: too many of orphaned sockets
net.ipv4.tcp_mem = 381705 508942 763410
# tcp读缓存区(以字节为单位), 3个数值分别为 默认值 最小值 最大值
net.ipv4.tcp_rmem = 10240 87380 16777216
# tcp写缓存区(以字节为单位), 3个数值分别为 默认值 最小值 最大值
net.ipv4.tcp_wmem = 10240 87380 16777216
Reference
https://tools.ietf.org/html/std7
https://datatracker.ietf.org/doc/html/rfc1122
https://datatracker.ietf.org/doc/html/rfc1323
https://datatracker.ietf.org/doc/html/rfc1379
https://datatracker.ietf.org/doc/html/rfc1948
https://datatracker.ietf.org/doc/html/rfc2018
https://datatracker.ietf.org/doc/html/rfc5681
https://datatracker.ietf.org/doc/html/rfc6247
https://datatracker.ietf.org/doc/html/rfc6298
https://datatracker.ietf.org/doc/html/rfc6824
https://datatracker.ietf.org/doc/html/rfc7323
https://datatracker.ietf.org/doc/html/rfc7414
https://en.wikipedia.org/wiki/Transmission_Control_Protocol
https://linuxgazette.net/135/pfeiffer.html
http://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech.htm
https://www2.tkn.tu-berlin.de/teaching/rn/animations/gbn_sr/
https://zhuanlan.zhihu.com/p/144273871
http://lxr.linux.no/linux+v3.2.8/Documentation/networking/ip-sysctl.txt#L464