症状表现说明
今天看到博客老是打不开, 一般情况下我们都会检查 web+db ,发现web的访问量并不是太高。下面我们记录一下处理过程。
查看负载
top
查看web日志
tailf /data/wwwlogs/jiloc.com_nginx.log
发现访问量并不高,但是负载莫名其妙得很高。数据库mysql的进程高属正常。
查看系统日志
tailf /var/log/messages
root@instance-642k0w3w:~# tailf /var/log/messages
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.460288] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.463060] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.464718] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.466232] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.467630] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.469126] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.470496] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.471864] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.473339] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:51:56 instance-642k0w3w kernel: [9113866.474759] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.464394] net_ratelimit: 27808 callbacks suppressed
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.464394] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.467535] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.469057] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.470924] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.473035] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.475067] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.476470] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.477816] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.479139] nf_conntrack: nf_conntrack: table full, dropping packet
Mar 25 10:52:01 instance-642k0w3w kernel: [9113871.481024] nf_conntrack: nf_conntrack: table full, dropping packet
发现很多这种 nf_conntrack 的报错信息。
原因
服务器访问量大,内核 netfilter 模块 conntrack 相关参数配置不合理,导致 IP 包被丢掉,连接无法建立。
查看 nf_conntrack 设置
sysctl -a | grep conntrack
sysctl: reading key “net.ipv6.conf.all.stable_secret”
sysctl: reading key “net.ipv6.conf.default.stable_secret”
sysctl: reading key “net.ipv6.conf.ens3.stable_secret”
sysctl: reading key “net.ipv6.conf.lo.stable_secret”
net.netfilter.nf_conntrack_acct = 0
net.netfilter.nf_conntrack_buckets = 8192
net.netfilter.nf_conntrack_checksum = 1
net.netfilter.nf_conntrack_count = 32768
net.netfilter.nf_conntrack_events = 1
net.netfilter.nf_conntrack_expect_max = 128
net.netfilter.nf_conntrack_frag6_high_thresh = 262144
net.netfilter.nf_conntrack_frag6_low_thresh = 196608
net.netfilter.nf_conntrack_frag6_timeout = 60
net.netfilter.nf_conntrack_generic_timeout = 600
net.netfilter.nf_conntrack_helper = 0
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_icmpv6_timeout = 30
net.netfilter.nf_conntrack_log_invalid = 0
net.netfilter.nf_conntrack_max = 32768
net.netfilter.nf_conntrack_tcp_be_liberal = 0
net.netfilter.nf_conntrack_tcp_loose = 1
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_timestamp = 0
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
net.nf_conntrack_max = 32768
查看最大跟踪连接数
进来的连接数超过这个值时,新连接的包会被丢弃。
sudo sysctl net.netfilter.nf_conntrack_max
# 默认 nf_conntrack_buckets * 4
# max 是 bucket 的多少倍决定了每个桶里的链表有多长,因此默认链表长度为 4
比较现代的系统(Ubuntu 16+, CentOS 7+)里,64 位,8G 内存的机器,max 通常默认为 262144,bucket 为 65536。随着内存大小翻倍这 2 个值也翻倍。
更多详细的说明,请自行百度。
解决办法
临时解决办法, 将 net.nf_conntrack_max 参数调大,观察一下情况。如果正常的话。则在配置文件里修改。现在临时修改的参数重启以后就会消失。
sysctl net.nf_conntrack_max=65535
负载一下低了很多,而且 /var/log/messages 中,也没有再跳错误信息。
修改配置文件
vi /etc/sysctl.d/99-sysctl.conf
找到 net.ipv4.tcp_max_orphans
设置将其设置成: 65535
net.ipv4.tcp_max_orphans = 65535
搞定,收工~