β

etcd clientv3的lease keepalive租约频繁续期bug 排查小记

莿鸟栖草堂 594 阅读

背景:


var err error
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
leaseResp, err := this.GetEtcdClient().Grant(ctx, 10)	//租约时间设定为10秒
cancel()
if err != nil {
	return err
}

ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
kvc := etcdclient.NewKV(this.kapi)
var txnResp *etcdclient.TxnResponse
txnResp, err = kvc.Txn(ctx).
	If(etcdclient.Compare(etcdclient.CreateRevision(keyName), "=", 0)).
	Then(etcdclient.OpPut(keyName, value.GetRegisterInfo(),etcdclient.WithLease(etcdclient.LeaseID(leaseResp.ID)))).
	Commit()
_, err = this.kapi.Put(ctx, keyName, value.GetRegisterInfo())
cancel()

if err != nil {
	return err
}

if !txnResp.Succeeded {
	return fmt.Errorf("无法注册自己:%s ,该 Key 已经存在",keyName)
}

ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
_, err = this.GetEtcdClient().KeepAlive(context.TODO(), leaseResp.ID)
cancel()
if err != nil {
	return err
}

问题现象:

监控图如下:

etcd leader的 etcd 进程入手:

先分析「疑似内存溢出」问题,在 etcd 里,已经有pprof的信息,golang的tool里也有直接读取 http pprof 性能分析的工具,只是不支持 https 的。需要将 pprof 文件下载到本地,再从本地读取


curl --cacert /xxx/xxx/etcd/config/ssl/ca.pem --cert /xxx/xxx/etcd/config/ssl/client.pem --key /xxx/xxx/etcd/config/ssl/client-key.pem   -L https://leader_ip:2379/debug/pprof/profile > leader_ip_cpu.txt

结果如下:


cfc4n@cnxct:~/Downloads$go tool pprof leader_ip_cpu.txt
File: etcd
Type: cpu
Time: Jul 3, 2018 at 4:13pm (CST)
Duration: 2.64mins, Total samples = 34.54mins (1306.43%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 23.98mins, 69.42% of 34.54mins total
Dropped 1702 nodes (cum <= 0.17mins)
Showing top 20 nodes out of 211
      flat  flat%   sum%        cum   cum%
.62mins 10.48% 10.48%   3.62mins 10.48%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
.99mins  8.66% 19.13%   2.99mins  8.66%  math/big.divWVW /usr/local/go/src/math/big/arith_amd64.s
.34mins  6.78% 25.91%   2.34mins  6.78%  math/big.bitLen /usr/local/go/src/math/big/arith_amd64.s
.90mins  5.51% 31.42%  10.44mins 30.22%  math/big.nat.montgomery /usr/local/go/src/math/big/nat.go
.67mins  4.82% 36.24%   1.67mins  4.82%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s
.46mins  4.22% 40.47%   3.57mins 10.33%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go
.36mins  3.93% 44.40%   1.44mins  4.17%  syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s
.30mins  3.78% 48.17%   1.30mins  3.78%  runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go
.04mins  3.01% 51.19%   1.04mins  3.02%  runtime.greyobject /usr/local/go/src/runtime/mgcmark.go
.94mins  2.71% 53.90%   0.94mins  2.71%  math/big.divWVW
.80mins  2.33% 56.23%   0.83mins  2.41%  runtime.siftdownTimer /usr/local/go/src/runtime/time.go
.80mins  2.32% 58.55%   3.17mins  9.17%  math/big.nat.divLarge /usr/local/go/src/math/big/nat.go
.78mins  2.27% 60.82%   0.78mins  2.27%  math/big.addMulVVW /usr/local/go/src/math/big/arith_amd64.s
.57mins  1.66% 62.48%   0.57mins  1.66%  runtime._ExternalCode /usr/local/go/src/runtime/proc.go
.57mins  1.64% 64.12%   0.57mins  1.64%  math/big.(*byteReader).Read <autogenerated>
.42mins  1.22% 65.33%   3.22mins  9.32%  runtime.mallocgc /usr/local/go/src/runtime/malloc.go
.40mins  1.17% 66.50%   0.40mins  1.17%  math/big.mulAddVWW /usr/local/go/src/math/big/arith_amd64.s
.34mins  0.99% 67.49%   0.34mins  0.99%  vendor/golang_org/x/crypto/curve25519.ladderstep /usr/local/go/src/vendor/golang_org/x/crypto/curve25519/ladderstep_amd64.s
.34mins  0.99% 68.49%   0.34mins  0.99%  math/big.bitLen
.32mins  0.93% 69.42%   0.41mins  1.18%  sync.(*Pool).pin /usr/local/go/src/sync/pool.go

可以看到大量 math 相关的包函数占用了大量的 cpu,而我们程序的业务中,并不涉及数学计算相关业务,也不会引用 math 包。唯一可能性就是在 https 的 tls 证书在做 RSA 算法校验时,用到了大量数学计算。我把 pprof 导出问 svg 图确认了一下,确实是 tls 的证书计算占用了大量 CPU。

但是 https 的链接,不应该是建立1次后就不再建立了吗?为什么会有这么大量的 cpu 计算呢。难道 TIMEWAIT 这么高跟这个有关系吗?回到 TIMEWAIT 的问题上来,此刻,leader 节点是 xxxx-etcdserver-xxx02,且TIME_WAIT非常高


[user@xxxx-etcdserver-xxx02 etcd]$ ss |grep 2380|awk '{print $5}'|awk -F ":" '{print $1}'|sort |uniq -c|sort -rn
2 192.168.198.234
7 192.168.227.202
9 192.168.255.210
7 192.168.58.204
5 192.168.137.210
6 192.168.128.205

leader 节点上的time-wait来源,都是来自各个member节点,平均每个节点将近几万个 TIMEWAIT,相隔几秒再次查询,还是5万多个。而各个 member 节点上,查到的是跟 leader 节点的2380端口 保持1千多个 ESTABLISHED 状态的连接,也确实是频繁的创建链接。为什么member节点频繁的创建链接呢?tcpdump抓点TCP 通讯包看一下:

如上抓包的图,发现了什么?

从这些数据包中,你能看出什么问题?想一想,打开这张图,想5分钟再往下看。。。

可以看出,都是 member主动向 leader 发 FIN 关闭包,这是为什么呢? 仅仅是看出来这些吗? 有没有注意到 member 发的消息包,收到 leader 回的 ack 之后,距离 member 主动发的 FIN 之间的时间间隔?都是10s ?为什么是10s?

leader 回了 ack,但没主动向 member 发“FIN,ACK”,为什么没回呢?若没回,那么 leader 上的这个 TCP 五元组的状态是什么?是的,是“CLOSE-WAIT”,难道这就是造成leader上大量 CLOSE-WAIT 的原因?先怀疑一下,继续往下看…

如上所说,member 向 leader 发了一个PSH数据包,发来什么内容呢?为什么 leader 没回信息?

payload的总长度是138,完全看不出来是什么数据包?不,应该还是有眼熟的,看前3个字节,0x16,0x03,0x01,如果对 HTTPS 熟悉的话,会联想到这是 TLS 的magic number,0x16对应十进制的22 即“Client Hello 的 Handshake”,后面的0x31,0x01是 TLS 的版本号。再配合我们的 member 跟 leader 之间确实是 https 通讯,可以确定这个地方是 tls 的client Hello握手包。也就是说配合前面发现的 10s 的间隔,可以看出 member 与 leader 之间的 TLS 握手并未完成,leader 并没有回 Server Hello… 整个TCP 包比较短,没有发现其他有用的信息…

去掉 HTTPS,减轻 cpu 负担:

回到之前的 CPU pprof 情况来看,是 TLS 相关证书计算,导致 CPU 使用率较高,但大量的 TCP 连接的发起,才是导致大量 TLS 计算的原因,导致内存飙升,消息积压的原因。不过,我们还是尝试去掉 member 与 leader 时间的 tls 证书校验,尝试减轻 leader 的CPU 使用。但结果影响不大,CPU 使用率依旧很高,不过,这也在意料之中,毕竟不是问题根本原因,查看pprof 如下


cfc4n@cnxct:~/Downloads$ go tool pprof 111.111.111.111-0703_cpu.txt
File: etcd
Type: cpu
Time: Jul 3, 2018 at 4:55pm (CST)
Duration: 30.01s, Total samples = 6.16mins (1231.36%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 4.29mins, 69.69% of 6.16mins total
Dropped 928 nodes (cum <= 0.03mins)
Showing top 20 nodes out of 181
      flat  flat%   sum%        cum   cum%
.13mins 34.62% 34.62%   2.13mins 34.62%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
.59mins  9.51% 44.13%   0.62mins 10.06%  syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s
.33mins  5.42% 49.55%   0.33mins  5.42%  runtime._ExternalCode /usr/local/go/src/runtime/proc.go
.25mins  4.07% 53.62%   1.23mins 20.00%  runtime.lock /usr/local/go/src/runtime/lock_futex.go
.13mins  2.18% 55.80%   0.13mins  2.18%  runtime.procyield /usr/local/go/src/runtime/asm_amd64.s
.11mins  1.80% 57.60%   0.23mins  3.72%  runtime.selectgoImpl /usr/local/go/src/runtime/select.go
.10mins  1.57% 59.17%   0.10mins  1.57%  runtime.osyield /usr/local/go/src/runtime/sys_linux_amd64.s
.09mins  1.39% 60.56%   1.47mins 23.85%  runtime.unlock /usr/local/go/src/runtime/lock_futex.go
.07mins  1.12% 61.69%   0.33mins  5.30%  runtime.mallocgc /usr/local/go/src/runtime/malloc.go
.07mins  1.06% 62.74%   0.07mins  1.06%  runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go
.06mins  1.01% 63.76%   0.15mins  2.36%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go
.05mins  0.77% 64.53%   0.05mins  0.77%  runtime.siftupTimer /usr/local/go/src/runtime/time.go
.04mins  0.71% 65.24%   0.04mins  0.71%  runtime.greyobject /usr/local/go/src/runtime/mgcmark.go
.04mins  0.71% 65.95%   0.07mins  1.14%  runtime.deferreturn /usr/local/go/src/runtime/panic.go
.04mins   0.7% 66.65%   1.35mins 21.86%  runtime.deltimer /usr/local/go/src/runtime/time.go
.04mins  0.67% 67.32%   0.12mins  1.92%  runtime.pcvalue /usr/local/go/src/runtime/symtab.go
.04mins  0.61% 67.93%   0.04mins  0.64%  runtime.siftdownTimer /usr/local/go/src/runtime/time.go
.04mins   0.6% 68.53%   0.23mins  3.74%  runtime.gentraceback /usr/local/go/src/runtime/traceback.go
.04mins  0.59% 69.11%   0.04mins  0.59%  runtime.casgstatus /usr/local/go/src/runtime/proc.go
.04mins  0.57% 69.69%   0.04mins  0.57%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s

CPU 使用率较高的,是 runtime.futex 跟系统调用的 syscall.Syscall,以及runtime下的包,heapBitsForObject、scanobject、greyobject是 golang GC 时的运行时函数,意味着GC 已经占用很大的 CPU 了。

配合memory pprof来看下


cfc4n@cnxct:~/Desktop$ go tool pprof 10.3.5.52.txt
File: etcd
Type: inuse_space
Time: Jul 3, 2018 at 4:57pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 10665.34MB, 94.72% of 11259.98MB total
Dropped 368 nodes (cum <= 56.30MB)
Showing top 20 nodes out of 57
      flat  flat%   sum%        cum   cum%
.71MB 63.56% 63.56%  7156.71MB 63.56%  github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*recvBuffer).put /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/transport.go
.28MB 20.06% 83.61%  9417.49MB 83.64%  github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).HandleStreams.func4 /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go
.28MB  5.89% 89.51%   663.28MB  5.89%  net/http.http2putRequestBodyBuf /usr/local/go/src/net/http/h2_bundle.go
.15MB  2.46% 91.97%   277.15MB  2.46%  runtime.makechan /usr/local/go/src/runtime/chan.go
.11MB  0.59% 92.55%    66.11MB  0.59%  crypto/tls.(*block).reserve /usr/local/go/src/crypto/tls/conn.go
.24MB  0.58% 93.13%    68.74MB  0.61%  net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go
.52MB   0.5% 93.64%    56.52MB   0.5%  runtime.malg /usr/local/go/src/runtime/proc.go
.52MB  0.48% 94.12%    85.91MB  0.76%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go
.53MB  0.21% 94.33%   113.68MB  1.01%  crypto/tls.(*Conn).readHandshake /usr/local/go/src/crypto/tls/conn.go
MB  0.13% 94.46%    74.58MB  0.66%  runtime.systemstack /usr/local/go/src/runtime/asm_amd64.s
MB  0.11% 94.57%      795MB  7.06%  net/http.(*http2Server).ServeConn /usr/local/go/src/net/http/h2_bundle.go
MB 0.098% 94.67%    65.23MB  0.58%  net/http.(*http2Framer).readMetaFrame /usr/local/go/src/net/http/h2_bundle.go
MB 0.018% 94.68%   990.71MB  8.80%  net/http.(*conn).serve /usr/local/go/src/net/http/server.go
MB 0.018% 94.70%    77.73MB  0.69%  net/http.(*http2serverConn).readFrames /usr/local/go/src/net/http/h2_bundle.go
MB 0.0089% 94.71%   721.45MB  6.41%  net/http.(*http2serverConn).newWriterAndRequest /usr/local/go/src/net/http/h2_bundle.go
MB 0.0089% 94.72%   755.99MB  6.71%  net/http.(*http2serverConn).serve /usr/local/go/src/net/http/h2_bundle.go
     0% 94.72%    64.32MB  0.57%  [etcd]
     0% 94.72%   190.21MB  1.69%  crypto/tls.(*Conn).Handshake /usr/local/go/src/crypto/tls/conn.go
     0% 94.72%    64.60MB  0.57%  crypto/tls.(*Conn).readRecord /usr/local/go/src/crypto/tls/conn.go
     0% 94.72%   190.21MB  1.69%  crypto/tls.(*Conn).serverHandshake /usr/local/go/src/crypto/tls/handshake_server.go

如上,可以看出,大约65%的内存都用在了*recvBuffer.put


// recvBuffer is an unbounded channel of recvMsg structs.
// Note recvBuffer differs from controlBuffer only in that recvBuffer
// holds a channel of only recvMsg structs instead of objects implementing "item" interface.
// recvBuffer is written to much more often than
// controlBuffer and using strict recvMsg structs helps avoid allocation in "recvBuffer.put"
type recvBuffer struct {
	c       chan recvMsg
	mu      sync.Mutex
	backlog []recvMsg
}

func newRecvBuffer() *recvBuffer {
	b := &recvBuffer{
		c: make(chan recvMsg, 1),
	}
	return b
}

func (b *recvBuffer) put(r recvMsg) {
	b.mu.Lock()
	if len(b.backlog) == 0 {
		select {
		case b.c <- r:
			b.mu.Unlock()
			return
		default:
		}
	}
	b.backlog = append(b.backlog, r)
	b.mu.Unlock()
}

这里的 put 方法是将收到的recvMsg写入到b.backlog这个 slice 中的,func (b *recvBuffer) put(r recvMsg)跟 (b *recvBuffer) load()两个函数都对b.backlog有读写操作,也就是说,这个slice 写入、读取时,都会进行加锁,严重占用 CPU 以及影响性能,增大系统负载。而且若 (b *recvBuffer) load()处理速度跟不上func (b *recvBuffer) put(r recvMsg)写入速度,意味着这个slice会持续增长,造成看似内存溢出的现象。。。
那问题来了,func (b *recvBuffer) put(r recvMsg)为什么有这么大的调用量,为什么在频繁写入…为什么呢???

定位到内存溢出的疑似原因:

grpc/server.go func (s *Server) serveStreams(st transport.ServerTransport)
grpc/transport/handler_server.go 文件284行 func (ht *serverHandlerTransport) HandleStreams 函数中的334行


	s.trReader = &transportReader{
		reader:        &recvBufferReader{ctx: s.ctx, recv: s.buf},
		windowHandler: func(int) {},
	}

	// readerDone is closed when the Body.Read-ing goroutine exits.
	readerDone := make(chan struct{})
	go func() {
		defer close(readerDone)

		// TODO: minimize garbage, optimize recvBuffer code/ownership
		const readSize = 8196
		for buf := make([]byte, readSize); ; {
			n, err := req.Body.Read(buf)
			if n > 0 {
				s.buf.put(recvMsg{data: buf[:n:n]})
				buf = buf[n:]
			}
			if err != nil {
				s.buf.put(recvMsg{err: mapRecvMsgError(err)})
				return
			}
			if len(buf) == 0 {
				buf = make([]byte, readSize)
			}
		}
	}()

然后再走到了,func (b *recvBuffer) put(r recvMsg)函数里,也就是说,这里的消息slice堆积,是因为接收到了大量http request。那接下来就是要找到这些http request 是来自谁了?

阅读 etcd 源码:

配合 member 之间大量的 TCP TIME-WAIT 状态链接,排查方向改到 member 节点上。

要查找 etcd member 建立 TCP/HTTP 链接的地方, 在源码里,还是非常多的,一时间没找到更好的bug 定位方式,只要从etcd 源码去逐步分析排查。

如上,etcd 节点之间的通讯,设定了这几个Handler

也就是说,member 之间通讯的大量通讯只会是这些 handler 中间的一个,除非源码找漏了。。。
从这个开源产品功能特性上来看,member 之间需要大量同步的,应该就是数据了,应该把线索放到raftHandler上吗?头大….

raftHandler是s.RaftHandler() 返回的, 值是 *EtcdServer.r.transport.Handler()返回值。
*EtcdServer.r是在etcdserver/server.go的414行赋值的


r: *newRaftNode(
			raftNodeConfig{
				isIDRemoved: func(id uint64) bool { return cl.IsIDRemoved(types.ID(id)) },
				Node:        n,
				heartbeat:   heartbeat,
				raftStorage: s,
				storage:     NewStorage(w, ss),
			},
		),

etcdserver/server.go 的511行 *EtcdServer.r.transport = tr进行赋值,类型为*rafthttp.Transport ,其Handler 如下代码:


func (t *Transport) Handler() http.Handler {
	pipelineHandler := newPipelineHandler(t, t.Raft, t.ClusterID)
	streamHandler := newStreamHandler(t, t, t.Raft, t.ID, t.ClusterID)
	snapHandler := newSnapshotHandler(t, t.Raft, t.Snapshotter, t.ClusterID)
	mux := http.NewServeMux()
	mux.Handle(RaftPrefix, pipelineHandler)
	mux.Handle(RaftStreamPrefix+"/", streamHandler)
	mux.Handle(RaftSnapshotPrefix, snapHandler)
	mux.Handle(ProbingPrefix, probing.NewHandler())
	return mux
}

这里又有几个 Handler ,而且都是 raft 处理相关的,代码读起来比较吃力,很难理解作者意图,

打印http.dial堆栈:

要想从这里查找 TCP 连接的发起代码,恐怕是难上加难,所以,我换了个思路,直接在 TCP 连接 Dail 的地方,判断 addr 是 etcd leader 的地址,打印调用栈。使用
runtime/debug下的debug.PrintStack()函数输出。同时打印了连接完成后的remote.addr,用于校对。不过,这改动是 golang 的系统 runtime 类库,(改了系统库,记得改回来)。要小心谨慎,在http/transport.go 882行 增加如下


func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
	if t.DialContext != nil {
		return t.DialContext(ctx, network, addr)
	}
	if addr == "leader_ip:2380" {
		debug.PrintStack()
	}
	if t.Dial != nil {
		c, err := t.Dial(network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
		}
		return c, err
	}
	return zeroDialer.DialContext(ctx, network, addr)
}

重新编译 etcd server,替换到集群中的某一个 member 节点上。 开始压测这个节点,重现大量 TIME-WAIT 的现象。现象重现后,在日志里也拿到了运行堆栈

. net.(*Dialer).Dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/dial.go:323
. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial() /Users/cfc4n/gopat
h/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeout_d
ialer.go:29
. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial-fm() /Users/cfc4n/go
path/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeou
t_transport.go:51
. net/http.(*Transport).dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:887
. net/http.(*Transport).dialConn() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:1060
. net/http.(*Transport).getConn.func4() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:943
. runtime.goexit() /usr/local/Cellar/go/1.9.1/libexec/src/runtime/asm_amd64.s:2337
 

在压测时,我用 netstat 抽样打印几次当前 member 上的 TCP 连接信息数据,也是用于校对日志输出的数据。
在压测的日志数据里,每建立一个 TCP 连接,都打印了这些堆栈,并且打印了远程 IP 的地址、端口。跟前面抽样打印的 TCP 数据核对一下,发现连接 leader 2380端口的几个 TCP 连接五元组跟 上面打印的remote.addr的五元组一致。
也就是说,是这个函数创建了大量的从 member 到 leader 之间的 TCP 连接。好了,如果找到大量 TCP 连接发起的函数了,又找到了该函数的调用栈了,那岂不是找到问题了?

然而,事情并没有这么简单,在上面的调用栈日志里,只能定位到coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial,并不能打印出所有调用堆栈,为什么?


// src/net/http/transport.go:886、887行 @TODO 高亮下面两行
func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
	if t.DialContext != nil {
		return t.DialContext(ctx, network, addr)
	}
	if t.Dial != nil {
		c, err := t.Dial(network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
		}
		return c, err
	}
	return zeroDialer.DialContext(ctx, network, addr)
}

找到 etcd member 10s 后主动断开 TCP 连接的原因:

在系统类库里的net/http/transport.go的*Transport.dial方法中,如果该Transport自定义了Dial方法,则调用自定义的…这个自定义的Dial方法是在github.com/coreos/etcd/pkg/transport/transport.go里定义的,如下:


func NewTransport(info TLSInfo, dialtimeoutd time.Duration) (*http.Transport, error) {
	cfg, err := info.ClientConfig()
	if err != nil {
		return nil, err
	}

	t := &http.Transport{
		Proxy: http.ProxyFromEnvironment,
		Dial: (&net.Dialer{
			Timeout: dialtimeoutd,
			// value taken from http.DefaultTransport
			KeepAlive: 30 * time.Second,
		}).Dial,
		// value taken from http.DefaultTransport
		TLSHandshakeTimeout: 10 * time.Second,
		TLSClientConfig:     cfg,
	}

	dialer := (&net.Dialer{
		Timeout:   dialtimeoutd,
		KeepAlive: 30 * time.Second,
	})
	dial := func(net, addr string) (net.Conn, error) {
		return dialer.Dial("unix", addr)
	}

	tu := &http.Transport{
		Proxy:               http.ProxyFromEnvironment,
		Dial:                dial,
		TLSHandshakeTimeout: 10 * time.Second,	//高亮,这里是10s,也解开了最初的10s TLS 超时的原因
		TLSClientConfig:     cfg,
	}
	ut := &unixTransport{tu}

	t.RegisterProtocol("unix", ut)
	t.RegisterProtocol("unixs", ut)

	return t, nil
}

在代码dial := func(net, addr string) (net.Conn, error) … 这里,自定义了一个 Dial 方法,赋值到http.Transport.Dial 函数属性上,并且设定TLS 的超时时间是 TLSHandshakeTimeout: 10 * time.Second,这也就是之前抓包时,看到的10s 后,member 节点主动断开连接的原因—-超过 TLS 设定的timeout时间了。
函数NewTransport的调用地方,只在NewTimeoutTransport这里找到,如下:


func NewTimeoutTransport(info TLSInfo, dialtimeoutd, rdtimeoutd, wtimeoutd time.Duration) (*http.Transport, error) {
	tr, err := NewTransport(info, dialtimeoutd)
	if err != nil {
		return nil, err
	}

	if rdtimeoutd != 0 || wtimeoutd != 0 {
		// the timed out connection will timeout soon after it is idle.
		// it should not be put back to http transport as an idle connection for future usage.
		tr.MaxIdleConnsPerHost = -1
	} else {
		// allow more idle connections between peers to avoid unnecessary port allocation.
		tr.MaxIdleConnsPerHost = 1024
	}

	tr.Dial = (&rwTimeoutDialer{
		Dialer: net.Dialer{
			Timeout:   dialtimeoutd,
			KeepAlive: 30 * time.Second,
		},
		rdtimeoutd: rdtimeoutd,
		wtimeoutd:  wtimeoutd,
	}).Dial
	return tr, nil
}

也就是说,找到NewTimeoutTransport调用着,就能确定发起大量 TCP 的代码…但之前的堆栈打印,并没有打印到这些函数调用,只好再想其他办法。

在NewTimeoutTransport的代码里,有一句对 HTTP Keepalive的最大持有空闲长连接数字的设置tr.MaxIdleConnsPerHost = 1024,这个参数决定 golang 的 http client 包对 server 发起 http 请求时,能持有的最大空闲 TCP 连接数量,在 etcd 的代码里,被设置为1024了,也就是说,etcd member 跟 etcd leader 之间空闲时,持有TCP 连接数可以是1024个,便于下次复用。(插一句,有人说,http 是最好的 RPC,这里我是不认同的,光从TCP 连接复用上,就有非常大的差异,在http 1.1 TCP 链接 「另类」复用,http 1.1 的keepalive属性也是仅当前一个请求响应后,这个链接才能继续复用,请求响应串行化。而在 http2里,是不需要等待上一个请求返回的,才有真正意义上的多路复用见: Streams and Multiplexing , 如果你做过 rpc 的功能实现,你就知道 rpc 在 tcp 上是如何实现请求响应的一一对应的,你就能很快明白http2\http1.1的 TCP 链接复用的区别了。)

排查 etcd member 跟 etcd leader 之间的通讯:

回到正题,对于代码这里的定位,我尝试去阅读代码来协助定位,但 etcd 的源码太复杂了,我读到 raft 协议节点之间数据同步这块,卡住了。开始从其他路子尝试解决,之前的 cpu 占用较高问题,我们在 member 节点之间去掉了 tls 证书验证,那么我们在 member 上做 tcp 抓包,抓2380端口数据包,看看数据发送的都是什么内容。

数据包中,可以看出 member 向 leader 上发送了大量的 /leases的 POST 请求。根据这个 uri,找到发起 HTTP 请求的代码,在etcdserver/v3_server.go的252行


func (s *EtcdServer) LeaseRenew(ctx context.Context, id lease.LeaseID) (int64, error) {
	ttl, err := s.lessor.Renew(id)
	if err == nil { // already requested to primary lessor(leader)
		return ttl, nil
	}
	if err != lease.ErrNotPrimary {
		return -1, err
	}

	cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())
	defer cancel()

	// renewals don't go through raft; forward to leader manually
	for cctx.Err() == nil && err != nil {
		leader, lerr := s.waitLeader(cctx)
		if lerr != nil {
			return -1, lerr
		}
		for _, url := range leader.PeerURLs {
			lurl := url + leasehttp.LeasePrefix
			ttl, err = leasehttp.RenewHTTP(cctx, id, lurl, s.peerRt)
			if err == nil || err == lease.ErrLeaseNotFound {
				return ttl, err
			}
		}
	}
	return -1, ErrTimeout
}

阅读这段代码,可以看到函数调用 leasehttp.RenewHTTP将请求发送到leader.PeerURLs,即发送到 leader 节点上。继续跟进leasehttp.RenewHTTP,


func RenewHTTP(ctx context.Context, id lease.LeaseID, url string, rt http.RoundTripper) (int64, error) {
	// will post lreq protobuf to leader
	lreq, err := (&pb.LeaseKeepAliveRequest{ID: int64(id)}).Marshal()
	if err != nil {
		return -1, err
	}

	cc := &http.Client{Transport: rt}
	req, err := http.NewRequest("POST", url, bytes.NewReader(lreq))
	if err != nil {
		return -1, err
	}
	req.Header.Set("Content-Type", "application/protobuf")
	req.Cancel = ctx.Done()

	resp, err := cc.Do(req)
	if err != nil {
		return -1, err
	}

可以看出,该请求使用 golang 标准包的http.Client.Do来将请求发送出去。http.Client.Transport使用参数传入的rt http.RoundTripper。

定位到10s TLS timeout 的相关函数调用:

回到上一层代码

这里可以关联上了,确认大量发起 TCP 连接的代码调用关系,证明方向正确,剩下的就是顺藤摸瓜,找到发起大量请求的原因即可。
回到LeaseRenew函数调用的地方,调用func (s *EtcdServer) LeaseRenew 函数代码见 github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go的88行附近


func (ls *LeaseServer) leaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) error {
	for {
		req, err := stream.Recv()
		if err == io.EOF {
			return nil
		}
		if err != nil {
			if isClientCtxErr(stream.Context().Err(), err) {
				plog.Debugf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())
			} else {
				plog.Warningf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())
			}
			return err
		}

		// Create header before we sent out the renew request.
		// This can make sure that the revision is strictly smaller or equal to
		// when the keepalive happened at the local server (when the local server is the leader)
		// or remote leader.
		// Without this, a lease might be revoked at rev 3 but client can see the keepalive succeeded
		// at rev 4.
		resp := &pb.LeaseKeepAliveResponse{ID: req.ID, Header: &pb.ResponseHeader{}}
		ls.hdr.fill(resp.Header)

		ttl, err := ls.le.LeaseRenew(stream.Context(), lease.LeaseID(req.ID))
		if err == lease.ErrLeaseNotFound {
			err = nil
			ttl = 0
		}

		if err != nil {
			return togRPCError(err)
		}

		resp.TTL = ttl
		err = stream.Send(resp)
		if err != nil {
			if isClientCtxErr(stream.Context().Err(), err) {
				plog.Debugf("failed to send lease keepalive response to gRPC stream (%q)", err.Error())
			} else {
				plog.Warningf("failed to send lease keepalive response to gRPC stream (%q)", err.Error())
			}
			return err
		}
	}
}

通过漫长的阅读,可以看到这里收到的消息都是来自 client 发送来的消息,其实可以从变量命名上看到端倪,比如 client 跟 member 之间通讯处理的变量都包含stream字样,而member到 leader 或 member 节点之间都带有peer字样,其实,client 跟 etcd 节点之间通讯用的是 http2传输协议,消息编码用的是grpc+protobuf;而etcd member节点之间是http1.1传输,消息编码也是 protobuf。

etcd lease 租约续期的时机:

就是说,这里的大量lease 租约续期的请求,来自 client 的请求,在我的案例中,我把主机 lease 的 TTL 设置为10S,这块会有异常吗?client 的 lease 续期是谁来发送?client 自己吗?多久发送一次?过期前的1秒?

在 etcd 的 client 类库中,对于租约续期的代码在clientv3/lease.go的474行


	// send update to all channels
	nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0)
	ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second)
	for _, ch := range ka.chs {
		select {
		case ch <- karesp:
			ka.nextKeepAlive = nextKeepAlive
		default:
		}
	}

nextKeepAlive变量是 lease TTL 的三分之一,用于下次续期时的触发时刻,在后面的sendKeepAliveLoop函数续期时,回每隔500ms 进行一次循环,检测是否续期。也就是说,我们场景下,10s 的 TTL 会每隔3S 一次续期。3W 客户端的话,每秒1W 的写请求,而之前我们压测结果是集群节点可以有4W 以上的 ( 官方压测数据 )qps,不至于这点都撑不住。这是个疑点。不过,前面的分析判断,大量的续约续期导致大量 TCP 链接的建立,大量 TCP 链接的建立,导致 leader 节点上大量对象分配,大量内存申请,golang gc 时,消耗大量 CPU,那这个问题也应该要优化一下,为此,我们把 TTL 设定为600S,即10分钟,也就是说,会在200秒后进行一次续期。

扩大lease TTL,尝试优化集群负载:

修复后的版本,我们在线下发布了6000多台,观察效果,果然,在发布后,内存有明显的下降,CPU 指标也有,但是在1个小时后,指标又上升了。。。

问题依旧:

这是为什么呢?中间大约1个小时的负载指标下降是因为发布过程带来的假象吗?上次变更的 lease 有效果吗?可我更改时,是测试过的,测试报告都有呢—-agent 的 etcd lease租约时长 调整 ,TTL 续期是200S 一次了,但这是为什么?为了确认这个问题,我在其中一台 member 上抓包确认

上面确实是大量的 lease 包,那么每个 lease id 之间的发送间隔是多少?是我们预期的200秒吗?或者这些 lease 包是同一个 lease id 吗?一共有多少个 lease id 呢?

分析 tcpdump 的pcapng文件:

之前从 etcd 的源码分析可以看出,etcd node 节点之间的通讯消息编码也是 protobuf 的,我们需要把 http response body 的字节流复制出来,用 protobuf 的对应结果体解码即可。可是从 2018-07-05 16:38:31.015981 到 2018-07-05 16:43:20.420401 之间,间隔5分钟之间,一共103万个数据包,手动复制,实在没法统计。借用pcap包,对整个 tcpdump抓包的pcapng文件,分析, 过滤响应类型是application/protobuf的,取 http response body内容,并对其做 pb.LeaseKeepAliveRequest protobuf的结构体解析,计算 lease 总数与唯一数


运行结果: lease id 总数数: 77550, lease id 唯一数:1382。

这些结果说明,相同的 lease id 在频繁发送续约请求。再次写程序,把这些 lease id 取出来,到 etcd 集群上查看 TTL 剩余时间是否是预期的400-600秒之间,但结果却总是599秒,以其中一个 lease id 为例,每隔1秒去抓取 TTL 信息,但每次结果 TTL 都是599秒, grantTTL 是600,意味着 etcd member 持续不断的为这些 lease id 不停的续期。 我们之前排查结果是, member 的续期是来自 client 的 lease keepalive 请求,那么方向得放到 client 上。

抽查其中一台 client,抓包分析,也确实是在频繁发送/etcdserverpb.Lease/LeaseKeepAlive
请求,这就很奇怪,跟之前我的测试结果完全不一样,明明测试时,是每个200S 一起续期请求,为什么这里变得这么频繁,每秒都发起。

回到 etcd clientv3的代码上:

现象很奇怪,排查这个问题,只能再次从源码层面入手,方向也放到 /etcdserverpb.Lease/LeaseKeepAlive的地方,函数sendKeepAliveLoop这里开始循环处理判断是否需要续期租约


// sendKeepAliveLoop sends keep alive requests for the lifetime of the given stream.
func (l *lessor) sendKeepAliveLoop(stream pb.Lease_LeaseKeepAliveClient) {
	for {
		var tosend []LeaseID

		now := time.Now()
		l.mu.Lock()
		for id, ka := range l.keepAlives {
			if ka.nextKeepAlive.Before(now) {
				tosend = append(tosend, id)
			}
		}
		l.mu.Unlock()

		for _, id := range tosend {
			r := &pb.LeaseKeepAliveRequest{ID: int64(id)}
			if err := stream.Send(r); err != nil {
				// TODO do something with this error?
				return
			}
		}

		select {
		case <-time.After(500 * time.Millisecond):
		case <-stream.Context().Done():
			return
		case <-l.donec:
			return
		case <-l.stopCtx.Done():
			return
		}
	}
}

如上代码,每隔 500毫秒一次循环,遍历l.keepAlives map,判断每个LeaseID的nextKeepAlive是否到了,client 决定是否发起LeaseKeepAliveRequest。这个nextKeepAlive的赋值是在clientv3/lease.go的recvKeepAlive函数内,第473行


// send update to all channels
nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0)
ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second)
for _, ch := range ka.chs {
	select {
	case ch <- karesp:
		ka.nextKeepAlive = nextKeepAlive
	default:
	}
}

是的,这块代码之前看过,之前关注的只是 TTL 时间的设定,但仔细看看这段代码ka.nextKeepAlive = nextKeepAlive,这里做下一次lease 续期的赋值,判断方法是select case,判断条件是ch <- karesp,在 golang 里,这种写法表示该channle能写入的话,则进入case里面的流程,对ka.nextKeepAlive进行赋值; 否则不赋值。。。否则,不赋值。。。否则,不赋值。。。

那么,这么 channel 长度多少呢?会不会写入的速度大于读取导致没写进去,导致下一次的续期时间没赋值成功呢?继续追查,在clientv3/lease.go的222行KeepAlive函数里


func (l *lessor) KeepAlive(ctx context.Context, id LeaseID) (<-chan *LeaseKeepAliveResponse, error) {
	ch := make(chan *LeaseKeepAliveResponse, leaseResponseChSize)

	l.mu.Lock()
	// ensure that recvKeepAliveLoop is still running
	select {
	case <-l.donec:
		err := l.loopErr
		l.mu.Unlock()
		close(ch)
		return ch, ErrKeepAliveHalted{Reason: err}
	default:
	}
	ka, ok := l.keepAlives[id]
	if !ok {
		// create fresh keep alive
		ka = &keepAlive{
			chs:           []chan<- *LeaseKeepAliveResponse{ch},

ch := make(chan *LeaseKeepAliveResponse, leaseResponseChSize)进行chan 的申请,长度leaseResponseChSize是常量,值是16。chs是个slice,长度1,只有这一个。那么,接下来,去确认这个 ch 读取的代码吧,可是我读完整个 lease.go也没找到哪里读取这个 channel 的代码,除了几处 close 的回收操作。那问题来了,会不会是这里把 ch channel 写满了,就无法再写入进去,导致ka.nextKeepAlive = nextKeepAlive流程走不到,导致ka.nextKeepAlive一直是上次的过去的续期时间,导致频繁进行续期呢?

找到疑点,每秒2次 lease keepalive 的请求:

开始测试,在 ch channel 写入的 default 分支里,打印“无法写入”等字样,并在for 后面打印这个ka的 TTL 等信息,用于核对。log.Println(fmt.Sprintf(“karesp.TTL:%d, nextKeepAlive:%s,ka.nextKeepAlive:%s, ka.deadline:%s”,karesp.TTL, nextKeepAlive.String(),ka.nextKeepAlive.String(), ka.deadline.String()))

改动后,重新编译代码,上传至某台 client 上 ,关掉 DEBUG 模式,观察日志输出。 文件见 /xxxx/xxx/core/nohup.out.1 ,在 2018/07/05 21:10:42 启动程序,经过1个多小时的观察,果然在2018/07/05 22:04:05时,lease 续约续期的请求从200秒变成了每秒2次。同时到 etcd 集群上查询响应 lease id,发现 TTL 均为599,grantTTL 为600,则可说明 是客户端在频繁重新发送租约请求。 第一个版本没打印ka.nextKeepAlive ,调整后,重新打印,为了让等待时间变短些,我把600s 改成60,便于观察。得到如下日志

2018/07/05 22:40:02 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:02 karesp.TTL:60, nextKeepAlive:2018-07-05 22:40:22.811626819 +0800 CST m=+40.135091508,ka.nextKeepAlive:2018-07-05 22:40:22.811626819 +0800 CST m=+40.135091508, ka.deadline:2018-07-05 22:41:02.811627601 +0800 CST m=+80.135092217

2018/07/05 22:40:22 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:22 karesp.TTL:60, nextKeepAlive:2018-07-05 22:40:42.922698966 +0800 CST m=+60.246163715,ka.nextKeepAlive:2018-07-05 22:40:42.922698966 +0800 CST m=+60.246163715, ka.deadline:2018-07-05 22:41:22.922699917 +0800 CST m=+100.246164470

2018/07/05 22:40:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:40:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:03.330494852 +0800 CST m=+80.653959932,ka.nextKeepAlive:2018-07-05 22:41:03.330494852 +0800 CST m=+80.653959932, ka.deadline:2018-07-05 22:41:43.330496602 +0800 CST m=+120.653961489

2018/07/05 22:41:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:23.341112484 +0800 CST m=+100.664577164,ka.nextKeepAlive:2018-07-05 22:41:23.341112484 +0800 CST m=+100.664577164, ka.deadline:2018-07-05 22:42:03.341113662 +0800 CST m=+140.664578212

2018/07/05 22:41:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:23 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:43.350264635 +0800 CST m=+120.673729313,ka.nextKeepAlive:2018-07-05 22:41:43.350264635 +0800 CST m=+120.673729313, ka.deadline:2018-07-05 22:42:23.350265498 +0800 CST m=+160.673730050

2018/07/05 22:41:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:41:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:03.572888276 +0800 CST m=+140.896353032,ka.nextKeepAlive:2018-07-05 22:42:03.572888276 +0800 CST m=+140.896353032, ka.deadline:2018-07-05 22:42:43.572889271 +0800 CST m=+180.896353840
2018/07/05 22:42:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:23.86879775 +0800 CST m=+161.192262488,ka.nextKeepAlive:2018-07-05 22:42:23.86879775 +0800 CST m=+161.192262488, ka.deadline:2018-07-05 22:43:03.868798687 +0800 CST m=+201.192263243
2018/07/05 22:42:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:23 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:43.877613966 +0800 CST m=+181.201079020,ka.nextKeepAlive:2018-07-05 22:42:43.877613966 +0800 CST m=+181.201079020, ka.deadline:2018-07-05 22:43:23.877615637 +0800 CST m=+221.201080417

2018/07/05 22:42:43 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:42:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:03.884576518 +0800 CST m=+201.208041122,ka.nextKeepAlive:2018-07-05 22:43:03.884576518 +0800 CST m=+201.208041122, ka.deadline:2018-07-05 22:43:43.884577361 +0800 CST m=+241.208041904

2018/07/05 22:43:03 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:23.893274193 +0800 CST m=+221.216738996,ka.nextKeepAlive:2018-07-05 22:43:23.893274193 +0800 CST m=+221.216738996, ka.deadline:2018-07-05 22:44:03.893275148 +0800 CST m=+261.216739702

2018/07/05 22:43:23 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:24 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:44.058766258 +0800 CST m=+241.382230998,ka.nextKeepAlive:2018-07-05 22:43:44.058766258 +0800 CST m=+241.382230998, ka.deadline:2018-07-05 22:44:24.058767168 +0800 CST m=+281.382231711
2018/07/05 22:43:44 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:43:44 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:04.413012991 +0800 CST m=+261.736477737,ka.nextKeepAlive:2018-07-05 22:44:04.413012991 +0800 CST m=+261.736477737, ka.deadline:2018-07-05 22:44:44.41301386 +0800 CST m=+301.736478391

2018/07/05 22:44:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:24.420796492 +0800 CST m=+281.744261106,ka.nextKeepAlive:2018-07-05 22:44:24.420796492 +0800 CST m=+281.744261106, ka.deadline:2018-07-05 22:45:04.420797607 +0800 CST m=+321.744262161

2018/07/05 22:44:24 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:24 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:44.428145652 +0800 CST m=+301.751610589,ka.nextKeepAlive:2018-07-05 22:44:44.428145652 +0800 CST m=+301.751610589, ka.deadline:2018-07-05 22:45:24.428147239 +0800 CST m=+341.751612018

2018/07/05 22:44:44 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:44:44 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:45:44.436152168 +0800 CST m=+361.759616723

2018/07/05 22:45:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:04 无法写入
2018/07/05 22:45:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:24.441563525 +0800 CST m=+341.765028110,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:46:04.44156426 +0800 CST m=+381.765028789
2018/07/05 22:45:04 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:04 无法写入
2018/07/05 22:45:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:24.941897369 +0800 CST m=+342.265361977,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:46:04.941898136 +0800 CST m=+382.265362694
2018/07/05 22:45:05 lease Id KeepAliveRequest:5904610818195269641
2018/07/05 22:45:05 无法写入

再16次之后,即第17次,开始恢复每秒2次的租约请求。。。所以,也验证了此 bug。
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!
也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!

之前的134版本灰度灰度后,中间CPU 降低了大约1个小时的样子,这是为什么呢?

解释 cpu 降低1小时的疑点:

我们这么计算, TTL 600,续约续约是200, ka.chs的 数组长度是1,第一个chan <- *LeasesKeepAliveResponse的长度是16 ,全部写满的时间是200*16/60 = 53分钟。 也非常契合 falcon 上的 cpu 信息截图
另外再从我加日志的第一次测试时间来算: 2018/07/05 22:04:05 – 2018/07/05 21:10:42 约为53分钟。也就是说,将近1小时,跟 falcon 的统计吻合的。。。

找到真凶:

定问题了,就是修复方式了,我并没看到 ch channel 的使用的地方,或许我没理解作者的用意,我觉得,不管 ch channel 是否写入,都必须将下一次nextKeepAlive的时间赋值为最新时间。将ka.nextKeepAlive = nextKeepAlive移到for select外面即可。

不过,我们产品急于推广,保险起见,牺牲了 lease 的需求,去掉了这个功能。发布到服务器后,负载情况有非常大的改善,CPU 从40%-50%降低到2%-5%左右。

回顾总结:

回顾etcd 集群遇到的所有现象,配合这个问题,所有问题都好解释了,那么,我按照事情发生发展的顺序总结一下原因:

最终,稻草压死骆驼。 该 bug 已反馈至 etcd 官方 clientv3 sdk will send /etcdserverpb.Lease/LeaseKeepAlive twice every second for lease id after over time

困扰我1个多月的 bug,不会忘记这个时刻

关于 etcd 过载保护的想法:

声明

对你的etcd 产品是否影响:

官方已修复

coreos etcd在2018年7月24日已经修复该 bug,涉及etcd client v3的3个版本效率非常高。详情如下

Fix lease keepalive interval updates when response queue is full .
If <-chan *clientv3LeaseKeepAliveResponse from clientv3.Lease.KeepAlive was never consumed or channel is full, client was sending keepalive request every 500ms instead of expected rate of every “TTL / 3” duration.

Coreos Etcd分布式集QQ交流群

QQ 群:Coreos Etcd分布式集 ,群号:177508087。一键加群: Coreos Etcd分布式集 ,欢迎大家加群一起学习 etcd 的知识。

关注微信公众号,手机阅读更方便: 程序员的阅微草堂

知识共享许可协议 莿鸟栖草堂 由 CFC4N 创作,采用 知识共享 署名-非商业性使用-相同方式共享(3.0未本地化版本)许可协议 进行许可。基于 http://www.cnxct.com 上的作品创作。转载请注明转自: etcd clientv3的lease keepalive租约频繁续期bug 排查小记

作者:莿鸟栖草堂
凯风自南,吹彼棘心。棘心夭夭,母氏劬劳。凯风自南,吹彼棘薪。母氏圣善,我无令人。