天天看點

Windows Networking 4: CloudMonitor 引發的網絡問題排查一則事情起因問題排查 + 分析原因分析

事情起因

近期一個比較有意思的問題,有使用者回報說他們的伺服器一直在連接配接阿裡雲的服務,把端口都占滿了,導緻網絡中斷,業務連接配接建立不起來,telnet本地端口也不通。

按照産品設計的角度來說,這種情況一般不可能發生,但是使用者的netstat的結果來看,的确顯示大部分的連接配接是跟阿裡雲的伺服器建立的。

Windows Networking 4: CloudMonitor 引發的網絡問題排查一則事情起因問題排查 + 分析原因分析

由于伺服器都是非阿裡雲執行個體,無法直接登陸排查,是以,準備了排查方案。

問題排查 + 分析

  1. 類似的從本地telnet本地端口不通的問題一般是動态端口用光。預設Windows Server 2008 之後的系統使用49152 - 65536範圍共計 16384 個端口。一般情況下是用不完的,如果有用光的情況常常伴随Port Leak(端口洩露)問題。

    netsh int ipv4 show dyn tcp

    Protocol tcp Dynamic Port Range

    Start Port : 49152

    Number of Ports : 16384

    驗證方案是使用指令 netsh int ipv4 set dynamicport tcp start=10000 num=50000 來開放更多的動态端口使用。

    由于 netsh 直接通過 NSI 接口調用操作 Windows Kernel 的 TCPIP 方法來修改系統配置,該指令立時生效。

  2. 若需要确認端口是否有洩露,一般是需要抓dump來分析端口洩露的原因。但這類問題我們其實處理過不少,絕大多數的原因是三方驅動的問題。是以,建議先檢查一下作業系統中是否有三方軟體,如果是的話,嘗試解除安裝後再觀察。

    不過,使用者很快回報是沒有任何三方軟體。

  3. 考慮到系統中多數動态端口在 TIME_WAIT 狀态,可以嘗試通過 TcpTimedWaitDelay 系統資料庫來縮短 TIME_WAIT 的釋放時間。
    https://docs.microsoft.com/en-us/biztalk/technical-guides/settings-that-can-be-modified-to-improve-network-performance

    Key: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters

    Value: TcpTimedWaitDelay

    Data Type: REG_DWORD

    Range: 30-300 (decimal)

    Default value: 0x78 (120 decimal)

    Recommended value: 30

    系統資料庫的配置需要重新開機才會生效。

  4. 使用者一直對他們伺服器通路的阿裡雲伺服器 IP 存有疑慮,不過大多數是 443 端口的通路,抓包很難證明确切是哪個應用産生的連接配接。是以,我們需要 Windows 提供的 ETL Trace 方法來擷取應用程式對 TCPIP 等 Network API 的調用資訊。
    netsh trace start provider={2F07E2EE-15DB-40F1-90EF-9D7BA282188A} keywords=0xffffffffffffffff level=0xff provider={E53C6823-7BB8-44BB-90DC-3F86090D48A6} keywords=0xffffffffffffffff level=0xff provider={7D44233D-3055-4B9C-BA64-0D47CA40A232} keywords=0xffffffffffffffff level=0xff provider={50B3E73C-9370-461D-BB9F-26F32D68887D} keywords=0xffffffffffffffff level=0xff provider={43D1A55C-76D6-4F7E-995C-64C711E5CAFE} keywords=0xffffffffffffffff level=0xff maxSize=500MB fileMode=circular persistent=no overwrite=yes report=yes correlation=yes traceFile=c:NetworkTrace.etl capture=yes packettruncatebytes=128
    從收集的日志看,我們發現請求對應的Process ID = 0x0600 = 1536,
    16767 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-TCPIP]TCP: endpoint 0xFFFFFA80C4CF0CF0 (Family=IPV6 , PID=1536) created with status = STATUS_SUCCESS. 
     16768 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]socket: 1: Process 0xFFFFFA80C3FB0860 (0x600), Endpoint 0xFFFFFA80C259FE50, Family 0, Type 0, Protocol 0, Seq 1013, Status STATUS_SUCCESS 
     16769 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x1, Seq 11002, Status STATUS_SUCCESS 
     16770 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address ::, Seq 7010, Status STATUS_SUCCESS 
     16771 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 acquired port number 10010. 
     16772 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) bound. 
     16773 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::]:10010, Seq 7022, Status STATUS_SUCCESS 
     16774 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]connect: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::ffff:120.55.35.39]:443, Seq 5023, Status STATUS_SUCCESS 
     16775 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosedState  to SynSentState , SndNxt = 0. 
     16776 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 attempted to acquire weak reference on port number 10010 inherited from endpoint 0xFFFFFA80C4CF0CF0. Successful = TRUE . 
     16777 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) requested to connect. 
     16779 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 is going to output SYN with ISN = 2438595996, RcvWnd = 8192, RcvWndScale = 8. 
     16780 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect proceeding. 
     16789 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from SynSentState  to EstablishedState , SndNxt = 2438595997. 
     16790 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1. ThSeq = 2246241665. 
     16791 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect completed. PID = 1536. 
     16793 [3]0600.061C::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]connect: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 5024, Status STATUS_SUCCESS 
     16794 [3]0600.061C::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x0, Seq 11002, Status STATUS_SUCCESS 
     16795 [3]0600.061C::?2018?-?12?-?24 20:15:23.009 [Microsoft-Windows-TCPIP]TCP: Option TCP_OFFLOAD_NOT_PREFERRED  is going to be set for connection 0xFFFFFA80C3956B20. 
     16796 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3047, Status STATUS_SUCCESS 
     16797 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3056, Status STATUS_SUCCESS 
     16798 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 send posted posted 279 bytes at 2438595997. 
     16807 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3051, Status STATUS_SUCCESS 
     16815 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3024, Status STATUS_SUCCESS 
     16818 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2C56830, Length 1452, Seq 9000 
     16819 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246241665. 
     16820 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246241665. 
     16823 [3]0600.061C::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4115, Status STATUS_SUCCESS 
     16824 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2D36F50, Length 1452, Seq 9000 
     16825 [3]0600.061C::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4116, Status STATUS_SUCCESS 
     16826 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246243117. 
     16827 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246243117. 
     ...
     17644 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from EstablishedState  to FinWait1State , SndNxt = 2438604647. 
     17650 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2002, Status STATUS_SUCCESS 
     17651 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) closed. 
     17652 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 released port number 10010. WeakReference = FALSE . 
     17653 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2003, Status STATUS_SUCCESS 
     17654 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2000, Status STATUS_SUCCESS 
     17655 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2001, Status STATUS_SUCCESS 
     17664 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from FinWait1State  to ClosingState , SndNxt = 2438604648. 
     17665 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 delivering FIN. RcvNxt = 2246247426. 
     17666 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-Winsock-AFD]disconnect indicated: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 12001 
     17671 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosingState  to TimeWaitState , SndNxt = 2438604648. 
     17673 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) disconnect completed. 
     17675 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) close issued. 
     17676 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) shutdown initiated (The transport connection was aborted by the local system.). PID = 1536. 
     17677 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from TimeWaitState  to ClosedState , SndNxt = 2438604648. 
     17678 [2]0004.0058::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C3956B20 released port number 10010. WeakReference = TRUE . 
               
    PID = 1536 通過 Process Explorer 我們能找到對應的Java.exe,
    Windows Networking 4: CloudMonitor 引發的網絡問題排查一則事情起因問題排查 + 分析原因分析
  5. 由于通過第四步我們找到去往阿裡雲伺服器 120.55.35.x 的連接配接請求是 CloudMonitor 觸發的。

    考慮到沒有配置 TcpTimedWaitDelay 的作業系統預設使用 2MSL (120s) TIME_WAIT timeout 時間,如果一個系統近 15000 的端口都處于 TIME_WAIT 狀态,理論上來說 Cloud Monitor 至少需要每秒産生 125 條以上的連接配接。而無論從 Cloud Monitor 的日志還是之前的 ETL Trace ,我們都沒有發現此類行為,Cloud Monitor 程式比較規律的每幾秒鐘建立連接配接用于送出 "名額"。

    41931919752 INFO 2018-12-24 17:57:17.748 [ricGatherServiceHttp] 送出名額完成,耗時:78ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]

    41931935181 INFO 2018-12-24 17:57:33.177 [ricGatherServiceHttp] 送出名額完成,耗時:47ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]

    41931950625 INFO 2018-12-24 17:57:48.621 [ricGatherServiceHttp] 送出名額完成,耗時:47ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]

  6. 問題比較蹊跷,排除一切可能後,剩下的唯一可能就是 Windows 系統在 2MSL 後沒有釋放 TIME_WAIT 的端口,而後續的排查也确認了這一點。可惜,我們從 ETL Trace 日志中無法獲知具體原因,但是使用者無心的一句話卻提醒了我們,

    "這台伺服器運作了很長時間,一直都沒有出問題"

    具體有多長?我們通過事件日志 Eventlog 6013 确認已經運作了 43225197s 近 500 多天。這引起我們的注意,

    All the TCP/IP ports that are in a TIME_WAIT status are not closed after 497 days from system startup in Windows Vista, in Windows 7, in Windows Server 2008 and in Windows Server 2008 R2

    https://support.microsoft.com/en-us/help/2553549/all-the-tcp-ip-ports-that-are-in-a-time-wait-status-are-not-closed-aft

至此,整個問題得到了完整的分析和解決。

原因分析

簡單說一下這個 bug 的産生的原因,在 Windows 系統中,使用目前系統運作時間加上 TIMEOUT 時間來決定什麼時候釋放 TIME_WAIT 狀态的 TCP Endpoint。在沒有安裝更新檔的系統中,32位寄存器所能存放的最大值 0xFFFFFFFF,寄存器中存放的數值為 "時間 * 0n100",也就是說在系統運作超過 "0XFFFFFFFF/0n100" 秒之後,就可能發生問題。

0xFFFFFFFF/(0n86400 * 0n100)= 497.1026961805...

繼續閱讀