天天看點

SLS機器學習最佳實戰:日志聚類+異常告警

0.文章系列連結

1.手中的錘子都有啥?

圍繞日志,挖掘其中更大價值,一直是我們團隊所關注。在原有日志實時查詢基礎上,今年SLS在DevOps領域完善了如下功能:

  • 上下文查詢
  • 實時Tail和智能聚類,以提高問題調查效率
  • 提供多種時序資料的異常檢測和預測函數,來做更智能的檢查和預測
  • 資料分析的結果可視化
  • 強大的告警設定和通知,通過調用webhook進行關聯行動
    SLS機器學習最佳實戰:日志聚類+異常告警

今天我們重點介紹下,日志隻能聚類和異常告警如何配合,更好的進行異常發現和告警

2.平台實驗

2.1 實驗資料

一份Sys Log的原始資料,,并且開啟了日志聚類服務,具體的狀态截圖如下:

SLS機器學習最佳實戰:日志聚類+異常告警

通過調整下面截圖中紅色框1的大小,可以改變圖中紅色框2的結果,但是對于每個最細粒度的pattern并不會改變,也就是說:子Pattern的結果是穩定且唯一的,我們可以通過子Pattern的Signature找到對應的原始日志條目。

SLS機器學習最佳實戰:日志聚類+異常告警

2.2 生成子模式的時序資訊

假設,我們對這個子Pattern要進行監控:

msg:vm-111932.tc su: pam_unix(*:session): session closed for user root

對應的 signature_id : __log_signature__: 1814836459146662485

我們得到了上述pattern對應的原始日志,可以看下具體的數量在時間軸上的直返圖:

SLS機器學習最佳實戰:日志聚類+異常告警

上圖中,我們可以發現,這個模式的日志分布不是很均衡,其中還有一些是沒有的,如果直接按照時間視窗統計數量,得到的時序圖如下:

__log_signature__: 1814836459146662485 |  
select 
    date_trunc('minute', __time__) as time, 
    COUNT(*) as num 
from log GROUP BY time order by time ASC limit 10000           
SLS機器學習最佳實戰:日志聚類+異常告警
上述圖中我們發現時間上并不是連續的。是以,我們需要對這條時序進行補點操作。
__log_signature__: 1814836459146662485 | 
select 
    time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
    avg(num) as num 
from  ( 
    select 
        __time__ - __time__ % 60 as time, 
        COUNT(*) as num 
    from log GROUP BY time order by time desc ) 
GROUP by time order by time ASC limit 10000           
SLS機器學習最佳實戰:日志聚類+異常告警

2.3 對時序進行異常檢測

使用時序異常檢測函數: ts_predicate_arma

__log_signature__: 1814836459146662485 | 
select 
    ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') 
from  ( 
    select 
        time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
        avg(num) as num 
    from  ( 
        select 
            __time__ - __time__ % 60 as time, 
            COUNT(*) as num 
        from log GROUP BY time order by time desc ) 
    GROUP by time order by time ASC ) limit 10000           
SLS機器學習最佳實戰:日志聚類+異常告警

2.4 告警該如何設定

  • 将機器學習函數的結果拆解開
__log_signature__: 1814836459146662485 | 
select 
    t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
from  ( 
    select 
        ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
    from  ( 
        select 
            time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
            avg(num) as num 
        from  ( 
            select 
                __time__ - __time__ % 60 as time, 
                COUNT(*) as num 
            from log GROUP BY time order by time desc ) 
        GROUP by time order by time ASC )) , unnest(res) as t(t1)           
SLS機器學習最佳實戰:日志聚類+異常告警
  • 針對最近兩分鐘的結果進行告警
__log_signature__: 1814836459146662485 | 
select 
    unixtime, src, pred, up, lower, prob 
from  ( 
    select 
        t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
    from  ( 
        select 
            ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
        from  ( 
            select 
                time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
                avg(num) as num 
            from  ( 
                select 
                    __time__ - __time__ % 60 as time, COUNT(*) as num 
                from log GROUP BY time order by time desc ) 
            GROUP by time order by time ASC )) , unnest(res) as t(t1) ) 
    where is_nan(src) = false order by unixtime desc limit 2           
SLS機器學習最佳實戰:日志聚類+異常告警
  • 針對上升點進行告警,并設定兜底政策
__log_signature__: 1814836459146662485 | 
select 
    sum(prob) as sumProb, max(src) as srcMax, max(up) as upMax 
from ( 
    select 
        unixtime, src, pred, up, lower, prob 
    from  ( 
        select 
            t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
        from  ( 
            select 
                ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
            from  ( 
                select 
                    time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, avg(num) as num 
                from  ( 
                    select 
                        __time__ - __time__ % 60 as time, COUNT(*) as num 
                    from log GROUP BY time order by time desc ) 
                GROUP by time order by time ASC )) , unnest(res) as t(t1) ) 
        where is_nan(src) = false order by unixtime desc limit 2 )           
SLS機器學習最佳實戰:日志聚類+異常告警

具體的告警設定如下:

SLS機器學習最佳實戰:日志聚類+異常告警

3.硬廣時間

3.1 日志進階

這裡是日志服務的各種功能的示範

日志服務整體介紹,各種Demo
SLS機器學習最佳實戰:日志聚類+異常告警

更多日志進階内容可以參考:

日志服務學習路徑

3.2 聯系我們

糾錯或者幫助文檔以及最佳實踐貢獻,請聯系:悟冥

問題咨詢請加釘釘群:

SLS機器學習最佳實戰:日志聚類+異常告警