天天看點

Tomcat c3p0連接配接池故障一例

負責維護的系統出現c3p0連接配接池洩露的故障,導緻Oracle DB響應極慢,本文對當時故障處理做簡單記錄,雖然最後直接換掉tomcat版本、棄用c3p0連接配接池解決了故障,還是希望本文能為遇到類似問題的同學提供一些思路。

      • 環境
      • 故障現象
      • 目前c3p0連接配接池配置
      • 查詢session分布
      • Oracle Zabbix監控
      • Oracle AWR分析
      • 分析session的SQL記錄
      • 嘗試分析原因
      • 幾個c3p0配置的官方建議

環境

資料庫和應用在同一台伺服器上。

  • OS: Windows Server 2008R2 Enterprise
  • 應用: Tomcat 6.0.32,部署一個webservice接口服務
  • 資料庫: Oracle 11g R2 11.2.0.1 64bit

故障現象

每天上午08:00開始業務,約至10:00左右開始出現異常,tomcat catalina.log報錯”java.sql.SQLException: Io 異常: The Network Adapter could not establish the connection”,c3p0連接配接池無法擷取新的connection。同時Oracle Database響應極慢。

Tomcat c3p0連接配接池故障一例

目前c3p0連接配接池配置

檢視tomcat c3p0連接配接池配置參數如下。

c3p0.minPoolSize=1
c3p0.maxPoolSize=300    <----
c3p0.idleConnectionTestPeriod=35    <----
c3p0.maxIdleTime=35    <----
c3p0.maxStatements=0
c3p0.maxStatementsPerConnection=0
c3p0.preferredTestQuery=select 1 from dual    <----
c3p0.acquireIncrement=2
c3p0.acquireRetryDelay=1000
c3p0.acquireRetryAttempts=10
c3p0.initialPoolSize=1

           

查詢session分布

v$session,按machine統計Oracle database會話情況,發現90%都是本伺服器tomcat發起。其發起的session數在1000以上,遠超c3p0 maxPoolSize=300的配置,猜想是c3p0連接配接池配置問題,或者某種原因導緻連接配接池洩露。

SELECT b.MACHINE, b.PROGRAM, COUNT (*)
   FROM v$process a, v$session b
   WHERE a.ADDR = b.PADDR AND b.USERNAME IS NOT NULL
GROUP BY b.MACHINE, b.PROGRAM
ORDER BY COUNT (*) DESC;
           

Oracle Zabbix監控

在Zabbix(orabbix)監控上,看到session在08:00時約200個,此後開始迅速增長,到10:15左右達到1400個。中午12:00安排重新開機tomcat,下午13:00業務開始後,session依舊快速增長,故障再次出現。圖上還可以看出session主體是”inactive session”,”Active Session”、”System Session”任然算是在正常範圍内。想象一下有1000個active session!? 那資料庫早就崩了!

Tomcat c3p0連接配接池故障一例

Oracle AWR分析

從11月16日10:00~11:00 1個小時内的性能awr報告可以看到,平均每秒鐘31.4次使用者登入(logon),60min内有113040次登入。

Tomcat c3p0連接配接池故障一例

按照對CPU Time的消耗,對1小時内SQL語句統計排序↓,排在前5的語句,均是oracle在建立新session時,使用者登入階段擷取環境參數、權限資訊、連接配接資訊的底層SQL語句。這些語句分别執行的次數基本和登入次數一緻,其中“cm5vu20fhtnq1”執行次數約為登入次數的4倍。

SQL_ID SQL Text
459f3z9u4fb3u select value fromprops where name = GLOBAL_DB_NAME’
cm5vu20fhtnq1 select /+ connect_by_filtering / privilege#, level from sysauth$ connect by grantee#=prior privilege# and privilege#>0 start with grantee#=:1 and privilege#>0
0ws7ahf1d78qa select SYS_CONTEXT(‘USERENV’, ‘SERVER_HOST’), SYS_CONTEXT(‘USERENV’, ‘DB_UNIQUE_NAME’), SYS_CONTEXT(‘USERENV’, ‘INSTANCE_NAME’), SYS_CONTEXT(‘USERENV’, ‘SERVICE_NAME’), INSTANCE_NUMBER, STARTUP_TIME, SYS_CONTEXT(‘USERENV’, ‘DB_DOMAIN’) from v$instance where INSTANCE_NAME=SYS_CONTEXT(‘USERENV’, ‘INSTANCE_NAME’)
0k8522rmdzg4k select privilege# from sysauth$ where (grantee#=:1 or grantee#=1) and privilege#>0
5ur69atw3vfhj select decode(failover_method, NULL, 0 , ‘BASIC’, 1, ‘PRECONNECT’, 2 , ‘PREPARSE’, 4 , 0), decode(failover_type, NULL, 1 , ‘NONE’, 1 , ‘SESSION’, 2, ‘SELECT’, 4, 1), failover_retries, failover_delay, flags from service$ where name = :1

而應用業務有關的sql其實執行次數以及CPU、IO消耗很小↓↓。

Tomcat c3p0連接配接池故障一例

收集同一時段内,業務核心資料庫(2節點Oracle RAC)的AWR,每秒oracle使用者登入次數為0.31,是故障伺服器上資料庫的1/100;并且耗時較多的sql語句均是應用業務相關的。↓經過和開發同僚确認,該webservice的通路量遠小于核心應用,相應資料庫的logon次數不應該比核心庫還大的多。

Tomcat c3p0連接配接池故障一例

分析session的SQL記錄

那麼到底這些session在變為inactive之前做了什麼操作呢?到底這些inactive session空閑了多久?

需要查v$session,統計session的空閑時間,及其最後執行的SQL語句。v$session中的last_call_et字段,反映了session最後一次執行sql到目前時刻共逝去多少時間(機關:秒)。以下語句統計了目前所有session從最後一次執行sql以來的空閑時間,以及最後一次執行的sql語句ID。

select username,
        s.SID,
        s.SERIAL#,
        s.logon_time,
        s.PREV_SQL_ID,
        s.last_call_et
   from v$session s
  where s.username is not null
  order by s.LAST_CALL_ET desc;
           

可以看出業務使用者的所有1200多個session中,空閑時間從0秒~34秒不等,不超過35秒(c3p0 maxIdleTime = 35)。成功執行的最後一個sql語句都是(520mkxqpf15q8):select 1 from dual,也即是c3p0連接配接池的connect testquery語句/idleConnectionTest測試語句。↓↓↓

Tomcat c3p0連接配接池故障一例

嘗試分析原因

應用擷取資料庫連接配接(connection)的開銷比較大,是以才有了連接配接池的設計。建立連接配接池,就是希望申請到的connection存放在pool裡,達到“建立一次,循環使用”的狀态。

應用使用getConnection()方法向c3p0 pool申請connection,如果pool裡有空閑的,則應用得到一個connection;如果pool裡沒有空閑的則由c3p0到資料庫去擷取新的connection,然後配置設定給應用;應用程式的資料庫操作完成後,使用close()方法将connection放回到pool中,标記為idle,以便其他應用線程調用。

現場的情況是c3p0的poolsize逐漸增大,當poolsize達到預設值300時,并沒有停止而是繼續申請connection,并且通過idleConnectiontest不斷檢查pool中所有連接配接的狀态。

以下是從網上找到幾個可能的原因,但是都跟本次場景都不符合:

  • 從pool裡擷取connection時,使用了getConnection(username, password)格式,并且一個DataSource配置裡有多個使用者名。

    舉例,比如一個DataSource配置中含有dbuser1、dbuser2、dbuser3、dbuser4、dbuser5等5個,則實際的pool最大連接配接數是maxpoolsize * 5=1500。**

  • 程式歸還connection時close()方法有問題
  • tomcat的server.xml 配置裡autoDeploy=true,并且線上修改webapp應用程式的配置時,autoDeploy功能會導緻自動生成一個新的pool。

幾個c3p0配置的官方建議

  • Managing Pool Size and Connection Age

    maxConnectionAge/maxIdleTime不易過小

    Some general advice about all of these timeout parameters: Slow down! The point of Connection pooling is to bear the cost of acquiring a Connection only once, and then to reuse the Connection many, many times. Most databases support Connections that remain open for hours at a time. There’s no need to churn through all your Connections every few seconds or minutes.Setting maxConnectionAge or maxIdleTime to 1800 (30 minutes) is quite aggressive. For most databases, several hours may be more appropriate. You can ensure the reliability of your Connections by testing them, rather than by tossing them.

    maxIdleTime配置為30min就已經很激進了,建議以小時為機關。可以使用connection test定期去檢查連接配接的可靠性,檢測到不可用的connection時,再将其從pool裡剔除出去。

    懷疑是maxpoolsize=35太小,現場将其修改為3600,重新開機tomcat觀察故障依舊:(。

  • Simple advice on Connection testing

    If you’d like to improve performance by eliminating Connection testing from clients’ code path:

    檢查修正了以下的建議配置,重新開機tomcat問題依舊。最後開發同僚從别的伺服器上copy了更新版本的tomcat,棄用c3p0連接配接池,恢複業務,總算是解決了這個故障,故障根本原因就沒再花時間去測試了

a. Set testConnectionOnCheckout to false

b. Set testConnectionOnCheckin to true

c. Set idleConnectionTestPeriod to 30, fire up you application and observe. This is a pretty robust setting, all Connections will tested on check-in and every 30 seconds thereafter while in the pool. Your application should experience broken or stale Connections only very rarely, and the pool should recover from a database shutdown and restart quickly. But there is some overhead associated with all that Connection testing.

d. If database restarts will be rare so quick recovery is not an issue, consider reducing the frequency of tests by idleConnectionTestPeriod to, say, 300, and see whether clients are troubled by stale or broken Connections. If not, stick with 300, or try an even bigger number. Consider setting testConnectionOnCheckin back to false to avoid unnecessary tests on checkin. Alternatively, if your application does encounter bad Connections, consider reducing idleConnectionTestPeriod and set testConnectionOnCheckin back to true. There are no correct or incorrect values for these parameters: you are trading off overhead for reliability in deciding how frequently to test. The exact numbers are not so critical. It’s usually easy to find configurations that perform well. It’s rarely worth spending time in pursuit of “optimal” values here.

參考:

c3p0 connection pool queries

http://stackoverflow.com/questions/13942540/c3p0-connection-pool-queries

c3p0 connection pool is not shrinking

http://stackoverflow.com/questions/21343741/c3p0-connection-pool-is-not-shrinking?rq=1

C3P0連接配接池配置

http://blog.csdn.net/caihaijiang/article/details/6843496

Does c3p0 connection pooling ensures max pool size?

http://stackoverflow.com/questions/16932846/does-c3p0-connection-pooling-ensures-max-pool-size

c3p0 Connection Pool not closing connections

http://stackoverflow.com/questions/9824879/c3p0-connection-pool-not-closing-connections

c3p0 - JDBC3 Connection and Statement Pooling

http://www.mchange.com/projects/c3p0/#contents