天天看点

ES内存持续上升问题定位记一次内存使用率过高的报警

https://discuss.elastic.co/t/es-vs-lucene-memory/20959

I've read the recommendations for ES_HEAP_SIZE

which

basically state to set -Xms and -Xmx to 50% physical RAM.

It says the rest should be left for Lucene to use (OS filesystem caching).

But I'm confused on how Lucene uses that. Doesn't Lucene run in the same

JVM as ES? So they would share the same max heap setting of 50%. 

nik9000 Nik Everett Elastic Team Member

Nov '14

Lucene runs in the same JVM as Elasticsearch but (by default) it mmaps

files and then iterates over their content inteligently. That means most

of its actual storage is "off heap" (its a java buzz-phrase). Anyway,

Linux will serve reads from mmaped files from its page cache. That is why

you want to leave linux a whole bunch of unused memory.

Nik 

就是官方人员建议预留一些空闲的内存给ES(lucene)的底层文件系统用于File cache。

pountz Adrien Grand Elastic Team Member

Indeed the behaviour is the same on Windows and Linux: memory that is not

used by processes is used by the operating system in order to cache the

hottest parts of the file system. The reason why the docs say that the rest

should be left to Lucene is that most disk accesses that elasticsearch

performs are done through Lucene. 

I used procexp and VMMap to double check, ya, i think they are file system

cache.

Is there anyway to control the size of file system cache? Coz now it's

easily driving up OS memory consumption. When it's reaching 100%, the node

would fail to respond...

他们也遇到ES机器的内存(java heap+文件系统)使用达到了100%。

不过该帖子没有给出解决方案。

类似ES内存问题在:https://discuss.elastic.co/t/jvm-memory-increase-1-time-more-xmx16g-while-elasticsearch-heap-is-stable/55413/4

其操作是每天都定期查询。

Elasticsearch uses not only heap but also out-of-heap memory buffers because of Lucene.

I just read the Lucene blog post and I already know that Lucene/ES start to use the file system cache (with MMapDirectory).

That why in my graph memory you can see: Free (in green) + Used memory (in red) + cached memory (the FS cache in blue).

https://discuss.elastic.co/t/memory-usage-of-the-machine-with-es-is-continuously-increasing/23537/2

提到:ES内存每天都上升200MB,而重启ES则一切又正常了。

Note

that when I restart the ES it gets cleared(most of it, may be OS clears up

this cache once it sees that the parent process has been stopped). 

When the underlying lucene engine interacts with a segment the OS will

leverage free system RAM and keep that segment in memory. However

Elasticsearch/lucene has no way to control of OS level caches. 

这个是操作系统的cache,ES本身无法控制。

转自:http://farll.com/2016/10/high-memory-usage-alarm/

Linux Centos服务器内存使用率过高的报警, 最后得出结论是因为 nss-softokn的bug导致curl 大量请求后, dentry 缓存飙升.

问题的开始是收到云平台发过来的内存使用率平均值超过报警值的短信, 登录云监控后台查看发现从前两天开始内存使用曲线缓慢地呈非常有规律上升趋势.

ES内存持续上升问题定位记一次内存使用率过高的报警

用top命令, 然后使用M按内存使用大小排序发现并没有特别消耗内存的进程在跑, 用H查询线程情况也正常. 最高的mysql占用6.9%内存. memcached进程telnet后stats并未发现有内存消耗过大等异常情况.

ES内存持续上升问题定位记一次内存使用率过高的报警

free -m 查看 -/+ buffers/cache used很高, 可用的free内存只剩下百分之十几, 那么内存消耗究其在哪里去了?

使用cat /proc/meminfo 查看内存分配更详细的信息:  Slab 和 SReclaimable 达几个G,  Slab内存分配管理, SReclaimable看字面意思是可回收内存.

ES内存持续上升问题定位记一次内存使用率过高的报警

因为最近并没有对服务器配置和程序代码进行变更, 所以第一反应是想是不是AliyunUpdate自动更新程序更改了什么. 但是 strace 依次监控Ali系列的进程并没有发现有大量的文件操作.  crond也发现会读写少量的session等临时文件, 但数量少并不至于到这个程度.

官方的工单回复是可以考虑升级服务器内存,如果内存不足影响业务,需要临时释放一下slab占用的内存, 参考以下步骤:

只给了个临时解决方案, 总不能还要开个自动任务定时执行回收dentry cache的任务吧. 不过我们逐渐接近了事情的本质, 即目前看来是过高的slab内存无法回收导致内存不足. 那么我们可以提高slab内存释放的优先级, Linux 提供了 vfs_cache_pressure 这个参数, 默认为100, 设置为高于100的数, 数值越大slab回收优先级越高(root 身份运行):

ES内存持续上升问题定位记一次内存使用率过高的报警

并不建议通过如上Laravel队列文章最后提到的使用min_free_kbytes 和 extra_free_kbytes参数来实现回收slab缓存的, 内核文档中比较明确地指明了vfs_cache_pressure 适合于控制directory 和inode 缓存的回收:

虽然slab占用过多内存得到了有效控制和实时回收, 但实际上我们还是没有找到问题的本质, 按理来说默认的vfs_cache_pressure值应该是能够较好地实现可用内存和slab缓存之间的平衡的.

最后去查该时间节点的网站服务日志才发现, 内存使用率一直升的原因在于从这个时间开始, 有系统crontab任务一直在不断地通过curl请求外部资源api接口, 而他这个api基于https. 罪魁祸首就在于Libcurl附带的Mozilla网络安全服务库NSS(Network Security Services) 的bug, 只有curl请求ssl即https的资源才会引入NSS.

第一步: 确保nss-softokn是已经解决了bug的版本

查看是否大于等于3.16.0版本(根据上面的bug修复链接, 大于nss-softokn-3.14.3-12.el6 也可以):

<code>yum list nss-softokn</code>

若版本太低需要升级:

<code>sudo yum update -y nss-softokn</code>

第二步: 设置变量NSS_SDB_USE_CACHE=yes

第三步: 重启Web服务器

重启Apache 或 重启Nginx和php-fpm. 否则curl_error可能出现 <code>Problem with the SSL CA cert (path? access rights?)</code> 的错误.

设置后dentry cache终于慢慢恢复正常.

本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/8066544.html,如需转载请自行联系原作者