码迷,mamicode.com
首页 > 其他好文 > 详细

Zabbix housekeeper processes more than 75% busy

时间:2020-09-04 17:34:53      阅读:58      评论:0      收藏:0      [点我收藏+]

标签:手工   width   ascii   方法   spl   http   none   flow   itemid   

原因分析

 

 

为了防止数据库持续增大,Zabbix有自动删除历史数据的机制,即housekeeper,而在频繁清理历史数据的时候,MySQL数据库可能出现性能降低的情况,此时就会告警。

 

一般来说,Zabbix都会监控Zabbix Server本身。如下所示,我们可以分析Zabbix server: Utilization of housekeeper internal processes, in %图形,通过分析,我们可以看到Zabbix housekeeper processes1032分的时候, housekeeper进程突然开始繁忙。

 

技术图片

 

拉长时间范围的截图如下所示:

 

技术图片

 

 

如需进一步分析,那么就必须查看MySQL的慢查询日志:

 

mysql> show variables like ‘%slow_query%‘;
+---------------------+-------------------------------------+
| Variable_name       | Value                               |
+---------------------+-------------------------------------+
| slow_query_log      | ON                                  |
| slow_query_log_file | /mysql_data/mysql/xxxx-slow.log |
+---------------------+-------------------------------------+
2 rows in set (0.01 sec)

 

 

#注意,如果没有设置过的话,这里默认是UTC时间。所以跟东八区时间有所区别。

# Time: 2020-08-26T02:34:56.354162Z
# User@Host: zabbix[zabbix] @ localhost []  Id: 345463
# Query_time: 13.832335  Lock_time: 0.000088 Rows_sent: 0  Rows_examined: 5000
SET timestamp=1598409282;
delete from history where itemid=37078 limit 5000;
# Time: 2020-08-26T02:35:00.377783Z
# User@Host: zabbix[zabbix] @ localhost []  Id: 345463
# Query_time: 4.023518  Lock_time: 0.000126 Rows_sent: 0  Rows_examined: 5000
SET timestamp=1598409296;
delete from history where itemid=37079 limit 5000;
# Time: 2020-08-26T02:35:36.848120Z
# User@Host: zabbix[zabbix] @ localhost []  Id: 345463
# Query_time: 21.513432  Lock_time: 0.000094 Rows_sent: 0  Rows_examined: 5000
SET timestamp=1598409315;
delete from history where itemid=37099 limit 5000;
# Time: 2020-08-26T02:35:46.705206Z
# User@Host: zabbix[zabbix] @ localhost []  Id: 345463
# Query_time: 9.856468  Lock_time: 0.000124 Rows_sent: 0  Rows_examined: 5000
SET timestamp=1598409336;
delete from history where itemid=37100 limit 5000;
# Time: 2020-08-26T02:36:43.856421Z
# User@Host: zabbix[zabbix] @ localhost []  Id: 345463
# Query_time: 38.186585  Lock_time: 0.000039 Rows_sent: 0  Rows_examined: 5000
SET timestamp=1598409365;
delete from history where itemid=38789 limit 5000;
# Time: 2020-08-26T02:36:59.432174Z
# User@Host: zabbix[zabbix] @ localhost [127.0.0.1]  Id: 345563
# Query_time: 8.542213  Lock_time: 0.000084 Rows_sent: 20  Rows_examined: 7298
SET timestamp=1598409410;
SELECT DISTINCT e.eventid,e.clock,e.ns,e.objectid,e.acknowledged,er1.r_eventid FROM events e LEFT JOIN event_recovery er1 ON er1.eventid=e.eventid WHERE e.sou
rce=‘0‘ AND e.object=‘0‘ AND e.objectid=26811 AND e.eventid<=‘3437835‘ AND e.value=1 ORDER BY e.eventid DESC LIMIT 20;
# Time: 2020-08-26T02:37:02.317422Z
# User@Host: zabbix[zabbix] @ localhost []  Id: 345463
# Query_time: 18.460853  Lock_time: 0.000101 Rows_sent: 0  Rows_examined: 5000
SET timestamp=1598409403;
delete from history where itemid=38790 limit 5000;

 

技术图片  

 

另外,Zabbix Server也会将慢查询SQL写入zabbix_server.log中,如下所示。

 

# grep "slow query" zabbix_server.log

 

    技术图片

 

 

 

通过分析,我们可以发现MySQL在删除historyhistry_unit数据。经过分析,这里突然出现 housekeeper进程繁忙,是因为我删除了模板Zabbix template for Microsoft SQL Server,并勾选了Clear,所以导致Zabbix Server需要删除大量的数据。当然这个只是一个诱因,本身history变得非常大才是一个重要的原因。你可以通过下面脚本,查看一下这些表的Size大小信息。

 

SELECT TABLE_SCHEMA
    ,  TABLE_NAME
    , (DATA_LENGTH/1024/1024)     AS DATA_SIZE_MB 
    , (INDEX_LENGTH/1024/1024)  AS INDEX_SIZE_MB
    , ((DATA_LENGTH+INDEX_LENGTH)/1024/1024) AS TABLE_SIZE_MB
    , TABLE_ROWS 
FROM INFORMATION_SCHEMA.TABLES 
WHERE table_schema = ‘zabbix‘
ORDER BY TABLE_SIZE_MB ASC;

 

解决方案

 

 

一般短时间出现这样的告警可以忽略,如果一直出现这个告警的话,我们就应该调整参数HousekeepingFrequencyMaxHousekeeperDelete

 

 

Zabbix 5.x下面,默认情况下HousekeepingFrequency值为1,表示一小时执行一次。 MaxHousekeeperDelete表示一次删除5000条记录。如下所示

 

# grep -C 1 HousekeepingFrequency /etc/zabbix/zabbix_server.conf
 
### Option: HousekeepingFrequency
#       How often Zabbix will perform housekeeping procedure (in hours).
#       Housekeeping is removing outdated information from the database.
#       To prevent Housekeeper from being overloaded, no more than 4 times HousekeepingFrequency
#       hours of outdated information are deleted in one housekeeping cycle, for each item.
#       To lower load on server startup housekeeping is postponed for 30 minutes after server start.
#       With HousekeepingFrequency=0 the housekeeper can be only executed using the runtime control option.
#       In this case the period of outdated information deleted in one housekeeping cycle is 4 times the
--
# Default:
# HousekeepingFrequency=1
 
 
# grep -C 1 MaxHousekeeperDelete  /etc/zabbix/zabbix_server.conf
 
### Option: MaxHousekeeperDelete
#       The table "housekeeper" contains "tasks" for housekeeping procedure in the format:
#       [housekeeperid], [tablename], [field], [value].
#       No more than ‘MaxHousekeeperDelete‘ rows (corresponding to [tablename], [field], [value])
#       will be deleted per one task in one housekeeping cycle.
--
# Default:
# MaxHousekeeperDelete=5000

 

 

 

Zabbix 1.8.2 开始支持该参数MaxHousekeeperDelete,如果MaxHousekeeperDelete设置为0,表示不限制删除的行数。当然这个不建议这么做。另外它仅在对已经被删除的监控项进行历史和趋势数据删除操作时有效。一般是通过将housekeeper进程做归档的时间间隔调大,一次删除数据的量放大来解决问题。至于这个值到底多大合适,没有标准答案。要根据实际情况、测试才能给出一个合适的值。

 

 

HousekeepingFrequency=6             #间隔时间6小时

MaxHousekeeperDelete=10000          #最大删除量

 

 

 

在这个案例中,将MaxHousekeeperDelete调整为100000,发现delete操作反而慢了许多。如下所示:

 

836378:20200826:161213.441 slow query: 773.254950 sec, "delete from history where itemid=45251 limit 100000"

836378:20200826:162435.978 slow query: 742.537260 sec, "delete from history where itemid=46694 limit 100000"

836378:20200826:163329.011 slow query: 532.932137 sec, "delete from history where itemid=51313 limit 100000"

836378:20200826:163842.539 slow query: 313.528311 sec, "delete from history where itemid=52664 limit 100000"

 

如果我将MaxHousekeeperDelete调整为10000的话,发现delete的性能还是要快一些。所以,这个不妨多测试验证一下。

 

 

943980:20200826:233157.246 slow query: 5.393617 sec, "delete from history where itemid=37769 limit 10000"

943980:20200826:233202.914 slow query: 5.667551 sec, "delete from history where itemid=38407 limit 10000"

943980:20200826:233208.044 slow query: 5.129767 sec, "delete from history where itemid=41283 limit 10000"

943980:20200826:233217.462 slow query: 7.011403 sec, "delete from history where itemid=37770 limit 10000"

943980:20200826:233222.516 slow query: 5.053935 sec, "delete from history where itemid=38408 limit 10000"

943980:20200826:233227.286 slow query: 4.769753 sec, "delete from history where itemid=41284 limit 10000"

 

 

另外,还有一些方法,例如减少历史数据的保留时间、对history等大表进行分区,也可以避免或减少这个告警出现的概率。根据个人的经验,如果像history表变得非常大以后,即使调整上面参数,其实效果并不明显。需要通过分区或手工清理历史数据来解决。这样效果才显著。

Zabbix housekeeper processes more than 75% busy

标签:手工   width   ascii   方法   spl   http   none   flow   itemid   

原文地址:https://www.cnblogs.com/kerrycode/p/13570463.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!