码迷,mamicode.com
首页 > 其他好文 > 详细

set hive.map.aggr=true 时统计PV数据错误

时间:2015-08-20 20:38:09      阅读:171      评论:0      收藏:0      [点我收藏+]

标签:

从一个表里group by 之后 计算累加值、去重值:

为了效率设置并行:set hive.exec.parallel=true(可选:set hive.exec.parallel.thread.number=16)、set hive.groupby.skewindata=true、set hive.map.aggr=true

select plat, pagetype, count(*) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by plat, pagetype
union all
select plat, all pagetype, count(*) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by plat
union all
select all plat, pagetype, count(*) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by pagetype
union all
select all plat, all pagetype, count(*) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19

坏就坏在:set hive.map.aggr=true,map端聚合的设置;

出来的pv数跟真实值对不上;

改成下边代码运行正确;

select plat, pagetype, sum(1) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by plat, pagetype
union all
select plat, all pagetype, sum(1) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by plat
union all
select all plat, pagetype, sum(1) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by pagetype
union all
select all plat, all pagetype, sum(1) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19

 

set hive.map.aggr=true 时统计PV数据错误

标签:

原文地址:http://www.cnblogs.com/sudz/p/4745985.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!