码迷,mamicode.com
首页 > 其他好文 > 详细

hive:子查询

时间:2014-10-22 10:55:14      阅读:1598      评论:0      收藏:0      [点我收藏+]

标签:blog   sp   div   2014   on   log   bs   ef   时间   

hive本身支持的子查询非常有限,Hive不支持where子句中的子查询,只允许子查询在from中出现

错误写法:

insert into table branch_atmzc_sum
Select  XT_OP_TRL, SA_TX_DT,"取款-存款",b.cr_tx_amt- a.cr_tx_amt as cr_tx_amt from branch_atmzc a join  branch_atmzc b  on (a.XT_OP_TRL = b.XT_OP_TRL and a.SA_TX_DT = b.SA_TX_DT and a.tran_cd = ‘ATM存款‘ and b.tran_cd = ‘ATM取款‘), counts from branch_atmzc
group by XT_OP_TRL, SA_TX_DT,cr_tx_amt,counts;

正确写法:

insert into table branch_atmzc_sum
Select a.XT_OP_TRL, a.SA_TX_DT,"取款-存款",b.cr_tx_amt- a.cr_tx_amt ,b.counts+a.counts  from branch_atmzc a join  branch_atmzc b  on (a.XT_OP_TRL = b.XT_OP_TRL and a.SA_TX_DT = b.SA_TX_DT and a.tran_cd = ‘ATM存款‘ and b.tran_cd = ‘ATM取款‘)

另外一种:考虑效率

SELECT
	t1.产品类型,
	COUNT(DISTINCT (IF(t2.用户ID IS NULL, NULL, t1.用户ID))) AS KEEP_UV
FROM
(
	SELECT
		产品类型,
		用户ID
	FROM 事实表
	WHERE (`DATE` >= 20140201 AND `DATE` <= 20140228)
) t1
LEFT OUTER JOIN
(
	SELECT
		产品类型,
		用户ID
	FROM 事实表
	WHERE (`DATE` >= 20140101 AND `DATE` <= 20140131)
) t2 ON (t1.产品类型 = t2.产品类型 AND t1.用户ID = t2.用户ID)
GROUP BY t1.产品类型

 本身表包含的字段信息多,时间跨度大。对于这种对于IN / EXISTS子查询(准确地说,这里是非相关子查询)有一种高效的实现,就是LEFT SEMI JOIN:

       LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way.

left join 

SELECT
	产品类型,
	COUNT(DISTINCT t1.用户ID) AS KEEP_UV
FROM
(
	SELECT
		产品类型,
		用户ID
	FROM 事实表
	WHERE (`DATE` >= 20140201 AND `DATE` <= 20140228)
) t1
LEFT SEMI JOIN
(
	SELECT
		产品类型,
		用户ID
	FROM 事实表
	WHERE (`DATE` >= 20140101 AND `DATE` <= 20140131)
) t2 ON (t1.产品类型 = t2.产品类型 AND t1.用户ID = t2.用户ID)
GROUP BY 产品类型

  

hive:子查询

标签:blog   sp   div   2014   on   log   bs   ef   时间   

原文地址:http://www.cnblogs.com/kxdblog/p/4042418.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!