码迷,mamicode.com
首页 > 其他好文 > 详细

TPCH Benchmark with Impala

时间:2014-07-13 22:27:46      阅读:557      评论:0      收藏:0      [点我收藏+]

标签:style   blog   http   color   文件   os   

1. 生成测试数据
在TPC-H的官网http://www.tpc.org/tpch/上下载dbgen工具,生成数据http://www.tpc.org/tpch/spec/tpch_2_17_0.zip
解压,到dbgen目录下,修改mkefile

################
## CHANGE NAME OF ANSI COMPILER HERE
################
CC      = gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
#                                  SQLSERVER, SYBASE, ORACLE, VECTORWISE
# Current values for MACHINE are:  ATT, DOS, HP, IBM, ICL, MVS, 
#                                  SGI, SUN, U2200, VMS, LINUX, WIN32 
# Current values for WORKLOAD are:  TPCH
DATABASE= ORACLE
MACHINE = LINUX
WORKLOAD = TPCH

编译完成之后运行./dbgen -help

jfp4-1:/mnt/disk1/tpch_2_17_0/dbgen # ./dbgen -help
TPC-H Population Generator (Version 2.17.0 build 0)
Copyright Transaction Processing Performance Council 1994 - 2010
USAGE:
dbgen [-{vf}][-T {pcsoPSOL}]
    [-s <scale>][-C <procs>][-S <step>]
dbgen [-v] [-O m] [-s <scale>] [-U <updates>]

Basic Options
===========================
-C <n> -- separate data set into <n> chunks (requires -S, default: 1)
-f     -- force. Overwrite existing files
-h     -- display this message
-q     -- enable QUIET mode
-s <n> -- set Scale Factor (SF) to  <n> (default: 1) 
-S <n> -- build the <n>th step of the data/update set (used with -C or -U)
-U <n> -- generate <n> update sets
-v     -- enable VERBOSE mode

Advanced Options
===========================
-b <s> -- load distributions for <s> (default: dists.dss)
-d <n> -- split deletes between <n> files (requires -U)
-i <n> -- split inserts between <n> files (requires -U)
-T c   -- generate cutomers ONLY
-T l   -- generate nation/region ONLY
-T L   -- generate lineitem ONLY
-T n   -- generate nation ONLY
-T o   -- generate orders/lineitem ONLY
-T O   -- generate orders ONLY
-T p   -- generate parts/partsupp ONLY
-T P   -- generate parts ONLY
-T r   -- generate region ONLY
-T s   -- generate suppliers ONLY
-T S   -- generate partsupp ONLY

To generate the SF=1 (1GB), validation database population, use:
    dbgen -vf -s 1

To generate updates for a SF=1 (1GB), use:
    dbgen -v -U 1 -s 1

运行./dbgen -s 1024生成1TB数据

jfp4-1:/mnt/disk1/tpch_2_17_0/dbgen # ll *.tbl
-rw-r--r-- 1 root root  25384864295 Jul  3 23:04 customer.tbl
-rw-r--r-- 1 root root 833545019752 Jul  3 23:04 lineitem.tbl
-rw-r--r-- 1 root root         2224 Jul  3 23:04 nation.tbl
-rw-r--r-- 1 root root 185305368911 Jul  3 23:04 orders.tbl
-rw-r--r-- 1 root root  25329003396 Jul  3 23:04 part.tbl
-rw-r--r-- 1 root root 126691192078 Jul  3 23:04 partsupp.tbl
-rw-r--r-- 1 root root          389 Jul  3 23:04 region.tbl
-rw-r--r-- 1 root root   1473459356 Jul  3 23:04 supplier.tbl

2.下载impala版本的TPCH-H脚本

建立原始表linetext,为text文件:大小776GB

jfp4-1:/mnt/disk1/tpch_2_17_0/dbgen # hdfs dfs -du /shaochen/tpch
25384864295   /shaochen/tpch/customer
833545019752  /shaochen/tpch/lineitem
2224          /shaochen/tpch/nation
185305368911  /shaochen/tpch/orders
25329003396   /shaochen/tpch/part
126691192078  /shaochen/tpch/partsupp
389           /shaochen/tpch/region
1473459356    /shaochen/tpch/supplier

 

Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY |  LOCATION /shaochen/tpch/lineitem;

从原始text表中统计记录条数:

[jfp4-1:21000] > select count(*) from lineitem;
Query: select count(*) from lineitem
+------------+
| count(*)   |
+------------+
| 6144008876 |
+------------+
Returned 1 row(s) in 856.47s

在脚本运行过程中,观察到Cluster Disk IO速度平均接近1GB,原始数据为776GB,由于是IO密集型操作,估算应该在776GB/1GB/s=800s完成。符合预期

将lineitem表保存为parquet格式:

[jfp4-1:21000] > insert overwrite lineitem_parquet select * from lineitem;
Query: insert overwrite lineitem_parquet select * from lineitem
Inserted 6144008876 rows in 3780.52s

在脚本运行过程中,该SQL为由于涉及到parquet文件的转换和Snappy压缩,属于混合型(IO密集+CPU密集),观察到Cluster Disk IO中读速率均值约为210M,估算在776/0.2=3800秒左右完成。符合预期。
根据写速率为140兆,parquet文件大小约为3800*0.14=532GB,再除以复制因子3,为180GB。

jfp4-1:/mnt/disk1/tpch_2_17_0/dbgen # hdfs dfs -du -h /user/hive/warehouse/tpch.db
200.9 G  /user/hive/warehouse/tpch.db/lineitem_parquet
546      /user/hive/warehouse/tpch.db/q1_pricing_summary_report

真实的parquet文件大小为200G,符合预期。

再次统计记录条数:

[jfp4-1:21000] > select count(*) from lineitem_parquet;
Query: select count(*) from lineitem_parquet
+------------+
| count(*)   |
+------------+
| 6144008876 |
+------------+
Returned 1 row(s) in 18.04s

 

在text文件格式上运行Q1:

[jfp4-1:21000] > -- the query
               > INSERT OVERWRITE TABLE q1_pricing_summary_report
               > SELECT
               >   L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT(1) as int)
               > FROM
               >   lineitem
               > WHERE
               >   L_SHIPDATE<=1998-09-02
               > GROUP BY L_RETURNFLAG, L_LINESTATUS
               > ORDER BY L_RETURNFLAG, L_LINESTATUS
               > LIMIT 2147483647;
Query: INSERT OVERWRITE TABLE q1_pricing_summary_report SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT(1) as int) FROM lineitem WHERE L_SHIPDATE<=1998-09-02 GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS LIMIT 2147483647
^C[jfp4-1:21000] > INSERT OVERWRITE TABLE q1_pricing_summary_report
               > SELECT
               >   L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT(1) as int)
               > FROM
               >   lineitem
               > WHERE
               >   L_SHIPDATE<=1998-09-02
               > GROUP BY L_RETURNFLAG, L_LINESTATUS
               > ORDER BY L_RETURNFLAG, L_LINESTATUS
               > LIMIT 2147483647;
Query: insert OVERWRITE TABLE q1_pricing_summary_report SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT(1) as int) FROM lineitem WHERE L_SHIPDATE<=1998-09-02 GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS LIMIT 2147483647
Inserted 4 rows in 823.57s

 

查询查询计划:

[jfp4-1:21000] > explain INSERT OVERWRITE TABLE q1_pricing_summary_report
               > SELECT
               >   L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT(1) as int)
               > FROM
               >   lineitem
               > WHERE
               >   L_SHIPDATE<=1998-09-02
               > GROUP BY L_RETURNFLAG, L_LINESTATUS
               > ORDER BY L_RETURNFLAG, L_LINESTATUS
               > LIMIT 2147483647;
Query: explain INSERT OVERWRITE TABLE q1_pricing_summary_report SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT(1) as int) FROM lineitem WHERE L_SHIPDATE<=1998-09-02 GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS LIMIT 2147483647
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Explain String                                                                                                                                                                                                                                                                               |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=208.13GB VCores=2                                                                                                                                                                                                                                    |
| WARNING: The following tables are missing relevant table and/or column statistics.                                                                                                                                                                                                           |
| tpch.lineitem                                                                                                                                                                                                                                                                                |
|                                                                                                                                                                                                                                                                                              |
| WRITE TO HDFS [tpch.q1_pricing_summary_report, OVERWRITE=true]                                                                                                                                                                                                                               |
| |  partitions=1                                                                                                                                                                                                                                                                              |
| |                                                                                                                                                                                                                                                                                            |
| 06:TOP-N [LIMIT=2147483647]                                                                                                                                                                                                                                                                  |
| |  order by: L_RETURNFLAG ASC, L_LINESTATUS ASC                                                                                                                                                                                                                                              |
| |                                                                                                                                                                                                                                                                                            |
| 05:EXCHANGE [PARTITION=UNPARTITIONED]                                                                                                                                                                                                                                                        |
| |                                                                                                                                                                                                                                                                                            |
| 02:TOP-N [LIMIT=2147483647]                                                                                                                                                                                                                                                                  |
| |  order by: L_RETURNFLAG ASC, L_LINESTATUS ASC                                                                                                                                                                                                                                              |
| |                                                                                                                                                                                                                                                                                            |
| 04:AGGREGATE [MERGE FINALIZE]                                                                                                                                                                                                                                                                |
| |  output: sum(sum(L_QUANTITY)), sum(sum(L_EXTENDEDPRICE)), sum(sum(L_EXTENDEDPRICE * (1.0 - L_DISCOUNT))), sum(sum(L_EXTENDEDPRICE * (1.0 - L_DISCOUNT) * (1.0 + L_TAX))), sum(count(L_QUANTITY)), sum(count(L_EXTENDEDPRICE)), sum(sum(L_DISCOUNT)), sum(count(L_DISCOUNT)), sum(count(1)) |
| |  group by: L_RETURNFLAG, L_LINESTATUS                                                                                                                                                                                                                                                      |
| |                                                                                                                                                                                                                                                                                            |
| 03:EXCHANGE [PARTITION=HASH(L_RETURNFLAG,L_LINESTATUS)]                                                                                                                                                                                                                                      |
| |                                                                                                                                                                                                                                                                                            |
| 01:AGGREGATE                                                                                                                                                                                                                                                                                 |
| |  output: sum(L_QUANTITY), sum(L_EXTENDEDPRICE), sum(L_EXTENDEDPRICE * (1.0 - L_DISCOUNT)), sum(L_EXTENDEDPRICE * (1.0 - L_DISCOUNT) * (1.0 + L_TAX)), count(L_QUANTITY), count(L_EXTENDEDPRICE), sum(L_DISCOUNT), count(L_DISCOUNT), count(1)                                              |
| |  group by: L_RETURNFLAG, L_LINESTATUS                                                                                                                                                                                                                                                      |
| |                                                                                                                                                                                                                                                                                            |
| 00:SCAN HDFS [tpch.lineitem]                                                                                                                                                                                                                                                                 |
|    partitions=1/1 size=776.30GB                                                                                                                                                                                                                                                              |
|    predicates: L_SHIPDATE <= 1998-09-02                                                                                                                                                                                                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Returned 28 row(s) in 0.15s

计算一下表的统计信息:

[jfp4-1:21000] > compute stats lineitem;
Query: compute stats lineitem
+------------------------------------------+
| summary                                  |
+------------------------------------------+
| Updated 1 partition(s) and 16 column(s). |
+------------------------------------------+
Returned 1 row(s) in 5894.34s

根据执行结果,发现compute stats 原来是如此花费时间!观察执行过程中,前15分钟的DISK IO是非常高,达到900M/s左右,基本上是集群中所有的磁盘都在满负荷的读文件的。之后的IO也保持在130M/s左右。看来compute status是一个昂贵的操作

在parquet表上统计一下:

[jfp4-1:21000] > compute stats lineitem_parquet;
Query: compute stats lineitem_parquet
Query aborted.
[jfp4-1:21000] > SET
> NUM_SCANNER_THREADS=2
> ;
NUM_SCANNER_THREADS set to 2
[jfp4-1:21000] > compute stats lineitem_parquet;
Query: compute stats lineitem_parquet
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 1 partition(s) and 16 column(s). |
+------------------------------------------+
Returned 1 row(s) in 5176.29s
[jfp4-1:21000] >

注意需要设置NUM_SCANNER_THREAD,才能成功

 

TPCH Benchmark with Impala,布布扣,bubuko.com

TPCH Benchmark with Impala

标签:style   blog   http   color   文件   os   

原文地址:http://www.cnblogs.com/littlesuccess/p/3840594.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!