升级hive0.13之后发现job运行完成后Rows loaded的信息没有了。
rows loaded的信息在hive0.11中由HiveHistory类的printRowCount输出。HiveHistory类的主要用途是记录job运行的信息,包括task的counter等。默认的目录在/tmp/$user中。
hive0.11在SessionState 的start方法中会初始化HiveHistory的对象
if (startSs. hiveHist == null) {
startSs. hiveHist = new HiveHistory(startSs);
}而在hive0.13中HiveHistory是一个抽象类,其具体的实现在HiveHistoryImpl类中,其中初始化HiveHistoryImpl对象时增加了一层判断,判断hive.session.history.enabled的设置(默认为false),导致不会实例化HiveHistoryImpl类
if(startSs.hiveHist == null){
if (startSs.getConf().getBoolVar(HiveConf.ConfVars.HIVE_SESSION_HISTORY_ENABLED)) {
startSs.hiveHist = new HiveHistoryImpl (startSs);
}else {
//Hive history is disabled, create a no-op proxy
startSs.hiveHist = HiveHistoryProxyHandler .getNoOpHiveHistoryProxy();
}
}在fix这个配置之后,仍然没有发现rows loaded的信息,通过分析源码
printRowCount方法的实现如下:
public void printRowCount(String queryId) {
QueryInfo ji = queryInfoMap.get(queryId);
if (ji == null) { // 如果ji为空,则直接返回
return;
}
for (String tab : ji. rowCountMap.keySet()) {
console.printInfo(ji. rowCountMap.get(tab) + " Rows loaded to " + tab); // 从hashmap中获取数据
}
}在hive0.13中,这里获取的ji对象是空值。
近一步发现,是由于counter中没有TABLE_ID_(\\d+)_ROWCOUNT,导致不能匹配ROW_COUNT_PATTERN的正则。就不能正常获取的row count的值。
其中获取tasker count的rows loaded信息的getRowCountTableName方法内容如下:
private static final String ROW_COUNT_PATTERN = "TABLE_ID_(\\d+)_ROWCOUNT";
private static final Pattern rowCountPattern = Pattern.compile(ROW_COUNT_PATTERN);
......
String getRowCountTableName(String name) {
if (idToTableMap == null) {
return null;
}
Matcher m = rowCountPattern.matcher(name);
if (m.find()) { // //没有和TABLE_ID_xxxx match的counter导致,即counter没有打印出TABLE_ID_(\\d+)_ROWCOUNT导致。。
String tuple = m.group(1);
return idToTableMap.get(tuple);
}
return null;
}而TABLE_ID_(\\d+)_ROWCOUNT是由FileSinkOperator类负责写入的。hive0.11中相关的代码如下:
protected void initializeOp(Configuration hconf) throws HiveException {
..........
int id = conf.getDestTableId();
if ((id != 0) && (id <= TableIdEnum. values().length)) {
String enumName = "TABLE_ID_" + String.valueOf(id) + "_ROWCOUNT";
tabIdEnum = TableIdEnum.valueOf(enumName);
row_count = new LongWritable();
statsMap.put( tabIdEnum, row_count );
}而在hive0.13中这部分代码都被去掉了,找到了原因,fix也比较简单,把这个counter加回去就可了。
patch如下:
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java b/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
index 1dde78e..96860f7 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
@@ -68,13 +68,16 @@
import org.apache.hadoop.util.ReflectionUtils;
import com.google.common.collect.Lists;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
/**
* File Sink operator implementation.
**/
public class FileSinkOperator extends TerminalOperator<FileSinkDesc> implements
Serializable {
-
+ public static Log LOG = LogFactory.getLog("FileSinkOperator.class");
protected transient HashMap<String, FSPaths> valToPaths;
protected transient int numDynParts;
protected transient List<String> dpColNames;
@@ -214,6 +217,7 @@ public Stat getStat() {
protected transient FileSystem fs;
protected transient Serializer serializer;
protected transient LongWritable row_count;
+ protected transient TableIdEnum tabIdEnum = null;
private transient boolean isNativeTable = true;
/**
@@ -241,6 +245,23 @@ public Stat getStat() {
protected transient JobConf jc;
Class<? extends Writable> outputClass;
String taskId;
+ public static enum TableIdEnum {
+ TABLE_ID_1_ROWCOUNT,
+ TABLE_ID_2_ROWCOUNT,
+ TABLE_ID_3_ROWCOUNT,
+ TABLE_ID_4_ROWCOUNT,
+ TABLE_ID_5_ROWCOUNT,
+ TABLE_ID_6_ROWCOUNT,
+ TABLE_ID_7_ROWCOUNT,
+ TABLE_ID_8_ROWCOUNT,
+ TABLE_ID_9_ROWCOUNT,
+ TABLE_ID_10_ROWCOUNT,
+ TABLE_ID_11_ROWCOUNT,
+ TABLE_ID_12_ROWCOUNT,
+ TABLE_ID_13_ROWCOUNT,
+ TABLE_ID_14_ROWCOUNT,
+ TABLE_ID_15_ROWCOUNT;
+ }
protected boolean filesCreated = false;
@@ -317,7 +338,15 @@ protected void initializeOp(Configuration hconf) throws HiveException {
prtner = (HivePartitioner<HiveKey, Object>) ReflectionUtils.newInstance(
jc.getPartitionerClass(), null);
}
- row_count = new LongWritable();
+ //row_count = new LongWritable();
+ int id = conf.getDestTableId();
+ if ((id != 0) && (id <= TableIdEnum.values().length)) {
+ String enumName = "TABLE_ID_" + String.valueOf(id) + "_ROWCOUNT";
+ tabIdEnum = TableIdEnum.valueOf(enumName);
+ row_count = new LongWritable();
+ statsMap.put(tabIdEnum, row_count);
+ }
+
if (dpCtx != null) {
dpSetup();
}打完patch后,重新打包,替换线上的hive-exec-xxx.jar包之后测试,rows loaded的数据又回来了。
本文出自 “菜光光的博客” 博客,请务必保留此出处http://caiguangguang.blog.51cto.com/1652935/1528516
hive0.13 rows loaded为空问题源码分析及fix
原文地址:http://caiguangguang.blog.51cto.com/1652935/1528516