码迷,mamicode.com
首页 > 其他好文 > 详细

airbnb 开源reAir 工具 用法及源码解析(一)

时间:2018-11-06 20:55:50      阅读:271      评论:0      收藏:0      [点我收藏+]

标签:pie   ted   一个   配置   move   元数据   tab   sep   tor   

reAir 有批量复制与增量复制功能 今天我们先来看看批量复制功能

批量复制使用方式:

cd reair
./gradlew shadowjar -p main -x test
# 如果是本地table-list 一样要加file:/// ; 如果直接写  --table-list ~/reair/local_table_list ,此文件必须在hdfs上!
hadoop jar main/build/libs/airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.batch.hive.MetastoreReplicationJob --config-files my_config_file.xml --table-list file:///reair/local_table_list

1.table_list 内容

db_name1.table_name1
db_name1.table_name2
db_name2.table_name3
...
  1. my_config_file.xml 配置
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>airbnb.reair.clusters.src.name</name>
    <value>ns8</value>
    <comment>
      Name of the source cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.metastore.url</name>
    <value>thrift://192.168.200.140:9083</value>
    <comment>Source metastore Thrift URL.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.hdfs.root</name>
    <value>hdfs://ns8/</value>
    <comment>Source cluster HDFS root. Note trailing slash.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.hdfs.tmp</name>
    <value>hdfs://ns8/tmp</value>
    <comment>
      Directory for temporary files on the source cluster.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.name</name>
    <value>ns1</value>
    <comment>
      Name of the destination cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.metastore.url</name>
    <value>thrift://dev04:9083</value>
    <comment>Destination metastore Thrift URL.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.hdfs.root</name>
    <value>hdfs://ns1/</value>
    <comment>Destination cluster HDFS root. Note trailing slash.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.hdfs.tmp</name>
    <value>hdfs://ns1/tmp</value>
    <comment>
      Directory for temporary files on the source cluster. Table / partition
      data is copied to this location before it is moved to the final location,
      so it should be on the same filesystem as the final location.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.batch.output.dir</name>
    <value>/tmp/replica</value>
    <comment>
      This configuration must be provided. It gives location to store each stage
      MR job output.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.batch.metastore.blacklist</name>
    <value>testdb:test.*,tmp_.*:.*</value>
    <comment>
      Comma separated regex blacklist. dbname_regex:tablename_regex,...
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.metastore.parallelism</name>
    <value>5</value>
    <comment>
      The parallelism to use for jobs requiring metastore calls. This translates to the number of
      mappers or reducers in the relevant jobs.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.copy.parallelism</name>
    <value>5</value>
    <comment>
      The parallelism to use for jobs that copy files. This translates to the number of reducers
      in the relevant jobs.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.overwrite.newer</name>
    <value>true</value>
    <comment>
      Whether the batch job will overwrite newer tables/partitions on the destination. Default is true.
    </comment>
  </property>
 
  <property>
    <name>mapreduce.map.speculative</name>
    <value>false</value>
    <comment>
      Speculative execution is currently not supported for batch replication.
    </comment>
  </property>

  <property>
    <name>mapreduce.reduce.speculative</name>
    <value>false</value>
    <comment>
      Speculative execution is currently not supported for batch replication.
    </comment>
  </property>

</configuration>

一、批量复制

批量复制有三个步骤(stage)

1.读取用户配置的table-list(及从src元数据获得对应表的分区),shuffle 到各个reduce中,reduce读取 metastore及集群信息做好拷贝文件的映射关系写到hdfs中。
2.遍历第一个mr生成作业列表,根据路径shuffle到不同reduce,执行复制。
3.处理hive metastore 提交逻辑只用map

官方图:

技术分享图片

airbnb 开源reAir 工具 用法及源码解析(一)

标签:pie   ted   一个   配置   move   元数据   tab   sep   tor   

原文地址:https://www.cnblogs.com/jiangxiaoxian/p/9917790.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!