标签:
三、使用Oozie定期自动执行ETLOozie是一种Java Web应用程序,它运行在Java servlet容器——即Tomcat——中,并使用数据库来存储以下内容:
(4)CDH 5.7.0中的Oozie
CDH 5.7.0中,Oozie的版本是4.1.0,元数据存储使用MySQL。关于CDH 5.7.0中Oozie的属性,参考以下链接:yarn.nodemanager.resource.memory-mb = 2000 yarn.scheduler.maximum-allocation-mb = 2000否则会在执行工作流作业时报类似下面的错误:
具体的做法是:
sqoop metastore > /tmp/sqoop_metastore.log 2>&1 &关于Oozie无法运行Sqoop Job的问题,参考以下链接:http://www.lamborryan.com/oozie-sqoop-fail/
sqoop job --show myjob_incremental_import | grep incremental.last.value sqoop job --delete myjob_incremental_import sqoop job --meta-connect jdbc:hsqldb:hsql://cdh2:16000/sqoop --create myjob_incremental_import -- import --connect "jdbc:mysql://cdh1:3306/source?useSSL=false&user=root&password=mypassword" --table sales_order --columns "order_number, customer_number, product_code, order_date, entry_date, order_amount" --hive-import --hive-table rds.sales_order --incremental append --check-column order_number --last-value 116其中last-value是上次ETL执行后的值,用第一条命令可以看到该值。
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.1" name="regular_etl">
<start to="fork-node"/>
<fork name="fork-node">
<path start="sqoop-customer" />
<path start="sqoop-product" />
<path start="sqoop-sales_order" />
</fork>
<action name="sqoop-customer">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<arg>import</arg>
<arg>--connect</arg>
<arg>jdbc:mysql://cdh1:3306/source?useSSL=false</arg>
<arg>--username</arg>
<arg>root</arg>
<arg>--password</arg>
<arg>mypassword</arg>
<arg>--table</arg>
<arg>customer</arg>
<arg>--hive-import</arg>
<arg>--hive-table</arg>
<arg>rds.customer</arg>
<arg>--hive-overwrite</arg>
<file>/tmp/hive-site.xml#hive-site.xml</file>
<archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.jar</archive>
</sqoop>
<ok to="joining"/>
<error to="fail"/>
</action>
<action name="sqoop-product">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<arg>import</arg>
<arg>--connect</arg>
<arg>jdbc:mysql://cdh1:3306/source?useSSL=false</arg>
<arg>--username</arg>
<arg>root</arg>
<arg>--password</arg>
<arg>mypassword</arg>
<arg>--table</arg>
<arg>product</arg>
<arg>--hive-import</arg>
<arg>--hive-table</arg>
<arg>rds.product</arg>
<arg>--hive-overwrite</arg>
<file>/tmp/hive-site.xml#hive-site.xml</file>
<archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.jar</archive>
</sqoop>
<ok to="joining"/>
<error to="fail"/>
</action>
<action name="sqoop-sales_order">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>job --exec myjob_incremental_import --meta-connect jdbc:hsqldb:hsql://cdh2:16000/sqoop</command>
<file>/tmp/hive-site.xml#hive-site.xml</file>
<archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.jar</archive>
</sqoop>
<ok to="joining"/>
<error to="fail"/>
</action>
<join name="joining" to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/tmp/hive-site.xml</job-xml>
<script>/tmp/regular_etl.sql</script>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app> 其DAG如下图所示。hdfs dfs -put -f workflow.xml /user/root/ hdfs dfs -put /etc/hive/conf.cloudera.hive/hive-site.xml /tmp/ hdfs dfs -put /root/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar /tmp/ hdfs dfs -put /root/regular_etl.sql /tmp/(7)建立作业属性文件
nameNode=hdfs://cdh2:8020
jobTracker=cdh2:8032
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}(8)运行工作流oozie job -oozie http://cdh2:11000/oozie -config /root/job.properties -run此时从Oozie Web Console可以看到正在运行的作业,如下图所示。
nameNode=hdfs://cdh2:8020
jobTracker=cdh2:8032
queueName=default
oozie.use.system.libpath=true
oozie.coord.application.path=${nameNode}/user/${user.name}
timezone=UTC
start=2016-07-11T06:00Z
end=2020-12-31T07:15Z
workflowAppUri=${nameNode}/user/${user.name}(2)建立协调作业配置文件<coordinator-app name="regular_etl-coord" frequency="${coord:days(1)}" start="${start}" end="${end}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>(3)部署协调作业hdfs dfs -put -f coordinator.xml /user/root/(4)运行协调作业
oozie job -oozie http://cdh2:11000/oozie -config /root/job-coord.properties -run此时从Oozie Web Console可以看到准备运行的协调作业,作业的状态为PREP,如下图所示。
标签:
原文地址:http://blog.csdn.net/wzy0623/article/details/51880687