码迷,mamicode.com
首页 > 其他好文 > 详细

Spark WordCount 文档词频计数

时间:2019-03-09 23:57:31      阅读:301      评论:0      收藏:0      [点我收藏+]

标签:bsp   pos   fas   rgs   代码   color   eve   防止   get   

一.使用数据

Apache Spark is a fast and general-purpose cluster computing system.It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

二.实现代码

package big.data.analyse.wordcount

import org.apache.spark.sql.SparkSession

/**
  * Created by zhen on 2019/3/9.
  */
object WordCount {
  def main(args: Array[String]) {
    val spark = SparkSession.builder().appName("WordCount")
      .master("local[2]")
      .getOrCreate()
    // 加载数据
    val textRDD = spark.sparkContext.textFile("src/big/data/analyse/wordcount/wordcount.txt")
    val result = textRDD.map(row => row.replace(",", ""))//去除文字中的,防止出现歧义
      .flatMap(row => row.split(" "))//把字符串转换为字符集合
      .map(row => (row, 1))//把每个字符串转换为map,便于计数
      .reduceByKey(_+_)//计数
    // 打印结果
    result.foreach(println)
  }
}

三.计算结果

(Spark,3)
(GraphX,1)
(graphs.,1)
(learning,1)
(general-purpose,1)
(Python,1)
(APIs,1)
(provides,1)
(that,1)
(is,1)
(a,2)
(R,1)
(high-level,1)
(general,1)
(processing,2)
(fast,1)
(including,1)
(higher-level,1)
(optimized,1)
(Apache,1)
(in,1)
(SQL,2)
(system.,1)
(Java,1)
(of,1)
(data,1)
(tools,1)
(cluster,1)
(also,1)
(graph,1)
(structured,1)
(execution,1)
(It,2)
(MLlib,1)
(for,3)
(Scala,1)
(an,1)
(computing,1)
(machine,1)
(supports,2)
(and,5)
(engine,1)
(set,1)
(rich,1)
(Streaming.,1)

 

Spark WordCount 文档词频计数

标签:bsp   pos   fas   rgs   代码   color   eve   防止   get   

原文地址:https://www.cnblogs.com/yszd/p/10503597.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!