Spark ML 文本的分类

时间：2015-07-29 11:58:43 阅读：846 评论：0 收藏：0 [点我收藏+]

标签：

最近一直在研究Spark的分类算法，因为我们是做日志文本分类，在官网和各大网站一直没找到相应的Demo，经过1个多月的研究，终于有点成效。

def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("DecisionTree1").setMaster("local[2]")

    val sc = new SparkContext(sparkConf)
    // Load documents (one per line).

    val documents: RDD[Seq[String]] = sc.textFile("/XXX/sample_libsvm_data.txt").map(_.split(" ").toSeq)

    val hashingTF = new HashingTF()
    val tf: RDD[Vector] = hashingTF.transform(documents)

    // Filter the top 50 features from each feature vector
    val data = tf.map { lp =>
      LabeledPoint(lp(0), Vectors.dense((lp.toArray).tail))
    }

    val splits = data.randomSplit(Array(0.9, 0.1))
    val (trainingData, testData) = (splits(0), splits(1))

    // Train a DecisionTree model.
    //  Empty categoricalFeaturesInfo indicates all features are continuous.
    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val impurity = "gini"
    val maxDepth = 5
    val maxBins = 32

    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      impurity, maxDepth, maxBins)

    // Evaluate model on test instances and compute test error

    val testStr = Array("y z a c")

    val prediction = model.predict(hashingTF.transform(testStr))
    
    println("-----------------------------------------")
    println(prediction)
    println("-----------------------------------------")
  }

样例数据：

0 a b c d 
1 e f g h 
1 i j k l 
1 m n o p 
1 r s t u 
0 q v w x
1 y z a b 
1 c d e f 
0 a b y z

Classification算法只支持Double类型，其实我们的核心就是怎么把字符串转成Double型的向量，在Spark1.3.0版本中有 HashingTF 来做转化，就发现程序很简单了。

Spark ML 文本的分类

标签：

原文地址：http://www.cnblogs.com/qq27271609/p/4685297.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行