标签:
最近一直在研究Spark的分类算法,因为我们是做日志文本分类,在官网和各大网站一直没找到相应的Demo,经过1个多月的研究,终于有点成效。
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("DecisionTree1").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("/XXX/sample_libsvm_data.txt").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
// Filter the top 50 features from each feature vector
val data = tf.map { lp =>
LabeledPoint(lp(0), Vectors.dense((lp.toArray).tail))
}
val splits = data.randomSplit(Array(0.9, 0.1))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val testStr = Array("y z a c")
val prediction = model.predict(hashingTF.transform(testStr))
println("-----------------------------------------")
println(prediction)
println("-----------------------------------------")
}
样例数据:
0 a b c d 1 e f g h 1 i j k l 1 m n o p 1 r s t u 0 q v w x 1 y z a b 1 c d e f 0 a b y z
Classification算法只支持Double类型,其实我们的核心就是怎么把字符串转成Double型的向量,在Spark1.3.0版本中有 HashingTF 来做转化,就发现程序很简单了。
标签:
原文地址:http://www.cnblogs.com/qq27271609/p/4685297.html