Spark GraphX学习笔记

时间：2016-09-17 02:04:19 阅读：3157 评论：0 收藏：0 [点我收藏+]

标签：

概述

GraphX是 Spark中用于图(如Web-Graphs and Social Networks)和图并行计算(如 PageRank and Collaborative Filtering)的API,可以认为是GraphLab(C++)和Pregel(C++)在Spark(Scala)上的重写及优化，跟其他分布式图计算框架相比，GraphX最大的贡献是，在Spark之上提供一站式数据解决方案，可以方便且高效地完成图计算的一整套流水作业。
Graphx是Spark生态中的非常重要的组件，融合了图并行以及数据并行的优势，虽然在单纯的计算机段的性能相比不如GraphLab等计算框架，但是如果从整个图处理流水线的视角（图构建，图合并，最终结果的查询）看，那么性能就非常具有竞争性了。

图计算应用场景

“图计算”是以“图论”为基础的对现实世界的一种“图”结构的抽象表达，以及在这种数据结构上的计算模式。通常，在图计算中，基本的数据结构表达就是：G = （V，E，D） V = vertex （顶点或者节点） E = edge （边） D = data （权重）。
图数据结构很好的表达了数据之间的关联性，因此，很多应用中出现的问题都可以抽象成图来表示，以图论的思想或者以图为基础建立模型来解决问题。
举几个图计算的应用场景：
PageRank让链接来”投票”
基于GraphX的社区发现算法FastUnfolding分布式实现
http://bbs.pinggu.org/thread-3614747-1-1.html
基于三角形计数的关系衡量
基于随机游走的用户属性传播
淘宝应用
度分布、二跳邻居数、连通图、多图合并、能量传播模型
所有的关系都可以从图的角度来看待和处理，但到底一个关系的价值多大？健康与否？适合用于什么场景？
快刀初试：Spark GraphX在淘宝的实践
http://www.csdn.net/article/2014-08-07/2821097

Spark中图的建立及图的基本操作

利用顶点和边RDD建立一个简单的属性图

如下图所示，顶点的属性包含用户的姓名和职业，带标注的边表示不同用户之间的关系。
技术分享

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

object myGraphX {

  def main(args:Array[String]){

    // Create the context  
    val sparkConf = new SparkConf().setAppName("myGraphPractice").setMaster("local[2]")
    val sc=new SparkContext(sparkConf) 

    // 顶点RDD,顶点的ID+（顶点的属性）
    val users: RDD[(VertexId, (String, String))] =
      sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
    // 边RDD，起始点->终点，边的属性（边的标注）
    val relationships: RDD[Edge[String]] =
      sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
    // 默认（缺失）用户
    //Define a default user in case there are relationship with missing user
    val defaultUser = ("John Doe", "Missing")

    //使用RDDs建立一个Graph（有许多建立Graph的数据来源和方法，后面会详细介绍）
    val graph = Graph(users, relationships, defaultUser)     
  }
}

读取文件建立图

（1）读取文件，建立顶点和边的RRD，然后建立属性图

 //读入数据文件
val articles: RDD[String] = sc.textFile("E:/data/graphx/graphx-wiki-vertices.txt")
val links: RDD[String] = sc.textFile("E:/data/graphx/graphx-wiki-edges.txt")

//装载顶点和边
val vertices = articles.map { line =>
    val fields = line.split(‘\t‘)
      (fields(0).toLong, fields(1))
    }

val edges = links.map { line =>
    val fields = line.split(‘\t‘)
      Edge(fields(0).toLong, fields(1).toLong, 0)
    }
//建立图
val graph = Graph(vertices, edges, "").persist()

（2）GraphLoader.edgeListFile建立图的基本结构+Join属性
(a)首先建立图的基本结构：
除了可以利用RDD建立图以外，还可以利用GraphLoader.edgeListFile函数从边List文件中建立图的基本结构（所有顶点+边），
且顶点和边的属性都默认为1。

object GraphLoader {
  def edgeListFile(
      sc: SparkContext,
      path: String,
      canonicalOrientation: Boolean = false,
      minEdgePartitions: Int = 1)
    : Graph[Int, Int]
}

//使用方法如下：
val graph=GraphLoader.edgeListFile(sc, "/data/graphx/followers.txt") 
文件的格式如下：
2 1
4 1
1 2//依次为第一个顶点和第二个顶点

（b)然后读取属性文件，获得RDD后和（1）中得到的基本结构图join在一起，就组合成完整的属性图。

三种视图及操作

Spark中图有以下三种视图可以访问，分别通过graph.vertices，graph.edges，graph.triplets来访问。
技术分享

在Scala语言中，可以用case语句进行形式简单、功能强大的模式匹配

//假设graph顶点属性(String,Int)-(name,age),边有一个权重（int)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
用case匹配可以很方便访问顶点和边的属性及id
graph.vertices.map{
      case (id,(name,age))=>//利用case进行匹配
        (age,name)//可以在这里加上自己想要的任何转换
    }

graph.edges.map{
      case Edge(srcid,dstid,weight)=>//利用case进行匹配
        (dstid,weight*0.01)//可以在这里加上自己想要的任何转换
    }

也可以通过下标访问

graph.vertices.map{
      v=>(v._1,v._2._1,v._2._2)//v._1,v._2._1,v._2._2分别对应Id，name，age
}

graph.edges.map {
      e=>(e.attr,e.srcId,e.dstId)
}

graph.triplets.map{
      triplet=>(triplet.srcAttr._1,triplet.dstAttr._2,triplet.srcId,triplet.dstId)
    }

//可以不用graph.vertices先提取顶点再map的方法，也可以通过graph.mapVertices直接对顶点进行map，返回是相同结构的另一个Graph，访问属性的方法和上述方法是一模一样的。如下：

graph.mapVertices{
      case (id,(name,age))=>//利用case进行匹配
        (age,name)//可以在这里加上自己想要的任何转换
}

graph.mapEdges(e=>(e.attr,e.srcId,e.dstId))

graph.mapTriplets(triplet=>(triplet.srcAttr._1))

Spark GraphX中的图的函数大全

/** Summary of the functionality in the property graph */
class Graph[VD, ED] {
  // Information about the Graph 
  //图的基本信息统计
===================================================================
  val numEdges: Long
  val numVertices: Long
  val inDegrees: VertexRDD[Int]
  val outDegrees: VertexRDD[Int]
  val degrees: VertexRDD[Int]

  // Views of the graph as collections 
  // 图的三种视图
=============================================================
  val vertices: VertexRDD[VD]
  val edges: EdgeRDD[ED]
  val triplets: RDD[EdgeTriplet[VD, ED]]

  // Functions for caching graphs ==================================================================
  def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]
  def cache(): Graph[VD, ED]
  def unpersistVertices(blocking: Boolean = true): Graph[VD, ED]
  // Change the partitioning heuristic  ============================================================
  def partitionBy(partitionStrategy: PartitionStrategy): Graph[VD, ED]

  // Transform vertex and edge attributes 
  // 基本的转换操作
==========================================================
  def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
  def mapEdges[ED2](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD, ED2]
  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
  def mapTriplets[ED2](map: (PartitionID, Iterator[EdgeTriplet[VD, ED]]) => Iterator[ED2])
    : Graph[VD, ED2]

  // Modify the graph structure 
  //图的结构操作（仅给出四种基本的操作，子图提取是比较重要的操作）
====================================================================
  def reverse: Graph[VD, ED]
  def subgraph(
      epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
      vpred: (VertexID, VD) => Boolean = ((v, d) => true))
    : Graph[VD, ED]
  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  def groupEdges(merge: (ED, ED) => ED): Graph[VD, ED]

  // Join RDDs with the graph 
  // 两种聚合方式，可以完成各种图的聚合操作  ======================================================================
  def joinVertices[U](table: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, U) => VD): Graph[VD, ED]
  def outerJoinVertices[U, VD2](other: RDD[(VertexID, U)])
      (mapFunc: (VertexID, VD, Option[U]) => VD2)

  // Aggregate information about adjacent triplets 
  //图的邻边信息聚合，collectNeighborIds都是效率不高的操作，优先使用aggregateMessages，这也是GraphX最重要的操作之一。
  =================================================
  def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexID]]
  def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexID, VD)]]
  def aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,
      mergeMsg: (Msg, Msg) => Msg,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[A]

  // Iterative graph-parallel computation ==========================================================
  def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(
      vprog: (VertexID, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED]

  // Basic graph algorithms 
  //图的算法API（目前给出了三类四个API）  ========================================================================
  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
  def connectedComponents(): Graph[VertexID, ED]
  def triangleCount(): Graph[Int, ED]
  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
}

结构操作

Structural Operators
Spark2.0版本中，仅仅有四种最基本的结构操作，未来将开发更多的结构操作。

class Graph[VD, ED] {
  def reverse: Graph[VD, ED]
  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
               vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
}

子图subgraph

      子图(subgraph)是图论的基本概念之一。子图是指节点集和边集分别是某一图的节点集的子集和边集的子集的图。
  Spark API–subgraph利用EdgeTriplet（epred）或/和顶点（vpred）满足一定条件，来提取子图。利用这个操作可以使顶点和边被限制在感兴趣的范围内，比如删除失效的链接。
        The subgraph operator takes vertex and edge predicates and returns the graph containing only the vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge predicate and connect vertices that satisfy the vertex predicate. The subgraph operator can be used in number of situations to restrict the graph to the vertices and edges of interest or eliminate broken links. For example in the following code we remove broken links:

//假设graph有如下的顶点和边 顶点RDD(id,(name,age) 边上有一个Int权重（属性）
(4,(David,42))(6,(Fran,50))(2,(Bob,27)) (1,(Alice,28))(3,(Charlie,65))(5,(Ed,55))
Edge(5,3,8)Edge(2,1,7)Edge(3,2,4) Edge(5,6,3)Edge(3,6,3)

//可以使用以下三种操作方法获取满足条件的子图
//方法1，对顶点进行操作
val subGraph1=graph.subgraph(vpred=(id,attr)=>attr._2>30)
//vpred=(id,attr)=>attr._2>30 顶点vpred第二个属性（age）>30岁
subGraph1.vertices.foreach(print)
println
subGraph1.edges.foreach {print}
println
输出结果：
顶点：(4,(David,42))(6,(Fran,50))(3,(Charlie,65))(5,(Ed,55))
边：Edge(3,6,3)Edge(5,3,8)Edge(5,6,3)

//方法2--对EdgeTriplet进行操作
val subGraph2=graph.subgraph(epred=>epred.attr>2)
//epred（边）的属性（权重）大于2
输出结果：
顶点：(4,(David,42))(6,(Fran,50))(2,(Bob,27))(1,(Alice,28)) (3,(Charlie,65))(5,(Ed,55))
边：Edge(5,3,8)Edge(5,6,3)Edge(2,1,7)Edge(3,2,4) Edge(3,6,3)
//也可以定义如下的操作
val subGraph2=graph.subgraph(epred=>pred.srcAttr._2<epred.dstAttr._2))
//起始顶点的年龄小于终点顶点年龄
顶点：1,(Alice,28))(4,(David,42))(3,(Charlie,65))(6,(Fran,50)) (2,(Bob,27))(5,(Ed,55))
边 ：Edge(5,3,8)Edge(2,1,7)Edge(2,4,2)

//方法3--对顶点和边Triplet两种同时操作“，”号隔开epred和vpred
val subGraph3=graph.subgraph(epred=>epred.attr>3,vpred=(id,attr)=>attr._2>30)
输出结果：
顶点：(3,(Charlie,65))(5,(Ed,55))(4,(David,42))(6,(Fran,50))
边：Edge(5,3,8)

图的基本信息统计-度计算

度分布：这是一个图最基础和重要的指标。度分布检测的目的，主要是了解图中“超级节点”的个数和规模，以及所有节点度的分布曲线。超级节点的存在对各种传播算法都会有重大的影响（不论是正面助力还是反面阻力），因此要预先对这些数据量有个预估。借助GraphX最基本的图信息接口degrees: VertexRDD[Int]（包括inDegrees和outDegrees），这个指标可以轻松计算出来，并进行各种各样的统计（摘自《快刀初试：Spark GraphX在淘宝的实践》。

//-----------------度的Reduce，统计度的最大值-----------------
def max(a:(VertexId,Int),b:(VertexId,Int)):(VertexId,Int)={
            if (a._2>b._2) a  else b }

val totalDegree=graph.degrees.reduce((a,b)=>max(a, b))
val inDegree=graph.inDegrees.reduce((a,b)=>max(a,b))
val outDegree=graph.outDegrees.reduce((a,b)=>max(a,b))

print("max total Degree = "+totalDegree)
print("max in Degree = "+inDegree)
print("max out Degree = "+outDegree)
//小技巧：如何知道a和b的类型为(VertexId,Int)？
//当你敲完graph.degrees.reduce((a,b)=>，再将鼠标点到a和b上查看，
//就会发现a和b是(VertexId,Int)，当然reduce后的返回值也是(VertexId,Int)
//这样就很清楚自己该如何定义max函数了  

//平均度
val sumOfDegree=graph.degrees.map(x=>(x._2.toLong)).reduce((a,b)=>a+b)    
val meanDegree=sumOfDegree.toDouble/graph.vertices.count().toDouble
print("meanDegree "+meanDegree)
println     

//------------------使用RDD自带的统计函数进行度分布分析--------
//度的统计分析
//最大，最小
val degree2=graph.degrees.map(a=>(a._2,a._1))
//graph.degrees是VertexRDD[Int],即（VertexID，Int）。
//通过上面map调换成map(a=>(a._2,a._1)),即RDD[(Int,VetexId)]
//这样下面就可以将度（Int）当作键值（key）来操作了，
//包括下面的min，max，sortByKey，top等等，因为这些函数都是对第一个值也就是key操作的
//max degree
print("max degree = " + (degree2.max()._2,degree2.max()._1))
println

//min degree
print("min degree =" +(degree2.min()._2,degree2.min()._1))
println

//top（N） degree"超级节点"
print("top 3 degrees:\n")   
degree2.sortByKey(true, 1).top(3).foreach(x=>print(x._2,x._1))
println

/*输出结果：
 * max degree = (2,4)//（Vetext，degree）
 * min degree =（1,2)
 * top 3 degrees:
 * (2,4)(5,3)(3,3)
 */

相邻聚合—消息聚合

       相邻聚合（Neighborhood Aggregation）
       图分析任务的一个关键步骤是汇总每个顶点附近的信息。例如我们可能想知道每个用户的追随者的数量或者每个用户的追随者的平均年龄。许多迭代图算法（如PageRank，最短路径和连通体）多次聚合相邻顶点的属性。
       聚合消息(aggregateMessages)
GraphX中的核心聚合操作是 aggregateMessages，它主要功能是向邻边发消息，合并邻边收到的消息，返回messageRDD。这个操作将用户定义的sendMsg函数应用到图的每个边三元组(edge triplet)，然后应用mergeMsg函数在其目的顶点聚合这些消息。

class Graph[VD, ED] {
  def aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,//(1)--sendMsg:向邻边发消息,相当与MR中的Map函数
      mergeMsg: (Msg, Msg) => Msg,//(2)--mergeMsg:合并邻边收到的消息，相当于Reduce函数
      tripletFields: TripletFields = TripletFields.All)//(3)可选项，TripletFields.Src/Dst/All
    : VertexRDD[Msg]//(4)--返回messageRDD
}

（1）sendMsg：
将sendMsg函数看做map-reduce过程中的map函数，向邻边发消息，应用到图的每个边三元组(edge triplet)，即函数的左侧为每个边三元组(edge triplet)。
The user defined sendMsg function takes an EdgeContext, which exposes the source and destination attributes along with the edge attribute and functions (sendToSrc, and sendToDst) to send messages to the source and destination attributes. Think of sendMsg as the map function in map-reduce.

//关键数据结构EdgeContext源码解析

package org.apache.spark.graphx

/**
 * Represents an edge along with its neighboring vertices and allows sending messages along the
 * edge. Used in [[Graph#aggregateMessages]].
 */
abstract class EdgeContext[VD, ED, A] {//三个类型分别是：顶点、边、自定义发送消息的类型（返回值的类型）
  /** The vertex id of the edge‘s source vertex. */
  def srcId: VertexId
  /** The vertex id of the edge‘s destination vertex. */
  def dstId: VertexId
  /** The vertex attribute of the edge‘s source vertex. */
  def srcAttr: VD
  /** The vertex attribute of the edge‘s destination vertex. */
  def dstAttr: VD
  /** The attribute associated with the edge. */
  def attr: ED

  /** Sends a message to the source vertex. */
  def sendToSrc(msg: A): Unit
  /** Sends a message to the destination vertex. */
  def sendToDst(msg: A): Unit

  /** Converts the edge and vertex properties into an [[EdgeTriplet]] for convenience. */
  def toEdgeTriplet: EdgeTriplet[VD, ED] = {
    val et = new EdgeTriplet[VD, ED]
    et.srcId = srcId
    et.srcAttr = srcAttr
    et.dstId = dstId
    et.dstAttr = dstAttr
    et.attr = attr
    et
  }
}

（2）mergeMsg ：
用户自定义的mergeMsg函数指定两个消息到相同的顶点并保存为一个消息。可以将mergeMsg函数看做map-reduce过程中的reduce函数。

The user defined mergeMsg function takes two messages destined to the same vertex and yields a single message. Think of mergeMsg as the reduce function in map-reduce.

（3）TripletFields可选项
        它指出哪些数据将被访问（源顶点特征，目的顶点特征或者两者同时，即有三种可选择的值：TripletFields.Src，TripletFieldsDst，TripletFields.All。
      因此这个参数的作用是通知GraphX仅仅只需要EdgeContext的一部分参与计算，是一个优化的连接策略。例如，如果我们想计算每个用户的追随者的平均年龄，我们仅仅只需要源字段。所以我们用TripletFields.Src表示我们仅仅只需要源字段。
     takes an optional tripletsFields which indicates what data is accessed in the EdgeContext (i.e., the source vertex attribute but not the destination vertex attribute). The possible options for the tripletsFields are defined in TripletFields and the default value is TripletFields.All which indicates that the user defined sendMsg function may access any of the fields in the EdgeContext. The tripletFields argument can be used to notify GraphX that only part of the EdgeContext will be needed allowing GraphX to select an optimized join strategy. For example if we are computing the average age of the followers of each user we would only require the source field and so we would use TripletFields.Src to indicate that we only require the source field

（4）返回值：
The aggregateMessages operator returns a VertexRDD[Msg] containing the aggregate message (of type Msg) destined to each vertex. Vertices that did not receive a message are not included in the returned VertexRDD.

//假设已经定义好如下图：
//顶点:[Id,(name,age)]
//(4,(David,18))(1,(Alice,28))(6,(Fran,40))(3,(Charlie,30))(2,(Bob,70))(5,Ed,55))
//边：Edge(4,2,2)Edge(2,1,7)Edge(4,5,8)Edge(2,4,2)Edge(5,6,3)Edge(3,2,4)
//    Edge(6,1,2)Edge(3,6,3)Edge(6,2,8)Edge(4,1,1)Edge(6,4,3)(4,(2,110))

//定义一个相邻聚合，统计比自己年纪大的粉丝数（count）及其平均年龄（totalAge/count)
val olderFollowers=graph.aggregateMessages[(Int,Int)](
//方括号内的元组(Int,Int)是函数返回值的类型，也就是Reduce函数（mergeMsg )右侧得到的值（count，totalAge）
        triplet=> {
          if(triplet.srcAttr._2>triplet.dstAttr._2){            
              triplet.sendToDst((1,triplet.srcAttr._2))
          }
        },//(1)--函数左侧是边三元组，也就是对边三元组进行操作，有两种发送方式sendToSrc和 sendToDst
        (a,b)=>(a._1+b._1,a._2+b._2),//(2)相当于Reduce函数，a，b各代表一个元组（count，Age）
        //对count和Age不断相加（reduce），最终得到总的count和totalAge
        TripletFields.All)//(3)可选项,TripletFields.All/Src/Dst
olderFollowers.collect().foreach(println)
输出结果：
(4,(2,110))//顶点Id=4的用户，有2个年龄比自己大的粉丝，同年龄是110岁
(6,(1,55))
(1,(2,110))

//计算平均年龄
val averageOfOlderFollowers=olderFollowers.mapValues((id,value)=>value match{
      case (count,totalAge) =>(count,totalAge/count)//由于不是所有顶点都有结果，所以用match-case语句
    })    

averageOfOlderFollowers.foreach(print)  
输出结果：
(1,(2,55))(4,(2,55))(6,(1,55))//Id=1的用户，有2个粉丝，平均年龄是55岁

图算法工具包

1.数三角形

TriangleCount主要用途之一是用于社区发现，如下图所示：
技术分享
例如说在微博上你关注的人也互相关注，大家的关注关系中就会有很多三角形，这说明社区很强很稳定，大家的联系都比较紧密；如果说只是你一个人关注很多人，这说明你的社交群体是非常小的。(摘自《大数据Spark企业级实战》一书）

graph.triangleCount().vertices.foreach(x=>print(x+"\n"))
    /*输出结果
     * (1,1)//顶点1有1个三角形
     * (3,2)//顶点3有2个三角形
     * (5,2)
     * (4,1)
     * (6,1)
     * (2,2)
     */

2.连通图

现实生活中存在各种各样的网络，诸如人际关系网、交易网、运输网等等。对这些网络进行社区发现具有极大的意义，如在人际关系网中，可以发现出具有不同兴趣、背景的社会团体，方便进行不同的宣传策略；在交易网中，不同的社区代表不同购买力的客户群体，方便运营为他们推荐合适的商品；在资金网络中，社区有可能是潜在的洗钱团伙、刷钻联盟，方便安全部门进行相应处理；在相似店铺网络中，社区发现可以检测出商帮、价格联盟等，对商家进行指导等等。总的来看，社区发现在各种具体的网络中都能有重点的应用场景，图1展示了基于图的拓扑结构进行社区发现的例子。

技术分享

        检测连通图可以弄清一个图有几个连通部分及每个连通部分有多少顶点。这样可以将一个大图分割为多个小图，并去掉零碎的连通部分，从而可以在多个小子图上进行更加精细的操作。目前，GraphX提供了ConnectedComponents和StronglyConnected-Components算法，使用它们可以快速计算出相应的连通图。
        连通图可以进一步演化变成社区发现算法，而该算法优劣的评判标准之一，是计算模块的Q值，来查看所谓的modularity情况。
         如果一个有向图中的每对顶点都可以从通过路径可达，那么就称这个图是强连通的。一个 strongly connected component就是一个有向图中最大的强连通子图。下图中就有三个强连通子图：
技术分享

//连通图
def connectedComponents(maxIterations: Int): Graph[VertexId, ED]
def connectedComponents(): Graph[VertexId, ED]

//强连通图
//numIter：the maximum number of iterations to run for
def stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED]

//连通图计算社区发现
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.{Level, Logger}

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

object myConnectComponent {
  def main(args:Array[String]){    

    val sparkConf = new SparkConf().setAppName("myGraphPractice").setMaster("local[2]")
    val sc=new SparkContext(sparkConf) 
    //屏蔽日志
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

    val graph=GraphLoader.edgeListFile(sc, "/spark-2.0.0-bin-hadoop2.6/data/graphx/followers.txt")    

    graph.vertices.foreach(print)
    println
    graph.edges.foreach(print)  
    println

    val cc=graph.connectedComponents().vertices
    cc.foreach(print)
    println 
    /*输出结果 
     *  (VertexId,cc)
     * (4,1)(1,1)(6,1)(3,1)(2,1)(7,1)
     */

    //强连通图-stronglyConnectedComponents
    val maxIterations=10//the maximum number of iterations to run for
    val cc2=graph.stronglyConnectedComponents(maxIterations).vertices
    cc2.foreach(print)
    println 


    val path2="/spark-2.0.0-bin-hadoop2.6/data/graphx/users.txt"
    val users=sc.textFile(path2).map{//map 中包含多行 必须使用{}    
      line=>val fields=line.split(",")
      (fields(0).toLong,fields(1))//(id,name) 多行书写 最后一行才是返回值 且与上行splitsplit(",")之间要有换行
    }    
    users.collect().foreach { println}
    println
    /*输出结果 (VertexId,name)
     * (1,BarackObama)
     * (2,ladygaga)
     * ...
     */


    val joint=cc.join(users)
    joint.collect().foreach { println}
    println

    /*输出结果
     * (VertexId,(cc,name))
     * (4,(1,justinbieber))
     * (6,(3,matei_zaharia))
     */

    val name_cc=joint.map{
      case (VertexId,(cc,name))=>(name,cc)
    }
    name_cc.foreach(print)   
    /*
     * (name,cc)
     * (BarackObama,1)(jeresig,3)(odersky,3)(justinbieber,1)(matei_zaharia,3)(ladygaga,1)
     */

  }  

}

3.PageRank让链接来”投票”

一个页面的“得票数”由所有链向它的页面的重要性来决定，到一个页面的超链接相当于对该页投一票。一个页面的PageRank是由所有链向它的页面（“链入页面”）的重要性经过递归算法得到的。一个有较多链入的页面会有较高的等级，相反如果一个页面没有任何链入页面，那么它没有等级。
技术分享

Spark Graphx实例直接参考：
http://www.cnblogs.com/shishanyuan/p/4747793.html

def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
//两个参数
//tol:the tolerance allowed at convergence (smaller => more accurate).
//tol越小计算结果越精确，但是会花更长的时间
//resetProb:the random reset probability (alpha)
//返回一个图，顶点的属性是PageRank(Double);边的属性是规范化的权重（Double）

Run a dynamic version of PageRank returning a graph with vertex attributes containing the PageRank and edge attributes containing the normalized edge weight.

val prGraph = graph.pageRank(tol=0.001).cache()

pregel

     在迭代计算中，释放内存是必要的，在新图产生后，需要快速将旧图彻底释放掉，否则，十几轮迭代后，会有内存泄漏问题，很快耗光作业缓存空间。但是直接使用Spark提供的API cache、unpersist和checkpoint，非常需要使用技巧。所以Spark官方文档建议：对于迭代计算，建议使用Pregal API，它能够正确的释放中间结果，这样就不需要自己费心去操作了。
     In iterative computations, uncaching may also be necessary for best performance.However, because graphs are composed of multiple RDDs, it can be difficult to unpersist them correctly. For iterative computation we recommend using the Pregel API, which correctly unpersists intermediate results.
        图是天然的迭代数据结构，顶点的属性值依赖于邻居的属性值，而邻居们的属性值同样也依赖于他们各自邻居属性值（即邻居的邻居）。许多重要的图算法迭代式的重新计算顶点的属性直到达到预设的迭代条件。这些迭代的图算法被抽象成一系列图并行操作。
     Graphs are inherently recursive data structures as properties of vertices depend on properties of their neighbors which in turn depend on properties of their neighbors. As a consequence many important graph algorithms iteratively recompute the properties of each vertex until a fixed-point condition is reached. A range of graph-parallel abstractions have been proposed to express these iterative algorithms. GraphX exposes a variant of the Pregel API.

     At a high level the Pregel operator in GraphX is a bulk-synchronous parallel messaging abstraction constrained to the topology of the graph. The Pregel operator executes in a series of super steps in which vertices receive the sum of their inbound messages from the previous super step, compute a new value for the vertex property, and then send messages to neighboring vertices in the next super step. Unlike Pregel, messages are computed in parallel as a function of the edge triplet and the message computation has access to both the source and destination vertex attributes. Vertices that do not receive a message are skipped within a super step. The Pregel operators terminates iteration and returns the final graph when there are no messages remaining.

Note, unlike more standard Pregel implementations, vertices in GraphX can only send messages to neighboring vertices and the message construction is done in parallel using a user defined messaging function. These constraints allow additional optimization within GraphX.

//Graphx 中pregel 所用到的主要优化：
1.   Caching for Iterative mrTriplets & Incremental Updates for Iterative mrTriplets ：在
很多图分析算法中，不同点的收敛速度变化很大。在迭代后期，只有很少的点会有更新。因此，对于没有更新的点，下一
次 mrTriplets 计算时 EdgeRDD 无需更新相应点值的本地缓存，大幅降低了通信开销。

2. Indexing Active Edges ：没有更新的顶点在下一轮迭代时不需要向邻居重新发送消息。因此， mrTriplets 
遍历边时，如果一条边的邻居点值在上一轮迭代时没有更新，则直接跳过，避免了大量无用的计算和通信。

3. Join Elimination ： Triplet 是由一条边和其两个邻居点组成的三元组，操作 Triplet 的 map 函数常常只
需访问其两个邻居点值中的一个。例如，在 PageRank 计算中，一个点值的更新只与其源顶点的值有关，而与其所指向
的目的顶点的值无关。那么在 mrTriplets 计算中，就不需要 VertexRDD 和 EdgeRDD 的 3-way join ，而只需
要 2-way join 。

所有这些优化使 GraphX 的性能逐渐逼近 GraphLab 。虽然还有一定差距，但一体化的流水线服务和丰富的编程接口，
可以弥补性能的微小差距。

//pregel 操作计算过程分析：
class GraphOps[VD, ED] {
  def pregel[A]
      //包含两个参数列表
      //第一个参数列表包含配置参数初始消息、最大迭代数、发送消息的边的方向（默认是沿边方向出）。
      //VD:顶点的数据类型。
      //ED:边的数据类型
      //A：Pregel message的类型。
      //graph：输入的图
      //initialMsg:在第一次迭代的时候顶点收到的消息。

maxIterations：迭代的次数

      (initialMsg: A,
       maxIter: Int = Int.MaxValue,
       activeDir: EdgeDirection = EdgeDirection.Out)

      //第二个参数列表包含用户 自定义的函数用来接收消息（vprog）、计算消息（sendMsg）、合并消息（mergeMsg）。 
      //vprog：用户定义的顶点程序运行在每一个顶点中，负责接收进来的信息，和计算新的顶点值。
      //在第一次迭代的时候，所有的顶点程序将会被默认的defaultMessage调用，
      //在次轮迭代中，顶点程序只有接收到message才会被调用。      
      (vprog: (VertexId, VD, A) => VD,//vprog：
      //sendMsg：用户提供的函数，应用于边缘顶点在当前迭代中接收message
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      //用户提供定义的函数，将两个类型为A的message合并为一个类型为A的message
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] = {

    // Receive the initial message at each vertex
    // 在第一次迭代的时候，所有的顶点都会接收到initialMsg消息，
    // 在次轮迭代的时候，如果顶点没有接收到消息，verteProgram(vprog)就不会被调用。
    var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()

    // 使用mapReduceTriplets compute the messages（即map和reduce message，不断减少messages）
    var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
    var activeMessages = messages.count()
    // Loop until no messages remain or maxIterations is achieved
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      g = g.joinVertices(messages)(vprog).cache()
      val oldMessages = messages

      // Send new messages, skipping edges where neither side received a message. We must cache
      // messages so it can be materialized on the next line, allowing us to uncache the previous
      // iteration.
      messages = g.mapReduceTriplets(
        sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
      activeMessages = messages.count()
      i += 1
    }
    g
  }
}
整个过程不是很容易理解，更详细的计算过程分析可以参考：
Spark的Graphx学习笔记--Pregel：http://www.ithao123.cn/content-3510265.html

总之，把握住整个迭代过程：
vertexProgram（vprog）在第一次在初始化的时候，会在所有顶点上运行，之后，只有接收到消息的顶点才会运行vertexProgram，重复这个步骤直到迭代条件。

//计算最短路径代码
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

object myPregal {
  def main(args:Array[String]){

    //设置运行环境
    val conf = new SparkConf().setAppName("myGraphPractice").setMaster("local[4]")
    val sc=new SparkContext(conf)

    //屏蔽日志
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)    

    val vertexArray = Array(
      (1L, ("Alice", 28)),(2L, ("Bob", 27)),(3L, ("Charlie", 65)),(4L, ("David", 42)),
      (5L, ("Ed", 55)),(6L, ("Fran", 50))
    )
    //边的数据类型ED:Int
    val edgeArray = Array(
      Edge(2L, 1L, 7),Edge(2L, 4L, 2),Edge(3L, 2L, 4),Edge(3L, 6L, 3),
      Edge(4L, 1L, 1),Edge(5L, 2L, 2),Edge(5L, 3L, 8),Edge(5L, 6L, 3)
    )

    //构造vertexRDD和edgeRDD
    val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
    val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

    //构造图Graph[VD,ED]
    val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)    


    val sourceId:VertexId=5//定义源点
    val initialGraph=graph.mapVertices((id,_)=>if (id==sourceId) 0 else Double.PositiveInfinity)  
    //pregel函数有两个参数列表
    val shorestPath=initialGraph.pregel(initialMsg=Double.PositiveInfinity,
                                        maxIterations=100,                                  
                                        activeDirection=EdgeDirection.Out)(

                                 //1-顶点属性迭代更新方式，与上一次迭代后保存的属性相比，取较小值
                                 //（将从源点到顶点的最小距离放在顶点属性中）    
                                 (id,dist,newDist)=>math.min(dist,newDist), 

                                 //2-Send Message，在所有能到达目的点的邻居中，计算邻居顶点属性+边属性
                                 //即（邻居-源点的距离+邻居-目的点的距离，并将这个距离放在迭代器中
                                 triplet=>{
                                   if(triplet.srcAttr+triplet.attr<triplet.dstAttr){
                                     Iterator((triplet.dstId,triplet.srcAttr+triplet.attr))
                                   }else{
                                     Iterator.empty
                                   }
                                 }, 

                                 //3-Merge Message，相当于Reduce函数
                                 //对所有能达到目的点的邻居发送的消息，进行min-reduce
                                 //邻居中最终reduce后最小的结果，作为newDist,发送至目的点，
                                 //至此，目的点中有新旧两个dist了，在下一次迭代开始的时候，步骤1中就可以进行更新迭代了
                                 (a,b)=>math.min(a,b))

    shorestPath.vertices.map(x=>(x._2,x._1)).top(30).foreach(print)  

    /*outprint(shorest distance,vertexId)
     * 8.0,3)(5.0,1)(4.0,4)(3.0,6)(2.0,2)(0.0,5)
     */ 
  }  
}

应用实例一：Louvain算法社区发现

      实例来自《Spark最佳实践》一书
      源代码来自
https://github.com/Sotera/spark-distributed-louvain-modularity
      但是2.0版本Spark源码需要进行修改《Spark最佳实践》一书中已经修改好了。
      社区发现算法可供参考的资料也比较多，算法也比较多。
http://blog.csdn.net/peghoty/article/details/9286905

关键概念–模块度（Modularity ）
很多的社区发现算法都是基于模块度设计的，模块度用于衡量社区划分结构的合理性。
用某种算法划分结果的内聚性与随机划分结果的内聚性的差值，对划分结果进行评估。
　　模块度是评估一个社区网络划分好坏的度量方法，它的物理含义是社区内节点的连边数与随机情况下的边数只差，它的取值范围是 [?1/2,1)，其定义如下：

Q = 1 2 m \sum i, j [A i j ? k i k j 2 m] δ (c i, c j)

$Q = \frac{1}{2m}\sum_{i,j}[A_{ij} - \frac{k_ik_j}{2m}]\delta(c_i,c_j)$

δ (u, v) = {1 w h e n u = = v 0 e l s e

$\delta(u,v) = \{_{0\ else}^{1 when\ u == v}$
　　其中，

Aij $A_{ij}$ 节点i和节点j之间边的权重，网络不是带权图时，所有边的权重可以看做是1；

ki=∑jAij $k_i = \sum_jA_{ij}$ 表示所有与节点i相连的边的权重之和（度数）；

ci $ci$ 表示节点i所属的社区；

m=12∑ijAij $m=\frac{1}{2}\sum_{ij}A_{ij}$ 表示所有边的权重之和（边的数目）。

　　公式中 $A_{ij} - \frac{k_ik_j}{2m}=A_{ij} - k_i\frac{k_j}{2m}$ ，节点j连接到任意一个节点的概率是 $\frac{k_j}{2m}$ ，现在节点i有 $k_i$ 的度数，因此在随机情况下节点i与j的边为 $k_i\frac{k_j}{2m}$ .
　　模块度的公式定义可以作如下简化：
　　

Q = 1 2 m \sum i, j [A i j ? k i k j 2 m] δ (c i, c j) = 1 2 m [\sum i, j A i j ? \sum i k i \sum j k j 2 m] δ (c i, c j) = 1 2 m \sum c [Σ i n ? ( Σ t o t ) 2 2 m]

$Q = \frac{1}{2m}\sum_{i,j}[A_{ij} - \frac{k_ik_j}{2m}]\delta(c_i,c_j)\　　= \frac{1}{2m}[\sum_{i,j}A_{ij} - \frac{\sum_ik_i\sum_jk_j}{2m}]\delta(c_i,c_j)\　　= \frac{1}{2m}\sum_c[\Sigma in-\frac{{(\Sigma tot)}^2}{2m}]$

其中ΣinΣin表示社区c内的边的权重之和，ΣtotΣtot表示与社区c内的节点相连的边的权重之和。

　　上面的公式还可以进一步简化成:

Q = \sum c [Σ i n 2 m ? (Σ t o t 2 m) 2] = \sum c [e c ? a c 2]

$Q = \sum_c[\frac{\Sigma in}{2m}-(\frac{\Sigma tot}{2m})^2] = \sum_c[e_c-{a_c}^2]$
　　这样模块度也可以理解是社区内部边的权重减去所有与社区节点相连的边的权重和，对无向图更好理解，即社区内部边的度数减去社区内节点的总度数。

　　基于模块度的社区发现算法，大都是以最大化模块度Q为目标。

Louvain算法流程
Louvain算法的思想很简单：

　　1）将图中的每个节点看成一个独立的社区，次数社区的数目与节点个数相同；

　　2）对每个节点i，依次尝试把节点i分配到其每个邻居节点所在的社区，计算分配前与分配后的模块度变化ΔQ ΔQ，并记录ΔQ ΔQ最大的那个邻居节点，如果maxΔQ>0 maxΔQ>0，则把节点i分配ΔQ ΔQ最大的那个邻居节点所在的社区，否则保持不变；

　　3）重复2），直到所有节点的所属社区不再变化；

　　4）对图进行压缩，将所有在同一个社区的节点压缩成一个新节点，社区内节点之间的边的权重转化为新节点的环的权重，社区间的边权重转化为新节点间的边权重；

　　5）重复1）直到整个图的模块度不再发生变化。
　　从流程来看，该算法能够产生层次性的社区结构，其中计算耗时较多的是最底一层的社区划分，节点按社区压缩后，将大大缩小边和节点数目，并且计算节点i分配到其邻居j的时模块度的变化只与节点i、j的社区有关，与其他社区无关，因此计算很快。

参考文献

（1）Spark 官方文档
http://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api
（2）大数据Spark企业级实战王家林
（3）GraphX迭代的瓶颈与分析
http://blog.csdn.net/pelick/article/details/50630003
（4）基于Spark的图计算框架 GraphX 入门介绍
http://www.open-open.com/lib/view/open1420689305781.html
（5）Spark入门实战系列–9.Spark图计算GraphX介绍及实例
http://www.cnblogs.com/shishanyuan/p/4747793.html
（6）快刀初试：Spark GraphX在淘宝的实践
http://www.csdn.net/article/2014-08-07/2821097
（7）基于GraphX的社区发现算法FastUnfolding分布式实现
http://bbs.pinggu.org/thread-3614747-1-1.html
（8）关于图计算和graphx的一些思考
http://www.tuicool.com/articles/3MjURj
（9）用 LDA 做主题模型：当 MLlib 邂逅 GraphX
http://blog.jobbole.com/86130/
（10）Spark的Graphx学习笔记–Pregel
http://www.ithao123.cn/content-3510265.html
（11）Spark最佳实践陈欢林世飞
（12）社区发现(Community Detection)算法
http://blog.csdn.net/peghoty/article/details/9286905

Spark GraphX学习笔记

标签：

原文地址：http://blog.csdn.net/qq_34531825/article/details/52324905

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行