Skip to content

Wrong PageRank results #143

@zma

Description

@zma

Hi all,

I tried the static PageRank computation on GraphX (in Spark 1.0.0) with the following code:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

object ord extends Ordering[(VertexId, Double)] { def compare(a:(VertexId, Double), b:(VertexId, Double)) = a._2 compare b._2}

val graph = GraphLoader.edgeListFile(sc, "/lhome/zma/test.txt")
for (i <- 0 to 30) {
  println("iter: " + i)
  val ranks = graph.staticPageRank(i)
  ranks.vertices.top(100)(ord).foreach(println(_))
}

The /lhome/zma/test.txt contains a simple graph:

0 1
1 3
2 3
3 3

The PageRank implemenation ( https://github.com/amplab/graphx/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala ) uses this PageRank definition:

PR[i] = alpha + (1 - alpha) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum

However, the results I get are:

iter: 0
(0,0.15)
(1,0.15)
(3,0.15)
(2,0.15)
iter: 1
(3,0.5325)
(1,0.27749999999999997)
(0,0.15)
(2,0.15)
iter: 2
(3,0.8384999999999999)
(1,0.27749999999999997)
(0,0.15)
(2,0.15)
iter: 3
(3,0.862725)
(1,0.27749999999999997)
(0,0.15)
(2,0.15)
...

The results with iterNum as 0 and 1 are correct. However, for "iter: 2":

The PageRank value for 3 should be:

0.15 + 0.85 * (0.27749999999999997 + 0.15 + 0.5325)
= 0.966

while the result is (3,0.8384999999999999). It seems the PageRank of vertex 2 is not passed to the PageRank of vertex 3 (0.8384999999999999 == 0.15 + 0.85 * (0.27749999999999997 + 0.5325)).

I noticed that the Pregel implementation (https://github.com/amplab/graphx/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala) only sends messages from vertices changed in last superstep. I can understand this performance optimization for less messages. However, is this PageRank implementation on top of this model wrong? Will passing the delta out from changed vertices be more suitable for implementation on this "Pregel" model in GraphX?

Please correct me if I am wrong at understanding some parts. I will appreciate it if additional information or pointers to them are provided.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions