Github repo cloud-examples: https://github.com/sagemath/cloud-examples
License: MIT
PageRank in Apache Spark
This is one of the basic examples how Apache Spark works and how it looks like.
Make sure to read about Transformations and especially map, flatMap, join, ...
Read more about the RDD's Python API.
When using the "Apache Spark" themes kernels in SageMathCloud, the object "sc
" for the "Spark Context" is already pre-initialized.
The data for your simplified link graph.
The initial rank-data is the value 1. for each node.
This initializes the edges of the graph data as links
, which are modeled in Spark as key-value-tuples.
This is a demonstration of what does happen, when the rank-key-value tuples are joined with the links-key-value tuples. Take a close look, it's a list of tuples in tuples with lists inside of them!
Here is a debug printout to outline what the first operation in the code below is doing:
Executing the PageRank Algorithm
This takes a while to execute. Do
in a Terminal to see what's going on behind the scenes!
Comparison with NetworkX
Task
Now go back to the directed graph in the beginning and make up your mind if those numbers make sense or not. Why is the weight of node 1 higher than the weight of node 2?