With the proliferation of data driven applications over last few years, many applications have come to rely on advanced analyses of huge data sets. While MapReduce based frameworks, such as Hadoop, make it easier to scale your algorithms to big data sets, their restrictive programming framework makes it challenging to fit your existing computations to MapReduce’s model.
Array-based languages such as R, provide a much more natural and expressive framework in which to write your statistical analyses. Since it was introduced in 1996, R has gone on to become one the languages of choice for statisticians and data scientists worldwide. While the package boasts a thriving community and an interactive ecosystem, it also has significant limitations. R is single threaded and does not scale well with larger data sets. This disadvantage renders effective statistical packages, written by statisticians, useless for non-trivial datasets.
If your algorithm is very data intensive, then Hadoop proves to be a better solution. However, with some clever refactoring of your existing R code, it is possible to eke out better performance from R. With that being said, I decided to test out the MPI wraparound for R, Rmpi.
PageRank is the popular link analysis algorithm, which kick-started the Google powerhouse. The link analysis algorithm represents the likelihood of any person randomly surfing the Internet and arriving at a particular webpage. The PageRank computations require several passes before converging. When one node links into another node it contributes a part of its PageRank score to the score of the overall PageRank of that node. If a node with a higher PageRank links into a second node, then the second node automatically receives a higher PageRank. The update function of the algorithm is:
PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
PR(A) – PageRank of node A
d – dampening factor
PR(Tn)/C(Tn) – portion of Tn’s score contributed to PR(A) where T1… Tn are the nodes that link into A
The most naive version of PageRank assumes all nodes have the same rank initially, and that each node distributes equal scores to its neighbors until convergence. This algorithm converges when the difference between successive updates to the PageRank vector is less than some very small value.
Strategies for parallelization
There are two main data structures in this calculation. The adjacency matrix represents the graph, and the PageRank vector contains scores for all of the nodes. While each thread will function on only its portion of rows of the adjacency matrix, the entire PageRank vector needs to be made available to the slave threads after each iteration. This ensures that the updated PageRanks will be used for the next iteration. I used MPI operation scatterv to distribute the adjacency matrix to the slave threads, and the operation allgatherv was used to gather the results of the partial PageRank vector after each iteration.
Experiments on the Rescale platform
I used the Stanford Network Analysis Project data sets for my experiments, in particular the High Energy Physics Citation Network data set. The graph contains 34,546 nodes and 421,578 edges. The runtime for the vanilla sequential version was 653.89 seconds.
The runtime graph follows an exponential decaying distribution with the gains in the runtime petering out between 12 (86.733 seconds) and 16 (68.38 seconds) threads.
While it is still challenging to scale to very large data sets, R can be scaled for moderately sized data sets using frameworks, such as Rmpi, as demonstrated on the Rescale platform.
This article was written by Rescale.