This article is an analysis of paper published by Jure Leskovec and Christos Faloutsos onSampling from Large Graphs. Graph mining is a very important part of web mining procedures, especially considering the huge connected nodes in social media. The idea behind the paper is how we can create a small but good sample from large massive graph with millions or billions of nodes obtaining the similar properties.
Graph Sampling and its Challenges:
Graph Sampling is the process of deriving a small and representative sample from a given huge real graph. There are few major challenges in graph sampling that this paper tries to answer:
Which sampling method to use?
How small can the sample size should be?
How to scale up the measurements of sample such as diameter.
How do we measure success?
There are two broader categories, in which we can perform graph sampling:
Scale-down sampling: In scale-down sampling, we target to derive a much smaller graph with similar graph properties such as similar degree distribution and similar diameter.
Back-in-time sampling: In Back-in-time sampling, we try to travel back in time and trying to replicate the past versions of graph, of which we only get to observe the final, static snap-shot.
The evaluation techniques of graph sampling is also very varied and for a set of static graph patterns, which are measured on a single snapshot of a graph few major evaluation properties are In-degree & Out-degree distribution, size of weakly & strongly connected components, number of hops to cover a distance and distribution of singular values of the graph adjacency matrix versus the rank. On the other hand for the evaluation of temporal graph factors such as the number of edges versus the number of nodes over time, the effective diameter of the graph over time, the normalized size of the largest connected component, The largest singular value of graph adjacency matrix over time and average clustering coefficient over time matters most.
There are few distinct sampling algorithm used in this paper, which are mainly categorized under three categories:
1. Sampling by random node selection: Here, we simply want to find best sampling method that gives the most similar degree distribution. So, we divide random node selection method even further:
Random Node (RN) Sampling: It is most obvious way to create a sample graph by uniformly at random select a set of nodes N and a sample is then a graph induced by the set of nodes N
Random Page Rank Node (RPN) sampling: Creating a sample graph by setting the probability of a node such that being selected into the sample is proportional to its Page Rank weight.
Random Degree Node (RDN) sampling: In RDN sampling, the probability of a node being selected is proportional to its degree that is has even more bias towards high degree nodes. Intuitively this method will have problems matching degree distribution, since there are too many high degree nodes in the sample
2. Random edge (RE) sampling: To build a sample graph, this techniques perform random selections of edges but sampled graphs will be very sparsely connected and will thus have large diameter and will not respect community structure. To overcome this problem, we can build a different version in form of random node-edge selection which will allow us to first select a node at random then uniformly select and an edge connected to the node.
3. Sampling by exploration: The idea behind sampling by exploration technique is that we first select a node uniformly at random and then for further selection, we explore the nodes in it’s the vicinity. Few sampling algorithms based on this techniques are:
Random Node Neighbor (RNN) sampling: In RNN, we select a node uniformly at random together with all of its out-going neighbors, which makes it effective in dealing with very large graph and out-degree distribution but fails matching for in-degree distribution.
Random Walk (RW) sampling: In RW, we uniformly at random pick a starting node and then simulate a random walk on the graph. At every step with some probability we fly back to the starting node and re-start the random walk. Random walks, which normally require fewer resources per sample, can suffer from large estimation errors in the presence of disconnected or loosely connected graphs.
Random Jump (RJ) sampling: Random jump differs from random walk in a manner such that here with some probability we randomly jump to any node in the graph and this algorithm does not have problems of getting stuck or not being able to visit enough nodes unlike random node algorithm.
Forest Fire (FF) sampling: FF is a recursive process, where we randomly pick a seed node and begin burning outgoing links and the corresponding nodes. If a link gets burned, the node at the other end point gets a chance to burn its own links, and so on recursively.
Now, by performing scale-down sampling, the out-come for RDN, RJ and RW are biased towards high degree nodes and densely connected parts of the graph, on the other hand algorithms such as FF, RPN and RN, which are not biased towards high degree nodes, shows that degree distribution is well matched and has about the same shape as the true degree distribution. Other algorithms, such as RE and RNE which are all based on random edge selection, for small sample size, the resulting samples are very sparsely connected and the slope of degree distribution is too steep.
For Back-in-time sampling, Edge selection techniques (RE, RNE) result in too sparse graphs that start to rapidly fill-in the missing edges when the sample size gets large. FF, RN and RPN match the temporal densification of the true graph. On contrary, RW, RJ and RDN have constant diameter, which is good for Scale-down sampling. And, FF, RN and RPN match the shrinking diameter of the true graph over time quite well. Similar observations can be done for the size of connected component and average clustering coefficient over time
This work concludes that no list of properties will ever be perfect but we can achieve a lot better performance in smaller size of graph. The major outcome of the work is listed below:
Back-in-time methodology to generate better sample graph as compared to scale-down sampling methodology.
Forest Fire is best performing sampling algorithm.
Usually sample graph with size of 15% of actual graph is enough to match the properties.