RHadoop

archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb

sudo dpkg -i Downloads/cdh4-repository_1.0_all.deb 

sudo apt-get update
sudo apt-get upgrade

sudo apt-get install hadoop-0.20-mapreduce-jobtracker 

As of May 5, 2015 the supported releases are Wily Werewolf (15.10), Vivid Vervet (15.04), Trusty Tahr (14.04; LTS), and Precise Pangolin (12.04; LTS)

See https://wiki.ubuntu.com/Releases for details.

# Installation

To obtain the latest R packages, add an entry like

    deb https://<my.favorite.cran.mirror>/bin/linux/ubuntu wily/
    
or

    deb https://<my.favorite.cran.mirror>/bin/linux/ubuntu vivid/

or

    deb https://<my.favorite.cran.mirror>/bin/linux/ubuntu trusty/
   
or

    deb https://<my.favorite.cran.mirror>/bin/linux/ubuntu precise/


in your /etc/apt/sources.list file, replacing <my.favorite.cran.mirror> by the actual URL of your favorite CRAN mirror. See https://cran.r-project.org/mirrors.html for the list of CRAN mirrors. To install the complete R system


 sudo apt-get install r-base


Users who need to compile R packages from source [e.g. package maintainers, or anyone installing packages with install.packages()] should also install the r-base-dev package:
 install curl and xml on linux.
    sudo apt-get install r-base-dev
sudo apt-get install r-cran-rjava
sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev

make sure you have java java -version

sudo R
install.packages("RCurl")
install.packages( c('RJSONIO', 'itertools', 'digest', 'Rcpp','httr','functional','devtools'),repos='http://cran.revolutionanalytics.com') 


download rmr and rhdfs from https://github.com/RevolutionAnalytics/RHadoop/wiki

export JAVA_HOME=/usr/lib/jvm/default-java
add JAVA_HOME to the file: /etc/environment

sudo R CMD javareconf


install cloudera

https://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb
sudo dpkg -i cdh5-repository_1.0_all.deb

https://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin

http://192.168.2.129:7180/
admin
admin

The file to edit is now located in /usr/share/lightdm/lightdm.conf.d, and is called: 50-ubuntu.conf.

I just added the line:

greeter-show-manual-login=true


sudo nano /etc/ssh/sshd_config

PermitRootLogin yes

install.packages("/home/hadoop/Downloads/rhdfs_1.0.8.tar.gz")

Sys.setenv(HADOOP_HOME="/usr/lib/hadoop-0.20-mapreduce")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.4.2.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/jre-1.8.0-openjdk.x86_64")

Influence and Correlation in Social Networks

Motivation:
Social influence is the phenomenon that the actions of a user can induce his/her friends to behave in a similar way. We draw a line between homophily and other external influence from the environment and its effect on behavior of any user on the social network. Existing work has established the existence of correlation between user actions and social affiliations but they do not address the source of the correlation. The focus is on analyzing the sources of the correlation which can assist in important decision making such as viral marketing campaigns.
 Challenges:  
Incompleteness of data due to privacy concerns and anonymous activity is the biggest challenge in evaluating the social influence, which makes it harder to distinguish between homophily and external influence. We propose a statistical test (called the shuffle test) based on the intuition that if influence is not a likely source of correlation in a system, timing of actions should not matter, and therefore reshuffling the time stamps of the actions should not significantly change the amount of correlation. 
Models of Social correlation:
Correlation between the behaviors of affiliated agents in a social network is a well-known phenomenon. Formally, this means that for two nodes u and v that are adjacent in graph G, the events that u becomes active is correlated with v becoming active. There are three primary explanations for this phenomenon: 
  1. Homophily- tendency to connect to people who share any similarity.
  2. Influence- Tendency to follow the behaviors of friends and adjacent users.
  3. Confounding: Forged due to external influences from environment. For example, two individuals living in the same city are more likely to become friends than two random individuals.
 Methodology:
  1.  Measuring social correlation: The first step in our analysis is to obtain a measure of social correlation between the actions of an individual and that of her friends in the network i.e. at each time step, calculate the probability as a function of the number of already active friends the user with the parameter as the number of friends that became active in the previous time steps. Flickr stores the actions and for most tags in the Flickr data set, a logistic function with the logarithm of the number of friends as the explanatory variable provides a good fit for the probability.
  2. The shuffle test: The shuffle test is based on the idea that if social influence does not play a role, even though an agent's probability of activation could depend on her friends, the timing of such activation should be independent of the timing of other agents. Let G, be a social network and W is the set of activated users between a time range [0, T]. Assume a user is activated at a particular time, we use logistic regression to estimate the number of user who at the beginning of that time instant had a number of active friends but Flickr did not predicted them, likewise we estimated the users who are inactive but was predicted.
    Theoretical analysis assumptions:
  • Distribution of the activation times is uniform over the time range [0, T].
  • Each future time step is chosen independently from the uniform distribution instead of using a permutation of the original time stamps.
  • There are enough data to gather statistics.
 3. The edge-reversal test: The edge-reversal test is a test used to for distinguishing influence similar to the one used in the obesity study. We reverse the direction of all the edges and run logistic regression on the data using the new graph. It would be expected that change in social influence would not change significantly since the assumption is friends have common characteristics, are affected by the same external variables and are independent of which of these two individuals has named the other as a friend. However social influence spreads in the direction specified by the edges of the graph, and hence reversing the edges should intuitively change the estimate of the correlation.
 Generative simulation model: 
  • No-correlation model:  The Network grows exactly in the same way as in the real data. In each time step, we look at the real data to see how many new agents use the tag, and pick the same number of agents uniformly at random from the set of agents that have already joined the network and have not been picked yet.
  • Influence model: The network, and the growth pattern of the network is kept as in the real data. In every time step, each node in the set of nodes that has joined the network but not activated yet flips a coin independently to decide if to become active in this time step.
  • Correlation model (no influence): we keep the network and the pattern of growth of the population the same as in the real data. Parameterized using parameter L, follows the pattern of a tag in real data. A number of centers are chosen at random before the generation of actions.

Conclusion : 
This article is analysis of  Influence and Correlation in Social Networks where Shuffle test and the edge reversal test have been processed and the outcome shows the cumulative distribution and frequency distribution of both are nearly identical, further enforcing the idea of correlation and For the Flickr data set, Influence found, cannot be given higher weightage, however the difference between values in two directions for a given edge is minimal, almost zero. The current work has not focused the possibility of social status in influence. Giving the users a position or rank might organize the influence and can also be helpful in understanding the behavior of the edge reversal test.

Scale Free Networks

Why this is interesting?

Networks are everywhere, whether it’s our society, or our brain cells, or metabolic system or even the Internet and it’s really interesting how most of these networks follow the same pattern without being relevant to each other in any manner. It would also be very exciting to see if studying one network can help us to understand others like how understanding network of World Wide Web can help us killing the bacteria in our body.

Background
In 1998, when researchers thought to analyze World Wide Web, they expected it to be a random network because people follow what interests them and due to people often pursue different interests, from the tremendous number of pages, most of the pages should have approximately same number of followers. Random network theory also predicts that despite the random placement of links, the resulting system will be deeply democratic: most nodes will have approximately the same number of links. But the outcome of the research showed that more than 80% of the pages on the network had only four or less links and a small minority less than 0.01% of all the nodes had more than 1000 links. There was no normal distribution among the links, it was a continuously decreasing function which is termed as power-law. Specifically, a power law does not have a peak, as a bell curve does, and when plotted on a double-logarithmic scale, a power law is a straight line

Scale-Free Networks Abound
The main take away from this research was the web follow a network pattern which is controlled by few high potential nodes, also called as hub. Hubs have tremendous number of connections to other nodes, whereas most nodes just have few. The hubs can have thousands or even millions of links and in this sense, network appears to have no scale, and this is what we call, "Scale Free Networks". Airline system can be termed as scale free network as, few major airports act as a hub and control the major traffic of airlines. Viral marketing campaigns, for instance, often specifically try to target hubs to speed the adoption of a product.
 Researchers found a scale-free structure in the cellular metabolic networks in 43 different organisms. In such networks, each node is a particular molecule, and each link is a biochemical reaction. We found that most molecules participate in just one or two reactions, but a few (the hubs), such as water and adenosine triphosphate, play a role in most of them. Even the study of networks of actors in Hollywood, considering actors as node and appearing together as link, followed the scale free pattern. So the curiosity raised as how can systems as fundamentally different as the cell and the Internet have the same architecture and obey the same laws?

Nature of Scale Free Network: Growth and Preferential attachment
These two basic mechanisms-growth and preferential attachment-will eventually lead to the system's being dominated by hubs. Constant growth of the network gives older nodes, a greater opportunity to acquire links, which is called "Rich Gets Richer" principle that generally favor the early nodes, which are more likely to eventually become hubs. Furthermore, all nodes are not equal. When deciding where to link their web page, people can choose from a few billion locations. Yet most of us are familiar with only a tiny fraction of the full Web, and that subset tends to include the more connected sites because they are easier to find. By simply linking to those nodes, people exercise and reinforce a bias toward them. This process of "preferential attachment "occurs elsewhere. In Hollywood the more connected actors are more likely to be chosen for new roles. On the Internet the more connected routers, which typically have greater bandwidth, are more desirable for new users.

Reliability of Scale Free Networks 
Are these networks, dominated by few hubs, can be considered reliable? The research says that, complex systems with tremendous number of nodes area amazingly resilient against the accident failures because of their inhomogeneous topology, any random removal of nodes will take out mainly the small ones because they are much more plentiful than hubs. That’s the reason that hundreds of routers routinely malfunction on the Internet at any moment but the network rarely suffers major disruptions, Also people rarely notice the consequences of thousands of errors in their cells, ranging from mutations to mis-folded proteins. In fact, as many as 80 percent of randomly selected Internet routers can fail and the remaining ones will still form a compact cluster in which there will still be a path between any two nodes
But a reliance on hubs has a serious drawback: vulnerability to attacks. Study shows that the removal of just a few key hubs from the Internet splintered the system into tiny groups of hopelessly isolated routers. Similarly, knockout experiments in yeast have shown that the removal of the more highly connected proteins has a significantly greater chance of killing the organism than does the deletion of other nodes. These hubs are crucial-if mutations make them dysfunctional, the cell will most likely die.
Apart from robustness and reliability on hubs, I find scale free network also has issue of trustworthiness, i.e. Information spreading through hubs, how trustworthy are they? For instance, there are only few media houses that control the news all over the world and rest of the media follows them. But we also have to be sure, that these news media houses are dependable.
Conclusion
For transportation, transmission and communications systems (such as the Internet), congestion along specific links is a major consideration: too much traffic on a particular link can cause it to break down, leading to the potential failure of other links that must then handle the spillover. Determining whether a network is scale-free is important in understanding the system's behavior, but other significant parameters merit attention, too. For instance, immunizing hubs might not be sufficient to stop the spread of a disease; a more effective solution might be found by considering not just the number of connections a person has but also the frequency and duration of contact for those links. Behavior of scale free networks encourages researchers from different area to collaborate and model the network, draw the pattern and work to enhance the reliability of network.




Sampling from Large Graphs

This article is an analysis of paper published by Jure Leskovec and Christos Faloutsos on Sampling from Large Graphs. Graph mining is a very important part of web mining procedures, especially considering the huge connected nodes in social media. The idea behind the paper is how we can create a small but good sample from large massive graph with millions or billions of nodes obtaining the similar properties. 
 Graph Sampling and its Challenges:
Graph Sampling is the process of deriving a small and representative sample from a given huge real graph. There are few major challenges in graph sampling that this paper tries to answer:
  1. Which sampling method to use?
  2. How small can the sample size should be?
  3. How to scale up the measurements of sample such as diameter. 
  4. How do we measure success?
There are two broader categories, in which we can perform graph sampling:
  1. Scale-down sampling: In scale-down sampling, we target to derive a much smaller graph with similar graph properties such as similar degree distribution and similar diameter.
  2. Back-in-time sampling: In Back-in-time sampling, we try to travel back in time and trying to replicate the past versions of graph, of which we only get to observe the final, static snap-shot.
The evaluation techniques of graph sampling is also very varied and for a set of static graph patterns, which are measured on a single snapshot of a graph few major evaluation properties are In-degree & Out-degree distribution, size of weakly & strongly connected components, number of hops to cover a distance and distribution of singular values of the graph adjacency matrix versus the rank. On the other hand for the evaluation of temporal graph factors such as the number of edges versus the number of nodes over time, the effective diameter of the graph over time, the normalized size of the largest connected component, The largest singular value of graph adjacency matrix over time and average clustering coefficient over time matters most.
 Sampling Algorithms 
There are few distinct sampling algorithm used in this paper, which are mainly categorized under three categories: 
1. Sampling by random node selection: Here, we simply want to find best sampling method that gives the most similar degree distribution. So, we divide random node selection method even further:
  1. Random Node (RN) Sampling: It is most obvious way to create a sample graph by uniformly at random select a set of nodes N and a sample is then a graph induced by the set of nodes N
  2. Random Page Rank Node (RPN) sampling: Creating a sample graph by setting the probability of a node such that being selected into the sample is proportional to its Page Rank weight.
  3. Random Degree Node (RDN) sampling: In RDN sampling, the probability of a node being selected is proportional to its degree that is has even more bias towards high degree nodes. Intuitively this method will have problems matching degree distribution, since there are too many high degree nodes in the sample
2. Random edge (RE) sampling: To build a sample graph, this techniques perform random selections of edges but sampled graphs will be very sparsely connected and will thus have large diameter and will not respect community structure. To overcome this problem, we can build a different version in form of random node-edge selection which will allow us to first select a node at random then uniformly select and an edge connected to the node.
 3. Sampling by exploration: The idea behind sampling by exploration technique is that we first select a node uniformly at random and then for further selection, we explore the nodes in it’s the vicinity. Few sampling algorithms based on this techniques are:
  1. Random Node Neighbor (RNN) sampling:  In RNN, we select a node uniformly at random together with all of its out-going neighbors, which makes it effective in dealing with very large graph and out-degree distribution but fails matching for in-degree distribution.
  2. Random Walk (RW) sampling: In RW, we uniformly at random pick a starting node and then simulate a random walk on the graph. At every step with some probability we fly back to the starting node and re-start the random walk. Random walks, which normally require fewer resources per sample, can suffer from large estimation errors in the presence of disconnected or loosely connected graphs.
  3. Random Jump (RJ) sampling: Random jump differs from random walk in a manner such that here with some probability we randomly jump to any node in the graph and this algorithm does not have problems of getting stuck or not being able to visit enough nodes unlike random node algorithm.
  4. Forest Fire (FF) sampling: FF is a recursive process, where we randomly pick a seed node and begin burning outgoing links and the corresponding nodes. If a link gets burned, the node at the other end point gets a chance to burn its own links, and so on recursively.

Now, by performing scale-down sampling, the out-come for RDN, RJ and RW are biased towards high degree nodes and densely connected parts of the graph, on the other hand algorithms such as FF, RPN and RN, which are not biased towards high degree nodes, shows that degree distribution is well matched and has about the same shape as the true degree distribution. Other algorithms, such as RE and RNE which are all based on random edge selection, for small sample size, the resulting samples are very sparsely connected and the slope of degree distribution is too steep.
 For Back-in-time sampling, Edge selection techniques (RE, RNE) result in too sparse graphs that start to rapidly fill-in the missing edges when the sample size gets large. FF, RN and RPN match the temporal densification of the true graph. On contrary, RW, RJ and RDN have constant diameter, which is good for Scale-down sampling. And, FF, RN and RPN match the shrinking diameter of the true graph over time quite well. Similar observations can be done for the size of connected component and average clustering coefficient over time
Conclusion
This work concludes that no list of properties will ever be perfect but we can achieve a lot better performance in smaller size of graph. The major outcome of the work is listed below:
  1. Back-in-time methodology to generate better sample graph as compared to scale-down sampling methodology.
  2. Forest Fire is best performing sampling algorithm.
  3. Usually sample graph with size of 15% of actual graph is enough to match the properties.

Mining Social Media with Social Theories: A Survey

Why is it interesting?
It is interesting to see how traditional social theories can be combined with modern computational tools and data mining techniques to form a better understanding of social media data along with the fact that the nature of social media data significantly differs from the data in traditional data mining.

What is Social media mining and its challenges?
Social Media Mining is the process of representing, analyzing, and extracting actionable patterns from social media data to provide better and customized services to social media Users. The major challenges in social media mining is handling of data, which can be described as:
  1. Big Data, approx. 500 million tweets per day and around 200 billion tweets per year.
  2. Linked Data: Data (Content and Users) is not independent which contradicts traditional data mining methods.
  3. Noisy: Quality of user generated content, spammers and ambiguous connections.
  4. Unstructured: short texts, typos, spacing errors, emoticons, h r u?
  5. Incomplete: To address such privacy concerns, social media data could be incomplete and extremely sparse.
What is Social theories?
Social theories are rules of our society under which data mining techniques can be applied on social media data to form a better understanding of social media data and customize the services. In this paper, three main social theories are discussed, which are:
1. Social Correlation Theory: It states that based on behavior, attributes and activities, adjacent users shares a correlation and they have better chances of forming a connection than any other two random person. Social correlation theory can be explained by further categorization of the process as: 
  1. Homophily- tendency to connect to people who share any similarity. 
  2. Influence- Tendency to follow the behaviors of friends and adjacent users.
  3. Confounding: Forged due to external influences from environment. For example, two individuals living in the same city are more likely to become friends than two random individuals.
2. Balance Theory: This theory is based on the intuition that “the friend of my friend is my friend” and ”the enemy of my enemy is my friend”, that drives toward psychological balance.
3. Social Status Theory: It considers the position or rank of a user in a social community, and represents the degree of honor or prestige attached to the position of each individual.
To give a sense for how the differences between status and balance arise, consider the situation in which a user links positively to a user B, and B in turn links positively to a user C. If C then forms a link to A, what sign should we expect this link to have? Balance theory predicts that since C is a friend of A’s friend, we should see a positive link from C to A. Status theory, on the other hand, predicts that A regards B as having higher status, and B regards C as having higher status — so C should regard A as having low status and hence be inclined to link negatively to A. In other words, the two theories suggest opposite conclusions in this case.

Applying Social theories on Social Data
1. Social theories in User related tasks:
  1. Community detection:  It’s the process of finding implicit groups of users that are more densely connected to each other than to the rest of the network. As per social theories, Homophily suggests that similar users are likely to be linked, and influence indicates that linked users will influence each other and become more similar.
  2. User classification:  Social correlation theory suggests that the labels of linked users should be correlated and in social media and classification can be performed to infer the unknown information of users in the same network, exhibiting the similar behaviors as its correlated user.
  3. Social Spammer Detection:  Based on social correlation theory, Spammers behave differently from their neighbors as most of their neighbors are normal users but normal users perform similarly with their neighbors. Hence, two connected normal users should be close in the latent space, while spammers should be far away from their neighbors in the latent space.
2. Social Theories in Relation Related Tasks:
  1. Link Prediction: Its commonly used in friend recommendation service in social media. To establish homophily theory and predict trust relations based on their activities and behavior, plot users in latent space, the stronger homophily between two users is, the smaller distance between them in the latent space is. On the other hand, Status theory suggests new links are more likely to be attached from users with low statuses to users with high statues.
  2. Social Tie Prediction : Its intend to automatically infer the types of social relations based on user's activity and interaction with other users to provide better services as one user’s work style may be mainly influenced by her/his colleagues; while the daily life habits may be strongly affected by her/his family.
  3. Tie Strength Prediction: Among the heterogeneous relationship on social media, social correlation theory can be used in determining how strong the relation between two users is, by assigning value from 0 to 1 as continuous range rather than binary approach.
3.  Social Theories in Content Related Tasks:
  1. Social Recommendation:  Social correlation theory suggests that a user’s preference is similar to or influenced by their directly connected friends and ensemble methods use this intuition to predict missing values of a user based on its social network.
  2. Feature Selection: By analyzing user generated content with social context based on social correlation theory, a feature selection framework can be used to handle high dimensional social media data effectively.
  3. Sentiment Analysis:  Social correlation theory indicates that sentiments of two linked users are likely to be similar and sentiment labels of tweets via user-user social relations and user-tweet relations can be utilized to assign sentiment labels to unlabeled tweets.

In this article, we reviewed three key social theories, i.e., social correlation theory, balance theory and status theory and stated the possibilities of integrating these social theories with computational models. As future directions, more existing social theories, such as structural hole theory and weak tie theory could be employed or new social theories could be discovered to advance social media mining.