Influence and Correlation in Social Networks

Motivation:
Social influence is the phenomenon that the actions of a user can induce his/her friends to behave in a similar way. We draw a line between homophily and other external influence from the environment and its effect on behavior of any user on the social network. Existing work has established the existence of correlation between user actions and social affiliations but they do not address the source of the correlation. The focus is on analyzing the sources of the correlation which can assist in important decision making such as viral marketing campaigns.
 Challenges:  
Incompleteness of data due to privacy concerns and anonymous activity is the biggest challenge in evaluating the social influence, which makes it harder to distinguish between homophily and external influence. We propose a statistical test (called the shuffle test) based on the intuition that if influence is not a likely source of correlation in a system, timing of actions should not matter, and therefore reshuffling the time stamps of the actions should not significantly change the amount of correlation. 
Models of Social correlation:
Correlation between the behaviors of affiliated agents in a social network is a well-known phenomenon. Formally, this means that for two nodes u and v that are adjacent in graph G, the events that u becomes active is correlated with v becoming active. There are three primary explanations for this phenomenon: 
  1. Homophily- tendency to connect to people who share any similarity.
  2. Influence- Tendency to follow the behaviors of friends and adjacent users.
  3. Confounding: Forged due to external influences from environment. For example, two individuals living in the same city are more likely to become friends than two random individuals.
 Methodology:
  1.  Measuring social correlation: The first step in our analysis is to obtain a measure of social correlation between the actions of an individual and that of her friends in the network i.e. at each time step, calculate the probability as a function of the number of already active friends the user with the parameter as the number of friends that became active in the previous time steps. Flickr stores the actions and for most tags in the Flickr data set, a logistic function with the logarithm of the number of friends as the explanatory variable provides a good fit for the probability.
  2. The shuffle test: The shuffle test is based on the idea that if social influence does not play a role, even though an agent's probability of activation could depend on her friends, the timing of such activation should be independent of the timing of other agents. Let G, be a social network and W is the set of activated users between a time range [0, T]. Assume a user is activated at a particular time, we use logistic regression to estimate the number of user who at the beginning of that time instant had a number of active friends but Flickr did not predicted them, likewise we estimated the users who are inactive but was predicted.
    Theoretical analysis assumptions:
  • Distribution of the activation times is uniform over the time range [0, T].
  • Each future time step is chosen independently from the uniform distribution instead of using a permutation of the original time stamps.
  • There are enough data to gather statistics.
 3. The edge-reversal test: The edge-reversal test is a test used to for distinguishing influence similar to the one used in the obesity study. We reverse the direction of all the edges and run logistic regression on the data using the new graph. It would be expected that change in social influence would not change significantly since the assumption is friends have common characteristics, are affected by the same external variables and are independent of which of these two individuals has named the other as a friend. However social influence spreads in the direction specified by the edges of the graph, and hence reversing the edges should intuitively change the estimate of the correlation.
 Generative simulation model: 
  • No-correlation model:  The Network grows exactly in the same way as in the real data. In each time step, we look at the real data to see how many new agents use the tag, and pick the same number of agents uniformly at random from the set of agents that have already joined the network and have not been picked yet.
  • Influence model: The network, and the growth pattern of the network is kept as in the real data. In every time step, each node in the set of nodes that has joined the network but not activated yet flips a coin independently to decide if to become active in this time step.
  • Correlation model (no influence): we keep the network and the pattern of growth of the population the same as in the real data. Parameterized using parameter L, follows the pattern of a tag in real data. A number of centers are chosen at random before the generation of actions.

Conclusion : 
This article is analysis of  Influence and Correlation in Social Networks where Shuffle test and the edge reversal test have been processed and the outcome shows the cumulative distribution and frequency distribution of both are nearly identical, further enforcing the idea of correlation and For the Flickr data set, Influence found, cannot be given higher weightage, however the difference between values in two directions for a given edge is minimal, almost zero. The current work has not focused the possibility of social status in influence. Giving the users a position or rank might organize the influence and can also be helpful in understanding the behavior of the edge reversal test.

Mining Social Media with Social Theories: A Survey

Why is it interesting?
It is interesting to see how traditional social theories can be combined with modern computational tools and data mining techniques to form a better understanding of social media data along with the fact that the nature of social media data significantly differs from the data in traditional data mining.

What is Social media mining and its challenges?
Social Media Mining is the process of representing, analyzing, and extracting actionable patterns from social media data to provide better and customized services to social media Users. The major challenges in social media mining is handling of data, which can be described as:
  1. Big Data, approx. 500 million tweets per day and around 200 billion tweets per year.
  2. Linked Data: Data (Content and Users) is not independent which contradicts traditional data mining methods.
  3. Noisy: Quality of user generated content, spammers and ambiguous connections.
  4. Unstructured: short texts, typos, spacing errors, emoticons, h r u?
  5. Incomplete: To address such privacy concerns, social media data could be incomplete and extremely sparse.
What is Social theories?
Social theories are rules of our society under which data mining techniques can be applied on social media data to form a better understanding of social media data and customize the services. In this paper, three main social theories are discussed, which are:
1. Social Correlation Theory: It states that based on behavior, attributes and activities, adjacent users shares a correlation and they have better chances of forming a connection than any other two random person. Social correlation theory can be explained by further categorization of the process as: 
  1. Homophily- tendency to connect to people who share any similarity. 
  2. Influence- Tendency to follow the behaviors of friends and adjacent users.
  3. Confounding: Forged due to external influences from environment. For example, two individuals living in the same city are more likely to become friends than two random individuals.
2. Balance Theory: This theory is based on the intuition that “the friend of my friend is my friend” and ”the enemy of my enemy is my friend”, that drives toward psychological balance.
3. Social Status Theory: It considers the position or rank of a user in a social community, and represents the degree of honor or prestige attached to the position of each individual.
To give a sense for how the differences between status and balance arise, consider the situation in which a user links positively to a user B, and B in turn links positively to a user C. If C then forms a link to A, what sign should we expect this link to have? Balance theory predicts that since C is a friend of A’s friend, we should see a positive link from C to A. Status theory, on the other hand, predicts that A regards B as having higher status, and B regards C as having higher status — so C should regard A as having low status and hence be inclined to link negatively to A. In other words, the two theories suggest opposite conclusions in this case.

Applying Social theories on Social Data
1. Social theories in User related tasks:
  1. Community detection:  It’s the process of finding implicit groups of users that are more densely connected to each other than to the rest of the network. As per social theories, Homophily suggests that similar users are likely to be linked, and influence indicates that linked users will influence each other and become more similar.
  2. User classification:  Social correlation theory suggests that the labels of linked users should be correlated and in social media and classification can be performed to infer the unknown information of users in the same network, exhibiting the similar behaviors as its correlated user.
  3. Social Spammer Detection:  Based on social correlation theory, Spammers behave differently from their neighbors as most of their neighbors are normal users but normal users perform similarly with their neighbors. Hence, two connected normal users should be close in the latent space, while spammers should be far away from their neighbors in the latent space.
2. Social Theories in Relation Related Tasks:
  1. Link Prediction: Its commonly used in friend recommendation service in social media. To establish homophily theory and predict trust relations based on their activities and behavior, plot users in latent space, the stronger homophily between two users is, the smaller distance between them in the latent space is. On the other hand, Status theory suggests new links are more likely to be attached from users with low statuses to users with high statues.
  2. Social Tie Prediction : Its intend to automatically infer the types of social relations based on user's activity and interaction with other users to provide better services as one user’s work style may be mainly influenced by her/his colleagues; while the daily life habits may be strongly affected by her/his family.
  3. Tie Strength Prediction: Among the heterogeneous relationship on social media, social correlation theory can be used in determining how strong the relation between two users is, by assigning value from 0 to 1 as continuous range rather than binary approach.
3.  Social Theories in Content Related Tasks:
  1. Social Recommendation:  Social correlation theory suggests that a user’s preference is similar to or influenced by their directly connected friends and ensemble methods use this intuition to predict missing values of a user based on its social network.
  2. Feature Selection: By analyzing user generated content with social context based on social correlation theory, a feature selection framework can be used to handle high dimensional social media data effectively.
  3. Sentiment Analysis:  Social correlation theory indicates that sentiments of two linked users are likely to be similar and sentiment labels of tweets via user-user social relations and user-tweet relations can be utilized to assign sentiment labels to unlabeled tweets.

In this article, we reviewed three key social theories, i.e., social correlation theory, balance theory and status theory and stated the possibilities of integrating these social theories with computational models. As future directions, more existing social theories, such as structural hole theory and weak tie theory could be employed or new social theories could be discovered to advance social media mining.