The messiness of public Facebook data: What is it that people do?

Written by Jakob Bæk Kristensen, researcher.

Social media sites that allow for interaction and the public sharing of opinions between people are often viewed as uncontrollable or anarchistic spaces that fuel infantile comments or hateful slur with no concrete purpose or direction. People are voicing their opinions and neither why they do it nor whether others care about it is exactly clear. Still millions and millions of likes and comments are made on public pages everyday.

Although it would be presumptuous to expect easily explainable, unified patterns to arise from the totality of public interaction, the sheer magnitude of seemingly random actions probes a basic curiosity, which can be expressed with the questions: What is it that people do? Is there any general meaning to be extracted from it?

The very simple questions posed above provided the essential motivation for the thesis project I chose for completing my degree in Media Studies at the University of Copenhagen. I am of course not the first to ask these questions, but outside of Facebook’s own research group, very few studies have tried to tap into the potential of large-scale, publicly available Facebook data.

My project was an experiment and an exploration rather than a case argument or the testing of a hypothesis. It was the ambition to derive a general overview from public Facebook data and find patterns where patterns were not apparent. It was not a study of specific cases, but an attempt to map out the totality of public Facebook activity in all its glorious messiness.

In undertaking this challenge a few problems quickly became apparent. First I needed a focus, or an angle. It is impossible to find meaningful patterns in Facebook activity without some expectation about the kind of meaning that can be extracted from it. User interactions must be seen in relation to something.

Another problem was the technical capacity. Including too much data made it very time consuming to sort and analyze on a single laptop. This was partly due to my own lack of technical knowledge about such Big Data oriented technologies as Hadoop and NoSQL. Instead all tools for the project were designed with PYTHON and standard SQL.

In response to the two problems I chose to focus on politically motivated user behavior on public Facebook pages. Data would be extracted from (roughly) the 148 most popular political pages and the 88 most popular media pages. These pages include party pages and individual politicians as well as media of television, radio, newspaper and magazines. In order to try and stay true to my objective of mapping the messiness, all activities such as post, likes, comments, comment-likes, replies and tags were collected from the pages. The finished collection of all data from said pages in 2015 consisted of 18 GB and more than 70 million unique data points.

Now what? Just having an angle and a fixed collection of pages is not enough. It is necessary to have methods for parsing, analyzing and visualizing the data in a meaningful way.

The design of methods and tools are intricately bound to the specific metrics and parameters you want to explore. And how to best explore politically motivated user behavior?

I first felt I needed a way to categorize users in relation to their political affiliation. To do this I tapped into the users’ public like-history. It roughly entails looking at how many times one user has liked posts on political pages that correspond to any of the 9 political parties in the Danish parliament. In a somewhat surprising turn, early testing showed that most users’ like-histories conform neatly to a polarized political landscape divided into right and left wing supporters. It also showed that more than 600.000 Danish users have liked a post on any given political page at least 7 times. This is quite a large number for a country as small as Denmark.

For this reason right and left wing users became the main categories and served as the relation that would inform most of the interaction patterns to be derived. The left/right wing categories also provided the first significant step in reducing complexity and conquering the messiness of the data. However, the act of reducing complexity, especially in binary categories such as right/left wing, should be a point of criticism since the reduction might produce a map of reality that is overly simplistic.

As mentioned, the main motivation for doing the project was to map out and produce a readable overview of messy public Facebook data. I realized that even with the right/left wing categories as a basic focal point it was difficult to create an extensive overview using just a single tool or method. I therefore elected to design a range of simple tools that would show different aspects of the data and provide complementary views.

All the tools are constructed using PYTHON and have been designed specifically to work with the data collected for the project. Without elaborating on limits, results and possibilities for interpretation the tools designed for the project consisted of the following.

Tools overview:

User activity distribution (bar chart): Measures whether some activities e.g. comment-likes and replies elicit different activity level distributions between highly active and less active users. This was done specifically in relation to the 1% most active, 9% highly active and 90% least active users.
Interaction between users with diverging political views over time (line graph): Measures how many percentages of e.g. comment-likes or replies are made between users where one user is left wing and the other is right wing or vice versa. It is illustrated in relation to change over time.
Likes and comments per post in relation to political affiliation and gender (scatter plot): Plots individual posts from a selection of pages according to the percentage of users of either right or left wing affiliation, who make likes or comments to the post, as well as the percentage of men and women. Patterns from said measures that arise on different pages can then be compared.
Keyword frequencies (line graph): Measures how much a word is used by right or left wing users over time in comments and replies. (Tool not included in final version of the project due to page count restrictions)
Network analysis (network graph): Measures the connections between right/left wing users and pages through different activities e.g. comments, tags, replies. It provides patterns of interconnectivity and centrality within the emerging networks that is produced through user interaction on public pages.

The individual results produced with the tools cannot be covered in this small blog post. The image below (click to view) is an example of one of the network graphs that were employed in the thesis. It shows that a fair amount of political pages are central nodes in the network meaning that they have a significant percentage of visitors from the opposite political wing. However politicians such as Pia Kjærsgaard, Anders Samuelsen and Johanne Schimdt Nielsen have rather large congregations of supporters who only visit their respective pages.

Click on the picture to explore the graph. The graph shows relations between pages and users. In this particular graph, relations represent all types of interaction: likes, comments, comment-likes, replies, reply-likes and tags. Bigger nodes denote pages and smaller ones are single users. Blue signify right-wingers and red nodes are left-wingers. The data included come from a random selection of 5 days. Repeated simulations yield a graph that is 97% similar.

The example above delivers only a small taste of the results obtained. The full amount of results must be understood as a complete narrative where each tool covers just one aspect.

In sum the study has shown that there is a very uneven distribution of user activity. In some cases the 10% most active users are responsible for 80% of the activity. Public interaction mirrors a very polarized political landscape, however in many cases people actively seek out pages and users who do not share their own political views. Although, posts that have tendencies to center on more concrete political issues generate significantly lower amounts of comments that gain support from the opposite political wing. Thus agonism and political strife, in contrast to deliberation and consensus, seem to constitute the predominant pattern.

All in all, the desire to derive meaningful patterns from the messiness of public Facebook data encountered general success. Using the approach determined in this project by the careful collection and exploration of the dataset has shown the potential for utilizing experimental, computational methods for analyzing social data and finding new patterns. While public Facebook pages represent only an infinitesimal part of peoples’ social reality, the patterns found can provide insights into the potential of the communicative infrastructure Facebook and similar constructs.

The project represents an alternate lens for viewing human culture and extracting meaningful patterns from underneath the brutal randomness of cheap, easily accessible and undirected forms of public interaction found on Facebook.

Interest in the methods developed or the specific results produced by the project can be directed at ETHOS Lab. Please write us at ethos@itu.dk.

The messiness of public Facebook data: What is it that people do?

Share this: