Analyzing #GamerGate

I’m going to detail an easy way to analyse #GamerGate tweets. I’ve heard they have a “fake data scientist”, not sure if this is true as being a Womble I only pick bits and pieces up here and there. But one thing is for sure, if he really did only suck in 5K tweets for his “analysis” then he is not the best data scientist on the common today. I’ll show you how you can easily hoover up vast numbers of GG tweets, for whatever data analysis requirements you may have. Read on for a whirlwind overview on how to out data scientist the #GamerGate data scientist.

Let me introduce you to something called ElasticSearch a system for storing semi/structured/unstructured data, indexing and retrieving data based on queries, it also supports JSON quite nicely, the native format of any tweets you might download. Let me also introduce you to something called Kibana, now this works with ElasticSearch to give you a nice funky interface to query and visualise data in ElasticSearch.

Now, you’ve got some Linux, right? Seriously, you need Linux for this tutorial, go, get it!

Download ElasticSearch, Download Kibana …. Usually the next line would be to download Logstash, but in the burrow we don’t get much power, so our WomblePC is fairly low powered and Logstash uses large amounts of memory (not that ES and Kibana don’t, but our little server was straining!). So I went for Fluentd instead. So Download that as well …

I’m going to assume you know how to extract and run these using “nohup” so they run in the background. If you don’t you can skip to the end to see what sort of data you can get, this might be a little too technical for you …. If you get stuck, leave me a comment!

For the Fluentd Twitter plugin you need eventmachine, install it.

/opt/td-agent/embedded/bin/fluent-gem install eventmachine

If that fails you may need gcc / buildessentials, install those and retry.

Then the plugin. /opt/td-agent/embedded/bin/fluent-gem install fluent-plugin-twitter

Now you are ready to GO!

… Oh wait, not quite yet. You have a Twitter application? No? Go to https://apps.twitter.com/, create an application (You need a Twitter account with a verified phone number attached). Then copy the OAUTH details off and save them into the fluentd configuration –

vi /etc/td-agent/td-agent.conf

<source>
type twitter
consumer_key BlahBlahHereNoQuotes
consumer_secret BlahBlahHereNoQuotes
oauth_token 1884422173-BlahBlahHereNoQuotes
oauth_token_secret BlahBlahHereNoQuotes
tag input.twitter.gg
timeline tracking
keyword #GamerGate
output_format nest
</source>

<match input.twitter.**>
type elasticsearch
logstash_format true
flush_interval 10s
</match>

Excellent, you are almost ready to start analysing some tweets…. Restart the td agent > sudo service td-agent restart

Check the log for errors, tail -f /var/log/td-agent/td-agent.log

Now, you’ll be getting in tweets, hopefully … How many have I got? In Kibana and Settings -> Indices and Add one, a search should show there is an index called “logstash-<date>”. Add an index pattern of “logstash*”, this is so it picks up every date.

Drop into the “Discover Tab”, an you can see the tweets from the last 15 mins by default. However in mine, I can see 285,751 tweets, as I’ve been running it a while, somewhat better than 5K. Other than a few glitches trying to get Logstash to run on the underpowered WomblePC it works quite well!

Screenshot from 2015-03-27 23:26:54

How about a bit of fun? Who in this period has tweeted more to #GamerGate than anyone else? Go to visualise -> From a new search -> Choose your index pattern -> Choose vertical bar chart. Add an aggregation on the X-Axis and choose “Terms”, you can then select a field, obvious one is user.screen_name. Who cares most about “Ethics in Journalism”?

Screenshot from 2015-03-27 22:54:51

Click the ^ icon at the bottom of the chart and you can get the raw numbers, even see the call made to ES and the raw JSON reply back. In this case 4rtt5ty is the, err, winner.

4rtt5ty 1500
b00nes 1234
a_man_in_green_ 1142
_wcs_ 1053

But what about most influential? Someone must be getting lots of RTs and favs, who could it be … Favorites first, as we all need a favorite in our lives!

Screenshot from 2015-03-27 23:50:45

 

sushilulutwitch 903674
professorf 670464
davidmdraiman 571973
_icze4r 542447

Huh, don’t know what to say. A #GamerGate neutral is most fav’d by #GamerGate, seems legit. (Note figure is not accurate as it adds them up for each RT, but still related to popularity). How about RTs? Well I won’t bother pasting here as it is quite clear who the tribe of #GamerGate feels speaks to them the most, @sushilulutwitch …. So is she a neutral? All I know is my gut says, maybe.

Finally, there is all sorts of interesting stuff you can do with the data when it is in ElasticSeach. I hope to show you sometime, when I’m not picking up all the rubbish you humans leave around my home 😡

 

PS, the most RT’d tweets in all this morass show GG is unfortunately (for them) fighting an uphill battle, they are not well liked. Top four are against GamerGate … Fifth one isn’t, but I can’t paste it here as I’d lose the SJW from SJWomble, and WTF is an Omble anyway?

 

Advertisements

3 thoughts on “Analyzing #GamerGate

  1. Hi, thank for amazing tutorial.

    I setup everything correctly and can visualize top RT username and most tweeted user. But when I try to list most RT-d 10 tweet it shows only 1 word. Seems like td-agent doesn’t provide raw value like logstash? Or did I setup something wrong?

    Like

    1. There is an issue with how it handles raw values, it can assign everything to an “analyzed field”. The way elasticsearch decides on what to assign to a given index is based on the template used. So calling your ndex “logstash-*” will make it use the logstash template if it is set up on elastic search.

      You can also force it to not analyze any fields for you by using your own.

      Can see what it is using by – curl -XGET localhost:9200/_template/

      If there is a logstash one there make sure your index is called logstash-{date}, that might fix it.

      If not then you can set one that makes sure no fields are analyzed, helps reduce space and CPU usage as well! DISCLAIMER, if you are using ES for anything else this could fuck it up 😀

      curl -XPUT localhost:9200/_template/template_1 -d ‘{
      “template”: “*”,
      “settings”: {
      “index.refresh_interval”: “5s”
      },
      “mappings”: {
      “_default_”: {
      “_all”: {
      “enabled”: true
      },
      “dynamic_templates”: [
      {
      “string_fields”: {
      “match”: “*”,
      “match_mapping_type”: “string”,
      “mapping”: {
      “index”: “not_analyzed”,
      “omit_norms”: true,
      “type”: “string”
      }
      }
      }
      ],
      “properties”: {
      “@version”: {
      “type”: “string”,
      “index”: “not_analyzed”
      }
      }
      }
      }
      }’

      Of course you could also use Logstash, my mini PC server ran out of memory. But it is very easy to set it up to suck in tweets from Twitter too!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s