Mithun Desai, Author at Tabvizexplorer.com

News Headlines: Text Mining and Sentiment Analysis

After writing previous article on Twitter Sentiment Analysis on #royalwedding, I thought why not do analysis on ABC news online website and see if we can uncover some interesting insights. This is some good practice to do some data scrapping, text mining and use few algorithms to practice.

Below is the step by step guide:

To start with we will scrap headlines from ABC news for the duration of entire 2018. To get historical headlines, we will scrap abc news homepage via Wayback Machine. Post few data crunching and manipulation we will get the clean headlines.

Step 1:
Load all important libraries which we will be using in this practice

#Importing Libraries
library(stringr)
library(jsonlite)
library(httr)
library(rvest)
library(dplyr)
library(V8)
library(tweenr)
library(syuzhet) 
library(tidyverse)

#Path where files output will be stored
path<-'/Mithun/R/WebScraping/'

Step 2:

Use wayback API call with abc.net.au/news and pass this information into json with the text content.
Here we will also filter the time stamp to have dates from 1st Jan’18.

#Internet WAYBACK API CALL http://www.abc.net.au/news/
AU_url<-'http://web.archive.org/cdx/search/cdx?url=abc.net.au/news/&output=json'

#API request
req <- httr::GET(AU_url, timeout(20))

#Get data
json <- httr::content(req, as = "text")

api_dat <- fromJSON(json)

#Get timestamps which will be used to pass in API
time_stamps<-api_dat[-1,2]

#Reverse order (so recent first)
time_stamps<-rev(time_stamps)

#Scrap the each and every URL to get headlines from theaustralian.com.au website
head(time_stamps,n=50)

#Filter time_stamps to have dates after 2018
time_stamps<-time_stamps[as.numeric(substr(time_stamps,1,8))>=20180000]

Step 3
We will create an loop where we will pass URL with necessary timestamp to get all headlines which were published from 1st to 21st May’18 on abc news website. Also, we will remove all duplicate headlines which we might have got while scraping the website.

#Dataframe to store output and loop to get headlines
abc_scrap_all <-NA
for(s in 1:length(time_stamps)){

 Sys.sleep(1)

feedurl<-paste0('https://web.archive.org/web/',time_stamps[s],'/http://www.abc.net.au/news/')

  print(feedurl)

  if(!is.na(feedurl)){

   print('Valid URL')

    #Scrap the data from URLs
    try(feed_dat<-read_html(feedurl),timeout(10),silent=TRUE)

    if(exists('feed_dat')){

      #USE
      initial<-html_nodes(feed_dat,"[href*='/news/2018']")

      #Date
      Date <- substr(time_stamps[s],1,8)

      #Get headlines
      headlines<-initial  %>% html_text()

      #Combine
      comb<-data.frame(Date,headlines,stringsAsFactors = FALSE)

      #Remove NA headlines
      comb<-comb[!(is.na(comb$headlines) | comb$headlines=="" | comb$headlines==" ") ,]

      #As a df
      comb<-data.frame(comb)

      #Remove duplicates on daily level
      comb <-  comb[!duplicated(comb$headlines),]

      if(length(comb$headlines)>0){

        #Save with the rest
        abc_scrap_all<-rbind(abc_scrap_all,comb)
      }
      rm(comb)
      rm(feed_dat)

     }
  }

}

#Remove duplicates
abc_scrap_all_final <-  abc_scrap_all[!duplicated(abc_scrap_all$headlines),]

Step 4:
We will use the headlines and do sentiment analysis on the headlines using Syuzhet package and see if we can make some conclusion

library('syuzhet')
abc_scrap_all_final$headlines <- str_replace_all(abc_scrap_all_final$headlines,"[^[:graph:]]", " ")
Sentiment <-get_nrc_sentiment(abc_scrap_all_final$headlines)

td<-data.frame(t(Sentiment))
td_Rowsum <- data.frame(rowSums(td[2:1781])) 

#Transformation and  cleaning
names(td_Rowsum)[1] <- "count"
td_Rowsum <- cbind("sentiment" = rownames(td_Rowsum), td_Rowsum)
rownames(td_Rowsum) <- NULL
td_Plot<-td_Rowsum[1:10,]

#Vizualisation
library("ggplot2")

qplot(sentiment, data=td_Plot, weight=count, geom="bar",fill=sentiment)+ggtitle("Abc News headlines sentiment analysis")

Conclusion on Sentiment Analysis:
Human brain tends to be more attentive to negative information. To grab attention of readers, most of the media houses focus on negative and fear related news. Thats what we see when we analyzed the abc news website headlines as well.

Lets analyze further on headlines:

Wordcloud for frequently used words in headlines

library(tm)
library(wordcloud)
  corpus = Corpus(VectorSource(tolower(abc_scrap_all_final$headlines)))
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeWords, stopwords("english"))

  frequencies = DocumentTermMatrix(corpus)
  word_frequencies = as.data.frame(as.matrix(frequencies))

  words <- colnames(word_frequencies)
  freq <- colSums(word_frequencies)
  wordcloud(words, freq,
            min.freq=sort(freq, decreasing=TRUE)[[100]],
            colors=brewer.pal(10, "Paired"),
            random.color=TRUE)

plot of chunk unnamed-chunk-5

Surprisingly, being an australian news agency most frequently used word in headlines related Donald trump (American President) followed by police, commonwealth games, sport and australia

Find word associations:

If you have any specific word which can useful for analysis and help us identify the highly correlate words with that term. If word always appears together then correlation=1.0 and in our example we will find correlated words with 30% correlation.

findAssocs(dtm, "tony", corlimit=0.3)

## $tony
##      abbott headbutting       cooke        30th     hansons  benneworth 
##        0.64        0.35        0.30        0.30        0.30        0.30 
##     mocking       astro        labe 
##        0.30        0.30        0.30

#0.3 means 30% correlation with word "tony"

That’s all for now. In my next post we will look further into text analytics using udpipe and see if we can build more on text association and analytics

Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

May 24, 2018

#Makeover Monday wk21: EPL Prediction vs Actual

This week in #makeovermonday challenge was Sports data i.e. English Premier League data which looks into predictions and actual outcomes of the season 17-18.

Data was shared by the Guardian

Here is the original report and how it looks:

Data is available on data.world week 21.

Here is what I tried to do:

Used gann and circle chart from tableau to display the results of actual vs prediction and highlighted both of them using different colors to make it self explanatory.
This will help to make inference about how many prediction were on target and how many off target.

Below is the Tableau file:

Thanks!!
Do subscribe to receive regular updates

May 21, 2018

Sentiment Analysis using R and Twitter

After a long break of 5 weeks I am back to blogging, Today we will go through Twitter Sentiment Analysis using R on #RoyalWedding.

Last few years has been interesting revolution in social media, it is not just platform where people talk to one another but it has become platform where people:

Express interests
Share views
Show dissent
Praise or criticize companies or politicians

So in this article we will learn how to analyze what people are posting on Twitter to come up with an solution which helps us understand about the public sentiments

How to create Twitter app

Twitter has developed an API which we can use to analyze tweets posted by users and their underlying metadata. This API helps us extract data in structured format which can easily be analyzed.

To create Twitter app, you need to have twitter account and once you have that account visit twitter app page and create an application to access data. Step by step process is available on following link:

https://iag.me/socialmedia/how-to-create-a-twitter-app-in-8-easy-steps/

once you have created the app, you will get following 4 keys:
a. Consumer key (API key)
b. Consumer secret (API Secret)
c. Access Token
d. Access Token Secret

These above keys we will use it to extract data from twitter to do analysis

Implementing Sentiment Analysis in R

Now, we will write step by step process in R to extract tweets from twitter and perform sentiment analysis on tweets. We will select #Royalwedding as our topic of analysis

Extracting tweets using Twitter application
Install the necessary packages

# Install packages
install.packages("twitteR", repos = "http://cran.us.r-project.org")
install.packages("RCurl", repos = "http://cran.us.r-project.org")
install.packages("httr", repos = "http://cran.us.r-project.org")
install.packages("syuzhet", repos = "http://cran.us.r-project.org")

# Load the required Packages
library(twitteR)
library(RCurl)
library(httr)
library(tm)
library(wordcloud)
library(syuzhet)

Next step is set the Twitter API using the app we created and use the key along with access tokens to get the data

# authorisation keys
consumer_key = "ABCD12345690XXXXXXXXX" #Consumer key from twitter app
consumer_secret = "ABCD12345690XXXXXXXXX" #Consumer secret from twitter app
access_token = "ABCD12345690XXXXXXXXX" #access token from twitter app
access_secret ="ABCD12345690XXXXXXXXX" #access secret from twitter app

# set up
setup_twitter_oauth(consumer_key,consumer_secret,access_token, access_secret)

## [1] "Using direct authentication"

# search for tweets in english language
tweets = searchTwitter("#RoyalWedding", n = 10000, lang = "en")

# store the tweets into dataframe
tweets.df = twListToDF(tweets)

Above code will invoke twitter app and extract the data with tweets having “#Royalwedding”. Since, Royal wedding is the flavor of season and talk of the world with everyone expressing their views on twitter.

Data Cleaning tweets for further analysis
We will remove hashtags, junk characters, other twitter handles and URLs from the tags using gsub function so we have tweets for further analysis

# CLEANING TWEETS

tweets.df$text=gsub("&amp", "", tweets.df$text)
tweets.df$text = gsub("&amp", "", tweets.df$text)
tweets.df$text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets.df$text)
tweets.df$text = gsub("@\\w+", "", tweets.df$text)
tweets.df$text = gsub("[[:punct:]]", "", tweets.df$text)
tweets.df$text = gsub("[[:digit:]]", "", tweets.df$text)
tweets.df$text = gsub("http\\w+", "", tweets.df$text)
tweets.df$text = gsub("[ \t]{2,}", "", tweets.df$text)
tweets.df$text = gsub("^\\s+|\\s+$", "", tweets.df$text)

tweets.df$text <- iconv(tweets.df$text, "UTF-8", "ASCII", sub="")

Now we have only relevant part of tweets which can use for analysis

Getting sentiments score for each tweet

Lets score the emotions on each tweet as syuzhet breaks emotion into 10 different categories.

# Emotions for each tweet using NRC dictionary
emotions <- get_nrc_sentiment(tweets.df$text)
emo_bar = colSums(emotions)
emo_sum = data.frame(count=emo_bar, emotion=names(emo_bar))
emo_sum$emotion = factor(emo_sum$emotion, levels=emo_sum$emotion[order(emo_sum$count, decreasing = TRUE)])

Post above steps, we are ready to visualize results to what type of emotions are dominant in the tweets

# Visualize the emotions from NRC sentiments
library(plotly)
p <- plot_ly(emo_sum, x=~emotion, y=~count, type="bar", color=~emotion) %>%
  layout(xaxis=list(title=""), showlegend=FALSE,
         title="Emotion Type for hashtag: #RoyalWedding")
api_create(p,filename="Sentimentanalysis")

Here we see majority of the people are discussing positive about Royal Wedding which is good indicator for analysis.

Lastly, lets see which word contributes which emotion:

# Create comparison word cloud data

wordcloud_tweet = c(
  paste(tweets.df$text[emotions$anger > 0], collapse=" "),
  paste(tweets.df$text[emotions$anticipation > 0], collapse=" "),
  paste(tweets.df$text[emotions$disgust > 0], collapse=" "),
  paste(tweets.df$text[emotions$fear > 0], collapse=" "),
  paste(tweets.df$text[emotions$joy > 0], collapse=" "),
  paste(tweets.df$text[emotions$sadness > 0], collapse=" "),
  paste(tweets.df$text[emotions$surprise > 0], collapse=" "),
  paste(tweets.df$text[emotions$trust > 0], collapse=" ")
)

# create corpus
corpus = Corpus(VectorSource(wordcloud_tweet))

# remove punctuation, convert every word in lower case and remove stop words

corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)

# create document term matrix

tdm = TermDocumentMatrix(corpus)

# convert as matrix
tdm = as.matrix(tdm)
tdmnew <- tdm[nchar(rownames(tdm)) < 11,]

# column name binding
colnames(tdm) = c('anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust')
colnames(tdmnew) <- colnames(tdm)
comparison.cloud(tdmnew, random.order=FALSE,
                 colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
                 title.size=1, max.words=250, scale=c(2.5, 0.4),rot.per=0.4)

This is how word cloud on tweets with #Royalwedding looks like. Basically using R, we can analyse the sentiments on the social media and this can be extended to particular handle or product to see what people are saying in social media and whether is it negative or positive

Please feel free to ask any questions or want me to write on any specific topic

Do subscribe to Tabvizexplorer.com to keep receiving regular updates.

May 19, 2018 2

#WorkoutWednesday Week 14 – Frequency Matrix

This week’s #workoutwednesday was about building frequency matrix using color to represent the frequency intensity.

Requirements

Use sub-categories
Dashboard size is 1000 x 900; tiled; 1 sheet
Distinctly count the number of orders that have purchases from both sub-categories
Sort the categories from highest to lowest frequency
White out when the sub-category matches and include the number of orders
Calculate the average sales per order for each sub-category
Identify in the tooltip the highest average spend per sub-category (see Phones & Tables)
If it’s the highest average spend for both sub-categories, identify with a dot in the square
Match formatting & tooltips – special emphasis on tooltip verbiage

This week uses the superstore dataset. You can get it here at data.world

Below is my attempt to meet the above requirements:

Thanks for reading 🙂

Do subscribe to blog for keep receiving updates

April 9, 2018

#Makeover Monday: World Wine Production

This week in #makeovermonday challenge was about World wine production between 2012 to 2016.