After writing previous article on Twitter Sentiment Analysis on #royalwedding, I thought why not do analysis on ABC news online website and see if we can uncover some interesting insights. This is some good practice to do some data scrapping, text mining and use few algorithms to practice.
Below is the step by step guide:
To start with we will scrap headlines from ABC news for the duration of entire 2018. To get historical headlines, we will scrap abc news homepage via Wayback Machine. Post few data crunching and manipulation we will get the clean headlines.
Step 1:
Load all important libraries which we will be using in this practice
#Importing Libraries
library(stringr)
library(jsonlite)
library(httr)
library(rvest)
library(dplyr)
library(V8)
library(tweenr)
library(syuzhet)
library(tidyverse)
#Path where files output will be stored
path<-'/Mithun/R/WebScraping/'
Step 2:
Use wayback API call with abc.net.au/news and pass this information into json with the text content.
Here we will also filter the time stamp to have dates from 1st Jan’18.
#Internet WAYBACK API CALL http://www.abc.net.au/news/
AU_url<-'http://web.archive.org/cdx/search/cdx?url=abc.net.au/news/&output=json'
#API request
req <- httr::GET(AU_url, timeout(20))
#Get data
json <- httr::content(req, as = "text")
api_dat <- fromJSON(json)
#Get timestamps which will be used to pass in API
time_stamps<-api_dat[-1,2]
#Reverse order (so recent first)
time_stamps<-rev(time_stamps)
#Scrap the each and every URL to get headlines from theaustralian.com.au website
head(time_stamps,n=50)
#Filter time_stamps to have dates after 2018
time_stamps<-time_stamps[as.numeric(substr(time_stamps,1,8))>=20180000]
Step 3
We will create an loop where we will pass URL with necessary timestamp to get all headlines which were published from 1st to 21st May’18 on abc news website. Also, we will remove all duplicate headlines which we might have got while scraping the website.
#Dataframe to store output and loop to get headlines
abc_scrap_all <-NA
for(s in 1:length(time_stamps)){
Sys.sleep(1)
feedurl<-paste0('https://web.archive.org/web/',time_stamps[s],'/http://www.abc.net.au/news/')
print(feedurl)
if(!is.na(feedurl)){
print('Valid URL')
#Scrap the data from URLs
try(feed_dat<-read_html(feedurl),timeout(10),silent=TRUE)
if(exists('feed_dat')){
#USE
initial<-html_nodes(feed_dat,"[href*='/news/2018']")
#Date
Date <- substr(time_stamps[s],1,8)
#Get headlines
headlines<-initial %>% html_text()
#Combine
comb<-data.frame(Date,headlines,stringsAsFactors = FALSE)
#Remove NA headlines
comb<-comb[!(is.na(comb$headlines) | comb$headlines=="" | comb$headlines==" ") ,]
#As a df
comb<-data.frame(comb)
#Remove duplicates on daily level
comb <- comb[!duplicated(comb$headlines),]
if(length(comb$headlines)>0){
#Save with the rest
abc_scrap_all<-rbind(abc_scrap_all,comb)
}
rm(comb)
rm(feed_dat)
}
}
}
#Remove duplicates
abc_scrap_all_final <- abc_scrap_all[!duplicated(abc_scrap_all$headlines),]
Step 4:
We will use the headlines and do sentiment analysis on the headlines using Syuzhet package and see if we can make some conclusion
library('syuzhet')
abc_scrap_all_final$headlines <- str_replace_all(abc_scrap_all_final$headlines,"[^[:graph:]]", " ")
Sentiment <-get_nrc_sentiment(abc_scrap_all_final$headlines)
td<-data.frame(t(Sentiment))
td_Rowsum <- data.frame(rowSums(td[2:1781]))
#Transformation and cleaning
names(td_Rowsum)[1] <- "count"
td_Rowsum <- cbind("sentiment" = rownames(td_Rowsum), td_Rowsum)
rownames(td_Rowsum) <- NULL
td_Plot<-td_Rowsum[1:10,]
#Vizualisation
library("ggplot2")
qplot(sentiment, data=td_Plot, weight=count, geom="bar",fill=sentiment)+ggtitle("Abc News headlines sentiment analysis")
Conclusion on Sentiment Analysis:
Human brain tends to be more attentive to negative information. To grab attention of readers, most of the media houses focus on negative and fear related news. Thats what we see when we analyzed the abc news website headlines as well.
Lets analyze further on headlines:
Wordcloud for frequently used words in headlines
library(tm)
library(wordcloud)
corpus = Corpus(VectorSource(tolower(abc_scrap_all_final$headlines)))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
frequencies = DocumentTermMatrix(corpus)
word_frequencies = as.data.frame(as.matrix(frequencies))
words <- colnames(word_frequencies)
freq <- colSums(word_frequencies)
wordcloud(words, freq,
min.freq=sort(freq, decreasing=TRUE)[[100]],
colors=brewer.pal(10, "Paired"),
random.color=TRUE)
Surprisingly, being an australian news agency most frequently used word in headlines related Donald trump (American President) followed by police, commonwealth games, sport and australia
Find word associations:
If you have any specific word which can useful for analysis and help us identify the highly correlate words with that term. If word always appears together then correlation=1.0 and in our example we will find correlated words with 30% correlation.
findAssocs(dtm, "tony", corlimit=0.3)
## $tony ## abbott headbutting cooke 30th hansons benneworth ## 0.64 0.35 0.30 0.30 0.30 0.30 ## mocking astro labe ## 0.30 0.30 0.30
#0.3 means 30% correlation with word "tony"
That’s all for now. In my next post we will look further into text analytics using udpipe and see if we can build more on text association and analytics
Please do let me know your feedback and if any particular topic you would like me to write on.
Do subscribe to Tabvizexplorer.com to keep receive regular updates.