In my previous post, we learned about text mining and sentiment analysis on News headlines using web scraping and R. Text analytics has been one of the black boxes of analytics. In this post, we will dive into text analysis of headlines with simple Natural Language Processing(NLP) using UDPipe in R. UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.

Step 1: Dataset
We will use the same dataset which we created in last article from ABC News. We will take all the headlines which have been published on ABC news for the year 2018.

#install.packages("udpipe")

library(dplyr)
library(ggplot2)

abc_scrap_all <- readRDS(file = paste0("abc_scrap_all.rds"))
#Remove duplicates
news <-  abc_scrap_all[!duplicated(abc_scrap_all$headlines),]

news %>% group_by(Date) %>% count() %>% arrange(desc(n))
## # A tibble: 121 x 2
## # Groups:   Date [121]
##    Date         n
##    <chr>    <int>
##  1 20180523   213
##  2 20180504   172
##  3 20180320   155
##  4 20180220   153
##  5 20180227   152
##  6 20180301   148
##  7 20180423   147
##  8 20180208   145
##  9 20180321   144
## 10 20180322   137
## # ... with 111 more rows

Step 2: pre-trained UDPipe model
UDPipe package comes with pre trained model for more than 50 spoken languages. We can download the model using udpipe_download_model() function

library(udpipe)
#model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = 'english-ud-2.0-170801.udpipe')

Step 3: Annotate the input text
use udpipe_annotate() to start with udpipe and this will annotates the given data and put the variable into data frame format with data.frame()

# use udpipe_annotate() for analysis 
textanalysis <- udpipe_annotate(udmodel_english, news$headlines)
#data frame for the output
textframe <- data.frame(textanalysis)

Step 4: Universal POS (Part of Speech)
We will plot Part of speech tags for the given headlines

## POS
library(lattice)
POS <- txt_freq(textframe$upos)
POS$key <- factor(POS$key, levels = rev(POS$key))
barchart(key ~ freq, data = POS, col = "yellow", 
         main = "UPOS (Universal Parts of Speech)\n frequency of occurrence", 
         xlab = "Freq")

plot of chunk unnamed-chunk-3

Step 5: Frequently used Nouns in headlines
Lets plot the most frequently used nouns in headlines

## NOUNS
noun <- subset(textframe, upos %in% c("NOUN")) 
noun <- txt_freq(noun$token)
noun$key <- factor(noun$key, levels = rev(noun$key))
barchart(key ~ freq, data = head(noun, 20), col = "cadetblue", 
         main = "Frequently used nouns", xlab = "Freq")

plot of chunk unnamed-chunk-4
More than half of the top nouns used in headlines seem to be indicating negative atmosphere

Step 6: Frequently used Adjective in headlines
Let’s analyze the Adjective used in headlines as its a news website which will love to magnify and inflate using several adjectives

## ADJECTIVES
adj <- subset(textframe, upos %in% c("ADJ")) 
adj <- txt_freq(adj$token)
adj$key <- factor(adj$key, levels = rev(adj$key))
barchart(key ~ freq, data = head(adj, 20), col = "purple", 
         main = "Frequently used Adjectives", xlab = "Freq")

plot of chunk unnamed-chunk-5

Step 7: Frequently used Verb in headlines
Do headlines bring in any sign of optimism or just infuse pessimism? The kind of Verb used by media house can certainly help in highlighting direction of optimism or pessimism.

## VERBS
verbs <- subset(textframe, upos %in% c("VERB")) 
verbs <- txt_freq(verbs$token)
verbs$key <- factor(verbs$key, levels = rev(verbs$key))
barchart(key ~ freq, data = head(verbs, 20), col = "gold", 
         main = "Most occurring Verbs", xlab = "Freq")

plot of chunk unnamed-chunk-6

With words like dies, killed, charged, accused and many more it does not look like ABC news is not interested in building an optimistic mindset amongst its citizen. It is just acting like a media house which will look into hot, sensational or burning news to gain further viewership.

Step 8: Automated keywords extraction using RAKE
Rapid Automatic Keyword Extraction(RAKE) algorithm is one of the most popular(unsupervised) algorithms for extracting keywords in Information retrieval. It looks for keywords by looking to a contiguous sequence of words which do not contain irrelevant words.

## Using RAKE
rake <- keywords_rake(x = textframe, term = "lemma", group = "doc_id", 
                       relevant = x$upos %in% c("NOUN", "ADJ"))
rake$key <- factor(rake$keyword, levels = rev(rake$keyword))
barchart(key ~ rake, data = head(subset(rake, freq > 3), 20), col = "red", 
         main = "Keywords identified by RAKE", 
         xlab = "Rake")

plot of chunk unnamed-chunk-7

Step 9: Phrases
Now, we will extract phrases in which a noun and a verb forming a phrase. Let us bring out the top phrases that are just keywords or topic for this headlines data.

## Using a sequence of POS tags (noun phrases / verb phrases)
textframe$phrase_tag <- as_phrasemachine(textframe$upos, type = "upos")
phrases <- keywords_phrases(x = textframe$phrase_tag, term = textframe$token, 
                          pattern = "(A|N)*N(P+D*(A|N)*N)*", 
                          is_regex = TRUE, detailed = FALSE)
phrases <- subset(phrases, ngram > 1 & freq > 3)
phrases$key <- factor(phrases$keyword, levels = rev(phrases$keyword))
barchart(key ~ freq, data = head(phrases, 20), col = "magenta", 
         main = "Keywords - simple noun phrases", xlab = "Frequency")

To conclude, we see here is commonwealth games and gold coast being top used phrases as Gold coast was hosting commonwealth games this year. Also, US influence on the news headlines with Wall street, white house and Donald trump being used frequently in headlines.

Hope this post helped you to get started with text analytics and NLP in R.

Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

After writing previous article on Twitter Sentiment Analysis on #royalwedding, I thought why not do analysis on ABC news online website and see if we can uncover some interesting insights. This is some good practice to do some data scrapping, text mining and use few algorithms to practice.

Below is the step by step guide:

To start with we will scrap headlines from ABC news  for the duration of entire 2018. To get historical headlines, we will scrap abc news homepage via Wayback Machine. Post few data crunching and manipulation we will get the clean headlines.

Step 1:
Load all important libraries which we will be using in this practice

#Importing Libraries
library(stringr)
library(jsonlite)
library(httr)
library(rvest)
library(dplyr)
library(V8)
library(tweenr)
library(syuzhet) 
library(tidyverse)

#Path where files output will be stored
path<-'/Mithun/R/WebScraping/'

Step 2:

Use wayback API call with abc.net.au/news and pass this information into json with the text content.
Here we will also filter the time stamp to have dates from 1st Jan’18.

#Internet WAYBACK API CALL http://www.abc.net.au/news/
AU_url<-'http://web.archive.org/cdx/search/cdx?url=abc.net.au/news/&output=json'

#API request
req <- httr::GET(AU_url, timeout(20))

#Get data
json <- httr::content(req, as = "text")

api_dat <- fromJSON(json)

#Get timestamps which will be used to pass in API
time_stamps<-api_dat[-1,2]

#Reverse order (so recent first)
time_stamps<-rev(time_stamps)

#Scrap the each and every URL to get headlines from theaustralian.com.au website
head(time_stamps,n=50)

#Filter time_stamps to have dates after 2018
time_stamps<-time_stamps[as.numeric(substr(time_stamps,1,8))>=20180000]

Step 3
We will create an loop where we will pass URL with necessary timestamp to get all headlines which were published from 1st to 21st May’18 on abc news website. Also, we will remove all duplicate headlines which we might have got while scraping the website.

#Dataframe to store output and loop to get headlines
abc_scrap_all <-NA
for(s in 1:length(time_stamps)){

 Sys.sleep(1)

feedurl<-paste0('https://web.archive.org/web/',time_stamps[s],'/http://www.abc.net.au/news/')

  print(feedurl)

  if(!is.na(feedurl)){

   print('Valid URL')

    #Scrap the data from URLs
    try(feed_dat<-read_html(feedurl),timeout(10),silent=TRUE)

    if(exists('feed_dat')){

      #USE
      initial<-html_nodes(feed_dat,"[href*='/news/2018']")

      #Date
      Date <- substr(time_stamps[s],1,8)

      #Get headlines
      headlines<-initial  %>% html_text()

      #Combine
      comb<-data.frame(Date,headlines,stringsAsFactors = FALSE)

      #Remove NA headlines
      comb<-comb[!(is.na(comb$headlines) | comb$headlines=="" | comb$headlines==" ") ,]

      #As a df
      comb<-data.frame(comb)

      #Remove duplicates on daily level
      comb <-  comb[!duplicated(comb$headlines),]

      if(length(comb$headlines)>0){

        #Save with the rest
        abc_scrap_all<-rbind(abc_scrap_all,comb)
      }
      rm(comb)
      rm(feed_dat)

     }
  }

}

#Remove duplicates
abc_scrap_all_final <-  abc_scrap_all[!duplicated(abc_scrap_all$headlines),]

Step 4:
We will use the headlines and do sentiment analysis on the headlines using Syuzhet package and see if we can make some conclusion

library('syuzhet')
abc_scrap_all_final$headlines <- str_replace_all(abc_scrap_all_final$headlines,"[^[:graph:]]", " ")
Sentiment <-get_nrc_sentiment(abc_scrap_all_final$headlines)

td<-data.frame(t(Sentiment))
td_Rowsum <- data.frame(rowSums(td[2:1781])) 

#Transformation and  cleaning
names(td_Rowsum)[1] <- "count"
td_Rowsum <- cbind("sentiment" = rownames(td_Rowsum), td_Rowsum)
rownames(td_Rowsum) <- NULL
td_Plot<-td_Rowsum[1:10,]

#Vizualisation
library("ggplot2")

qplot(sentiment, data=td_Plot, weight=count, geom="bar",fill=sentiment)+ggtitle("Abc News headlines sentiment analysis")

Conclusion on Sentiment Analysis:
Human brain tends to be more attentive to negative information. To grab attention of readers, most of the media houses focus on negative and fear related news. Thats what we see when we analyzed the abc news website headlines as well.

Lets analyze further on headlines:

Wordcloud for frequently used words in headlines

library(tm)
library(wordcloud)
  corpus = Corpus(VectorSource(tolower(abc_scrap_all_final$headlines)))
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeWords, stopwords("english"))

  frequencies = DocumentTermMatrix(corpus)
  word_frequencies = as.data.frame(as.matrix(frequencies))

  words <- colnames(word_frequencies)
  freq <- colSums(word_frequencies)
  wordcloud(words, freq,
            min.freq=sort(freq, decreasing=TRUE)[[100]],
            colors=brewer.pal(10, "Paired"),
            random.color=TRUE) 

plot of chunk unnamed-chunk-5

Surprisingly, being an australian news agency most frequently used word in headlines related Donald trump (American President) followed by police, commonwealth games, sport and australia

Find word associations:

If you have any specific word which can useful for analysis and help us identify the highly correlate words with that term. If word always appears together then correlation=1.0 and in our example we will find correlated words with 30% correlation.

findAssocs(dtm, "tony", corlimit=0.3)
## $tony
##      abbott headbutting       cooke        30th     hansons  benneworth 
##        0.64        0.35        0.30        0.30        0.30        0.30 
##     mocking       astro        labe 
##        0.30        0.30        0.30
#0.3 means 30% correlation with word "tony" 

That’s all for now. In my next post we will look further into text analytics using udpipe and see if we can build more on text association and analytics

Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

This week in #makeovermonday challenge was Sports data i.e. English Premier League data which looks into predictions and actual outcomes of the season 17-18.

Data was shared by the Guardian

Here is the original report and how it looks:

 

Data is available on data.world week 21.

Here is what I tried to do:

  • Used gann and circle chart from tableau to display the results of actual vs prediction and highlighted both of them using different colors to make it self explanatory.
  • This will help to make inference about how many prediction were on target and how many off target.

Below is the Tableau file:

Thanks!!
Do subscribe to receive regular updates

This week’s #workoutwednesday was about building frequency matrix using color to represent the frequency intensity.

Requirements

  • Use sub-categories
  • Dashboard size is 1000 x 900; tiled; 1 sheet
  • Distinctly count the number of orders that have purchases from both sub-categories
  • Sort the categories from highest to lowest frequency
  • White out when the sub-category matches and include the number of orders
  • Calculate the average sales per order for each sub-category
  • Identify in the tooltip the highest average spend per sub-category (see Phones & Tables)
  • If it’s the highest average spend for both sub-categories, identify with a dot in the square
  • Match formatting & tooltips – special emphasis on tooltip verbiage

This week uses the superstore dataset.  You can get it here at data.world

Below is my attempt to meet the above requirements:

Thanks for reading 🙂

Do subscribe to blog for keep receiving updates