In my previous post, we learned about text mining and sentiment analysis on News headlines using web scraping and R. Text analytics has been one of the black boxes of analytics. In this post, we will dive into text analysis of headlines with simple Natural Language Processing(NLP) using UDPipe in R. UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
Step 1: Dataset
We will use the same dataset which we created in last article from ABC News. We will take all the headlines which have been published on ABC news for the year 2018.
#install.packages("udpipe")
library(dplyr)
library(ggplot2)
abc_scrap_all <- readRDS(file = paste0("abc_scrap_all.rds"))
#Remove duplicates
news <- abc_scrap_all[!duplicated(abc_scrap_all$headlines),]
news %>% group_by(Date) %>% count() %>% arrange(desc(n))
## # A tibble: 121 x 2 ## # Groups: Date [121] ## Date n ## <chr> <int> ## 1 20180523 213 ## 2 20180504 172 ## 3 20180320 155 ## 4 20180220 153 ## 5 20180227 152 ## 6 20180301 148 ## 7 20180423 147 ## 8 20180208 145 ## 9 20180321 144 ## 10 20180322 137 ## # ... with 111 more rows
Step 2: pre-trained UDPipe model
UDPipe package comes with pre trained model for more than 50 spoken languages. We can download the model using udpipe_download_model() function
library(udpipe)
#model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = 'english-ud-2.0-170801.udpipe')
Step 3: Annotate the input text
use udpipe_annotate() to start with udpipe and this will annotates the given data and put the variable into data frame format with data.frame()
# use udpipe_annotate() for analysis
textanalysis <- udpipe_annotate(udmodel_english, news$headlines)
#data frame for the output
textframe <- data.frame(textanalysis)
Step 4: Universal POS (Part of Speech)
We will plot Part of speech tags for the given headlines
## POS
library(lattice)
POS <- txt_freq(textframe$upos)
POS$key <- factor(POS$key, levels = rev(POS$key))
barchart(key ~ freq, data = POS, col = "yellow",
main = "UPOS (Universal Parts of Speech)\n frequency of occurrence",
xlab = "Freq")
Step 5: Frequently used Nouns in headlines
Lets plot the most frequently used nouns in headlines
## NOUNS
noun <- subset(textframe, upos %in% c("NOUN"))
noun <- txt_freq(noun$token)
noun$key <- factor(noun$key, levels = rev(noun$key))
barchart(key ~ freq, data = head(noun, 20), col = "cadetblue",
main = "Frequently used nouns", xlab = "Freq")
More than half of the top nouns used in headlines seem to be indicating negative atmosphere
Step 6: Frequently used Adjective in headlines
Let’s analyze the Adjective used in headlines as its a news website which will love to magnify and inflate using several adjectives
## ADJECTIVES
adj <- subset(textframe, upos %in% c("ADJ"))
adj <- txt_freq(adj$token)
adj$key <- factor(adj$key, levels = rev(adj$key))
barchart(key ~ freq, data = head(adj, 20), col = "purple",
main = "Frequently used Adjectives", xlab = "Freq")
Step 7: Frequently used Verb in headlines
Do headlines bring in any sign of optimism or just infuse pessimism? The kind of Verb used by media house can certainly help in highlighting direction of optimism or pessimism.
## VERBS
verbs <- subset(textframe, upos %in% c("VERB"))
verbs <- txt_freq(verbs$token)
verbs$key <- factor(verbs$key, levels = rev(verbs$key))
barchart(key ~ freq, data = head(verbs, 20), col = "gold",
main = "Most occurring Verbs", xlab = "Freq")
With words like dies, killed, charged, accused and many more it does not look like ABC news is not interested in building an optimistic mindset amongst its citizen. It is just acting like a media house which will look into hot, sensational or burning news to gain further viewership.
Step 8: Automated keywords extraction using RAKE
Rapid Automatic Keyword Extraction(RAKE) algorithm is one of the most popular(unsupervised) algorithms for extracting keywords in Information retrieval. It looks for keywords by looking to a contiguous sequence of words which do not contain irrelevant words.
## Using RAKE
rake <- keywords_rake(x = textframe, term = "lemma", group = "doc_id",
relevant = x$upos %in% c("NOUN", "ADJ"))
rake$key <- factor(rake$keyword, levels = rev(rake$keyword))
barchart(key ~ rake, data = head(subset(rake, freq > 3), 20), col = "red",
main = "Keywords identified by RAKE",
xlab = "Rake")
Step 9: Phrases
Now, we will extract phrases in which a noun and a verb forming a phrase. Let us bring out the top phrases that are just keywords or topic for this headlines data.
## Using a sequence of POS tags (noun phrases / verb phrases)
textframe$phrase_tag <- as_phrasemachine(textframe$upos, type = "upos")
phrases <- keywords_phrases(x = textframe$phrase_tag, term = textframe$token,
pattern = "(A|N)*N(P+D*(A|N)*N)*",
is_regex = TRUE, detailed = FALSE)
phrases <- subset(phrases, ngram > 1 & freq > 3)
phrases$key <- factor(phrases$keyword, levels = rev(phrases$keyword))
barchart(key ~ freq, data = head(phrases, 20), col = "magenta",
main = "Keywords - simple noun phrases", xlab = "Frequency")
To conclude, we see here is commonwealth games and gold coast being top used phrases as Gold coast was hosting commonwealth games this year. Also, US influence on the news headlines with Wall street, white house and Donald trump being used frequently in headlines.
Hope this post helped you to get started with text analytics and NLP in R.
Please do let me know your feedback and if any particular topic you would like me to write on.
Do subscribe to Tabvizexplorer.com to keep receive regular updates.