Natural Language Processing using UDPIPE in R

In my previous post, we learned about text mining and sentiment analysis on News headlines using web scraping and R. Text analytics has been one of the black boxes of analytics. In this post, we will dive into text analysis of headlines with simple Natural Language Processing(NLP) using UDPipe in R. UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.

Step 1: Dataset
We will use the same dataset which we created in last article from ABC News. We will take all the headlines which have been published on ABC news for the year 2018.



abc_scrap_all <- readRDS(file = paste0("abc_scrap_all.rds"))
#Remove duplicates
news <-  abc_scrap_all[!duplicated(abc_scrap_all$headlines),]

news %>% group_by(Date) %>% count() %>% arrange(desc(n))
## # A tibble: 121 x 2
## # Groups:   Date [121]
##    Date         n
##    <chr>    <int>
##  1 20180523   213
##  2 20180504   172
##  3 20180320   155
##  4 20180220   153
##  5 20180227   152
##  6 20180301   148
##  7 20180423   147
##  8 20180208   145
##  9 20180321   144
## 10 20180322   137
## # ... with 111 more rows

Step 2: pre-trained UDPipe model
UDPipe package comes with pre trained model for more than 50 spoken languages. We can download the model using udpipe_download_model() function

#model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = 'english-ud-2.0-170801.udpipe')

Step 3: Annotate the input text
use udpipe_annotate() to start with udpipe and this will annotates the given data and put the variable into data frame format with data.frame()

# use udpipe_annotate() for analysis 
textanalysis <- udpipe_annotate(udmodel_english, news$headlines)
#data frame for the output
textframe <- data.frame(textanalysis)

Step 4: Universal POS (Part of Speech)
We will plot Part of speech tags for the given headlines

## POS
POS <- txt_freq(textframe$upos)
POS$key <- factor(POS$key, levels = rev(POS$key))
barchart(key ~ freq, data = POS, col = "yellow", 
         main = "UPOS (Universal Parts of Speech)\n frequency of occurrence", 
         xlab = "Freq")

plot of chunk unnamed-chunk-3

Step 5: Frequently used Nouns in headlines
Lets plot the most frequently used nouns in headlines

noun <- subset(textframe, upos %in% c("NOUN")) 
noun <- txt_freq(noun$token)
noun$key <- factor(noun$key, levels = rev(noun$key))
barchart(key ~ freq, data = head(noun, 20), col = "cadetblue", 
         main = "Frequently used nouns", xlab = "Freq")

plot of chunk unnamed-chunk-4
More than half of the top nouns used in headlines seem to be indicating negative atmosphere

Step 6: Frequently used Adjective in headlines
Let’s analyze the Adjective used in headlines as its a news website which will love to magnify and inflate using several adjectives

adj <- subset(textframe, upos %in% c("ADJ")) 
adj <- txt_freq(adj$token)
adj$key <- factor(adj$key, levels = rev(adj$key))
barchart(key ~ freq, data = head(adj, 20), col = "purple", 
         main = "Frequently used Adjectives", xlab = "Freq")

plot of chunk unnamed-chunk-5

Step 7: Frequently used Verb in headlines
Do headlines bring in any sign of optimism or just infuse pessimism? The kind of Verb used by media house can certainly help in highlighting direction of optimism or pessimism.

verbs <- subset(textframe, upos %in% c("VERB")) 
verbs <- txt_freq(verbs$token)
verbs$key <- factor(verbs$key, levels = rev(verbs$key))
barchart(key ~ freq, data = head(verbs, 20), col = "gold", 
         main = "Most occurring Verbs", xlab = "Freq")

plot of chunk unnamed-chunk-6

With words like dies, killed, charged, accused and many more it does not look like ABC news is not interested in building an optimistic mindset amongst its citizen. It is just acting like a media house which will look into hot, sensational or burning news to gain further viewership.

Step 8: Automated keywords extraction using RAKE
Rapid Automatic Keyword Extraction(RAKE) algorithm is one of the most popular(unsupervised) algorithms for extracting keywords in Information retrieval. It looks for keywords by looking to a contiguous sequence of words which do not contain irrelevant words.

## Using RAKE
rake <- keywords_rake(x = textframe, term = "lemma", group = "doc_id", 
                       relevant = x$upos %in% c("NOUN", "ADJ"))
rake$key <- factor(rake$keyword, levels = rev(rake$keyword))
barchart(key ~ rake, data = head(subset(rake, freq > 3), 20), col = "red", 
         main = "Keywords identified by RAKE", 
         xlab = "Rake")

plot of chunk unnamed-chunk-7

Step 9: Phrases
Now, we will extract phrases in which a noun and a verb forming a phrase. Let us bring out the top phrases that are just keywords or topic for this headlines data.

## Using a sequence of POS tags (noun phrases / verb phrases)
textframe$phrase_tag <- as_phrasemachine(textframe$upos, type = "upos")
phrases <- keywords_phrases(x = textframe$phrase_tag, term = textframe$token, 
                          pattern = "(A|N)*N(P+D*(A|N)*N)*", 
                          is_regex = TRUE, detailed = FALSE)
phrases <- subset(phrases, ngram > 1 & freq > 3)
phrases$key <- factor(phrases$keyword, levels = rev(phrases$keyword))
barchart(key ~ freq, data = head(phrases, 20), col = "magenta", 
         main = "Keywords - simple noun phrases", xlab = "Frequency")

To conclude, we see here is commonwealth games and gold coast being top used phrases as Gold coast was hosting commonwealth games this year. Also, US influence on the news headlines with Wall street, white house and Donald trump being used frequently in headlines.

Hope this post helped you to get started with text analytics and NLP in R.

Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to to keep receive regular updates.