This week #workoutwednesday challenge was set up by Ann Jackson. It was slightly difficult challenge as this week it was about finding signals or patterns in data with use of control chart. It was more about finding statistical signals using standard deviations and mean/ medians with signals to indicate change in pattern or trend or finding an outlier within the dataset. This is good way to analyze the data and get statistical insights/ signals from the data about pattern and behavior.

Requirements: 

Available here

Data for the workout can be downloaded from here

Here is my output for the challenge (Click on Image for interactive version):

It took me time to understand the requirement and figure out the way to design the visualisation. lookup() & windows functions did help a lot to build this requirement

Let me share the steps required to create this visualisation post data is imported into Tableau:

Step 1: Create 2 parameters for the requirement

Select a Middle line: This parameter to provide drop down option for either median or mean

Select A Test: This parameter to select the type of test we want to perform (outliers, Trend or change)

Step 2: Create the calculated fields required for the visualisation

Middle Line: This field is based on parameter “SELECT A MIDDLE LINE” and based on selection we will have either median or mean of sales

 

+3SD: This field is to calculate + 3 standard deviation based on middle line

-3SD: This field is to calculate – 3 standard deviation based on middle line

TEST – Outliers: Boolean field to highlight the outliers in the dataset i.e. if the sales value is either above +3SD or -3SD

 

TEST – TREND: Boolean field to highlight the trend whether current sales > previous month > previous to previous month or current sales < previous month < previous to previous month.

TEST – CHANGE: Boolean field to highlight whether 3 consecutive fields are below or above middle line.

TEST – SELECTION: This field is based on “SELECT A TEST” Parameter and uses above 3 calculated field to show the required value.

Tooltip – Signal: Tooltip to show signal if the sales value matches with any of the parameter selection criteria

Show Text: This is the field to show the necessary text based on parameter selection

STEP 3: There are 3-4 more fields which we need to create but lets start creating data visualisation for first 2 sheets

Drag Order date to column shelf (convert to month)

Drag Sum(sales) and TEST Selection to rows shelf then change it to dual axis followed by sync axis

Drag Middle line, +3SD, -3SD to details in Marks Area

Add a 3 reference lines on the basis of Middle line, +3SD and -3SD

Add Parameter control for Parameters “Select a Middle line” and “Select A TEST”

Then format the sheet as per requirement with tooltips and it will look like following:

Create an new sheet for Text data and drag Show text field to Text

Step 4: Final sheet for Monthly strip chart

For Strip chart, we will need to add following fields into the dataset

TREND: This is boolean field which we will use to highlight the pattern for Trend

CHANGE:This is boolean field which we will use to highlight the pattern for Change

SymbolCOLOR: Based on Parameter selected for the test, this is will return values for the test select and we will use this as also to segregate data into 3 color bucket of blue, amber and orange to show following:

  • Meets the test criteria
  • Part of test pattern

Finally, we will create 3rd sheet with strip chart:

Drag Order date to column shelf (convert to month)

Then Drag symbolColor to color and shape area under Marks area then create the tooltip to show the values

Step 5: Create the Dashboard

Add all 3 sheets into dashboard as per below image:

Now we are ready with visualization with user experience which we wanted to show.

Click here for Tableau file

Thanks Ann Jackson for this workout.

Happy Data Visualisation!!!!

Thanks for visiting this post. Please do let me know your feedback or if you have any questions about the blog do not hesitate to contact me on twitter (@Desaimithun)

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

This post we will learn about developing an predictive model to predict deal or no deal using Shark Tank dataset (US based show).

Problem Statement

Shark Tank is a US based show wherein entrepreneurs and founders pitch their businesses in front of investors (aka Sharks) who decides to invest or not in the businesses based on multiple parameters.

Here, we have got an dataset containing Shark Tank episodes with 495 records where each entrepreneur making their pitch to investors (aka sharks). Using multiple algorithms, we will predict given the description of new pitch, how likely is the pitch will convert into success or not.

Import Dataset and Representation along with data cleaning
Import the shark tank dataset into R

# Read in the data

Sharktank = read.csv("Shark Tank Companies-1.csv", stringsAsFactors=FALSE)

Load all the libraries required for text mining

# Load Library

library(tm)
library(SnowballC)

To use tm pacakge we first need to transform dataset into a corpus with required variable i.e. description. Next we normalize the texts in the reviews:
1. Switch to lower case
2. Remove punctuation marks and stopwords
3. Remove extra whitespaces
4. Stem the documents

# Create corpus
corpus = Corpus(VectorSource(Sharktank$description))

# Convert to lower-case
corpus = tm_map(corpus, tolower)

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

# Word cloud before removing stopwords
library(wordcloud)
wordcloud(corpus,colors=rainbow(7),max.words=100) 
# Remove stopwords, the, and
corpus = tm_map(corpus, removeWords, c("the", "and", stopwords("english")))

# Remove extra whitespaces if any
corpus = tm_map(corpus, stripWhitespace)

# Stem document 
corpus = tm_map(corpus, stemDocument)


# Word cloud after removing stopwords and cleaning
wordcloud(corpus,colors=rainbow(7),max.words=100)

To analyze the texts, we need to use DTM (Document-Term Matrix): basically converting all the documents as rows, terms/words as columns, frequency of the term in the document. This will help us identify unique words in the corpus used frequently.

#Document term matrix
frequencies = DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 495, terms: 3501)>>
## Non-/sparse entries: 9531/1723464
## Sparsity           : 99%
## Maximal term length: 21
## Weighting          : term frequency (tf)

To reduce the dimensions in DTM, we will remove less frequent words using removeSparseTerms and sparsity less than 0.995

# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)

Convert this dataset into data.frame and add dependant variable deal into data frame as final step for data preparation

# Convert to a data frame
descSparse = as.data.frame(as.matrix(sparse))

# Make all variable names R-friendly
colnames(descSparse) = make.names(colnames(descSparse))

# Add dependent variable
descSparse$deal = Sharktank$deal

#Get no of deals
table(descSparse$deal)
## 
## FALSE  TRUE 
##   244   251

Predictive modelling
To predict whether investors(aka shark) will invest in the businesses we will use deal as an output variable and use the CART, logistic regression and random forest models to measure the performance and accuracy of the model.

CART Model

# Build CART model

library(rpart)
library(rpart.plot)

SharktankCart = rpart(deal ~ ., data=descSparse, method="class")

#CART Diagram
prp(SharktankCart, extra=2)

plot of chunk unnamed-chunk-164

# Evaluate the performance of the CART model
predictCART = predict(SharktankCart, data=descSparse, type="class")

CART_initial <- table(descSparse$deal, predictCART)

# Baseline accuracy
BaseAccuracyCart = sum(diag(CART_initial))/sum(CART_initial)

Random Forest Model

# Random forest model
library(randomForest)
set.seed(123)

SharktankRF = randomForest(deal ~ ., data=descSparse)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# Make predictions:
predictRF = predict(SharktankRF, data=descSparse)

# Evaluate the performance of the Random Forest
RandomForestInitial <- table(descSparse$deal, predictRF>= 0.5)

# Baseline accuracy
BaseAccuracyRF = sum(diag(RandomForestInitial))/sum(RandomForestInitial)

#variable importance as measured by a Random Forest 
varImpPlot(SharktankRF,main='Variable Importance Plot: Shark Tank',type=2)

plot of chunk unnamed-chunk-165

Logistic Regression Model

# Logistic Regression model

set.seed(123)

Sharktanklogistic = glm(deal~., data = descSparse)

# Make predictions:
predictLogistic = predict(Sharktanklogistic, data=descSparse)

# Evaluate the performance of the Random Forest
LogisticInitial <- table(descSparse$deal, predictLogistic> 0.5)

# Baseline accuracy
BaseAccuracyLogistic = sum(diag(LogisticInitial))/sum(LogisticInitial)

Now let’s add additional variable called as Ratio which will be derived using column askfor/valuation and then we will re-run the models to see if we can have improved accuracy in the models

# Add ratio variable into descSparse
descSparse$ratio = Sharktank$askedFor/Sharktank$valuation

#re-run the models to see if any changes

########CART Model###########
SharktankCartRatio = rpart(deal ~ ., data=descSparse, method="class")

#CART Diagram
prp(SharktankCartRatio, extra=2)

plot of chunk unnamed-chunk-167

# Evaluate the performance of the CART model
predictCARTRatio = predict(SharktankCartRatio, data=descSparse, type="class")

CART_ratio <- table(descSparse$deal, predictCARTRatio)

# Baseline accuracy
BaseAccuracyRatio = sum(diag(CART_ratio))/sum(CART_ratio)


#########Random Forrest#############
#Random Forrest Model
SharktankRFRatio = randomForest(deal ~ ., data=descSparse)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
#Make predictions:
predictRFRatio = predict(SharktankRFRatio, data=descSparse)

# Evaluate the performance of the Random Forest
RandomForestRatio <- table(descSparse$deal, predictRFRatio>= 0.5)

# Baseline accuracy
BaseAccuracyRFRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)

#variable importance as measured by a Random Forest 
varImpPlot(SharktankRFRatio,main='Variable Importance Plot: Shark Tank with Ratio',type=2)

plot of chunk unnamed-chunk-167

#########Logistic Regression##########
#Logistic Model
SharktanklogisticRatio = glm(deal~., data = descSparse)

# Make predictions:
predictLogisticRatio = predict(SharktanklogisticRatio, data=descSparse)

# Evaluate the performance of the Random Forest
LogisticRatio <- table(descSparse$deal, predictLogisticRatio>= 0.5)

# Baseline accuracy
BaseAccuracyLogisticRatio = sum(diag(LogisticRatio))/sum(LogisticRatio)

Conclusion
Lets look at the accuracy of each model before ratio column and after ratio column added into dataset for text mining.

####CART MODEL
#Before Ratio Column
BaseAccuracyCart
## [1] 0.6565657
#After Ratio Column
BaseAccuracyRatio
## [1] 0.6606061
####Logistic Regression
#Before Ratio Column
BaseAccuracyLogistic
## [1] 0.9979798
#After Ratio Column
BaseAccuracyLogisticRatio
## [1] 1
####RandomForest
#Before Ratio Column
BaseAccuracyRF
## [1] 0.5535354
#After Ratio Column
BaseAccuracyRFRatio
## [1] 0.5575758

With CART Model we were able to predict around 65.65% and 66.06% accurate results using only description and description+ratio respectively. Using Random Forest, we were able to predict 55.35% and 55.75% accurate results using only description and description+ratio respectively.

With Logistic regression, it gave us 100% accuracy with both parameters however, this requires further validation with significant variables and remove unnecessary variables to derive an measureable output.

I would urge readers to implement and use the knowledge from this post in making their own analysis on text and solve various problems.

That’s all for now. Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

In my previous post, we learned about text mining and sentiment analysis on News headlines using web scraping and R. Text analytics has been one of the black boxes of analytics. In this post, we will dive into text analysis of headlines with simple Natural Language Processing(NLP) using UDPipe in R. UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.

Step 1: Dataset
We will use the same dataset which we created in last article from ABC News. We will take all the headlines which have been published on ABC news for the year 2018.

#install.packages("udpipe")

library(dplyr)
library(ggplot2)

abc_scrap_all <- readRDS(file = paste0("abc_scrap_all.rds"))
#Remove duplicates
news <-  abc_scrap_all[!duplicated(abc_scrap_all$headlines),]

news %>% group_by(Date) %>% count() %>% arrange(desc(n))
## # A tibble: 121 x 2
## # Groups:   Date [121]
##    Date         n
##    <chr>    <int>
##  1 20180523   213
##  2 20180504   172
##  3 20180320   155
##  4 20180220   153
##  5 20180227   152
##  6 20180301   148
##  7 20180423   147
##  8 20180208   145
##  9 20180321   144
## 10 20180322   137
## # ... with 111 more rows

Step 2: pre-trained UDPipe model
UDPipe package comes with pre trained model for more than 50 spoken languages. We can download the model using udpipe_download_model() function

library(udpipe)
#model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = 'english-ud-2.0-170801.udpipe')

Step 3: Annotate the input text
use udpipe_annotate() to start with udpipe and this will annotates the given data and put the variable into data frame format with data.frame()

# use udpipe_annotate() for analysis 
textanalysis <- udpipe_annotate(udmodel_english, news$headlines)
#data frame for the output
textframe <- data.frame(textanalysis)

Step 4: Universal POS (Part of Speech)
We will plot Part of speech tags for the given headlines

## POS
library(lattice)
POS <- txt_freq(textframe$upos)
POS$key <- factor(POS$key, levels = rev(POS$key))
barchart(key ~ freq, data = POS, col = "yellow", 
         main = "UPOS (Universal Parts of Speech)\n frequency of occurrence", 
         xlab = "Freq")

plot of chunk unnamed-chunk-3

Step 5: Frequently used Nouns in headlines
Lets plot the most frequently used nouns in headlines

## NOUNS
noun <- subset(textframe, upos %in% c("NOUN")) 
noun <- txt_freq(noun$token)
noun$key <- factor(noun$key, levels = rev(noun$key))
barchart(key ~ freq, data = head(noun, 20), col = "cadetblue", 
         main = "Frequently used nouns", xlab = "Freq")

plot of chunk unnamed-chunk-4
More than half of the top nouns used in headlines seem to be indicating negative atmosphere

Step 6: Frequently used Adjective in headlines
Let’s analyze the Adjective used in headlines as its a news website which will love to magnify and inflate using several adjectives

## ADJECTIVES
adj <- subset(textframe, upos %in% c("ADJ")) 
adj <- txt_freq(adj$token)
adj$key <- factor(adj$key, levels = rev(adj$key))
barchart(key ~ freq, data = head(adj, 20), col = "purple", 
         main = "Frequently used Adjectives", xlab = "Freq")

plot of chunk unnamed-chunk-5

Step 7: Frequently used Verb in headlines
Do headlines bring in any sign of optimism or just infuse pessimism? The kind of Verb used by media house can certainly help in highlighting direction of optimism or pessimism.

## VERBS
verbs <- subset(textframe, upos %in% c("VERB")) 
verbs <- txt_freq(verbs$token)
verbs$key <- factor(verbs$key, levels = rev(verbs$key))
barchart(key ~ freq, data = head(verbs, 20), col = "gold", 
         main = "Most occurring Verbs", xlab = "Freq")

plot of chunk unnamed-chunk-6

With words like dies, killed, charged, accused and many more it does not look like ABC news is not interested in building an optimistic mindset amongst its citizen. It is just acting like a media house which will look into hot, sensational or burning news to gain further viewership.

Step 8: Automated keywords extraction using RAKE
Rapid Automatic Keyword Extraction(RAKE) algorithm is one of the most popular(unsupervised) algorithms for extracting keywords in Information retrieval. It looks for keywords by looking to a contiguous sequence of words which do not contain irrelevant words.

## Using RAKE
rake <- keywords_rake(x = textframe, term = "lemma", group = "doc_id", 
                       relevant = x$upos %in% c("NOUN", "ADJ"))
rake$key <- factor(rake$keyword, levels = rev(rake$keyword))
barchart(key ~ rake, data = head(subset(rake, freq > 3), 20), col = "red", 
         main = "Keywords identified by RAKE", 
         xlab = "Rake")

plot of chunk unnamed-chunk-7

Step 9: Phrases
Now, we will extract phrases in which a noun and a verb forming a phrase. Let us bring out the top phrases that are just keywords or topic for this headlines data.

## Using a sequence of POS tags (noun phrases / verb phrases)
textframe$phrase_tag <- as_phrasemachine(textframe$upos, type = "upos")
phrases <- keywords_phrases(x = textframe$phrase_tag, term = textframe$token, 
                          pattern = "(A|N)*N(P+D*(A|N)*N)*", 
                          is_regex = TRUE, detailed = FALSE)
phrases <- subset(phrases, ngram > 1 & freq > 3)
phrases$key <- factor(phrases$keyword, levels = rev(phrases$keyword))
barchart(key ~ freq, data = head(phrases, 20), col = "magenta", 
         main = "Keywords - simple noun phrases", xlab = "Frequency")

To conclude, we see here is commonwealth games and gold coast being top used phrases as Gold coast was hosting commonwealth games this year. Also, US influence on the news headlines with Wall street, white house and Donald trump being used frequently in headlines.

Hope this post helped you to get started with text analytics and NLP in R.

Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

After writing previous article on Twitter Sentiment Analysis on #royalwedding, I thought why not do analysis on ABC news online website and see if we can uncover some interesting insights. This is some good practice to do some data scrapping, text mining and use few algorithms to practice.

Below is the step by step guide:

To start with we will scrap headlines from ABC news  for the duration of entire 2018. To get historical headlines, we will scrap abc news homepage via Wayback Machine. Post few data crunching and manipulation we will get the clean headlines.

Step 1:
Load all important libraries which we will be using in this practice

#Importing Libraries
library(stringr)
library(jsonlite)
library(httr)
library(rvest)
library(dplyr)
library(V8)
library(tweenr)
library(syuzhet) 
library(tidyverse)

#Path where files output will be stored
path<-'/Mithun/R/WebScraping/'

Step 2:

Use wayback API call with abc.net.au/news and pass this information into json with the text content.
Here we will also filter the time stamp to have dates from 1st Jan’18.

#Internet WAYBACK API CALL http://www.abc.net.au/news/
AU_url<-'http://web.archive.org/cdx/search/cdx?url=abc.net.au/news/&output=json'

#API request
req <- httr::GET(AU_url, timeout(20))

#Get data
json <- httr::content(req, as = "text")

api_dat <- fromJSON(json)

#Get timestamps which will be used to pass in API
time_stamps<-api_dat[-1,2]

#Reverse order (so recent first)
time_stamps<-rev(time_stamps)

#Scrap the each and every URL to get headlines from theaustralian.com.au website
head(time_stamps,n=50)

#Filter time_stamps to have dates after 2018
time_stamps<-time_stamps[as.numeric(substr(time_stamps,1,8))>=20180000]

Step 3
We will create an loop where we will pass URL with necessary timestamp to get all headlines which were published from 1st to 21st May’18 on abc news website. Also, we will remove all duplicate headlines which we might have got while scraping the website.

#Dataframe to store output and loop to get headlines
abc_scrap_all <-NA
for(s in 1:length(time_stamps)){

 Sys.sleep(1)

feedurl<-paste0('https://web.archive.org/web/',time_stamps[s],'/http://www.abc.net.au/news/')

  print(feedurl)

  if(!is.na(feedurl)){

   print('Valid URL')

    #Scrap the data from URLs
    try(feed_dat<-read_html(feedurl),timeout(10),silent=TRUE)

    if(exists('feed_dat')){

      #USE
      initial<-html_nodes(feed_dat,"[href*='/news/2018']")

      #Date
      Date <- substr(time_stamps[s],1,8)

      #Get headlines
      headlines<-initial  %>% html_text()

      #Combine
      comb<-data.frame(Date,headlines,stringsAsFactors = FALSE)

      #Remove NA headlines
      comb<-comb[!(is.na(comb$headlines) | comb$headlines=="" | comb$headlines==" ") ,]

      #As a df
      comb<-data.frame(comb)

      #Remove duplicates on daily level
      comb <-  comb[!duplicated(comb$headlines),]

      if(length(comb$headlines)>0){

        #Save with the rest
        abc_scrap_all<-rbind(abc_scrap_all,comb)
      }
      rm(comb)
      rm(feed_dat)

     }
  }

}

#Remove duplicates
abc_scrap_all_final <-  abc_scrap_all[!duplicated(abc_scrap_all$headlines),]

Step 4:
We will use the headlines and do sentiment analysis on the headlines using Syuzhet package and see if we can make some conclusion

library('syuzhet')
abc_scrap_all_final$headlines <- str_replace_all(abc_scrap_all_final$headlines,"[^[:graph:]]", " ")
Sentiment <-get_nrc_sentiment(abc_scrap_all_final$headlines)

td<-data.frame(t(Sentiment))
td_Rowsum <- data.frame(rowSums(td[2:1781])) 

#Transformation and  cleaning
names(td_Rowsum)[1] <- "count"
td_Rowsum <- cbind("sentiment" = rownames(td_Rowsum), td_Rowsum)
rownames(td_Rowsum) <- NULL
td_Plot<-td_Rowsum[1:10,]

#Vizualisation
library("ggplot2")

qplot(sentiment, data=td_Plot, weight=count, geom="bar",fill=sentiment)+ggtitle("Abc News headlines sentiment analysis")

Conclusion on Sentiment Analysis:
Human brain tends to be more attentive to negative information. To grab attention of readers, most of the media houses focus on negative and fear related news. Thats what we see when we analyzed the abc news website headlines as well.

Lets analyze further on headlines:

Wordcloud for frequently used words in headlines

library(tm)
library(wordcloud)
  corpus = Corpus(VectorSource(tolower(abc_scrap_all_final$headlines)))
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeWords, stopwords("english"))

  frequencies = DocumentTermMatrix(corpus)
  word_frequencies = as.data.frame(as.matrix(frequencies))

  words <- colnames(word_frequencies)
  freq <- colSums(word_frequencies)
  wordcloud(words, freq,
            min.freq=sort(freq, decreasing=TRUE)[[100]],
            colors=brewer.pal(10, "Paired"),
            random.color=TRUE) 

plot of chunk unnamed-chunk-5

Surprisingly, being an australian news agency most frequently used word in headlines related Donald trump (American President) followed by police, commonwealth games, sport and australia

Find word associations:

If you have any specific word which can useful for analysis and help us identify the highly correlate words with that term. If word always appears together then correlation=1.0 and in our example we will find correlated words with 30% correlation.

findAssocs(dtm, "tony", corlimit=0.3)
## $tony
##      abbott headbutting       cooke        30th     hansons  benneworth 
##        0.64        0.35        0.30        0.30        0.30        0.30 
##     mocking       astro        labe 
##        0.30        0.30        0.30
#0.3 means 30% correlation with word "tony" 

That’s all for now. In my next post we will look further into text analytics using udpipe and see if we can build more on text association and analytics

Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

After a long break of 5 weeks I am back to blogging, Today we will go through Twitter Sentiment Analysis using R on #RoyalWedding.

Last few years has been interesting revolution in social media, it is not just platform where people talk to one another but it has become platform where people:

  • Express interests
  • Share views
  • Show dissent
  • Praise or criticize companies or politicians

So in this article we will learn how to analyze what people are posting on Twitter to come up with an solution which helps us understand about the public sentiments

How to create Twitter app

Twitter has developed an API which we can use to analyze tweets posted by users and their underlying metadata. This API helps us extract data in structured format which can easily be analyzed.

To create Twitter app, you need to have twitter account and once you have that account visit twitter app page and create an application to access data. Step by step process is available on following link:

https://iag.me/socialmedia/how-to-create-a-twitter-app-in-8-easy-steps/

once you have created the app, you will get following 4 keys:
a. Consumer key (API key)
b. Consumer secret (API Secret)
c. Access Token
d. Access Token Secret

These above keys we will use it to extract data from twitter to do analysis

Implementing Sentiment Analysis in R

Now, we will write step by step process in R to extract tweets from twitter and perform sentiment analysis on tweets. We will select #Royalwedding as our topic of analysis

Extracting tweets using Twitter application
Install the necessary packages

# Install packages
install.packages("twitteR", repos = "http://cran.us.r-project.org")
install.packages("RCurl", repos = "http://cran.us.r-project.org")
install.packages("httr", repos = "http://cran.us.r-project.org")
install.packages("syuzhet", repos = "http://cran.us.r-project.org")

# Load the required Packages
library(twitteR)
library(RCurl)
library(httr)
library(tm)
library(wordcloud)
library(syuzhet)

Next step is set the Twitter API using the app we created and use the key along with access tokens to get the data

# authorisation keys
consumer_key = "ABCD12345690XXXXXXXXX" #Consumer key from twitter app
consumer_secret = "ABCD12345690XXXXXXXXX" #Consumer secret from twitter app
access_token = "ABCD12345690XXXXXXXXX" #access token from twitter app
access_secret ="ABCD12345690XXXXXXXXX" #access secret from twitter app

# set up
setup_twitter_oauth(consumer_key,consumer_secret,access_token, access_secret)
## [1] "Using direct authentication"
# search for tweets in english language
tweets = searchTwitter("#RoyalWedding", n = 10000, lang = "en")
# store the tweets into dataframe
tweets.df = twListToDF(tweets)

Above code will invoke twitter app and extract the data with tweets having “#Royalwedding”. Since, Royal wedding is the flavor of season and talk of the world with everyone expressing their views on twitter.

Data Cleaning tweets for further analysis
We will remove hashtags, junk characters, other twitter handles and URLs from the tags using gsub function so we have tweets for further analysis

# CLEANING TWEETS

tweets.df$text=gsub("&amp", "", tweets.df$text)
tweets.df$text = gsub("&amp", "", tweets.df$text)
tweets.df$text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets.df$text)
tweets.df$text = gsub("@\\w+", "", tweets.df$text)
tweets.df$text = gsub("[[:punct:]]", "", tweets.df$text)
tweets.df$text = gsub("[[:digit:]]", "", tweets.df$text)
tweets.df$text = gsub("http\\w+", "", tweets.df$text)
tweets.df$text = gsub("[ \t]{2,}", "", tweets.df$text)
tweets.df$text = gsub("^\\s+|\\s+$", "", tweets.df$text)

tweets.df$text <- iconv(tweets.df$text, "UTF-8", "ASCII", sub="")

Now we have only relevant part of tweets which can use for analysis

Getting sentiments score for each tweet

Lets score the emotions on each tweet as syuzhet breaks emotion into 10 different categories.

# Emotions for each tweet using NRC dictionary
emotions <- get_nrc_sentiment(tweets.df$text)
emo_bar = colSums(emotions)
emo_sum = data.frame(count=emo_bar, emotion=names(emo_bar))
emo_sum$emotion = factor(emo_sum$emotion, levels=emo_sum$emotion[order(emo_sum$count, decreasing = TRUE)])

Post above steps, we are ready to visualize results to what type of emotions are dominant in the tweets

# Visualize the emotions from NRC sentiments
library(plotly)
p <- plot_ly(emo_sum, x=~emotion, y=~count, type="bar", color=~emotion) %>%
  layout(xaxis=list(title=""), showlegend=FALSE,
         title="Emotion Type for hashtag: #RoyalWedding")
api_create(p,filename="Sentimentanalysis")

Here we see majority of the people are discussing positive about Royal Wedding which is good indicator for analysis.

Lastly, lets see which word contributes which emotion:

# Create comparison word cloud data

wordcloud_tweet = c(
  paste(tweets.df$text[emotions$anger > 0], collapse=" "),
  paste(tweets.df$text[emotions$anticipation > 0], collapse=" "),
  paste(tweets.df$text[emotions$disgust > 0], collapse=" "),
  paste(tweets.df$text[emotions$fear > 0], collapse=" "),
  paste(tweets.df$text[emotions$joy > 0], collapse=" "),
  paste(tweets.df$text[emotions$sadness > 0], collapse=" "),
  paste(tweets.df$text[emotions$surprise > 0], collapse=" "),
  paste(tweets.df$text[emotions$trust > 0], collapse=" ")
)

# create corpus
corpus = Corpus(VectorSource(wordcloud_tweet))

# remove punctuation, convert every word in lower case and remove stop words

corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)

# create document term matrix

tdm = TermDocumentMatrix(corpus)

# convert as matrix
tdm = as.matrix(tdm)
tdmnew <- tdm[nchar(rownames(tdm)) < 11,]

# column name binding
colnames(tdm) = c('anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust')
colnames(tdmnew) <- colnames(tdm)
comparison.cloud(tdmnew, random.order=FALSE,
                 colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
                 title.size=1, max.words=250, scale=c(2.5, 0.4),rot.per=0.4)

plot of chunk unnamed-chunk-22

This is how word cloud on tweets with #Royalwedding looks like. Basically using R, we can analyse the sentiments on the social media and this can be extended to particular handle or product to see what people are saying in social media and whether is it negative or positive

Please feel free to ask any questions or want me to write on any specific topic

Do subscribe to Tabvizexplorer.com to keep receiving regular updates.