This week in #makeovermonday, it was tough and sensitive topic on Gender pay gap in UK.

Data was shared by Gov.uk

Here is the original report and how it looks:

Graphic displaying the mean bonus pay gap for HMRC

 

Data is available on data.world week 23.

Here is what I did:

  • First attempt I tried to create few calculations and pivot the data within Tableau to create bar charts which did not yield any productive output
  • I tried working with scatter plots and various other form of visualisation but was missing something
  • Final attempt was to create a gantt chart to show difference in male to female ratio in different pay scale quartile.

Below is the screenshot of Tableau file (click on Image for interactive version):

Thanks for visiting blog. Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

This week’s #workoutwednesday was about a problem that can only be solved using table calculations. Idea was to find which city contributes the most sales to each states.

Requirements

  • Use only table calculations
  • The bar length is the total sales of each state
  • City must be included in the view.
  • Display only one mark per state.
  • Label each bar by the city with the highest sales, sales for that city, and the total sales for that state.
  • No level of detail calculations allowed.

This week uses the superstore dataset.  You can get it here at data.world

Below is my attempt to design solution with the above requirements:

Thanks for visiting blog. Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

This week in #makeovermonday, we are looking at real estate’s price data to see World’s most expensive prime property.

Data was shared by The world Economic Forum (weforum)

Here is the original report and how it looks:

 

Data is available on data.world week 22.

Here is what I did:

  • Use of simple chart to show the how big a place can be acquired in $1 million USD. I thought to display data in rather simple way then to complicate things.

Below is the Tableau file:

Thanks for visiting blog. Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

This post we will learn about developing an predictive model to predict deal or no deal using Shark Tank dataset (US based show).

Problem Statement

Shark Tank is a US based show wherein entrepreneurs and founders pitch their businesses in front of investors (aka Sharks) who decides to invest or not in the businesses based on multiple parameters.

Here, we have got an dataset containing Shark Tank episodes with 495 records where each entrepreneur making their pitch to investors (aka sharks). Using multiple algorithms, we will predict given the description of new pitch, how likely is the pitch will convert into success or not.

Import Dataset and Representation along with data cleaning
Import the shark tank dataset into R

# Read in the data

Sharktank = read.csv("Shark Tank Companies-1.csv", stringsAsFactors=FALSE)

Load all the libraries required for text mining

# Load Library

library(tm)
library(SnowballC)

To use tm pacakge we first need to transform dataset into a corpus with required variable i.e. description. Next we normalize the texts in the reviews:
1. Switch to lower case
2. Remove punctuation marks and stopwords
3. Remove extra whitespaces
4. Stem the documents

# Create corpus
corpus = Corpus(VectorSource(Sharktank$description))

# Convert to lower-case
corpus = tm_map(corpus, tolower)

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

# Word cloud before removing stopwords
library(wordcloud)
wordcloud(corpus,colors=rainbow(7),max.words=100) 
# Remove stopwords, the, and
corpus = tm_map(corpus, removeWords, c("the", "and", stopwords("english")))

# Remove extra whitespaces if any
corpus = tm_map(corpus, stripWhitespace)

# Stem document 
corpus = tm_map(corpus, stemDocument)


# Word cloud after removing stopwords and cleaning
wordcloud(corpus,colors=rainbow(7),max.words=100)

To analyze the texts, we need to use DTM (Document-Term Matrix): basically converting all the documents as rows, terms/words as columns, frequency of the term in the document. This will help us identify unique words in the corpus used frequently.

#Document term matrix
frequencies = DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 495, terms: 3501)>>
## Non-/sparse entries: 9531/1723464
## Sparsity           : 99%
## Maximal term length: 21
## Weighting          : term frequency (tf)

To reduce the dimensions in DTM, we will remove less frequent words using removeSparseTerms and sparsity less than 0.995

# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)

Convert this dataset into data.frame and add dependant variable deal into data frame as final step for data preparation

# Convert to a data frame
descSparse = as.data.frame(as.matrix(sparse))

# Make all variable names R-friendly
colnames(descSparse) = make.names(colnames(descSparse))

# Add dependent variable
descSparse$deal = Sharktank$deal

#Get no of deals
table(descSparse$deal)
## 
## FALSE  TRUE 
##   244   251

Predictive modelling
To predict whether investors(aka shark) will invest in the businesses we will use deal as an output variable and use the CART, logistic regression and random forest models to measure the performance and accuracy of the model.

CART Model

# Build CART model

library(rpart)
library(rpart.plot)

SharktankCart = rpart(deal ~ ., data=descSparse, method="class")

#CART Diagram
prp(SharktankCart, extra=2)

plot of chunk unnamed-chunk-164

# Evaluate the performance of the CART model
predictCART = predict(SharktankCart, data=descSparse, type="class")

CART_initial <- table(descSparse$deal, predictCART)

# Baseline accuracy
BaseAccuracyCart = sum(diag(CART_initial))/sum(CART_initial)

Random Forest Model

# Random forest model
library(randomForest)
set.seed(123)

SharktankRF = randomForest(deal ~ ., data=descSparse)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# Make predictions:
predictRF = predict(SharktankRF, data=descSparse)

# Evaluate the performance of the Random Forest
RandomForestInitial <- table(descSparse$deal, predictRF>= 0.5)

# Baseline accuracy
BaseAccuracyRF = sum(diag(RandomForestInitial))/sum(RandomForestInitial)

#variable importance as measured by a Random Forest 
varImpPlot(SharktankRF,main='Variable Importance Plot: Shark Tank',type=2)

plot of chunk unnamed-chunk-165

Logistic Regression Model

# Logistic Regression model

set.seed(123)

Sharktanklogistic = glm(deal~., data = descSparse)

# Make predictions:
predictLogistic = predict(Sharktanklogistic, data=descSparse)

# Evaluate the performance of the Random Forest
LogisticInitial <- table(descSparse$deal, predictLogistic> 0.5)

# Baseline accuracy
BaseAccuracyLogistic = sum(diag(LogisticInitial))/sum(LogisticInitial)

Now let’s add additional variable called as Ratio which will be derived using column askfor/valuation and then we will re-run the models to see if we can have improved accuracy in the models

# Add ratio variable into descSparse
descSparse$ratio = Sharktank$askedFor/Sharktank$valuation

#re-run the models to see if any changes

########CART Model###########
SharktankCartRatio = rpart(deal ~ ., data=descSparse, method="class")

#CART Diagram
prp(SharktankCartRatio, extra=2)

plot of chunk unnamed-chunk-167

# Evaluate the performance of the CART model
predictCARTRatio = predict(SharktankCartRatio, data=descSparse, type="class")

CART_ratio <- table(descSparse$deal, predictCARTRatio)

# Baseline accuracy
BaseAccuracyRatio = sum(diag(CART_ratio))/sum(CART_ratio)


#########Random Forrest#############
#Random Forrest Model
SharktankRFRatio = randomForest(deal ~ ., data=descSparse)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
#Make predictions:
predictRFRatio = predict(SharktankRFRatio, data=descSparse)

# Evaluate the performance of the Random Forest
RandomForestRatio <- table(descSparse$deal, predictRFRatio>= 0.5)

# Baseline accuracy
BaseAccuracyRFRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)

#variable importance as measured by a Random Forest 
varImpPlot(SharktankRFRatio,main='Variable Importance Plot: Shark Tank with Ratio',type=2)

plot of chunk unnamed-chunk-167

#########Logistic Regression##########
#Logistic Model
SharktanklogisticRatio = glm(deal~., data = descSparse)

# Make predictions:
predictLogisticRatio = predict(SharktanklogisticRatio, data=descSparse)

# Evaluate the performance of the Random Forest
LogisticRatio <- table(descSparse$deal, predictLogisticRatio>= 0.5)

# Baseline accuracy
BaseAccuracyLogisticRatio = sum(diag(LogisticRatio))/sum(LogisticRatio)

Conclusion
Lets look at the accuracy of each model before ratio column and after ratio column added into dataset for text mining.

####CART MODEL
#Before Ratio Column
BaseAccuracyCart
## [1] 0.6565657
#After Ratio Column
BaseAccuracyRatio
## [1] 0.6606061
####Logistic Regression
#Before Ratio Column
BaseAccuracyLogistic
## [1] 0.9979798
#After Ratio Column
BaseAccuracyLogisticRatio
## [1] 1
####RandomForest
#Before Ratio Column
BaseAccuracyRF
## [1] 0.5535354
#After Ratio Column
BaseAccuracyRFRatio
## [1] 0.5575758

With CART Model we were able to predict around 65.65% and 66.06% accurate results using only description and description+ratio respectively. Using Random Forest, we were able to predict 55.35% and 55.75% accurate results using only description and description+ratio respectively.

With Logistic regression, it gave us 100% accuracy with both parameters however, this requires further validation with significant variables and remove unnecessary variables to derive an measureable output.

I would urge readers to implement and use the knowledge from this post in making their own analysis on text and solve various problems.

That’s all for now. Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

In my previous post, we learned about text mining and sentiment analysis on News headlines using web scraping and R. Text analytics has been one of the black boxes of analytics. In this post, we will dive into text analysis of headlines with simple Natural Language Processing(NLP) using UDPipe in R. UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.

Step 1: Dataset
We will use the same dataset which we created in last article from ABC News. We will take all the headlines which have been published on ABC news for the year 2018.

#install.packages("udpipe")

library(dplyr)
library(ggplot2)

abc_scrap_all <- readRDS(file = paste0("abc_scrap_all.rds"))
#Remove duplicates
news <-  abc_scrap_all[!duplicated(abc_scrap_all$headlines),]

news %>% group_by(Date) %>% count() %>% arrange(desc(n))
## # A tibble: 121 x 2
## # Groups:   Date [121]
##    Date         n
##    <chr>    <int>
##  1 20180523   213
##  2 20180504   172
##  3 20180320   155
##  4 20180220   153
##  5 20180227   152
##  6 20180301   148
##  7 20180423   147
##  8 20180208   145
##  9 20180321   144
## 10 20180322   137
## # ... with 111 more rows

Step 2: pre-trained UDPipe model
UDPipe package comes with pre trained model for more than 50 spoken languages. We can download the model using udpipe_download_model() function

library(udpipe)
#model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = 'english-ud-2.0-170801.udpipe')

Step 3: Annotate the input text
use udpipe_annotate() to start with udpipe and this will annotates the given data and put the variable into data frame format with data.frame()

# use udpipe_annotate() for analysis 
textanalysis <- udpipe_annotate(udmodel_english, news$headlines)
#data frame for the output
textframe <- data.frame(textanalysis)

Step 4: Universal POS (Part of Speech)
We will plot Part of speech tags for the given headlines

## POS
library(lattice)
POS <- txt_freq(textframe$upos)
POS$key <- factor(POS$key, levels = rev(POS$key))
barchart(key ~ freq, data = POS, col = "yellow", 
         main = "UPOS (Universal Parts of Speech)\n frequency of occurrence", 
         xlab = "Freq")

plot of chunk unnamed-chunk-3

Step 5: Frequently used Nouns in headlines
Lets plot the most frequently used nouns in headlines

## NOUNS
noun <- subset(textframe, upos %in% c("NOUN")) 
noun <- txt_freq(noun$token)
noun$key <- factor(noun$key, levels = rev(noun$key))
barchart(key ~ freq, data = head(noun, 20), col = "cadetblue", 
         main = "Frequently used nouns", xlab = "Freq")

plot of chunk unnamed-chunk-4
More than half of the top nouns used in headlines seem to be indicating negative atmosphere

Step 6: Frequently used Adjective in headlines
Let’s analyze the Adjective used in headlines as its a news website which will love to magnify and inflate using several adjectives

## ADJECTIVES
adj <- subset(textframe, upos %in% c("ADJ")) 
adj <- txt_freq(adj$token)
adj$key <- factor(adj$key, levels = rev(adj$key))
barchart(key ~ freq, data = head(adj, 20), col = "purple", 
         main = "Frequently used Adjectives", xlab = "Freq")

plot of chunk unnamed-chunk-5

Step 7: Frequently used Verb in headlines
Do headlines bring in any sign of optimism or just infuse pessimism? The kind of Verb used by media house can certainly help in highlighting direction of optimism or pessimism.

## VERBS
verbs <- subset(textframe, upos %in% c("VERB")) 
verbs <- txt_freq(verbs$token)
verbs$key <- factor(verbs$key, levels = rev(verbs$key))
barchart(key ~ freq, data = head(verbs, 20), col = "gold", 
         main = "Most occurring Verbs", xlab = "Freq")

plot of chunk unnamed-chunk-6

With words like dies, killed, charged, accused and many more it does not look like ABC news is not interested in building an optimistic mindset amongst its citizen. It is just acting like a media house which will look into hot, sensational or burning news to gain further viewership.

Step 8: Automated keywords extraction using RAKE
Rapid Automatic Keyword Extraction(RAKE) algorithm is one of the most popular(unsupervised) algorithms for extracting keywords in Information retrieval. It looks for keywords by looking to a contiguous sequence of words which do not contain irrelevant words.

## Using RAKE
rake <- keywords_rake(x = textframe, term = "lemma", group = "doc_id", 
                       relevant = x$upos %in% c("NOUN", "ADJ"))
rake$key <- factor(rake$keyword, levels = rev(rake$keyword))
barchart(key ~ rake, data = head(subset(rake, freq > 3), 20), col = "red", 
         main = "Keywords identified by RAKE", 
         xlab = "Rake")

plot of chunk unnamed-chunk-7

Step 9: Phrases
Now, we will extract phrases in which a noun and a verb forming a phrase. Let us bring out the top phrases that are just keywords or topic for this headlines data.

## Using a sequence of POS tags (noun phrases / verb phrases)
textframe$phrase_tag <- as_phrasemachine(textframe$upos, type = "upos")
phrases <- keywords_phrases(x = textframe$phrase_tag, term = textframe$token, 
                          pattern = "(A|N)*N(P+D*(A|N)*N)*", 
                          is_regex = TRUE, detailed = FALSE)
phrases <- subset(phrases, ngram > 1 & freq > 3)
phrases$key <- factor(phrases$keyword, levels = rev(phrases$keyword))
barchart(key ~ freq, data = head(phrases, 20), col = "magenta", 
         main = "Keywords - simple noun phrases", xlab = "Frequency")

To conclude, we see here is commonwealth games and gold coast being top used phrases as Gold coast was hosting commonwealth games this year. Also, US influence on the news headlines with Wall street, white house and Donald trump being used frequently in headlines.

Hope this post helped you to get started with text analytics and NLP in R.

Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.