This week in #makeovermonday, it was tough and sensitive topic on Gender pay gap in UK.

Data was shared by Gov.uk

Here is the original report and how it looks:

Graphic displaying the mean bonus pay gap for HMRC

 

Data is available on data.world week 23.

Here is what I did:

  • First attempt I tried to create few calculations and pivot the data within Tableau to create bar charts which did not yield any productive output
  • I tried working with scatter plots and various other form of visualisation but was missing something
  • Final attempt was to create a gantt chart to show difference in male to female ratio in different pay scale quartile.

Below is the screenshot of Tableau file (click on Image for interactive version):

Thanks for visiting blog. Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

This post we will learn about developing an predictive model to predict deal or no deal using Shark Tank dataset (US based show).

Problem Statement

Shark Tank is a US based show wherein entrepreneurs and founders pitch their businesses in front of investors (aka Sharks) who decides to invest or not in the businesses based on multiple parameters.

Here, we have got an dataset containing Shark Tank episodes with 495 records where each entrepreneur making their pitch to investors (aka sharks). Using multiple algorithms, we will predict given the description of new pitch, how likely is the pitch will convert into success or not.

Import Dataset and Representation along with data cleaning
Import the shark tank dataset into R

# Read in the data

Sharktank = read.csv("Shark Tank Companies-1.csv", stringsAsFactors=FALSE)

Load all the libraries required for text mining

# Load Library

library(tm)
library(SnowballC)

To use tm pacakge we first need to transform dataset into a corpus with required variable i.e. description. Next we normalize the texts in the reviews:
1. Switch to lower case
2. Remove punctuation marks and stopwords
3. Remove extra whitespaces
4. Stem the documents

# Create corpus
corpus = Corpus(VectorSource(Sharktank$description))

# Convert to lower-case
corpus = tm_map(corpus, tolower)

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

# Word cloud before removing stopwords
library(wordcloud)
wordcloud(corpus,colors=rainbow(7),max.words=100) 
# Remove stopwords, the, and
corpus = tm_map(corpus, removeWords, c("the", "and", stopwords("english")))

# Remove extra whitespaces if any
corpus = tm_map(corpus, stripWhitespace)

# Stem document 
corpus = tm_map(corpus, stemDocument)


# Word cloud after removing stopwords and cleaning
wordcloud(corpus,colors=rainbow(7),max.words=100)

To analyze the texts, we need to use DTM (Document-Term Matrix): basically converting all the documents as rows, terms/words as columns, frequency of the term in the document. This will help us identify unique words in the corpus used frequently.

#Document term matrix
frequencies = DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 495, terms: 3501)>>
## Non-/sparse entries: 9531/1723464
## Sparsity           : 99%
## Maximal term length: 21
## Weighting          : term frequency (tf)

To reduce the dimensions in DTM, we will remove less frequent words using removeSparseTerms and sparsity less than 0.995

# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)

Convert this dataset into data.frame and add dependant variable deal into data frame as final step for data preparation

# Convert to a data frame
descSparse = as.data.frame(as.matrix(sparse))

# Make all variable names R-friendly
colnames(descSparse) = make.names(colnames(descSparse))

# Add dependent variable
descSparse$deal = Sharktank$deal

#Get no of deals
table(descSparse$deal)
## 
## FALSE  TRUE 
##   244   251

Predictive modelling
To predict whether investors(aka shark) will invest in the businesses we will use deal as an output variable and use the CART, logistic regression and random forest models to measure the performance and accuracy of the model.

CART Model

# Build CART model

library(rpart)
library(rpart.plot)

SharktankCart = rpart(deal ~ ., data=descSparse, method="class")

#CART Diagram
prp(SharktankCart, extra=2)

plot of chunk unnamed-chunk-164

# Evaluate the performance of the CART model
predictCART = predict(SharktankCart, data=descSparse, type="class")

CART_initial <- table(descSparse$deal, predictCART)

# Baseline accuracy
BaseAccuracyCart = sum(diag(CART_initial))/sum(CART_initial)

Random Forest Model

# Random forest model
library(randomForest)
set.seed(123)

SharktankRF = randomForest(deal ~ ., data=descSparse)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# Make predictions:
predictRF = predict(SharktankRF, data=descSparse)

# Evaluate the performance of the Random Forest
RandomForestInitial <- table(descSparse$deal, predictRF>= 0.5)

# Baseline accuracy
BaseAccuracyRF = sum(diag(RandomForestInitial))/sum(RandomForestInitial)

#variable importance as measured by a Random Forest 
varImpPlot(SharktankRF,main='Variable Importance Plot: Shark Tank',type=2)

plot of chunk unnamed-chunk-165

Logistic Regression Model

# Logistic Regression model

set.seed(123)

Sharktanklogistic = glm(deal~., data = descSparse)

# Make predictions:
predictLogistic = predict(Sharktanklogistic, data=descSparse)

# Evaluate the performance of the Random Forest
LogisticInitial <- table(descSparse$deal, predictLogistic> 0.5)

# Baseline accuracy
BaseAccuracyLogistic = sum(diag(LogisticInitial))/sum(LogisticInitial)

Now let’s add additional variable called as Ratio which will be derived using column askfor/valuation and then we will re-run the models to see if we can have improved accuracy in the models

# Add ratio variable into descSparse
descSparse$ratio = Sharktank$askedFor/Sharktank$valuation

#re-run the models to see if any changes

########CART Model###########
SharktankCartRatio = rpart(deal ~ ., data=descSparse, method="class")

#CART Diagram
prp(SharktankCartRatio, extra=2)

plot of chunk unnamed-chunk-167

# Evaluate the performance of the CART model
predictCARTRatio = predict(SharktankCartRatio, data=descSparse, type="class")

CART_ratio <- table(descSparse$deal, predictCARTRatio)

# Baseline accuracy
BaseAccuracyRatio = sum(diag(CART_ratio))/sum(CART_ratio)


#########Random Forrest#############
#Random Forrest Model
SharktankRFRatio = randomForest(deal ~ ., data=descSparse)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
#Make predictions:
predictRFRatio = predict(SharktankRFRatio, data=descSparse)

# Evaluate the performance of the Random Forest
RandomForestRatio <- table(descSparse$deal, predictRFRatio>= 0.5)

# Baseline accuracy
BaseAccuracyRFRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)

#variable importance as measured by a Random Forest 
varImpPlot(SharktankRFRatio,main='Variable Importance Plot: Shark Tank with Ratio',type=2)

plot of chunk unnamed-chunk-167

#########Logistic Regression##########
#Logistic Model
SharktanklogisticRatio = glm(deal~., data = descSparse)

# Make predictions:
predictLogisticRatio = predict(SharktanklogisticRatio, data=descSparse)

# Evaluate the performance of the Random Forest
LogisticRatio <- table(descSparse$deal, predictLogisticRatio>= 0.5)

# Baseline accuracy
BaseAccuracyLogisticRatio = sum(diag(LogisticRatio))/sum(LogisticRatio)

Conclusion
Lets look at the accuracy of each model before ratio column and after ratio column added into dataset for text mining.

####CART MODEL
#Before Ratio Column
BaseAccuracyCart
## [1] 0.6565657
#After Ratio Column
BaseAccuracyRatio
## [1] 0.6606061
####Logistic Regression
#Before Ratio Column
BaseAccuracyLogistic
## [1] 0.9979798
#After Ratio Column
BaseAccuracyLogisticRatio
## [1] 1
####RandomForest
#Before Ratio Column
BaseAccuracyRF
## [1] 0.5535354
#After Ratio Column
BaseAccuracyRFRatio
## [1] 0.5575758

With CART Model we were able to predict around 65.65% and 66.06% accurate results using only description and description+ratio respectively. Using Random Forest, we were able to predict 55.35% and 55.75% accurate results using only description and description+ratio respectively.

With Logistic regression, it gave us 100% accuracy with both parameters however, this requires further validation with significant variables and remove unnecessary variables to derive an measureable output.

I would urge readers to implement and use the knowledge from this post in making their own analysis on text and solve various problems.

That’s all for now. Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

After writing previous article on Twitter Sentiment Analysis on #royalwedding, I thought why not do analysis on ABC news online websiteΒ and see if we can uncover some interesting insights. This is some good practice to do some data scrapping, text mining and use few algorithms to practice.

Below is the step by step guide:

To start with we will scrap headlines from ABC newsΒ  for the duration of entire 2018. To get historical headlines, we will scrap abc news homepage via Wayback Machine. Post few data crunching and manipulation we will get the clean headlines.

Step 1:
Load all important libraries which we will be using in this practice

#Importing Libraries
library(stringr)
library(jsonlite)
library(httr)
library(rvest)
library(dplyr)
library(V8)
library(tweenr)
library(syuzhet) 
library(tidyverse)

#Path where files output will be stored
path<-'/Mithun/R/WebScraping/'

Step 2:

Use wayback API call with abc.net.au/news and pass this information into json with the text content.
Here we will also filter the time stamp to have dates from 1st Jan’18.

#Internet WAYBACK API CALL http://www.abc.net.au/news/
AU_url<-'http://web.archive.org/cdx/search/cdx?url=abc.net.au/news/&output=json'

#API request
req <- httr::GET(AU_url, timeout(20))

#Get data
json <- httr::content(req, as = "text")

api_dat <- fromJSON(json)

#Get timestamps which will be used to pass in API
time_stamps<-api_dat[-1,2]

#Reverse order (so recent first)
time_stamps<-rev(time_stamps)

#Scrap the each and every URL to get headlines from theaustralian.com.au website
head(time_stamps,n=50)

#Filter time_stamps to have dates after 2018
time_stamps<-time_stamps[as.numeric(substr(time_stamps,1,8))>=20180000]

Step 3
We will create an loop where we will pass URL with necessary timestamp to get all headlines which were published from 1st to 21st May’18 on abc news website. Also, we will remove all duplicate headlines which we might have got while scraping the website.

#Dataframe to store output and loop to get headlines
abc_scrap_all <-NA
for(s in 1:length(time_stamps)){

 Sys.sleep(1)

feedurl<-paste0('https://web.archive.org/web/',time_stamps[s],'/http://www.abc.net.au/news/')

  print(feedurl)

  if(!is.na(feedurl)){

   print('Valid URL')

    #Scrap the data from URLs
    try(feed_dat<-read_html(feedurl),timeout(10),silent=TRUE)

    if(exists('feed_dat')){

      #USE
      initial<-html_nodes(feed_dat,"[href*='/news/2018']")

      #Date
      Date <- substr(time_stamps[s],1,8)

      #Get headlines
      headlines<-initial  %>% html_text()

      #Combine
      comb<-data.frame(Date,headlines,stringsAsFactors = FALSE)

      #Remove NA headlines
      comb<-comb[!(is.na(comb$headlines) | comb$headlines=="" | comb$headlines==" ") ,]

      #As a df
      comb<-data.frame(comb)

      #Remove duplicates on daily level
      comb <-  comb[!duplicated(comb$headlines),]

      if(length(comb$headlines)>0){

        #Save with the rest
        abc_scrap_all<-rbind(abc_scrap_all,comb)
      }
      rm(comb)
      rm(feed_dat)

     }
  }

}

#Remove duplicates
abc_scrap_all_final <-  abc_scrap_all[!duplicated(abc_scrap_all$headlines),]

Step 4:
We will use the headlines and do sentiment analysis on the headlines using Syuzhet package and see if we can make some conclusion

library('syuzhet')
abc_scrap_all_final$headlines <- str_replace_all(abc_scrap_all_final$headlines,"[^[:graph:]]", " ")
Sentiment <-get_nrc_sentiment(abc_scrap_all_final$headlines)

td<-data.frame(t(Sentiment))
td_Rowsum <- data.frame(rowSums(td[2:1781])) 

#Transformation and  cleaning
names(td_Rowsum)[1] <- "count"
td_Rowsum <- cbind("sentiment" = rownames(td_Rowsum), td_Rowsum)
rownames(td_Rowsum) <- NULL
td_Plot<-td_Rowsum[1:10,]

#Vizualisation
library("ggplot2")

qplot(sentiment, data=td_Plot, weight=count, geom="bar",fill=sentiment)+ggtitle("Abc News headlines sentiment analysis")

Conclusion on Sentiment Analysis:
Human brain tends to be more attentive to negative information. To grab attention of readers, most of the media houses focus on negative and fear related news. Thats what we see when we analyzed the abc news website headlines as well.

Lets analyze further on headlines:

Wordcloud for frequently used words in headlines

library(tm)
library(wordcloud)
  corpus = Corpus(VectorSource(tolower(abc_scrap_all_final$headlines)))
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeWords, stopwords("english"))

  frequencies = DocumentTermMatrix(corpus)
  word_frequencies = as.data.frame(as.matrix(frequencies))

  words <- colnames(word_frequencies)
  freq <- colSums(word_frequencies)
  wordcloud(words, freq,
            min.freq=sort(freq, decreasing=TRUE)[[100]],
            colors=brewer.pal(10, "Paired"),
            random.color=TRUE) 

plot of chunk unnamed-chunk-5

Surprisingly, being an australian news agency most frequently used word in headlines related Donald trump (American President) followed by police, commonwealth games, sport and australia

Find word associations:

If you have any specific word which can useful for analysis and help us identify the highly correlate words with that term. If word always appears together then correlation=1.0 and in our example we will find correlated words with 30% correlation.

findAssocs(dtm, "tony", corlimit=0.3)
## $tony
##      abbott headbutting       cooke        30th     hansons  benneworth 
##        0.64        0.35        0.30        0.30        0.30        0.30 
##     mocking       astro        labe 
##        0.30        0.30        0.30
#0.3 means 30% correlation with word "tony" 

That’s all for now. In my next post we will look further into text analytics using udpipe and see if we can build more on text association and analytics

Please do let me know your feedback and if any particular topic you would like me to write on.

Do subscribe to Tabvizexplorer.com to keep receive regular updates.

This week in #makeovermonday challenge was Sports data i.e. English Premier League data which looks into predictions and actual outcomes of the season 17-18.

Data was shared by the Guardian

Here is the original report and how it looks:

 

Data is available on data.world week 21.

Here is what I tried to do:

  • Used gann and circle chart from tableau to display the results of actual vs prediction and highlighted both of them using different colors to make it self explanatory.
  • This will help to make inference about how many prediction were on target and how many off target.

Below is the Tableau file:

Thanks!!
Do subscribe to receive regular updates

This week’s #workoutwednesday was about comparing last 2 periods sales for a sub-category using date functions with customize tooltips.

Requirements

  • Dashboard size should be 1000Γ—600.
  • Sub-Category on the Rows Shelf, and Year of Order Date on the Columns Shelf
  • There are 3 measures, Sales, % of Total (in year) and % Difference
  • Only the Sales Measure has color (which colors you use don’t matter, just that Sales is the only measure colored).
  • Users can select a Year and a Measure and sort the table in Descending order by that measure in that year.

Dataset used was superstore dataset available with Tableau

Below is my attempt to meet the above requirements:

Thanks for reading πŸ™‚

Do subscribe to blog for keep receiving updates