This post we will learn about developing an predictive model to predict deal or no deal using Shark Tank dataset (US based show).
Problem Statement
Shark Tank is a US based show wherein entrepreneurs and founders pitch their businesses in front of investors (aka Sharks) who decides to invest or not in the businesses based on multiple parameters.
Here, we have got an dataset containing Shark Tank episodes with 495 records where each entrepreneur making their pitch to investors (aka sharks). Using multiple algorithms, we will predict given the description of new pitch, how likely is the pitch will convert into success or not.
Import Dataset and Representation along with data cleaning
Import the shark tank dataset into R
# Read in the data
Sharktank = read.csv("Shark Tank Companies-1.csv", stringsAsFactors=FALSE)
Load all the libraries required for text mining
# Load Library
library(tm)
library(SnowballC)
To use tm pacakge we first need to transform dataset into a corpus with required variable i.e. description. Next we normalize the texts in the reviews:
1. Switch to lower case
2. Remove punctuation marks and stopwords
3. Remove extra whitespaces
4. Stem the documents
# Create corpus corpus = Corpus(VectorSource(Sharktank$description)) # Convert to lower-case corpus = tm_map(corpus, tolower) # Remove punctuation corpus = tm_map(corpus, removePunctuation) # Word cloud before removing stopwords library(wordcloud) wordcloud(corpus,colors=rainbow(7),max.words=100)
# Remove stopwords, the, and
corpus = tm_map(corpus, removeWords, c("the", "and", stopwords("english")))
# Remove extra whitespaces if any
corpus = tm_map(corpus, stripWhitespace)
# Stem document
corpus = tm_map(corpus, stemDocument)
# Word cloud after removing stopwords and cleaning
wordcloud(corpus,colors=rainbow(7),max.words=100)
To analyze the texts, we need to use DTM (Document-Term Matrix): basically converting all the documents as rows, terms/words as columns, frequency of the term in the document. This will help us identify unique words in the corpus used frequently.
#Document term matrix
frequencies = DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 495, terms: 3501)>> ## Non-/sparse entries: 9531/1723464 ## Sparsity : 99% ## Maximal term length: 21 ## Weighting : term frequency (tf)
To reduce the dimensions in DTM, we will remove less frequent words using removeSparseTerms and sparsity less than 0.995
# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)
Convert this dataset into data.frame and add dependant variable deal into data frame as final step for data preparation
# Convert to a data frame
descSparse = as.data.frame(as.matrix(sparse))
# Make all variable names R-friendly
colnames(descSparse) = make.names(colnames(descSparse))
# Add dependent variable
descSparse$deal = Sharktank$deal
#Get no of deals
table(descSparse$deal)
## ## FALSE TRUE ## 244 251
Predictive modelling
To predict whether investors(aka shark) will invest in the businesses we will use deal as an output variable and use the CART, logistic regression and random forest models to measure the performance and accuracy of the model.
CART Model
# Build CART model
library(rpart)
library(rpart.plot)
SharktankCart = rpart(deal ~ ., data=descSparse, method="class")
#CART Diagram
prp(SharktankCart, extra=2)
# Evaluate the performance of the CART model
predictCART = predict(SharktankCart, data=descSparse, type="class")
CART_initial <- table(descSparse$deal, predictCART)
# Baseline accuracy
BaseAccuracyCart = sum(diag(CART_initial))/sum(CART_initial)
Random Forest Model
# Random forest model
library(randomForest)
set.seed(123)
SharktankRF = randomForest(deal ~ ., data=descSparse)
## Warning in randomForest.default(m, y, ...): The response has five or fewer ## unique values. Are you sure you want to do regression?
# Make predictions:
predictRF = predict(SharktankRF, data=descSparse)
# Evaluate the performance of the Random Forest
RandomForestInitial <- table(descSparse$deal, predictRF>= 0.5)
# Baseline accuracy
BaseAccuracyRF = sum(diag(RandomForestInitial))/sum(RandomForestInitial)
#variable importance as measured by a Random Forest
varImpPlot(SharktankRF,main='Variable Importance Plot: Shark Tank',type=2)
Logistic Regression Model
# Logistic Regression model
set.seed(123)
Sharktanklogistic = glm(deal~., data = descSparse)
# Make predictions:
predictLogistic = predict(Sharktanklogistic, data=descSparse)
# Evaluate the performance of the Random Forest
LogisticInitial <- table(descSparse$deal, predictLogistic> 0.5)
# Baseline accuracy
BaseAccuracyLogistic = sum(diag(LogisticInitial))/sum(LogisticInitial)
Now let’s add additional variable called as Ratio which will be derived using column askfor/valuation and then we will re-run the models to see if we can have improved accuracy in the models
# Add ratio variable into descSparse
descSparse$ratio = Sharktank$askedFor/Sharktank$valuation
#re-run the models to see if any changes
########CART Model###########
SharktankCartRatio = rpart(deal ~ ., data=descSparse, method="class")
#CART Diagram
prp(SharktankCartRatio, extra=2)
# Evaluate the performance of the CART model
predictCARTRatio = predict(SharktankCartRatio, data=descSparse, type="class")
CART_ratio <- table(descSparse$deal, predictCARTRatio)
# Baseline accuracy
BaseAccuracyRatio = sum(diag(CART_ratio))/sum(CART_ratio)
#########Random Forrest#############
#Random Forrest Model
SharktankRFRatio = randomForest(deal ~ ., data=descSparse)
## Warning in randomForest.default(m, y, ...): The response has five or fewer ## unique values. Are you sure you want to do regression?
#Make predictions:
predictRFRatio = predict(SharktankRFRatio, data=descSparse)
# Evaluate the performance of the Random Forest
RandomForestRatio <- table(descSparse$deal, predictRFRatio>= 0.5)
# Baseline accuracy
BaseAccuracyRFRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)
#variable importance as measured by a Random Forest
varImpPlot(SharktankRFRatio,main='Variable Importance Plot: Shark Tank with Ratio',type=2)
#########Logistic Regression##########
#Logistic Model
SharktanklogisticRatio = glm(deal~., data = descSparse)
# Make predictions:
predictLogisticRatio = predict(SharktanklogisticRatio, data=descSparse)
# Evaluate the performance of the Random Forest
LogisticRatio <- table(descSparse$deal, predictLogisticRatio>= 0.5)
# Baseline accuracy
BaseAccuracyLogisticRatio = sum(diag(LogisticRatio))/sum(LogisticRatio)
Conclusion
Lets look at the accuracy of each model before ratio column and after ratio column added into dataset for text mining.
####CART MODEL
#Before Ratio Column
BaseAccuracyCart
## [1] 0.6565657
#After Ratio Column
BaseAccuracyRatio
## [1] 0.6606061
####Logistic Regression
#Before Ratio Column
BaseAccuracyLogistic
## [1] 0.9979798
#After Ratio Column
BaseAccuracyLogisticRatio
## [1] 1
####RandomForest
#Before Ratio Column
BaseAccuracyRF
## [1] 0.5535354
#After Ratio Column
BaseAccuracyRFRatio
## [1] 0.5575758
With CART Model we were able to predict around 65.65% and 66.06% accurate results using only description and description+ratio respectively. Using Random Forest, we were able to predict 55.35% and 55.75% accurate results using only description and description+ratio respectively.
With Logistic regression, it gave us 100% accuracy with both parameters however, this requires further validation with significant variables and remove unnecessary variables to derive an measureable output.
I would urge readers to implement and use the knowledge from this post in making their own analysis on text and solve various problems.
That’s all for now. Please do let me know your feedback and if any particular topic you would like me to write on.
Do subscribe to Tabvizexplorer.com to keep receive regular updates.