"Drop the Mic" machine learning (part 2)

January 10, 2018
R modelling fun

I wrote about an ‘algorithm’ that I concocted in my head for determining “Drop the Mic” winners in an earlier post: (“Drop the Mic” winner algorithm!). In that post I found that my silly little algorithm has accurately predicted the winner 85% of the time. 👍

To rehash that algorithm I said:

  1. If one contestant is female and the other is male, the female wins!
  2. If both are male, or both are female, or if they are a group, the one standing to the left of the screen at the end wins!

That was fun, but it wasn’t very data science-y. So… let’s apply a more proper machine learning method to determine the winner algorithm.

We start again with scraping the data from IMDb:

library(rvest)
library(tidyverse)
library(knitr)
library(kableExtra)

# Scrape data from IMDb
data.dropmic <- read_html("http://www.imdb.com/title/tt6396094/episodes?season=1&ref_=tt_eps_sn_1") %>%
        html_nodes("strong a") %>%
        html_text() %>%
        as.data.frame() %>%
        .[grepl(" vs", .[,1]),] %>%
        data.frame(Episode = .)

data.clean <- str_split(data.dropmic$Episode, "&| and ") %>%
        unlist() %>%
        data.frame(Battles = .)

# Display results
kable(data.clean, "html") %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
Battles
Halle Berry vs. James Corden
Usher vs. Anthony Anderson
James Van Der Beek vs. Randall Park
Rob Gronkowski vs. Gina Rodriguez
Niecy Nash vs. Cedric the Entertainer
Liam Payne vs. Jason Derulo
Tony Hale vs Timothy Simons
Rascal Flatts vs. Boyz II Men
Nicole Scherzinger vs. Lil Rel Howery
Charlie Puth vs. Backstreet Boys
Vanessa Hudgens vs. Michael Bennett
James Corden vs. Nicole Richie
Mayim Bialik vs. Kunal Nayyar
Ashley Tisdale vs. Nick Lachey
Wayne Brady vs. Jake Owen
Kenny G vs. Richard Marx
David Arquette vs. Brian Tyree Henry
Jesse Tyler Ferguson vs. Chrissy Metz
Pentatonix vs. Bell Biv Devoe
Padma Lakshmi vs. Randy Jackson
Danielle Fishel vs. Jonathan Lipnicki
Shania Twain vs. Meghan Trainor

The Battles were properly scraped in the above table. Let’s grab the following features from it:

  1. Who are the contestants?
  2. What are the genders of the contestants (we shall designate the first name in the battle as contestant 1)?
  3. Where is contestant 1 standing (left or right on the screen)?
# One data cleaning step needs to be done... make all "vs. " to "vs "
data.clean$Battles <- gsub("vs. ", "vs ", data.clean$Battles)

## 1. Who are the contestants?
# Separate contestants
features <- str_split(data.clean$Battles, "vs ")

# Use purrr::map to add Contestants to data.clean
data.model <- cbind(data.clean, 
                    Contestant.1 = purrr::map(features, 1) %>% unlist(),
                    Contestant.2 = purrr::map(features, 2) %>% unlist())

# Display results
kable(data.model, "html") %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
Battles Contestant.1 Contestant.2
Halle Berry vs James Corden Halle Berry James Corden
Usher vs Anthony Anderson Usher Anthony Anderson
James Van Der Beek vs Randall Park James Van Der Beek Randall Park
Rob Gronkowski vs Gina Rodriguez Rob Gronkowski Gina Rodriguez
Niecy Nash vs Cedric the Entertainer Niecy Nash Cedric the Entertainer
Liam Payne vs Jason Derulo Liam Payne Jason Derulo
Tony Hale vs Timothy Simons Tony Hale Timothy Simons
Rascal Flatts vs Boyz II Men Rascal Flatts Boyz II Men
Nicole Scherzinger vs Lil Rel Howery Nicole Scherzinger Lil Rel Howery
Charlie Puth vs Backstreet Boys Charlie Puth Backstreet Boys
Vanessa Hudgens vs Michael Bennett Vanessa Hudgens Michael Bennett
James Corden vs Nicole Richie James Corden Nicole Richie
Mayim Bialik vs Kunal Nayyar Mayim Bialik Kunal Nayyar
Ashley Tisdale vs Nick Lachey Ashley Tisdale Nick Lachey
Wayne Brady vs Jake Owen Wayne Brady Jake Owen
Kenny G vs Richard Marx Kenny G Richard Marx
David Arquette vs Brian Tyree Henry David Arquette Brian Tyree Henry
Jesse Tyler Ferguson vs Chrissy Metz Jesse Tyler Ferguson Chrissy Metz
Pentatonix vs Bell Biv Devoe Pentatonix Bell Biv Devoe
Padma Lakshmi vs Randy Jackson Padma Lakshmi Randy Jackson
Danielle Fishel vs Jonathan Lipnicki Danielle Fishel Jonathan Lipnicki
Shania Twain vs Meghan Trainor Shania Twain Meghan Trainor

We have our contestants! Let’s now add the genders of both contestants (I’m doing this manually… training a convolutional neural network to recognize females is a bit out of scope for this post 😛) as well as the positions and note the winners:

Battles Contestant.1 Contestant.2 Con1.gender Con2.gender Con1.position winner
Halle Berry vs James Corden Halle Berry James Corden female male right 1
Usher vs Anthony Anderson Usher Anthony Anderson male male left 1
James Van Der Beek vs Randall Park James Van Der Beek Randall Park male male left 2
Rob Gronkowski vs Gina Rodriguez Rob Gronkowski Gina Rodriguez male female left 2
Niecy Nash vs Cedric the Entertainer Niecy Nash Cedric the Entertainer female male left 1
Liam Payne vs Jason Derulo Liam Payne Jason Derulo male male left 1
Tony Hale vs Timothy Simons Tony Hale Timothy Simons male male right 1
Rascal Flatts vs Boyz II Men Rascal Flatts Boyz II Men group group left 2
Nicole Scherzinger vs Lil Rel Howery Nicole Scherzinger Lil Rel Howery female male right 1
Charlie Puth vs Backstreet Boys Charlie Puth Backstreet Boys male group left 1
Vanessa Hudgens vs Michael Bennett Vanessa Hudgens Michael Bennett female male right 1
James Corden vs Nicole Richie James Corden Nicole Richie male female left 2
Mayim Bialik vs Kunal Nayyar Mayim Bialik Kunal Nayyar female male right 1
Ashley Tisdale vs Nick Lachey Ashley Tisdale Nick Lachey female male left 1
Wayne Brady vs Jake Owen Wayne Brady Jake Owen male male left 1
Kenny G vs Richard Marx Kenny G Richard Marx male male left 1
David Arquette vs Brian Tyree Henry David Arquette Brian Tyree Henry male male right 2
Jesse Tyler Ferguson vs Chrissy Metz Jesse Tyler Ferguson Chrissy Metz male female right 2
Pentatonix vs Bell Biv Devoe Pentatonix Bell Biv Devoe group group left 2
Padma Lakshmi vs Randy Jackson Padma Lakshmi Randy Jackson female male left 1
Danielle Fishel vs Jonathan Lipnicki Danielle Fishel Jonathan Lipnicki female male right NA
Shania Twain vs Meghan Trainor Shania Twain Meghan Trainor female female left NA

Nice! We are now ready to train a machine learning model to predict the winner:

library(caret)
set.seed(1)

# Create training dataset
train.index <- createDataPartition(data.model.comp$winner, 
                                   p = 0.8,
                                   list = F)

data.train <- data.model.comp[train.index,]
data.test <- data.model.comp[-train.index,]

## Establish caret parameters
# Training controls
caret.train.control <- trainControl(
        method = "repeatedcv",
        number = 10,
        repeats = 10
        )
# Training machine learning model
doParallel::registerDoParallel(cores = 6)

machine.model <- train(winner ~ Con1.gender + Con2.gender + Con1.position, 
                       data = data.train, 
                       method = "rf",
                       trControl = caret.train.control)

# Model details
machine.model
## Random Forest 
## 
## 17 samples
##  3 predictor
##  2 classes: '1', '2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 16, 16, 15, 15, 14, 15, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8916667  0.7078125
##   3     0.8866667  0.6969231
##   5     0.8483333  0.6223881
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

The model, based on training data, is reporting a 89% accuracy. But the real test is on the test dataset, data.test:

# Predicting winners on test dataset
mm.predict <- predict(machine.model, newdata = data.test)

results <- cbind(data.test, mm.predict)

# Display results
kable(results, "html") %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
Battles Contestant.1 Contestant.2 Con1.gender Con2.gender Con1.position winner mm.predict
James Van Der Beek vs Randall Park James Van Der Beek Randall Park male male left 2 1
Niecy Nash vs Cedric the Entertainer Niecy Nash Cedric the Entertainer female male left 1 1
Mayim Bialik vs Kunal Nayyar Mayim Bialik Kunal Nayyar female male right 1 1

The machine learning algorithm predicted that contestant 1 would win all 3 battles. It was right 2/3 times.

Main Battle

When I made the earlier post, the results of the last two battles were not released yet. Now that the last episode has been posted, we can set up a battle of Machine vs. Human… Joshua, give me a beat!

Human prediction (based on algo developed in part 1):

  1. Danielle Fishel will beat Jonathan Lipnicki (female)
  2. Shania will beat Meghan (Shania is standing left)
data.battle <- data.model[(nrow(data.model) - 1): nrow(data.model),]

# Machine prediction
mm.predict <- predict(machine.model, data.battle)

results <- cbind(data.battle, mm.predict)

# Display results
kable(results, "html") %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
Battles Contestant.1 Contestant.2 Con1.gender Con2.gender Con1.position winner mm.predict
Danielle Fishel vs Jonathan Lipnicki Danielle Fishel Jonathan Lipnicki female male right NA 1
Shania Twain vs Meghan Trainor Shania Twain Meghan Trainor female female left NA 2

Machine prediction:

  1. Danielle Fishel will beat Jonathan Lipnicki
  2. Meghan will beat Shania

(I was a bit worried that the machine learning algo would make the same prediction as me… that would’ve been quite… anti-climactic)

Battle 1: Danielle Fishel vs. Jonathan Lipnicki


And the winner is Danielle! Both human and machine score a point!

Battle 2: Tie-breaker: Shania Twain vs. Meghan Trainor


And the winner is… both…?!?!?!?!?!? That was anti-climactic!!

A look at teacher misconduct in Canada...

August 8, 2018
R data

RNN vs. Dense neural networks for time-series

June 27, 2018
R modelling

Auto neural networks vs. Manual Keras neural model

June 11, 2018
R modelling