"Drop the Mic" machine learning (part 2)
January 10, 2018
R modelling funI wrote about an ‘algorithm’ that I concocted in my head for determining “Drop the Mic” winners in an earlier post: (“Drop the Mic” winner algorithm!). In that post I found that my silly little algorithm has accurately predicted the winner 85% of the time. 👍
To rehash that algorithm I said:
- If one contestant is female and the other is male, the female wins!
- If both are male, or both are female, or if they are a group, the one standing to the left of the screen at the end wins!
That was fun, but it wasn’t very data science-y. So… let’s apply a more proper machine learning method to determine the winner algorithm.
We start again with scraping the data from IMDb:
library(rvest)
library(tidyverse)
library(knitr)
library(kableExtra)
# Scrape data from IMDb
data.dropmic <- read_html("http://www.imdb.com/title/tt6396094/episodes?season=1&ref_=tt_eps_sn_1") %>%
html_nodes("strong a") %>%
html_text() %>%
as.data.frame() %>%
.[grepl(" vs", .[,1]),] %>%
data.frame(Episode = .)
data.clean <- str_split(data.dropmic$Episode, "&| and ") %>%
unlist() %>%
data.frame(Battles = .)
# Display results
kable(data.clean, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Battles |
---|
Halle Berry vs. James Corden |
Usher vs. Anthony Anderson |
James Van Der Beek vs. Randall Park |
Rob Gronkowski vs. Gina Rodriguez |
Niecy Nash vs. Cedric the Entertainer |
Liam Payne vs. Jason Derulo |
Tony Hale vs Timothy Simons |
Rascal Flatts vs. Boyz II Men |
Nicole Scherzinger vs. Lil Rel Howery |
Charlie Puth vs. Backstreet Boys |
Vanessa Hudgens vs. Michael Bennett |
James Corden vs. Nicole Richie |
Mayim Bialik vs. Kunal Nayyar |
Ashley Tisdale vs. Nick Lachey |
Wayne Brady vs. Jake Owen |
Kenny G vs. Richard Marx |
David Arquette vs. Brian Tyree Henry |
Jesse Tyler Ferguson vs. Chrissy Metz |
Pentatonix vs. Bell Biv Devoe |
Padma Lakshmi vs. Randy Jackson |
Danielle Fishel vs. Jonathan Lipnicki |
Shania Twain vs. Meghan Trainor |
The Battles were properly scraped in the above table. Let’s grab the following features from it:
- Who are the contestants?
- What are the genders of the contestants (we shall designate the first name in the battle as contestant 1)?
- Where is contestant 1 standing (left or right on the screen)?
# One data cleaning step needs to be done... make all "vs. " to "vs "
data.clean$Battles <- gsub("vs. ", "vs ", data.clean$Battles)
## 1. Who are the contestants?
# Separate contestants
features <- str_split(data.clean$Battles, "vs ")
# Use purrr::map to add Contestants to data.clean
data.model <- cbind(data.clean,
Contestant.1 = purrr::map(features, 1) %>% unlist(),
Contestant.2 = purrr::map(features, 2) %>% unlist())
# Display results
kable(data.model, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Battles | Contestant.1 | Contestant.2 |
---|---|---|
Halle Berry vs James Corden | Halle Berry | James Corden |
Usher vs Anthony Anderson | Usher | Anthony Anderson |
James Van Der Beek vs Randall Park | James Van Der Beek | Randall Park |
Rob Gronkowski vs Gina Rodriguez | Rob Gronkowski | Gina Rodriguez |
Niecy Nash vs Cedric the Entertainer | Niecy Nash | Cedric the Entertainer |
Liam Payne vs Jason Derulo | Liam Payne | Jason Derulo |
Tony Hale vs Timothy Simons | Tony Hale | Timothy Simons |
Rascal Flatts vs Boyz II Men | Rascal Flatts | Boyz II Men |
Nicole Scherzinger vs Lil Rel Howery | Nicole Scherzinger | Lil Rel Howery |
Charlie Puth vs Backstreet Boys | Charlie Puth | Backstreet Boys |
Vanessa Hudgens vs Michael Bennett | Vanessa Hudgens | Michael Bennett |
James Corden vs Nicole Richie | James Corden | Nicole Richie |
Mayim Bialik vs Kunal Nayyar | Mayim Bialik | Kunal Nayyar |
Ashley Tisdale vs Nick Lachey | Ashley Tisdale | Nick Lachey |
Wayne Brady vs Jake Owen | Wayne Brady | Jake Owen |
Kenny G vs Richard Marx | Kenny G | Richard Marx |
David Arquette vs Brian Tyree Henry | David Arquette | Brian Tyree Henry |
Jesse Tyler Ferguson vs Chrissy Metz | Jesse Tyler Ferguson | Chrissy Metz |
Pentatonix vs Bell Biv Devoe | Pentatonix | Bell Biv Devoe |
Padma Lakshmi vs Randy Jackson | Padma Lakshmi | Randy Jackson |
Danielle Fishel vs Jonathan Lipnicki | Danielle Fishel | Jonathan Lipnicki |
Shania Twain vs Meghan Trainor | Shania Twain | Meghan Trainor |
We have our contestants! Let’s now add the genders of both contestants (I’m doing this manually… training a convolutional neural network to recognize females is a bit out of scope for this post 😛) as well as the positions and note the winners:
Battles | Contestant.1 | Contestant.2 | Con1.gender | Con2.gender | Con1.position | winner |
---|---|---|---|---|---|---|
Halle Berry vs James Corden | Halle Berry | James Corden | female | male | right | 1 |
Usher vs Anthony Anderson | Usher | Anthony Anderson | male | male | left | 1 |
James Van Der Beek vs Randall Park | James Van Der Beek | Randall Park | male | male | left | 2 |
Rob Gronkowski vs Gina Rodriguez | Rob Gronkowski | Gina Rodriguez | male | female | left | 2 |
Niecy Nash vs Cedric the Entertainer | Niecy Nash | Cedric the Entertainer | female | male | left | 1 |
Liam Payne vs Jason Derulo | Liam Payne | Jason Derulo | male | male | left | 1 |
Tony Hale vs Timothy Simons | Tony Hale | Timothy Simons | male | male | right | 1 |
Rascal Flatts vs Boyz II Men | Rascal Flatts | Boyz II Men | group | group | left | 2 |
Nicole Scherzinger vs Lil Rel Howery | Nicole Scherzinger | Lil Rel Howery | female | male | right | 1 |
Charlie Puth vs Backstreet Boys | Charlie Puth | Backstreet Boys | male | group | left | 1 |
Vanessa Hudgens vs Michael Bennett | Vanessa Hudgens | Michael Bennett | female | male | right | 1 |
James Corden vs Nicole Richie | James Corden | Nicole Richie | male | female | left | 2 |
Mayim Bialik vs Kunal Nayyar | Mayim Bialik | Kunal Nayyar | female | male | right | 1 |
Ashley Tisdale vs Nick Lachey | Ashley Tisdale | Nick Lachey | female | male | left | 1 |
Wayne Brady vs Jake Owen | Wayne Brady | Jake Owen | male | male | left | 1 |
Kenny G vs Richard Marx | Kenny G | Richard Marx | male | male | left | 1 |
David Arquette vs Brian Tyree Henry | David Arquette | Brian Tyree Henry | male | male | right | 2 |
Jesse Tyler Ferguson vs Chrissy Metz | Jesse Tyler Ferguson | Chrissy Metz | male | female | right | 2 |
Pentatonix vs Bell Biv Devoe | Pentatonix | Bell Biv Devoe | group | group | left | 2 |
Padma Lakshmi vs Randy Jackson | Padma Lakshmi | Randy Jackson | female | male | left | 1 |
Danielle Fishel vs Jonathan Lipnicki | Danielle Fishel | Jonathan Lipnicki | female | male | right | NA |
Shania Twain vs Meghan Trainor | Shania Twain | Meghan Trainor | female | female | left | NA |
Nice! We are now ready to train a machine learning model to predict the winner:
library(caret)
set.seed(1)
# Create training dataset
train.index <- createDataPartition(data.model.comp$winner,
p = 0.8,
list = F)
data.train <- data.model.comp[train.index,]
data.test <- data.model.comp[-train.index,]
## Establish caret parameters
# Training controls
caret.train.control <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 10
)
# Training machine learning model
doParallel::registerDoParallel(cores = 6)
machine.model <- train(winner ~ Con1.gender + Con2.gender + Con1.position,
data = data.train,
method = "rf",
trControl = caret.train.control)
# Model details
machine.model
## Random Forest
##
## 17 samples
## 3 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 16, 16, 15, 15, 14, 15, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8916667 0.7078125
## 3 0.8866667 0.6969231
## 5 0.8483333 0.6223881
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
The model, based on training data, is reporting a 89% accuracy. But the real test is on the test dataset, data.test
:
# Predicting winners on test dataset
mm.predict <- predict(machine.model, newdata = data.test)
results <- cbind(data.test, mm.predict)
# Display results
kable(results, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Battles | Contestant.1 | Contestant.2 | Con1.gender | Con2.gender | Con1.position | winner | mm.predict |
---|---|---|---|---|---|---|---|
James Van Der Beek vs Randall Park | James Van Der Beek | Randall Park | male | male | left | 2 | 1 |
Niecy Nash vs Cedric the Entertainer | Niecy Nash | Cedric the Entertainer | female | male | left | 1 | 1 |
Mayim Bialik vs Kunal Nayyar | Mayim Bialik | Kunal Nayyar | female | male | right | 1 | 1 |
The machine learning algorithm predicted that contestant 1 would win all 3 battles. It was right 2/3 times.
Main Battle
When I made the earlier post, the results of the last two battles were not released yet. Now that the last episode has been posted, we can set up a battle of Machine vs. Human… Joshua, give me a beat!
Human prediction (based on algo developed in part 1):
- Danielle Fishel will beat Jonathan Lipnicki (female)
- Shania will beat Meghan (Shania is standing left)
data.battle <- data.model[(nrow(data.model) - 1): nrow(data.model),]
# Machine prediction
mm.predict <- predict(machine.model, data.battle)
results <- cbind(data.battle, mm.predict)
# Display results
kable(results, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Battles | Contestant.1 | Contestant.2 | Con1.gender | Con2.gender | Con1.position | winner | mm.predict |
---|---|---|---|---|---|---|---|
Danielle Fishel vs Jonathan Lipnicki | Danielle Fishel | Jonathan Lipnicki | female | male | right | NA | 1 |
Shania Twain vs Meghan Trainor | Shania Twain | Meghan Trainor | female | female | left | NA | 2 |
Machine prediction:
- Danielle Fishel will beat Jonathan Lipnicki
- Meghan will beat Shania
(I was a bit worried that the machine learning algo would make the same prediction as me… that would’ve been quite… anti-climactic)
Battle 1: Danielle Fishel vs. Jonathan Lipnicki
And the winner is Danielle! Both human and machine score a point!
Battle 2: Tie-breaker: Shania Twain vs. Meghan Trainor
And the winner is… both…?!?!?!?!?!? That was anti-climactic!!