Hello and welcome to the second deliverable of my portfolio project on app store analytics and ratings of apps. This deliverable will be focused on model planning and building and working towards a predictive model to set me up for the final phase of the portfolio.
First I want to write a function called include()
which will take a library that is already installed and load it but will download a package that is not installed on the machine that is running this file. Next I will load my packages needed for this deliverable.
include <- function(library_name){
if( !(library_name %in% installed.packages()) )
install.packages(library_name)
library(library_name, character.only=TRUE)
}
include("knitr")
purl("deliverable1.Rmd", output = "deliv1.r")
##
##
## processing file: deliverable1.Rmd
##
|
| | 0%
|
|.. | 3%
|
|.... | 6%
|
|...... | 10%
|
|........ | 13%
|
|.......... | 16%
|
|............. | 19%
|
|............... | 23%
|
|................. | 26%
|
|................... | 29%
|
|..................... | 32%
|
|....................... | 35%
|
|......................... | 39%
|
|........................... | 42%
|
|............................. | 45%
|
|............................... | 48%
|
|.................................. | 52%
|
|.................................... | 55%
|
|...................................... | 58%
|
|........................................ | 61%
|
|.......................................... | 65%
|
|............................................ | 68%
|
|.............................................. | 71%
|
|................................................ | 74%
|
|.................................................. | 77%
|
|.................................................... | 81%
|
|....................................................... | 84%
|
|......................................................... | 87%
|
|........................................................... | 90%
|
|............................................................. | 94%
|
|............................................................... | 97%
|
|.................................................................| 100%
## output file: deliv1.r
## [1] "deliv1.r"
source("deliv1.r")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Parsed with column specification:
## cols(
## id = col_double(),
## track_name = col_character(),
## size_bytes = col_double(),
## currency = col_character(),
## price = col_double(),
## rating_count_tot = col_double(),
## rating_count_ver = col_double(),
## user_rating = col_double(),
## user_rating_ver = col_double(),
## ver = col_character(),
## cont_rating = col_character(),
## prime_genre = col_character(),
## sup_devices.num = col_double(),
## ipadSc_urls.num = col_double(),
## lang.num = col_double(),
## vpp_lic = col_double(),
## game_enab = col_double()
## )
include("caret")
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
include("tidyverse")
include("rvest")
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
include("ggplot2")
include("reticulate")
Based off the first deliverable, I shifted my focus from how microtransactions affected app sales to how age ratings for apps affected their overall rating. Now knowing this, I would like to see if I can predict certain app analytics such as overall rating or ratings based on the latest update. These predictions will matter because if I can create a model that can predict the rating of an app based on its content rating, then future app developers can focus their content in that age rating for higher ratings and reviews of the app.
After looking online, I found a json file of google play store app analytics. This is a different data type than my last data source which was a csv file. Using the jsonlite
package, I will import the json file and then convert it to a data frame to begin tidying the data using the as.data.frame
function. This data set will complement my first data source by adding Google Play Store data which is the other most popular app store other than the Apple App Store. This data source also has data for the amount of installs the app has which could be beneficial for analysis.
googleimport <- jsonlite::fromJSON(txt = "data/googleplay.json")
asFrame <- as.data.frame(googleimport)
Now that the data is successfully loaded into the environment, it is time to tidy the data and get it ready for analysis.
First, I need to understand the data set before I clean it. To do this I will explain the variables and their data types.
App
is character data of the name of the app.
Category
is character data of the genre of the app.
Rating
is numeric data of the average rating of the app.
Reviews
is numeric data of the amount of reviews for the app.
Size
is numeric data for the size of the app in megabites and gigabites.
Installs
is numeric data for the amount of installs the app has.
Type
is character data for if the app is free or paid.
Price
is character data for the price of the app.
Content Rating
is character data for the age rating of the app.
Genre
is character data for the genre of the app but Category
will be kept instead.
Last Updated
is the date in which the app was last updated.
Current Ver
is character data for the current version of the app.
Android Ver
is character data for the android version of the app.
First I want to check all the variables for possible problems and then delete the problems.
table(asFrame$Category)
##
## 1.9 ART_AND_DESIGN AUTO_AND_VEHICLES
## 1 65 85
## BEAUTY BOOKS_AND_REFERENCE BUSINESS
## 53 231 460
## COMICS COMMUNICATION DATING
## 60 387 234
## EDUCATION ENTERTAINMENT EVENTS
## 156 149 64
## FAMILY FINANCE FOOD_AND_DRINK
## 1972 366 127
## GAME HEALTH_AND_FITNESS HOUSE_AND_HOME
## 1144 341 88
## LIBRARIES_AND_DEMO LIFESTYLE MAPS_AND_NAVIGATION
## 85 382 137
## MEDICAL NEWS_AND_MAGAZINES PARENTING
## 463 283 60
## PERSONALIZATION PHOTOGRAPHY PRODUCTIVITY
## 392 335 424
## SHOPPING SOCIAL SPORTS
## 260 295 384
## TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS
## 843 258 175
## WEATHER
## 82
After checking Category
, I see that one response in this column is 1.9 so I will check into that to see the possible error. To do this I will use filter()
.
asFrame %>% filter(Category == "1.9")
Looking into this specific row, it seems as if the whole row is shifted left so I will delete this row.
problem1 <- asFrame %>% filter(Category != "1.9")
Now that that row is taken out, I will move onto the next variable Reviews
. I wanted to make sure that all entries in this column is a number so I will convert it to a character and then numeric so non number entries would turn to NA. Then I used which()
to find out if any entries in the variable were non integers.
problem1$Reviews <- as.numeric(as.character(problem1$Reviews))
which(!grepl('^[0-9]',problem1$Reviews))
## integer(0)
The next problem I want to take care of is the Installs
variable. I want to make this better for linear regression so I will manually make a new variable Install_Group
where apps with 0-1000+ are category 1, 5000+ to 50000+ are category 2, 100000+ to 5,000,000+ are category 3, and lastly apps with 10,000,000+ downloads are category 4.
problem1 <- problem1 %>% mutate(Install_Group = Installs)
problem1$Install_Group[problem1$Install_Group %in% c("0",
"0+",
"1+",
"10+",
"100+",
"50+",
"5+",
"500+",
"1,000+")] <- "1"
problem1$Install_Group[problem1$Install_Group %in% c("5,000+",
"10,000+",
"50,000+")] <- "2"
problem1$Install_Group[problem1$Install_Group %in% c("100,000+",
"500,000+",
"1,000,000+",
"5,000,000+")] <- "3"
problem1$Install_Group[problem1$Install_Group %in% c("50,000,000+",
"10,000,000+",
"1,000,000,000+",
"100,000,000+",
"500,000,000+")] <- "4"
table(problem1$Install_Group)
##
## 1 2 3 4
## 2711 2010 4039 2080
Now that the new variable has been made and the categories were made, I will make it into 4 different columns with inputs of 0 and 1 for linear regression.
problem1 <- problem1 %>% mutate(cat1 = Install_Group == "1") %>% mutate(cat2 = Install_Group == "2") %>% mutate(cat3 = Install_Group == "3") %>% mutate(cat4 = Install_Group == "4")
problem1$cat1 <- as.numeric(problem1$cat1)
problem1$cat2 <- as.numeric(problem1$cat2)
problem1$cat3 <- as.numeric(problem1$cat3)
problem1$cat4 <- as.numeric(problem1$cat4)
Now that the Installs
variable cleaning is over, I will now move on to Content Rating
. I will do the same process with the multiple columns as above.
table(problem1$`Content Rating`, useNA = "always")
##
## Adults only 18+ Everyone Everyone 10+ Mature 17+
## 3 8714 414 499
## Teen Unrated <NA>
## 1208 2 0
problem1 <- problem1 %>% mutate(adult = `Content Rating` == "Adults only 18+") %>% mutate(everyone = `Content Rating` == "Everyone") %>% mutate(everyone_10 = `Content Rating` == "Everyone 10+") %>% mutate(mature_17 = `Content Rating` == "Mature 17+") %>% mutate(teen = `Content Rating` == "Mature 17+") %>% mutate(unrated = `Content Rating` == "Unrated")
problem1$adult <- as.numeric(problem1$adult)
problem1$everyone <- as.numeric(problem1$everyone)
problem1$everyone_10 <- as.numeric(problem1$everyone_10)
problem1$mature_17 <- as.numeric(problem1$mature_17)
problem1$teen <- as.numeric(problem1$teen)
problem1$unrated <- as.numeric(problem1$unrated)
Now that all the variables have been cleaned that we need for the predictions, I will create a final tibble to be ready for analysis.
google_play <- tibble(app_name = problem1$App,
genre = problem1$Category,
app_rating = problem1$Rating,
num_reviews = problem1$Reviews,
cat1 = problem1$cat1,
cat2 = problem1$cat2,
cat3 = problem1$cat3,
cat4 = problem1$cat4,
adult = problem1$adult,
everyone = problem1$everyone,
everyone_10 = problem1$everyone_10,
mature_17 = problem1$mature_17,
teen = problem1$teen,
unrated = problem1$unrated)
After creating the table, I realized that app_rating
still has NA data for some of the rows so I will take out the rows where app_rating = NA.
table(google_play$app_rating, useNA = "always")
##
## 1 1.2 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6
## 16 1 3 3 4 8 8 13 12 8 14 20 19 21 25
## 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
## 25 42 45 83 69 64 102 128 163 174 239 303 386 568 708
## 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 NaN <NA>
## 952 1076 1109 1038 823 499 234 87 274 1474 0
google_play <- google_play %>% filter(app_rating != "NaN")
Now that the data is prepped and ready to observe in analysis, I will make a partition of the data for testing and predicting respectively. I want to predict the app_rating so I will create the partition on app_rating
. The partition is splitting the data into two groups of randomly chosen data points where one group has 70% of the data and the other has 30%.
sample_selection <- createDataPartition(google_play$app_rating, p=0.70, list = FALSE)
After creating the partition, I will make the training and the test data sets. The train set will be 70% of the data and will be used to make the model.
train <- google_play[sample_selection,]
test <- google_play[-sample_selection,]
Now I will create the model using the 4 categories I made about amount of downloads and the 6 categories I made about content rating.
train_model <- lm(app_rating ~ cat1 + cat2 + cat3 + cat4 + adult + everyone + everyone_10 + mature_17 + teen + unrated, data = train)
summary(train_model)
##
## Call:
## lm(formula = app_rating ~ cat1 + cat2 + cat3 + cat4 + adult +
## everyone + everyone_10 + mature_17 + teen + unrated, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1973 -0.1841 0.0650 0.3159 0.9650
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.36762 0.02153 202.862 < 2e-16 ***
## cat1 -0.14440 0.02169 -6.659 2.98e-11 ***
## cat2 -0.30667 0.01952 -15.714 < 2e-16 ***
## cat3 -0.15758 0.01659 -9.499 < 2e-16 ***
## cat4 NA NA NA NA
## adult 0.33996 0.36239 0.938 0.34823
## everyone -0.02593 0.02010 -1.290 0.19709
## everyone_10 0.02172 0.03517 0.618 0.53687
## mature_17 -0.10897 0.03448 -3.160 0.00158 **
## teen NA NA NA NA
## unrated 0.03905 0.51224 0.076 0.93924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5117 on 6549 degrees of freedom
## Multiple R-squared: 0.03952, Adjusted R-squared: 0.03835
## F-statistic: 33.68 on 8 and 6549 DF, p-value: < 2.2e-16
After seeing the results of the model, I see that cat4
, adult
, and teen
were not defined bevause of singularities so in my revised model I will take those variables out.
train_revised_model <- lm(app_rating ~ cat1 + cat2 + cat3 + everyone + everyone_10 + mature_17 + unrated, data = train)
summary(train_revised_model)
##
## Call:
## lm(formula = app_rating ~ cat1 + cat2 + cat3 + everyone + everyone_10 +
## mature_17 + unrated, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1972 -0.1842 0.0651 0.3158 0.9651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.36838 0.02151 203.042 < 2e-16 ***
## cat1 -0.14432 0.02169 -6.655 3.06e-11 ***
## cat2 -0.30660 0.01952 -15.710 < 2e-16 ***
## cat3 -0.15731 0.01659 -9.484 < 2e-16 ***
## everyone -0.02683 0.02008 -1.336 0.18150
## everyone_10 0.02083 0.03516 0.592 0.55361
## mature_17 -0.10988 0.03447 -3.188 0.00144 **
## unrated 0.03822 0.51224 0.075 0.94052
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5117 on 6550 degrees of freedom
## Multiple R-squared: 0.03939, Adjusted R-squared: 0.03836
## F-statistic: 38.37 on 7 and 6550 DF, p-value: < 2.2e-16
predictions <- train_revised_model %>% predict(test)
R2(predictions,test$app_rating)
## [1] 0.03244455
MAE(predictions,test$app_rating)
## [1] 0.3520071
The r-squared value is the measure of how well our model can explain the variation in the test data. This means that our model can explain 3% of the variation in the test dataset.
The mean average error is a measure of the average length of distance both positive and negative from the actual point to the line our model makes. In this case the MAE is 0.35 stars out of 5.
Now that the model has been made in R, I want to make the model in python to where I can provide easier visuals for the model.
python <- problem1 %>% filter(Rating != "NaN")
python_final <- tibble(app_rating = python$Rating,
content_rating = python$`Content Rating`)
write.csv(python_final,"data/python_model.csv", row.names = FALSE)
After importing the data into Python and creating a model this is the scatterplot that I produced of App Rating based on Content Rating within the google store data set. As you can see, the Everyone
category has a wide spread of ratings and the Everyone 10+
category has a less low rated apps.
These numbers are what the model predicts is the average rating of apps within each group of Content Rating. The first number is the average for the Adults only 18+
category and the 5 following numbers are offsets relative to the first number. The following averages are calculated:
Adults only 18+
= 4.29 Stars
Everyone
= 4.18 Stars
Everyone 10+
= 4.26 Stars
Mature 17+
= 4.11 Stars
Teen
= 4.23 Strs
Unrated
= 4.10 Stars
In this case there are several limitations that the model suffers from. Certain values of Content Rating only have a few data points in them such as the Adults only category and the Unrated category. This model also suffers from external bias where apps are normally only reviewed if the user has an extreme experience with the app on both sides of the spectrum. Usual reviews are given 5 or 1 star for this reason because not everyone would take the time to review an app just to give 2 or 3 stars. This results in data that is skewed towards the higher end of the spectrum of reviews with most apps having an average rating of 4.5 or higher. As seen in the python model, most apps within each group of content rating have an average near 4.25 stars and this is because of the raw nature of voluntary response data.