Hello and welcome to the second deliverable of my portfolio project on app store analytics and ratings of apps. This deliverable will be focused on model planning and building and working towards a predictive model to set me up for the final phase of the portfolio.

First I want to write a function called include() which will take a library that is already installed and load it but will download a package that is not installed on the machine that is running this file. Next I will load my packages needed for this deliverable.

include <- function(library_name){
  if( !(library_name %in% installed.packages()) )
    install.packages(library_name) 
  library(library_name, character.only=TRUE)
}
include("knitr")
purl("deliverable1.Rmd", output = "deliv1.r")
## 
## 
## processing file: deliverable1.Rmd
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |..                                                               |   3%
  |                                                                       
  |....                                                             |   6%
  |                                                                       
  |......                                                           |  10%
  |                                                                       
  |........                                                         |  13%
  |                                                                       
  |..........                                                       |  16%
  |                                                                       
  |.............                                                    |  19%
  |                                                                       
  |...............                                                  |  23%
  |                                                                       
  |.................                                                |  26%
  |                                                                       
  |...................                                              |  29%
  |                                                                       
  |.....................                                            |  32%
  |                                                                       
  |.......................                                          |  35%
  |                                                                       
  |.........................                                        |  39%
  |                                                                       
  |...........................                                      |  42%
  |                                                                       
  |.............................                                    |  45%
  |                                                                       
  |...............................                                  |  48%
  |                                                                       
  |..................................                               |  52%
  |                                                                       
  |....................................                             |  55%
  |                                                                       
  |......................................                           |  58%
  |                                                                       
  |........................................                         |  61%
  |                                                                       
  |..........................................                       |  65%
  |                                                                       
  |............................................                     |  68%
  |                                                                       
  |..............................................                   |  71%
  |                                                                       
  |................................................                 |  74%
  |                                                                       
  |..................................................               |  77%
  |                                                                       
  |....................................................             |  81%
  |                                                                       
  |.......................................................          |  84%
  |                                                                       
  |.........................................................        |  87%
  |                                                                       
  |...........................................................      |  90%
  |                                                                       
  |.............................................................    |  94%
  |                                                                       
  |...............................................................  |  97%
  |                                                                       
  |.................................................................| 100%
## output file: deliv1.r
## [1] "deliv1.r"
source("deliv1.r")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Parsed with column specification:
## cols(
##   id = col_double(),
##   track_name = col_character(),
##   size_bytes = col_double(),
##   currency = col_character(),
##   price = col_double(),
##   rating_count_tot = col_double(),
##   rating_count_ver = col_double(),
##   user_rating = col_double(),
##   user_rating_ver = col_double(),
##   ver = col_character(),
##   cont_rating = col_character(),
##   prime_genre = col_character(),
##   sup_devices.num = col_double(),
##   ipadSc_urls.num = col_double(),
##   lang.num = col_double(),
##   vpp_lic = col_double(),
##   game_enab = col_double()
## )
include("caret")
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
include("tidyverse")
include("rvest")
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
include("ggplot2")
include("reticulate")

Predictions from First Deliverable

Based off the first deliverable, I shifted my focus from how microtransactions affected app sales to how age ratings for apps affected their overall rating. Now knowing this, I would like to see if I can predict certain app analytics such as overall rating or ratings based on the latest update. These predictions will matter because if I can create a model that can predict the rating of an app based on its content rating, then future app developers can focus their content in that age rating for higher ratings and reviews of the app.

New Data Source

After looking online, I found a json file of google play store app analytics. This is a different data type than my last data source which was a csv file. Using the jsonlite package, I will import the json file and then convert it to a data frame to begin tidying the data using the as.data.frame function. This data set will complement my first data source by adding Google Play Store data which is the other most popular app store other than the Apple App Store. This data source also has data for the amount of installs the app has which could be beneficial for analysis.

googleimport <- jsonlite::fromJSON(txt = "data/googleplay.json")
asFrame <- as.data.frame(googleimport)

Now that the data is successfully loaded into the environment, it is time to tidy the data and get it ready for analysis.

Tidying the Data

First, I need to understand the data set before I clean it. To do this I will explain the variables and their data types.

App is character data of the name of the app.

Category is character data of the genre of the app.

Rating is numeric data of the average rating of the app.

Reviews is numeric data of the amount of reviews for the app.

Size is numeric data for the size of the app in megabites and gigabites.

Installs is numeric data for the amount of installs the app has.

Type is character data for if the app is free or paid.

Price is character data for the price of the app.

Content Rating is character data for the age rating of the app.

Genre is character data for the genre of the app but Category will be kept instead.

Last Updated is the date in which the app was last updated.

Current Ver is character data for the current version of the app.

Android Ver is character data for the android version of the app.

Checking the Variables for Odd Inputs

First I want to check all the variables for possible problems and then delete the problems.

table(asFrame$Category)
## 
##                 1.9      ART_AND_DESIGN   AUTO_AND_VEHICLES 
##                   1                  65                  85 
##              BEAUTY BOOKS_AND_REFERENCE            BUSINESS 
##                  53                 231                 460 
##              COMICS       COMMUNICATION              DATING 
##                  60                 387                 234 
##           EDUCATION       ENTERTAINMENT              EVENTS 
##                 156                 149                  64 
##              FAMILY             FINANCE      FOOD_AND_DRINK 
##                1972                 366                 127 
##                GAME  HEALTH_AND_FITNESS      HOUSE_AND_HOME 
##                1144                 341                  88 
##  LIBRARIES_AND_DEMO           LIFESTYLE MAPS_AND_NAVIGATION 
##                  85                 382                 137 
##             MEDICAL  NEWS_AND_MAGAZINES           PARENTING 
##                 463                 283                  60 
##     PERSONALIZATION         PHOTOGRAPHY        PRODUCTIVITY 
##                 392                 335                 424 
##            SHOPPING              SOCIAL              SPORTS 
##                 260                 295                 384 
##               TOOLS    TRAVEL_AND_LOCAL       VIDEO_PLAYERS 
##                 843                 258                 175 
##             WEATHER 
##                  82

After checking Category, I see that one response in this column is 1.9 so I will check into that to see the possible error. To do this I will use filter().

asFrame %>% filter(Category == "1.9")

Looking into this specific row, it seems as if the whole row is shifted left so I will delete this row.

problem1 <- asFrame %>% filter(Category != "1.9")

Now that that row is taken out, I will move onto the next variable Reviews. I wanted to make sure that all entries in this column is a number so I will convert it to a character and then numeric so non number entries would turn to NA. Then I used which() to find out if any entries in the variable were non integers.

problem1$Reviews <- as.numeric(as.character(problem1$Reviews))
which(!grepl('^[0-9]',problem1$Reviews))
## integer(0)

The next problem I want to take care of is the Installs variable. I want to make this better for linear regression so I will manually make a new variable Install_Group where apps with 0-1000+ are category 1, 5000+ to 50000+ are category 2, 100000+ to 5,000,000+ are category 3, and lastly apps with 10,000,000+ downloads are category 4.

problem1 <- problem1 %>% mutate(Install_Group = Installs)
problem1$Install_Group[problem1$Install_Group %in% c("0",
                                                   "0+",
                                                   "1+",
                                                   "10+",
                                                   "100+",
                                                   "50+",
                                                   "5+",
                                                   "500+", 
                                                   "1,000+")] <- "1"
problem1$Install_Group[problem1$Install_Group %in% c("5,000+",
                                                   "10,000+",
                                                   "50,000+")] <- "2"
problem1$Install_Group[problem1$Install_Group %in% c("100,000+",
                                                   "500,000+",
                                                   "1,000,000+",
                                                   "5,000,000+")] <- "3"
problem1$Install_Group[problem1$Install_Group %in% c("50,000,000+",
                                                   "10,000,000+",
                                                   "1,000,000,000+",
                                                   "100,000,000+",
                                                   "500,000,000+")] <- "4"
table(problem1$Install_Group)
## 
##    1    2    3    4 
## 2711 2010 4039 2080

Now that the new variable has been made and the categories were made, I will make it into 4 different columns with inputs of 0 and 1 for linear regression.

problem1 <- problem1 %>% mutate(cat1 = Install_Group == "1") %>% mutate(cat2 = Install_Group == "2") %>% mutate(cat3 = Install_Group == "3") %>% mutate(cat4 = Install_Group == "4")
problem1$cat1 <- as.numeric(problem1$cat1)
problem1$cat2 <- as.numeric(problem1$cat2)
problem1$cat3 <- as.numeric(problem1$cat3)
problem1$cat4 <- as.numeric(problem1$cat4)

Now that the Installs variable cleaning is over, I will now move on to Content Rating. I will do the same process with the multiple columns as above.

table(problem1$`Content Rating`, useNA = "always")
## 
## Adults only 18+        Everyone    Everyone 10+      Mature 17+ 
##               3            8714             414             499 
##            Teen         Unrated            <NA> 
##            1208               2               0
problem1 <- problem1 %>% mutate(adult = `Content Rating` == "Adults only 18+") %>% mutate(everyone = `Content Rating` == "Everyone") %>% mutate(everyone_10 = `Content Rating` == "Everyone 10+") %>% mutate(mature_17 = `Content Rating` == "Mature 17+") %>% mutate(teen = `Content Rating` == "Mature 17+") %>% mutate(unrated = `Content Rating` == "Unrated")
problem1$adult <- as.numeric(problem1$adult)
problem1$everyone <- as.numeric(problem1$everyone)
problem1$everyone_10 <- as.numeric(problem1$everyone_10)
problem1$mature_17 <- as.numeric(problem1$mature_17)
problem1$teen <- as.numeric(problem1$teen)
problem1$unrated <- as.numeric(problem1$unrated)

Now that all the variables have been cleaned that we need for the predictions, I will create a final tibble to be ready for analysis.

google_play <- tibble(app_name = problem1$App,
                      genre = problem1$Category,
                      app_rating = problem1$Rating,
                      num_reviews = problem1$Reviews,
                      cat1 = problem1$cat1,
                      cat2 = problem1$cat2,
                      cat3 = problem1$cat3,
                      cat4 = problem1$cat4,
                      adult = problem1$adult,
                      everyone = problem1$everyone,
                      everyone_10 = problem1$everyone_10,
                      mature_17 = problem1$mature_17,
                      teen = problem1$teen,
                      unrated = problem1$unrated)

After creating the table, I realized that app_rating still has NA data for some of the rows so I will take out the rows where app_rating = NA.

table(google_play$app_rating, useNA = "always")
## 
##    1  1.2  1.4  1.5  1.6  1.7  1.8  1.9    2  2.1  2.2  2.3  2.4  2.5  2.6 
##   16    1    3    3    4    8    8   13   12    8   14   20   19   21   25 
##  2.7  2.8  2.9    3  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9    4  4.1 
##   25   42   45   83   69   64  102  128  163  174  239  303  386  568  708 
##  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9    5  NaN <NA> 
##  952 1076 1109 1038  823  499  234   87  274 1474    0
google_play <- google_play %>% filter(app_rating != "NaN")

Performing Linear Regression

Now that the data is prepped and ready to observe in analysis, I will make a partition of the data for testing and predicting respectively. I want to predict the app_rating so I will create the partition on app_rating. The partition is splitting the data into two groups of randomly chosen data points where one group has 70% of the data and the other has 30%.

sample_selection <- createDataPartition(google_play$app_rating, p=0.70, list = FALSE)

After creating the partition, I will make the training and the test data sets. The train set will be 70% of the data and will be used to make the model.

train <- google_play[sample_selection,]
test <- google_play[-sample_selection,]

Now I will create the model using the 4 categories I made about amount of downloads and the 6 categories I made about content rating.

train_model <- lm(app_rating ~ cat1 + cat2 + cat3 + cat4 + adult + everyone + everyone_10 + mature_17 + teen + unrated, data = train)
summary(train_model)
## 
## Call:
## lm(formula = app_rating ~ cat1 + cat2 + cat3 + cat4 + adult + 
##     everyone + everyone_10 + mature_17 + teen + unrated, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1973 -0.1841  0.0650  0.3159  0.9650 
## 
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.36762    0.02153 202.862  < 2e-16 ***
## cat1        -0.14440    0.02169  -6.659 2.98e-11 ***
## cat2        -0.30667    0.01952 -15.714  < 2e-16 ***
## cat3        -0.15758    0.01659  -9.499  < 2e-16 ***
## cat4              NA         NA      NA       NA    
## adult        0.33996    0.36239   0.938  0.34823    
## everyone    -0.02593    0.02010  -1.290  0.19709    
## everyone_10  0.02172    0.03517   0.618  0.53687    
## mature_17   -0.10897    0.03448  -3.160  0.00158 ** 
## teen              NA         NA      NA       NA    
## unrated      0.03905    0.51224   0.076  0.93924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5117 on 6549 degrees of freedom
## Multiple R-squared:  0.03952,    Adjusted R-squared:  0.03835 
## F-statistic: 33.68 on 8 and 6549 DF,  p-value: < 2.2e-16

After seeing the results of the model, I see that cat4, adult, and teen were not defined bevause of singularities so in my revised model I will take those variables out.

train_revised_model <- lm(app_rating ~ cat1 + cat2 + cat3 + everyone + everyone_10 + mature_17 + unrated, data = train)
summary(train_revised_model)
## 
## Call:
## lm(formula = app_rating ~ cat1 + cat2 + cat3 + everyone + everyone_10 + 
##     mature_17 + unrated, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1972 -0.1842  0.0651  0.3158  0.9651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.36838    0.02151 203.042  < 2e-16 ***
## cat1        -0.14432    0.02169  -6.655 3.06e-11 ***
## cat2        -0.30660    0.01952 -15.710  < 2e-16 ***
## cat3        -0.15731    0.01659  -9.484  < 2e-16 ***
## everyone    -0.02683    0.02008  -1.336  0.18150    
## everyone_10  0.02083    0.03516   0.592  0.55361    
## mature_17   -0.10988    0.03447  -3.188  0.00144 ** 
## unrated      0.03822    0.51224   0.075  0.94052    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5117 on 6550 degrees of freedom
## Multiple R-squared:  0.03939,    Adjusted R-squared:  0.03836 
## F-statistic: 38.37 on 7 and 6550 DF,  p-value: < 2.2e-16
predictions <- train_revised_model %>% predict(test)
R2(predictions,test$app_rating)
## [1] 0.03244455
MAE(predictions,test$app_rating)
## [1] 0.3520071

The r-squared value is the measure of how well our model can explain the variation in the test data. This means that our model can explain 3% of the variation in the test dataset.

The mean average error is a measure of the average length of distance both positive and negative from the actual point to the line our model makes. In this case the MAE is 0.35 stars out of 5.

Exporting the Train Data for Python Analysis

Now that the model has been made in R, I want to make the model in python to where I can provide easier visuals for the model.

python <- problem1 %>% filter(Rating != "NaN")
python_final <- tibble(app_rating = python$Rating,
                                  content_rating = python$`Content Rating`)
write.csv(python_final,"data/python_model.csv", row.names = FALSE)

After importing the data into Python and creating a model this is the scatterplot that I produced of App Rating based on Content Rating within the google store data set. As you can see, the Everyone category has a wide spread of ratings and the Everyone 10+ category has a less low rated apps.

Python Graph Python Numbers These numbers are what the model predicts is the average rating of apps within each group of Content Rating. The first number is the average for the Adults only 18+ category and the 5 following numbers are offsets relative to the first number. The following averages are calculated:

Adults only 18+ = 4.29 Stars

Everyone = 4.18 Stars

Everyone 10+ = 4.26 Stars

Mature 17+ = 4.11 Stars

Teen = 4.23 Strs

Unrated = 4.10 Stars

Limitations of the Model

In this case there are several limitations that the model suffers from. Certain values of Content Rating only have a few data points in them such as the Adults only category and the Unrated category. This model also suffers from external bias where apps are normally only reviewed if the user has an extreme experience with the app on both sides of the spectrum. Usual reviews are given 5 or 1 star for this reason because not everyone would take the time to review an app just to give 2 or 3 stars. This results in data that is skewed towards the higher end of the spectrum of reviews with most apps having an average rating of 4.5 or higher. As seen in the python model, most apps within each group of content rating have an average near 4.25 stars and this is because of the raw nature of voluntary response data.