**This page has been revised. To view revision, Click here

include <- function(library_name){
  if( !(library_name %in% installed.packages()) )
    install.packages(library_name) 
  library(library_name, character.only=TRUE)
}

\(\color{red}{\text{Introduction}}\)

…. ….. ……. In the previous deliverable, we discovered that the amount of harmful molecules in the atmosphere in California has actually been decreasing over the last 15 years. We now want to get a more in-depth idea of the major causes of pollution. The goal of this deliverable is to determine if human population has any effect on the amount of air pollution in a given area. To get started, we’ll first load in our progress from Phase 1.

include("tidyverse")
include("knitr")
include("caret")
purl("deliv1.Rmd", output = "part1.r")
source("part1.r")

\(\color{red}{\text{Source Analysis}}\)

We then read in data from our secondary source. This source we will use is from an online database query for county populations in California from 2010 to 2018. This data is from https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk. Because this data is from the United States Census Bureau, a common source for datasets, I believe this data is credible.

new_data <- read_csv("./PEP_2018_PEPANNRES_with_ann.csv")  #read in data

\(\color{red}{\text{Cleaning Data}}\)

This dataset’s first row is an extra header row that we will not be using, so we can remove it.

head(new_data) #view data

new_data <- new_data[-1,]  #remove the first row
head(new_data) #view data

We also observe that there are 2 different columns representing geographical ID, and that the 2nd column GEO.id2 is simply a shorter representation of the 1st column, so we don’t need the first column.

#select all columns to be used
new_data <- new_data %>% select("GEO.id2", "GEO.display-label", "respop72010", "respop72011", "respop72012", "respop72013", "respop72014", "respop72015", "respop72016", "respop72017", "respop72018")

The next step is to change the names of the columns so that they are easier to understand.

#rename columns
colnames(new_data)[colnames(new_data)=="GEO.id2"] <- "GEO_ID" #Geographical ID
colnames(new_data)[colnames(new_data)=="GEO.display-label"] <- "County" 
colnames(new_data)[colnames(new_data)=="respop72010"] <- "2010" 
colnames(new_data)[colnames(new_data)=="respop72011"] <- "2011" 
colnames(new_data)[colnames(new_data)=="respop72012"] <- "2012" 
colnames(new_data)[colnames(new_data)=="respop72013"] <- "2013" 
colnames(new_data)[colnames(new_data)=="respop72014"] <- "2014" 
colnames(new_data)[colnames(new_data)=="respop72015"] <- "2015" 
colnames(new_data)[colnames(new_data)=="respop72016"] <- "2016" 
colnames(new_data)[colnames(new_data)=="respop72017"] <- "2017" 
colnames(new_data)[colnames(new_data)=="respop72018"] <- "2018" 
colnames(new_data)

##  [1] "GEO_ID" "County" "2010"   "2011"   "2012"   "2013"   "2014"  
##  [8] "2015"   "2016"   "2017"   "2018"

Now that we have all our columns named properly, we notice that the last 9 columns are all representing a common variable: year. We can use the gather function to tidy this part of the table.

#group all the year columns into a single column and put the data values into a new column
new_data <- gather(new_data, key = "Year", value = "Population", "2010":"2018", convert = FALSE, factor_key = FALSE)
head(new_data)

Since all the counties in this table are located within California, that makes " County, California" irrelevant in the County column. Using sapply() and regular expressions, we can remove the " County, California" part of each description of the county.

#0 or many characters before ' County' and 0 or many characters after that.
pattern <- "(.*) County.*"       
#function to shorten County column
new_data$County <- sapply(new_data$County, function(County){
  return(gsub(pattern, "\\1", County)) #replace with part of pattern in ()
})
head(new_data)

Finally, the last step to cleaning this data is to make sure that each column is being represented as the correct type. Because GEO_ID is representative of a specific location, it should be considered a factor. County is already represented as a character, so that is good. Population needs to be changed from a character to an integer. Year needs to be changed to a double to match the type from the data tibble.

new_data$GEO_ID <- as.factor(new_data$GEO_ID)
new_data$Population <- as.integer(new_data$Population)
new_data$Year <- as.double(new_data$Year)

\(\color{red}{\text{Models}}\)

Now that we have our 2 sources of data, we can add the new source to our previous data, and see if there are any correllations that can be observed.

new_data1 <- inner_join(data, new_data, by=c("County", "Year"))

## Warning: Column `County` joining factor and character vector, coercing into
## character vector

Now that all the populations for those counties have been added to the table, we can create a model to determine how much effect Population has on harmful molecules in the air.

set.seed(385) #this is the number to start with for random number generator

sample_selection <- createDataPartition(new_data1$Concentration, p = 0.70, list = FALSE) # p to tell 70% split

train <- new_data1[sample_selection,]
test <- new_data1[-sample_selection,]

train_model <- lm(Concentration ~ SITE_ID + Year + WEEK + DATEON + DATEOFF + Molecule + County + Population, data=train)
summary(train_model)

## 
## Call:
## lm(formula = Concentration ~ SITE_ID + Year + WEEK + DATEON + 
##     DATEOFF + Molecule + County + Population, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5319 -0.2668 -0.0236  0.1546  9.7445 
## 
## Coefficients: (5 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.014e+03  1.373e+03   2.195   0.0282 *  
## SITE_IDLAV410        -4.959e+00  2.725e+00  -1.820   0.0688 .  
## SITE_IDPIN414        -4.932e+00  2.901e+00  -1.700   0.0891 .  
## SITE_IDSEK430        -3.673e+00  2.321e+00  -1.583   0.1135    
## SITE_IDYOS404        -5.195e+00  2.960e+00  -1.755   0.0792 .  
## Year                 -1.527e+00  6.969e-01  -2.192   0.0284 *  
## WEEK                 -2.756e-02  1.335e-02  -2.064   0.0390 *  
## DATEON                4.845e-08  2.208e-08   2.194   0.0283 *  
## DATEOFF                      NA         NA      NA       NA    
## MoleculeSO4           2.936e-01  1.923e-02  15.267  < 2e-16 ***
## MoleculeNO3           2.109e-01  1.919e-02  10.990  < 2e-16 ***
## MoleculeHNO3          2.672e-01  1.921e-02  13.906  < 2e-16 ***
## MoleculeTNO3          8.993e-01  1.928e-02  46.632  < 2e-16 ***
## MoleculeNH4          -1.349e-01  1.922e-02  -7.023 2.27e-12 ***
## MoleculeCa           -2.949e-01  1.926e-02 -15.311  < 2e-16 ***
## MoleculeNa           -2.149e-01  1.927e-02 -11.151  < 2e-16 ***
## MoleculeMg           -3.870e-01  1.934e-02 -20.009  < 2e-16 ***
## MoleculeK            -3.736e-01  1.922e-02 -19.432  < 2e-16 ***
## MoleculeCl           -2.975e-01  1.927e-02 -15.440  < 2e-16 ***
## CountySan Benito             NA         NA      NA       NA    
## CountySan Bernardino         NA         NA      NA       NA    
## CountyShasta                 NA         NA      NA       NA    
## CountyTulare                 NA         NA      NA       NA    
## Population           -2.432e-06  1.460e-06  -1.666   0.0957 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4921 on 14428 degrees of freedom
## Multiple R-squared:  0.4418, Adjusted R-squared:  0.4411 
## F-statistic: 634.4 on 18 and 14428 DF,  p-value: < 2.2e-16

As we can see from the above model, we have a bunch of variables that are not good predictors of the Concentration level of a molecule. We can refine our model by removing these insignificant variables.

train_model <- lm(Concentration ~ Year + WEEK + DATEON + Molecule + Population, data=train)
summary(train_model)

## 
## Call:
## lm(formula = Concentration ~ Year + WEEK + DATEON + Molecule + 
##     Population, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7135 -0.1974 -0.0493  0.0827 10.0131 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.639e+03  1.477e+03   4.495 7.00e-06 ***
## Year         -3.370e+00  7.497e-01  -4.495 7.00e-06 ***
## WEEK         -6.291e-02  1.436e-02  -4.380 1.19e-05 ***
## DATEON        1.070e-07  2.375e-08   4.503 6.75e-06 ***
## MoleculeSO4   2.951e-01  2.081e-02  14.182  < 2e-16 ***
## MoleculeNO3   2.176e-01  2.076e-02  10.486  < 2e-16 ***
## MoleculeHNO3  2.716e-01  2.078e-02  13.067  < 2e-16 ***
## MoleculeTNO3  9.073e-01  2.086e-02  43.495  < 2e-16 ***
## MoleculeNH4  -1.333e-01  2.079e-02  -6.415 1.45e-10 ***
## MoleculeCa   -2.900e-01  2.083e-02 -13.921  < 2e-16 ***
## MoleculeNa   -2.092e-01  2.085e-02 -10.035  < 2e-16 ***
## MoleculeMg   -3.854e-01  2.092e-02 -18.421  < 2e-16 ***
## MoleculeK    -3.637e-01  2.080e-02 -17.491  < 2e-16 ***
## MoleculeCl   -2.941e-01  2.085e-02 -14.109  < 2e-16 ***
## Population    2.891e-07  1.314e-08  22.009  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5324 on 14432 degrees of freedom
## Multiple R-squared:  0.3465, Adjusted R-squared:  0.3459 
## F-statistic: 546.7 on 14 and 14432 DF,  p-value: < 2.2e-16

Based on these results, we can see that the Molecule, Year, Week, Date on, and population are all significant in predicting the concentration level of harmful particles in the air. The p-value is much lower than the cutoff of .05, suggesting that our model is statistically significant. However, the R squared value is fairly small, suggesting that only 34% of variation can be explained by our model.

\(\color{red}{\text{Testing}}\)

We can now use this model to help us make preditions about the concentration of harmful molecules in the air. We can use the R2 function to calculate the R squared value, or the square of the correllation. This will always be a value between 0 and 1. The closer it it is to 1, the higher the amount of correllation. Because our R squared predicition is so low, we can conclude that we need more information in order to make an accurate prediction.

predictions <- train_model %>% predict(test)
R2(predictions, as.numeric(test$Concentration))

## [1] 0.3364283

We can also use the MAE function to determine the amount of error that our model gives us from predicted values to actual values. For this, the closer the result is to 0, the better our model works.

MAE(test$Concentration, predictions)

## [1] 0.2967832

\(\color{red}{\text{Visualizations}}\)

In order to show that population is a good predictor of the concentration of harmful particles, we can make a few plots. This first plot shows how population affects the concentration of just SO2.

s02 <- new_data1 %>% filter(Molecule == "SO2")
ggplot() + geom_point(data = s02, mapping = aes(x = Population, y = Concentration))

This 2nd plot shows how population effects the concentration of NO4.

NH4 <- new_data1 %>% filter(Molecule == "NH4")
ggplot() + geom_point(data = NH4, mapping = aes(x = Population, y = Concentration))

This 3rd plot shows how population effects the concentration of HNO3.

HNO3 <- new_data1 %>% filter(Molecule == "HNO3")
ggplot() + geom_point(data = HNO3, mapping = aes(x = Population, y = Concentration))

Based on these plots, it is difficult to see that population is a significant predictor of the concentration level of harmful particles. We can conclude that this is caused either by limitations in our model (described in phase 3), or that population needs other variables to help make an accurate prediction of the concentration of harmful molecules.

Index DELIVERABLE 1 DELIVERABLE 2 REVISED DELIVERABLE 3

p02 Model Planning and Building

Kaleb Crisci

November 21, 2019