Index
DELIVERABLE 1
DELIVERABLE 3
include <- function(library_name){
if( !(library_name %in% installed.packages()) )
install.packages(library_name)
library(library_name, character.only=TRUE)
}
…. ….. ……. In the previous deliverable, we discovered that the amount of harmful molecules in the atmosphere in California has actually been decreasing over the last 15 years. We now want to get a more in-depth idea of the major causes of pollution. The goal of this deliverable is to determine if human population has any effect on the amount of air pollution in a given area. To get started, we’ll first load in our progress from Phase 1.
include("tidyverse")
include("knitr")
include("caret")
purl("deliv1.Rmd", output = "part1.r")
source("part1.r")
We then read in data from our secondary source. This source we will use is from an online database query for county populations in California from 2010 to 2018. This data is from https://www.counties.org/data-and-research. This dataset provides populations by county in California. The format is an xlsx file, so we will need to load the library gdata.
include("gdata")
new_data <- read.xls("./population_by_county.xlsx", sheet="Population by County", header = TRUE)
head(new_data)
The first step to tidying this data is to remove any whitespace that may be present in a column. To do this, we will use the trimws function.
We also notice that the integer values representing populations has commas in it, and therefore cannot be converted into an integer. To remove the commas, we use the as.numeric and gsub functions.
new_data$Population <- as.numeric(gsub(",","",new_data$Population))
The next step is to make sure that each column has the correct type. The first column represents a county, which should be represented as a factor. The second column represents a year, which should also be represented as a factor. Finally, the last column represents population in that county, which should be represented as an integer.
new_data$County <- as.factor(new_data$County)
new_data$Year <- as.factor(new_data$Year)
new_data$Population <- as.integer(new_data$Population)
We also observe that there are many more counties listed in this dataset than we need. We can filter out the ones we don’t need by using the vector of counties that we created earlier: values.
#view what counties we are using
values
## [1] "San Bernardino" "Inyo" "San Bernadino" "Shasta"
## [5] "San Benito" "Tulare" "Tulare" "Mariposa"
#select all counties to be used
new_data <- new_data %>% filter(County %in% values)
#output resulting table
head(new_data)
Now that we have our 2 sources of data, we can add the new source to our previous data, and see if there are any correllations that can be observed.
new_data1 <- inner_join(data, new_data, by=c("County", "Year"))
Because we joined with different levels of factors, the function switched some of our factors into characters. To ensure all columns are the correct types, we will re-assign their types.
new_data1$SITE_ID <- as.factor(new_data1$SITE_ID)
new_data1$WEEK <- as.numeric(new_data1$WEEK)
new_data1$County <- as.factor(new_data1$County)
new_data1$Year <- as.numeric(new_data1$Year)
Now that all the populations for those counties have been added to the table, we can create a model to determine how much effect Population has on harmful molecules in the air.
set.seed(385) #this is the number to start with for random number generator
sample_selection <- createDataPartition(new_data1$Concentration, p = 0.70, list = FALSE) # p to tell 70% split
train <- new_data1[sample_selection,]
test <- new_data1[-sample_selection,]
train_model <- lm(Concentration ~ SITE_ID + Year + WEEK + DATEON + DATEOFF + Molecule + County + Population, data=train)
summary(train_model)
##
## Call:
## lm(formula = Concentration ~ SITE_ID + Year + WEEK + DATEON +
## DATEOFF + Molecule + County + Population, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7884 -0.3115 -0.0266 0.1944 10.7144
##
## Coefficients: (6 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.035e+03 1.257e+03 4.005 6.23e-05 ***
## SITE_IDDEV412 -3.731e+00 3.482e-01 -10.714 < 2e-16 ***
## SITE_IDLAV410 -3.708e+00 3.216e-01 -11.529 < 2e-16 ***
## SITE_IDPIN414 -3.551e+00 3.432e-01 -10.344 < 2e-16 ***
## SITE_IDSEK402 -2.780e+00 2.821e-01 -9.856 < 2e-16 ***
## SITE_IDSEK430 -2.534e+00 2.738e-01 -9.253 < 2e-16 ***
## SITE_IDYOS404 -3.819e+00 3.499e-01 -10.914 < 2e-16 ***
## Year -2.553e+00 6.382e-01 -4.001 6.32e-05 ***
## WEEK -4.725e-02 1.223e-02 -3.864 0.000112 ***
## DATEON 8.062e-08 2.022e-08 3.986 6.73e-05 ***
## DATEOFF NA NA NA NA
## MoleculeSO4 3.789e-01 1.703e-02 22.246 < 2e-16 ***
## MoleculeNO3 2.905e-01 1.705e-02 17.036 < 2e-16 ***
## MoleculeHNO3 5.064e-01 1.698e-02 29.828 < 2e-16 ***
## MoleculeTNO3 1.220e+00 1.700e-02 71.778 < 2e-16 ***
## MoleculeNH4 -9.657e-02 1.699e-02 -5.683 1.33e-08 ***
## MoleculeCa -2.866e-01 1.707e-02 -16.791 < 2e-16 ***
## MoleculeNa -2.192e-01 1.704e-02 -12.862 < 2e-16 ***
## MoleculeMg -3.884e-01 1.702e-02 -22.823 < 2e-16 ***
## MoleculeK -3.710e-01 1.699e-02 -21.833 < 2e-16 ***
## MoleculeCl -3.221e-01 1.704e-02 -18.897 < 2e-16 ***
## CountyMariposa NA NA NA NA
## CountySan Benito NA NA NA NA
## CountySan Bernardino NA NA NA NA
## CountyShasta NA NA NA NA
## CountyTulare NA NA NA NA
## Population -1.740e-06 1.790e-07 -9.722 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6118 on 28547 degrees of freedom
## Multiple R-squared: 0.441, Adjusted R-squared: 0.4406
## F-statistic: 1126 on 20 and 28547 DF, p-value: < 2.2e-16
As we can see from the above model, we have a bunch of variables that are not good predictors of the Concentration level of a molecule. We can refine our model by removing these insignificant variables. We can also remove County, because it is returning NA’s due to being too similar to another of our variables (SITE_ID). Because the County column was generated from the SITE_ID column, we can assume that any correllation with County will also occur in SITE_ID.
train_model <- lm(Concentration ~ SITE_ID + Molecule + Population + Year + WEEK, data=train)
summary(train_model)
##
## Call:
## lm(formula = Concentration ~ SITE_ID + Molecule + Population +
## Year + WEEK, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8095 -0.3109 -0.0263 0.1933 10.7257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.292e+01 1.845e+00 12.426 < 2e-16 ***
## SITE_IDDEV412 -3.821e+00 3.475e-01 -10.994 < 2e-16 ***
## SITE_IDLAV410 -3.794e+00 3.210e-01 -11.820 < 2e-16 ***
## SITE_IDPIN414 -3.642e+00 3.426e-01 -10.633 < 2e-16 ***
## SITE_IDSEK402 -2.872e+00 2.812e-01 -10.213 < 2e-16 ***
## SITE_IDSEK430 -2.605e+00 2.733e-01 -9.532 < 2e-16 ***
## SITE_IDYOS404 -3.913e+00 3.492e-01 -11.204 < 2e-16 ***
## MoleculeSO4 3.787e-01 1.704e-02 22.231 < 2e-16 ***
## MoleculeNO3 2.901e-01 1.705e-02 17.010 < 2e-16 ***
## MoleculeHNO3 5.063e-01 1.698e-02 29.814 < 2e-16 ***
## MoleculeTNO3 1.220e+00 1.700e-02 71.753 < 2e-16 ***
## MoleculeNH4 -9.678e-02 1.700e-02 -5.695 1.25e-08 ***
## MoleculeCa -2.865e-01 1.707e-02 -16.785 < 2e-16 ***
## MoleculeNa -2.195e-01 1.705e-02 -12.875 < 2e-16 ***
## MoleculeMg -3.887e-01 1.702e-02 -22.830 < 2e-16 ***
## MoleculeK -3.714e-01 1.700e-02 -21.850 < 2e-16 ***
## MoleculeCl -3.223e-01 1.705e-02 -18.902 < 2e-16 ***
## Population -1.790e-06 1.786e-07 -10.021 < 2e-16 ***
## Year -9.327e-03 9.656e-04 -9.659 < 2e-16 ***
## WEEK 1.484e-03 2.415e-04 6.144 8.15e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.612 on 28548 degrees of freedom
## Multiple R-squared: 0.4407, Adjusted R-squared: 0.4403
## F-statistic: 1184 on 19 and 28548 DF, p-value: < 2.2e-16
Based on these results, we can see that the SITE_ID, Molecule, Week, Year, and population are all significant in predicting the concentration level of harmful particles in the air. The p-value is much lower than the cutoff of .05, suggesting that our model is statistically significant. However, the R squared value is fairly small, suggesting that only 44% of variation can be explained by our model.
We can now use this model to help us make preditions about the concentration of harmful molecules in the air. We can use the R2 function to calculate the R squared value, or the square of the correllation. This will always be a value between 0 and 1. The closer it it is to 1, the higher the amount of correllation. Because our R squared predicition is so low, we can conclude that we need more information in order to make an accurate prediction.
predictions <- train_model %>% predict(test)
R2(predictions, as.numeric(test$Concentration))
## [1] 0.4453836
We can also use the MAE function to determine the amount of error that our model gives us from predicted values to actual values. For this, the closer the result is to 0, the better our model works.
MAE(test$Concentration, predictions)
## [1] 0.3583829
In order to show that population is a good predictor of the concentration of harmful particles, we can make a few plots. This first plot shows how population affects the concentration of just SO2.
s02 <- new_data1 %>% filter(Molecule == "SO2")
ggplot() + geom_point(data = s02, mapping = aes(x = Population, y = Concentration))
This 2nd plot shows how population affects the concentration of NO4.
NH4 <- new_data1 %>% filter(Molecule == "NH4")
ggplot() + geom_point(data = NH4, mapping = aes(x = Population, y = Concentration))
This 3rd plot shows how population effects the concentration of HNO3.
HNO3 <- new_data1 %>% filter(Molecule == "HNO3")
ggplot() + geom_point(data = HNO3, mapping = aes(x = Population, y = Concentration))
Based on these plots, it is difficult to see that population is a significant predictor of the concentration level of harmful particles. We can conclude that this is caused either by limitations in our model (described in phase 3), or that population needs other variables to help make an accurate prediction of the concentration of harmful molecules.