Linear Regression is a supervised modeling technique for continuous data. The model fits a line that is closest to all observation in the dataset. The basic assumption here is that functional form is the line and it is possible to fit the line that will be closest to all observation in the dataset. Please note that if the basic assumption about the linearity of the model is away from reality then there is bound to have an error (bias towards linearity) in the model however best one will try to fit the model.
The aim of linear regression is to find a mathematical equation for a continuous response variable Y as a function of one or more X variable(s). So that you can use this regression model to predict the Y when only the X is known.
This mathematical equation can be generalised as follows:
Y=β1+β2 X+ϵ
where, β1 is the intercept and β2 is the slope.
Collectively, they are called regression coefficients and ϵ is the error term, the part of Y the regression model is unable to explain
How to apply linear regression
The coefficient for linear regression is calculated based on the sample data. The basic assumption here is that the sample is not biased. This assumption makes sure that the sample does not necessarily always overestimate or underestimate the coefficients. The idea is that a particular sample may overestimate or underestimate but if one takes multiple samples and try to estimate the coefficient multiple times, then the average of co-efficient from multiple samples will be spot on.
Problem:
For this post, let’s take the Boston dataset that is part of the MASS library in R Studio. Following are the features available in Boston dataset. The problem statement is to predict ‘medv’ based on the set of input features.
#Extract the data and create the training and testing sample
library(MASS)
library(ggplot2)
attach(Boston)
> names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm"
[7] "age" "dis" "rad" "tax" "ptratio" "black"
[13] "lstat" "medv"
Graphical Analysis
Lets try to understand these variables graphically.
Typically, for each of the predictors, the following plots help visualise the patterns:
Scatter plot: Visualise the linear relationship between the predictor and response
Density plot: To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred.
Let us see how to make each one of them.
Using Scatter Plot To Visualise The Relationship
Scatter plots can help visualise linear relationships between the response and predictor variables.
Ideally, if you have many predictor variables, a scatter plot is drawn for each one of them against the response, along with the line of best fit as seen below. (Below screenshot for the ‘crim’ against ‘medv’ variable)
Using BoxPlot To Check For Outliers
Generally, an outlier is any datapoint that lies outside the 1.5 * inter quartile range (IQR).
IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.
Generally, an outlier is any datapoint that lies outside the 1.5 * inter quartile range (IQR).
IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.
Multivariate Model Approach
Declaring an observation as an outlier based on a just one (rather unimportant) feature could lead to unrealistic inferences. When you have to decide if an individual entity (represented by row or observation) is an extreme value or not, it better to collectively consider the features (X’s) that matter. This is when Cook’s Distance play a role.
Cooks Distance
Cook’s distance is a measure computed with respect to a given regression model and therefore is impacted only by the X variables included in the model. But, what does cook’s distance mean? It computes the influence exerted by each data point (row) on the predicted outcome.
The cook’s distance for each observation i measures the change in Y^Y^ (fitted Y) for all observations with and without the presence of observation i, so we know how much the observation i impacted the fitted values. Mathematically, cook’s distance Di for observation i is computed as:
where,
Y^jY^j is the value of jth fitted response when all the observations are included.
Y^j(i)Y^j(i) is the value of jth fitted response, where the fit does not include observation i.
MSE is the mean squared error.
p is the number of coefficients in the regression model.
mod <- lm(ozone_reading ~ ., data=ozone)
cooksd <- cooks.distance(model1) #model1 develop using the baseline approach
Influence measures
In general use, those observations that have a cook’s distance greater than 4 times the mean may be classified as influential. This is not a hard boundary.
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") # plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red") # add labels
If you extract and examine each influential row 1-by-1 (from below output), you will be able to reason out why that row turned out influential. It is likely that one of the X variables included in the model had extreme values.
Using Density Plot To Check If Response Variable Is Close To Normal
Let’s check for the distribution of response variable ‘medv’. The following figure shows the three distributions of ‘medv’ original, log transformation and square root transformation. We can see that both ‘log’ and ‘sqrt’ does a decent job to transform ‘medv’ distribution closer to normal. In the following model, I have selected ‘log’ transformation but it is also possible to try out ‘sqrt’ transformation.
Correlation Analysis
Correlation analysis studies the strength of relationship between two continuous variables. It involves computing the correlation coefficient between the the two variables.
So what is correlation? And how is it helpful in linear regression? Correlation is a statistical measure that shows the degree of linear dependence between two variables.
In order to compute correlation, the two variables must occur in pairs.
Correlation can take values between -1 to +1.
If one variables consistently increases with increasing value of the other, then they have a strong positive correlation (value close to +1).
Similarly, if one consistently decreases when the other increase, they have a strong negative correlation (value close to -1).
A value closer to 0 suggests a weak relationship between the variables.
A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X). In that case, you should probably look for better explanatory variables.If you observe the Boston data set in the R console, for every instance where rm increases, the medv also increases along with it. That means, there is a strong positive relationship between them. So, the correlation between them will be closer to 1.However, correlation doesn’t imply causation. In other words, if two variables have high correlation, it does not mean one variable ’causes’ the value of the other variable to increase.Correlation is only an aid to understand the relationship. You can only rely on logic and business reasoning to make that judgement.
So, how to compute correlation in R?
Simply use the cor() function with the two numeric variables as arguments.
cor(Boston$rm, Boston$medv) # calculate correlation between rm and medv
#> [1] 0.6953599
Split the sample data and make the model
Let’s continue with model building. Split the input data into training and evaluation set and make the model for the training dataset. It can be seen that training dataset has 404 observations and testing dataset has 102 observations based on 80–20 split.
##Sample the dataset.
set.seed(1)
row.number <- sample(1:nrow(Boston), 0.8*nrow(Boston))
train = Boston[row.number,]
test = Boston[-row.number,]
> dim(train)
[1] 404 14
> dim(test)
[1] 102 14
Model Building — Model 1
Now as a first step we will fit the multiple regression models. We will start by taking all input variables in the multiple regression.
#Let’s make default model.#Let’s make default model.
model1 = lm(log(medv)~., data=train)
summary(model1)
par(mfrow=c(2,2))
for(i in 1:4)
plot(model1, which = i)
Call:
lm(formula = log(medv) ~ ., data = train)Residuals:
Min 1Q Median 3Q Max
-0.71392 -0.10435 -0.00913 0.10259 0.83290Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5387176 0.2331496 19.467 < 2e-16 ***
crim -0.0110546 0.0015143 -7.300 1.63e-12 ***
zn 0.0014176 0.0006281 2.257 0.024574 *
indus 0.0020512 0.0028308 0.725 0.469120
chas 0.0853159 0.0402646 2.119 0.034732 *
nox -0.9285807 0.1693534 -5.483 7.52e-08 ***
rm 0.0589055 0.0189839 3.103 0.002056 **
age 0.0002373 0.0006075 0.391 0.696247
dis -0.0598220 0.0091621 -6.529 2.06e-10 ***
rad 0.0152004 0.0030216 5.031 7.47e-07 ***
tax -0.0005681 0.0001709 -3.325 0.000968 ***
ptratio -0.0427382 0.0059698 -7.159 4.07e-12 ***
black 0.0003423 0.0001209 2.831 0.004885 **
lstat -0.0319466 0.0022703 -14.071 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.193 on 390 degrees of freedom
Multiple R-squared: 0.7933, Adjusted R-squared: 0.7864
F-statistic: 115.1 on 13 and 390 DF, p-value: < 2.2e-16
Gives this plot:
Observation from summary (model1)
Is there a relationship between predictor and response variables? We can answer this using F stats. This defines the collective effect of all predictor variables on the response variable. In this model, F=102.3 is far greater than 1, and so it can be concluded that there is a relationship between predictor and response variable.
Which of the predictor variables are significant? Based on the ‘p-value’ we can conclude on this. The lesser the ‘p’ value the more significant is the variable. From the ‘summary’ dump we can see that ‘zn’, ‘age’ and ‘indus’ are less significant features as the ‘p’ value is large for them. In next model, we can remove these variables from the model.
Is this model fit? We can answer this based on R2 (multiple-R-squared) value as it indicates how much variation is captured by the model. R2 closer to 1 indicates that the model explains the large value of the variance of the model and hence a good fit. In this case, the value is 0.7933 (closer to 1) and hence the model is a good fit.
Observation from the plot
Fitted vs Residual graph Residuals plots should be random in nature and there should not be any pattern in the graph. The average of the residual plot should be close to zero. From the above plot, we can see that the red trend line is almost at zero except at the starting location.
Normal Q-Q Plot Q-Q plot shows whether the residuals are normally distributed. Ideally, the plot should be on the dotted line. If the Q-Q plot is not on the line then models need to be reworked to make the residual normal. In the above plot, we see that most of the plots are on the line except at towards the end.
Scale-Location This shows how the residuals are spread and whether the residuals have an equal variance or not.
Residuals vs Leverage The plot helps to find influential observations. Here we need to check for points that are outside the dashed line. A point outside the dashed line will be influential point and removal of that will affect the regression coefficients.
Model Building — Model 2
As the next step, we can remove the four lesser significant features (‘zn’, age’ and ‘indus’ ) and check the model again.
# remove the less significant feature
model2 = update(model1, ~.-zn-indus-age)
summary(model2)
par(mfrow=c(2,2))
plot(model2)
Call:
lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = train)Residuals:
Min 1Q Median 3Q Max
-0.71008 -0.10962 -0.01296 0.10330 0.84030Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5489757 0.2327067 19.548 < 2e-16 ***
crim -0.0107965 0.0015120 -7.140 4.54e-12 ***
chas 0.0854786 0.0400433 2.135 0.033407 *
nox -0.9117698 0.1566753 -5.819 1.23e-08 ***
rm 0.0644686 0.0184026 3.503 0.000513 ***
dis -0.0523327 0.0072616 -7.207 2.95e-12 ***
rad 0.0143808 0.0028974 4.963 1.03e-06 ***
tax -0.0004624 0.0001505 -3.073 0.002263 **
ptratio -0.0464620 0.0055623 -8.353 1.16e-15 ***
black 0.0003416 0.0001211 2.821 0.005026 **
lstat -0.0314967 0.0021598 -14.583 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1935 on 393 degrees of freedom
Multiple R-squared: 0.7905, Adjusted R-squared: 0.7851
F-statistic: 148.3 on 10 and 393 DF, p-value: < 2.2e-16
Gives this plot:
Observation from summary (model1)
Is there a relationship between predictor and response variable? F=131.2 is far greater than 1 and this value is more than the F value of the previous model. It can be concluded that there is a relationship between predictor and response variable.
Which of the variable are significant? Now in this model, all the predictors are significant.
Is this model fit? R2 =0.7905 is closer to 1 and so this model is a good fit. Please note that this value has decreased a little from the first model but this should be fine as removing three predictors caused a drop from 0.7933 to 0.7905 and this is a small drop. In other words, the contribution of three predictors towards explaining the variance is an only small value(0.0028) and hence it is better to drop the predictor.
Observation of the plot All the four plots look similar to the previous model and we don’t see any major effect.
Model Building — Model 3 & Model 4
We can now enhance the model by adding a square term to check for non-linearity. We can first try model3 by introducing square terms for all features ( from model 2). And in the next iteration, we can remove the insignificant feature from the model.
#Lets make default model and add square term in the model.
model3 = lm(log(medv)~crim+chas+nox+rm+dis+rad+tax+ptratio+
black+lstat+ I(crim^2)+ I(chas^2)+I(nox^2)+ I(rm^2)+ I(dis^2)+
I(rad^2)+ I(tax^2)+ I(ptratio^2)+ I(black^2)+ I(lstat^2), data=train)
summary(model3)
Call:
lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat + I(crim^2) + I(chas^2) + I(nox^2) +
I(rm^2) + I(dis^2) + I(rad^2) + I(tax^2) + I(ptratio^2) +
I(black^2) + I(lstat^2), data = train)Residuals:
Min 1Q Median 3Q Max
-0.74190 -0.08032 -0.00534 0.09105 0.75405Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.266e+00 9.571e-01 8.636 < 2e-16 ***
crim -2.754e-02 4.094e-03 -6.726 6.33e-11 ***
chas 7.821e-02 3.670e-02 2.131 0.033686 *
nox -1.723e+00 1.099e+00 -1.567 0.117826
rm -6.355e-01 1.406e-01 -4.518 8.31e-06 ***
dis -1.267e-01 2.534e-02 -5.001 8.71e-07 ***
rad 2.150e-02 1.051e-02 2.046 0.041442 *
tax -1.140e-03 5.835e-04 -1.954 0.051428 .
ptratio -1.543e-01 7.778e-02 -1.984 0.047922 *
black 1.622e-03 5.486e-04 2.957 0.003299 **
lstat -4.756e-02 6.012e-03 -7.910 2.76e-14 ***
I(crim^2) 2.052e-04 5.299e-05 3.873 0.000126 ***
I(chas^2) NA NA NA NA
I(nox^2) 4.512e-01 8.146e-01 0.554 0.579939
I(rm^2) 5.402e-02 1.105e-02 4.888 1.50e-06 ***
I(dis^2) 6.531e-03 2.017e-03 3.238 0.001307 **
I(rad^2) -1.448e-04 4.148e-04 -0.349 0.727159
I(tax^2) 7.655e-07 7.322e-07 1.045 0.296450
I(ptratio^2) 3.347e-03 2.216e-03 1.510 0.131741
I(black^2) -3.056e-06 1.203e-06 -2.540 0.011487 *
I(lstat^2) 5.130e-04 1.684e-04 3.046 0.002477 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1741 on 384 degrees of freedom
Multiple R-squared: 0.8343, Adjusted R-squared: 0.8261
F-statistic: 101.7 on 19 and 384 DF, p-value: < 2.2e-16##Removing the insignificant variables.
model4=update(model3, ~.-nox-rad-tax-I(crim^2)-I(chas^2)-I(rad^2)-
I(tax^2)-I(ptratio^2)-I(black^2))
summary(model4)par(mfrow=c(2,2))
plot(model4)
Call:
lm(formula = log(medv) ~ crim + chas + rm + dis + ptratio + black +
lstat + I(nox^2) + I(rm^2) + I(dis^2) + I(lstat^2), data = train)Residuals:
Min 1Q Median 3Q Max
-0.73918 -0.09787 -0.00723 0.08868 0.82585Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6689322 0.4622249 14.428 < 2e-16 ***
crim -0.0104797 0.0013390 -7.827 4.72e-14 ***
chas 0.0909580 0.0379316 2.398 0.016954 *
rm -0.7616673 0.1445966 -5.268 2.29e-07 ***
dis -0.0918453 0.0239087 -3.842 0.000143 ***
ptratio -0.0308017 0.0049005 -6.285 8.72e-10 ***
black 0.0002632 0.0001134 2.321 0.020824 *
lstat -0.0440988 0.0060377 -7.304 1.57e-12 ***
I(nox^2) -0.6629324 0.1148901 -5.770 1.61e-08 ***
I(rm^2) 0.0654052 0.0113152 5.780 1.52e-08 ***
I(dis^2) 0.0045419 0.0020195 2.249 0.025063 *
I(lstat^2) 0.0003587 0.0001657 2.165 0.030997 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1834 on 392 degrees of freedom
Multiple R-squared: 0.8123, Adjusted R-squared: 0.807
F-statistic: 154.2 on 11 and 392 DF, p-value: < 2.2e-16
Observation from summary (model4)
Is there a relationship between predictor and response variables? F-Stat is 154.2 and it is far greater than 1. So there is a relationship between predictor and response variable.
Which of the predictor variable are significant? All predictor variables are significant.
Is this model fit? R2 is 0.8123 and this is more ( and better ) than our first and second model.
Prediction
Till now we were checking training-error but the real goal of the model is to reduce the testing error. As we already split the sample dataset into training and testing dataset, we will use test dataset to evaluate the model that we have arrived upon. We will make a prediction based on ‘Model 4’ and will evaluate the model. As the last step, we will predict the ‘test’ observation and will see the comparison between predicted response and actual response value. RMSE explains on an average how much of the predicted value will be from the actual value. Based on RMSE = 3.682, we can conclude that on an average predicted value will be off by 3.682 from the actual value.
pred1 <- predict(model4, newdata = test)
rmse <- sqrt(sum((exp(pred1) - test$medv)^2)/length(test$medv))
c(RMSE = rmse, R2=summary(model4)$r.squared)
RMSE R2
3.6825564 0.8122965
par(mfrow=c(1,1))
plot(test$medv, exp(pred1))
Gives this plot:
How to know which regression model is best fit for the data?
The most common metrics to look at while selecting the model are:
Conclusion
The example shows how to approach linear regression modeling. The model that is created still has scope for improvement as we can apply techniques like removal of outliers using imputation methods, correlation detection to further improve the accuracy of more accurate prediction. There are other methods to check whether the accuracy can be further improved for the model like Random Forest and Boosting technique.
Comments