Linear Regression
Simple Linear Regression
Barbara Yam
6 May 2020
Triangle <- read.csv("Triangle.csv")
head(Triangle)
## Store AvgDispIncome Sales
## 1 1 22.3 3.7
## 2 2 36.6 3.9
## 3 3 55.5 6.7
## 4 4 46.7 9.5
## 5 5 32.4 3.4
## 6 6 31.7 5.6
# Q1a. creating a scatterplot between Average Disposable Income($000) and Annual Sales (Millions $)
plot(Triangle$AvgDispIncome, Triangle$Sales, main = "Scatterplot between Income and Annual Sales",
xlab = "Average Disposable Income", ylab = "Sales",
pch = 19)
# 1b. compute the correlation coefficient
cor(Triangle$AvgDispIncome, Triangle$Sales)
## [1] 0.6982346
Q1c. Can a simple linear regression be applied? The correlation coefficient is 0.698 of close to 0.7 suggests that there may be a strong correlation between average disposable income and sales.
The scatter plot also seems to suggest that sales is higher when average disposable income is higher.
linear_Model <- lm(Sales~AvgDispIncome,data=Triangle)
linear_Model
##
## Call:
## lm(formula = Sales ~ AvgDispIncome, data = Triangle)
##
## Coefficients:
## (Intercept) AvgDispIncome
## -1.9412 0.1929
summary(linear_Model)
##
## Call:
## lm(formula = Sales ~ AvgDispIncome, data = Triangle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4518 -1.6089 -0.1991 1.4032 3.7079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.94122 2.37999 -0.816 0.43060
## AvgDispIncome 0.19295 0.05711 3.379 0.00548 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.235 on 12 degrees of freedom
## Multiple R-squared: 0.4875, Adjusted R-squared: 0.4448
## F-statistic: 11.42 on 1 and 12 DF, p-value: 0.005481
The model is as follows: Sales(\(Millions) = 0.19295 * AvgDispIncome(\)’000) -1.94122
The intercept means that with $0 average disposable income, the sales would be $1.94122 million. It would not make sense here as it is an extropolation of the dataset.
The positive slope of 0.19295 means that for every $1000 increase in average disposable income, there is an increase in $190,295 of sales. Average disposable income is positively related with sales.
1d. #determine the R-square R-square is a goodness-of-fit measure for linear regression models. R- square value = 0.4875. R-square value shows that 48.75% of variance of the dependent variable, Sales, can be predicted from the independent variable, Average Disposable Income.
Generally, an R-square value as close as possible to 1, more than 0.7, is preferred. Here, a 0.4875 value leaves much uncertainty in the prediction. More analysis is necessary to improve the goodness of fit.
# Q1e.perform a residual analysis and evaluate the regression assumptions
plot(linear_Model)
The Residuals vs Fitted graph show a good dispersion of points about the line Residuals = 0. The graph is relatively shapeless without clear patterns in the data, no obvious outliers, and be generally symmetrically distributed around the 0 line without particularly large residuals.
The data points on the QQ plot follow a diagonal line indicates that the residuals follow a normal distribution.
The assumptions of linear regression are being adhered to: 1. The scatter plot shows that Sales and Average Disposable Income have a linear relationship.
There is homoscedasticity as the variables in the residuals are constant.
The residuals follow a normal distribution.
1f. At the 0.05 level of significance, is there evidence of a linear relationship between mean disposable income and sales?
p = 0.00548 < 0.05, there is sufficient evidence at 95% confidence interval to claim that Average Disposable Income is statistically significant in predicting Sales.
Q2. Should average disposable income be used to predict sales? Yes. The model shows that there is a positive relationship between disposable income and sales. However, more variables can be considered as the R-square of 0.4875, shows that 48.75% of variance of Sales, can be predicted from Average Disposable Income.
Q3. Should the claims be accepted? (Shops will do no less than 10.6 million dollars in sale.) The model is as follows:
Sales (in Millions) = 0.1929 * Average Disposable Income (in Thousands) -1.9412
Given that the Average Disposable Income in the areas are no less than 65k, based on calculations sales would be no less than 10.6k.
new_data <- data.frame(AvgDispIncome=65)
predict(linear_Model,newdata=new_data,interval="confidence")
## fit lwr upr
## 1 10.6004 7.267873 13.93294
However, based on a 95% confidence interval, with 65k in disposable incomes, sales could be expected to be between 7.28millions to 13.9millions.
It is not advisable to accept claims that sales would be more than 10.6millions.
Q4. Are there other factors not mentioned that might be relevant to the store leasing decision? Yes, the factors may include costs of operations, such as logistics and also store rental, size of store, and also if there is similar stores in the area.