---
title: "Dunn book reading group Chapter 02"
author: "Ed Harris"
date: "20/02/2020"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(GLMsData)
```

# Dunn Ch02 notes

***

## 2.1 Linear regression (special case of GLM)
* Notation and assumptions
* Least squares
* Multiple regression
* Examples in R
* Variations and model selection

***

## 2.2 Definitions
* random component $y$ 
* systematic component: $p$ explanatory vars $x_1, x_2, ..., x_p$
* Assumption of constant variance (across) 
**or**
* known *prior weights* exist

$u_i = β_0 + β_1x_{1i}, β_2x_{2i}, ..., β_px_{pi}$ (2.1)  
(aka multiple linear regression)

$E[y_i] = u_i$

$var[y_i] = σ^2 / w_i$ specify prior weights

$β_0$ the *intercept* ($y$ value when all explanatory vars are zero)

***

$μ = β_0 + β_1x_1$ *simple, plain old* linear regression
(i.e., all prior weights == 1)

***
**Assumptions**  
* **SUITABILITY** same regression model valid for all obs.  
* **LINEARITY** model is linear...  
* **INDEPENDENCE** of observations  
* **CONSTANT VARIANCE**  
![Constant variance](Fig2.1.png)

 

***  
**Example 2.1**  
library(GLMsData); data(gestation); str(gestation)  
Weight = human birthweight (Kg); Age = gestational age (weeks)  



```{r, fig.width = 7, fig.height = 5, echo = F }
data("gestation")
plot(Weight ~ Age, data = gestation, 
     ylab = "Baby weight (Kg)",
     xlab = "Gestational age (Weeks)",
     pch = ifelse(Births < 20, 1, 19), cex = 2,
     cex.lab =1.5)

legend(x = 22, y = 3.5,
       legend = c("n = 20+", "n < 20"),
       pch = c(19,1), cex = 1.5)

```

***  
NB, imbalance of observations
```{r}
gestation
sum(gestation$Births)
```

***  
##2.2 Simple regression and Least-Squares  

For our model, birth **Weight ~ Age**  
$μ_i = y_i − β_0 − β_1x_i$  
  
**residual error in our model**
$e_i = y_i − μ_i = y_i − β_0 − β_1x_i$   (2.3)  

**Sum of Squares equation in "general form" explicitly including prior weights**  
$S(β_0, β_1) = \sum_{i=1}^{n} w_i(y_i - u_i)^2$  (SSE)

***  
**Example 2.2 - look at the effect of different slopw and intercept estimates**  

```{r}
y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
# Try these values for beta0 and beta1
beta0.A <- -0.9 #intercept
beta1.A <- 0.1 #slope
mu.A <- beta0.A + beta1.A * x
SA <- sum( wts*(y - mu.A)^2 ) #same as (SSE)
SA # goals is to minimise this!


```
***  
```{r, fig.width = 7, fig.height = 5, echo = F }
data("gestation")
plot(Weight ~ Age, data = gestation, 
     ylab = "Baby weight (Kg)",
     xlab = "Gestational age (Weeks)",
     pch = 19, cex = 1,
     cex.lab =1.5, ylim=c(0,4), xlim=c(20,45))

mylmNoWt <- lm(Weight ~ Age, data = gestation)


abline(a = -.9, b = .1, lty=3)
abline(a = -3, b = .15, lty=2)
abline(a = -2.687, b = .1538, lty=1, lwd=1.5)
abline(mylmNoWt, col="red", lty=2)

anova(mylmNoWt)
coefficients(mylmNoWt)

mylmWt <- lm(Weight ~ Age, weights = Births,
               data = gestation)
anova(mylmWt)
summary(mylmWt)
coefficients(mylmWt)

```




***
**NB derivation of SSE for $B_0$ and $B_1$:   
***  
## 2.3.3 Variance estimation
**NB the estimate of variance $s^2$ is given as Mean Sq. error for residuals using anova()**  
**NB the estimate of standard deviation $s$ is given as residual standard error for residuals using summary()**  

***  
##2.4 Multiple Regression
*Estimation of coefficients analogous to simple linear reg.  
Lung data example  

```{r, fig.width=8, fig.height=4, echo = F}
par(mfrow=c(1,2))
data("lungcap")
scatter.smooth( lungcap$Ht, lungcap$FEV, las=1, col="grey",
ylim=c(0, 6), xlim=c(45, 75), # Use similar scales for comparisons
main="FEV", xlab="Height (in inches)", ylab="FEV (in L)" )

scatter.smooth( lungcap$Ht, log(lungcap$FEV), las=1, col="grey",
ylim=c(-0.5, 2), xlim=c(45, 75), # Use similar scales for comparisons
main="log of FEV", xlab="Height (in inches)", ylab="log of FEV (in L)")


```
**Discuss appropriateness of logged y var to regression assumptions**  
##2.5 Matrix Notation "*"  
***  
##2.6 Linear regression in R  
```{r}
gest.wtd <- lm( Weight ~ Age, data=gestation,
weights=Births) # The prior weights
summary(gest.wtd)
```

*"~" = "as a function of"  
*note weights  
```{r}
gest.ord <- lm( Weight ~ Age, data=gestation)
coef(gest.ord)
```
**Better fit with weights?**
```{r, fig.height=7, fig.width=7}
plot( Weight ~ Age, data=gestation, type="n",
las=1, xlim=c(20, 45), ylim=c(0, 4),
xlab="Gestational age (weeks)", ylab="Mean birthweight (in kg)" )
points( Weight[Births< 20] ~ Age[Births< 20], pch=1, data=gestation )
points( Weight[Births>=20] ~ Age[Births>=20], pch=19, data=gestation )
abline( coef(gest.ord), lty=2, lwd=2)
abline( coef(gest.wtd), lty=1, lwd=2)
legend("topleft", lwd=c(2, 2), bty="n",
lty=c(2, 1, NA, NA), pch=c(NA, NA, 1, 19), # NA shows nothing
legend=c("Ordinary regression", "Weighted regression",
"Based on 20 or fewer obs.","Based on more than 20 obs."))
```

***  
##Smoking example 2.15
Consider fitting the Model (2.14) to the lung capacity data
(lungcap), using age, height, gender and smoking status as explanatory variables,
and log(fev) as the response.

```{r}
# Recall, Smoke has been declared previously as a factor
lm( log(FEV) ~ Age + Ht + Gender + factor(Smoke), data=lungcap )
mylm <- lm( log(FEV) ~ Age + Ht + Gender + factor(Smoke), data=lungcap )
summary(mylm)
```
*** 
##2.7 Interpreting regression coefficients

**Challenge yourself with your own data:** take responsibility for the analysis and the NUMERICAL OUTPUTS to make sense.  I consider this to be the principle distinction between someone who is merely good at performing statistics, and someone who is a good scientist...  

***  
* Implications of logging our Y var in previous example
* coefficient values and predictions
* explicit interpretation of Smoke variable  


```{r} 
#Coefficient est. of Smoke on log FEV
# -0.04607
#Now, relate back to actual FEV
exp(-0.04607) #yawn not that large

#help(lungcap) 
# FEV
# the forced expiratory volume in LITRES, a measure of lung capacity; a numeric vector

#omfg this is in Litres...

```
##2.8 Inference of linear models  
Author pretends to now that we have only bee interested in the parameter estimates, without any actual interest in making a statistical inference.  
  
Here is where We need to start being explicit about the distribution of residuals.   
$y_i ∼ N(μ_i, σ^2/w_i)$
  
In practice, we formally relate this to the estimation of the model parameters, also assuming an explicit distribution.

$\hat{β_j} ∼ N(β_j , var[ \hatβ_j ])$
  
**We use this estimate for hypothesis testing - this is where we get our p-value from**  
  
**Is the estimate of a particular coefficient different to zero...?**  

```{r}
mylm1 <- lm( log(FEV) ~ Age + Ht + Gender + factor(Smoke), data=lungcap )
confint(mylm1)
```