--- title: "Dunn book reading group Chapter 02" author: "Ed Harris" date: "20/02/2020" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) require(GLMsData) ``` # Dunn Ch02 notes *** ## 2.1 Linear regression (special case of GLM) * Notation and assumptions * Least squares * Multiple regression * Examples in R * Variations and model selection *** ## 2.2 Definitions * random component $y$ * systematic component: $p$ explanatory vars $x_1, x_2, ..., x_p$ * Assumption of constant variance (across) **or** * known *prior weights* exist $u_i = β_0 + β_1x_{1i}, β_2x_{2i}, ..., β_px_{pi}$ (2.1) (aka multiple linear regression) $E[y_i] = u_i$ $var[y_i] = σ^2 / w_i$ specify prior weights $β_0$ the *intercept* ($y$ value when all explanatory vars are zero) *** $μ = β_0 + β_1x_1$ *simple, plain old* linear regression (i.e., all prior weights == 1) *** **Assumptions** * **SUITABILITY** same regression model valid for all obs. * **LINEARITY** model is linear... * **INDEPENDENCE** of observations * **CONSTANT VARIANCE** ![Constant variance](Fig2.1.png) *** **Example 2.1** library(GLMsData); data(gestation); str(gestation) Weight = human birthweight (Kg); Age = gestational age (weeks) ```{r, fig.width = 7, fig.height = 5, echo = F } data("gestation") plot(Weight ~ Age, data = gestation, ylab = "Baby weight (Kg)", xlab = "Gestational age (Weeks)", pch = ifelse(Births < 20, 1, 19), cex = 2, cex.lab =1.5) legend(x = 22, y = 3.5, legend = c("n = 20+", "n < 20"), pch = c(19,1), cex = 1.5) ``` *** NB, imbalance of observations ```{r} gestation sum(gestation$Births) ``` *** ##2.2 Simple regression and Least-Squares For our model, birth **Weight ~ Age** $μ_i = y_i − β_0 − β_1x_i$ **residual error in our model** $e_i = y_i − μ_i = y_i − β_0 − β_1x_i$ (2.3) **Sum of Squares equation in "general form" explicitly including prior weights** $S(β_0, β_1) = \sum_{i=1}^{n} w_i(y_i - u_i)^2$ (SSE) *** **Example 2.2 - look at the effect of different slopw and intercept estimates** ```{r} y <- gestation$Weight x <- gestation$Age wts <- gestation$Births # Try these values for beta0 and beta1 beta0.A <- -0.9 #intercept beta1.A <- 0.1 #slope mu.A <- beta0.A + beta1.A * x SA <- sum( wts*(y - mu.A)^2 ) #same as (SSE) SA # goals is to minimise this! ``` *** ```{r, fig.width = 7, fig.height = 5, echo = F } data("gestation") plot(Weight ~ Age, data = gestation, ylab = "Baby weight (Kg)", xlab = "Gestational age (Weeks)", pch = 19, cex = 1, cex.lab =1.5, ylim=c(0,4), xlim=c(20,45)) mylmNoWt <- lm(Weight ~ Age, data = gestation) abline(a = -.9, b = .1, lty=3) abline(a = -3, b = .15, lty=2) abline(a = -2.687, b = .1538, lty=1, lwd=1.5) abline(mylmNoWt, col="red", lty=2) anova(mylmNoWt) coefficients(mylmNoWt) mylmWt <- lm(Weight ~ Age, weights = Births, data = gestation) anova(mylmWt) summary(mylmWt) coefficients(mylmWt) ``` *** **NB derivation of SSE for $B_0$ and $B_1$: *** ## 2.3.3 Variance estimation **NB the estimate of variance $s^2$ is given as Mean Sq. error for residuals using anova()** **NB the estimate of standard deviation $s$ is given as residual standard error for residuals using summary()** *** ##2.4 Multiple Regression *Estimation of coefficients analogous to simple linear reg. Lung data example ```{r, fig.width=8, fig.height=4, echo = F} par(mfrow=c(1,2)) data("lungcap") scatter.smooth( lungcap$Ht, lungcap$FEV, las=1, col="grey", ylim=c(0, 6), xlim=c(45, 75), # Use similar scales for comparisons main="FEV", xlab="Height (in inches)", ylab="FEV (in L)" ) scatter.smooth( lungcap$Ht, log(lungcap$FEV), las=1, col="grey", ylim=c(-0.5, 2), xlim=c(45, 75), # Use similar scales for comparisons main="log of FEV", xlab="Height (in inches)", ylab="log of FEV (in L)") ``` **Discuss appropriateness of logged y var to regression assumptions** ##2.5 Matrix Notation "*" *** ##2.6 Linear regression in R ```{r} gest.wtd <- lm( Weight ~ Age, data=gestation, weights=Births) # The prior weights summary(gest.wtd) ``` *"~" = "as a function of" *note weights ```{r} gest.ord <- lm( Weight ~ Age, data=gestation) coef(gest.ord) ``` **Better fit with weights?** ```{r, fig.height=7, fig.width=7} plot( Weight ~ Age, data=gestation, type="n", las=1, xlim=c(20, 45), ylim=c(0, 4), xlab="Gestational age (weeks)", ylab="Mean birthweight (in kg)" ) points( Weight[Births< 20] ~ Age[Births< 20], pch=1, data=gestation ) points( Weight[Births>=20] ~ Age[Births>=20], pch=19, data=gestation ) abline( coef(gest.ord), lty=2, lwd=2) abline( coef(gest.wtd), lty=1, lwd=2) legend("topleft", lwd=c(2, 2), bty="n", lty=c(2, 1, NA, NA), pch=c(NA, NA, 1, 19), # NA shows nothing legend=c("Ordinary regression", "Weighted regression", "Based on 20 or fewer obs.","Based on more than 20 obs.")) ``` *** ##Smoking example 2.15 Consider fitting the Model (2.14) to the lung capacity data (lungcap), using age, height, gender and smoking status as explanatory variables, and log(fev) as the response. ```{r} # Recall, Smoke has been declared previously as a factor lm( log(FEV) ~ Age + Ht + Gender + factor(Smoke), data=lungcap ) mylm <- lm( log(FEV) ~ Age + Ht + Gender + factor(Smoke), data=lungcap ) summary(mylm) ``` *** ##2.7 Interpreting regression coefficients **Challenge yourself with your own data:** take responsibility for the analysis and the NUMERICAL OUTPUTS to make sense. I consider this to be the principle distinction between someone who is merely good at performing statistics, and someone who is a good scientist... *** * Implications of logging our Y var in previous example * coefficient values and predictions * explicit interpretation of Smoke variable ```{r} #Coefficient est. of Smoke on log FEV # -0.04607 #Now, relate back to actual FEV exp(-0.04607) #yawn not that large #help(lungcap) # FEV # the forced expiratory volume in LITRES, a measure of lung capacity; a numeric vector #omfg this is in Litres... ``` ##2.8 Inference of linear models Author pretends to now that we have only bee interested in the parameter estimates, without any actual interest in making a statistical inference. Here is where We need to start being explicit about the distribution of residuals. $y_i ∼ N(μ_i, σ^2/w_i)$ In practice, we formally relate this to the estimation of the model parameters, also assuming an explicit distribution. $\hat{β_j} ∼ N(β_j , var[ \hatβ_j ])$ **We use this estimate for hypothesis testing - this is where we get our p-value from** **Is the estimate of a particular coefficient different to zero...?** ```{r} mylm1 <- lm( log(FEV) ~ Age + Ht + Gender + factor(Smoke), data=lungcap ) confint(mylm1) ```