--- title: "Dunn book reading group Chapter 01" author: "Ed Harris" date: "12/02/2020" output: html_document --- *** ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) require(GLMsData) ``` # Dunn Ch01 notes *** ## Preface * Book is intended for wide audience + ("Second stats course" up to more advanced) * Regression approach * Build on regression towards Generalized Linear Model * {GLMsData} package *** ## 1.2 Data (and conventions) **lungcap data** **lungcap {GLMsData}** The health and smoking habits of 654 youth **Age** the age of the subject in completed years; a numeric vector **FEV** the forced expiratory volume in litres, a measure of lung capacity; a numeric vector **Ht** the height in inches; a numeric vector **Gender** the gender of the subjects: a numeric vector with females coded as 0 and males as 1 **Smoke** the smoking status of the subject: a numeric vector with non-smokers coded as 0 and smokers as 1 ```{r lungcap data from GLMsData} library(GLMsData) #Load the {GLMsData} package data(lungcap) #Load lungcap data into memory dim(lungcap) #Dimensions (rows and columns of data) head(lungcap) #Print first 6 rows ``` *** ### Notes on notation $y_i = x_{1i} + x_{2i}$ *equivalent to* dependent_var = first_ind_var + Second_ind_var *equivalent to* FEV = Age + Ht **NB** variable addressing notation $y_i$ is the $ith$ observation of variable $y$ *e.g.* $y_3$ *equivalent to* ```{r data index} lungcap$FEV[3] ``` *** ### Factor issues with variable Smoke ```{r} lungcap$Smoke <- factor(lungcap$Smoke, levels=c(0, 1), # The values of Smoke labels=c("Non-smoker","Smoker")) table(lungcap$Smoke) ``` *** ## 1.3 Plotting data #### Univariate ```{r, fig.width = 7, fig.height = 7, echo = F } par(mfrow=c(2,2)) plot( FEV ~ Age, data=lungcap, xlab="Age (in years)", # The x-axis label ylab="FEV (in L)", # The y-axis label main="FEV vs age", # The main title xlim=c(0, 20), # Explicitly set x-axis limits ylim=c(0, 6), # Explicitly set y-axis limits las=1) # Makes axis labels horizontal plot( FEV ~ Ht, data=lungcap, main="FEV vs height", xlab="Height (in inches)", ylab="FEV (in L)", las=1, ylim=c(0, 6)) plot( FEV ~ Gender, data=lungcap, main="FEV vs gender", ylab="FEV (in L)", las=1, ylim=c(0, 6)) plot( FEV ~ Smoke, data=lungcap, main="FEV vs Smoking status", ylab="FEV (in L)", xlab="Smoking status", las=1, ylim=c(0, 6)) ``` *** #### Multivariate ```{r, fig.width = 7, fig.height = 7, echo = F } par(mfrow=c(2,2)) AgeAdjust <- lungcap$Age + ifelse(lungcap$Smoke=="Smoker", 0, 0.5) plot( FEV ~ AgeAdjust, data=lungcap, pch = ifelse(Smoke=="Smoker", 1, 16), xlab="Age (in years)", ylab="FEV (in L)", main="FEV vs age", las=1) legend("topleft", pch=c(16, 1), legend=c("Non-smokers","Smokers"), bty = "n") boxplot(lungcap$FEV ~ lungcap$Smoke + lungcap$Gender, ylab="FEV (in L)", main="FEV, by gender\n and smoking status", las=2, # Keeps labels perpendicular to the axes names=c("F:\nNon", "F:\nSmoker", "M:\nNon", "M:\nSmoker"), xlab = "") interaction.plot( lungcap$Smoke, lungcap$Gender, lungcap$FEV, xlab="Smoking status", ylab="FEV (in L)", main="Mean FEV, by gender\n and smoking status", trace.label="Gender", las=1, legend=F) legend(x=1, y=3.6, legend = c("M", "F"), bty = "n", lty = c(1,2)) interaction.plot( lungcap$Smoke, lungcap$Gender, lungcap$Age, xlab="Smoking status", ylab="Age (in years)", main="Mean age, by gender\n and smoking status", trace.label="Gender", las=1, legend=F) legend(x=1, y=13.5, legend = c("M", "F"), bty = "n", lty = c(1,2)) ``` *** **To make any further progress quantifying the relationship between the variables, mathematics is necessary to create a *statistical model*.** *** ## 1.4 Factors **Concept of contrasts, "dunny vars"** ```{r} contrasts(lungcap$Gender) ``` **the contrasts reference level is arbitrary (alphabetical) unless you set it** ```{r} contrasts( relevel( lungcap$Gender, "M") ) # Now, M is the ref. level lungcap$Smoke <- factor(lungcap$Smoke, levels=c(0, 1), labels=c("Non-smoker","Smoker")) contrasts(lungcap$Smoke) ``` ## 1.5 Statistical models **random component** Describes the distribution of the dependent variable **systematic component** Describes a "statistical model" $μ_i = β_0 + β_1x_{1i} + β_2x_{2i} + β_3x_{3i} + β_4x_{4i}$ Here: $μ_i$ is the predicted value of an observation ($y_i$) $β$ are regression estimates, intercept and slopes $x$ are the parameter values for Age, Ht, Gender, and Smoke *** Assumptions $var[y_i] = σ^2$ **constant variance** (likely true given these data?) $y_i ∼ N(μ_i, σ^2)$ **Gaussian residuals** (merely popular...) ## 1.6 Regression models * linear regression (e.g. assumes constant variance, Gaussian, etc.) * generalized linear model (GLM: other options for modelling variation and dist.) **NB** linear regression *is* GLM, in the specific case where we model constant var + Gaussian residuals *** ### Poisson model for count data ```{r} data(nminer) head(nminer) plot( jitter(Minerab) ~ Eucs, data=nminer, las=1, ylim=c(0, 20), xlab="Number of eucalypts per 2 ha", ylab="Number of noisy miners" ) ``` ## 1.8 All Models Are Wrong, but Some Are Useful Good quote * Prediction versus understanding * Complexity, *Occam's Razor* * Experiments versus observational study ## Problems *1.1. The plots in Fig. 1.7 (data set: paper) show the strength of Kraft paper [7, 8] for different percentages of hardwood concentrations. Which systematic component, if any, appears most suitable for modelling the data? Explain. * **Cubic - better than quadratic, simpler than Quartic** 1.2. The plots in Fig. 1.8 (data set: heatcap) show the heat capacity of solid hydrogen bromide y measured as a function of temperature x [6, 16]. Which systematic component, if any, appears best for modelling the data? Explain. **Cubic - better than linear and simpler than Quadratis or Quartic** 1.3. Consider the data plotted in Fig. 1.9. In the panels, quadratic, cubic and quartic systematic components are shown with the data. Which systematic component appears best for modelling the data? Explain. **Cubic** The data are actually randomly generated using the systematic component μ = 1+10exp(−x/2) (with added randomness), which is not a polynomial at all. Explain what this demonstrates about fitting systematic components. **Demonstrates trial and error in explaining model variation!** 1.4. Consider the data plotted in Fig. 1.10 (data set: toxo). The data show the proportion of the population y testing positive to toxoplasmosis against the annual rainfall x for 34 cities in El Salvador [5]. Analysis suggests a cubic model fits the data reasonably well (though substantial variation still exists). What important features of the data are evident from the plot? Which of the plotted systematic components appears better? Explain. **Uneven number of observations across range of rainfall. More extreme cubic linear regression seems overfitted on low data density compared to "gentler" fit of glm. Also range/extrapolation issue.** 1.5. For the following systematic components used in a regression model, determine if they are appropriate for regression models linear in the parameters, linear regression models, and/or generalized linear models. In all cases, βj refers to model parameters, μ is the expected value of the response variable, while x, x1 and x2 refer to explanatory variables. **Probably 2,3,4 okay. 1 definitely non-linear!** ## 1.6 1. Use names() to determine the names of the variables in the data frame. ```{r} library(GLMsData) data(turbines) names(turbines) ``` *** 2. Determine which variables are quantitative and which are qualitative. ```{r} str(turbines) ``` **None are qualitative, but possibly Hours should be - discuss?** *** 3. For any qualitative variables, define appropriate dummy variables using treatment coding. **...** *** 4. Use r to summarize each variable. ```{r} summary(turbines) ``` *** 5. Use r to create a plot of the proportion of failures (turbines with fissures) against run-time. ```{r} plot(Fissures/Turbines ~ Hours, data = turbines) ``` *** 6. Determine the important features of the data evident from the plot. **Not quite linear. Increasing variance with Hours. Described as an experiment, number hours running manipulated?** *** 7. Would a linear regression model seem appropriate for modelling the data? Explain. **Probably not inappropriate... violates some assumptions though** *** 8. Read the help for the data frame (use ?turbines after loading the GLMsData package in r), and determine whether the data come from an observational or experimental study, then discuss the implications. **This is an experiment. Therefore we can make prediction about fissures as a function of hours run, and infer causation** *** ## 1.7 ```{r, fig.width = 7, fig.height = 7, echo = F } library(GLMsData) data(humanfat) names(humanfat) contrasts(humanfat$Gender) # relevel or label? summary(humanfat) #not many males par(mfrow = c(2,2)) plot(Percent.Fat ~ Age, data = humanfat) #big gap 28-~40 plot(Percent.Fat ~ Gender, data = humanfat) # F>M, male dist wider plot(Percent.Fat ~ BMI, data = humanfat) #weird hump, outliers plot.new() # ?humanfat # probably need more information and larger sample to generalize! ```