---
title: "Dunn book reading group Chapter 01"
author: "Ed Harris"
date: "12/02/2020"
output: html_document
---

***

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(GLMsData)
```

# Dunn Ch01 notes

***

## Preface
* Book is intended for wide audience
  + ("Second stats course" up to more advanced)
* Regression approach
* Build on regression towards Generalized Linear Model
* {GLMsData} package

***

## 1.2 Data (and conventions)

**lungcap data**  
  
**lungcap {GLMsData}** 
The health and smoking habits of 654 youth  

**Age**
the age of the subject in completed years; a numeric vector  

**FEV**
the forced expiratory volume in litres, a measure of lung capacity; a numeric vector  

**Ht**
the height in inches; a numeric vector  

**Gender**
the gender of the subjects: a numeric vector with females coded as 0 and males as 1  

**Smoke** 
the smoking status of the subject: a numeric vector with non-smokers coded as 0 and smokers as 1  

```{r lungcap data from GLMsData}
library(GLMsData)  #Load the {GLMsData} package 
data(lungcap)      #Load lungcap data into memory
dim(lungcap)       #Dimensions (rows and columns of data)
head(lungcap)      #Print first 6 rows
```

***
### Notes on notation
 $y_i = x_{1i} + x_{2i}$  
   
*equivalent to*  
 dependent_var = first_ind_var + Second_ind_var   
   
*equivalent to*  
 FEV = Age + Ht
  
  
**NB** variable addressing notation  
$y_i$ is the $ith$ observation of variable $y$  
  
*e.g.* $y_3$  
*equivalent to*  

```{r data index}
lungcap$FEV[3]
```

***
  
### Factor issues with variable Smoke  

```{r}
lungcap$Smoke <- factor(lungcap$Smoke, levels=c(0, 1), # The values of Smoke
labels=c("Non-smoker","Smoker"))
table(lungcap$Smoke)
```
  
***  
## 1.3 Plotting data  
#### Univariate   

```{r, fig.width = 7, fig.height = 7, echo = F }
par(mfrow=c(2,2))

plot( FEV ~ Age, data=lungcap,
xlab="Age (in years)", # The x-axis label
ylab="FEV (in L)",     # The y-axis label
main="FEV vs age",     # The main title
xlim=c(0, 20),         # Explicitly set x-axis limits
ylim=c(0, 6),          # Explicitly set y-axis limits
las=1)                 # Makes axis labels horizontal

plot( FEV ~ Ht, data=lungcap, main="FEV vs height",
xlab="Height (in inches)", ylab="FEV (in L)",
las=1, ylim=c(0, 6))

plot( FEV ~ Gender, data=lungcap,
main="FEV vs gender", ylab="FEV (in L)",
las=1, ylim=c(0, 6))

plot( FEV ~ Smoke, data=lungcap, main="FEV vs Smoking status",
ylab="FEV (in L)", xlab="Smoking status",
las=1, ylim=c(0, 6))


```

***  

#### Multivariate    

```{r, fig.width = 7, fig.height = 7, echo = F }
par(mfrow=c(2,2))
AgeAdjust <- lungcap$Age + ifelse(lungcap$Smoke=="Smoker", 0, 0.5)

plot( FEV ~ AgeAdjust, data=lungcap,
  pch = ifelse(Smoke=="Smoker", 1, 16),
  xlab="Age (in years)", ylab="FEV (in L)", 
  main="FEV vs age", las=1)

legend("topleft", pch=c(16, 1), legend=c("Non-smokers","Smokers"),
       bty = "n")

boxplot(lungcap$FEV ~ lungcap$Smoke + lungcap$Gender,
ylab="FEV (in L)", main="FEV, by gender\n and smoking status",
las=2, # Keeps labels perpendicular to the axes
names=c("F:\nNon", "F:\nSmoker", "M:\nNon", "M:\nSmoker"),
xlab = "")

interaction.plot( lungcap$Smoke, lungcap$Gender, lungcap$FEV,
xlab="Smoking status", ylab="FEV (in L)",
main="Mean FEV, by gender\n and smoking status",
trace.label="Gender", las=1, legend=F)

legend(x=1, y=3.6, legend = c("M", "F"), bty = "n",
       lty = c(1,2))

interaction.plot( lungcap$Smoke, lungcap$Gender, lungcap$Age,
xlab="Smoking status", ylab="Age (in years)",
main="Mean age, by gender\n and smoking status",
trace.label="Gender", las=1, legend=F)

legend(x=1, y=13.5, legend = c("M", "F"), bty = "n",
       lty = c(1,2))
```

***  
**To make any further progress quantifying the relationship between the
variables, mathematics is necessary to create a *statistical model*.**
  
***  
## 1.4 Factors

**Concept of contrasts, "dunny vars"**  
```{r}
contrasts(lungcap$Gender)
```

**the contrasts reference level is arbitrary (alphabetical) unless you set it** 
```{r}
contrasts( relevel( lungcap$Gender, "M") ) # Now, M is the ref. level

lungcap$Smoke <- factor(lungcap$Smoke,
  levels=c(0, 1),
  labels=c("Non-smoker","Smoker"))

contrasts(lungcap$Smoke)

```

## 1.5 Statistical models

**random component** Describes the distribution of the dependent variable  
  
**systematic component** Describes a "statistical model"  
  
$μ_i = β_0 + β_1x_{1i} + β_2x_{2i} + β_3x_{3i} + β_4x_{4i}$  
Here:  
$μ_i$ is the predicted value of an observation ($y_i$)  
$β$ are regression estimates, intercept and slopes  
$x$ are the parameter values for Age, Ht, Gender, and Smoke  
  
***  
Assumptions  
$var[y_i] = σ^2$ **constant variance**  (likely true given these data?)  
$y_i ∼ N(μ_i, σ^2)$ **Gaussian residuals** (merely popular...)  
  
## 1.6 Regression models

* linear regression (e.g. assumes constant variance, Gaussian, etc.)
* generalized linear model (GLM: other options for modelling variation and dist.)  
  
**NB** linear regression *is* GLM, in the specific case where we model constant var + Gaussian residuals

***  
### Poisson model for count data
```{r}
data(nminer)
head(nminer)
plot( jitter(Minerab) ~ Eucs, data=nminer, las=1, ylim=c(0, 20),
 xlab="Number of eucalypts per 2 ha", ylab="Number of noisy miners" )
```

  
## 1.8 All Models Are Wrong, but Some Are Useful  
Good quote  
* Prediction versus understanding  
* Complexity, *Occam's Razor*  
* Experiments versus observational study


## Problems 
*1.1. The plots in Fig. 1.7 (data set: paper) show the strength of Kraft paper
[7, 8] for different percentages of hardwood concentrations. Which systematic
component, if any, appears most suitable for modelling the data?
Explain.  *  
**Cubic - better than quadratic, simpler than Quartic**  
  
1.2. The plots in Fig. 1.8 (data set: heatcap) show the heat capacity of solid
hydrogen bromide y measured as a function of temperature x [6, 16]. Which
systematic component, if any, appears best for modelling the data? Explain.  
**Cubic - better than linear and simpler than Quadratis or Quartic**  
  
1.3. Consider the data plotted in Fig. 1.9. In the panels, quadratic, cubic and
quartic systematic components are shown with the data. Which systematic
component appears best for modelling the data? Explain.  
**Cubic**    
The data are actually randomly generated using the systematic component
μ = 1+10exp(−x/2) (with added randomness), which is not a polynomial
at all. Explain what this demonstrates about fitting systematic components.  
**Demonstrates trial and error in explaining model variation!**  
    
1.4. Consider the data plotted in Fig. 1.10 (data set: toxo). The data show
the proportion of the population y testing positive to toxoplasmosis against
the annual rainfall x for 34 cities in El Salvador [5]. Analysis suggests a cubic model fits the data reasonably well (though substantial variation still exists). What important features of the data are evident from the plot? Which of the plotted systematic components appears better? Explain.  
**Uneven number of observations across range of rainfall.  More extreme cubic linear regression seems overfitted on low data density compared to "gentler" fit of glm.  Also range/extrapolation issue.**
  
1.5. For the following systematic components used in a regression model,
determine if they are appropriate for regression models linear in the parameters, linear regression models, and/or generalized linear models. In all cases, βj refers to model parameters, μ is the expected value of the response variable, while x, x1 and x2 refer to explanatory variables.  
**Probably 2,3,4 okay.  1 definitely non-linear!**

## 1.6  
1. Use names() to determine the names of the variables in the data frame.  
```{r}
library(GLMsData)
data(turbines)
names(turbines)
```  
***
2. Determine which variables are quantitative and which are qualitative. 
```{r}
str(turbines)
```  
**None are qualitative, but possibly Hours should be - discuss?**  

***
3. For any qualitative variables, define appropriate dummy variables using
 treatment coding.  
**...**  

***  

4. Use r to summarize each variable.  
```{r}
summary(turbines)
```
  
***  

5. Use r to create a plot of the proportion of failures (turbines with fissures)
against run-time.  
```{r}
plot(Fissures/Turbines ~ Hours, data = turbines)
```
  
***  

6. Determine the important features of the data evident from the plot.  
**Not quite linear. Increasing variance with Hours. Described as an experiment, number hours running manipulated?**
  
***
  
7. Would a linear regression model seem appropriate for modelling the data? 
Explain.  
**Probably not inappropriate... violates some assumptions though**
  
***
  
8. Read the help for the data frame (use ?turbines after loading the
GLMsData package in r), and determine whether the data come from
an observational or experimental study, then discuss the implications.

**This is an experiment.  Therefore we can make prediction about fissures as a function of hours run, and infer causation**
  
***
  
## 1.7  
```{r, fig.width = 7, fig.height = 7, echo = F }
library(GLMsData)
data(humanfat)
names(humanfat)
contrasts(humanfat$Gender) # relevel or label?
summary(humanfat) #not many males

par(mfrow = c(2,2))
plot(Percent.Fat ~ Age, data = humanfat) #big gap 28-~40
plot(Percent.Fat ~ Gender, data = humanfat) # F>M, male dist wider
plot(Percent.Fat ~ BMI, data = humanfat) #weird hump, outliers
plot.new()

# ?humanfat
# probably need more information and larger sample to generalize!


```