-Learn as you go philosophy
-Data exploration
-Linear regression modelling
-Linear regression assumptions
26/03/2020
-Learn as you go philosophy
-Data exploration
-Linear regression modelling
-Linear regression assumptions
Nereis data
here I think:
concentration (numeric) nutrient concentration
biomass (numeric) polychaete biomass
nutrient (factor) nutrient type - reads in as numeric but is actually a cetegorical factor
## concentration biomass nutrient ## 1 0.050 0.0 1 ## 2 0.105 0.0 1 ## 3 0.105 0.0 1 ## 4 0.790 0.5 1 ## 5 0.210 0.5 1 ## 6 2.100 0.5 1
par(mfrow = c(1,2)) Nereis <- read.table(file = "Nereis.txt", header = T) dotchart(Nereis$concentration, groups = factor(Nereis$nutrient), ylab = "Nutrient",xlab = "Concentration", main = "Clevelanddotplot", pch = Nereis$nutrient) plot(jitter(nutrient) ~ concentration, data = Nereis, pch = Nereis$nutrient)
# NB possible effect (bigger mean Nutrient 2), # Also variance seems lower on nutrient level 3 boxplot(concentration ~ factor(nutrient), data = Nereis, ylab = "Concentration", xlab = "Nutrient")
\(Y_i = \alpha + \beta \times X_i +\epsilon _i\)
\(Y_i\) = dependent var
\(X_i\) = explanatory var
\(\alpha\) and \(\beta\) = intercept and slope
\(\epsilon _i\) = residual error
\(\epsilon _i \sim N(0, \sigma ^2)\)
Assumption the residual error is Gaussian
with expected value = 0, variance = \(\sigma ^2\)
-assume Gaussian residual error
-homoscedasity of error
-no weird values
RIKZ data, R = spp richness, NAP = tide height
P = probability density
Vanilla linear model full assumptions
-Gaussian residuals
-Homogeneous variance
-“fixed” X (discuss briefly)
-Independence
-Correct model specification…
“the underlying concept of normality is grossly misunderstood by many researchers. The linear regression model requires normality of the data, and therefore of the residuals at each X value”
Important but is it black and white? (Sokal and Rohlf, 1995; Zar, 1999)
“heterogeneity (violation of homogeneity), also called heteroscedasticy, happens if the spread of the data is not the same at each X value, and this can be checked by comparing the spread of the residuals for the different X values”
“heterogeneity (violation of homogeneity), also called heteroscedasticy, happens if the spread of the data is not the same at each X value, and this can be checked by comparing the spread of the residuals for the different X values”
-Concept of fixed versus random “effects”
-Explanatory variables are fixed if:
1) experimentally assigned
2) low error in sample estimate relative to pop’n
-Can be serious (ref to Faraway 2005)
This is the most serious of violated assumptions in linear models and is very, very common too.
2 related causes:
-Dependence structure inherent in the model
(e.g. multiple samples in a plot)
-Other dependence in the data
(e.g. measuring growth at multiple points in time)
NB we fix this with a mixed effects model…
Clams <- read.table("Clams.txt", header = T) str(Clams)
## 'data.frame': 398 obs. of 5 variables: ## $ MONTH : num 11 11 11 11 11 11 11 11 11 11 ... ## $ LENGTH : num 28.4 16.6 13.7 17.4 11.8 ... ## $ AFD : num 0.248 0.052 0.028 0.07 0.022 0.187 0.361 0.05 0.087 0.128 ... ## $ LNLENGTH: num 3.35 2.81 2.62 2.86 2.47 ... ## $ LNAFD : num -1.39 -2.96 -3.57 -2.65 -3.83 ...
Month - month of measurement
Length - length (mm?)
AFD - weight
LNLENGTH - log(Length)
LMAFD - log(AFD)
models: LNAFN ~ LNLENGTH + MONTH LNAFN ~ LNLENGTH * MONTH
Clams$MONTH <- factor(Clams$MONTH) M1 <-lm(LNAFD ~ LNLENGTH * MONTH, data = Clams) drop1(M1, test = "F")
## Single term deletions ## ## Model: ## LNAFD ~ LNLENGTH * MONTH ## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 6.4490 -1616.8 ## LNLENGTH:MONTH 5 0.20328 6.6523 -1614.4 2.4334 0.03444 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1