Set up

Data Exploration

Below code uses Zuurs code for the two cleveland dotplots but then ggplot2 for the boxplot. Scale_colour_manual is used to dictate the colours of the box plots

Nereis <- read.delim("C:/Users/Joe/OneDrive/Documents/R practice/ZUUR/ZuurDataMixedModelling/Nereis.txt")

dotchart(Nereis$concentration, ylab = "Order of observations", xlab = "Concentration", main = "Cleveland dotplot") 

dotchart(Nereis$concentration, groups = factor(Nereis$nutrient), ylab = "Nutrient", xlab = "Concentration", main = "Cleveland dotplot", pch = Nereis$nutrient)

ggplot(Nereis,aes(factor(nutrient),concentration))+geom_boxplot(outlier.shape = NA, aes(colour = factor(nutrient)))+geom_jitter(width = 0.1)+
  labs(title = "Boxplot with geom_jitter layered above", 
       x= "Nutrient used as a factor", 
       y="Concentration")+
  scale_colour_manual(name="Nutrient", values = c("hotpink", "grey20", "red"))

##Clams and residuals

Looking at the wedge clam data initially as a straight forward scatter plot with AFD plotted against length.
The ggplot code below colours the months differently, there is no analytical output from it, it is just a visual for us to see if there is any obvious grouping. The colour is assigned to different months by adding the “fill = factor(MONTH)” code into the aesthetic function in the first line (note, shape 21 can accept a fill, not all shapes can), shape 21 is then specified in the geom_point() function and scale_fill_manual() is used to dictate the fill colour.

Clams <- read.delim("C:/Users/Joe/OneDrive/Documents/R practice/ZUUR/ZuurDataMixedModelling/Clams.txt")
ggplot(Clams,aes(LENGTH,AFD,fill=factor(MONTH)))+
  geom_point(shape=21, alpha =0.6)+
  scale_fill_manual(name="Month",values = c("hotpink","yellow", "navyblue", "grey60","red", "black"))+
  labs(title = "Unmodified length vs AFD", subtitle = "looks to not be linear", x="Length")

Do the colours in the above graph suggest anything about different variances between months? Certainly a tight group at the start. The textbook then switches to log values for AFD and Length to give a linear relationship.

ggplot(Clams,aes(LNLENGTH,LNAFD,fill=factor(MONTH)))+
  geom_point(shape=21, alpha =0.6)+
  scale_fill_manual(name="Month",values = c("hotpink","yellow", "navyblue", "grey60","red", "black"))+
  labs(title = "Unmodified length vs AFD", subtitle = "looks to not be linear", x="Ln Length", y="Ln AFD")

The below code generates a linear model based on logged values of length and AFD as well as using month as a factor, this is effectively a linear regression with groups. The coplot does the same job as the facet wrap functiion below in separating the plots by month. The function drop1() then removes month as a factor and compares that simpler model to the original model with groups, you can see in the output that the residual sum of squares is lower for the model (includes month as a factor) and that has been calculated as signficantly so with the F test specified in the drop1() function.

Clams$LNAFD <- log(Clams$AFD)
Clams$LNLENGTH <- log(Clams$LENGTH)
Clams$fMONTH <- factor(Clams$MONTH)
library(lattice) 
coplot(LNAFD ∼ LNLENGTH | fMONTH, data = Clams)

M1 <- lm(LNAFD ∼ LNLENGTH * fMONTH, data = Clams)
drop1(M1,test = "F")
## Single term deletions
## 
## Model:
## LNAFD ~ LNLENGTH * fMONTH
##                 Df Sum of Sq    RSS     AIC F value  Pr(>F)  
## <none>                       6.3592 -1622.3                  
## LNLENGTH:fMONTH  5   0.22558 6.5848 -1618.5  2.7385 0.01906 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The graphing can also be run in ggplot using the facet wrap function. And another way to calculate the interaction term between Lnlength and month previously down with the drop1() function is to use the anova() function to create an ANOVA output table from the M1 linear model.

ggplot(Clams,aes(LNLENGTH,LNAFD))+
  geom_point()+
  facet_wrap(~fMONTH)

M1 <- lm(LNAFD ∼ LNLENGTH * fMONTH, data = Clams)
anova(M1)
## Analysis of Variance Table
## 
## Response: LNAFD
##                  Df Sum Sq Mean Sq    F value  Pr(>F)    
## LNLENGTH          1 478.31  478.31 29033.2481 < 2e-16 ***
## fMONTH            5   2.15    0.43    26.0500 < 2e-16 ***
## LNLENGTH:fMONTH   5   0.23    0.05     2.7385 0.01906 *  
## Residuals       386   6.36    0.02                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The below code generates the residual plots, the first plot is generated using the plot(M1) code, the M1 model was created in the first code chunk on clams. All the code after plot(M1….) actually just removes three plots that the plot function when applied to a linear model generates, one of those includes a qqplot. But that first plot is plotting the residuls against the fitted or predicted values from the model for every given value of x (log clam length), this is looking at the uniformity of the varience around the fitted model for every value of x. Then there is a historgram of residuals, residuals plotted against the log length (x values) in a scatter plot, and the last is a boxplot.

p <- par(mfrow = c(2, 2), mar = c(5, 4, 1, 2)) 
plot(M1, add.smooth = FALSE, which = 1) 
E <- resid(M1) 
hist(E, xlab = "Residuals", main = "") 
plot(Clams$LNLENGTH, E, xlab = "Log(Length)", ylab = "Residuals") 
plot(Clams$fMONTH, E, xlab = "Month", ylab = "Residuals") 

Below are all the plots generated by Plot(M1)

plot(M1)

Below is code to generate qqnorm and residuals against dependant variable plots

qqnorm(E)

plot(Clams$LNLENGTH,E)

The above plots suggest that there might be heteroscedasticy according to month. The box plot highlights this quight well, and the residuals vs fitted values also shows a widening of the residuals after -3 log AFD.

The fitted vs residual plot can be replicated in ggplot by using the M2 object in the ggplot() function as the sole argument and specifying x and y as .fitted and .resid. To do that though

ggplot(M1)+geom_point(aes(x=.fitted, y=.resid), shape =21)+
  geom_hline(yintercept=0, linetype="dashed")+labs(title="ggplot fitted values vs residual values")

We can look at histograms of the residuals for each month to see if they behave the same, and use the bartlett test to test the homoscedasticy

To generate the histograms per month I had to add the residuals to the clams df with the cbind function. For this mark up I left the code for the cbind() but not the output as its massive.

Examining the residuals per month

ggplot(Clams,aes(E))+geom_histogram()+facet_wrap(~MONTH)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#looking at the bartlett test as a decider of the homoscadasticy
bartlett.test(E,Clams$fMONTH)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  E and Clams$fMONTH
## Bartlett's K-squared = 34.286, df = 5, p-value = 2.089e-06

##Moby’s Teeth

Text books residual plots using leverage which I don’t understand

M2 <- lm(X15N ∼ Age, subset = (TN$Tooth == "Moby"), data = TN)
op <- par(mfrow = c(2, 2))
plot(M2, add.smooth = FALSE)

par(op)

The top left plot suggests that there is a violation of independence. This plot can be replicated in ggplot by using the M2 object in the ggplot() function as the sole argument and specifying x and y as .fitted and .resid.

ggplot(M2)+geom_point(aes(x=.fitted, y=.resid), shape=21)+
  geom_hline(yintercept = 0, linetype="dashed")+
  labs(title = "Residuals plotted against fitted values based om M2 model", x="Fitted Values", y="Residuals")

The below plot uses the filter() function (dplyr) to filter the data to just the Moby teeth within the ggplot code. This is a bit more efficient than the text books code which generates two new vectors first.

N.Moby <- TN$X15N[TN$Tooth == "Moby"]
Age.Moby <- TN$Age[TN$Tooth == "Moby"]

plot(y = N.Moby, x = Age.Moby, xlab = "Estimated age Moby", ylab ="N Moby")
abline(M2)

ggplot(filter(TN,Tooth=="Moby"),aes(Age,X15N))+geom_smooth(se=F, method = lm, colour="black")+geom_point(shape=21)+labs(title="ggplot")+theme_classic()

summary(M2)
## 
## Call:
## lm(formula = X15N ~ Age, data = TN, subset = (TN$Tooth == "Moby"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07102 -0.28706  0.04346  0.33820  1.13724 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.748940   0.163559   71.83   <2e-16 ***
## Age          0.113794   0.006186   18.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4859 on 40 degrees of freedom
## Multiple R-squared:  0.8943, Adjusted R-squared:  0.8917 
## F-statistic: 338.4 on 1 and 40 DF,  p-value: < 2.2e-16
anova(M2)
## Analysis of Variance Table
## 
## Response: X15N
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Age        1 79.902  79.902  338.43 < 2.2e-16 ***
## Residuals 40  9.444   0.236                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The mean square of the residuals is calculated by Residual (or Error) sum of squares/degrees of freedom
Residual_Mean_sq<-9.4444/40;Residual_Mean_sq
## [1] 0.23611
#residual standard error is also called Root mean square error, stanard error and standard error of the estimate
#It is calculated by square rooting the mean square error
Residual_standard_error<-sqrt(0.236);Residual_standard_error
## [1] 0.4857983

The standard erroe is the average distance the data points fall from the regression line in the units of the dependant variable (X15N in this example).

##Nereis

 Nereis$fbiomass <- factor(Nereis$biomass)
Nereis$fnutrient <- factor(Nereis$nutrient)
M3 <- lm(concentration ∼ fbiomass * fnutrient, data = Nereis)
drop1(M3, test = "F")
## Single term deletions
## 
## Model:
## concentration ~ fbiomass * fnutrient
##                    Df Sum of Sq    RSS     AIC F value   Pr(>F)   
## <none>                          13.630 -23.746                    
## fbiomass:fnutrient  8    11.553 25.183 -12.121  3.1785 0.009899 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
op <- par(mfrow = c(1, 2))
plot(resid(M3) ∼ Nereis$fbiomass, xlab = "Biomass", ylab = "Residuals")
plot(resid(M3) ∼ Nereis$fnutrient, xlab = "Nutrient", ylab = "Residuals")

par(op)

anova(M3)
## Analysis of Variance Table
## 
## Response: concentration
##                    Df  Sum Sq Mean Sq F value    Pr(>F)    
## fbiomass            4  4.2509  1.0627  2.3390  0.077837 .  
## fnutrient           2 19.0662  9.5331 20.9820 1.996e-06 ***
## fbiomass:fnutrient  8 11.5530  1.4441  3.1785  0.009899 ** 
## Residuals          30 13.6304  0.4543                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above plots have been replicated in ggplot2, however ggplot requires data to be in a df rather than as vectors so the residuals had to be added to the Nereis df, below I used the residual() function on M3 followed by cbind() to add that vector to the Nereis df. This looks to work but if you run the code chunk more than once the cbind() function keeps adding columns to the Nereis df and then the ggplot code won’t work.

R<-residuals(M3)
Nereis<-cbind(Nereis,R)
ggplot(Nereis,aes(factor(biomass),R))+
geom_boxplot(outlier.shape = NA)+geom_jitter(width=0.2)+
  labs(title="ggplot of Biomass residuals", x="Biomass", y="Residuals")

ggplot(Nereis,aes(factor(nutrient),R))+
geom_boxplot(outlier.shape = NA)+geom_jitter(width=0.2)+
  labs(title="ggplot of Biomass residuals", x="Nutrient", y="Residuals")