A short introduction to R (for ecologists)
Tutorial by Benjamin Blonder (bblonder@email.arizona.edu)
In this tutorial we will learn basic data analysis procedures via an exploration of leaf economics traits: how do leaves use resources like carbon and nitrogen, and how does this vary across biomes and phylogeny?

This tutorial is not a statistics course, nor is it a comprehensive guide to R. It assumes you have some background with statistics (i.e. you know what analysis you want to do) and hopes that an immersive experience with R will be a useful way to learn. Not every code example I provide is fully annotated - think about how it might work, and experiment to resolve your questions!


You will need to have copies of glopnet_final.csv, shrinkage.csv and shrinkage.tree to use this tutorial. These files should be located in your R working directory (see below).

The GLOPNET data for this tutorial are modified from Wright et al. - The worldwide leaf economics spectrum (Nature 2004) and represent a compilation of leaf trait measurements (carbon assimilation rate, nitrogen content, life span, respiration rate, etc.) on several thousand species from hundreds of globally-distributed sites. All other data are my own unpublished measurements and come from Colorado and an elevational gradient in Costa Rica.

You may be tempted to copy and paste code from this tutorial into R. If you do this, R will incorrectly ignore all line breaks and any multi-line code segment will generate an error. Better to paste the code into a new script document within R (see the File menu), add line breaks manually, and then run the code from there.

Basic syntax and arithmetic
Preliminaries
To get help on any function, type its name preceded by a ?.
?sum
To search for help on a topic, type the topic preceded by ??. Executing ?? can take a long time to run; don't panic if you're waiting for it to find help.
??column
R files can be either comments or code. Comment lines begin with # and are not interpreted by R.
# 3 + 4
Arithmetic
Try some simple numerical operations
6+3 cos(pi) exp(1) sqrt(16)
Variables
We can assign the results of operations to variables
x <- 17-38.5 x
And we can make sequences
y <- 1:10 y
We can add numbers to vectors
x+y
And concatenate vectors together
z <- c(x,y) z
We can also work with binary variables
a <- c(TRUE, TRUE, FALSE) a
Creating and removing objects
Let's remove that object
rm(a) a
We can also create data frames, which store multiple columns of data of multiple types including categorical data (factors):
df <- data.frame(species=c("fish","frog","duck"),bodymass=c(2.5,0.5,10),color=factor(c("red","red","blue"))) df
Sometimes it is useful to create repeated or empty variables for later usage:
empty <- rep(3, 10) empty emptymatrix <- matrix(NA, nrow=10, ncol=3) emptymatrix
We can also coerce variables between different formats. Some functions will require data to be stored as a numeric array, vector, matrix, character array, factor, list, data frame, etc... in general, as.type will convert to the correct type for you.
?as df$color as.numeric(df$color) as.character(df$color) as.matrix(df) as.list(df)
And finally, some functions will either require data as row-vectors (1 x n) or column vectors (n x 1). This is annoying but conversion is easy with the transpose operator t.
?t t(df)
Control structures (if, for, while)
You may often want to perform a task repeatedly or iterate through some sets of items. You can use loops for this purpose. First, a for loop runs one time for each item in its argument:
for (i in 1:10) { print(10*i) } for (i in c("a","b","c")) { print(toupper(i)) }
Second, a while loop runs whilst its argument is true:
done = FALSE while (!done) { # choose a random number between 0 and 1 randomnumber <- runif(1, min=0, max=1) print(randomnumber) done = (randomnumber > 0.9) # stop when our number is bigger than 0.9 }
Third, we can use if/else statements to control scripts:
for (i in 1:10) { print(i) if (i %% 2 == 0) # if i is divisible by two { print("Even number") } else { print("Odd number") } }
If you would like prettier printing with more formatting options, look at ?sprintf.
Data input/output
Reading data
Let's get started with some real analyses - first, we need to set a working directory that will tell R where to find datafiles. Make sure to select the folder that encloses your files, not the file you want to open. You can press Command-D (Mac) or File Menu > Change Dir(Windows) to choose a folder using a standard graphical interface or use the 'setwd' command through the command line.
You should have a datafile in the tutorial folder called 'glopnet_final.csv'. The easiest way to read files into R is have them already stored in CSV format. If your data are in Excel you can Save As... CSV. Comma-separated-value files are platform independent and will never become obsolete, as opposed to the often-changing proprietary Excel format. Excel also often puts weird things in datafiles pertaining to formatting that can get in the way of analysis.
Let's first learn how to read in data. The 'read.csv' command will be useful. Look to see how we should use this command. The ? operator will give you help on any function.
?read.csv
you should have settled on something like the below. We are reading in the file and assigning its contents to a new object called, conveniently, 'glopnet'. The 'header' argument tell us that our file has a header whose names will be used as labels for the resulting object
glopnet <- read.csv('glopnet_final.csv', header=TRUE)
Quick looks at data
Let's see what we read in
glopnet
That last command printed the entire, very large, data set - let's get some simple summary information instead. What are the names of each column?
names(glopnet)
We can also access some simple summary statistics about the data. 'summary' is a command which will work on many different types of objects, doing different tasks depending on the type of the object.
summary(glopnet)
We can also look 'under the hood' and see the actual structure of the objects, if we need to know if data are continuous or categorical, for example
str(glopnet)
Writing files
Outputting data from R back to a CSV file is also very simple. Let's say we wanted to output a complete alphabetized species list for the data set:
uniquespecieslist <- sort(unique(paste(glopnet$Genus, glopnet$Species))) ?write.table write.table(uniquespecieslist,file='specieslist.csv',quote=FALSE, sep=",",row.names=F,col.names=F)
And there should now be a new file called specieslist.csv in your working directory!
Descriptive statistics and data manipulation
Categorical variables (factors)
Notice that variables like Genus, Biome, Species... are all called factors. This means they are categorical variables. R will automatically assign categorical status to your data, but it can sometimes make mistakes. For example, suppose BIOME were actually not a factor - let's force R to do that temporarily:
glopnet$BIOME <- as.character(glopnet$BIOME) str(glopnet)
Now we can convert it back:
glopnet$BIOME <- as.factor(glopnet$BIOME)
Take a look at ?factor and ?levels for more information about this. You can define ordered and unordered factors for any set of levels. Checking levels is always good - you may have had a typo in your datasheet that will add an extra level to your analysis and reduce power!
levels(glopnet$BIOME)
Summary statistics
Let's try learning some simple things about our data - what, for example, is the average log.LMA (log10 leaf mass per leaf area). Here we need the $ operator to access a column of the data frame. We'll use the mean command to get an average.
mean(glopnet$log.LMA)
That didn't work, because some rows of the dataset have missing values for log.LMA. We can find them and return the rows of the data that have NA values. The which commands returns the indices of the TRUE entries of a vector.
is.na(glopnet$log.LMA) which(is.na(glopnet$log.LMA))
let's calculate the mean on only the non-NA values of log LMA now, by first getting a subset of the data. We use [] to access different rows and columns of the dataset. The ! operator means NOT (i.e. converts true to false, false to true)
glopnetlma_not_na <- glopnet$log.LMA[!is.na(glopnet$log.LMA)] mean(glopnetlma_not_na)
Now we have an answer - but there is an easier way too. Read the help for the mean command, and you'll find that it can automatically remove missing values.
mean(glopnet$log.LMA, na.rm=TRUE)
We can also get standard deviation, median, etc. the same way
sd(glopnet$log.LMA, na.rm=TRUE) median(glopnet$log.LMA,na.rm=T)
Counting data
maybe more broadly we are curious about the coverage of the dataset - does it include as many deciduous as evergreen species? We can count these using the table function:
table(glopnet$Decid.E.green)
how many rows does our dataset have?
nrow(glopnet)
We can also count the number of items in each individual column/row vector using 'length':
length(glopnet$log.LMA)
Subsetting data and analyzing each subset automatically
This dataset has a column which assigns a categorical variable (BIOME) to each point. Suppose we wanted to find the mean logLMA in each biome. Here we applied the 'mean' function to log.LMA for each unique value found in BIOME.
lma_by_biome <- tapply(glopnet$log.LMA, glopnet$BIOME, mean, na.rm=T)
Suppose we only wanted to study oak (Quercus) species in temperate forests. Let's subset the data to find only these points. First, what is the temperate forest biome called?
levels(glopnet$BIOME)
Now, let's use the == (equality test) operator along with the & (logical and) operator to get the subset
temperateoaks <- subset(glopnet, Genus == "Quercus" & BIOME == "TEMP_FOR") temperateoaks
These temperate oaks can have either deciduous or evergreen leaves. Let's look at the LMA of the evergreen ones using the [] and $ operators together. First we will choose all rows that have evergreen leaves using the [,] operator (leaving the space before or after the comma means to take all values, i.e. [,x] is all rows satisfying x while [x,] is all columns satisfying x.
temperateoaks[temperateoaks$Decid.E.green == "E",]$log.LMA
We can also do more complicated things to different categories within a dataframe. Suppose we wanted to calculate the ratio of deciduous to evergreen leaf lifespans within each biome. This is not too hard - we can use the 'by' command and then write a custom function that will return the right ratio. We can define a new function just like defining a variable:
?by ratio_of_ll_means <- function(x) { decid_ll <- subset(x, Decid.E.green == "D")$log.LL evergreen_ll <- subset(x, Decid.E.green == "E")$log.LL ratio_of_means <- mean(decid_ll, na.rm=T) / mean(evergreen_ll, na.rm=T) return(data.frame(rom=ratio_of_means)) }
Now we will use the 'by' command to apply this function to each unique BIOME of the data. We then have to call 'unlist' to transform the result from a list (another data type) to a data frame (much easier to work with)
decid_evergreen_lls <- unlist(by(glopnet, glopnet$BIOME, ratio_of_ll_means))
Now let's sort in decreasing order the output to see which biome has the highest ratio of leaf lifespans
sort(decid_evergreen_lls, decreasing=TRUE)
So in tropical forests leaf lifespans are most equivalent between deciduous and evergreen species!
Finding parts of the dataset
Suppose we wanted something simpler, like the name of the species with the highest value of leaf lifespan:
highestll <- glopnet[which.max(glopnet$log.LL),]
That returned the entire row - let's output the full name of the species and the actual lifespan (un-logged) of that species' leaves:
paste(highestll$Genus, highestll$Species, sep=" ") 10^(highestll$log.LL)
288 months - 24 years is really long! Check it out at http://www.esu.edu/~milewski/intro_biol_two/lab_3_seed_plts/Araucaria.html
Let's try using the [,] and $ operators more. These are very useful for subsetting and manipulating your data!
glopnet[11,4] #the 11th row, 4th column glopnet[11,"Genus"]
should give the same output
all rows in the Genus column
glopnet[,"Genus"]
Use a logical test to select rows, and take two columns of the result
glopnet[glopnet$Angio == FALSE,c("Genus","Species")]
Only get a single column this time
glopnet$Genus[glopnet$Angio == FALSE]
Sorting data frames
Sometimes it is useful to be able to sort a dataframe by multiple columns. Let's sort by biome, then by decreasing LL. We use the 'order' command to generate a list of indices that we then use as row IDs:
sortedglopnet <- glopnet[order(glopnet$BIOME, glopnet$log.LL,decreasing=TRUE),] sortedglopnet
a lot of output, but you can see it's ordered first by BIOME, then by LL!
Random samples and bootstrapping
We can also get a random subset of rows - here, choose 10 random values from the integers ranging from 1 to nrow(glopnet)
randomrows <- sample(1:nrow(glopnet),10) randomrows
Try running these two previous commands again - you should get different values this time.
We can get the mean log leaf lifespan from a random subset of the data:
mean(glopnet[randomrows,"log.LL"], na.rm=T)
Now suppose we wanted to see how variable the data are (i.e. bootstrap/resampling). Let's write a loop that will repeatedly get a new random subset from the data:
numRandomTries <- 100 llrandom <- rep(NA, numRandomTries) # initialize an empty vector that can store numRandomTries results for (i in 1:numRandomTries) #loop through numRandomTries times, setting i = 1, i=2, i=3... each time in the loop { randomrows <- sample(1:nrow(glopnet),10) llrandom[i] <- mean(glopnet[randomrows,"log.LL"], na.rm=T) # store the mean value in the i'th position of ll random } llrandom # print all the stored values sd(llrandom,na.rm=T) # shows how variable our random subsets are
You can of course write a function that will do many more complex things to random subsets of the data - and also choose stratified random subsets, etc. Loops are very powerful! 'while' and 'if' are also good control structures to use.
Inferential statistics and regression
Tests of significance
Is there a significant difference in leaf lifespan between deciduous and evergreen species? First, let's get the right data:
sp_decid <- subset(glopnet, Decid.E.green == "D")$log.LL sp_everg <- subset(glopnet, Decid.E.green == "E")$log.LL
Now let's try a two-sided t-test at the alpha = 0.05 level for no difference in means:
?t.test t.test(sp_decid, sp_everg, alternative="two.sided",conf.level = 0.95)
Of course we really should have tested to see if the data met the assumptions of a t test, the most important of which being normality. We can test normality using a Shapiro-Wilk test:
shapiro.test(sp_decid) shapiro.test(sp_everg)
Not looking so good!
Let's do a Mann-Whitney (Wilcoxon Rank Sum) non-parametric test instead, that doesn't assume data are normally distributed
wilcox.test(sp_decid, sp_everg, alternative="two.sided",conf.level = 0.95)
And the difference is still significant!
We can also test for correlations between variables. For example, perhaps leaf lifespan is significantly correlated with leaf nitrogen content?
ll_n_test <- cor.test(glopnet$log.LL, glopnet$log.Nmass) ll_n_test
We can also directly access different parts of the test result - for example, the p value:
str(ll_n_test) ll_n_test$p.value
You can find documentation on many more of these tests by googling. We're going to focus instead on linear models, which encompass many of these tests and also include things such as ANOVA, ANCOVA, MANOVA.
Linear models (most important section)
The most important command to learn is lm, which generates linear models. These models are assumed to represent data points with normally distributed errors, homoscedastic variances, independent data points, random sampling... etc. Your data may not match these assumptions (for example, count data (e.g. number of marmots seen today) will probably have Poisson-distributed errors, and presence/absence data may require a logistic regression) and you may have multiple non-independent points in your dataset. But don't worry! R can handle these things too (scroll down). Let's just explore the simple cases first.
Let's see if there is a relationship between photosynthetic rate (log.Aarea) and nitrogen content (log.Nmass). This might be expected because photosynthesis depends on the Rubisco protein, which is very nitrogen-rich.
m_an <- lm(log.Aarea ~ log.Nmass, data=glopnet)
Notice how we specify a formula with the dependent variable on the left of the ~ and the independent variable on the right.
m_an
This prints the coefficients of the regression
summary(m_an)
A much more useful output. Looks like a significant positive relationship!
summary(m_an)$r.squared
Try str(summary(m_an)) to see what else we can extract from the model summary.
It's very important to verify that the data meet all the assumptions of a linear model. We can check residuals, leverage, etc:
plot(m_an)
This will generate several plots you can step through with the mouse click or the return key on the console. Probably the simplest diagnostic is the normal quantile-quantile plot - the quantiles of your data should match the quantiles of a scaled and translated normal distribution, generating a straight line. If you don't want to see all the linear model diagnostics, just try
qqnorm(glopnet$log.Nmass)
which looks pretty good.
This last model included by default an intercept term (in the language of linear models, Y ~ B0 + B1 X). You might not want an intercept term, so try instead adding '-1' to the formula:
m_an_noint <- lm(log.Aarea ~ log.Nmass - 1, data=glopnet) summary(m_an_noint)
More fun things about linear models
Creating a linear model object lets us access a number of useful additional variables like
the model coefficients
coef(m_oneint)
the residuals at each data point
resid(m_oneint)
confidence intervals
confint(m_oneint)
the effects
effects(m_oneint)
Models with interaction and nonlinear terms
We can also specify much more complex formulas. You should know the I(), :, and * notation. * adds terms that include linear and interaction terms. : adds only interaction terms. I adds terms 'as-is', which are numerically determined -before- running the linear model (for example, adding a quadratic term to the model. Let's try a few:
m_quadratic <- lm(log.Aarea ~ log.Nmass + I(log.Nmass^2), data = glopnet) summary(m_quadratic)
You can try removing the I from the above model - notice it won't keep a quadratic term for you!
Now let's consider the potential of a three-way interaction between Nmass, LMA, and LL. We can specify a very full model that includes all these terms as such:
m_threeway <- lm(log.Aarea ~ log.Nmass*log.LMA*log.LL, data=glopnet) summary(m_threeway)
Notice how some of the interaction terms in the result are not individually significant. This model is probably too complex to be realistic. It looks like the Nmass:LMA interaction is real though, so let's try a simpler model with additive terms and just one interaction:
m_oneint <- lm(log.Aarea ~ log.Nmass + log.LMA + log.LL + log.LMA:log.LL, data=glopnet) summary(m_oneint)
Now even this interaction term is not significant. Perhaps our very complex model misled us.
ANOVAs
We can also perform an ANOVA to look at sums of squares and see which of the interactions and terms are overall significant in a model:
anova(m_threeway)
This approach is especially useful if we are using categorical variables. In R, you can treat categorical variables just like continuous ones. Let's see if there is an interaction between BIOME and leaf nitrogen that influences Aarea:
m_biome <- lm(log.Aarea ~ log.Nmass*BIOME,data=glopnet) anova(m_biome)
yes, a significant interaction was found!
You do need to be careful about something here - this is type I ANOVA that works by sequentially adding terms to look at the effect on the residual sum of squares. So the order in which you specify the variables will potentially impact your conclusions.
You need to think about type II or type III ANOVA if you don't think this is appropriate. Unbalanced designs will also require other types of ANOVAs. Fortunately this is possible, but we need to bring in some extra help. You will want to install the extra 'car' package for R by going up to the Packages & Data Menu, and selecting Package Installer. There are hundreds of packages which extend R's functionality. All of these are freely available, and are documented to varying extents. Some work better than others - google before using! 'car' is very well-respected. Once you have done the one-time download of the package, you need to load it into this R session:
library(car)
We will now use the Anova command in 'car' (note the new capitalization) to try a type III sum of squares:
Anova(m_threeway, type="III")
compare this to anova(m_threeway) - notice the different results! You can try ?Anova to understand more what is going on.
You may also want to explore the 'aov' function for performaning analyses of variances. However this requires a balanced data set - I recommend using 'lme' instead (see later). 'lm' can also handle contrasts if you are into those.
We can also compare between models using the standard 'anova' function (not Anova in 'car')
anova(m_oneint,m_threeway,test="Chisq")
There is a very tricky thing going on here - how do we decide how to compare models? You can do a F test, Chi-square test, or a Mallow's Cp test. Think carefully about what you want.
Model selection
Probably a better way to compare models is to use the Akaike Information Criterion (a penalized log-likelihood). For a given dependent variable and dataset, we can assess which set of predictors (i.e. which model) is best. Lower AIC values are better, but there is no absolute scale - AIC only allows comparison between candidate models.
allaics <- AIC(m_an, m_quadratic, m_threeway, m_biome, m_oneint) allaics
We can also programmatically find the minimium AIC (i.e. the best model)
allaics[which.min(allaics$AIC),]
We can also programmatically find a good model using AIC-based model selection, stepwise deleting terms from a model until one is found where no further deletions improve the AIC:
m_final <- stepAIC(lm(log.Aarea ~ log.Nmass*log.LMA*log.LL*BIOME,data=glopnet)) summary(m_final)
which shows the 'best' model by an AIC criterion
And it's also easy to use a linear model for future prediction. Suppose we wanted to make a prediction at some new data points. We need to define a new data frame with all of the required predictors:
newvals <- data.frame(log.Nmass=c(0.01, 0.05,0.1), log.LMA=c(50,100,75), log.LL=c(10,30,20)) predict.lm(m_oneint,newdata=newvals)
Robust regression
Ordinary least-squares may not always be the most appropriate type of regression. While OLS estimators are maximum likelihood estimators under many conditions, they are not ideal under all conditions. For example, OLS is very sensitive to outlying points, so that minimizing squared residuals may not work well for smaller high-variance samples. In this case, you may want to instead use robust regression in the MASS package (you will need to install it).
library(MASS) m_lm <- lm(log.Amass~log.Nmass,data=subset(glopnet, Angio==FALSE)) m_rlm <- rlm(log.Amass~log.Nmass,data=subset(glopnet, Angio==FALSE)) coef(m_lm) coef(m_rlm)
Here you can see for the relatively unsampled non-Angiosperm taxa, the robust-regression method estimates a higher intercept and lower slope. This estimate may be more useful if the data have known but unremovable outliers.
Type II regression
OLS regression (Type I regression) is useful if you want to compare the likelihoods of different models, or make predictions. However if you are more interested in obtaing a true estimate of regression coefficients (e.g. slope) and/or your data have uncertainty in both x and y, you may want to use Type II regression instead. Type II regression will minimize (optionally rescaled) perpendicular, rather than just vertical residuals. This is called major axis or standard/reduced major axis regression. Fortunately there is a package for that (you will need to install it first)
library(smatr)
the smatr package doesn't let you enter formulas or use a data= argument - so we have to manually specify y and x variables. see ?line.cis for details.
m_sma <- line.cis(glopnet$log.Amass,glopnet$log.Nmass,method='SMA') m_sma
Notice that this method estimates a higher slope than the OLS method did.
We can also do slope/intercept tests on Type II regressions. Suppose we hypothesize that the true slope of the A/N relationship is 1, because photosynthesis is proportional to nitrogen (maybe)
slope.test(glopnet$log.Amass,glopnet$log.Nmass,method='SMA',test.value=1)
We get p=0, so the true slope is not 1!
Linear mixed models (dealing with groups, nesting, and pseudoreplication)
Let's move onto more complex types of linear models. Suppose our data were not all independent points but rather was psuedoreplicated. To deal with this, and more generally with unbalanced designs, we need to start working with mixed models. Let's first load the nlme package use the lme command for linear mixed models.
library(nlme)
We can now specify fixed and nested random effects. Suppose (wrongly) that we nee random effects of species gropued within genus when considering the relationship between Aarea and Nmass:
m_nested <- lme(log.Aarea~log.Nmass, random=~1|Genus/Species,data=glopnet,na.action=na.omit) summary(m_nested)
You can also specify random slopes, random intercepts at different levels of hierarchy... but the syntax gets very complex very quickly, and documentation is poor. You can consider also using the 'lmer' command in the 'lme4' library. Mixed models are controversial and poorly documented in R unfortunately.
Generalized linear models (non-normal error distribution)
Let's instead consider generalized linear models, which allow non-normally distributed errors. For example, suppose we were considering Aarea instead of logAarea - Aarea varies from 0 to infinity, so might be best described as Gamma-distributed. We can account for this quite easily:
AareaUntransformed <- 10^glopnet$log.Aarea m_gamma <- glm(AareaUntransformed~log.Nmass,family=Gamma,data=glopnet) summary(m_gamma)
Notice how it calculates a dispersion for the gamma distribution and deals with missing values automatically...
You can learn more about other more complex models (mostly exponential family distributions) with ?family. Notice the options for probit, logit, logistic regression if your data merit these options. GLM is a very powerful way to analyze your data!
Generalized least squares (correlation structure in the data)
Another very useful command is gls in the nlme package, for generalized least squares. This function lets you break the assumption of uncorrelated data - for example, dealing with phylogenetic correlation, temporal autocorrelation, etc. Here's an example of using GLS to account for phylogenetic structure (an analysis equivalent to creating phylogenetic independent contrasts). We have data on leaf area shrinkage (after drying) and LMA for multiple species, at the leaf level. First let's calculate species level means:
lmadata <- read.csv('shrinkage.csv') lmadata$LatinName <- tolower(paste(lmadata$Genus, lmadata$Species,sep="_")) # make sure # the names are in the same format as our phylogenetic tree (lower case, genus_species) lma_spmean <- tapply(lmadata$LMA.g.m2, lmadata$LatinName, mean, na.rm=T) shrinkage_spmean <- tapply(lmadata$Dry.area.cm2 / lmadata$Wet.area.cm2, lmadata$LatinName, mean, na.rm=T)
Now let's read in a phylogenetic tree that contains these species using a phylogenetic analysis package (ape, please install it!)
library(ape) tree <- read.tree('shrinkage.tree') # remove polytomies in the tree randomly tree <- multi2di(tree, random=TRUE) # trim and order the data in the speciesmeans to only include those in the phylogeny lma_spmean <- lma_spmean[tree$tip.label] shrinkage_spmean <- shrinkage_spmean[tree$tip.label] # run the models m_phylo <- gls(lma_spmean~shrinkage_spmean, correlation=corBrownian(phy=tree)) m_nophylo <- lm(lma_spmean~shrinkage_spmean)
Let's compare the model with and without phylogenetic independent contrasts (Brownian evolution correlation structure)
summary(m_phylo) summary(m_nophylo)
In both models, the shrinkage effect is significant, but is slightly stronger (57 vs 56) when phylogeny is taken into account.) So in this case the relationship is probably not strongly affected by evolutionary history. You may also want to use gls if you have spatially or temporally autocorrelated data.
R has packages that can do many other types of modeling (classification/regression trees, generalized additive models, generalized linear mixed models, LOESS, quantile regression, whatever you like - to find packages to do what you want, look on CRAN (http://cran.r-project.org/web/packages/)
Principal components analysis
One last very important function is principal components analysis (PCA). Note that R also has packages for things like non-metric multidimensional scaling, canonical correlation analysis, redundancy analysis... you will again want to look at CRAN for these more complex ordination techniques. But for simple multidimensional descriptive statistics on normally distributed data, PCA is probably your best bet...)
?princomp
Suppose we wanted to understand variation in multiple leaf traits like LMA, Aarea, Nmass, and LL. How many axes of variation are there, and what traits correlate with each other? We can specify a one-sided formula (no dependent variable). If our data are not all in the same units we should probably look at correlations rather than covariances between them to assess the PCA axes. Our data are also already log-transformed to meet normality assumptions, but you may need to do this to your data before running a PCA.
leafpca <- princomp(~log.LMA + log.Amass + log.Nmass + log.LL, data=glopnet, cor=TRUE) summary(leafpca)
This is incredible - the first principal component explains 79.7% of the total variation in the dataset. So all these leaf traits are very correlated with each other. Let's see in which direction we find these correlations:
loadings(leafpca)
So the first, most important axis is dominated by a tradeoff between high LMA, low Amass, low Nmass, and high LL. This is the global leaf economics spectrum right here!
Graphing
Point, line, and scatterplots
R is capable of making publication-quality figures. The problem is that it often takes a lot of work to set up all the margins, sizes, fonts, labels, colors, etc. just as you would like. However it is worth the trouble to learn how!
In this tutorial we are going to cover just the most basic types of graphs. There are many packages out there which can make nearly any graph you like. To find out how, I recommend browsing the R graph gallery (http://addictedtor.free.fr/graphiques/thumbs.php) - will show you examples of graph types and demo code for how to generate them.
Let's first make a new window to plot in. If you don't specify this command before plotting, R will open a default-sized window. But often if you are making figures for publication (e.g. Nature requires figures that are exactly 3.503 inches wide, good luck doing that in Powerpoint) you will appreciate this extra control.
dev.new(width=5, height=5)
Now let's plot the relationship between LMA and LL:
plot(log.LL ~ log.LMA, data=glopnet)
Of course we can have much more control over this plot - let's relabel the axes, add a title, and keep 1:1 aspect ratio, for starters
plot(log.LL ~ log.LMA, data=glopnet,xlab="Log10 LMA",ylab="Log10 LL",main="GLOPNET dataset",asp=1)
We can also label the points based on any criteria we like. Here we will show the Angiosperms as black circles(pch=16) and gymnosperms as open triangles (pch=2). See if you can understand how this bit of code works!
plot(log.LL ~ log.LMA, data=glopnet,xlab="Log10 LMA",ylab="Log10 LL",main="GLOPNET dataset",pch=c(2,16)[glopnet$Angio+1],asp=1)
We can also change point color and size - how about we try angiosperms in red, gymnosperms in blue:
plot(log.LL ~ log.LMA, data=glopnet,xlab="Log10 LMA",ylab="Log10 LL",main="GLOPNET dataset",col=c("blue","red")[glopnet$Angio+1],asp=1)
We can also do more complex coloring - let's instead color the points with a rainbow color map that varies with each species' nitrogen content. Here we will define a rainbow color map (?rainbow) and take different index values from the cut function to get our plotting colors.
ncolors <- 10 plot(log.LL ~ log.LMA, data=glopnet,xlab="Log10 LMA",ylab="Log10 LL",main="GLOPNET dataset",col=rainbow(ncolors)[cut(glopnet$log.Nmass, breaks=ncolors)],asp=1)
As you can see leaves that have higher LMA and LL also have lower Nmass.
Plot legends
We should probably add a legend to the plot. Let's do this for the simpler gymno/angio plot:
plot(log.LL ~ log.LMA, data=glopnet,xlab="Log10 LMA",ylab="Log10 LL",main="GLOPNET dataset",pch=c(2,16)[glopnet$Angio+1],asp=1) legend("bottomright",c("Gymno","Angio"),pch=c(2,16))
Drawing text and fancy labels
We can also directly put text on the graph. Let's draw a graph with just the gymnosperm points, labeled by their Latin name in a small font and rotated 45 degrees:
notangio <- subset(glopnet,Angio==FALSE) plot(log.LL ~ log.LMA, data=notangio,col="gray") text(notangio$log.LMA, notangio$log.LL, paste(notangio$Genus, notangio$Species,sep="\n"),cex=0.5,srt=-45)
Now we can even overlay the angiosperm points on this graph - R lets us continually add to graphs.
points(log.LL ~ log.LMA, data=subset(glopnet,Angio==TRUE),col="red",cex=0.25)
You might also want to investigate the lines and segments functions. segments is good at drawing error bars, for example.
We can also put fancier Greek symbols and subscripts/superscripts on the graph - look at ?plotmath for instructions. For example, let's subscript the Log10 in the axes:
plot(log.LL ~ log.LMA, data=glopnet,xlab=expression(paste("Log"[10]," LMA")),ylab=expression(paste("Log"[10]," LL")),main="GLOPNET dataset",col=c("blue","red")[glopnet$Angio+1],asp=1)
Multi-panel figures and margin control
We can also put multiple figures on one plot, and control margins and other such options. Try looking at ?par:
dev.new(width=6, height=6) par(mfrow=c(2,2)) plot(log.LL ~ log.LMA, data=glopnet) plot(log.LL ~ log.Nmass, data=glopnet) plot(log.Amass ~ log.Nmass, data=glopnet) plot(log.Amass ~ log.LMA, data=glopnet) title("GLOPNET", outer=TRUE,line=-1)
If you ever want to control figure margins, font size, font face, etc... look in par first!
Boxplots (never bar plots)
Another very useful type of plot is the boxplot, which will show the distribution of your data by a categorical variable. You may be tempted to show a bar plot, but bar plots don't show uncertainty ranges - and if you don't know an uncertainty range, you might not want to be making a graph yet anyway!
dev.new() par(cex=0.5) # reduce font size to fit all the labels boxplot(log.LL ~ BIOME, data=glopnet)
We can also easily add multiple combinations of categorical variables - here all the combinations of C3/C4 vs Angio/notAngio
dev.new() par(cex=0.5) boxplot(log.LL ~ C3C4*Angio, data=glopnet)
Pair correlation plots (for data exploration)
What if we want to see lots of correlations between lots of variables at once? Fortunately this is a built-in function:
pairs(~log.LL + log.LMA + log.Amass + log.Nmass + BIOME, data=glopnet)
With this function you can assess data distribution, correlations, and so on - very useful for initial exploratory work.
There is a more useful version of this function in the 'car' library:
library(car) # in case you haven't loaded it yet scatterplotMatrix(~log.LL + log.LMA + log.Amass + log.Nmass + BIOME, data=glopnet)
Here we also see histograms of data distributions and linear fits. You can explore the 'car' library for many more nice plotting functions. ggplot2 and lattice also have some nice plotting functionality if you want to explore them. I will instead show you how to generate this functionality using base functions.
Drawing regression lines over data on plots
Maybe we want to draw a line of best-fit through the data. Suppose we are interested in the relationship between Amass and Nmass:
plot(log.Amass~log.Nmass,data=glopnet) # generate a linear model m_linear <- lm(log.Amass~log.Nmass,data=glopnet)
let's compare to the plot we would get of non-log transformed data:
qqnorm(10^glopnet$log.Nmass)
this plot is very curvy, indicating a data transformation (here, obviously, a log) is necessary.
We can now draw the fit on the graph with a double line width
abline(coef(m_linear),col="red",lwd=2)
Strictly speaking we might not want to draw this line, because the regression is not valid at points beyond the range of the predictors. So let's draw a better line by first generating a sequence of evenly spaced points in the x-range, then predicting their y-values:
xvals <- seq(min(glopnet$log.Nmass,na.rm=T), max(glopnet$log.Nmass,na.rm=T),length.out=100) yvals <- predict(m_linear,data.frame(log.Nmass=xvals)) lines(xvals, yvals,col="blue",lwd=4)
Nonlinear fitting and graphical display
Generating x values and predicting y-values works for nonlinear models too, whereas abline only works for linear models. Suppose we try to fit a shifted-scaled square-root (k0*sqrt(x-k1)) model to the data. This uses the nonlinear least-squares command, which can also be useful for other data-fitting procedures.
m_nonlinear <- nls(log.Amass~k0*sqrt(log.Nmass-k1),data=glopnet,start=list(k0=4,k1=-0.5)) yvalsnls <- predict(m_nonlinear,data.frame(log.Nmass=xvals)) lines(xvals, yvalsnls,col="green",lwd=8)
Now you can really see the curvature!
Local smoothing
let's finally add a locally-smoothed line to the graph using LOESS. This can be useful for very complex datasets for which you don't have an a-priori model to fit.
ml <- loess(log.Amass~log.Nmass,data=glopnet) lines(xvals, predict(ml, xvals),col="orange",lwd=8)
Histograms and kernel density estimates
We can also make histograms. Suppose we wanted to visualize the distribution of LMA. We can draw a histogram using the hist function. You can choose the number of 'breaks' or bins in the histogram, as well as do many fancier plots too - look at the help!
hist(glopnet$log.LMA, breaks=50)
Histograms are rarely a good thing to display - the story they tell depends very heavily on the number of bins/breaks the user chooses. A better option is a kernel density estimate:
plot(density(glopnet$log.LMA,na.rm=TRUE))
You can choose smoothing bandwidths to generate a better looking density estimate, but the default options are usually quite good.
Let's superimpose the distributions of the angios and gymnos on the whole plot:
lines(density(glopnet$log.LMA[glopnet$Angio==TRUE],na.rm=T),col="red") lines(density(glopnet$log.LMA[glopnet$Angio==FALSE],na.rm=T),col="blue")
And now for a nice trick, let's graph the LMA distributions by each biome type - in three lines of code!
plot(density(glopnet$log.LMA,na.rm=TRUE),type="n",ylim=c(0,3),main="Distribution of LMA by biome",xlab="Log10 LMA",ylab="Frequency") by(glopnet, glopnet$BIOME, function(x) { lines(density(x$log.LMA,na.rm=T),lwd=2,col=rainbow(length(levels(glopnet$BIOME)))[as.numeric(x$BIOME)])}) legend("topright",levels(glopnet$BIOME),col=rainbow(length(levels(glopnet$BIOME))),lwd=2)
Displaying principal component analyses with biplots
It's also often useful to visualize the results of a principal components analysis to assess pairwise axes of variation. First set dimension names so the biplot looks nice:
dimnames(glopnet)[[1]] <- paste(glopnet$Genus, glopnet$Species, dimnames(glopnet)[[1]],sep="\n")
Now do the pca and biplot (defaulting to showing PCA1 and PCA2)
leafpca <- princomp(~log.LMA + log.Amass + log.Nmass + log.LL, data=glopnet, cor=TRUE) biplot(leafpca, col=c("gray","red"),cex=c(0.25,1))
Phylogenetic analysis and tree plotting
Last up - phylogenetic analyses. If you haven't already loaded it for an earlier part of the tutorial, we need the 'ape' and 'picante' packages, which you can install if needed.
library(ape) library(picante) tree <- read.tree('shrinkage.tree') # remove polytomies in the tree randomly - only dichotomous trees work in this package tree <- multi2di(tree, random=TRUE) # read in some traits to hang on the tree lmadata <- read.csv('shrinkage.csv') lmadata$LatinName <- tolower(paste(lmadata$Genus, lmadata$Species,sep="_")) # make sure the names are in the same format as our phylogenetic tree (lower case, genus_species) lma_spmean <- tapply(lmadata$LMA.g.m2, lmadata$LatinName, mean, na.rm=T) shrinkage_spmean <- tapply(lmadata$Dry.area.cm2 / lmadata$Wet.area.cm2, lmadata$LatinName, mean, na.rm=T) # for these packages the traits need to be in the same order as the # tree's tip labels - we can order this (assuming that the two data # objects share the same names, which we cleverly ensured they did, # using the tolower/paste syntax) as below: lma_spmean <- lma_spmean[tree$tip.label] shrinkage_spmean <- shrinkage_spmean[tree$tip.label]
Now plot species-mean LMA on the tree, using a 'fan' layout
plot.phylo(tree,cex=0.4,type="f",no.margin=TRUE,underscore=FALSE,label.offset=5) tiplabels(pch=16,col=rainbow(50)[cut(lma_spmean,50)])
Phylogenetic signal
We can also test for phylogenetic signal in the data using Blomberg's K statistic (you can also find tests for Pagel's lambda, but it seems to be less used these days):
phylosignal(as.vector(lma_spmean), tree) # this requires # that the values in the trait be in the same order as the tips of the tree # (tree$tip.label - but you will see above we have already ordered the data # like this.
Don't worry about the warning. You can see here we get K=0.45, so this trait evolves significantly more rapidly (P = 0.001) than expected under a Brownian evolution model.
Phylogenetic independent contrasts
You can also calculate phylogenetic independent contrasts if you don't like using GLS with correlational structure (see above).
lmapic <- pic(lma_spmean, tree) shrinkagepic <- pic(shrinkage_spmean, tree)
Now we can look for a phylogeny-independent relationship between LMA and shrinkage using regression through the origin:
m_pic <- lm(lmapic~shrinkagepic - 1) summary(m_pic)
A relationship still exists (but is weak).

Test your understanding

Additional packages
Here are some suggestions on other packages you may want to investigate. These are all available on the Comprehensive R Archive Network (CRAN) (which can be accessed from the Package Installer within R).
Next steps
To really make the most of R, you will want to learn how to
Where to find help
Additional resources include
Last modified August 15, 2011.