# Statistics and R

### From BiodivBorneo09

We will learn/revise statistical basics while learning R.

Core steps:

- Command line, rather than menu-driven: a scripting language: you will be programmers!
- R objects, sub-objects and indices
- Data: variation (one-dimensional)
- (Graphics)
- Histograms
- Two-dimensional data
- Populations and samples
- Random samples, stratification, pseudoreplication
- Tests, statistical significance. E.g., `the probability that we sampled two populations that were identical and obtained samples that were as different as these two are is less than 5%'
- One dimensional variation, parametric/non-parametric
- 2D: correlation
- 2D: Linear models (regression)
- Treatments, factors and levels
- Multiple factors in a linear model

R-script from Cam:

# basic objects: a0 <- 1 a0 a1 <- c(0,1,2) a1 b1 <- c("a", "b", "c", "d", "e") b1 # concatenating vectors: new <- c(a0, a1) new # column bind: a2 <- cbind( c(0,1,2), c(10,11,12) ) a2 # vector indices: a1 a1[2] # matrix indices: a2 a2[2,2] # row bind: a3 <- rbind( c(0,1,2), c(10,11,12) ) a3 # select a row: a3[1, ] # sequences: 1:3 1:10 -111:-211 hundred <- 111:211 hundred hundred[45:52] # list objects: ls() # drop members: a1[-2] hundred[-32:-54] hundred[c(-32, -42, -52)] drop <- c(-32, -42, -52) hundred[drop] # logical tests: b1 b1[1] == "c" b1[3] == "c" b1 == "c" hundred > 155 # subsetting: b1[b1 == "c"] x <- cbind( c("a", "b", "c", "d", "c"), c(123, 43, 52, 110, 21)) x[,1] == "c" x[ x[,1] == "c", ] x[ x[,1] == "c", 2] # data frames x <- data.frame( c("a", "b", "c", "d", "c"), c(123, 43, 52, 110, 21)) colnames(x) <- c("species", "diam") x x$diam x[,2] x$species

We entered sample data into a spreadsheet. We discussed why spreadsheets are dangerous as databases, but by setting a range you can partially protect the data when sorting. Save as CSV. Check in a text editor:

"diam","moisture","light" 50,1,"hi" 51,1,"hi" 60,1,"lo" 61,1,"lo" 70,2,"hi" 72,2,"hi" 80,2,"lo" 81,2,"lo" 100,3,"hi" 101,3,"hi" 120,3,"lo" 124,3,"lo"

Change the working directory of R to match the place you save the file.

# reading in data: mydata <- read.table("bb09test.csv", header=T, sep=",") mydata # variation: rnorm(n=30, mean=1, sd=1) r <- rnorm(n=30, mean=1, sd=1) mean(r) sd(r) mean( mydata$diam ) sd( mydata$diam ) sd( rnorm(n=30, mean=1, sd=1) ) sd( rnorm(n=30, mean=1, sd=1) ) sd( rnorm(n=30, mean=1, sd=1) ) # graphics: demo(graphics) # histogram: hist(r) hist( mydata$diam ) # 2D data rx <- rnorm(n=30, mean=1, sd=1) ry <- rnorm(n=30, mean=1, sd=1) plot( rx, ry ) # test extracted parameters against population mean t.test(rx , mu = 1) t.test(rx , mu = 1.3) # compare two samples: rx <- rnorm(n=30, mean=1, sd=0.5) ry <- rnorm(n=30, mean=1.4, sd=0.5) hist( rx ) hist( ry ) # multiple plots: par(mfrow=c(2,1)) hist(rx, breaks=10, xlim=c(-1,4), ylim=c(0,15)) hist(ry, breaks=10, xlim=c(-1,4), ylim=c(0,15)) par(mfrow=c(1,1)) # t-test t.test( rx, ry ) ry <- rnorm(n=30, mean=1.5, sd=1) t.test( rx, ry ) # non-parametric (rank based), for non-normal distributions wilcox.test(rx, ry ) plot( rx, ry ) # correlated data diam <- rnorm(n=30, mean=1, sd=1) fecund <- (diam * 0.5) + rnorm(n=30, mean=0.5, sd=0.5) plot( diam , fecund ) # correlation test cor.test(diam, fecund) # R-squared: cor.test(diam, fecund)$estimate^2 # linear model lm <- lm(fecund ~ diam) summary(lm) abline(lm, col="red") # our data plot(mydata$diam ~ mydata$moisture) cor.test(mydata$diam , mydata$moisture, method="spearman") # linear model: lm <- lm(mydata$diam ~ mydata$moisture) summary(lm) abline(lm) # two factors: lm <- lm(mydata$diam ~ mydata$moisture + mydata$light) summary(lm) # plot the two factors separately: plot(mydata$diam ~ mydata$moisture) plot(mydata$diam ~ mydata$light) # interaction term: lm <- lm(mydata$diam ~ mydata$moisture + mydata$light + mydata$light:mydata$moisture) summary(lm)