Statistics and R

From BiodivBorneo09

Jump to: navigation, search

We will learn/revise statistical basics while learning R.

Core steps:

  • Command line, rather than menu-driven: a scripting language: you will be programmers!
  • R objects, sub-objects and indices
  • Data: variation (one-dimensional)
  • (Graphics)
  • Histograms
  • Two-dimensional data
  • Populations and samples
  • Random samples, stratification, pseudoreplication
  • Tests, statistical significance. E.g., `the probability that we sampled two populations that were identical and obtained samples that were as different as these two are is less than 5%'
  • One dimensional variation, parametric/non-parametric
  • 2D: correlation
  • 2D: Linear models (regression)
  • Treatments, factors and levels
  • Multiple factors in a linear model

R-script from Cam:

# basic objects:
a0 <- 1
a0
a1 <- c(0,1,2)
a1
b1 <- c("a", "b", "c", "d", "e")
b1

# concatenating vectors:
new <- c(a0, a1)
new

# column bind:
a2 <- cbind( c(0,1,2), c(10,11,12) )
a2

# vector indices:
a1
a1[2]

# matrix indices:
a2
a2[2,2]

# row bind:
a3 <- rbind( c(0,1,2), c(10,11,12) )
a3

# select a row:
a3[1, ]

# sequences:
1:3
1:10
-111:-211
hundred <- 111:211
hundred
hundred[45:52]

# list objects:
ls()

# drop members:
a1[-2]
hundred[-32:-54]
hundred[c(-32, -42, -52)]
drop <- c(-32, -42, -52)
hundred[drop]

# logical tests:
b1
b1[1] == "c"
b1[3] == "c"
b1 == "c"
hundred > 155

# subsetting:
b1[b1 == "c"]
x <- cbind( c("a", "b", "c", "d", "c"), c(123, 43, 52, 110, 21))
x[,1] == "c"
x[ x[,1] == "c",  ]
x[ x[,1] == "c", 2]

# data frames
x <- data.frame( c("a", "b", "c", "d", "c"), c(123, 43, 52, 110, 21))
colnames(x) <- c("species", "diam")
x
x$diam
x[,2]
x$species

We entered sample data into a spreadsheet. We discussed why spreadsheets are dangerous as databases, but by setting a range you can partially protect the data when sorting. Save as CSV. Check in a text editor:

"diam","moisture","light"
50,1,"hi"
51,1,"hi"
60,1,"lo"
61,1,"lo"
70,2,"hi"
72,2,"hi"
80,2,"lo"
81,2,"lo"
100,3,"hi"
101,3,"hi"
120,3,"lo"
124,3,"lo"

Change the working directory of R to match the place you save the file.

# reading in data:
mydata <- read.table("bb09test.csv", header=T, sep=",")
mydata

# variation:
rnorm(n=30, mean=1, sd=1)
r <- rnorm(n=30, mean=1, sd=1)
mean(r)
sd(r)
mean( mydata$diam )
sd( mydata$diam )
sd( rnorm(n=30, mean=1, sd=1) )
sd( rnorm(n=30, mean=1, sd=1) )
sd( rnorm(n=30, mean=1, sd=1) )

# graphics:
demo(graphics)

# histogram:
hist(r)
hist( mydata$diam )

# 2D data
rx <- rnorm(n=30, mean=1, sd=1)
ry <- rnorm(n=30, mean=1, sd=1)
plot( rx, ry )

# test extracted parameters against population mean
t.test(rx , mu = 1)
t.test(rx , mu = 1.3)

# compare two samples:
rx <- rnorm(n=30, mean=1, sd=0.5)
ry <- rnorm(n=30, mean=1.4, sd=0.5)
hist( rx )
hist( ry )

# multiple plots:
par(mfrow=c(2,1))
hist(rx, breaks=10, xlim=c(-1,4), ylim=c(0,15))
hist(ry, breaks=10, xlim=c(-1,4), ylim=c(0,15))
par(mfrow=c(1,1))

# t-test
t.test( rx, ry )
ry <- rnorm(n=30, mean=1.5, sd=1)
t.test( rx, ry )

# non-parametric (rank based), for non-normal distributions
wilcox.test(rx, ry )
plot( rx, ry )

# correlated data
diam <- rnorm(n=30, mean=1, sd=1)
fecund <- (diam * 0.5) + rnorm(n=30, mean=0.5, sd=0.5)
plot( diam , fecund )

# correlation test
cor.test(diam, fecund)

# R-squared:
cor.test(diam, fecund)$estimate^2

# linear model
lm <- lm(fecund ~ diam)
summary(lm)
abline(lm, col="red")

# our data
plot(mydata$diam ~ mydata$moisture)
cor.test(mydata$diam , mydata$moisture, method="spearman")

# linear model:
lm <- lm(mydata$diam ~ mydata$moisture)
summary(lm)
abline(lm)

# two factors:
lm <- lm(mydata$diam ~ mydata$moisture + mydata$light)
summary(lm)

# plot the two factors separately:
plot(mydata$diam ~ mydata$moisture)
plot(mydata$diam ~ mydata$light)

# interaction term:
lm <- lm(mydata$diam ~ mydata$moisture + mydata$light + mydata$light:mydata$moisture)
summary(lm)