## R Notes

These are the notes I took from here and there, including Coursera Data Analysis course and R’s online help, with `help.start`

# Basics

R objects have attributes which can be observed by `attributes()`

functions.

`<-`

is the assignment operator.

`:`

is used to create integer sequences. `1:4`

= 1 2 3 4

`c`

function can be used to create vectors from different kinds of objects. (concatenate) `c(TRUE, FALSE)`

creates a logical vector. `c(1+3i, 4+8i, 3-5i)`

creates a complex vector. Type-coercion happens if different kinds of objects are mixed.

`as...`

functions can be used to coerce different kinds of data. `as.numeric(TRUE)`

returns 1.

`matrix(ncol = 3, nrow = 4)`

creates a matrix.

`cbind`

and `rbind`

are other options to create a matrix from vectors by binding them to columns or rows.

`factor`

s are categorized data, like `male/female`

. They are created using `factor`

function. `table`

and `unclass`

functions can also be used to get information and change the factor into a numeric table.

`levels`

parameter to `factor`

function can be used to determined the factor number correspondence.

`is.nan`

and `is.na`

functions are used to check whether vector values are `NaN`

or `NA`

.

`data.frame`

s are used to store tabular data like matrices. unlike matrices, they can store different types of data in each column. first column can be numeric, second can be factor, third can be logical etc.

`data.frame`

s are created usually by `read.table`

and `read.csv`

functions. each row has a name which can be accessed by `row.names`

. It can be converted to matrix with `data.matrix`

.

`nrow`

and `ncol`

functions can be used to get row and columns sizes of `data.table`

.

`str`

and `summary`

functions can be used to get a concise information about a structure.

Use `getwd`

to report the working directory, and `setwd`

to change it.

`ls`

function displays the names of objects in your workspace:

```
> x <- 10
> y <- 50
> z <- c("three", "blind", "mice")
26 | Chapter 2: Some Basics
> f <- function(n,p) sqrt(p*(1-p)/n)
> ls()
[1] "f" "x" "y" "z"
```

`rm`

function removes, permanently, one or more objects from the workspace.

# Scripts

R will execute the `.Rprofile`

script when it starts. The place of `.Rprofile`

depends upon your platform: In Linux / Unix save the file in your home directory `~/.Rprofile`

The source function instructs R to read the text file and execute its contents:

```
source("myScript.R")
```

On the command line, this can be run as

```
$ R CMD BATCH /home/jim/psych/adoldrug/partyuse1.R
```

Managing various objects used in R can be challenging. Using the directory structure to sort these objects into sensible categories can be a big help. Instead of starting the R session in a particular directory, you may wish to keep a directory of R scripts and allow these to change the working directory to suit whatever task they perform.

`par(ask=TRUE)`

requires you to hit before each plot is displayed.

`readline("Press <Enter> to continue")`

presents a prompt.

# Vectors

Vectors are created like `v <- c(1.1, 2.2, 3.3)`

. Vectors can be used in arithmetic expressions. `x <- v + 2 * w`

A shorter vector is *cycled* until it reaches the length of longer vector in arithmetic expressions.

`range`

returns the minimum and maximum elements of a vector.

`sort`

sorts a vector in increasing order.

`sqrt(-17)`

returns `NaN`

but `sqrt(-17+0i)`

returns a result.

Regular sequences are generated by `:`

operator. `4:10`

returns `[4, 5, 6, 7, 8, 9, 10]`

. This is a syntactic sugar for `seq`

function which also receives step size and length parameters.

`rep`

function repeats the supplied elements to create a vector.

# Arrays

If `z`

is a vector with 1500 elements (e.g. `z <- 1:1500`

), then `dim(z) <- c(3, 5, 100)`

makes it a 3D array with the respective boundaries.

Another way to create an array is like `x <- array(1:20, dim=c(4,5))`

# Matrices

Two matrices `A`

and `B`

can be multiplied like `A %*% B`

.

A linear equation of the for `b <- A %*% x`

can be solved by `solve(A, b)`

.

# Lists

A list can be created by `list`

function. List elements don’t have to be of the same type. They can be anything from characters to vectors.

```
> mylist <- list(name="Fred", no.children=3, child.ages=c(4, 7,
9))
```

Components can be reached by index like `mylist[[1]]`

or by component name like `mylist$no.children`

. `mylist[[no.children]]`

can also be used.

Lists are a lot like structs.

`c`

function can be used to concatenate lists.

# Arbitrary Functions

An arbitrary function (similar to lambda) can be created like `f <- function(x, y) x + y`

# Statistical Functions for Discrete Distributions

`library(distrEx)`

has functions `E`

, `var`

and `sd`

that calculates mean, variance and standard deviation.

Uniform random events can be emulated with `sample`

function. It has three parameters. The first determines the range of values to select randomly, the second (`size`

) determines the number of events and the third (`replace`

) sets whether we replace the balls we have drawn.

1000 dice: `sample(6, size=1000, replace=TRUE)`

50 random numbers from 1000 to 2000: `sample(1000:2000, size=50, replace=TRUE)`

Flip a fair coin 100 times: `sample(c("H", "T"), size=100, replace=TRUE)`

# Reading and Writing Data

`read.table`

and `read.csv`

reads tabular data from text files.

`readLines`

reads lines of text.

`source`

and `dget`

reads R code files.

`load`

and `unserialize`

is used to read binary objects.

`dump`

and `dput`

are inverse of `source`

and `dget`

. They include the metadata of object in the output.

`file`

is used to open files. `gzfile`

opens gzipped, `bzfile`

opens bzip2 files. `url`

command is used to open a connection to a web page.

`read.table`

`read.table`

is the most important function for inputting data.

`file`

is name of the file or *connection*.

`header`

determines if there is header in the csv input.

`sep`

indicates the separator. (comma, tab, etc.)

`colClasses`

is vector for column data types. specifying this makes R faster twice.

`nrows`

is the number of rows in the dataset.

`comment.char`

indicates the comment char.

with `skip`

it’s possible to skip lines from the beginning.

`read.csv=read.table`

, but the default separator is comma.

```
initial <- read.table("datatable.txt", nrows = 100)
classes <- sapply(initial.class)
tabAll <- read.table("datatable.txt", colClasses = classes)
```

# Plotting

`plot(x, y)`

plots the values in `x`

with the corresponding values in `y`

.

Additional parameters to the `plot`

can be used to configure the visual settings.

Use the `density`

function to approximate the sample density. Use `lines`

to draw the approximation:

```
hist(x, prob=T)
lines(density(x))
```

# Installing R packages

## Method 1: Install from source

Download the add-on R package, say mypkg, and type the following command in Unix console to install it to `/my/own/R-packages/`

:

```
$ R CMD INSTALL mypkg -l /my/own/R-packages/
```

## Method 2: Install from CRAN directly

Type the following command in R console to install it to `/my/own/R-packages/`

directly from CRAN:

```
> install.packages("mypkg", lib="/my/own/R-packages/")
```

## Load the library

Type the following command in R console to load the package

```
> library("mypkg", lib.loc="/my/own/R-packages/")
```

# Statistics

## density

Use the density function to approximate the sample density; then use lines to draw the approximation:

```
> hist(x, prob=T)
> lines(density(x))
```

`~`

notation for relations between variables

R has a special notation for describing relationships between variables. Suppose that you are assuming a linear model for a variable $y$, predicted from the variables $x_1, x_2,…,x_n$. (Statisticians usually refer to $y$ as the dependent variable, and $x_1,x_2,…, x_n$ as the independent variables.) In equation form, this implies a relationship like: In R, you would write the relationship as `y ~ x1 + x2 + ... + xn`

, which is a formula object.

# Working with data

## Creating Data Frame

```
> points <- data.frame(label=c("Low", "Mid", "High"),
lbound=c(0, 0.67, 1.64),
ubound=c(0.674, 1.64, 2.33))
```

## The print function

Lets you vary the number of printed digits using the digits parameter:

```
> print(pi, digits=4)
```

## The cat function does not give you direct control over formatting.

Instead, use the format function to format your numbers before calling cat:

```
> cat(pi, "\n")
```

## The `list.files`

function shows the contents of your working directory:

```
> list.files()
```

## The `write.csv`

function can write a CSV file:

```
> write.csv(x, file`"filename", row.names`FALSE)
```

## Factor analysis

Factor analysis is available in R through the function `factanal`

in the stats package:

```
factanal(x, factors, data ` NULL, covmat = NULL, n.obs ` NA,
subset, na.action, start = NULL,
scores = c("none", "regression", "Bartlett"),
rotation ` "varimax", control ` NULL, ...)
```

## PCA

Principal components analysis breaks a set of (possibly correlated) variables into a set of un- correlated variables. In R, principal components analysis is available through the function `prcomp`

in the stats package:

## distributions in R

*Binomial*binom n = number of trials; p = probability of success for one trial*Geometric*geom p = probability of success for one trialHypergeometric hyper m = number of white balls in urn; n = number of black balls in urn; k = number of balls drawn from urn

Negative binomial (NegBinomial) nbinom size = number of successful trials; either prob = probability of successful trial or mu = mean Poisson pois lambda = mean

Beta beta shape1; shape2

Cauchy cauchy location; scale

Chi-squared (Chisquare) chisq df = degrees of freedom

Exponential exp rate F f df1 and df2 = degrees of freedom

Gamma gamma rate; either rate or scale -

Log-normal (Lognormal) lnorm meanlog = mean on logarithmic scale; sdlog = standard deviation on logarithmic scale

Logistic logis location; scale Normal norm mean; sd = standard deviation Student’s t (TDist) t df = degrees of freedom

Uniform unif min = lower limit; max = upper limit

Weibull weibull shape; scale

Wilcoxon wilcox m = number of observations in first sample; n = number of observations in second sample

## combination calculation

A common problem in computing probabilities of discrete variables is counting combinations: the number of distinct subsets of size $k$ that can be created from $n$ items. The number is given by $\frac{n!}{r!(n − r)!}$. However, it’s much more convenient to use the `choose`

function—especially as $n$ and $k$ grow larger:

```
> choose(5,3) # How many ways can we select 3 items from 5 items?
[1] 10
> choose(50,3) # How many ways can we select 3 items from 50 items?
[1] 19600
> choose(50,30) # How many ways can we select 30 items from 50 items?
[1] 4.712921e+13
```

These numbers are also known as binomial coefficients.

## generating combinations

When you want to generate all combinations of $n$ items taken $k$ at a time. Use the `combn`

function:

```
> combn(items, k)
```

## selecting $n$ items from a vector

The sample function will randomly select $n$ items from a vector:

```
> sample(vec, n)
```

## dotplot() of lattice

The `dotplot()`

function in `library(lattice)`

is useful for displaying labeled quantitative values.

## measuring intelligence

The classic examples come from the social sciences. Suppose that you wanted to measure intelligence. It’s not possible to directly measure an abstract concept like intelligence, but it is possible to measure performance on different tests. You could use factor analysis to analyze a set of test scores (the observed values) to try to determine intelligence (the hidden value).

## correlation

Correlation measures range between −1 and 1. 1 means that one variable is a (positive) linear function of the other. 0 means the two variables aren’t correlated at all. −1 means that one variable is a negative linear function of the other (the two move in completely opposite directions;

## bootstrapping

Would we get a similar result if we were to omit a few points? What are the range of values for the statistic? It is possible to answer this question for an arbitrary statistic using a technique called bootstrapping.

Formally, bootstrap resampling is a technique for estimating the bias of an estimator. An estimator is a statistic calculated from a data sample that provides an estimate of a true underlying value, often a mean, standard deviation, or a hidden parameter. Bootstrapping works by repeatedly selecting random observations from a data sample (with replacement) and recalculating the statistic. In R, you can use bootstrap resampling through the boot function in the boot package.

## paste

The paste function allows you to concatenate multiple character vectors into a single vector.

## chi-squared distribution

The chi-squared distribution allows for statistical tests of categorical data. Among these tests are those for goodness of fit and independence.

## chi-square test

The chi-square test for homogeneity does a similar analysis as the chi-square test for independence. For each cell it computes an expected amount and then uses this to compare to the frequency.

## plotting the regression line

To plot the regression line You make a plot of the data, and then add a line with the abline command

```
> plot(x,y)
> abline(lm.result)
```

## coefficients of regression

To access the coefficients The coef function will return a vector of coefficients.

```
> coef(lm.result)
(Intercept) x
210.0484584 -0.7977266
```

## anova

For example, to test if there is a difference between control and treatment groups. The method called analysis of variance (ANOVA) allows one to compare means for more than 2 independent samples.

## regression analysis

Regression analysis is used for explaining or modeling the relationship between a single variable $y$, called the response, output or dependent variable; and one or more predictor, input, independent or explanatory variables, $x_1,…,X_p$. When $p=1$, it is called simple regression but when $p>1$ it is called multiple regression or sometimes multivariate regression. When there is more than one $y$, then it is called multivariate multiple regression

## i.i.d

independent and identically distributed (i.i.d.)

## linear models

Linear models seem rather restrictive, but because the predictors can be transformed and combined in any way, they are actually very flexible. The term linear is often used in everyday speech as almost a synonym for simplicity. This gives the casual observer the impression that linear models can only handle small simple datasets. This is far from the truth—linear models can easily be expanded and modified to handle complex datasets. Linear is also used to refer to straight lines, but linear models can be curved.

## nonlinear models

Truly nonlinear models are rarely absolutely necessary and most often arise from a theory about the relationships between the variables, rather than an empirical investigation.

## failing to reject the null hypothesis

A failure to reject the null hypothesis is not the end of the game—you must still investigate the possibility of nonlinear transformations of the variables and of outliers which may obscure the relationship. Even then, you may just have insufficient data to demonstrate a real effect, which is why we must be careful to say “fail to reject” the null rather than “accept” the null. It would be a mistake to conclude that no real relationship exists.

## lurking variable Z that affects two seemingly related variables

The first objection is that there may be some lurking variable $Z$ that is the real driving force behind $y$ that also happens to be associated with $x_1$. Once $Z$ is accounted for, there may be no relationship between $x_1$ and $y$. Unfortunately, we can usually never be certain that such a $Z$ does not exist.

## Statistical inference is based on the assumption that

The data must be independent and identically distributed – that is multinomial with some specified probability distribution. If these assumptions are satisfied, then the χ2 statistic is approximately χ2 distributed with n − 1 degrees of freedom

## using null hypothesis for independence testing

The same statistic can also be used to study if two rows in a contingency table are “independent”. That is, the null hypothesis is that the rows are independent and the alternative hypothesis is that they are not independent.