The R programming language

Table of contents:

What is programming?

Programming is problem solving: breaking down an implicit, indefinite problem statement into explicit, well-defined steps to solve it.

Breaking down into what? Programming has a few defined basic building blocks, that exist in most languages:

Every fancy computery thing boils down to some assortment of these components, no exceptions. The programming bit is figuring out which ones to use (and how) to accomplish some task or solve some problem.

Variables

Whenever we're working with things, we need a place to put stuff. If you're writing notes, you need some paper to write it on. If you're doing it digitally, you need a digital document.

Likewise, when we do stuff with a computer, we need a place to put stuff.
In R, the analogy is thus:

Let's look at the RStudio interface:

The RStudio user interface

So now that we now where things are, let's write some R: how do we define a variable?

Defining a variable

It's pretty simple:

  1. Type a name for the variable (use underscores instead of spaces!)
  2. Type the right assign operator <-
  3. Type the value you want to assign to the variable

And let's see it in practice:

number_one <- 1

If you type this code in to the console (and press enter), you will be able to see number_one 1 in your environment.

What kind of things can I assign to a variable?

For now, we'll only look at literal values:

Super important note: if you want to write text in your code, you must surround it with quotes!

Functions (using)

Just defining some stuff in our environment isn't very useful. We want to do stuff with it! That's where functions come in...

Functions are a list of operations that is done "on" or using some data. Usually, it's up to us to "give" the function that data. These are called parameters. A function will usually return a value to us, which we can either output directly, or assign to a variable for use later.

Going back to data input, it would be pretty limiting to assign a single data point to a single variable. So let's assign a whole load of data points to a single variable:

c(1, 2, 3)
[1] 1 2 3

If you type the first line into the console and press enter, you should see the second line as output in the console.

The c() function will create a vector (or list) of some data - the things you pass it as parameters. In this example, we have given it the literal values of 1, 2, and 3, and it returns a vector, with elements of 1, 2, and 3. What's a vector? We'll talk about that soon... (it's a data type)

Huh? How? Why?

How do you know you're calling a function?

  1. Type the name of the funcion c
  2. Type opening and closing parentheses ()

How do you "pass parameters" to a function?

  1. Type the name of the function c
  2. Type an opening parenthesis (
  3. Type a parameter
  4. Add a comma , between each subsequent parameter
  5. Type a closing parenthesis )

How do you get output returned from a function?

If it is on a line all by itself, the output will pop up in the console (or below the codeblock, in an .Rmd file). If you want to do something else with it (or just keep it for later), you can assign it to a variable just like before:

one_two_three <- c(1, 2, 3)

This will assign a vector containing the elements 1, 2, and 3 to a variable called "one_two_three" in your environment.

Data (importing)

Typing in all our data is a bit tedious, it would be great to grab it from somewhere else... oh wait we can!

The read.csv() function will need the name of a .csv file as a parameter, and will return the contents of it as a dataframe.

all_the_data <- read.csv("some_dataset.csv")

Now we've got a variable, called "all_the_data" in our environment, which is a dataframe containing everything in the "some_dataset.csv" file. It's up to you to make sure that the variable and file names are correct for what you're doing.

Data types

Every variable that we define has a type, according to the type of data that it contains (its value). We've covered a few already, in their literal form, but here's a list with explanation:

Boolean

The simplest data type, it is either TRUE or FALSE

Integer

A type of numeric, it can only be a whole number (positive or negative) 1

Numeric

A numeric is just any number, such as 1.5

String

A string is a "string" (sequence) of characters, a.k.a. "text", such as "Here's some text."

Vector

A vector is like a list, it keeps any number of values within it - but they all have to be of the same type, e.g. 1.0, 1.5, 2.0

Dataframe

A dataframe is a bit like a spreadsheet, but slightly more limited. Under-the-hood, it is just an assortment of named columns, each of which is a vector.

Conditions and scope

Code that only ever does one thing (or sequence of things) is pretty boring, and not very useful. What if we want it to change what it does, depending on the data?

if (...) {

}

Introducing, the if statement. It evaluates a Boolean expression, and executes code depending on the result. If the expression evaluates to be TRUE, then it will execute the enclosed block of code. If the expression evaluates to FALSE, then the code execution skips over the enclosed block of code.

Expression evaluation

The Boolean expression is implicitly compared to TRUE. In programming languages, the typical Boolean algebra notation (symbols) you may be familiar with is not used. Instead, it is substituted for appropriate symbols that are present on a standard keyboard.

Here are the Boolean operators, and comparison operators:

Operation Algebra Code
AND &&
OR ||
NOT ¬ !
EQUIVALENT ==
NOT EQUIVALENT !=
LESS THAN < <
LESS THAN OR EQUAL <=
GREATER THAN > >
GREATER THAN OR EQUAL >=

Block of code

What was that block of code I mentioned? It's everything between the opening bracket { and the closing bracket }.

When lines of code are placed within a block like this, it defines the limits of our code, such as where to jump to if the expression evaluates to FALSE, i.e. the line after the closing bracket }.

How do we write an if statement?

There are three parts to it:

  1. The keyword if
  2. The Boolean expression, in parentheses (...)
  3. The code block to execute if the Boolean expression is TRUE {...}

Here's an example:

if (day == 29) {
	month <- 2
}

Functions (defining)

Writing a whole bunch of code is pretty tedious. It'd be great if we could write it to do one thing, and then reuse that code again. Copy / paste? Nah, there's a way better way... functions!

Functions are special blocks of code that can be called any place you want to execute it. The big benefit comes from being able to change the data that the function uses by special variables called parameters. These variables only exist within the function's enclosed block of code, and are set each and every time the function is executed.

How do we write a function?

It's a mix of defining a variable, and writing an if statement: there are three key parts to it, and 5 overall:

  1. Type a name for the function, just like for a variable
  2. Type the rightward assignment operation <-
  3. Type the keyword function
  4. Define the parameters you want to use, inside parentheses (...)
  5. Define the code that will use these parameters, inside brackets {...}

Let's have an example:

is_it_the_date <- function (day, month) {
	if (day == 29 && month == 2) {
		return(TRUE)
	}
	return(FALSE)
}

ANOVA

linear_fit <- lm(dependent_variable ~ independent_variable, a_dataframe)
anova(linear_fit)

Before we can do an ANOVA test, we need to fit the data to a linear model, with the lm() function. It will need the variables to compare, and the data to use, as two parameters.

To specify which variables to compare, just use their names from their column in the dataframe. Specify the dependent variable first, on the left side of the tilde ~ operator. All the dependent variables go on the right side (with commas , in between). The second parameter is the dataframe to be used. You can either just put this in as the second parameter, or explicitly pass it as the data by prefixing it with data=.

Once we've got the linear model fit stored in a variable, use the anova() function with it as a parameter.

Sample output:

            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation matrix

data_correlation_matrix <- cor(data)

It is often useful to calculate the correlation matrix of some data, which can be done with the cor() function. It only requires one parameter: a dataframe containing the data.

Sample output:

           column_1   column_2
column_1  1.0000000 -0.1784702
column_2 -0.1784702  1.0000000

EFA

data_efa <- fa(data_correlation_matrix, 3, rotate = "oblimin")
data_efa
fa.diagram(data_efa, digits = 2)

To run an exploratory factor analysis, use the efa() function. It requires three parameters: the data correlation matrix, the number of factors to analyse for, and a rotation to apply (if any).

The data correlation matrix can be computed with the cor() function, see above.

The number of factors is the second parameter to pass. You can also prefix it with nfactors=.

The third parameter is the rotation to apply, if any. It must be prefixed by rotate=!

Sample output:

Factor Analysis using method =  minres
Call: fa(r = data_correlation_matrix, nfactors = 3, rotate = "oblimin")
Standardized loadings (pattern matrix) based upon correlation matrix
           MR1  MR2 MR3   h2   u2 com
column_1 -0.43 0.15   0 0.21 0.79 1.2
column_2  0.43 0.13   0 0.20 0.80 1.2
column_3  0.00 0.36   0 0.13 0.87 1.0

                       MR1  MR2  MR3
SS loadings           0.37 0.17 0.00
Proportion Var        0.12 0.06 0.00
Cumulative Var        0.12 0.18 0.18
Proportion Explained  0.69 0.31 0.00
Cumulative Proportion 0.69 1.00 1.00

 With factor correlations of 
      MR1   MR2 MR3
MR1  1.00 -0.06   0
MR2 -0.06  1.00   0
MR3  0.00  0.00   1

Mean item complexity =  1.1
Test of the hypothesis that 3 factors are sufficient.

The degrees of freedom for the null model are  3  and the objective function was  0.03
The degrees of freedom for the model are -3  and the objective function was  0 

The root mean square of the residuals (RMSR) is  0 
The df corrected root mean square of the residuals is  NA 

Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                    MR1   MR2 MR3
Correlation of (regression) scores with factors    0.56  0.41   0
Multiple R square of scores with factors           0.32  0.17   0
Minimum correlation of possible factor scores     -0.36 -0.67  -1
Graph of factor analysis as produced by RStudio

Principal components

principal(data_correlation_matrix, 2, rotate = "none")

The principal() function performs a principal component analysis (PCA) for a number of components.

It requires three parameters: a correlation or covariance matrix, the number of principal components to analyse, and a rotation to apply, respectively.

The second parameter you specify is the number of princpal components you will see in the output, labelled as PC1, PC2, etc. for no rotation, or as RC1, RC2, etc. if they have been rotated. You could also prefix nFactors= before the parameter.

The third parameter is the rotation to apply, if any. It must be prefixed by rotate=! Be careful, if you do not specify this parameter, it will apply varimax rotation by default.

Sample output:

    PC1   PC2   h2   u2 com
Q1 0.73 -0.28 0.61 0.39 1.3
Q2 0.78 -0.29 0.61 0.31 1.3
Q3 0.71  0.58 0.69 0.16 1.9
Q4 0.77 -0.25 0.84 0.34 1.2
Q5 0.78 -0.22 0.66 0.34 1.2
Q6 0.72  0.59 0.66 0.13 1.9
Q7 0.73 -0.30 0.87 0.37 1.3
Q8 0.82 -0.24 0.63 0.26 1.2
Q9 0.74  0.52 0.74 0.18 1.8