The R programming language
Table of contents:
- Variables
- Functions (using)
- Data (importing)
- Data types
- Conditions and scope
- Functions (defining)
- ANOVA
- Correlation matrix
- EFA
- Principal components
What is programming?
Programming is problem solving: breaking down an implicit, indefinite problem statement into explicit, well-defined steps to solve it.
Breaking down into what? Programming has a few defined basic building blocks, that exist in most languages:
- Numbers
- Conditions
- Loops
- Functions
- Classes
Every fancy computery thing boils down to some assortment of these components, no exceptions. The programming bit is figuring out which ones to use (and how) to accomplish some task or solve some problem.
Variables
Whenever we're working with things, we need a place to put stuff. If you're writing notes, you need some paper to write it on. If you're doing it digitally, you need a digital document.
Likewise, when we do stuff with a computer, we need a place to put stuff.
In R, the analogy is thus:
- a desk to write on becomes the environment
- the paper you are currently writing a thing on becomes a script
- a single line / idea on the paper becomes a single variable
Let's look at the RStudio interface:
- In the bottom-right, there is the files view - it shows all your folders and files and stuff that are on your computer. You can use it to browse around, and make
.R
and.Rmd
files. - In the bottom-left, there is the R console - this is R itself. Every line of code will be run here, literally line-by-line (you can even watch it happen!). If you don't want to bother making a whole file, or just to quickly test something, you can type into it just like you would do for a regular R file.
- In the top-left, there is the currently open R file - or perhaps a dataframe view. If you're looking at a
.Rmd
file, there will also be two buttons, to switch between the raw code, and the nice rendered view. - In the top-right, there is the environment - it shows you every variable that is currently "known" by R.
So now that we now where things are, let's write some R: how do we define a variable?
Defining a variable
It's pretty simple:
- Type a name for the variable (use underscores instead of spaces!)
- Type the right assign operator
<-
- Type the value you want to assign to the variable
And let's see it in practice:
number_one <- 1
If you type this code in to the console (and press enter), you will be able to see number_one 1
in your environment.
What kind of things can I assign to a variable?
For now, we'll only look at literal values:
- An integer:
1
- A number:
1.0
- Some text:
"Hello there!"
Super important note: if you want to write text in your code, you must surround it with quotes!
Functions (using)
Just defining some stuff in our environment isn't very useful. We want to do stuff with it! That's where functions come in...
Functions are a list of operations that is done "on" or using some data. Usually, it's up to us to "give" the function that data. These are called parameters. A function will usually return a value to us, which we can either output directly, or assign to a variable for use later.
Going back to data input, it would be pretty limiting to assign a single data point to a single variable. So let's assign a whole load of data points to a single variable:
c(1, 2, 3)
[1] 1 2 3
If you type the first line into the console and press enter, you should see the second line as output in the console.
The c()
function will create a vector (or list) of some data - the things you pass it as parameters. In this example, we have given it the literal values of 1
, 2
, and 3
, and it returns a vector, with elements of 1
, 2
, and 3
. What's a vector? We'll talk about that soon... (it's a data type)
Huh? How? Why?
How do you know you're calling a function?
- Type the name of the funcion
c
- Type opening and closing parentheses
()
How do you "pass parameters" to a function?
- Type the name of the function
c
- Type an opening parenthesis
(
- Type a parameter
- Add a comma
,
between each subsequent parameter - Type a closing parenthesis
)
How do you get output returned from a function?
If it is on a line all by itself, the output will pop up in the console (or below the codeblock, in an .Rmd
file). If you want to do something else with it (or just keep it for later), you can assign it to a variable just like before:
one_two_three <- c(1, 2, 3)
This will assign a vector containing the elements 1
, 2
, and 3
to a variable called "one_two_three" in your environment.
Data (importing)
Typing in all our data is a bit tedious, it would be great to grab it from somewhere else... oh wait we can!
The read.csv()
function will need the name of a .csv
file as a parameter, and will return the contents of it as a dataframe.
all_the_data <- read.csv("some_dataset.csv")
Now we've got a variable, called "all_the_data" in our environment, which is a dataframe containing everything in the "some_dataset.csv" file. It's up to you to make sure that the variable and file names are correct for what you're doing.
Data types
Every variable that we define has a type, according to the type of data that it contains (its value). We've covered a few already, in their literal form, but here's a list with explanation:
Boolean
The simplest data type, it is either TRUE
or FALSE
Integer
A type of numeric, it can only be a whole number (positive or negative) 1
Numeric
A numeric is just any number, such as 1.5
String
A string is a "string" (sequence) of characters, a.k.a. "text", such as "Here's some text."
Vector
A vector is like a list, it keeps any number of values within it - but they all have to be of the same type, e.g. 1.0, 1.5, 2.0
Dataframe
A dataframe is a bit like a spreadsheet, but slightly more limited. Under-the-hood, it is just an assortment of named columns, each of which is a vector.
Conditions and scope
Code that only ever does one thing (or sequence of things) is pretty boring, and not very useful. What if we want it to change what it does, depending on the data?
if (...) {
}
Introducing, the if statement. It evaluates a Boolean expression, and executes code depending on the result. If the expression evaluates to be TRUE
, then it will execute the enclosed block of code. If the expression evaluates to FALSE
, then the code execution skips over the enclosed block of code.
Expression evaluation
The Boolean expression is implicitly compared to TRUE
. In programming languages, the typical Boolean algebra notation (symbols) you may be familiar with is not used. Instead, it is substituted for appropriate symbols that are present on a standard keyboard.
Here are the Boolean operators, and comparison operators:
Operation | Algebra | Code |
---|---|---|
AND | ∧ | && |
OR | ∨ | || |
NOT | ¬ | ! |
EQUIVALENT | ≡ | == |
NOT EQUIVALENT | ≢ | != |
LESS THAN | < | < |
LESS THAN OR EQUAL | ≤ | <= |
GREATER THAN | > | > |
GREATER THAN OR EQUAL | ≥ | >= |
Block of code
What was that block of code I mentioned? It's everything between the opening bracket {
and the closing bracket }
.
When lines of code are placed within a block like this, it defines the limits of our code, such as where to jump to if the expression evaluates to FALSE
, i.e. the line after the closing bracket }
.
How do we write an if statement?
There are three parts to it:
- The keyword
if
- The Boolean expression, in parentheses
(...)
- The code block to execute if the Boolean expression is
TRUE
{...}
Here's an example:
if (day == 29) {
month <- 2
}
Functions (defining)
Writing a whole bunch of code is pretty tedious. It'd be great if we could write it to do one thing, and then reuse that code again. Copy / paste? Nah, there's a way better way... functions!
Functions are special blocks of code that can be called any place you want to execute it. The big benefit comes from being able to change the data that the function uses by special variables called parameters. These variables only exist within the function's enclosed block of code, and are set each and every time the function is executed.
How do we write a function?
It's a mix of defining a variable, and writing an if statement: there are three key parts to it, and 5 overall:
- Type a name for the function, just like for a variable
- Type the rightward assignment operation
<-
- Type the keyword
function
- Define the parameters you want to use, inside parentheses
(...)
- Define the code that will use these parameters, inside brackets
{...}
Let's have an example:
is_it_the_date <- function (day, month) {
if (day == 29 && month == 2) {
return(TRUE)
}
return(FALSE)
}
ANOVA
linear_fit <- lm(dependent_variable ~ independent_variable, a_dataframe)
anova(linear_fit)
Before we can do an ANOVA test, we need to fit the data to a linear model, with the lm()
function. It will need the variables to compare, and the data to use, as two parameters.
To specify which variables to compare, just use their names from their column in the dataframe. Specify the dependent variable first, on the left side of the tilde ~
operator. All the dependent variables go on the right side (with commas ,
in between). The second parameter is the dataframe to be used. You can either just put this in as the second parameter, or explicitly pass it as the data
by prefixing it with data=
.
Once we've got the linear model fit stored in a variable, use the anova()
function with it as a parameter.
Sample output:
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation matrix
data_correlation_matrix <- cor(data)
It is often useful to calculate the correlation matrix of some data, which can be done with the cor()
function. It only requires one parameter: a dataframe containing the data.
Sample output:
column_1 column_2
column_1 1.0000000 -0.1784702
column_2 -0.1784702 1.0000000
EFA
data_efa <- fa(data_correlation_matrix, 3, rotate = "oblimin")
data_efa
fa.diagram(data_efa, digits = 2)
To run an exploratory factor analysis, use the efa()
function. It requires three parameters: the data correlation matrix, the number of factors to analyse for, and a rotation to apply (if any).
The data correlation matrix can be computed with the cor()
function, see above.
The number of factors is the second parameter to pass. You can also prefix it with nfactors=
.
The third parameter is the rotation to apply, if any. It must be prefixed by rotate=
!
Sample output:
Factor Analysis using method = minres
Call: fa(r = data_correlation_matrix, nfactors = 3, rotate = "oblimin")
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 com
column_1 -0.43 0.15 0 0.21 0.79 1.2
column_2 0.43 0.13 0 0.20 0.80 1.2
column_3 0.00 0.36 0 0.13 0.87 1.0
MR1 MR2 MR3
SS loadings 0.37 0.17 0.00
Proportion Var 0.12 0.06 0.00
Cumulative Var 0.12 0.18 0.18
Proportion Explained 0.69 0.31 0.00
Cumulative Proportion 0.69 1.00 1.00
With factor correlations of
MR1 MR2 MR3
MR1 1.00 -0.06 0
MR2 -0.06 1.00 0
MR3 0.00 0.00 1
Mean item complexity = 1.1
Test of the hypothesis that 3 factors are sufficient.
The degrees of freedom for the null model are 3 and the objective function was 0.03
The degrees of freedom for the model are -3 and the objective function was 0
The root mean square of the residuals (RMSR) is 0
The df corrected root mean square of the residuals is NA
Fit based upon off diagonal values = 1
Measures of factor score adequacy
MR1 MR2 MR3
Correlation of (regression) scores with factors 0.56 0.41 0
Multiple R square of scores with factors 0.32 0.17 0
Minimum correlation of possible factor scores -0.36 -0.67 -1
Principal components
principal(data_correlation_matrix, 2, rotate = "none")
The principal()
function performs a principal component analysis (PCA) for a number of components.
It requires three parameters: a correlation or covariance matrix, the number of principal components to analyse, and a rotation to apply, respectively.
The second parameter you specify is the number of princpal components you will see in the output, labelled as PC1, PC2, etc.
for no rotation, or as RC1, RC2, etc.
if they have been rotated. You could also prefix nFactors=
before the parameter.
The third parameter is the rotation to apply, if any. It must be prefixed by rotate=
! Be careful, if you do not specify this parameter, it will apply varimax
rotation by default.
Sample output:
PC1 PC2 h2 u2 com
Q1 0.73 -0.28 0.61 0.39 1.3
Q2 0.78 -0.29 0.61 0.31 1.3
Q3 0.71 0.58 0.69 0.16 1.9
Q4 0.77 -0.25 0.84 0.34 1.2
Q5 0.78 -0.22 0.66 0.34 1.2
Q6 0.72 0.59 0.66 0.13 1.9
Q7 0.73 -0.30 0.87 0.37 1.3
Q8 0.82 -0.24 0.63 0.26 1.2
Q9 0.74 0.52 0.74 0.18 1.8