Fundamentals of Coding in R

Download R

R. The R Project for Statistical Computing. Available at: https://cran.r-project.org/
Select: “Download R for (Linux/macOS/Windows)” and follow the instructions.
R Studio Desktop. Available at: https://posit.co/download/rstudio-desktop/
Scroll down to find your installer, ex: “RSTUDIO-2024.04.2-764.EXE” for Windows

Starting R

Once you have installed both R and RStudio, you will be able to start the RStudio application. The console, where you can begin to execute commands, is in the bottom left containing the >

Recommended Reading: The R Book, Chapter 2 “Essentials of the R Language”
Available here

Keeping Your Code Clean

It is important when sharing your code and for your own sanity to document your code well. R will skip any line of code that begins with a ‘#’

x = 5
# x = 4
x

> [1] 5

x = 4
x

> [1] 4

R as a Calculator

The basic operations are “*, +, -, /, ^”

2+3

> [1] 5

2*3

> [1] 6

2/3

> [1] 0.667

2^3

> [1] 8

2*(3+1)^2

> [1] 32

Variables

You can store results in variables and use them in calculations. You can print the value of a variable by using it as a command.

x = 2+3

x

> [1] 5

y = 1+2
x*y

> [1] 15

z = x^y
z

> [1] 125

R has another notation for assignment: the arrow: <- . Many R programmers use this. It may seem odd to programmers coming from other languages.

x <- 3
x

> [1] 3

x <- 5.412
x

> [1] 5.41

Look here for an explanation of the differences between = and <-.

Variable Types

Character: free text, a.k.a. string
Factor: Categorical values; actual values may be text (North/South/East/West) or numbers (1/2/3/4)
Logical: Binary (TRUE/FALSE)
Integer: Only whole numbers (1L/2L/3L/4L, the letter ‘L’ declares these as Integers)
Numeric: Decimal Numbers (1.5/2.5/3.5/4.5)

We can change between variable types easily in R. Sometimes when importing data, you will import unexpected datatypes. For example, you may find a column has been imported as a character instead of a number.

temp_char = "3" #Initialize as character
temp_char*2 #Returns an error

> Error in temp_char * 2: non-numeric argument to binary operator

temp_char = "3"
temp_char = as.integer(temp_char)
temp_char*2

> [1] 6

Be careful if you convert a numeric variable into an integer, you will lose any data following the decimal place.

temp_numeric = 3.14
as.integer(temp_numeric) #Only prints 3

> [1] 3

If you are working with whole numbers, it is a good habit to store variables as integers. This will save space and computation time, especially in large datasets.

Example comparing the size of one million integers and one million numerics. Variable x is twice the size as a numeric despite containing the same information as y.

x = rep(as.numeric(1), 1e7)
y = rep(as.integer(1), 1e7)

object.size(x)/object.size(y)

> 2 bytes

Note: we learn more about functions later.

Vectors

A vector stores a collection of any datatype. You create vectors by using the c() function (concatenate).

# A vector with 4 entries
c(1, 2, 3, 4)

> [1] 1 2 3 4

x = c(1.1, 0.0, 3.14, 2.718)
x

> [1] 1.10 0.00 3.14 2.72

One way to access specific values within a vector is use the index of that value:

x[1]

> [1] 1.1

x[4]

> [1] 2.72

Or you can access a range of values:

x[2:4]

> [1] 0.00 3.14 2.72

Sequences of integers are so common that there is a shortcut for making them.

1:4

> [1] 1 2 3 4

9:2

> [1] 9 8 7 6 5 4 3 2

# or 
seq(1,2,0.1)

>  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

A long vector will be displayed over several lines. The number at the start of each line in brackets is the index of the first entry on that line.

x = 1:40
x

>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Matrices, Arrays, Data Frames

Matrices

A matrix is a two dimensional set of data.

You can specify matrices of any size. The nrow parameter tells R the number of rows, and ncol tells R the number of columns.

myMatrix = matrix(data = 0, nrow = 3, ncol = 2)

Like in a vector we can access matrix items using square brackets, however we now require two inputs, one for each dimension.

myMatrix

>      [,1] [,2]
> [1,]    0    0
> [2,]    0    0
> [3,]    0    0

myMatrix[1,1] = 2
myMatrix[1,2] = 3
myMatrix

>      [,1] [,2]
> [1,]    2    3
> [2,]    0    0
> [3,]    0    0

We can also access entire rows or columns of matrices at once, by leaving the input blank.

myMatrix[1, ]

> [1] 2 3

myMatrix[ , 1]

> [1] 2 0 0

We can add rows or columns using the rbind() and cbind() functions.

my2ndMatrix = cbind(myMatrix,c(1,2,3))
my2ndMatrix

>      [,1] [,2] [,3]
> [1,]    2    3    1
> [2,]    0    0    2
> [3,]    0    0    3

my2ndMatrix = rbind(myMatrix, c(1,2))
my2ndMatrix

>      [,1] [,2]
> [1,]    2    3
> [2,]    0    0
> [3,]    0    0
> [4,]    1    2

Useful operators on matrices:

#Dimensions
dim(myMatrix)

> [1] 3 2

nrow(myMatrix) #Display the number of rows

> [1] 3

ncol(myMatrix) #Display the number of columns

> [1] 2

length(myMatrix)

> [1] 6

#Check if item exists
1 %in% myMatrix

> [1] FALSE

2 %in% myMatrix

> [1] TRUE

Arrays

Arrays are very similar to matrices. The key difference is they can have more than two dimensions.

# Creating a 4x3x2 array
myArray = array(data = 0, dim = c(4,3,2))
myArray[1,1,1] = 1
myArray[1,2,1] = 2
myArray[1,1,2] = 3
myArray[1,2,2] = 4
myArray[2, , ] = 5
myArray

> , , 1
> 
>      [,1] [,2] [,3]
> [1,]    1    2    0
> [2,]    5    5    5
> [3,]    0    0    0
> [4,]    0    0    0
> 
> , , 2
> 
>      [,1] [,2] [,3]
> [1,]    3    4    0
> [2,]    5    5    5
> [3,]    0    0    0
> [4,]    0    0    0

Data Frames

Data frames are more complex matrices. They allow for each column to contain a different data type. Some statistics functions will require either a data frame or a matrix.

The data you will be using in this class will be imported as data frames.

You can use the same functions and methods of accessing data frames as you can matrices, with the addition of some new ones.

myDF = data.frame(col1 = 1:3, col2 = 4:6, col3 = c("a", "b", "c"))
myDF

>   col1 col2 col3
> 1    1    4    a
> 2    2    5    b
> 3    3    6    c

#Request a summary of the data frame
summary(myDF)

>       col1          col2         col3          
>  Min.   :1.0   Min.   :4.0   Length:3          
>  1st Qu.:1.5   1st Qu.:4.5   Class :character  
>  Median :2.0   Median :5.0   Mode  :character  
>  Mean   :2.0   Mean   :5.0                     
>  3rd Qu.:2.5   3rd Qu.:5.5                     
>  Max.   :3.0   Max.   :6.0

#Access specific named columns
myDF$col1

> [1] 1 2 3

myDF$col3

> [1] "a" "b" "c"

Functions

R has all the functions you know and love. (Most of them can be used on vectors.)

sin(1)

> [1] 0.841

sin(1.4)

> [1] 0.985

sin(3)

> [1] 0.141

# R knows about pi
pi

> [1] 3.14

sin(pi/2)

> [1] 1

# The exponential function
exp(0)

> [1] 1

exp(1)

> [1] 2.72

# factorial:  n!
factorial(8)

> [1] 40320

# n choose k 
factorial(8)/(factorial(3)*factorial(8-3))

> [1] 56

# a built in function!
choose(8,3)

> [1] 56

Sum and mean functions on vectors. They take the sum and average respectively of the vectors entries

x = 1:6
x

> [1] 1 2 3 4 5 6

sum(x)

> [1] 21

mean(x)

> [1] 3.5

Example: find the sum of the integers from 1 to 1024.

x = 1:1024
sum(x)

> [1] 524800

# This can be done in one command.
sum(1:1024)

> [1] 524800

Functions can have required and optional variables that must be passed through. For example, the function ‘mean’ is described this way: If the data in the vector Y includes NA the default mean function doesn’t work properly.

Y <- c(1, 2, 3, NA, 5, 6, NA, 9, 10, NA)
mean(Y)

> [1] NA

# Same as the default
mean(Y, na.rm=FALSE)

> [1] NA

# Changing the argument to TRUE
mean(Y, na.rm=TRUE)

> [1] 5.14

Custom Functions

When writing code, you will often find that you are using the same lines of code more than once. In these cases it can help readability and usability to create your own function for repeat lines of code.

Below is a simple function showcasing the ability to use custom inputs and return a value after applying some transformations.

someFunction = function(input1, input2){
  output = (input1 + input2) * 5
  
  #A return statement is needed to tell your function what value to send back
  return(output)
}

With your custom function initialized, you may now call upon it whenever you like.

someFunction(1,2)

> [1] 15

someFunction(5,10)

> [1] 75

If Statements and Loops

Logical Conditions

R supports logical conditions:

Operator	Name	Example
==	Equal	x == y
!=	Not equal	x != y
>	Greater than	x > y
<	Less than	x < y
>=	Greater than or equal to	x >= y
<=	Less than or equal to	x <= y

It is important to distinguish between a single = and two ==. One is assignment, two is logical comparison.

1==1

> [1] TRUE

1==0

> [1] FALSE

1>0

> [1] TRUE

If Statements

We can use logical conditions in many ways, the most common is within if statements. An if statement only executes if its condition is TRUE

x = 1
if(1==1) {
  x = 2
}
x

> [1] 2

if(1==0) {
  x = 3
}
x

> [1] 2

We can also provide code that will run should the if statement be FALSE.

x = 1
if(1==0){ 
  x = 2 #IF TRUE 
} else {
  x = 3 #IF FALSE
}
x

> [1] 3

We can add multiple conditions.

x = 1
if(1==0){ 
  x = 2 #Check one first
}else if(1==1){
  x = 3 #Check two second
}else{ 
  x = 4 #If neither is TRUE
}
x

> [1] 3

Logical Operators

R supports logical operators as well:

Operator	Description
&	Element-wise Logical AND operator
&&	Logical AND operator - Returns TRUE if both statements are TRUE
\|	Element-wise Logical OR operator.
\|\|	Logical OR operator. It returns TRUE if one of the statements is TRUE.
!	Logical NOT - returns FALSE if statement is TRUE

Note the difference between a single & and two &&. The single & will be used more commonly when dealing with vectors, as it compares each element rather than the entire vector.

1||0

> [1] TRUE

1&&0

> [1] FALSE

!1||0

> [1] FALSE

Loops

There are two types of loops in R.

While loops
For loops

While

While loops execute as long as a condition remains TRUE.

x = 1
while(x < 5){
  print(x)
  x = x + 1
}

> [1] 1
> [1] 2
> [1] 3
> [1] 4

It is important to be careful with loops. For example, forgetting to increment x can cause an infinite loop.

For

For loops allow for iterating over a sequence. The for statement will execute code once for each item.

for(x in 1:5){
  print(x)
}

> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5

Break and Next

The break and next statements allow for more control within our loops. Break will stop our loop, and next will skip the current iteration without stopping the remainder.

for(x in 1:5){
  if(x==4) break
  print(x)
}

> [1] 1
> [1] 2
> [1] 3

for(x in 1:5){
  if(x==4) next
  print(x)
}

> [1] 1
> [1] 2
> [1] 3
> [1] 5

More complex example

data = read.csv("Datasets/families.csv")
head(data)

>   TYPE PERSONS CHILDREN INCOME REGION EDUCATION
> 1    1       2        0  43450      1        39
> 2    1       2        0  79000      1        40
> 3    1       2        0  51306      1        39
> 4    1       4        2  24850      1        41
> 5    1       4        2  65145      1        43
> 6    3       3        2  23015      1        40

#Count number of families with more than two people
count = 0
for(x in 1:nrow(data)){
  if(data[x, "PERSONS"]>2) count = count + 1
}
count

> [1] 25335

Getting Help

R and RStudio have complete documentation on all R functions. The lower right pane in RStudio has a help tab you can use. The help contains a lot of information, so you will have to learn to filter out what you don’t need. Try to use the R Documentation before taking your query to your favourite search engine.

# You can also access the help directly from the help tab in RStudio.
help(mean)
# or
?mean