R. The R Project for Statistical Computing. Available at: https://cran.r-project.org/
Select: “Download R for (Linux/macOS/Windows)” and follow the
instructions.
R Studio Desktop. Available at: https://posit.co/download/rstudio-desktop/
Scroll down to find your installer, ex: “RSTUDIO-2024.04.2-764.EXE” for
Windows
Once you have installed both R and RStudio, you will be able to start
the RStudio application. The console, where you can begin to execute
commands, is in the bottom left containing the >
Recommended Reading: The R Book, Chapter 2 “Essentials of the R
Language”
Available here
It is important when sharing your code and for your own sanity to document your code well. R will skip any line of code that begins with a ‘#’
x = 5
# x = 4
x
> [1] 5
x = 4
x
> [1] 4
The basic operations are “*, +, -, /, ^”
2+3
> [1] 5
2*3
> [1] 6
2/3
> [1] 0.667
2^3
> [1] 8
2*(3+1)^2
> [1] 32
You can store results in variables and use them in calculations. You can print the value of a variable by using it as a command.
x = 2+3
x
> [1] 5
y = 1+2
x*y
> [1] 15
z = x^y
z
> [1] 125
R has another notation for assignment: the arrow: <-
. Many R programmers use this. It may seem odd to programmers coming
from other languages.
x <- 3
x
> [1] 3
x <- 5.412
x
> [1] 5.41
Look here for an explanation of the differences between
=
and<-
.
Character: free text, a.k.a. string
Factor: Categorical values; actual values may be text
(North/South/East/West) or numbers (1/2/3/4)
Logical: Binary (TRUE/FALSE)
Integer: Only whole numbers (1L/2L/3L/4L, the letter
‘L’ declares these as Integers)
Numeric: Decimal Numbers (1.5/2.5/3.5/4.5)
We can change between variable types easily in R. Sometimes when importing data, you will import unexpected datatypes. For example, you may find a column has been imported as a character instead of a number.
temp_char = "3" #Initialize as character
temp_char*2 #Returns an error
> Error in temp_char * 2: non-numeric argument to binary operator
temp_char = "3"
temp_char = as.integer(temp_char)
temp_char*2
> [1] 6
Be careful if you convert a numeric variable into an integer, you will lose any data following the decimal place.
temp_numeric = 3.14
as.integer(temp_numeric) #Only prints 3
> [1] 3
If you are working with whole numbers, it is a good habit to store variables as integers. This will save space and computation time, especially in large datasets.
Example comparing the size of one million integers and one million
numerics. Variable x
is twice the size as a numeric despite
containing the same information as y
.
x = rep(as.numeric(1), 1e7)
y = rep(as.integer(1), 1e7)
object.size(x)/object.size(y)
> 2 bytes
Note: we learn more about functions later.
A vector stores a collection of any datatype. You create vectors by using the c() function (concatenate).
# A vector with 4 entries
c(1, 2, 3, 4)
> [1] 1 2 3 4
x = c(1.1, 0.0, 3.14, 2.718)
x
> [1] 1.10 0.00 3.14 2.72
One way to access specific values within a vector is use the index of that value:
x[1]
> [1] 1.1
x[4]
> [1] 2.72
Or you can access a range of values:
x[2:4]
> [1] 0.00 3.14 2.72
Sequences of integers are so common that there is a shortcut for making them.
1:4
> [1] 1 2 3 4
9:2
> [1] 9 8 7 6 5 4 3 2
# or
seq(1,2,0.1)
> [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
A long vector will be displayed over several lines. The number at the start of each line in brackets is the index of the first entry on that line.
x = 1:40
x
> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
A matrix is a two dimensional set of data.
You can specify matrices of any size. The nrow
parameter
tells R the number of rows, and ncol
tells R the number of
columns.
myMatrix = matrix(data = 0, nrow = 3, ncol = 2)
Like in a vector we can access matrix items using square brackets, however we now require two inputs, one for each dimension.
myMatrix
> [,1] [,2]
> [1,] 0 0
> [2,] 0 0
> [3,] 0 0
myMatrix[1,1] = 2
myMatrix[1,2] = 3
myMatrix
> [,1] [,2]
> [1,] 2 3
> [2,] 0 0
> [3,] 0 0
We can also access entire rows or columns of matrices at once, by leaving the input blank.
myMatrix[1, ]
> [1] 2 3
myMatrix[ , 1]
> [1] 2 0 0
We can add rows or columns using the rbind()
and
cbind()
functions.
my2ndMatrix = cbind(myMatrix,c(1,2,3))
my2ndMatrix
> [,1] [,2] [,3]
> [1,] 2 3 1
> [2,] 0 0 2
> [3,] 0 0 3
my2ndMatrix = rbind(myMatrix, c(1,2))
my2ndMatrix
> [,1] [,2]
> [1,] 2 3
> [2,] 0 0
> [3,] 0 0
> [4,] 1 2
Useful operators on matrices:
#Dimensions
dim(myMatrix)
> [1] 3 2
nrow(myMatrix) #Display the number of rows
> [1] 3
ncol(myMatrix) #Display the number of columns
> [1] 2
length(myMatrix)
> [1] 6
#Check if item exists
1 %in% myMatrix
> [1] FALSE
2 %in% myMatrix
> [1] TRUE
Arrays are very similar to matrices. The key difference is they can have more than two dimensions.
# Creating a 4x3x2 array
myArray = array(data = 0, dim = c(4,3,2))
myArray[1,1,1] = 1
myArray[1,2,1] = 2
myArray[1,1,2] = 3
myArray[1,2,2] = 4
myArray[2, , ] = 5
myArray
> , , 1
>
> [,1] [,2] [,3]
> [1,] 1 2 0
> [2,] 5 5 5
> [3,] 0 0 0
> [4,] 0 0 0
>
> , , 2
>
> [,1] [,2] [,3]
> [1,] 3 4 0
> [2,] 5 5 5
> [3,] 0 0 0
> [4,] 0 0 0
Data frames are more complex matrices. They allow for each column to contain a different data type. Some statistics functions will require either a data frame or a matrix.
The data you will be using in this class will be imported as data frames.
You can use the same functions and methods of accessing data frames as you can matrices, with the addition of some new ones.
myDF = data.frame(col1 = 1:3, col2 = 4:6, col3 = c("a", "b", "c"))
myDF
> col1 col2 col3
> 1 1 4 a
> 2 2 5 b
> 3 3 6 c
#Request a summary of the data frame
summary(myDF)
> col1 col2 col3
> Min. :1.0 Min. :4.0 Length:3
> 1st Qu.:1.5 1st Qu.:4.5 Class :character
> Median :2.0 Median :5.0 Mode :character
> Mean :2.0 Mean :5.0
> 3rd Qu.:2.5 3rd Qu.:5.5
> Max. :3.0 Max. :6.0
#Access specific named columns
myDF$col1
> [1] 1 2 3
myDF$col3
> [1] "a" "b" "c"
R has all the functions you know and love. (Most of them can be used on vectors.)
sin(1)
> [1] 0.841
sin(1.4)
> [1] 0.985
sin(3)
> [1] 0.141
# R knows about pi
pi
> [1] 3.14
sin(pi/2)
> [1] 1
# The exponential function
exp(0)
> [1] 1
exp(1)
> [1] 2.72
# factorial: n!
factorial(8)
> [1] 40320
# n choose k
factorial(8)/(factorial(3)*factorial(8-3))
> [1] 56
# a built in function!
choose(8,3)
> [1] 56
Sum and mean functions on vectors. They take the sum and average respectively of the vectors entries
x = 1:6
x
> [1] 1 2 3 4 5 6
sum(x)
> [1] 21
mean(x)
> [1] 3.5
Example: find the sum of the integers from 1 to 1024.
x = 1:1024
sum(x)
> [1] 524800
# This can be done in one command.
sum(1:1024)
> [1] 524800
Functions can have required and optional variables that must be
passed through. For example, the function ‘mean’ is described this way:
If the data in the vector Y includes
NA the default mean function doesn’t work properly.
Y <- c(1, 2, 3, NA, 5, 6, NA, 9, 10, NA)
mean(Y)
> [1] NA
# Same as the default
mean(Y, na.rm=FALSE)
> [1] NA
# Changing the argument to TRUE
mean(Y, na.rm=TRUE)
> [1] 5.14
When writing code, you will often find that you are using the same lines of code more than once. In these cases it can help readability and usability to create your own function for repeat lines of code.
Below is a simple function showcasing the ability to use custom inputs and return a value after applying some transformations.
someFunction = function(input1, input2){
output = (input1 + input2) * 5
#A return statement is needed to tell your function what value to send back
return(output)
}
With your custom function initialized, you may now call upon it whenever you like.
someFunction(1,2)
> [1] 15
someFunction(5,10)
> [1] 75
R supports logical conditions:
Operator | Name | Example |
---|---|---|
== | Equal | x == y |
!= | Not equal | x != y |
> | Greater than | x > y |
< | Less than | x < y |
>= | Greater than or equal to | x >= y |
<= | Less than or equal to | x <= y |
It is important to distinguish between a single =
and
two ==
. One is assignment, two is logical comparison.
1==1
> [1] TRUE
1==0
> [1] FALSE
1>0
> [1] TRUE
We can use logical conditions in many ways, the most common is within
if
statements. An if
statement only executes
if its condition is TRUE
x = 1
if(1==1) {
x = 2
}
x
> [1] 2
if(1==0) {
x = 3
}
x
> [1] 2
We can also provide code that will run should the if
statement be FALSE.
x = 1
if(1==0){
x = 2 #IF TRUE
} else {
x = 3 #IF FALSE
}
x
> [1] 3
We can add multiple conditions.
x = 1
if(1==0){
x = 2 #Check one first
}else if(1==1){
x = 3 #Check two second
}else{
x = 4 #If neither is TRUE
}
x
> [1] 3
R supports logical operators as well:
Operator | Description |
---|---|
& | Element-wise Logical AND operator |
&& | Logical AND operator - Returns TRUE if both statements are TRUE |
| | Element-wise Logical OR operator. |
|| | Logical OR operator. It returns TRUE if one of the statements is TRUE. |
! | Logical NOT - returns FALSE if statement is TRUE |
Note the difference between a single
&
and two&&
. The single&
will be used more commonly when dealing with vectors, as it compares each element rather than the entire vector.
1||0
> [1] TRUE
1&&0
> [1] FALSE
!1||0
> [1] FALSE
There are two types of loops in R.
While loops execute as long as a condition remains TRUE.
x = 1
while(x < 5){
print(x)
x = x + 1
}
> [1] 1
> [1] 2
> [1] 3
> [1] 4
It is important to be careful with loops. For example, forgetting to increment x can cause an infinite loop.
For loops allow for iterating over a sequence. The for
statement will execute code once for each item.
for(x in 1:5){
print(x)
}
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
The break
and next
statements allow for
more control within our loops. Break
will stop our loop,
and next
will skip the current iteration without stopping
the remainder.
for(x in 1:5){
if(x==4) break
print(x)
}
> [1] 1
> [1] 2
> [1] 3
for(x in 1:5){
if(x==4) next
print(x)
}
> [1] 1
> [1] 2
> [1] 3
> [1] 5
data = read.csv("Datasets/families.csv")
head(data)
> TYPE PERSONS CHILDREN INCOME REGION EDUCATION
> 1 1 2 0 43450 1 39
> 2 1 2 0 79000 1 40
> 3 1 2 0 51306 1 39
> 4 1 4 2 24850 1 41
> 5 1 4 2 65145 1 43
> 6 3 3 2 23015 1 40
#Count number of families with more than two people
count = 0
for(x in 1:nrow(data)){
if(data[x, "PERSONS"]>2) count = count + 1
}
count
> [1] 25335
R and RStudio have complete documentation on all R functions. The lower right pane in RStudio has a help tab you can use. The help contains a lot of information, so you will have to learn to filter out what you don’t need. Try to use the R Documentation before taking your query to your favourite search engine.
# You can also access the help directly from the help tab in RStudio.
help(mean)
# or
?mean