x <- 4:7
x
[1] 4 5 6 7
R has many powerful subset operators. Mastering them will allow you to easily perform complex operations on any kind of dataset.
There are many different ways we can subset any kind of object, and three different subsetting operators for different data structures.
Let’s start by examining subsetting in the simplest data structure, the vector.
Subsetting a vector always returns another vector.
First let’s create a vector
x <- 4:7
x
[1] 4 5 6 7
[
and element indicesTo extract elements of a vector we can use the square bracket operator ([
) and the target element index, starting from one (as R is a 1 indexed language):
x[1]
[1] 4
x[4]
[1] 7
It may look different, but the square brackets operator is a function and means “get me the nth element”.
If we ask for an index beyond the length of the vector, R will return a missing value (NA
):
x[6]
[1] NA
If we ask for the 0th element, we get an empty vector:
x[0]
integer(0)
We can also ask for multiple elements at once:
x[c(1, 3)]
[1] 4 6
Or slices of the vector:
x[2:4]
[1] 5 6 7
We can ask for the same element multiple times:
x[c(1,1,3)]
[1] 4 4 6
If we use a negative number as the index of a vector, R will return every element except for the one specified:
x[-2]
[1] 4 6 7
We can skip multiple elements:
x[c(-1, -5)] # or x[-c(1,5)]
[1] 5 6 7
In general, be aware that the result of subsetting using indices could change if the vector is reordered.
If the vector has a name attribute, we can subset the vector more precisely using the element’s name.
Subsetting using names is the most robust way to extract elements. The position of various elements can often change when chaining together subsetting operations, but the names will always remain the same!
We can also use any logical vector to subset:
x[c(FALSE, FALSE, TRUE, TRUE)]
c d
6 7
Since comparison operators (e.g. >
, <
, ==
) evaluate to logical vectors, we can also use them to succinctly subset vectors: the following statement gives the same result as the previous one.
x[x > 5]
c d
6 7
Breaking it down, this statement first evaluates x > 5
, generating a logical vector c(FALSE, FALSE, TRUE, TRUE)
, and then selects the elements of x
corresponding to the TRUE
values.
We can use ==
to mimic the previous method of indexing by name (remember you have to use ==
rather than =
for comparisons):
x[names(x) == "a"]
a
4
Avoid using ==
to compare numbers unless they are integers! See function dplyr::near()
instead.
We also might want to subset using a vector of potential values, that might not necessarily have matches in x
.
In this case we can use %in%
:
Excluding or removing named elements is a little harder.
If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn’t know how to take the negative of a string:
x[-"a"]
Error in -"a": invalid argument to unary operator
However, we can use the !=
(not-equals) operator to construct a logical vector that will do what we want:
x[names(x) != "a"]
b c d
5 6 7
Excluding multiple named indices requires a different tactic through.
To perform such a subset robustly, we need to combine %in%
and !
.
This checks whether names of x
take any value of the values in c("a","c")
, returning the elements where the condition is TRUE
. The !
then negates the selection, returning only the elements whose names are not contained in c("a","c")
.
As matrices are just 2d vectors, all the subsetting operations using the
[
can also be applied to matrices.
Let’s create a matrix
m <- matrix(1:12, ncol=4, nrow=3)
m
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Indexing matrices with [
takes two arguments: the first expression is applied to the rows, the second to the columns:
Say we want the 2 and 3rd rows of the last and first column (in that order) of our matrix. We can use all the subsetting we learned for vectors and apply them to each dimension of our matrix.
m[2:3, c(4,1)]
[,1] [,2]
[1,] 11 2
[2,] 12 3
We can leave the first or second arguments blank to retrieve all the rows or columns respectively:
m[, c(2,3)]
[,1] [,2]
[1,] 4 7
[2,] 5 8
[3,] 6 9
m[c(2,3),]
[,1] [,2] [,3] [,4]
[1,] 2 5 8 11
[2,] 3 6 9 12
If we only access one row or column, R will automatically convert the result to a vector:
m[3,]
[1] 3 6 9 12
If we want to keep the output as a matrix, we need to specify a third argument; drop = FALSE
:
m[3, , drop=FALSE]
[,1] [,2] [,3] [,4]
[1,] 3 6 9 12
Tip: Higher dimensional arrays
When dealing with multi-dimensional arrays, each argument to [
corresponds to a dimension. For example, a 3D array, the first three arguments correspond to the rows, columns, and depth dimension.
There are three functions used to subset lists and extract individual elements:
[
,[[
, and$
.
Using [
will always return a list. If you want to subset a list, but not extract an element, then you will likely use [
.
As with vectors, we can use element indices and [
to subset lists.
xlist[1]
$a
[1] "ACCE DTP Course"
This returns a list with one element.
We can use multiple indices to subset multiple list elements:
xlist[1:2]
$a
[1] "ACCE DTP Course"
$b
[1] 1 2 3 4 5 6 7 8 9 10
We can also use names:
xlist[c("a", "b")]
$a
[1] "ACCE DTP Course"
$b
[1] 1 2 3 4 5 6 7 8 9 10
Using a single [
accesses the list as if it were a vector and returns a list.
Comparison operations involving the contents of list elements however won’t work as they are not accessible at the level of [
indexing.
Extracting individual elements allow us to access the objects contained in a list, which can be any type of object. Hence the result depends on the object each element contains.
To extract individual elements of a list, we use the double-square bracket function: [[
.
Again we can use element indices to extract the object contained in an element.
xlist[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
Notice that now the result is a vector, not a list, which is what the second element contained.
You can’t extract more than one element at once:
xlist[[1:2]]
Error in xlist[[1:2]]: subscript out of bounds
Nor use it to skip elements:
xlist[[-1]]
Error in xlist[[-1]]: invalid negative subscript in get1index <real>
We can however use single names to extract elements:
xlist[["a"]]
[1] "ACCE DTP Course"
$
operatorThe $
operator is a shorthand way for extracting single elements by name:
xlist$data
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Given the following list:
and using your knowledge of both list and vector subsetting, extract the number 2 from xlist.
Hint: the number 2 is contained within the “b” item in the list.
Data frames are lists underneath the hood, so similar subsetting rules apply. However they are also two dimensional objects.
[
to subsetUsing the [
operator with one argument will act the same way as for lists, where each list element corresponds to a column. The resulting object will be a data.frame:
trees[1]
Girth
1 8.3
2 8.6
3 8.8
4 10.5
5 10.7
6 10.8
7 11.0
8 11.0
9 11.1
10 11.2
11 11.3
12 11.4
13 11.4
14 11.7
15 12.0
16 12.9
17 12.9
18 13.3
19 13.7
20 13.8
21 14.0
22 14.2
23 14.5
24 16.0
25 16.3
26 17.3
27 17.5
28 17.9
29 18.0
30 18.0
31 20.6
trees["Girth"]
Girth
1 8.3
2 8.6
3 8.8
4 10.5
5 10.7
6 10.8
7 11.0
8 11.0
9 11.1
10 11.2
11 11.3
12 11.4
13 11.4
14 11.7
15 12.0
16 12.9
17 12.9
18 13.3
19 13.7
20 13.8
21 14.0
22 14.2
23 14.5
24 16.0
25 16.3
26 17.3
27 17.5
28 17.9
29 18.0
30 18.0
31 20.6
[[
to extractSimilarly, [[
will act to extract a single column as a vector:
trees[[1]]
[1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
[16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
[31] 20.6
trees[["Girth"]]
[1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
[16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
[31] 20.6
And $
provides a convenient shorthand to extract columns by name:
trees$Girth
[1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
[16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
[31] 20.6
With two arguments, [
behaves the same way as for matrices:
trees[1:5, c("Girth", "Volume")]
Girth Volume
1 8.3 10.3
2 8.6 10.3
3 8.8 10.2
4 10.5 16.4
5 10.7 18.8
If we subset a single row, the result will be a data.frame (because the elements are mixed types):
trees[3,]
Girth Height Volume
3 8.8 63 10.2
But for a single column the result will be a vector.
trees[, "Girth"]
[1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
[16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
[31] 20.6
This can be changed with the third argument, drop = FALSE
).
trees[, "Girth", drop = FALSE]
Girth
1 8.3
2 8.6
3 8.8
4 10.5
5 10.7
6 10.8
7 11.0
8 11.0
9 11.1
10 11.2
11 11.3
12 11.4
13 11.4
14 11.7
15 12.0
16 12.9
17 12.9
18 13.3
19 13.7
20 13.8
21 14.0
22 14.2
23 14.5
24 16.0
25 16.3
26 17.3
27 17.5
28 17.9
29 18.0
30 18.0
31 20.6