7 Subsetting columns
Using select() to subset columns
A very common task when working with dataframes is to remove or keep only desired columns. In R, we can do this with the verb select()
. Within the select()
function, we first enter the dataframe name, and then the instructions on what to select.
There are lots of different instructions we can give to the select()
function to select our desired columns.
Selecting by name
One simple way to select columns is by the column names. After specifying the dataset, simply enter each column you want to keep, separated by a comma. To select the Age
and Sex
columns from the titanic_df
dataset, use the below code.
Try it yourself: Select the Pclass
and Name
columns from the titanic_df dataset.
Deselecting
We can also remove (deselect) columns, using a minus (-) sign:
To select everything except the Age column, do the following:
Try it yourself: below, select everything except the Pclass
column.
Selecting by position
You can also select columns by their position. Enter the positions as numbers, separated by a comma. This selects column 1 and 5:
Exercises:
Edit this code cell to select the Name, Age, and Sex columns in the titanic_df dataframe, by position.
Deselect the fifth column in the titanic_df
dataframe.
Select a range
You can also select a range of columns using a colon (:). The below command selects columns 1 through to 5:
Combine select instructions
You can also combine multiple select instructions. This will select the first to the third columns, and the eighth column:
If you have programmed with Python before, this method of selecting using two colon-separated numbers is similar to something called list slicing. There are two important differences. First, Python starts counting from zero, not 1, and second, using a range like this in Python results in everything up to but excluding the end number. So in Python, this range would be the second, third, fourth, and fifth numbers in the sequence.
Tidyselect functions
You can also select columns using more complicated methods. There are a g roup of ‘tidyselect’ helpers to select columns based on more complicated criteria. The most useful are:
starts_with()
ends_with()
contains()
First, starts_with() will look for and any columns which starts with some chosen text. It works as the following:
Try it yourself: In the cell below write a select function which using starts_with() which selects the PassengerID
and Parch
columns, but not the Pclass
column.
ends_with() works in a similar way, selecting columns which end with chosen characters:
Try it yourself: In the cell below, write a select function that selects the Age, Fare, and Name columns.
Lastly, contains() will select any columns which contain a given pattern of letters:
To rename columns, we can use the rename()
function with the syntax rename(dataset, new_name = old_name)
. For example, to rename the Name
column to passenger_names
in the Titanic dataframe, we can use:
If you want to, you can rename and select columns in one go. To do this use the same syntax as rename, except use the function select()
. The following will select the Name
and Pclass
columns, and rename them:
However, note that nothing will actually change in the original dataset until you save it as an object in your environment, which will look at when we move to Posit cloud.