18 Joins
left_join(), inner_join(), anti_join()
Another set of commands we’ll use quite often in this course are the join()
‘family’. Joins are a very powerful but simple way of selecting certain subsets of data, and adding information from multiple tables together.
Remember the titanic dataset had a column ‘embarked’, which contained either Q, S or C, which told where the passenger got on?
Let’s say we wanted to display the full name of the embarking location, plus some other information. We have another dataset, called locations_df, which contains the code and the equivalent full name, plus the country name and coordinates of the location. You can see what this looks like by running the cell below:
We can combine these two tables of information, so that they are merged, or joined together - the result will be that the correct extra information for each place code will be added to the titanic_df data.
This is done using a set of functions called joins.
Joining, or using relational data, is a really fundamental concept to many programming languages and data models. Excel has a similar feature which they call pivot tables. Even if you never use R or make a visualisation again, understanding the principles behind this is incredibly useful if you ever work with data in your future career.
In R, we can do these joins using the functions left_join()
, right_join()
and inner_join()
. These are variations on the same thing, and they differ by the way that they merge the two datasets together.
Joins need a common key, a column which allows the join to match the data tables up. It’s important that these are unique (a person’s name makes a bad key by itself, for example, because it’s likely more than one person will share the same name). Usually, we use codes as the join keys. If the columns containing the join keys have different names (as ours do), specify them using the syntax below:
The join function is a bit complex, so it’s worth looking at what it is doing.
Exercises
- Calculate the number of men and women which embarked from each country, first merging the new table with the existing one.