9  Ordering and sorting dataframes

Ordering data

Another typical task or manipulation we might want to do with our data is to put it in some kind of chosen order. This is particularly useful when doing exploratory data analysis, where we want to find out how the data looks and what interesting properties it might have that we can use for analysis.

The arrange() function

We can sort data using the verb arrange(). As with select() and slice(), the first element in the function will be the dataset, then followed by the column or columns we want to sort the data by. For example, this will sort the titanic_df dataframe by the column Age:

Sorting is ascending by default, but you can specify descending by using a minus sign before the column you want to sort by:

What happens if you sort by a character column instead of a numeric column?

If we want to use reverse order in a character column, we need to wrap the column in another verb, desc().

Look at the result of the name column now. I’ve included 20 rows rather than the usual 5. Do you notice anything strange? Why is van Melkebeke or del Carlo put before Zimmerman?

The sorting method used by arrange() thinks that lowercase letters should be considered before uppercase ones. This is something which we would need to fix if we were working properly with the data, for instance by converting everything to lowercase before sorting.

You can sort by multiple things. Before you run the code, what do you think the following does?

Exercise

Sort by the Fare column. What is unusual about this and why? How do you think you could fix it? (Hint: use glimpse() to look at the data types)

  • Use arrange() to find the 5 oldest male passengers.

  • Use a slice_ function to find the 20 youngest passengers, and put them in alphabetical order.