Chapter 2 Data wrangling comparative data in R
In the Primer, we took some data on chameleons and manipulated/wrangled it so it was ready for analyses. In this exercise we will show you how we did this in R. We use functions from a collection of packages called the tidyverse - these are amazing for data manipulation etc. It’s beyond the scope of this Primer to cover these in detail, but we will show you what we did to get the tidied dataset in the book.
Before you start
- Open the
02-DataWrangling.RProj
file in the02-DataWrangling
folder to open your R Project for this exercise.
You will also need to install the following packages:
tidyverse
- for reading, manipulating and plotting datameasurements
- to convert inches to cm
2.1 Chameleon data
Let’s imagine we have the following three sources of chameleon body length and life history data.
SOURCE 1: Picard et al. 2020.
- Chamaeleo chamaeleon. 245 mm, 80 eggs, sexually dimorphic.
- Brookesia minima. 33 mm, 2 eggs, sexually dimorphic.
- Calumma parsonii. 650 mm, 50 eggs, sexually dimorphic.
SOURCE 2: Janeway et al. 1995.
- Chamaeleo chamaeleon. Up to 250 cm, Up to 100 eggs.
- Brookesia minima. Up to 34 mm, 2 eggs.
- Calumma parsonii. Up to 695 mm, Up to 50 eggs.
SOURCE 3: Kirk et al. 1966.
- Chamaeleo chamaeleon. Approximately 10 inches.
- Brookesia minima. Approximately 1 inch.
- Calumma parsonii. Approximately 24 inches.
We would tend to record these data in a table (probably in Excel or another spreadsheet program). Here we have saved this file as chameleon-data-raw.csv
We would also record the full reference for each source in a separate table. Recording the data like this makes it easy to enter in Excel, and also makes it easy for us to exclude certain types of data or sources should we decide we don’t trust them enough to include in our analyses.
Note that this will not be the ideal format for using the data in R, so you will need to manipulate or wrangle the data first to make it tidy. In R, the tidyr
and dplyr
packages are useful for doing this. Or you could use something like Excel if you are less confident with R. Make sure to keep the raw data too in case you accidentally introduce any errors, and so that you can repeat the data collection and analyses if needed.
Both dplyr
and tidyr
packages are part of the tidyverse
so we can save some time by loading them both at the same time via library(tidyverse)
# Load packages
library(tidyverse)
Next we need to read in the chameleon data from our folder. This is saved in the file chameleon-data-raw.csv
so we can read it in using read_csv
as usual.
# Read in the data
<- read_csv("data/chameleon-data-raw.csv") chameleon
## Rows: 18 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Species, Measurement, Type, Units, Quality, Source
## dbl (1): Value
## lgl (1): Approx
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When we read in data we get output that tells us how R has interpreted each column. Here it thinks Species
is a character (i.e. words), whereas Value
is col_double
which means it’s a number with decimal places. These all look right to me, but if something looks off here (i.e. characters where you were expecting numbers) it’s worth checking your data to make sure you haven’t made any errors. To save having lots of non-needed output I often suppress these messages in this book, though they’ll pop up if you run the code yourself.
# Take a look at it
glimpse(chameleon)
## Rows: 18
## Columns: 8
## $ Species <chr> "Chamaeleo chamaeleon", "Chamaeleo chamaeleon", "Chamaeleo chamaeleon", …
## $ Measurement <chr> "length", "clutch size", "dimorphic", "length", "clutch size", "dimorphi…
## $ Value <dbl> 245, 80, 1, 33, 2, 1, 650, 50, 1, 250, 100, 34, 2, 695, 50, 10, 1, 24
## $ Type <chr> "mean", "mean", "truefalse", "mean", "mean", "truefalse", "mean", "mean"…
## $ Units <chr> "mm", "eggs", NA, "mm", "eggs", NA, "mm", "eggs", NA, "mm", "eggs", "mm"…
## $ Quality <chr> "high", "high", "high", "high", "high", "high", "high", "high", "high", …
## $ Approx <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ Source <chr> "Picard et al 2020", "Picard et al 2021", "Picard et al 2022", "Picard e…
To wrangle this data, we would first convert the units to a standard unit for each measurement. Here we have length in mm and inches, so we would convert the inches into mm. The three altered rows are shown below. Note that converting from inches to mm gives us the false impression of precision, i.e. two values are now 25.4 and 609.6, rather than whole numbers making it look like these were measured accurately rather than approximated. It’s worth looking out for this in other data, especially curated databases where this kind of conversion is typical.
We can do this easily using the function conv_unit
in the package measurements
. Let’s load the library here so we can do this.
# Load packages
library(measurements)
Now let’s use conv_unit
and some dplyr
functions to convert all inches to mm.
# Create a new dataset called chameleon2
<-
chameleon2 # Start with the chameleon data
%>%
chameleon # Convert inches into mm within the Value column
mutate(Value = ifelse(Units == "inches", conv_unit(Value, "inch", "mm"), Value)) %>%
# Remove Units and Source columns
::select(-Units, -Source) dplyr
Here mutate
modifies the variable Value. The ifelse
statement tells R that if Units is in inches, we should use conv_unit
to convert from inch to mm, and place this value in the Value column. If Units is not in inches it just records the original value in Value.
Finally, select
is used to remove the Units and Source columns. These are important to keep in our raw data, but we don’t need them for our analyses, so it’s neater to remove them.
The observant among you may have noticed I used
dplyr::select
not just select
in the code
above. This tells R to look in the package dplyr
and use
the select
function from that package. There are several
different packages with select
functions, and I use another
one later in this book. To ensure that R knows to use the
dplyr
version here I have used the ::
.
# Just show the last three entries to check what we have done looks correct
tail(chameleon2, n = 3)
## # A tibble: 3 × 6
## Species Measurement Value Type Quality Approx
## <chr> <chr> <dbl> <chr> <chr> <lgl>
## 1 Chamaeleo chamaeleon length 254 mean medium TRUE
## 2 Brookesia minima length 25.4 mean medium TRUE
## 3 Calumma parsonii length 610. mean medium TRUE
Next we might decide to exclude low quality records, and remove any approximations. We might need to do something more complicated if we only have one record for a species and it is an approximation. In this case we are going to keep the approximations, and use all the data. But you should choose what is sensible for your dataset.
To make our data tidier and remove unnecessary columns we will use select
again to remove the Data_Quality and Approximation columns.
# Create a new dataset called chameleon3
<-
chameleon3 # Start with chameleon2
%>%
chameleon2 # Remove the Data Quality and Approximation columns
::select(-Quality, -Approx)
dplyr
# Look at the data
head(chameleon3)
## # A tibble: 6 × 4
## Species Measurement Value Type
## <chr> <chr> <dbl> <chr>
## 1 Chamaeleo chamaeleon length 245 mean
## 2 Chamaeleo chamaeleon clutch size 80 mean
## 3 Chamaeleo chamaeleon dimorphic NA truefalse
## 4 Brookesia minima length 33 mean
## 5 Brookesia minima clutch size 2 mean
## 6 Brookesia minima dimorphic NA truefalse
Finally, we want to summarise the data for each species, to get means/medians and possibly also minima and maxima. To summarise we use the function summarise
(or summarize
if you prefer the US spelling). But first we need to group the data into the groupings we want the summary data for. In this case we want to know the median values for each species, for each different measurement, and for each different type of measurement, i.e. averages, maximums, minimums etc. We can tell R to group data using group_by
. To do all of this we use the following code:
# Create a new dataset called chameleon4
<-
chameleon4 # Start with chameleon3
%>%
chameleon3 # Group the records together by Species, Measurement, and Data_Type
group_by(Species, Measurement, Type) %>%
# Get the median and max for Value in each group constructed by Species, Measurement, and Data_Type.
summarise(median = median(Value),
max = max(Value))
# Look at the output
head(chameleon4)
## # A tibble: 6 × 5
## # Groups: Species, Measurement [4]
## Species Measurement Type median max
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Brookesia minima clutch size max 2 2
## 2 Brookesia minima clutch size mean 2 2
## 3 Brookesia minima dimorphic truefalse NA NA
## 4 Brookesia minima length max 34 34
## 5 Brookesia minima length mean 29.2 33
## 6 Calumma parsonii clutch size max 50 50
Note that we now have a median and maximum value, for each measurement and data type for each species.
This is fine, but to run analyses in R we are going to want a different column for each of the measurement x data type combinations. To do this we can use the function pivot_wider
, because the change will make the table wider…
# Create a new dataset called chameleon5
<-
chameleon5 # Start with chameleon4
%>%
chameleon4 # Reshape the data so measurement X data types combinations are their own columns
pivot_wider(names_from = c(Measurement, Type), values_from = c(median, max))
# Look at the output
chameleon5
## # A tibble: 3 × 11
## # Groups: Species [3]
## Species media…¹ media…² media…³ media…⁴ media…⁵ max_c…⁶ max_c…⁷ max_d…⁸ max_l…⁹ max_l…˟
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Brookesia m… 2 2 NA 34 29.2 2 2 NA 34 33
## 2 Calumma par… 50 50 NA 695 630. 50 50 NA 695 650
## 3 Chamaeleo c… 100 80 NA 250 250. 100 80 NA 250 254
## # … with abbreviated variable names ¹`median_clutch size_max`, ²`median_clutch size_mean`,
## # ³median_dimorphic_truefalse, ⁴median_length_max, ⁵median_length_mean,
## # ⁶`max_clutch size_max`, ⁷`max_clutch size_mean`, ⁸max_dimorphic_truefalse,
## # ⁹max_length_max, ˟max_length_mean
Finally, we probably don’t need all of these columns. The summary data you use and how you calculate it will vary depending on your inputs for example here we might want to extract:
- Median of mean length values. This will tell us about the overall mean value for length across all sources.
- Maximum of maximum length values. This will tell us about the overall maximum value for length across all sources.
- Median of mean clutch size values. This will tell us about the overall mean value for clutch size across all sources.
- Maximum of maximum clutch size values. This will tell us about the overall maximum value for clutch size across all sources.
- Median dimorphism score. This will tell use whether the species is predominantly considered to be dimorphic or not.
We can select these columns only using select
:
# Create a new dataset called chameleon6
<-
chameleon6 # Start with chameleon5
%>%
chameleon5 # Select only the required columns
::select(Species, median_length = median_length_mean, max_length = max_length_max,
dplyrmedian_clutchsize = `median_clutch size_mean`, max_clutchsize = `max_clutch size_max`,
dimorphic = median_dimorphic_truefalse)
# Look at the output
chameleon6
## # A tibble: 3 × 6
## # Groups: Species [3]
## Species median_length max_length median_clutchsize max_clutchsize dimorphic
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Brookesia minima 29.2 34 2 2 NA
## 2 Calumma parsonii 630. 695 50 50 NA
## 3 Chamaeleo chamaeleon 250. 250 80 100 NA
Finally if this were a real dataset, we might want to save this clean and tidy version so that we don’t have to run all of this code again every time we want to use it. We can do this using the function write_csv
:
write_csv(x = chameleon6, file = "data/chameleon-data-for-analyses.csv")
Note that, to demonstrate how the code works we’ve separated it into chunks above, but you could use %>%
to do all of these in one go if you wanted to:
<-
chameleon_final # Start with chameleon
%>%
chameleon # Convert inches into mm within the Value column
mutate(Value = ifelse(Units == "inches", conv_unit(Value, "inch", "mm"), Value)) %>%
# Remove the Units, Source, Data Quality and Approximation columns
::select(-Units, -Source, -Quality, -Approx) %>%
dplyr# Group the records together by Species, Measurement, and Data_Type
group_by(Species, Measurement, Type) %>%
# Get the median and max for Value in each group constructed by Species, Measurement, and Data_Type.
summarise(median = median(Value),
max = max(Value)) %>%
# Reshape the data so measurement X data types combinations are their own columns
pivot_wider(names_from = c(Measurement, Type), values_from = c(median, max)) %>%
# Select only the required columns
::select(Species, median_length = median_length_mean, max_length = max_length_max,
dplyrmedian_clutchsize = `median_clutch size_mean`, max_clutchsize = `max_clutch size_max`,
dimorphic = median_dimorphic_truefalse)
# Look at the output
chameleon_final
## # A tibble: 3 × 6
## # Groups: Species [3]
## Species median_length max_length median_clutchsize max_clutchsize dimorphic
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Brookesia minima 29.2 34 2 2 NA
## 2 Calumma parsonii 630. 695 50 50 NA
## 3 Chamaeleo chamaeleon 250. 250 80 100 NA
# Write to file
# I've commented this out as we already made this above
# write_csv(x = chameleon_final, file = "data/chameleon-data-for-analyses.csv")
2.2 Summary
This exercise should have introduced you to some basic skills for wrangling comparative data in R.
2.3 Practical exercise
Using the frog-eyes-raw-data.csv
dataset and R, create a new dataset that contains only the columns genus_species
, Family
, and two new columns containing the species mean values for rootmass
and eyemean
. We will talk more about this dataset in later exercises, but it comes from K. N. Thomas et al. (2020).