Chapter 2 Data wrangling comparative data in R

In the Primer, we took some data on chameleons and manipulated/wrangled it so it was ready for analyses. In this exercise we will show you how we did this in R. We use functions from a collection of packages called the tidyverse - these are amazing for data manipulation etc. It’s beyond the scope of this Primer to cover these in detail, but we will show you what we did to get the tidied dataset in the book.

Before you start

Open the 02-DataWrangling.RProj file in the 02-DataWrangling folder to open your R Project for this exercise.

You will also need to install the following packages:

tidyverse - for reading, manipulating and plotting data
measurements - to convert inches to cm

2.1 Chameleon data

Let’s imagine we have the following three sources of chameleon body length and life history data.

SOURCE 1: Picard et al. 2020.

Chamaeleo chamaeleon. 245 mm, 80 eggs, sexually dimorphic.
Brookesia minima. 33 mm, 2 eggs, sexually dimorphic.
Calumma parsonii. 650 mm, 50 eggs, sexually dimorphic.

SOURCE 2: Janeway et al. 1995.

Chamaeleo chamaeleon. Up to 250 cm, Up to 100 eggs.
Brookesia minima. Up to 34 mm, 2 eggs.
Calumma parsonii. Up to 695 mm, Up to 50 eggs.

SOURCE 3: Kirk et al. 1966.

Chamaeleo chamaeleon. Approximately 10 inches.
Brookesia minima. Approximately 1 inch.
Calumma parsonii. Approximately 24 inches.

We would tend to record these data in a table (probably in Excel or another spreadsheet program). Here we have saved this file as chameleon-data-raw.csv

We would also record the full reference for each source in a separate table. Recording the data like this makes it easy to enter in Excel, and also makes it easy for us to exclude certain types of data or sources should we decide we don’t trust them enough to include in our analyses.

Note that this will not be the ideal format for using the data in R, so you will need to manipulate or wrangle the data first to make it tidy. In R, the tidyr and dplyr packages are useful for doing this. Or you could use something like Excel if you are less confident with R. Make sure to keep the raw data too in case you accidentally introduce any errors, and so that you can repeat the data collection and analyses if needed.

Both dplyr and tidyr packages are part of the tidyverse so we can save some time by loading them both at the same time via library(tidyverse)

# Load packages
library(tidyverse)

Next we need to read in the chameleon data from our folder. This is saved in the file chameleon-data-raw.csv so we can read it in using read_csv as usual.

# Read in the data
chameleon <- read_csv("data/chameleon-data-raw.csv")

## Rows: 18 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Species, Measurement, Type, Units, Quality, Source
## dbl (1): Value
## lgl (1): Approx
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

When we read in data we get output that tells us how R has interpreted each column. Here it thinks Species is a character (i.e. words), whereas Value is col_double which means it’s a number with decimal places. These all look right to me, but if something looks off here (i.e. characters where you were expecting numbers) it’s worth checking your data to make sure you haven’t made any errors. To save having lots of non-needed output I often suppress these messages in this book, though they’ll pop up if you run the code yourself.

# Take a look at it
glimpse(chameleon)

## Rows: 18
## Columns: 8
## $ Species     <chr> "Chamaeleo chamaeleon", "Chamaeleo chamaeleon", "Chamaeleo chamaeleon", …
## $ Measurement <chr> "length", "clutch size", "dimorphic", "length", "clutch size", "dimorphi…
## $ Value       <dbl> 245, 80, 1, 33, 2, 1, 650, 50, 1, 250, 100, 34, 2, 695, 50, 10, 1, 24
## $ Type        <chr> "mean", "mean", "truefalse", "mean", "mean", "truefalse", "mean", "mean"…
## $ Units       <chr> "mm", "eggs", NA, "mm", "eggs", NA, "mm", "eggs", NA, "mm", "eggs", "mm"…
## $ Quality     <chr> "high", "high", "high", "high", "high", "high", "high", "high", "high", …
## $ Approx      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ Source      <chr> "Picard et al 2020", "Picard et al 2021", "Picard et al 2022", "Picard e…

To wrangle this data, we would first convert the units to a standard unit for each measurement. Here we have length in mm and inches, so we would convert the inches into mm. The three altered rows are shown below. Note that converting from inches to mm gives us the false impression of precision, i.e. two values are now 25.4 and 609.6, rather than whole numbers making it look like these were measured accurately rather than approximated. It’s worth looking out for this in other data, especially curated databases where this kind of conversion is typical.

We can do this easily using the function conv_unit in the package measurements. Let’s load the library here so we can do this.

# Load packages
library(measurements)

Now let’s use conv_unit and some dplyr functions to convert all inches to mm.

# Create a new dataset called chameleon2
chameleon2 <-
  # Start with the chameleon data
  chameleon %>% 
  # Convert inches into mm within the Value column
  mutate(Value = ifelse(Units == "inches", conv_unit(Value, "inch", "mm"), Value)) %>%
  # Remove Units and Source columns
  dplyr::select(-Units, -Source)

Here mutate modifies the variable Value. The ifelse statement tells R that if Units is in inches, we should use conv_unit to convert from inch to mm, and place this value in the Value column. If Units is not in inches it just records the original value in Value.

Finally, select is used to remove the Units and Source columns. These are important to keep in our raw data, but we don’t need them for our analyses, so it’s neater to remove them.

The observant among you may have noticed I used dplyr::select not just select in the code above. This tells R to look in the package dplyr and use the select function from that package. There are several different packages with select functions, and I use another one later in this book. To ensure that R knows to use the dplyr version here I have used the ::.

# Just show the last three entries to check what we have done looks correct
tail(chameleon2, n = 3)

## # A tibble: 3 × 6
##   Species              Measurement Value Type  Quality Approx
##   <chr>                <chr>       <dbl> <chr> <chr>   <lgl> 
## 1 Chamaeleo chamaeleon length      254   mean  medium  TRUE  
## 2 Brookesia minima     length       25.4 mean  medium  TRUE  
## 3 Calumma parsonii     length      610.  mean  medium  TRUE

Next we might decide to exclude low quality records, and remove any approximations. We might need to do something more complicated if we only have one record for a species and it is an approximation. In this case we are going to keep the approximations, and use all the data. But you should choose what is sensible for your dataset.

To make our data tidier and remove unnecessary columns we will use select again to remove the Data_Quality and Approximation columns.

# Create a new dataset called chameleon3
chameleon3 <-
  # Start with chameleon2
  chameleon2 %>%
  # Remove the Data Quality and Approximation columns
  dplyr::select(-Quality, -Approx)

# Look at the data
head(chameleon3)

## # A tibble: 6 × 4
##   Species              Measurement Value Type     
##   <chr>                <chr>       <dbl> <chr>    
## 1 Chamaeleo chamaeleon length        245 mean     
## 2 Chamaeleo chamaeleon clutch size    80 mean     
## 3 Chamaeleo chamaeleon dimorphic      NA truefalse
## 4 Brookesia minima     length         33 mean     
## 5 Brookesia minima     clutch size     2 mean     
## 6 Brookesia minima     dimorphic      NA truefalse

Finally, we want to summarise the data for each species, to get means/medians and possibly also minima and maxima. To summarise we use the function summarise (or summarize if you prefer the US spelling). But first we need to group the data into the groupings we want the summary data for. In this case we want to know the median values for each species, for each different measurement, and for each different type of measurement, i.e. averages, maximums, minimums etc. We can tell R to group data using group_by. To do all of this we use the following code:

# Create a new dataset called chameleon4
chameleon4 <-
  # Start with chameleon3
  chameleon3 %>%
  # Group the records together by Species, Measurement, and Data_Type
  group_by(Species, Measurement, Type) %>%
  # Get the median and max for Value in each group constructed by Species, Measurement, and Data_Type.
  summarise(median = median(Value),
                max = max(Value))

# Look at the output
head(chameleon4)

## # A tibble: 6 × 5
## # Groups:   Species, Measurement [4]
##   Species          Measurement Type      median   max
##   <chr>            <chr>       <chr>      <dbl> <dbl>
## 1 Brookesia minima clutch size max          2       2
## 2 Brookesia minima clutch size mean         2       2
## 3 Brookesia minima dimorphic   truefalse   NA      NA
## 4 Brookesia minima length      max         34      34
## 5 Brookesia minima length      mean        29.2    33
## 6 Calumma parsonii clutch size max         50      50

Note that we now have a median and maximum value, for each measurement and data type for each species.

This is fine, but to run analyses in R we are going to want a different column for each of the measurement x data type combinations. To do this we can use the function pivot_wider, because the change will make the table wider…

# Create a new dataset called chameleon5
chameleon5 <-
  # Start with chameleon4
  chameleon4 %>%
  # Reshape the data so measurement X data types combinations are their own columns
  pivot_wider(names_from = c(Measurement, Type), values_from = c(median, max)) 

# Look at the output
chameleon5

## # A tibble: 3 × 11
## # Groups:   Species [3]
##   Species      media…¹ media…² media…³ media…⁴ media…⁵ max_c…⁶ max_c…⁷ max_d…⁸ max_l…⁹ max_l…˟
##   <chr>          <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Brookesia m…       2       2      NA      34    29.2       2       2      NA      34      33
## 2 Calumma par…      50      50      NA     695   630.       50      50      NA     695     650
## 3 Chamaeleo c…     100      80      NA     250   250.      100      80      NA     250     254
## # … with abbreviated variable names ¹`median_clutch size_max`, ²`median_clutch size_mean`,
## #   ³median_dimorphic_truefalse, ⁴median_length_max, ⁵median_length_mean,
## #   ⁶`max_clutch size_max`, ⁷`max_clutch size_mean`, ⁸max_dimorphic_truefalse,
## #   ⁹max_length_max, ˟max_length_mean

Finally, we probably don’t need all of these columns. The summary data you use and how you calculate it will vary depending on your inputs for example here we might want to extract:

Median of mean length values. This will tell us about the overall mean value for length across all sources.
Maximum of maximum length values. This will tell us about the overall maximum value for length across all sources.
Median of mean clutch size values. This will tell us about the overall mean value for clutch size across all sources.
Maximum of maximum clutch size values. This will tell us about the overall maximum value for clutch size across all sources.
Median dimorphism score. This will tell use whether the species is predominantly considered to be dimorphic or not.

We can select these columns only using select:

# Create a new dataset called chameleon6
chameleon6 <-
  # Start with chameleon5
  chameleon5 %>%
  # Select only the required columns
  dplyr::select(Species, median_length = median_length_mean, max_length = max_length_max,
         median_clutchsize = `median_clutch size_mean`, max_clutchsize = `max_clutch size_max`,
         dimorphic = median_dimorphic_truefalse)

# Look at the output
chameleon6

## # A tibble: 3 × 6
## # Groups:   Species [3]
##   Species              median_length max_length median_clutchsize max_clutchsize dimorphic
##   <chr>                        <dbl>      <dbl>             <dbl>          <dbl>     <dbl>
## 1 Brookesia minima              29.2         34                 2              2        NA
## 2 Calumma parsonii             630.         695                50             50        NA
## 3 Chamaeleo chamaeleon         250.         250                80            100        NA

Finally if this were a real dataset, we might want to save this clean and tidy version so that we don’t have to run all of this code again every time we want to use it. We can do this using the function write_csv:

write_csv(x = chameleon6, file = "data/chameleon-data-for-analyses.csv")

Note that, to demonstrate how the code works we’ve separated it into chunks above, but you could use %>% to do all of these in one go if you wanted to:

chameleon_final <-
  # Start with chameleon
  chameleon %>%
  # Convert inches into mm within the Value column
  mutate(Value = ifelse(Units == "inches", conv_unit(Value, "inch", "mm"), Value)) %>%
  # Remove the Units, Source, Data Quality and Approximation columns
  dplyr::select(-Units, -Source, -Quality, -Approx) %>%
  # Group the records together by Species, Measurement, and Data_Type
  group_by(Species, Measurement, Type) %>%
  # Get the median and max for Value in each group constructed by Species, Measurement, and Data_Type.
  summarise(median = median(Value),
                max = max(Value)) %>%
  # Reshape the data so measurement X data types combinations are their own columns
  pivot_wider(names_from = c(Measurement, Type), values_from = c(median, max)) %>%
  # Select only the required columns
  dplyr::select(Species, median_length = median_length_mean, max_length = max_length_max,
         median_clutchsize = `median_clutch size_mean`, max_clutchsize = `max_clutch size_max`,
         dimorphic = median_dimorphic_truefalse)

# Look at the output
chameleon_final

## # A tibble: 3 × 6
## # Groups:   Species [3]
##   Species              median_length max_length median_clutchsize max_clutchsize dimorphic
##   <chr>                        <dbl>      <dbl>             <dbl>          <dbl>     <dbl>
## 1 Brookesia minima              29.2         34                 2              2        NA
## 2 Calumma parsonii             630.         695                50             50        NA
## 3 Chamaeleo chamaeleon         250.         250                80            100        NA

# Write to file
# I've commented this out as we already made this above
# write_csv(x = chameleon_final, file = "data/chameleon-data-for-analyses.csv")

2.2 Summary

This exercise should have introduced you to some basic skills for wrangling comparative data in R.

2.3 Practical exercise

Using the frog-eyes-raw-data.csv dataset and R, create a new dataset that contains only the columns genus_species, Family, and two new columns containing the species mean values for rootmass and eyemean. We will talk more about this dataset in later exercises, but it comes from K. N. Thomas et al. (2020).

References

Thomas, Kate N, David J Gower, Rayna C Bell, Matthew K Fujita, Ryan K Schott, and Jeffrey W Streicher. 2020. “Eye Size and Investment in Frogs and Toads Correlate with Adult Habitat, Activity Pattern and Breeding Ecology.” Proceedings of the Royal Society B 287 (1935): 20201393.