13 Common mistakes using PCMs in R
Throughout these exercises I’ve tried to mention common mistakes you might come across when working with your own data. I’ve collated some of them here for easy reference. Note that all of these issues are really common; I do most of them on a daily basis! So don’t feel bad if you make them too. It’s all part of learning!
13.1 Good practice to prevent errors
There are a couple of ways you can help yourself avoid errors.
- Look at your data (and tree) before beginning any analysis to check it read in correctly.
- Check the exact spelling (and capitalisation) of your variable names. For example, in the datasets we have used the tip names have been called
tiplabel,SpeciesandBinomial. It’s important to check what things are called in your data before you start the analyses. - Check (using
glimpseorstrorclass) what kind of data R thinks each of your variables is. If a function expects a factor, it will not work if R thinks your variable is a character. - Sometimes species names (and variable names) are separated by spaces, sometimes by
., and other times by_. Check this in your data and use something consistent. You can usestringrfunctions to easily change capitalisation or replace ” ” for “_” or vice versa. - Finally, make sure to run through your code slowly and carefully, making copious notes and comments to yourself (preceded by
#so R ignores them) to remind yourself what you are doing and why. Always check the output at every stage of the analysis. Every time you modify the data or tree, check that this happened as you expected. This can save you from lots of downstream issues.
13.2 Common errors
13.2.1 Standard R issues
Typos.
Incorrectly spelled variable names.
Missing brackets, parentheses, commas, or quotation marks.
The dreaded
+
- If you see the
+rather than the prompt>at the start of the line of code you’re trying to run, it suggests something didn’t get completed in the code above. Maybe a missing parenthesis or comma or quotation mark? You should fix this before moving forwards. - To quickly get rid of the
+just put your cursor into the Console tab and then press Esc (escape).
- R cannot find
name of function.
- Could this be a typo? Check the exact spelling of the function.
- Have you loaded the package that contains the function? Remember you need to tell R to load the packages every time you start a new R session and want to use functions from these packages using the function
library. - Did you install the package that contains the function? Install the package using
install.packages("package name"). See 1 for more details of common problems installing packages.
- R cannot find your data.
- Is there a typo in the name?
- Did you unzip the data? R cannot work with stuff in zipped files.
- Is the data in the place R is looking for it? Check the Files tab in the bottom right hand panel in a standard R Studio set up. Can you see your data there? Is it in the correct folder?
- Using the wrong data
- Are you working with the correct data? If you use generic names like
mydataandmytree(like I’ve done throughout this book) it’s easy to accidentally use the wrongmydata. You can avoid this by giving your objects more descriptive names.
- Data uses
,rather than.for decimals.
- This is common in mainland Europe. Luckily there’s an easy fix. When you read in your data use
read_csv2orread.csv2or similar, and R will read it in with this information in mind.
- Functions with the same names from different packages act differently.
- Some functions, especially those with useful names like
select, appear in multiple packages and work differently in each. To solve this problem you can declare which package you would like the function to come from, for example to ask R to use the version of the functionselectfrom the packagedplyr, usedplyr::select.
- Arguments are in the wrong order.
- R functions expect arguments to be entered into them in a specific order. This is known as “syntax” and you can see the correct syntax in the help files for a function.
- For example, if the syntax of a function is
myfunction(phy, node)then it expects thephyto be entered first and thenodesecond. So if we have a phylogeny calledmytreeand we want to investigate node 10, we could typemyfunction(mytree, 10)and the function will do what we expect. However, if we forget what the syntax of the function is and trymyfunction(mytree, 10)the function will not work. - A simple (and highly recommended) solution to this is to always specify the arguments of the function, then you can’t go wrong!
myfunction(phy = mytree, node = 10)will work, and so willmyfunction(node = 10, phy = mytree)because we have specified the argument names so the order is no longer important.
13.2.2 Issues with the tree or data
- Use
is.ultrametricto check. - Fix using
force.ultrametricin thephytoolspackage. - Remember that
force.ultrametricshould only be used to correct rounding errors, it is not a substitute for time calibrating the tree.
- Use
is.rootedto check. - Fix using
root.
- Use
is.binaryto check. - Fix using
multi2di.
- Species names in the tree and the data do not match. See 4.
- Use
name.checkin thegeigerpackage to check, or usesetdiff(phy$tip.label, data$species). - If they do not match, but they should do, check for spaces rather than underscores in species names, differences in capitalisation, or any words (like family names or numbers) added to tip label names. Also ensure you use the variable name from the data set that contains the species names.
- Note that when you plot the tree the
_will be omitted from tip labels so you will need to check these within the actual tree file or usingphy$tip.label. - Fix using code in 4.
- Check you have the species names attached to any dataset/variable you are working with.
- Some functions require that the species names are rownames. Some functions require named vectors (e.g.
fitContinuousin 8). UsesetNamesto generate names for these.
- Some functions require the data and the tree to be ordered so that species are in the same order in both.
- Fix using
mydata <- mydata[match(mytree$tip.label, mydata$Species), ]
- Data is a tibble not a dataframe. See 4.
- Most functions for PCMs require a dataframe as input.
- Check using
classorstr. - Convert to a data frame using
as.data.frame
- Variable is character not a factor.
- Check using
glimpseorstr. - Fix using
as.factor. - If you need to convert something to numeric use
as.numeric, or if you need to convert something to character useas.characterand so on.
- This generally happens when the likelihood profile for one of your parameters is really flat (i.e. it could be one of many different values all of which are equally likely), and the model is getting stuck near one of the bounds (i.e. limits) of the parameter.
- To fix this error you need to change the bounds (i.e. upper and lower values) on the parameter being optimized to restrict where the function is looking. First establish which bound is the problem, then change it to something a little bigger/smaller than the default upper/lower bound until it works. See the appropriate chapters for more details on specific function.
- As mentioned in the relevant chapters, you must check your parameter estimates make sense. This can take time to work out, so discuss with your supervisor/colleagues and think carefully about what the numbers mean.
- If these are wildly unrealistic it suggests you don’t have enough data to fit a model of the complexity you’re trying to fit.
- There’s no solution to this, aside from gathering more data.
- Tree formats
- Check the format of your tree, if it is a NEXUS file then use
read.nexus, if it is any other kind of tree useread.tree. Note that some NEXUS files contain only the data (morphological matrices or molecular data). If there is no tree bit in there these cannot be used as a phylogeny.
13.3 What to do if you get an Error message
Error messages will have the word “Error” in the message after you try to run the code. This means R did not run the code. You will need to fix this before you can move on.
- Don’t panic! Error messages are common and there might be an easy fix. Read the message carefully as some will indicate the problem and you’ll be able to fix it quickly.
- If you don’t know what issue the message is referring to, first make sure the basics are correct. Check your code carefully for typos, missing parentheses etc. Ensure that you are using the right data and tree, and that you have used the correct variable names and function names etc.
- Run through all your code again slowly to check you didn’t miss an important step. Look at the data (and the tree) at every stage to make sure they are changing as you expect them to. Make sure you didn’t accidentally overwrite an object with something that has the same name.
- Restart R Studio, and clear the Global Environment by clicking the little broom button on the top right hand panel in a standard R Studio set up. Then try to run the code again.
- If none of these basic fixes work move onto the Error message itself. Read the message again and see if you can work out what the issue is.
- If you can’t work out what the error means, try Googling it (remove any words specific to your data from it first). Google may be able to tell you what the issue is and help you find solutions on websites like Stack Overflow.
- Finally if none of this helps, ask for help. I would first advise asking local colleagues/supervisors etc. Then expanding to either emailing the package maintainer or raising an issue on GitHub or posting a question on an online forum like Stack Overflow. To get help you will need to provide a reproducible example so the person helping you can run the code on their computer. This means you need to provide the code, the data (and the tree) and the error message you are getting. Or a subsample of the data if they are very large.
13.4 What to do if you get a Warning message
Warning messages will have not have the word “Error” in the message after you try to run the code. Many will have the word “Warning”. Some will just print a message without the word “Warning”. A warning message means that R has run the code as you asked BUT it thinks there is some information you need to consider. Maybe you did something silly? Or maybe it just wants to clarify to you exactly what it did.
In PCMs the warnings may indicate unrealistic parameters or poor convergence.
- Don’t panic! Warning messages are common and there’s probably an easy fix.
- Read the warning carefully. Try to understand what it means. You may need to Google it. In many cases it is nothing to worry about, but it may be alerting you to a serious issue with your analysis.
- Always check warning messages, do not ignore them.
13.5 Memory issues
If your tree/dataset is too large, R may crash, especially for more computationally demanding analyses. In this situation, first check your code works by running it on a smaller subset of the tree/dataset. This will also give you an idea of how long the analysis might take on the whole dataset. You then will need to think about either:
- Finding a better computer with more memory
- Parallelisation, i.e. running small analyses that can then be stuck back together again. This is often not very useful in PCMs but worth considering.
- Using High Performance Computing facilities to run the code remotely.
Sometimes analyses just aren’t feasible on certain datasets.