automated phylogenetics for brave

FAQ - Who? - Citation - Mailing List


I strongly advise you to check the output from each step of pG2. Building a phylogeny is pointless if you alignment is awful, and using a phylogeny is pointless if you're not confident in the accuracy of that phylogeny (via bootstrapping, replicates, and the like). In version 1, I went to some pains to provide warnings if you went wrong. I do not do that here; if you are not a phylogeneticist, I strongly advise you to start with phyloGenerator1. Version 2 is what I use; if you know what you're doing, and want to build a large phylogeny, I sense you will find it useful.

phyloGenerator2 goes in a different direction to phyloGenerator1: whereas before I went to great pains to neaten output, check what you were doing, and provide a tidy user interface, I don't do that here. This means you get to the good stuff much quicker, but it also means there is a burden on you, as the user, to somewhat understand what you're doing. For example, I don't neaten the RAxML output: you get everything, raw. If you want a single, one-click, ecologist-friendly program please use phyloGenerator1, or the family in pez; these options will get you an accurate phylogeny without any fuss, and once you're comfortable with them you're probably ready for phyloGenerator2.

Please check the FAQ and read this guide carefully. If you have any questions, please get in touch with me.

New Methods


My usual approach to phylogenetics is to build an alignment, identify the sequences in it that are obviously nonsense because they:

...and then I remove them. referenceDownlad in pG1 handled some of these, but the last two things still cropped up occasionally (but rarely). In pG2, I've added the Hawkeye method, which actively looks for sequences with gaps where the consensus alignment doesn't have them, and then removes them. In my own experiments, I've found it means I spend a tenth of the time cleaning alignment than I used to; Your Mileage May Vary, but I encourage you to play around with its options and see what you end up with.


phyloGenerator2 forces you to use the referenceDownload method from the original phyloGenerator. This means that you must supply it with an example of what some sequences for each locus should look like, and pG2 will find sequences that align well with that sequence (as defined by the parameters to follow). Finding such reference sequences is easy: go to GenBank and find some! Just a handful of sequences will do the job, and while more is probably better, too many (>100) and sequence download could get a bit slow. This is not an unusual requirement for a program like pG2; phlawd and NCBIminer have the same requirement. pG2 uses alignment to find these sequences; I think this is faster, and I like that it's based around what you're actually going to use in phylogenetic construction and doesn't rely on similarity of sequences (similarity --> closely related, so could affect our answer). Your Mileage May Vary!


You need three things to do a phyloGenerator2 run:

The demo folder contains examples of all of these things. Remember to always use absolute paths within your parameter file, and to enter your own email address into it otherwise you'll get into trouble with GenBank. You can write comments into your parameter file by putting a # at the start of them. The parameter file is just a YAML file. The demo parameter file (demo.yml) contains everything you need to know to get started. All parameters are set with :; for example email : would set you email address. Spaces don't matter.

DNA Download

In the parameter file, section consists of a genes block, within which there is another block for each gene you want to download data from. In the demo, you have: genes: rbcL:

...which basically just means "start my genes block", and "my first gene is called rbcL". Each of the parameters after that are:

There are no hard and fast rules for how to choose these parameters; I tend to set ref_file to be just less than half the expected length of the locus to get partial sequences (if I want them), and I set ref_max to be about 120% of the length of my reference sequences. fussy is useful if you're worried about weird GenBank naming conventions; I find that 16s,18s, etc. sequences don't have useful annotations, and so limiting my search on the basis of them means I never find anything. phyloGenerator1 handles all this for you automatically, but many of you wanted flexibility so here it is!


The following options are all optional, and most of them turn on what I consider to be useful features of pG2. The most useful one is arguably cache; using it you download once, and then play with different options for subsetting sequences and building phylogenies.

phy_options - each phylogenetic construction program has special options that can be passed to it. I've not written a parser for each program, as I did in pG1, but I have written some examples of common things you might want to do that essentially replicates the work of pG1. Experience tells me most users want to do something I hadn't thought of; you wanted the power, now you've got it!


The output varies a little depending on what options you choose, but if you do everything in one step you'll get three folders: seqs with the best sequences, phylo with the phylogeny output, and bad_seqs without the output from hawkeye. In almost all cases, all sequences are in FASTA format, and GenBank IDs are inside the files themselves. The naming of the files should be fairly self-explanatory!

Tips and Tricks