Guide

How to use phyloGenerator

phyloGenerator is simple enough to use, but it can take time to download everything from GenBank. Whenever you need to input a file or directory, you'll need to give phyloGenerator the 'absolute path' to the file. Something like '/Users/will/Documents/dna.fasta', or 'C:\Documents and Settings\dna.fasta'. If you're on a Mac or Linux computer, do not use something like '~/Documents/dna.fasta' - only '/Users/will/Documents/dna.fasta' will work.

There are some example files included with phyloGenerator - they're in the 'Demos' folder, and usage examples are in the 'tests' file.

Please register for the phyloGenerator mailing list. This way you can get help easily, I can keep you informed of updates, and you can make feature requests more easily.

My advice would be to just try using the thing, and then refer back to this when stuck. You can navigate through quite happily by just whacking enter lots of times. Why not do this once to see what you get?

This is intended to be a fairly exhaustive guide, and so makes the program seem a little intimidating. Bear in mind that phylogenetics is complicated - I don't go into any detail about how RAxML works, for example, because it's the product of several people's entire research careers. I don't have that much space!

Some people prefer video walkthroughs, so I've made two below. In the first (below left), I show the basics of running phyloGenerator.

In the second (below right), I show how to use the sequence checking options, and a brief introduction to alignment checking.

If you want any more, please just ask. I'm sorry about my slightly croaky voice in these videos, I wasn't feeling very well!

Double-click 'phyloGenerator.exe' in Windows, or run './phyloGenerator.app/Contents/MacOS/phyloGenerator' from a Mac. On a Mac, do not double click the application.

Stem Name

phyloGenerator will automatically output all your files, but you need to give it a 'stem name' so that it knows what to call everything. I'd suggest 'test' for your first attempt!

Working Directory

phyloGenerator needs to know where to put your files. Give a full, absolute path.

Gene Name(s)

Enter the abbreviated names of the genes you're interested in, one per line, and then hit enter on a blank line to continue. You can also use presets for plants (rbcL and matK), and for animals (COI).

Not all sequences on GenBank are annotated in the same way; for example, a cytochrome oxidase one sequence might be labelled as a 'COI', 'Cox1', or 'cytochrome oxidase one' sequence. You can use 'gene aliasing' to account for this, in this example specifying the gene's name as "COI-Cox1-cytochrome_oxidase_one" - aliases are split using '-', and spaces are replaced with '_'.

DNA Input

If you've already downloaded or sequenced your data, but not aligned it, tell phyloGenerator where that FASTA formatted file is now. Otherwise, just press enter.

If you want to put data from more than one locus (gene) into pG, have a separate file for each gene and supply all the files at the same time, separated by commas. For example, '/home/will/first_gene.fasta,/home/will/second_gene.fasta'. The names of the sequences will be used to match the data up. The output pG gives you after you search for more than one gene will be sufficient for this.

If you have DNA alignment(s) you really want to use, these can be given to pG as command-line arguments but pG won't ask you for alignment while it's running interactively. Something like './phyloGenerator.py -existingAlignment /home/will/first_gene.fasta,/home/will/second_gene.fasta', would work well. Note that pG is expecting the output from a previous pG run, which means alignments that don't have the same number of taxa in the same order, will lead to problems! You will not be given an opportunity to alter these alignments - if you're providing an alignment, I presume you're happy with it (...I'm not sure why you're using pG in that case, but you asked for it, so you've got it!)

DNA Download

If you don't have DNA sequences already, you'll need to download some. Tell phyloGenerator where a plain text file containing the binomial names of the species you want is. Each species must be on a separate line, with no formatting of any kind in the file. Otherwise, just press enter.

If you're downloading DNA, you'll need to provide a valid email address so Entrez can contact you about your downloads - this is if you download too much, and has never happened to me.

You may have to wait some time before your sequences have all been downloaded, but phyloGenerator will tell you how it's doing. By default, it waits ten seconds after downloading each DNA sequence so that it doesn't irritate the servers at GenBank.

You may see warnings relating to 'NCBI's DTD' files. This is nothing to worry about.

You will be asked if you want to use the new 'referenceDownload' method; this is not described in pG paper, so if you're in any doubt just hit enter to carry on without it. The referenceDownload method is intended to make building extremely large phylogenies much easier, and I'd strongly recommend you check out the FAQ about it if you're interested in doing that. There's an example of how to use the method in the Silwood_Plants demo.

You can now supply phyloGenerator with GenBank TaxonIDs of species, instead of binomial names, which it will search for. Just specify the '-taxonIDs' option from the command line, and you're away. There's an example There's an example of how to use the method in the Silwood_Plants demo. You should think about doing this if you're concerned that taxonomic mismatches on GenBank are going to cause problems for your analysis, and you can find out more about TaxonIDs here.

This table shows each DNA sequence's 'ID', name used when searching GenBank (if appropriate), the name it was given on GenBank (usually an ID number), and its length. Sequences that are in the top or bottom 5% of sequence lengths are highlighted in the list. Be sensible: if you've downloaded an entire mitochondrial genome (~16000 base pairs) then this won't happily align with a single gene, and you'll crash most of the programs that phyloGenerator uses. Trim those long sequences! Remember that there will always be a 'longest' sequence - but by trimming your sequences you can be sure they're not too long.

There are five modes when 'checking' DNA: delete, reload, trim, replace, and merge. You can enter any of the modes by typing their names. To continue, simply hit enter.

delete
Enter a single sequence's 'Seq ID' to delete it.
Enter 'gene' and then a gene name to delete that gene entirely. This makes it easy to check to see what genes are available on GenBank for your species, without committing you to use those sequences. Note that you can also run phyloGenerator with the 'nGene' argument to let phyloGenerator select the n genes that give the best coverage of your species from the list of possible genes you provide.
Enter 'output' to write out the sequences you've downloaded so far; but note they will all be outputted at the end of the run anyway.
reload
Enter a single sequence's 'Seq ID' to download another sequence from GenBank with the search term originally used. If you put a gene name after the SeqID, e.g. '0rbcL', you'll just reload one gene. You can also use the '>' and '>' operators to reload unusually large/small sequences.
You also have a number of options 'rnd', 'median', 'max', 'min', 'targetLength' that determine how phyloGenerator picks sequences for you from GenBank. For example, if you have an 'nCheck' of 20 and had selected 'median', phyloGenerator will download the first 20 sequences it finds for your species from GenBank (...fewer if there aren't that many!) and use the sequence whose length is closest to the median of them. This can be useful if you're trying to replace your shorter sequences with longer ones.
trim
Enter a single sequence's 'Seq ID' to trim a sequence down, first using any annotations the people who uploaded it to GenBank gave it, and then secondly to its Open Reading Frame (ORF). phyloGenerator will only do one of these things each time you ask it to trim - so if you're still not happy with a gene, try trimming again. However, I would warn that you're unlikely to get a good result from a sequence that wasn't annotated correctly. If you want to trim all the sequences (often a good idea), enter 'EVERYTHING' and press enter. DO NOT attempt to trim a non-coding gene down to its ORF - it doesn't have one! If you trim, and find that you're left with a sequence of length 0, this means phyloGenerator could find neither an ORF nor any annotations on the sequence, and has deleted the sequence (because it's likely rubbish). Consider reloading that species' sequence.
To change the translation table for the sequence you're working with (it might be from a plant's chloroplast, for example), type type and select the appropriate number for each gene.It's important you do this before trimming, because the translation table determines what a sequence's ORF is.
replace
If there are no sequences for a species on GenBank, type its SeqID to replace it with a congener. Alternatively, type 'EVERYTHING' to replace all missing species with a congener, or 'THOROUGH' to replace all species without any sequences with a relative it is more related to than any other species according to the GenBank taxonomy. This will work even if a species isn't on GenBank at all, but typically not if there's no congener in the database.
merge
If there are no sequences for a species on GenBank, you can merge it/them with another species that do. phyloGenerator will replace the merged species with a polytomy at the end of the run, splitting them half-way along the species with sequence data's branch. Just enter the SeqIDs of the species you want to merge, separated by commas.

Alignment

Simply press enter, and phyloGenerator will align your DNA using MAFFT. However, you have a choice of programs: MUSCLE, MAFFT, Clustal-Omega, or PRANK. To use any of them, simply enter muscle, mafft, clustalo, or prank respectively. To compare the alignments of several programs, enter everything, or just to use the first three type quick. Note that PRANK can take a very long time, and I often use 'quick'.

When you're ready to continue, just hit enter. You'll pick an alignment for each gene (you can only have one alignment) by entering the alignment number for each gene, or if you've only used one alignment method phyloGenerator will just continue straight through.

Remember you can return to the sequence download stage, or align sequences again, if you're unhappy with what you have.

Alignment Checking

It is strongly recommended that you look at your alignment before using it. Summary statistics for your alignment(s) have been outputted, and you may notice that one alignment has more gaps than another, or that they all seem OK. You can write out your alignment(s) by typing output. If you wish to return to the alignment or DNA download states, just type DNA or align. There are several other options available to you: trimAl, raxml=X and metal.

Among all the sequences in the alignments, 'Med. Gaps' is its median number of gaps, 'SD Gaps' the standard deviation of the mean number of gaps, 'Min-Max Gaps' the smallest and largest number of gaps in one sequence, 'Med. Gap Fraction' is the median of the fraction of the sequences made of gaps, 'M-M Gap Fraction' the smallest and largest fraction of sequences with gaps, and 'Warn?' indicates if your alignment is unusually long, given the maximum length of the sequences going into it. Look at these numbers; in particular, warning flags are something to think about.

trimAl
trimAl finds difficult to align regions, and automatically cuts them out. It's a very neat exploratory tool, but a good general rule of thumb is that if you've thrown away a lot of your alignment, the sequences you were using aren't likely to be much good!
raxml=X
Different alignments give different results when run through phylogeny-building programs. RAxML is very quick, and so you can use it to run a number (replace the 'X' with the number of runs you want) of searches and examine the mean Robinson-Folds distance between trees built using the different alignments. Depending on how many alignments and genes you have, this can take quite some time, so do a small run (maybe 5) to guage how long this will take. You'll be shown a distance matrix, where each element is the mean distance between runs.
metal
Comparing sequence alignments is tricky, but metal calculates the 'SSP' summary statistic for you and gives a distance matrix, like with RAxML.
clustal-x2
This will open the Clustal X-2 website in your browser, so you can view your alignments more easily if you've outputted them.

It is strongly advised that you use a constraint tree in your analysis. Generating novel phylogenetic hypotheses without a fair amount of knowledge is a tricky thing, so by constraining your tree to fit in with the general consensus of how your group fits together, you're reducing the likelihood that your results will be untenable.

If you already have a constraint tree, type newick, then tell phyloGenerator where the phylogeny is.

Building a Constraint Tree

To build a constraint tree with Phylomatic, type phylomatic, then give the path to the reference phylogeny and taxonomy when asked. These need to match up perfectly with what you've entered into phyloGenerator, which can be difficult if you've merged taxa. Phylomatic can be a little fiddly, but phyloGenerator attempts to make some changes to the phylogeny to make things run smoothly. You might like to try using Phylomatic outside phyloGenerator if you hit any major problems.

You can also use GenBank's taxonomy service to find the taxonomy of the species in your phylogeny. Just type 'taxonomy', wait a while, and at the end of your phyloGenerator run a taxonomy of your species will be outputted. You can use this to manually construct a constraint tree, perhaps putting members of the same family together.

Checking

Constraint trees alter the way BEAST and RAxML search for trees, and alter your result. In an ideal world, the genes you've picked convey the 'truth' about the phylogeny of your species, but you might want to know whether that's true. If so, just enter the number of times you want to run a RAxML search both with and without the constraint tree (hit enter to not do so), and you'll be given the mean Robinson-Foulds distance between searches with and without constraints, as well as the variation within the searches.

Overview

This is it! By simply pressing enter, you will make a phylogeny using RAxML. That means you'll generate a maximum likelihood tree, with branch lengths proportional to mutation rates in your genes (you can 'rate smooth' this tree later if you wish). You can alter RAxML's parameters by typing raxml, or build a Bayesian tree with BEAST by typing beast.

RAxML

You need to chain the options you want to use together, for example accelBootstrap-noPartitions.

If you want a good result, you'll have to 'restart' your search a few times (processing power is cheap; why not 100?). Alternatively, you can use the 'accelBootstrap' method, which will conduct multiple searches and also give bootrapped confidences on each of the nodes of your phylogeny; this is my preferred option and tends to be quicker than restarts, so why not do this 200 times?

BEAST

The defaults (GTR-GAMMA) are likely to be fine for most uses. You'll be using a relaxed log-normal clock, and you won't be estimating any dates or rates of mutation. At this moment in time, phyloGenerator can't handle multiple searches - this will change in the future.

phyloGenerator does not check MCMC chains for convergence - you will need to do this yourself. This is a deliberate choice; a method that is never wrong does not exist, and I don't want to give people false hope. If you are not comfortable checking the output of a BEAST run, then do not use this option. If someone finds a method (...) I'll incorporate it, and buy them an evening's worth of drinks.

If you use a constraint tree, phyloGenerator will set strong priors on the ages of all named nodes. For example, the node '(Quercus_robur:30, Quercus_ilex:30)Quercus:44' would be drawn from a normal distribution with a mean of 74 and an SD of 1.

A user kindly directed me to a bug that may have affected some BEAST runs using constraint trees in version 1.1a or lower of phyloGenerator. This bug is very easy to detect - all clades have posterior probabilities of 1, and only one topology is found in the posterior. This bug seems quite rare, but if you're in any way concerned please contact me with a DropBox link to your output, and I will check your output for you, and re-run the analysis if necessary. I'm sorry for an inconvenience this may cause.

Overview

RAxML creates phylogenies whose branch lengths are proportional to the number of mutations that have taken places along them. However, for most purposes, you will want a phylogeny whose branch lengths are proportional to time. We can do this either using PATHd8, which essentially averages out variation in branch lengths (very quickly), or conduct a BEAST search with the topology constrained to what was found by RAxML. All the caveats above about BEAST searches apply to this method of rate smoothing, although I've never had any convergence issues using BEAST in this way.

PATHd8

By default, the root age is set to 1, and PATHd8 is run like this. I have no plans to incorproate node-specific ages by default into phyloGenerator - it is probably easier to just do this by hand. I would argue that in my field, community phylogenetics, setting the root age to 1 is better than working with a phylogeny with branch lengths proportional to mutation rate.

Note that you have to set an outgroup with this method, which is not the case for BEAST.

BEAST

All the options avaiable to you when performing a normal BEAST run are now available. It might be argued that deviating from the default model choice (GTR-GAMMA) makes no sense, however, as this is (roughly) the model that RAxML uses. As in a normal BEAST search, a log-normal clock is assumed.