How to use phyloGenerator
phyloGenerator is simple enough to use, but it can take time to download everything from GenBank. Whenever you need to input a file or directory, you'll need to give phyloGenerator the 'absolute path' to the file. Something like '/Users/will/Documents/dna.fasta', or 'C:\Documents and Settings\dna.fasta'. If you're on a Mac or Linux computer, do not use something like '~/Documents/dna.fasta' - only '/Users/will/Documents/dna.fasta' will work.
There are some example files included with phyloGenerator - they're in the 'Demos' folder, and usage examples are in the 'tests' file.
Please register for the phyloGenerator mailing list. This way you can get help easily, I can keep you informed of updates, and you can make feature requests more easily.
My advice would be to just try using the thing, and then refer back to this when stuck. You can navigate through quite happily by just whacking enter lots of times. Why not do this once to see what you get?
This is intended to be a fairly exhaustive guide, and so makes the program seem a little intimidating. Bear in mind that phylogenetics is complicated - I don't go into any detail about how RAxML works, for example, because it's the product of several people's entire research careers. I don't have that much space!
Some people prefer video walkthroughs, so I've made two below. In the first (below left), I show the basics of running phyloGenerator.
In the second (below right), I show how to use the sequence checking options, and a brief introduction to alignment checking.
If you want any more, please just ask. I'm sorry about my slightly croaky voice in these videos, I wasn't feeling very well!
phyloGenerator will automatically output all your files, but you need to give it a 'stem name' so that it knows what to call everything. I'd suggest 'test' for your first attempt!
phyloGenerator needs to know where to put your files. Give a full, absolute path.
Enter the abbreviated names of the genes you're interested in, one per line, and then hit enter on a blank line to continue. You can also use presets for plants (rbcL and matK), and for animals (COI).
Not all sequences on GenBank are annotated in the same way; for example, a cytochrome oxidase one sequence might be labelled as a 'COI', 'Cox1', or 'cytochrome oxidase one' sequence. You can use 'gene aliasing' to account for this, in this example specifying the gene's name as "COI-Cox1-cytochrome_oxidase_one" - aliases are split using '-', and spaces are replaced with '_'.
If you've already downloaded or sequenced your data, but not aligned it, tell phyloGenerator where that FASTA formatted file is now. Otherwise, just press enter.
If you want to put data from more than one locus (gene) into pG, have a separate file for each gene and supply all the files at the same time, separated by commas. For example, '/home/will/first_gene.fasta,/home/will/second_gene.fasta'. The names of the sequences will be used to match the data up. The output pG gives you after you search for more than one gene will be sufficient for this.
If you have DNA alignment(s) you really want to use, these can be given to pG as command-line arguments but pG won't ask you for alignment while it's running interactively. Something like './phyloGenerator.py -existingAlignment /home/will/first_gene.fasta,/home/will/second_gene.fasta', would work well. Note that pG is expecting the output from a previous pG run, which means alignments that don't have the same number of taxa in the same order, will lead to problems! You will not be given an opportunity to alter these alignments - if you're providing an alignment, I presume you're happy with it (...I'm not sure why you're using pG in that case, but you asked for it, so you've got it!)
If you don't have DNA sequences already, you'll need to download some. Tell phyloGenerator where a plain text file containing the binomial names of the species you want is. Each species must be on a separate line, with no formatting of any kind in the file. Otherwise, just press enter.
If you're downloading DNA, you'll need to provide a valid email address so Entrez can contact you about your downloads - this is if you download too much, and has never happened to me.
You may have to wait some time before your sequences have all been downloaded, but phyloGenerator will tell you how it's doing. By default, it waits ten seconds after downloading each DNA sequence so that it doesn't irritate the servers at GenBank.
You may see warnings relating to 'NCBI's DTD' files. This is nothing to worry about.
You will be asked if you want to use the new 'referenceDownload' method; this is not described in pG paper, so if you're in any doubt just hit enter to carry on without it. The referenceDownload method is intended to make building extremely large phylogenies much easier, and I'd strongly recommend you check out the FAQ about it if you're interested in doing that. There's an example of how to use the method in the Silwood_Plants demo.
You can now supply phyloGenerator with GenBank TaxonIDs of species, instead of binomial names, which it will search for. Just specify the '-taxonIDs' option from the command line, and you're away. There's an example There's an example of how to use the method in the Silwood_Plants demo. You should think about doing this if you're concerned that taxonomic mismatches on GenBank are going to cause problems for your analysis, and you can find out more about TaxonIDs here.
This table shows each DNA sequence's 'ID', name used when searching GenBank (if appropriate), the name it was given on GenBank (usually an ID number), and its length. Sequences that are in the top or bottom 5% of sequence lengths are highlighted in the list. Be sensible: if you've downloaded an entire mitochondrial genome (~16000 base pairs) then this won't happily align with a single gene, and you'll crash most of the programs that phyloGenerator uses. Trim those long sequences! Remember that there will always be a 'longest' sequence - but by trimming your sequences you can be sure they're not too long. There are five modes when 'checking' DNA: delete, reload, trim, replace, and merge. You can enter any of the modes by typing their names. To continue, simply hit enter. Simply press enter, and phyloGenerator will align your DNA using MAFFT. However, you have a choice of programs: MUSCLE, MAFFT, Clustal-Omega, or PRANK. To use any of them, simply enter muscle, mafft, clustalo, or prank respectively. To compare the alignments of several programs, enter everything, or just to use the first three type quick. Note that PRANK can take a very long time, and I often use 'quick'. When you're ready to continue, just hit enter. You'll pick an alignment for each gene (you can only have one alignment) by entering the alignment number for each gene, or if you've only used one alignment method phyloGenerator will just continue straight through.
Remember you can return to the sequence download stage, or align sequences again, if you're unhappy with what you have. It is strongly recommended that you look at your alignment before using it. Summary statistics for your alignment(s) have been outputted, and you may notice that one alignment has more gaps than another, or that they all seem OK. You can write out your alignment(s) by typing output. If you wish to return to the alignment or DNA download states, just type DNA or align. There are several other options available to you: trimAl, raxml=X and metal. Among all the sequences in the alignments, 'Med. Gaps' is its median number of gaps, 'SD Gaps' the standard deviation of the mean number of gaps, 'Min-Max Gaps' the smallest and largest number of gaps in one sequence, 'Med. Gap Fraction' is the median of the fraction of the sequences made of gaps, 'M-M Gap Fraction' the smallest and largest fraction of sequences with gaps, and 'Warn?' indicates if your alignment is unusually long, given the maximum length of the sequences going into it. Look at these numbers; in particular, warning flags are something to think about. It is strongly advised that you use a constraint tree in your analysis. Generating novel phylogenetic hypotheses without a fair amount of knowledge is a tricky thing, so by constraining your tree to fit in with the general consensus of how your group fits together, you're reducing the likelihood that your results will be untenable. If you already have a constraint tree, type newick, then tell phyloGenerator where the phylogeny is. To build a constraint tree with Phylomatic, type phylomatic, then give the path to the reference phylogeny and taxonomy when asked. These need to match up perfectly with what you've entered into phyloGenerator, which can be difficult if you've merged taxa. Phylomatic can be a little fiddly, but phyloGenerator attempts to make some changes to the phylogeny to make things run smoothly. You might like to try using Phylomatic outside phyloGenerator if you hit any major problems. You can also use GenBank's taxonomy service to find the taxonomy of the species in your phylogeny. Just type 'taxonomy', wait a while, and at the end of your phyloGenerator run a taxonomy of your species will be outputted. You can use this to manually construct a constraint tree, perhaps putting members of the same family together. Constraint trees alter the way BEAST and RAxML search for trees, and alter your result. In an ideal world, the genes you've picked convey the 'truth' about the phylogeny of your species, but you might want to know whether that's true. If so, just enter the number of times you want to run a RAxML search both with and without the constraint tree (hit enter to not do so), and you'll be given the mean Robinson-Foulds distance between searches with and without constraints, as well as the variation within the searches. This is it! By simply pressing enter, you will make a phylogeny using RAxML. That means you'll generate a maximum likelihood tree, with branch lengths proportional to mutation rates in your genes (you can 'rate smooth' this tree later if you wish). You can alter RAxML's parameters by typing raxml, or build a Bayesian tree with BEAST by typing beast. You need to chain the options you want to use together, for example accelBootstrap-noPartitions. If you want a good result, you'll have to 'restart' your search a few times (processing power is cheap; why not 100?). Alternatively, you can use the 'accelBootstrap' method, which will conduct multiple searches and also give bootrapped confidences on each of the nodes of your phylogeny; this is my preferred option and tends to be quicker than restarts, so why not do this 200 times? The defaults (GTR-GAMMA) are likely to be fine for most uses. You'll be using a relaxed log-normal clock, and you won't be estimating any dates or rates of mutation. At this moment in time, phyloGenerator can't handle multiple searches - this will change in the future. phyloGenerator does not check MCMC chains for convergence - you will need to do this yourself. This is a deliberate choice; a method that is never wrong does not exist, and I don't want to give people false hope. If you are not comfortable checking the output of a BEAST run, then do not use this option. If someone finds a method (...) I'll incorporate it, and buy them an evening's worth of drinks. If you use a constraint tree, phyloGenerator will set strong priors on the ages of all named nodes. For example, the node '(Quercus_robur:30, Quercus_ilex:30)Quercus:44' would be drawn from a normal distribution with a mean of 74 and an SD of 1. A user kindly directed me to a bug that may have affected some BEAST runs using constraint trees in version 1.1a or lower of phyloGenerator. This bug is very easy to detect - all clades have posterior probabilities of 1, and only one topology is found in the posterior. This bug seems quite rare, but if you're in any way concerned please contact me with a DropBox link to your output, and I will check your output for you, and re-run the analysis if necessary. I'm sorry for an inconvenience this may cause. RAxML creates phylogenies whose branch lengths are proportional to the number of mutations that have taken places along them. However, for most purposes, you will want a phylogeny whose branch lengths are proportional to time. We can do this either using PATHd8, which essentially averages out variation in branch lengths (very quickly), or conduct a BEAST search with the topology constrained to what was found by RAxML. All the caveats above about BEAST searches apply to this method of rate smoothing, although I've never had any convergence issues using BEAST in this way. By default, the root age is set to 1, and PATHd8 is run like this. I have no plans to incorproate node-specific ages by default into phyloGenerator - it is probably easier to just do this by hand. I would argue that in my field, community phylogenetics, setting the root age to 1 is better than working with a phylogeny with branch lengths proportional to mutation rate. Note that you have to set an outgroup with this method, which is not the case for BEAST. All the options avaiable to you when performing a normal BEAST run are now available. It might be argued that deviating from the default model choice (GTR-GAMMA) makes no sense, however, as this is (roughly) the model that RAxML uses. As in a normal BEAST search, a log-normal clock is assumed. You're now done! phyloGenerator has produced intermediate files (sequences downloaded, aligned sequences, a log of what was done) and your final phylogeny/phylogenies with the stem name you specified at the beginning. Check your phylogeny. Does it look how you'd expect? Are there unusually long branches? Are there polytomies (lots of branches from the same node)? The FAQ has some hints on what to do if you're not happy. phyloGenerator is very much a work in progress. If there are features you'd like to see in the program, please list them on the feature tracking list (if they're not already there). phyloGenerator does exactly what I want it to do, but you likely want other things. We're happy to give them to you, but we need to know what you need! Thank you! In particular, I'm working on more 'sanity checks' for the output of phyloGenerator, so I'd be grateful for any datasets you've run through the program. For the time being, in exchange for letting me see your results, I'll discuss them with you and give you some help!DNA Checking
Alignment
Alignment
Alignment Checking
Constraint Tree
Building a Constraint Tree
Checking
Phylogeny Generation
Overview
RAxML
BEAST
Rate Smoothing
Overview
PATHd8
BEAST
Output
End of the line!
Feedback