phyloGenerator2

Premable

I strongly advise you to check the output from each step of pG2. Building a phylogeny is pointless if you alignment is awful, and using a phylogeny is pointless if you're not confident in the accuracy of that phylogeny (via bootstrapping, replicates, and the like). In version 1, I went to some pains to provide warnings if you went wrong. I do not do that here; if you are not a phylogeneticist, I strongly advise you to start with phyloGenerator1. Version 2 is what I use; if you know what you're doing, and want to build a large phylogeny, I sense you will find it useful.

phyloGenerator2 goes in a different direction to phyloGenerator1: whereas before I went to great pains to neaten output, check what you were doing, and provide a tidy user interface, I don't do that here. This means you get to the good stuff much quicker, but it also means there is a burden on you, as the user, to somewhat understand what you're doing. For example, I don't neaten the RAxML output: you get everything, raw. If you want a single, one-click, ecologist-friendly program please use phyloGenerator1, or the phy.build family in pez; these options will get you an accurate phylogeny without any fuss, and once you're comfortable with them you're probably ready for phyloGenerator2.

Please check the FAQ and read this guide carefully. If you have any questions, please get in touch with me.

New Methods

Hawkeye

My usual approach to phylogenetics is to build an alignment, identify the sequences in it that are obviously nonsense because they:

are vastly longer than the other sequences
don't align with them (i.e., are in one long stream at the end)
have somehow aligned (badly) with some regions and have vast gaps between those regions (a common problem with MUSCLE, I find, but other methods too)

...and then I remove them. referenceDownlad in pG1 handled some of these, but the last two things still cropped up occasionally (but rarely). In pG2, I've added the Hawkeye method, which actively looks for sequences with gaps where the consensus alignment doesn't have them, and then removes them. In my own experiments, I've found it means I spend a tenth of the time cleaning alignment than I used to; Your Mileage May Vary, but I encourage you to play around with its options and see what you end up with.

referenceDownload

phyloGenerator2 forces you to use the referenceDownload method from the original phyloGenerator. This means that you must supply it with an example of what some sequences for each locus should look like, and pG2 will find sequences that align well with that sequence (as defined by the parameters to follow). Finding such reference sequences is easy: go to GenBank and find some! Just a handful of sequences will do the job, and while more is probably better, too many (>100) and sequence download could get a bit slow. This is not an unusual requirement for a program like pG2; phlawd and NCBIminer have the same requirement. pG2 uses alignment to find these sequences; I think this is faster, and I like that it's based around what you're actually going to use in phylogenetic construction and doesn't rely on similarity of sequences (similarity --> closely related, so could affect our answer). Your Mileage May Vary!

Basics

You need three things to do a phyloGenerator2 run:

A parameter file
A species list
Reference sequences for each locus you want to download

The demo folder contains examples of all of these things. Remember to always use absolute paths within your parameter file, and to enter your own email address into it otherwise you'll get into trouble with GenBank. You can write comments into your parameter file by putting a # at the start of them. The parameter file is just a YAML file. The demo parameter file (demo.yml) contains everything you need to know to get started. All parameters are set with :; for example email : me@gmail.com would set you email address. Spaces don't matter.

DNA Download

In the parameter file, section consists of a genes block, within which there is another block for each gene you want to download data from. In the demo, you have: genes: rbcL:

...which basically just means "start my genes block", and "my first gene is called rbcL". Each of the parameters after that are:

ref_file - the location of your reference sequence file (FASTA format), as described in _overview)
ref_min - the minimum length a downloaded sequence must be
ref_max - the maximum length of an alignment of your sequence when aligned with ref_file. This will also be used for the hawkeye method (see below).
max_dwn - maximum number of sequences to try for each species
fussy - (optional) if present and set to false, don't use GenBank gene and organism annotations, and don't attempt to trim a sequence down using GenBank annotations.
aliases - an array of alternative names for your gene. For example, if your block was called COI, and you specified aliases = [cytochrome oxidase one, cox1], pG2 would search for COI, cytochrome oxidase one, and cox1. You must use the square brackets ([]) otherwise bad things will happen.
gap_length - (hawkeye only) maximum number of gaps in sequence when aligned
gap_length - (hawkeye only) length that defines a gap in gap_length

There are no hard and fast rules for how to choose these parameters; I tend to set ref_file to be just less than half the expected length of the locus to get partial sequences (if I want them), and I set ref_max to be about 120% of the length of my reference sequences. fussy is useful if you're worried about weird GenBank naming conventions; I find that 16s,18s, etc. sequences don't have useful annotations, and so limiting my search on the basis of them means I never find anything. phyloGenerator1 handles all this for you automatically, but many of you wanted flexibility so here it is!

Options

The following options are all optional, and most of them turn on what I consider to be useful features of pG2. The most useful one is arguably cache; using it you download once, and then play with different options for subsetting sequences and building phylogenies.

hawkeye - if present and set to true, runs the new hawkeye sequence check. The parameters are described in the last section, and the new method is dscribed at the top of this page. If you use this option, you'll get a new folder (hawkeye) in your output, with a load of species that are re-named to mark them as 'bad'. Anything still in the seqs folder has passed the hawkeye check.
cache - skip DNA download for all species that have at least one DNA sequence (as outputted by pG2) in this folder. Note that a single sequence will be sufficient to stop further searches. There's no point in searching for the same species over and over again if you didn't find anything the first time and the settings haven't changed; just remove those species from your species list the second time.
phy_method - either raxml, examl, or exabayes, depending on which program you want to use to build you phylogeny. You don't have to supply this option; giving nothing will just download sequences.
If using ExaBayes, bear in mind that pG2 does no summarising of the posterior, unlike pG1. The reason for this is simple: despite my warnings, people who had no idea how to conduct a Bayesian analysis were using pG, and so many were not checking for mixing, convergence, that they had sufficient samples, etc. I then received a number of very angry emails from people who hadn't followed instructions (and had often ignored pG1's warnings) that their results were "obviously false". I don't enjoy being shouted at, so I dropped support for this.
constraint - location of (Newick) constraint tree to be used in your build. Note that only RAxML supports a constraint tree at this time, and nothing supports any kind of dating (yet; I plan to add it). Your constraint tree must contain all of the species you are attempting to download data for, but if you can't find some data for some species pG2 will just drop those species for you.
partition - if true, a separate rate matrix will be fit to each locus, along with gamma parameters et al. I can't imagine you would want to set this to false.

phy_options - each phylogenetic construction program has special options that can be passed to it. I've not written a parser for each program, as I did in pG1, but I have written some examples of common things you might want to do that essentially replicates the work of pG1. Experience tells me most users want to do something I hadn't thought of; you wanted the power, now you've got it!

Output

The output varies a little depending on what options you choose, but if you do everything in one step you'll get three folders: seqs with the best sequences, phylo with the phylogeny output, and bad_seqs without the output from hawkeye. In almost all cases, all sequences are in FASTA format, and GenBank IDs are inside the files themselves. The naming of the files should be fairly self-explanatory!

Tips and Tricks

To make a FASTA file with each locus' sequences, just cat *_locus.fasta > combined_locus.fasta. If you want a concatenated *alignment of all the species, it's in your phylo folder.
If you want to add more loci, just make a new search with all your species and just those loci. Then, copy-paste all the sequences from this run with your last run(s), and you have a cache folder you can use.
If you want a concatenated alignment, look in the phylo folder; there should be one there (in various file formats that are partially a function of the phylogenetic program you're using)