© Springer International Publishing AG 2017
Christoph BleidornPhylogenomics10.1007/978-3-319-54064-1_9

9. Sources of Error and Incongruence in Phylogenomic Analyses

Christoph Bleidorn
(1)
Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), Madrid, Spain
 
  • Phylogenomic analyses can be performed by analysing gene trees separately and using coalescent or supertree analyses or a concatenation of all genes (supermatrix approach).
  • Several sources of systematic error may bias phylogenomic studies due to the violation of substitution model assumptions, including problems with compositional heterogeneity, among-lineage rate variation and heterotachy.
  • Missing data is usually less problematic for phylogenomic studies, but taxon sampling can be critical.
  • Data and taxa should be carefully selected for analysis; highly saturated genes as well as phylogenetically unstable (rogue) taxa should be avoided.
  • Discordance of gene trees and species trees is not rare, and potential causes are incongruent lineage sorting, hybridization or horizontal gene transfer.
  • Coalescent-based methods are able to reconstruct species tree inference when gene trees are incongruent due to incomplete lineage sorting.

9.1 Incongruence in Phylogenomic Analyses

During the end of the 1990s and the early 2000s, molecular phylogenetic analyses revolutionized phylogenetic systematics. Many results contributed to changing textbook knowledge about the evolutionary relationships of plant and animal systematics and enabled a new picture for the phylogeny of the entire group of eukaryotes (Donoghue and Doyle 2000; Halanych 2004; Adl et al. 2005). Many of these early analyses were based on a single or few genes, leaving many nodes – especially deep in time – unsupported or unresolved. Current practice of phylogenomic analyses can be broadly classified into two different approaches: supermatrix and gene tree-based analyses of hundreds or thousands of genes (Liu et al. 2015). In the case of supermatrix analyses, all gene alignments are concatenated into a single matrix, which is subsequently analysed using the chosen phylogenetic method. In the case of gene tree-based analyses, all genes are analysed separately, and in a second step, the resulting topologies are (subsequently or simultaneously) used to construct a supertree (Bininda-Emonds 2004) or a species tree based on coalescent theory (► see Sect. 9.4). Phylogenomic approaches are able to produce precise estimations of phylogeny; however, this does not mean the result reflects the true evolutionary history (Kumar et al. 2012), as several factors can mislead phylogenetic analyses even when a massive amount of data is available.
The era of phylogenomic analyses to resolve relationships among organisms was basically kick-started in 2003. By analysing 106 different genes to resolve the phylogeny of yeast, Rokas et al. (2003) found incongruence among them, sometimes strongly supporting competing hypotheses (◘ Fig. 9.1). Using a genome-scale approach, the incongruence disappeared when combining all of them. Moreover, it was shown that a concatenation of any 20 out of these 106 genes always recovered the best topology with bootstrap values of at least 95% for each node. Even though details of this study have been criticized to be unrealistic (Gatesy et al. 2007), it clearly supported the idea that phylogenomic approaches could end incongruence in phylogenetics (Gee 2003). Whereas genome-scale approaches for most groups of non-model organisms remained a pipe dream in 2003, the availability of next-generation sequencing (NGS) techniques allowed gathering huge datasets for basically every taxon of interest (Rokas and Abbot 2009).
A332029_1_En_9_Fig1_HTML.gif
Fig. 9.1
al Incongruence among gene trees from a phylogenomic analysis of yeast relationships (Reprinted by permission from Macmillan Publishers Ltd.: [Nature] Rokas et al. (2003), copyright 2003)
There are several reasons why trees inferred from single genes (i.e. gene trees) might differ with each other (Jeffroy et al. 2006). First, this might be a stochastic error associated with a lack of sufficient phylogenetic signal, which could be overcome by combining more (informative) genes. This approach assumes that combining more genes into a single data matrix should increase the phylogenetic signal-to-noise ratio compared to single genes (de Queiroz and Gatesy 2007). Second, the species tree will be different from a gene tree because of violation of the orthology assumption, incongruent lineage sorting or horizontal gene transfer. There are certain methods detecting such problems and dealing with them in phylogenomic datasets (► see Sect. 9.4). Third, systematic errors present in single genes might also lead to artefacts in the phylogenetic reconstruction (► see Sect. 9.2). Such systematic errors are usually due to the violation of assumptions of the underlying model for the analyses. Systematic errors can occur because the assumptions of the underlying model are violated, including (I) heterogeneity of the nucleotide/amino acid composition among lineages (compositional signal), (II) variation of the substitution rate among lineages (rate signal) and (III) variation in the substitution rate within nucleotide positions over time (heterotacheous signal). All these patterns are generally not accounted for by the evolutionary model and might negatively impact phylogenetic reconstruction.
Often, high statistical support (e.g. bootstrapping) is taken as a measure that the tree is correct. However, it is important to remember that these measures assess the stability of the obtained relationships to sampling error (White et al. 2007). Bootstrap analyses detect if datasets contain a pattern and how strong this is but are not able to decide whether or not this pattern represents genuine phylogenetic signal. Systematic error can negatively affect phylogenetic inference even with single genes, but it becomes stronger when multiple genes are combined into a supermatrix, simply because the addition of more (biased) genes will increase the support for a biased (wrong) result. As expressed by Jeffroy et al. (2006), phylogenomic analyses, rather than resolving the entire tree of life, might in fact be the beginning of incongruence (► see Infobox 9.1 for an example). Furthermore, combined datasets from hundreds of genes often contain large amounts of missing data (Roure et al. 2013), which could additionally influence the analysis (► see Sect. 9.3).

9.1.1 Infobox 9.1Which Taxon Is the Sister Group of All Other Animals?

It was basically written in stone that sponges (Porifera) represent the sister taxon of all other animals, and it was rather discussed if sponges are monophyletic or if different sponge taxa branch off subsequently at the base of the animal tree (Sperling et al. 2007; Philippe et al. 2009). However, some phylogenomic analyses surprisingly started to find that the enigmatic Ctenophora (known as comb jellies or sea gooseberries) could represent the sister taxon of animals (Dunn et al. 2008; Moroz et al. 2014). This placement has important implications regarding how the evolution of several organ systems is understood (◘ Fig. 9.2) (Telford et al. 2016). Under the latter hypothesis, it has to be assumed either that the nervous system, muscles and epithelia evolved twice convergently or that all these characters were already present in the last common ancestor of animals and got lost in sponges. This controversy led to a heated debate about phylogenomics methodology and systematic error and how much trust can be put into phylogenomic analyses of very deep divergences. Proponents of the «Porifera-sister» scenario claimed that the result supporting the «Ctenophora-sister» hypothesis represents an LBA artefact, which might be introduced due to a poor fit of the used evolutionary models with the analysed data, as well as by the out-group choice (Pisani et al. 2015). In contrast, proponents of the «Ctenophora-sister» hypothesis analysed the sensitivity of phylogenomic analyses to model and gene choice (Whelan et al. 2015) and used an extensive taxon sampling. By analysing possible sources of systematic error, no biases affecting the position of Ctenophora as sister taxon to all other animals were found. Instead, some genes included in previous analyses supporting the «Porifera-sister» hypothesis were identified to introduce conflicting signal, thereby supporting a maybe wrong hypotheses of the placement of Ctenophora. This result is in line with a previous study by Nosenko et al. (2013), who by modifying gene and out-group taxon sampling were able to recover three different but well-supported phylogenies of non-bilaterian animals. This controversy remains still unresolved (Giribet 2016) and shifted to the question which models are suited to analyse datasets with massive substitutional heterogeneity and how to perform phylogenomic analyses for deep phylogenies (Whelan and Halanych 2016).
A332029_1_En_9_Fig2_HTML.gif
Fig. 9.2
Competing hypotheses regarding which taxon represents the sister group of all other animals and its evolutionary implications (Reprinted by permission from Macmillan Publishers Ltd.: [Nature] (Telford et al. 2016), copyright 2016)

9.2 Systematic Errors

The problem of systematic errors biasing phylogenetic analyses has been recognized early on by Felsenstein (1978). In this paper, he described conditions under which maximum parsimony (MP) inference is misled by the attraction of long branches in a tree irrespective of the true relationships (◘ Fig. 9.3). This phenomenon was termed «long edges attract» by Hendy and Penny (1989), and it is nowadays generally known as long-branch attraction (LBA). Despite maximum likelihood (ML) and Bayesian inference (BI) being more robust than MP to LBA (Philippe et al. 2005b), it was shown that probabilistic phylogenetic reconstruction methods could be also affected by LBA when the assumptions of the underlying model are violated by the data (Huelsenbeck 1995). Many simulation studies have shown that MP is the most sensitive method to the LBA artefact, whereas ML and BI are more robust (Philippe et al. 2005b). Even though LBA is often accounted for when phylogenetic analyses lead to unexpected results, a clear (statistically based) definition of the phenomenon is missing. Some authors defined LBA loosely as a condition where analyses are biased due to a combination of short and long branches (Sanderson et al. 2000; Bergsten 2005), which basically translates to a bias due to variation of the substitution rate across lineages. Parks and Goldman (2014) systematically analysed the placement of long branches using simulation studies and found that also single long branches are difficult to place in a phylogeny, even when using ML. Interestingly they also found that there is no attraction between two long branches, even though they seem to be disproportionally often joined together. This observation has an impact on several approaches which were proposed to detect LBA in real datasets. For example, a common method was to remove one of the long branches from the analysis and to see if the placement of the other long branch remains consistent (Pol and Siddall 2001). However, as also the placement of single long branches is difficult, this might not be a good test. Other approaches to reduce LBA are the exclusion of terminals with very long branches (not an option when they are the taxon of interest) or the exclusion of fast-evolving genes or sites (Bergsten 2005; Pisani 2004; Rivera-Rivera and Montoya-Burgos 2016). Especially classifying all genes (or alignment sites) according to their evolutionary rate and successively removing them from the analysis starting with the fastest class will give a good overview if analyses are biased by the rate signal (Brinkmann et al. 2005). Finally, as LBA is basically a problem of model misspecification, the use of more sophisticated models is recommended. As such, it has been shown that site-heterogeneous CAT models are less affected by LBA due to their ability to better anticipate homoplasy in alignment site patterns (Lartillot et al. 2007), but also ML analyses with carefully selected partitions (and models for each partition) seem to be promising (Whelan and Halanych 2016). In summary, LBA is a very common yet not fully understood phenomenon, and the placement of long branches in phylogenetic analyses remains a difficult task.
A332029_1_En_9_Fig3_HTML.gif
Fig. 9.3
a Unrooted four-taxon tree illustrating the classical example of long-branch attraction (LBA), with two long and two short branches, both unrelated. b A valid rooted tree of the unrooted topology shown in a. c Often analyses are misled by LBA, clustering together the long-branched terminals. This rooted topology is a typical artefact occurring in studies with tree A as the underlying true tree
Variation in the substitution rate across lineages (rate signal) can lead to the LBA phenomenon (Jeffroy et al. 2006), but this bias can often be handled by using models incorporating rate heterogeneity (Yang 1996). Additionally, the evolutionary rate of an alignment site can vary over time (heterotachy) (Lopez et al. 2002), and this process can also produce LBA (Lockhart and Steel 2005). A specific case of this phenomenon is known as the covarion hypothesis of molecular evolution, which states that substitutions at one alignment site may alter the substitution probability at other sites (Miyamoto and Fitch 1995). Kolaczkowski and Thornton (2004) used a clever simulation scheme to mimic another case of heterotachy. They simulated two sets of sequence alignments using the same topology, but under completely different models of DNA substitutions. By combining these two datasets and giving different weights to the two data partitions, different levels of heterotachy were simulated (◘ Fig. 9.4). Interestingly, these authors found that under higher levels of heterotachy, MP outperforms ML in recovering the correct tree. However, subsequent studies criticized this study for choosing very special and unrealistic parameters for their simulation, as well as for the way how ML analyses were conducted (Philippe et al. 2005b; Spencer et al. 2005). Instead, it could be shown that for realistic simulations of heterotacheous datasets, ML always outperforms MP and should be therefore the preferred method (Philippe et al. 2005b). This phenomenon of heterotachy has been demonstrated to be common in real datasets, where it affects phylogenetic reconstruction (Lopez et al. 2002; Whelan et al. 2011). Some statistical tests for the detection of heterotachy have been proposed (Wu and Susko 2011; Wang et al. 2011). Approaches specifically dealing with heterotachy are the CAT-BP model (Blanquart and Lartillot 2008), as well as a model allowing changing the rate heterogeneity as modelled by the gamma distribution along branches (Bouckaert and Lockhart 2015).
A332029_1_En_9_Fig4_HTML.gif
Fig. 9.4
Scheme for the simulation of different levels of heterotachy as used in Kolaczkowski and Thornton (2004). a Sequences are simulated under two different sets of branch lengths, including opposing sets of long (p) and short terminal branches. b Sequence alignments generated under this simulation scheme can be combined under different weights (w) to simulate different degrees of heterotachy (Figure reprinted from Philippe et al. (2005b))
Another systematic error violating model assumptions is compositional bias, which describes significant differences in the nucleotide or amino acid composition across taxa. Most evolutionary models assume that the composition is homogenous across taxa. Several tests for compositional homogeneity are available, including frequency-dependent significance tests, matched-pairs tests or analyses based on Monte Carlo simulations of estimates of the standard deviation of the mean nucleotide or amino acid composition (Steel et al. 1993; Jermiin et al. 2004; Ababneh et al. 2006). With the software SEQVIS, it is possible to visualize compositional heterogeneity in nucleotide alignments (Ho et al. 2006).
A typical example of how compositional bias misleads phylogenetic analyses is that unrelated taxa with convergently evolved elevated GC content might group together, e.g. as demonstrated for drosophilids (Tarrío et al. 2001). Using simulation studies, Jermiin et al. (2004) found that the frequency of successful phylogenetic reconstruction is not only related to the difference in GC content (or base composition) but also to the length of internal branches. Analyses with short internal branches are more easily misled. Compositional bias is also related to rate variation, as especially fast-evolving sites are frequently compositionally biased (Rodríguez-Ezpeleta et al. 2007). Fittingly, third codon positions in protein-coding genes often have a stronger bias in composition, and their removal sometimes increases the accuracy of the phylogenetic analysis. One of the many negative effects of compositional heterogeneity can be the accumulation of convergencies. For example, transitions (replacement of a purine by a purine or pyrimidine by a pyrimidine) are usually more frequently observed than transversions (replacement of a purine by a pyrimidine or reverse), leading to coincident substitutions. It has been shown that recoding all nucleotides to R (purines, A and G) and Y (pyrimidines, C and T) reduces this misleading effect of compositional bias (Phillips and Penny 2003). Recoding can, for example, be conducted with the software BMGE (Criscuolo and Gribaldo 2010), which furthermore is able to identify and exclude characters which contribute to compositional biases based on a matched-pairs test of marginal symmetry. Finally, non-homogeneous nonstationary models that account for variations in the base composition can be used. The model of DNA sequence evolution by Galtier and Gouy (1998), which is implemented in PHYML (Boussau and Gouy 2006), allows varying equilibrium GC contents among lineages and estimation of five parameters: (I) ancestral GC content, (II) location of the root in its branch, (III) transition/transversion ratio, (IV) branch lengths and (V) equilibrium GC contents in each branch. Compositional bias was expected to be more frequent and also misleading on the nucleotide level, as only four different states exist and convergence is to be expected (Hasegawa and Hashimoto 1993; Foster and Hickey 1999). However, compositional bias on the protein level seems also to be frequent and thereby a problem for phylogenetic analyses as well (Lartillot and Philippe 2008; Nesnidal et al. 2010). Kück and Struck (2014) developed a package of scripts to analyse phylogenomic datasets (BACOCA), which can be used to investigate the compositional bias among amino acids. As with nucleotides, recoding of amino acids can reduce the compositional bias. The most commonly used recoding classifies amino acids according to six groups identified by Dayhoff et al. (1978), which tend to replace each other (Susko and Roger 2007). Furthermore, using the CAT-BP model for amino acid data allows lineage-specific compositional shifts across the phylogeny and thus deals with heterogeneous amino acid sequence compositions (Blanquart and Lartillot 2008).

9.3 Missing Data, Phylogenetic Information Content and Taxon Sampling

9.3.1 Missing Data

A typical way to compile a dataset for phylogenomic studies involves the generation of transcriptomes and subsequent selection of putative orthologs for the analyses. Ortholog sets often range from 100 to more than 1000 genes, and it is not unusual that not all genes are (completely) recovered for all taxa. As such, orthologs are often found incomplete using transcriptome sequencing (◘ Fig. 9.5a). In most cases, missing genes are due to the depth of the sequenced transcriptome or they are just not expressed in the sampled specimen (Roure et al. 2013). Moreover, many genes might have been lost for some taxa during evolution (◘ Fig. 9.5b). Percentages of missing data up to 80% have been reported for phylogenomic studies (Hejnol et al. 2009). The discussion if missing data should be reduced from phylogenetic analyses, e.g. excluding the most incomplete taxa and/or characters, has a long tradition in the literature (Wiens 2003; Wiens and Morrill 2011; Philippe et al. 2004; Wiens 1998). Initially, the question arose if incompletely sampled taxa should be included in phylogenetic analyses of one or few genes or in morphological character matrices. In the latter case, the discussion often centred on fossils, for which it was usually impossible to analyse all characters found in recent taxa. Later the discussion was expanded to genomic datasets, where often substantial amounts of data are missing. Even though some publications addressed missing data as problematic (Lemmon et al. 2009), most studies using real or simulated data could show that the inclusion of incomplete taxa is usually advantageous. One simple reason is that an improved taxon sampling helps to break long branches (Roure et al. 2013). By analysing a large dataset covering diverse eukaryotes, Philippe et al. (2004) could show that 25% of missing data in the original dataset did not negatively impact the analyses. Subsequent random deletion of 50% of the character matrix did not alter the outcome of the analysis, and even when analysing with up to 90% of missing data, similar trees could be obtained. Jiang et al. (2014) found that that adding incomplete data is in particular helpful for resolving poorly supported nodes and showed that missing data does not consistently bias branch lengths. Finally, Hovmöller et al. (2013) have shown that also species tree reconstruction methods relying on coalescent approaches (► see Sect. 9.4) are remarkably robust under the presence of up to 50% of missing data. However, if missing data is nonrandomly distributed over the matrix, it may bias analyses, leading to many trees (or subtrees) which are nearly indistinguishable by its likelihood value (Sanderson et al. 2010). A tool for the visualization of the completeness of the supermatrix (◘ Fig. 9.5b), as well as for the exclusion of incompletely sampled genes, is the software MARE (Misof et al. 2013). Using such an approach, differently covered data matrices can be constructed and analysed, and the sensitivity of phylogenomic analyses to missing data can be assessed (Weigert et al. 2014).
A332029_1_En_9_Fig5_HTML.gif
Fig. 9.5
Missing data in phylogenomic analyses. a Single gene alignment based on transcriptomic data often includes highly incomplete and partially nonoverlapping gene sequences. b The gene coverage (columns) is often highly uneven for taxa (rows) included in a phylogenomic study. Blue squares show presence of genes, white squares show absent genes. Matrix based on data from Weigert et al. (2014) constructed with MARE (Misof et al. 2013)

9.3.2 More Genes or More Taxa?

Taxon sampling has been profusely discussed in the phylogenetic literature prior to the genomic era. In particular, whether it was better centres the efforts in obtaining more data for a number of taxa or more taxa with relatively fewer data (Rokas and Carroll 2005; Mitchell et al. 2000). This discussion lost power with the (comparatively) cheap price of NGS technologies, which allows the recovery of large amounts of sequences for non-model taxa, and in most cases adding more data is not a bottleneck anymore. The first phylogenomic analyses often relied on a handful of model taxa where complete genomes were available. For example, focussing on animal relationships, these analyses seemed to support the so-called Coelomata hypothesis (arthropods + deuterostomes) and not the widely accepted Ecdysozoa hypothesis (arthropods + nematodes) (Philip et al. 2005). However, these results have been clearly demonstrated to be an artefact related to a limited taxon sampling (Philippe et al. 2005a). The discussion of experimental design has now shifted to which genes and which taxa to include in an analysis (Philippe et al. 2011).

9.3.3 Taxon Sampling

The importance of taxon sampling for phylogenetic analyses is widely acknowledged (Heath et al. 2008; Pollock et al. 2002; Zwickl and Hillis 2002), with only few studies coming to a different conclusion (Rosenberg and Kumar 2001). Rannala et al. (1998) demonstrated in a simulation study that a decrease in taxon sampling leads to an increase in the average branch length of terminals, which could make analyses more susceptible to LBA. This is in line with the finding that the estimation of rate heterogeneity is highly sensitive to taxon sampling (Sullivan et al. 1999). Moreover, estimation of branch lengths becomes also more challenging due to the so-called node density effect under a limited taxon sampling (Hugall and Lee 2007). This effect often leads to an underestimation of branch lengths in sparsely sampled tree regions, because less information is available to infer multiple substitutions, which could have been revealed under the presence of additional nodes. However, not all included taxa are equally helpful to improve phylogenetic analyses. Certain taxa, so-called rogue taxa, can show a phylogenetically unstable behaviour, characterized by widely different positions in tree topologies estimated from the same dataset (e.g. within bootstrap replicates) (Sanderson and Shaffer 2002). Often, but not always, rogue taxa are characterized by showing large amounts of missing data. Inclusion of such rogue taxa can have a negative impact on support values (especially when using bootstrap), but could also influence tree reconstruction in general (Mariadassou et al. 2012). In fact, Aberer et al. (2013) demonstrated that exclusion of rogue taxa increases the accuracy of phylogenetic analyses. These authors developed an algorithm for the identification and subsequent pruning of rogue taxa, implemented in the software ROGUENAROK. The idea behind the algorithm is to identify taxa, which exclusion results into an increase of support in bootstrap consensus trees. The measure of change in support is called relative bipartition information criterion (RBIC), which is the sum of all support values divided by the maximum support in a fully bifurcating tree of the original dataset. Taxa or combinations of taxa yielding the highest change in RBIC are excluded from the analysis. This analysis can be iteratively repeated until no significant change is observed. Alternatively, the leave stability index (LSI) has been used to identify rogue taxa. The LSI uses the occurrence of taxon triplets in trees from bootstrap analyses (Thorley and Wilkinson 1999). Three different possibilities for the relationship of three taxa (A, B, C) exist in a rooted, bifurcated tree: ((A, B), C), ((A, C), B) and ((B, C), A). The LSI is calculated as the difference of the relative frequency of the most common triplet and the second most common and is averaged over all triplets containing a certain taxon. LSI values of 1 or close to 1 indicate stable taxa, where values closer to 0 indicate instability. A LSI cut-off value can be defined for rogue taxa to be excluded from the analysis. Inference of the LSI is, for example, incorporated in the software PHYUTILITY (Smith and Dunn 2008). A third approach called multiple co-inertia analysis (MCOA) has been explored by de Vienne et al. (2012), which is based on the comparison of pairwise distances between species in all gene tree topologies to identify rogue taxa (described as outlier taxa in this publication).

9.3.4 Gene Sampling

Gene alignments can differ in their missing data, sequence saturation or phylogenetic information content. DNA and protein sequences are regarded as saturated, when sites have undergone multiple substitutions and the number of observed differences no longer reflects «true» evolutionary distances. Slight levels of saturation are corrected by the use of models of sequence evolution, but more saturated sequence alignments can mislead phylogenetic reconstruction. When analysing highly saturated sequences, phylogenetic inference can be driven by sequence composition to a large extent rather than true phylogeny (Xia et al. 2003). DNA sequences are normally more affected by saturation because only four different character states exist compared to the 20 states of amino acids (Philippe et al. 2011). However, saturation can also be problematic at the amino acid level (Van de Peer et al. 2002). A simple method to check for the presence of saturation in nucleotide sequences is by separately plotting the raw numbers of substitutions (p uncorrected distance) of transitions and transversions of all pairwise comparisons of taxa in an alignment against their genetic (usually ML-corrected) distance (Struck et al. 2008). For most protein-coding genes, transitions occur more frequently than transversions and thus are more likely saturated (◘ Fig. 9.6). Formalized measures of substitution saturation have been introduced by Xia et al. (2003), as implemented in the software DAMBE (Xia 2013), and Struck et al. (2008), as implemented in the BACOCA package of scripts (Kück and Struck 2014). Possible strategies to deal with saturated sequences are use of amino acids, exclusion of the saturated data or recoding (e.g. RY coding or the use of Dayhoff categories for amino acids).
A332029_1_En_9_Fig6_HTML.gif
Fig. 9.6
Saturation at different codon positions. Uncorrected pairwise distances are plotted for pairs of taxa, separately for transitions (left) and transversions (right) and first (grey), second (black) and third (white) codon positions. For unsaturated sequences, the number of substitutions should increase linearly with time (e.g. transversions on first and second positions), whereas for saturated sequences, no increase in the number of substitutions is detected with increasing genetic distance (e.g. transitions on third codon positions) (Reprinted from Dávalos and Perkins (2008), with permission from Elsevier)
It is important to remember the relationship between sequence saturation and sequence divergence: one gene might be saturated for old divergences but well suited to resolve young divergences, whereas a slower-evolving gene might not be saturated for old divergences but totally uninformative for young ones. The usefulness of a given gene for phylogenetic analyses can be estimated by its phylogenetic informativeness (PI) (Townsend 2007). Briefly summarized, PI estimates the probability that a character resolves a dated four-taxon alignment (more than four taxa can be analysed by providing a consistent topology). Thereby, PI provides an estimate of the amount of phylogenetic signal relative to noise across time (◘ Fig. 9.7). PI can be analysed using the software PHYDESIGN (López-Giráldez and Townsend 2011), which is available online, by providing an alignment, as well as an ultrametric tree as input. Some updates and modifications for the calculation of PI are available in the R package PHYLINFORMR (Dornburg et al. 2016). As an example on how to use PI, in ◘ Fig. 9.7, the utility of different classes of phylogenetic markers from percomorph fishes are compared (Gilbert et al. 2015).
A332029_1_En_9_Fig7_HTML.gif
Fig. 9.7
Phylogenetic informativeness and its 95% confidence interval of three different classes of phylogenetic markers from percomorph fishes (UCE core regions, UCE flanking regions, protein-coding genes) plotted against time. Core regions of ultraconserved elements (UCEs) are basically uninformative, whereas flanking regions of UCE show a higher PI than protein-coding genes, with the highest resolution power for divergences between 20 and 40 million years old (Reprinted from (Gilbert et al. 2015), with permission from Elsevier)
A different approach to investigate and visualize phylogenetic information content is based on likelihood mapping (Strimmer and von Haeseler 1997). This method analyses possible four-taxon cases of a given dataset, called quartets. For every quartet, there are three possible fully resolved tree topologies, for which the posterior probability for each of the three possible topologies can be estimated using Bayes’ theorem. The three posterior probabilities are then used as coordinates to locate a point within a triangular graph where each corner represents one topology. This calculation is repeated for all possible quartets, which are subsequently plotted in the triangle. In the case of an uninformative quartet (starlike evolution), all three probabilities are the same and the point is located in the middle of the triangle. If one tree topology is clearly supported with a probability close to 1, this would point to one of the corners of the triangle (according the supported topology). If two topologies gain similar probability, whereas one topology gets a probability close to 0, the point would be located at one edge of the triangle, between the corners representing the two supported topologies. By analysing all possible quartets of a dataset, the phylogenetic information content can be visualized. The more quartets can be located in one of the corners of the triangle, the higher is the information content of the dataset (◘ Fig. 9.8). Likelihood mapping is implemented in the software TREE-PUZZLE (Schmidt et al. 2002).
A332029_1_En_9_Fig8_HTML.gif
Fig. 9.8
Likelihood mapping using TREE-PUZZLE (Schmidt et al. 2002) for datasets with differences in phylogenetic information content. a, b In a dataset with low information content, a high percentage (30.9%) of quartets represent starlike evolution. c, d In this dataset 7.9% of the quartets represent starlike evolution, whereas 2.5% + 2.3% + 2.4% of quartets are in an area where it is difficult to distinguish between two of the three possible tree alternatives. e, f Most quartets (33.7% + 32.2%, 33.1%) are in well-resolved areas of the tree distribution, indicating high phylogenetic information content. a, c, e show distribution patterns of mapped quartets; b, d, f show occupancies (in percent) for seven areas of interest
Different strategies have been used to select sets of orthologous genes for phylogenetic analyses. Some authors recommend to only include highly informative genes in the analysis (Salichos and Rokas 2013), whereas others suggest that phylogenetic signal can be basically extracted from all ortholog alignments when combined in a supermatrix (Gatesy and Baker 2005). PI represents a possible way to select genes which are suitable for both, supermatrix and coalescent-based methods. Shen et al. (2016) systematically analysed the association between sequence-based properties, gene function-based properties and gene tree-based properties with phylogenetic information content. The goal was to identify those properties which predict phylogenetic signal of a gene best. Even though most of the investigated properties correlate with each other, a set of properties with the highest relevance could be identified. Interestingly, the most important property to predict phylogenetic signal is gene alignment length, followed by number of parsimony-informative sites and variable sites. This result could be interpreted in favour of binning genes for coalescent analyses (see above), but also for the use of the supermatrix approach, which basically combines all alignments into a highly informative «supergene».

9.4 Incongruence Between Gene Trees and Species Trees

Gene trees may differ from the species tree simply by the stochastic sampling of alleles during speciation events (Degnan and Rosenberg 2009), a phenomenon known as incomplete lineage sorting or deep coalescence (◘ Fig. 9.9). The term «hemiplasy» has been coined to describe incorrect inference of character-state evolution due to genetic polymorphisms which are retained across speciation events (Avise and Robinson 2008; Hahn and Nakhleh 2016). This term should reflect that in this case similarity does not reflect common ancestry, even though the considered character states are homologous (and apomorphic!).
A332029_1_En_9_Fig9_HTML.gif
Fig. 9.9
Incomplete lineage sorting can lead to incongruence between gene trees and species trees. The gene tree is drawn in colour inside the species tree (black). The last common ancestor of taxa ac had two paralogs of a gene (X and Y). Duplicates got lost before the split of the three species, but paralog sorting is incongruent with the species tree
It has been demonstrated that discordance between gene trees and species trees is common, especially in cases where speciation events happened in short time spans, i.e. separated by short branches (Degnan and Rosenberg 2006). A good example of incomplete lineage sorting is represented by the genome-scale analyses of the bird phylogeny, which includes a rapid radiation characterized by many short internal branches. For this phylogeny, not a single gene tree has been found to match the reconstructed species tree (Jarvis et al. 2014). Later on, lineage sorting has been shown to be frequent in the evolutionary history of birds, and a phylogenetic network was used to illustrate their complex history (◘ Fig. 9.10) (Suh et al. 2015).
A332029_1_En_9_Fig10_HTML.gif
Fig. 9.10
Phylogenetic network analyses of rare genomic change markers reveal a strong discordance of markers, which can be explained by high levels of incomplete lineage sorting (Figure reprinted from Suh et al. (2015))
Several other evolutionary processes can lead to the disagreement between gene trees and species trees, including horizontal gene transfer (HGT), gene duplication and hybridization (Maddison 1997; Knowles and Kubatko 2010). HGT is a process where genes are transferred from one species to another across the phylogeny. Whereas HGT is rather rare in eukaryotes and therefore less problematic for phylogenetic reconstruction, it is common among prokaryotes (Ku and Martin 2016). Gene duplication complicates the inference of orthology (Philippe et al. 2011). Hybridization and introgression are biological processes by which the genetic material of two different species gives rise to hybrids and sometimes new species. Hybridization is most commonly found in plants, but also many examples have been described for animals (Mallet 2007).
Several phylogenetic methods have been developed to detect and deal with incongruence of gene trees and species trees. In contrast to the supermatrix approach, where genes are concatenated into one single matrix, these methods are usually based on the separate reconstruction of gene trees, which are subsequently (or simultaneously) used to infer the species tree. Most species tree inference methods are rooted within the coalescence theory, a model which has been developed to follow the history of genes (or alleles) back in time. Coalescence models are commonly used in population genetics and are often based on the Wright-Fisher model of genetic drift, assuming nonoverlapping generations, neutral evolution and random joining of populations back in time (Degnan and Rosenberg 2009). The multispecies coalescent (MSC) is used to estimate the probability distribution of gene trees evolving along the branches of a species tree. Each branch of a species tree represents a single population, and lineages of genes entering these populations are traced back through time to a common ancestor at rates given by the model. The coalescence of different gene lineages of the gene trees finally provides the signal for the inference of the overlying species tree (Liu et al. 2015). The MSC has been implemented into ML approaches, e.g. STEM (Kubatko et al. 2009) or MP-EST (Liu et al. 2010), and a Bayesian framework, e.g. BEST (Liu 2008) or BEAST (Drummond et al. 2012). The performance of species tree inference methods is controversially discussed. Gatesy and Springer (2014) criticized that species tree inference is often misled by unreliable gene trees, especially when dealing with phylogenetic analyses at deep timescales. Similar to the idea that the phylogenetic signal-to-noise ratio gets improved by using concatenation of single gene alignments into a supermatrix, statistical binning of genes with a similar signal has been proposed to reduce gene tree estimation errors for species tree inference (Mirarab et al. 2014). Several simulation studies show a superior performance of species tree inference using a Bayesian framework in comparison with other methods, especially in the case when a high probability of gene tree discordance is simulated (Leaché and Rannala 2011). Interestingly, comparison of results from species tree inference and supermatrix methods for real datasets often show rather consistent results (Liu et al. 2015).
For the quantification of incongruence in phylogenomic datasets, Salichos and Rokas (2013) developed a measure called internode certainty (IC). Here, incongruence for a given internal node is measured by calculating the frequency of a bipartition found in the best tree in a given set of gene trees together with the occurrence of conflicting bipartition in these gene trees. Values close to 0 indicate the presence of strong conflict, whereas values close to 1 indicate the absence of conflictive signal. Summing overall ICs will give the tree certainty (TC). The calculation of IC and TC is implemented within the software RAxML (Stamatakis 2014; Kobert et al. 2016).
References
Ababneh F, Jermiin LS, Ma C, Robinson J (2006) Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics 22:1225–1231PubMed
Aberer AJ, Krompass D, Stamatakis A (2013) Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. Syst Biol 62:162–166PubMed
Adl SM, Simpson AGB, Farmer MA, Andersen RA, Anderson OR, Barta JR, Bowser SS, Brugerolle GUY, Fensome RA, Fredericq S, James TY, Karpov S, Kugrens P, Krug J, Lane CE, Lewis LA, Lodge J, Lynn DH, Mann DG, McCourt RM, Mendoza L, Moestrup Ø, Mozley-Standridge SE, Nerad TA, Shearer CA, Smirnov AV, Spiegel FW, Taylor MFJR (2005) The new higher level classification of eukaryotes with emphasis on the taxonomy of protists. J Eukaryot Microbiol 52:399–451PubMed
Avise JC, Robinson TJ (2008) Hemiplasy: a new term in the lexicon of phylogenetics. Syst Biol 57:503–507PubMed
Bergsten J (2005) A review of long-branch attraction. Cladistics 21:163–193
Bininda-Emonds ORP (2004) The evolution of supertrees. Trends Ecol Evol 19:315–322PubMed
Blanquart S, Lartillot N (2008) A site- and time-heterogeneous model of amino acid replacement. Mol Biol Evol 25:842–858PubMed
Bouckaert R, Lockhart P (2015) Capturing heterotachy through multi-gamma site models. bioRxiv. doi.​org/​10.​1101/​018101
Boussau B, Gouy M (2006) Efficient likelihood computations with nonreversible models of evolution. Syst Biol 55:756–768PubMed
Brinkmann H, van der Giezen M, Zhou Y, de Raucourt GP, Philippe H (2005) An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst Biol 54:743–757PubMed
Criscuolo A, Gribaldo S (2010) BMGE (Block mapping and gathering with entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol 10:210PubMedPubMedCentral
Dávalos LM, Perkins SL (2008) Saturation and base composition bias explain phylogenomic conflict in Plasmodium. Genomics 91:433–442PubMed
Dayhoff M, Schwarz R, Orcutt B (1978) A model of evolutionary change in proteins. In: Dayhoff M (ed) Atlas of protein sequence and structure, vol 5, Suppl. 3. National Biomedical Research Foundation. Washington, DC, pp 345–352
de Queiroz A, Gatesy J (2007) The supermatrix approach to systematics. Trends Ecol Evol 22:34–41PubMed
de Vienne DM, Ollier S, Aguileta G (2012) Phylo-MCOA: a fast and efficient method to detect outlier genes and species in phylogenomics using multiple co-inertia analysis. Mol Biol Evol 29:1587–1598PubMed
Degnan JH, Rosenberg NA (2006) Discordance of species trees with their most likely gene trees. PLoS Genet 2:e68PubMedPubMedCentral
Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol 24:332–340PubMed
Donoghue MJ, Doyle JA (2000) Seed plant phylogeny: demise of the anthophyte hypothesis? Curr Biol 10:R106–R109PubMed
Dornburg A, Fisk JN, Tamagnan J, Townsend JP (2016) PhyInformR: phylogenetic experimental design and phylogenomic data exploration in R. BMC Evol Biol 16:262PubMedPubMedCentral
Drummond AJ, Suchard MA, Xie D, Rambaut A (2012) Bayesian Phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol 29:1969–1973PubMedPubMedCentral
Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, Sorensen MV, Haddock SHD, Schmidt-Rhaesa A, Okusu A, Kristensen RM, Wheeler WC, Martindale MQ, Giribet G (2008) Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452:745–750PubMed
Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 27:401–410
Foster PG, Hickey DA (1999) Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J Mol Evol 48:284–290PubMed
Galtier N, Gouy M (1998) Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol Biol Evol 15:871–879PubMed
Gatesy J, Baker RH (2005) Hidden likelihood support in genomic data: can forty-five wrongs make a right? Syst Biol 54:483–492PubMed
Gatesy J, DeSalle R, Wahlberg N (2007) How many genes should a systematist sample? Conflicting insights from a phylogenomic matrix characterized by replicated incongruence. Syst Biol 56:355–363PubMed
Gatesy J, Springer MS (2014) Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Mol Phylogenet Evol 80:231–266PubMed
Gee H (2003) Evolution: ending incongruence. Nature 425:782–782PubMed
Gilbert PS, Chang J, Pan C, Sobel EM, Sinsheimer JS, Faircloth BC, Alfaro ME (2015) Genome-wide ultraconserved elements exhibit higher phylogenetic informativeness than traditional gene markers in percomorph fishes. Mol Phylogenet Evol 92:140–146PubMedPubMedCentral
Giribet G (2016) Genomics and the animal tree of life: conflicts and future prospects. Zool Scr 45:14–21
Hahn MW, Nakhleh L (2016) Irrational exuberance for resolved species trees. Evolution 70:7–17PubMed
Halanych KM (2004) The new view of animal phylogeny. Annu Rev Ecol Syst 35:229–256
Hasegawa M, Hashimoto T (1993) Ribosomal RNA trees misleading? Nature 361:23–23PubMed
Heath TA, Hedtke SM, Hillis DM (2008) Taxon sampling and the accuracy of phylogenetic analyses. J Syst Evol 46:239–257
Hejnol A, Obst M, Stamatakis A, Ott M, Rouse GW, Edgecombe GD, Martinez P, Baguñà J, Bailly X, Jondelius U, Wiens M, Müller WEG, Seaver E, Wheeler WC, Martindale MQ, Giribet G, Dunn CW (2009) Assessing the root of bilaterian animals with scalable phylogenomic methods. Proc R Soc Lond B Biol Sci 276:4261–4270
Hendy MD, Penny D (1989) A framework for the quantitative study of evolutionary trees. Syst Biol 38:297–309
Ho JWK, Adams CE, Lew JB, Matthews TJ, Ng CC, Shahabi-Sirjani A, Tan LH, Zhao Y, Easteal S, Wilson SR, Jermiin LS (2006) SeqVis: visualization of compositional heterogeneity in large alignments of nucleotides. Bioinformatics 22:2162–2163PubMed
Hovmöller R, Lacey Knowles L, Kubatko LS (2013) Effects of missing data on species tree estimation under the coalescent. Mol Phylogenet Evol 69:1057–1062PubMed
Huelsenbeck JP (1995) Performance of phylogenetic methods in simulation. Syst Biol 44:17–48
Hugall AF, Lee MSY (2007) The likelihood node density effect and consequence for evolutionary studies of molecular rates. Evolution 61:2293–2307PubMed
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SYW, Faircloth BC, Nabholz B, Howard JT, Suh A, Weber CC, da Fonseca RR, Li J, Zhang F, Li H, Zhou L, Narula N, Liu L, Ganapathy G, Boussau B, Bayzid MS, Zavidovych V, Subramanian S, Gabaldón T, Capella-Gutiérrez S, Huerta-Cepas J, Rekepalli B, Munch K, Schierup M, Lindow B, Warren WC, Ray D, Green RE, Bruford MW, Zhan X, Dixon A, Li S, Li N, Huang Y, Derryberry EP, Bertelsen MF, Sheldon FH, Brumfield RT, Mello CV, Lovell PV, Wirthlin M, Schneider MPC, Prosdocimi F, Samaniego JA, Velazquez AMV, Alfaro-Núñez A, Campos PF, Petersen B, Sicheritz-Ponten T, Pas A, Bailey T, Scofield P, Bunce M, Lambert DM, Zhou Q, Perelman P, Driskell AC, Shapiro B, Xiong Z, Zeng Y, Liu S, Li Z, Liu B, Wu K, Xiao J, Yinqi X, Zheng Q, Zhang Y, Yang H, Wang J, Smeds L, Rheindt FE, Braun M, Fjeldsa J, Orlando L, Barker FK, Jønsson KA, Johnson W, Koepfli K-P, O’Brien S, Haussler D, Ryder OA, Rahbek C, Willerslev E, Graves GR, Glenn TC, McCormack J, Burt D, Ellegren H, Alström P, Edwards SV, Stamatakis A, Mindell DP, Cracraft J, Braun EL, Warnow T, Jun W, Gilbert MTP, Zhang G (2014) Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346:1320–1331PubMedPubMedCentral
Jeffroy O, Brinkmann H, Delsuc F, Philippe H (2006) Phylogenomics: the beginning of incongruence? Trends Genet 22:225–231PubMed
Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD (2004) The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst Biol 53:638–643PubMed
Jiang W, Chen S-Y, Wang H, Li D-Z, Wiens JJ (2014) Should genes with missing data be excluded from phylogenetic analyses? Mol Phylogenet Evol 80:308–318PubMed
Knowles LL, Kubatko LS (2010) Estimating species trees: an introduction to concepts and models. In: Knowles LL, Kubatko LS (eds) Estimating species trees: practical and theoretical aspects. Wiley-Balckwell, Hoboken, pp 1–14
Kobert K, Salichos L, Rokas A, Stamatakis A (2016) Computing the internode certainty and related measures from partial gene trees. Mol Biol Evol 33:1606–1617PubMedPubMedCentral
Kolaczkowski B, Thornton JW (2004) Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431:980–984PubMed
Ku C, Martin WF (2016) A natural barrier to lateral gene transfer from prokaryotes to eukaryotes revealed from genomes: the 70% rule. BMC Biol 14:89PubMedPubMedCentral
Kubatko LS, Carstens BC, Knowles LL (2009) STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics 25:971–973PubMed
Kück P, Struck TH (2014) BaCoCa—a heuristic software tool for the parallel assessment of sequence biases in hundreds of gene and taxon partitions. Mol Phylogenet Evol 70:94–98PubMed
Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K (2012) Statistics and Truth in Phylogenomics. Mol Biol Evol 29:457–472PubMed
Lartillot N, Brinkmann H, Philippe H (2007) Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol 7:S4PubMedPubMedCentral
Lartillot N, Philippe H (2008) Improvement of molecular phylogenetic inference and the phylogeny of Bilateria. Philos Trans R Soc Lond Ser B Biol Sci 363:1463–1472
Leaché AD, Rannala B (2011) The accuracy of species tree estimation under simulation: a comparison of methods. Syst Biol 60:126–137PubMed
Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM (2009) The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and bayesian inference. Syst Biol 58:130–145PubMed
Liu L (2008) BEST: bayesian estimation of species trees under the coalescent model. Bioinformatics 24:2542–2543PubMed
Liu L, Xi Z, Wu S, Davis CC, Edwards SV (2015) Estimating phylogenetic trees from genome-scale data. Ann N Y Acad Sci 1360:36–53PubMed
Liu L, Yu L, Edwards SV (2010) A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol 10:302PubMedPubMedCentral
Lockhart P, Steel M (2005) A tale of two processes. Syst Biol 54:948–951PubMed
López-Giráldez F, Townsend JP (2011) PhyDesign: an online application for profiling phylogenetic informativeness. BMC Evol Biol 11:152PubMedPubMedCentral
Lopez P, Casane D, Philippe H (2002) Heterotachy, an important process of protein evolution. Mol Biol Evol 19:1–7PubMed
Maddison WP (1997) Gene trees in species trees. Syst Biol 46:523–536
Mallet J (2007) Hybrid speciation. Nature 446:279–283PubMed
Mariadassou M, Bar-Hen A, Kishino H (2012) Taxon influence index: assessing taxon-induced incongruities in phylogenetic inference. Syst Biol 61:337–345PubMed
Mirarab S, Bayzid MS, Boussau B, Warnow T (2014) Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science 346 1250463.
Misof B, Meyer B, von Reumont BM, Kück P, Misof K, Meusemann K (2013) Selecting informative subsets of sparse supermatrices increases the chance to find correct trees. BMC Bioinformatics 14:348PubMedPubMedCentral
Mitchell A, Mitter C, Regier JC (2000) More taxa or more characters revisited: combining data from nuclear protein-encoding genes for phylogenetic analyses of noctuoidea (Insecta: lepidoptera). Syst Biol 49:202–224PubMed
Miyamoto MM, Fitch WM (1995) Testing the covarion hypothesis of molecular evolution. Mol Biol Evol 12:503–513PubMed
Moroz LL, Kocot KM, Citarella MR, Dosung S, Norekian TP, Povolotskaya IS, Grigorenko AP, Dailey C, Berezikov E, Buckley KM, Ptitsyn A, Reshetov D, Mukherjee K, Moroz TP, Bobkova Y, Yu F, Kapitonov VV, Jurka J, Bobkov YV, Swore JJ, Girardo DO, Fodor A, Gusev F, Sanford R, Bruders R, Kittler E, Mills CE, Rast JP, Derelle R, Solovyev VV, Kondrashov FA, Swalla BJ, Sweedler JV, Rogaev EI, Halanych KM, Kohn AB (2014) The ctenophore genome and the evolutionary origins of neural systems. Nature 510:109–114PubMedPubMedCentral
Nesnidal MP, Helmkampf M, Bruchhaus I, Hausdorf B (2010) Compositional heterogeneity and phylogenomic inference of metazoan relationships. Mol Biol Evol 27:2095–2104PubMed
Nosenko T, Schreiber F, Adamska M, Adamski M, Eitel M, Hammel J, Maldonado M, Müller WEG, Nickel M, Schierwater B, Vacelet J, Wiens M, Wörheide G (2013) Deep metazoan phylogeny: when different genes tell different stories. Mol Phylogenet Evol 67:223–233PubMed
Parks SL, Goldman N (2014) Maximum likelihood inference of small trees in the presence of long branches. Syst Biol 63:798–811PubMed
Philip GK, Creevey CJ, McInerney JO (2005) The opisthokonta and the ecdysozoa may not be clades: stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the coelomata than ecdysozoa. Mol Biol Evol 22:1175–1184PubMed
Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, Baurain D (2011) Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 9:e1000602PubMedPubMedCentral
Philippe H, Derelle R, Lopez P, Pick K, Borchiellini C, Boury-Esnault N, Vacelet J, Renard E, Houliston E, Quéinnec E, Da Silva C, Wincker P, Le Guyader H, Leys S, Jackson DJ, Schreiber F, Erpenbeck D, Morgenstern B, Wörheide G, Manuel M (2009) Phylogenomics revives traditional views on deep animal relationships. Curr Biol 19:706–712PubMed
Philippe H, Lartillot N, Brinkmann H (2005a) Multigene analyses of bilaterian animals corroborate the monophyly of ecdysozoa, lophotrochozoa, and protostomia. Mol Biol Evol 22:1246–1253PubMed
Philippe H, Snell EA, Bapteste E, Lopez P, Holland PWH, Casane D (2004) Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol 21:1740–1752PubMed
Philippe H, Zhou Y, Brinkmann H, Rodrigue N, Delsuc F (2005b) Heterotachy and long-branch attraction in phylogenetics. BMC Evol Biol 5:50PubMedPubMedCentral
Phillips MJ, Penny D (2003) The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol 28:171–185PubMed
Pisani D (2004) Identifying and removing fast-evolving sites using compatibility analysis: an example from the arthropoda. Syst Biol 53:978–989PubMed
Pisani D, Pett W, Dohrmann M, Feuda R, Rota-Stabelli O, Philippe H, Lartillot N, Wörheide G (2015) Genomic data do not support comb jellies as the sister group to all other animals. Proc Natl Acad Sci U S A 112:15402–15407PubMedPubMedCentral
Pol D, Siddall ME (2001) Biases in maximum likelihood and parsimony: a simulation approach to a 10-taxon case. Cladistics 17:266–281
Pollock DD, Zwickl DJ, McGuire JA, Hillis DM (2002) Increased taxon sampling is advantageous for phylogenetic inference. Syst Biol 51:664–671PubMedPubMedCentral
Rannala B, Huelsenbeck JP, Yang Z, Nielsen R (1998) Taxon sampling and the accuracy of large phylogenies. Syst Biol 47:702–710PubMed
Rivera-Rivera CJ, Montoya-Burgos JI (2016) LS3: a method for improving phylogenomic inferences when evolutionary rates are heterogeneous among taxa. Mol Biol Evol 33:1625–1634PubMedPubMedCentral
Rodríguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H (2007) Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol 56:389–399PubMed
Rokas A, Abbot P (2009) Harnessing genomics for evolutionary insights. Trends Ecol Evol 24:192–200PubMed
Rokas A, Carroll SB (2005) More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol Biol Evol 22:1337–1344PubMed
Rokas A, Williams B, King N, Caroll S (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–804PubMed
Rosenberg MS, Kumar S (2001) Incomplete taxon sampling is not a problem for phylogenetic inference. Proc Natl Acad Sci U S A 98:10751–10756PubMedPubMedCentral
Roure B, Baurain D, Philippe H (2013) Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. Mol Biol Evol 30:197–214PubMed
Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497:327–331PubMed
Sanderson MJ, McMahon MM, Steel M (2010) Phylogenomics with incomplete taxon coverage: the limits to inference. BMC Evol Biol 10:155PubMedPubMedCentral
Sanderson MJ, Shaffer HB (2002) Troubleshooting molecular phylogenetic analyses. Annu Rev Ecol Syst 33:49–72
Sanderson MJ, Wojciechowski MF, Hu J-M, Khan TS, Brady SG (2000) Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. Mol Biol Evol 17:782–797PubMed
Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502–504PubMed
Shen X-X, Salichos L, Rokas A (2016) A genome-scale investigation of how sequence, function, and tree-based gene properties influence phylogenetic inference. Genome Biol Evol 8:2565–2580PubMedPubMedCentral
Smith SA, Dunn CW (2008) Phyutility: a phyloinformatics tool for trees, alignments and molecular data. Bioinformatics 24:715–716PubMed
Spencer M, Susko E, Roger AJ (2005) Likelihood, parsimony, and heterogeneous evolution. Mol Biol Evol 22:1161–1164PubMed
Sperling EA, Pisani D, Peterson KJ (2007) Poriferan paraphyly and its implications for Precambrian palaeobiology. Geol Soc Lond Spec Publ 286:355–368
Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313PubMedPubMedCentral
Steel MA, Lockhart PJ, Penny D (1993) Confidence in evolutionary trees from biological sequence data. Nature 364:440–442PubMed
Strimmer K, von Haeseler A (1997) Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proc Natl Acad Sci U S A 94:6815–6819PubMedPubMedCentral
Struck TH, Nesnidal MP, Purschke G, Halanych KM (2008) Detecting possibly saturated positions in 18S and 28S sequences and their influence on phylogenetic reconstruction of Annelida (Lophotrochozoa). Mol Phylogenet Evol 48:628–645PubMed
Suh A, Smeds L, Ellegren H (2015) The dynamics of incomplete lineage sorting across the ancient adaptive radiation of neoavian birds. PLoS Biol 13:e1002224PubMedPubMedCentral
Sullivan J, Swofford D, Naylor G (1999) The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Mol Biol Evol 16:1347
Susko E, Roger AJ (2007) On reduced amino acid alphabets for phylogenetic inference. Mol Biol Evol 24:2139–2150PubMed
Tarrío R, Rodríguez-Trelles F, Ayala FJ (2001) Shared nucleotide composition biases among species and their impact on phylogenetic reconstructions of the drosophilidae. Mol Biol Evol 18:1464–1473PubMed
Telford MJ, Moroz LL, Halanych KM (2016) Evolution: a sisterly dispute. Nature 529:286–287PubMed
Thorley JL, Wilkinson M (1999) Testing the phylogenetic stability of early tetrapods. J Theor Biol 200:343–344PubMed
Townsend JP (2007) Profiling phylogenetic informativeness. Syst Biol 56:222–231PubMed
Van de Peer Y, Frickey T, Taylor JS, Meyer A (2002) Dealing with saturation at the amino acid level: a case study based on anciently duplicated zebrafish genes. Gene 295:205–211PubMed
Wang H-C, Susko E, Roger AJ (2011) Fast statistical tests for detecting heterotachy in protein evolution. Mol Biol Evol 28:2305–2315PubMed
Weigert A, Helm C, Meyer M, Nickel B, Arendt D, Hausdorf B, Santos SR, Halanych KM, Purschke G, Bleidorn C, Struck TH (2014) Illuminating the base of the annelid tree using transcriptomics. Mol Biol Evol 31:1391–1401PubMed
Whelan NV, Halanych KM (2016) Who let the CAT out of the bag? Accurately dealing with substitutional heterogeneity in phylogenomic analyses. Syst Biol 52:696–704
Whelan NV, Kocot KM, Moroz LL, Halanych KM (2015) Error, signal, and the placement of Ctenophora sister to all other animals. Proc Natl Acad Sci U S A 112:5773–5778PubMedPubMedCentral
Whelan S, Blackburne BP, Spencer M (2011) Phylogenetic substitution models for detecting heterotachy during plastid evolution. Mol Biol Evol 28:449–458PubMed
White W, Hills S, Gaddam R, Holland B, Penny D (2007) Treeness triangles: visualizing the loss of phylogenetic signal. Mol Biol Evol 24:2029–2039PubMed
Wiens JJ (1998) Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst Biol 47:625–640PubMed
Wiens JJ (2003) Missing data, incomplete taxa, and phylogenetic accuracy. Syst Biol 52:528–538PubMed
Wiens JJ, Morrill MC (2011) Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Syst Biol 60:719–731PubMed
Wu J, Susko E (2011) A test for heterotachy using multiple pairs of sequences. Mol Biol Evol 28:1661–1673PubMed
Xia X (2013) DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution. Mol Biol Evol 30:1720–1728PubMedPubMedCentral
Xia X, Xie Z, Salemi M, Chen L, Wang Y (2003) An index of substitution saturation and its application. Mol Phylogenet Evol 26:1–7PubMed
Yang Z (1996) Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol Evol 11:367–372PubMed
Zwickl DJ, Hillis DM (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51:588–598PubMed