Skip to main content
Open Access

Geographic Range Dynamics Drove Ancient Hybridization in a Lineage of Angiosperms

Abstract

Elucidating the dynamic distribution of organismal lineages has been central to biology since the nineteenth century, yet the difficulty of combining biogeographic methods with shifts in habitat suitability remains a limitation. This integration, however, is critical to understanding geographic distributions, present and past, as well as the time-extended trajectories of lineages. Here, we link previous advances in phyloclimatic modeling to develop a framework that overcomes existing methodological gaps by predicting potential ecological and geographic overlap directly from estimated ancestral trait distributions. We show the utility of this framework by focusing on a clade in the montane angiosperm genus Heuchera, which is noteworthy in that it experienced ancient introgression from circumboreally distributed species of Mitella, lineages now ~1,300 km disjunct. Using this system, we demonstrate an application of ancestral state reconstruction to assess geographic range dynamics in a lineage lacking a fossil record. We test hypotheses regarding inferred past geographic distributions and examine the potential for ancient geographic contact. Application of this multifaceted approach suggests potential past contact between species of Heuchera and Mitella in western North America during cooler periods of the Pleistocene. Integration of niche models and phylogenetic estimates suggests that climatic cooling may have promoted range contact and gene flow between currently highly disjunct species. Our approach has wide applicability for testing hypotheses concerning organismal co-occurrences in deep time.

Online enhancements:   supplemental materials. Dryad data: http://dx.doi.org/10.5061/dryad.7cr0c76.

Introduction

Geographic ranges of organisms have been and continue to be driven dramatically by climate change (Schönswetter et al. 2005; Lenoir et al. 2008; Alvarez et al. 2009; Birks 2010; Dullinger et al. 2012), sometimes resulting in surprising disjunct distributions. A remarkable example in plants is the ancestral occurrence of hybridization between taxa that are currently widely allopatric, with past co-occurrence hypothesized to have been driven by Pleistocene climate change often increasing suitable habitat in montane organisms (López-Alvarez et al. 2015; Klein and Kadereit 2016; Marques et al. 2016).

For extant lineages, these scenarios can be tested using paleoclimatic projections and existing ecological niche modeling (ENM). However, assessing ancestral range dynamics requires that the object of inference be the degree of overlap in niche envelope of two lineages that are no longer extant, such that ENM methods must be combined with ancestral niche reconstruction (phyloclimatic modeling; Yesson and Culham 2006a). If these methods are extended to provide not only point estimates but also variance for environmental predictors for ancestral lineages (e.g., Evans et al. 2009; Smith and Donoghue 2010; Hinojosa et al. 2016), they can be used to estimate the ancestral niche envelope directly. Thus, ancestral niche characteristics can be inferred via reconstructing a multivariate ecological niche space (E-space), for which the only hard limit in temporal depth is the ability to obtain confident ancestral reconstructions under available models. In addition, for time periods for which paleoclimatic data are available, an inferred E-space can be translated into geographic space (G-space; i.e., geographic projection; e.g., Yesson and Culham 2006a; Williams et al. 2017). Hence, testing the potential for ancestral co-occurrence requires combining existing approaches. Connecting these disparate approaches to phyloclimatic modeling—particularly the direct use of nodal variance estimates and geographic projection of reconstructed niches at nodes—provides an opportunity to test not only hypothesized niche shifts but also environmental and geographic overlap through time, reconstructing the potential for historical biotic interactions.

The flowering plant genus Heuchera (alumroots; Saxifragaceae) has experienced an extreme level of gene flow among lineages, both contemporary and ancestral, making it an excellent test case for ancient hybridization (Rosendahl et al. 1936; Wells 1984; Soltis et al. 1991; Folk et al. 2017). An earlier study (Folk et al. 2017) using coalescent simulations concluded that ancient introgression, rather than lineage sorting, better explained well-resolved but extremely discordant phylogenomic data. The most dramatic of these instances features a disjunction of ~1,300 km and involves the well-supported capture of the chloroplast genome of the ancestor of two species of Mitella (bishop’s cap, hereafter the “Mitella clade,” =Mitella sensu stricto with an extant distribution that is circumboreal and extends into the eastern United States) by the ancestor of a clade of five southern Californian species of Heuchera (hereafter “California Heuchera”; fig. 1; Soltis et al. 1991; Folk et al. 2017). Species of these two genera are remarkably distinct morphologically (fig. 1), yet species in different genera of Saxifragaceae are known to hybridize and can be crossed artificially (Doyle et al. 1985; Soltis and Soltis 1986; Folk et al. 2017; reviewed in Spongberg 1972).

Figure 1.
Figure 1.

Extant distribution (a) of Mitella (red) and California Heuchera (blue) and a phylogeny (b) plotting the focal clades and inferred introgression event. Exemplar taxa (most recent common ancestors marked by diamonds) for the focal clades (marked by boxes) are illustrated by exemplar photographs: blue = H. elegans, red = M. diphylla.

We developed a computational pipeline that combines existing methods in a single integrated framework to estimate ancestral niches and translates these to habitat suitability using paleoclimatic models. We then applied these methods to the Heuchera-Mitella hybridization event and asked whether the ancestors of the Mitella clade and California Heuchera (1) overlapped in niche space and (2) whether and where they overlapped in geographic space under glacial maximum conditions compared with interglacial conditions. We then integrated these estimates with a time-calibrated phylogeny and biogeographic reconstructions to refine further areas of potential co-occurrence that might have facilitated hybridization, demonstrating how these methods have broad applicability in assessing potential past co-occurrences among species.

Methods

Phylogenetics

The phylogenetic framework for this study was built on previous results (Folk et al. 2015, 2016, 2017) by supplementing outgroup sampling with key taxa to strengthen our identification of introgressed lineages and inferences of ancestral states. Briefly, an earlier study sequenced 277 nuclear loci via targeted enrichment (total target length ~400,000 bp; Folk et al. 2015). We sampled 42 of 43 known species of Heuchera. To improve outgroup representation, we mined transcriptome resources for five species from the 1,000 Plants project (1KP; https://onekp.com) and generated new genomic targeted sequencing and transcriptome data for 13 additional outgroup taxa, summarized in tables S1 and S2.

Sequencing

Sequencing followed Folk and Freudenstein (2015) with the following modifications: library preparation was performed by Rapid Genomics (Gainesville, FL; using TruSeq-like adapters as in Folk et al. 2015), the targeted insert size was >200 bp, and sequencing used a 300-cycle (150-bp read) kit for a HiSeq 3000 instrument. Outgroup sampling (21 taxa; table S1) was improved greater than fivefold (Folk et al. 2017) and included representatives of all lineages that have been hypothesized to undergo hybridization in the Heuchera group of Saxifragaceae (Soltis et al. 1991).

Intronic sequence can be recovered from RNAseq data (Ameur et al. 2011) but has consistently lower coverage than exons, and noncoding read dropout can be high when the assembly target is divergent from the reference (R. A. Folk, personal observation). For this reason, all intronic positions in assemblies were excluded from the matrix after alignment. For nuclear analyses, reads were assembled with 277 references comprising the gene sequences used for bait design, with intronic sites removed. Assembly methods otherwise followed a previously developed Burrows-Wheeler-Aligner-based approach (BWA; Folk et al. 2015). Assembly for target-enriched data again followed Folk and Freudenstein (2015), inferring a single consensus sequence that combined allelic variants by majority consensus. For chloroplast genome analyses, the reference plastid genome of Heuchera parviflora var. saurensis (Folk et al. 2015) was stripped of all intronic and intergenic sequence; reads were assembled to 113 references comprising coding sequences in the chloroplast genome (protein-coding genes, ribosomal DNA, transfer RNAs), following methods for nuclear data.

Alignment and Concatenation

All consensus sequences were aligned using MAFFT (Katoh et al. 2009) using “–auto” and default gap cost options. Manual editing was deemed unnecessary for the plastid analysis and for nuclear genes obtained through target enrichment. However, for several nuclear genes, exon-intron boundaries were incorrectly aligned in transcriptomic data; these were identified and manually edited in Geneious (ver. R9; Kearse et al. 2012).

Species Tree Estimate

Given the similarity of phylogenetic results from concatenation and coalescence methods for this data set (Folk et al. 2017), we used the former here to facilitate downstream analyses. To summarize state reconstructions on a single optimal tree, we inferred the phylogeny in RAxML (using all individuals sampled for 277 nuclear genes) and took as the plotting tree the best maximum likelihood (ML) tree, pruned to a single individual per species at random (tree available at https://github.com/ryanafolk/ambitus; see also fig. S1). Bootstrapping (1,000 replicates) on the pruned data set was used to integrate phylogenetic uncertainty in ancestral estimation. Because noncoding sites were excluded and gene-wise partitioning did not previously have a major impact on topology or support (Folk et al. 2017), we did not use data set partitions.

Chloroplast Gene Tree Estimate

To test the chloroplast capture scenario (Folk et al. 2017), we constructed and analyzed an augmented chloroplast gene matrix with RAxML rapid bootstrapping and best tree search (option -f a) with 5,000 replicates and a GTR-gamma model, with coding and noncoding nucleotide positions set as distinct partitions, following Folk et al. (2017).

Phylogenetic Dating

We used phylogenetic dating approaches to identify the ages of the Heuchera and Mitella lineages of interest and the possible time frame for ancient hybridization between them. Because fossils suitable for phylogenetic dating do not exist for any close relatives in the family Saxifragaceae, we used secondary calibrations in BEAST version 1.8.3 (Drummond and Rambaut 2007) following Deng et al. (2015). We used two prior formulations on the time calibration, and both chloroplast and nuclear data were used to estimate the implied time frame of gene flow independently; in all, four dating analyses were performed. We used a secondary date (Deng et al. 2015): 7.27 myr for the Heuchera group of genera (i.e., all taxa but Peltoboykinia and Telesonix). This calibration was treated under two priors: lognormal and truncated normal (1–28.55 myr, the latter being the estimated age of the Heuchera group of genera; calibration priors with significant probability density at 0 are problematic). Both prior treatments were set with a mean of 7.27 myr and standard deviation of 1.645. Node age prior standard deviation was from posterior density data (Deng et al. 2015), namely, [95% interval upper limit] − mean / 2, to impose the approximate uncertainty of the source date. Otherwise, we used a fixed tree topology (the same topology used above to plot ancestral reconstructions), the GTR-gamma nucleotide substitution model (four gamma categories, estimated base frequencies, exponential prior on alpha with mean 0.5), a lognormal uncorrelated relaxed clock (exponential standard deviation prior with mean 0.33 and approximate reference mean prior), and Yule speciation (priors on Yule birthrate and root height uniform [0, ∞]).

Markov chain Monte Carlo (MCMC) was run for 50 million generations, sampling every 1,000 generations, with 20% of this as burn-in; six independent chains for nuclear data and eight for chloroplast data were run with these settings, and chains for nuclear data and chloroplast data were respectively combined for separate summary trees for each partition (i.e., 240–300 million generations retained or 240,000–300,000 samples). All effective sample size (ESS; calculated with Tracer ver. 1.6; Drummond and Rambaut 2007) values were >100 for chloroplast analyses and >200 for nuclear analyses. Under an introgression scenario, the most recent common ancestor (MRCA) of the source and introgressant in a gene tree is expected to date to the hybridization event for any locus that was transferred. Therefore, as a complementary dating approach, we used the chloroplast genome alignment to estimate the age of the MRCA of Mitella and California Heuchera. All settings and MCMC lengths were the same as for the nuclear analysis.

Point Record Synthesis

We synthesized available point records from all taxa in the species tree in major online data repositories relevant to the study region (California, Pacific Northwest, and Intermountain Consortia of Herbaria; GBIF; S-NET [http://science-net.kahaku.go.jp/specimen_en/collection/]; OS [https://herbarium.osu.edu/online-data-access]; SEINet; and SERNEC). After assessing sampling weaknesses in point records from these repositories, strategic taxa of Heuchera were identified, imaged, and georeferenced from specimen loans from the following herbaria: ARIZ, ASC, ASU, BRIT, CAS, CS, DUKE, F, GH, MEXU, MICH, MINN, MNA, MO, NCU, NMC, NY, RM, RSA, SIU, TENN, TEX, UNM, US, UTC, UTEP, VT, WCUH, WIS, XAL. The records for several species (H. acutifolia, H. glomerulata, H. inconstans, H. longipetala, H. mexicana, H. rosendahlii, H. sanguinea, H. soltisii, and H. wellsiae) were taken from previously published occurrence maps (Folk 2013; Folk and Freudenstein 2014b; Folk and Alexander 2015); additional unpublished identified records were added for Mexican H. versicolor. We also used occurrence data from previous fieldwork; for H. longiflora, H. missouriensis, H. parviflora, H. puberula, and H. soltisii, we mostly or entirely used field-collected GPS records (Folk and Alexander 2015; Folk and Freudenstein 2015; Folk et al. 2018a).

To georeference localities of specimens lacking GPS coordinates, we primarily used GeoLocate, supplemented by verification in Google Earth and Google Maps, following best practices including, briefly, using only unique and unambiguous place names, discarding localities that refer to large metropolitan areas or cultivated records, verifying that cardinal directions and distances were correctly calculated, and verifying by satellite imagery that broadly suitable habitat was present. We removed all point records calculated to the nearest degree. Spurious records were removed manually by reference to known ranges in the literature (cited above); in particular, a large number of European records were removed because this group of plants contains common garden subjects; other botanical garden records were found by scanning locality fields and removed. Final occurrence record numbers, deduplicated pixel-wise (30-arcsecond resolution), are given in table S3. Species delimitations were conservative and primarily followed current taxonomic works (Rosendahl et al. 1936; Rosendahl 1951; Wells 1984; Wells and Elvander 2009; Folk 2013; Folk and Freudenstein 2014b, 2015; Folk and Alexander 2015).

Ecological Predictor Assembly

We assembled 35 layers at 30-second resolution, capturing aspects of climate, soil, landcover, and topography. These comprised 19 well-known temperature and precipitation variables (Bioclim; http://www.worldclim.org/bioclim gives the complete list of variables and units), three topographical layers (elevation, GTOPO30; https://lta.cr.usgs.gov/GTOPO30; aspect and slope calculated from this layer in GDAL, using a Z-factor of 111,120), seven soil layers (bulk density, coarse fragment percentage, clay percentage, silt percentage, sand percentage, pH, organic carbon content, SoilGrids1km; averaged in qGIS across layers at 2.5-, 10-, 22.5-, and 45-cm core depths; https://soilgrids.org/), and six landcover classes (percentage cover of needle-leaf, evergreen broadleaf, deciduous broadleaf, mixed trees, shrubs, herbaceous; http://www.earthenv.org/landcover). Forthwith, “environmental variables” refer to our climatic and nonclimatic factors jointly.

Layer Correlation

We calculated layer correlation in R using Pearson’s r on layers downsampled to 2.5-minute resolution (Python library GDAL, http://www.gdal.org/); using nearest-neighbor sampling, and clipped to the combined range of sampled extant taxa using the training region method noted below on pooled point records; convex hulls were calculated separately for disjunct range portions in Asia and North America and merged for the correlation calculation. Among highly correlated (Pearson’s r ≥ 0.75) environmental layer pairs, we deleted one of the layers using a random-number generator. This resulted in 22 layers, which was still excessive given the limited locality data we had for many species, so we retained from this set the following 12 layers chosen to capture multiple climatic, edaphic, and topographic aspects relevant to Heuchera (Wells 1984; Folk and Freudenstein 2014b): mean annual temperature, temperature mean annual range, annual mean precipitation, mean precipitation of driest quarter (i.e., Bioclim 1, 7, 12, and 17), elevation, slope, mean coarse fragment percentage, mean pH, mean sand percentage, mean organic carbon content, needle-leaf landcover percentage, and herbaceous landcover percentage.

Training Region Development

To set up training regions for our models, we combined ecoregional and dispersal distance concepts. We used the Nature Conservancy terrestrial ecoregion data set (http://maps.tnc.org/gis_data.html; based on Olson et al. 2001), taking those that contained points for the species of interest. To incorporate dispersal, we calculated a convex hull containing all points; in the absence of specific dispersal data from Heuchera or its relatives (which are expected to be poorly dispersed; cf. Wells 1984), we conservatively used a distance of 0.5 degrees for a buffer distance around this hull (60 pixels at 30-second resolution or ~42.7 km at 40° latitude). The final training region was taken as the intersection of the kept ecoregions and buffered hull. All calculations were performed in qGIS (http://www.qgis.org/) and calculated per species. Effectively, the training region comprised all biomes from which a species is known, trimmed to 0.5 degrees from the most peripheral occurrences; this is summarized in figure S2. Models based on these training regions are hereafter referred to as “narrow-buffer” models. As a model treatment to assess the sensitivity of the results to these fairly tight training region definitions, we applied a further 1-degree buffer to the 0.5-degree training regions, referred to hereafter as “wide-buffer” models. Training regions were clipped from global rasters using ENMGadgets (https://github.com/vijaybarve/ENMGadgets; degree-based buffers) or GDAL (50-km buffers).

Ecological Niche Modeling

Maxent (Phillips et al. 2004, 2006) was run with a maximum of 5,000 iterations, extrapolation of extant species’ geographic projections (used only for quality control) disabled, and points with missing data allowed. For all modeled taxa, 25% of the data were set aside for model testing, and the remainder were used for training. Models were averaged over 10 bootstrap replicates. To evaluate model performance, we calculated jackknifes and response curves as well as the default receiver operating characteristic curve and omission plots; model statistics are given in table S3. Models were inferred for 30-second-resolution data; a trial of 2.5-minute models (not shown) resulted in poorer fit statistics. To assess the sensitivity of our results to the alternative projection and buffer distance treatments generated above, we ran models and all downstream analyses in duplicate with the narrow-buffer and wide-buffer training regions.

Additionally, we developed a third set of niche models to assess the sensitivity of our results to a degree-based coordinate reference system and buffer. We recalculated slope and aspect data sets in GDAL based on a Mollweide (equal area) projection to account for potential latitudinal distortion of these metrics. We repeated models with this slope data set and a 50-km buffer (similar in size to a 0.5-degree buffer). These models hereafter are denoted as “distance-buffer” models.

Predicted Niche Occupancy Profiles

To calculate the distribution of niche space occupancy in the models, we calculated predicted niche occupancy profiles (PNOs; first developed in Evans et al. 2009) for all retained layers, using the R package phyloclim (Heibl and Calenge 2013). A PNO profile is a density histogram that represents an integration of a single environmental layer over the suitability surface predicted by an ecological niche model.

For 10 included taxa, we could not obtain more training points than the number of layers (i.e., n<15,=12+25%). Given the extensive loans undertaken from critical collections for western North America and Mexico, this likely represents the limits of our knowledge of occurrences for these taxa. We addressed this issue by circumventing niche modeling for these species and sampling directly from layer values at occurrence coordinates (cf. Ogburn and Edwards 2015), excluding pixel-wise duplicates, and created a PNO profile solely of these observed values sampled under a uniform distribution.

Ancestral Reconstruction

To incorporate the statistical distribution of niche space occupancy for extant species into ancestral reconstruction, we sampled the environmental value of each of these bins proportionally (cf. Evans et al. 2009; Nyári and Reddy 2013) for 1,000 samples. Ancestral reconstruction was performed in BayesTraits 2.0 (http://www.evolution.rdg.ac.uk/BayesTraits.html), an MCMC implementation of ancestral reconstruction that allows for directly estimating the distribution of ancestral values. Each sample was taken as the true point value for each extant taxon in a single ancestral reconstruction, and this process was repeated for each input variable, resulting in 12,000 ancestral reconstructions. To incorporate phylogenetic uncertainty, ancestral reconstruction was run on a sample of 1,000 RAxML fast bootstrap trees, generated (option -f a) from a sequence matrix that included one randomly selected individual per species (see above); results were plotted on the optimal phylogenetic estimate (best ML tree with all individuals, pruned to the same randomly selected samples, described above). For both the model and the reconstruction step, MCMC chains were run for 5 million generations, preceded by 5 million generations of burn-in. All priors were uniform over [−300, 5,000], the approximate range of environmental data. For tree summary statistics, the median, 2.5, and 97.5 percentiles were calculated for every node in each MCMC chain; these were then averaged across MCMCs to summarize average median and credibility intervals across MCMCs. Output was formatted in BEAST-style extended NEXUS format for visualization. Because more variable intronic data were excluded, not all clades were highly supported; we directly incorporated phylogenetic uncertainty by searching over the 1,000 bootstrap trees with branch lengths in the MCMC analysis.

To evaluate the presence of niche conservatism (i.e., that closely related species are more similar in niche space than expected by chance), we tested for phylogenetic signal in two ways. First, we used the λ test (Pagel 1999) as implemented in the R package phytools (Revell 2012) to individually test niche variables as independent univariate predictors. Second, we used the multivariate Kmult test (Adams 2014), as implemented in the R package phylocurve, to test whether there was evidence for niche conservatism in our data set as a whole. Trait values for tip taxa were the medians (i.e., 50th percentile) of the PNO histograms. Because phylogenetic signal among outgroup taxa or between the outgroup and ingroup would not be relevant to the inferences made here, all of these analyses were restricted to the group of interest by removing outgroups Telesonix and Peltoboykinia.

For BayesTraits reconstructions, we used a Brownian single-rate model. Although not implemented to date in the continuous ancestral reconstruction module of BayesTraits, Ornstein-Uhlenbeck (OU) models (Hansen 1997) are also commonly used (e.g., reviewed in Pennell and Harmon 2013). To compare these models, we tested the model fit of Brownian motion against OU models for all 12 variables using the corrected Akaike information criterion (AICc; Akaike 1974; Sugiura 2007) implemented in the fitContinuous function of the R package phytools. We also compared the Brownian model against a directional model (using a nonultrametric tree) that is implemented in BayesTraits using AICc, calculated from likelihoods estimated in BayesTraits. These analyses were performed using median values from the PNOs and restricted to ingroup taxa as with niche conservatism tests.

Ancestral Suitability Overlap

We examined ancestral suitability overlap for the two ancestral taxa of the Mitella and California Heuchera clades in both E-space and G-space. For E-space, we quantified the intersection (Cha 2007) of probability densities, using probability-density histograms in 50 equal bins from the minimum to maximum of observed values, comparing the two MRCAs for each variable to evaluate the relative impact of the environmental factors we used on potential overlap.

We developed new approaches implemented as a suite of postprocessing tools, with the aim of projecting nodal variable distributions into G-space. We developed the following two alternative approaches: (1) a binary approach based on an n-dimensional hypervolume comprising credibility intervals and (2) a binned-probabilities approach that directly estimates pixel-wise joint posterior probabilities. To assess where this range overlap might have been distributed geographically in the past, we considered climatic layers from the Last Glacial Maximum (LGM), mid-Holocene, and last interglacial (available at http://www.worldclim.org/paleo-climate1). For the former two, we used the CCSM4 model at 2.5-minute resolution, and for the latter, we used that from Otto-Bliesner et al. (2006), sampled to 2.5-minute resolution with GDAL, averaging pixels. Of variables used to model niche space, all four retained Bioclim layers were available for these periods.

For the binary map (1), intended as representing a literal Hutchinsonian niche (an n-dimensional hypercube in E-space) and close analog of the Bioclim method (Nix 1986; cf. Yesson and Culham 2006b), we calculated 95% credibility intervals (2.5 and 97.5 percentiles) from pooled MCMC chains for both ancestral taxa, thinned as described above. We then calculated a binary map in qGIS for each species model by scoring as present only pixels in a paleoclimatic model projection that were in the credibility interval. We then combined these rasters by taking the intersection (in this case, the product), returning all pixels that were within 95% of the distribution of all four Bioclim variables in the paleoclimatic projection.

We also calculated a binned-probabilities map (2), intended to incorporate distributions of suitability rather than ranges, and hence closer to commonly used modeling approaches (e.g., Maxent), to quantify joint posterior probability of niche suitability. We calculated probability-density histograms for the two ancestral taxa of interest, using 50 evenly spaced bins from the minimum to maximum observed MCMC value (lower limit inclusive). The four Bioclim rasters were then classified by the corresponding bin probability; the rasters were multiplied to yield joint probabilities, then probabilities were normalized to sum to 1. Viewing the result as representative of a posterior probability density function in geographic space, we calculated niche suitability overlap as the intersection IS of their probability densities (Cha 2007):

IS=i=1nmin(Ai, Bi),
where A and B are raster layers with values corresponding to ancestral posterior probability and n is the number of pixels.

Histogram and credibility interval calculations were performed in R (base package); all raster operations were performed in GDAL. For all subsequent analyses, the input data comprised a pool of all 1,000 MCMC chains, thinned to include every fiftieth sample (100 samples per chain, required for memory management).

Temporal Mismatch and Extrapolation

While the hybridization event we assessed likely dates to the early Pleistocene (cf. “Results”), published paleoclimatic data (e.g., WorldClim data; http://www.worldclim.org/) generally are not publicly available at both high spatial and temporal resolution earlier than the late Pleistocene, limiting our ability to match our phylogenetic data temporally with an exact glacial cycle. Furthermore, levels of uncertainty in dating are typically too wide to distinguish among climatic oscillations that are on the scale of tens of thousands of years apart (cf. credibility intervals in “Results”; all >1 myr). Hence, we are not attempting to interpret our reconstructions as actual distributions at the time line of interest. Rather, we used an LGM paleoclimatic layer as representative of a typical Pleistocene cool period; the other two layers were used as two alternative historical and present warm scenarios (cf. Waltari et al. 2007).

To address the temporal mismatch between available paleoclimatic data and our phylogenetic dating results, we explored whether the patterns we recovered for the late Pleistocene were generalizable to earlier glacial cycles in the Pleistocene. We built a linear regression model of the geographic range intersection for our three reconstructed paleoclimate scenarios, based on intersection statistics from the narrow-buffer models and constrained to BioGeoBEARS results (see below) against a recent data set of temperature anomaly for the Northern Hemisphere in the past 5 myr (de Boer et al. 2014; available at https://www.ncdc.noaa.gov/paleo-search/study/19564). We then used this model to extrapolate the range intersection of the Mitella and California Heuchera clades across the period of the Pleistocene when introgression could have occurred based on dating above.

Pipeline Construction

We automated the steps outlined above (summarized in fig. 2) by designing a data-analysis pipeline in Python, called ambitus, that serves as a wrapper for BayesTraits, to handle the thousands of ancestral reconstructions required to incorporate PNOs from extant taxa, parallelizing the computationally intensive reconstructions, and automating summary statistics and tree plotting. Some initial data formatting is required (e.g., variable and taxon label standardization), but otherwise operation is straightforward. The program (available at https://github.com/ryanafolk/ambitus/) is compatible with Linux and OS X and allows customization of a large number of analysis parameters, as well as optionally producing extensive diagnostic MCMC output. We also provide a Python-based postprocessing script (https://github.com/ryanafolk/ambitus-postprocessor) to enable ancestral geographic projection and overlap quantification.

Figure 2.
Figure 2.

Conceptual summary of the estimation of ancestral niche space performed in ambitus.

Biogeography

A major potential weakness of niche modeling approaches is that they do not explicitly attempt to model dispersal; a reasonable way to integrate dispersal into analyses in a phylogenetic context is to use biogeographic models (e.g., Meseguer et al. 2015). Estimates of ancestral biogeographic states were performed in the R package BioGeoBEARS (Matzke 2013) under both the DEC and DEC+J models, using our dated nuclear tree under the truncated normal prior. Biogeographic regions were coded from ecoregion concepts but conceived more broadly to include the main montane physiographic regions that correlate with Heuchera distributions in a manageable number of states for the DEC model. The following six states were recognized: Temperate East Asia; Eastern United States; Pacific-Montane (California Coast Ranges, Pacific Northwest, Sierra Nevada, Transverse Ranges); Sierra Madre; Rockies; and Basin and Range (Great Basin and adjacent mountain ranges of Southwest North America north of Mexico and east of the Sierra Nevada). This coding scheme uses physiographic regions for the United States (https://water.usgs.gov/GIS/metadata/usgswrd/XML/physio.xml; sensu Fenneman and Johnson 1946) and is congruent with a previous biogeographic analysis (Folk and Freudenstein 2014a; cf. fig. S3). Two further major regions present in the data, the Great Plains and Himalayas, were classified as the nearest neighbors of Eastern United States and Temperate East Asia, respectively, because only one taxon occurred in each. The maximum allowed ancestral range size was 6 (among extant taxa, the largest was 4). The two models were compared by AIC and likelihood ratio tests (LRT).

We explicitly combined biogeographic results with geographic projections of niche suitability by constraining our inference of suitable habitat to the ancestral accessible area (sensu Soberón and Peterson 2005) as inferred by BioGeoBEARS. Using the vector of biogeographic regions (fig. S3), we assigned zero probability to all pixels outside the single optimal ancestral region reconstructed in biogeographic analyses. We calculated overlap statistics both globally and for this ancestral accessible area.

Data and Code Availability

All raw sequence data have been deposited in the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra); all DNA assemblies, phylogenetic trees, occurrences, Maxent suitability layers, and ancestral reconstructions have been deposited in Dryad: http://dx.doi.org/10.5061/dryad.7cr0c76 (Folk et al. 2018c). Scripts and data sets to reproduce ancestral reconstruction analyses are available at GitHub (https://github.com/ryanafolk/ambitus).

Results

Phylogenetic Estimate

Our nuclear phylogenetic estimates (fig. S1) were consistent with previous results, while greatly increasing taxon sampling (Okuyama et al. 2012; Folk et al. 2017); additional outgroup taxa were mostly placed confidently in both the nuclear and chloroplast trees. Significantly, the chloroplast phylogeny (fig. S4) resolved the California clade of Heuchera as sister to a clade comprising Mitella diphylla and Mitella nuda, confirming that the ancestor of these Mitella species (rather than only M. diphylla; Folk et al. 2017) donated its chloroplast genome to the ancestor of the California Heuchera clade.

Phylogenetic Dating

In the nuclear gene tree, 95% credibility intervals for crown MRCA nodes were 2.1784 myr [0.433–4.5639] and 0.7266 myr [0.1696–1.5062], respectively, for Mitella and California Heuchera under the lognormal calibration prior (fig. S5) and 2.8 myr [1.0915–4.5766] and 0.9166 myr [0.3791–1.4773] under the truncated normal prior (fig. S6). Chloroplast estimates for the date of genome transfer were somewhat younger than nuclear estimates: 1.8792 myr [0.334–4.0491] under the lognormal calibration prior (fig. S7) and 2.3031 myr [0.7142–4.0726] under the truncated normal prior (fig. S8). Hence, dating was compatible with an early Pleistocene time frame for co-occurrence of the ancestors of these two clades and for introgression.

Niche Modeling Metrics

The mean of pixel-wise deduplicated occurrence records available per taxon (table S3) was 140. The mean and minimum areas under the curve (AUC) for the narrow-buffer models were 0.9263 and 0.843, respectively; for wide-buffer models, they were 0.9471 and 0.882; and for distance-buffer models, they were 0.9245 and 0.803. As expected, including more unsuitable habitat in training regions drove fit statistics up; the narrow-buffer training regions were fairly conservative given that no more than ~40 km were included beyond observations.

Ancestral Niche Reconstruction

Ancestral reconstructions of environmental parameters (figs. 3, 4) showed apparent phylogenetic structure on visual inspection (fig. 4), and tests for phylogenetic signal in niche space occupancy confirmed this. The univariate λ tests (table S4) were highly significant after Bonferroni correction for all variables (P<.001 in all cases) with the exception of herbaceous landcover (P prior to correction = .0917), which was not used for ancestral geographic projection. The Kmult test also showed strong structure considering the niche variables as multivariate data (P<.0000001). AICc model comparisons favored Brownian motion over both the OU model (table S5) and the directional model (table S6) for all 12 variables.

Figure 3.
Figure 3.

Histograms summarizing pooled Markov chain Monte Carlo distributions across all 12 variables examined for ancestors of Mitella (red) and California Heuchera (blue). The Y-axis represents absolute sample number. These histograms reflect narrow-buffer models; wide-buffer model projections are given in figure S9; distance-buffer model projections are given in figure S10; and predicted niche occupancy histograms for extant taxa are given in figure S11.

Figure 4.
Figure 4.

Median values of niche space for all ancestral nodes and tip taxa, using narrow-buffer models. Branches are colored by median values scaled from the minimum (red) to the maximum (blue). Black circles at nodes represent by their diameters the relative credibility interval breadth of the nodes. Species labels are not shown for compact representation, but tips follow the exact sequence of figure S1. Asterisks denote focal clades—blue for California Heuchera and red for Mitella. Ancestral reconstructions for wide-buffer and distance-buffer models are given in figures S12 and S13.

Ancestral Overlap

The E-spaces of the MRCAs of Mitella and California Heuchera overlapped for all variables examined (table 1; fig. 3). When the ancestral distributions are projected into environmental conditions estimated for the Holocene, LGM, and interglacial periods (fig. 5), areas of greatest potential geographic overlap include northern California and the Sierra Madre of Mexico. The cumulative overlap (quantified by the intersection of the joint probability density in G-space, see “Methods”; table 2) was highest under LGM conditions, which was the coolest paleoclimatic scenario we examined. This was equally true for models under the three training region buffer treatments although the absolute magnitude differed.

Table 1.

Quantification of variable-wise environmental (ecological niche spaces) overlap as intersections of joint posterior probabilities of occurrence between the two taxa (defined in “Methods”)

 Narrow-buffer modelsWide-buffer modelsDistance-buffer models
Mean annual temperature.4013.4149.3844
Temperature annual range.3031.3133.2972
Mean annual precipitation.4083.3988.4124
Precipitation of driest quarter.1594.1592.1527
Carbon content.2648.3252.2798
Coarse fragment percentage.3450.3378.3349
Elevation.1378.1505.1355
Herbaceous landcover percentage.6608.6594.6716
Needle-leaf landcover percentage.6033.5867.5866
pH.3537.3539.3535
Soil sand percentage.3654.3677.3679
Slope.3057.3130.3776
Figure 5.
Figure 5.

a, Ancestral geographic projections of estimated niche space for ancestors of Mitella (red) and California Heuchera (blue), at all three time slices with both projection methods. b, Niche overlap of projected distributions (calculated as the intersection; see “Methods”). These projections show narrow-buffer models; wide-buffer model projections are given in figure S15; those for distance-buffer models are given in figure S16. Last interglacial = 120–140 thousand years ago [kya], Last Glacial Maximum = 22 kya, Holocene Optimum = 6 kya.

Table 2.

Quantification of geographic space overlap between the two ancestral lineages as intersections of pixel-wise joint posterior probabilities of occurrence (defined in “Methods”)

 Global cumulative overlapAncestral region cumulative overlap
Narrow-buffer models:  
 Holocene Optimum (6 kya).0403.0428
 Last Glacial Maximum (22 kya).0639.1429
 Last interglacial (120–140 kya).0403.0428
Wide-buffer models:  
 Holocene Optimum (6 kya).0574.0623
 Last Glacial Maximum (22 kya).0658.1438
 Last interglacial (120–140 kya).0424.0458
Distance-buffer models:  
 Holocene Optimum (6 kya).0517.0558
 Last Glacial Maximum (22 kya).0616.1396
 Last interglacial (120–140 kya).0382.0387

Note. Under both training region formulations, overlap probability is greatest during glacial conditions (boldface). This result was obtained when overlap was assessed globally (left column), as well as when assessed only in the ancestral region inferred by BioGeoBEARS.

View Table Image

We used a regression approach with a 5-myr temperature anomaly data set to predict geographic overlap in past glacial maxima prior to those for which we performed geographic projection, focusing in particular on the intervals between the two chloroplast MRCA dates that directly dated chloroplast capture (fig. 6, truncated normal prior = solid vertical line, 2,303.1 thousand years ago [kya]; and lognormal prior = dashed vertical line, 1,879.2 kya). Temperature anomalies between these two dates are comparable to those for the Holocene, LGM, and interglacial climatic models (−1.717°C, −15.379°C, −7.413°C, respectively), and extrapolated range intersections (ranging from 0.0254 [warmest interglacial] to 0.1127 [coolest glacial maximum]); cf. table 2 and fig. 6) were similar in scale. This suggests that, at least on the basis of temperature anomaly alone, glacial maxima in the period of interest had comparable impacts on shared habitat to the LGM data we used directly for geographic projection.

Figure 6.
Figure 6.

Temperature anomaly and extrapolated range overlap of the Mitella and California Heuchera clades throughout the Pleistocene and late Pliocene. The black trace and left-hand vertical axis show temperature anomaly compared with the present. The red trace and right-hand vertical axis show extrapolated range intersection, on the scale [0, 1] as defined in “Methods.” The time interval plotted represents the credibility interval recovered from phylogenetic dating of chloroplast data with a lognormal prior (i.e., 4,049 kya to 334 kya). The vertical lines show the inferred date of chloroplast capture based on chloroplast dating analyses, the solid line shows the point estimate with a truncated normal prior, and the dashed line shows the point estimate with a lognormal prior.

Biogeographic Inference

AIC and LRT favored DEC+J (fig. S14), yet for the focal ancestral taxa, DEC and DEC+J did not differ qualitatively in geographic range inference; this was also largely true for the tree as a whole. The reconstructed distributions of the MRCAs of both the Mitella and California Heuchera clades contain the Pacific region but have essentially no probability of presence in the Sierra Madre. Hence, when dispersal and vicariance are considered, suitable habitat for the MRCAs in the Sierra Madre is ruled out, leaving northern California as the remaining region of highest overlap probability. As previously observed with parsimony reconstruction (Folk and Freudenstein 2014a), the reconstructed distributions of the nodes ancestral to Heuchera and other genera contained the Pacific region without exception. The use of the DEC model relaxes the assumption of single-state geographic ranges, yet a substantial portion of nodes ancestral to, and early-diverging within, Heuchera are still reconstructed as solely Pacific. Niche overlap calculations restricted to the ancestral region predicted by BioGeoBEARS were congruent with those calculated using global niche suitability projections (table 2).

Discussion

Using a synthesis of biogeographic and paleoclimatic approaches, we demonstrated that in taxa that are presently widely separated, potential distributions in the Pleistocene might have been conducive to gene flow between the ancestors of the two lineages. First, the ancestors of the California Heuchera and Mitella clades had broadly overlapping reconstructed distributions of all examined environmental predictor variables (fig. 3; table 1). Biogeographic results also reconstructed shared range between the two focal clades (fig. S14). Therefore, in terms of both potentially shared habitat suitability during the Pleistocene and reconstructed historical distributions, our results are consistent with the existence of opportunities for ancient gene flow between the ancestors of what are now highly disjunct taxa that exhibit clear molecular evidence of ancient hybridization.

Under both cool and warm Pleistocene scenarios, we assessed geographic overlap of these ancestral Heuchera and Mitella clades (fig. 5b; table 2). However, overlap probabilities were highest during the LGM scenario (table 2), particularly compared with interglacial conditions, indicating that cool conditions during the Pleistocene (either at the LGM or in a prior glacial maximum) might have most greatly facilitated overlap and produced the greatest opportunity for gene flow. The suitable range of the MRCA of the Mitella clade was much more southerly and narrow than that of the extant species, which is consistent with typical responses to glaciation for lowland taxa in eastern North America (Soltis et al. 2006). The range suitability of the ancestor of the California Heuchera clade was much broader than the present range—an expected feature for high-montane organisms in cooler climates that have sky-island distributions in the present (Guralnick 2007). Hence, although suitable habitat for the California Heuchera clade expanded during the Pleistocene, the pronounced southward shift of the Mitella clade and more continuous distribution of the California Heuchera clade at this time compared with the present primarily drove geographic overlap between these clades. Overall, while the suitability prediction for California Heuchera oscillated considerably in different time slices, Mitella habitat suitability showed more extreme latitudinal shifts. Large amounts of suitable climate were inferred for both lineages in the Sierra Madre of Mexico and northern California; suitability in other regions had almost no overlap, suggesting historical contact in areas of local sympatry.

Temporal Mismatch and Extrapolation

Because we could not directly match the temporal scale of paleoclimatic and phylogenetic data, we used late-Pleistocene paleoclimatic models for geographic projections to represent how general cool and warm periods of the Pleistocene might have impacted range overlap (see also Waltari et al. 2007). Our extrapolation of range overlap to the approximate date of gene flow suggested by phylogenetic dating (fig. 6) suggests that earlier glacial maxima resulted in similar increases in shared range. This indicates that, at least on the basis of a temperature predictor, our results were generalizable to earlier time windows in the Pleistocene and that geographic overlap was likely facilitated similarly by earlier glacial cool periods. A critical point here is that direct temporal alignment of events based on phylogenetic inferences and climatic events is challenging in particular because of phylogenetic uncertainty. A comparison of alternative phylogenetic dating approaches (e.g., two prior formulations on the same data set in fig. 6; see also “Results”) and credibility intervals for each analysis resulted in variance spanning several glacial-interglacial cycles, highlighting this challenge.

Comparison with Other Approaches to Ancestral Niche Reconstruction

Our pipeline synthesizes previous developments in phyloclimatic reconstruction, offering two major advances over current methods. First, ambitus accomplishes tasks that previously were attempted individually but not combined, including incorporating extant taxon niche breadth (e.g., Nyári and Reddy 2013; Ogburn and Edwards 2015) and phylogenetic uncertainty (e.g., Smith and Donoghue 2010). Furthermore, it integrates these sources of variance in a Bayesian approach, as opposed to ML and squared-change parsimony approaches used solely to date in paleoclimatic modeling. Incorporating uncertainty in ancestral reconstruction is crucial for interpreting the results of downstream analyses (Schluter et al. 1997). This is particularly true of uncertainty in the trait model, rarely accounted for to date, because uncertainty has a clear a priori expectation, with the earliest tree nodes possessing the greatest uncertainty (Garland et al. 1999).

The second advance of our approach is that it enables ancestral geographic range predictions to be calculated directly from distributions of ancestral niche reconstructions. Some methods have previously been proposed for inferring ancestral ranges using paleoclimatic reconstructions. Most of these have used the min-max coding workflow of Yesson and Culham (2006a; see discussion in Lawing and Matzke 2014 of shortcomings in these types of approaches); in one case, paleoclimatic models were used to map the Mahalanobis distance from each pixel to the point estimate of reconstructed ancestral niche attributes (Williams et al. 2017). The former approach results in a binary map where each pixel is either suitable or unsuitable, while the latter approach results in a distance measure that is difficult to interpret directly as a suitability statistic. Neither directly uses distributions of ancestral reconstructions; the pipeline we present allows these ancestral reconstructions to inform predictions of potential paleodistribution and niche overlap.

New Prospects for Detecting Hybridization and Contextualizing Events

The approach outlined here has broad utility in the study of hybridization. The detection of hybridization in recent phylogenetic studies has largely been an exercise in molecular data analysis (Sang and Zhong 2000; Holder and Anderson 2001; Joly et al. 2009; Meng and Kubatko 2009; Gerard et al. 2011; Yu et al. 2013; Folk et al. 2017; reviewed in Folk et al. 2018b), with few exceptions (Marques et al. 2016). Our study suggests that evaluating ancestral niches and ranges is promising as a complementary procedure for validating and contextualizing hypotheses of hybridization identified through the use of molecular data. In the Heuchera case study presented here, the data are consistent with cooling climatic trends maximizing opportunities for hybridization, factors that could not be inferred from phylogenetic analyses of molecular data alone. For most hybrids that have been identified, climatic correlates of shared geographic range are typically unknown; an integrative approach to detecting hybridization that uses ecological, paleoclimatic, and genetic data can facilitate the identification of geographic regions and climatic conditions in which particular events could have taken place.

Future Directions for Ancestral Habitat Reconstruction

Researchers have been limited by a lack of programs enabling analysis and postprocessing of ancestral reconstructions beyond visualization of results. Future work focusing on explicit integration of biogeographic methods with paleoclimatic modeling will facilitate new forms of hypothesis testing relevant to shallow- and deep-scale diversification, particularly in relation to historical climate change. Direct assessment of ancestral dynamics in geographic space relies on the availability of historical climate layers for the period of interest. However, niche space occupancy estimates do not share this requirement and thus can potentially be used in deeper time intervals than those we considered. We hope that our case study demonstrates the value of synthesizing developments to date in paleoclimatic modeling to generate distributions of ancestral suitability. In particular, inference of ancestral distributions enables testing hypotheses about geographic range dynamics and lineage overlap and offers glimpses into the geographic potential for historical-biotic interactions, which are perhaps more elusive than other factors that biologists have invoked for explaining present-day distributions and organismal properties.

We thank H. Owens and N. Barve for discussions on ecological niche modeling methods, concepts, and layer sources; B. Drew and M. Gitzendanner for discussions on missing data; C. A. McCormick (University of North Carolina Herbarium) and the curators of the herbaria noted in “Methods” for assisting with occurrence record acquisition; B. Stucky for discussions on Brownian motion and statistical sampling; J. V. Freudenstein for advice concerning homology; E. B. Sessa for reading an early draft; the 1,000 Plants project for granting access to several transcriptomes; Y. Okuyama for sharing DNA materials of Japanese Mitella; and the Florida Museum of Natural History for supporting R.A.F. as a visiting researcher. This research was supported by National Science Foundation (NSF) DBI 1523667 and DEB 1406721 to R.A.F. (NSF PRFB and DDIG programs) and NSF DBI 1458640 to D.E.S. and P.S.S.

Literature Cited

Top left, Heuchera abramsii (San Gabriel alum-root), a high alpine species found only in a very small part of the Transverse Ranges of southern California. This is a member of the group that received genetic material from Mitella species through hybridization. Top right, Mitella diphylla (two-leaf mitrewort), a woodland species found broadly in eastern North America. The ancestor of this and its sister species (M. nuda) experienced hybridization with California Heuchera and transferred genetic material from its chloroplast and mitochondrion. Bottom, Habitat of H. elegans (urn-flowered alum-root), in crevices on loose stony slopes, as with other Heuchera species in southern California. Photo credit: R. A. Folk.

Associate Editor: Stephen B. Heard

Editor: Alice A. Winn