MATERIALS AND METHODS


Plant material

PI416937 pedigree analysis

For this study we looked to obtain high yielding lines that were derived from PI416937 as well as there parental lines. This combination of PI416937 derived lines and parental lines formed what we will refer to as a trio. Each entire trio was genotyped for the pedigree analysis. One of the parental lines must obviously have been derived from PI416937. We used 52 trios in our pedigree analysis. For seven of the trios, both parents were derived from PI416937. By pedigree, percentage PI416937 ranged from 12.5% to 50% in the high yielding progeny line of the trios. There were several lines that were derived from the same parent combinations. We became concerned of biasing our results for certain regions under selection if they were inherited to multiple progeny from the same parental combination, but these progeny were derived from the same F2 plant or an even later generation. This would indicate that these multiple progeny were no longer multiple independent occurrences where a breeder selected a high yielding line with a particular genomic region from PI416937 so they should not be treated as such. In order to identify lines we believed to be derived from the same F2 generation or later, we looked for sister lines which contained what we deemed to be an excess of shared recombination events. As long as we could determine within a reasonable doubt that these lines were derived from different F2 plants, we kept them in the analysis. The high yielding progeny lines were chosen based on inclusion in the USDA Southern States Uniform Tests (UT). This was deemed an acceptable criteria by us because the primary breeding objective for soybean breeders is to develop high yielding breeding lines. Breeders in the southeastern U.S. will take their highest yielding breeding lines developed each year and test them against the material of other breeders in the UT for at least one and possibly multiple years to determine which lines are most elite to possibly release as germplasm or a cultivar. At the very least, these high yielding lines would most likely be recycled as parents in future crosses by interested breeders. Lines chosen for the analysis were present in the UT as early as 1994 and as recently as 2015, covering a roughly twenty year timespan. N93-110-6 was not present in any of the UT, but was still chosen to include because of it's elite pedigree and the fact that this line was bred for superior yield, albeit under specifically drought conditions (Devi 2014 reference). Forty-four of the lines included in the analysis were bred at USDA Raleigh while the remaining eight were bred at the University of Georgia (UGA). Seed for the genotyping of these lines was obtained from their respective source institutions. Not all of the lines in these trios needed to be procured for growing out plants for genotyping because genotypic data was already available from Soybase. 

RIL pedigree and yield analysis

The UGA soybean breeding program developed five F5 derived RIL populations with the original intention of developing lines for germplasm release or eventual cultivar release. These populations underwent a traditional breeding scheme which began with the initial crosses, single pod descent to the F5 generation, single-plant selections, and then single-plant-derived-row selections. Four of the RIL populations (RIL-1, RIL-2, RIL-3, RIL-4) placed into yield testing were comprised of 84 lines each. The fifth RIL population (RIL-5) was derived from a more diverse cross and this population was composed of 150 lines. RIL-1 will refer to the population derived from a cross of AUO2-3104 x G00-3213RR2Y. RIL-2 will refer to the population derived from a cross of G93-2225 x G09PR-54329RR2Y. RIL-3 will refer to the population derived from a cross of G10PR-56248RR2Y x G10PR-56389RR2Y. RIL-4 will refer to the population derived from a cross of G10PR-10 x G10PR-56389RR2Y. RIL-5 will refer to the population derived from a cross of G00-3213 x LG04-6000. The term trio can be used here as well to refer to an individual RIL and both of the parental lines. 
Include pedigree information on these RIL populations

RIL-1
RIL-2
RIL-3
RIL-4
RIL-5

NIL yield analysis

NIL-1

The UGA soybean breeding program also developed two near isogenic populations composed of near isogenic lines (NILs) segregating for the YLD1 region for the purpose of yield testing. The first NIL population (NIL-1) originating from a single breeding line from the USDA Raleigh breeding program. This line was N01-11828 and was found to be heterogeneous for the YLD1 locus. N01-11828 was derived from a cross of 'Graham' (PI594922) x N96-7031. 'Graham' is a MG V elite line developed at USDA Raleigh that was released in 1997 (Carter 1997). 'Graham' was developed from a cross of N77-114 x 'Pixie' (PI543856). N96-7031 is an elite MG VIII line developed at USDA Raleigh. N96-7031 was developed from a cross between 'N7001' (PI615694) x N90-7241. 'N7001' was developed from a cross of N77-114 x PI416937. N90-7241 was derived from a cross of 'GaSoy17' (PI553046) x PI416937. Thus, by pedigree, N96-7031 was 50% PI416937. F6 generation seed from N01-11828 were screened with SSR markers linked to the YLD1 locus and found several individual seed which were heterozygous for the target QTL. Satt333 was an SSR marker we used for genotyping of the YLD1 region. Satt333 was located from 39,911,032 to 39,911,097 bp on chromosome 8. This is 1.9 Mb away from the SNP alleles we associated with the YLD1 region and thus we considered this SSR in close linkage with the YLD1 region of interest in the latter yield analysis of the YLD1 region. NILs were grown out from these heterozygous seed and genotyped again at the YLD1 locus to identify which lines inherited the PI416937 genomic region and which had not. Twenty NILs were used for yield testing. The distribution for the YLD1 locus was 15 lines contained the PI416937 allele and five lines contained the 'Graham' allele. We could have had an equal distribution of the genotypes but there was a second allele, located on chromosome 6, we were testing that needed to be relatively balanced as well.

NIL-2

We developed a population of 150 near isogenic lines (NILs) in order to test the region we referred to as YLD1 which was indicated as our threshold region for significance in our pedigree analysis. We cross 'Boggs' (PI602597) by 'Woodruff' in order to create this NIL population. 'Boggs' is a MG VI elite line developed at UGA that was released in 1997 (Boerma 2000). 'Boggs' was developed from a cross of G81-152 and  'Coker 237' (PI556536). 'Woodruff,' also referred to as G00-3209, is a MG VII elite line developed at UGA that was patented in 2010 (Boerma 2010). 'Woodruff' was developed from a cross of  'N7001' (Carter 2003) and 'Boggs.' 'N7001' was developed form a cross of N77-114 x PI416937. Thus, by pedigree, Woodruff was 25% PI416937. Satt333 was an SSR marker we used for genotyping of the YLD1 region. 'Woodruff' and PI416937 had also been genotyped using the SoySNP50k chip to ensure that this region of Woodruff containing YLD1 traced back to PI416937 and not 'Boggs.' The development of the NIL population was initiated from the crossing of 'Boggs' x 'Woodruff' in the summer of 2010 at the University of Georgia Plant Sciences Farm near Athens, GA. We obtained pollen from the 10 F1's that were true crosses and backcrossed to 'Boggs' in the UGA greenhouse during the winter of 2010-2011. We obtained three true BC1F1's from this first backcross of F1's to 'Boggs.' Satt333 was used to ensure all BC1F1's were heterozygous at the YLD1 locus. In summer 2011, at the UGA Plant Sciences Farm, we performed a second backcross to 'Boggs' using pollen from our BC1F1's. We ran satt333 and selected seven BC2F1's which were heterozygous at the the YLD1 locus. These BC2F1's were selfed in the UGA greenhouse facility during the 2011-2012 winter. During the summer of 2012, The BC2F2's were grown out at the UGA Plant Sciences Farm. We obtained 441 BC2F2 plants and single plant threshed each plant to create BC2F2:3's. During the 2012-2013 winter, these BC2F2:3's were sent to the USDA winter nursery in Puerto Rico to be grown in individual BC2F2:3 plant rows. Two hundred and eighty rows were randomly selected for genotyping. Leaf tissue from these rows was bulked by row and we ran markers to see the distribution of the different YLD1 genotypes. We selected 150 lines for yield testing that had a relatively equal distribution of all three possible YLD1 genotypes. The distribution broke down into 59 lines with the 'Boggs' allele, 40 heterozygous, and 51 with the PI416937 allele based upon genotyping with Satt333. Similar to the previous NIL population, we could have had an equal distribution of the genotypes but there was a second allele, located on chromosome 6, we were testing that needed to be relatively balanced as well.

Genotypic data

PI416937 pedigree analysis and RIL pedigree analysis

The USDA Soybean Germplasm Collection was genotyped using the SoySNP50K iSelect BeadChips (). Genotypic data utilized in this manuscript was obtained from Soybase on. Genotypic data for lines that were not present in Soybase was procured courtesy of USDA Beltsivlle as well as Michigan State University (MSU). These lines were also genotyped using the SoySNP50K iSelect BeadChips. Due to discrepancies in SNPs between USDA Beltsville and MSU, a final set of BLANK SNPs was established based upon USDA Beltsivlle which had the fewer number of SNPs between the two genotyping facilities. The five RIL populations were genotyped using the SoySNP6K iSelect BeadChips courtesly of USDA Beltsville. The physical distances of SNPs were originally based on genome assembly version Glyma.Wm82.a1 (Gmax1.01) (Schmutz, 2010) but were converted to version Glyma.Wm82.a2(Gmax2.0) for the analysis. SNPs which were mapped in Gmax1.01 but not Gmax2.0, were excluded from the analysis.
DNA was isolated from RILs and lines included in pedigree based analysis for which there was no previous genotypic data available. For each line, 20 seeds were planted in cups in a University of Georgia greenhouse facility. Once seedling were three weeks old, tissue was bulk harvested in 50ml falcon tubes for each line. Leaf tissue was lyophilized and ground into fine powder. DNA was extracted using the 1:0.1 CTAB (cetyl trim ethyl ammonium bromide): 2ME (betamercaptoethanol) method according to (Keim et al. (1988)), with some modifications. Modifications were made to improve purity of DNA which is required for successful genotyping using the SoySNP50K and SoySNP6K iSelect BeadChips. Some key modifications include adding  Edwards Extraction Buffer, NaCl, polyvinypyrrolidone, and proteinase k to the CTAB 2ME buffer. Another modification was to add a second 24:1 chloroform:isoamyl alcohol step to further remove proteins and polysaccharides. Isopropanol precipitatiton was performed at negative 20 degrees Celsius to improve concentrations. An additional 75% ethanol wash was performed to remove excess salts that fell out of solution during the isopropanol precipitation. 

PCA and phylogenetic tree
NIL marker work


Pedigree analysis to identify PI416937 regions under selection in high yielding PI416937 derived lines and RIL populations

This process was used for the pedigree analysis performed on the high yielding PI416937 derived lines as well as the five RIL populations. In this section, I will be referring to the high yielding PI416937 derived lines and the RILs as progeny. The first step was to identify for each progeny line, which alleles were inherited from which parent. To do this, we took each line and began matching alleles to each parent. It should be noted that for all genotypic data, SNPs were in order based upon their physical map positions from Gmax2.0. No genetic maps were made for the RIL populations so later we could have congruency with the original analysis of the high yielding PI416937 derived lines. Our goal was to calculate a matching score and whichever parent held a higher score for a genomic region was called the parental source for that region. Each matched allele was worth a point. If a locus in either the parent or progeny line was heterozygous, this was called as half of a point. Missing values were worth zero points. This matching continued until the matching was broken by an opposite allele being present in the parent. Then for a given matched segment, a score was calculated based upon the quality of the match and the match with the higher score was called the parent of origin. The parental match had to be at least two markers longer in one parent to be considered definitively one parent versus the other. In theory, a genomic region called for one parent could be from a shorter consecutive match because the longer match to the other parent had an excess of heterozygous loci or missing data points. We performed simulations to ensure this was an appropriate methodology for parental genomic origin calling in the progeny. If a given region had the same score in the two parents or the match was not two markers longer in one parent, we were unable to definitively call this region for either parent and it was called as ambiguous. Once the parental regions were identified, we looked to identify the origin of these regions with relation to PI416937 as well as to the predominant ancestor lines of North American southern elite material according to Vaughn et al. (Vaughn 2016). We chose North American southern ancestors versus all North American ancestors because the genotypes used in our analysis were comprised predominantly of southern germplasm and this also lead to less ambiguity in identifying ancestor sources for the genomic regions. We were now able to trace back the genomic origins of our progeny lines to their parental sources and then to their original ancestor sources. The regions we then became interested in were regions in which one parent contained an allele from PI416937 at a a particular locus and the other parent did not. This then became a test of this PI416937 allele to determine if it had been transmitted to the progeny during the breeding selection process. In some cases the ancestor source of an allele in a parental line was identified as ambiguous but the most likely sources, though similar to each other, were not similar to PI416937 so this remained a test if the other parent contained an allele at this locus definitively IBD with an allele from PI416937. To obtain p-values for alleles under selection, we performed a two sided binomial test in R (R Core Team, 2015). The number of successes was entered as the number of times the PI416937 allele was inherited. The number of trials was entered as the number of trios in which the PI416937 allele was tested in the parents. The hypothesized probability of success was entered as 0.5 because we expected that by chance, with no selection pressure, the allele from PI416937 would be inherited 50% of the time it was tested in the parents. We set our significance threshold at 0.05 and performed a multiple test correction utilizing the Li and Ji methodology to account for collinearity among linked markers (Li 2005). For the analysis utilizing the high yielding PI416937 derived lines, we were unable to discover regions that were statistically significant after the multiple testing correction so we decided to focus on regions which were at least as significant as a previously discovered genomic region associated with yield from PI416937 referred to as Yld1 (David publication). For a majority of the RIL populations we were able to find regions that were statistically significant using the multiple test corrected threshold. Regions under selection were defined as any run of consecutive markers surpassing our chosen statistical significance threshold. There were several situations in which markers showed varying levels of significance within a run of consecutive markers. The markers with the highest levels of significance within these runs of consecutive markers have the most evidence of selection in a positive or negative direction. We also did not include markers in our results that were not tested in at least 10 of our trios.

Plant material characterization

PI416937 pedigree analysis
Pedigree tree (Helium)
Phylogenetic tree
GAPIT PCA

We traced the pedigrees of the high yielding PI416937 derived lines as far back as was possible utilizing germplasm/cultivar releases, USDA Southern States UT publications, and USDA GRIN. We used Helium to display this information (Shaw 2014).  We conducted PCA of the trios used in the analysis along with North American ancestor lines and a sample of relatively modern public and private soybean materials using the GAPIT R package (Lipka 2012). The PCA was plotted using TIBCO Spotfire (reference). We also constructed a distance matrix using TASSEL v5.2.29 (Bradbury 2007). We used the neighbor joining clustering method and then we created a circular phylogenetic tree using FigTree v1.4.3 (reference). 

QTL identification

PI416937 pedigree analysis and RIL pedigree analysis

Once we identified genomic regions from PI416937 under favorable selection, we examined the literature to see if any reported QTL from mapping studies involving PI416937 had mapped QTL to the regions we found in our analysis. There are several studies that have been performed in the past involving PI416937 looking to map the following traits: WUE (Mian 1996), aluminum tolerance (Bianchi-Hall 2000), root morphology (Abdel-Haleem 2010), drought tolerance (Carpentieri-Pipolo 2011), and canopy wilt (Abdel-Haleem 2012). Using soybase, we identified the physical positions of the nearest markers to the QTL of interest to estimate the QTL physical positions. If the nearest marker was an SSR, we used the physical positions of the primer sequences for SSR markers located 3' or 5' of the QTL in order to estimate the physical positions of the QTL. If the nearest marker was a SNP marker, the exact SNP position for the SNP located 3' or 5' of the QTL was indicated. We then looked to see if any portion of these QTL were located within regions we found to be under positive selection. We determined it was not relevant to look for QTL mapped using PI416937 that were overlapping with PI416937 regions we found under negative selection. These QTL mapping studies were designed to find favorable genomic regions of PI416937, not the alternative parents. This being said, several of these mapping studies found genomic regions from the alternative parent to be favorable for a couple of these traits of interest. For these cases, we looked to see if these QTL were in regions under negative selection from PI416937, indicating that there is something detrimental from PI416937 which was identified as unfavorable in both a bi-parental QTL mapping study and our complex pedigree analysis. 

Yield analysis

RIL yield analysis
NIL yield analysis

NIL-1

The 20 NILs were yield tested across several different locations in GA and NC over a period of five years (2007-2011). The Georgia yield tests were grown at UGA Plant Science Farm locations while the North Carolina yield test were grown at North Carolina Agricultural Research Service locations. During the summer of 2007, the NILs were yield tested in Athens, GA; Plains, GA; and Caswell, NC. In 2008-2010, yield testing occurred in the same three location with the addition of Plymouth, NC. In 2011, yield testing occurred only in Athens, GA and Plains, GA. In addition to the 20 NILs, elite checks were grown in the yield trials. Three blocks of each line were grown in a randomized complete block design in 2007 while four blocks were grown in 2008-2011. The elite checks varied somewhat from year to year. In 2007, two replications of 'Boggs' and two replications of 'Benning' were grown per block. An additional elite check, 'Vance,' was grown per block in Caswell, NC. In 2008 and 2011, two replications of both 'Boggs' and 'Benning' were grown at each location in each block. In 2009-2010, two replications of 'Boggs' and 'Benning' were grown in each block. Yield tests from 2009-2010 in Caswell, NC and Plymouth, NC were tested with elite check, 'N7103' as well. Yield trials at the Georgia locations were grown in two row plots which were 4.9 m long and later end trimmed to 3.7 m near the end of the vegetative stage. Rows were planted 76 cm apart and were harvested for yield evaluation. For North Carolina locations, lines were planted in three row plots which were 6.1 m long and row spacing was 97 cm between rows. These plots were end trimmed to 4.6 m near the completion of the vegetative growth stage. Only the middle row was harvested for yield evaluation. Various traits we measured were seed yield, plant height, lodging score, seed size, seed protein content, seed oil content, and maturity date.
Elite checks which were in the experimental design for the yield test, were not genotyped for the yield analysis because we only wanted to examine the yield differences among NILs for YLD1. We initially ran a mixed model with YLD1, location, year, YLD1*location, and YLD1*year as fixed effects and blocks within location*year as a random effect to determine if there was significant GxE.

NIL-2

 The BC2F2:4 plant rows of these 150 lines were yield tested at the UGA Plant Sciences Farms in Athens, GA and Plains, GA in the summer of 2013. These 150 lines were broken up into three equal sets of fifty based upon maturity date. Each set also contained two replications of 'Boggs' and  two replications of 'Woodruff.' There were two blocks of the experiment per location set up in a randomized complete block design. These yield trials were grown in two row plots which were 4.9 m long and later end trimmed to 3.7 m near the end of the vegetative stage. Rows were planted 76 cm apart and were harvested for yield evaluation. Various traits we measured were seed yield, plant height, lodging score, seed size, seed protein content, seed oil content, and maturity date. Seed yield, plant height, lodging score, seed size, seed protein content, and seed oil content was measured for all blocks at each location. Maturity dates were measured only at the Athens location.
Parental lines which were in in the experimental design for the yield test, were not genotyped for the yield analysis because we only wanted to examine the yield differences among NILs for YLD1. We initially ran a mixed model with YLD1, location, and YLD1*location as fixed effects and blocks nested within location as a random effect to determine if there was GxE.  Since there was significant evidence of GxE we ran the location separately to see what the effect of YLD1 was within each location. The models for the individual locations had YLD1 as a fixed effect, while block was a random effect. For each location, we then ran a Tukey HSD multiple means comparison test to identify which groups statistically were higher yielding within each location. For the two individual locations, we performed the same analysis for plant height, lodging score, seed size, seed protein content, seed oil content, and maturity date. 

Comparison of PI416937 regions under selection with regions of low diversity in North American elite breeding lines