Browsing by Subject "Cross-validation"

Now showing 1 - 4 of 4

Development of MidDRIFTS methodologies to support mapping of physico-chemical soil properties at the regional scale
(2014) Mirzaeitalarposhti, Reza; Müller, Torsten
Changing climate conditions and land-use change severely affect key ecosystem processes in soils. Hence, regular monitoring of essential soil properties are required to implement appropriate soil management in agro-ecosystems. However characterizing soil properties at different spatial scales remains challenging, requiring a large amount of geo-referenced data by intensive sampling. Mid-infrared diffuse reflectance infrared Fourier transform spectroscopy (midDRIFTS) in combination with partial least square regression (PLSR) was applied as a rapid-throughput method to quantify soil properties and to assess soil spatial patterns at the regional scale in two agro-ecological areas. A pre-sampling at the regional scale was done to develop the most efficient midDRIFTS-PLSR prediction models by testing two different calibration procedures, i.e. cross-validation and independent validation, to quantify essential soil properties with 126 sample points. A generic MidDRIFTS-PLSR prediction model was developed to predict most soil properties of “unknown” samples accurately using independent validation approach. The next step was the integration of midDRIFTS-PLSR with geostatistics to facilitate regional soil property mapping. Developed midDRIFTS-PLSR models were used to predict TC, TIC, TOC and soil texture contents (clay, silt and sand) of the 1170 soil samples. The midDRIFTS-PLSR models accurately predicted all soil properties. Furthermore, the integration of midDRIFTS-PLSR-based predicted data with geostatisitcs resulted in high resolution maps of soil carbon and texture at the regional scale which are an improvement over the existing maps. As a further development of midDRIFTS approaches for soil quality assessment, spectral-based indexes for characterizing SOM quality and quantifying carbonate at regional scale were explored. MidDRIFTS peak areas corresponding to SOM functional groups (2930, 1620, 1520 and 1159 cm-1) were assessed to study the composition of SOM. The peak assigned for aliphatic C-H bond (2930 cm-1) was an appropriate index to investigate SOM fractions if the interference of carbonates was taken into consideration. Regression performance obtained between the peak at 2930 cm-1 and SOM fractions (e.g., R2 = 0.31 for Cmic) increased to R2 = 0.65 when high carbonate containing samples (total inorganic carbon > 1%) were excluded. The most accurate spectral index for carbonate was the peak area at 713 cm-1 when relating to TIC obtained by Scheiblers method (R2 = 0.98). In conclusion, it was demonstrated that midDRIFTS-PLSR is a rapid-throughput method for providing high-quality predictions of soil properties to update regional digital soil property mapping by integration with geostatistics. It opens a new possibility to gain high resolution data coverage of soil C and N pools, which is relevant for the application of SOM simulation models on a regional scale. However, to up-scale the approach for extended geographical areas, further efforts are needed to establish a national level spectral library by considering standardization of sampling, analytical reference analyses and midDRIFT spectroscopy techniques.
Model selection by cross-validation in multi-environment trials
(2017) Hadasch, Steffen; Piepho, Hans-Peter
In plant breeding, estimation of the performance of genotypes across a set of tested environments (genotype means), and the estimation of the environment-specific performances of the genotypes (genotype-environment means) are important tasks. For this purpose, breeders conduct multi-environment trials (MET) in which a set of genotypes is tested in a set of environments. The data from such experiments are typically analysed by mixed models as such models for example allow modelling the genotypes using random effects which may be correlated according to their genetic information. The data from MET are often high-dimensional and the covariance matrix of the data may contain many parameters that need to be estimated. To circumvent computational burdens, the data can be analysed in a stage-wise fashion. In the stage-wise analysis, the covariance matrix of the data needs to be taken into account in the estimation of the individual stages. In the analysis of MET data, there is usually a set of candidate models from which the one that fits bets to the objective of the breeder needs to be determined. Such a model selection can be done by cross validation (CV). In the application of CV schemes, different objectives of the breeder can be evaluated using an appropriate sampling strategy. In the application of a CV, both the sampling strategy and the evaluation of the model need to take the correlation of the data into account to evaluate the model performance adequately. In this work, two different types of models that are used for the analysis of MET were focused. In Chapter 2, models that use genetic marker information to estimate the genotype means were considered. In Chapters 3 and 4, the estimation of genotype-environment means using models that include multiplicative terms to describe the genotype-environment interaction, namely the additive main effects and multiplicative interaction (AMMI), and the genotype and genotype-environment interaction (GGE) model, were focused. In all the Chapters, the models were estimated in a stage-wise fashion. Furthermore, CV was used in Chapters 2 and 3 to determine the most appropriate model from a set of candidate models. In Chapter 2, two traits of a biparental lettuce (Lactuca sativa L.) population were analysed by models for (i) phenotypic selection, (ii) marker-assisted selection using QTL-linked markers, (iii) genomic prediction using all available molecular markers, and (iv) a combination of genomic prediction and QTL-linked markers. Using different sampling strategies in a CV, the predictive performances of these models were compared in terms of different objectives of a breeder, namely predicting unobserved genotypes, predicting genotypes in an unobserved environment, and predicting unobserved genotypes in an unobserved environment. Generally, the genomic prediction model outperformed marker assisted and phenotypic selection when there are only a few markers with large effects, while the marker assisted selection outperformed genomic prediction when the number of markers with large effects increases. Furthermore, the results obtained for the different objectives indicate that the predictive performance of the models in terms of predicting (unobserved) genotypes in an unobserved environment is reduced due to the presence of genotype-environment interaction. In AMMI/GGE models, the number of multiplicative terms can be determined by CV. In Chapter 3, different CV schemes were compared in a simulation study in terms of recovering the true (simulated) number of multiplicative terms, and in terms of the mean squared error of the estimated genotype-environment means. The data were simulated using the estimated variance components of a randomized complete block design and a resolvable incomplete block design. The effects of the experimental design (replicates and blocks) need to be taken into account in the application of a CV in order to evaluate the predictive performance of the model adequately. In Chapter 3, the experimental design was accounted for by an adjustment of the data for the design effects estimated from all data before applying a CV scheme. The results of the simulation study show that an adjustment of the data is required to determine the number of multiplicative terms in AMMI/GGE models. Furthermore, the results indicate that different CV schemes can be used with similar efficiencies provided that the data were adjusted adequately. AMMI/GGE models are typically estimated in a two-stage analysis in which the first stage consists of estimating the genotype-environment means while the second stage consists of estimating main effects of genotypes and environments and the multiplicative interaction. The genotype-environment means estimated in the first stage are not independent when effects of the experimental design are modelled as random effects. In such a case, estimation of the second stage should be done by a weighted (generalized least squares) estimation where a weighting matrix is used to take the covariance matrix of the estimated genotype-environment means into account. In Chapter 4, three different algorithms which can take the full covariance matrix of the genotype-environment means into account are introduced to estimate the AMMI/GGE model in a weighted fashion. To investigate the effectiveness of the weighted estimation, the algorithms were implemented using different weighting matrices, including (i) an identity matrix (unweighted estimation), (ii) a diagonal approximation of the inverse covariance matrix of the genotype-environment means, and (iii) the full inverse covariance matrix. The different weighting strategies were compared in a simulation study in terms of the mean squared error of the estimated genotype-environment means, multiplicative interaction effects, and Biplot coordinates. The results of the simulation study show that weighted estimation of the AMMI/GGE model generally outperformed unweighted estimation. Furthermore, the effectiveness of a weighted estimation increased when the heterogeneity in the covariance matrix of the estimated genotype-environment means increased. The analysis of MET in a stage-wise fashion is an efficient procedure to estimate a model for MET data, whereas the covariance structure of the data needs to be taken into account in each stage in order to estimate the model appropriately. When correlated data are used in a CV, the correlation can be taken into account by an appropriate choice of training and validation data, by an adjustment of the data before applying a CV scheme and by the success criterion used in a CV scheme.
QTL mapping of resistance to Sclerotinia sclerotiorum (Lib.) De Bary in sunflower (Helianthus annuus L.)
(2005) Micic, Zeljko; Melchinger, Albrecht E.
Sclerotinia sclerotiorum (Lib.) de Bary is one of the most important pathogens of sunflower. Three different disease symptoms can be caused by S. sclerotiorum: Sclerotinia wilt, midstalk- and head rot. An improvement of the resistance against S. sclerotiorum would contribute to yield security and thus increase the profitability of sunflower cultivation. We investigated resistance to midstalk rot with respect to the prospects of marker-assisted selection (MAS). The bjectives were to (1) identify quantitative trait loci (QTL) involved in resistance against Sclerotinia sclerotiorum, (2) map their position in the genome, (3) characterize their gene effects, and (4) estimate their consistency across generations of the cross NDBLOSsel x CM625. Two sunflower lines with high resistance level to S. sclerotiorum and different genetic origins (NDBLOSsel and TUB-5-3234) were used as parents. They were crossed with a highly susceptible line CM625 to develop two mapping populations. A modified leaf test was used for the evaluation of midstalk-rot resistance. Three resistance traits and two morphological traits were measured. Disease resistance of 354 F3 families of the population NDBLOSsel x CM625 was screened in field trials with two different sowing times in 1999. A total 317 recombinant inbred lines (RIL) derived from F3 families were tested in 2002/2003. The 434 F3 families of cross CM625 x TUB-5-3234 were screened in 2000/2001. The field trials were conducted by using generalized lattice designs with three replications and five infected plants per replication. Highly significant genetic variation between F3 families and RIL was observed for the resistance traits in all field trials. Heritabilities ( ) were highest for stem lesion and lowest for leaf lesion for all three experiments. The resistance traits were moderately correlated with each other. For the construction of the genetic map of population NDBLOSsel x CM625, 352 F2 individuals were analyzed with 117 SSR marker loci. On the basis of results from the QTL mapping in F3 families, 41 markers were selected and genotyped in 248 RIL. A "selective genotyping" (SG) approach was used for population CM625 x TUB-5-3234. Based on the results measured in F3 families for stem lesion, the SSR genotype at 72 marker loci was determined for the 60 most resistant and 60 most susceptible F2 individuals. For QTL mapping and estimation, the method of the "composite interval of mapping" was used. For stem lesion in the population NDBLOSsel x CM625, eight QTL were detected explaining 33.7% of the genetic variance ( ). The QTL on LG8 explained 36.7% of the phenotypic variance (R2adj). All other QTL for this trait explained between 3.3 and 6.0% of R2adj. Nine QTL were detected for leaf lesion. The proportion of the phenotypic variance explained by individual QTL ranged from 3.4 to 11.3%. All detected QTL for leaf lesion explained 25.3% of the genetic variance in cross validation. For speed of fungal growth, 6 QTL were detected, which explained from 4.6 to 10.2% R2adj. Cross validation explained 24.4% of. Most QTL showed additive gene action. QTL occurring consistently across generations can be recommended for MAS and therefore, the QTL results between RIL and F3 families of population NDBLOSsel x CM625 were compared. One common QTL was identified for leaf lesion, two for stem lesion and three for speed of fungal growth. In population CM625 x TUB-5-3234, four QTL for stem lesion, three QTL for leaf lesion and three QTL for speed of fungal growth were identified. Owing to the SG approach we conjecture that not all QTL were found. The comparison of QTL results between two F3 populations showed two common QTL for stem lesion on LG4 and LG8. The QTL on LG4 originated from the susceptible parent CM625. The QTL on LG8 probably corresponds to the QTL with the largest effect determined in the population NDBLOSsel x CM625. Regarding MAS, our results indicate that two QTL detected for stem lesion and speed of fungal growth in population NDBLOSsel x CM625 are promising. They were consistent across environments, and showed no adverse correlation to leaf morphology in trials with the RIL. In mapping population CM625  TUB-5-3234, it remained unclear whether TUB-5-3234 can contribute new alleles with sufficiently large effects for resistance that were not identified in line NDBLOSsel and would be useful in MAS. The genomic region on LG10 should be analyzed in more detail with respect to its importance for resistance in multiple plant parts (head and stalk) and to verify its association with leaf morphology. Resistance breeding of sunflower against S. sclerotiorum is difficult due to the complex inheritance of the trait. This study showed that both the resistance source NDBLOSsel and the identified markers are promising in improving resistance by MAS. For a broader resistance against S. sclerotiorum, it is necessary to detect new resistance genes from different sources to pyramide them in elite lines.
Statistical methods for analysis of multienvironment trials in plant breeding
accuracy and precision
(2021) Buntaran, Harimurti; Piepho, Hans-Peter
Multienvironment trials (MET) are carried out every year in different environmental conditions to evaluate a vast number of cultivars, i.e., yield, because different cultivars perform differently in various environmental conditions, known as genotype×environment interactions. MET aim to provide accurate information on cultivar performance so that a recommendation of which cultivar performs the best in a growers’ field condition can be available. MET data is often analysed via mixed models, which allow the cultivar effect to be random. The random effect of cultivar enables genetic correlation to be exploited across zones and considering the trials’ heterogeneity. A zone can be viewed as a larger target of population environments. The accuracy and precision of the cultivar predictions are crucial to be evaluated. The prediction accuracy can be evaluated via a cross-validation (CV) study, and the model selection can be done based on the lowest mean squared error prediction (MSEP). Also, since the trials’ locations hardly coincide with growers’ field, the precision of predictions needs to be evaluated via standard errors of predictions of cultivar values (SEPV) and standard errors of the predictions of pairwise differences of cultivar values (SEPD). The central objective of this thesis is to assess the model performance and conduct model selection via a CV study for zone-based cultivar predictions. Chapter 2 assessed the performance between empirical best linear unbiased estimations (EBLUE) and empirical best linear unbiased predictions (EBLUP) for zone-based prediction. Different CV schemes were done for the single-year and multi-year datasets to mimic the practice. A complex covariance structure such as factor-analytic (FA) was imposed to account for the heterogeneity of cultivar×zone (CZ) effect. The MSEP showed that the EBLUP models outperformed the EBLUE models. The zonation was necessary since it improved the accuracy and was preferable to make cultivar recommendations. The FA structure did not improve the accuracy compared to the simpler covariance structure, and so the EBLUP model with a simple covariance structure is sufficient for the single and multi-year datasets. Chapter 3 assessed the single-stage and stagewise analyses. The three weighting methods were compared in the stagewise analysis: two diagonal approximation methods and the fully efficient method with the unweighted analysis. The assessment was based on the MSEP instead of Pearson’s and Spearman’s correlation coefficients since the correlation coefficients are often very close between the compared models. The MSEP showed that the single-stage EBLUP and the stagewise weighting EBLUP strategy were very similar. Thus, the loss of information due to diagonal approximation is minor. In fact, the MSEP showed a more apparent distinction between the single-stage and the stagewise weighting analyses with the unweighted EBLUE compared to the correlation coefficients. The simple compound-symmetric covariance structure was sufficient for the CZ effect than the more complex structures. The choice between the single-stage and stagewise weighting analysis, thus, depends on the computational resources and the practicality of data handling. Chapter 4 assessed the accuracy and precision of the predictions for the new locations. The environmental covariates were combined with the EBLUP in the random coefficient (RC) models since the covariates provide more information for the new locations. The MSEP showed that the RC models were not the model with the smallest MSEP, but the RC models had the lowest SEPV and SEPD. Thus, the model selection can be done by joint consideration of the MSEP, SEPV, and SEPD. The models with EBLUE and covariate interaction effects performed poorly regarding the MSEP. The EBLUP models without RC performed best, but the SEPV and SEPD were large, considered unreliable. The covariate scale and selection are essential to obtain a positive definite covariance matrix. Employing unstructured covariance int the RC is crucial to maintaining the RC models’ invariance feature. The RC framework is suitable to be implemented with GIS data to provide an accurate and precise projection of cultivar performance for the new locations or environments. To conclude, the EBLUP model for zoned-based predictions should be preferred to obtain the predictions and rankings closer to the true values and rankings. The stagewise weighting analysis can be recommended due to its practicality and its computational efficiency. Furthermore, projecting cultivar performances to the new locations should be done to provide more targeted information for growers. The available environmental covariates can be utilised to improve the predictions’ accuracy and precision in the new locations in the RC model framework. Such information is certainly more valuable for growers and breeders than just providing means across a whole target population of environments.