Browsing by Subject "Mixed models"

Now showing 1 - 5 of 5

Biometrical approaches for analysing gene bank evaluation data on barley (Hordeum spec.)
(2007) Hartung, Karin; Piepho, Hans-Peter
This thesis explored methods to statistically analyse phenotypic data of gene banks. Traits of the barley data (Hordeum spp.) of the gene bank of the IPK-Gatersleben were evaluated. The data of years 1948-2002 were available. Within this period the ordinal scale changed from a 0-5 to a 1-9 scale after 1993. At most gene banks reproduction of accessions is currently done without any experimental design. With data of a single year only rarely do accessions have replications and there are only few replications of a single check for winter and summer barley. The data of 2002 were analysed separately for winter and summer barley using geostatistical methods. For the traits analysed four types of variogram model (linear, spherical, exponential and Gaussian) were fitted to the empirical variogram using non-linear regression. The spatial parameters obtained by non-linear regression for every variogram model then were implemented in a mixed model analysis and the four model fits compared using Akaike's Information Criterion (AIC). The approach to estimate the genetical parameter by Kriging can not be recommended. The first points of the empirical variogram should be explained well by the fitted theoretical variogram, as these represent most of the pairwise distances between plots and are most crucial for neighbour adjustments. The most common well-fitting geostatistical models were the spherical and the exponential model. A nugget effect was needed for nearly all traits. The small number of check plots for the available data made it difficult to accurately dissect the genetical effect from environmental effects. The threshold model allows for joint analysis of multi-year data from different rating scales, assuming a common latent scale for the different rating systems. The analysis suggests that a mixed model analysis which treats ordinal scores as metric data will yield meaningful results, but that the gain in efficiency is higher when using a threshold model. The threshold model may also be used when there is a metric scale underlying the observed ratings. The Laplace approximation as a numerical method to integrate the log-likelihood for random effects worked well, but it is recommended to increase the number of quadrature points until the change in parameter estimates becomes negligible. Three rating methods (1%, 5%, 9-point rating) were assessed by persons untrained (A) and experienced (B) in rating. Every person had to rate several pictograms of diseased leaves. The highest accuracy was found with Group B using the 1%-scale and with Group A using the 5%-scale. With a percentage scale Group A tended to use values that are multiples of 5%. For the time needed per leaf assessment the Group B was fastest when using the 5% rating scale. From a statistical point of view both percent ratings performed better than the ordinal rating scale and the possible error made by the rater is calculable and usually smaller than with ratings by rougher methods. So directly rating percentages whenever possible leads to smaller overall estimation errors, and with proper training accuracy and precision can be further improved. For gene banks augmented designs as proposed by Federer and by Lin et al. offer themselves, so an overview is given. The augmented designs proposed by Federer have the advantage of an unbiased error estimate. But the random allocation of checks is a problem. The augmented design by Lin et al. always places checks in the centre plot of every whole plot. But none of the methods is based on an explicit statistical model, so there is no well-founded decision criterion to select between them. Spatial analysis can be used to find an optimal field layout for an augmented design, i.e. a layout that yields small least significant differences. The average variance of a difference and the average squared LSD were used to compare competing designs, using a theoretical approach based on variations of two anisotropic models and different rotations of anisotropy axes towards field reference axes. Based on theoretical calculations, up to five checks per block are recommended. The nearly isotropic combinations led to designs with large quadratic blocks. With strongly anisotropic combinations the optimal design depends on degree of anisotropy and rotation of anisotropy axes: without rotation small elongated blocks are preferred; the closer the rotation is to 45° the more squarish blocks and the more checks are appropriate. The results presented in this thesis may be summarised as follows: Cultivation for regeneration of accessions should be based on a meaningful and statistically analysable experimental field design. The design needs to include checks and a random sample of accessions from the gene pool held at the gene bank. It is advisable to utilise metric or percentage rating scales. It can be expected that using a threshold model increases the quality of multivariate analysis and association mapping studies based on phenotypic gene bank data.
Estimating heritability in plant breeding programs
(2019) Schmidt, Paul; Piepho, Hans-Peter
Heritability is an important notion in, e.g., human genetics, animal breeding and plant breeding, since the focus of these fields lies on the relationship between phenotypes and genotypes. A phenotype is the composite of an organism’s observable traits, which is determined by its underlying genotype, by environmental factors and by genotype-environment interactions. For a set of genotypes, the notion of heritability expresses the proportion of the phenotypic variance that is attributable to the genotypic variance. Furthermore, as it is an intraclass correlation, heritability can also be interpreted as, e.g., the squared correlation between phenotypic and genotypic values. It is important to note that heritability was originally proposed in the context of animal breeding where it is the individual animal that represents the basic unit of observation. This stands in contrast to plant breeding, where multiple observations for the same genotype are obtained in replicated trials. Furthermore, trials are usually conducted as multi-environment trials (MET), where an environment denotes a year × location combination and represents a random sample from a target population of environments. Hence, the observations for each genotype first need to be aggregated in order to obtain a single phenotypic value, which is usually done by obtaining some sort of mean value across trials and replicates. As a consequence, heritability in the context of plant breeding is referred to as heritability on an entry-mean basis and its standard estimation method is a linear combination of variances and trial dimensions. Ultimately, I find that there are two main uses for heritability in plant breeding: The first is to predict the response to selection and the second is as a descriptive measure for the usefulness and precision of cultivar trials. Heritability on an entry-mean basis is suited for both purposes as long as three main assumptions hold: (i) the trial design is completely balanced/orthogonal, (ii) genotypic effects are independent and (iii) variances and covariances are constant. In the last decades, however, many advancements in the methodology of experimental design for and statistical analysis of plant breeding trials took place. As a consequence it is seldom the case that all three of above mentioned assumptions are met. Instead, the application of linear mixed models enables the breeder to straightforwardly analyze unbalanced data with complex variance structures. Chapter 2 exemplarily demonstrates some of the flexibility and benefit of the mixed model framework for typically unbalanced MET by using a bivariate mixed model analyses to jointly analyze two MET for cultivar evaluation, which differ in multiple crucial aspects such as plot size, trial design and general purpose. Such an approach can lead to higher accuracy and precision of the analysis and thus more efficient and successful breeding programs. It is not clear, however, how to define and estimate a generalized heritability on an entry-mean basis for such settings. Therefore, multiple alternative methods for the estimation of heritability on an entry-mean basis have been proposed. In Chapter 3, six alternative methods are applied to four typically unbalanced MET for cultivar evaluation and compared to the standard method. The outcome suggests that the standard method over-estimates heritability, while all of the alternative methods show similar, lower estimates and thus seem able to handle this kind of unbalanced data. Finally, it is argued in Chapter 4 that heritability in plant breeding is not actually based on or aiming at entry-means, but on the differences between them. Moreover, an estimation method for this new proposal of heritability on an entry-difference basis (H_Delta^2/h_Delta^2) is derived and discussed, as well as exemplified and compared to other methods via analyzing four different datasets for cultivar evaluation which differ in their complexity. I argue that regarding the use of heritability as a descriptive measure, H_Delta^2/h_Delta^2, can on the one hand give a more detailed and meaningful insight than all other heritability methods and on the other hand reduces to other methods under certain circumstances. When it comes to the use of heritability as a means to predict the response to selection, the outcome of this work discourages this as a whole. Instead, response to selection should be simulated directly and thus without using any ad hoc heritability measure.
Evaluation of alternative statistical methods for genomic selection for quantitative traits in hybrid maize
(2012) Schulz-Streeck, Torben; Piepho, Hans-Peter
The efficacy of several contending approaches for Genomic selection (GS) were tested using different simulation and empirical maize breeding datasets. Here, GS is viewed as a general approach, incorporating all the different stages from the phenotypic analysis of the raw data to the marker-based prediction of the breeding values. The overall goal of this study was to develop and comparatively evaluate different approaches for accurately predicting genomic breeding values in GS. In particular, the specific objectives were to: (1) Develop different approaches for using information from analyses preceding the marker-based prediction of breeding values for GS. (2) Extend and/or suggest efficient implementations of statistical methods used at the marker-based prediction stage of GS, with a special focus on improving the predictive accuracy of GS in maize breeding. (3) Compare different approaches to reliably evaluate and compare methods for GS. An important step in the analyses preceding the marker-based prediction is the phenotypic analysis stage. One way of combining phenotypic analysis and marker-based prediction into a single stage analysis is presented. However, a stagewise analysis is typically computationally more efficient than a single stage analysis. Several different weighting schemes for minimizing information loss in stagewise analyses are therefore proposed and explored. It is demonstrated that orthogonalizing the adjusted means before submitting them to the next stage is the most efficient way within the set of weighting schemes considered. Furthermore, when using stagewise approaches, it may suffice to omit the marker information until the very last stage, if the marker-by-environment interaction has only a minor influence, as was found to be the case for the datasets considered in this thesis. It is also important to ensure that genotypic and phenotypic data for GS are of sufficiently high-quality. This can be achieved by using appropriate field trial designs and carrying out adequate quality controls to detect and eliminate observations deemed to be outlying based on various diagnostic tools. Moreover, it is shown that pre-selection of markers is less likely to be of high practical relevance to GS in most cases. Furthermore, the use of semivariograms to select models with the greatest strength of support in the data for GS is proposed and explored. It is shown that several different theoretical semivariogram models were all well supported by an example dataset and no single model was selected as being clearly the best. Several methods and extensions of GS methods have been proposed for marker-based prediction in GS. Their predictive accuracies were similar to that of the widely used ridge regression best linear unbiased prediction method (RR-BLUP). It is thus concluded that RR-BLUP, spatial methods, machine learning methods, such as componentwise boosting, and regularized regression methods, such as elastic net and ridge regression, have comparable performance and can therefore all be routinely used for GS for quantitative traits in maize breeding. Accounting for environment-specific or population-specific marker effects had only minor influence on predictive accuracy contrary to findings of several other studies. However, accuracy varied markedly among populations, with some populations showing surprisingly very low levels of accuracy. Combining different populations prior to marker-based prediction improved prediction accuracy compared to doing separate population-specific analyses. Moreover, polygenetic effects can be added to the RR-BLUP model to capture genetic variance not captured by the markers. However, doing so yielded minor improvements, especially for high marker densities. To relax the assumption of homogenous variance of markers, the RR-BLUP method was extended to accommodate heterogeneous marker variances but this had negligible influence on the predictive accuracy of GS for a simulated dataset. The widely used information-theoretic model selection criterion, namely the Akaike information criterion (AIC), ranked models in terms of their predictive accuracies similar to cross-validation in the majority of cases. But further tests would be required to definitively determine whether the computationally more demanding cross-validation may be substituted with the more efficient model selection criteria, such as AIC, without much loss of accuracy. Overall, a stagewise analysis, in which the markers are omitted until at the very last stage, is recommended for GS for the tested datasets. The particular method used for marker-based prediction from the set of those currently in use is of minor importance. Hence, the widely used and thoroughly tested RR-BLUP method would seem adequate for GS for most practical purposes, because it is easy to implement using widely available software packages for mixed models and it is computationally efficient.
Statistical methods for analysis of multienvironment trials in plant breeding
accuracy and precision
(2021) Buntaran, Harimurti; Piepho, Hans-Peter
Multienvironment trials (MET) are carried out every year in different environmental conditions to evaluate a vast number of cultivars, i.e., yield, because different cultivars perform differently in various environmental conditions, known as genotype×environment interactions. MET aim to provide accurate information on cultivar performance so that a recommendation of which cultivar performs the best in a growers’ field condition can be available. MET data is often analysed via mixed models, which allow the cultivar effect to be random. The random effect of cultivar enables genetic correlation to be exploited across zones and considering the trials’ heterogeneity. A zone can be viewed as a larger target of population environments. The accuracy and precision of the cultivar predictions are crucial to be evaluated. The prediction accuracy can be evaluated via a cross-validation (CV) study, and the model selection can be done based on the lowest mean squared error prediction (MSEP). Also, since the trials’ locations hardly coincide with growers’ field, the precision of predictions needs to be evaluated via standard errors of predictions of cultivar values (SEPV) and standard errors of the predictions of pairwise differences of cultivar values (SEPD). The central objective of this thesis is to assess the model performance and conduct model selection via a CV study for zone-based cultivar predictions. Chapter 2 assessed the performance between empirical best linear unbiased estimations (EBLUE) and empirical best linear unbiased predictions (EBLUP) for zone-based prediction. Different CV schemes were done for the single-year and multi-year datasets to mimic the practice. A complex covariance structure such as factor-analytic (FA) was imposed to account for the heterogeneity of cultivar×zone (CZ) effect. The MSEP showed that the EBLUP models outperformed the EBLUE models. The zonation was necessary since it improved the accuracy and was preferable to make cultivar recommendations. The FA structure did not improve the accuracy compared to the simpler covariance structure, and so the EBLUP model with a simple covariance structure is sufficient for the single and multi-year datasets. Chapter 3 assessed the single-stage and stagewise analyses. The three weighting methods were compared in the stagewise analysis: two diagonal approximation methods and the fully efficient method with the unweighted analysis. The assessment was based on the MSEP instead of Pearson’s and Spearman’s correlation coefficients since the correlation coefficients are often very close between the compared models. The MSEP showed that the single-stage EBLUP and the stagewise weighting EBLUP strategy were very similar. Thus, the loss of information due to diagonal approximation is minor. In fact, the MSEP showed a more apparent distinction between the single-stage and the stagewise weighting analyses with the unweighted EBLUE compared to the correlation coefficients. The simple compound-symmetric covariance structure was sufficient for the CZ effect than the more complex structures. The choice between the single-stage and stagewise weighting analysis, thus, depends on the computational resources and the practicality of data handling. Chapter 4 assessed the accuracy and precision of the predictions for the new locations. The environmental covariates were combined with the EBLUP in the random coefficient (RC) models since the covariates provide more information for the new locations. The MSEP showed that the RC models were not the model with the smallest MSEP, but the RC models had the lowest SEPV and SEPD. Thus, the model selection can be done by joint consideration of the MSEP, SEPV, and SEPD. The models with EBLUE and covariate interaction effects performed poorly regarding the MSEP. The EBLUP models without RC performed best, but the SEPV and SEPD were large, considered unreliable. The covariate scale and selection are essential to obtain a positive definite covariance matrix. Employing unstructured covariance int the RC is crucial to maintaining the RC models’ invariance feature. The RC framework is suitable to be implemented with GIS data to provide an accurate and precise projection of cultivar performance for the new locations or environments. To conclude, the EBLUP model for zoned-based predictions should be preferred to obtain the predictions and rankings closer to the true values and rankings. The stagewise weighting analysis can be recommended due to its practicality and its computational efficiency. Furthermore, projecting cultivar performances to the new locations should be done to provide more targeted information for growers. The available environmental covariates can be utilised to improve the predictions’ accuracy and precision in the new locations in the RC model framework. Such information is certainly more valuable for growers and breeders than just providing means across a whole target population of environments.
Studying stem rust and leaf rust resistances of self-fertile rye breeding populations
(2022) Gruner, Paul; Witzke, Anne; Flath, Kerstin; Eifler, Jakob; Schmiedchen, Brigitta; Schmidt, Malthe; Gordillo, Andres; Siekmann, Dörthe; Fromme, Franz Joachim; Koch, Silvia; Piepho, Hans-Peter; Miedaner, Thomas
Stem rust (SR) and leaf rust (LR) are currently the two most important rust diseases of cultivated rye in Central Europe and resistant cultivars promise to prevent yield losses caused by those pathogens. To secure long-lasting resistance, ideally pyramided monogenic resistances and race-nonspecific resistances are applied. To find respective genes, we screened six breeding populations and one testcross population for resistance to artificially inoculated SR and naturally occurring LR in multi-environmental field trials. Five populations were genotyped with a 10K SNP marker chip and one with DArTseqTM. In total, ten SR-QTLs were found that caused a reduction of 5–17 percentage points in stem coverage with urediniospores. Four QTLs thereof were mapped to positions of already known SR QTLs. An additional gene at the distal end of chromosome 2R, Pgs3.1, that caused a reduction of 40 percentage points SR infection, was validated. One SR-QTL on chromosome 3R, QTL-SR4, was found in three populations linked with the same marker. Further QTLs at similar positions, but from different populations, were also found on chromosomes 1R, 4R, and 6R. For SR, additionally seedling tests were used to separate between adult-plant and all-stage resistances and a statistical method accounting for the ordinal-scaled seedling test data was used to map seedling resistances. However, only Pgs3.1 could be detected based on seedling test data, even though genetic variance was observed in another population, too. For LR, in three of the populations, two new large-effect loci (Pr7 and Pr8) on chromosomes 1R and 2R were mapped that caused 34 and 21 percentage points reduction in leaf area covered with urediniospores and one new QTL on chromosome 1R causing 9 percentage points reduction.