Browsing by Subject "Biostatistik"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Publication Biometrical approaches for analysing gene bank evaluation data on barley (Hordeum spec.)(2007) Hartung, Karin; Piepho, Hans-PeterThis thesis explored methods to statistically analyse phenotypic data of gene banks. Traits of the barley data (Hordeum spp.) of the gene bank of the IPK-Gatersleben were evaluated. The data of years 1948-2002 were available. Within this period the ordinal scale changed from a 0-5 to a 1-9 scale after 1993. At most gene banks reproduction of accessions is currently done without any experimental design. With data of a single year only rarely do accessions have replications and there are only few replications of a single check for winter and summer barley. The data of 2002 were analysed separately for winter and summer barley using geostatistical methods. For the traits analysed four types of variogram model (linear, spherical, exponential and Gaussian) were fitted to the empirical variogram using non-linear regression. The spatial parameters obtained by non-linear regression for every variogram model then were implemented in a mixed model analysis and the four model fits compared using Akaike's Information Criterion (AIC). The approach to estimate the genetical parameter by Kriging can not be recommended. The first points of the empirical variogram should be explained well by the fitted theoretical variogram, as these represent most of the pairwise distances between plots and are most crucial for neighbour adjustments. The most common well-fitting geostatistical models were the spherical and the exponential model. A nugget effect was needed for nearly all traits. The small number of check plots for the available data made it difficult to accurately dissect the genetical effect from environmental effects. The threshold model allows for joint analysis of multi-year data from different rating scales, assuming a common latent scale for the different rating systems. The analysis suggests that a mixed model analysis which treats ordinal scores as metric data will yield meaningful results, but that the gain in efficiency is higher when using a threshold model. The threshold model may also be used when there is a metric scale underlying the observed ratings. The Laplace approximation as a numerical method to integrate the log-likelihood for random effects worked well, but it is recommended to increase the number of quadrature points until the change in parameter estimates becomes negligible. Three rating methods (1%, 5%, 9-point rating) were assessed by persons untrained (A) and experienced (B) in rating. Every person had to rate several pictograms of diseased leaves. The highest accuracy was found with Group B using the 1%-scale and with Group A using the 5%-scale. With a percentage scale Group A tended to use values that are multiples of 5%. For the time needed per leaf assessment the Group B was fastest when using the 5% rating scale. From a statistical point of view both percent ratings performed better than the ordinal rating scale and the possible error made by the rater is calculable and usually smaller than with ratings by rougher methods. So directly rating percentages whenever possible leads to smaller overall estimation errors, and with proper training accuracy and precision can be further improved. For gene banks augmented designs as proposed by Federer and by Lin et al. offer themselves, so an overview is given. The augmented designs proposed by Federer have the advantage of an unbiased error estimate. But the random allocation of checks is a problem. The augmented design by Lin et al. always places checks in the centre plot of every whole plot. But none of the methods is based on an explicit statistical model, so there is no well-founded decision criterion to select between them. Spatial analysis can be used to find an optimal field layout for an augmented design, i.e. a layout that yields small least significant differences. The average variance of a difference and the average squared LSD were used to compare competing designs, using a theoretical approach based on variations of two anisotropic models and different rotations of anisotropy axes towards field reference axes. Based on theoretical calculations, up to five checks per block are recommended. The nearly isotropic combinations led to designs with large quadratic blocks. With strongly anisotropic combinations the optimal design depends on degree of anisotropy and rotation of anisotropy axes: without rotation small elongated blocks are preferred; the closer the rotation is to 45° the more squarish blocks and the more checks are appropriate. The results presented in this thesis may be summarised as follows: Cultivation for regeneration of accessions should be based on a meaningful and statistically analysable experimental field design. The design needs to include checks and a random sample of accessions from the gene pool held at the gene bank. It is advisable to utilise metric or percentage rating scales. It can be expected that using a threshold model increases the quality of multivariate analysis and association mapping studies based on phenotypic gene bank data.Publication Evaluation of alternative statistical methods for genomic selection for quantitative traits in hybrid maize(2012) Schulz-Streeck, Torben; Piepho, Hans-PeterThe efficacy of several contending approaches for Genomic selection (GS) were tested using different simulation and empirical maize breeding datasets. Here, GS is viewed as a general approach, incorporating all the different stages from the phenotypic analysis of the raw data to the marker-based prediction of the breeding values. The overall goal of this study was to develop and comparatively evaluate different approaches for accurately predicting genomic breeding values in GS. In particular, the specific objectives were to: (1) Develop different approaches for using information from analyses preceding the marker-based prediction of breeding values for GS. (2) Extend and/or suggest efficient implementations of statistical methods used at the marker-based prediction stage of GS, with a special focus on improving the predictive accuracy of GS in maize breeding. (3) Compare different approaches to reliably evaluate and compare methods for GS. An important step in the analyses preceding the marker-based prediction is the phenotypic analysis stage. One way of combining phenotypic analysis and marker-based prediction into a single stage analysis is presented. However, a stagewise analysis is typically computationally more efficient than a single stage analysis. Several different weighting schemes for minimizing information loss in stagewise analyses are therefore proposed and explored. It is demonstrated that orthogonalizing the adjusted means before submitting them to the next stage is the most efficient way within the set of weighting schemes considered. Furthermore, when using stagewise approaches, it may suffice to omit the marker information until the very last stage, if the marker-by-environment interaction has only a minor influence, as was found to be the case for the datasets considered in this thesis. It is also important to ensure that genotypic and phenotypic data for GS are of sufficiently high-quality. This can be achieved by using appropriate field trial designs and carrying out adequate quality controls to detect and eliminate observations deemed to be outlying based on various diagnostic tools. Moreover, it is shown that pre-selection of markers is less likely to be of high practical relevance to GS in most cases. Furthermore, the use of semivariograms to select models with the greatest strength of support in the data for GS is proposed and explored. It is shown that several different theoretical semivariogram models were all well supported by an example dataset and no single model was selected as being clearly the best. Several methods and extensions of GS methods have been proposed for marker-based prediction in GS. Their predictive accuracies were similar to that of the widely used ridge regression best linear unbiased prediction method (RR-BLUP). It is thus concluded that RR-BLUP, spatial methods, machine learning methods, such as componentwise boosting, and regularized regression methods, such as elastic net and ridge regression, have comparable performance and can therefore all be routinely used for GS for quantitative traits in maize breeding. Accounting for environment-specific or population-specific marker effects had only minor influence on predictive accuracy contrary to findings of several other studies. However, accuracy varied markedly among populations, with some populations showing surprisingly very low levels of accuracy. Combining different populations prior to marker-based prediction improved prediction accuracy compared to doing separate population-specific analyses. Moreover, polygenetic effects can be added to the RR-BLUP model to capture genetic variance not captured by the markers. However, doing so yielded minor improvements, especially for high marker densities. To relax the assumption of homogenous variance of markers, the RR-BLUP method was extended to accommodate heterogeneous marker variances but this had negligible influence on the predictive accuracy of GS for a simulated dataset. The widely used information-theoretic model selection criterion, namely the Akaike information criterion (AIC), ranked models in terms of their predictive accuracies similar to cross-validation in the majority of cases. But further tests would be required to definitively determine whether the computationally more demanding cross-validation may be substituted with the more efficient model selection criteria, such as AIC, without much loss of accuracy. Overall, a stagewise analysis, in which the markers are omitted until at the very last stage, is recommended for GS for the tested datasets. The particular method used for marker-based prediction from the set of those currently in use is of minor importance. Hence, the widely used and thoroughly tested RR-BLUP method would seem adequate for GS for most practical purposes, because it is easy to implement using widely available software packages for mixed models and it is computationally efficient.Publication Factors influencing the accuracy of genomic prediction in plant breeding(2017) Schopp, Pascal; Melchinger, Albrecht E.Genomic prediction (GP) is a novel statistical tool to estimate breeding values of selection candidates without the necessity to evaluate them phenotypically. The method calibrates a prediction model based on data of phenotyped individuals that were also genotyped with genome-wide molecular markers. The renunciation of an explicit identification of causal polymorphisms in the DNA sequence allows GP to explain significantly larger amounts of the genetic variance of complex traits than previous mapping-based approaches employed for marker-assisted selection. For these reasons, GP rapidly revolutionized dairy cattle breeding, where the method was originally developed and first implemented. By comparison, plant breeding is characterized by often intensively structured populations and more restricted resources routinely available for model calibration. This thesis addresses important issues related to these peculiarities to further promote an efficient integration of GP into plant breeding.Publication Mixed modelling for phenotypic data from plant breeding(2011) Möhring, Jens; Piepho, Hans-PeterPhenotypic selection and genetic studies require an efficient and valid analysis of phenotypic plant breeding data. Therefore, the analysis must take the mating design, the field design and the genetic structure of tested genotypes into account. In Chapter 2 unbalanced multi-environment trials (METs) in maize using a factorial design are analysed. The dataset from 30 years is subdivided in periods of up to three years. Variance component estimates for general and specific combining ability are calculated for each period. While mean grain yield increased with ongoing inter-pool selection, no changes for the mean of dry matter yield or for variance component estimate ratios were found. The continuous preponderance of general combining ability variance allows a hybrid selection based on general combining effects. The analysis of large datasets is often performed in stage-wise fashion by analysing each trial or location separately and estimating adjusted genotype means per trial or location. These means are then submitted to a mixed model to calculate genotype main effects across trials or locations. Chapter 3 studies the influence of stage-wise analysis on genotype main effect estimates for models which take account of the typical genetic structure of genotype effects within plant breeding data. For comparison, the genetic effects were assumed both fixed and random. The performance of several weighting methods for the stage-wise analysis are analysed by correlating the two-stage estimates with results of one-stage analysis and by calculating the mean square error (MSE) between both types of estimate. In case of random genetic effects, the genetic structure is modelled in one of three ways, either by using the numerator relationship matrix, a marker-based kinship matrix or by using crossed and nested genetic effects. It was found that stage-wise analysis results in comparable genotype main effect estimates for all weighting methods and for the assumption of random or fixed genetic effect if the model for analysis is valid. In case of choosing invalid models, e.g., if the missing data pattern is informative, both analyses are invalid and the results can differ. Informative missing data pattern can result from ignoring information either used for selecting the analysed genotypes or for selecting the test environments of genotypes, if not all genotypes are tested in all environments. While correlated information from relatives is rarely directly used for analysis of plant breeding data, it is often used implicitly by the breeder for selection decisions, e.g. by looking at the performance of a genotype and the average performance of the underlying cross. Chapter 4 proposed a model with a joint variance-covariance structure for related genotypes in analysis of diallels. This model is compared to other diallel models based on assumptions regarding the inheritance of several independent genes, i.e. on genetic models with more restrictive assumptions on the relationship between relatives. The proposed diallel model using a joint variance-covariance structure for parents and parental effects in crosses is shown to be a general model subsuming other more specialized diallel models, as these latter models can be obtained from the general model by adding restrictions on the variance-covariance structure. If no a priori information about the genetic model is available the proposed general model can outperform the more restrictive models. Using restrictive models can result in biased variance component estimates, if restrictions are not fulfilled by the data analysed. Chapter 5 evaluates, whether a subdivision of 21 triticale genotypes into heterotic pools is preferable. Subdividing genotypes into heterotic pools implies a factorial mating design between heterotic pools and a diallel mating design within each heterotic pool. For two (or more) heterotic pools the model is extended by assuming a joint variance-covariance structure for parental effects and general combing ability effects within the diallel and within the factorials. It is shown that a model with two heterotic pools has the best model fit. The variance component estimates for the general combing ability decrease within the heterotic pools and increase between heterotic pools. The results in Chapter 2 to 5 show, that an efficient and valid analysis of phenotypic plant breeding data is an essential part of the plant breeding process. The analysis can be performed in one or two stages. The used mixed models recognizing the field and mating design and the genetic structure can be used for answering questions about the genetic variance in cultivar populations under selection and of the number of heterotic pools. The proposed general diallel model using a joint variance-covariance structure between related effects can further be modified for factorials and other mating designs with related genotypes.Publication Statistical methods for analysis of multienvironment trials in plant breeding : accuracy and precision(2021) Buntaran, Harimurti; Piepho, Hans-PeterMultienvironment trials (MET) are carried out every year in different environmental conditions to evaluate a vast number of cultivars, i.e., yield, because different cultivars perform differently in various environmental conditions, known as genotype×environment interactions. MET aim to provide accurate information on cultivar performance so that a recommendation of which cultivar performs the best in a growers’ field condition can be available. MET data is often analysed via mixed models, which allow the cultivar effect to be random. The random effect of cultivar enables genetic correlation to be exploited across zones and considering the trials’ heterogeneity. A zone can be viewed as a larger target of population environments. The accuracy and precision of the cultivar predictions are crucial to be evaluated. The prediction accuracy can be evaluated via a cross-validation (CV) study, and the model selection can be done based on the lowest mean squared error prediction (MSEP). Also, since the trials’ locations hardly coincide with growers’ field, the precision of predictions needs to be evaluated via standard errors of predictions of cultivar values (SEPV) and standard errors of the predictions of pairwise differences of cultivar values (SEPD). The central objective of this thesis is to assess the model performance and conduct model selection via a CV study for zone-based cultivar predictions. Chapter 2 assessed the performance between empirical best linear unbiased estimations (EBLUE) and empirical best linear unbiased predictions (EBLUP) for zone-based prediction. Different CV schemes were done for the single-year and multi-year datasets to mimic the practice. A complex covariance structure such as factor-analytic (FA) was imposed to account for the heterogeneity of cultivar×zone (CZ) effect. The MSEP showed that the EBLUP models outperformed the EBLUE models. The zonation was necessary since it improved the accuracy and was preferable to make cultivar recommendations. The FA structure did not improve the accuracy compared to the simpler covariance structure, and so the EBLUP model with a simple covariance structure is sufficient for the single and multi-year datasets. Chapter 3 assessed the single-stage and stagewise analyses. The three weighting methods were compared in the stagewise analysis: two diagonal approximation methods and the fully efficient method with the unweighted analysis. The assessment was based on the MSEP instead of Pearson’s and Spearman’s correlation coefficients since the correlation coefficients are often very close between the compared models. The MSEP showed that the single-stage EBLUP and the stagewise weighting EBLUP strategy were very similar. Thus, the loss of information due to diagonal approximation is minor. In fact, the MSEP showed a more apparent distinction between the single-stage and the stagewise weighting analyses with the unweighted EBLUE compared to the correlation coefficients. The simple compound-symmetric covariance structure was sufficient for the CZ effect than the more complex structures. The choice between the single-stage and stagewise weighting analysis, thus, depends on the computational resources and the practicality of data handling. Chapter 4 assessed the accuracy and precision of the predictions for the new locations. The environmental covariates were combined with the EBLUP in the random coefficient (RC) models since the covariates provide more information for the new locations. The MSEP showed that the RC models were not the model with the smallest MSEP, but the RC models had the lowest SEPV and SEPD. Thus, the model selection can be done by joint consideration of the MSEP, SEPV, and SEPD. The models with EBLUE and covariate interaction effects performed poorly regarding the MSEP. The EBLUP models without RC performed best, but the SEPV and SEPD were large, considered unreliable. The covariate scale and selection are essential to obtain a positive definite covariance matrix. Employing unstructured covariance int the RC is crucial to maintaining the RC models’ invariance feature. The RC framework is suitable to be implemented with GIS data to provide an accurate and precise projection of cultivar performance for the new locations or environments. To conclude, the EBLUP model for zoned-based predictions should be preferred to obtain the predictions and rankings closer to the true values and rankings. The stagewise weighting analysis can be recommended due to its practicality and its computational efficiency. Furthermore, projecting cultivar performances to the new locations should be done to provide more targeted information for growers. The available environmental covariates can be utilised to improve the predictions’ accuracy and precision in the new locations in the RC model framework. Such information is certainly more valuable for growers and breeders than just providing means across a whole target population of environments.Publication Weighting methods for variance heterogeneity in phenotypic and genomic data analysis for crop breeding(2019) Damesa, Tigist Mideksa; Piepho, Hans-PeterIn plant breeding programmes MET form the backbone for phenotypic selection, GS and GWAS. Efficient analysis of MET is fundamental to get accurate results from phenotypic selection, GS and GWAS. On the other hand inefficient analysis of MET data may have consequences such as biased ranking of genotype means in phenotypic data analysis, small accuracy of GS and wrong identification of QTL in GWAS analysis. A combined analysis of MET is performed using either single-stage or stage-wise (two-stage) approaches based on the linear mixed model framework. While single-stage analysis is a fully efficient approach, MET data is suitably analyzed using stage-wise methods. MET data often show within-trial and between-trial variance heterogeneities, which is in contradiction with the homogeneity of variance assumption of linear models, and these heterogeneities require corrections. In addition it is well documented that spatial correlations are inherent to most field trials. Appropriate remedial techniques for variance heterogeneities and proper accounting of spatial correlation are useful to improve accuracy and efficiency of MET analysis. Chapter 2 studies methods for simultaneous handling of within-trial variance heterogeneity and within-trial spatial correlation. This study is conducted based on three maize trials from Ethiopia. To stabilize variance Box-Cox transformation was considered. The result shows that, while the Box-Cox transformation was suitable for stabilizing the variance, it is difficult to report results on the original scale. As alternative variance models, i.e. power-of-the-mean (POM) and exponential models, were used to fix the variance heterogeneity problem. Unlike the Box-Cox method, the variance models considered in this study were successful to deal simultaneously with both spatial correlation and heterogeneity of variance. For analysis of MET data, two-stage analysis is often favored in practice over single-stage analysis because of its suitability in terms of computation time, and its ability to easily account for any specifics of each trial (variance heterogeneity, spatial correlation, etc). Stage-wise analyses are approximate in that they cannot fully reproduce a single-stage analysis because the variance–covariance matrix of adjusted means from the first-stage analysis is sometimes ignored or sometimes approximated and the approximation may not be efficient. Discrepancy of results between single-stage and two-stage analysis increases when the variance between trials is heterogeneous. In stage-wise analysis one of the major challenges is how to account for heterogeneous variance between trials at the second stage. To account for heterogeneous variance between trials, a weighted mixed model approach is used for the second-stage analysis. The weights are derived from the variances and covariances of adjusted means from the first-stage analysis. In Chapter 3 we compared single-stage analysis and two-stage analysis. A new fully efficient and a diagonal weighting matrix are used for weighting in the second stage. The methods are explored using two different types of maize datasets. The result indicates that single-stage analysis and two-stage analysis give nearly identical results provided that the full information on all effect estimates and their associated estimated variances and covariances is carried forward from the first to the second stage. GWAS and GS analysis can be conducted using a single-stage or a stage-wise approach. The computational demand for GWAS and GS increases compared to purely phenotypic analysis because of the addition of marker data. Usually researchers compute genotype means from phenotypic MET data in stage-wise analysis (with or without weighting) and then forward these means to GWAS or GS analysis, often without any weighting. In Chapter 4 weighted stage-wise analysis versus unweighted stage-wise analysis are compared for GWAS and GS using phenotypic and genotypic maize data. Fully-efficient and a diagonal weighting are used. Results show that weighting is preferred over unweighted analysis for both GS and GWAS. In conclusion, stage-wise analysis is a suitable approach for practical analysis of MET, GS and GWAS analysis. Single-stage and two-stage analysis of MET yield very similar results. Stage-wise analysis can be nearly as efficient as single-stage analysis when using optimal weighting, i.e., fully-efficient weighting. Spatial variation and within-trial variance heterogeneity are common in MET data. This study illustrated that both can be resolved simultaneously using a weighting approach for the variance heterogeneity and spatial modeling for the spatial variation. Finally beside application of weighting in the analysis of phenotypic MET data, it is recommended to use weighting in the actual GS and GWAS analysis stage.