Browsing by Person "Piepho, Hans-Peter"
Now showing 1 - 15 of 15
Results Per Page
Sort Options
Publication Biometrical approaches for analysing gene bank evaluation data on barley (Hordeum spec.)(2007) Hartung, Karin; Piepho, Hans-PeterThis thesis explored methods to statistically analyse phenotypic data of gene banks. Traits of the barley data (Hordeum spp.) of the gene bank of the IPK-Gatersleben were evaluated. The data of years 1948-2002 were available. Within this period the ordinal scale changed from a 0-5 to a 1-9 scale after 1993. At most gene banks reproduction of accessions is currently done without any experimental design. With data of a single year only rarely do accessions have replications and there are only few replications of a single check for winter and summer barley. The data of 2002 were analysed separately for winter and summer barley using geostatistical methods. For the traits analysed four types of variogram model (linear, spherical, exponential and Gaussian) were fitted to the empirical variogram using non-linear regression. The spatial parameters obtained by non-linear regression for every variogram model then were implemented in a mixed model analysis and the four model fits compared using Akaike's Information Criterion (AIC). The approach to estimate the genetical parameter by Kriging can not be recommended. The first points of the empirical variogram should be explained well by the fitted theoretical variogram, as these represent most of the pairwise distances between plots and are most crucial for neighbour adjustments. The most common well-fitting geostatistical models were the spherical and the exponential model. A nugget effect was needed for nearly all traits. The small number of check plots for the available data made it difficult to accurately dissect the genetical effect from environmental effects. The threshold model allows for joint analysis of multi-year data from different rating scales, assuming a common latent scale for the different rating systems. The analysis suggests that a mixed model analysis which treats ordinal scores as metric data will yield meaningful results, but that the gain in efficiency is higher when using a threshold model. The threshold model may also be used when there is a metric scale underlying the observed ratings. The Laplace approximation as a numerical method to integrate the log-likelihood for random effects worked well, but it is recommended to increase the number of quadrature points until the change in parameter estimates becomes negligible. Three rating methods (1%, 5%, 9-point rating) were assessed by persons untrained (A) and experienced (B) in rating. Every person had to rate several pictograms of diseased leaves. The highest accuracy was found with Group B using the 1%-scale and with Group A using the 5%-scale. With a percentage scale Group A tended to use values that are multiples of 5%. For the time needed per leaf assessment the Group B was fastest when using the 5% rating scale. From a statistical point of view both percent ratings performed better than the ordinal rating scale and the possible error made by the rater is calculable and usually smaller than with ratings by rougher methods. So directly rating percentages whenever possible leads to smaller overall estimation errors, and with proper training accuracy and precision can be further improved. For gene banks augmented designs as proposed by Federer and by Lin et al. offer themselves, so an overview is given. The augmented designs proposed by Federer have the advantage of an unbiased error estimate. But the random allocation of checks is a problem. The augmented design by Lin et al. always places checks in the centre plot of every whole plot. But none of the methods is based on an explicit statistical model, so there is no well-founded decision criterion to select between them. Spatial analysis can be used to find an optimal field layout for an augmented design, i.e. a layout that yields small least significant differences. The average variance of a difference and the average squared LSD were used to compare competing designs, using a theoretical approach based on variations of two anisotropic models and different rotations of anisotropy axes towards field reference axes. Based on theoretical calculations, up to five checks per block are recommended. The nearly isotropic combinations led to designs with large quadratic blocks. With strongly anisotropic combinations the optimal design depends on degree of anisotropy and rotation of anisotropy axes: without rotation small elongated blocks are preferred; the closer the rotation is to 45° the more squarish blocks and the more checks are appropriate. The results presented in this thesis may be summarised as follows: Cultivation for regeneration of accessions should be based on a meaningful and statistically analysable experimental field design. The design needs to include checks and a random sample of accessions from the gene pool held at the gene bank. It is advisable to utilise metric or percentage rating scales. It can be expected that using a threshold model increases the quality of multivariate analysis and association mapping studies based on phenotypic gene bank data.Publication Biometrical tools for heterosis research(2010) Schützenmeister, André; Piepho, Hans-PeterMolecular biological technologies are frequently applied for heterosis research. Large datasets are generated, which are usually analyzed with linear models or linear mixed models. Both types of model make a number of assumptions, and it is important to ensure that the underlying theory applies for datasets at hand. Simultaneous violation of the normality and homoscedasticity assumptions in the linear model setup can produce highly misleading results of associated t- and F-tests. Linear mixed models assume multivariate normality of random effects and errors. These distributional assumptions enable (restricted) maximum likelihood based procedures for estimating variance components. Violations of these assumptions lead to results, which are unreliable and, thus, are potentially misleading. A simulation-based approach for the residual analysis of linear models is introduced, which is extended to linear mixed models. Based on simulation results, the concept of simultaneous tolerance bounds is developed, which facilitates assessing various diagnostic plots. This is exemplified by applying the approach to the residual analysis of different datasets, comparing results to those of other authors. It is shown that the approach is also beneficial, when applied to formal significance tests, which may be used for assessing model assumptions as well. This is supported by the results of a simulation study, where various alternative, non-normal distributions were used for generating data of various experimental designs of varying complexity. For linear mixed models, where studentized residuals are not pivotal quantities, as is the case for linear models, a simulation study is employed for assessing whether the nominal error rate under the null hypothesis complies with the expected nominal error rate. Furthermore, a novel step within the preprocessing pipeline of two-color cDNA microarray data is introduced. The additional step comprises spatial smoothing of microarray background intensities. It is investigated whether anisotropic correlation models need to be employed or isotropic models are sufficient. A self-versus-self dataset with superimposed sets of simulated, differentially expressed genes is used to demonstrate several beneficial features of background smoothing. In combination with background correction algorithms, which avoid negative intensities and which have already been shown to be superior, this additional step increases the power in finding differentially expressed genes, lowers the number of false positive results, and increases the accuracy of estimated fold changes.Publication Estimating heritability in plant breeding programs(2019) Schmidt, Paul; Piepho, Hans-PeterHeritability is an important notion in, e.g., human genetics, animal breeding and plant breeding, since the focus of these fields lies on the relationship between phenotypes and genotypes. A phenotype is the composite of an organism’s observable traits, which is determined by its underlying genotype, by environmental factors and by genotype-environment interactions. For a set of genotypes, the notion of heritability expresses the proportion of the phenotypic variance that is attributable to the genotypic variance. Furthermore, as it is an intraclass correlation, heritability can also be interpreted as, e.g., the squared correlation between phenotypic and genotypic values. It is important to note that heritability was originally proposed in the context of animal breeding where it is the individual animal that represents the basic unit of observation. This stands in contrast to plant breeding, where multiple observations for the same genotype are obtained in replicated trials. Furthermore, trials are usually conducted as multi-environment trials (MET), where an environment denotes a year × location combination and represents a random sample from a target population of environments. Hence, the observations for each genotype first need to be aggregated in order to obtain a single phenotypic value, which is usually done by obtaining some sort of mean value across trials and replicates. As a consequence, heritability in the context of plant breeding is referred to as heritability on an entry-mean basis and its standard estimation method is a linear combination of variances and trial dimensions. Ultimately, I find that there are two main uses for heritability in plant breeding: The first is to predict the response to selection and the second is as a descriptive measure for the usefulness and precision of cultivar trials. Heritability on an entry-mean basis is suited for both purposes as long as three main assumptions hold: (i) the trial design is completely balanced/orthogonal, (ii) genotypic effects are independent and (iii) variances and covariances are constant. In the last decades, however, many advancements in the methodology of experimental design for and statistical analysis of plant breeding trials took place. As a consequence it is seldom the case that all three of above mentioned assumptions are met. Instead, the application of linear mixed models enables the breeder to straightforwardly analyze unbalanced data with complex variance structures. Chapter 2 exemplarily demonstrates some of the flexibility and benefit of the mixed model framework for typically unbalanced MET by using a bivariate mixed model analyses to jointly analyze two MET for cultivar evaluation, which differ in multiple crucial aspects such as plot size, trial design and general purpose. Such an approach can lead to higher accuracy and precision of the analysis and thus more efficient and successful breeding programs. It is not clear, however, how to define and estimate a generalized heritability on an entry-mean basis for such settings. Therefore, multiple alternative methods for the estimation of heritability on an entry-mean basis have been proposed. In Chapter 3, six alternative methods are applied to four typically unbalanced MET for cultivar evaluation and compared to the standard method. The outcome suggests that the standard method over-estimates heritability, while all of the alternative methods show similar, lower estimates and thus seem able to handle this kind of unbalanced data. Finally, it is argued in Chapter 4 that heritability in plant breeding is not actually based on or aiming at entry-means, but on the differences between them. Moreover, an estimation method for this new proposal of heritability on an entry-difference basis (H_Delta^2/h_Delta^2) is derived and discussed, as well as exemplified and compared to other methods via analyzing four different datasets for cultivar evaluation which differ in their complexity. I argue that regarding the use of heritability as a descriptive measure, H_Delta^2/h_Delta^2, can on the one hand give a more detailed and meaningful insight than all other heritability methods and on the other hand reduces to other methods under certain circumstances. When it comes to the use of heritability as a means to predict the response to selection, the outcome of this work discourages this as a whole. Instead, response to selection should be simulated directly and thus without using any ad hoc heritability measure.Publication Evaluation of alternative statistical methods for genomic selection for quantitative traits in hybrid maize(2012) Schulz-Streeck, Torben; Piepho, Hans-PeterThe efficacy of several contending approaches for Genomic selection (GS) were tested using different simulation and empirical maize breeding datasets. Here, GS is viewed as a general approach, incorporating all the different stages from the phenotypic analysis of the raw data to the marker-based prediction of the breeding values. The overall goal of this study was to develop and comparatively evaluate different approaches for accurately predicting genomic breeding values in GS. In particular, the specific objectives were to: (1) Develop different approaches for using information from analyses preceding the marker-based prediction of breeding values for GS. (2) Extend and/or suggest efficient implementations of statistical methods used at the marker-based prediction stage of GS, with a special focus on improving the predictive accuracy of GS in maize breeding. (3) Compare different approaches to reliably evaluate and compare methods for GS. An important step in the analyses preceding the marker-based prediction is the phenotypic analysis stage. One way of combining phenotypic analysis and marker-based prediction into a single stage analysis is presented. However, a stagewise analysis is typically computationally more efficient than a single stage analysis. Several different weighting schemes for minimizing information loss in stagewise analyses are therefore proposed and explored. It is demonstrated that orthogonalizing the adjusted means before submitting them to the next stage is the most efficient way within the set of weighting schemes considered. Furthermore, when using stagewise approaches, it may suffice to omit the marker information until the very last stage, if the marker-by-environment interaction has only a minor influence, as was found to be the case for the datasets considered in this thesis. It is also important to ensure that genotypic and phenotypic data for GS are of sufficiently high-quality. This can be achieved by using appropriate field trial designs and carrying out adequate quality controls to detect and eliminate observations deemed to be outlying based on various diagnostic tools. Moreover, it is shown that pre-selection of markers is less likely to be of high practical relevance to GS in most cases. Furthermore, the use of semivariograms to select models with the greatest strength of support in the data for GS is proposed and explored. It is shown that several different theoretical semivariogram models were all well supported by an example dataset and no single model was selected as being clearly the best. Several methods and extensions of GS methods have been proposed for marker-based prediction in GS. Their predictive accuracies were similar to that of the widely used ridge regression best linear unbiased prediction method (RR-BLUP). It is thus concluded that RR-BLUP, spatial methods, machine learning methods, such as componentwise boosting, and regularized regression methods, such as elastic net and ridge regression, have comparable performance and can therefore all be routinely used for GS for quantitative traits in maize breeding. Accounting for environment-specific or population-specific marker effects had only minor influence on predictive accuracy contrary to findings of several other studies. However, accuracy varied markedly among populations, with some populations showing surprisingly very low levels of accuracy. Combining different populations prior to marker-based prediction improved prediction accuracy compared to doing separate population-specific analyses. Moreover, polygenetic effects can be added to the RR-BLUP model to capture genetic variance not captured by the markers. However, doing so yielded minor improvements, especially for high marker densities. To relax the assumption of homogenous variance of markers, the RR-BLUP method was extended to accommodate heterogeneous marker variances but this had negligible influence on the predictive accuracy of GS for a simulated dataset. The widely used information-theoretic model selection criterion, namely the Akaike information criterion (AIC), ranked models in terms of their predictive accuracies similar to cross-validation in the majority of cases. But further tests would be required to definitively determine whether the computationally more demanding cross-validation may be substituted with the more efficient model selection criteria, such as AIC, without much loss of accuracy. Overall, a stagewise analysis, in which the markers are omitted until at the very last stage, is recommended for GS for the tested datasets. The particular method used for marker-based prediction from the set of those currently in use is of minor importance. Hence, the widely used and thoroughly tested RR-BLUP method would seem adequate for GS for most practical purposes, because it is easy to implement using widely available software packages for mixed models and it is computationally efficient.Publication Extensions and applications of generalized linear mixed models for network meta-analysis of randomized controlled trials(2022) Wiksten, Anna; Piepho, Hans-PeterNetwork meta-analyses of published clinical trials has received increased attention over the past years with some meta-analytic publications having had a big impact on the cost-benefit assessment of important drugs. Much of the research has been based on Bayesian analysis using so called base-line contrast model. The research in network meta-analysis methodology has in parts been isolated from other fields of mathematical statistics and is lacking an integrative framework clearly separating statistical models and assumptions, inferential principles, and computational algorithms. The very extensive past research on ANOVA and MANOVA of un- balanced designs, variance component models, generalised linear models with fixed and/or random effects, provides a wealth of useful approaches and insights. These models are especially common in agricultural statistics and this thesis extended the use of the general statistical methods mainly applied in agricultural statistics to applications of network meta-analysis of clinical trials. The methods were applied to four different research problems in separate manuscripts. The first manuscript was based on a simulated case (based on real example) where some of the trials provided individual patient data and some only aggregated data. The outcome type considered was continuous normally distributed data. This manuscript provides models for jointly model the individual patient data and aggregated data. It was also explored how much information is lost if data is aggregated and how to quantify the amount of lost information. The second manuscript was based a real life dataset with pain medications used in acute postoperative pain. The outcome of interest was binomial, whether a subject experienced pain relief or not. The dataset used for NMA included 261 trials with 52 different treatment and dose combinations, making it extraordinarily rich and large network. The third manuscript developed methods for a case of time-to-event-outcome extracted from published Kaplan-Meier curves of survival analyses. This re-generated individual patient data was then used to model and compare the Kaplan-Meier curves and hazards of different treatments. The fourth manuscript of the thesis was tackling the problem of between-trial variance estimation for a specific method of Hartung-Knapp in classical two-treatment meta-analysis. The main finding of the paper was that in some cases random effect meta-analysis using Hartung-Knapp method may yield shorter confidence intervals for combined treatment effect than fixed effect meta-analysis and therefore the recommendation is to always compare results from Hartung-Knapp method with fixed effect meta-analysis. This thesis explored and developed the use of generalized linear mixed models in a setting of network meta-analysis of randomized clinical trials. In practice the most popular analysis method in the field of network meta-analysis has been the baseline contrast model which is usually fitted in a Bayesian framework. The baseline contrast model and Bayesian estimation provides great flexibility, but also come with some unnecessary complications for certain types of analyses. This thesis showed how methods originally developed and extensively used in agricultural research can be used in other field providing efficient calculation, estimation, and inference. Some of the examples used in this thesis arose from analyses needed for real applications in drug development and were directly used in medical research.Publication Genomic prediction in rye(2017) Bernal-Vasquez, Angela-Maria; Piepho, Hans-PeterTechnical progress in the genomic field is accelerating developments in plant and animal breeding programs. The access to high-dimensional molecular data has facilitated acquisition of knowledge of genome sequences in many economically important species, which can be used routinely to predict genetic merit. Genomic prediction (GP) has emerged as an approach that allows predicting the genomic estimated breeding value (GEBV) of an unphenotyped individual based on its marker profile. The approach can considerably increase the genetic gain per unit time, as not all individuals need to be phenotyped. Accuracy of the predictions are influenced by several factors and require proper statistical models able to overcome the problem of having more predictor variables than observations. Plant breeding programs run for several years and genotypes are evaluated in multi environment trials. Selection decisions are based on the mean performance of genotypes across locations and later on, across years. Under this conditions, linear mixed models offer a suitable and flexible framework to undertake the phenotypic and genomic prediction analyses using a stage-wise approach, allowing refinement of each particular stage. In this work, an evaluation and comparison of outlier detection methods, phenotypic analyses and GP models were considered. In particular, it was studied whether at the plot level, identification and removal of possible outlying observations has an impact on the predictive ability. Further, if an enhancement of phenotypic models by spatial trends leads to improvement of GP accuracy, and finally, whether the use of the kinship matrix can enhance the dissection of GEBVs from genotype-by-year (GY) interaction effects. Here, the methods related to the mentioned objectives are compared using experimental datasets from a rye hybrid breeding program. Outlier detection methods widely used in many German plant breeding companies were assessed in terms of control of the family-wise error rate and their merits evaluated in a GP framework (Chapter 2). The benefit of implementation of the methods based on a robust scale estimate was that in routine analysis, such procedures reliably identified spurious data. This outlier detection approach per trial at the plot level is conservative and ensures that adjusted genotype means are not severely biased due to outlying observations. Whenever it is possible, breeders should manually flag suspicious observations based on subject-matter knowledge. Further, removing the flagged outliers identified by the recommendedmethods did not reduce predictive abilities estimated by cross validation (GP-CV) using data of a complete breeding cycle. A crucial step towards an accurate calibration of the genomic prediction procedure is the identification of phenotypic models capable of producing accurate adjusted genotype mean estimates across locations and years. Using a two-year dataset connected through a single check, a three-stage GP approach was implemented (Chapter 3). In the first stage, spatial and non-spatial models were fitted per locations and years to obtain adjusted genotype-tester means. In the second stage, adjusted genotype means were obtained per year, and in the third stage, GP models were evaluated. Akaike information criterion (AIC) and predictive abilities estimated from GP-CV were used as model selection criteria in the first and in the third stage. These criteria were used in the first stage, because a choice had to be made between the spatial and non-spatial models and in the third stage, because the predictive abilities allow a comparison of the results of the complete analysis obtained by the alternative stage-wise approaches presented in this thesis. The second stage was a transitional stage where no model selection was needed for a given method of stage-wise analysis. The predictive abilities displayed a different ranking pattern for the models than the AIC, but both approaches pointed to the same best models. The highest predictive abilities obtained for the GP-CV at the last stage did not coincide with the models that AIC and predictive ability of GP-CV selected in the first stage. Nonetheless, GP-CV can be used to further support model selection decisions that are usually based only upon AIC. There was a trend of models accounting for row and column variation to have better accuracies than the counterpart model without row and column effects, thus suggesting that row-column designs may be a potential option to set up breeding trials. While bulking multi-year data allows increasing the training set size and covering a wider genetic background, it remains a challenge to separate GEBVs from GY effects, when there are no common genotypes across years, i.e., years are poorly connected or totally disconnected. First, an approach considering the two-year dataset connected through a single check, adjusted genotype means were computed per year and submitted to the GP stage (Chapter 3). The year adjustment was done in the GP model by assuming that the mean across genotypes in a given year is a good estimate of the year effect. This assumption is valid because the genotypes evaluated in a year are a sample of the population. Results indicated that this approach is more realistic than relying on the adjustment of a single check. A further approach entailed the use of kinship to dissect GY effects from GEBVs (Chapter 4). It was not obvious which method best models the GY effect, thus several approaches were compared and evaluated in terms of predictive abilities in forward validation (GP-FV) scenarios. It was found that for training sets formed by several disconnected years’ data, the use of kinship to model GY effects was crucial. In training sets where two or three complete cycles were available (i.e. there were some common genotypes across years within a cycle), using kinship or not yielded similar predictive abilities. It was further shown that predictive abilities are higher for scenarios with high relatedness degree between training and validation sets, and that predicting a selection of top-yielding genotypes was more accurate than predicting the complete validation set when kinship was used to model GY effects. In conclusion, stage-wise analysis is recommended and it is stressed that the careful choice of phenotypic and genomic prediction models should be made case by case based on subject matter knowledge and specificities of the data. The analyses presented in this thesis provide general guidelines for breeders to develop phenotypic models integrated with GP. The methods and models described are flexible and allow extensions that can be easily implemented in routine applications.Publication Improvement of breeding strategies for the trait vase life in cut carnations (Dianthus caryophyllus L.)(2018) Boxriker, Maike; Piepho, Hans-PeterCarnation (Dianthus caryophyllus L.) is one of the ten most famous cut flowers worldwide. A single big flower characterizes standard carnations, while mini car-nations possess multiple flowers per stem. Vase life (VL) is one of the most im-portant breeding objectives in carnations due to the need of long transportation times and direct influence on the costumers. But VL is a complex trait with several effects influencing it. Two-phase traits like VL are traits where the assessment is done in a second phase, in the laboratory and the plants are cultivated in the greenhouse, the first phase. Many experiments have a two-phase character, but little research has been conducted to develop experimental designs in the second phase. To improve breeding efficiency, molecular markers and genomic selection is used in agriculture science but it is so far not common in ornamental breeding. The goal of this thesis was the implementation of SNP-based molecular markers for the trait VL to improve selection of long-lasting, transportable cut carnations. For marker association, 1,500 carnation genotypes were screened for VL behav-ior in an experimental design in both phases. Response to selection was used to assess efficiency. The second-phase experimental design was more important for precise data analyses. This highlights the research need on this topic. Fur-thermore, it was possible to suggest row-column designs for VL trials. Row-column designs are more flexible in the case of positional effects compared with one-dimensional blocking and can be easily analyzed like an α-design. The easiest way to design the following phases are to apply the design one-to-one. The carnation types, mini and standard, showed an influence on VL. The mini carnations last 0.5 d longer than the standard carnations. The same conclusion was drawn based on the molecular data. Transcriptome data was generated with two different sequencing methods. By independent analysis of both carnation types, different results than via the analysis of the whole data set were found. This indicates that the analysis of carnations should be done separately for each carnation type. Association of the phenotypic and genotypic data was so far not possible. As an alternative to molecular markers, genetic correlations for the use as indirect selection for the trait VL and others for breeding relevant traits was calculated. For the first time, bivariate analysis was conducted in two-phase ex-periments. The genotypic correlation between VL and FD was high, but indirect selection would be less effective than direct selection. However, the information can provide an indication of the performance and the effort to measure FD is small. The calculated high heritability of VL and found differences in VL of up to 15 d between the best and worst genotypes showed the potential of improving the population mean by using improved selection strategies like marker-assisted selection or auxiliary traits and the use of statistical methods like experimental designs in all phases of the experiment. The influence of carnation type was shown with this thesis and indicates that the implementation of molecular markers must be done independently for each car-nation type. The importance of experimental designs in multi-phase experiments was highlighted and statistical analysis by mixed models and a bivariate analysis of different traits was performed. Until now, no molecular marker for VL was identified but in a further research project, this will be solved by generating more genotypic data and the construction of a genetic map.Publication Management effect on the weed control efficiency in double cropping systems(2023) Schmidt, Fruzsina; Böhm, Herwart; Graß, Rüdiger; Wachendorf, Michael; Piepho, Hans-PeterThere are often negative side-effects associated with the traditional (silage) maize cropping system related to the unprotected soil surface. Reducing soil disturbance could enhance system sustainability. Yet, increased weed pressure and decreased nitrogen availability, particularly in organic agriculture, may limit the implementation of alternative management methods. Therefore, a field experiment was conducted at two distinct locations to evaluate the weed control efficiency of 18 organically managed silage maize cropping systems. Examined parameters were relative weed groundcover (GCweed) and its correlation with maize dry matter yield (DMY), relative proportion of dominant weed species (DWS) and their groups by life form (DWSgroup). Treatment factors comprised first crop (FC—winter pea, hairy vetch, and their mixtures with rye, control (sole silage maize cropping system—SCS)), management—incorporating FC use and tillage (double cropping system no-till (DCS NT), double cropping system reduced till (DCS RT), double cropped, mulched system (DCMS Roll) and SCS control), fertilization, mechanical weed control and row width (75 cm and 50 cm). The variation among environments was high, but similar patterns occurred across locations: Generally low GCweed occurred (below 28%) and, therefore, typically no correlation to maize DMY was observed. The number of crops (system), system:management and occasionally management:FC (group) influenced GCweed and DWS(group). Row width had inconsistent and/or marginal effects. Results suggest differences related to the successful inclusion of DCS and DCMS into the rotation, and to the altered soil conditions, additional physical destruction by shallow tillage operations, especially in the early season, which possibly acts through soil thermal and chemical properties, as well as light conditions. DCS RT could successfully reduce GCweed below 5%, whereas DCS NT and particularly DCMS (Mix) suffered from inadequate FC management. Improvements in DCMS may comprise the use of earlier maturing legumes, especially hairy vetch varieties, further reduction/omission of the cereal companion in the mixture and/or more destructive termination of the FC.Publication Mixed modelling for phenotypic data from plant breeding(2011) Möhring, Jens; Piepho, Hans-PeterPhenotypic selection and genetic studies require an efficient and valid analysis of phenotypic plant breeding data. Therefore, the analysis must take the mating design, the field design and the genetic structure of tested genotypes into account. In Chapter 2 unbalanced multi-environment trials (METs) in maize using a factorial design are analysed. The dataset from 30 years is subdivided in periods of up to three years. Variance component estimates for general and specific combining ability are calculated for each period. While mean grain yield increased with ongoing inter-pool selection, no changes for the mean of dry matter yield or for variance component estimate ratios were found. The continuous preponderance of general combining ability variance allows a hybrid selection based on general combining effects. The analysis of large datasets is often performed in stage-wise fashion by analysing each trial or location separately and estimating adjusted genotype means per trial or location. These means are then submitted to a mixed model to calculate genotype main effects across trials or locations. Chapter 3 studies the influence of stage-wise analysis on genotype main effect estimates for models which take account of the typical genetic structure of genotype effects within plant breeding data. For comparison, the genetic effects were assumed both fixed and random. The performance of several weighting methods for the stage-wise analysis are analysed by correlating the two-stage estimates with results of one-stage analysis and by calculating the mean square error (MSE) between both types of estimate. In case of random genetic effects, the genetic structure is modelled in one of three ways, either by using the numerator relationship matrix, a marker-based kinship matrix or by using crossed and nested genetic effects. It was found that stage-wise analysis results in comparable genotype main effect estimates for all weighting methods and for the assumption of random or fixed genetic effect if the model for analysis is valid. In case of choosing invalid models, e.g., if the missing data pattern is informative, both analyses are invalid and the results can differ. Informative missing data pattern can result from ignoring information either used for selecting the analysed genotypes or for selecting the test environments of genotypes, if not all genotypes are tested in all environments. While correlated information from relatives is rarely directly used for analysis of plant breeding data, it is often used implicitly by the breeder for selection decisions, e.g. by looking at the performance of a genotype and the average performance of the underlying cross. Chapter 4 proposed a model with a joint variance-covariance structure for related genotypes in analysis of diallels. This model is compared to other diallel models based on assumptions regarding the inheritance of several independent genes, i.e. on genetic models with more restrictive assumptions on the relationship between relatives. The proposed diallel model using a joint variance-covariance structure for parents and parental effects in crosses is shown to be a general model subsuming other more specialized diallel models, as these latter models can be obtained from the general model by adding restrictions on the variance-covariance structure. If no a priori information about the genetic model is available the proposed general model can outperform the more restrictive models. Using restrictive models can result in biased variance component estimates, if restrictions are not fulfilled by the data analysed. Chapter 5 evaluates, whether a subdivision of 21 triticale genotypes into heterotic pools is preferable. Subdividing genotypes into heterotic pools implies a factorial mating design between heterotic pools and a diallel mating design within each heterotic pool. For two (or more) heterotic pools the model is extended by assuming a joint variance-covariance structure for parental effects and general combing ability effects within the diallel and within the factorials. It is shown that a model with two heterotic pools has the best model fit. The variance component estimates for the general combing ability decrease within the heterotic pools and increase between heterotic pools. The results in Chapter 2 to 5 show, that an efficient and valid analysis of phenotypic plant breeding data is an essential part of the plant breeding process. The analysis can be performed in one or two stages. The used mixed models recognizing the field and mating design and the genetic structure can be used for answering questions about the genetic variance in cultivar populations under selection and of the number of heterotic pools. The proposed general diallel model using a joint variance-covariance structure between related effects can further be modified for factorials and other mating designs with related genotypes.Publication Model selection by cross-validation in multi-environment trials(2017) Hadasch, Steffen; Piepho, Hans-PeterIn plant breeding, estimation of the performance of genotypes across a set of tested environments (genotype means), and the estimation of the environment-specific performances of the genotypes (genotype-environment means) are important tasks. For this purpose, breeders conduct multi-environment trials (MET) in which a set of genotypes is tested in a set of environments. The data from such experiments are typically analysed by mixed models as such models for example allow modelling the genotypes using random effects which may be correlated according to their genetic information. The data from MET are often high-dimensional and the covariance matrix of the data may contain many parameters that need to be estimated. To circumvent computational burdens, the data can be analysed in a stage-wise fashion. In the stage-wise analysis, the covariance matrix of the data needs to be taken into account in the estimation of the individual stages. In the analysis of MET data, there is usually a set of candidate models from which the one that fits bets to the objective of the breeder needs to be determined. Such a model selection can be done by cross validation (CV). In the application of CV schemes, different objectives of the breeder can be evaluated using an appropriate sampling strategy. In the application of a CV, both the sampling strategy and the evaluation of the model need to take the correlation of the data into account to evaluate the model performance adequately. In this work, two different types of models that are used for the analysis of MET were focused. In Chapter 2, models that use genetic marker information to estimate the genotype means were considered. In Chapters 3 and 4, the estimation of genotype-environment means using models that include multiplicative terms to describe the genotype-environment interaction, namely the additive main effects and multiplicative interaction (AMMI), and the genotype and genotype-environment interaction (GGE) model, were focused. In all the Chapters, the models were estimated in a stage-wise fashion. Furthermore, CV was used in Chapters 2 and 3 to determine the most appropriate model from a set of candidate models. In Chapter 2, two traits of a biparental lettuce (Lactuca sativa L.) population were analysed by models for (i) phenotypic selection, (ii) marker-assisted selection using QTL-linked markers, (iii) genomic prediction using all available molecular markers, and (iv) a combination of genomic prediction and QTL-linked markers. Using different sampling strategies in a CV, the predictive performances of these models were compared in terms of different objectives of a breeder, namely predicting unobserved genotypes, predicting genotypes in an unobserved environment, and predicting unobserved genotypes in an unobserved environment. Generally, the genomic prediction model outperformed marker assisted and phenotypic selection when there are only a few markers with large effects, while the marker assisted selection outperformed genomic prediction when the number of markers with large effects increases. Furthermore, the results obtained for the different objectives indicate that the predictive performance of the models in terms of predicting (unobserved) genotypes in an unobserved environment is reduced due to the presence of genotype-environment interaction. In AMMI/GGE models, the number of multiplicative terms can be determined by CV. In Chapter 3, different CV schemes were compared in a simulation study in terms of recovering the true (simulated) number of multiplicative terms, and in terms of the mean squared error of the estimated genotype-environment means. The data were simulated using the estimated variance components of a randomized complete block design and a resolvable incomplete block design. The effects of the experimental design (replicates and blocks) need to be taken into account in the application of a CV in order to evaluate the predictive performance of the model adequately. In Chapter 3, the experimental design was accounted for by an adjustment of the data for the design effects estimated from all data before applying a CV scheme. The results of the simulation study show that an adjustment of the data is required to determine the number of multiplicative terms in AMMI/GGE models. Furthermore, the results indicate that different CV schemes can be used with similar efficiencies provided that the data were adjusted adequately. AMMI/GGE models are typically estimated in a two-stage analysis in which the first stage consists of estimating the genotype-environment means while the second stage consists of estimating main effects of genotypes and environments and the multiplicative interaction. The genotype-environment means estimated in the first stage are not independent when effects of the experimental design are modelled as random effects. In such a case, estimation of the second stage should be done by a weighted (generalized least squares) estimation where a weighting matrix is used to take the covariance matrix of the estimated genotype-environment means into account. In Chapter 4, three different algorithms which can take the full covariance matrix of the genotype-environment means into account are introduced to estimate the AMMI/GGE model in a weighted fashion. To investigate the effectiveness of the weighted estimation, the algorithms were implemented using different weighting matrices, including (i) an identity matrix (unweighted estimation), (ii) a diagonal approximation of the inverse covariance matrix of the genotype-environment means, and (iii) the full inverse covariance matrix. The different weighting strategies were compared in a simulation study in terms of the mean squared error of the estimated genotype-environment means, multiplicative interaction effects, and Biplot coordinates. The results of the simulation study show that weighted estimation of the AMMI/GGE model generally outperformed unweighted estimation. Furthermore, the effectiveness of a weighted estimation increased when the heterogeneity in the covariance matrix of the estimated genotype-environment means increased. The analysis of MET in a stage-wise fashion is an efficient procedure to estimate a model for MET data, whereas the covariance structure of the data needs to be taken into account in each stage in order to estimate the model appropriately. When correlated data are used in a CV, the correlation can be taken into account by an appropriate choice of training and validation data, by an adjustment of the data before applying a CV scheme and by the success criterion used in a CV scheme.Publication Optimizing the prediction of genotypic values accounting for spatial trend andpopulation structure(2010) Müller, Bettina Ulrike; Piepho, Hans-PeterDifferent effects, like the design of the field trial, agricultural practice, competition between neighboured plots, climate as well as the spatial trend, have an influence on the non-genotypic variation of the genotype. This effects influence the prediction of the genotypic value by the non-genotypic variation. The error, which results from the influence of the non-genotypic variation, can be separated from the phenotypic value by field design and statistical models. The integration of different information, like spatial trend or marker, can lead to an improved prediction of genotypic values. The present work consists of four studies from the area of plant breeding and crop science, in which the prediction of the genotypic values was optimized with inclusion of the above mentioned aspects. Goals of the work were: (1) to compare the different spatial models and to find one model, which is applicable as routine in plant breeding analysis, (2) to optimize the analysis of unreplicated trials of plant breeding experiments by improving the allocation of replicated check genotypes, (3) to improve the analysis of intercropping experiments by using spatial models and to detect the neighbour effect between the different cultivars, and (4) to optimize the calculation of the genome-wide error rate in association mapping experiments by using an approach which regards the population structure. Different spatial models and a baseline model, which reflects the randomization of the field trial, were compared in three of the four studies. In one study the models were compared on basis of different efficiency criteria with the goal to find a model, which is applicable as routine in plant breeding experiments. In the second study the different spatial models and the baseline model were compared on unreplicated trials, which are used in the early generation of the plant breeding process. Adjacent to the comparison of the models in this study different designs were compared with the goal to see if a non-systematic allocation of check genotypes is more preferable than a systematic allocation of check genotypes. In the third study these different models were tested for intercropping experiments. In this study it should be tested, if an improvement is expectable for these non randomized or restricted randomized trials by using a spatial analysis. The results of the three studies are that no spatial model could be found, which is preferable over all other spatial models. In a lot of cases the baseline model, which regards only the randomization, but no spatial trend, was better than the spatial models, also for the restricted or non-randomized intercropping trials. In all three studies the basic principle was followed to start first with the baseline model, which is based on the randomization theory, and then to extend it by spatial trend, if the model fit can be improved. In the second study the systematic and non-systematic allocation of check plots in unreplicated trials were compared to solve the question if a non-systematic allocation leads to more efficient estimates of genotypes as the systematic allocation. The non-systematic allocation of check plots led to an unbiased estimation in three of four uniformity trials. As well as in the third study an analysis was done, if the border plots of the different cultivars are influenced by the neighboured cultivar and if there are significant differences to the inner plot. The position of the cultivars, border plot or inner plot, had a significant influence of the yield. If maize was cultivated adjacent to pea, the yield of the border plot of maize was much higher than the inner plot of maize. When wheat was cultivated behind maize, there were no significant differences in the yield, if the plot was a border plot or inner plot. In addition to optimizing the field design for unreplicated trials and the extension of the models by spatial trend the marker information was integrated in a fourth study. An approach was proposed in this study, which calculates the genome wide error for association mapping experiments and accounts for the population structure. Advantages of this approach in contrast to previously published approaches are that the approach on the one hand is not too conservative and on the other hand accounts the population structure. The adherence of the genome wide error rate was tested on three datasets, which were provided by different plant breeding companies. The results of these studies, which were obtained in this thesis, show that by the different extensions, like integration of spatial trend and marker information, and modifications of the field design, an improved prediction of the genotypic values can be achieved.Publication Statistical methods for analysis of multienvironment trials in plant breeding : accuracy and precision(2021) Buntaran, Harimurti; Piepho, Hans-PeterMultienvironment trials (MET) are carried out every year in different environmental conditions to evaluate a vast number of cultivars, i.e., yield, because different cultivars perform differently in various environmental conditions, known as genotype×environment interactions. MET aim to provide accurate information on cultivar performance so that a recommendation of which cultivar performs the best in a growers’ field condition can be available. MET data is often analysed via mixed models, which allow the cultivar effect to be random. The random effect of cultivar enables genetic correlation to be exploited across zones and considering the trials’ heterogeneity. A zone can be viewed as a larger target of population environments. The accuracy and precision of the cultivar predictions are crucial to be evaluated. The prediction accuracy can be evaluated via a cross-validation (CV) study, and the model selection can be done based on the lowest mean squared error prediction (MSEP). Also, since the trials’ locations hardly coincide with growers’ field, the precision of predictions needs to be evaluated via standard errors of predictions of cultivar values (SEPV) and standard errors of the predictions of pairwise differences of cultivar values (SEPD). The central objective of this thesis is to assess the model performance and conduct model selection via a CV study for zone-based cultivar predictions. Chapter 2 assessed the performance between empirical best linear unbiased estimations (EBLUE) and empirical best linear unbiased predictions (EBLUP) for zone-based prediction. Different CV schemes were done for the single-year and multi-year datasets to mimic the practice. A complex covariance structure such as factor-analytic (FA) was imposed to account for the heterogeneity of cultivar×zone (CZ) effect. The MSEP showed that the EBLUP models outperformed the EBLUE models. The zonation was necessary since it improved the accuracy and was preferable to make cultivar recommendations. The FA structure did not improve the accuracy compared to the simpler covariance structure, and so the EBLUP model with a simple covariance structure is sufficient for the single and multi-year datasets. Chapter 3 assessed the single-stage and stagewise analyses. The three weighting methods were compared in the stagewise analysis: two diagonal approximation methods and the fully efficient method with the unweighted analysis. The assessment was based on the MSEP instead of Pearson’s and Spearman’s correlation coefficients since the correlation coefficients are often very close between the compared models. The MSEP showed that the single-stage EBLUP and the stagewise weighting EBLUP strategy were very similar. Thus, the loss of information due to diagonal approximation is minor. In fact, the MSEP showed a more apparent distinction between the single-stage and the stagewise weighting analyses with the unweighted EBLUE compared to the correlation coefficients. The simple compound-symmetric covariance structure was sufficient for the CZ effect than the more complex structures. The choice between the single-stage and stagewise weighting analysis, thus, depends on the computational resources and the practicality of data handling. Chapter 4 assessed the accuracy and precision of the predictions for the new locations. The environmental covariates were combined with the EBLUP in the random coefficient (RC) models since the covariates provide more information for the new locations. The MSEP showed that the RC models were not the model with the smallest MSEP, but the RC models had the lowest SEPV and SEPD. Thus, the model selection can be done by joint consideration of the MSEP, SEPV, and SEPD. The models with EBLUE and covariate interaction effects performed poorly regarding the MSEP. The EBLUP models without RC performed best, but the SEPV and SEPD were large, considered unreliable. The covariate scale and selection are essential to obtain a positive definite covariance matrix. Employing unstructured covariance int the RC is crucial to maintaining the RC models’ invariance feature. The RC framework is suitable to be implemented with GIS data to provide an accurate and precise projection of cultivar performance for the new locations or environments. To conclude, the EBLUP model for zoned-based predictions should be preferred to obtain the predictions and rankings closer to the true values and rankings. The stagewise weighting analysis can be recommended due to its practicality and its computational efficiency. Furthermore, projecting cultivar performances to the new locations should be done to provide more targeted information for growers. The available environmental covariates can be utilised to improve the predictions’ accuracy and precision in the new locations in the RC model framework. Such information is certainly more valuable for growers and breeders than just providing means across a whole target population of environments.Publication Studying stem rust and leaf rust resistances of self-fertile rye breeding populations(2022) Gruner, Paul; Witzke, Anne; Flath, Kerstin; Eifler, Jakob; Schmiedchen, Brigitta; Schmidt, Malthe; Gordillo, Andres; Siekmann, Dörthe; Fromme, Franz Joachim; Koch, Silvia; Piepho, Hans-Peter; Miedaner, ThomasStem rust (SR) and leaf rust (LR) are currently the two most important rust diseases of cultivated rye in Central Europe and resistant cultivars promise to prevent yield losses caused by those pathogens. To secure long-lasting resistance, ideally pyramided monogenic resistances and race-nonspecific resistances are applied. To find respective genes, we screened six breeding populations and one testcross population for resistance to artificially inoculated SR and naturally occurring LR in multi-environmental field trials. Five populations were genotyped with a 10K SNP marker chip and one with DArTseqTM. In total, ten SR-QTLs were found that caused a reduction of 5–17 percentage points in stem coverage with urediniospores. Four QTLs thereof were mapped to positions of already known SR QTLs. An additional gene at the distal end of chromosome 2R, Pgs3.1, that caused a reduction of 40 percentage points SR infection, was validated. One SR-QTL on chromosome 3R, QTL-SR4, was found in three populations linked with the same marker. Further QTLs at similar positions, but from different populations, were also found on chromosomes 1R, 4R, and 6R. For SR, additionally seedling tests were used to separate between adult-plant and all-stage resistances and a statistical method accounting for the ordinal-scaled seedling test data was used to map seedling resistances. However, only Pgs3.1 could be detected based on seedling test data, even though genetic variance was observed in another population, too. For LR, in three of the populations, two new large-effect loci (Pr7 and Pr8) on chromosomes 1R and 2R were mapped that caused 34 and 21 percentage points reduction in leaf area covered with urediniospores and one new QTL on chromosome 1R causing 9 percentage points reduction.Publication The development of phenotypic protocols and adjustment of experimental designs in Pelargonium zonale breeding(2018) Molenaar, Heike; Piepho, Hans-PeterOrnamental plant variety improvement is limited by current phenotyping approaches and the lack of use of experimental designs. Robust phenotypic data obtained from experiments laid out to best control local variation by blocking allow adequate statistical analysis and are crucial for any breeding purpose, including MAS. Often experiments consist of multiple phases like in P. zonale breeding, where in the first phase stock plants are cultivated to obtain the stem cutting count and in the second phase the stem cuttings are further assess for root formation. The first analyses of rooting experiments raised questions regarding options for improving the two-phase experimental layout, for example whether there is a disadvantage to using exactly the same design in both phases. The other question was, whether a design can be optimized across both phases, such that the MVD can be decreased. Instead of generating a separate layout for each phase. Moreover, optimal selection methods that maximize selection gain in P. zonale breeding based on available data collected from unreplicated trials and containing pedigree information were sought. This thesis was conducted to evaluate the benefits of using two-phase experimental designs and corresponding analysis in P. zonale for production related traits, for which it was necessary to establish phenotyping protocols. To optimize the rooting experiments with their two-phase nature, alternative approaches were explored involving two-phase design generation either in phase wise order or across phases. Furthermore, selection methods considering pedigreeinformation (family-index selection) or not (individual selection), were evaluated to enhance selection efficiency in P. zonale breeding. The benefits of using experimental designs in P. zonale breeding was shown by the simulated response to selection. Alternative designs were evaluated by the MVD obtained by the intrablock analysis and the joint inter-block-intra-block analysis. The efficiency of individual and family-index selection was evaluated in terms of heritability obtained from linear mixed models implementing the selection methods. Simulated response to selection varied greatly, depending on the genotypic variances of the breeding population and traits. However, by using efficient designs allowing adequate analysis, a varietal improvement of over 20% of stock plant reduction is possible for stem cutting count, root formation, branch count and flower count. The smallest MVD for alternative designs was most frequently obtained for designs generated across phases rather than for each phase separately, in particular when both phases of the design were separated with a single pseudolevel. Family-index selection was superior to individual selection in P. zonale indicating that the pedigree-based BLUP procedure can further enhance selection efficiency in productionrelated traits in P. zonale. The quantification of genotypic variation by phenotypic protocols and the optimized two-phase designs for estimating genotypic values were necessary and successful steps in laying the foundation for effective MAS. Phenotypic protocols effectively characterized the genetic material on an observational unit level, while the two-phase experimental designs enabled effective characterization on a genotype level by adjusting entry means using linear mixed models. The resulting adjusted entry means are the basis for future genotype phenotype association for MAS.Publication Weighting methods for variance heterogeneity in phenotypic and genomic data analysis for crop breeding(2019) Damesa, Tigist Mideksa; Piepho, Hans-PeterIn plant breeding programmes MET form the backbone for phenotypic selection, GS and GWAS. Efficient analysis of MET is fundamental to get accurate results from phenotypic selection, GS and GWAS. On the other hand inefficient analysis of MET data may have consequences such as biased ranking of genotype means in phenotypic data analysis, small accuracy of GS and wrong identification of QTL in GWAS analysis. A combined analysis of MET is performed using either single-stage or stage-wise (two-stage) approaches based on the linear mixed model framework. While single-stage analysis is a fully efficient approach, MET data is suitably analyzed using stage-wise methods. MET data often show within-trial and between-trial variance heterogeneities, which is in contradiction with the homogeneity of variance assumption of linear models, and these heterogeneities require corrections. In addition it is well documented that spatial correlations are inherent to most field trials. Appropriate remedial techniques for variance heterogeneities and proper accounting of spatial correlation are useful to improve accuracy and efficiency of MET analysis. Chapter 2 studies methods for simultaneous handling of within-trial variance heterogeneity and within-trial spatial correlation. This study is conducted based on three maize trials from Ethiopia. To stabilize variance Box-Cox transformation was considered. The result shows that, while the Box-Cox transformation was suitable for stabilizing the variance, it is difficult to report results on the original scale. As alternative variance models, i.e. power-of-the-mean (POM) and exponential models, were used to fix the variance heterogeneity problem. Unlike the Box-Cox method, the variance models considered in this study were successful to deal simultaneously with both spatial correlation and heterogeneity of variance. For analysis of MET data, two-stage analysis is often favored in practice over single-stage analysis because of its suitability in terms of computation time, and its ability to easily account for any specifics of each trial (variance heterogeneity, spatial correlation, etc). Stage-wise analyses are approximate in that they cannot fully reproduce a single-stage analysis because the variance–covariance matrix of adjusted means from the first-stage analysis is sometimes ignored or sometimes approximated and the approximation may not be efficient. Discrepancy of results between single-stage and two-stage analysis increases when the variance between trials is heterogeneous. In stage-wise analysis one of the major challenges is how to account for heterogeneous variance between trials at the second stage. To account for heterogeneous variance between trials, a weighted mixed model approach is used for the second-stage analysis. The weights are derived from the variances and covariances of adjusted means from the first-stage analysis. In Chapter 3 we compared single-stage analysis and two-stage analysis. A new fully efficient and a diagonal weighting matrix are used for weighting in the second stage. The methods are explored using two different types of maize datasets. The result indicates that single-stage analysis and two-stage analysis give nearly identical results provided that the full information on all effect estimates and their associated estimated variances and covariances is carried forward from the first to the second stage. GWAS and GS analysis can be conducted using a single-stage or a stage-wise approach. The computational demand for GWAS and GS increases compared to purely phenotypic analysis because of the addition of marker data. Usually researchers compute genotype means from phenotypic MET data in stage-wise analysis (with or without weighting) and then forward these means to GWAS or GS analysis, often without any weighting. In Chapter 4 weighted stage-wise analysis versus unweighted stage-wise analysis are compared for GWAS and GS using phenotypic and genotypic maize data. Fully-efficient and a diagonal weighting are used. Results show that weighting is preferred over unweighted analysis for both GS and GWAS. In conclusion, stage-wise analysis is a suitable approach for practical analysis of MET, GS and GWAS analysis. Single-stage and two-stage analysis of MET yield very similar results. Stage-wise analysis can be nearly as efficient as single-stage analysis when using optimal weighting, i.e., fully-efficient weighting. Spatial variation and within-trial variance heterogeneity are common in MET data. This study illustrated that both can be resolved simultaneously using a weighting approach for the variance heterogeneity and spatial modeling for the spatial variation. Finally beside application of weighting in the analysis of phenotypic MET data, it is recommended to use weighting in the actual GS and GWAS analysis stage.