How to deal with genotype uncertainty in variance component quantitative trait loci analyses

XIA SHEN; LARS RÖNNEGÅRD; ÖRJAN CARLBORG

doi:10.1017/S0016672311000152

How to deal with genotype uncertainty in variance component quantitative trait loci analyses

Published online by Cambridge University Press: 18 July 2011

XIA SHEN ,

LARS RÖNNEGÅRD and

ÖRJAN CARLBORG

Show author details

XIA SHEN*: Affiliation:
The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden School of Technology and Business Studies, Dalarna University, Borlänge, Sweden
LARS RÖNNEGÅRD: Affiliation:
School of Technology and Business Studies, Dalarna University, Borlänge, Sweden Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden
ÖRJAN CARLBORG: Affiliation:
The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden
*: *Corresponding author: The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden. E-mail: xia.shen@lcb.uu.se

Article contents

Summary
Introduction
Methods and materials
Results and discussion
Conclusion
Footnotes
References

Rights & Permissions

Summary

Dealing with genotype uncertainty is an ongoing issue in genetic analyses of complex traits. Here we consider genotype uncertainty in quantitative trait loci (QTL) analyses for large crosses in variance component models, where the genetic information is included in identity-by-descent (IBD) matrices. An IBD matrix is one realization from a distribution of potential IBD matrices given available marker information. In QTL analyses, its expectation is normally used resulting in potentially reduced accuracy and loss of power. Previously, IBD distributions have been included in models for small human full-sib families. We develop an Expectation–Maximization (EM) algorithm for estimating a full model based on Monte Carlo imputation for applications in large animal pedigrees. Our simulations show that the bias of variance component estimates using traditional expected IBD matrix can be adjusted by accounting for the distribution and that the calculations are computationally feasible for large pedigrees.

Type: Research Papers
Information: Genetics Research , Volume 93 , Issue 5 , October 2011 , pp. 333 - 342

DOI: https://doi.org/10.1017/S0016672311000152 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

1. Introduction

Variance component models have played an important role in detecting quantitative trait loci (QTL) for the last couple of decades in both animal breeding (Fernando & Grossman, Reference Fernando and Grossman1989; Goddard, Reference Goddard1992; Arendonk et al., Reference Arendonk, Tier and Kinghorn1994; Wang et al., Reference Wang, Fernando, van der Beek, Grossman and van Arendonk1995; George et al., Reference George, Visscher and Haley2000) and human genetics (Goldgar, Reference Goldgar1990; Schork, Reference Schork1993; Fulker & Cardon, Reference Fulker and Cardon1994; Olson, Reference Olson1995; Xu & Atchley, Reference Xu and Atchley1995; Blangero et al., Reference Blangero, Williams and Almasy2001). To construct the variance–covariance matrix of the random QTL effect, identity-by-descent (IBD) probabilities are required. The IBD probabilities describe the correlation structure between individuals with respect to the frequency of their shared (common) alleles. The genetic variance component estimates, and the corresponding likelihoods, are usually calculated using an estimated IBD matrix.

The IBD matrix can be estimated from marker information using either deterministic (Wang et al., Reference Wang, Fernando, van der Beek, Grossman and van Arendonk1995; Pong-Wong et al., Reference Pong-Wong, George, Woolliams and Haley2001; Besnier & Carlborg, Reference Besnier and Carlborg2007) or stochastic algorithms (Thompson & Heath, Reference Thompson and Heath1999; Pérez-Enciso et al., Reference Pérez-Enciso, Varona and Rothschild2000; Mao & Xu, Reference Mao and Xu2005). All these methods actually calculate an average IBD matrix, where each entry is the average frequency of shared alleles, based on partially informative markers. Namely, all the IBD values are known in the statistical models. Instead of using the average IBD matrix, which we refer to as the expectation method (Xu, Reference Xu1996), the uncertainty of the IBD matrix itself may also be included in the likelihood. Such a method accounting for the uncertainty of the IBD matrix is referred to as the distribution method. The likelihood function that the distribution method uses is called the full likelihood function, in contrast to the expectation method that uses an approximated likelihood function.

Comparison of distribution methods with expectation methods has been a thoroughly investigated problem in human genetics, especially for regression models in QTL analysis and genome-wide association studies (GWAS). Using genotype imputation, GWAS can gain power at positions with uncertain genotypes (Marchini & Howie, Reference Marchini and Howie2010) where the expectation method gives good power and accuracy (Kutalik et al., Reference Kutalik, Johnson, Bochud, Mooser, Vollenweider, Waeber, Waterworth, Beckmann and Bergmann2011) by using SNP probabilities as covariates. For QTL analyses based on regression models, the QTL effect is treated as fixed, and several studies have applied the idea of a full likelihood function (Elston & Stewart, Reference Elston and Stewart1971; Morton & Maclean, Reference Morton and Maclean1974; Lander & Botstein, Reference Lander and Botstein1989), which is referred to as the maximum likelihood (ML) method in QTL analysis and has been implemented in, for instance, MAPMAKER-QTL (Lander & Botstein, Reference Lander and Botstein1989). The implementation is based on an Expectation–Maximization (EM) algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977). However, ML estimates based on a regression model can be approximated very well by the simple Haley–Knott (HK) regression (Haley & Knott, Reference Haley and Knott1992), which is the corresponding expectation method using line-origin probabilities as covariates.

In random effect models, the QTL effect is regarded as random, and considering a full likelihood method is still important to avoid losing statistical power (Schork, Reference Schork1993). Replacing the IBD matrix using its expectation can only approximate the ML estimates, and the approximation was shown, by means of simulations, to be non-negligible in the analyses of sib-pairs (Kruglyak & Lander, Reference Kruglyak and Lander1995). To resolve these problems, a weighted likelihood approach has been implemented in the software package Mx (Eaves et al., Reference Eaves, Neale and Maes1996) for the analysis of small human pedigrees where the probability of IBD states are used as weights. However, knowing the distribution of the IBD matrix is crucial for deriving the full likelihood function. In human full-sib studies, the closed form of the joint distribution of the additive IBD matrix and the dominance IBD matrix has been derived, but this is feasible only for pedigrees including small full-sib families (Gessler & Xu, Reference Gessler and Xu1996; Xu, Reference Xu1996). These earlier studies show that the full likelihood function is statistically more powerful and often gives higher likelihood at the QTL.

Three problems were raised from previous studies. First, for animal pedigrees, deriving the distribution of the IBD matrix is infeasible due to the large size. Therefore, approximating the full likelihood function using a Monte Carlo strategy is a reasonable idea but has not been implemented (Xu, Reference Xu1996). Second, full-sib studies in humans calculate IBD probabilities for F ₁ individuals. For a crossing design in animals, where e.g. F ₂ individuals are studied, deriving the IBD distribution is difficult even for small pedigrees. An application of the distribution method to F ₂ individuals has therefore not been investigated before, even though the theory was claimed to be able to extend to different kinds of crosses. Third, when there is inbreeding, for instance, in an F ₂ intercross, diagonal elements of the IBD matrix need to be adjusted. After adjusting for inbreeding, the full likelihood theory still holds. If the marker density is low or the markers are partially informative, the difference between the distribution method and the expectation method might be substantial for experimental crosses as well.

The aim of our study is to evaluate the performance of the distribution method in animal intercross designs by assessing the magnitude and direction of bias for the expectation method. We try to account for the above problems and investigate the full likelihood function for animal pedigrees. The rest of this paper is arranged as follows. We first describe the statistical model that our study is based on and introduce the theory about the full likelihood. Two illustrative examples of F ₂ pedigrees are simulated, where one is used to show the difference from full-sib studies, and the other is used to show the performance of the distribution method in adjusting the bias of heritability estimates. We compare the distribution method with the expectation method for a real experimental dataset, with simulations based on real genotypes for comparing the power of the two methods. The paper is concluded by discussing possible applications and suggesting future developments.

2. Methods and materials

(i) Models and likelihoods

We consider the variance component QTL model (Fernando & Grossman, Reference Fernando and Grossman1989; Goldgar, Reference Goldgar1990):

(1)

where y is the trait response vector for N individuals, β is the fixed effect vector, γ is the multivariate normal-distributed random QTL effect with a zero mean and variance–covariance matrix and ε is the normal-distributed error term with a zero mean and variance–covariance matrix . X and Z are the design matrices. σ_g² is the genotypic variance and σ_e² is the residual variance. Z relates individuals and their inherited allelic substitution effects. Given the IBD matrix Π, the variance–covariance matrix of the phenotype y is . The relationship between Π and Z is given by (Rönnegård & Carlborg, Reference Rönnegård and Carlborg2007):

(2)

To adjust the bias of variance component estimates due to estimating the fixed effects, restricted maximum likelihood (REML) is commonly used instead of ML. If θ=(σ_g², σ_e²)′ is the vector of variance components, the likelihood function is given by

(3)

If the alleles cannot be traced unambiguously through the pedigree, e.g. because the markers are not fully informative, the conditional expectation of Π has been used as a known matrix for the estimation of likelihood (3) (expectation method). Regarding Π as random, a joint distribution f(y,Π|θ) is considered for estimating the likelihood (distribution method), i.e. the full likelihood. In this paper, the terms ‘distribution method' and ‘full likelihood method’ are used interchangeably. ℓ is used to denote the logarithm of the corresponding likelihood .

Given the incomplete marker information, there is a probability space in which the IBD matrix Π is distributed. The expectation method uses an approximated likelihood

(4)

where the variation of Π is not considered since E[Π] is inserted as a known matrix. Instead of calculating the expected IBD matrix E[Π], the full likelihood function takes the variation of Π into account by considering the joint distribution of θ and Π. Inference of θ should be made from the marginal likelihood of θ integrating out Π. Hence, based on profile likelihood (3), the distribution method uses the marginal likelihood

(5)

Thus, the difference between and is the difference between calculating the function of an expectation and calculating an expectation of the function. Non–linearity of function with respect to Π affects the difference between and .

(ii) Computation of the full likelihood

Since the distribution of Π is rather complicated, marginal likelihood (5), involving an expectation with respect to Π, is hardly derivable unless the number of individuals is extremely small. Therefore, we propose a Monte Carlo strategy that approximates likelihood (5) by

(6)

where m is the number of imputed IBD matrices drawn based on the marker information. Each impute Π_i corresponds to an incidence matrix Z_i, and eqn (2) holds, so that . converges to as m→∞, where is the ML estimate of .

The estimate of θ is identical for all the imputes of , namely, instead of maximizing each imputed likelihood, the entire sum needs to be maximized. The first and second derivatives of with respect to θ have closed solutions (Harville, Reference Harville1977). Let and . ℓ is the target log-likelihood to maximize. Using the derivatives and , a Newton–Raphson-based EM algorithm can be used to estimate θ.

Algorithm. Given a set of imputed IBD matrices (or corresponding incidence matrices), estimation of variance components θ using the full likelihood function can be made via the following steps:

(i) Find an initial estimate .
(ii) Loop on k until convergence.
(7)
where δ is a step size constant and
(8)

(9)
In eqns (8) and (9), the weights are defined as
(10)
(iii) Take the converged estimate as the final variance component estimates.

This algorithm is based on Newton–Raphson iterations but is an EM algorithm since the weights in gradient and hessian need to be updated using the current variance component estimates. There are two advantages with this implementation. First, imputing incidence matrices (Z_i) is much easier than creating the IBD matrices (Π_i) that is not required in the proposed algorithm. Second, for large pedigrees with a few founders, when the marker is partially informative, the rank of is much greater than the rank of Z_i. A low rank of Z_i increases the computational efficiency for maximizing the likelihood function (Rönnegård et al., Reference Rönnegård, Mischenko, Holmgren and Carlborg2007).

(iii) Simple illustrative examples

Two pedigrees were simulated for showing different properties of the method. Pedigree I was simulated for showing that the non-negligible bias from the expectation method could be adjusted by the distribution method. Pedigree II was simulated for showing that the proven relationship between the likelihood estimate from the expectation method and that from the distribution method for full-sib families is not true for F ₂ intercrosses.

(a) Pedigree I

In an F ₂ intercross (Fig. 1 a), a single pair of grandparents were mated to produce one male and two female F ₁ progeny. They were thereafter mated to obtain five F ₂ offspring. At a putative QTL, genotypes were simulated for each individual, and there are four alleles (A, B, C and D) throughout the pedigree. The phenotypic value for individual i was simulated by

(11)

where μ=50, , and γ_i1 and γ_i2 correspond to the paternal and maternal allele substitution effects. Five sets of allele substitution effects were simulated according to five different σ_g² values. For instance, given the sample variance of σ_g²=15, the allelic effects can be assigned as γ_A=3, γ_B=6, γ_C=9 and γ_D=12, which give a consistent estimate for the genetic variance, and the simulated phenotypic values were 64·87, 56·21, 61·28, 69·20 and 62·08 for individuals 6–10, respectively, for this particular set of allelic effects. In Fig. 1 a, the kinship information is known but the genotypes are not observed. The (narrow sense) heritability (Lynch & Walsh, Reference Lynch and Walsh1997) of the studied trait is defined as the intra-class correlation by

(12)

which measures how large proportion of the trait variation is determined by inheritance. We used both the expectation and the distribution method to estimate h ² using the five animals in the F ₂ generation. Note that equivalently, this example compares the variance component estimates using the two methods given an uninformative marker (no marker information) in QTL analyses.

Fig. 1. Pedigree I – a simulated F ₂ intercross with 10 individuals including two founders. (a) A three-generation intercross with five offspring, three parents and two grandparents. Squares and circles denote male and female animals, respectively, with indices inside. Bubbles with arrows pointing to each animal indicate the true genotypes that are not observed. (b) For the distribution method and the expectation method, asymptotic trend of bias in estimating the narrow sense heritability is displayed with respect to the heritability of the trait.

(b) Pedigree II

One male was mated to three females to produce six F ₁ individuals, and the F ₁s were mated to obtain 10 F ₂ offspring (Fig. 2). At a putative QTL, allele substitution effects for the eight alleles of the founders were simulated from a normal distribution with a zero mean and a standard deviation of 3, and they were inherited through the pedigree (Table 1). The phenotypes were calculated by summing the allele substitution effects, an overall mean of 200 and an error term drawn from . Assuming a complete link to the QTL, two sets of marker genotypes were simulated (Markers I and II). We computed the log-likelihood function using both the expectation and the distribution method.

Fig. 2. Pedigree II – a simulated F ₂ intercross with 18 individuals including four founders. Squares and circles denote male and female animals, respectively, with indices inside. Each dashed curve connects the same individual for the purpose of clear display.

Table 1. The tabular form of pedigree II. Two markers that have a complete link to the QTL were simulated. ℓ_D and ℓ_E are log-likelihood from the distribution method and the expectation method, respectively. Given marker I, ℓ _D>ℓ _E, while given marker II, ℓ _D<ℓ _E.

(iv) Analyses of experimental data

(a) Simulation using real genotypes

An F ₂ cross was bred from two European wild boars mated to eight large white sows (Andersson et al., Reference Andersson, Haley, Ellegren, Knott, Johansson, Andersson, Edfors-Lilja, Fredholm, Hansson, Håkansson and Lundström1994). Four F ₁ boars were mated to 22 F ₁ sows to produce 191 recorded F ₂ offspring in 26 families. The genetic information on chromosome 6 came from 22 genotyped micro-satellite markers at 0·0, 8·6, 36·6, 49·7, 50·5, 62·9, 79·2, 80·4, 83·7, 84·1, 84·8, 90·6, 95·4, 100·7, 101·9, 115·9, 116·7, 119·0, 120·2, 124·0, 127·0 and 170·9 cM. In order to investigate the power of the distribution method and the expectation method for this real dataset, we simulated a QTL at 25 cM harboured by the two flanking markers at 8·6 and 36·6 cM. This simulated QTL position has low marker information since it is in the middle of the longinterval between two markers with low information content. IBD probabilities at the QTL were calculated taking into account all the marker information across the chromosome. We simulated the phenotype under three different scenarios for the heritability, 90%, 10% and 0%. For each scenario, 1000 replicates were simulated, where for each replicate, we calculated the LRT statistics both for the expectation and the distribution methods.

(b) Analysis of a real phenotype

A meat quality trait (reflectance value, EEL) recorded in the pig cross is strongly affected by the halothane gene located on chromosome 6 at position 80·4 cM. One of the founder boars was heterozygote (Hal ^N/Hal ⁿ) for this gene while all the other founders were homozygotes (Hal ^N/Hal ^N) for the wild-type allele. In our analyses, we include sex, litter and slaughter weight as fixed effects (Knott et al., Reference Knott, Marklund, Haley, Andersson, Davies, Ellegren, Fredholm, Hansson, Hoyheim, Lundström, Moller and Andersson1998). The causal mutation underlying this QTL had previously been evidenced by Lundström et al. (Reference Lundström, Karlsson, Håkansson, Hansson, Johansson, Andersson and Andersson1995), and the dataset is here used to illustrate the properties of both the expectation and the distribution method. The statistic used for QTL scan was the likelihood-ratio test (LRT) statistic that which is equivalentto the maximized likelihood function value but always non-negative. A permutation test was performed to obtain a significance threshold.

To understand the informativeness of each marker, the polymorphism information content (PIC) was used and determined using the following equation:

(13)

where p _i is the frequency of the ith allele and t is the number of alleles (Botstein et al., Reference Botstein, White, Skolnick and Davis1980).

3. Results and discussion