Why I’m wary of candidate gene studies

Candidate genes

The candidate gene approach to finding genes that determine phenotypic variation is hugely attractive. Rather than performing whole genome scans, involving large amounts of time and money (and a statistical burden of multiple testing), why not just look at those genes which are good candidates? Candidates can be identified by looking through the literature to find genes that influence similar traits in other organisms such as humans, mice, fruit flies or Arabidopsis. There is no doubt that the candidate gene approach has yielded some nice results, perhaps most obviously in studies of plumage and coat colour polymorphism in vertebrates [1-3]. However, at least in the evolutionary genetics literature, most successes are for traits with a simple Mendelian basis, and  where there is a well-established pathway of which genes interact to produce the phenotype of interest. In other words, the candidate gene approach works well for simple traits.

Of course, most traits are complex, and are determined by many genes of small effect. I worry –  only a little bit, not  in the ‘I can’t sleep at night’ sense –  that the candidate gene approach applied to these traits can send people on a wild goose chase. As I explain below, I think there are already a number of studies in the molecular ecology field where people are investing too much effort into following up candidate gene results which are, at best, equivocal. The example I’m going to use comes from studies of the DRD4 gene and it’s possible role in behavioural variation in wild birds. I’ve chosen this example, because I think the authors of the original studies have been extremely careful. If the results are not clear cut when this much care is taken, I think the problems I describe will be exacerbated in less rigorous studies (of which there are plenty).

Population structure – the known unknown

There are two main statistical problems when assessing whether candidate gene studies are significant. The first is the well known issue that population structure can lead to false positive associations, unless properly controlled for. In a study involving many thousands of tests, it is easy to see when population structure is an issue. A comparison of observed test statistics (or P values) against an expected distribution, shows when datasets are prone to false positives. The slope of the regression of observed against expected values, gives an inflation factor (lambda) which is a measure of the degree of the problem. The plot below shows two lines, each with 100,000 statistical tests – one where there is no issue (lambda = 1) and another where there is a lot of inflation (lambda = 3), caused by population structure.


You might be thinking that an inflation factor of 3 is pretty extreme – it probably is – but it certainly isn’t unprecedented. Susan Johnston (@SuseJohnston) published a very nice paper [4] a few years ago where she performed a GWAS for the age at which wild salmon migrated back to freshwater from the sea. In that study, lambda was 3.24, and without controlling for the genetic structure there would have been many false positive results. We can see the effect of failing to control for structure below:

Here, I have simulated 100,000 datapoints (imaginary SNPs), assuming either no structure (lambda = 1) or salmon-like structure (lambda = 3). The plot below shows the distribution of chi square values following association testing. For clarity, I’ve restricted the x-axis to range from 0.5 to 20, but in fact the difference between the histograms is more extreme than it looks, because the structured dataset has a long tail, while more than half of the lambda = 1 datapoints have chisquare values < 0.5. In the structured dataset, if we (naiively) assumed our test statistic followed a chi square distribution with 1 df we would have a massive false positive problem. Almost 26% of the datapoints are ‘significant’ at P < 0.05, 13.7% are significant at P < 0.01 and 5.7% are significant at P < 0.001. In other words, we are in real danger of assigning biological importance to what are actually chance results.


Fortunately, when we perform GWAS studies, there are methods for controlling for the population genetic structure – the simplest is just dividing the test statistic by the inflation factor before calculating the P value. There are other more sophisticated approaches, but they all do a good job of dealing with the effects of structure. So what’s the problem for candidate gene studies then?

The problem is that we only know we have a problem of inflated test statistics when we have performed genomewide scans with many independent datapoints. When we do a candidate gene study, we typically look at one or a few genes, and so it is virtually impossible to know whether our significant results are real or simply an artefact of unappreciated genetic structure causing false positives.

So, one of the problems with candidate gene studies is that too few statistical tests have been performed for us to be able to say whether a gene really does have an effect on the trait. Strangely, the second statistical problem can arise when we have performed relatively few, yet still too many, tests and the multiple testing problem can make it difficult to know whether a result is statistically significant or not. Typically, this happens when multiple SNPs are examined, sometimes in tandem with several traits and several different genetic models. The DRD4 studies are informative in this regard.

DRD4 in wild birds – background

The paper that started the enthusiasm for studying DRD4 in birds is by Fidler and colleagues and was published in 2007 in Proc. R. Soc. B. [5]. The paper has been very influential; at the time of writing it has been cited > 80 times. Briefly, the authors studied novelty-seeking behaviour in great tits, which was recorded by observing how long it took birds to explore novel objects in a cage or room. Data were collected on hand-reared wild birds, and on birds from selection lines for fast or slow ‘early exploratory behaviour’ (EEB). EEB had previously been shown to be heritable, and to respond to artifical selection in this population, hence the existence of divergent lines. The authors reasoned that the dopamine receptor D4 gene (DRD4) was a good candidate for explaining variation in behaviour because it has been associated with novelty-seeking behaviour in humans (see refs within Fidler).

The authors sequenced the full coding sequence of DRD4 in great tits – no easy task. In total they found 73 polymorphsism (66 SNPs and 7 indels). However, most of their attention was on the 3rd exon, because it was this region that was associated with relevant traits in other vertebrates. They first showed that the exon contained a single synonymous SNP (hereafter SNP830) which differed in frequency between the fast (n=29) and slow (n=21) lines. As the authors pointed out, genetic drift could cause these kinds of effects. Therefore, they analysed a further 91 hand-reared but unselected birds that were wild-caught as nestlings, from 17 nests.

The results of an association study on the wild birds are shown below – this figure is taken directly from the paper. Genotype TT is associated with the highest EEB score, which is consistent with the trend from the selection lines. This study was conducted in the days before many of the ways we correct for population structure were fully developed, but the authors were well aware of the problem. It is unlikely that there was much structure in the data because great tits have a very large effective population size, there is very low genetic structure across Europe, and the birds were sampled from a single location. However, by sampling 91 birds from just 17 nests, structure is introduced by the inclusion of relatives. The authors recognised this problem, by fitting nest as a random effect in their models. Today, we may take a slightly different approach by fitting a relationship matrix (derived from the pedigree or markers) in the model, but the approach Fidler and colleagues took here probably deals with the problem ok, and it certainly demonstrates an awareness of the problem and how to deal with it.


At face value, this looks like a pretty good story. The selection line and wild bird analyses are consistent, and the SNP is in the functionally relevant part of the gene. It’s worth stating here though, that the authors actually fitted three different genetic models, and they also typed an indel in the wild birds, so they actually performed 6 different (but perhaps not independent) tests on the data, which does make the statistical significance pretty marginal (even ignoring the possibility of inflated lambda). To their credit, Fidler and colleagues raised the possibility that the results could be an artefact, caused by population genetic structure. I have no beef with this paper – it was ground-breaking, potentially important, and yet circumspect. However, my worry is that subsequent studies  seem to make an assumption that the DRD4 story is both robust and repeatable elsewhere. I don’t think it is.

DRD4 in great tits – followup studies

The first follow-up paper to Fidler was by the same group – this time led by Peter Korsten [6]. This is a really nice example of an attempt to replicate the original results. Not only do the authors type a new set of wild birds from the original population (Westerheide, in the Netherlands), but they also look at three other great tit populations – Wytham Woods in Oxford, Boshoek in Belgium and Lauwersmeer in the Netherlands. Similar tests of exploratory behaviour were examined in each population. The orginal SNP – SNP830 – and the indel that significantly differed between the fast and slow EEB lines in Fidler were typed in all 4 populations.

The results from Korsten et al are quite striking. First of all, the SNP830 result in Westerheide was replicated in the new set of 77 birds sampled from the wild. Second, the associations could not be replicated in any of the other 3 populations. An analysis pooling data from all 4 populations did not show an association between SNP830 and exploratory behaviour. There was a significant interaction term between SNP genotype and population, but this could mean either there are genuine gene by environment interactions, or the original Westerheide result was a false positive and the others are non-significant. Again, the association between Westerheide and SNP830 is only nominally significant, and if multiple tests for the number of loci, populations or genetic models were performed, it would be non-significant. Regardless though, the evidence for SNP830 being associated with EEB in this population is strengthened by this study.

One of the puzzles from Korsten et al is why wasn’t the original result replicated elsewhere? One possiblilty that the authors suggested is that SNP830 isn’t the causal mutation, but is in LD with an unknown causal mutation closeby. If the LD structure varied between populations, then one might not expect to see similar associations with SNP830 in other populations. Another possibility is that there is no real association in any of the populations, including Westerheide.

In 2013, the same authors (led by Jakob Mueller) typed a much larger number of polymorphisms in the DRD4 region, in the same birds as used by Korsten [7]. Again, this is a very substantial and impressive piece of work. This time, around 98 polymorphisms were studied, which made it possible to define linkage disequilibrium around DRD4 and to make comparisons in pattern of LD between populations. In addition, there were now a large enough number of tests to get some feel for whether test statistics were inflated due to e.g. the population structure or the presence of relatives. In all 4 populations, LD declined quite rapidly, but patterns were quite similar across populations. This means that (i) the different SNPs can be regarded as fairly independent tests (especially if they are in different haploblocks) and (ii) the suggestion that Westerheide shows an assocation with SNP830, but the others don’t, due to heterogeneity in the LD between SNP830 and a causal variant, is not really supported.

The data from Mueller et al are on Dryad, and so I have had a look at them in a little more detail. I was interested to know whether (i) there is much evidence for test statistic inflation (i.e. lambda >1), which could be driving false positive results and (ii) whether the significant results look very compelling. For simplicity, I have only looked at additive models for a couple of reasons. First, the original SNP830 association in Westerheide was more significant using this model compared to models that assumed dominance or overdominance. Second, I don’t find the reasons for fitting overdominance models very compelling, although the reasons for that can wait for a future post.

I followed the model fitting procedure used by Mueller; SNP genotypes were coded 0,1 or 2 and fitted as a covariate in models of association with personality score. For the Westerheide population brood was fitted as a random effect. Plots showing observed against expected p values (assuming a uniform distriubtion of P values between 0 and 1) are shown below. Each SNP is a separate datapoint and SNP830 is shown as a red dot.




Let’s first look at the Westerheide population – the combined data from both years (i.e. the Fidler study and the Korsten study) shows that SNP830 was more strongly associated with exploratory behaviour than any other locus. Not only that, but the data fit pretty well on the lambda = 1 line, but with SNP830 some way above it. This looks pretty convincing to me, although compared to the most significant loci in a genome wide association study, it is not very significant. It’s also noteworthy that when the data are restricted to birds sampled in 2007 – these are the birds in the Korsten follow-up study – SNP830 is still the most significantly associated SNP.

In the three other populations, SNP830 is unremarkable. There is no evidence whatsoever that it is associated with exploratory behaviour. The other populations do not show serious problems with inflation of test statistics, although there is a hint of it in the Wytham Woods population. The most significant SNP in Lauwersmeer is more significant than expected by chance, but not by much. It should be said, that none of these SNPs reach significance after applying a Bonferrroni correction or similar procedure.

On balance then, I think the SNP830 association in Westerheide looks reasonably convincing, but I don’t think associations between DRD4 and exploratory behaviour in other great tit populations have been demonstrated. The authors of these papers have always been very careful to highlight alternative explanations, and overall it’s an impressive body of work. What I find surprising, is that despite these fairly equivocal results, there seems to be a slew of follow-up DRD4 candidate gene studies in other bird species, with what look to me, to be fairly wishful interpretations of the results. I worry that a narrative is building up that DRD4 explains lots of variation in personality traits in lots of different bird species. What do these other papers show?

DRD4 in other bird populations

I’ll briefly discuss some of the other studies.

Gillingham and colleagues looked at DRD4 in relation to body condition in flamingoes [8]. They argue that DRD4 genotypes are associated with the phenotype, but it is pretty hard to tell whether there is much going on. They use an unconventional (for association studies) approach of AIC-based multi-model parameter estimation and they only test DRD4. I dislike this approach for association studies, because by studying a single locus (as is the case here) or fitting one locus at a time it is really easy for a locus that doesn’t affect a trait to appear to explain a few % of the variation. The authors actually typed 10 microsatellites, and it is my guess that fitting genotypes at the microsatellites would have yielded similar results to DRD4. Certainly, I think the claim that “This is to our knowledge, the first study to show an association between exon 3 DRD4 polymorphism and body condition in non-human animals” needs some more validation before it can be believed.

In a study of adaptation to urban environments in blackbirds [9], Mueller et al (2013) examined 16 candidate polymorphisms (including 5 in DRD4) in 12 populations, but there was no compelling evidence that DRD4 played a role in adaptation.

Mueller and colleagues also scored the effects of DRD4 polymorphisms in exploratory behaviour in yellow-crowned bishops [10]. They replicated the study in Portugese and Spanish populations. The sample sizes were fairly small, nor was there any control for population structure. A large proportion of SNPs were non-significant, but two SNPs were associated with behavioural variation. One of these SNPs was significant in both populations; the other was significant in one population and approached significance in the other. The directions of the associations were the same in both populations If both SNPs had been significant in both populations (they narrowly missed), the result would have been experiment-wide significant. This study looks worthy of follow-up, although I’d be a little wary of population structure being an issue given the demographic history of this species and the way the birds were sampled/obtained. The jury’s still out for me.

Garamszegi and colleagues looked at two DRD4 SNPs in relation to three behavioural traits in collared flycatchers [11]. One association was significant and another marginally so – again though, we know very little about what test statistics would be generated by other loci and whether population structure is a confounding issue.

In a study of invasive starling populations in Australia [12], Rollins used propensity to enter a trap as a proxy for boldness. There were no significant associations between DRD4 and boldness – sample sizes were quite large in this study.

There has been a new great tit study this year [13], with an Estonian population typed at SNP830 and measured for feeding times/delays when presented with a novel object near the nest. In males, birds with the CC genotype delayed feeding longer than other birds, but the effect was not observed in females. The sample sizes were modest. It’s a tantalising result.

Finally, an attempt to examine DRD4 in relation to personality traits in Seychelles Warblers found no DRD4 polymorphisms [14].

Final Thoughts

Overall, I think some of the follow up studies provide some, tentative evidence for associations between DRD4 and personality traits, but I’m not convinced that any randonmly chosen gene would have given much weaker evidence.  Would I take a candidate gene approach to studying behavioural variation in birds? Probably not. If we cannot replicate results across different great tit populations (where genetic structure is low), it doesn’t seem likely that we can easily detect variation in other species. Is there good evidence that DRD4 explains behavioural variation in other bird species? I don’t think so.

I do think that the main problem with candidate gene studies is the more general one, that we just don’t know how significant something is until we have seen data from across the genome. Fortunately, with whole genome scans becoming ever easier, the need to even perform candidate gene studies is diminishing. For what its worth, I would avoid them unless the trait has a really simple Mendelian basis and the genetic pathways underlying the trait are well understood.


1. Nachman et al. (2003) The genetic basis of adaptive melanism in pocket mice. PNAS 100: 5268–5273
2. Mundy et al. (2004) Conserved genetic basis of a quantitative plumage trait involved in mate choice. Science 303: 1870-1873
3. Gratten et al (2007) Compelling evidence that a single nucleotide polymorphism in TYRP1 is responsible for coat colour polymorphism in a free-living population of Soay sheep. Proc. R. Soc. Lond B. 274: 619-626
4. Johnston et al. (2014). Genome-wide SNP analysis reveals a genetic basis for sea-age variation in a wild population of Atlantic salmon (Salmo salar). Molecular Ecology 23:3452-3468
5. Fidler et al. (2007) Drd4 gene polymorphisms are associated with personality variation in a passerine bird. Proc. R. Soc. Lond B. 274: 1685-1691.
6. Korsten et al. (2010) Association between DRD4 gene polymorphism and personality variation in great tits: a test across four wild populations. Molecular Ecology 19: 832-843.
7. Mueller et al (2013) Haplotype structure, adaptive history and associations with exploratory behaviour of the DRD4 gene region in four great tit (Parus major) populations. Molecular Ecology 22:2797-2808
8. Gillingham et al (2012) Genetic polymorphism in dopamine receptor D4 is associated with early body condition in a large population of greater flamingos, Phoenicopterus roseus. Molecular Ecology 21:4024-4037.
9. Mueller et al (2013) Candidate gene polymorphisms for behavioural adaptations during urbanization in blackbirds. Molecular Ecology 22:3629-3637
10. Mueller et al (2014) Behaviour-related DRD4 polymorphisms in invasive bird populations. Molecular Ecology 23:2876-2885
11. Garamszegi et al (2014) The relationship between DRD4 polymorphisms and phenotypic correlations of behaviors in the collared flycatcher. Ecology and Evolution 4: 1466–1479
12. Rollins et al (2015) Is there evidence of selection in the dopamine receptor D4 gene in Australian invasive starling populations? Current Zoology 61: 505–519
13. Timm et al (2015) DRD4 gene polymorphism in great tits: gender-specific association with behavioural variation in the wild. Behavioural Ecology and Sociobiology 69:729–735
14. Edwards et al (2015) No Association between Personality and Candidate Gene Polymorphisms in a Wild Bird Population. PloS One 10: e0138439.
This entry was posted in Uncategorized. Bookmark the permalink.