A question on exclusion of study participants for an Exome genotyping array

A question on exclusion of study participants for an Exome genotyping array

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm reading a paper that used whole exome sequencing on an African American and European populations to discover novel low frequency and rare variants associated with lipid levels & the risk of coronary heart disease.

They used the Illumina Human exome genotyping array that was designed based on coding variants discovered from sequencing the exomes of 12,000 individuals. Therefore, they selected their participants from a study who were not among the 12,000 individuals utilized to design the array .

My question is: Why would they select individuals who were not among the 12,000 used to design the array? (I'm thinking because maybe it could produce false positives associations? bias?)

When you design the arrays, you need to have probes on the surface complementary to the sequence you want to detect. Depending on what you want to detect, you need to design these probes with known sequence on a known position. If you want to detect single nucleotide polymorphisms (SNP), then you need a library of known SNPs on your ChIP, which are basically the position of the SNP and the surrounding sequence.

SNPs can be roughly grouped into two subgroups, common (I call them like this) and rare (frequency in the population lower than 0.1%). The problem with this ChIP based method is that it can only detect SNPs which are already known (or at least are located close to known SNPs with which they in linkage disequilibrium). So if you want to detect rare SNPs in your population sample, you need a big group of people from which you take the SNPs into you ChIP design. If you look at the numbers, one rare variant will only be present in about 12 of the 12.000 persons used in the design.

Your study group is then different from the group of people, which where used to make the ChIP design. Here you take people with a certain background (high lipid levels for example) and then compare, if people which had coronary heart problems have SNPs in common which might be connected to this disease. This can help to identify a mutation in a protein which is the risk factor.

DbGaP Collection: NHLBI Heart Failure Related dbGaP Data (No IRB requirement)

The National Heart, Lung, and Blood Institute (NHLBI) Big Data Analysis Challenge: Creating New Paradigms for Heart Failure Research has concluded. The NHLBI will continue to make this collection of data available to the research community.

Past information regarding the Data Collection for the NHLBI Big Data Analysis Challenge: Creating New Paradigms for Heart Failure Research is below.

The National Heart, Lung, and Blood Institute (NHLBI), part of the National Institutes of Health (NIH), is inviting novel Solutions for the NHLBI Big Data Analysis Challenge: Creating New Paradigms for Heart Failure Research. The goal of the challenge is to foster innovation in computational analysis and machine learning approaches utilizing large-scale NHLBI-funded datasets to identify new paradigms in heart failure research. The challenge aims to address the need for new open source disease models that can define sub-categorizations of adult heart failure to serve as a springboard for new research hypotheses and tool development in areas of heart failure research from basic to clinical settings.

NHLBI has a history of making considerable investments in the creation of deep data resources including: long-standing, deeply-phenotyped epidemiological cohorts, innovative clinical trials, and large-scale precision medicine efforts that have generated whole genome sequencing and "other omics" data for more than one-hundred thousand individuals. To provide challenge participants with streamlined access to data across NHLBI's numerous studies containing heart failure data, the NHLBI has created this dbGaP study collection. Data access for this collection is controlled by NHLBI's Data Access Committee. Challenge participants are required to follow the use restrictions and acknowledgement instructions from the original dataset(s). Please note that the data in this collection are not harmonized across studies or otherwise altered from the original study.

The NHLBI Big Data Analysis Challenge: Heart Failure Data Collection contains all NHLBI studies currently in dbGaP that contain data that may be relevant to research on heart failure. The studies in this collection are approved for General Research Use (GRU), Health/Medical/Biomedical Use (HMB), or Disease-Specific Use (DS) that permits research on heart failure. These studies span a variety of study designs, inclusion and exclusion criteria, sample sizes, and provide a wide breadth and depth of phenotype data on study participants. The available genomic data in these studies also varies, including genotyping arrays, sequencing (targeted, exome, whole genome), and additional -omic data (e.g., RNA or metabolite profiles). Please refer to each study's individual accession page to learn more about how study data were collected.

Challenge participants are reminded that in addition to dbGaP, NHLBI's Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) contains other studies that have collected data relevant to heart failure and may wish to utilize BioLINCC data in the development of their Solution.

Please refer to the individual accession page for each study in this collection to learn more about the history of each study.

Surveying the contribution of rare variants to the genetic architecture of human disease through exome sequencing of 177,882 UK Biobank participants

The UK Biobank (UKB) represents an unprecedented population-based study of 502,543 participants with detailed phenotypic data and linkage to medical records. While the release of genotyping array data for this cohort has bolstered genomic discovery for common variants, the contribution of rare variants to this broad phenotype collection remains relatively unknown. Here, we use exome sequencing data from 177,882 UKB participants to evaluate the association between rare protein-coding variants with 10,533 binary and 1,419 quantitative phenotypes. We performed both a variant-level phenome-wide association study (PheWAS) and a gene-level collapsing analysis-based PheWAS tailored to detecting the aggregate contribution of rare variants. The latter revealed 911 statistically significant gene-phenotype relationships, with a median odds ratio of 15.7 for binary traits. Among the binary trait associations identified using collapsing analysis, 83% were undetectable using single variant association tests, emphasizing the power of collapsing analysis to detect signal in the setting of high allelic heterogeneity. As a whole, these genotype-phenotype associations were significantly enriched for loss-of-function mediated traits and currently approved drug targets. Using these results, we summarise the contribution of rare variants to common diseases in the context of the UKB phenome and provide an example of how novel gene-phenotype associations can aid in therapeutic target prioritisation.


Between 2006 and 2011, 656 patients with bacterial meningitis were included in the MeninGene study 21 . Of these patients, 469 (72%) had pneumococcal meningitis. After quality control filters 408 pneumococcal meningitis patients and 2072 controls were included in the genetic associations analysis for meningitis disease susceptibility. Demographic and clinical data of the successfully genotyped patients can be found in Table 1. After exclusion of all monomorf and sex chromosome variants a total of 100,464 single nucleotide polymorphisms (SNPs) passed quality control thresholds of >95% call rate and Hardy Weinberg equilibrium and were incorporated in the association analysis.

The genomic control parameter λ was 0.31 when all variations were included. This was driven by the rare variants and exclusion of variants with a minor allele frequency below 0.01%, increased λ to 0.93 (Supplementary Fig. 1). For our single marker analysis we therefore included those with a minor allele frequency higher than 0.01%.

None of the tested genetic variants reached the Bonferroni corrected significance threshold (p-value <5 × 10 −7 ). Six genetic variants associated with pneumococcal meningitis reached p-values lower than 1 × 10 −4 (Fig. 1). Our strongest signal was the missense rs139064549 in the collagen type XI alpha 1 (COL11A1) gene (p = 1.51 × 10 −6 G allele OR 3.21 [95% CI 2.05–5.02]) and the second strongest was the intron variant rs9309464 in the exocyst complex component 6B (EXOC6B) gene (p = 6.01 × 10 −5 G allele OR 0.66 [95% CI 0.54–0.81] Table 2). Of these six variants with a p-value lower than 1 × 10 −4 , three variants were located in the fibrous sheath CABYR binding protein (FSCB) gene, namely rs3809429 (p = 6.80 × 10 −5 A-allele OR 1.65 [95% CI 1.30–2.09]), rs3825630 (p = 6.80 × 10 −5 G allele OR 1.65 [95% CI 1.30–2.09]) and rs1959379 (p = 8.81 × 10 −5 A allele OR 1.64 [95% CI 1.30–2.09]). The sixth variant rs617169 (p = 7.33 × 10 −5 G allele OR 0.72 [95% CI 0.61–0.85]) was not located within a gene, but the nearest gene was protein kinase N2 (PKN2) located on chromosome 2.

Manhattan plot of p-values.

The y-axis indicates the –log10 (p-values) of the SNPs in the association analysis and the x-axis indicates the chromosomal position. The horizontal blue line indicates the threshold of p = 1 × 10 −4 . All markers of the three genes with the highest association signal, namely COL11A1 (rs139064549, p-value = 1.51 × 10 −6 , location chromosome 1), EXOC6B (rs9309464, p-value = 6.01 × 10 −5 , location chromosome 2) and FSCB (with rs3809429, rs3825630 rs1959379, p-value = 6.80 × 10 −5 , 6.80 × 10 −5 , 8.81 × 10 −5 location chromosome 14), have been colored green.

In the single marker analysis we did not include variants with a MAF below 0.01%. To assess the role of these rare variants in the association testing of pneumococcal susceptibility we used the region based sequence kernel association test (SKAT) allowing us to include all common and rare variations within a gene region. By using the SKAT analysis, we found one significant associated gene namely COL11A1 (p = 1.03 × 10 −7 ). In this analysis we tested a total of 12968 gene sets and after correcting for these multiple tests COL11A1 reaches a significance of p = 0.001. No difference in the p-value of COL11A1 was observed when correcting for the proportion of case-control imbalance. The second strongest signal in the SKAT analysis was the polymerase (DNA directed) lambda (POLL) gene (p = 8.10 × 10 −5 ), which was not significant after multiple testing correction. When corrected this p-value for proportion of case-control imbalance the association was slightly more significant (p = 7.05 × 10 −5 ). We also looked if the portion of common and rare variants drove the associations with pneumococcal meningitis susceptibility when the rare variants threshold was defined as a function of the total sample size 22 . This analysis showed only for POLL (p = 1.90 × 10 −5 ) that association was based on the proportion of common and rare variants, although this did not reach genome-wide significance after correction for multiple testing.

Methods and analysis

This GWAS follows the Strengthening the Reporting of Genetic Association Studies guidelines. 25 A flow਌hart of the study procedure is detailed in figure 1 .

Study procedure. The graph details the study procedure from recruitment to data analysis. Yellow boxes represent new procedures green boxes indicate data generation and collection blue boxes indicate a procedural step. AITC, allyl isothiocyanate BMI, body mass index CANDELA, Consortium for the Analysis of the Diversity and Evolution of Latin America CDT, cold detection thresholds GWAS, genome-wide association study HPT, heat pain threshold MPT, mechanical pain threshold PCs, principal components PPT, pressure pain threshold QST, quantitative sensory testing TSL, thermal sensory limen VDT, vibration detection threshold VAS, Visual Analogue Scale WDT, warm detection thresholds WUR, wind-up ratio.


Healthy participants aged 18� will be recruited in Medellin, Colombia via public noticeboards at local Universities, distribution of flyers and through the local print media. In addition, we are inviting previous participants from the Consortium for the Analysis of the Diversity and Evolution of Latin America (CANDELA) 26 GWAS to participate in this project.

Recruiting healthy young participants has advantages in GWAS studies. They are less likely to have undetected illnesses or other problems that may influence their biological pathway of pain sensitivity. Young people will also have less overall accumulated exposure or risks from environmental (external) factors which may affect their pain sensitivity. Such factors increase the overall variability of participants’ pain perception response and reduce the power of detecting genetic causes. Since most traits are affected by a combination of genetic and environmental factors, many studies including CANDELA tend to use young participants for genetic variant discovery. 27

Participants will be excluded if they have chronic pain or any chronic medical condition (eg, diabetes, neurodegenerative, musculoskeletal or psychiatric conditions). Participants currently taking analgesics, anti-inflammatories, opioids, antihistamines, antidepressants or antiepileptic drugs will be excluded. Women who are pregnant or in their menstrual phase (self-report) will be excluded from the study. Participants will be advised to not smoke or consume coffee within 1 hour of testing, and to avoid psychoactive substances or alcohol within 8 hours prior to testing. Further exclusion criteria include current or past self-inflicted injuries, as well as dermatomal, traumatic or infectious conditions affecting the arm, and a history of severe allergic reactions to any kind of medication, materials, food or insect bites. Participants with moderate to severe anxiety (� on the Hamilton Anxiety Rating Scale 28 ) or severe depression (㸕 on the 16-item Quick Inventory of Depressive Symptomatology Self-Reported (QIDS-SR16) 29 will be excluded from the study. Recruitment started in January 2013 and is predicted to take approximately 5𠄷 years.


Participants will attend a single appointment at the quantitative sensory testing (QST) laboratory at the Universidad de Antioquia, Medellín. Following informed consent, age and self-reported gender will be recorded and participants will answer questions regarding their self-reported ancestry (see online supplementary appendix 1). Height and weight will be measured and body mass index (BMI) calculated. Since psychological factors such as anxiety can influence pain perception during experimental pain testing, 30 participants will complete the Spanish version of the Hamilton Anxiety Rating Scale and the QIDS-SR16. The QIDS-SR16 has acceptable internal consistency and moderate to strong concurrent validity compared with other depression scores 31 and its Spanish version shows adequate test–retest reliability and high internal consistency. 32 The Hamilton Anxiety Rating Scale has shown to have high inter-rater and test–retest reliability 33 and good construct validity. 34

Supplementary data

Evaluation of sensory function in the naïve state

We will determine sensory function in the naïve state and following nociceptive sensitisation. Baseline sensory function will be evaluated using specific static and dynamic QST. These include cold detection threshold and warm detection thresholds, thermal sensory limen and heat pain thresholds (HPT) using a ThermoTester (Q-sense, Medoc, Israel, 30휰 mm thermode size). Recording of thermal thresholds will strictly follow published QST guidelines. 35 Mechanical pain thresholds (MPT) will be evaluated using a 20 piece von Frey hair set (Touch Test, North Coast, USA) which exerts differingਏorces (9.8, 13.7, 19.6, 39.2, 58.8, 78.5, 98.1, 147.1, 255.0, 588.4, 980.7, 1765.2, 2942.0 mN). The von Frey hairs will be applied at a rate of 2 s on, 2 s off in ascending order starting from 9.8mN baseline stimulus until participants first perceive the stimulus as sharp (pricking). Subsequently, the hairs will be applied in descending order until the stimulus is perceived as blunt. The geometric mean of five series of ascending and descending stimuli is defined as the MPT. Wind-up ratio will be determined with numerical pain ratings on a Visual Analogue Scale (VAS 0�) for a single stimulus followed by the average pain rating for a train of 10 stimuli applied at 1 Hz within the same 1𠂜m 2 using a 255 mN von Frey hair. This will be repeated five times and the ratio will be established as the mean rating of the trains of stimuli divided by the mean rating of the single stimuli. Vibration detection thresholds (VDT) will be determined by recording the mean of 3 disappearance thresholds with a Rydel-Seiffer tuning fork. Pressure pain thresholds (PPT) will be recorded in triplicate with a manual algometer (Wagner Instruments, Greenwich, Connecticut, USA) and their mean used for analysis.

The side to be tested will be randomised and patients will first be familiarised with the sensory tests on the forearm on the control side, before performing the actual measurements on the test arm. All tests will be performed halfway over the volar side of the forearm except for VDT (ulnar styloid) and PPT (thenar muscles).

Mustard oil evoked nociceptive sensitisation

After the baseline sensory measures, an acetate template will be used to mark a star with eight spokes each containing eight points at 1𠂜m increments on the volar forearm ( figure 2 ). The skin temperature will be standardised by placing the 32ଌ warm thermode over the center of the star for 5 min before starting. We will then apply a sensitisation paradigm using mustard oil (AITC (Sigma), diluted at 30% in olive oil) as previously performed. 36 AITC, the active component of mustard oil, activates the ion channel TRPA1 and evokes skin flare and nociceptive sensitisation. 37 A small cotton swab soaked in mustard oil will be applied to a 0.64𠂜m 2 area on the volar forearm and held in place with a Tegaderm (3M) for 10 min. 36 ਍uring this time, pain scores will be recorded every 30 s using an electronic VAS ranging from 0 to 100. After 10 min, the mustard oil will be removed and the area of the skin flare will be recorded to the nearest 0.5𠂜m at each spoke. 7 Eight triangular shapes will be created by joining the points on adjacent spokes and the total area will be calculated by adding all triangular segments. The area of mustard oil application will be subtracted from the total area to determine the area of secondary flare (flare area).

Method to determine area of flare, punctuate hyperalgesia and allodynia. (A) An acetate template is used to mark a star with eight spokes containing eight points at 1𠂜m increments on the volar forearm. (B) A small cotton swab soaked in 30% mustard oil is applied in the center of the star and (C) held in place with a Tegaderm for 10 min. During this time, pain scores are recorded every 30 s. (D and E) After removal of the mustard oil, the skin flare will be marked and the area calculated. (F) The area of brush-evoked and punctuate hypersensitivity will be determined with a brush and a 98.1 mN von Frey hair respectively (pictured) by testing potential hypersensitivity at each point on the eight spokes.

After mapping of the flare, the area of brush-evoked hypersensitivity will be determined with a brush (Nr 5 Senselab, Somedic, Sweden) by applying 1𠂜m long strokes at each of the points on the eight spokes, starting from the outside and moving towards the sensitised centre. The area of punctuate hypersensitivity will be determined with a 98.1 mN filament (Bailey Instruments, UK) following the same procedure. 36 As for the flare, the primary area of mustard oil application will be subtracted from both hypersensitive areas such that the recorded areas represent secondary hyperalgesia/hypersensitivity.

Following mustard oil sensitisation, the MPT and HPT will be repeated using the same methods as described above. All postsensitisation tests will be performed within 5 min of mustard oil removal.

Reliability of naïve and sensitised sensory function protocol

To determine intratester reliability, we repeated the sensory function protocol performed by the same investigator in n=12 healthy volunteers on two different occasions within 2𠄶 weeks. Intraclass correlation coefficients (3.1) revealed good to excellent agreement for all sensory testing variables ( table 1 ). 38

Table 1

Intratester reliability of sensory function protocol

Intraclass correlation coefficients95%𠂜IP value
CDT0.7280.277 to 0.9140.003
WDT0.7640.351 to 0.927π.0001
TSL0.6380.161 to 0.8780.005
HPT0.7520.339 to 0.9220.002
MPT0.9280.767 to 0.979π.0001
WUR0.6340.113 to 0.8800.012
VDT0.9560.860 to 0.987π.0001
PPT0.7340.305 to 0.9150.002
VAS0.8930.667 to 0.970π.0001
Flare area0.6100.095 to 0.8690.015
Brush-evoked allodynia0.7560.365 to 0.9220.001
Punctuate hyperalgesia0.6150.094 to 0.8710.013
Postsensitisation MPT0.9410.808 to 0.983π.0001
Postsensitisation HPT0.7580.339 to 0.9240.002

CDT, cold detection threshold HPT, heat pain threshold MPT, mechanical pain threshold PPT, pressure pain threshold TSL, thermal sensory limen VDT, vibration detection threshold VAS, Visual Analogue Scale WDT, warm detection threshold WUR, wind-up ratio.


Each participant will donate blood or saliva (Oragene OG-500, Genotek, Canada) for DNA extraction. DNA samples will be genotyped on the Illumina HumanOmniExpress chip containing 

700� markers. In volunteers who already participated in the CANDELA GWAS, 39 genotype data from blood samples genotyped on the same chip are already available and will be reused.

Whole-genome genotype data from the Illumina array will undergo quality control 40 to exclude any markers or samples that fail stringent thresholds. Quality metrics provided by the genotype calling algorithm in the Illumina GenomeStudio software, 41 such as the GenTrain score, cluster separation score, and excess heterozygosity rates will be used to filter poorly genotyped SNPs. Subsequent SNP-level and sample-level quality control thresholds such as missingness will be applied. Sex mismatch between records and genetic data of X and Y chromosomes will be checked. Only samples and SNPs that pass all criteria will be retained for analysis. Details of the currently used quality control protocol for CANDELA genotyped samples are provided in online supplementary appendix 2.

Supplementary data

Statistical analysis

Sample size calculation

The power for GWAS of experimental pain phenotypes for varying sample and effect sizes was estimated following the formulae described in Visscher et al. 42 Estimated power is shown for a range of effect sizes for experimental pain phenotypes taken from existing experimental pain studies. The statistical software R V.3.4.1 43 was used to perform the calculations and produce the figures. The codes are published on

In whole-genome SNP-based GWAS studies, the association analysis is usually conducted with a multivariate linear regression model, where the trait values are regressed onto an SNP genotype (with additive coding) and other covariates which commonly include age, gender, BMI and genetic principal components (PCs). The p value threshold 39 42 for genome-wide significant associations is commonly 5휐 𢄨 , while the threshold for a suggestive significant association is commonly 10 𢄥 . Formulae to calculate power in GWAS with genome-wide and suggestive significance thresholds are presented in online supplementary appendix 3, and power calculated for the current GWAS setting is shown in figure 3 .

Estimated power (in percentage) under the standard genome-wide association studies (GWAS) settings of using whole-genome genotyping data. (A) Estimated power (in percentage) as a heatmap, setting the significance threshold at 5휐 𢄨 , the commonly used threshold for genome-wide significance in GWAS studies. (B) Estimated power with the significance threshold set at 10 𢄥 , the commonly used threshold for suggestive significance. In panels A-B, the x-axis denotes a range of sample sizes (n) in a GWAS, the y-axis represents the proportion of trait variance (q 2 ) explained by a marker. Power of detecting the marker at a specific (n, q 2 ) combination is represented by a colour gradient. Contour lines for power at 10% intervals are also shown. Panels C-D shows power curves for the expected sample sizes for this study. (C) Expected power at genome-wide and suggestive significance thresholds for a sample size of n=1500. (D) Estimated power for a sample size of n=2000. In Panels C-D, the x-axis denotes the proportion of trait variance (q 2 ) explained by a marker, and y-axis represents estimated power (in percentage). The two curves correspond to the two commonly used GWAS thresholds. In each panel, the point for 80% power is indicated with a green triangle, so that the necessary parameter configurations can be read from the graph. In panels A-B, the contour corresponding to 80% power is also marked in green.

Supplementary data

Figure 3A shows estimated power (in percentage) as a heatmap under the standard GWAS settings of using whole-genome genotyping data and a p value significance threshold of 5휐 𢄨 . Sample size (n) varies from 100 to 5000, while the proportion of trait variance explained by the marker (q 2 , in percentage) varies from 0.01% to 6%. As sample size increases, power increases quickly for a range of trait variance values to reach 100%.

Figure 3B shows estimated power (in percentage) under the same settings but a suggestive p value significance threshold of 10 𢄥 . As expected, power is higher at similar sample and effect sizes for this less stringent threshold.

Simplified power estimates are shown as power curves in figure 3C,D for the expected sample sizes for this study. Figure 3C shows expected power at genome-wide and suggestive significance thresholds for a sample size of n=1500 at varying effect sizes, while figure 3D shows estimated power for a sample size of n=2000.

The range of trait variance has been taken from Doehring et al, 44 which provides estimates for the proportion of trait variance explained by an SNP for several experimental pain phenotypes and multiple markers. The values ranged from 0.02% to 6%. Some of the traits were the same as the traits investigated here, while some other traits were different. Nevertheless, the distributions of trait variance for the two groups of traits are very similar, as seen in figure 4A .

(A) Distributions of trait variance explained by a single marker from Doehring et al 44 for traits included in our study and those not included. (B) Allele frequency distributions of loci associated with experimental pain in previously published cohorts, for Europeans and Colombians.

Power of a GWAS depends on the allele frequency of the SNPs through their effect on the test statistic. While the majority of GWAS studies are conducted in European-origin individuals, including the experimental pain study used to determine sample size here, 44 our population of interest is an admixed Latin American population. Therefore, we wanted to assess the distribution of allele frequencies in Europeans versus Colombians for SNPs studied for or associated with experimental pain in various studies. 44 45 Minor allele frequencies were obtained for all such reported SNPs from the 1000 Genomes project database 17 for Western Europeans (from Britain (GBR), Utah residents from Northern and Western Europe (CEU), Spain (IBS), and Tuscany in Italy (TSI)) and Colombians (from Medellin in Colombia, (CLM), where this study will be performed). Allele frequency distributions for both Europeans and Colombians are shown in figure 4B . The two distributions are quite similar, with the Colombian distribution slightly more spread out. This is somewhat expected as the Colombians have on average 60% European (Spanish) ancestry. 26 Having a well spread out distribution of allele frequencies is important in a GWAS as low-frequency alleles have lower power for a given sample and effect size. 46 Here, the comparison to European allele frequency distribution suggests that the current Colombian cohort will have nearly equivalent power to any European-based cohort. In contrast to European-only cohorts though, our cohort will have the advantage that alleles present in other continental populations such as sub-Saharan Africans or Native Americans that are not present in Europeans would also be detectable in a GWAS, and could be followed up in replication cohorts of specific ethnicities.

The CANDELA project includes genotpyes of 

2000 patients from Medellin. We anticipate to contact and phenotype 50%�% of these participants as well as contacting an additional 500 participants to bring the initial sample to 1500�.

Data analysis plan

The cleaned genetic data will first be merged with reference samples worldwide, such as the 1000 Genomes Project, 17 Simons Genome Diversity Project, 47 Estonian Biocentre Human Genome Diversity Panel, 48 and additional European and Native American samples that are particularly relevant for Latin American populations. 49 The merged dataset will be checked for genetic outliers, through genetic PCs and continental ancestry proportions (using supervised Admixture 50 ), and for unexpected genetic similarities. These steps can often detect any sample misplacement or contamination, which might be reflected in sex mismatch, unexpected genetic similarities or inflated heterozygosity rates. 40 51 Genetic ancestry estimates will be compared with self-reported ancestry information (see online supplementary appendix 1), particularly for genetic outliers or samples showing unexpected results. Participants self-reporting for ethnicities rare in Colombians, such as East Asian or South Asian, would also be excluded as outliers. The authors have extensive experience in conducting association analysis in admixed populations, including several GWAS publications on a wide range of phenotypes which contain detailed protocols on how to conduct such analyses. 27 39 52 53 Further details of the currently used quality control protocol for CANDELA samples 53 are provided in online supplementary appendix 2.

The genetic data allow estimation of the narrow-sense heritability of any quantitative trait, which is the fraction of trait variance that is explained by the genetic data. Estimates of heritability, obtained using the software GCTA, 54 will provide an idea of which traits have more of a biological basis versus which are more environmentally determined, and thus which traits would be more amenable to genetic analysis for discovery of associated genetic variants. Note however that relatively precise estimate (low SDs) of heritability by this method requires several thousand samples, 52 so the currently proposed sample size might be underpowered to estimate heritability accurately.

To facilitate better identification of associated loci, the genotype data will be imputed to approximately 10 million loci using the 1000 Genomes phase 3 imputation reference panel 52 by first haplotype phasing using SHAPEIT2 55 and then imputation using Impute2. 56 Quality control of the imputed genotypes will be performed using recommended thresholds on imputation quality score, concordance metrics and proportion of high-probability calls. Details of the currently used imputation protocol for CANDELA samples 52 are provided in online supplementary appendix 2.

GWAS studies will be conducted in Plink2 57 to perform single-locus association studies for each trait individually, across the whole genome in an additive multivariate linear regression model. 42 Covariates will be used in the regression to adjust for any other sources of trait variability, such as basic variables like age, sex, and BMI, and genetic PCs will be used to control for population substructure. 52 58

The number of genetic PCs to be included in the regression depends on the sample composition, such as variation in ancestry and presence/absence of genetic outliers. It would be determined by inspecting the proportion of variance explained by each PC (displayed on a scree plot) and by checking PC scatter plots.

In addition to being used as exclusion criteria, anxiety and depression scores could be used as covariates in GWAS. The exact set of covariates to be used will be determined based on initial diagnostic analyses such as correlation analysis.

These single-locus association results, obtained as p values, will be visualised via the Manhattan plot. Commonly used p value thresholds for selecting associated loci are 5휐 𢄨 for genome-wide significance and 10 𢄥 for suggestive significance. 42

An extension of this additive multivariate linear regression model, still within the single-trait single-locus setting, called the mixed linear model analysis which better controls for any cryptic relatedness or population substructure, will also be performed in GCTA. 54

There are several extensions of the single-trait single-locus association studies that increase power for detecting associated loci: combining several related traits that may share a biological basis, using multivariate Wald tests as implemented in MultiPhen 59 or gene-based tests that combine signals across all loci in a gene to increase signal strength and reduce the burden of multiple testing, such as set-based models implemented in Plink2 57 or fastBAT implemented in GCTA. 60 The admixed nature of the sample might be used in detecting associations by the method of admixture mapping, 61 though the potential of success of this method in detecting associated variants depends on the extent of stratification of the variant’s allele frequency across continents. These analyses might help detect additional loci that are underpowered in classical GWAS due to smaller effect sizes.

Handling of missing data

It might not be possible to record some traits in some individuals, even though the completeness of the first 100 samples suggests that missingness will be low. The single-trait methods used in traditional GWAS analyses automatically exclude individuals from the analysis of a trait who have missing values for that trait. The same applies to individuals having missing genotypes for any particular SNP. However, genotyping success rate using the Illumina HumanOmniExpress chip in the CANDELA cohort is very high (㺙.8%), so the number of excluded individuals in any analysis would be very low overall.

Some multivariate analyses such as PCs when applied on the set of phenotypes require having recorded values of all phenotypes for an individual. Instead of using the subset of individuals who have the complete set of phenotypes recorded, which would incur some loss in sample size, the missing phenotype data for each individual will be imputed following standard statistical procedures as implemented in the R package ‘mice’. 62 When the proportion of missing data is small, imputation is preferable in such multivariate analyses than sample exclusion, and is routinely applied to genetic data such as while calculating genetic PCs. 58


This GWAS including a well-defined cohort of healthy participants will provide important insights into the genetic aspects underlying experimental pain sensitivity in the naïve and sensitised state. This may allow further exploration of potential biological mechanisms underlying pain sensitivity. Future studies will be required to extrapolate these findings to patient populations with chronic pain.

Patient and public involvement

No patient involvement is performed during this study.

Ethics and dissemination

Findings will be disseminated to commissioners, clinicians and service users via papers and presentations at international conferences such as the biennial World Congress of International Association for the Study of Pain. We will also post our findings to the publicly available database.

Gargis AS, Kalman L, Berry MW, Bick DP, Dimmock DP, Hambuch T, Lu F, Lyon E, Voelkerding KV, Zehnbauer BA, Agarwala R. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol. 201230(11):1033.

Linderman MD, Brandt T, Edelmann L, Jabado O, Kasai Y, Kornreich R, Mahajan M, Shah H, Kasarskis A, Schadt EE. Analytical validation of whole exome and whole genome sequencing for clinical applications. BMC Med Genet. 20147(1):20.

Steemers FJ, Gunderson KL. Whole genome genotyping technologies on the BeadArray™ platform. Biotechnol J. 20072(1):41–9.

Beck TF, Mullikin JC, Biesecker LG, Comparative Sequencing Program NISC. Systematic evaluation of sanger validation of next-generation sequencing variants. Clin Chem. 201662(4):647–54.

Broad Institute. GATK Tools (version 3.8). Available from: Accessed 5 Mar 2018.

Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL, Friez MJ, Funke BH, Hegde MR, Lyon E. ACMG clinical laboratory standards for next-generation sequencing. Genet Med. 201315(9):733.

Ye J, Coulouris G, Zaretskaya I, Cutcutache I, Rozen S, Madden TL. Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinform. 201213(1):134.

Wang Z, Liu X, Yang BZ, Gelernter J. The role and challenges of exome sequencing in studies of human diseases. Front Genet. 20134:160.

Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 201430(15):2114–20.

Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 2013.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 200925(16):2078–9.

Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 201127(21):2987–93.


Department of Health Sciences, University of Leicester, Leicester, UK

Iain R. Timmins & Frank Dudbridge

Diabetes Research Centre, University of Leicester, Leicester, UK

Francesco Zaccardi & Thomas Yates

Department of Cardiovascular Sciences, University of Leicester, Leicester, UK

NIHR Leicester Biomedical Research Centre, University Hospitals of Leicester NHS Trust & University of Leicester, Leicester, UK

Christopher P. Nelson & Thomas Yates

Department of Clinical Sciences, Lund University, Lund, Sweden

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar


I.R.T. developed methods, performed analyses, interpreted results and wrote the manuscript. F.Z. conceived the study, interpreted results and edited the manuscript. C.P.N. interpreted results and edited the manuscript. P.W.F. interpreted results and edited the manuscript. T.Y. conceived the study, interpreted results and edited the manuscript. F.D. conceived the study, interpreted results and edited the manuscript.

Corresponding author


Epithelial ovarian cancer (EOC) has a strong heritable component, with an estimated three-fold increased risk among women with a first-degree relative having the disease ( 1). The excess familial risk that is not attributed to high penetrance mutations in genes such as BRCA1 and BRCA2 may be due to a combination of common and rare alleles that confer low- to moderate penetrance ( 2, 3). Genome-wide association studies (GWAS) of EOC that have been conducted using most of the samples included in the current investigation have identified common variants at approximately 22 loci that collectively account for 4% of the estimated heritability ( 4–13). Few data exist regarding the contribution of rare (minor allele frequency (MAF) <0.5%) and low frequency (MAF 0.5–5%) protein-coding variants to EOC risk. This reflects the fact that protein- coding variants have not been targeted by conventional GWAS ( 14) despite prediction that their effects could be substantial ( 15) and imputation is known to be challenging for rare variants ( 16).

Following GWAS arrays of the mid-2000s, exome-based arrays were developed in 2012. The Affymetrix Axiom ® Exome Genotyping Array and the Illumina HumanExome Beadchip each contain >245,000 putative functional coding variants and other categories of variants selected from 16 exome sequencing initiatives that included approximately 12,000 individuals of diverse ethnic backgrounds and a range of diseases ( 17) ( Supplementary Material, Table 1 ). Variants were included as ‘fixed’ content on the arrays if they occurred at least three times and were seen in two or more of the 16 studies ( 17). Here, we report the first large-scale genetic association study of uncommon exome-wide variants and EOC risk among nearly 20,000 women ( Supplementary Material, Table 2 ).


In this EWAS of sporadic DCM we confirmed associations with variants in ZBTB17-HSPB7 and BAG3 and identified six novel loci. Statistical analyses, cardiac tissue expression, and physiology suggest that the most likely causal genes are HSPB7, BAG3, TTN, SLC39A8, MLIP, FLNC, ALPK3 and FHOD3.

Our data provide evidence that non-coding variants close to or within HSPB7 are more likely to account for the observed association at the ZBTB17 locus. The genetic mechanism linking the risk haplotype to HSPB7 functional modulation in absence of detectable eQTL is unknown. HSPB7 (commonly referred as cardiovascular Heat Shock Protein cvHSP) is a member of the small HSPB family of molecular chaperones. It is a potent polyglutamin aggregation suppressor that assists the loading of misfolded proteins or small protein aggregates into autophagosomes [15]. In addition, our in vitro experiments demonstrate a physical interaction of BAG3 with HSPB7 (Fig 4) suggesting functional relationships between the 2 proteins that may be relevant for their genetic implication in DCM pathophysiology. The strongest association with sporadic DCM in our EWAS involved rs2234962 which encodes a p.Cys151Arg substitution in BAG3. The interaction signal of BAG3 Arg151 and Cys151 isoforms with HSPB7 was similar (data not shown), suggesting no direct effect of the polymorphism on HSPB7 binding The p.Cys151Arg variant is located between two conserved Ile–Pro–Val (IPV) motifs involved in BAG3 complex formation with HSPB6 and HSPB8. Interestingly, a p.Pro209Leu mutation responsible for myofibrillar myopathy associated with cardiomyopathy is located in one of the two IPV motifs [16]. Whether p.Cys151Arg modifies the interaction of BAG3 with HSPBs partners and affects the functional potential of the complex is currently unknown.

Common haplotype-tagging variants of the titin gene were associated with small differences in DCM risk in this EWAS. This extends the spectrum of TTN genetic variants that affect DCM risk, from highly penetrant mutations responsible for familial DCM [12] to common haplotypes with low penetrance associated with sporadic DCM. A potential consequence of common DCM-associated TTN variants, in line with the pathogenic mechanisms suggested by the candidate genes identified in this EWAS, is the proteotoxic effect of accumulating truncated or aggregate prone mutant TTN in cardiomyocytes.

The rs13107325 SNV in SLC39A8 has been shown in GWAS to be associated with several traits affecting cardiovascular risk, including blood pressure. It is therefore conceivable that it has systemic consequences that raise the risk of DCM but were not considered in our disease exclusion criteria. Because SLC39A8 encodes a zinc transporter [17], its association with DCM may also be related to the cardioprotective role of zinc [18].

In the nuclear envelope, MLIP (also known as CIP) directly interacts with the N-terminal region of lamin (LMNA) [19]. Dominant mutations in LMNA cause DCM and other hereditary multisystemic diseases and several pathogenic mutations of LMNA are located in its MLIP interacting domain [20]. In mice, Mlip interacts with Isl1, a transcription factor required for cardiomyocyte differentiation, and represses its transcriptional activity [21]. Notably, the DCM-associated SNV (rs4712056, p.Val159Ile) is located within the Isl1-interacting region of MLIP. MLIP has recently been shown to be a key regulator of cardiomyopathy that has potential as a therapeutic target to attenuate heart failure progression [22].

Filamin C is involved in the organization of actin filaments, it serves as a scaffold for signaling proteins and interacts with several Z-disk proteins. FLNC mutations in humans and mice cause hypertrophic cardiomyopathy [23] and myofibrillar myopathy, a form of muscular dystrophy with concurrent cardiomyopathy [24]. These pathologies are characterized by myofibrillar disorganization, accumulation of myofibrillar degradation products and ectopic expression of multiple proteins [25]. FLNC mutations induce massive protein aggregates within skeletal muscle fibers and altered expression of chaperone proteins and components of proteasomal and autophagic degradation pathways. Interestingly, functional interaction between FLNC and HSPB7 or BAG3, two genes confirmed by this study, have been previously reported [26,27]. In addition to the fact that BAG3 mutations also causes myofibrillar myopathy it suggests the hypothesis that dysregulation of proteostasis could be a common mechanism underlying myofibrillar myopathy and DCM.

Cardiac FHOD3 plays a crucial role in the sarcomere organization of cardiomyocytes, is essential for heart myofibrillogenesis [28] and is required for the maintenance of the contractile structures in heart muscle. A cardiac isoform of FHOD3 is targeted to thin actin filaments via phosphorylation of tyrosine residue preventing autophagy dependent degradation [29]. Most DCM-associated SNVs in our EWAS are clustered in a region in 3' of FHOD3 that encodes the Formin FH2 domain of the protein, which is implicated in actin polymerization [30]. A FHOD3 variant, Y1249N, has been reported in a Japanese patient with a dominant form of DCM. In vivo functional analysis showed that this variant may impair actin filament assembly, thus providing some support for the implication of FHOD3 in the pathogenesis of DCM [31].

Alpha-kinase 3 (ALPK3/MIDORI) was initially described as a myocyte-specific gene that promotes differentiation of P19CL6 cells into cardiomyocytes [32]. The pattern of expression of ALPK3 in differentiating cardiomyocytes nucleus is similar to that of transcription factors specific of the cardiogenic lineage [33] but its function is still largely unknown. Recently recessive mutations in ALPK3 have been reported to cause pediatric DCM [34].

In addition to TTN and BAG3, MYBPC3 was present in the cardiomyopathy gene-set [2] and harbored variants associated with sporadic DCM in our EWAS. MYBPC3 is an actin, myosin and titin interacting protein of the M-band of the sarcomere. Mutations in this gene are a major cause of hypertrophic cardiomyopathy and have also been reported in familial forms of DCM [35]. Coding variants in MYBPC3 may affect actin-myosin interaction [36] and concurrently interfere with the ubiquitin proteasome system and autophagy in humans and animal models [37]. Both mechanisms could account for the association of common SNVs in MYBPC3 with sporadic DCM.

Considered as a whole, both rare and common variants with elevated CADD scores in the cardiomyopathy gene-set were associated with sporadic DCM (S3 Table) indicating that other loci than those found in this EWAS are involved in sporadic DCM, however identifying the responsible genes will require larger studies.

Proteostasis might be important for DCM

Three of the DCM-associated genes, FLNC, TTN (through its kinase activity) and cardiac specific FHOD3 encode maintenance partners of sarcomere and sarcomere-related structures, including Z-disk or F-actin myofibrils [29,38,39], which are disorganized or degraded in experimental models of cardiomyopathy [40]. Moreover the cellular level of FLNC, FHOD3 and TTN kinase targets such as MuRF2, appears regulated by proteostasis mechanisms [26,29,39]. One of these mechanisms, BAG3-associated chaperone-assisted selective autophagy (CASA) is described as a central adaptation mechanism that responds to acute physical exercise and to repeated mechanical stimulation [41]. BAG3 inactivation also leads to Z-disk disruption in mice and fruit fly [26]. Based on the functional similarities (also pertaining to HSPB7 and MYBPC3) characterizing several of the DCM-associated genes identified in this study, we hypothesize that abnormal cardiomyocyte sarcomere maintenance and regulation of autophagy is a potential mechanism involved in DCM pathophysiology is. Further experimental exploration of this hypothesis may yield novel therapeutic targets for DCM.

Limitations of this study

This study has some limitations. The recruitment was focused on a priori homogeneous sets of patients and controls of European ancestry and outliers were excluded based on genomic data. We also conducted a meta-analysis which did not reveal any significant heterogeneity across populations. Despite these precautions, we cannot fully exclude undetected population stratification. In addition, given the rather low prevalence of DCM, we conducted exome-wide genotyping in all available patients and controls instead of using a two-step discovery/replication design. As a consequence, even if we provide a series of arguments supporting the identified genes as plausible candidates, independent studies would certainly further refine and extend our results. It is likely that a genome-wide tagging array not limited to exon regions would identify other DCM-associated variants and loci than those reported here. Finally, as our power analyses show, our EWAS had limited power to detect the collective effect of rare variants present in our data set at the gene level.


We identified 6 novel loci associated with sporadic DCM and confirmed two previously reported associations with variants located within the ZBTB17-HSPB7 and BAG3 genes. Fine mapping revealed that at the ZBTB17 locus HSPB7 is likely the implicated gene. The lead-SNVs at all associated loci are common variants and conditioning on them reduced considerably the associations of other variants in the regions of interest with DCM. We provide evidence that 7 of the DCM-associated genes are very plausible candidates from a pathogenic perspective.

Watch the video: Δέσποινα Βανδή Βασίλης Μπισμπίκης: Το ταξίδι στην Ιταλία και το χρονικό του έρωτά τους. OPEN TV (August 2022).