Difference between NCBI's /genomes and /1000genomes

Difference between NCBI's /genomes and /1000genomes

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Wondering what the difference is in the data hosted here:

Also (sidenote), would be interested to note what the difference betweenftp-traceand justftpis. But there is probably over 1TB of data in/genomes, and I assume something of "a thousand genomes" is in/1000genomes, but I'm not sure what data exactly. Would like to know what is i thse two folders, and where the 1000genomes data is exactly. It looks like /1000genomes/ftp/data/ has "sequence data", but I'm not sure how that compares to the/genomeslist of species names (and I'm also not sure what "sequence data" entails exactly).

The simplest difference is the scale and range. The aim of 1000 genome project was to provide comprehensive library of human genetic variation. DNA of individuals coming from different ethnic groups, geographic locations were sequenced and the results of the study were published here.

Genomes from NCBI come from different organisms and the number of sequenced individuals differ between species (including human). This database is constantly growing. Detailed description can be found here.

In fact these two things are very different and I would say quite opposite.

NCBI genomes are collection of information on genomes of various species including sequences, maps, chromosomes, assemblies, and annotations. Basically it's a catalog of reference genomes.

1KG is a catalog of human genetic variations from a genome-wide association study.

Copy Number Variation

Differences between genomes can be due to single nucleotide variants (SNPs), translocations, inversions and copy number variants (CNVs, gain or loss of DNA). The latter can range from sub-microscopic events to complete chromosomal aneuploidies. Small CNVs are often benign but those larger than 250 kb are strongly associated with morbid consequences such as developmental disorders and cancer. Detecting CNVs within and between populations is essential to better understand the plasticity of our genome and to elucidate its possible contribution to disease or phenotypic traits.While the link between SNPs and disease susceptibility has been well studied, to date there are still very few published CNV genome-wide association studies probably owing to the fact that CNV analysis remains a slightly more complex task than SNP analysis (both in term of bioinformatics workflow and uncertainty in the CNV calling leading to high false positive rates and unknown false negative rates). This chapter aims at explaining computational methods for the analysis of CNVs, ranging from study design, data processing and quality control, up to genome-wide association study with clinical traits.

Keywords: Copy number variation DNA Deletion Duplication Genome-wide association studies Human disease Human genetics Structural variation.


1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

Thorisson, G.A., Smith, A.V., Krishnan, L. & Stein, L.D. The International HapMap Project Web site. Genome Res. 15, 1592–1593 (2005).

Rosenbloom, K.R. et al. ENCODE whole-genome data in the UCSC Genome Browser. Nucleic Acids Res. 38, D620–D625 (2010).

Washington, N.L. et al. The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details. Database (Oxford) 2011, bar023 (2011).

Baker, M. Next-generation sequencing: adjusting to data overload. Nat. Methods 7, 495–499 (2010).

Shumway, M., Cochrane, G. & Sugawara, H. Archiving next generation sequencing data. Nucleic Acids Res. 38, D870–D871 (2010).

Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

Toronto International Data Release Workshop Authors. Prepublication data sharing. Nature 461, 168–170 (2009). The Toronto Agreement describes a set of best practices for prepublication data sharing. These practices have been adopted by the 1000 Genomes Project and have helped drive the widespread use of the data.

Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 27, 718–719 (2011).

Flicek, P. et al. Ensembl 2011. Nucleic Acids Res. 39, D800–D806 (2011).

McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 (2010). The Ensembl Variant Effect Predictor (VEP) is a flexible and regularly updated method to annotate all newly discovered variants and provides information about how such variants impact genes, regulatory regions and other key genomic features.

Kumar, P., Henikoff, S. & Ng, P.C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073–1081 (2009).

Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

Foelo, M.L. & Sherry, S.T. NCBI dbSNP database: content and searching. in Genetic Variation: A Laboratory Manual (eds., Weiner, M.P., Gabriel, S.B. & Stephens, J.C.) 41–61 (Cold Spring Harbor Laboratory Press, 2007).

Church, D.M. et al. Public data archives for genomic structural variation. Nat. Genet. 42, 813–814 (2010).


In this section, we start with a summary and characterization of our VNTR predictions, followed by identification of commonly polymorphic VNTRs and an enrichment analysis of their association with genomic functional regions and genes sets. We next report on the effect of VNTRs on expression of nearby genes, and then identify population-specific VNTR alleles and show that they are predictive of ancestry. We conclude this section with evidence confirming the accuracy of our predictions using several validation methods.

About one in five minisatellite TRs are variable in the human population

WGS datasets from 2770 human genomes were analyzed with VNTRseek to detect VNTRs. Overall, 184 315 out of 191 286 singleton reference TR loci (�%) were genotyped across all samples (Table  2 ) while 5% of the loci had TR arrays too long to fit within the longest reads and could only be genotyped if they lost a sufficient number of copies.

Table 2.

TRs and VNTRs detected, by dataset. TRsgenotyped is the number of distinct TR loci genotyped across all individuals within a dataset. (All other numbers are also per dataset.) Multis are TR loci genotyped in a single individual with more than the expected number of alleles. They could be artifacts or indicate copy number variation in a genomic segment. Multis were excluded from further analysis on a per sample basis. VNTRsdetected is the number of TR loci, excluding multis, with a detected allele different from the reference

DatasetSamplesTRs genotypedMultisVNTRs detected
1000 Gen.30178 3953668761
NYGC2504177 612118133 403
SGDP253156 8032219944
GIAB7178 8042396736
CHM11159 5631751118
CHM131170 8056321977
Tumor-Normal4150 531211291
Totals2800184 315-35 638

A total of 5 198 392 loci with non-reference alleles were detected, corresponding to 35 638 (�%) distinct VNTR loci, indicating wide occurrence of these variable repeats. Their occurrence within genes was common, totaling 7698 protein coding genes, and 3512 exons. The resulting genotypes were output in VCF format files (see Data Availability Section) and summarized for each genome (Supplementary Material Summary_of_results.txt ). A website is under development to view the VNTR alleles (

The number of TRs genotyped and VNTRs detected depends on coverage and read length

To determine the effect of coverage and read length on genotyping, we measured two quantities: the percentage of reference singleton TRs that were genotyped, and the total number of singleton VNTRs that were detected in each genome. Only singleton loci were considered in all further analyses. Figure  1A shows that there was a strong positive correlation between coverage and the ability to genotype TRs. A strong correlation with read length was also apparent, however, the effect was larger, primarily due to the ability of longer reads to span, and thus allow VNTRseek to detect, longer TR arrays. These results suggest that our VNTR numbers are undercounts.

(A) TR Genotyping sensitivity. Graph shows the relationship between coverage, read length, and the percentage of TRs in the reference set that were genotyped. Each symbol represents a single sample and specific samples are labeled. Increasing read length had the largest effect on sensitivity because many reference TR alleles could not be detected at the shorter read lengths (see Supplementary Table S1). (B) VNTRs detected. Graph shows the relationship between coverage, read length, and the number of VNTR loci detected. Read length and coverage both had large effects. Coloring of symbols shows that population also had a strong effect, reflecting distance from the reference, which is primarily European. Note the reduced numbers for CHM1 (150਋p) and CHM13 (250਋p). Because they are ‘haploid’ genomes, parental heterozygous loci with one reference allele would appear to be VNTRs, on average, only about half the time. (C) Alleles detected per locus. Each bar represents a specific number of alleles detected across all datasets. Coloring shows that proportion of loci where the reference allele was or was not observed. (D) Copies gained or lost. Each bar represents a specific number of copies gained or lost in non-reference VNTR alleles relative to the reference allele. Loss was always more frequently encountered. (E) VNTR locus sample support. Data shown are the common loci from the 2504 sample NYGC dataset. Each bar represents the number of samples calling a locus as a VNTR. Bin size is 100. Bar height is number of loci with that sample support. Red line indicates the 5% cutoff for common loci (126 samples).

VNTR detection was similarly dependent on coverage and read length, as shown in Figure  1B . However, detection was also positively correlated with population, which seemed likely due to the evolutionary distance of populations from the reference genome, which is primarily European (125). For example, in the 250਋p trios with comparable coverage, the African Yoruban genomes (YRI) had the highest number of VNTRs detected, followed by the Ashkenazi Jewish genomes (AJ), and finally, the Utah genomes (CEU). Notably, within each trio, the VNTR counts were similar.

The ‘haploid’ genomes CHM1 (150਋p) and CHM13 (250਋p) had greatly reduced VNTR counts relative to genomes with similar coverage and read length. This was because in these genomes, which consist of two copies of an underlying haploid genome, the single allele represented at any VNTR locus would frequently be a reference allele and so the locus would not be called as a VNTR.

More than two alleles are common in VNTRs

Two alleles were detected in the majority of VNTR loci across all datasets (Figure  1C ). However at 10 698 loci (29%), three or more alleles were detected. In a substantial number of loci (5395), the reference allele was never seen, but in only 105 of these (2%) was the reference allele in the VNTRseek detectable range for the 150ਊnd 100਋p reads, which made up the bulk of our data. Interestingly, in 1166 loci, the reference allele, although detectable, was not the major allele (Supplementary Material Major_genotypes.txt ).

Loss of VNTR copies relative to the reference is more common than gain

Overall, VNTRseek found approximately 1.8-fold more alleles with copy losses (3 444 128), with respect to the reference copy number, than gains (1 958 250). Loss of one copy (2 263 608) was the most common type of VNTR polymorphism (Figure  1D ). Although there were allele lengths that VNTRseek could not detect, this bias persisted even when restricting the loci to only those where gain and loss could both be observed (Supplementary Figure S5). The overabundance of VNTR copy loss may actually be an underestimate. Since VNTRseek required a read to span a TR array for it to be detected, gain of one copy would have been possible in approximately 68%, 82%ਊnd 92% of loci for samples with read lengths of 100, 150ਊnd 250਋p, respectively. By contrast, the reference locus needed to have a minimum of 2.8 copies for a loss of one copy to be observed by TRF, and only 16% of the reference loci met this criterion. Higher observed copy loss could be explained by a bias in the reference genome towards including higher copy number repeats (126), or by an overall mutational preference for copy loss.

VNTRs have high heterozygosity

High heterozygosity in human populations suggests higher genetic variability and may have beneficial effects on a range of traits associated with human health and disease (127). Since calculating heterozygosity for VNTRs is not straightforward (because of limitations on discovering alleles, especially within shorter reads), we used the percentage of detected, per-sample heterozygous VNTRs as an estimate for heterozygosity. At read length 250਋p, per-sample heterozygous VNTR loci comprised approximately 46�% of the total, which is comparable to previous theoretical estimates of 43�% (19). At shorter read lengths, the bottom of the range extended lower (��% for 150਋p reads, ��% for 100਋p reads, Supplementary Figure S1), as expected, because longer alleles were undetectable if they did not fit within a single read.

Interestingly, despite the previous comment, within genomes that were comparable in read length and coverage, the fraction of heterozygous loci clustered within populations (Supplementary Figures S2–S4), with African genomes generally having more heterozygous calls and East Asians fewer. This result is consistent with previous findings of population differences in SNP heterozygosity among Yoruban and Ashkenazi Jewish individuals with respect to European individuals (128,129), and suggests higher genomic diversity among African genomes, as has been previously noted (130).

Loss of heterozygosity observed in tumor samples

A significant loss of heterozygosity (LOH) was observed in predicted VNTRs of one of the tumor tissues compared to its matching normal tissue (sample ID HC1187). The percentage of heterozygous VNTRs was roughly double in the normal tissue (�% versus �%) (Supplementary Tables S2 and S3). Extreme loss of heterozygosity in small variants has previously been reported in these samples by Illumina Basespace (131) with the number of heterozygous small variants in HC1187 being four times lower in the tumor tissue compared to the normal. Taken together, these results suggest that VNTR LOH could be linked to tumor progression.

Additionally, in both tumors a large number of loci exhibited loss of both alleles in comparison to the normal tissue (Supplementary Table S2). Given that the coverage for the tumor samples was significantly higher than for the normal tissue, it is unlikely that these observations were due to artifacts. Also, the tumor samples did not show a higher percentage of filtered multi VNTRs (too many alleles) than the normal samples (1.37% and 1.23% in normal tissue versus 1.72% and 1.71% in tumor tissue).

Knowledge of gene associations with somatic tumor mutations (VNTR alleles present in a tumor, but not normal tissue) could be useful as indicators of cancer prognosis and for therapy. In the HC2218 individual, somatic tumor mutations overlapped with lncRNAs ( <"type":"entrez-protein","attrs":<"text":"ACO73336.1","term_id":"226714198","term_text":"ACO73336.1">> ACO73336.1, <"type":"entrez-nucleotide","attrs":<"text":"AC107959.2","term_id":"21431382","term_text":"AC107959.2">> AC107959.2, <"type":"entrez-nucleotide","attrs":<"text":"AL355388.2","term_id":"7940027","term_text":"AL355388.2">> AL355388.2), introns (C3orf67, COX17, DHRS3, DPP6, GAN, PCGF3, RGS12, SLC25A13, SLC6A19, TACR2, TEPP), and promoter regions (TRIM24, DUSP4). Of these, DHRS3, DUSP4, GAN, RGS12, and TRIM14 are known oncogenes or tumor suppressors (as indicated by CancerMine (132)). TRIM24 has been associated with prognosis in breast cancer (133�) and over-expression of DUSP4 has been shown to improve the outcome of chemotherapy and overall survival (136,137).

In the HC1187 individual, somatic tumor mutations overlapped with lncRNAs (LINC01708, AC1058290.1, <"type":"entrez-nucleotide","attrs":<"text":"AC104596.1","term_id":"17647043","term_text":"AC104596.1">> AC104596.1), exons (THNSL2), introns (AJAP1, SMAD1, FLT4, PTPN3, ADAMTSL2, ANO2, SOX5, SGCG, WDR72, NQO1, CCDC200, ARHGAP45, <"type":"entrez-nucleotide","attrs":<"text":"AC005258.1","term_id":"3289992","term_text":"AC005258.1">> AC005258.1, PEAK3) and promoters (HFM1, TBK1, GNS, LEMD3, FGFR3, VIPR2). Of these, AJAP1, FGFR3, GNS, NQO1, PTPN3, SMAD1, SOX5ਊnd TBK1 are known oncogenes or tumor suppressors (132). TBK1 and FGFR3 have been used as treatment targets for HER2+ breast cancer (138,139).

Common versus private VNTRs

Following methodology used with SNPs (140), we classified VNTR loci in the 2504 healthy, unrelated individuals from the NYGC dataset (150਋p and coverage 㸰×) as commonly polymorphic, or 𠆌ommon VNTRs,’ if non-reference alleles occurred in at least 5% of a population (126 individuals) and as private VNTRs if they occurred in less than 1% (25 individuals).

We classified 5676 VNTRs as common (17% of the 33 403 VNTRs detected in this population) and 68% as private. In each sample, we detected, on average, 1951 VNTR loci, and among those, 1783 were common VNTRs (median 1,677) and 46 were private VNTRs (median 17). A total of 3627 common VNTRs overlapped with 2173 protein coding genes including 254 exons. Interestingly, increasing the threshold for common VNTRs did not reduce the number dramatically (Supplementary Figure S6), suggesting that these VNTRs have not occurred randomly, but rather have undergone natural selection. Widespread occurrence of common VNTRs indicates a fitness for use in Genome Wide Association Studies (GWAS). A list of common and private VNTRs can be found in Supplementary Material Common_VNTRs.txt and Private_VNTRs.txt .

Common VNTR enrichment in functionally annotated regions

To determine possible functional effects of the common VNTRs, we classified the overlap of reference TRs with various functionally annotated genomic regions: upstream and downstream of genes, 3′ UTRs, 5′ UTRs, introns, exons, transcription factor binding site (TFBS) clusters, CpG islands, and DNAse clusters (Supplementary Material Reference_set.txt ).

Our reference TR set comprised only 0.52% of the genome, however, 49% of human genes contained at least one TR and 5% of all the TFBS clusters overlapped with TRs. Moreover, high proportions of our TR reference set and common VNTRs intersected with genes (63% and 64%, respectively), TFBS clusters (38% and 51%), and DNAse clusters (21% and 28%) (Supplementary Table S5).

In comparison to TRs, VNTR loci were positively enriched in 1 kb upstream and downstream regions of genes, 5′ and 3′ UTRs, coding exons, TFBS clusters, DNAse clusters, and CpG islands (P-values < 0.05) (Supplementary Tables S4 and S5). The common VNTRs, on the other hand, compared to all VNTRs, were enriched in 1 kb upstream regions of genes, TFBS, and CpG islands, suggesting regulatory function. Private VNTRs were less likely to occur in 1 kb upstream or downstream regions, inside TFBS clusters, open DNAse clusters, or CpG islands.

Focusing on the common VNTRs, we used the LOLAweb (141) online tool to perform enrichment analysis with various curated feature sets (Supplementary Figures S7–S12 in Supplementary Section S5.2), Among the results, DNAse enrichments by tissue type (142) pointed to brain, muscle, epithelial, fibroblast, bone, hematopoietic, cervix, skin, and endothelial (Supplementary Figure S11) with brain showing up multiple times, consistent with findings in the literature (24). These results suggest that VNTR alleles may affect gene regulation in multiple tissues.

VNTR genotypes are correlated with gene expression differences

To detect association between VNTR genotypes and expression of nearby genes, we paired VNTRs to any gene within 10 kb and after removing genes with low expression and controlling for confounders, applied a one-way ANOVA test to determine if there was a significant difference between the average gene expression levels for the VNTR genotypes. A total of 1071 gene-VNTR pairs were tested and 193 pairs (187 genes, 188 VNTRs) exhibited significant expression differences at FDR π.05 (Supplementary Figure S25 and Supplementary Material RNA_VNTRs.txt )). The top 10 genes by mean expression difference were: DPYSL4, KLF11, B4GALNT3, PIP5K1B, DNAJA4, THNSL2, CD151, MXRA7, HEBP1ਊnd FARP1. Of these eQTL VNTRs, 47 also exhibited population-specific alleles (next section and Supplementary Table S15) and 81 overlapped with peaks for histone marks and DNAse hypersensitive sites (see Supplementary Table S13 and Supplementary Figure S31), a significant level of enrichment in these marks (Supplementary Table S14).

Three of the top genes are shown in Figure  2 . Gene MXRA7 is associated with a VNTR (id 182606303) in the 5’ UTR exon, DPYSL4 is associated with a VNTR (id 182316137) in the first intron, and CSTB is associated with an upstream VNTR (id 182814480). The VNTR region in MXRA7 is a target site for transcription factors METTL23 and JMJD6. METTL23 is known to function as a regulator in the transcriptional pathway for human cognition (143) and has been associated with mental retardation and intellectual disability (144). JMJD6 is associated with pancreatitis (145) and tumorgenesis (146,147). Copy number expansions in the VNTR upstream of CSTB have been previously associated with progressive myoclonic epilepsy (EPM1) (148). For this VNTR, we observed the 𢄡 and 0 alleles (2 and 3 copies, respectively), which are common in healthy individuals. However, 201 individuals had genotypes outside of our detection range which likely represented longer expansions and these individuals showed higher expression of this gene. More examples are given in Supplementary Figures S26–S30.

Gene expression differences and VNTR genotype. Shown are violin plots of gene expression values (log2 normalized TPM) for three genes which displayed significant differential expression when samples were partitioned by VNTR allele genotype. Additional examples are shown in Supplementary Figures S26–S30. Genotype is indicated in labels on the X-axis and numbers refer to copies gained or lost relative to the reference allele. ‘Other’ indicates a partition with undetected alleles presumed outside the range of VNTRseek detection (see text). Number of samples in each partition is shown in parenthesis. In these examples, the effect size for at least one genotype class was significant. Top: VNTR 182606303 is upstream of MXRA7 and partially overlaps the 5’ UTR exon. Middle: VNTR 182316137 occurs inside the first intron of DPYSL4. Bottom: VNTR 182814480 occurs upstream of CSTB.

One in five common VNTR loci have population-specific alleles

We further investigated whether VNTR alleles are population-specific and whether they can be used to predict ancestry. Understanding the occurrence of population-specific VNTR alleles will be useful when controlling for population effects in GWAS, and more generally in interpreting gene expression differences among people of different ancestry.

A total of 4605 alleles from the common VNTR loci were classified as common if they were detected in at least 5% of the population (NYGC). We then constructed a matrix of presence/absence of each allele by sample and clustered the samples using Principal Component Analysis. We found that the first, second, fourth, and fifth principal components (PCs) separated the super populations as shown in Figure  3 . Each PC captured a small fraction of the variation in the dataset, suggesting that there was substantial variation between individuals from the same population.

Principal Component Analysis (PCA) of common VNTR alleles in the NYGC population (150਋p). PCA was performed to reduce the dimensions of the data. Left: PC1 captured 𢏅% of the variation and separated Africans from the other super-populations, suggesting that they had the greatest distance from the others. PC2 separated East Asian and European populations but left individuals from the Americas and South Asia mixed. Right: PC4 separated the South Asian population and PC5 separated the American populations. PC3 (not shown) captured batch effects due to differences in coverage. Some American sub-populations proved hardest to separate, likely due to ancestry mixing.

The first PC separated Africans, suggesting furthest evolutionary distance. The second PC separated East Asians. The third PC captured coverage bias. The fourth and fifth PCs separated South Asians and Americans, respectively. The American population had a sub-population of Puerto Ricans that clustered with the Iberian Spanish population, suggesting mixed ancestry (149). To show the power of these alleles to predict ancestry, we next trained a decision tree model (Supplementary Figure S19) using the top 10 PCs (11% of the total variation) and achieved a recall of 㺘% on every population when applied to the 30% test partition (Supplementary Table S10).

A one-sided Fisher’s exact test was applied to determine the population-specific VNTR alleles that were over-represented in one population versus all the others. A total of 3850 VNTR alleles were identified as population-specific in one or more super-populations, corresponding to 1096 VNTR loci (Supplementary Figures S20 and S21). The complete list of population-specific alleles can be found in Supplementary Material Superpopulation_VNTRs.txt . These loci overlapped with 689 genes and 51 coding exons. Africans had the highest number of population-specific alleles (266), followed by East Asians (65), while Americans had the lowest (13), suggesting more mixed ancestry. We observed 63 loci that had a population-specific allele in each population. Figure  4 illustrates seven of the top population-specific loci in a ‘virtual gel’ representation, mimicking the appearance of bands on an agarose gel for easier interpretation. Forty-eight genes that displayed expression differences correlated with VNTR genotype were associated with population-specific VNTR loci (Supplementary Table S15), including the VNTR 182316137 associated with the gene DPYSL4, discussed in the previous section, which exhibited seven different alleles, five of which were population-specific.

‘Virtual gel’ representation of seven population-specific VNTR alleles. Each dot represents an allele in one sample. Samples are separated vertically by super-population. Dots are jiggered in a rectangular area to reduce overlap. Population-specific alleles show up as bands over-represented in one population. Numbers and labels at bottom are VNTR locus ids with nearby genes indicated and the population-specific allele expressed as copy number change (+1, 𢄢, etc.) from the reference. For example, in the leftmost column, the +1 allele was over-represented in the African population. Note that the allele bias towards pattern copy loss relative to the reference allele is apparent and that at one locus (second from left) the reference allele was the population-specific allele since almost no reference alleles were observed in the four other populations. The details of these seven loci are given in Supplementary Table S11.

Finally, to identify potential functional roles of the population-specific VNTR loci we performed Gene Set Enrichment Analysis (GSEA) for the associated genes against the Broad Institute MSigDB (150). Genes overlapping with the population-specific VNTRs were enriched for Endocytosis (hsa04144), Fatty acid metabolism (hsa01212), and Arrhythmogenic right ventricular cardiomyopathy (ARVC) (hsa05412) pathways (Supplementary Table S12). Among the GO biological processes affected by these genes were neurogenesis (GO:0022008 FDR =ਃ.62e 𢄨 ), neuron differentiation (GO:0030182 FDR =ਂ.52e 𢄧 ), and neuron development (GO:0048666 FDR = 4.31e 𢄧 ). These processes are potentially related to other findings that have linked VNTRs to neurodegenerative disorders and cognitive abilities (28,36,56�,151) (Supplementary Material Population_specific_Go_BP.xlsx ). The GO term behavior (GO:0007610 FDR = 2.22e 𢄤 ) was also found, which could be related to the association of VNTR loci with aggressive behavior (152�). Other notable GO terms were regulation of muscle contraction (GO:0006937) and neuromuscular processes related to balancing (GO:0050885) with FDR ρ%. The genes were also highly enriched in midbrain neurotype cell gene signatures (FDR = 5.49e � ), which might affect movement and emotions (155�).

Accuracy of VNTR predictions

To show the reliability of our results, we experimentally validated VNTR predictions at 13 loci in the three related AJ genomes, and also compared VNTRseek predictions to alleles experimentally validated in the literature. We additionally used accurate long reads on one genome (HG002) to find evidence of the predicted alleles. Separately, we showed the consistency of our predictions in two ways: first, we looked at inheritance consistency among four trios (mother, father, child), and second, we compared result for genomes sequenced on two different platforms.

Experimental validation

All but one of the 66 predicted VNTR alleles were confirmed at 13 loci in the three related AJ genomes (child HG002, father HG003ਊnd mother HG004). In the remaining case, two predicted alleles were separated by only 15 nucleotides and could not be distinguished. At two loci, other bands were also observed. In one, all three family members contained an allele outside the detectable range of VNTRseek (longer than the reads). In the other, one allele that was detectable was missed in two family members (see Table  3 for a summary of results, Supplementary Table S6 for details of the experiment, and Supplementary Figures S13–S17 for gel images).

Table 3.

Experimental validation results

VNTRseek predictedExperimental validation
#TR idPattern sizeRef. copiesGeneCFMCFM
1*1823161811054STK32C𢄢𢄢𢄢𢄢, −1𢄢, −1𢄢, +1
2182316985276LINC01168 + 𢄣,਀𢄡,਀𢄣, 𢄢yesyesyes
3182453735302DNAJC30, +10, +10, +1yesyesyes
4182461997387RASA3𢄥, 𢄤𢄥, 𢄤𢄥, 𢄢yesyesyes
5182493720703BEGAIN000, 𢄡yesyesyes
6182515357348MEGF11𢄥𢄥, 𢄦𢄥yesyesyes
7182608886276RPTOR0, 𢄢0, 𢄣𢄢, +1yesyesyes
8182620950483RNF1380, 𢄡0, 𢄡0, 𢄡yesyesyes
9182982510344SLC12A70, 𢄡00, 𢄡yesyesyes
10 # 183046759384ARL1000, 𢄡𢄡0, +1yes𢄡, +1
11 † 183081195154TENT5A+2, 𢄡+2, +1+2, 𢄡yes+2yes
12183117043179MRM2 + , LFNG + 𢄥, 𢄣𢄥, 𢄣0, 𢄣yesyesyes
13183169331154IRF5𢄢𢄢, 0𢄢yesyesyes

Thirteen VNTR loci were selected for experimental validation in the AJ trio. All but one of the 66 bands predicted by VNTRseek were validated. † For the remaining band, the results were questionable because the two predicted alleles for the father were only 15 nucleotides different in length, which was too close to distinguish in the image. *For all three individuals, the gel contained bands (bold) not predicted (or detectable) by VNTRseek. The extra band for the son corresponded to the 𢄡 allele as found in the PacBio reads. The father’s extra band appeared to match with the 𢄡 allele. The mother’s extra band appeared to be a +1 allele. # An extra band for the mother and son (bold) was not predicted by VNTRseek, although it seemed to match the +1 allele that was detectable. +These VNTRs overlapped regulatory sites that target the given genes.

We also compared VNTRseek predictions in three datasets from the NA12878 genome with VNTRs validated in the adVNTR paper (95). Out of the original 17 VNTR loci experimentally validated in that paper, four were not included in our reference set and for one, the matching TR could not be determined. In total, 11 out of 16 detectable alleles were correctly predicted, four were not found in the NA12878 sample with sufficient read size (250਋p), and one was incorrectly predicted in the HG001 sample and not found in the other two (Supplementary Table S7).

Validation of predicted VNTRs using long reads

PacBio Circular Consensus Sequencing reads from the HG002 genome (103), with an average length of 13.5 kb and an estimated 99.8% sequence accuracy, were computationally tested to determine if they confirmed VNTRseek predicted alleles for the GIAB Illumina reads from the same genome. Overall, 㺗% of predicted alleles were confirmed, and at the predicted VNTR loci, 㺇% of alleles were confirmed (Supplementary Table S8).

VNTR predictions are consistent with Mendelian inheritance

We compared the predicted alleles in four trios (CEU and YRI trios from 1000 Genomes Chinese HAN and AJ from GIAB), testing loci on autosomes and X and Y chromosomes (see Materials and Methods). In all cases, only a handful of loci were inconsistent (Supplementary Table S9).

VNTR non-reference allele predictions showed substantial consistency across platforms

In 2015, the 1000 Genomes Phase 3 sequenced 30 genomes using Illumina HiSeq2500 at read length 250਋p. In 2020, 27 of those 30 genomes were resequenced by NYGC using Illumina Novaseq 6000 at read length 150਋p. Comparing VNTR loci genotyped in both platforms and only non-reference alleles detectable at both read lengths, agreement ranged from 76% toꂑ% (Supplementary Figure S18). When including reference alleles, the agreement increased to greater than 99%. Note, however, that read coverage was not the same for both datasets, causing variation in statistical power.

Because of the complementary nature of base-pairing between nucleic acid polymers, a double-stranded DNA molecule will be composed of two strands with sequences that are reverse complements of each other. To help molecular biologists specifically identify each strand individually, the two strands are usually differentiated as the "sense" strand and the "antisense" strand. An individual strand of DNA is referred to as positive-sense (also positive (+) or simply sense) if its nucleotide sequence corresponds directly to the sequence of an RNA transcript which is translated or translatable into a sequence of amino acids (provided that any thymine bases in the DNA sequence are replaced with uracil bases in the RNA sequence). The other strand of the double-stranded DNA molecule is referred to as negative-sense (also negative (-) or antisense), and is reverse complementary to both the positive-sense strand and the RNA transcript. It is actually the antisense strand that is used as the template from which RNA polymerases construct the RNA transcript, but the complementary base-pairing by which nucleic acid polymerization occurs means that the sequence of the RNA transcript will look identical to the sense strand, apart from the RNA transcript's use of uracil instead of thymine.

Sometimes the phrases coding strand and template strand are encountered in place of sense and antisense, respectively, and in the context of a double-stranded DNA molecule the usage of these terms is essentially equivalent. However, the coding/sense strand need not always contain a code that is used to make a protein both protein-coding and non-coding RNAs may be transcribed.

The terms "sense" and "antisense" are relative only to the particular RNA transcript in question, and not to the DNA strand as a whole. In other words, either DNA strand can serve as the sense or antisense strand. Most organisms with sufficiently large genomes make use of both strands, with each strand functioning as the template strand for different RNA transcripts in different places along the same DNA molecule. In some cases, RNA transcripts can be transcribed in both directions (i.e. on either strand) from a common promoter region, or be transcribed from within introns on either strand (see "ambisense" below). [1] [2] [3]

Antisense DNA Edit

The DNA sense strand looks like the messenger RNA (mRNA) transcript, and can therefore be used to read the expected codon sequence that will ultimately be used during translation (protein synthesis) to build an amino acid sequence and then a protein. For example, the sequence "ATG" within a DNA sense strand corresponds to an "AUG" codon in the mRNA, which codes for the amino acid methionine. However, the DNA sense strand itself is not used as the template for the mRNA it is the DNA antisense strand which serves as the source for the protein code, because, with bases complementary to the DNA sense strand, it is used as a template for the mRNA. Since transcription results in an RNA product complementary to the DNA template strand, the mRNA is complementary to the DNA antisense strand.

Hence, a base triplet 3′-TAC-5′ in the DNA antisense strand (complementary to the 5′-ATG-3′ of the DNA sense strand) is used as the template which results in a 5′-AUG-3′ base triplet in the mRNA. The DNA sense strand will have the triplet ATG, which looks similar to the mRNA triplet AUG but will not be used to make methionine because it will not be directly used to make mRNA. The DNA sense strand is called a "sense" strand not because it will be used to make protein (it won't be), but because it has a sequence that corresponds directly to the RNA codon sequence. By this logic, the RNA transcript itself is sometimes described as "sense".

Example with double-stranded DNA Edit

Some regions within a double-stranded DNA molecule code for genes, which are usually instructions specifying the order in which amino acids are assembled to make proteins, as well as regulatory sequences, splicing sites, non-coding introns, and other gene products. For a cell to use this information, one strand of the DNA serves as a template for the synthesis of a complementary strand of RNA. The transcribed DNA strand is called the template strand, with antisense sequence, and the mRNA transcript produced from it is said to be sense sequence (the complement of antisense). The untranscribed DNA strand, complementary to the transcribed strand, is also said to have sense sequence it has the same sense sequence as the mRNA transcript (though T bases in DNA are substituted with U bases in RNA).

3′CGCTATAGCGTTT 5′ DNA antisense strand (template/noncoding) Used as a template for transcription.
5′GCGATATCGCAAA 3′ DNA sense strand (nontemplate/coding) Complementary to the template strand.
5′GCGAUAUCGCAAA 3′ mRNA sense transcript RNA strand that is transcribed from the noncoding (template/antisense) strand. Note 1 : Except for the fact that all thymines are now uracils (T → U), it is complementary to the noncoding (template/antisense) DNA strand and identical to the coding (nontemplate/sense) DNA strand.
3′CGCUAUAGCGUUU 5′ mRNA antisense transcript RNA strand that is transcribed from the coding (nontemplate/sense) strand. Note: Except for the fact that all thymines are now uracils (T → U), it is complementary to the coding (nontemplate/sense) DNA strand and identical to the noncoding (template/antisense) DNA strand.

The names assigned to each strand actually depend on which direction you are writing the sequence that contains the information for proteins (the "sense" information), not on which strand is depicted as "on the top" or "on the bottom" (which is arbitrary). The only biological information that is important for labeling strands is the relative locations of the terminal 5′ phosphate group and the terminal 3′ hydroxyl group (at the ends of the strand or sequence in question), because these ends determine the direction of transcription and translation. A sequence written 5′-CGCTAT-3′ is equivalent to a sequence written 3′-TATCGC-5′ as long as the 5′ and 3′ ends are noted. If the ends are not labeled, convention is to assume that both sequences are written in the 5′-to-3′ direction. The "Watson strand" refers to 5′-to-3′ top strand (5′→3′), whereas the "Crick strand" refers to the 5′-to-3′ bottom strand (3′←5′). [4] Both Watson and Crick strands can be either sense or antisense strands depending on the specific gene product made from them.

For example, the notation "YEL021W", an alias of the URA3 gene used in the National Center for Biotechnology Information (NCBI) database, denotes that this gene is in the 21st open reading frame (ORF) from the centromere of the left arm (L) of Yeast (Y) chromosome number V (E), and that the expression coding strand is the Watson strand (W). "YKL074C" denotes the 74th ORF to the left of the centromere of chromosome XI and that the coding strand is the Crick strand (C). Another confusing term referring to "Plus" and "Minus" strand is also widely used. Whether the strand is sense (positive) or antisense (negative), the default query sequence in NCBI BLAST alignment is "Plus" strand.

A single-stranded genome that is used in both positive-sense and negative-sense capacities is said to be ambisense. Some viruses have ambisense genomes. Bunyaviruses have three single-stranded RNA (ssRNA) fragments, some of them containing both positive-sense and negative-sense sections arenaviruses are also ssRNA viruses with an ambisense genome, as they have three fragments that are mainly negative-sense except for part of the 5′ ends of the large and small segments of their genome.

An RNA sequence that is complementary to an endogenous mRNA transcript is sometimes called "antisense RNA". In other words, it is a non-coding strand complementary to the coding sequence of RNA this is similar to negative-sense viral RNA. When mRNA forms a duplex with a complementary antisense RNA sequence, translation is blocked. This process is related to RNA interference. Cells can produce antisense RNA molecules naturally, called microRNAs, which interact with complementary mRNA molecules and inhibit their expression. The concept has also been exploited as a molecular biology technique, by artificially introducing a transgene coding for antisense RNA in order to block the expression of a gene of interest. Radioactively or fluorescently labelled antisense RNA can be used to show the level of transcription of genes in various cell types.

Some alternative antisense structural types have been experimentally applied as antisense therapy. In the United States, the Food and Drug Administration (FDA) has approved the phosphorothioate antisense oligonucleotides fomivirsen (Vitravene) [5] and mipomersen (Kynamro) [6] for human therapeutic use.

In virology, the term "sense" has a slightly different meaning. The genome of an RNA virus can be said to be either positive-sense, also known as a "plus-strand", or negative-sense, also known as a "minus-strand". In most cases, the terms "sense" and "strand" are used interchangeably, making terms such as "positive-strand" equivalent to "positive-sense", and "plus-strand" equivalent to "plus-sense". Whether a viral genome is positive-sense or negative-sense can be used as a basis for classifying viruses.

Positive-sense Edit

Positive-sense (5′-to-3′) viral RNA signifies that a particular viral RNA sequence may be directly translated into viral proteins (e.g., those needed for viral replication). Therefore, in positive-sense RNA viruses, the viral RNA genome can be considered viral mRNA, and can be immediately translated by the host cell. Unlike negative-sense RNA, positive-sense RNA is of the same sense as mRNA. Some viruses (e.g. Coronaviridae) have positive-sense genomes that can act as mRNA and be used directly to synthesize proteins without the help of a complementary RNA intermediate. Because of this, these viruses do not need to have an RNA polymerase packaged into the virion—the RNA polymerase will be one of the first proteins produced by the host cell, since it is needed in order for the virus's genome to be replicated.

Negative-sense Edit

Negative-sense (3′-to-5′) viral RNA is complementary to the viral mRNA, thus a positive-sense RNA must be produced by an RNA-dependent RNA polymerase from it prior to translation. Like DNA, negative-sense RNA has a nucleotide sequence complementary to the mRNA that it encodes also like DNA, this RNA cannot be translated into protein directly. Instead, it must first be transcribed into a positive-sense RNA that acts as an mRNA. Some viruses (e.g. influenza viruses) have negative-sense genomes and so must carry an RNA polymerase inside the virion.

Gene silencing can be achieved by introducing into cells a short "antisense oligonucleotide" that is complementary to an RNA target. This experiment was first done by Zamecnik and Stephenson in 1978 [7] and continues to be a useful approach, both for laboratory experiments and potentially for clinical applications (antisense therapy). [8] Several viruses, such as influenza viruses [9] [10] [11] [12] Respiratory syncytial virus (RSV) [9] and SARS coronavirus (SARS-CoV), [9] have been targeted using antisense oligonucleotides to inhibit their replication in host cells.

If the antisense oligonucleotide contains a stretch of DNA or a DNA mimic (phosphorothioate DNA, 2′F-ANA, or others) it can recruit RNase H to degrade the target RNA. This makes the mechanism of gene silencing catalytic. Double-stranded RNA can also act as a catalytic, enzyme-dependent antisense agent through the RNAi/siRNA pathway, involving target mRNA recognition through sense-antisense strand pairing followed by target mRNA degradation by the RNA-induced silencing complex (RISC). The R1 plasmid hok/sok system provides yet another example of an enzyme-dependent antisense regulation process through enzymatic degradation of the resulting RNA duplex.

The 1000 Genomes Project more than doubles catalog of human genetic variation

Genetic variation explains part of why people look different and vary in their risk for diseases. The goal of the 1000 Genomes Project is to identify and compile variants in the human genome that occur at a frequency of at least 1 in 50 people. Although most of these genetic variants cause little if any effect, some contribute to disease, and others are beneficial. An example of a beneficial difference is a rare genetic variant that blocks the human immunodeficiency virus from infecting white blood cells and, thus, protects people exposed to HIV who carry this variant. "

The 1000 Genomes Project is a large, international effort aiming to characterize human genetic variation, including people from many different populations," said Eric D. Green, M.D., Ph.D., NHGRI director. "The newly published findings provide deeper insights about the presence and pattern of variants in different people's genomes, which is critical information for studying the genomic basis of human disease."

The expanded catalog allows medical researchers to locate genetic differences contributing to rare and common diseases more precisely. Identifying the genetic underpinnings of disease will help lead to new diagnostic tests and, in some cases, treatments.

"I view this project as a Lewis and Clark expedition to the interior of the human genome," said Stephen Sherry, Ph.D., chief of the Reference Collections Section, Information Engineering Branch at the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine. "We knew the outlines and contours (of the genome). Now, we're trying to document all the fine details such as the rivers and tributaries."

So far, project researchers have sequenced the genomes of 1,092 people from 14 populations in Europe, East Asia, sub-Saharan Africa and the Americas. Ultimately, they will study more than 2,500 individuals from 26 populations.

All of the participants consented to inclusion, in an open online database, of sequence data derived from their anonymous DNA samples. Each part of these genomes were read (or sequenced) an average of six times, which provides accurate information about common genetic variants but misses many rare variants.

To identify rare variants in the exome, the part of the genome that codes for proteins, the researchers sequenced the exons of 15,000 genes in each genome an average of 80 times. The study discovered 99.8 percent of exome variants with a frequency of at least 1 percent and 99.3 percent of variants elsewhere in the genome with a frequency of at least 1 percent.

Phase one of the 1000 Genomes Project, the subject of the Nature paper, has produced a massive amount of genomic data. Simply recording the raw information takes some 180 terabytes of hard-drive space, enough to fill more than 40,000 DVDs. All of the information is freely available on the Internet through public databases such as ones at the National Center for Biotechnology Information at the U.S. National Library of Medicine in Bethesda, Md., and the European Bioinformatics Institute in Hinxton, England. Data from the project have been available to researchers since 2008. The massive dataset became available in the cloud this year via Amazon Web Services (AWS). Cloud access enables users to analyze large amounts of the data much more quickly, as it eliminates the time-consuming download of data and because users can run their analyses over many servers at once. Researchers pay only for the additional AWS resources they need to further process or analyze the data.

"With this project, we have succeeded in making sure that information about our shared genetic heritage, and the common DNA variants we carry, are freely available for researchers to use to benefit patients around the world," said David Altshuler, M.D., Ph.D., an endocrinologist at Massachusetts General Hospital who directs the Broad Institute's Program in Medical and Population Genetics, and who co-leads the 1000 Genomes Project. "Moreover, the tools and methods that this project has helped foster are being used now in disease-oriented genetics research and will be used increasingly in clinical care."

All the genetic information for making an organism resides in the DNA, which is a set of long molecules made of units called bases. Each base is a chemical unit abbreviated A, C, G or T. For this paper, the researchers identified 38 million single-nucleotide polymorphisms, or SNPs (pronounced "snips"), which are DNA variants that occur when a particular basein the genome sequence differs among people.

These variants are the most common genetic differences among people. Each SNP is like a landmark, reflecting a specific position in the genome where the DNA spelling differs by one letter among people.

They also identified variants in the linear structure of the DNA, including 1.4 million short indels (insertions or deletions of DNA as small as a single base or as large as 50 bases) and 14,000 large deletions of DNA.

SNPs and structural variants can help explain an individual's susceptibility to disease, response to drugs, or reaction to environmental factors such as air pollution or stress. Other studies have found an elevated rate of indels in diseases such as autism and schizophrenia, although it's not yet clear how they affect those diseases.

"Project researchers discovered that each person carries a handful of rare variants that would currently be recognized as disease-causing and a few hundred more rare variants that are likely to have a detrimental effect on how genes work," said Gilean McVean, Ph.D., professor of statistical genetics at the University of Oxford in England and co-leader of the 1000 Genomes Project Analysis Group. "It's fortunate that most of us usually carry only one copy of these variants since two copies might lead to disease."

Another large NHGRI-funded effort, the ENCODE Project, recently published a series of papers showing that large parts of the human genome outside of protein-coding regions affect gene regulation. The patterns of variation in these regions that the 1000 Genomes Project found provide additional evidence about the functionality of these regions.

Data analysis is a vital part of the project, and about 260 analysts participated in analyzing the data reported in the recent Nature publication. They mapped and assembled the raw DNA sequence data relative to the reference human genome sequence. They then analyzed the aligned sequences to locate SNPs and structural variants. SNPs are relatively easy to find structural variants (such as insertions, deletions and copy number differences) are much harder to find. A number of research groups are working to establish how to go from the raw sequence data to identifying structural variants.

Many research groups contributed to the generation of genome sequence data for this project, including NHGRI's large-scale sequencing centers: the Human Genome Sequencing Center at the Baylor College of Medicine, Houston The Broad Institute of MIT and Harvard University in Cambridge, Mass., and The Genome Institute at the Washington University School of Medicine in St. Louis. Other groups included the Wellcome Trust Sanger Institute in Hinxton, England BGI Shenzhen in Shenzhen, China the Max Planck Institute for Molecular Genetics in Berlin and Illumina, Inc., in San Diego.

The 1000 Genomes Project eliminates time-consuming steps for researchers trying to find genetic variants that affect a disease. Genome-wide association studies aim to find regions of the genome that contain DNA variants relevant to a disease. They use technologies that provide information about hundreds of thousands to a couple of million SNPs in each studied genome they can combine these data with 1000 Genomes Project data on tens of millions of variants to find regions affecting the disease more precisely. The 1000 Genomes Project data then can be used to greatly enhance such studies by providing more detailed information about known variants. Instead of sequencing the genomes of all the people in a study - still an expensive prospect for thousands of people - researchers can use the 1000 Genomes Project data to find most of the variants in those regions that are associated with the disease.

"Once researchers find genes and variants of interest associated with disease by using the 1000 Genomes Project data, they have to return to basic biology to study them one at a time, to establish which genes and variants are causal for the disease and not just along for the ride," said Lisa D. Brooks, Ph.D., program director of the Genetic Variation Program in NHGRI's Division of Genome Sciences. "The 1000 Genomes Project data accelerate their ability to close in on those genes and variants."

Planning for the $120 million project began in 2007. In 2010, researchers published data on three pilot studies. The 2012 data set will be followed by the last addition to the catalog in 2013.


PGAP general structure and input

The PGAP pipeline is designed to annotate both complete genomes and draft genomes comprising multiple contigs. PGAP is deeply integrated into NCBI infrastructure and processes, and uses a modular software framework, GPipe, developed at NCBI for execution of all annotation tasks, from fetching of raw and curated data from public repositories (the Sequence and Assembly databases) through sequence alignment and model-based gene prediction, to submission of annotated genomic data to public NCBI databases.

On input, PGAP accepts an assembly (either draft or complete) with a predefined NCBI Taxonomy ID that defines the genetic code of the organism. PGAP also accepts a predetermined clade identifier, matching the genome in question to a species-specific clade. Clade IDs are computed using a series of 23 universal ribosomal protein markers and are independent of taxonomy. In the absence of a clade ID, we can infer the ID from taxonomy in the majority of cases. The clade ID determines the realm of core proteins used as the target protein set. PGAP annotation of a new genomic sequence can be requested at the time of submission to GenBank. Taxonomic and clade identifiers are determined outside of the annotation pipeline, and are influenced by GenBank curatorial decisions. The clade-dependent sets of protein clusters as well as sets of curated structural ribosomal RNAs (5S, 16S and 23S) are generated and maintained outside of PGAP. More details on the PGAP workflow are provided below.

Pan-genome approach

The genome sequencing revolution has radically altered the field of microbiology. Whole-genome sequencing for prokaryotes became a standard method of study ever since the first complete genome of free-living organism, Haemophilus influenza, was sequenced in 1995 ( 14). Due to the widespread use of the next generation sequencing (NGS) techniques, thousands of genomes of prokaryotic species are now available, including genomes of multiple isolates of the same species, typically human pathogens. Thus, the mere density of comparative genomic information for high interest organisms provides an opportunity to introduce a pan-genome based approach to prediction of the protein complement of a species.

The collection of prokaryotic genomes available at NCBI is growing exponentially and shows no signs of abating: as of January 2016 NCBI's assembly resource contains 57 890 genome assemblies representing 8047 species (see genome browser, for the up-to-date information). Notably, genomes of different strains of the same species can vary considerably in size, gene content and nucleotide composition. In 2005, Tettelin et al. ( 15) introduced the concept of pan-genome, aiming to provide a compact description of the full complement of genes of all the strains of a species. Genes common to all pan-genome members (or to the vast majority of them) are called core genes those present in just a few clade members are termed accessory or dispensable genes genes specific to a particular genome (strain) are termed unique genes ( 16).

In PGAP we define the pan-genome of a clade at a species or higher level ( 17). To be included as a core gene for a species-level pan-genome, we require the gene to be present in the vast majority—at least 80%—of all genomes in the clade. A set of core genes gives rise to a set of core proteins. We show in Figure 1 how the number of protein clusters, for each of four well studied large clades, depends on the fraction of the clade members that contribute proteins to the cluster. There are three critical regions in this analysis: (i) unique genes, present in less than 1% of all clade members (ii) dispensable genes, present in 1–20% of genomes and (iii) core genes, found in at least 80% of the represented genomes. Based on our analysis, there are very few clusters appearing in at least 20% of the members of a clade but no more than 80% of the members. The use of a cutoff of 80% was chosen to capture a wide set of genes conserved within the whole clade while eliminating genes having less abundant representation. We further subject the core proteins to clustering using USearch to reduce the total number of proteins required to represent the full protein complement of the pan-genome ( 18). We use the representative core proteins to infer genes for homologous core proteins in a newly sequenced genome ( 19).

Cumulative number of protein clusters (Y) is defined for a given X (%) as the number of clusters containing proteins from fraction x ≥ X of all members of the clade. Data are presented for the four well studied clades.

Cumulative number of protein clusters (Y) is defined for a given X (%) as the number of clusters containing proteins from fraction x ≥ X of all members of the clade. Data are presented for the four well studied clades.

The notion of the pan-genome can be generalized beyond a species level and applies, in fact, to any taxonomy level (from genus to phylum to kingdom). Notably, in the pan-genomes of Archaea and Bacteria, the universally conserved ribosomal genes make a group of core genes. The main practical value of the pan-genome approach is in formulating an efficient framework for comparative analysis of large groups of closely related organisms separated by small evolutionary distances as defined by ribosomal protein markers ( 20, 21).

Prediction of RNA genes (structural RNA, tRNA, small ncRNA)

Structural rRNAs (5S, 16S and 23S) are highly conserved in closely related prokaryotic species. The NCBI RefSeq Targeted Loci collection ( 22) contains curated sets of the three types of rRNA gene sequences, which serve as reference sets for PGAP ( To identify genes for 16S and 23S rRNAs PGAP uses members of the reference sets as queries in BLASTn ( 23). Hits that correspond to partial alignments are dropped if they fall below a certain coverage and identity thresholds with respect to the average length of the corresponding rRNA (50% coverage and 70% identity for 16S rRNA 50% coverage and 60% identity for 23S rRNA). Borders of predicted rRNA genes are defined by a voting mechanism similar to the one mentioned below for identifying gene starts among several alternative start codons.

For prediction of 5S rRNAs and small ncRNAs, PGAP uses cmsearch (ver. 1.1.1) along with covariance models, score thresholds and recommended command line options from the Rfam database (release 12.0 ( 7)). Current execution of this cmsearch version has been optimized to permit direct use of the tool without a preliminary BLASTn search ( 5–7).

For prediction of tRNA sequences, PGAP relies on tRNAscan-SE. The input genomic sequence is split into overlapping fragments long enough to cover a tRNA gene with possible introns. These fragments are used as inputs to tRNAscan-SE ( 8), currently one of the most widely used tRNA gene identification tools. Domain specific parameters of tRNAscan-SE are selected automatically for each genome ( 8).

All predicted RNA genes from the above steps are collected and presented to GeneMarkS+ as a set of RNA gene ‘footprints’ (Figure 2). GeneMarkS+ has several labels (‘M’, ‘N’ and ‘R’) for RNA gene footprints the labels specify different types of possible overlaps between protein-coding genes and RNA genes.

A fragment of the PGAP execution graph: prediction of structural RNA genes (ncRNA, tRNA, 5S-, 16S-, 23S- rRNA).

A fragment of the PGAP execution graph: prediction of structural RNA genes (ncRNA, tRNA, 5S-, 16S-, 23S- rRNA).

Repetitive regions (CRISPRs) and mobile genetic elements (prophages)

Clustered regularly interspaced short palindromic repeats (CRISPRs), along with associated proteins, comprise a prokaryotic defense system. Interest in CRISPRs has recently blossomed due to the opportunity to use the system as programmable restriction enzymes. CRISPRs are commonly found in almost all archaeal genomes while they are less frequent in bacterial genomes ( 24, 25). For CRISPRs prediction and annotation PGAP uses a combination of the CRISPR recognition tool (CRT) ( 26) and PILER-CR ( 27).

Phage and plasmid genes are frequent subjects of horizontal transfer and may be difficult to predict by gene finding tools that are tuned to identify native genes or foreign genes adapted to the genomic context in evolution to ( 1). Some phage and plasmid genes can be identified by alignment based methods. In PGAP, we utilize a curated set of phage and plasmid proteins and map these proteins to a genome in question using tBLASTn and ProSplign as described in the section below. The reference set of phage proteins was created earlier in a project on inference and curation of phage protein clusters ( 4). High-scoring phage and plasmid protein alignments form another set of footprints in the input to GeneMarkS+.

Protein alignments and integration of multiple evidence types into genome annotation with GeneMarkS+

GeneMarkS+, a self-training gene finder, was designed to detect protein-coding patterns in genomic sequence and to infer gene locations compatible with externally supplied evidence, the hints or footprints, indicating presence of various functional elements in specific sequence intervals. Examples of hints include protein-coding intervals suggested by protein-to-genome alignments intervals covering RNA genes identified by specialized tools or intervals recognized as repetitive sequences, e.g. elements of CRISPR. The core of the integration algorithm is the ab initio gene finding tool GeneMarkS ( 1), which implements the Viterbi algorithm for a hidden semi-Markov model (HSMM) of prokaryotic genomic sequence ( 28). External hints transformed into sequence labels restrict possible parses of a genome in question into protein-coding and non-protein-coding regions. Of note, these constraints reduce the set of possible parses and, thus, help accelerate execution of the Viterbi algorithm. Species specific parameters of GeneMarkS are determined by iterative unsupervised training ( 1) on the whole genomic sequence submitted for annotation (see Figure 3) thus the training occurs without any hints defined restrictions.

Flowchart of PGAP. The red dotted line indicates separation between pass one and pass two (see text for details).

Flowchart of PGAP. The red dotted line indicates separation between pass one and pass two (see text for details).

In the first round (or pass) of annotation we align representative proteins from the clade specific core clusters using tBLASTn ( 22) to the genome. High scoring protein alignments are further refined by ProSplign, a frameshift-aware protein to genome aligner. An added benefit of ProSplign is its ability to identify cases in which newly defined genes for core proteins are frameshifted ( The core protein alignments are then transformed into footprints (interval hints) for consumption by GeneMarkS+. In the presence of a frameshift, the pipeline generates a ‘corrected’ sequence of the genome, adding or subtracting bases to repair the translation frame. This repaired sequence is consumed by GeneMarkS+. During this first pass (top half of Figure 3) the core protein footprints are used as interval hints (with the label ‘C’) that carry information on the frame of either an intact mapped gene, a repaired pseudogene or a repaired gene with sequencing error. Notably, the frameshifted regions are ‘repaired’ not only in the first pass but in the second pass as well, when the focus moves to identification of genes of non-core proteins. The first pass of the pipeline concludes with execution of the first run of GeneMarkS+ making the initial prediction of the genome-wide complement of genes and proteins.

In the second pass (bottom half of Figure 3), proteins predicted by GeneMarkS+ in locations that were not covered by the footprints of core proteins are considered as the ‘seeds’ or ‘prototypes’ of the non-core genes. These seeds are searched via BLASTp against a broader database of RefSeq proteins to identify a subset of seeds supported by evolutionarily conserved non-core proteins. Each identified seed protein is aligned back to the genome using ProSplign. The ProSplign non-core protein alignments are combined with the core protein alignments from the first pass and the predictions of all protein-coding regions are transformed into footprints (intervals with ‘C’ label) defined on the repaired genomic sequence.

Importantly, the core genes predicted in the first round as well as the first round ab initio predicted genes are evaluated in the second round with BLASTp searching against a broader set of representative proteins from all prokaryotic protein clusters. High-scoring BLASTp hits are then mapped back to the genome, and proteins that map incompletely or with disruptions are realigned to the genome using tBLASTn and ProSplign. The rationale of repeating this analysis is that the larger set of all protein clusters provides not only new but also more extensive evidence for predicted gene models thus improving the accuracy. To give an example (Figure 4) in place of overlapping CDS features predicted by ab initio in the first round (panel A), frameshifts can be identified when, in the second round, proteins homologous to the predicted CDS sequences are aligned by ProSplign to genomic sequence (panel B). In another example (Figure 5) one of the alternative start codons (panel A) may get more supporting evidence than the other in the second round of alignments (panel B). Genomic intervals predicted to be protein-coding by the two rounds of alignments make the set of ‘protein footprints’ used as input in the second run of GeneMarkS+.

A region in the Deinococcus radiodurans R1 genome assembly (GCA_000008565.1) contains three overlapping ORFs predicted ab initio as CDSs in the first pass of PGAP. Automatic evaluation of the cross-species protein evidence through the second pass of PGAP reveals proteins bearing homology to all three fragments. Alignment of the proteins to the genome reveals otherwise unpredicted frameshifts. Green bars represent genes, red bars – coding regions grey bars – alignments with red vertical bars indicating mismatches. (A) A region of Chromosome 1 of D. radiodurans (AE000513.1) containing the three CDS features is displayed alongside the six-frame translation. (B) The same region, updated to include final annotation markup with a frameshifted CDS as well as supporting proteins that demonstrate a consistent pattern and location of two frameshifts (marked by arrows at positions 100 733 and 100 959).

A region in the Deinococcus radiodurans R1 genome assembly (GCA_000008565.1) contains three overlapping ORFs predicted ab initio as CDSs in the first pass of PGAP. Automatic evaluation of the cross-species protein evidence through the second pass of PGAP reveals proteins bearing homology to all three fragments. Alignment of the proteins to the genome reveals otherwise unpredicted frameshifts. Green bars represent genes, red bars – coding regions grey bars – alignments with red vertical bars indicating mismatches. (A) A region of Chromosome 1 of D. radiodurans (AE000513.1) containing the three CDS features is displayed alongside the six-frame translation. (B) The same region, updated to include final annotation markup with a frameshifted CDS as well as supporting proteins that demonstrate a consistent pattern and location of two frameshifts (marked by arrows at positions 100 733 and 100 959).

Annotation of genome of Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 (NC_003197). Protein alignment provides support for gene start selection. See legend to Figure 4 for description of the meaning of green, red and gray bars. (A) the first round of alignments of protein representatives from the ‘core’ protein clusters doesn't give enough evidence for gene start selection. (B) the second round of alignments clearly supports a shorter gene model which does not overlap with the upstream gene.

Annotation of genome of Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 (NC_003197). Protein alignment provides support for gene start selection. See legend to Figure 4 for description of the meaning of green, red and gray bars. (A) the first round of alignments of protein representatives from the ‘core’ protein clusters doesn't give enough evidence for gene start selection. (B) the second round of alignments clearly supports a shorter gene model which does not overlap with the upstream gene.

These final protein-coding footprints are combined with information on other types of hints, including RNA genes and CRISPR sequences, prior to the second pass of GeneMarkS+. The presence of RNA genes (rRNA, tRNA, small ncRNA) prohibits prediction of protein-coding genes in the same location, less small overlaps. This restriction is particularly important for low GC content genomes in which sequences of RNA genes frequently trigger prediction of false protein-coding genes due to the relatively high GC content of RNA genes.

We also considered other types of hints that occur rarely and may affect one or two genes in a whole genome. Examples include non-canonical start sites or sites of non-conventional amino acid translation, such as selenocysteine. While these hints do not have matching labels, GeneMarkS+ is able to integrate such hints outside the labeling routine by assigning a non-canonical start codon or by making a stop codon at specified location a part of the protein-coding region (stop codon read-through).

In PGAP, the second run of GeneMarkS+ produces the final structural gene annotation in which all identified protein-coding and non-coding regions are compatible with the whole set of externally defined hints (labels). One critical aspect of the algorithm revolves around protein-coding region start site selection. Within PGAP, we determine start sites using a combination of ab initio and alignment evidence. In the presence of even a single aligned protein with sufficiently strong score we give more weight to the start site determined by the alignment in the absence of protein alignment information, GeneMarkS+ prediction determines the start site. There are two specific cases that we must then consider within pipeline execution. First, the protein evidence we use, as described above, is generated by a sample of representatives of the clustered pan-genome proteins. As a result, each such protein carries a weight proportional to the number of proteins it represents. Therefore, when alignments suggest alternative start sites we use a simple voting algorithm to favor the start site with the largest weight. Second, we must filter out cases when the protein homology in some regions is too weak to reliably support protein-coding hints. These weak alignments are filtered out to allow GeneMarkS+ to operate in such regions in ab initio mode to identify gene models in all possible frames.

In practice, for genomes from well-studied species such as E. coli and S. typhimurium that have large sets of core genes, evidence from core proteins is abundant and the role of ab initio prediction is limited. Conversely, in less well studied species where the full complement of core proteins less well known or reduced to clusters of ribosomal proteins, the protein-coding regions are predicted mostly by GeneMarkS+ running in the GeneMarkS mode (ab initio). Thus, the two-pass approach outlined here is robust in the face of a wide variety of taxa and genome features.

As new organisms are sequenced and annotated, PGAP updates clusters of core proteins. In turn, the use of newly updated clusters allows for iterative improvement of annotation of all prokaryotic genomes in the database.

Protein naming

Protein naming is a critical part of the PGAP pipeline, since the protein name has to reflect the protein function. We use a BLASTp search for all newly identified protein products against a specialized database comprised of representatives of all automatically derived prokaryotic protein clusters ( 4), reviewed proteins from the UniProt-SwissProt Protein Knowledgebase ( 29) and all curated bacteriophage proteins from the RefSeq collection. The resulting BLASTp alignments are filtered to remove those that fall below thresholds of identity (25%) and coverage (the alignment region should comprise at least 70% of both query and subject (target) length). Each BLASTp alignment is assigned a distance value, varying between 0 (identical) and 1 (entirely dissimilar) based on the ratio of the BLASTp score to the expected BLASTp score for a perfect match alignment. A query protein is assigned to the closest identification cluster situated at least at 0.5 distance from the query protein. However, no such identification is made if within a distance of 0.1 from the query protein there is another ‘competing’ identification cluster. Thus, we favor unique and unambiguous assignments: if a query protein matches best to one and only one identification cluster, we accept the match if there is an ambiguous placement (i.e. a close unrelated identification cluster), we do not accept the match. The NCBI protein naming conventions use similar rules that have been adopted by the UniProt-SwissProt Protein Knowledgebase ( 29, 30) (

Assignment of protein names, however, important this part of the pipeline is, is not in the focus of this paper. The full set of rules of functional annotation along with any additional related information will be described in a forthcoming publication.

The PGAP execution system

The new PGAP uses a robust, high-performance execution framework (GPipe) developed for in-house use at NCBI. This system provides distributed parallel computing, robust tracking of all execution tasks and optimization of compute-intensive steps. Tracking of execution and decision-making is a critical feature that permits easy retrieval of the evidence behind key algorithmic decisions. The modular structure of PGAP allows for easy incorporation of new algorithms.

At its heart, GPipe is a workflow management tool that describes collections of tasks connected by dataflow between programs. In the GPipe model, execution consists of generating a build (statement of intent to complete a workflow). Each build owns one or more build-runs (an actual attempt to complete the workflow). A build-run contains an execution graph, which connects execution of specific tasks to other tasks using strongly-type data connections as edges. Each task in turn contains potentially multiple executions of a series of programs. All parameters and dataflow connections for all executions are tracked in a relational database and can be queried to identify historical usage patterns and deviations from expected executions. Workflows can be subdivided into specialized subgraphs that execute repetitive tasks these subgraphs can be created many times in each build-run to handle varying types of evidence using identical sets of actions (e.g. Figure 2).

The pipeline execution environment consists of four major components: (i) a database in which tasks are organized as builds and build-runs (ii) a series of graphs and graph templates used to organize execution tasks (iii) C++ object code implementing the execution tasks and (iv) an execution application that reads build definitions from the database and executes the appropriate tasks.

PGAP output: validation and final annotation

In addition to identifying structural components and establishing functional identification, PGAP produces reports in a wide variety of formats used by curators, submitters, as well as internal submission tools. Included in the reports produced are: (i) the primary annotated genome objects, represented in NCBI's ASN.1 data model and suitable for direct submission into GenBank (ii) a report on annotation markup discrepancies requiring submitter or curator attention (iii) genome annotation in GenBank flat file format ready for manual review and public display and (iv) statistics from the annotation process along with citation of supporting evidence for each gene model. Concomitant with the ASN.1 markup produced in PGAP, we record the evidence selected for each model to support the selection of the specific start/stop site as well as evidence used to support the functional identification of each inferred object (e.g. Figure 6).

A summary of PGAP genome annotation process is provided in the COMMENT section of GenBank and RefSeq records. The example is given for Listeria monocytogenes strain CFSAN010068, complete genome NZ_CP014250.1.

9.1C: Viral Genomes

  • Contributed by Boundless
  • General Microbiology at Boundless

Viral diseases have an enormous impact on human health worldwide. Genomic technologies are providing infectious disease researchers an unprecedented capability to study at a genetic level the viruses that cause disease and their interactions with infected hosts. An enormous variety of genomic structures can be seen among viral species as a group, they contain more structural genomic diversity than plants, animals, archaea, or bacteria. There are millions of different types of viruses, although only about 5,000 of them have been described in detail.

Figure: Diagram of a virus: The location of the genome inside the virus.

A virus has either DNA or RNA genes and is called a DNA virus or a RNA virus, respectively. The vast majority of viruses have RNA genomes. Plant viruses tend to have single-stranded RNA genomes and bacteriophages tend to have double-stranded DNA genomes. Viral genomes are circular, as in the polyomaviruses, or linear, as in the adenoviruses. The type of nucleic acid is irrelevant to the shape of the genome. Among RNA viruses and certain DNA viruses, the genome is often divided up into separate parts, in which case it is called segmented. For RNA viruses, each segment often codes for only one protein, and they are usually found together in one capsid. However, all segments are not required to be in the same virion for the virus to be infectious, as demonstrated by the brome mosaic virus and several other plant viruses.

A viral genome, irrespective of nucleic acid type, is almost always either single-stranded or double-stranded. Single-stranded genomes consist of an unpaired nucleic acid, analogous to one-half of a ladder split down the middle. Double-stranded genomes consist of two complementary paired nucleic acids, analogous to a ladder. The virus particles of some virus families, such as those belonging to the Hepadnaviridae, contain a genome that is partially double-stranded and partially single-stranded.

The Genomic GPS is available at [12] under the MIT license (Software DOI: [13]). Genotype data for 1000Genomes Phase 1 was downloaded from Imputed haplotype data of 1000Genomes Phase 3 was downloaded from [5]. WTCCC data used for sample overlap simulation can be accessible via EGA accession number EGAD00000000001 for 1958 British Birth Cohort and EGAD00000000008 for Type 1 Diabetes (T1D) samples ( [7]. POPRES data used for PC map analysis is accessible via dbGaP Study accession number phs000145.v4.p2 ( [10]. Detailed description of datasets used in this research can be found in Additional file 1: Supplementary Note.

Erlich Y, Narayanan A. Routes for breaching and protecting genetic privacy. Nat Rev Genet. 201415(6):409–21.

Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying personal genomes by surname inference. Science. 2013339(6117):321.

He D, Furlotte NA, Hormozdiari F, Joo JWJ, Wadia A, Ostrovsky R, et al. Identifying genetic relatives without compromising privacy. Genome Res. 201424(4):664–72.

Mantilla-Gaviria I, Leonardi M, Galati G, Balbastre J. Localization algorithms for multilateration (MLAT) systems in airport surface surveillance. SIViP. 20149:1–10.

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015526:68–74.

Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 201188(1):76–82.

The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007447:661–78.

Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography within Europe. Nature. 2008456(7218):98–101.

Yang W-Y, Novembre J, Eskin E, Halperin E. A model-based approach for analysis of spatial structure in genetic data. Nat Genet. 201244(6):725–31.

Nelson MR, Bryc K, King KS, Indap A, Boyko AR, Novembre J, et al. The population reference sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am J Hum Genet. 200883(3):347–58.

Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 200919(9):1655–64.

Kim K, Baik H, Jang CS, Roh JK, Eskin E, Han B. Genomic GPS: using genetic distance from individuals to public data for genomic analysis without disclosing personal genomes. Github. 2019 Available from: Accessed 30 July 2019.

Kim K, Baik H, Jang CS, Roh JK, Eskin E, Han B. Genomic GPS: using genetic distance from individuals to public data for genomic analysis without disclosing personal genomes. Zenodo. 2019

Databases for genomic analysis

  • Nucleic acid sequences
  • genomic and mRNA, including ESTs
  • Protein sequences
  • Protein structures
  • Genetic and physical maps

Figure 4.15. Example of mapping information at NCBI. Genetic map around MYOD1, 11p15.4

Sequences and annotation of the human genome

Ensemble (European Bioinformatics Institute (EMBL) and Sanger Centre)



Figure 4.16. Sample views from servers displaying the human genome. (A) View from the Human Genome Browser. The region shown is part of chromosome 22 with the genes PNUTL1, TBX1and others. Extensive annotation for exons, repeats, single nucleotide polymorphisms, homologous regions in mouse and other information is available for all the sequenced genome. (B) Comparable information in a different format is available at the ENSEMBL server.

Watch the video: NCBIs 1000 Genomes Browser: Introduction (August 2022).