WisPaper
WisPaper
Search
QA
Pricing
TrueCite

Is the human genome fully understood and annotated?

No, the human genome is not fully understood or annotated. While the sequence is now complete, vast regions remain functionally mysterious.

Direct answer

No, the human genome is not fully understood or annotated. While the Telomere-to-Telomere (T2T) Consortium finally completed the full DNA sequence in 2022, adding nearly 200 million base pairs [4], understanding what all that DNA does is a separate, ongoing challenge. For example, even the best computational methods for identifying genes only correctly predict about 70% of known genes and are far less accurate for alternative splicing [7], and large-scale studies show that most of the genome's non-coding regions have unknown functions [9].

10sources cited

This article was generated with WisPaper-powered search and paper analysis.

The DNA sequence is finally complete, but what does it all mean?

Think of the human genome as a massive library. For decades, we only had access to about 92% of the books on the shelves. The Telomere-to-Telomere (T2T) Consortium finally gave us the complete set of books in 2022, filling in the missing 8% — roughly 200 million new 'pages' of DNA [4]. This was a monumental technical achievement, resolving notoriously tricky regions like the centromeres (the 'waist' of chromosomes) and the short arms of five chromosomes [4][6]. However, having all the books on the shelf is not the same as having read and understood every one of them.

The process of 'annotation' — figuring out which stretches of DNA are actual genes, what they do, and how they are regulated — is far from finished. A major community experiment from 2006 (EGASP) found that even the best computer programs at the time could only correctly predict at least one version of a gene for about 70% of known genes, and their accuracy for capturing the different ways a single gene can be spliced (alternative splicing) was only 40-50% [7]. While tools have improved, this highlights that automated annotation is still imperfect and requires extensive manual curation and experimental validation.

The newly completed regions are the most mysterious and complex

The very regions that were hardest to sequence are also the most difficult to understand. These include vast stretches of repetitive DNA, like the centromeres and segmental duplications (large, nearly identical copies of DNA blocks). The T2T assembly revealed that segmental duplications make up 7% of the genome, not the 5.4% previously estimated [2]. These regions are hotbeds of structural variation — large-scale rearrangements like inversions and deletions that differ between individuals [5][8].

This complexity matters for medicine. For instance, the T2T genome allowed researchers to fully resolve the SMN1/SMN2 gene region, which is critical for spinal muscular atrophy, and the AMY1/AMY2 region, linked to starch digestion and obesity [1]. A 2023 study found that a deletion in the KLRC gene cluster, a region only fully resolved in the T2T genome, is associated with natural killer cell differentiation in about 20% of humans [8]. This shows that the 'dark matter' of the genome is not junk; it contains medically relevant genes that we are only beginning to explore.

A complete reference genome is a game-changer for genetic testing, but it's not a complete understanding

Having a complete, accurate reference genome (like T2T-CHM13) dramatically improves our ability to find genetic variants in individuals. A 2022 study showed that using the T2T reference instead of the older GRCh38 reference eliminated tens of thousands of false-positive variants per person and reduced errors in 269 medically relevant genes by up to a factor of 12 [10]. This means fewer false alarms and more accurate diagnoses when sequencing a patient's genome.

However, even with a perfect reference, we still cannot interpret most of the variants we find. A massive 2023 study of over 76,000 human genomes created a 'constraint map' of the genome, showing which regions are so important that mutations are rarely tolerated [9]. While this map helps identify functional regions, it also confirms that the vast majority of the non-coding genome shows no signs of constraint, meaning its function (if any) remains unknown. Furthermore, a 2025 benchmark of the complete HG002 genome showed that even state-of-the-art methods still struggle, with de novo assemblies outperforming traditional variant calling by an order of magnitude, yet still making about one error per 100,000 base pairs in the most complex regions [3]. We have the complete map, but we are still learning to read it.

Sources used in this answer

1

Complex genetic variation in nearly complete human genomes

Sequencing 65 diverse genomes and building 130 haplotype-resolved assemblies closed 92% of previous assembly gaps and fully resolved complex loci like MHC and centromeres, revealing up to 30-fold variation in centromere array length.

2

Segmental duplications and their variation in a complete human genome

The complete T2T genome showed segmental duplications account for 7.0% of the genome (218 Mbp), up from the previous estimate of 5.4%, and that 91% of the newly resolved duplication sequence better represents human copy number variation.

3

A complete diploid human genome benchmark for personalized genomics.

A telomere-to-telomere benchmark for the diploid HG002 genome achieved near-perfect accuracy across 99.4% of the genome, adding 15.3% of sequence absent from prior benchmarks and showing de novo assembly outperforms variant calling by an order of magnitude.

4

The complete sequence of a human genome

The T2T Consortium produced a complete 3.055 billion-base pair human genome sequence, adding nearly 200 million base pairs of sequence containing 1,956 gene predictions, 99 of which are predicted to be protein-coding.

5

Inversion polymorphism in a complete human genome assembly

Remapping data from 41 genomes against the T2T reference found a ~21% increase in sensitivity for detecting inversions, identifying 26 misorientations in the older GRCh38 reference.

6

Complete genomic and epigenetic maps of human centromeres

Complete maps of human centromeres revealed they constitute 6.2% of the genome (189.9 megabases) and uncovered multimegabase structural rearrangements and high degrees of structural, epigenetic, and sequence variation across individuals.

7

EGASP: the human ENCODE Genome Annotation Assessment Project.

The EGASP experiment found the best computational methods correctly predicted at least one transcript for ~70% of annotated genes, but multiple-transcript accuracy (accounting for alternative splicing) reached only ~40-50%.

8

Characterization of large-scale genomic differences in the first complete human genome

Analysis of large-scale differences between T2T-CHM13 and GRCh38 found 67 additional discrepant regions (~21.6 Mbp) and identified a deletion in the KLRC gene cluster associated with natural killer cell differentiation in ~20% of humans.

9

A genomic mutational constraint map using variation in 76,156 human genomes

Aggregating 76,156 human genomes from gnomAD built a genome-wide constraint map, showing that constrained non-coding regions are enriched for known regulatory elements and variants implicated in complex diseases.

10

A complete reference genome improves analysis of human genetic variation

Using the T2T-CHM13 reference universally improved read mapping and variant calling for thousands of globally diverse samples, eliminating tens of thousands of spurious variants per sample and reducing false positives in 269 medically relevant genes by up to a factor of 12.