Probably A (Possibly C, G or T)

Two decades ago, I worked on the Human Genome Project. All day, I scrutinized eye-glazing stretches of A’s, G’s, C’s and T’s, cloned bits from a composite human genome. Our team analyzed the sequence, closed gaps, and deposited the polished versions of those genomic bits in GenBank every evening. A scientist, anywhere in the world, could then begin to interpret the biology of the sequence.  This year, a magazine asked me to do a sum up of what the completion of the Human Genome Project meant.

 

On April 25, 1953, the journal Nature published a one-page paper by two scientists from Cambridge University, James Watson and Francis Crick. The paper, titled ‘A Structure for Deoxyribose Nucleic Acid,” which didn’t make headline news right away, but went on to set things in motion for a revolution in biology.  The structure of DNA, the authors wrote, suggests “a possible copying mechanism for the genetic material.”

The chemical composition of DNA was well-known, but its structure of DNA was not. Watson and Crick had shown that the Deoxyribose Nucleic Acid (DNA) was a double helix. DNA is a polymer made of two strands of molecules called nucleotides. The sugar and phosphate parts of the nucleotide form the two strands of the helix, and the nucleotide bases point into the helix, where they stack on top of each other. DNA molecules have four kinds of nucleotide bases. These bases pair with great specificity. Adenine (A) pairs with Thymine (T). Guanine (G) pairs with Cytosine (C). The pairing is key – it is the basis by which DNA molecules are copied when cells divide. In humans, DNA is packaged into 23 pairs of chromosomes – one from each parent. Each chromosome has its share of genes – the functional units of heredity. The genome is the sum of all the DNA in the nucleus.

Crick was again the first to realize that the seemingly random sequence of the four bases in the genomic DNA formed a code and provided a template for protein synthesis. Other scientists would finally “break the genetic code” and describe how three nucleotide bases in a DNA code for each of the twenty amino acids, which are the fundamental building blocks from which all life is constructed. Crick, however, did not foresee that entire genomes would be decoded.

The DNA sequence provide the blueprint for development from a single cell to a complex, integrated organism. Determining the entire genomic sequence could help scientists gain molecular-level insights into the workings of any organism, but, at that point, sequencing entire genomes was unthinkable. It would take a series of advances in molecular biology, technology, and computing to first make the sequencing of genomes, large and small, a reality.  

The DNA sequence of a virus, with less than 10,000 base pairs, would be published a full fifteen years after the discovery of the DNA double helix. In 1977, Frederick Sanger devised a method to sequence DNA. Sanger and his colleagues determined the genetic sequence of the Bacteriophage phiX174, which had 5368 base pairs.  A rough estimate indicated that, without automation, it would take 1,500 scientists working for a century to sequence the human genome which is some 3 billion base pairs long. But the Sanger method was automated, and the first commercial sequencing machines hit the market in 1986.

By the mid-1980s, some visionary scientists proposed the idea of sequencing the entire human genome. Many agreed, in principle, that it would be useful to determine the order and spacing of all the genes that make up the genome, but some biologists thought this was too ambitious a project, and that it would end up generating plenty of useless data. Still, there was no denying the fact that the large-scale discovery of disease-causing genes would help in medical research and clinical care. The grant-making agencies decided to green-light the big science project.

The Human Genome Project (HGP) officially began in 1990. The National Institutes of Health (NIH) in the United States estimated that the project would take 15 years to complete at a cost of $3 billion. Labs across the globe joined in and formed an international consortium. Apart from sequencing the human genome, their goal was also to identify all the genes it contained – the estimated number was 100,000 genes.

In the first phase of the Human Genome Project, scientists began sequencing model organisms, which had long been used in the lab, such as the bacterium Escherichia coli and the yeast Saccharomyces cerevisiaeThe consortium published the sequences of the bacteria and the yeast in 1996. But a year before that, Craig Venter, a maverick scientist, came along, completed the sequencing of the bacterium Haemophilus influenza — it was first free-living organism whose genome was sequenced. Meanwhile, the consortium lay the groundwork to sequence the roundworm Caenorhabditis elegans, a multicellular organism which had a 100 million base pairs in its genome.

The Race Begins

In 1998, the consortium was well-placed to begin the large-scale sequencing of the human genome. They had created a physical map which showed identifiable landmarks on chromosomes – such as the positions of disease-causing genes. The roundworm genome project had just been completed. Things were going well. Once again, Venter, who had by then founded a private company called Celera Genomics, burst into the scene. His team, he said, was poised to finish sequencing the human genome in a couple of years. The consortium immediately moved the deadline to 2003, two years ahead of the scheduled finish in 2005. Watson reminded the world that 2003 would be the 50th anniversary of the discovery of the double helix – so it was, in fact, an ideal date for the completion of the project.

Francis Collins, head of the N.I.H. consortium, planned to stick to their structured, map-based approach in which the DNA is broken into fragments, and the position of each fragment is mapped on the chromosome first. In Celera’s shotgun method, the genome was broken into millions of DNA fragments and pieced together in one go, without creating any map – the assembly called for sophisticated algorithms and greater computing power.

In the summer of 2000, at a gala event in the White House, both groups announced that they had arrived at a working draft of the human genome. The following year, the consortium and Celera would publish their results in the journals Nature, and Science respectively. The consortium had completed only 85 percent of the genome; Celera was not much further ahead. Both versions had known gaps and errors. One thing, however, was clear – the human genome had less than 25,000 protein-coding genes.

Celera moved on to other things, but the consortium kept pegging away at the draft. In 2003, on the day of the agreed-upon deadline, the human genome was declared complete once again. This version, too, was not error-free. The most complete genome sequence was still missing about eight percent of the genome. In 2019, a group called the Telomere-to-Telomere (T2T) consortium, decided to take on the challenge of arriving at the complete sequence.

Closing The Gaps

When the Human Genome Project began in the 1990s, sequencing machines could read only short stretches of DNA at a time – less than a thousand base pairs. So, a genome was broken into suitably small fragments and sequenced individually. A computer program would look for overlaps at the ends of the sequences and fit the fragments together in the right order to reconstitute the stretch.

But repeats in certain parts of the genome were so long that it was hard to figure out where the sequences fit. The problematic repeats occurred in biologically important regions: telomeres (regions at the ends of chromosomes) centromeres (typically reside in the middle of the chromosome) and short arms of five chromosomes, where centromeres are skewed toward one end. Transposable elements, the mysterious sequences that can move around the genome, are again full of repeats.

The source DNA used by the consortium also posed problems. The consortium had collected DNA from many anonymous individuals to get a mosaic human genome. Because of the hundreds and thousands of variations in sequences between individuals, some artificial gaps were created in the consortium’s genome. Celera’s DNA, it is reported, largely came from a single individual – that of its founder Venter. The T2T consortium, which used an unusual cell type that has DNA inherited only from one parent, sidestepped the variation issue altogether.

Long-read sequencers can read 10,000 bases accurately at a time — some can read 100,000 base pairs accurately-enough. Having reads that can span the length of the repetitive sequences made it easy to place the segments correctly in the genome. So, the T2T scientists arrived at a more complete sequence of the human genome and published their results in the Science issue dated April 1, 2022. The T2T-CHM13, as the new reference genome is called, represents the most complete, accurate human genome sequence there is yet –their DNA source did not have the Y chromosome. Finally, scientists can confidently say that the human genome has 3.05 billion base pairs. They have added, or fixed, more than 200 million base pairs in the reference genome. They estimate that our genome contains 19,969 protein-coding genes.

With the complete genome, researchers can finally study variation in DNA in individuals. Still, one reference genome does not convey the genomic diversity of the human species. We need many reference genomes–a pangenome. This monumental undertaking called The Human Pangenome is already taking place and is poised to redefine the future of genomic research and human health. Ultimately, the goal is that every person would be able to have their complete genome sequenced as part of their medical record – faster, cheaper and without using huge machines which take up a lot of room.

A much clearer, high-resolution picture of the genome has emerged now. Why is such a small part of the genome’s total length devoted to protein-coding? What is the function of the repeats? What do the non-repetitive, non-gene-coding parts of the genome do? Such questions promise to keep biologists occupied for a long time. The secret of life, which Watson and Crick thought they had stumbled upon when they discovered the doubled-helical structure of DNA, is yet to be deciphered fully. The revolution in biology is still chugging along.