9.17.2007

[object HTMLImageElement]

[object HTMLImageElement]: "

What is DNA sequencing?

DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks (called bases and abbreviated A, T, C, and G) that make up the DNA of the 24 different human chromosomes, was the greatest technical challenge in the Human Genome Project. Achieving this goal has helped reveal the estimated 20,000-25,000 human genes within our DNA as well as the regions controlling them. The resulting DNA sequence maps are being used by 21st Century scientists to explore human biology and other complex phenomena.

Meeting Human Genome Project sequencing goals by 2003 required continual improvements in sequencing speed, reliability, and costs. Previously, standard methods were based on separating DNA fragments by gel electrophoresis, which was extremely labor intensive and expensive. Total sequencing output in the community was about 200 Mb for 1998. In January 2003, the DOE Joint Genome Institute alone sequenced 1.5 billion bases for the month.

Gel-based sequencers use multiple tiny (capillary) tubes to run standard electrophoretic separations. These separations are much faster because the tubes dissipate heat well and allow the use of much higher electric fields to complete sequencing in shorter times.
See a figure depicting this technology.


Whose genome was sequenced in the public (HGP) and private projects?

The human genome reference sequences do not represent any one person’s genome. Rather, they serve as a starting point for broad comparisons across humanity. The knowledge obtained is applicable to everyone because all humans share the same basic set of genes and genomic regulatory regions that control the development and maintenance of their biological structures and processes.

In the international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few of many collected samples were processed as DNA resources. Thus the donor identities were protected so neither donors nor scientists could know whose DNA was sequenced. DNA clones from many different libraries were used in the overall project.

Technically, it is much easier to prepare DNA cleanly from sperm than from other cell types because of the much higher ratio of DNA to protein in sperm and the much smaller volume in which purifications can be done. Using sperm does provide all chromosomes for study, including equal numbers of sperm with the X (female) or Y (male) sex chromosomes. However, HGP scientists also used white cells from the blood of female donors so as to include female-originated samples.

In the Celera Genomics private-sector project, DNAs from a few different genomes were mixed up and processed for sequencing. The DNA resources used for these studies came from anonymous donors of European, African, American (North, Central, South), and Asian ancestry. The lead scientist of Celera Genomics at that time, Craig Venter, has since acknowledged that his DNA was one of those in the pool.

Many small regions of DNA that vary among individuals (called polymorphisms) also were identified during the HGP, mostly single nucleotide polymorphisms (SNPs). Most SNPs are without physiological effect, although a minority contribute to the delightful and beneficial diversity of humanity. A much smaller minority of polymorphisms affect an individual’s susceptibility to disease and response to medical treatments.

Although the HGP has been completed, SNP studies continue in the International HapMap Project, whose goal is to identify patterns of SNP groups (called haplotypes, or “haps”). The DNA samples for the HapMap came from a total of 270 individuals: Yoruba people in Ibadan, Nigeria; Japanese in Tokyo; Han Chinese in Beijing; and the French Centre d’Etude du Polymorphisme Humain (CEPH) resource.

[Answer supplied by Dr. Marvin Stodolsky, U.S. DOE Office of Biological and Environmental Research, Office of Science]


Who sequenced the human genome?

Human Genome Project research was funded at many laboratories around the U.S. by the Department of Energy (DOE), the National Institutes of Health (NIH), or both. A list of the major U.S. Human Genome Project research sites can be found here.

Other researchers at numerous colleges, universities, and laboratories throughout the United States have also received DOE and NIH funding for human genome research. At any given time, the DOE Human Genome Program has funded about 100 separate principal investigators. For DOE-funded projects, see Research. To see a list of NIH-funded projects, visit their grants database.

In addition, many large and small private U.S. companies are conducting genome research. For more on the genomics research partnership between the public and private sectors, see the Human Genome Project and the Private Sector Fact Sheet. At least 18 other countries have participated in the Human Genome Project. See the list.


How is DNA sequencing done?

Download a PDF illustration courtesy of the Department of Energy's Joint Genome Institute.

  • Chromosomes, which range in size from 50 million to 250 million bases, must first be broken into much shorter pieces (subcloning step).

  • Each short piece is used as a template to generate a set of fragments that differ in length from each other by a single base that will be identified in a later step (template preparation and sequencing reaction steps).

    See a figure depicting the sequencing reaction.

  • The fragments in a set are separated by gel electrophoresis (separation step).

    New fluorescent dyes allow separation of all four fragments in a single lane on the gel.

    See an example of an electropherogram using fluorescent dyes. Click on the image for a caption.

  • The final base at the end of each fragment is identified (base-calling step). This process recreates the original sequence of As, Ts, Cs, and Gs for each short piece generated in the first step.

    Automated sequencers analyze the resulting electropherograms, and the output is a four-color chromatogram showing peaks that represent each of the four DNA bases.

    After the bases are "read," computers are used to assemble the short sequences (in blocks of about 500 bases each, called the read length) into long continuous stretches that are analyzed for errors, gene-coding regions, and other characteristics.

    To read about all the trouble researchers go through to "finish" this raw sequence from automated sequencers Click here (and scroll to bottom that begins "Here are our definitions of...").

    Finished sequence is submitted to major public sequence databases, such as GenBank. Human Genome Project sequence data are thus freely available to anyone around the world.


In May 2006, Human Genome Project (HGP) researchers announced the completion of the DNA sequence for the last of the 24 human chromosomes. How does this differ from the finished human genome announced by HGP researchers in 2003?

The DNA sequences announced in 2003 were only rough drafts for each human chromosome. While this draft has already speeded up medical research, more detail was needed. The draft genomic sequences can be compared broadly to a cross-country road excavated by a bulldozer, which leaves behind many gaps across difficult terrain that will require bridges and other refinements.

So, too, with charting the landscape of the human genome. Researchers have now filled in the gaps and provided far more detail for each chromosome. Much of this was accomplished by comparing particular DNA sequences across populations in genomic areas that may have contained anomalies in the initial samples. For example, some DNA segments have proven unstable during the process of copying them (cloning) for use in sequencing machines. (See an example.) Correction of minor errors (estimated at 1 error in every 10,000 DNA subunits) and cataloging of mutations will continue for some time to come.

The entire collection of human chromosome DNA sequences are freely available to the worldwide research community.

For more details, see


What is the difference between draft sequence and finished sequence?

In generating the draft sequence (released in June 2000), scientists determined the order of base pairs in each chromosomal area at least 4 to 5 times (4x to 5x) to ensure data accuracy and to help with reassembling DNA fragments in their original order. This repeated sequencing is known as genome "depth of coverage." Draft sequence data are mostly in the form of 10,000 basepair-sized fragments whose approximate chromosomal locations are known.

To generate the high-quality reference sequence, completed in April 2003, additional sequencing was done to close gaps, reduce ambiguities, and allow for only a single error every 10,000 bases, the agreed-upon standard for the HGP. Investigators believe that a high-quality sequence is critical for recognizing regulatory components of genes that are very important in understanding human biology and such disorders as heart disease, cancer, and diabetes. The finished version provides an estimated 8x to 9x coverage of each chromosome.


What genomes have been sequenced completely?

The small genomes of several viruses and bacteria and the much larger genomes of three higher organisms have been completely sequenced; they are bakers' or brewers' yeast (Saccharomyces cerevisiae), the roundworm (Caenorhabditis elegans), and the fruit fly (Drosophila melanogaster). In October 2001 the draft sequence of the pufferfish Fugu rubripes, the first vertebrate after the human, was completed; and scientists finished the first genetic sequence of a plant, that of the weed Arabidopsis thaliana, in December 2000. Many more genomes have been completed since then.

For information on published and unpublished genomes, see Genomes Online Database (GOLD).


What nonhuman genome sequencing projects are supported by the U.S. Department of Energy?

A list of microbial genome sequencing projects supported by the U.S. Department of Energy Microbial Genome Program is available here.


What happens now that the human genome sequence is completed?

The working draft DNA sequence and the more polished 2003 version represent an enormous achievement, akin in scientific importance, some say, to developing the periodic table of elements. And, as in most major scientific advances, much work remains to realize the full potential of the accomplishment.

Early explorations into the human genome, now joined by projects on the genomes of a number of other organisms, are generating data whose volume and complex analyses are unprecedented in biology. Genomic-scale technologies will be needed to study and compare entire genomes, sets of expressed RNAs or proteins, gene families from a large number of species, variation among individuals, and the classes of gene regulatory elements.

Deriving meaningful knowledge from DNA sequence will define biological research through the coming decades and require the expertise and creativity of teams of biologists, chemists, engineers, and computational scientists, among others. A sampling follows of some research challenges in genetics--what we still won't know, even with the full human sequence in hand.

  • Gene number, exact locations, and functions
  • Gene regulation
  • DNA sequence organization
  • Chromosomal structure and organization
  • Noncoding DNA types, amount, distribution, information content, and functions
  • Coordination of gene expression, protein synthesis, and post-translational events
  • Interaction of proteins in complex molecular machines
  • Predicted vs experimentally determined gene function
  • Evolutionary conservation among organisms
  • Protein conservation (structure and function)
  • Proteomes (total protein content and function) in organisms
  • Correlation of SNPs (single-base DNA variations among individuals) with health and disease
  • Disease-susceptibility prediction based on gene sequence variation
  • Genes involved in complex traits and multigene diseases
  • Complex systems biology including microbial consortia useful for environmental restoration
  • Developmental genetics, genomics
"