Human genome sequence

Human genome sequence

Also released the same week:
Draft sequence of the publicly funded Human Genome Project
Nature 409, 860-921 (15 February 2001)

http://www.nature.com/genomics/human/papers/articles.html
(Links are to abstracts, but you can download PDF or HTML versions of papers from there).

Genome was sequenced using a shotgun approach
DNA sequence obtained from random clones of sheared genomic DNAs
The size of the genome (3 billion base pairs) makes assembling short stretches of sequence (500 to 600 base pairs) a computational challenge
27 million sequencing reactions were performed, yielding 15 billion base pairs of sequence
Sequence was obtained from five individuals of different ethnic backgrounds

Key to assembling sequence: pairs of sequencing reads obtained from opposite ends of cloned insert ("mate-pairs".
Clones were obtained from three different size specific libraries: to tensile basis and 50 kilobases
Knowing the distance between pairs of reads from individual clones helps in assembling contiguous sequences
Raw sequence was trimmed of known vector sequence, E. coli sequence, and mitochondrial sequence
Additional sequence data was obtained by shredding sequence from the public Human Genome Project

Genome sequences were assembled in 2 ways:

A whole genome assembly in which the entire sequence was assembled into the chromosomes
An alternative strategy sorted data based upon homology to know chromosomes, allowing assembly of individual chromosomes
A key hurdle in assembling sequence is avoiding misalignments caused by repeated sequences in the genome and chimeric clones.
Either of these problems can result in assembling connecting sequences from different regions in the genome together.
Repeated sequences were screened by removing known repeated sequences
Also, unknown repeats indicated by over representation of particular sequences
unique sequences were then assembled and oriented using mate-pair information
Sequence scaffolds could be assembled based upon distances between pairs of reads
The initial scaffold covered 73% of the genome additional gaps were filled using reads which showed homology to scaffolds at one end
Additional gaps were filled with known sequence from the public database
Sequence scaffolds were mapped to chromosome by homology to known chromosome markers

The assembled sequence was assessed for completeness and correctness

Completeness was assessed by comparison to known finished sequences for chromosome 21 and chromosome 22 completeness was better than 95%
Accuracy will was determined by comparison to known chromosome 21 and chromosome 22 sequences and estimated at 99.96%

Gene prediction

Genes were detected by homology searches of protein databases and expressed sequence tags the total number of genes estimated between 25 and 30 thousand
Genes are found associated with GC rich sequences
Genes tend to cluster and desert regions > 500 kilobases have been found without genes
CpG islands also correlates with gene locations
A number of gene duplications caused by a retrotransposition were detected
These are genes which have only a single exon flanked by repeated sequence and homologous to known multi-exon genes

Duplication

Chromosome duplications were detected by concatenation of predicted protein sequences for each of the 24 chromosomes and searching by homology between chromosomes
Clusters of related genes on different chromosomes were indicative of ancient chromosomal duplications most appeared to have occurred before the divergence of primate and rodent lineages
Duplicated genes which are related to genes of known function are interesting candidates for further study

Nearly 50% of the predicted genes are of unknown function