Correspondence to Chien Liu. Reprints and Permissions. Liu, C. Observation of coherent optical information storage in an atomic medium using halted light pulses. Download citation. Received : 13 October Accepted : 17 November Issue Date : 25 January Anyone you share the following link with will be able to read this content:. Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative. Advanced Composites and Hybrid Materials By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Advanced search. Skip to main content Thank you for visiting nature. Access through your institution. Buy or subscribe. Rent or Buy article Get time limited or full article access on ReadCube.
Figure 1: Experimental set-up and procedure. Figure 2: Measurements of delayed and revived probe pulses. These browsers will be updated as the draft genome sequence is refined and corrected as additional annotations are developed.
Please note that this figure is too large to display in image form. Instead it has been split into four PDFs. The Figure shows the occurrences of twelve important types of feature across the human genome. Large grey blocks represent centromeres and centromeric heterochromatin size not precisely to scale.
Each of the feature types is depicted in a track, from top to bottom as follows. Red, areas covered by finished clones; yellow, areas covered by predraft sequence. Regions covered by draft sequenced clones are in orange, with darker shades reflecting increasing shotgun sequence coverage. Percentage of bases in a 20, base window that are C or G. The SNPs were detected by sequencing and alignments of random genomic reads.
Rigorous analysis of SNP density requires comparing the number of SNPs identified to the precise number of bases surveyed. Regions of homology with the pufferfish T. The starts of known genes from the RefSeq database are shown in blue.
Known disease genes from the OMIM database are red, other genes blue. This Figure is based on an earlier version of the draft genome sequence than analysed in the text, owing to production constraints. We are aware of various errors in the Figure, including omissions of some known genes and misplacements of others. Some genes are mapped to more than one location, owing to errors in assembly, close paralogues or pseudogenes.
Manual review was performed to select the most likely location in these cases and to correct other regions. In addition to using the Genome Browsers, one can download from these sites the entire draft genome sequence together with the annotations in a computer-readable format.
The sequences of the underlying sequenced clones are all available through the public sequence databases. URLs for these and other genome websites are listed in Box 2.
An introduction to using the draft genome sequence, as well as associated databases and analytical tools, is provided in an accompanying paper In addition, the human cytogenetic map has been integrated with the draft genome sequence as part of a related project. The BAC Resource Consortium established dense connections between the maps using more than 7, sequenced large-insert clones that had been cytogenetically mapped by FISH; the average density of the map is 2. Although the precision of the integration is limited by the resolution of FISH, the links provide a powerful tool for the analysis of cytogenetic aberrations in inherited diseases and cancer.
These cytogenetic links can also be accessed through the Genome Browsers. The existence of GC-rich and GC-poor regions in the human genome was first revealed by experimental studies involving density gradient separation, which indicated substantial variation in average GC content among large fragments.
Subsequent studies have indicated that these GC-rich and GC-poor regions may have different biological properties, such as gene density, composition of repeat sequences, correspondence with cytogenetic bands and recombination rate , , , , , Many of these studies were indirect, owing to the lack of sufficient sequence data.
The draft genome sequence makes it possible to explore the variation in GC content in a direct and global manner. Visual inspection Fig. Fluctuations would be modest, with the standard deviation being halved as the window size is quadrupled—for example, 0. The draft genome sequence, however, contains many regions with much more extreme variation. There are also examples of large shifts in GC content between adjacent multimegabase regions. Long-range variation in GC content is evident not just from extreme outliers, but throughout the genome.
The distribution of average GC content in kb windows across the draft genome sequence is shown in Fig. The spread is fold larger than predicted by a uniform process. Moreover, the standard deviation barely decreases as window size increases by successive factors of four—5. We studied the draft genome sequence to see whether strict isochores could be identified.
For example, the sequence was divided into kb windows, and each window was subdivided into kb subwindows. About three-quarters of the genome-wide variance among kb windows can be statistically explained by the average GC content of kb windows that contain them, but the residual variance among subwindows standard deviation, 2. In fact, the hypothesis of homogeneity could be rejected for each kb window in the draft genome sequence. Similar results were obtained with other window and subwindow sizes.
Some of the local heterogeneity in GC content is attributable to transposable element insertions see below. Such repeat elements typically have a higher GC content than the surrounding sequence, with the effect being strongest for the most recent insertions. These results rule out a strict notion of isochores as compositionally homogeneous. Instead, there is substantial variation at many different scales, as illustrated in Fig. This region is AT-rich overall. Top, the GC content of the entire Mb region analysed in non-overlapping kb windows.
At this scale, gaps in the sequence can be seen. Fickett et al. Churchill has proposed that the boundaries between GC content domains can in some cases be predicted by a hidden Markov model, with one state representing a GC-rich region and one representing an AT-rich region. We found that this approach tended to identify only very short domains of less than a kilobase data not shown , but variants of this approach deserve further attention. The correlation between GC content domains and various biological properties is of great interest, and this is likely to be the most fruitful route to understanding the basis of variation in GC content.
As described below, we confirm the existence of strong correlations with both repeat content and gene density. Using the integration between the draft genome sequence and the cytogenetic map described above, it is possible to confirm a statistically significant correlation between GC content and Giemsa bands G-bands.
Estimated band locations can be seen in Fig. A related topic is the distribution of so-called CpG islands across the genome. The deficit occurs because most CpG dinucleotides are methylated on the cytosine base, and spontaneous deamination of methyl-C residues gives rise to T residues.
Spontaneous deamination of ordinary cytosine residues gives rise to uracil residues that are readily recognized and repaired by the cell.
As a result, methyl-CpG dinucleotides steadily mutate to TpG dinucleotides. We searched the draft genome sequence for CpG islands. Ideally, they should be defined by directly testing for the absence of cytosine methylation, but that was not practical for this report.
There are various computer programs that attempt to identify CpG islands on the basis of primary sequence alone. These programs differ in some important respects such as how aggressively they subdivide long CpG-containing regions , and the precise correspondence with experimentally undermethylated islands has not been validated. Nevertheless, there is a good correlation, and computational analysis thus provides a reasonable picture of the distribution of CpG islands in the genome.
To identify CpG islands, we used the definition proposed by Gardiner-Garden and Frommer and embodied in a computer program. We searched the draft genome sequence for CpG islands, using both the full sequence and the sequence masked to eliminate repeat sequences. The number of regions satisfying the definition of a CpG island was 50, in the full sequence and 28, in the repeat-masked sequence.
The difference reflects the fact that some repeat elements notably Alu are GC-rich. Although some of these repeat elements may function as control regions, it seems unlikely that most of the apparent CpG islands in repeat sequences are functional. Accordingly, we focused on those in the non-repeated sequence. The count of 28, CpG islands is reasonably close to the previous estimate of about 35, ref.
The smaller islands are consistent with their previously hypothesized function, but the role of these larger islands is uncertain. The density of CpG islands varies substantially among some of the chromosomes. Most chromosomes have 5—15 islands per Mb, with a mean of However, chromosome Y has an unusually low 2. The extreme outlier is chromosome 19, with 43 islands per Mb.
Similar trends are seen when considering the percentage of bases contained in CpG islands. The relative density of CpG islands correlates reasonably well with estimates of relative gene density on these chromosomes, based both on previous mapping studies involving ESTs Fig. Chromosomes 16, 17, 22 and particularly 19 are clear outliers, with a density of CpG islands that is even greater than would be expected from the high gene counts for these four chromosomes. The draft genome sequence makes it possible to compare genetic and physical distances and thereby to explore variation in the rate of recombination across the human chromosomes.
We focus here on large-scale variation. Finer variation is examined in an accompanying paper The genetic and physical maps are integrated by 5, polymorphic loci from the Marshfield genetic map , whose positions are known in terms of centimorgans cM and Mb along the chromosomes. Figure 15 shows the comparison of the draft genome sequence for chromosome 12 with the male, female and sex-averaged maps.
One can calculate the approximate ratio of cM per Mb across a chromosome reflected in the slopes in Fig. Female, male and sex-averaged maps are shown.
Female recombination rates are much higher than male recombination rates. The increased slopes at either end of the chromosome reflect the increased rates of recombination per Mb near the telomeres. Conversely, the flatter slope near the centromere shows decreased recombination there, especially in male meiosis.
Discordant markers may be map, marker placement or assembly errors. Two striking features emerge from analysis of these data. First, the average recombination rate increases as the length of the chromosome arm decreases Fig. A similar trend has been seen in the yeast genome , , despite the fact that the physical scale is nearly times as small.
Moreover, experimental studies have shown that lengthening or shortening yeast chromosomes results in a compensatory change in recombination rate For large chromosomes, the average recombination rates are very similar, but as chromosome arm length decreases, average recombination rates rise markedly.
The increase is most pronounced in the male meiotic map. The effect can be seen, for example, from the higher slope at both ends of chromosome 12 Fig. Regional and sex-specific effects have been observed for chromosome 21 refs , Why is recombination higher on smaller chromosome arms? A higher rate would increase the likelihood of at least one crossover during meiosis on each chromosome arm, as is generally observed in human chiasmata counts Crossovers are believed to be necessary for normal meiotic disjunction of homologous chromosome pairs in eukaryotes.
An extreme example is the pseudoautosomal regions on chromosomes Xp and Yp, which pair during male meiosis; this physical region of only 2. Mechanistically, the increased rate of recombination on shorter chromosome arms could be explained if, once an initial recombination event occurs, additional nearby events are blocked by positive crossover interference on each arm.
Evidence from yeast mutants in which interference is abolished shows that interference plays a key role in distributing a limited number of crossovers among the various chromosome arms in yeast An alternative possibility is that a checkpoint mechanism scans for and enforces the presence of at least one crossover on each chromosome arm.
Variation in recombination rates along chromosomes and between the sexes is likely to reflect variation in the initiation of meiosis-induced double-strand breaks DSBs that initiate recombination.
DSBs in yeast have been associated with open chromatin , , rather than with specific DNA sequence motifs. With the availability of the draft genome sequence, it should be possible to explore in an analogous manner whether variation in human recombination rates reflects systematic differences in chromosome accessibility during meiosis. A puzzling observation in the early days of molecular biology was that genome size does not correlate well with organismal complexity.
For example, Homo sapiens has a genome that is times as large as that of the yeast S. This mystery the C-value paradox was largely resolved with the recognition that genomes can contain a large quantity of repetitive sequence, far in excess of that devoted to protein-coding genes reviewed in refs , These regions are intentionally under-represented in the draft genome sequence and are not discussed here. However, they actually represent an extraordinary trove of information about biological processes.
The repeats constitute a rich palaeontological record, holding crucial clues about evolutionary events and forces. As passive markers, they provide assays for studying processes of mutation and selection. As active agents, repeats have reshaped the genome by causing ectopic rearrangements, creating entirely new genes, modifying and reshuffling existing genes, and modulating overall GC content.
They also shed light on chromosome structure and dynamics, and provide tools for medical genetic and population genetic studies. The human is the first repeat-rich genome to be sequenced, and so we investigated what information could be gleaned from this majority component of the human genome. Although some of the general observations about repeats were suggested by previous studies, the draft genome sequence provides the first comprehensive view, allowing some questions to be resolved and new mysteries to emerge.
Most human repeat sequence is derived from transposable elements , To describe our analyses of interspersed repeats, it is necessary briefly to review the relevant features of human transposable elements. In mammals, almost all transposable elements fall into one of four types Fig. LINEs are one of the most ancient and successful inventions in eukaryotic genomes. The LINE machinery is believed to be responsible for most reverse transcription in the genome, including the retrotransposition of the non-autonomous SINEs and the creation of processed pseudogenes , Only LINE1 is still active.
These non-autonomous transposons are thought to use the LINE machinery for transposition. LTR retroposons are flanked by long terminal direct repeats that contain all of the necessary transcriptional regulatory elements. The autonomous elements retrotransposons contain gag and pol genes, which encode a protease, reverse transcriptase, RNAse H and integrase. Exogenous retroviruses seem to have arisen from endogenous retrotransposons by acquisition of a cellular envelope gene env Transposition occurs through the retroviral mechanism with reverse transcription occurring in a cytoplasmic virus-like particle, primed by a tRNA in contrast to the nuclear location and chromosomal priming of LINEs.
Although a variety of LTR retrotransposons exist, only the vertebrate-specific endogenous retroviruses ERVs appear to have been active in the mammalian genome. Mammalian retroviruses fall into three classes I—III , each comprising many families with independent origins. DNA transposons tend to have short life spans within a species. By contrast, DNA transposons cannot exercise a cis -preference: the encoded transposase is produced in the cytoplasm and, when it returns to the nucleus, it cannot distinguish active from inactive elements.
As inactive copies accumulate in the genome, transposition becomes less efficient. This checks the expansion of any DNA transposon family and in due course causes it to die out. To survive, DNA transposons must eventually move by horizontal transfer to virgin genomes, and there is considerable evidence for such transfer , , , , Transposable elements employ different strategies to ensure their evolutionary survival. DNA transposons are more promiscuous, requiring relatively frequent horizontal transfer.
LTR retroposons use both strategies, with some being long-term active residents of the human genome such as members of the ERVL family and others having only short residence times. This program scans sequences to identify full-length and partial members of all known repeat families represented in RepBase Update version 5.
Table 11 shows the number of copies and fraction of the draft genome sequence occupied by each of the four major classes and the main subclasses. The precise count of repeats is obviously underestimated because the genome sequence is not finished, but their density and other properties can be stated with reasonable confidence.
We expect these densities to grow as more repeat families are recognized, among which will be lower copy number LTR elements and DNA transposons, and possibly high copy number ancient highly diverged repeats.
The ancestry and approximate age of each fossil can be inferred by exploiting the fact that each copy is derived from, and therefore initially carried the sequence of, a then-active transposon and, being generally under no functional constraint, has accumulated mutations randomly and independently of other copies.
We can infer the sequence of the ancestral active elements by clustering the modern derivatives into phylogenetic trees and building a consensus based on the multiple sequence alignment of a cluster of copies.
Using available consensus sequences for known repeat subfamilies, we calculated the per cent divergence from the inferred ancestral active transposon for each of three million interspersed repeats in the draft genome sequence. The percentage of sequence divergence can be converted into an approximate age in millions of years Myr on the basis of evolutionary information. Care is required in calibrating the clock, because the rate of sequence divergence may not be constant over time or between lineages The relative-rate test can be used to calculate the sequence divergence that accumulated in a lineage after a given timepoint, on the basis of comparison with a sibling species that diverged at that time and an outgroup species.
For example, the substitution rate over roughly the last 25 Myr in the human lineage can be calculated by using old world monkeys which diverged about 25 Myr ago as a sibling species and new world monkeys as an outgroup. We have used currently available calibrations for the human lineage, but the issue should be revisited as sequence information becomes available from different mammals.
Figure 18a shows the representation of various classes of transposable elements in categories reflecting equal amounts of sequence divergence. In Fig. Figure 19 shows the mean ages of various subfamilies of DNA transposons.
Several facts are apparent from these graphs. First, most interspersed repeats in the human genome predate the eutherian radiation. This is a testament to the extremely slow rate with which nonfunctional sequences are cleared from vertebrate genomes see below concerning comparison with the fly.
Bases covered by interspersed repeats were sorted by their divergence from their consensus sequence which approximates the repeat's original sequence at the time of insertion. This model tends to underestimate higher substitution levels.
There is a different correspondence between substitution levels and time periods owing to different rates of nucleotide substitution in the two species. The correspondence between substitution levels and time periods was largely derived from three-way species comparisons relative rate test , with the age estimates based on fossil data.
Unlike retroposons, DNA transposons are thought to have a short life span in a genome. Thus, the average or median divergence of copies from the consensus is a particularly accurate measure of the age of the DNA transposon copies.
Third, there were two major peaks of DNA transposon activity Fig. The first involved Charlie elements and occurred long before the eutherian radiation; the second involved Tigger elements and occurred after this radiation. Because DNA transposons can produce large-scale chromosome rearrangements , , , , it is possible that widespread activity could be involved in speciation events.
Fourth, there is no evidence for DNA transposon activity in the past 50 Myr in the human genome. Finally, LTR retroposons appear to be teetering on the brink of extinction, if they have not already succumbed. In the draft genome sequence, we can identify only three full-length copies with all ORFs intact the final total may be slightly higher owing to the imperfect state of the draft genome sequence.
More generally, the overall activity of all transposons has declined markedly over the past 35—50 Myr, with the possible exception of LINE1 Fig. Indeed, apart from an exceptional burst of activity of Alus peaking around 40 Myr ago, there would appear to have been a fairly steady decline in activity in the hominid lineage since the mammalian radiation.
The extent of the decline must be even greater than it appears because old repeats are gradually removed by random deletion and because old repeat families are harder to recognize and likely to be under-represented in the repeat databases. We confirmed that the decline in transposition is not an artefact arising from errors in the draft genome sequence, which, in principle, could increase the divergence level in recent elements.
First, the sequence error rate Table 9 is far too low to have a significant effect on the apparent age of recent transposons; and second, the same result is seen if one considers only finished sequence. What explains the decline in transposon activity in the lineage leading to humans? We return to this question below, in the context of the observation that there is no similar decline in the mouse genome. We compared the complement of transposable elements in the human genome with those of the other sequenced eukaryotic genomes.
We analysed the fly, worm and mustard weed genomes for the number and nature of repeats Table 12 and the age distribution Fig. The human genome stands in stark contrast to the genomes of the other organisms. The repeats in the other organisms may have been slightly underestimated because the repeat databases for the other organisms are less complete than for the human, especially with regard to older elements; on the other hand, recent additions to these databases appear to increase the repeat content only marginally.
The difference is most marked with the fly, but is clear for the other genomes as well. The rate of large deletions has not been systematically compared, but seems likely also to differ markedly. Instead, the worm, fly and mustard weed genomes all contain many transposon families, each consisting of typically hundreds to thousands of elements.
These features of the human genome are probably general to all mammals. The relative lack of horizontally transmitted elements may have its origin in the well developed immune system of mammals, as horizontal transfer requires infectious vectors, such as viruses, against which the immune system guards. We also looked for differences among mammals, by comparing the transposons in the human and mouse genomes.
As with the human genome, care is required in calibrating the substitution clock for the mouse genome. Warren, H. CAS Google Scholar. Meharg, A. Download references. You can also search for this author in PubMed Google Scholar. Reprints and Permissions. Ma, L. A fern that hyperaccumulates arsenic. Nature , Download citation. Clones are then selected for shotgun sequencing and the whole genome sequence is reconstructed by map-guided assembly of overlapping clone sequences 3.
The availability of the whole-genome clone-based map assisted the sequencing of the human genome in many respects. The fingerprinted BAC map made it possible to select clones for sequencing that would ensure comprehensive coverage of the genome and reduce sequencing redundancy. In addition, the challenge of sequence assembly was minimized by restricting random shotgun sequencing to individual clones. Furthermore, the clone-based map also enabled the identification of large segments of the genome that are repeated, thereby simplifying the assembly.
Many IHGSC centres had developed chromosomal maps and resources that were not integrated, so it was essential to have a unifying genome map to enable localization of clones, with respect to previously sequenced clones, before they were sequenced.
The accurate fingerprinting and sizing of each clone enabled us to verify the accuracy of shotgun sequence 4 assembly of each clone. The human genome presented unique challenges for the development of a clone-based physical map.
Its size of 3. Its greater complexity also made it more difficult to distinguish true overlaps, which was further complicated by the repeat-rich nature of the genome. Early efforts to construct clone-based regional and even chromosomal physical maps of the human genome using cosmid libraries derived from isolated human chromosomes met with limited success 5 , 6. By contrast, maps based on sequence-tagged site STS landmarks provided greater coverage of the genome 7 , 8 , 9 , as did genetic maps based on variations in simple sequence repeats in STS landmarks 10 , The development of P1-artificial chromosome PAC 12 and bacterial artificial chromosome BAC 13 cloning systems was pivotal to the success of the whole-genome map.
They provided larger inserts, more stable clones and better coverage of the genome. Clone-based maps similar to that described here have been important in the sequencing of most large genomes, including those of Saccharomyces cerevisiae 1 , Caenorhabditis elegans 2 and Arabidopsis thaliana A clone-based map also contributed to the sequencing of the Drosophila melanogaster genome 15 , 16 and a combined mapping and sequencing strategy is being applied to the mouse genome 17 , This work illustrates the benefit of using the clone-based map in the assembly of the human genome sequence.
The pilot phase of the sequencing project began in , at which time efforts were renewed to develop clone-based maps covering specific regions of the genome. To construct these regional maps, we screened PAC and BAC clones for STS markers, fingerprinted the positive clones, integrated them into the existing maps, and selected the largest, intact clones with minimal overlap for sequencing.
To keep pace with the ramping up of the sequencing effort in , the ongoing efforts to construct the whole-genome BAC map were increased approximately tenfold. The whole-genome BAC map was constructed in several steps. First we collected fingerprint data for a large sample of random clones from a genome-wide BAC library. We then assembled the BAC map, first by using the fingerprint data to cluster highly related clones automatically, then by further refining them manually, and last by merging contigs with related clones at their ends.
Finally, in parallel with construction of the BAC map, we mapped the chromosomal positions of individual clones on the basis of landmarks from existing landmark maps. Redundancy of sampling was vital to achieve high continuity in the final map Assuming an average BAC insert size of , base pairs bp and a genome size of 3.
The library was derived from male DNA, providing full coverage of all 24 human chromosomes but with half as much coverage of the sex chromosomes as of the autosomes. Our experience with the library found it to be of high quality with uniformly large-insert clones, few non-recombinant clones and little cross-contamination of source plates. With these and other improvements in the fingerprinting technology and resources, we increased throughput tenfold to process more than 20, fingerprints which equates to approximately onefold clone coverage of the human genome each week.
This provided differential sampling of the genome, given the different distribution of the restriction enzyme sites within the genome. Every fifth lane contains a mixture of marker DNAs; the sizes of selected marker fragments are indicated. We experimented with various strategies for automated assembly that would be as complete and as consistent as possible see Supplementary Information.
First, we edited the fingerprint data itself. We therefore removed fragments smaller than bp before assembly. To reduce the variability between the number of bands called in these multiplet situations and thus increase the reliability with which related clones are correctly overlapped, these fragments were collapsed to only a single band in the resulting fingerprint.
We compared the clusters obtained for consistency with known regions and with other mapping data for the fingerprinted clones primarily radiation hybrid chromosomal localization data from the Stanford Human Genome Center SHGC. The remaining unincorporated clones singletons were excluded, as they contained too few bands to be included by automated assembly under these conditions or simply had no closely related clones.
These latter clones included artefacts such as clones that had rearranged or had poor quality data, as well as rare clones representing poorly sampled portions of the genome. As fingerprints from new clones were added after the initial assembly, there was a disproportionate increase in the number of singletons Table 2. These new data were only incorporated into existing clusters or contigs if they added needed depth or helped to join contigs. One possible explanation is that these new libraries encompass regions of the genome not represented in the initial RPCI library.
Most clones Although only about two-thirds of the fingerprint data are derived from DNA from a single individual, we did not experience any problems in assembly arising from polymorphisms between the individuals from whom the DNA was obtained.
The goals of the manual editing were to refine the ordering of the clones within clusters to create contigs, to disassemble larger chimaeric contigs representing clusters of two or more sets of non-overlapping clones and to join contigs. This process involved first editing the fingerprint assemblies using the tools encapsulated in FPC to ensure that every clone within a contig was properly situated with respect to its most highly related neighbours, defined by fingerprint similarity 14 see Supplementary Information.
About chimaeric clusters were identified and disassembled. Joins were incorporated into the map if the fingerprinting data was logically consistent with the proposed map order Fig. Portion of contig shown is localized to chromosomal region 8q21, composed of BAC clones ordered by restriction fingerprint mapping. Only of the clones are displayed.
The contig contains markers; 77 clones have been selected for sequencing. Green: specifically associated with clone NE06 aqua in c. There are 69, markers currently in the database associated with clones, largely by ePCR. Only one marker of the 62 shown is inconsistent with the 8q21 localization of this contig D17S, red underline. This is probably not a unique marker in the genome as the clone with which it is associated also contains several chromosome 8 markers.
Blue, example clones selected for sequencing. These clones were believed to overlap as they shared several restriction fragments; overlaps have been confirmed by working draft sequence.
GenBank accession numbers are indicated. Sequences were mapped to the associated clone using in silico restriction digests, BAC end sequences and sequence overlap.
Around The incorrect clone name referenced in their sequence records is indicated. Several are associated with clones in this contig c , further positioning this contig within the genome.
The most notable effect of the intensive editing was the greater than fivefold reduction in total contigs, from a high of 7, contigs after chimaeric contigs had been disassembled, to 1, by the 7 October data freeze of the draft genome sequence 3 Table 2.
At the time of writing, the number of contigs had fallen further, to just contigs. As the contigs became accurately positioned and oriented with respect to one another see below and with the emergence of the draft sequence, end clones of adjacent contigs with overlapping sequence were recognized. After inspection of the sequence overlap to rule out shared sequence resulting from internal repeated segments, about half of the candidate joins were well supported by the fingerprint data and were integrated into the map.
Another 62 had unconvincing evidence of overlap based on fingerprints but were tagged as overlapping on the basis of sequence alone. The contigs appeared to be appropriately distributed among the chromosomes on the basis of the expected size of the chromosomes. The number of contigs per chromosome varies with the size of the chromosomes and the efforts made at closure Table 3.
Chromosomes 6, 7, 13, 14, 15, 20 and Y have relatively few remaining gaps, with 21, 29, 15, 21, 19, 10 and 8 contigs, respectively. To increase the utility of the whole-genome BAC map, we incorporated various map data to anchor the contigs along the 24 chromosomes. This enabled us to position 96, different BAC clones as genome anchor points for the contigs. In addition, because the RPCI library was used for other genome initiatives, much additional marker information was available from other laboratories.
Cox, unpublished data , with many of these selected deliberately because they came from clones in unlocalized contigs. In addition, chromosomal assignment and integration of cytogenetic map positions were achieved by utilizing 3, BACs mapped by fluorescence in situ hybridization FISH data As the working draft sequence accumulated, known markers within the sequence were readily identified by electronic PCR ePCR , a program that searches sequence for STSs by identifying the associated primer sequences in the correct orientation and with correct spacing These data were incorporated into the FPC database.
Once sequenced clones could be reliably associated with the fingerprinted clones 3 , we could use the marker content of sequenced clones determined by ePCR to order and orient contigs more reliably. The regional mapping data included those for chromosomes 12 ref. Telomeric contigs were identified and positioned where possible, as described elsewhere in this issue These clones included those from regions of chromosomes 5 J.
Cheng , 8 A. Rosenthal and N. Shimizu 35 , 11 Y. Sakaki and 17 J. In addition, we used computer-generated restriction digests, or in silico digests, of sequences in GenBank to incorporate these clones into the whole-genome BAC map.
Of the 96 BACs examined, 87 were successfully assigned to a single chromosome band. The remaining clones either failed to label six or were associated with multiple chromosome bands three. A single BAC mapped to one of the two positions that were equally well supported by the marker content of its associated contig.
0コメント