加入VIP 上传考博资料 您的流量 增加流量 考博报班 每日签到
   
主题 : GenBank: The Nucleotide Sequence Database
级别: 初级博友
显示用户信息 
楼主  发表于: 2009-02-24   

GenBank: The Nucleotide Sequence Database

1. GenBank: The Nucleotide Sequence Database u3,b,p  
Ilene Mizrachi ^[%~cG  
Created: October 9, 2002 ~.x!st}  
Updated: August 22, 2007 >uOc#+5M.  
Summary jgbw'BBu  
The GenBank sequence database is an annotated collection of all publicly available nucleotide >/bl r}5 H  
sequences and their protein translations. This database is produced at National Center for Ky nZz R  
Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular ydns_Z  
Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA i]$d3J3  
Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in j9u-C/Q\r  
laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to ^m1Rw|  
grow at an exponential rate, doubling every 10 months. Release 134, produced in February 2003, %okEN !=  
contained over 29.3 billion nucleotide bases in more than 23.0 million sequences. GenBank is built z|b4w7 I  
by direct submissions from individual laboratories, as well as from bulk submissions from large-scale iqoPD4A  
sequencing centers. e2*^;&|%  
Direct submissions are made to GenBank using BankIt [http://www.ncbi.nlm.nih.gov/BankIt/], |\_O8=B%  
which is a Web-based form, or the stand-alone submission program, Sequin [http:// 2#!$f_  
www.ncbi.nlm.nih.gov/Sequin/index.html]. Upon receipt of a sequence submission, the GenBank staff W;-Qze\D  
assigns an Accession number to the sequence and performs quality assurance checks. The }dXL= ul  
submissions are then released to the public database, where the entries are retrievable by Entrez or X&lkA (  
downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence Tagged Site 2hp x%H  
(STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are s=q+3NTv  
most often submitted by large-scale sequencing centers. The GenBank direct submissions group also lT:<ZQyjT  
processes complete microbial genome sequences. 8@qahEgQ  
History ZbGyl}8ua  
Initially, GenBank was built and maintained at Los Alamos National Laboratory (LANL). In the early *"4<&F S  
1990s, this responsibility was awarded to NCBI through congressional mandate. NCBI undertook qe$K6A%Yd  
the task of scanning the literature for sequences and manually typing the sequences into the data- ji8 Rd"S  
base. Staff then added annotation to these records, based upon information in the published article. n)=&=Uj`f  
Scanning sequences from the literature and placing them into GenBank is now a rare occurrence. 53T2w,?  
Nearly all of the sequences are now deposited directly by the labs that generate the sequences. r"Bf@va  
This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences foFn`?L F  
are first deposited into publicly available databases (DDBJ/EMBL/GenBank) so that the Accession zm}1~A  
number can be cited and the sequence can be retrieved when the article is published. NCBI began dG rA18  
1-1 xl9l>k6,  
ec0vg.>p  
NCBI Handbook GenBank >r*Zm2($MR  
accepting direct submissions to GenBank in 1993 and received data from LANL until 1996. Cur- i!HGM=f  
rently, NCBI receives and processes about 20,000 direct submission sequences per month, in m88~ +o<G%  
addition to the approximately 200,000 bulk submissions that are processed automatically. q38; w~H  
International Collaboration < 9MnQ*@  
In the mid-1990s, the GenBank database became part of the International Nucleotide Sequence =,'Z6?%p  
Database Collaboration with the EMBL database (European Bioinformatics Institute [http:// <#Lw.;(U;k  
www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome Sequence Database (GSDB; LANL, L~$RF {$  
Los Alamos, NM). Subsequently, the GSDB was removed from the Collaboration (by the National }cT}G;L'-  
Center for Genome Resources, Santa Fe, NM), and DDBJ [http://www.ddbj.nig.ac.jp/] (Mishima, ; M(}fV]  
Japan) joined the group. Each database has its own set of submission and retrieval tools, but the e>^R 8qM?  
three databases exchange data daily so that all three databases should contain the same set of Dk-L4FS  
sequences. Members of the DDBJ, EMBL, and GenBank staff meet annually to discuss technical J$Q-1fjj  
issues, and an international advisory board meets with the database staff to provide additional uM}O8N  
guidance. An entry can only be updated by the database that initially prepared it to avoid conflicting -0`hJ_(  
data at the three sites. GN<I|mGLJK  
The Collaboration created a Feature Table Definition [http://www.ncbi.nlm.nih.gov/collab/FT/ |$t0cd  
index.html] that outlines legal features and syntax for the DDBJ, EMBL, and GenBank feature tables. ,2`d3u^CW  
The purpose of this document is to standardize annotation across the databases. The presentation zT% kx:Fk  
and format of the data are different in the three databases, however, the underlying biological infor- R}>Do=hAO  
mation is the same. A E711l-  
Confidentiality of Data Mb>XM7}PU  
When scientists submit data to GenBank, they have the opportunity to keep their data confidential 07|NPS  
for a specified period of time. This helps to allay concerns that the availability of their data in Gen- + LwoBn>6  
Bank before publication may compromise their work. When the article containing the citation of the fq|2E&&v  
sequence or its Accession number is published, the sequence record is released. The database Z@ kC28  
staff request that submitters notify GenBank of the date of publication so that the sequence can be }WbN)  
released without delay. The request to release should be sent to gb-admin@ncbi.nlm.nih.gov. 1Tq$E[  
Direct Submissions v~\45eEA  
The typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence qt(:bEr^6b  
with annotations. The annotations are meant to provide an adequate representation of the biological 1l_}O1  
information in the record. The GenBank Feature Table Definition [http://www.ncbi.nlm.nih.gov/col- }gW/heUE  
lab/FT/index.html] describes the various features and subsequent qualifiers agreed upon by the 6iG<"{/U5  
International Nucleotide Sequence Database Collaboration. fn OkH  
Currently, only nucleotide sequences are accepted for direct submission to GenBank. These aA`q!s.%A  
include mRNA sequences with coding regions, fragments of genomic DNA with a single gene or CD_f[u  
multiple genes, and ribosomal RNA gene clusters. If part of the nucleotide sequence encodes a (J&Xo.<Z-  
1-2 <o+ 7U  
fWyDWU  
NCBI Handbook GenBank !&O/7ywe  
protein, a conceptual translation, called a CDS (coding sequence), is annotated. The span of the nZCpT |M5  
CDS feature is mapped to the nucleotide sequence encoding the protein. A protein Accession num- ^Nw]'e3  
ber (/protein_id) is assigned to the translation product, which will subsequently be added to the N[bN"'U/1  
protein databases. 0> =)  
Multiple sequences can be submitted together. Such batch submissions of non-related sequen- ?d<:V.1U@  
ces may be processed together but will be displayed in Entrez (Chapter 15) as single records. \GvY`kt3  
Alternatively, by using the Sequin submission tool (Chapter 12), a submitter can specify that several "g%:#'5  
sequences are biologically related. Such sequences are classified as environmental sample sets, O'-Zn]@.]  
population sets, phylogenetic sets, mutation sets, or segmented sets. Each sequence within a set Nf'dT;s.N  
is assigned its own Accession number and can be viewed independently in Entrez. However, with F/*fQAa"  
the exception of segmented sets, each set is also indexed within the PopSet division of Entrez, thus :ECK $Cu  
allowing scientists to view the relationship between the sequences. ^HFU@/  
What defines a set? Environmental sample, population, phylogenetic, and mutation sets all _MbVF>JOx  
contain a group of sequences that spans the same gene or region of the genome. Environmental >zg8xA1zL  
samples are derived from a group of unclassified or unknown organisms. A population set contains ckN(`W,xp  
sequences from different isolates of the same organism. A phylogenetic set contains sequences pq?[wp"  
from different organisms that are used to determine the phylogenetic relationship between them. >|/ ? Up  
Sequencing multiple mutations within a single gene gives rise to a mutation set. U.Chf9a -  
All sets, except segmented sets, may contain an alignment of the sequences within them and ~5 pC$SC6>  
might include external sequences already present in the database. In fact, the submitter can begin /P!X4~sTM  
with an existing alignment to create a submission to the database using the Sequin submission tool. 8U<.16+5Q  
Currently, Sequin accepts FASTA+GAP, PHYLIP, MACAW, NEXUS Interleaved, and NEXUS Con- )SL@ >Cij  
tiguous alignments. Submitted alignments will be displayed in the PopSet section of Entrez. P} +|`>L  
Segmented sets are a collection of noncontiguous sequences that cover a specified genetic 3TiXYH  
region. The most common example is a set of genomic sequences containing exons from a single rfX F 01I  
gene where part or all of the intervening regions have not been sequenced. Each member record *KjVPs  
within the set contains the appropriate annotation, exon features in this case. However, the mRNA a&s"# j  
and CDS will be annotated as joined features across the individual records. Segmented sets them- DAMpR3  
selves can be part of an environmental sample, population, phylogenetic, or mutation set.
评价一下你浏览此帖子的感受

精彩

感动

搞笑

开心

愤怒

无聊

灌水

  
描述
快速回复

验证问题:
注册发誓: 正确答案:恶意灌水出门撞电线杆
按"Ctrl+Enter"直接提交