1. GenBank: The Nucleotide Sequence Database
u 3,b,p Ilene Mizrachi
^[%~cG Created: October 9, 2002
~.x!st} Updated: August 22, 2007
>uOc#+5M. Summary
jgbw'BBu The GenBank sequence database is an annotated collection of all publicly available nucleotide
>/bl
r}5
H sequences and their protein translations. This database is produced at National Center for
Ky nZz
R Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular
ydns_Z Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA
i]$d3J3 Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in
j9u-C/Q\r laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to
^m1Rw|
grow at an exponential rate, doubling every 10 months. Release 134, produced in February 2003,
%okEN!= contained over 29.3 billion nucleotide bases in more than 23.0 million sequences. GenBank is built
z|b4w7I by direct submissions from individual laboratories, as well as from bulk submissions from large-scale
iqoPD4A sequencing centers.
e2*^;&|% Direct submissions are made to GenBank using BankIt [
http://www.ncbi.nlm.nih.gov/BankIt/],
|\_O8=B% which is a Web-based form, or the stand-alone submission program, Sequin [http://
2#!$f_ www.ncbi.nlm.nih.gov/Sequin/index.html]. Upon receipt of a sequence submission, the GenBank staff
W;-Qze\D assigns an Accession number to the sequence and performs quality assurance checks. The
}dXL= ul submissions are then released to the public database, where the entries are retrievable by Entrez or
X&lkA
( downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence Tagged Site
2hpx%H (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are
s=q+3NTv most often submitted by large-scale sequencing centers. The GenBank direct submissions group also
lT:<ZQyjT processes complete microbial genome sequences.
8@qahEgQ History
ZbGyl}8ua Initially, GenBank was built and maintained at Los Alamos National Laboratory (LANL). In the early
*"4<&F
S 1990s, this responsibility was awarded to NCBI through congressional mandate. NCBI undertook
qe$K6A %Yd the task of scanning the literature for sequences and manually typing the sequences into the data-
ji8Rd"S base. Staff then added annotation to these records, based upon information in the published article.
n)=&=Uj`f Scanning sequences from the literature and placing them into GenBank is now a rare occurrence.
53T2w,? Nearly all of the sequences are now deposited directly by the labs that generate the sequences.
r"Bf@va This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences
foFn`?L
F are first deposited into publicly available databases (DDBJ/EMBL/GenBank) so that the Accession
zm}1~A number can be cited and the sequence can be retrieved when the article is published. NCBI began
dG rA18 1-1
xl9l>k6, ec0vg.>p NCBI Handbook GenBank
>r*Zm2($MR accepting direct submissions to GenBank in 1993 and received data from LANL until 1996. Cur-
i!HGM=f rently, NCBI receives and processes about 20,000 direct submission sequences per month, in
m88~
+o<G% addition to the approximately 200,000 bulk submissions that are processed automatically.
q38; w~H International Collaboration
<