1. GenBank: The Nucleotide Sequence Database
N.?)s.D( Ilene Mizrachi
u9D#5NvGs Created: October 9, 2002
a
G\ Updated: August 22, 2007
X*e<g= Summary
0
![ The GenBank sequence database is an annotated collection of all publicly available nucleotide
@Q%<~b[y sequences and their protein translations. This database is produced at National Center for
.{4U]a;[ Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular
VTJxVYE Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA
x#}{z1op9 Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in
PEPf=sm laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to
# GOL%2X grow at an exponential rate, doubling every 10 months. Release 134, produced in February 2003,
OtZc;c contained over 29.3 billion nucleotide bases in more than 23.0 million sequences. GenBank is built
rvp#[RAaS} by direct submissions from individual laboratories, as well as from bulk submissions from large-scale
C!}t6 sequencing centers.
eWWfUNBSLX Direct submissions are made to GenBank using BankIt [
http://www.ncbi.nlm.nih.gov/BankIt/],
lC'U3Q& which is a Web-based form, or the stand-alone submission program, Sequin [http://
D
]G=sYt www.ncbi.nlm.nih.gov/Sequin/index.html]. Upon receipt of a sequence submission, the GenBank staff
Sq[LwJ assigns an Accession number to the sequence and performs quality assurance checks. The
Xsd+5="{N submissions are then released to the public database, where the entries are retrievable by Entrez or
T[ltOQw?Y downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence Tagged Site
u_jhmKr~ (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are
aM|;3j1p most often submitted by large-scale sequencing centers. The GenBank direct submissions group also
R
_2T" processes complete microbial genome sequences.
nWFU8u% History
9qI#vHA Initially, GenBank was built and maintained at Los Alamos National Laboratory (LANL). In the early
T':} p2}w+ 1990s, this responsibility was awarded to NCBI through congressional mandate. NCBI undertook
#TPS?+( the task of scanning the literature for sequences and manually typing the sequences into the data-
`%Dz 8Z base. Staff then added annotation to these records, based upon information in the published article.
A^vvw~!d Scanning sequences from the literature and placing them into GenBank is now a rare occurrence.
A8T8+M: Nearly all of the sequences are now deposited directly by the labs that generate the sequences.
)6*)u/x: This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences
UZx8ozv' are first deposited into publicly available databases (DDBJ/EMBL/GenBank) so that the Accession
v /x~L$[ number can be cited and the sequence can be retrieved when the article is published. NCBI began
.>
5[; 1-1
;}}k*<
Z 74fE
%;F NCBI Handbook GenBank
%gEgpJd accepting direct submissions to GenBank in 1993 and received data from LANL until 1996. Cur-
JVR,Py:%G rently, NCBI receives and processes about 20,000 direct submission sequences per month, in
U?=-V8#M| addition to the approximately 200,000 bulk submissions that are processed automatically.
Cq[<CPAS International Collaboration
*9tRhRc In the mid-1990s, the GenBank database became part of the International Nucleotide Sequence
s y>}2orj~ Database Collaboration with the EMBL database (European Bioinformatics Institute [http://
::sk) www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome Sequence Database (GSDB; LANL,
<op|yh3Jkk Los Alamos, NM). Subsequently, the GSDB was removed from the Collaboration (by the National
Kp>fOe'KW Center for Genome Resources, Santa Fe, NM), and DDBJ [
http://www.ddbj.nig.ac.jp/] (Mishima,
I R~szUY6 Japan) joined the group. Each database has its own set of submission and retrieval tools, but the
4(&sw<k three databases exchange data daily so that all three databases should contain the same set of
Yht |^ =a sequences. Members of the DDBJ, EMBL, and GenBank staff meet annually to discuss technical
2?@Ozr2Uh issues, and an international advisory board meets with the database staff to provide additional
n;e.N:p guidance. An entry can only be updated by the database that initially prepared it to avoid conflicting
g17 fge6% data at the three sites.
s9,Z}]Th The Collaboration created a Feature Table Definition [
http://www.ncbi.nlm.nih.gov/collab/FT/ eI1C0Uz1
index.html] that outlines legal features and syntax for the DDBJ, EMBL, and GenBank feature tables.
h^'+y1 The purpose of this document is to standardize annotation across the databases. The presentation
Q!{,^Qb and format of the data are different in the three databases, however, the underlying biological infor-
yOO@v6jO) mation is the same.
<&l@ ):a Confidentiality of Data
rmeGk&*R8 When scientists submit data to GenBank, they have the opportunity to keep their data confidential
+NTC!/ for a specified period of time. This helps to allay concerns that the availability of their data in Gen-
7M<co,"
Bank before publication may compromise their work. When the article containing the citation of the
]h'*L`
sequence or its Accession number is published, the sequence record is released. The database
`5SQ4 staff request that submitters notify GenBank of the date of publication so that the sequence can be
G
UK%RC8 released without delay. The request to release should be sent to
gb-admin@ncbi.nlm.nih.gov.
t'[`"pp= Direct Submissions
aSIoq}c( The typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence
cH6J:0>W with annotations. The annotations are meant to provide an adequate representation of the biological
Bo0f`EC I information in the record. The GenBank Feature Table Definition [
http://www.ncbi.nlm.nih.gov/col- A8
!&Y