1. GenBank: The Nucleotide Sequence Database
"7Zb)Ocb Ilene Mizrachi
}oj$w?Ex Created: October 9, 2002
9V[|_ Updated: August 22, 2007
>77
/e@ Summary
lmQ!q>N
The GenBank sequence database is an annotated collection of all publicly available nucleotide
R%8nR6iG" sequences and their protein translations. This database is produced at National Center for
r >u0Y Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular
#@3RYx Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA
J$j&j` Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in
Z\6azhbI} laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to
S-Ryt>G grow at an exponential rate, doubling every 10 months. Release 134, produced in February 2003,
8!4=j contained over 29.3 billion nucleotide bases in more than 23.0 million sequences. GenBank is built
K>w}(td by direct submissions from individual laboratories, as well as from bulk submissions from large-scale
Apa)qRJd sequencing centers.
D0~ WK
stl Direct submissions are made to GenBank using BankIt [
http://www.ncbi.nlm.nih.gov/BankIt/],
rV[#4,} PF which is a Web-based form, or the stand-alone submission program, Sequin [http://
&5O www.ncbi.nlm.nih.gov/Sequin/index.html]. Upon receipt of a sequence submission, the GenBank staff
E)NH6~ assigns an Accession number to the sequence and performs quality assurance checks. The
t A\N$ submissions are then released to the public database, where the entries are retrievable by Entrez or
f&7SivS# downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence Tagged Site
T,`'qZ> (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are
--Dw8FR9 most often submitted by large-scale sequencing centers. The GenBank direct submissions group also
!b7]n-1zs processes complete microbial genome sequences.
G0^23j History
J=Ak+J Initially, GenBank was built and maintained at Los Alamos National Laboratory (LANL). In the early
?G*XZ0u~ 1990s, this responsibility was awarded to NCBI through congressional mandate. NCBI undertook
*~lD;{2 the task of scanning the literature for sequences and manually typing the sequences into the data-
zi:GvTG base. Staff then added annotation to these records, based upon information in the published article.
r*0a43mC1 Scanning sequences from the literature and placing them into GenBank is now a rare occurrence.
`(_cR@\ Nearly all of the sequences are now deposited directly by the labs that generate the sequences.
*jTr This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences
C ])Q#!D| are first deposited into publicly available databases (DDBJ/EMBL/GenBank) so that the Accession
V/"XC3/n* number can be cited and the sequence can be retrieved when the article is published. NCBI began
L<1"u.3Z`} 1-1
!jU<(eY v"F.<Q NCBI Handbook GenBank
A0V"5syY accepting direct submissions to GenBank in 1993 and received data from LANL until 1996. Cur-
I{_St8 rently, NCBI receives and processes about 20,000 direct submission sequences per month, in
LJc
w-> addition to the approximately 200,000 bulk submissions that are processed automatically.
N`8!h:yL International Collaboration
wTf0O@``6H In the mid-1990s, the GenBank database became part of the International Nucleotide Sequence
e%R+IH5i
Database Collaboration with the EMBL database (European Bioinformatics Institute [http://
prlB9,3|C www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome Sequence Database (GSDB; LANL,
x/[i &Gkv Los Alamos, NM). Subsequently, the GSDB was removed from the Collaboration (by the National
7. G Center for Genome Resources, Santa Fe, NM), and DDBJ [
http://www.ddbj.nig.ac.jp/] (Mishima,
>QE^KtZ Japan) joined the group. Each database has its own set of submission and retrieval tools, but the
%,
K |v three databases exchange data daily so that all three databases should contain the same set of
dsR{
P,! sequences. Members of the DDBJ, EMBL, and GenBank staff meet annually to discuss technical
p? iJ'K issues, and an international advisory board meets with the database staff to provide additional
}(ay( guidance. An entry can only be updated by the database that initially prepared it to avoid conflicting
,h#!!j\j6 data at the three sites.
HoH3.AY X The Collaboration created a Feature Table Definition [
http://www.ncbi.nlm.nih.gov/collab/FT/ bjm`u3
A index.html] that outlines legal features and syntax for the DDBJ, EMBL, and GenBank feature tables.
>:W)9o The purpose of this document is to standardize annotation across the databases. The presentation
D8m?`^Zz and format of the data are different in the three databases, however, the underlying biological infor-
>h[tHM
O mation is the same.
c*RZbE9k Confidentiality of Data
g#9w5Q When scientists submit data to GenBank, they have the opportunity to keep their data confidential
nI2}E for a specified period of time. This helps to allay concerns that the availability of their data in Gen-
Jgtvia Bank before publication may compromise their work. When the article containing the citation of the
f#eTi&w sequence or its Accession number is published, the sequence record is released. The database
Ef%8+_ staff request that submitters notify GenBank of the date of publication so that the sequence can be
+FY-r[_~ released without delay. The request to release should be sent to
gb-admin@ncbi.nlm.nih.gov.
7 aDI6G Direct Submissions
d2.n^Q"?3 The typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence
`3pe\s with annotations. The annotations are meant to provide an adequate representation of the biological
i$"FUC~' information in the record. The GenBank Feature Table Definition [
http://www.ncbi.nlm.nih.gov/col- T2Y`q' lab/FT/index.html] describes the various features and subsequent qualifiers agreed upon by the
3C_g)5
_: International Nucleotide Sequence Database Collaboration.
^z[s;:- Currently, only nucleotide sequences are accepted for direct submission to GenBank. These
.8b4 include mRNA sequences with coding regions, fragments of genomic DNA with a single gene or
;lYO)Z`3\ multiple genes, and ribosomal RNA gene clusters. If part of the nucleotide sequence encodes a
a P{xMB#1h 1-2
B{)Du
:) cb'8Li8,j NCBI Handbook GenBank
WCR+ZXI?1 protein, a conceptual translation, called a CDS (coding sequence), is annotated. The span of the
nJ*NI) CDS feature is mapped to the nucleotide sequence encoding the protein. A protein Accession num-
7DOAG[gH ber (/protein_id) is assigned to the translation product, which will subsequently be added to the
U-WrZ|
- protein databases.
p-
*BB_J" Multiple sequences can be submitted together. Such batch submissions of non-related sequen-
Nz!AR$
ces may be processed together but will be displayed in Entrez (Chapter 15) as single records.
>W-e0kkH Alternatively, by using the Sequin submission tool (Chapter 12), a submitter can specify that several
'
O{hr0q} sequences are biologically related. Such sequences are classified as environmental sample sets,
m.V mS7_I population sets, phylogenetic sets, mutation sets, or segmented sets. Each sequence within a set
!Id F6 % is assigned its own Accession number and can be viewed independently in Entrez. However, with
R`1$z8$ the exception of segmented sets, each set is also indexed within the PopSet division of Entrez, thus
c$e~O-OVD? allowing scientists to view the relationship between the sequences.
xHD!8B) What defines a set? Environmental sample, population, phylogenetic, and mutation sets all
*Gu=O|Mm contain a group of sequences that spans the same gene or region of the genome. Environmental
Z<&:
W8n samples are derived from a group of unclassified or unknown organisms. A population set contains
}\gpO0Ox sequences from different isolates of the same organism. A phylogenetic set contains sequences
8\I(a]kM` from different organisms that are used to determine the phylogenetic relationship between them.
I gFz
[)
Sequencing multiple mutations within a single gene gives rise to a mutation set.
|#-Oz#Eg' All sets, except segmented sets, may contain an alignment of the sequences within them and
G53!wIW2: might include external sequences already present in the database. In fact, the submitter can begin
-C8awtbC with an existing alignment to create a submission to the database using the Sequin submission tool.
|=h>3Z=r! Currently, Sequin accepts FASTA+GAP, PHYLIP, MACAW, NEXUS Interleaved, and NEXUS Con-
As)-a5! tiguous alignments. Submitted alignments will be displayed in the PopSet section of Entrez.
SR 43#!99Q Segmented sets are a collection of noncontiguous sequences that cover a specified genetic
")sq?1?X region. The most common example is a set of genomic sequences containing exons from a single
i,)kI gene where part or all of the intervening regions have not been sequenced. Each member record
#vDe/o+= within the set contains the appropriate annotation, exon features in this case. However, the mRNA
l7x%G@1#~W and CDS will be annotated as joined features across the individual records. Segmented sets them-
CV&
SNA selves can be part of an environmental sample, population, phylogenetic, or mutation set.