1. GenBank: The Nucleotide Sequence Database
)nTOIfP2 Ilene Mizrachi
x^;nQas; Created: October 9, 2002
z9u"?vdA Updated: August 22, 2007
> & \QLo[5 Summary
sq\oatMw[ The GenBank sequence database is an annotated collection of all publicly available nucleotide
~;3N'o sequences and their protein translations. This database is produced at National Center for
aB#qzrr['8 Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular
5!
NK Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA
"#qyX[\ Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in
B}C"Xc laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to
w-"o?;)a grow at an exponential rate, doubling every 10 months. Release 134, produced in February 2003,
{xMY2I++ contained over 29.3 billion nucleotide bases in more than 23.0 million sequences. GenBank is built
w*f.Fu(su by direct submissions from individual laboratories, as well as from bulk submissions from large-scale
1#&*xF" sequencing centers.
}pTy mAN Direct submissions are made to GenBank using BankIt [
http://www.ncbi.nlm.nih.gov/BankIt/],
?6; +.h\ which is a Web-based form, or the stand-alone submission program, Sequin [http://
Wm
7Dy7#l www.ncbi.nlm.nih.gov/Sequin/index.html]. Upon receipt of a sequence submission, the GenBank staff
)?:V5U
O\ assigns an Accession number to the sequence and performs quality assurance checks. The
4M}u_}9 submissions are then released to the public database, where the entries are retrievable by Entrez or
'@3hU|jO! downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence Tagged Site
a.y_o50#T (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are
eZ(<hE> most often submitted by large-scale sequencing centers. The GenBank direct submissions group also
`u3to{ processes complete microbial genome sequences.
p%jl-CC1 History
Bq#?g@V Initially, GenBank was built and maintained at Los Alamos National Laboratory (LANL). In the early
H$9--p 1990s, this responsibility was awarded to NCBI through congressional mandate. NCBI undertook
l23_K7 the task of scanning the literature for sequences and manually typing the sequences into the data-
BHy#g>
KUF base. Staff then added annotation to these records, based upon information in the published article.
.nJErC## Scanning sequences from the literature and placing them into GenBank is now a rare occurrence.
-%I]Q9 Nearly all of the sequences are now deposited directly by the labs that generate the sequences.
<iajtq<Z This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences
Hc ]/0: are first deposited into publicly available databases (DDBJ/EMBL/GenBank) so that the Accession
%fGS<
W; number can be cited and the sequence can be retrieved when the article is published. NCBI began
"6Hka{ 1-1
R%Gh4y\nF }xJ9EE*G/ NCBI Handbook GenBank
.3l'&".' accepting direct submissions to GenBank in 1993 and received data from LANL until 1996. Cur-
{0+gPTp rently, NCBI receives and processes about 20,000 direct submission sequences per month, in
)cNG)F addition to the approximately 200,000 bulk submissions that are processed automatically.
}#Z Q\[ International Collaboration
E)DdiB'R
h In the mid-1990s, the GenBank database became part of the International Nucleotide Sequence
4\H:^U& Database Collaboration with the EMBL database (European Bioinformatics Institute [http://
-9Wx;u4]o www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome Sequence Database (GSDB; LANL,
QKDY:1] Los Alamos, NM). Subsequently, the GSDB was removed from the Collaboration (by the National
kz\
D-b Center for Genome Resources, Santa Fe, NM), and DDBJ [
http://www.ddbj.nig.ac.jp/] (Mishima,
zM%ILv4 Japan) joined the group. Each database has its own set of submission and retrieval tools, but the
`VCU`Y three databases exchange data daily so that all three databases should contain the same set of
TMj(y{2 sequences. Members of the DDBJ, EMBL, and GenBank staff meet annually to discuss technical
=/&ob%J)9] issues, and an international advisory board meets with the database staff to provide additional
|#q 5#@, guidance. An entry can only be updated by the database that initially prepared it to avoid conflicting
-g(&5._,ZW data at the three sites.
>Ip>x!wi The Collaboration created a Feature Table Definition [
http://www.ncbi.nlm.nih.gov/collab/FT/ {y0#(8-& index.html] that outlines legal features and syntax for the DDBJ, EMBL, and GenBank feature tables.
2siUpmX The purpose of this document is to standardize annotation across the databases. The presentation
P/5bNK! and format of the data are different in the three databases, however, the underlying biological infor-
/PH+K24v~ mation is the same.
SUsD)!u_H Confidentiality of Data
pV]m6!y& When scientists submit data to GenBank, they have the opportunity to keep their data confidential
m/gl7+ for a specified period of time. This helps to allay concerns that the availability of their data in Gen-
D ksSD Bank before publication may compromise their work. When the article containing the citation of the
%wn|H> sequence or its Accession number is published, the sequence record is released. The database
[,e[~J`C staff request that submitters notify GenBank of the date of publication so that the sequence can be
gq~>S1 released without delay. The request to release should be sent to
gb-admin@ncbi.nlm.nih.gov.
Vb*q^
v Direct Submissions
Ve#VGlI The typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence
GzxtC& with annotations. The annotations are meant to provide an adequate representation of the biological
S G|``}OA information in the record. The GenBank Feature Table Definition [
http://www.ncbi.nlm.nih.gov/col- KoVy,@ lab/FT/index.html] describes the various features and subsequent qualifiers agreed upon by the
0(\ybppx International Nucleotide Sequence Database Collaboration.
G{
~pA4 Currently, only nucleotide sequences are accepted for direct submission to GenBank. These
)_N|r$i\ include mRNA sequences with coding regions, fragments of genomic DNA with a single gene or
flDe*F^ multiple genes, and ribosomal RNA gene clusters. If part of the nucleotide sequence encodes a
>Vz Gx(7q 1-2
U\'.rT[# 6KC.l}Y* NCBI Handbook GenBank
>F+Mu-^ protein, a conceptual translation, called a CDS (coding sequence), is annotated. The span of the
Q/&H3N CDS feature is mapped to the nucleotide sequence encoding the protein. A protein Accession num-
N)|mA)S) ber (/protein_id) is assigned to the translation product, which will subsequently be added to the
m-9{@kgAM? protein databases.
b-R!oP+vP Multiple sequences can be submitted together. Such batch submissions of non-related sequen-
}la\?I ces may be processed together but will be displayed in Entrez (Chapter 15) as single records.
m?bd
6'&FR Alternatively, by using the Sequin submission tool (Chapter 12), a submitter can specify that several
}Pd S?[R sequences are biologically related. Such sequences are classified as environmental sample sets,
mpfc2>6Il. population sets, phylogenetic sets, mutation sets, or segmented sets. Each sequence within a set
KLD)h,] is assigned its own Accession number and can be viewed independently in Entrez. However, with
Q`k=VSUk the exception of segmented sets, each set is also indexed within the PopSet division of Entrez, thus
S^0Po%d allowing scientists to view the relationship between the sequences.
q8%T)$! What defines a set? Environmental sample, population, phylogenetic, and mutation sets all
G;:n*_QXE contain a group of sequences that spans the same gene or region of the genome. Environmental
epM;u samples are derived from a group of unclassified or unknown organisms. A population set contains
U5Say3r sequences from different isolates of the same organism. A phylogenetic set contains sequences
1%~yb Q from different organisms that are used to determine the phylogenetic relationship between them.
(3 xCW
Sequencing multiple mutations within a single gene gives rise to a mutation set.
wz^Q,Od All sets, except segmented sets, may contain an alignment of the sequences within them and
fa7Z=:aG might include external sequences already present in the database. In fact, the submitter can begin
MzZYzz with an existing alignment to create a submission to the database using the Sequin submission tool.
~ED8]*H|` Currently, Sequin accepts FASTA+GAP, PHYLIP, MACAW, NEXUS Interleaved, and NEXUS Con-
3aIP^I1 tiguous alignments. Submitted alignments will be displayed in the PopSet section of Entrez.
s;V~dxAiv Segmented sets are a collection of noncontiguous sequences that cover a specified genetic
/Y$UJt region. The most common example is a set of genomic sequences containing exons from a single
n,D~ whZx gene where part or all of the intervening regions have not been sequenced. Each member record
$D{KXkrd within the set contains the appropriate annotation, exon features in this case. However, the mRNA
&KinCh7l L and CDS will be annotated as joined features across the individual records. Segmented sets them-
#x-@ >{1k& selves can be part of an environmental sample, population, phylogenetic, or mutation set.