Searching Genbank for Sequences

How to search and retrieve from the NIH Genbank

Register in Our Guestbook and Submit Comments

To search the data banks at the NIH

The [space] symbol is to draw your attention to the fact that there should be exactly one space between these two "words".

1) Send your enquiry to the following email address:

blast@ncbi.nlm.nih.gov

2) Send the following message, EXACTLY AS WRITTEN with a separate line for each bullet, with NO subject heading for a nucleotide search:

[space] indicates where you should leave a single space in your text

Program[space]blastn (or blastp for protein searches)
DATALIB[space]nr
alignments[space]20 (limits the number of matches reported: default is 50)
expect[space]20 (20 matches in a row to score: default is 10)
Begin
>any description you will find handy (even a blank line), but this is a required line and you have to start the line with the '>' symbol
(on a new line, put your sequence here, between 30 and 100 bases, and then hit return twice)

3) Click the send button and away she goes.

Go to "How to Retrieve Genbank Files"

You will get a huge response very quickly, 1- 15 minutes. Most notably, you will see some sort of colon and = bar graphs; a long bar indicates a good match. Then you will see a listing of the best matches indicated by scores and probabilities of random homology (e.g. score 230 and P 3.6 e-16 means that to randomly generate a match this good would happen 1/3.6 to the 16th power; not bloody likely). Preceding this information is an accession number within two "|" signs (e.g.|J45678|. This is its official "name" in the Genbank). This is followed by graphic depictions of your entered sequence matched to the sequences from Genbank with the best matches and a % identity is given. An example of some of the information when GFP was searched is shown below:

A Partial Example of the Information You Will Get Back

Query= GFP search
(44 letters)
Database: Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR
234,430 sequences; 66,380,649 total letters.
Searching..................................................done
Observed Numbers of Database Sequences Satisfying
Various EXPECTation Thresholds (E parameter values)

Histogram units: = 45 Sequences : less than 45 sequences

EXPECTation Threshold
(E parameter)
|
V Observed Counts-->
10000 7512 2735 |=====================================================
6310 4777 1111 |========================
3980 3666 786 |=================
2510 2880 675 |===============
1580 2205 536 |===========
1000 1669 300 |======
631 1369 228 |=====
398 1141 190 |====
251 951 137 |===
158 814 139 |===
100 675 89 |=
63.1 586 98 |==
39.8 488 111 |==
25.1 377 82 |=
>>>>>>>>>>>>>>>>>>>>> Expect = 20.0, Observed = 339 <<<<<<<<<<<<<<<<

Smallest Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N

(Note: the combination of letters and numbers that begin each line tells you two things. First, the data base containing this information is abbreviated by two letters (e.g. Sp in the first example). Second, there is a verticle line followed by a letter and five numbers. This is the sequence's accession number. You will need all of this information when you want to retrieve the entire DNA or protein sequence.)

sp|P42212 GFP_AEQVI GREEN FLUORESCENT PR >pir||JQ15... 226 7.5e-25 1
gi|155663 (M62654) green-fluorescent protein [... 226 7.5e-25 1
gi|1490529 (U62636) GFPuv [Cloning vector pGFPu... 226 7.5e-25 1
gi|1619751 (U70496) soluble-modified red-shifte... 226 7.5e-25 1
gi|1619753 (U70497) soluble-modified blue fluor... 226 7.5e-25 1
gi|1669868 (U73901) green fluorescent protein m... 226 7.5e-25 1
gi|1354498 (U53602) green fluorescent protein-g... 226 2.6e-24 1
gnl|PID|e228230 (X96418) green fluorescent protein [... 222 2.8e-24 1
gi|1289375 (U43284) green fluorescent protein, ... 222 2.8e-24 1
gi|1373322 (U57608) Enhanced Green Fluorescent ... 222 2.8e-24 1
gi|887957 (U19276) green fluorescent protein [... 222 2.9e-24 1
gi|632521 (U19278) green fluorescent protein [... 222 3.6e-24 1
gi|632527 (U19280) green fluorescent protein [... 222 3.6e-24 1
gi|632530 (U19281) green fluorescent protein [... 222 3.6e-24 1
gi|1019891 (U36201) green fluorescent protein, ... 222 3.6e-24 1
gi|1019894 (U36202) green fluorescent protein, ... 222 3.6e-24 1
gi|1373316 (U57606) Enhanced Green Fluorescent ... 222 3.6e-24 1
gi|1373319 (U57607) Enhanced Green Fluorescent ... 222 3.6e-24 1
gi|1377915 (U55763) enhanced green fluorescent ... 222 3.6e-24 1
gi|1490533 (U62637) GFPuv [Cloning vector pBAD-... 221 3.9e-24 1
pir||S48693 fluorescent protein - hydromedusa (A... 218 1.0e-23 1
prf||2008181A green fluorescent protein [Aequorea ... 218 1.0e-23 1
gi|1277124 (U50974) gfp gene product [Cloning v... 217 1.4e-23 1
pir||S51330 green fluorescent protein - hydromed... 217 1.4e-23 1
pir||S51331 green fluorescent protein - hydromed... 213 5.2e-23 1
sp|Q04508|AMOB_NITEU AMMONIA MONOOXYGENASE >gi|806408 (L0... 49 0.21 2
gi|1173628 (U34746) glycine-rich protein [Phala... 57 0.22 1
sp|P21325|RT67_ECOLI RNA-DIRECTED DNA POLYMERASE FROM RET... 51 0.22 2
gi|598115 (M24363) open reading frame [Escheri... 51 0.22 2
pir||S16654 hypothetical protein - Escherichia coli 51 0.24 2
pir||A25029 outer membrane protein F - Escherich... 29 0.24 2
sp|Q06373|YPLC_CLOPE HYPOTHETICAL 55.7 KD PROTEIN IN PLC ... 46 0.31 2
sp|Q05443|LUM_BOVIN LUMICAN PRECURSOR (LUM) (KERATAN SUL... 45 0.36 2

sp|P42212|GFP_AEQVI GREEN FLUORESCENT PROTEIN >pir||JQ1514 green-fluorescent
protein - hydromedusa (Aequorea victoria) >gi|155661 (M62653)
green-fluorescent protein [Aequorea victoria] >gi|894140 (U28417)
green fluorescent protein [Cloning vector p35S-GFP] >gi|1289498
(U50963) green fluorescent protein [synthetic construct]
>gi|1335938 (U54830) green fluorescent protein [synthetic
construct]
Length = 238

Score = 226 (102.2 bits), Expect = 7.5e-25, P = 7.5e-25
Identities = 44/44 (100%), Positives = 44/44 (100%)

Query: 1 MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTL 44
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTL
Sbjct: 1 MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTL 44

>gi|155663 (M62654) green-fluorescent protein [Aequorea victoria]
Length = 238

Score = 226 (102.2 bits), Expect = 7.5e-25, P = 7.5e-25
Identities = 44/44 (100%), Positives = 44/44 (100%)

Statistics:
Query Expected Observed HSPs HSPs
Frame MatID High Score High Score Reportable Reported
+0 0 51 (23.1 bits) 226 (102.2 bits) 659 20

Query Neighborhd Word Excluded Failed Successful Overlaps
Frame MatID Words Hits Hits Extensions Extensions Excluded
+0 0 780 6632756 1272902 5308205 51649 146

Database: Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR
Release date: December 20, 1996
Posted date: 12:26 PM EST Dec 20, 1996
# of letters in database: 66,380,649
# of sequences in database: 234,430
# of database sequences satisfying E: 339
No. of states in DFA: 319 (32 KB)
Total size of DFA: 40 KB (64 KB)
Time to generate neighborhood: 0.00u 0.01s 0.01t Real: 00:00:00
No. of processors used: 4
Time to search database: 14.16u 0.27s 14.43t Real: 00:00:04
Total cpu time: 14.19u 0.32s 14.51t Real: 00:00:04

Molecular Biology Main Page

Course Materials

Biology Main Page