Contents
NCBI SNP FTP SITE
Definition of Directories and subdirectories
Directories
- /bin software tools for using ASN.1 binaries
- /specs ASN.1 and XML specifications for dbSNP docsum data structure
- /ss_fasta fasta format for all submissions in dbSNP
- /organisms/{organism} organism-specific data in multiple report formats
- Top-level Organism-specific Directories:
- human
- mouse
- rat
- chimpanzee
- plasmodium
- Top-level Organism-specific Directories:
Subdirectories
- --subdirectories of /organisms/{organism} by format--
- /ASN1_bin RefSNP docsum in ASN.1 binary format.
- /ASN1_flat RefSNP docsum from ASN.1 binary in human readable flatfile format
- /chr_rpts RefSNPs per chromosome sorted by chromosome location
- /rs_fasta fasta format for non-redundant refSNP clusters by chromosome
- /XML submission format and XML exchange format for
- dbSNP refSNP clusters including:
- submissions (ss#'s) in cluster, mapping information, gene function information computed from analysis of reference genome sequence, snp-links, accessions, submitter comments, comments on meth-failure, submitter defined gene contexts, flanking sequence and alleles,population definitions and allele frequencies.XML DTD available in /specs directory (above).
- dbSNP refSNP clusters including:
- /genome_reports Summery reports on SNPs in genes, SNP density on the genome, and intervals of genome sequence with little or no SNP content.
- /database database dump of all organism-specific tables
FASTA format and data structure
ss record
defline for FASTA records start with ">"
| object-type=general
| | total length
| | database name of sequence list of
| | | offset of SNP | Submitter organism molecule class of alleles
| | |unique id ss# in sequence | | SubmitterSNPID | type variation |
| | | | | | | | | | | | |
defline: >gnl|dbSNP|ss271 ss=271|pos=51|len=101|handle="DEBNICK"|subid="lp03022"|taxid=9606|mol="Genomic"|class=1|alleles="G/A"
5' sequence: CTGCATCACATGTACTGATTCTGTCCATTGGAACAGAGATGATGACTGGT
variation: R
3' sequence: TTACTAAACCCTGAGCCCTGGTGTTTCTGTTGATAGGGGGTTGCATTGAT
rs record
defline for FASTA records start with ">"
| object-type=general
| | total length
| | database name of sequence list of
| | | offset of SNP| organism molecule class of alleles source of
| | |unique id rs# in sequence | | type variation | sequence
| | | | | | | | | | | |
defline: >gnl|dbSNP|rs271 rs=271|pos=51|len=101|taxid=9606|mol="Genomic"|class=1|alleles="G/A"|source="dbSNP"
5' sequence: CTGCATCACATGTACTGATTCTGTCCATTGGAACAGAGATGATGACTGGT
variation: R
3' sequence: TTACTAAACCCTGAGCCCTGGTGTTTCTGTTGATAGGGGGTTGCATTGAT
no variation record
defline for FASTA records start with ">"
| object-type=general
| | total length
| | database name of sequence
| | | | organism
| | |unique id novar rs# | |
| | | | | | | |
defline: >gnl|dbSNP|rs16598 type="novar"|rs=16598|len=241|taxid=9606
sequence: cacctccaacacccttcTTTTCTTTGAACAAGATTTTTCCTTAATTCCCCAATACTCCCT
TTGAATATATGATTTTAGCCACCATCATAGCGAATTGCATCGTCCTCGCACTGGAGCAGC
ATCTGCCTGATGATGACAAGACCCCGATGTCTGAACGGCTGGTGAGTGATGTCTTTTCTC
AGGGTCTTCTCCTTGGCTTTAGCAGGACATTAATTTTTGGGGGAGTggagcagggcacag
Chromosome Report
Chromosome reports provide an ordered list of RefSNPs in approximate
chromosome coordinates (the same coordinate system used for the
NCBI genome MapViewer). Each line gives the following information
for a single RefSNP in tab-delimited columns:
Column Data
1 RefSNP id (rs#)
2 mapweight where
1 = unmapped
2 = mapped to single position in genome
3 = mapped to 2 positions on a single chromosome
4 = mapped to 3-10 positions in genome (possible paralog hits)
5 = mapped to >10 positions in genome
3 snp_type where
0 = not withdrawn
1 = withdrawn There are several reasons for withdrawn, the
withdrawn status is fully defined in the asn1, flatfile,
and XML descriptions of the RefSNP. See /specs/docsum_2005.asn
for full definition of snp-type values.
4 total number of chromosomes hit by this RefSNP during mapping
5 total number of contigs hit by this RefSNP during mapping
6 total number of hits to genome by this RefSNP during mapping
7 chromosome for this hit to genome
8 contig accession for this hit to genome
9 version number of contig accession for this hit to genome
10 contig ID for this hit to genome
11 position of RefSNP in contig coordinates
12 position of RefSNP in chromosome coordinates (used to order report)
Locations are specified in NCBI sequence location convention where;
x, a single number indicates a feature at base position x
x..y, a feature that spans from x to y inclusive
x^y, a feature that is inserted between bases x and y
13 genes at this same position on the chromosome
14 average heterozygosity of this RefSNP
15 standard error of average heterozygosity
16 maximum reported probability that RefSNP is real. (For computationally-
predicted submissions)
17 validated status
0 = no validation information
1 = cluster has 2+ submissions, with 1+ submission assayed
with a non-computational method
2 = at least one subsnp in cluster has frequency data submitted
3 = non-computational method in cluster and frequency data present
4 = at lease one subsnp in cluster has been experimentally
validated by submitter
for other validation status value, please see:
ftp://ftp.ncbi.nih.gov/snp/database/organism_shared_data/SnpValidationCode.bcp.gz
18 genotypes available in dbSNP for this RefSNP
1 = yes, 0 = no
19 linkout available to submitter website for further data on the RefSNP
1 = yes, 0 = no
20 dbSNP build ID when the refSNP was first created (e.g. create date)
21 dbSNP build ID of most recent change to the refSNP cluster (update date)
where dates are reckoned in dbSNP build IDs
22 mapped to reference or alternate assembly (e.g. Celera)








