NCBI SNP FTP SITE

Definition of Directories and subdirectories

Directories

  • /bin software tools for using ASN.1 binaries
  • /specs ASN.1 and XML specifications for dbSNP docsum data structure
  • /ss_fasta fasta format for all submissions in dbSNP
  • /organisms/{organism} organism-specific data in multiple report formats
    • Top-level Organism-specific Directories:
      • human
      • mouse
      • rat
      • chimpanzee
      • plasmodium

Subdirectories

  • --subdirectories of /organisms/{organism} by format--
  • /ASN1_bin RefSNP docsum in ASN.1 binary format.
  • /ASN1_flat RefSNP docsum from ASN.1 binary in human readable flatfile format
  • /chr_rpts RefSNPs per chromosome sorted by chromosome location
  • /rs_fasta fasta format for non-redundant refSNP clusters by chromosome
  • /XML submission format and XML exchange format for
    • dbSNP refSNP clusters including:
      • submissions (ss#'s) in cluster, mapping information, gene function information computed from analysis of reference genome sequence, snp-links, accessions, submitter comments, comments on meth-failure, submitter defined gene contexts, flanking sequence and alleles,population definitions and allele frequencies.XML DTD available in /specs directory (above).
  • /genome_reports Summery reports on SNPs in genes, SNP density on the genome, and intervals of genome sequence with little or no SNP content.
  • /database database dump of all organism-specific tables

FASTA format and data structure

ss record

             defline for FASTA records start with ">" 
             | object-type=general
             | |                             total length
             | |   database name             of sequence                                                               list of
             | |    |              offset of SNP |    Submitter                       organism    molecule   class of   alleles
             | |    |unique id  ss# in sequence  |        |         SubmitterSNPID        |         type     variation      |
             | |    |    |       |       |       |        |               |               |           |          |          |
    defline: >gnl|dbSNP|ss271 ss=271|pos=51|len=101|handle="DEBNICK"|subid="lp03022"|taxid=9606|mol="Genomic"|class=1|alleles="G/A"
5' sequence: CTGCATCACATGTACTGATTCTGTCCATTGGAACAGAGATGATGACTGGT 
  variation: R
3' sequence: TTACTAAACCCTGAGCCCTGGTGTTTCTGTTGATAGGGGGTTGCATTGAT 

rs record

             defline for FASTA records start with ">" 
             | object-type=general
             | |                            total length
             | |   database name             of sequence                                list of
             | |    |               offset of SNP|    organism    molecule   class of   alleles     source of
             | |    |unique id  rs#  in sequence |        |         type     variation     |         sequence
             | |    |    |       |       |       |        |          |          |          |            |
    defline: >gnl|dbSNP|rs271 rs=271|pos=51|len=101|taxid=9606|mol="Genomic"|class=1|alleles="G/A"|source="dbSNP"
5' sequence: CTGCATCACATGTACTGATTCTGTCCATTGGAACAGAGATGATGACTGGT 
  variation: R
3' sequence: TTACTAAACCCTGAGCCCTGGTGTTTCTGTTGATAGGGGGTTGCATTGAT 

no variation record

             defline for FASTA records start with ">" 
             | object-type=general
             | |                                  total length
             | |   database name                  of sequence                                
             | |    |                                   |      organism      
             | |    |unique id    novar       rs#       |          |      
             | |    |    |          |          |        |          |     
    defline: >gnl|dbSNP|rs16598 type="novar"|rs=16598|len=241|taxid=9606
   sequence: cacctccaacacccttcTTTTCTTTGAACAAGATTTTTCCTTAATTCCCCAATACTCCCT 
             TTGAATATATGATTTTAGCCACCATCATAGCGAATTGCATCGTCCTCGCACTGGAGCAGC 
             ATCTGCCTGATGATGACAAGACCCCGATGTCTGAACGGCTGGTGAGTGATGTCTTTTCTC 
             AGGGTCTTCTCCTTGGCTTTAGCAGGACATTAATTTTTGGGGGAGTggagcagggcacag 

Chromosome Report

Chromosome reports provide an ordered list of RefSNPs in approximate
chromosome coordinates (the same coordinate system used for the
NCBI genome MapViewer). Each line gives the following information
for a single RefSNP in tab-delimited columns:

Column   Data
  1      RefSNP id (rs#)
  2      mapweight where
            1 = unmapped
            2 = mapped to single position in genome
            3 = mapped to 2 positions on a single chromosome
            4 = mapped to 3-10 positions in genome (possible paralog hits)
            5 = mapped to >10 positions in genome
  3      snp_type where
            0 = not withdrawn
            1 = withdrawn There are several reasons for withdrawn, the
                withdrawn status is fully defined in the asn1, flatfile,
                and XML descriptions of the RefSNP. See /specs/docsum_2005.asn
                for full definition of snp-type values.
  4      total number of chromosomes hit by this RefSNP during mapping
  5      total number of contigs hit by this RefSNP during mapping
  6      total number of hits to genome by this RefSNP during mapping
  7      chromosome for this hit to genome
  8      contig accession for this hit to genome
  9      version number of contig accession for this hit to genome
 10      contig ID for this hit to genome
 11      position of RefSNP in contig coordinates
 12      position of RefSNP in chromosome coordinates (used to order report)
           Locations are specified in NCBI sequence location convention where;
               x, a single number indicates a feature at base position x
            x..y, a feature that spans from x to y inclusive
             x^y, a feature that is inserted between bases x and y
 13      genes at this same position on the chromosome
 14      average heterozygosity of this RefSNP
 15      standard error of average heterozygosity
 16      maximum reported probability that RefSNP is real. (For computationally-
             predicted submissions)
 17      validated status
             0 = no validation information
             1 = cluster has 2+ submissions, with 1+ submission assayed 
                 with a non-computational method
             2 = at least one subsnp in cluster has frequency data submitted
             3 = non-computational method in cluster and frequency data present
             4 = at lease one subsnp in cluster has been experimentally 
                 validated by submitter
             for other validation status value, please see:
             ftp://ftp.ncbi.nih.gov/snp/database/organism_shared_data/SnpValidationCode.bcp.gz
 18      genotypes available in dbSNP for this RefSNP
             1 = yes, 0 = no
 19      linkout available to submitter website for further data on the RefSNP
             1 = yes, 0 = no
 20      dbSNP build ID when the refSNP was first created (e.g. create date)
 21      dbSNP build ID of most recent change to the refSNP cluster (update date)
             where dates are reckoned in dbSNP build IDs 
 22      mapped to reference or alternate assembly (e.g. Celera)

SNPDiscovery/NcbiSNP (last edited 2012-03-17 17:55:13 by localhost)










  • Immutable Page
  • Info
  • Attachments