barts-and-london_sml
SNPnexus

Barts Cancer Institute

  • Home
  • About
  • User Guide
  • Citation
  • Contact

SNPnexus

  • Genomic Mapping
  • Gene/Protein Consequences
  • Effect on Protein Function
  • HapMap Population Data
  • Regulatory Elements
  • Conservation
  • Phenotype & Disease Association
  • Structural Variations

SNPnexus was designed to simplify and assist in the selection of functionally relevant Single Nucleotide Polymorphisms (SNP) for large-scale genotyping studies of multifactorial disorders. The tool has been upgraded in 2011 to provide additional support for multiple nucleotide substitutions and insertions/deletions (indels) covering the wider range of variation data.

SNPnexus allows single queries using dbSNP identifiers or chromosomal regions for annotating known variants. The users are also allowed to provide novel in-house SNPs/indels using genomic coordinates on clones, contigs and chromosomes. For practical purposes, SNPnexus allows batch queries comprising SNP data using dbSNP identifiers or genomic coordinates. SNPnexus is updated on a regular basis to be synchronized with UCSC human genome annotation database and it provides the scientific community with a friendly web-interface to compute the following data:

1. Genomic Mapping and additional annotations

SNPnexus provides genomic coordinates for the queried SNPs/indels in terms of their physical (on chromosome and contig) and cytogenetic positions. When novel in-house SNPs are submitted, the tool retrieves whether these overlaps with existing publicly available known SNPs. and subsequently provides the links, if any, to dbSNP (Sherry et al., 2001) and HapMap populations (The International HapMap Consortium 2007).

2. Gene/Protein Consequences

A wide range of possible functional consequences is computed on the major gene annotation systems from NCBI RefSeq (Pruitt et al., 2007), UCSC Known Genes(Hsu et al., 2006), Ensembl (Hubbard et al., 2007), Vega (Wilming et al., 2008), AceView (Thierry-Mieg and Thierry-Mieg, 2006), CCDS (Pruitt et al., 2009) and H-Invitational (Yamasaki et al., 2010). The predicted functional effect falls into one of the following consequences:

Transcript Type Predicted Function Description
Coding coding In coding region
intronic In intron
intronic (splice_site) Within 2-bp of an intron/exon junction
5’UTR In 5' untranslated region
3’UTR In 3' untranslated region
5-upstream Within 2 kb upstream of the 5' end of a transcript
3-downstream Within 2 kb downstream of the 3' end of a transcript
Non-Coding non-coding In exon
non-coding intronic In intron
non-coding intronic (splice_site) Within 2-bp of an intron/exon junction

For intronic SNPs, the distance to the splicing site is reported. For coding variants, the coordinates of the first nucleotide position within the cdna and cds as well as the resultant first amino acid position in the peptide chain are reported.

Since coding variants are of special interest, we provide further information about the mutation type such as whether the single substitution is synonymous or non-synonymous. We also report whether non-synonymous substitution results in immediate stop-codon gain or loss. In case of insertion/deletion/block substitution occurring within coding region, we report the occurrence as frameshift if the total number of nucleotides to be replaced is not a multiple of 3, in which case we also report early stop or stop loss scenario. If the total number of nucleotides to be replaced is a multiple of 3, we report it as peptide shift. In all these cases, we show the change of amino acids in the reported region. The reference/altered protein sequence can be found in the resultant excel file. Transcripts with incomplete ORF (with missing or premature stop codon) and incomplete proteins are identified in the "note" column (representing the effect of mutation) by a "*" symbol. Unrecognisable alleles containing characters other than IUPAC base characters and "-" are identified in the "note" column as "Unknown". The predicted function for these cases will only be based on the SNP position on the gene.

Users can also download all the results in excel format, where we report an additional column containing the protein sequences before and after each substitution separated by '|'.

3. Effect on Protein Function

For non-synonymous single amino acid substitution, we provide the predicted effect on protein function (Tolerated or Damaging) based on the SIFT prediction (Kumar et al., 2009). Predictions are only shown for complete Ensembl proteins. Also, no predictions are shown for non-synonymous substitution resulting in stop-gain or stop-loss as these fundamentally changes the protein sequence.

4. Hapmap Population Data

For known SNPs, the tool provides related genotypes and allele frequency estimation retrieved from the Hapmap population data provided by The HapMap Project. for the following four population on hg18 assembly,:

  • Yoruba in Ibadan, Nigeria (YRI)
  • Japanese in Tokyo, Japan (JPT)
  • Han Chinese in Beijing, China (CHB)
  • Utah residents with ancestry from northern and western Europe (CEU)
On hg19 assembly, seven more populations are supported:
  • African Ancestry in SouthWestern US (ASW)
  • Chinese Ancestry in Metropolitan Denver, US (CHD)
  • Gujarati Indians in Houston, Texas (GIH)
  • Luhya in Webuye, Kenya (LWK)
  • Mexican Ancestery in Los Angeles, US (MEX)
  • Masai in Kinyawa, Kenya (MKK)
  • Toscani in Italia (TSI)

The Hapmap data provided by SNPnexus is based on the combined Phase II and III data from the International HapMap Project release 27.

5. Regulatory Elements

Regulatory SNPs can be queried against any overlap with the following regulatory elements:

  • Transcription Factor Binding Sites conserved in the human/mouse/rat alignment.
  • Exons, Promoters and CpG windows predicted by FirstEF (First Exon Finder) program (Davuluri et al., 2001). Available only for hg18 assembly.
  • Published microRNA sequences available from miRBase (Kozomara and Griffiths-Jones, 2011) release 18.
  • Vista Enhancers (Pennacchio et al., 2006): Non-coding distant-acting transcriptional enhancers in the human genome identified as conserved in human, mouse, and rat.
  • CpG Islands (Bird, 1986)
  • Conserved mammalian microRNA regulatory sites in the 3' UTR regions of Refseq Genes, as predicted by TargetScanS (Lewis et al 2003)
  • microRNAs from miRBase, snoRNAs and scaRNAs from snoRNABase (Lestrade and Weber, 2006)

6. Conservation

SNPnexus shows the estimated probability score that a variant belongs to a conserved region, based on the multiple alignments of 44/46 vertebrate species using phastCons method from the PHAST package.

7. Phenotype & Disease Association

SNPnexus retrieves the connection between queried SNPs/indels and the following phenotype & disease association databases:

  • GAD, The Genetic Association Database. GAD is an archive of published scientific papers on human genetic association studies of complex diseases and disorders. When investigating the role of variants, the user can mine GAD and extract any information related to the gene of interest (Becker et al., 2004).
  • COSMIC, The Catalogue Of Somatic Mutations In Cancer. Cosmic is an online database of somatic mutations found in human cancers. Users can investigate the relationship between the queried variants and cancer phenotypes (Forbes et al., 2008).
  • GWAS Catalogue, The Catalogue of Published Genome-Wide Association Studies. SNPs identified by published GWAS are collected in this catalogue which represents a useful resource for mining SNP-trait associations (Hindorff et al., 2009).

8. Structural Variations

SNPnexus checks any overlap with putative copy number polymorphisms (CNPs), insertions/deletions (InDels), inversions and inversion breakpoints determined from various methods, as annotated by the Database of Genomic Variants (DGV) via UCSC.


References

Sherry,S.T. et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311.

The International HapMap Consortium (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature, 449, 851–861.

Pruitt,K.D. et al. (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35, D61–D65.

Hsu,F. et al. (2006) The UCSC Known Genes. Bioinformatics, 22, 1036–1046.

Hubbard,T.J. et al. (2007) Ensembl 2007. Nucleic Acids Res., 35, D610–D617.

Wilming,L.G. et al. (2008) The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res., 36, D753–D760.

Thierry-Mieg,D. and Thierry-Mieg,J. (2006) AceView: a comprehensive cDNA supported gene and transcripts annotation. Genome Biol., 7 (Suppl. 1), S12.

Pruitt,K.D. et al. (2009) The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res., 19, 1316–1323.

Yamasaki,C. et al. (2010) H-InvDB in 2009: extended database and data mining resources for human genes and transcripts. Nucleic Acids Res., 38, D626–D632.

Kumar,P. et al. (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc., 4, 1073–1081.

Davuluri,R.V. et al. (2001) Computational identification of promoters and first exons in the human genome. Nat Genet., 29, 412–417.

Kozomara,A. and Griffiths-Jones,S. (2011) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res., 39, D152–D157.

Pennacchio,L.A. et al. (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature, 444, 499–502.

Lewis,B.P. et al. (2003) Prediction of mammalian microRNA targets. Cell, 115, 787–798.

Bird,A.P. (1986) CpG-rich islands and the function of DNA methylation. Nature, 321, 209–213.

Lestrade,L. and Weber,M.J. (2006) snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res., 34, D158–D162.

Becker,K.G. et al. (2004) The genetic association database. Nat Genet., 36, 431–432.

Forbes,S.A. et al. (2011) COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res., 39 (Suppl. 1), D945– D950.

Hindorff,L.A. et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci., 106, 9362–9367.

Copyright © 2008
Barts Cancer Institute