SNPnexus

barts-and-london_sml

Barts Cancer Institute

SNPnexus


SNPnexus was designed to simplify and assist in the selection of functionally relevant Single Nucleotide Polymorphisms (SNP) for large-scale genotyping studies of multifactorial disorders. The tool has been upgraded in 2011 to provide additional support for multiple nucleotide substitutions and insertions/deletions (indels) covering the wider range of variation data. In December 2017, a new upgrade took place, providing the options for variant annotation in three human genome reference systems with new annotation categories. For the latest upgrade, in December 2019, we released a complete redesigned version both in its internal architecture and in its user interface. This upgrade maintains its basics features but also adds new functionalities in the results section, as well as new two new annotation categories. From this version on we are focusing our update efforts on the two latest human genome assemblies (GRCh37/hg19 and GRCh38/hg38). For variant annotation using the NCBI36/hg18 human assembly, is still possible to use the legacy version of SNPnexus here.

SNPnexus allows single queries using dbSNP identifiers or chromosomal regions for annotating known variants. The users are also allowed to provide novel in-house SNPs/indels using genomic coordinates on clones, contigs and chromosomes. For practical purposes, SNPnexus allows batch queries comprising SNP data using dbSNP identifiers or genomic coordinates. SNPnexus is updated on a regular basis to be synchronized with the external annotation databases. It provides the scientific community with a friendly web-interface to extract the broadest annotations for their query variants, all from a single location.

To read more about the input and output formats and the options available in the filtering system, please consult the User Guide. To learn how to use the tool with an example, please go to the Example page.

Please note that SNPnexus uses data produced by external independent tools and databases to produce comprehensive annotation of query variants submitted by users. As such, it is prudent for users to use their own discretion when interpreting findings reported in SNPnexus. If necessary, please consult the individual resources and related peer-reviewed publications to check the viability of the result/data provided by the tools/databases.

SNPnexus provides genomic coordinates for the queried SNPs/indels in terms of their physical (on chromosome and contig) and cytogenetic positions. When novel in-house SNPs are submitted, the tool retrieves whether these overlaps with existing publicly available known SNP and subsequently provides the links, if any, to dbSNP (Sherry et al., 2001) and, when available, the minor allele and minor allele frequency for the global population.

Furthermore, the tool also maps each queried variant with its closest gene, being an overlapped gene or an downstream or upstream gene. For an overlapped gene, it also show the type of gene and its predicted consequence.

Deepending on the choice of genome assembly, a wide range of possible functional consequences is computed on the major gene annotation systems from NCBI RefSeq (Pruitt et al., 2007), UCSC Known Genes(Hsu et al., 2006), Ensembl (Hubbard et al., 2007), Vega (Wilming et al., 2008), AceView (Thierry-Mieg and Thierry-Mieg, 2006), CCDS (Pruitt et al., 2009) and H-Invitational (Yamasaki et al., 2010). The predicted functional effect falls into one of the following consequences:

Transcript Type Predicted Function Description
Coding coding In coding region
intronic In intron
intronic (splice_site) Within 2-bp of an intron/exon junction
5’UTR In 5' untranslated region
3’UTR In 3' untranslated region
5-upstream Within 2 kb upstream of the 5' end of a transcript
3-downstream Within 2 kb downstream of the 3' end of a transcript
Non-Coding non-coding In exon
non-coding intronic In intron
non-coding intronic (splice_site) Within 2-bp of an intron/exon junction

For intronic SNPs, the distance to the splicing site is reported. For coding variants, the coordinates of the first nucleotide position within the cdna and cds as well as the resultant first amino acid position in the peptide chain are reported.

Since coding variants are of special interest, we provide further information about the mutation type such as whether the single substitution is synonymous or non-synonymous. We also report whether non-synonymous substitution results in immediate stop-codon gain or loss. In case of insertion/deletion/block substitution occurring within coding region, we report the occurrence as frameshift if the total number of nucleotides to be replaced is not a multiple of 3, in which case we also report early stop or stop loss scenario. If the total number of nucleotides to be replaced is a multiple of 3, we report it as peptide shift. In all these cases, we show the change of amino acids in the reported region. The reference/altered protein sequence can be found in the resultant excel file. Transcripts with incomplete ORF (with missing or premature stop codon) and incomplete proteins are identified in the "Detail" column (representing the effect of mutation) by a "*" symbol. Unrecognisable alleles containing characters other than IUPAC base characters and "-" are identified in the "Detail" column as "Unknown". The predicted function for these cases will only be based on the SNP position on the gene.

Users can also download all the results in a tab-separated text format, where we report an additional column containing the protein sequences before and after each substitution separated by '|'.

For variations such as Deletions and Block substitutions that may span over more than one functional regions of the transcript, we predict the function of the variation based on the fist nucleotide position of the variation. However, we provide additional information such as the regions over which the variation spans. For example, if a deletion potentially deletes nucleotides starting from a coding exon and continues to do so in the next intron, then we predict the function of this variation as coding but in the "Detail" column, it is referred as coding-intronic. Currently, in these more complicated cases, we do not provide the resultant amino acid changes even if the variation possibly affects the coding region of a transcript. Users can try the batch query example on the home page to see how it works on the hg19 assembly.

For non-synonymous single amino acid substitution, we provide the predicted effect on protein function (Tolerated or Damaging) based on the SIFT (Kumar et al., 2009) and PolyPhen (Adzhubei et al., 2010) predictions. Predictions are only shown for complete Ensembl proteins. Also, no predictions are shown for non-synonymous substitution resulting in stop-gain or stop-loss as these fundamentally changes the protein sequence. For known dbSNPs we provide both SIFT and PolyPhen predictions. For novel variants we only provide the SIFT prediction.

For known SNPs, the tool provides related genotypes and allele frequency retrieved from data provided by the HapMap Project (The International HapMap Consortium, 2007), the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015) and Exome data from the Genome Aggregation Database (gnomAD) (Karczewski et al., 2019).

For HapMap, SNPnexus provides 12 populations for both human assemblies:

  • Yoruba in Ibadan, Nigeria (YRI)
  • Japanese in Tokyo, Japan (JPT)
  • Han Chinese in Beijing, China from HapMap phase 3 (CHB)
  • Unrelated Han Chinese in Beijing, China from the International HapMap project (HCB)
  • Utah residents with ancestry from northern and western Europe (CEU)
  • African Ancestry in SouthWestern US (ASW)
  • Chinese Ancestry in Metropolitan Denver, US (CHD)
  • Gujarati Indians in Houston, Texas (GIH)
  • Luhya in Webuye, Kenya (LWK)
  • Mexican Ancestery in Los Angeles, US (MEX)
  • Masai in Kinyawa, Kenya (MKK)
  • Toscani in Italia (TSI)

The HapMap project has been discontinued after it paved way for 1000 Genomes Project, which utilizes many of the same populations. Therefore, we also report 1000 Genomes population data for the following five super populations:

  • African (AFR)
  • Ad Mixed American (AMR)
  • East Asian (EAS)
  • European (EUR)
  • South Asian (SAS)

The Genome Aggregation Database (gnomAD), is a coalition of investigators seeking to aggregate and harmonize exome and genome sequencing data from a large variety of large-scale sequencing projects. SNPnexus provides annotations from exome data for 7 different ancestries:

  • African / African American (AFR)
  • Latino (AMR)
  • East Asian (EAS)
  • Finnish (FIN)
  • Non-Finnish European (NFE)
  • South Asian (SAS)
  • Other populations (OTH)

Regulatory SNPs can be queried against any overlap with the following regulatory elements on different assemblies:

  • Transcription Factor Binding Sites (TFBS) conserved in the human/mouse/rat alignment.
  • Exons, Promoters and CpG windows predicted by FirstEF (First Exon Finder) program (Davuluri et al., 2001). Available only for hg18 assembly.
  • Published microRNA sequences available from miRBase (Kozomara and Griffiths-Jones, 2011) release 18.
  • Vista Enhancers (Pennacchio et al., 2006): Non-coding distant-acting transcriptional enhancers in the human genome identified as conserved in human, mouse, and rat.
  • CpG Islands (Bird, 1986)
  • Conserved mammalian microRNA regulatory sites in the 3' UTR regions of Refseq Genes, as predicted by TargetScanS (Lewis et al, 2003)
  • microRNAs from miRBase, snoRNAs and scaRNAs from snoRNABase (Lestrade and Weber, 2006)

Most of these regulatory elements, such as TFBS, enhancers, promoters, microRNAs have become the integral part of research on non-coding genome regions in recent years. With many different types of regulatory features being explored under the non-coding research, we have introduced three new annotation categories that encompass the broadest range of regulatory feature classes and types, including:

  • Promoters
  • Promoter flanking regions
  • Enhancers
  • CTCF binding sites
  • Transcription factor binding sites
  • Open chromatin regions

The resources available are for both human assemblies are:

  • ENCODE(Encyclopedia of DNA Elements): a comprehensive list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
  • Roadmap Epigenomics: a public resource of human epigenomic data
  • Ensembl Regulatory Build: a genome-wide set of regions that are likely to be involved in gene regulation.

SNPnexus shows the estimated probability score that a variant belongs to a conserved region, based on the multiple alignments of 100 vertebrate species using phastCons method from the PHAST package. SNPnexus can also scan variants against conserved regions that are identified by GERP elements (Davydov et al, 2010) and shows the Rejected Substitution score of the element. GERP elements are not available for hg38 assembly.

SNPnexus retrieves the connection between queried SNPs/indels and the following phenotype & disease association databases:

  • GAD, The Genetic Association Database. GAD is an archive of published scientific papers on human genetic association studies of complex diseases and disorders. When investigating the role of variants, the user can mine GAD and extract any information related to the gene of interest (Becker et al., 2004). GAD data is not available on hg38 assembly.
  • COSMIC, The Catalogue Of Somatic Mutations In Cancer. Cosmic is an online database of somatic mutations found in human cancers. Users can investigate the relationship between the queried variants and cancer phenotypes (Forbes et al., 2008).
  • GWAS Catalogue, The Catalogue of Published Genome-Wide Association Studies. SNPs identified by published GWAS are collected in this catalogue which represents a useful resource for mining SNP-trait associations (Hindorff et al., 2009).
  • ClinVar. A public archive of reports of the relationships among human variations and phenotypes, with supporting evidence (Landrum et al. 2016).

SNPnexus checks any overlap with putative copy number variations (CNVs) such as gains/losses, insertions/deletions (InDels), duplications, inversions and complex types determined from various methods, as annotated by the Database of Genomic Variants (DGV).

Going beyond SIFT and PolyPhen predictions for the deleterious effect of coding variants on protein functions, SNPnexus users can now obtain the predicted functional impact of noncoding variants from eight popular noncoding variant scoring algorithms.

  • CADD (Kircher et al. 2014)
  • DeepSEA (Zhou and Troyanskaya 2015)
  • Eigen (Ionita-Laza et al. 2016)
  • FATHMM-MKL (Shihab et al. 2015)
  • FitCons (Gulko et al. 2015)
  • FunSeq2 (Fu et al. 2014)
  • GWAVA (Ritchie et al. 2014)
  • ReMM (Smedley et al. 2016)

Each of these systems uses diverse criteria and computational methods to provide a simple continuous functional score for noncoding variants/regions, placing these within the spectrum of being non-functional/benign/non-deleterious and functional/pathogenic/deleterious.

From this release, SNPnexus uses Reactome (Fabregat et al. 2018) data to link the genes involved in the queried variants with their biological pathways. For each pathway, we also provide a p-value to facilitate an Enrichment Analysis. This p-value is determined by a Fisher's Exact Test taking into account all the genes associated with the original query set.

As thousands of tumour genomes are sequenced around the world every year, it becomes increasingly necessary to annotate and identify which sequenced variants could have a possible role in tumourigenesis and treatment response. Using the Cancer Genome Interpreter (Tamborero et al., 2018), SNPnexus now links the queried variants with known and likely tumourigenic alterations, including the assessment of variants of unknown significance, as well as provides the variants that constitute state-of-the-art biomarkers of drug response.


References

Sherry,S.T. et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311.

Pruitt,K.D. et al. (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35, D61–D65.

Hsu,F. et al. (2006) The UCSC Known Genes. Bioinformatics, 22, 1036–1046.

Hubbard,T.J. et al. (2007) Ensembl 2007. Nucleic Acids Res., 35, D610–D617.

Wilming,L.G. et al. (2008) The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res., 36, D753–D760.

Thierry-Mieg,D. and Thierry-Mieg,J. (2006) AceView: a comprehensive cDNA supported gene and transcripts annotation. Genome Biol., 7 (Suppl. 1), S12.

Pruitt,K.D. et al. (2009) The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res., 19, 1316–1323.

Yamasaki,C. et al. (2010) H-InvDB in 2009: extended database and data mining resources for human genes and transcripts. Nucleic Acids Res., 38, D626–D632.

Kumar,P. et al. (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc., 4, 1073–1081.

Adzhubei, I.A. et al (2010) A method and server for predicting damaging missense mutations. Nat Methods, 7(4), 248-9.

The International HapMap Consortium (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature, 449, 851–861.

The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature, 526, 68-74.

Karczewski, K. et al (2019) Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv 531210

Davuluri,R.V. et al. (2001) Computational identification of promoters and first exons in the human genome. Nat Genet., 29, 412–417.

Kozomara,A. and Griffiths-Jones,S. (2011) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res., 39, D152–D157.

Pennacchio,L.A. et al. (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature, 444, 499–502.

Lewis,B.P. et al. (2003) Prediction of mammalian microRNA targets. Cell, 115, 787–798.

Vlachos,I.S ET AL. (2014) DIANA-TarBase v7.0: indexing more than half a million experimentally supported miRNA:mRNA interactions. Nucl. Acids Res. , 43, D153-9.

Bird,A.P. (1986) CpG-rich islands and the function of DNA methylation. Nature, 321, 209–213.

Lestrade,L. and Weber,M.J. (2006) snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res., 34, D158–D162.

Davydov,E.V. et al. (2010) Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol., 6(12), e1001025.

Becker,K.G. et al. (2004) The genetic association database. Nat Genet., 36, 431–432.

Forbes,S.A. et al. (2011) COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res., 39 (Suppl. 1), D945– D950.

Hindorff,L.A. et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci., 106, 9362–9367.

Landrum, M.J. et al. (2016) ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res., 44(D1), D862-8.

Kircher,M. et al. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat genet., 46, 310-5.

Zhou,J. and Troyanskaya,O.G. (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods, 12, 931-934.

Ionita-Laza,I. et al. (2016) A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet., 48(2), 214-20.

Shihab,H.A. et al. (2015) An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics, 31, 1536-43.

Gulko,B. et al. (2015) A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat Genet., 47, 276-83.

Fu,Y. et al. (2014) FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol., 15, 480.

Ritchie,G.R. et al. (2014) Functional annotation of noncoding sequence variants. Nat Methods, 11, 294-6.

Smedley,D. et al. (2016) A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. Am J Hum Genet., 99, 595-606.

Fabregat,A. et al. (2018) The Reactome Pathway Knowledgebase. Nucleic Acids Res., 46(D1), D649-D655.

Tamborero,D. et al. (2018) Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med., 25(2018).