Here we present a practical example of SNPnexus usage, demonstrating the selection of query options and annotation categories in the input interface, followed by the assessment of the potential functional roles of query variants from the results available in the output interface.
The results shown in this example are available from here.
A user can provide the email address and a name for the query dataset. This is an optional requirement. If the user provides a valid email address, notification of acceptance and completion of their request are sent via email. The dataset name is useful for the users to keep track of the individual queries when multiple requests are submitted.
The variant data can be submitted in three different forms: genomic position, chromosomal region or dbSNP identifier. Queries can be made in both single and batch mode.
When dealing with large numbers of variations, a batch query is the suitable option. Users can either paste the variants list directly into the designated text area or upload a text file (.txt) containing the query variants. Currently, we limit the maximum number of variants in a single batch query to 100,000. Alternatively, users can upload VCF files (.vcf), containing SNPs, InDels and Block substitutions.
For the quick reassessment of a single SNP or known SNPs in a given region, the appropriate Single Query option can be chosen.
SNPnexus supports the three recent human genome assemblies: GRCh38/hg38 (default), GRCh37/hg19 and NCBI36/hg18, The output annotation categories available for selection under each assembly appear in dedicated navigation tabs. Users will click on the appropriate navigation tab for the genome assembly on which the queried variants would be annotated. After specifying/uploading the query variants, at least one of the available output annotation options needs to be selected for submission. In this example, various annotation categories are selected under GRCh37/hg19 assembly.
Once the request is submitted, the user is immediately notified of the submission.
The request is placed in a queue to be picked up for processing. However, if the server is overloaded with previous requests submitted by other users, then the current request might need to stay in the queue for longer. The user will be immediately notified of such circumstance.
Once the resource is available, the server picks up the requests for processing and the notification is displayed on the screen.
If the query does not contain any valid SNPs, then the user is immediately notified.
This usually happens when the user provides query variants in a format that do not agree with acceptable SNPnexus input variant format (e.g., missing keyword dbsnp before rsID).
In the general case when the request contains valid SNPs, the request remains in the processing stage until it is completed. The request may stay in this state for short or long time, depending on the query size as well as few other factors such as number and types of selected annotation categories, and the number of requests under processing from other users.
For each selected output annotation category, detailed results are shown separately in an interactive tabular format with filtering, pagination and sorting options with links to the related web data sources, when available.
Results are available for download as archived tab-delimited text or VCF files (depending on the selection in the input page).
Results can also be downloaded as a single excel file, composed of separate worksheets representing selected output annotations, allowing further flexibility to carry on additional investigations.
The top of the result page shows the number of distinct valid SNPs extracted from user input and the archive file and excel file for download.
It is followed by a Navigation Tree that allows the user to easily navigate through the subsequent result tables. The result tables are initially in the collapsed state. Clicking on a link in the Navigation Tree will expand the respective table.
For the gene/protein consequences category, an additional graphical representation is available in the Navigation Tree to quickly inspect the distribution of predicted functional consequences.
The first table provides the genomic mapping with both physical (on chromosome and contig) and cytogenetic positions. It also displays whether the variants are already deposited in public databases. In this example, the first variant chr3:30713853:G/TL:1 is the only one not found in dbSNP.
The second table provides the relative positions of the variants within the genome, i.e., whether overlapped with a gene or falls within an intergenic region. For gene-overlapped variants, it provides the corresponding gene type and a summary list of annotations in terms of the transcript isoforms. For intergenic variants, it displays the nearest upstream and downstream genes and distances to them. The Ensembl gene annotation system is used as the reference to conduct the annotation. In this example, the known SNP rs8028529 is the only intergenic variants.
SNPnexus considers different gene annotation systems to assess the functional consequences on the possible isoforms. The first variant (chr3:30713853:G/T) in the example was implicated in pancreatic tumourigenesis by altering genes involved in the TGF-β signalling pathway. Using SNPnexus, we assessed in details the functional consequences of this variant on RefSeq, Ensembl and AceView. All of the three annotation systems agree upon the coding non-synonymous effect (C393F, C418F) of the variant on two alternative transcripts of the transforming growth factor beta receptor II (TGFBR2) gene. But SNPnexus also identifies possible downstream effects on two AceView transcripts as well.
The corresponding protein alterations are predicted as damaging by both SIFT and PolyPhen.
The variant (chr3:9791667:AGA/- or rs3218998) has 5'-UTR and upstream effect on 8-oxoguanine DNA glycosylase (OGG1) gene, but it could have a downstream effect on the neighbouring bromodomain and PHD finger containing protein 1 (BRPF1). Interestingly, a potential peptide shift is detected on one AceView isoform, which is not captured by RefSeq or Ensembl.
This is an example of how SNPnexus enables discovery of new potential functions that should be considered in any further analysis.
SNPnexus could be used to identify variants that occur in the highly conserved promoter or regulatory regions. SNPnexus located rs3218998 (chr3:9791667:AGA/-) in a predicted CpG island, potentially affecting the transcriptional regulation of OGG1 gene.
The same SNP is also found to be located in the ENCODE-defined active promoter and enhancer regions, based on H3K4Me1, H3K9Ac, H3K4Me3, H3K27Ac chromatin marks.
The variant chr3:30713853:G/T overlaps with the predicted binding sites of regulatory factor X1 (RFX1) gene. SNP rs13129 is found within the predicted binding sites of transcriptional activator Forkhead box J2 (FOXJ2) gene.
Both SNPs are also located in conserved regions.
rs13129 is also a potential miRSNP occurring in 3′-UTR of AGRN gene, a putative target site of miR-224.
SNPnexus users can now obtain the predicted functional impact of noncoding variants from eight popular noncoding variant scoring algorithms. Each of these systems uses diverse criteria and computational methods to provide a simple continuous functional score for noncoding variants/regions, placing these within the spectrum of being non-functional/benign/non-deleterious and functional/pathogenic/deleterious. In this example, the noncoding variant rs13129 consistently shows its potential functional/pathogenic role across different scoring algorithms.
All scores from selected non-coding variant annotation scoring methods can be viewed together from the downloadable excel file as a single sheet.
SNPnexus can report direct and indirect links between variants and known diseases/phenotypes using GAD, ClinVar, COSMIC and GWAS resources. Direct links indicate whether the given variant has been reported in previous studies. Indirect links are based on the gene containing the variant rather than the variant itself. For example, rs2476601 is a variant found in the sequence of the PTPN22 gene. SNPnexus identified 269 GAD entries connecting rs2476601 (+1858G>A) with various diseases, most of which are indirect links, as it is the gene PTPN22 that has been studied in these reports. Only 43 entries are direct links, as they focus on rs2476601.
Mining ClinVar database reveals rs2476601 as a potential risk factor for several diseases including Rheumatoid Arthritis, Diabetes mellitus, Addison disease and Systemic Lupus Erythematosus.
Combining disease association database results with the population genotype data is also possible. We considered the association between the rs2476601 and Rheumatoid Arthritis among European population that has been reported in several studies. Overlaying HapMap data shows an MAF for +1858A allele of 11.96% in European CEU and <2% for most other populations (except Mexican and Toscani).
Studies have suggested the importance of CNVs in complex phenotypes and the need to assess them independently from SNPs. Both rs13129 and rs2476601 are found to overlap with several CNV regions, suggesting that the association studies relating these SNPs with diseases need to consider the impact of corresponding CNV as well.