NGS-SNP - annotate_SNPs.pl

Note

If you find that the annotation process sometimes fails due to lost connections to the Ensembl database or other required resources try using the script 'run_annotate_SNPs.pl' in place of this script. The same command-line parameters are used for both scripts. When the 'run_annotate_SNPs.pl' script is used the annotation process can restart automatically if problems are encountered, starting at the first unannotated SNP.

By default this script obtains sequences and annotations from Ensembl, UniProt, and NCBI. To speed up the annotation process you may want to set up the required Ensembl databases locally. For the sample data in the test directory the required Ensembl databases are bos_taurus_core, bos_taurus_variation, bos_taurus_funcgen, homo_sapiens_core, homo_sapiens_variation, ensembl_compara, and ensembl_ontology. For more information on installing Ensembl databases see the NGS-SNP installation instructions.

Description

This script annotates SNPs identified by the sequencing of genomic DNA or transcripts. It is designed to accept SNPs from any other program that can provide a reference sequence identifier, a SNP position, the base found at that position in the reference sequence, and the bases found at that position among the sequencing reads. More information on input formats is given below. Using a locally installed version of Ensembl this script can annotate 4,000,000 SNPs in about two days on a standard desktop system. Thus this script is suitable for annotating the SNPs arising from genome resequencing projects.

Setup

See setup instructions for NGS-SNP.

Input

The SNPs provided as input to this script must have been obtained from a species contained in the Ensembl database. The reference sequences used for SNP detection can be chromosomes or transcripts. If the reference sequences are chromosomes, chromosome names should be used as the reference identifiers. If the reference sequences are transcripts, Ensembl transcript IDs should be used as the reference identifiers (see input examples below). Several different input formats are accepted. The fields in the input file should be separated by tabs, although commas or multiple spaces can be used for column separation for some formats.

SAMtools consensus pileup

The following is an example of the SAMtools consensus pileup format obtained using the 'samtools pileup' command with the '-c' option (see SAMtools - manual). In this example, chromosome sequences were used as the reference for SNP detection. Note that only the rows describing SNPs are processed by NGS-SNP. The rows describing positions that match the reference are skipped. Use '-format sam' to process files in this format:

chr6   60357580  C  T  66  0  99  13  ...........^~.^~.   9<<55<;<<<<<<
chr27  17496636  A  A  72  0  99  15  .$..............    <;;,55;<<<<<<<<
chr30  8507830   G  W  72  0  99  15  .............^~.^y. (;975&;<<<<<<<<
    

To speed up the processing of this file format you may want to remove the non-SNP rows before running annotate_SNPs.pl. The second row in the above example is a non-SNP row. The awk program can be used to remove non-SNP rows as follows:

awk '{if ($3 != $4) print $0}' consensus_pileup.txt > snps.txt
     

Maq

The following is an example of the Maq format obtained using the 'cns2snp' or 'maq.pl SNPfilter' commands (see Maq - manual). In this example, transcript sequences were used as the reference for SNP detection. Use '-format maq' to process files in this format:

ENSBTAT00000002263  3068    G   K   22  10  1.00    52  44  T   72  G
ENSBTAT00000044491  123     A   C   28  6   1.00    52  30  M   62  A
ENSBTAT00000012002  1846    G   T   10  11  1.00    53  32  K   31  G
    

diBayes

The following is an example of the diBayes format (see AB diBayes SNP tool). In this example, chromosome sequences were used as the reference for SNP detection. Use '-format dibayes' to process files in this format:

#Chr    Source          Type    Pos_Start       Pos_End   Score      Strand  Phase   Attributes
chr1    SOLiD_diBayes   SNP     3231            3231      0.000000   .       .       genotype=T;reference=A;coverage=5;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=5;novelAlleleStarts=4;novelAlleleMeanQV=18;diColor1=13;diColor2=13;het=0;flag=
chr1    SOLiD_diBayes   SNP     4010            4010      0.003399   .       .       genotype=Y;reference=C;coverage=10;refAlleleCounts=7;refAlleleStarts=5;refAlleleMeanQV=18;novelAlleleCounts=3;novelAlleleStarts=3;novelAlleleMeanQV=18;diColor1=03;diColor2=21;het=1;flag=
chr6    SOLiD_diBayes   SNP     60357580        60357580  1.000000   .       .       genotype=T;reference=C;coverage=;refAlleleCounts=;refAlleleStarts=;refAlleleMeanQV=;novelAlleleCounts=;novelAlleleStarts=;novelAlleleMeanQV=;diColor1=;diColor2=;het=0;flag=
    

Generic

The generic format requires four columns in the following order: the name of the reference chromosome or the ID of the reference transcript, the position of the SNP in the reference, the base in the reference, and the bases in the reads (represented by a single-letter). The following is an example of generic input (created from the data in the Maq and diBayes examples above). Use '-format generic' to process files in this format:

ENSBTAT00000002263  3068        G   K
ENSBTAT00000044491  123         A   C
ENSBTAT00000012002  1846        G   T
chr1                3231        A   T
chr1                4010        C   Y
chr6                60357580    C   T
    

VCF

The following is an example of the "VCF" format (see VCF (Variant Call Format) version 4.0). For annotation purposes NGS-SNP uses the "CHROM", "POS", "REF", and "ALT" columns. The "QUAL", "FILTER", and "INFO" columns are not used but are included in the annotated output. The first five columns must appear in the order depicted below. Note that only the rows describing SNPs are processed by NGS-SNP. The header should include a line starting with "##fileformat=VCF". Use '-format vcf' to process files in this format:

##fileformat=VCFv4.0
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3
    

Output fields in annotated SNP files

All of the fields present in the input file are reproduced in the annotated SNP list. The first field added by NGS-SNP is called "Functional_Class"--this and the other fields are described below:

Functional_Class
the type of SNP ('missense_variant' for example). They are a subset of the variation classes used by Ensembl, and are ordered below according to the level of severity assigned by Ensembl (and annotate_SNPs.pl), from most severe to least severe (note that the terms changed in Ensembl 68):
New Term
(Ensembl 68 and newer)
Old Term
(pre-Ensembl 68)
Description
splice_donor_variant Essential splice site A splice variant that changes the 2 base region at the 5' end of an intron.
splice_acceptor_variant Essential splice site A splice variant that changes the 2 base region at the 3' end of an intron.
stop_gained Stop gained A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript.
stop_lost Stop lost A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript.
initiator_codon_variant Non synonymous coding A codon variant that changes at least one base of the first codon of a transcript.
missense_variant Non synonymous coding A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved.
splice_region_variant Splice site A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron.
incomplete_terminal_codon_variant Partial codon A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed.
synonymous_variant Synonymous coding A sequence variant where there is no resulting change to the encoded amino acid.
stop_retained_variant Synonymous coding A sequence variant where at least one base in the terminator codon is changed, but the terminator remains.
coding_sequence_variant Coding unknown A sequence variant that changes the coding sequence.
mature_miRNA_variant Within mature miRNA A transcript variant located with the sequence of the mature miRNA.
5_prime_UTR_variant 5prime UTR A UTR variant of the 5' UTR.
3_prime_UTR_variant 3prime UTR A UTR variant of the 3' UTR.
intron_variant Intronic A transcript variant occurring within an intron.
non_coding_exon_variant Within non coding gene A sequence variant that changes non-coding exon sequence.
nc_transcript_variant Within non coding gene A transcript variant of a non coding RNA.
NMD_transcript_variant NMD transcript A variant in a transcript that is the target of NMD.
upstream_gene_variant Upstream A sequence variant located 5' of a gene.
downstream_gene_variant Downstream A sequence variant located 3' of a gene.
TF_binding_site_variant Regulatory region A sequence variant located within a transcription factor binding site.
regulatory_region_variant Regulatory region A sequence variant located within a regulatory region.
intergenic_variant Intergenic A sequence variant located in the intergenic region, between genes.
Chromosome
the chromosome containing the SNP.
Chromosome_Position
the position of the SNP on the chromosome.
Chromosome_Strand
the strand of the SNP alleles reported in the 'Chromosome_Reference' and 'Chromosome_Reads' columns ('forward' or 'reverse'). 'forward' indicates that the alleles given in the 'Chromosome_Reference' and 'Chromosome_Reads' columns can be found on the forward strand of the reference chromosome sequence. 'reverse' indicates that the alleles can be found on the reverse strand of the reference chromosome sequence.
Chromosome_Reference
the nucleotide found in the reference genome at the position described by 'Chromosome', 'Chromosome_Position', and 'Chromosome_Strand'.
Chromosome_Reads
the nucleotides observed by resequencing, as they would be observed at the position described by 'Chromosome', 'Chromosome_Position', and 'Chromosome_Strand'.
Gene_Description
a short description of the relevant gene.
Ensembl_Gene_ID
the Ensembl Gene ID of the relevant gene.
Entrez_Gene_Name
the Entrez Gene name of the relevant gene.
Entrez_Gene_ID
the Entrez Gene ID of the relevant gene.
Ensembl_Transcript_ID
the Ensembl Transcript ID of the affected transcript.
Transcript_SNP_Position
the position of the SNP, on the transcript.
Transcript_SNP_Reference
the reference genome allele, as it would appear in the mature transcipt.
Transcript_SNP_Reads
a nucleotide found by resequencing, as it would appear in the mature transcript.
Transcript_To_Chromosome_Strand
the strand of the transcript compared to the chromosome. 'forward' indicates that the alleles given in 'Transcript_SNP_Reference' and 'Transcript_SNP_Reads' refer to the forward strand of the reference chromosome sequence. 'reverse' indicates that the alleles refer to the reverse strand of the reference chromosome sequence.
Ensembl_Protein_ID
the Ensembl Protein ID of the relevant protein.
UniProt_ID
the UniProt ID of the relevant protein.
Amino_Acid_Position
the position of the relevant amino acid in the protein sequence.
Overlapping_Protein_Domains
protein domains that overlap with the position of the affected amino acid.
Overlapping_Protein_Features
protein features, obtained from UniProt, that overlap with the position of the relevant amino acid. This field is not calculated if the -no_uniprot option is specified.
Amino_Acid_Reference
the amino acid encoded by the reference sequence.
Amino_Acid_Reads
the amino acid encoded by the variant sequence.
Amino_Acids_In_Orthologues
the amino acids aligned with the reference amino acid in orthologous sequences.
Alignment_Score_Change
the alignment score for the variant amino acid vs. the orthologous amino acids minus the alignment score for the reference amino acid vs. the orthologous amino acids. A positive value indicates that the variant amino acid better resembles the orthologues than does the reference amino acid, whereas a negative value indicates that the reference amino acid better resembles the orthologues than does the variant amino acid. The value is scaled to between -1 and 1.
C_blosum
a measure of the conservation of the reference amino acid with the aligned amino acids in orthologous sequences. The alignment score for the reference amino acid vs. the orthologous amino acids is divided by the alignment score that would be obtained if all the orthologous residues matched the reference. Higher values tend to be associated with changes to the amino acid having a greater functional consequence. The formula used is equivalent to the C_blosum formula given in Kowarsch A et al. (2010 PLoS Comput Biol 6(9): e1000923), except that in NGS-SNP scoring matrices other than BLOSUM62 can be used.
Context_Conservation
the average percent identity obtained when the region of the reference protein containing the SNP-affected residue is aligned with the orthologous region from other species. The size of the region examined is determined by the -cf option, which specifies how much sequence on either side of the SNP to examine. For example, if '-cf 10' is used, the size of the region is 10 + 1 + 10 = 21.
Orthologue_Species
the species from which sequences were obtained to generate the 'Amino_Acid_In_Orthologues', 'Alignment_Score_Change', 'C_Blosum', and 'Context_Conservation' values. The order of the species matches the order used to generate the 'Amino_Acid_In_Orthologues' value.
Gene_Ontology
GO slim IDs and terms associated with the relevant transcript.
Model_Annotations
provides information transferred from the model species (specified using the -model option) orthologue of the relevant gene, or from the relevant gene itself if the input SNPs are from the model species. Information in this field is given in the form of key-value pairs.
Phenotypes_Position phenotypic information associated with known variation in the model species at the SNP position. If the input SNPs are from the model species then their locations are used to identify known variations at the same locations, and any phenotypic information linked to these variations is reported. If the input SNPs are not from the model species and the input SNP alters a protein, then protein alignment is used to find the orthologous genomic region from the model. Model species variations that alter this region are obtained, and any phenotypic information linked to these variations is reported.
Phenotypes_Gene phenotypic information associated with the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.
Interactions_Count the number of reported interactions involving proteins produced by the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.
KEGG KEGG pathways associated with the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.
Overlapping_Protein_Features protein features from the model species gene, obtained from UniProt, that overlap with the position of the relevant amino acid. The value for this key is not calculated if the -no_uniprot option is specified.
Comments
provides additional information about the SNP in the form of key-value pairs. The contents of this field are likely to change as NGS-SNP is updated with new annotation capabilities. By placing new annotations in this field the output format of NGS-SNP can remain relatively stable as the program is updated.
Genotype genotype information parsed from the input VCF file.
Gene_Biotype a gene classification to indicate the biological significance of the gene (e.g. whether it is protein coding or a pseudogene).
Transcript_Biotype a transcript classification to indicate the biological significance of the transcript (e.g. whether it is protein coding or a pseudogene).
Gene_Status a "Known" gene has at least one transcript with a sequence match in a sequence repository external to Ensembl for the same species. "Known by Projection" refers to genes that are homologous, based on Ensembl comparative analysis, to genes with "Known" status in another species (usually human genes). A "Novel" gene contains only transcripts that have a sequence match outside Ensembl for an alternate species. "Merged" indicates that the gene has at least one merged (gold) transcript. A merged transcript is a case where an identical sequence to the Ensembl prediction has been determined by the VEGA/Havana project. Two distinctions are made. If the coding sequence (CDS) is the same, but the UTR differs, the link to the Havana ID (OTT...) will read: "Havana transcript having same CDS". If the entire transcript (both UTR and CDS) is the same in Havana and Ensembl, the link to the Havana transcript will read: "Transcript having exact match between Ensembl and Havana". Merged genes are only available for human, mouse, and zebrafish.
Transcript_Status a "Known" transcript has a sequence match in a sequence repository external to Ensembl for the same species. A "Novel" transcript has a sequence match outside Ensembl for an alternate species.
Distance_From_Transcript the distance in bases from the INDEL to the transcript. A qualifier further describing the location of the INDEL relative to the transcript is given in parentheses. Distances are reported when the INDEL is upstream or downstream of the transcript.
Protein_Length_Decrease gives the length in amino acids of the protein segment that is lost because of an allele that introduces a stop codon (i.e. functional class is 'stop_gained'). The value given in parentheses is the length of the lost protein segment expressed as a percentage of the length of the reference protein.
Protein_Sequence_Lost the peptide sequence removed from the reference sequence because of an allele that introduces a stop codon (i.e. functional class is 'stop_gained').
BLAST_Protein_Sequence_Lost the most significant BLAST E-value obtained when the 'Protein_Sequence_Lost' is compared to proteins from other species using blastp. The intention of this search is to help identify 'stop_gained' SNPs that remove an evolutionarily conserved protein sequence segment. The search is done remotely using NCBI. The results are given in the form 'E-value(hit description)(hit accession)', or as 'no hit'. The value for this key is not calculated if the -no_blast option is specified.
Protein_Length_Increase gives the length in amino acids of the protein segment that is gained because of an allele that removes a stop codon (i.e. functional class is 'stop_lost'). The value given in parentheses is the length of the gained protein segment expressed as a percentage of the length of the reference protein.
Protein_Sequence_Gained the peptide sequence added to the reference sequence because of an allele that removes a stop codon (i.e. functional class is 'stop_lost').
BLAST_Protein_Sequence_Gained the most significant BLAST E-value obtained when the 'Protein_Sequence_Gained' is compared to proteins from other species using blastp. The intention of this search is to help identify 'stop_lost' SNPs that add an evolutionarily conserved protein sequence segment. The search is done remotely using NCBI. The results are given in the form 'E-value(hit description)(hit accession)', or as 'no hit'. The value for this key is not calculated if the -no_blast option is specified.
Protein_OMES_Orthologues reports the residues in the reference sequence that are most mutationally correlated with the altered residue, as determined using the Observed Minus Expected Squared (OMES) method described in Fodor and Aldrich (2004 Proteins 56(2): 211-221). Orthologous protein sequences are obtained from Ensembl and aligned with the reference using Muscle. All column pairs in the resulting multiple alignment are then compared to identify correlated residues. The output for each correlated residue is in the form 'score(X)(Y)' where 'score' is the raw OMES score, 'X' is the position of the SNP-altered residue in the reference sequence, and 'Y' is the position of the correlated residue in the reference sequence. 'Y' may sometimes be given as '-', which indicates that the correlated position detected in the multiple alignment is absent from the reference sequence. The value for this key is only calculated if the -run_omes option is specified.
Reference_Splice_Site the sequence of the splice site that is altered by the SNP. The splice site bases (i.e. the first two and last two bases in the intron) are given as they appear on the sense strand of the reference sequence. This value is reported when the functional class is 'splice_donor_variant' or 'splice_acceptor_variant'.
Variant_Splice_Site the sequence of the splice site that is altered by the SNP. The splice site bases (i.e. the first two and last two bases in the intron) are given as they appear on the sense strand of the variant sequence. This value is reported when the functional class is 'splice_donor_variant' or 'splice_acceptor_variant'.
Distance_From_Transcript the distance in bases from the SNP to the transcript. A qualifier further describing the location of the SNP relative to the transcript is given in parentheses. Distances are reported when the SNP is upstream or downstream of the transcript.
SIFT_Prediction a prediction of whether or not the amino acid change is deleterious. This score is calculated using the approach described by Ng and Henikoff (2001 Genome Research 11(5): 863-874), using an alignment of orthologous sequences. The prediction is given in the form 'prediction(score)' where 'prediction' is the SIFT prediction regarding the effect of the change (for example 'TOLERATED' or 'DAMAGING'), and 'score' is the raw SIFT score. The SIFT score ranges from 0 to 1. The prediction is given as 'DAMAGING' when the score is <= 0.05, and 'TOLERATED' when the score is > 0.05. The SIFT_Prediction value is given for SNPs of function class 'missense_variant' or 'initiator_codon_variant'. A prediction may not be made for some SNPs if there are insufficient orthologous sequences available. The value for this key is not calculated if the -no_sift option is specified. Note that the SIFT_Prediction value can be determined for any species (not just human). If a SIFT prediction is already available in Ensembl, the prediction from Ensembl is used instead of performing the calculation locally. The prediction from Ensembl, if available, is given as the 'SIFT_Prediction_Ensembl' value.
SIFT_Prediction_Ensembl a prediction of whether or not the amino acid change is deleterious. The format is 'prediction(score)'. See the SIFT web site for more information. This value is obtained from Ensembl and may not be available for all species. The value for this key is not provided if the -no_sift option is specified.
PolyPhen_Prediction_Ensembl a prediction of whether or not the amino acid change is deleterious. The format is 'prediction(score)'. See the PolyPhen web site for more information. This value is obtained from Ensembl and may not be available for all species. The value for this key is not provided if the -no_sift option is specified.
Regulatory_Features features such as transcription factor binding sites that overlap with the SNP position.
Known_SNP_Allele_Frequencies the frequencies of the alleles described in the 'Chromosome_Reads' field, in genotyped populations. This information is available for some known SNPs. Frequencies are given in the form 'allele/frequency/population'.
Known_SNP_Genotype_Frequencies the frequencies of genotypes containing the alleles described in the 'Chromosome_Reads' field, in genotyped populations. This information is available for some known SNPs. Frequencies are given in the form 'allele1/allele2/frequency/population'.
Other_Consequences additional functional consequences for this variant. Multiple consequences can be predicted, for example, when sequence features overlap. Values are given in the form 'function_class(gene_id,transcript_id,protein_id)' where the gene, transcript, and protein IDs are those obtained from Ensembl and indicate which features may be affected by the consequence.
Ref_SNPs
refSNP (rs) IDs of known SNPs at the same position and sharing at least one allele with the SNP.
Is_Fully_Known
whether or not the alleles in the input SNP are fully described by known refSNPs.

Output columns in flanking sequence files

The flanking sequence file consists of two columns:

SNP_ID
a name assigned to the SNP by NGS-SNP. The name consists of four values separated by underscores: chromosome, position, base on the forward strand in the reference sequence, and the single base representation of the alleles observed at the SNP site, as they would appear on the forward strand.
Sequence
the flanking sequence of the SNP. The SNP alleles are given in square brackets, in the middle of the sequence.

Directories and files produced when a directory name is supplied using the -split option

Annotated SNPs and flanking sequences grouped by functional class
a directory called "All" is created inside of the -split directory. Annotated SNPs and flanking sequences grouped by SNP functional class are added to the "All" directory. For example, all stop_gained SNPs can be found in the "All" directory, in one or more files with names beginning with "All_stop_gained" (multiple files for the same functional class are created when the number of SNPs in the functional class exceeds 100,000--each file is given a distinct numerical suffix). For each annotated SNP file there is a matching flanking sequence file.
Annotated SNPs and flanking sequences grouped by functional class and chromosome
directories named after chromosomes are created inside of the -split directory. Within these chromosome directories are SNPs and flanking sequences organized by chromosome and SNP functional class. For example, all missense SNPs on chromosome 1 can be found in the "1" directory, in one or more files with names beginning with "1_missense_variant".
SNP summary files
within the -split directory two files are created, called "chromosome_function_stats.txt" and "function_stats.txt". These files give the number of SNPs assigned to each functional class, and the number of known and novel SNPs. In the "chromosome_function_stats.txt" file the counts are given separately for each chromosome, whereas in the "function_stats.txt" file the entire dataset is summarized. Note that if the -all option is used the interpretation of these files differs, since more than one output SNP can be generated for each input SNP. For example, a single input SNP might generate a stop_gained SNP and an intron_variant SNP when the -all option is used. Both of these SNP consequences will be counted in the summary files. When the -all option is not used, only one consequence is reported for each input SNP.

Tips

Usage

USAGE:
perl annotate_SNPs.pl [-arguments]

-i [STRING]        : file containing SNPs to annotate. The script will attempt
                     to determine the input file format automatically, unless
                     the -format option is used to indicate format (Required).
-o [FILE]          : the output file to create (Required).
-s [STRING]        : the source species in the form genus_species (e.g. 
                     'homo_sapiens', 'bos_taurus', 'sus_scrofa'). To view the
                     complete list of available species use the -show_species
                     option (Required).
-f [FILE]          : write genomic flanking sequence for each SNP to this 
                     file (Optional).
-n [STRINGS]       : chromosome name changes to make when querying Ensembl. 
                     For example, 'chr30=chrX' causes all instances of 
                     'chr30' in the input file to be treated as 'chrX' 
                     for Ensembl searching. To view the chromosome names
                     recognized by Ensembl use the -show_chr option 
                     (Optional).
-d [INTEGER]       : amount of flanking genomic sequence to write on each 
                     side of SNPs when -f option is used (Optional; default 
                     is 100 bases).
-c                 : begin by counting the number of records already present 
                     in the output file and skip that number of input 
                     records. This option can be useful if the script was 
                     disrupted during a previous run. Do not use if the -all 
                     option was used when creating the existing output file.
                     Use of this option is no longer recommended--use the -p
                     option instead (Optional).
-p [FILE]          : file for tracking annotation progress. If the file does 
                     not exist, the script will create the file and use it to
                     keep track of which records have been annotated as
                     annotation progresses. If the file already exists, the
                     script will parse it to determine where the annotation 
                     process last ended, so that it can continue. Use this
                     option so that the annotation can continue with the first
                     unannotated SNP in the event of a disrupted network
                     connection or error (Optional).
-L [FILE]          : log file to contain additional SNP information and 
                     processing messages (Optional).
-v                 : provide verbose output (Optional).
-sk [FILE]         : file to contain list of input SNPs that were skipped
                     because they could not be processed (Optional).
-header            : input file contains a line of column titles that does 
                     not start with '#' (Optional).
-split [DIRECTORY] : split up the results based on chromosome and functional 
                     class and write to this directory (Optional).
-format [STRING]   : format of input file ('sam', 'maq', 'dibayes', 'generic',
                     or 'vcf') (Optional; default is to determine format from
                     the input).
-compara_db [STRING] : the name of the compara database to be used. The value 
                     'Multi' should be specified unless the SNPs to be 
                     annotated are from a species only available in one of the 
                     specialized 'Ensembl Genomes' databases, as 
                     'Ensembl Fungi'. When an 'Ensembl Genomes' species is
                     used, the supported -compara_db values are the following:
                     'bacteria', 'fungi', 'metazoa', 'pan_homology', 'plants',
                     and 'protists' (Optional; default is 'Multi').
-model [STRING]    : the model species to use when filling in the 
                     'Model_Annotations' column in the output. The model
                     species can be the same as the source species. See the -s
                     option for information on species name format and on 
                     available species. (Optional; default value is 
                     'homo_sapiens').
-matrix [FILE]     : the scoring matrix file to use for amino acid 
                     comparisons (Optional; default is to use 
                     'annotate_SNPs/data/blosum62.mat').
-omes [FILE]       : the location of the OMES_score.pl script (Optional; 
                     default is to locate automatically).
-sift [FILE]       : the location of the SIFT_prediction.pl script (Optional; 
                     default is to locate automatically).
-ncbi [FILE]       : the location of the ncbi_search.pl script (Optional; 
                     default is to locate automatically).
-uniprot [FILE]    : the location of the ebi_fetch.pl script (Optional; 
                     default is to locate automatically).
-blast [FILE]      : the location of the remote_blast_client.pl script 
                     (Optional; default is to locate automatically).
-run_omes          : perform the calculations used to fill in the 
                     'Protein_OMES_Orthologues' values in the 'Comments' field. 
                     Using this option significantly slows down the annotation 
                     process for protein-altering SNPs (Optional; default is to 
                     skip these calculations).
-no_sift           : do not perform the annotation steps needed to determine
                     SIFT and PolyPhen values in the 'Comments' field.
                     (Optional; default is to perform these steps).
-no_ncbi           : do not perform annotation steps that involve retrieving
                     data from NCBI. Using this option will slightly speed up
                     the annotation process but will decrease the amount of
                     information provided in the 'Model_Annotations' column.
                     Do not run multiple instances of this script when not 
                     using this option (Optional; default is to perform steps 
                     that require data from NCBI).
-no_uniprot        : do not perform annotation steps that involve retrieving
                     data from UniProt. Using this option will slightly speed 
                     up the annotation process but will decrease the amount of 
                     information provided in the 
                     'Overlapping_Protein_Features' and 'Model_Annotations' 
                     columns. Do not run multiple instances of this script
                     when not using this option (Optional; default is to 
                     perform steps that require data from UniProt).
-no_blast          : do not perform annotation steps that involve performing
                     remote BLAST searches. Using this option will slightly
                     speed up the processing of SNPs that remove or add
                     protein sequence, but will prevent some values in the
                     'Comments' column from being determined. Do not run 
                     multiple instances of this script when not using this 
                     option (Optional; default is to perform steps that
                     require remote BLAST searches).
-cf [INTEGER]      : amount of flanking protein sequence on either side of 
                     SNP-affected residue to use for determining conservation
                     of region containing coding SNP (Optional; default is 10
                     residues).
-all               : return all consequences for each SNP. A single input SNP 
                     can have multiple consequences for a variety of reasons 
                     (one of the alleles may lead to a synonymous change while 
                     another introduces a stop codon; a single SNP may be 
                     located in the exon of one gene and the intron of 
                     another). The -all option causes each consequence to be 
                     described by a separate row in the output file. 
                     (Optional; default is to try to return the most severe 
                     consequence for each SNP).
-status [STRING]   : the types of genes and transcripts to be used during 
                     annotation. Values "known" or "novel" can be specified. 
                     (Optional; default is to use all genes and transcripts).
-df                : indicate known SNP sites within flanking sequence as 
                     lowercase IUPAC DNA bases (Optional; default is to 
                     return unmodified reference sequence).
-assembly [STRING] : the name of the assembly that was used as the reference
                     for SNP discovery, for example, 'Btau_4.0' or 'NCBI34'.
                     If the name supplied using this option does not match
                     the name of the default assembly available in Ensembl 
                     an attempt is made to convert the SNP locations and
                     alleles to the Ensembl assembly. To view the assembly
                     names recognized, use the -show_assem option (Optional; 
                     default is to assume the default Ensembl assembly was 
                     used for SNP discovery).
-mismatch [INTEGER]: program behaviour when a reference allele in the input 
                     file disagrees with the genome assembly in Ensembl. 1 = 
                     exit with error; 2 = use the reference allele from the 
                     Ensembl assembly. Such mismatches can arise when the 
                     -assembly option is used to convert the SNP locations 
                     and alleles to the Ensembl assembly (Optional; default 
                     is 1).
-host [STRING]     : the Ensembl database host (Optional; default is 
                     'ensembldb.ensembl.org').
-user [STRING]     : the Ensembl database user (Optional; default is 
                     'anonymous').
-pass [STRING]     : the Ensembl database password (Optional; default is '').
-port [INTEGER]    : the TCP port to use when connecting to the Ensembl 
                     database (Optional; default is to let the script choose 
                     the port).
-ann_db [DIRECTORY]: create or use flat file databases in this directory for
                     storing the annotated SNPs and flanking sequences 
                     generated by this script, so that future SNPs located at
                     the same position and involving the same alleles can be
                     annotated more quickly. The same directory can always be 
                     specified regardless of the species and other options 
                     being used--previously calculated annotations and 
                     flanking sequences are only applied to a new SNP when 
                     appropriate (i.e. same species, alleles, etc.). This 
                     option is particularly useful when numerous SNPs from 
                     different individuals of the same species are to be
                     annotated by repeated use of this script.
-help              : provide list of arguments and exit (Optional).
-show_species      : show the species values that can be used with the -s, 
                     -cs, and -model options and then exit. The -host option
                     should also be given, unless the default host is 
                     to be used for annotation (Optional).
-show_assem        : show the assembly values that can be used with the 
                     -assembly option and then exit. The -s option is 
                     required with this option, to specify the species 
                     (Optional).
-show_chr          : show the chromosome names recognized by Ensembl and then
                     exit. This information is useful when using the -n
                     option. The -s option is required with this option, to
                     specify the species (Optional).

perl annotate_SNPs.pl -i snps.tab -o results.tab -s bos_taurus \
-n 'chr30=chrX' 'chr31=chrY' -p progress.txt

perl annotate_SNPs.pl -show_species

perl annotate_SNPs.pl -s homo_sapiens -show_assem

perl annotate_SNPs.pl -s homo_sapiens -show_chr

perl annotate_SNPs.pl -host mysql.ebi.ac.uk -user anonymous -port 4157 \
-compara_db fungi -show_species

perl annotate_SNPs.pl -host mysql.ebi.ac.uk -user anonymous -port 4157 \
-i snps.tab -o out.tab -compara_db fungi -s schizosaccharomyces_pombe