This script generates 3 figures consisting of bar plots showing the distribution of INDELs by INDEL length. The figures show the distribution for either all INDELs, deletions or insertions. In addition, each figure has a separate plot created for the subset of INDELs occurring in coding regions (CDS).

Previous studies have shown that INDELs with a length 3n are enriched in the coding regions (first described in the Nature genetics paper Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing). This is due to the fact that 3n INDELs preserve reading frame.


See setup instructions for NGS-SNP.

In addition, this script requires R. If you are using the NGS-SNP virtual machine, R will already be installed, otherwise see the R site for installation instructions.


Usage: perl 
Arguments required:
       -i [FILE] : input INDEL annotation file (Required).
       -o [DIRECTORY] : output folder (Required).
       -c [FILE] : cutoff of length of INDELs (default is 12).
       -w [INT] : width of figure (default is 1400).
       -l [INT] : height of figure (default is 700).
       -r [FILE]: the location of the Length_Distribution_Plot.R script (Optional; 
                     default is to locate automatically).
example: perl -i indels.vcf.annotated -o test 


An INDEL annotation file as created by the script.


Four files will be generated in the output directory: