Supplementary Table 1 Comparison of SEQSIM with representative alignment-free (AF) and pairwise similarity methods. Methods are evaluated based on their algorithmic approach, emphasis on contiguous sequence matches, output format, typical use cases, and their relation to SEQSIM. This comparison highlights SEQSIM s unique ability to detect long, high homologous regions in promoter sequences at genome scale.

Method

Alignment-Free Approach

Contiguous Match Emphasis

Output

Typical Use Cases

Relation to SEQSIM

SEQSIM

Needleman-Wunsch inspired scoring. No dynamic programming; rewards long matches.

High prioritizes extended high identity regions

Pairwise similarity index (%)

Promoter / promoter comparisons across genome

 

N2 (1)

k-mer frequencies with mismatch neighborhoods (word vector correlation)

Low counts shared k-mers (mismatches allowed), not requiring contiguity

Similarity/distance matrix (% similarity)

Enhancer/promoter clustering; motif content comparison​

Broad motif-level similarity; complements SEQSIM by finding functional similarity even without long exact matches. Code[JS1]  no longer available through publication.

ACS(2)

Longest exact common substrings (averaged length)

High requires exact contiguous matches (no mismatches)

Distance or similarity (%) based on average match length

Closely related sequences; genomic phylogeny when moderate conservation exists​

Extreme case of SEQSIM (no mismatches tolerated). Could validate if SEQSIM s similarities involve exact runs. Less useful if sequences have mutations.

KMACS (3)

Longest common substrings allowing k mismatches (k-mismatch ACS)

High rewards long nearly contiguous regions (few mismatches allowed)

Distance matrix (alignment-free phylogenetic distance)

Sequence sets with some divergence; phylogeny of genes/genomes with mutations​

Seems to be comparable to SEQSIM - both highlight long high-identity regions. Could not run a sample as source code is no longer available on http://kmacs.gobics.de/

FSWM (4)

Filtered spaced-word matches (pattern-based ungapped alignment)

High finds local alignments under a spaced mask (tolerates mismatches at don t care positions)

Distance matrix or dendrogram (phylogenetic)

Whole-genome or gene set comparisons; finds homologous regions with substitutions​

Similar target as SEQSIM (local homology) with different technique. Good cross-check for SEQSIM-detected segments, especially with mismatches. No longer available on http://fswm.gobics.de/

Clustal Omega (5)

NOT Alignment Free performs MSA using guide trees and HMM profile alignments

Medium High alignment-based; rewards global similarity with gaps allowed

Multiple sequence alignment and pairwise percent identity matrix

Protein and nucleotide MSAs; phylogenetics; sequence homology visualization

Not alignment-free, but its percent identity matrix offers a familiar comparison format. SEQSIM's similarity matrix shows visual and numerical similarity to Clustal Omega s output (Figure 4), justifying its use as a reference baseline. Of all mentioned methods, Clustal is the most widely documented and cited.

MatGAT (6)

Pairwise global alignments (PAM/BLOSUM scoring), no MSA required

Medium High scores extended similarity using scoring matrices

Similarity/identity matrix (%)

Small sets of DNA/protein sequences without MSA

Closest in spirit to SEQSIM; similar output and pairwise comparison, but not very scalable and not truly AF. (See Supplementary Figure 3)

Mash (7)

MinHash sketch of k-mers (Jaccard similarity of k-mer sets)

Medium long exact matches yield many shared k-mers, but mismatches break k-mers

Distance or similarity (% identity estimate, with p-value)

Massive comparisons (10k+ sequences); fast clustering, database search​

Fast screening tool; highlight similar overall patterns to SEQSIM but some major differences can be seen (See Supplementary Figure 3) due to difference in algorithm.

 


 

Supplementary Table References:

1. G ke J, Schulz MH, Lasserre J, Vingron M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics. 2012 Mar 1;28(5):656 63.

2. Ulitsky I, Burstein D, Tuller T, Chor B. The average common substring approach to phylogenomic reconstruction. J Comput Biol. 2006 Mar;13(2):336 50.

3. Leimeister CA, Morgenstern B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics. 2014 Jul 15;30(14):2000 8.

4. Leimeister CA, Sohrabi-Jahromi S, Morgenstern B. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics. 2017 Apr 1;33(7):971 9.

5. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11;7:539.

6. Campanella JJ, Bitincka L, Smalley J. MatGAT: An application that generates similarity/identity matrices using protein or DNA sequences. BMC Bioinformatics. 2003 Jul 10;4(1):29.

7. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology. 2016 Jun 20;17(1):132.