Home/Blog/How do I calculate visual similarity between domains?
Security

How do I calculate visual similarity between domains?

Domain names that look similar can be used in phishing attacks. Learn how to calculate visual similarity and detect homograph attacks.

By Inventive HQ Team
How do I calculate visual similarity between domains?

Understanding Visual Similarity in Domains

Homograph attacks use domain names that look visually similar to legitimate domains to trick users into visiting fraudulent sites. For example, "rn" looks similar to "m", and attackers exploit this confusion. Calculating visual similarity helps identify and defend against these attacks.

Understanding visual similarity metrics enables organizations to protect users, register protective variants, and detect spoofing attempts early.

Common Visual Similarity Techniques

1. Character Substitution (Homograph Attacks)

Visually similar characters:

  • l (lowercase L) vs. I (uppercase i) vs. 1 (one)
  • 0 (zero) vs. O (uppercase O)
  • rn vs. m vs. rn
  • cl vs. d
  • ɪ (Latin letter) vs. i (ASCII i)

Examples:

Legitimate: amazon.com
Attacks:
- amaz0n.com (zero instead of o)
- amzon.com (missing 'a', looks similar)
- αmazon.com (Greek alpha instead of 'a')
- аmаzon.com (Cyrillic 'a' instead of ASCII 'a')

2. Internationalized Domain Names (IDN)

Using non-ASCII characters:

English: example.com
Cyrillic lookalike: exаmple.com (Cyrillic 'a' U+0430 instead of ASCII 'a')
Greek lookalike: εхаmple.com (mixed Greek and Cyrillic)

Unicode homoglyphs:

  • Look identical to users but different characters
  • Different character codes
  • Registered as different domains
  • Often intentional attack vectors

3. Domain Structure Manipulation

Exploiting domain structure:

Legitimate: example.com
Attacks:
- exаmple.co.m (split with dots)
- exаmple-real.com (add legitimate-looking suffix)
- exаmple.com.fake (add TLD to legitimate domain)
- subdomain.exаmple.com (looks like subdomain)

Metrics for Calculating Visual Similarity

1. Levenshtein Distance

Measures character-level differences between strings.

Algorithm:

  • Count minimum edits needed to transform one string to another
  • Edit types: insert, delete, substitute
  • Lower distance = more similar

Example:

"example" vs. "exаmple" (one Cyrillic character)
Distance: 1 (one substitution needed)

"amazon" vs. "amаzоn" (two substitutions)
Distance: 2

"example" vs. "amazon"
Distance: 3 (e→a, x→m, l→z)

Threshold:

  • Distance ≤ 2: Suspicious, likely homograph attack
  • Distance ≤ 1: Very suspicious, definitely investigate

2. Damerau-Levenshtein Distance

Enhanced Levenshtein distance allowing transpositions.

Includes:

  • Insertions
  • Deletions
  • Substitutions
  • Transpositions (character swaps)

Example:

"example" vs. "exmaple" (transposed 'a' and 'm')
Damerau-Levenshtein: 1 (one transposition)

3. Jaro-Winkler Similarity

Measures similarity as a number 0-1.

How it works:

  • Considers matching characters at different positions
  • Weights matches based on position
  • Higher values = more similar

Example:

"example.com" vs. "exаmple.com"
Similarity: 0.995 (nearly identical, one different character)
0.9+ = Suspicious similarity
0.95+ = Likely homograph attack

4. Visual Character Confusion Matrix

Classifies characters by visual similarity:

GroupCharacters
Group 1l, I, 1,
Group 20, O (looks circular)
Group 3m, rn, ɪ (similar shape)
Group 4a, ɑ, α, а (various 'a' characters)
Group 5e, е, ё (various 'e' characters)
Group 6o, ο, о, ӧ (various 'o' characters)

Similarity score:

  • Same group: High similarity
  • Different group: Low similarity

5. Unicode Confusable Characters

Based on Unicode Confusables list:

  • Maintained by Unicode Consortium
  • Lists characters that look identical
  • Used for security checks

Example confusables:

U+0041 (ASCII A) ≈ U+0391 (Greek Alpha)
U+0435 (Cyrillic 'e') ≈ U+0065 (ASCII 'e')
U+043E (Cyrillic 'o') ≈ U+006F (ASCII 'o')

Check using Unicode Confusables API:

Input: а (Cyrillic, U+0430)
Confusable with: a (ASCII, U+0061)
Risk: High (visual homograph)

Practical Homograph Detection

1. Domain Registration Monitoring

Detect registered homographs:

from difflib import SequenceMatcher

def similarity_ratio(a, b):
    return SequenceMatcher(None, a, b).ratio()

# Monitor registrations
new_domain = "amаzоn.com"  # Cyrillic 'a' and 'o'
protected_domain = "amazon.com"

if similarity_ratio(new_domain, protected_domain) > 0.95:
    print("ALERT: Homograph domain registered!")
    print(f"Similarity: {similarity_ratio(new_domain, protected_domain)}")

2. Email Address Spoofing Detection

Identify spoofed email addresses:

Legitimate: [email protected]
Spoofed:    jоhn.smith@еxamplе.com
            (Cyrillic о, е, е, е)

Detection:
- Compare with levenshtein distance
- Calculate visual similarity
- Flag for user verification

3. User Input Validation

Warn users about similar domains:

// When user types URL
function checkVisualSimilarity(userInput) {
  const knownDomains = [
    'amazon.com', 'apple.com', 'google.com'
  ];

  for (let known of knownDomains) {
    const distance = levenshteinDistance(
      userInput.toLowerCase(),
      known
    );

    if (distance <= 2) {
      showWarning(
        `Did you mean ${known}? ` +
        `You typed something very similar.`
      );
    }
  }
}

Advanced Similarity Calculations

Visual Homoglyph Matrix

Create matrix of visually similar characters:

Matrix:
  l: [I, 1, |]
  o: [0, O, ο, О, о]
  a: [α, а, ɑ]
  e: [е, ё, ε]
  m: [rn, ɪ, м]
  n: [и, η, ո]

Similarity score:

  • If characters in same group: +0.8 similarity
  • Each character position: weight by importance
  • Calculate overall domain similarity

Contextual Similarity

Consider domain context:

def contextual_similarity(domain1, domain2):
  # Extract components
  name1, tld1 = domain1.rsplit('.', 1)
  name2, tld2 = domain2.rsplit('.', 1)

  # Different TLDs = lower similarity
  tld_similarity = 0.9 if tld1 == tld2 else 0.5

  # Name similarity
  name_distance = levenshtein_distance(name1, name2)
  name_similarity = 1.0 - (name_distance / max(len(name1), len(name2)))

  # Combined score
  return (name_similarity * 0.8) + (tld_similarity * 0.2)

Tools for Visual Similarity Analysis

Online Tools

  • Inventive HQ Domain Spoofing Detector
  • Phishable.com - Tests domain spoofing
  • Squatm0nkey - Domain typosquatting scanner
  • SecurityTrails - Domain similarity analysis

Command-Line Tools

# Check confusable characters
python3 -m idna example.com

# Whois check for registered variants
whois amazon.com
whois amаzon.com  # Cyrillic variant

# Unicode analysis
echo "amаzon.com" | od -An -tx1
# Shows byte representation

Programmatic Approaches

import unicodedata
from difflib import SequenceMatcher

def analyze_domain_similarity(suspicious, legitimate):
    # Unicode normalization
    sus_norm = unicodedata.normalize('NFKD', suspicious)
    leg_norm = unicodedata.normalize('NFKD', legitimate)

    # Basic similarity
    similarity = SequenceMatcher(None, sus_norm, leg_norm).ratio()

    # Check for confusable characters
    for char in suspicious:
        cat = unicodedata.category(char)
        if cat.startswith('L'):  # Letter category
            print(f"Character {char}: {cat}")

    return similarity

# Example
result = analyze_domain_similarity(
    "amаzon.com",  # Cyrillic 'a'
    "amazon.com"   # ASCII domain
)
print(f"Similarity: {result}")

Protective Measures Based on Similarity

1. Register Protective Variants

Preemptively register similar-looking domains:

Primary domain: amazon.com
Variants to protect:
- amаzon.com (Cyrillic 'a')
- аmazon.com (Cyrillic 'a' at start)
- amaz0n.com (zero instead of 'o')
- amazоn.com (Cyrillic 'o')
- am4zon.com (4 instead of 'a')

2. User Education

  • Teach users to check address bar carefully
  • Highlight domain in email clients
  • Verify domain when suspicious

3. Email Authentication

  • Implement DMARC strictly (p=reject)
  • Use DKIM for email signatures
  • Verify sender domain completely

4. Technical Detection

  • Monitor domain registrations
  • Set alerts for similar domain registration
  • Check certificate transparency logs
  • Monitor phishing databases

Real-World Homograph Examples

Cyrillic 'a' Attack (U+0430)

Original: ebay.com
Attack: еbay.com (Cyrillic е)
Looks identical to users
Completely different domain

Confused 'o' and '0'

Original: microsoft.com
Attack: micr0s0ft.com (zeros instead of o's)
Typosquatting variation
Users might not notice

Greek Letter Substitution

Original: paypal.com
Attack: payρal.com (Greek rho instead of 'p')
Extremely similar appearance
Easy to miss

Conclusion

Calculating visual similarity between domains enables detection and prevention of homograph attacks. By understanding similarity metrics, implementing detection systems, and registering protective variants, organizations can:

  • Identify spoofing attempts early
  • Protect users from phishing
  • Defend brand reputation
  • Prevent credential theft

Whether using simple string distance calculations or advanced unicode confusable detection, visual similarity analysis is essential for modern phishing prevention and domain security.

Need Expert IT & Security Guidance?

Our team is ready to help protect and optimize your business technology infrastructure.