Understanding Visual Similarity in Domains
Homograph attacks use domain names that look visually similar to legitimate domains to trick users into visiting fraudulent sites. For example, "rn" looks similar to "m", and attackers exploit this confusion. Calculating visual similarity helps identify and defend against these attacks.
Understanding visual similarity metrics enables organizations to protect users, register protective variants, and detect spoofing attempts early.
Common Visual Similarity Techniques
1. Character Substitution (Homograph Attacks)
Visually similar characters:
l(lowercase L) vs.I(uppercase i) vs.1(one)0(zero) vs.O(uppercase O)rnvs.mvs.rnclvs.dɪ(Latin letter) vs.i(ASCII i)
Examples:
Legitimate: amazon.com
Attacks:
- amaz0n.com (zero instead of o)
- amzon.com (missing 'a', looks similar)
- αmazon.com (Greek alpha instead of 'a')
- аmаzon.com (Cyrillic 'a' instead of ASCII 'a')
2. Internationalized Domain Names (IDN)
Using non-ASCII characters:
English: example.com
Cyrillic lookalike: exаmple.com (Cyrillic 'a' U+0430 instead of ASCII 'a')
Greek lookalike: εхаmple.com (mixed Greek and Cyrillic)
Unicode homoglyphs:
- Look identical to users but different characters
- Different character codes
- Registered as different domains
- Often intentional attack vectors
3. Domain Structure Manipulation
Exploiting domain structure:
Legitimate: example.com
Attacks:
- exаmple.co.m (split with dots)
- exаmple-real.com (add legitimate-looking suffix)
- exаmple.com.fake (add TLD to legitimate domain)
- subdomain.exаmple.com (looks like subdomain)
Metrics for Calculating Visual Similarity
1. Levenshtein Distance
Measures character-level differences between strings.
Algorithm:
- Count minimum edits needed to transform one string to another
- Edit types: insert, delete, substitute
- Lower distance = more similar
Example:
"example" vs. "exаmple" (one Cyrillic character)
Distance: 1 (one substitution needed)
"amazon" vs. "amаzоn" (two substitutions)
Distance: 2
"example" vs. "amazon"
Distance: 3 (e→a, x→m, l→z)
Threshold:
- Distance ≤ 2: Suspicious, likely homograph attack
- Distance ≤ 1: Very suspicious, definitely investigate
2. Damerau-Levenshtein Distance
Enhanced Levenshtein distance allowing transpositions.
Includes:
- Insertions
- Deletions
- Substitutions
- Transpositions (character swaps)
Example:
"example" vs. "exmaple" (transposed 'a' and 'm')
Damerau-Levenshtein: 1 (one transposition)
3. Jaro-Winkler Similarity
Measures similarity as a number 0-1.
How it works:
- Considers matching characters at different positions
- Weights matches based on position
- Higher values = more similar
Example:
"example.com" vs. "exаmple.com"
Similarity: 0.995 (nearly identical, one different character)
0.9+ = Suspicious similarity
0.95+ = Likely homograph attack
4. Visual Character Confusion Matrix
Classifies characters by visual similarity:
| Group | Characters |
|---|---|
| Group 1 | l, I, 1, |
| Group 2 | 0, O (looks circular) |
| Group 3 | m, rn, ɪ (similar shape) |
| Group 4 | a, ɑ, α, а (various 'a' characters) |
| Group 5 | e, е, ё (various 'e' characters) |
| Group 6 | o, ο, о, ӧ (various 'o' characters) |
Similarity score:
- Same group: High similarity
- Different group: Low similarity
5. Unicode Confusable Characters
Based on Unicode Confusables list:
- Maintained by Unicode Consortium
- Lists characters that look identical
- Used for security checks
Example confusables:
U+0041 (ASCII A) ≈ U+0391 (Greek Alpha)
U+0435 (Cyrillic 'e') ≈ U+0065 (ASCII 'e')
U+043E (Cyrillic 'o') ≈ U+006F (ASCII 'o')
Check using Unicode Confusables API:
Input: а (Cyrillic, U+0430)
Confusable with: a (ASCII, U+0061)
Risk: High (visual homograph)
Practical Homograph Detection
1. Domain Registration Monitoring
Detect registered homographs:
from difflib import SequenceMatcher
def similarity_ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
# Monitor registrations
new_domain = "amаzоn.com" # Cyrillic 'a' and 'o'
protected_domain = "amazon.com"
if similarity_ratio(new_domain, protected_domain) > 0.95:
print("ALERT: Homograph domain registered!")
print(f"Similarity: {similarity_ratio(new_domain, protected_domain)}")
2. Email Address Spoofing Detection
Identify spoofed email addresses:
Legitimate: [email protected]
Spoofed: jоhn.smith@еxamplе.com
(Cyrillic о, е, е, е)
Detection:
- Compare with levenshtein distance
- Calculate visual similarity
- Flag for user verification
3. User Input Validation
Warn users about similar domains:
// When user types URL
function checkVisualSimilarity(userInput) {
const knownDomains = [
'amazon.com', 'apple.com', 'google.com'
];
for (let known of knownDomains) {
const distance = levenshteinDistance(
userInput.toLowerCase(),
known
);
if (distance <= 2) {
showWarning(
`Did you mean ${known}? ` +
`You typed something very similar.`
);
}
}
}
Advanced Similarity Calculations
Visual Homoglyph Matrix
Create matrix of visually similar characters:
Matrix:
l: [I, 1, |]
o: [0, O, ο, О, о]
a: [α, а, ɑ]
e: [е, ё, ε]
m: [rn, ɪ, м]
n: [и, η, ո]
Similarity score:
- If characters in same group: +0.8 similarity
- Each character position: weight by importance
- Calculate overall domain similarity
Contextual Similarity
Consider domain context:
def contextual_similarity(domain1, domain2):
# Extract components
name1, tld1 = domain1.rsplit('.', 1)
name2, tld2 = domain2.rsplit('.', 1)
# Different TLDs = lower similarity
tld_similarity = 0.9 if tld1 == tld2 else 0.5
# Name similarity
name_distance = levenshtein_distance(name1, name2)
name_similarity = 1.0 - (name_distance / max(len(name1), len(name2)))
# Combined score
return (name_similarity * 0.8) + (tld_similarity * 0.2)
Tools for Visual Similarity Analysis
Online Tools
- Inventive HQ Domain Spoofing Detector
- Phishable.com - Tests domain spoofing
- Squatm0nkey - Domain typosquatting scanner
- SecurityTrails - Domain similarity analysis
Command-Line Tools
# Check confusable characters
python3 -m idna example.com
# Whois check for registered variants
whois amazon.com
whois amаzon.com # Cyrillic variant
# Unicode analysis
echo "amаzon.com" | od -An -tx1
# Shows byte representation
Programmatic Approaches
import unicodedata
from difflib import SequenceMatcher
def analyze_domain_similarity(suspicious, legitimate):
# Unicode normalization
sus_norm = unicodedata.normalize('NFKD', suspicious)
leg_norm = unicodedata.normalize('NFKD', legitimate)
# Basic similarity
similarity = SequenceMatcher(None, sus_norm, leg_norm).ratio()
# Check for confusable characters
for char in suspicious:
cat = unicodedata.category(char)
if cat.startswith('L'): # Letter category
print(f"Character {char}: {cat}")
return similarity
# Example
result = analyze_domain_similarity(
"amаzon.com", # Cyrillic 'a'
"amazon.com" # ASCII domain
)
print(f"Similarity: {result}")
Protective Measures Based on Similarity
1. Register Protective Variants
Preemptively register similar-looking domains:
Primary domain: amazon.com
Variants to protect:
- amаzon.com (Cyrillic 'a')
- аmazon.com (Cyrillic 'a' at start)
- amaz0n.com (zero instead of 'o')
- amazоn.com (Cyrillic 'o')
- am4zon.com (4 instead of 'a')
2. User Education
- Teach users to check address bar carefully
- Highlight domain in email clients
- Verify domain when suspicious
3. Email Authentication
- Implement DMARC strictly (p=reject)
- Use DKIM for email signatures
- Verify sender domain completely
4. Technical Detection
- Monitor domain registrations
- Set alerts for similar domain registration
- Check certificate transparency logs
- Monitor phishing databases
Real-World Homograph Examples
Cyrillic 'a' Attack (U+0430)
Original: ebay.com
Attack: еbay.com (Cyrillic е)
Looks identical to users
Completely different domain
Confused 'o' and '0'
Original: microsoft.com
Attack: micr0s0ft.com (zeros instead of o's)
Typosquatting variation
Users might not notice
Greek Letter Substitution
Original: paypal.com
Attack: payρal.com (Greek rho instead of 'p')
Extremely similar appearance
Easy to miss
Conclusion
Calculating visual similarity between domains enables detection and prevention of homograph attacks. By understanding similarity metrics, implementing detection systems, and registering protective variants, organizations can:
- Identify spoofing attempts early
- Protect users from phishing
- Defend brand reputation
- Prevent credential theft
Whether using simple string distance calculations or advanced unicode confusable detection, visual similarity analysis is essential for modern phishing prevention and domain security.

