Why Create a Custom File Type Detection Database
Organizations often work with proprietary, custom, or non-standard file formats that the standard file magic number libraries don't recognize. A financial institution might have a custom trading data format. A media company might use specialized video containers. A healthcare provider might have custom patient record formats. An industrial company might have proprietary sensor data formats.
When these custom formats appear in unexpected locations, during breach investigations, or in suspicious contexts, being able to identify and analyze them is crucial. A custom file type detection database allows your organization to automatically identify these files during security analysis, forensic investigation, and data classification.
This guide walks you through creating and maintaining a custom file type detection database that integrates with your existing security tools and forensic workflows.
Understanding File Type Detection Fundamentals
Magic Numbers and File Signatures
Every file type has identifying characteristics. Most files start with a magic number—the first few bytes that uniquely identify the file type.
Standard magic numbers:
PDF: 25 50 44 46 "%PDF"
PNG: 89 50 4E 47 "‰PNG"
JPEG: FF D8 FF (plus more bytes)
ZIP: 50 4B 03 04 "PK.."
EXE: 4D 5A "MZ"
GIF: 47 49 46 38 "GIF8"
For custom formats, you need to identify the unique magic number that marks files of that type.
File Structures and Footers
Beyond magic numbers, files have internal structures and often have footer signatures marking the end:
- JPEG: Starts with
FF D8 FF, ends withFF D9 - ZIP: Starts with
50 4B, contains central directory, ends with50 4B 05 06(end of central directory) - PDF: Starts with
%PDF, ends with%%EOF
A robust file type detector uses both start and end signatures plus internal structure validation.
Step 1: Identify Your Custom File Types
Inventory Custom Formats
First, document all custom file formats your organization uses:
- List all proprietary formats: What file types does your organization create?
- Collect samples: Gather example files of each type
- Document format specifications: Do you have format documentation? If not, you'll need to reverse-engineer it
- Determine file extensions: What extensions are typically used? (.dat, .custom, .prop, etc.)
- Document usage: What applications create/use these formats? On which systems?
Example: Custom Trading Data Format
An investment firm uses a custom format .TRD (Trading Data):
- Contains encrypted trading records
- Created by proprietary trading platform
- Stored in
/data/trades/directory - Files are 4KB-100MB typically
- No public documentation available
- Occasionally found in unintended locations when investigating breaches
Step 2: Reverse-Engineer File Signatures
If you don't have format documentation, analyze example files to identify signatures.
Hexadecimal Analysis
Use a hex editor to examine file contents:
hexdump -C example.trd | head -30
Output might show:
00000000: 5452 4441 0100 2054 2020 2020 2020 2020 TRDA.. T
00000010: 2020 2020 0000 0100 0000 0000 0000 00c0 ............
00000020: 0100 0000 0100 0000 ffff ffff 0000 0000 ................
Analysis:
54 52 44 41= "TRDA" - This looks like a magic number01 00 20followed by ASCII spaces suggests a version number- Repeated patterns suggest structured data
Statistical Analysis
Examine multiple files to identify consistent patterns:
for file in *.trd; do
echo "=== $file ==="
hexdump -C "$file" | head -5
done
Compare the first 100 bytes of multiple files:
- Are the first bytes always the same?
- Is there variation in what order?
- Are there timestamp or version fields?
File Structure Mapping
Once you identify the start signature, map the file structure:
Byte 0-3: Magic number (TRDA)
Byte 4-5: Version number
Byte 6-135: Header (130 bytes)
6-10: Timestamp
11-50: Trading account ID
51-135: Reserved/padding
Byte 136-end: Encrypted data block
Last 32 bytes: Checksum/signature
Step 3: Create the Database Format
Choose how to store your custom file type definitions. Several approaches are available:
Option 1: Regex-Based Configuration File
Create a configuration file with regular expressions for each format:
File: custom-formats.conf
# Custom Trading Data Format
[TRD]
name=Trading Data Format
magic=54524441 # TRDA
magic_offset=0
extension=.trd
min_size=4096
max_size=104857600 # 100MB
version_offset=4
version_length=2
description=Investment firm trading data records
severity=high # Data classification
action=quarantine # On discovery action
Option 2: JSON Database
{
"custom_formats": [
{
"id": "trd_trading_data",
"name": "Trading Data Format",
"magic": {
"hex": "54524441",
"offset": 0,
"ascii": "TRDA"
},
"extension": ".trd",
"file_size": {
"min": 4096,
"max": 104857600
},
"structure": {
"version_offset": 4,
"version_length": 2,
"footer_magic": "ffffffff",
"footer_offset": -32
},
"classification": {
"data_type": "financial",
"sensitivity": "highly-confidential",
"severity": "high"
},
"detection_rules": [
"magic_match",
"size_range_match",
"structure_validation"
]
}
]
}
Option 3: Custom Database Tool
Create a specialized tool (Python, Go, etc.) that manages the database:
from dataclasses import dataclass
from typing import Optional
@dataclass
class FileTypeSignature:
format_id: str
name: str
magic_hex: str # Magic number in hex
magic_offset: int # Where magic appears
file_extension: str
min_size: int # Minimum file size
max_size: int # Maximum file size
footer_hex: Optional[str] # End signature
severity: str # Data sensitivity level
def matches(self, file_data: bytes) -> bool:
"""Check if file matches this signature"""
magic_bytes = bytes.fromhex(self.magic_hex)
start = self.magic_offset
end = start + len(magic_bytes)
if end > len(file_data):
return False
return file_data[start:end] == magic_bytes
# Create database
database = {
"trd": FileTypeSignature(
format_id="trd_trading_data",
name="Trading Data Format",
magic_hex="54524441",
magic_offset=0,
file_extension=".trd",
min_size=4096,
max_size=104857600,
footer_hex="ffffffff",
severity="high"
)
}
Step 4: Validation and Testing
Create test samples to validate your detection database:
Unit Testing
def test_trading_data_detection():
database = load_custom_database()
# Test 1: Valid TRD file
with open("valid_example.trd", "rb") as f:
data = f.read()
trd_sig = database["trd"]
assert trd_sig.matches(data), "Failed to detect valid TRD file"
# Test 2: Non-TRD file
with open("document.pdf", "rb") as f:
data = f.read()
assert not trd_sig.matches(data), "False positive: detected PDF as TRD"
# Test 3: Corrupted TRD
corrupted = b"XXDA" + valid_data[4:]
assert not trd_sig.matches(corrupted), "Accepted corrupted data"
if __name__ == "__main__":
test_trading_data_detection()
print("All tests passed!")
False Positive Testing
Test for false positives by scanning diverse files:
# Scan all system files looking for false positives
find /usr -type f -exec ./custom_detector {} \; 2>/dev/null | grep "TRD" | head -20
# Manually verify any false positives found
hexdump -C /usr/bin/suspected_false_positive | head -10
Step 5: Integration with Security Tools
File Analysis Workflows
Integrate your custom database with forensic and security tools:
Foremost/Scalpel Integration:
Modify the Scalpel configuration file to include custom signatures:
# Scalpel.conf - added to custom formats section
trd y 4G 54524441 ffffff ffffffff
The format is:
- Extension name (trd)
- Enabled (y/n)
- Max file size (4G)
- Magic bytes (54524441)
- Ignore bytes (ffffff = don't care)
- Footer bytes (ffffffff)
Binwalk Plugin:
Create a custom Binwalk module for your format:
# custom_trd_module.py
import binwalk
class TRDModule(binwalk.Module):
def init(self):
self.description = "Trading Data Format"
def scan(self, stream, callback):
stream.seek(0)
while True:
magic = stream.read(4)
if not magic:
break
if magic == b'TRDA':
callback(
size=stream.tell(),
description="Trading Data Format",
valid=True
)
SIEM and Log Analysis
Add detection to your SIEM (Splunk, ELK, Sumo Logic) for real-time monitoring:
Splunk Search:
source="/var/log/file_access/*"
| transaction file_extension
| where file_extension=".trd"
AND source NOT IN ("/data/trades/*")
AND user NOT IN ("trading_app", "finance_team")
| table timestamp, user, file_path, file_size
This alerts when TRD files appear in unexpected locations.
Step 6: Maintenance and Updates
Version Control
Track changes to your database:
git init custom_file_database
git add custom-formats.json
git commit -m "Initial version with TRD format"
# Later update
# ... modify custom-formats.json ...
git commit -m "Add healthcare format HCR with updated magic bytes"
Documentation
Maintain clear documentation:
# Custom File Type Database
## TRD - Trading Data Format
- **Created**: January 2024
- **Last Updated**: October 2024
- **Owned By**: Trading Platform Team
- **Magic Number**: 54 52 44 41 (TRDA)
- **Version History**:
- v1.0: Initial definition (Jan 2024)
- v1.1: Added footer validation (Oct 2024)
- **Detection Status**: Production
- **Notes**: Can be encrypted; applies to files created after 2020
Regular Audits
Periodically review and update:
- Quarterly: Test detection accuracy against new file samples
- Annually: Review format specifications for changes
- Per incident: Update database when new formats discovered during investigations
Example: Complete Custom Format Definition
Here's a complete example for a custom healthcare patient record format:
{
"format_id": "hcr_patient_record",
"name": "Healthcare Patient Record",
"organization": "MedCare Hospital",
"created_date": "2023-06-15",
"last_updated": "2024-10-31",
"detection": {
"magic_number": {
"hex": "48435245",
"offset": 0,
"ascii": "HCRE"
},
"file_extension": ".hcr",
"mime_type": "application/x-hcre",
"size_range": {
"minimum": 1024,
"maximum": 52428800
},
"structure": {
"header_size": 256,
"version_offset": 4,
"version_length": 2,
"record_count_offset": 8,
"checksum_offset": 240,
"footer_magic": "454452"
}
},
"classification": {
"data_sensitivity": "PHI",
"regulatory_framework": "HIPAA",
"handling_instructions": "Encrypt in transit, encrypt at rest",
"notification_required": true
},
"detection_rules": {
"required": ["magic_match", "structure_validation"],
"optional": ["checksum_validation"],
"when_found": {
"expected_locations": ["/data/patients/*", "/backups/patients/*"],
"alert_if": "found_outside_expected_locations",
"severity": "critical"
}
}
}
Advanced Techniques
Yara Rules for Complex Detection
For more sophisticated detection, use Yara rules:
rule custom_trd_format {
strings:
$magic = "TRDA" at 0
$version = {01 00 20}
$footer = {FF FF FF FF}
condition:
$magic and $version and $footer
}
rule encrypted_trd {
strings:
$magic = "TRDA"
$encryption_marker = "ENC"
condition:
$magic at 0 and $encryption_marker at 50
}
Machine Learning Approach
For very complex formats, use ML to learn signatures:
import numpy as np
from sklearn.ensemble import IsolationForest
# Extract features from known TRD files
features = []
for trd_file in known_trd_files:
with open(trd_file, 'rb') as f:
data = f.read(1000)
features.append(np.frombuffer(data, dtype=np.uint8))
# Train model
model = IsolationForest(contamination=0.1)
model.fit(features)
# Test on unknown file
with open("suspect_file", "rb") as f:
test_data = np.frombuffer(f.read(1000), dtype=np.uint8)
if model.predict([test_data])[0] == 1:
print("File matches TRD pattern")
Conclusion
Creating a custom file type detection database enables your organization to identify proprietary and non-standard file formats automatically. By systematically identifying magic numbers, validating file structures, and integrating with your existing security infrastructure, you can:
- Detect data exfiltration of custom formats
- Identify unauthorized file types in unexpected locations
- Automate forensic analysis of custom formats
- Build institutional knowledge of your organization's data types
- Respond more effectively to security incidents
The effort to create and maintain a custom database pays dividends through improved threat detection, faster incident response, and better data governance. Combined with proper access controls and monitoring, a comprehensive file type detection strategy becomes a powerful tool for protecting your organization's most sensitive data.


