Home/Blog/How do I create a custom file type detection database?
Cybersecurity

How do I create a custom file type detection database?

Learn how to build and maintain a custom file type detection database for identifying files with non-standard signatures or proprietary formats.

By Inventive HQ Team
How do I create a custom file type detection database?

Why Create a Custom File Type Detection Database

Organizations often work with proprietary, custom, or non-standard file formats that the standard file magic number libraries don't recognize. A financial institution might have a custom trading data format. A media company might use specialized video containers. A healthcare provider might have custom patient record formats. An industrial company might have proprietary sensor data formats.

When these custom formats appear in unexpected locations, during breach investigations, or in suspicious contexts, being able to identify and analyze them is crucial. A custom file type detection database allows your organization to automatically identify these files during security analysis, forensic investigation, and data classification.

This guide walks you through creating and maintaining a custom file type detection database that integrates with your existing security tools and forensic workflows.

Understanding File Type Detection Fundamentals

Magic Numbers and File Signatures

Every file type has identifying characteristics. Most files start with a magic number—the first few bytes that uniquely identify the file type.

Standard magic numbers:

PDF:        25 50 44 46        "%PDF"
PNG:        89 50 4E 47        "‰PNG"
JPEG:       FF D8 FF           (plus more bytes)
ZIP:        50 4B 03 04        "PK.."
EXE:        4D 5A              "MZ"
GIF:        47 49 46 38        "GIF8"

For custom formats, you need to identify the unique magic number that marks files of that type.

File Structures and Footers

Beyond magic numbers, files have internal structures and often have footer signatures marking the end:

  • JPEG: Starts with FF D8 FF, ends with FF D9
  • ZIP: Starts with 50 4B, contains central directory, ends with 50 4B 05 06 (end of central directory)
  • PDF: Starts with %PDF, ends with %%EOF

A robust file type detector uses both start and end signatures plus internal structure validation.

Step 1: Identify Your Custom File Types

Inventory Custom Formats

First, document all custom file formats your organization uses:

  1. List all proprietary formats: What file types does your organization create?
  2. Collect samples: Gather example files of each type
  3. Document format specifications: Do you have format documentation? If not, you'll need to reverse-engineer it
  4. Determine file extensions: What extensions are typically used? (.dat, .custom, .prop, etc.)
  5. Document usage: What applications create/use these formats? On which systems?

Example: Custom Trading Data Format

An investment firm uses a custom format .TRD (Trading Data):

  • Contains encrypted trading records
  • Created by proprietary trading platform
  • Stored in /data/trades/ directory
  • Files are 4KB-100MB typically
  • No public documentation available
  • Occasionally found in unintended locations when investigating breaches

Step 2: Reverse-Engineer File Signatures

If you don't have format documentation, analyze example files to identify signatures.

Hexadecimal Analysis

Use a hex editor to examine file contents:

hexdump -C example.trd | head -30

Output might show:

00000000: 5452 4441 0100 2054 2020 2020 2020 2020  TRDA.. T
00000010: 2020 2020 0000 0100 0000 0000 0000 00c0  ............
00000020: 0100 0000 0100 0000 ffff ffff 0000 0000  ................

Analysis:

  • 54 52 44 41 = "TRDA" - This looks like a magic number
  • 01 00 20 followed by ASCII spaces suggests a version number
  • Repeated patterns suggest structured data

Statistical Analysis

Examine multiple files to identify consistent patterns:

for file in *.trd; do
  echo "=== $file ==="
  hexdump -C "$file" | head -5
done

Compare the first 100 bytes of multiple files:

  • Are the first bytes always the same?
  • Is there variation in what order?
  • Are there timestamp or version fields?

File Structure Mapping

Once you identify the start signature, map the file structure:

Byte 0-3:     Magic number (TRDA)
Byte 4-5:     Version number
Byte 6-135:   Header (130 bytes)
  6-10:       Timestamp
  11-50:      Trading account ID
  51-135:     Reserved/padding
Byte 136-end: Encrypted data block
  Last 32 bytes: Checksum/signature

Step 3: Create the Database Format

Choose how to store your custom file type definitions. Several approaches are available:

Option 1: Regex-Based Configuration File

Create a configuration file with regular expressions for each format:

File: custom-formats.conf

# Custom Trading Data Format
[TRD]
name=Trading Data Format
magic=54524441                 # TRDA
magic_offset=0
extension=.trd
min_size=4096
max_size=104857600            # 100MB
version_offset=4
version_length=2
description=Investment firm trading data records
severity=high                 # Data classification
action=quarantine             # On discovery action

Option 2: JSON Database

{
  "custom_formats": [
    {
      "id": "trd_trading_data",
      "name": "Trading Data Format",
      "magic": {
        "hex": "54524441",
        "offset": 0,
        "ascii": "TRDA"
      },
      "extension": ".trd",
      "file_size": {
        "min": 4096,
        "max": 104857600
      },
      "structure": {
        "version_offset": 4,
        "version_length": 2,
        "footer_magic": "ffffffff",
        "footer_offset": -32
      },
      "classification": {
        "data_type": "financial",
        "sensitivity": "highly-confidential",
        "severity": "high"
      },
      "detection_rules": [
        "magic_match",
        "size_range_match",
        "structure_validation"
      ]
    }
  ]
}

Option 3: Custom Database Tool

Create a specialized tool (Python, Go, etc.) that manages the database:

from dataclasses import dataclass
from typing import Optional

@dataclass
class FileTypeSignature:
    format_id: str
    name: str
    magic_hex: str          # Magic number in hex
    magic_offset: int       # Where magic appears
    file_extension: str
    min_size: int           # Minimum file size
    max_size: int           # Maximum file size
    footer_hex: Optional[str]  # End signature
    severity: str           # Data sensitivity level

    def matches(self, file_data: bytes) -> bool:
        """Check if file matches this signature"""
        magic_bytes = bytes.fromhex(self.magic_hex)
        start = self.magic_offset
        end = start + len(magic_bytes)

        if end > len(file_data):
            return False

        return file_data[start:end] == magic_bytes

# Create database
database = {
    "trd": FileTypeSignature(
        format_id="trd_trading_data",
        name="Trading Data Format",
        magic_hex="54524441",
        magic_offset=0,
        file_extension=".trd",
        min_size=4096,
        max_size=104857600,
        footer_hex="ffffffff",
        severity="high"
    )
}

Step 4: Validation and Testing

Create test samples to validate your detection database:

Unit Testing

def test_trading_data_detection():
    database = load_custom_database()

    # Test 1: Valid TRD file
    with open("valid_example.trd", "rb") as f:
        data = f.read()

    trd_sig = database["trd"]
    assert trd_sig.matches(data), "Failed to detect valid TRD file"

    # Test 2: Non-TRD file
    with open("document.pdf", "rb") as f:
        data = f.read()

    assert not trd_sig.matches(data), "False positive: detected PDF as TRD"

    # Test 3: Corrupted TRD
    corrupted = b"XXDA" + valid_data[4:]
    assert not trd_sig.matches(corrupted), "Accepted corrupted data"

if __name__ == "__main__":
    test_trading_data_detection()
    print("All tests passed!")

False Positive Testing

Test for false positives by scanning diverse files:

# Scan all system files looking for false positives
find /usr -type f -exec ./custom_detector {} \; 2>/dev/null | grep "TRD" | head -20

# Manually verify any false positives found
hexdump -C /usr/bin/suspected_false_positive | head -10

Step 5: Integration with Security Tools

File Analysis Workflows

Integrate your custom database with forensic and security tools:

Foremost/Scalpel Integration:

Modify the Scalpel configuration file to include custom signatures:

# Scalpel.conf - added to custom formats section
trd	y	4G	54524441	ffffff		ffffffff

The format is:

  • Extension name (trd)
  • Enabled (y/n)
  • Max file size (4G)
  • Magic bytes (54524441)
  • Ignore bytes (ffffff = don't care)
  • Footer bytes (ffffffff)

Binwalk Plugin:

Create a custom Binwalk module for your format:

# custom_trd_module.py
import binwalk

class TRDModule(binwalk.Module):
    def init(self):
        self.description = "Trading Data Format"

    def scan(self, stream, callback):
        stream.seek(0)
        while True:
            magic = stream.read(4)
            if not magic:
                break

            if magic == b'TRDA':
                callback(
                    size=stream.tell(),
                    description="Trading Data Format",
                    valid=True
                )

SIEM and Log Analysis

Add detection to your SIEM (Splunk, ELK, Sumo Logic) for real-time monitoring:

Splunk Search:

source="/var/log/file_access/*"
| transaction file_extension
| where file_extension=".trd"
  AND source NOT IN ("/data/trades/*")
  AND user NOT IN ("trading_app", "finance_team")
| table timestamp, user, file_path, file_size

This alerts when TRD files appear in unexpected locations.

Step 6: Maintenance and Updates

Version Control

Track changes to your database:

git init custom_file_database
git add custom-formats.json
git commit -m "Initial version with TRD format"

# Later update
# ... modify custom-formats.json ...
git commit -m "Add healthcare format HCR with updated magic bytes"

Documentation

Maintain clear documentation:

# Custom File Type Database

## TRD - Trading Data Format
- **Created**: January 2024
- **Last Updated**: October 2024
- **Owned By**: Trading Platform Team
- **Magic Number**: 54 52 44 41 (TRDA)
- **Version History**:
  - v1.0: Initial definition (Jan 2024)
  - v1.1: Added footer validation (Oct 2024)
- **Detection Status**: Production
- **Notes**: Can be encrypted; applies to files created after 2020

Regular Audits

Periodically review and update:

  1. Quarterly: Test detection accuracy against new file samples
  2. Annually: Review format specifications for changes
  3. Per incident: Update database when new formats discovered during investigations

Example: Complete Custom Format Definition

Here's a complete example for a custom healthcare patient record format:

{
  "format_id": "hcr_patient_record",
  "name": "Healthcare Patient Record",
  "organization": "MedCare Hospital",
  "created_date": "2023-06-15",
  "last_updated": "2024-10-31",

  "detection": {
    "magic_number": {
      "hex": "48435245",
      "offset": 0,
      "ascii": "HCRE"
    },
    "file_extension": ".hcr",
    "mime_type": "application/x-hcre",
    "size_range": {
      "minimum": 1024,
      "maximum": 52428800
    },
    "structure": {
      "header_size": 256,
      "version_offset": 4,
      "version_length": 2,
      "record_count_offset": 8,
      "checksum_offset": 240,
      "footer_magic": "454452"
    }
  },

  "classification": {
    "data_sensitivity": "PHI",
    "regulatory_framework": "HIPAA",
    "handling_instructions": "Encrypt in transit, encrypt at rest",
    "notification_required": true
  },

  "detection_rules": {
    "required": ["magic_match", "structure_validation"],
    "optional": ["checksum_validation"],
    "when_found": {
      "expected_locations": ["/data/patients/*", "/backups/patients/*"],
      "alert_if": "found_outside_expected_locations",
      "severity": "critical"
    }
  }
}

Advanced Techniques

Yara Rules for Complex Detection

For more sophisticated detection, use Yara rules:

rule custom_trd_format {
    strings:
        $magic = "TRDA" at 0
        $version = {01 00 20}
        $footer = {FF FF FF FF}

    condition:
        $magic and $version and $footer
}

rule encrypted_trd {
    strings:
        $magic = "TRDA"
        $encryption_marker = "ENC"

    condition:
        $magic at 0 and $encryption_marker at 50
}

Machine Learning Approach

For very complex formats, use ML to learn signatures:

import numpy as np
from sklearn.ensemble import IsolationForest

# Extract features from known TRD files
features = []
for trd_file in known_trd_files:
    with open(trd_file, 'rb') as f:
        data = f.read(1000)
    features.append(np.frombuffer(data, dtype=np.uint8))

# Train model
model = IsolationForest(contamination=0.1)
model.fit(features)

# Test on unknown file
with open("suspect_file", "rb") as f:
    test_data = np.frombuffer(f.read(1000), dtype=np.uint8)

if model.predict([test_data])[0] == 1:
    print("File matches TRD pattern")

Conclusion

Creating a custom file type detection database enables your organization to identify proprietary and non-standard file formats automatically. By systematically identifying magic numbers, validating file structures, and integrating with your existing security infrastructure, you can:

  • Detect data exfiltration of custom formats
  • Identify unauthorized file types in unexpected locations
  • Automate forensic analysis of custom formats
  • Build institutional knowledge of your organization's data types
  • Respond more effectively to security incidents

The effort to create and maintain a custom database pays dividends through improved threat detection, faster incident response, and better data governance. Combined with proper access controls and monitoring, a comprehensive file type detection strategy becomes a powerful tool for protecting your organization's most sensitive data.

Need Expert Cybersecurity Guidance?

Our team of security experts is ready to help protect your business from evolving threats.