Unmasking the Moon: Comparing LunaStealer Samples with MalChela and Claude

As one tends to do on Saturday mornings with coffee in hand, I was reviewing two samples that were attributed to the LunaStealer / LunaGrabber family. Originally I was validating that tiquery was working with the MCP configuration, however what started as a quick TI check turned into a full static analysis session — and it gave me a good opportunity to put the MalChela MCP integration through its paces in a real workflow. This post walks through how that investigation unfolded, what the pivot points were, and what we found at the bottom of the rabbit hole.


The Setup

If you haven’t seen the MalChela MCP plugin before, the short version is this: MalChela is a Rust-based malware analysis toolkit I’ve been building for a while — tools like tiqueryfileanalyzermstrings, and others. The MCP server exposes all of those tools to Claude Desktop natively, so instead of dropping to the terminal for every command, I can run analysis steps conversationally and let Claude help interpret the results and suggest next moves.

This is not replacing the terminal — it’s augmenting it. The pivot decisions still come from the analyst. But having a reasoning layer that can look at mstrings output and say “that SetDllDirectoryW + GetTempPathW combination is staging behavior, and here’s the ATT&CK mapping” is genuinely useful when you’re moving fast.

Both samples were sitting in a folder on my Desktop. I had SHA-256 hashes. Let’s go.


Phase 1: Threat Intelligence Query

First move is always TI. The MalChela tiquery tool hits MalwareBazaar, VirusTotal, Hybrid Analysis, MetaDefender, and Triage simultaneously and returns a combined results matrix. Two calls, two answers.

Sample 1 (4f3b8971...) came back confirmed LunaStealer across all five sources. First seen 2025-12-01. Original filename sdas.exe. VT tagged it trojan.generickdq/python — already telling us something about the build.

Sample 2 (d4f57b42...) was more interesting. MalwareBazaar returned both LunaGrabber and LunaStealer tags. Triage clustered it with BlankGrabber, GlassWorm, IcedID, and Luca-Stealer. The original filename was loader.exe. That’s a different kind of name than sdas.exe. One sounds like a throwaway test artifact. The other sounds deliberate.

The TI results alone suggested these weren’t just two copies of the same thing. They were potentially different components of the same campaign.


Phase 2: Static PE Analysis

fileanalyzer and mstrings on both samples.

The first thing that jumped out was the imphash — f3c0dbc597607baa2ea891bc3a114b19 — identical on both. Same section layout, same section sizes, same import count (146), same 7 PE sections including the .fptable section that PyInstaller uses for its frozen module table. These two samples were compiled from the same PyInstaller loader template with different payloads bundled inside.

But the entropy diverged sharply. Sample 1 (sdas.exe) came in at 3.9 — low, even for a PyInstaller bundle. Sample 2 (loader.exe) was 6.9 — high, indicating the embedded payload is compressed or encrypted more aggressively. Combined with the file size difference (47 MB vs 22 MB), this was the first signal that what was inside each bundle was meaningfully different.

mstrings gave us 22–23 ATT&CK-mapped detections across both samples — largely the same set: IsDebuggerPresentQueryPerformanceCounterSetDllDirectoryWGetTempPathWExpandEnvironmentStringsWOpenProcessToken. Standard infostealer staging behavior. Tcl_CreateThread showed up in both, which is a PyInstaller artifact from bundling Python with Tkinter. The VT python family tag made more sense in context.


Phase 3: PyInstaller Extraction

Both samples were extracted with pyinstxtractor-ng. This is where the two samples started to diverge clearly.

Sample 1 entry point: sdas.pyc — Python 3.13, 112 files in the CArchive, 752 modules in the PYZ archive.

Sample 2 entry point: cleaner.pyc — Python 3.11, 113 files, 760 modules.

The name cleaner.pyc inside a file called loader.exe is a tell. That’s not a stealer payload name. That’s something that runs after.

The bundled library sets were nearly identical between both — requestsrequests_toolbeltCryptodomecryptographypsutilPILsqlite3win32 — same stealer framework. But Sample 2 had a unique addition: a l.js reference (mapped to T1059 — Command and Scripting Interpreter). A JavaScript component not present in the December build. The OpenSSL versions also differed: Sample 1 bundled libcrypto-3.dll (OpenSSL 3.x), Sample 2 had libcrypto-1_1.dll (OpenSSL 1.1). Different build environments, roughly one month apart.

At this point the working theory was solid: Sample 1 is a standalone stealer. Sample 2 is a later-generation dropper/installer with an updated payload and additional capability.


Phase 4: Bytecode Decompilation

decompile3 couldn’t handle Python 3.11 or 3.13 bytecode. That’s a known limitation. pycdc (Decompyle++) handles both.

sdas.pyc decompiled cleanly — the import stack made the capability set immediately obvious:

from win32crypt import CryptUnprotectData
from Cryptodome.Cipher import AES
from PIL import Image, ImageGrab
from requests_toolbelt.multipart.encoder import MultipartEncoder
import sqlite3

CryptUnprotectData for browser master key decryption. AES for the decryption itself. ImageGrab for screenshots. MultipartEncoder for structured exfiltration. Classic infostealer, nothing surprising.

cleaner.pyc was a different story. The decompiler output opened with this:

__________ = eval(getattr(__import__(bytes([98,97,115,101,54,52]).decode()), ...

Heavy obfuscation — byte arrays used to reconstruct evalgetattr, and __import__ at runtime so none of those strings appear in plain text. The approach is designed to evade static string detection. Decode the byte arrays and you get:

bytes([98,97,115,101,54,52]) → "base64"
bytes([90,88,90,104,98,65,61,61]) → b64decode("ZXZhbA==") → "eval"
bytes([90,50,86,48,...]) → "getattr"
bytes([88,49,57,112,...]) → "__import__"

Standard Python malware obfuscation. But buried further down in the decompile output was a large binary blob — a bytes literal starting with \xfd7zXZ. That’s the LZMA magic header.


Phase 5: LZMA Stage 2 Extraction

The blob was located at offset 0x17d4 in the pyc file. Extract and decompress it:

import lzma
blob = open('cleaner.pyc', 'rb').read()
idx = blob.find(b'\xfd7zXZ')
decompressed = lzma.decompress(blob[idx:])
# → 102,923 bytes

One important detail: the decompression is wrapped in a try/except LZMAError block with os._exit(0) on failure. If the decompression fails — as it would in some emulated sandbox environments — the process exits silently with no error. That’s the anti-sandbox mechanism.

The decompressed payload was another obfuscated Python source using a custom alphabet substitution encoding. The final execution chain was compile() + exec(). Decoding the full stage 2 revealed everything:

The injection URL:

https://raw.githubusercontent.com/Smug246/luna-injection/main/obfuscated-injection.js

This is the live Discord injection payload. The stage 2 pulls this JavaScript file from GitHub and injects it into the Discord desktop client’s core module, persisting across restarts.

The capability set from stage 2:

  • Anti-analysis checks on startup: process blacklist (~30 entries including wiresharkprocesshackervboxserviceollydbgx96dbgpestudio), MAC address blacklist (80+ VM prefixes), HWID blacklist, IP blacklist, username/PC name blacklists
  • Discord token theft from all three release channels (stable, canary, PTB)
  • Browser credential theft across 20+ Chromium and non-Chromium browsers
  • Roblox session cookie harvesting (.ROBLOSECURITY= targeting with API validation)
  • Desktop screenshot capture
  • Self-destruct: ping localhost -n 3 > NUL && del /F "{path}"

The ping delay is a simple trick — the 3-second wait lets the process fully exit before the delete fires, so the file removes itself cleanly after execution.


What MalChela + MCP Added to This Workflow

The honest answer is: speed and synthesis.

tiquery hitting five TI sources in one call versus five separate browser tabs or CLI invocations is a meaningful time saving, but that’s the surface benefit. The deeper value showed up in the mstrings step — getting ATT&CK-mapped output with technique IDs alongside the raw strings meant the behavioral picture came together faster than manually correlating imports against the ATT&CK matrix.

The MCP integration meant each of those steps — TI query, PE analysis, string extraction — could happen within the same conversation context. Claude could see the fileanalyzer output and the mstrings output together and note that the entropy difference between the two samples was significant, that the identical imphash meant shared loader infrastructure, that the staging imports in mstrings were consistent with the exfil approach suggested by the TI tags. That cross-tool synthesis is where the integration earns its keep.

The parts that still required manual work: pyinstxtractor-ngpycdc, the LZMA extraction, and decoding the stage 2. Those are terminal steps on the Mac.


IOCs at a Glance

Samples:

SHA-256FilenameFamily
4f3b8971...d0sdas.exeLunaStealer
d4f57b42...24loader.exeLunaGrabber

Injection URL:

https://raw.githubusercontent.com/Smug246/luna-injection/main/obfuscated-injection.js

Self-destruct pattern:

ping localhost -n 3 > NUL && del /F "{executable}"

Imphash (shared loader stub):

f3c0dbc597607baa2ea891bc3a114b19

A full IOC list including ~60 C2 IPs, MAC address blacklists, and HWID blacklists is in the analysis report linked below.

Downloads

  • 📄 [Full Analysis Report] — Complete investigation narrative, sample properties, capability breakdown, IOC documentation, campaign timeline, and recommendations. (LunaStealer_Analysis_Report.pdf)
  • 🛡️ [YARA Rules — PE] — Four rules targeting the PE samples: exact hash match, shared PyInstaller stub (imphash-based), infostealer payload strings, generic PyInstaller infostealer. (lunastealer_pe.yar)

If you’re running MalChela in your environment and want to reproduce the TI query steps, the MalChela MCP plugin source is on GitHub at github.com/dwmetz/MalChela. Questions or additions to the IOC list — find me on the usual channels.


MalChela 2.2 “REMnux” Release

MalChela’s 2.2 update is packed with practical and platform-friendly improvements. It includes native support for REMnux, better tool settings, and deeper integrations with analysis tools like YARA-X, Tshark, Volatility3, and the newly improved fileanalyzer module.

🦀 REMnux Edition: Built-In Support, Zero Tweaks

When the GUI loads a REMnux-specific tools.yaml profile, it enters REMnux mode.

Screenshot of yaml configuration applying REMnux mode

Native binaries and Python scripts like capa, oledump.py, olevba, and FLOSS are loaded into the MalChela tools menu, allowing you to mix and match operations with the embedded MalChela utilities and the full REMnux tool stack. No manual configuration needed—just launch and go. MalChela currently supports the following REMnux programs right out of the box:

Tool Name Description
binwalkFirmware analysis and extraction tool
capaIdentifies capabilities in executable files
radare2Advanced reverse engineering framework
Volatility 3Memory forensics framework for RAM analysis
exiftoolExtracts metadata from images, documents, and more
TSharkTerminal-based network packet analyzer (Wireshark CLI)
mraptorDetects malicious macros in Office documents
oledumpParses OLE files and embedded streams
oleidIdentifies features in OLE files that may indicate threats
olevbaExtracts and analyzes VBA macros from Office files
rtfobjExtracts embedded objects from RTF documents
zipdumpInspects contents of ZIP files, including suspicious payloads
pdf-parserAnalyzes structure and contents of suspicious PDFs
FLOSSReveals obfuscated and decoded strings in binaries
clamscanOn-demand virus scanner using ClamAV engine
stringsExtracts printable strings from binary files
YARA-XNext-generation high-performance YARA rule scanner

If you only need a subset of tools you can easily save and restore that a custom profile.


TShark Panel with Built-In Reference

Tshark and the integrated field reference

A new TShark integration exposes features including:

  • A filter builder panel
  • Commonly used fields reference
  • Tooltip hints for each example (e.g., `ip.addr == 192.168.1.1` shows “Any traffic to or from 192.168.1.1”)
  • One-click copy support

This helps analysts build and understand filters quickly—even if TShark isn’t something they use every day. Using the syntax builder in MalChela you can use the exact commands directly in Tshark or Wireshark.


YARA-X Support (Install Guide Included)

YARA-X module in MalChela

Support for YARA-X (via the `yr` binary) is now built in. YARA-X is not bundled with REMnux by default, but install instructions are included in the User Guide for both macOS and Linux users.

Once installed, MalChela allows for rule-based scanning from the GUI,and with YARA-X, it’s faster than ever.


fileanalyzer: Fuzzy Hashing, PE Metadata, and More

Updated FileAnalyzer Module

MalChela’s fileanalyzer tool has also been updated to include:

  • Fuzzy hashing support via `ssdeep`
  • BLAKE3 hashing for fast, secure fingerprints
  • Expanded PE analysis, including:
  • Import and Export Table parsing (list of imported and exported functions)
  • Compilation Timestamp (for detection of suspicious or forged build times)
  • Section Characteristics (flags like IMAGE_SCN_MEM_EXECUTE, IMAGE_SCN_CNT_CODE, etc., for detecting anomalous sections)

These improvements provide deeper insight into executable structure, helping analysts detect anomalies such as packers, suspicious timestamps, or unexpected imports/exports. Useful for everything from sample triage to correlation, fileanalyzer now digs deeper—without slowing down.


Memory Forensics Gets a Boost: Volatility 3 Now Supported

With the 2.2 release, MalChela introduces support for Volatility 3, the modern Python-based memory forensics framework. Whether you’re running MalChela in REMnux or on a customized macOS or Linux setup, you can now access the full power of Volatility directly from the MalChela GUI.

Volatility 3 in MalChela

There’s an intuitive plugin selector that dynamically adjusts available arguments based on your chosen plugin,. You can search, sort, and browse available plugins, and even toggle output options like –dump-dir with ease.

Like Tshark, there is an added plugin reference panel with searchable descriptions and argument overviews — a real time-saver when navigating Volatility’s deep and often complex toolset.

Volatility Plugin Reference

Smarter Tool Configuration via YAML

The tool configuration system continues to evolve:

  • Tools now declare their input type (file, folder, or hash)
  • The GUI dynamically adjusts the interface to match
  • Alternate profiles (like REMnux setups) can be managed simply by swapping `tools.yaml` files via the GUI
  • Easily backup or restore your custom setups
  • Restore the default toolset to get back to basics

This structure helps keep things clean—whether you’re testing, teaching, or deploying in a lab environment.


Embedded Documentation Access

The GUI now includes a link to the full MalChela User Guide in PDF. You can also access the documentation online.

From tool usage and CLI flags to configuration tips and install steps, it’s all just a click away—especially useful in offline environments or when onboarding new analysts. I’ll be honest, this is likely the most comprehensive user guide I’ve ever written.


Whether you’re reviewing binaries, building hash sets, or exploring network captures—MalChela 2.2 is designed bring together the tools you need, and make it easier to interoperate between them.

The new REMnux mode makes it even easier to get up and running with dozens of third party integrations.

Have an idea for a feature or application you’d like to see supported — reach out to me.


GitHub: REMnux Release

MalChela User Guide: Online, PDF, Web

Shop: T-shirts, hats, stickers, and more

Creating custom hash sets with YARA and Python

I don’t like to brag, he said, but you should see the size of my malware library.

For a recent project, I wanted to produce a hash set for all the malware files in my repository. Included in the library are malware samples for Windows and other platforms. Within the library there are also a lot of pdf’s with write ups corresponding to different samples. Lastly there are zip files that the malware samples haven’t been extracted from yet.

I didn’t want hashes for any of the pdf documents or zip files. I wanted one hash set for the malware specific to Windows, and a second set for all the other samples.

Using YARA and Python, and some (a lot of) AI coaching, I wound up with three scripts. Two of them are used to create the hash sets, and a third that does counting and indexing on the source directory for different file headers.

Windows Malware Hashing

The majority of Windows malware has an MZ header. The first Python script uses YARA to recursively search a directory. For any files with an MZ header, it will write the MD5 hash to a file.

launching MZMD5.py with Python
Completion of the MZMD5.py script showing 22789 hashes generated for MZ files.

MZMD5.py

import os
import yara
import hashlib
import sys

def compile_yara_rules():
    """
    Compile YARA rules for detecting MZ headers.
    
    Returns:
        yara.Rules: Compiled YARA rules.
    """
    rules = """
rule mz_header {
    meta:
        description = "Matches files with MZ header (Windows Executables)"
    strings:
        $mz = {4D 5A}  // MZ header in hex
    condition:
        $mz at 0  // Match if MZ header is at the start of the file
}
"""
    return yara.compile(source=rules)

def calculate_md5(file_path):
    """
    Calculate the MD5 hash of a file.
    
    Args:
        file_path (str): Path to the file.
    
    Returns:
        str: MD5 hash in hexadecimal format.
    """
    md5_hash = hashlib.md5()
    try:
        with open(file_path, "rb") as f:
            for byte_block in iter(lambda: f.read(4096), b""):
                md5_hash.update(byte_block)
        return md5_hash.hexdigest()
    except Exception as e:
        return None

def scan_and_hash_files(directory, rules, output_file):
    """
    Scan files in a directory using YARA rules, calculate MD5 hashes for matches,
    and write results to an output file.
    
    Args:
        directory (str): Path to the directory to scan.
        rules (yara.Rules): Compiled YARA rules.
        output_file (str): Path to the output file where results will be saved.
    
    Returns:
        int: Total number of hashes written to the output file.
    """
    hash_count = 0
    with open(output_file, "w") as out_file:
        # Walk through the directory and its subdirectories
        for root, _, files in os.walk(directory):
            for file in files:
                file_path = os.path.join(root, file)
                
                try:
                    # Match YARA rules against the file
                    matches = rules.match(file_path)
                    if any(match.rule == "mz_header" for match in matches):
                        # Calculate MD5 hash if the file matches the MZ header rule
                        md5_hash = calculate_md5(file_path)
                        if md5_hash:
                            out_file.write(f"{md5_hash}\n")
                            # Print hash value and flush output immediately
                            print(md5_hash, flush=True)
                            hash_count += 1
                except Exception as e:
                    pass  # Suppress error messages
    
    return hash_count

if __name__ == "__main__":
    # Prompt user for directory to scan
    directory_to_scan = input("Enter the directory you want to scan: ").strip()
    
    # Verify that the directory exists
    if not os.path.isdir(directory_to_scan):
        print("Error: The specified directory does not exist.")
        exit(1)
    
    # Set output file path to MZMD5.txt in the current working directory
    output_file_path = "MZMD5.txt"
    
    # Check if the output file already exists
    if os.path.exists(output_file_path):
        overwrite = input(f"The file '{output_file_path}' already exists. Overwrite? (y/n): ").strip().lower()
        if overwrite != 'y':
            print("Operation canceled.")
            exit(0)
    
    # Compile YARA rules
    yara_rules = compile_yara_rules()
    
    # Scan directory, calculate MD5 hashes, and write results to an output file
    total_hashes = scan_and_hash_files(directory_to_scan, yara_rules, output_file_path)
    
    # Report total number of hashes written and location of the output file
    print(f"\nScan completed.")
    print(f"Total number of hashes written: {total_hashes}")
    print(f"Output file location: {os.path.abspath(output_file_path)}")

Non-Windows Malware Hashing

The second script is a little more complicated. Again we will use YARA to determine the filetype, however in this case we want to exclude anything with an MZ header, as well as exclude any zip files or pdfs. Based on the contents of the library, this should produce a hash set for all the other binaries in the library that aren’t targeted to Windows.

Launching XMZMD5.py in Python
Results of XMZMD5.py showing 5988 hashes calculated.

XMZMD5.py

import os
import yara
import hashlib

def compile_yara_rules():
    """
    Compile YARA rules for MZ, PDF, and ZIP headers.
    
    Returns:
        yara.Rules: Compiled YARA rules.
    """
    rules = """
rule mz_header {
    meta:
        description = "Matches files with MZ header (Windows Executables)"
    strings:
        $mz = {4D 5A}  // MZ header in hex
    condition:
        $mz at 0  // Match if MZ header is at the start of the file
}

rule pdf_header {
    meta:
        description = "Matches files with PDF header"
    strings:
        $pdf = {25 50 44 46}  // PDF header in hex (%PDF)
    condition:
        $pdf at 0  // Match if PDF header is at the start of the file
}

rule zip_header {
    meta:
        description = "Matches files with ZIP header"
    strings:
        $zip = {50 4B 03 04}  // ZIP header in hex
    condition:
        $zip at 0  // Match if ZIP header is at the start of the file
}
"""
    return yara.compile(source=rules)

def calculate_md5(file_path):
    """
    Calculate the MD5 hash of a file.

    Args:
        file_path (str): Path to the file.

    Returns:
        str: MD5 hash of the file, or None if an error occurs.
    """
    hasher = hashlib.md5()
    try:
        with open(file_path, 'rb') as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hasher.update(chunk)
        return hasher.hexdigest()
    except Exception as e:
        print(f"[ERROR] Unable to calculate MD5 for {file_path}: {e}")
        return None

def scan_directory(directory, rules, output_file):
    """
    Scan a directory for files that do not match YARA rules and calculate their MD5 hashes.

    Args:
        directory (str): Path to the directory to scan.
        rules (yara.Rules): Compiled YARA rules.
        output_file (str): File to save MD5 hashes of unmatched files.
    """
    hash_count = 0  # Counter for total number of hashes written

    try:
        with open(output_file, 'w') as out:
            for root, _, files in os.walk(directory):
                for file in files:
                    file_path = os.path.join(root, file)
                    try:
                        # Check if the file matches any YARA rule
                        matches = rules.match(file_path)
                        if not matches:  # Only process files that do not match any rule
                            md5_hash = calculate_md5(file_path)
                            if md5_hash:
                                print(md5_hash)  # Print hash to console
                                out.write(md5_hash + '\n')  # Write only hash to output file
                                hash_count += 1
                    except yara.Error as ye:
                        print(f"[WARNING] YARA error scanning {file_path}: {ye}")
                    except Exception as e:
                        print(f"[ERROR] Unexpected error scanning {file_path}: {e}")
        
        # Report total number of hashes written and location of the output file
        print(f"\nScan completed.")
        print(f"Total number of hashes written: {hash_count}")
        print(f"Output file location: {os.path.abspath(output_file)}")

    except Exception as e:
        print(f"[ERROR] Failed to write to output file {output_file}: {e}")

if __name__ == "__main__":
    # Prompt user for directory to scan
    directory_to_scan = input("Enter directory to scan: ").strip()
    
    # Compile YARA rules
    try:
        yara_rules = compile_yara_rules()
    except Exception as e:
        print(f"[ERROR] Failed to compile YARA rules: {e}")
        exit(1)

    # Output filename for unmatched files' MD5 hashes
    output_filename = "XMZMD5.txt"

    # Check if the output file already exists and prompt user for action
    if os.path.exists(output_filename):
        overwrite_prompt = input(f"[WARNING] The file '{output_filename}' already exists. Do you want to overwrite it? (yes/no): ").strip().lower()
        
        if overwrite_prompt not in ['yes', 'y']:
            print("[INFO] Operation canceled by user.")
            exit(0)

    # Scan the directory
    if os.path.isdir(directory_to_scan):
        scan_directory(directory_to_scan, yara_rules, output_filename)
    else:
        print(f"[ERROR] The provided path is not a valid directory: {directory_to_scan}")

The third script is for counting and validation. I wanted to know the total number of files, and how many had the MZ header, were zip or pdf files, or none of the above. Based on the counts, the hash lists should contain a matching number of entries, the MZ’s for Windows malware samples and the “Neither Header Files” for the remaining binaries. Note: to run this script you will need to have the Python module “tabulate” installed. (pip install tabulate). There are 2 output options available, Detailed and Table View.

MZCount.py Table View
MZCount.py Detailed View.
Completed MZCount in Table View.
Completed MZCount in Detailed View.

MZCount.py

import os
import yara
import time


def compile_yara_rules():
    """
    Compile YARA rules for MZ, PDF, and ZIP headers.
    
    Returns:
        yara.Rules: Compiled YARA rules.
    """
    rules = """
rule mz_header {
    meta:
        description = "Matches files with MZ header (Windows Executables)"
    strings:
        $mz = {4D 5A}  // MZ header in hex
    condition:
        $mz at 0  // Match if MZ header is at the start of the file
}

rule pdf_header {
    meta:
        description = "Matches files with PDF header"
    strings:
        $pdf = {25 50 44 46}  // PDF header in hex (%PDF)
    condition:
        $pdf at 0  // Match if PDF header is at the start of the file
}

rule zip_header {
    meta:
        description = "Matches files with ZIP header"
    strings:
        $zip = {50 4B 03 04}  // ZIP header in hex
    condition:
        $zip at 0  // Match if ZIP header is at the start of the file
}
"""
    try:
        return yara.compile(source=rules)
    except yara.SyntaxError as e:
        print(f"Error compiling YARA rules: {e}")
        raise


def display_table(counts):
    """
    Display the counts in a simple table format.
    
    Args:
        counts (dict): Dictionary containing counts for each file type.
    """
    # Clear console before displaying new table
    os.system('cls' if os.name == 'nt' else 'clear')  # Clears terminal for Windows ('cls') or Linux/Mac ('clear')
    
    # Print updated table
    print("\n+----------------------+---------+")
    print("| File Type           | Count   |")
    print("+----------------------+---------+")
    print(f"| Total Files         | {counts['total_files']:<7} |")
    print(f"| MZ Header Files     | {counts['mz_header']:<7} |")
    print(f"| PDF Header Files    | {counts['pdf_header']:<7} |")
    print(f"| ZIP Header Files    | {counts['zip_header']:<7} |")
    print(f"| Neither Header Files| {counts['neither_header']:<7} |")
    print("+----------------------+---------+")


def scan_and_count_files(directory, rules, use_table_display):
    """
    Scan files in a directory using YARA rules and count matches by header type.
    
    Args:
        directory (str): Path to the directory to scan.
        rules (yara.Rules): Compiled YARA rules.
        use_table_display (bool): Whether to use table display for live updates.
    
    Returns:
        dict: A dictionary with counts for total files, MZ headers, PDF headers, ZIP headers, and neither headers.
    """
    counts = {
        "total_files": 0,
        "mz_header": 0,
        "pdf_header": 0,
        "zip_header": 0,
        "neither_header": 0
    }

    # Walk through the directory and its subdirectories
    for root, _, files in os.walk(directory):
        for file in files:
            counts["total_files"] += 1
            file_path = os.path.join(root, file)
            
            try:
                # Open file in binary mode for YARA matching
                with open(file_path, "rb") as f:
                    data = f.read()
                
                # Match YARA rules against file content
                matches = rules.match(data=data)

                # Process matches
                if matches:
                    matched_rules = {match.rule for match in matches}
                    if "mz_header" in matched_rules:
                        counts["mz_header"] += 1
                    if "pdf_header" in matched_rules:
                        counts["pdf_header"] += 1
                    if "zip_header" in matched_rules:
                        counts["zip_header"] += 1
                else:
                    counts["neither_header"] += 1

            except Exception as e:
                print(f"Error scanning {file_path}: {e}")
                # Decrement total_files if an error occurs
                counts["total_files"] -= 1
            
            # Display updated output after processing each file
            if use_table_display:
                display_table(counts)
            else:
                print(f"Scanned: {file_path}")
                print(f"Current Counts: {counts}")
            
            time.sleep(0.1)  # Optional: Add a small delay for smoother updates
    
    return counts


if __name__ == "__main__":
    # Prompt user for directory to scan
    directory_to_scan = input("Enter directory to scan: ").strip()
    
    # Check if the directory exists
    if not os.path.isdir(directory_to_scan):
        print(f"Error: The directory '{directory_to_scan}' does not exist. Please enter a valid directory.")
        exit(1)
    
    # Prompt user for display format preference
    display_choice = input("Choose output format - (1) Detailed, (2) Table Display: ").strip()
    
    # Determine whether to use table display or original output format
    use_table_display = display_choice == "2"
    
    # Compile YARA rules
    yara_rules = compile_yara_rules()
    
    # Scan directory and count matches
    results = scan_and_count_files(directory_to_scan, yara_rules, use_table_display)
    
    # Final results display after completion
    print("\nFinal Results:")
    
    if use_table_display:
        display_table(results)
        
        # Handle case where no results were found
        if results["total_files"] == 0:
            print("No files were scanned. Please check your directory.")
            
    else:
        print(f"Total files scanned: {results['total_files']}")
        print(f"Files with MZ header: {results['mz_header']}")
        print(f"Files with PDF header: {results['pdf_header']}")
        print(f"Files with ZIP header: {results['zip_header']}")
        print(f"Files with neither MZ, PDF, nor ZIP header: {results['neither_header']}")

Double Checking the Hash Files

Finally we can use RegEx to count the number of MD5 hashes for each file. The RegEx looks for strings of 32 hexadecimal digits. (A-F and 0-9.)

grep -oE '\b[a-fA-F0-9]{32}\b' [filename].txt | wc -l
Regex output showing counts for hashes, 22789 and 5988 respectively.

The number of hashes in the MZMD5.txt hash list matches the number of MZ files identified by YARA. Additionally, the number of non-MZ binaries in the hash list, XMZMD5.txt, matches the number of files when we exclude the Windows binaries and the pdf and zip files.


There you have it, the fruits of my labors combining a few of my favorite things (cue John Coltrane), YARA, Malware, Python, and using AI as tool to develop my coding skills. If you’d like to download the scripts for your own usage, they can be found at https://github.com/dwmetz/Toolbox/ (Miscellaneous PowerShell and Python scripts related to YARA and Malware Analysis.)

Beyond Hashes: Simplifying Malware Identification with Python and MpCmdRun

In an earlier post titled Growing Your Malware Corpus, I outlined methods for building a comprehensive test corpus of malware for detection engineering. It covers using sources like VX-Underground for malware samples and details how to organize and unzip these files using Python scripts.

In today’s post we’re going to cover using Python to apply a standard naming methodology to all our malware samples.

Depending on where you curate your samples from, they could be named by their hash, or as they were identified during investigation, like invoice.exe. Depending on the size of your collection, I’d surmise it’s highly unlikely that they have a consistent naming format.

I don’t know about you, but a title that indicates the malware family and platform is a lot more useful to me than a hash value when perusing the corpus for a juicy malware sample. We can rename all our malware files using Python and the command line utility for Windows Defender.

Step 1: You’ll need to install Python on a Windows box that has Windows Defender.

Install Python

If you don’t have Python installed on your Windows machine, you can do so by downloading the installer from python.org, or alternatively, installing from the Windows store.

Windows Store installer for Python versions 3.7 to 3.12

Directory Exclusion

Within the Windows Defender Virus & Threat protection settings, add an exclusion for the directory you’re going to be using with the malware. Make sure the exclusion is in place before connecting the drive with the malware so it doesn’t get nuked.

Note: Doing this assumes you’ve evaluated the potential risks associated with handling malware, even in controlled settings, and have taken safety precautions. This is not an exercise to be conducted on your corporate workstation.

Screenshot of the D:\Malware Directory being excluded from Windows Defender.

Automatic Sample submission

It’s up to you if you want to disable the Automatic Sample submission. If you do, you’ll still may get prompted to send some.

Automatic Sample Submission turned off in Windows Defender Configuration.
Windows Defender requesting to send samples to Microsoft for further analysis.

Rename_Malware.py

The star of this show is the python script that was shared on twitter from vx-underground.

The post walks through various options for utilizing Windows Defender command line, MpCPmdRun.exe. Using that information a Python script was developed to loop through the contents of a directory, analyze those files with Windows Defender, and then rename the files accordingly based on the malware identification.

Python code for rename_malware.py in VS Code.

You can grab the code from the linked post, or a copy on my Github here.

Once you’ve got Python installed, directory exclusion configured, and a pocketful of kryptonite (malware), – you’re ready to go.

python rename_malware.py D:\Malware

Windows Defender command line will run through each file and rename them based on its detection.

The script recursively renames the analyzed files.

I’m running this on a copy of my malware corpus of 30,000+ malware samples.

Counting the Corpus

A bit of handy PowerShell math. Before and after the process I wanted to be sure of how many files were present to ensure that the antivirus didn’t remove any. I also wanted to exclude counting pdfs as many of the samples in my corpus also have accompanying write-ups.

Using PowerShell for selective file counting.
Get-ChildItem -Recurse -file | Where-Object { $_.Extension ne *.pdf" } | Measure-Object | Select Count

Back at the console the script is still running.

The script continues recursively renaming the analyzed files.
Energizer Rabbit. “Still Going!”

Finally… not begrudgingly at all considering over 30,000 samples were analyzed, the script has reached the end of the samples.

Script has reached the end of the files.

If we do a directory listing on the contents of the malware directory, we see that the majority of the files have all been renamed based on their malware identification.

File listing showing malware files named Trojan.Powershell… Trojan.Script… etc.

Hooray!

That makes it much easier to search and query through the malware repository.

The last step… make a BACKUP. 😉