Beyond Hashes: Simplifying Malware Identification with Python and MpCmdRun

In an earlier post titled Growing Your Malware Corpus, I outlined methods for building a comprehensive test corpus of malware for detection engineering. It covers using sources like VX-Underground for malware samples and details how to organize and unzip these files using Python scripts.

In today’s post we’re going to cover using Python to apply a standard naming methodology to all our malware samples.

Depending on where you curate your samples from, they could be named by their hash, or as they were identified during investigation, like invoice.exe. Depending on the size of your collection, I’d surmise it’s highly unlikely that they have a consistent naming format.

I don’t know about you, but a title that indicates the malware family and platform is a lot more useful to me than a hash value when perusing the corpus for a juicy malware sample. We can rename all our malware files using Python and the command line utility for Windows Defender.

Step 1: You’ll need to install Python on a Windows box that has Windows Defender.

Install Python

If you don’t have Python installed on your Windows machine, you can do so by downloading the installer from python.org, or alternatively, installing from the Windows store.

Windows Store installer for Python versions 3.7 to 3.12

Directory Exclusion

Within the Windows Defender Virus & Threat protection settings, add an exclusion for the directory you’re going to be using with the malware. Make sure the exclusion is in place before connecting the drive with the malware so it doesn’t get nuked.

Note: Doing this assumes you’ve evaluated the potential risks associated with handling malware, even in controlled settings, and have taken safety precautions. This is not an exercise to be conducted on your corporate workstation.

Screenshot of the D:\Malware Directory being excluded from Windows Defender.

Automatic Sample submission

It’s up to you if you want to disable the Automatic Sample submission. If you do, you’ll still may get prompted to send some.

Automatic Sample Submission turned off in Windows Defender Configuration.
Windows Defender requesting to send samples to Microsoft for further analysis.

Rename_Malware.py

The star of this show is the python script that was shared on twitter from vx-underground.

The post walks through various options for utilizing Windows Defender command line, MpCPmdRun.exe. Using that information a Python script was developed to loop through the contents of a directory, analyze those files with Windows Defender, and then rename the files accordingly based on the malware identification.

Python code for rename_malware.py in VS Code.

You can grab the code from the linked post, or a copy on my Github here.

Once you’ve got Python installed, directory exclusion configured, and a pocketful of kryptonite (malware), – you’re ready to go.

python rename_malware.py D:\Malware

Windows Defender command line will run through each file and rename them based on its detection.

The script recursively renames the analyzed files.

I’m running this on a copy of my malware corpus of 30,000+ malware samples.

Counting the Corpus

A bit of handy PowerShell math. Before and after the process I wanted to be sure of how many files were present to ensure that the antivirus didn’t remove any. I also wanted to exclude counting pdfs as many of the samples in my corpus also have accompanying write-ups.

Using PowerShell for selective file counting.
Get-ChildItem -Recurse -file | Where-Object { $_.Extension ne *.pdf" } | Measure-Object | Select Count

Back at the console the script is still running.

The script continues recursively renaming the analyzed files.
Energizer Rabbit. “Still Going!”

Finally… not begrudgingly at all considering over 30,000 samples were analyzed, the script has reached the end of the samples.

Script has reached the end of the files.

If we do a directory listing on the contents of the malware directory, we see that the majority of the files have all been renamed based on their malware identification.

File listing showing malware files named Trojan.Powershell… Trojan.Script… etc.

Hooray!

That makes it much easier to search and query through the malware repository.

The last step… make a BACKUP. 😉

Growing Your Malware Corpus

If you’re writing YARA rules or doing other kinds of detection engineering, you’ll want to have a test bed that you can run your rules against.  This is known as a corpus. For your corpus you’ll want to have both Goodware (known good operating system files), as well as a library of malware files.

One source to get a lot of malware samples is from VX-Underground.  What I really appreciate about VX-Underground is that in addition to providing lots of malware samples, they also produce an annual archive of samples and papers. You can download a whole year’s worth of samples and papers, from 2010 to 2023.

Pandora’s Box

Just to understand the structure here, I have a USB device called “Pandora.” On the root of the drive is a folder called “APT”, and within that is a “Samples” directory. Inside the samples directory is the .7z download for 2023 from VX-Underground. There’s also a python script… we’ll get to that soon enough.

The first thing we’ll need to do is unzip the download with the usual password.

7zz x 2023.7z

Once the initial extraction is complete you can delete the original 2023.7z archive.

Within the archive for each year, there is a directory for the sample, with sub-directories of ‘Samples’ and ‘Papers.’  Every one of the samples is also password protected zip file.

This makes sense from a safety perspective, but it makes it impossible to scan against all the files at once.

Python to the Rescue

We can utilize a Python script to recursively go through the contents of our malware folder and unzip all the password protected files, while keeping those files in their original directories.

You may have noticed in the first screenshot that I have a script called ExtractSamples.py in my APT directory.

We will use this for the recursive password protected extractions.

Python ExtractSamples.py

A flurry of code goes by, and you congratulate yourself on you Python prowess. Now if we look again at our contents, we’ve got the extracted sample and the original zip file. 

Let’s get rid of all the zip files as we don’t need them cluttering up the corpus.

We can start by running a find command to identify all the 7zip files.

find . -type f -name '*.7z' -print

After you’ve checked the output and verified the command above is only grabbing the 7z files you want to delete, we can update the command to delete the found files.

find . -type f -name '*.7z' -delete

One more a directory listing to verify:

Success. All the 7z files are removed and all the sample files are intact.

GitHub Link: ExtractSamples.py

Time to go write some new detections!

Huntress CTF: Week 1 – Malware: Hot Off The Press, HumanTwo, PHP Stager & Zerion

Hot Off The Press

To start with let’s see what kind of file this is.

UHARC is a compression/archiving system for PC platforms, which appears to be neglected since around 2005. It achieves better compression than most other archivers, at the expense of being much slower.” 

http://fileformats.archiveteam.org/wiki/UHARC

I scoured the internet looking for a copy of UHARC to download. I’m not going to link any here as many if not all may contain malware. Since this is a Windows only tool, (or Wine under Linux), we’ll open this one in a sandboxed Windows system.

When the file extracts we are presented with hot_off_the_press.ps1.

OMG that’s a lot of obfuscation! Let’s see if we can clean this up and make it more readable. First let’s remove all the ”+”

That’s a little bit better. There’s another obfuscation method going on where specific numbers are used to represent different letters. Originally, I tried to determine the substitution by completing terms I knew. Early ahead I saw (”Sc{2}i’pt{1}loc{0}Logging”) which to me reads like ScriptBlockLogging. So all 2’s are i’s, 1’s are B’s, and 0’s are k. I do a find/replace through the script with replacements on {0},{1}, and {2}. Now it looks like a block of Base64 in the middle block. I copy it over to CyberChef and … NADA. Something’s not right.

If you look closer at the code, you’ll see that each one of the strings that had a {#} substitution in it ends with “-f” followed by other letters in quotations. The first character after -f is substituted for {0}, the next for {1}, etc. So I run the same substitution pattern on the script using the correct letters for this string this time.

Replace the {0} with L.

Replace the {1} with E.

Now we’ve got a nice clean block of Base64.

Bring that over to CyberChef for decoding and:

We’ve got a script within the script.

If you scroll down in the output, you’ll see that there’s something else encoded as well.

We’ll run that through CyberChef.

Interesting we have an encoded_flag. Let’s add URL decode to the recipe.


HumanTwo

There were 1,000 files in the zip container. Easy comparison options like file size, modification date etc. don’t help as they are the same for all the files. It’s something in the content that has to be different. How the ‘f’ am I going to find the outlier in 1,000 files?! Meld and diff are two options coming up in the Discord. I install Meld, which is really a gui for diff, and start getting a feel for it. You can compare files or directories. If doing files you could do a 3 way comparison between 3 files. But not 1000. As I was looking through the files with Meld it struck me that all of the file contents we also the same with the exception of one line.

Let’s run through all the files with the_silver_searcher and isolate on String.Equals

Scrolling down through the output we see that one is a definite outlier, or as we like to say around here, an Irregular.

Once more to CyberChef, this time from Hex.


PHP Stager

Heavily obfuscated PHP. This is going to be fun.

Let’s see if ChatGPT can give some insight into what’s going on here.

After several hours of back and forth from PHP to Python to PowerShell, online IDE’s, more ChatGPT, googling, and back again I was able to roughly reproduce the PHP in a Python and get it to execute.

Looks like we’re not done yet. In the middle of the output we can see another block of Base64. What happens if we toss that into CyberChef.

Great! Now we have a Perl script. How far down does this challenge go? It’s like those Matryoshka dolls from Russia. One inside another inside another. But wait… there’s something interesting in the Perl script.

There’s a reference to UU encoding and a string. We’ll copy the string and bring it over to another of my favorite decoding sites, dcode.fr.

Sure enough it handles the decoding and we have our flag.


Zerion

Yay (said no one), another crazy PHP file.

Looks to be using Base64 encoding, Rot13, and some other options to obfuscate the code. Back to school (ChatGPT) to see what’s going on.

Let’s copy the large encoded text block to CyberChef. We’ll apply Rot13, then Reverse the text by Character, and finally – decrypt using Base64.

And that’s our flag!


Use the tag #HuntressCTF on BakerStreetForensics.com to see all related posts and solutions for the 2023 Huntress CTF.

Creating YARA files with Python

When I’m researching a piece of malware, I’ll have a notepad open (usually VS Code), where I’m capturing strings that might be useful for a detection rule. When I have a good set of indicators, the next step is to turn them into a YARA rule.

It’s easy enough to create a YARA file by hand. My objective was to streamline the boring stuff like formatting and generating a string identifier ($s1 = “stringOne”) for each string. Normally PowerShell is my goto, but this week I’m branching out and wanted to work on my Python coding.

The code relies on you having a file called strings.txt. One string per line.

When you run the script it will prompt for (metadata):

  • rule name
  • author
  • description
  • hash

It then takes the contents of strings.txt and combines those with the metadata to produce a cleanly formatted YARA rule.

Caveats:

If the strings have special characters that need to be escaped, you may need to tweak the strings in the rule after it’s created.

The script will define the condition “any of them”. If you prefer to have all strings required, you can change line 22 from

yara_rule += '\t\tany of them\n}\n'

to

yara_rule += '\t\tall of them\n}\n'

CreateYARA.py

def get_user_input():
    rule_name = input("Enter the rule name: ")
    author = input("Enter the author: ")
    description = input("Enter the description: ")
    hash_value = input("Enter the hash value: ")
    return rule_name, author, description, hash_value

def create_yara_rule(rule_name, author, description, hash_value, strings_file):
    yara_rule = f'''rule {rule_name} {{
    meta:
    \tauthor = "{author}"
    \tdescription = "{description}"
    \thash = "{hash_value}"

    strings:
    '''
    with open(strings_file, 'r') as file:
        for id, line in enumerate(file, start=1):
            yara_rule += f'\t$s{id} = "{line.strip()}"\n\t'
    yara_rule += '\n'
    yara_rule += '\tcondition:\n'
    yara_rule += '\t\tany of them\n}\n'

    return yara_rule

def main():
    rule_name, author, description, hash_value = get_user_input()
    strings_file = 'strings.txt'  

    yara_rule = create_yara_rule(rule_name, author, description, hash_value, strings_file)
    print("Generated YARA rule:")
    print(yara_rule)
    
    yar_filename = f'{rule_name}.yar'
    with open(yar_filename, 'w') as yar_file:
        yar_file.write(yara_rule)

    print(f"YARA rule saved to {yar_filename}")

if __name__ == "__main__":
    main()
Sample strings.txt file used as input for the YARA rule
Running CreateYARA.py
YARA rule created from Python script, viewed in VS Code.