Growing Your Malware Corpus

If you’re writing YARA rules or doing other kinds of detection engineering, you’ll want to have a test bed that you can run your rules against.  This is known as a corpus. For your corpus you’ll want to have both Goodware (known good operating system files), as well as a library of malware files.

One source to get a lot of malware samples is from VX-Underground.  What I really appreciate about VX-Underground is that in addition to providing lots of malware samples, they also produce an annual archive of samples and papers. You can download a whole year’s worth of samples and papers, from 2010 to 2023.

Pandora’s Box

Just to understand the structure here, I have a USB device called “Pandora.” On the root of the drive is a folder called “APT”, and within that is a “Samples” directory. Inside the samples directory is the .7z download for 2023 from VX-Underground. There’s also a python script… we’ll get to that soon enough.

The first thing we’ll need to do is unzip the download with the usual password.

7zz x 2023.7z

Once the initial extraction is complete you can delete the original 2023.7z archive.

Within the archive for each year, there is a directory for the sample, with sub-directories of ‘Samples’ and ‘Papers.’  Every one of the samples is also password protected zip file.

This makes sense from a safety perspective, but it makes it impossible to scan against all the files at once.

Python to the Rescue

We can utilize a Python script to recursively go through the contents of our malware folder and unzip all the password protected files, while keeping those files in their original directories.

You may have noticed in the first screenshot that I have a script called ExtractSamples.py in my APT directory.

We will use this for the recursive password protected extractions.

Python ExtractSamples.py

A flurry of code goes by, and you congratulate yourself on you Python prowess. Now if we look again at our contents, we’ve got the extracted sample and the original zip file. 

Let’s get rid of all the zip files as we don’t need them cluttering up the corpus.

We can start by running a find command to identify all the 7zip files.

find . -type f -name '*.7z' -print

After you’ve checked the output and verified the command above is only grabbing the 7z files you want to delete, we can update the command to delete the found files.

find . -type f -name '*.7z' -delete

One more a directory listing to verify:

Success. All the 7z files are removed and all the sample files are intact.

GitHub Link: ExtractSamples.py

Time to go write some new detections!

Creating YARA files with Python

When I’m researching a piece of malware, I’ll have a notepad open (usually VS Code), where I’m capturing strings that might be useful for a detection rule. When I have a good set of indicators, the next step is to turn them into a YARA rule.

It’s easy enough to create a YARA file by hand. My objective was to streamline the boring stuff like formatting and generating a string identifier ($s1 = “stringOne”) for each string. Normally PowerShell is my goto, but this week I’m branching out and wanted to work on my Python coding.

The code relies on you having a file called strings.txt. One string per line.

When you run the script it will prompt for (metadata):

  • rule name
  • author
  • description
  • hash

It then takes the contents of strings.txt and combines those with the metadata to produce a cleanly formatted YARA rule.

Caveats:

If the strings have special characters that need to be escaped, you may need to tweak the strings in the rule after it’s created.

The script will define the condition “any of them”. If you prefer to have all strings required, you can change line 22 from

yara_rule += '\t\tany of them\n}\n'

to

yara_rule += '\t\tall of them\n}\n'

CreateYARA.py

def get_user_input():
    rule_name = input("Enter the rule name: ")
    author = input("Enter the author: ")
    description = input("Enter the description: ")
    hash_value = input("Enter the hash value: ")
    return rule_name, author, description, hash_value

def create_yara_rule(rule_name, author, description, hash_value, strings_file):
    yara_rule = f'''rule {rule_name} {{
    meta:
    \tauthor = "{author}"
    \tdescription = "{description}"
    \thash = "{hash_value}"

    strings:
    '''
    with open(strings_file, 'r') as file:
        for id, line in enumerate(file, start=1):
            yara_rule += f'\t$s{id} = "{line.strip()}"\n\t'
    yara_rule += '\n'
    yara_rule += '\tcondition:\n'
    yara_rule += '\t\tany of them\n}\n'

    return yara_rule

def main():
    rule_name, author, description, hash_value = get_user_input()
    strings_file = 'strings.txt'  

    yara_rule = create_yara_rule(rule_name, author, description, hash_value, strings_file)
    print("Generated YARA rule:")
    print(yara_rule)
    
    yar_filename = f'{rule_name}.yar'
    with open(yar_filename, 'w') as yar_file:
        yar_file.write(yara_rule)

    print(f"YARA rule saved to {yar_filename}")

if __name__ == "__main__":
    main()
Sample strings.txt file used as input for the YARA rule
Running CreateYARA.py
YARA rule created from Python script, viewed in VS Code.

AXIOM, YARA, GitHub – Oh My!

Version 6 of Magnet Axiom added support for YARA rules. By default the installation ships with the free Open-Source YARA rules from Reversing Labs. These YARA rules may be updated within Axiom periodically. In addition to the included rules, AXIOM supports adding your own YARA source folders.

If you need to update the included rules on demand, you can do so with a PowerShell script and the GitHub CLI. The script below can be used to update the included rules, as well as other YARA sources you may be using within Axiom.

Prerequisites:

  • Prior to running the script you’ll need to install GitHub CLI
  • Once installed run gh auth login to establish authentication with GitHub
  • When running the script you will need to run as an Administrator in order for the file-copy to ~\ProgramFiles to be successful

Set the working directory to the local git repository for the YARA rules

Set-Location C:\GitHub\reversinglabs-yara-rules\

Sync the repository; requires github CLI https://cli.github.com/

gh repo sync

Create local archive directory

mkdir C:\Archives -Force

Backup the existing YARA rules in Axiom

Get-ChildItem -Path "C:\Program Files\Magnet Forensics\Magnet AXIOM\YARA" | Compress-Archive -DestinationPath C:\Archives\AxiomYARA.zip

Variable for date/time

$timestamp = Get-Date -Format o | ForEach-Object { $_ -replace ":", "." }

Set the working directory to the Archives location

Set-Location "C:\Archives"

Rename the archive with timestamp

Get-ChildItem -Filter 'AxiomYARA' -Recurse | Rename-Item -NewName {$_.name -replace 'AxiomYARA', $timestamp }

Copy new YARA rules to Axiom

robocopy /s C:\GitHub\reversinglabs-yara-rules\yara "C:\Program Files\Magnet Forensics\Magnet AXIOM\YARA\ReversingLabs"


Now let’s run it all together in a single script:

Set-Location C:\GitHub\reversinglabs-yara-rules\
gh repo sync
mkdir C:\Archives -Force
Get-ChildItem -Path "C:\Program Files\Magnet Forensics\Magnet AXIOM\YARA" | Compress-Archive -DestinationPath C:\Archives\AxiomYARA.zip
$timestamp = Get-Date -Format o | ForEach-Object { $_ -replace ":", "." }
Set-Location "C:\Archives"
Get-ChildItem -Filter 'AxiomYARA' -Recurse | Rename-Item -NewName {$_.name -replace 'AxiomYARA', $timestamp }
robocopy /s C:\GitHub\reversinglabs-yara-rules\yara "C:\Program Files\Magnet Forensics\Magnet AXIOM\YARA\ReversingLabs"


That’s all there is to it. If you’ve got multiple repositories to sync, just add lines to cd (Set-Location) into those directories and repeat the gh repo sync command.

Feel free to copy the code above, or you can download directly from my GitHub.

Are you utilizing YARA rules within AXIOM? If so, leave a comment on what are some that you’ve found useful.