Using local LLM and Ghidra to analyze malware
0 net
Using LLM and Ghidra to analyze malware (Part 1) Chun Kit (Tim) Lau Subscribe Sign in Using LLM and Ghidra to analyze malware (Part 1) I am not trying to put myself out of a job, I promise. Chun Kit (Tim) Lau Mar 18, 2026 1 Share Subscribe Motivation After I splurged a significant amount of money into buying used GPUs for hosting my own local LLM (hint: Buying Geforce RTX 3090 from Yahoo Japan Auction is about $100 cheaper than buying from Ebay), I need to justify my purchases. A certain Hackernews post gave me an idea. I spent 10 years analyzing various malwares at my previous job. Let’s see if I can put my 15 year ago malware analyst self out of a job. Malware Analyst My previous company spends 8 months (!!) training new employee various reverse engineering techniques. Since the firm is a security firm, the training was geared towards various analyzing malicious binaries. You are basically being paid to crack a constant stream of crackme . Fun time. Malware binaries are rather unlike normal programs for the following reasons- The malware writers probably do not have a QA department, so the code quality varies. Malware needs to constantly mutate to evade detection, so they can’t be too ‘creative’ or else the creator won’t be able to release them fast enough. The bulk of AV detection still rely on signature based detection, and so you just need to change enough to evade detection. (Yes AI behavioral analysis has been the hype since forever, but if a binary is not doing anything (maybe it’s dormant), how can an AV say for certain the binary is malicious?) Most malware is not the first stage payload. The first stage payload is what execute the exploitation to the victim system (say a malicious .docx file that crashes Word’s COM serializer). After the first stage gains access to the system, then other malware can be (down)loaded. These malwares don’t need to be as advanced as the first stage payload. Malware Analysis My former employer is the first result when I Google the term “Malware Analysis”. I will use them as the guide. Since this post is not intended to be a full blown malware analysis course, I will keep this section short. I should also add that this post has nothing to do with my previous employer. Malware analysis is broken down into the following steps: 1. Static Code Analysis We are just trying to gain information from the binary itself, without running the binary. Items we can extract and infer includes: Strings such as web domains, commands, application names, file names Library/function import/export Compiler metadata Hashes Similarity to other binaries (eg. with Bindiff and ssdeep) In addition, more in-depth info can be found in the code Common encryption routines/PKI keys Any anti-debug techniques ( this is an excellent overview of techniques) Code obfuscation (often has its own encryption/decryption routines) System call map Custom VM/Interpreter implementation for extra functionality obfuscation Memory scanner Communication with other peers/CC The more info we can gleam in this phase, the easier the next phase will be. 2. Dynamic Code Analysis Here is where we actually run the binary (hopefully in a sandbox) and see what happen. During my time I have actually devised a few different way to conduct the test, but the main idea is to capture all the system calls, RPCs such as namepipes or Netlink messages, all network communication, and all file system accesses (which would be in the system calls). Note that most reputable security company would not allow it’s analyst to modify the malicious binary, even if it is for the sake of analysis. Imagine you patched out the first stage decryption routine from a malware for analysis, but that malware somehow escaped your test system (owned by your corporate!) and wreck havoc on the internet. Who is Bruce Schneider going to blame? This extends to many of the resources you might find in the code. For example, the Wannacry malware has an unregistered killswitch domain stored in clear text in the code. During analysis many researchers discovered it immediately, but we weren’t allowed to actually register the domain. Other people are more adventurous . All I can say is that it was a good thing the domain is for a kill-switch, not for complete system wipe. I might still be a bit bitter heh. Anyway, the dynamic code analysis will not be the focus of this post. The Tools Fortinet’s post on malware analysis already has an excellent list of tools, so I am not going to add much. There are more useful tools that you can use, including unpackers, debuggers, code injectors (remember no writing to the binary), syscall monitor (sysmon and Process Monitor), network traffic analyser (eg. burpsuite but I prefer Fiddler for malware analysis), any sandbox systems you care to run, etc. This post will focus on Ghidra since no one likes paying for IDA and Binary Ninja, while great and cheap, still cost money to use commercially. Ghidra Ghidra is the first open source decompiler that produces usable de-compiled code. It does many of the things IDA can do, albeit more slowly. I am glad I don’t have to spend any more time doing you-know-what to you-know-what. Ghidra Ghidra does everything you need to analysis malwares. It can dump all the strings in the code, both unicodes and classic variants. It can deassemble machine codes to assembly, then into psuedo-c. It can display the import and export tables. It can cross reference resources in different segments. It works across all ISAs you will care about (x86, ARM, MIPS, PPC). It is even scriptable . The scriptable part is what’s interesting. If you can control aspects of a debugger, then you can get very far in your debugging advanture. Even Windbg, and its hideous legacy scripting syntax , is pretty much a requirement to master if you want to get anywhere in the Windows kernel (yes you can also use the Python script plugin, but it is not as reliable imho.) But I don’t want to write scripts. I might not even know what I am looking for. Can I just…ask LLM to do the work? LLM MCP Some kind hearted souls have written Ghidra plugins that exposes the internal of Ghidra via MCP servers. Here is the list of tools I found on Github: https://github.com/LaurieWired/GhidraMCP https://github.com/bethington/ghidra-mcp/ (a fork of the above) https://github.com/cyberkaida/reverse-engineering-assistant (ReVa) https://github.com/symgraph/GhidrAssistMCP If you are not familiar with MCP , just think of it as an API for LLM agents to communicate with. The protocol is nothing special, other than the clear lack of thoughts into security or any notion of secure design from its creators. But I digress. The ghidra-mcp tool claims to support 179 tools (!!) all powered by Ghidra. I am not a scientist myself but I believe if a little is good, a lot is better. I am also not going to sit and go though all 179 tools and learn each of them inside out (I want to, but I recently got laid off so other things come first.) Perfect for LLM then. The Setup I have created a sandbox VM environment for maximum protection (Debian 13 if you care.) You can use Windows as well, but some malware exploits vulnerabilities in Explorer and can result in no-click exploitation. As such I usually recommend doing your analysis on Linux. Finally a good reason to have 5 different windows managers. I have the following hardware- 2x Geforce RTX 3090 24GB i7-1260p 12th gen Intel CPU 64GB ram I have installed the following applications- XFCE Ghidra 12.04 3 of the 4 plugins above (ghidra-mcp is a fork of GhidraMCP) Python 3.10 for 1 of the plugins’ MCP server x2go for remote GUI access On my LLM side, llama.cpp recently added support for MCP . I will be using Llama.cpp b8226 Qwen 3.5 27b Q4 model in Thinking mode and 200k token context (if you use any llama.cpp version after b8226 you can’t disable Thinking mode on Qwen 3 due to a bug ). I also have loaded its vision mmproj but this should not matter. Duckduckgo MCP Caddy proxy to get around CORS protection when using Llama.cpp’s web interface. Llama.cpp’s MCP implementation supports CORS proxy but it has been buggy. All the issues should have been fixed by now ymmv. Llama.cpp working llama.cpp web interface MCPs added Malware Sources One thing you might find surprising (or not surprised) is that it’s hard to get the latest malware samples if you don’t have the right connection. VirusTotal , the industry standard for malware identification, does not provide sample download to the general populace. You can upload anything you want, but once it’s on their server, you can’t get anything back. And even if you are are part of the industry, they have strict upload/download ratio requirements that you must adhere to. As such I will use an alternative source, MalwareBazaar . Thank you folks at MalwareBazaar. Looking at the sample list, as always Mirai continues to top the chart in terms of popularity. Initially I wanted to use the historical infamous malwares (ILOVEYOU, WannaCry, the Target PoS malware BlackPoS, EternalBlue, etc.), but MalwareBazaar doesn’t have them. So instead I will just use what I find in their latest sample page. Sample 1: ValleyRAT Getting the Sample I randomly picked a RAT (Remote Access Trojan) sample from the page call ValleyRAT. According to my excellent former colleagues at Fortinet, ValleyRat is a multi-stage malware that utilizes diverse techniques to monitor and control its victims and deploy arbitrary plugins to cause further damage. Another noteworthy characteristic of this malware is its heavy usage of shellcode to execute its many components directly in memory, significantly reducing its file footprint in the victim’s system. In its Signature page, I picked an exe sample (SHA256: 82135c88fd0af3de7551f55d6cf7670e206b9383fc0a0d0b4166fbbc8c5ace27). 4th entry Sample page VirusTotal claims as for 2 hours ago, 40/71 participating AV scanners have detected the sample. VirusTotal Sample Page Note that the sample already has been analysed by an LLM agent and the following has been discovered: It ensures persistence by adding itself to the Windows 'Run' registry key under the value 'qidong' (a common artifact in Chinese-authored malware). The binary performs a mutex check ('MyUniqueMutexName') to avoid multiple instances and then uses wininet.dll to download a payload from 'http://38.148.242.188/output_86.bin'. It employs an anti-analysis execution technique: it registers a Vectored Exception Handler (VEH), intentionally triggers an access violation, and within the handler, it allocates RWX memory via VirtualAlloc to copy and execute the downloaded shellcode by modifying the exception context EIP. Relevant IOCs include the C2 IP 38.148.242.188 and the registry value 'qidong'. Let’s see if my local LLM setup also picks up on these signals. Loading the Sample One of the plugins allow the user to load the sample remotely. I prefer to do it manually to make sure the sample is at a known location. Let Ghidra analyze the binary when importing Fun with the Sample Now that the sample is loaded into Ghidra and analyzed, let’s switch to llama.cpp’s web interface. I have enabled the Duckduckgo and the 3 MCPs in the MCP configuration popup. I have also change the “Agentic loop max turns” to 1000. I am running this on local LLM, so I don’t care how many tokens the analysis will consume. In the chatbox, I use the following prompt to start the analysis: I have loaded a binary file 82135c88fd0af3de7551f55d6cf7670e206b9383fc0a0d0b4166fbbc8c5ace27 into Ghidra. Analyze the file and tell me what you found. LLM analysing the binary After 5 minutes, the initial analysis is done. Based on my analysis of the binary file 82135c88fd0af3de7551f55d6cf7670e206b9383fc0a0d0b4166fbbc8c5ace27.exe , here’s what I found: Summary: This appears to be a MALICIOUS executable (likely a trojan/downloader) Key Technical Details: File Size : 142KB (142,180 bytes) Architecture : 32-bit Windows executable (x86) Compiler : Visual Studio 2010 Functions : 431 total functions Entry Point : Standard C runtime startup ( ___tmainCRTStartup ) 🚨 Malicious Indicators Found: Persistence Mechanism : String: "Software\\Microsoft\\Windows\\CurrentVersion\\Run" - Registry key for auto-start on Windows boot Network C2 Communication : String: "http://38.148.242.188/output_86.bin" - Downloads a second-stage payload from a remote server Strings related to WinINET API: InternetOpenA , InternetOpenUrlA , InternetReadFile , InternetCloseHandle References to wininet.dll Anti-Analysis Techniques : IsDebuggerPresent - Detects if running under a debugger AddVectoredExceptionHandler - Exception handling often used for anti-debugging Process Injection/Loading : Strings: VirtualAlloc , LoadLibraryA , GetProcAddress - Dynamic API loading for stealth Mutex for Single Instance : String: "MyUniqueMutexName" with CreateMutexA - Ensures only one instance runs Author/Group Signature : String: "qidong" - Possibly the author or group name Standard Malware Behaviors : HeapSetInformation with termination on corruption Console control handler installation ( SetConsoleCtrlHandler ) Runtime check failures (RTC messages in debug builds) Function Analysis: The main entry flow goes: entry → ___tmainCRTStartup → thunk_FUN_004147ee (likely the main malware logic) Recommendation: This file should be treated as MALWARE . It exhibits classic downloader/dropper behavior: Establishes persistence via Windows Registry Run key Downloads additional payload froP C2 server Uses anti-debugging techniques Dynamically loads APIs to evade detection Do not execute this file . If this was found on a system, the system should be considered compromised and isolated for forensic analysis. It’s more or less what the VirusTotal AI blurb says as well. Let’s do a deeper dive. Prompt: Find all the interesting strings you can find in the binary. Response CONCLUSION This is clearly malware with the following capabilities: Downloader/Dropper - Downloads output_86.bin from 38.148.242.188 Persistence - Installs to Windows Registry Run key Anti-Analysis - Uses debugger detection and exception handlers Dynamic Loading - Resolves APIs at runtime to evade signature-based detection Single Instance - Uses mutex to prevent multiple copies The IP address 38.148.242.188 should be investigated/blocked. This appears to be a known malicious C2 infrastructure. My response Let’s find out what does the malware actually do. Prompt: What are the capabilities of this malware? I didn’t copy most of the output. Suffice to say, it would take me a couple hours to do produce the same data from manual static analysis. The sample is only a second stage payload, so it doesn’t do anything interesting beyond downloading more payloads for further exploitation. Remember that the VirusTotal page only has 40/71 AV companies detecting the sample. I think it’s suffice to say that reverse engineering skills is at risk by LLM. EDIT : Someone asked my how much did the analysis cost. I added up all the prompts and the analysis output about 8000 tokens and consumed 60k context. According to this calculator 8000 tokens is about $0.201 cents on Opus 4.5. But since I used a local LLM so I suppose about 2 cents of electricity? The malware is so simple I don’t think you need Opus or Sonnet for this. To Be Continue Part 2 would include more samples for different architectures, and maybe even a mystery sample. Stay tune. Thanks for reading Chun Kit (Tim) Lau! Subscribe for free to receive new posts and support my work. Subscribe 1 Share Discussion about this post Comments Restacks Ready for more? Subscribe © 2026 Chun Kit (Tim) Lau · Privacy ∙ Terms ∙ Collection notice Start your Substack Get the app Substack is the home for great culture