Deobfuscating JavaScript Malware Using Abstract Syntax Trees

nikhilh-20.github.io · Nikhil Hegde · 1 year ago · research
quality 7/10 · good
0 net
Deobfuscating JavaScript Malware Using Abstract Syntax Trees | Nikhil “Kaido” Hegde Deobfuscating JavaScript Malware Using Abstract Syntax Trees Happy Diwali! At the time of publishing this blog, it is the Hindu festival of Diwali, which symbolizes the victory of light over darkness. I wish you and your family a very happy and prosperous Diwali! Image credits: https://i.pinimg.com/474x/ab/3a/46/ab3a46973f797165f2b2f3776405b21c.jpg Metadata MintsLoader SHA256: 013296e33be6f884abaaeabab90a3d3089b7411c7f934f5d978917b3b789a635 Table of Contents Introduction Abstract Syntax Tree Installing Prequisites Deobfuscation using ASTs Remove Comment Lines Remove Unused Variables Simplifying Functions Signaturizing ASTs Bringing It All Together to Deobfuscate MintsLoader Summary Introduction Malware based on scripting languages such as VBScript and JavaScript (JS) can be heavily obfuscated. Analyzing these malware statically may take considerable time and can be, generally, a frustrating experience for analysts. In this blog, we’ll explore how abstract syntax trees (ASTs) can be leveraged to deobfuscate JS-based malware. We’ll then apply these techniques to partially deobfuscate a recently reported JS-based malware called MintsLoader. Disclosure: The code was generated through a LLM. I provided the ideas and directed the model to create the relevant code. Without this assistance, it would have taken me longer to codify the concept, given that I’m not deeply versed in JS. Abstract Syntax Trees When we look at any language we understand—whether a natural language like English or a programming language like JavaScript—we can identify the individual elements that make up a sentence or statement of that language. For example, consider this English sentence: I ate an apple. Some people can break down this sentence into parts of speech like nouns and verbs. Others may not know each element’s label but can still recognize the individual components of the sentence. In compiler theory, each of these individual elements is called a token , and the process of breaking source code into tokens is known as tokenization , or lexical analysis . A token might be a mathematical operator like + , an identifier like a variable name, etc. Naturally, each language has its own set of tokens. Once source code is broken into tokens, we need a way to preserve its syntactical structure. For example, it would be problematic if the structure of the sentence, I ate an apple , were lost during compilation. An AST represents this syntactical structure in a tree format, with each node representing a syntactic construct in the source code. Consider the following JS source code: let x = 6 ; let y = x + 2 ; The figure below shows two AST visualizations. The one on the left was generated using AST Explorer , and the one on the right was generated using JointJs . Each node represents a syntactic construct. In this example: VariableDeclarator associates an identifier (variable name) with a value or expression. Literal represents a fixed value like a number, string, or boolean. BinaryExpression represents an expression with two operands and an operator, like addition ( + ). And that is the extent of my knowledge about ASTs. With just this basic information, we are now capable of performing JavaScript deobfuscation. Installing Prequisites In this blog, our goal is to develop JS deobfuscation tooling, which will require certain JS libraries. I’m using a Windows 10 host and npm for installing these libraries. There are 3 libraries which we need to install: esprima : a library for generating ASTs from JS code. estraverse : a library for traversing and analyzing a given AST. escodegen : a library for converting an AST back into source code. > npm install esprima estraverse escodegen added 5 packages, and audited 6 packages in 979ms found 0 vulnerabilities Deobfuscation using ASTs Remove Comment Lines Consider the following JS source code: // This is a comment var x = 6 ; var y = 2 ; Our goal is to transform the source code into an AST, print the variable names, and then transform the AST back into source code. The JS code below achieves this. const esprima = require ( " esprima " ); const estraverse = require ( " estraverse " ); const escodegen = require ( " escodegen " ); const fs = require ( " fs " ); function print_variable_names ( ast ) { estraverse . traverse ( ast , { enter ( node , parent ) { if ( node . type === " Identifier " ) console . log ( node . name ); } }); } function process_js_file ( input_path ) { // Read source code from input JS file const code = fs . readFileSync ( input_path , " utf-8 " ); console . log ( " Source code in file: " ); console . log ( code ); // Generate AST const ast = esprima . parseScript ( code ); // Print variable names by traversing AST console . log ( " \n\n Variable names: " ); print_variable_names ( ast ); // Generate JS source code from AST const cleaned_code = escodegen . generate ( ast ); console . log ( " \n\n Source code: " ); console . log ( cleaned_code ); } process_js_file ( " test.js " ); The output of the above is shown below. An astute reader will notice that the source code generated from the AST no longer contains comments. Simply transforming JS source code into an AST and then back removes comment lines, which are often used in JS-based malware as an anti-analysis layer for obfuscation and to artificially increase file size. > node .\testBuilder.js Source code in file: // This is a comment var x = 6; var y = 2; Variable names: x y Source code: var x = 6; var y = 2; Remove Unused Variables Consider the JS source code below. It is obvious that the variables y and z are initialized once and never reused. This kind of code falls under the umbrella of junk code. Like comments, these kinds of junk code serve as an obfuscation layer, but they are more problematic because you must check the entire source code to ensure they haven’t been referenced elsewhere. var x = 6 ; var y = 2 ; var z = x + 2 ; x += 2 ; Our goal is to transform the source code into an AST, identify unused variables, remove them from the AST, and then transform the AST back into source code. The JS code below achieves this. const esprima = require ( " esprima " ); const estraverse = require ( " estraverse " ); const escodegen = require ( " escodegen " ); const fs = require ( " fs " ); function find_declared_variables ( ast ) { const declared_vars = new Set (); estraverse . traverse ( ast , { enter ( node ) { // Find variable declarations and track their names if ( node . type === " VariableDeclarator " && node . id . type === " Identifier " ) declared_vars . add ( node . id . name ); } }); return declared_vars ; } function find_used_variables ( ast ) { const used_vars = new Set (); estraverse . traverse ( ast , { enter ( node , parent ) { // Only add variables to used_vars if current node is not a declaration statement if ( node . type === " Identifier " && ! ( parent && parent . type === " VariableDeclarator " && parent . id === node )) used_vars . add ( node . name ); } }); return used_vars ; } function remove_unused_variables ( ast , declared_vars , used_vars ) { // unused_vars are those variables which were declared but not reused const unused_vars = new Set ([... declared_vars ]. filter ( var_name => ! used_vars . has ( var_name ))); estraverse . replace ( ast , { enter ( node ) { // Get the declaration statement if ( node . type === " VariableDeclaration " ) { /* Remove declaration statements of unused variables and replace the current node's declarations with the result */ node . declarations = node . declarations . filter ( decl => ! unused_vars . has ( decl . id . name )); // If the unused variable was the only declaration, then remove the entire statement if ( node . declarations . length === 0 ) // Remove current node from parent array return this . remove (); } } }); } function process_js_file ( input_path ) { // Read source code from input JS file const code = fs . readFileSync ( input_path , " utf-8 " ); console . log ( " Source code in file: " ); console . log ( code ); // Generate AST const ast = esprima . parseScript ( code ); // Find all declared and used variables // Then remove unused variables based on the difference const declared_vars = find_declared_variables ( ast ); const used_vars = find_used_variables ( ast ); remove_unused_variables ( ast , declared_vars , used_vars ); // Generate JS source code from AST const cleaned_code = escodegen . generate ( ast ); console . log ( " \n\n Cleaned source code: " ); console . log ( cleaned_code ); } process_js_file ( " test.js " ); The output of the above is shown below. As we can see, the unused variables y and z no longer appear in the cleaned code. > node .\testBuilder.js Source code in file: var x = 6; var y = 2; var z = x + 2; x += 2; Cleaned source code: var x = 6; x += 2; Simplifying Functions Consider the JS source code below. To someone knowledgeable in JS, it is quickly apparent that wrapper_func() simply returns the array stored in variable x . However, this is still a roundabout way of implementing the function. Like the other techniques discussed previously, it serves as an obfuscation layer, consuming more of an analyst’s time. function wrapper_func () { var x = [ " this " , " is " , " a " , " wrapper " ]; wrapper_func = function () { return x ; }; return wrapper_func (); } var x = wrapper_func (); Our goal is to transform the source code into an AST, simplify functions like the one above, and then transform the AST back into source code. The JS code below achieves this. const esprima = require ( " esprima " ); const estraverse = require ( " estraverse " ); const escodegen = require ( " escodegen " ); const fs = require ( " fs " ); function simplify_self_invoking_functions ( ast ) { estraverse . replace ( ast , { enter ( node ) { // Check if a function has exactly three statements if ( node . type === " FunctionDeclaration " && node . body . body . length === 3 ) { const [ var_decl , self_assign , return_call ] = node . body . body ; // Check if the first statement is a variable declaration if ( var_decl . type === " VariableDeclaration " && var_decl . declarations . length === 1 ) { const var_declarator = var_decl . declarations [ 0 ]; /* Check if the second statement assigns the current function to a different function that only returns the previously declared variable */ if ( self_assign . type === " ExpressionStatement " && self_assign . expression . type === " AssignmentExpression " && self_assign . expression . operator === " = " && self_assign . expression . left . name === node . id . name && self_assign . expression . right . type === " FunctionExpression " && self_assign . expression . right . body . body . length === 1 && self_assign . expression . right . body . body [ 0 ]. type === " ReturnStatement " && self_assign . expression . right . body . body [ 0 ]. argument . name === var_declarator . id . name ) { // Check if the third statement is a return statement that calls the modified function if ( return_call . type === " ReturnStatement " && return_call . argument . type === " CallExpression " && return_call . argument . callee . name === node . id . name ) { /* All conditions have been satisfied. Simplify the current function to a single return statement returning the variable's initialization value */ return { type : " FunctionDeclaration " , id : node . id , params : [], body : { type : " BlockStatement " , body : [ { type : " ReturnStatement " , argument : var_declarator . init , }, ], }, generator : false , expression : false , async : false , }; } } } } }, }); } function process_js_file ( input_path ) { // Read source code from input JS file const code = fs . readFileSync ( input_path , " utf-8 " ); console . log ( " Source code in file: " ); console . log ( code ); // Generate AST const ast = esprima . parseScript ( code ); // Simplify self-invoking functions simplify_self_invoking_functions ( ast ); // Generate JS source code from AST const cleaned_code = escodegen . generate ( ast ); console . log ( " \n\n Cleaned source code: " ); console . log ( cleaned_code ); } process_js_file ( " test.js " ); The output of the above is shown below. As we can see, the obfuscated function has been simplified. > node .\testBuilder.js Source code in file: function wrapper_func() { var x = ["this", "is", "a", "wrapper"]; wrapper_func = function () { return x; }; return wrapper_func(); } var x = wrapper_func(); Cleaned source code: function wrapper_func() { return [ 'this', 'is', 'a', 'wrapper' ]; } var x = wrapper_func(); Signaturizing ASTs Consider the JS source code below. We would like to analyze the function func_name once and then programmatically recognize it the next time we encounter it. After analysis, I determined that func_name decodes a hex string into a decimal array. function func_name ( arg ) { if ( typeof arg !== " string " ) { throw new Error ( " Check input " ); } if ( arg . length % 2 !== 0 ) { throw new Error ( " Incorrect length " ); } var decimal_array = []; for ( var i = 0 ; i < arg . length ; i += 2 ) { var slice = arg . substr ( i , 2 ); var decimal_val = 0 ; for ( var j = 0 ; j < slice . length ; j ++ ) { decimal_val <<= 4 ; var c = slice . charAt ( j ); if ( c >= " 0 " && c <= " 9 " ) { decimal_val |= ( c . charCodeAt ( 0 ) - " 0 " . charCodeAt ( 0 )); } else if ( c >= " A " && c <= " F " ) { decimal_val |= ( c . charCodeAt ( 0 ) - " A " . charCodeAt ( 0 ) + 10 ); } else if ( c >= " a " && c <= " f " ) { decimal_val |= ( c . charCodeAt ( 0 ) - " a " . charCodeAt ( 0 ) + 10 ); } else { throw new Error ( " Throw something " ); } } console . log ( decimal_val ) decimal_array . push ( decimal_val ); } return decimal_array ; } var x = func_name ( " 4D616C776172652026204D7573696E6773 " ) On execution, x will contain the below decimal array. [ 77, 97, 108, 119, 97, 114, 101, 32, 38, 32, 77, 117, 115, 105, 110, 103, 115 ] Our goal is to transform the source code into an AST, identify the sequence of statements (in other words, a signature) that represents hex decoding, rename the function and its references accordingly, and then transform the AST back into source code. The JS code below achieves this. const esprima = require ( " esprima " ); const estraverse = require ( " estraverse " ); const escodegen = require ( " escodegen " ); const fs = require ( " fs " ); function matches_hex_decode_signature ( function_node ) { let step = 0 ; let numeric = false , upper_alpha = false , lower_alpha = false estraverse . traverse ( function_node . body , { enter ( node ) { switch ( step ) { case 0 : // The first pattern to match is the update expression in the for loop if ( node . type === " ForStatement " && node . update && node . update . type === " AssignmentExpression " && node . update . operator === " += " && node . update . right . type === " Literal " && node . update . right . value === 2 ) step ++ ; break ; case 1 : // The second pattern to match is a substr() call if ( node . type === " CallExpression " && node . callee && node . callee . property && node . callee . property . name === " substr " && node . arguments . length === 2 && node . arguments [ 1 ]. type === " Literal " && node . arguments [ 1 ]. value === 2 ) step ++ ; break ; case 2 : // The third pattern to match is the update expression in another for loop if ( node . type === " ForStatement " && node . update && node . update . type === " UpdateExpression " && node . update . operator === " ++ " ) step ++ ; break ; case 3 : // The fourth pattern to match is a charAt() call if ( node . type === " CallExpression " && node . callee && node . callee . property && node . callee . property . name === " charAt " && node . arguments . length === 1 && node . arguments [ 0 ]. type === " Identifier " ) step ++ ; break ; case 4 : /* The fifth, sixth and seventh patterns can occur in any order and involve comparison with alphanumeric characters */ if ( node . type === " BinaryExpression " ) { if (( node . operator === " >= " && node . right . type === " Literal " && node . right . value === " 0 " ) || ( node . operator === " <= " && node . right . type === " Literal " && node . right . value === " 9 " )) numeric = true ; if (( node . operator === " >= " && node . right . type === " Literal " && node . right . value === " A " ) || ( node . operator === " <= " && node . right . type === " Literal " && node . right . value === " F " )) upper_alpha = true ; if (( node . operator === " >= " && node . right . type === " Literal " && node . right . value === " a " ) || ( node . operator === " <= " && node . right . type === " Literal " && node . right . value === " f " )) lower_alpha = true ; } if ( numeric && upper_alpha && lower_alpha ) step ++ ; break ; } } }); // Only return True if all steps in the sequence matched return step === 5 ; } function rename_function ( ast ) { let original_function_name = null ; // First pass: rename the function itself if it matches the signature estraverse . traverse ( ast , { enter ( node ) { if ( node . type === " FunctionDeclaration " ) { if ( matches_hex_decode_signature ( node )) { original_function_name = node . id . name ; node . id . name = " hex_decode " ; } } }, }); // Second pass: rename any CallExpression references to the original function if ( original_function_name ) { estraverse . traverse ( ast , { enter ( node ) { if ( node . type === " CallExpression " && node . callee . type === " Identifier " && node . callee . name === original_function_name ) node . callee . name = " hex_decode " ; }, }); } } function process_js_file ( input_path ) { // Read source code from input JS file const code = fs . readFileSync ( input_path , " utf-8 " ); console . log ( " Source code in file: " ); console . log ( code ); // Generate AST const ast = esprima . parseScript ( code ); // Rename function based on signature rename_function ( ast ); // Generate JS source code from AST const cleaned_code = escodegen . generate ( ast ); console . log ( " \n\n Cleaned source code: " ); console . log ( cleaned_code ); } process_js_file ( " test.js " ); The output of the above is shown below. This can be considered function similarity detection via signatures. There are two important points to note: The above “signature” is not the only viable one; another analyst might write a different signature, and that’s okay. Different implementations of the function may require different signatures, meaning this detection mechanism requires significant maintenance. > node .\testBuilder.js Source code in file: function func_name(arg) { if (typeof arg !== "string") { throw new Error("Check input"); } if (arg.length % 2 !== 0) { throw new Error("Incorrect length"); } var decimal_array = []; for (var i = 0; i < arg.length; i += 2) { var slice = arg.substr(i, 2); var decimal_val = 0; for (var j = 0; j < slice.length; j++) { decimal_val <<= 4; var c = slice.charAt(j); if (c >= "0" && c <= "9") { decimal_val |= (c.charCodeAt(0) - "0".charCodeAt(0)); } else if (c >= "A" && c <= "F") { decimal_val |= (c.charCodeAt(0) - "A".charCodeAt(0) + 10); } else if (c >= "a" && c <= "f") { decimal_val |= (c.charCodeAt(0) - "a".charCodeAt(0) + 10); } else { throw new Error("Throw something"); } } console.log(decimal_val) decimal_array.push(decimal_val); } return decimal_array; } var x = func_name("4D616C776172652026204D7573696E6773") Cleaned source code: function hex_decode(arg) { if (typeof arg !== 'string') { throw new Error('Check input'); } if (arg.length % 2 !== 0) { throw new Error('Incorrect length'); } var decimal_array = []; for (var i = 0; i < arg.length; i += 2) { var slice = arg.substr(i, 2); var decimal_val = 0; for (var j = 0; j < slice.length; j++) { decimal_val <<= 4; var c = slice.charAt(j); if (c >= '0' && c <= '9') { decimal_val |= c.charCodeAt(0) - '0'.charCodeAt(0); } else if (c >= 'A' && c <= 'F') { decimal_val |= c.charCodeAt(0) - 'A'.charCodeAt(0) + 10; } else if (c >= 'a' && c <= 'f') { decimal_val |= c.charCodeAt(0) - 'a'.charCodeAt(0) + 10; } else { throw new Error('Throw something'); } } console.log(decimal_val); decimal_array.push(decimal_val); } return decimal_array; } var x = hex_decode('4D616C776172652026204D7573696E6773'); Bringing It All Together to Deobfuscate MintsLoader The MintsLoader JS code contains all of the obfuscation measures we previously examined. By combining our deobfuscation measures into a single script , the following figure shows the before-and-after comparsion. Summary In this blog, we explored the potential of ASTs for deobfuscating JS-based malware. Some of the techniques described are more resilient to new implementations, while others, such as AST signatures, are less resilient.