I Built a Security Scanner That Goes Beyond Regex — Here’s Why (and How)
quality 9/10 · excellent
0 net
Tags
I Built a Security Scanner That Goes Beyond Regex — Here's Why (and How) | by Parag Bagade - Freedium
Milestone: 20GB Reached
We’ve reached 20GB of stored data — thank you for helping us grow!
Patreon
Ko-fi
Liberapay
Close
< Go to the original
I Built a Security Scanner That Goes Beyond Regex — Here's Why (and How)
There's a moment every penetration tester knows well. You're reviewing a C codebase, you grep for `strcpy`, you find fifty hits, and now…
Parag Bagade
Follow
~10 min read
·
March 31, 2026 (Updated: March 31, 2026)
·
Free: Yes
There's a moment every penetration tester knows well. You're reviewing a C codebase, you grep for `strcpy`, you find fifty hits, and now you have to manually trace each one to figure out which of them actually leads to a reachable, exploitable overflow. You open Burp Suite in one tab, a terminal in another, and you wonder — why doesn't a tool just do this thinking for me?
That frustration is where OverflowGuard was born.
— -
The Problem With Existing Tools
I've spent years doing bug bounty hunting across HackerOne and Bugcrowd. One pattern shows up constantly: most open-source SAST tools are, at their core, glorified `grep` wrappers. They scan for dangerous function names, maybe check if they're inside a comment, and call it a day.
That works fine for junior developers writing checklists. It doesn't work for serious research.
The issues I hit repeatedly:
- False positives everywhere:- A tool flags `strcpy` inside a test fixture, inside dead code behind `if(0)`, or inside a function whose input is already bounds-checked three calls up. You spend more time triaging noise than finding real bugs.
- No cross-language support:- Modern projects mix Python backends, Java microservices, Go APIs, and TypeScript frontends. You need one tool that understands all of them — not five tools with five different output formats.
- No depth:- Regex tells you where a dangerous call is. It doesn't tell you whether attacker-controlled data actually reaches it .
- No context:- IaC misconfigurations, hardcoded secrets, vulnerable dependencies — these all live in the same repo but get scanned by completely separate tools with no unified view.
So I built OverflowGuard to fix all of that.
— -
Seeing It In Action
Before diving into the internals, here's what it actually looks like when you run OverflowGuard against a directory of vulnerable sample files:
![OverflowGuard v11.0 scanning sample C files — multi-engine findings including Symbolic, Taint, and cppcheck results]
OverflowGuard v11.0 startup and first scan pass — Symbolic engine, Taint engine, cppcheck, and the automated fuzzer all running in parallel across C source files.
![Multi-file scan output showing use-after-free, heap overflow, and off-by-one findings with fuzzer crashes]
Multiple files being analyzed — use-after-free confirmed by both cppcheck (CWE-416) and Taint engine; fuzzer crashes with `AAAAAA…` payload confirm the findings are real and exploitable.
Every `[!!!]` line is a confirmed finding. Every `[~]` is a symbolic engine result with a proof attempt. Every `[!]` is a cppcheck or clang-tidy result. Multiple engines corroborating the same finding dramatically increases confidence.
— -
What OverflowGuard Actually Is
OverflowGuard is a polyglot security orchestration framework — not just a scanner. It currently supports 14 programming languages (C, C++, Python, Go, Rust, Java, JavaScript, TypeScript, PHP, Ruby, C#, Kotlin, Swift, and Scala) and combines multiple layers of analysis into a single pipeline:
- Real AST parsing via tree-sitter
- CFG-based dataflow analysis with gen/kill semantics
- Symbolic execution using the Z3 SMT solver
- Source-to-sink taint tracking (Checkmarx/CodeQL-style)
- IaC scanning across Terraform, Kubernetes, Dockerfile, CloudFormation, and Ansible
- SCA (dependency vulnerability checks via OSV API)
- Secrets scanning (30+ patterns + Shannon entropy)
- SBOM generation (CycloneDX 1.4)
- SARIF 2.1.0 export for GitHub Code Scanning and Azure DevOps
Every finding comes with a CWE, a CVSS risk score, a confidence badge, and a secure alternative code snippet — not just "this is bad," but "here's exactly what to use instead."
— -
How It Works: The Analysis Pipeline
Let me walk through what actually happens when you point OverflowGuard at a target.
Stage 0 — Real AST + CFG
The first thing that runs is the tree-sitter engine. Unlike regex, tree-sitter builds a proper **syntax tree** for the code. This means OverflowGuard understands the *structure* of code — it knows the difference between a function call, a variable declaration, and a string literal. It doesn't get fooled by comments, string contents, or unusual whitespace.
From the AST, the CFG builder constructs **Control-Flow Graphs** — basic blocks connected by conditional branches, loop edges, exception handlers, and return edges. The dominator tree is computed using the Cooper-Harvey-Kennedy algorithm. This is the same foundation used by production compilers and serious academic static analyzers.
Stage 0a — Real Dataflow Analysis
With a proper CFG, OverflowGuard runs reaching definitions analysis with gen/kill semantics. This isn't heuristic guessing — it's a proper fixpoint computation that tracks exactly where each variable was last defined as execution reaches any given point in the program.
Taint propagation follows: sources (network input, `stdin`, HTTP request bodies, environment variables) are marked tainted. That taint propagates through assignments, function returns, and data transformations. When a tainted value reaches a dangerous sink (`system()`, `memcpy`, SQL execute, `eval`), it's flagged — with the complete taint-flow path shown in the report.
The key improvement over most open-source tools: dominator-based sanitizer verification . If a sanitizer (bounds check, input validation, shell escape) dominates all paths from a source to a sink on the CFG, the finding is suppressed. This is what kills the false positive problem.
Stage 0b — Symbolic Execution
For high-value vulnerability classes (buffer overflows, integer overflows, off-by-one errors), OverflowGuard calls in the Z3 SMT solver.
Z3 works with 64-bit bitvector arithmetic . OverflowGuard feeds it the path constraints accumulated at each branch and asks it to prove or refute whether the bad condition is reachable. If Z3 says "yes, here's a concrete input that triggers the overflow" — you get a counterexample in the report. That's not a guess. That's a mathematical proof that the bug is real and exploitable.
If Z3 isn't installed, the engine gracefully falls back to interval abstract interpretation. You still get findings, just without SMT-proved certainty.
Stages 1–6 — Multi-layer SAST
On top of the real analysis engine, OverflowGuard also runs:
- libclang for C/C++ AST-based sink/source tracking
- cppcheck , clang-tidy , semgrep , and Facebook Infer for external SAST
- An SSA-style def-use dataflow engine for second-order taint flows
- Interprocedural taint that follows data across function boundaries
- Concurrency analysis for data races and lock-order inversions in C/C++ and Go
- A concolic fuzzer combining angr symbolic execution, AFL++ mutation, and ASAN
Multiple engines seeing the same finding dramatically increases confidence. A finding flagged by both Z3 symbolic execution and cppcheck with a confirmed taint path is almost certainly real.
Stages 7–14 — The v11.0 Layer
The most recent version expanded OverflowGuard from a code scanner into a full security posture framework:
IaC Scanning covers 44 rules across Terraform, Kubernetes, Dockerfiles, CloudFormation, and Ansible — things like S3 buckets with public access, Kubernetes containers running as root, hardcoded secrets in `ENV` instructions, and wildcard IAM policies.
Cross-file Taint Analysis builds a file-level call graph from imports and includes, then propagates taint findings across file boundaries. This catches the class of vulnerabilities where user input enters through `api.py`, gets passed to `db_utils.py`, and triggers SQL injection in `query_builder.py` — three hops that single-file analysis would completely miss.
Auto-fix Patch Generation produces unified diff patches for 18 common vulnerability patterns. You don't just learn what's wrong — you get a `.patch` file you can apply directly.
Severity Trend Tracking via SQLite records every scan result historically and acts as a quality gate: if critical or high findings increase since the last scan, the CI pipeline fails.
OWASP Top 10 Coverage Reporting maps all findings to OWASP categories using a 200+ CWE-to-OWASP mapping table and shows you exactly which categories you're covered on and which are blind spots.
After all stages complete, the Final Audit Scorecard gives you a per-file breakdown:
![OverflowGuard Final Audit Scorecard showing 225 findings across 17 files with CRIT/HIGH/MED/LOW breakdown]
Final Audit Scorecard — 17 files scanned, 225 total findings (CRIT:23, HIGH:77, MED:29, LOW:96), with per-file severity breakdown and quality gate status.
The scorecard also shows the trend delta vs. the previous scan (commit `f00494e1`): zero new findings introduced, quality gate PASS. This is the kind of signal that belongs in a CI pipeline — not just "vulnerabilities exist" but "did this commit make things worse?"
![OWASP Top 10 coverage report and cross-file taint flows showing 40 boundary-crossing taint paths]
Cross-file taint: 40 boundary-crossing flows detected, with explicit chain notation (`sample.c:gets(line 31) → sample.c:main()(line 34) → saft.c:main(body)`). OWASP Top 10 coverage at 50% with A03:Injection leading at 175 findings.
— -
Real-World Use Cases
Penetration Testing Engagements
The most immediate use case. During an engagement on a mixed Python/Go web application, OverflowGuard's cross-file taint analysis traced a user-controlled HTTP parameter through three files before it reached an unsanitized SQL execute call — a finding that would have taken hours of manual code review to locate. The report came with the complete injection path, CWE-89, CVSS risk score, and a remediation snippet using prepared statements.
Secure Code Review for Developers
OverflowGuard integrates directly into CI/CD via ready-made templates for GitLab CI, Jenkins, Bitbucket Pipelines, and Azure Pipelines. Developers get findings annotated on pull requests through SARIF export to GitHub Code Scanning. The secure-alternative cards mean a junior developer who hits a `gets()` finding immediately sees: here's `fgets()`, here's why, here's the exact replacement.
Supply Chain Risk Assessment
Before pulling a new dependency into a production codebase, you can point OverflowGuard at the package's GitHub repo directly. It scans the source, queries the OSV API for known CVEs, generates a CycloneDX SBOM, flags any GPL/AGPL license contamination risks, and checks for hardcoded secrets — all in one run. This satisfies US Executive Order 14028 SBOM requirements as a side effect.
Infrastructure Hardening
For DevSecOps teams, the IaC scanner catches the configuration vulnerabilities that slip through code review because "it's just Terraform." An open security group ingress rule, a Kubernetes pod with `hostPID: true`, a Dockerfile that pulls a script with `curl | bash` — these are CWE-16 findings that belong in the same unified report as the injection vulnerabilities in the application code.
CTF and Vulnerability Research
For security students and CTF players, the differential scanning mode (` — diff`) is particularly useful when studying patches: scan the vulnerable version, then the patched version, and compare. The counterexamples from Z3 give you concrete exploit inputs — effectively a starting point for PoC development.
The remediation guidance doesn't stop at flagging — it tells you exactly what to replace and why:
![Top Remediation Hints showing secure alternatives for heap-buffer-overflow, use-after-free, double-free, null-pointer, and off-by-one]
Top Remediation Hints — every finding type comes with a concrete fix: `strcpy()` → `strncpy() + NUL`, `free(ptr)` called multiple times → `set pointer to NULL after free()`, `i <= count` → `use strict < comparison`. Actionable, not just informational.
— -
The Architecture Decision I'm Most Proud Of
When I was designing the false-positive filtering strategy, the easy solution was the ±10-line heuristic: if a sanitizer function call appears within 10 lines of the vulnerable call, suppress the finding. Almost every open-source scanner does this.
I didn't do that.
Instead, OverflowGuard uses dominator-based sanitizer verification on the actual CFG . A sanitizer only suppresses a finding if it dominates all paths from the taint source to the sink — meaning no execution path can reach the sink without passing through the sanitizer first. This is the correct definition. A bounds check that happens on only one branch of an if-else does not protect the other branch, and the ±10-line heuristic would incorrectly suppress it.
The difference in false-positive rates between these two approaches is significant. And for a security tool — where every missed vulnerability matters — correctness isn't optional.
— -
What's Coming Next
OverflowGuard is already at v11.0 after a rapid development sprint, but the roadmap has several directions I'm actively exploring:
Cross-language taint propagation — modern microservice architectures have Python services calling Go APIs calling Java backends. Taint that enters at an HTTP endpoint in one service and reaches a sink in another is currently invisible to single-language analysis. True cross-language taint tracking is the next frontier.
LLM-assisted remediation — beyond static code snippets, integrating a local LLM (I use Llama locally) to generate context-aware fix suggestions that understand the surrounding code semantics, not just the vulnerability pattern in isolation.
Binary analysis mode — for black-box penetration testing scenarios where source code isn't available. Integrating Ghidra's P-Code IR or LLVM IR from decompilation as an analysis target.
Plugin ecosystem — a YAML/Python plugin API so the security community can contribute custom detection rules, custom sanitizer databases, and custom report formats without touching the core engine.
Android/iOS security — given my work on BLE and Wear OS projects, extending OverflowGuard's taint analysis to cover Android-specific sources and sinks (Intent extras, SharedPreferences, ContentProviders) is a natural next step.
— -
## Try It
OverflowGuard is MIT-licensed and available on GitHub:
[github.com/parag25mcf10022/OverflowGuard]( https://github.com/parag25mcf10022/OverflowGuard)
Getting started takes three commands:
```bash
git clone https://github.com/parag25mcf10022/OverflowGuard.git
cd OverflowGuard
chmod +x setup.sh && ./setup.sh
```
Then scan any GitHub repo directly:
```bash
python3 main.py
# Enter Path/File/GitHub Repo: torvalds/linux
```
Or drop it into your CI pipeline with one of the included templates.
— -
If you've ever spent a Friday afternoon manually triaging 200 false positives from a regex-based scanner, OverflowGuard is for you. If you want to go beyond pattern matching and understand whether attacker-controlled data actually reaches a dangerous sink — with a mathematical proof — it's for you.
Security research should be about finding real bugs, not managing tool noise.
— -
I am a security researcher, ethical hacker, and M.Tech student in Cyber Security & Digital Forensics at VIT Bhopal. Recognized bug bounty researcher acknowledged by 23+ organizations including the Government of India, with findings spanning XSS, RCE, IDOR, and authentication bypass vulnerabilities.
Connect on GitHub: [parag25mcf10022]( https://github.com/parag25mcf10022)
— -
#cybersecurity #penetration-testing #sast #static-analysis #bug-bounty
Reporting a Problem
Sometimes we have problems displaying some Medium posts.
If you have a problem that some images aren't loading - try using VPN. Probably you have problem with
access to Medium CDN (or fucking Cloudflare's bot detection algorithms are blocking you).