I ran an AI pentester on a vibe-coded quiz app and found 22 vulnerabilities

github.com · francesco_sta · 4 days ago · view on HN · research
quality 9/10 · excellent
0 net

numasec

AI pentester that actually finds vulnerabilities. Open source. Runs in your terminal.

numasec running a pentest against OWASP Juice Shop

MIT License Python 3.11+ Build Release PyPI

96% vulnerability recall on OWASP Juice Shop · 10 specialized agents · 21 security tools · PTES methodology

--- ## Quickstart ```bash pip install numasec numasec ``` Or with Docker: ```bash docker run -it francescosta/numasec ``` Or from source: ```bash curl -fsSL https://numasec.dev/install | bash ``` Type `/target https://yourapp.com` and watch it work. The AI scans, finds vulnerabilities, chains attacks together, and writes the report. You watch, approve, and steer. Works with **Claude, GPT-4, Gemini, DeepSeek, Mistral**, or any OpenAI-compatible model. --- ## Why numasec Most "AI security tools" wrap a single scanner and call it AI. numasec is different — it's a **team of 10 specialized agents** running **21 offensive security tools** through an actual penetration testing methodology. It doesn't just find vulnerabilities. It **chains them**: a leaked API key in JavaScript → SSRF → cloud metadata → account takeover. Then it writes a professional report with CVSS scores, CWE IDs, OWASP categories, and remediation guidance. **Benchmarked against real targets:** | Target | Vulnerabilities Found | Coverage | |---|---|---| | OWASP Juice Shop v17 | 25/26 ground-truth vulns | **96% recall** | | DVWA | 7/7 vulnerability categories | **100%** | | WebGoat | 20+ vulnerabilities across all modules | **Full coverage** | --- ## What it finds
**Injection** - SQL injection (blind, time-based, union, error-based) - NoSQL injection - OS command injection - Server-Side Template Injection - XXE injection - GraphQL introspection & injection **Authentication & Access** - JWT attacks (alg:none, weak HS256, kid traversal) - OAuth misconfiguration - Default credentials & password spray - IDOR - CSRF - Privilege escalation **Client & Server Side** - XSS (reflected, stored, DOM) - SSRF with cloud metadata detection - CORS misconfiguration - Path traversal / LFI - Open redirect - HTTP request smuggling - Race conditions - File upload bypass
Every finding is auto-enriched with **CWE ID**, **CVSS 3.1 score**, **OWASP Top 10 category**, **MITRE ATT&CK technique**, and actionable **remediation guidance**. --- ## Multi-Agent Architecture numasec isn't a single bot — it's a coordinated team of specialized agents, each with distinct roles and permissions: ### Primary Agents | Agent | Role | What it does | |---|---|---| | 🔴 **pentest** | Full PTES methodology | Recon → Discovery → Vuln Assessment → Exploitation → Reporting | | 🔵 **recon** | Intelligence gathering | Port scanning, fingerprinting, subdomain enum, service probing — no exploitation | | 🟠 **hunt** | OWASP Top 10 hunter | Systematic, aggressive testing across all 10 OWASP categories | | 🟡 **review** | Secure code review | Static analysis of source code, diffs, commits, PRs | | 🟢 **report** | Report & findings | Finding management, severity validation, report generation | ### Subagents | Agent | Role | |---|---| | **scanner** | Executes automated vulnerability scans (passive → semi-active → active) | | **analyst** | Validates results, eliminates false positives, correlates attack chains | | **reporter** | Generates SARIF / Markdown / HTML / JSON reports | | **explore** | CVE research, exploit documentation, knowledge base queries | Each agent has **tailored permissions** — the recon agent can't run exploits, the review agent can't launch scanners. The analyst agent filters false positives using strict evidence criteria before any finding enters the report. --- ## Security Tooling 21 purpose-built security tools and 38 async scanners under the hood — covering reconnaissance, injection testing, authentication attacks, access control, file upload bypass, race conditions, request smuggling, out-of-band detection, and more. The AI selects and orchestrates them automatically based on what it discovers about your target. A built-in **knowledge base of 34 templates** covers detection patterns, exploitation techniques, payloads, and remediation — so the AI doesn't hallucinate attack methodology, it looks it up. Extensible with your own templates and plugins. --- ## Reports Four output formats, all auto-generated: | Format | Use case | |---|---| | **SARIF** | Drop into GitHub Code Scanning, GitLab SAST, or any SARIF viewer | | **HTML** | Self-contained report to share with your team | | **Markdown** | Paste into tickets, docs, or wikis | | **JSON** | Feed into your pipeline or dashboard | Every report includes an executive summary with risk score (0-100), severity breakdown, OWASP coverage matrix, attack chain documentation, and per-finding remediation. --- ## OWASP Top 10 Coverage The TUI header tracks real-time testing coverage across all 10 OWASP categories as the pentest progresses. Each category is automatically mapped to the relevant tools — so you always know what's been tested and what's left. --- ## Installation ### pip (recommended) ```bash pip install numasec numasec ``` Downloads the TUI binary automatically on first run. No Bun, Node, or other runtime needed. ### Docker ```bash docker run -it francescosta/numasec ``` Full TUI + all 21 security tools. Multi-arch (amd64, arm64). ### From source ```bash curl -fsSL https://numasec.dev/install | bash ``` Or manually: ```bash git clone https://github.com/FrancescoStabile/numasec.git cd numasec pip install -e ".[all]" # Python backend cd agent && bun install && bun run build # TUI ``` --- ## Usage ```bash numasec # Start interactive TUI ``` ### Slash Commands | Command | Description | |---|---| | `/target ` | Set target and begin reconnaissance | | `/findings` | List all discovered vulnerabilities | | `/report ` | Generate report (markdown, html, sarif, json) | | `/coverage` | Show OWASP Top 10 coverage matrix | | `/creds` | List discovered credentials | | `/evidence ` | Show evidence for a specific finding | | `/review` | Security review of code changes | | `/init` | Analyze app and create security profile | ### Agent Modes Switch between agents for different tasks: - **pentest** — full methodology, default - **recon** — reconnaissance only, no exploitation - **hunt** — aggressive OWASP Top 10 testing - **review** — secure code review (no network scanning) - **report** — finding management and deliverables --- ## LLM Providers | Provider | Models | |---|---| | Anthropic | Claude Opus, Sonnet, Haiku | | OpenAI | GPT-4o, GPT-4, o1 | | Google | Gemini Pro, Flash | | AWS Bedrock | Claude, Llama | | Azure OpenAI | GPT-4, GPT-4o | | Mistral | Large, Medium | | DeepSeek | V2, Coder | | OpenRouter | Any model via aggregation | | GitHub Copilot | Copilot models | | Google Vertex | Gemini via Vertex | | GitLab | GitLab models | --- ## Development ```bash pip install -e ".[all]" # Tests (1273 unit + 3 benchmark suites) pytest tests/ -v pytest tests/ -m "not slow and not benchmark" # fast run # Lint & type check ruff check numasec/ ruff format numasec/ mypy numasec/ # TypeScript TUI cd agent && bun install cd packages/numasec && bun run typecheck cd packages/numasec && bun test ``` ### Benchmarks ```bash # Juice Shop (96% recall) JUICE_SHOP_URL=http://localhost:3000 pytest tests/benchmarks/test_juice_shop.py -v # DVWA (100% coverage) DVWA_TARGET=http://localhost:8080 pytest tests/benchmarks/test_dvwa.py -v # WebGoat WEBGOAT_TARGET=http://localhost:8081/WebGoat pytest tests/benchmarks/test_webgoat.py -v ``` ### Extend with plugins Drop a Python file with a `register(registry)` function into `~/.numasec/plugins/` or a YAML scanner template into `~/.numasec/templates/`. --- ## How it works ``` ┌─────────────────────────────────────────────────────────────┐ │ Terminal TUI │ │ (TypeScript/Bun • SolidJS reactive UI • 5 agent modes) │ └────────────────────────────┬────────────────────────────────┘ │ ┌────────────────────────────▼────────────────────────────────┐ │ Security Engine │ │ ┌─────────────┐ ┌───────────────┐ ┌───────────────────┐ │ │ │ 21 Security │ │ 34 Knowledge │ │ Session Store │ │ │ │ Tools │ │ Base Templates│ │ │ │ │ └──────┬──────┘ └───────────────┘ └───────────────────┘ │ │ │ │ │ ┌──────▼──────────────────────────────────────────────┐ │ │ │ 38 Skills │ │ │ │ Injection · Auth · Access · Recon · Fuzzing │ │ │ │ Client-side · Server-side · Out-of-band · ... │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` The TUI drives the AI conversation. The AI calls security tools. Each tool orchestrates one or more async scanners. Findings are auto-enriched (CWE → CVSS → OWASP → MITRE ATT&CK), deduplicated, and grouped into attack chains. Reports are generated from the session store. **No hallucinated methodology.** The knowledge base provides real detection patterns, exploitation techniques, and payloads. The deterministic planner (based on the CHECKMATE paper) selects tests based on detected technologies — no LLM involved in test selection. --- **Built by [Francesco Stabile](https://www.linkedin.com/in/francesco-stabile-dev).** [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=flat-square&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/francesco-stabile-dev) [![X](https://img.shields.io/badge/X-000000?style=flat-square&logo=x&logoColor=white)](https://x.com/Francesco_Sta) [MIT License](LICENSE)