bug-bounty589
xss342
exploit241
google209
microsoft166
facebook162
rce153
web3129
writeup114
malware106
apple105
open-source91
cve91
csrf89
sqli77
browser66
ai-agents63
account-takeover58
cloudflare51
ssrf50
node48
supply-chain48
tool46
phishing45
dos44
privacy44
aws44
reverse-engineering42
idor41
pentest39
privilege-escalation38
llm37
cloud37
oauth36
opinion35
automation33
lfi32
machine-learning32
auth-bypass32
race-condition31
ctf31
code-generation31
infrastructure31
clickjacking28
cors28
react28
access-control27
docker25
performance-optimization24
rust24
0
3/10
NVIDIA researchers present a concept-driven synthetic data generation workflow that creates 15 million Python programming problems from a hierarchical taxonomy of programming concepts, achieving a 6-point improvement on HumanEval benchmark when included in Nemotron-Nano-v3 pretraining. The method enables targeted LLM training by combining and distilling specific programming concepts to control data difficulty, diversity, and conceptual balance.
synthetic-data-generation
llm-training
code-generation
python
machine-learning
dataset
pretraining
programming-concepts
benchmark-improvement
data-quality
NVIDIA
Joseph Jennings
Brandon Norick
Nemotron-Pretraining-Code-Concepts
Nemotron-Nano-v3
HumanEval
GPT-OSS 120B