bug-bounty645
xss410
exploit324
google266
rce218
facebook196
microsoft192
writeup159
web3129
cve124
apple112
malware102
csrf101
browser101
open-source91
sqli81
account-takeover80
ssrf68
ai-agents63
node58
cloudflare58
dos53
supply-chain53
ctf52
aws51
oauth51
lfi50
pentest50
privilege-escalation49
reverse-engineering46
tool46
phishing45
cloud45
react44
privacy44
auth-bypass42
idor42
race-condition39
llm37
cors36
opinion35
clickjacking33
docker33
automation33
machine-learning32
infrastructure31
code-generation31
open-redirect28
access-control27
wordpress27
0
3/10
NVIDIA researchers present a concept-driven synthetic data generation workflow that creates 15 million Python programming problems from a hierarchical taxonomy of programming concepts, achieving a 6-point improvement on HumanEval benchmark when included in Nemotron-Nano-v3 pretraining. The method enables targeted LLM training by combining and distilling specific programming concepts to control data difficulty, diversity, and conceptual balance.
synthetic-data-generation
llm-training
code-generation
python
machine-learning
dataset
pretraining
programming-concepts
benchmark-improvement
data-quality
NVIDIA
Joseph Jennings
Brandon Norick
Nemotron-Pretraining-Code-Concepts
Nemotron-Nano-v3
HumanEval
GPT-OSS 120B