benchmark-improvement

1 article

sort: new top best

bug-bounty645 xss410 exploit324 google266 rce218 facebook196 microsoft192 writeup159 web3129 cve124 apple112 malware102 csrf101 browser101 open-source91 sqli81 account-takeover80 ssrf68 ai-agents63 node58 cloudflare58 dos53 supply-chain53 ctf52 aws51 oauth51 lfi50 pentest50 privilege-escalation49 reverse-engineering46 tool46 phishing45 cloud45 react44 privacy44 auth-bypass42 idor42 race-condition39 llm37 cors36 opinion35 clickjacking33 docker33 automation33 machine-learning32 infrastructure31 code-generation31 open-redirect28 access-control27 wordpress27

0 3/10

A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds

research

NVIDIA researchers present a concept-driven synthetic data generation workflow that creates 15 million Python programming problems from a hierarchical taxonomy of programming concepts, achieving a 6-point improvement on HumanEval benchmark when included in Nemotron-Nano-v3 pretraining. The method enables targeted LLM training by combining and distilling specific programming concepts to control data difficulty, diversity, and conceptual balance.

synthetic-data-generation llm-training code-generation python machine-learning dataset pretraining programming-concepts benchmark-improvement data-quality

NVIDIA Joseph Jennings Brandon Norick Nemotron-Pretraining-Code-Concepts Nemotron-Nano-v3 HumanEval GPT-OSS 120B

huggingface.co · ibobev · 6 days ago · details · hn