LLM_log #021: How LLMs Learn to Reason — From Chain-of-Thought to Self-Rewarding and Meta-Judges
Highlights: Jason Weston traces the arc from early neural language models to self-improving LLMs that generate their own training data and evaluate their own reasoning System 1 vs System 2: fixed-compute pattern-matching vs deliberate multi-step reasoning — and why the same LLM implements both Chain-of-Thought prompting: adding “Let’s think step by step” jumps GSM8K accuracy from ~10% to 40–50%; few-shot CoT hits 90%+ on MultiArith CoVe + S2A: Chain-of-Verification reduces hallucinations 3× on knowledge list…
Read more