← Events

/

Webinar

Feb 26, 2025

at

17:00 CET

·

Online

Best practices for benchmarking and fine-tuning - discover how mathematical reasoning unlocks new LLM applications. Join experts from Google DeepMind, Toloka, Gradarius, and Stevens Institute of Technology as they share their findings and explore in-depth approaches to evaluating and enhancing LLM proficiency in mathematical reasoning at the university level. We’ll also share new benchmarking results and show how DeepSeek and o1 stack up against Gemini, GPT-4, Qwen2-Math, and more.

Recording

Agenda and key topics

What you'll learn about:

  • Mathematical reasoning for LLMs: Key use cases and areas where math reasoning fundamentally enhances language model capabilities.

  • Auto-formalization in math: AlphaProof, Lean, and the applicability of automated verifiers for LLM training and evaluation.

  • Expert-curated data collection: Best practices for sourcing high-quality datasets to assess and enhance model performance, including for university-level mathematical reasoning.

  • Designing effective benchmarks: Developing robust evaluation metrics, assessing model performance across different domains, and extracting actionable insights. 

  • Boosting LLM performance with new data: Optimally leveraging a custom university-level training dataset to enhance the mathematical proficiency of top-performing LLMs.

  • Practical applications: Applying the results in academic, educational, and product contexts.

View presentation