
Webinar 30 April 7 PM CET Online
Overview
Agentic AI systems can plan, use tools, and act across multi-step workflows, but reliability in production remains an open problem. Current limitations in scaling trustworthy agents aren't just a model issue. They're a data issue.
Toloka specializes in combining human expert knowledge with technology to evaluate frontier-pushing LLMs and AI Agents.
Our focus on real-world evaluation tasks and environments helps you better understand the actual capabilities and limitations of your models.
Speaking session
1. Why training data for standard LLMs is insufficient for agents operating across multiple steps and modalities
2. Plan validation and trajectory optimization: using human feedback to verify tool selection, reasoning chains, and intermediate steps
3. RLHF for agentic workflows: adapting preference ranking to sequential, multi-step task evaluation
4. Robotics as a case study: annotation pipelines for physical agents. From manipulation video labeling to frame-level failure tagging
5. Continuous monitoring: how human annotators catch failure modes that automated metrics miss in production
6. Tendem via MCP: connecting a live agent stack to verified domain experts as a programmable reliability layer
7. Live demo: Toloka Arena, RL Gym walkthroughs, and self-service robotics annotation presets
Session info
Presentation and Q&A
60 minutes
Live expert panel discussion
30 minutes
Audience
Data scientists, ML engineers, AI developers
Level
All experience levels
Hosts
© 2025 Toloka AI BV
