Solutions

Datasets

Research

Resources

Company

Talk to us

Read our papers on AI training, evaluation, and safety

Learn more

Read our papers on AI training, evaluation, and safety

Learn more

Read our papers on AI training, evaluation, and safety

Learn more

Toloka podcast: agentic AI & the future of coding

Toloka Team

July 29, 2025

Insights

We kick off the first episode of the Toloka podcast with an insightful conversation between Ilya Kochik, our VP of Business Development, and Pierre Tempel, Director of Product at GitHub. Together, they explore how to move AI from flashy autocompletions to useful tools that understand context and adapt to real-world workflows.

Talking through GitHub's evolving philosophy around AI integration, from the early days of Copilot as a helpful assistant to today's push toward more intelligent, trustworthy, and workflow-aware systems that support real engineering teams, they hit on topics like:

Why GitHub thinks of its AI tools in terms of making models "less dumb" and "more smart"
How enterprise codebases break most public benchmarks, and what GitHub does instead
The difference between building an AI that can call tools vs. one that knows when and why to use them
What happens when you start seeing Copilot not as a feature, but as a collaborator

Pierre brings a rare mix of technical depth and product honesty. He's discussing tradeoffs, lessons learned, and what it takes to make AI more useful within real-world development environments.

They also dig into:

What GitHub learned by watching how developers use (and misuse) AI
The role of human feedback in scaling secure and reliable AI systems
The hidden cost of tool-calling agents (spoiler: it's not just compute)
The risks of overfitting to public leaderboards, and how GitHub avoids that trap

If you're building AI-first products, designing evaluation frameworks, or trying to understand what's real behind the AI agent hype, this conversation is packed with pragmatic insight from the frontlines.

Ilya: Hi, everyone. Today is our first episode of the Toloka podcast. We have a fantastic guest with us, and I'll introduce him in a second. We'll be talking about coding, LLM agents, evaluation, training, and lots of other cool topics.

I'm Ilya Kochik, VP of Business Development at Toloka. At Toloka, we create human data to train and evaluate GenAI models, working with coding, agents, and LLMs. Our guest today is Pierre Tempah, Director of Product Management at GitHub. Welcome, Pierre.

Pierre: Thanks for having me. Hi everyone. I'm Pierre, or Turbo, as some people know me. I’m Director of Product at GitHub, focused on our detection and remediation engines. We have a product called Advanced Security, which checks for vulnerabilities and code quality issues. We develop static analysis engines like CodeQL and also AI-native engines like Copilot Autofix, which automatically fix vulnerabilities, as well as Copilot code review, which gives AI-based suggestions in your pull requests.

Ilya: Amazing. Great to have you. I've followed GitHub since its early days, and it really changed how we think about software engineering. Copilot seems to be another leap. Just to start simply: how has AI changed how you work? Do you use it daily? What do your workflows look like?

Pierre: It's moving so fast that even a few weeks ago, my notes for this podcast would've looked completely different. At GitHub, the way teams use AI depends a lot on their workflows. Personally, AI helps me with tasks like market research and competitive analysis. It also lets me spend more time on deep thinking and user interaction. When we're benchmarking or evaluating models, AI really helps gather data and understand real user behavior.

I also maintain a side project called FindSide AI, a book search engine with a few thousand users. It helps me build and learn more about these systems. It's been pretty transformative for me.

Ilya: That's interesting. When you use AI for product and market research, do you build your own workflows? Any tips or tricks?

Pierre: Yes, I tend to build my own systems. My background is in development, and I still write code. At GitHub and Microsoft, we have powerful tools to orchestrate custom workflows. Things move so quickly that relying on two-year-old analyst reports just doesn't cut it anymore. I use systems that can autonomously track new developments, aggregate user feedback from multiple channels like Twitter or internal forums, and notify me. We use tools like Copilot Spaces to create custom AI workflows inside GitHub, and people on my team build their own workflows around Copilot too. So we're also sharing these workflows internally.

Ilya: I agree. Even a few months ago, my answers would've been different. Our work at Toloka has evolved from $1 crowd-labeled image classification tasks to $1,000+ complex expert-driven data points involving virtual companies and synthetic environments. It’s all happened in just a couple of years. So from your side, what do you see happening in the next 6 to 12 months?

Pierre: One major trend is that a lot of traditional machine learning workflows are becoming obsolete. Smaller hyperspecialized models are being replaced by larger, more capable models that transfer knowledge better. At GitHub, we're focusing on two priorities: making models "less dumb" and "more smart."

"Less dumb" means avoiding mistakes, like hallucinations, missed context, etc. “More smart” is about making models more context-sensitive and senior-engineer-like, rather than just autocomplete tools. We’re aiming to change the workflow itself, meaning testing, UI design, and feedback can all happen while you write code. That expansion of the “development bubble” is where we're headed.

Ilya: I like that framing: less dumb and more smart. Do you think the future is general-purpose autonomous agents or more specialized user journey tools?

Pierre: It depends on the user need. At Microsoft Build, we launched two products: Copilot Coding Agent, which you can assign an issue to and get a pull request back, and Copilot Autofix, an agentless system that runs tests and adjusts fixes without AI planning. Both were driven by user needs. Sometimes you want autonomy, sometimes you want linear, predictable workflows. So the answer depends on what the user is trying to solve. That’s where we start.

Ilya: That's interesting. When it comes to evaluation, let’s start with Copilot. Some enterprise clients feel it underperforms in complex scenarios. What’s your take?

Pierre: That's valid. We offer security capabilities to both open-source and enterprise users, and the usage patterns are very different. Enterprise codebases are larger, more complex, with different workflows, more dependencies, and bigger pull requests. A lot of benchmarks out there often don't reflect this. For Copilot Autofix, we analyze metadata like repo size and vulnerability types to match enterprise distributions without seeing the actual code. That helped us improve performance significantly.

Ilya: What other differences have you observed between open source and enterprise?

Pierre: Language usage, types of vulnerabilities, and workflow structures. Enterprises often have mono repos or large numbers of microservices. Some do massive PRs with thousands of files, which is very different from typical open-source PRs. Plus, internal tools and custom libraries are more common in enterprises.

Ilya: Right. We've seen that too. Things like dependency migrations, large-scale refactors, or early-stage prototypes rarely show up in public repos. Do you work with enterprise clients to co-develop tools?

Pierre: Yes, especially for big features like security campaigns or Autofix. We build prototypes with key clients to validate workflows and outputs. We try not to overfit to any one client, but we do allow customization. For instance, we help model proprietary libraries for better AI analysis.

Ilya: That’s helpful. With so many variables: multiple LLMs, tool calls, prompts, and limited user feedback, how do you optimize it all?

Pierre: GitHub's infrastructure lets us scale evaluation massively, with thousands of nightly evaluations across real-world repos. But more importantly, we have domain experts. For security, quality, and programming languages, we use expert triaged data to guide tuning and testing and AI to scale the test volume massively.

Ilya: So it's hybrid red teaming, humans set the standard, then AI scales the checks?

Pierre: Exactly. We ground everything in expert-rated examples. We also analyze feedback from experts in different fields, like programming, security, etc. and use those insights to improve. That combination helps us spot blind spots and train models more effectively.

Ilya: Will agents play a bigger role in this kind of red teaming?

Pierre: They can scale the process, yes. But we still need human expertise to plan and curate the evaluations. Agents help explore, simulate, and triage, but the gold standard needs a human foundation.

Ilya: When it comes to tool calling, does it create evaluation challenges? External tools evolve, so results might differ over time.

Pierre: That's true. Tool calling alone doesn't solve problems, you need to evaluate what it’s trying to achieve. If the workflow is repeatable, we often "lock" it into a structured agentless loop. It's easier to evaluate, more predictable, and scalable. Free-roam agents are harder to trust in production unless they're tightly scoped.

Ilya: Do you simulate environments for evaluation?

Pierre: Sometimes, but GitHub is massive. We often have real-world data. Cleanup is harder than generation. We prefer anonymized real-world data over fully synthetic mocks where possible.

Ilya: That makes sense. How do you avoid overfitting on benchmarks? Especially with all these leaderboards.

Pierre: It's a real issue. We've seen small models top public benchmarks because they were overfitted. They collapse in real-world scenarios. The solution is better evaluation grounded in human feedback and domain understanding. Experts aren't going away any time soon.

Ilya: Agreed. What about agent use cases? Where do you see real benefits?

Pierre: I think agents will become persistent collaborators. Instead of you prompting them every time, they'll have ongoing context, knowledge of past successes and failures, and be deeply integrated into your workflow. It's less about asking and more about co-developing. The agent will always be aware, always helpful.

Ilya: That could really help with onboarding and knowledge transfer. So much gets lost when people leave teams.

Pierre: Exactly. Development is about information flow. Agents can retain and surface that knowledge continuously. It's about removing dead ends for developers and keeping them in flow.

Ilya: Last question. Do you think we'll eventually stop coding in programming languages and just use natural language?

Pierre: Yes, especially for outcome-driven development. Models that can translate across languages will increase developer flexibility. But we'll also need a strong industry around verification and security, especially when AI is involved in mission-critical systems.

Ilya: Amazing. Thank you, Pierre, for the insightful conversation.

Pierre: Thanks, Ilya. It was a pleasure.

Rethinking AI Workflows? Start With the Right Data.

Toloka helps you build and test AI systems that align with real-world use, not just benchmarks. Reach out for a custom solution.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Continual learning: Building AI that adapts to changing data

Oct 10, 2025

From word docs to data analysis: Evaluating AI agent performance across everyday apps

Oct 1, 2025

AI Benchmarks: How to measure real progress in artificial intelligence

Sep 26, 2025

Continual learning: Building AI that adapts to changing data

Oct 10, 2025

From word docs to data analysis: Evaluating AI agent performance across everyday apps

Oct 1, 2025

AI Benchmarks: How to measure real progress in artificial intelligence

Sep 26, 2025

TAU-bench extension: benchmarking policy-aware agents in realistic settings

Sep 24, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?