HomER v2: A Larger, more diverse egocentric dataset for robotics research

on June 15, 2026

Toloka Arena is live. See how your model ranks.

Learn more

The first HomER dataset was built around a straightforward premise: egocentric household video, collected at scale with consistent perspective and verified quality, is more useful for embodied AI training than anything scraped from the open web. Getting that right requires a large, distributed workforce and a quality process that catches failures before they ever enter your pipeline. HomER v2 is built on exactly that — expanded to 12 times the original volume and a broader spread of household activity.

What's in HomER v2

HomER v2 contains 765 first-person videos — approximately 100 hours of real-world household activity, all recorded from an egocentric perspective. Every video includes a natural-language description of the performed task, and each record carries the activity category and duration alongside URLs for preview and download.

The dataset covers 10 household activity domains:

Food preparation
Cleaning
Laundry and garment care
Home organization
Dining table tasks
Desk and office tasks
Repair and assembly
Personal care and bathroom activities
Indoor plant activities
Crafts and fine manipulation

The category structure is broader than v1, which covered 17 more narrowly defined task labels. The v2 taxonomy was redesigned around the manipulation range a household robot has to handle, so fine motor tasks like crafts sit alongside gross motor work like repair and assembly. The categories map more directly onto the physical reasoning challenges these models encounter.

Why egocentric data specifically

Viewpoint is part of what a model learns. Footage shot from a fixed third-person angle or a handheld phone differs systematically from a forward-facing, head-mounted view — in framing, in what stays occluded, and in where the hands and the manipulated object sit in the frame. Each viewpoint has its uses, but for embodied manipulation an egocentric view is the closest practical proxy for the perspective a robot operates from: the camera moves with the actor, and the hands and the worked object stay roughly where an onboard camera would see them. That narrows the distribution gap to a real embodied perspective — not to zero, since human hands and head-cam placement still differ from any given robot, but considerably more than third-person footage does.

The payoff is largest in fine manipulation. A third-person view of detailed hand work often loses the close, line-of-sight angle on contact and coordination. An egocentric view keeps the manipulation in frame as the person performing it sees it, which is closer to the signal a robot needs.

That perspective is also what we can enforce at scale. HomER v2 is collected from a large distributed crowd, and every clip runs through automated checks at upload against the capture spec: head-mounted, forward-facing, both hands visible. Clips that don't meet it are rejected before they enter the dataset, so what ships is dense with on-spec egocentric signal rather than diluted with mixed-perspective footage.

Collection methodology

HomER v2 was collected from over 400 contributors across 49 countries.Every clip runs through automated quality checks at the point of upload, and rejection happens before a clip ever enters the dataset.

Our quality specs cover seven dimensions:

Footage stability. Every recording is verified to be steady and easy to follow, with hands and objects clearly visible throughout and no camera motion obscuring the action.

Hand visibility. Both hands — wrists to fingertips — must remain in frame for at least 80% of the recording, with the primary manipulation clearly observable for the vast majority of the clip.

Valid physical task. Each video must capture a genuine, productive physical task with continuous hand-object manipulation that matches the declared category and meets duration requirements.

Natural environment. Footage must be recorded in an authentic residential setting. Staged or non-residential environments are rejected.

Camera angle. Correct head-mounted POV framing is enforced: landscape orientation, a downward view of the workspace, arms entering from the bottom with the backs of the hands visible, and no face or shoulders in frame.

No personal information. Every video is screened to confirm that no faces, identity documents, or legible personal information appear anywhere in the footage.

Original recording. Each submission is verified as original first-person footage — not a re-recording or screen capture — based on natural motion, framing, and absence of playback artifacts.

What's changed from v1

The original HomER covered 17 task categories with an emphasis on kitchen and food handling scenarios, manipulation-heavy activities that were a natural first target for robotic learning. HomER v2 expands the scope significantly:

From kitchen-focused to full household coverage
From 17 narrow task labels to 10 broader, manipulation-relevant domains
Natural language descriptions added for every video, enabling video-language model evaluation
Diverse household environments across contributors

The natural language descriptions are a meaningful addition and go beyond perception pipelines. They make HomER v2 directly usable for video-language alignment work and task reasoning benchmarks.

Intended use

HomER v2 is designed for:

Robotics perception and policy training
Embodied AI evaluation
Video understanding benchmarks
Activity recognition
Human-object interaction analysis
Video-language model evaluation
Task understanding and reasoning

Access the dataset

View the dataset on hugging face

HomER v2 is available on Hugging Face under the CC BY 4.0 license. If you use it, please attribute Toloka and cite the dataset in publications, benchmarks, or derivative works.

Need more egocentric data — or something different?

HomER v2 is only a starting point. Toloka has over 25,000 hours of egocentric video data available off the shelf. If your research requires a specific activity scope, additional annotation layers — keypoints, action segmentation, object labels, trajectory data — or a fully custom collection built to your hardware and task spec, reach out to discuss what we can put together for you.

Want to collect your own data?

Toloka's self-serve platform gives you access to the same data collection infrastructure — no sales cycle, no minimums. Describe your task, set your quality bar, and 200,000+ contributors across 100+ countries do the rest. First batch in 3 hours.

Launch an egocentric data collection project

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.