Single Image 3D Scene Reconstruction: Recent Advances

Denis Fraltsov

Subscribe to Toloka News

Subscribe to Toloka News

Computer vision is a rapidly developing field of artificial intelligence, particularly in the area of 3D. This overview will consider an applied task: transitioning between 2D and 3D environments.

Direct task

To begin with, we will analyze how to solve a direct problem of computer graphics, namely creating a 3D image using a 3D model, and get acquainted with the basic concepts.

Rendering is the process of moving from a 3D model to its 3D projection. You've probably heard of some of them:

  • Rasterization is one of the earliest and fastest rendering methods. Rasterization treats the model as a grid of polygons. These polygons have vertexes embedded with information such as position, texture, and color. These vertices are then projected onto a plane perpendicular to the perspective. Rasterization has problems with overlapping objects: if the surfaces overlap, the last part drawn will be reflected during rendering, which will cause the wrong object to be displayed. This problem was solved using z-buffering (in fact, the z-buffer is a depth map).

  • Ray casting. Unlike rasterization, the potential problem of overlapping surfaces does not occur during raycasting. Ray casting, as the name suggests, directs rays at the model from the camera's point of view. Rays are output to each pixel on the image plane. The surface that it hits first will be shown during rendering, and any other intersection after the first surface will not be drawn.

  • Ray tracing. Despite the advantages of ray casting, the technique still lacks the ability to correctly model shadows, reflections, and refractions. The ray tracing method was developed to help resolve these issues. Ray tracing works in a similar way to ray casting, except that it displays light better. Basically, the primary rays from the camera's point of view are directed at the models to produce secondary rays. After hitting the model, shadow rays, reflected rays, or refractive rays will be emitted, depending on the surface properties.

Now that we’ve considered the direct problem of building a 3D image from a 3D model, let’s look at ways to solve the inverse problem: building a 3D model from a 3D image.

Inverse problem

A two-dimensional photograph is a projection of a three-dimensional scene. A 3D scene is a collection of 3D meshes, vertices, faces, texture maps, and a light source viewed from a camera or viewpoint. For simplicity, let's limit the scene to a single 3D object. If we were able to restore the original 3D scene from which the 2D photo was created, we should be able to verify this by projecting the 3D object onto 2D using the same point of view that was used to create the input 2D photo.

To reconstruct an object, you need to calculate all possible combinations of vertices, faces, light sources, and textures, which, when projected in 2D, should give an equivalent image in 2D, given the input image, provided that the camera position is the same. This is essentially a search problem. But the problem with brute-forcing is that there are a huge number of combinations of vertices, faces, texture maps, and lighting that can be created, so we can't solve this problem by brute force.

Let's look at the existing approaches to solving this problem.


DIB-R is a differential renderer that models pixel values using the differentiable rasterization algorithm. It has two methods for assigning pixel values. One for foreground pixels, the other for background pixels.

Here, in contrast to standard rendering, where the pixel value is assigned to the nearest face covering the pixel, foreground rasterization is considered as an interpolation of vertex attributes. On each foreground pixel, we perform a z-buffering test and assign it to the nearest covering face. Each pixel is affected exclusively by this face.

So foreground pixels are calculated as an interpolation of the nearest three neighboring vertices using a weight for each vertex.

For background pixels, i.e. pixels that are not covered by any face of the 3D object, the value is calculated based on the distance from the pixel to the nearest face.

Architecture scheme from the official paper

DIB-R can generate images with realistic lighting and shading effects that are difficult to achieve with traditional rendering.

Official paper

Link to the solution

Im2Struct: SMN+SRN

A structural masking network (SMN) creates an object mask based on the input 2D image at different scales. This is a multi-layer convolutional neural network (CNN). Its task is to save information about the form while viewing irrelevant information: background and textures.

A structure-restoring network (SRN) recursively reconstructs the hierarchy of object details in the form of a cuboid structure. The SRN receives input data from the SMN, adds CNN characteristics of the 2D image, and then passes these functions to the recursive neural network (RvNN) for decoding into a 3D structure. At the output, we get an image in the form of three-dimensional cuboids with a plausible spatial configuration.

Architecture scheme from the official paper

Im2Struct has several advantages over traditional 3D scanning methods, as it can recover the 3D structure of an object from a single 2D image, which is often faster and less expensive than scanning an object from multiple viewpoints.

Official paper


The method takes as input a sequence of RGB images of arbitrary length.The internal characteristics and pose are known for each image. These images are passed through a 2D CNN backbone for feature extraction. The objects are then projected back into the 3D voxel volume and accumulated using the current average value. Once the image elements are combined in 3D, we regress the TSDF directly using the 3D CNN.

Architecture scheme from the official paper

ATLAS is useful in a variety of industries, including manufacturing, engineering, and archaeology. One limitation of ATLAS 3D is that it requires the object being scanned to be stationary, which may not always be feasible in certain applications. Additionally, the system may struggle to capture fine details and textures on objects with highly reflective or transparent surfaces.

Official paper

Link to the solution

Mesh R-CNN

The framework uses a two-stage approach: in the first stage, it detects and segments the object in the image using a convolutional neural network (CNN), similar to the popular Mask R-CNN framework. In the second stage, it regresses a set of 3D vertices for each object instance using a mesh prediction network.

Architecture scheme from the official paper

One of the main advantages of Mesh R-CNN is its ability to reconstruct detailed 3D meshes of objects, including their fine-grained geometry and texture. This makes it useful for applications such as virtual reality, augmented reality, and 3D printing.

Official paper

Link to the solution

Reconstructing a 3D scene from a single image

To overcome the problem of restoring areas in 3D that are closed in the 2D image, the new approach proposes to extract this information from synthetically generated high-resolution data. To do this, a deep network architecture is used, which is specifically designed for bulk TSDF data, using a specific tree network architecture. The developed framework can handle 512 ^ 3 3D resolution by implementing a special compression technique based on a modified autoencoder.

Link to the solution

We’ve covered several state-of-the-art solutions for solving the inverse graphics problem. All of these solutions can help you solve a wide variety of tasks, like reconstructing a room, creating a 3D local map, reconstructing a 3D scene from a single image, or even estimating the height and depth of crops or terrain to guide planting, harvesting, and irrigation decisions.

Keep in mind that all these solutions are based on different approaches to rendering, voxel prediction, mesh prediction, and so on. But they all have a common need to build or predict a depth map in one form or another.

That’s why I also propose to separately consider the problem of constructing a depth map.

Depth estimation

There are several ways to get a depth map:

  • Use a stereo pair of RGB images.
  • Use an RGB-D camera.
  • Teach a model to predict the depth map based on one RGB image.

Let's look at several state-of-the-art solutions for predicting depth maps.

Monocular depth estimation - GLPN

A new architecture with global and local feature paths through the entire network. The overall structure of the framework is next: the transformer encoder enables the model to learn global dependencies, and the proposed decoder successfully recovers the extracted feature into the target depth map by constructing the local path through a skip connection and the feature fusion module.

A fragment of an image with the result from the official paper

Official paper

Link to the solution

Dense depth model

For the encoder, the RGB input image is encoded into a vector of objects using a DenseNet-169 network pre-trained in ImageNet.

This vector is then fed into a sequential series of layers with increased sampling to build a final depth map with a resolution equal to half of the input. These upsampling layers and their associated bandwidth connections form the decoder.

A fragment of an image with the result from the official paper

Official paper

Link to the solution


The architecture is represented by a visual multi-connected transformer as the basis. The overall encoder-decoder structure that has been successful for prediction in the past is preserved. The input image is converted to tokens either by extracting non-overlapping sections and then linear projection of their smoothed representation (DPT-Base and DPT-Large), or by applying ResNet-50 (DPT-Hybrid).

Image embedding is supplemented with positional embedding and a patch-independent token. Tokens go through several stages of conversion. Tokens are collected from different stages as a multi-resolution image (Reassemble). The Fusion modules gradually merge and upsample views to produce a detailed forecast.

Official paper

Link to the solution

All of these solutions can help you get the depth map and use it in your own way. For instance, you might want to build 3D scenes with pytorch-3d.

The hand of crowdsourcing

In the latter case, MIDAS was able to achieve its result by linking new data sources, which no one had implemented before. There is a difficulty in collecting diverse depth datasets at scale, so a tool has been introduced to combine complementary data sources. In addition, a new dataset based on 3D movies provides reliable information about various dynamic scenes.

Thus, I wanted to focus on the problem of data for the 3D direction. Every developer faces this problem and has to somehow dodge it, including architecturally. All these solutions I described were using almost the same scroll of open dataset.

It was not enough because it’s not that simple to collect such complex and high quality data due to various reasons, such as occlusions, poor lighting conditions, and limited viewpoints. When there is not enough data available, it becomes difficult to accurately estimate the depth and structure of the scene, leading to inaccurate or incomplete 3D reconstructions.

Crowdsourcing can be used as a potential solution to address the problem of not enough data for 3D reconstruction. By leveraging the collective effort of a large number of individuals, crowdsourcing can provide additional data and perspectives on a scene, which can improve the accuracy and completeness of the 3D reconstruction.

For example, a crowdsourcing platform could be used to collect multiple images of a scene taken from different viewpoints by a large number of contributors. These images could then be processed using multi-view stereo or structure from motion techniques to create a more accurate 3D reconstruction of the scene.

This is exactly what was implemented in the Neatsy project to partially compensate for a lack of 3D data. Neatsy develops AI software for virtually sizing shoes. They used the Toloka crowdsourcing platform for additional data collection (more than 50 thousand new photos) and made improvements to the model's metrics. Their software creates a 3D model of your feet using around 50 different measurements and helps you find the perfect pair of sneakers. The project has moved on and now they can also diagnose health problems in feet, all thanks to data from people in the crowd. This is just one example of the strong potential of developing 3D technology.

Real-world applications of 3D reconstruction technologies in robotics

Here are some examples of how 3D reconstruction technologies are being applied:

  • Robotics for agriculture: In farm management applications, 3D reconstruction of a single image can be used to estimate growth and intensity of crops or analyze the terrain, which can help with planting, harvesting and irrigation decisions.
  • Navigation: By using a 3D scene generated from a single image, an autonomous robot can more accurately perceive the depth and location of its surroundings. This information can be used to create more accurate navigation plans, such as taking a detour or choosing the best route to a destination.
  • Path planning: A 3D scene generated from a single image can be used to help autonomous robots plan paths that are more efficient and safer. For example, by understanding the height and depth of objects in the environment, robots can more easily identify obstacles and bypass them.
  • Simulation: Restoring the 3D picture in simulation allows you to create more realistic and accurate virtual environments to test and validate your algorithms. This could lead to the creation of more efficient autonomous robotic systems in the real world.
  • Object recognition: By using a 3D scene generated from a single image, automatic robots can better understand the geometry and shape of objects in the environment. This can improve the accuracy of object recognition, especially for objects with complex shapes or textures.

Finally, combining multiple sources of information, such as reconstructing a 3D scene from a single image using lidars or an RGB-D camera, can result in even more accurate 3D information that will improve robot perception and navigational abilities.


There are many state-of-the-art solutions available for forward and inverse graphics, as well as for predicting the depth map. We looked at the approaches of each in practical applications, and also noted the limitations caused by the lack of data for 3D. Crowdsourcing platforms have the potential to solve the data collection problem and support the development of 3D technologies for real-life computer vision applications.

Article written by:
Denis Fraltsov

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.

More about Toloka

  • Our mission is to empower businesses with high quality data to develop AI products that are safe, responsible and trustworthy.
  • Toloka is a European company. Our global headquarters is located in Amsterdam. In addition to the Netherlands, Toloka has offices in the US, Israel, Switzerland, and Serbia. We provide data for Generative AI development.
  • We are the trusted data partner for all stages of AI development–from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise. Toloka offers high quality expert data for training models at scale.
  • The Toloka team has supported clients with high-quality data and exceptional service for over 10 years.
  • Toloka ensures the quality and accuracy of collected data through rigorous quality assurance measures–including multiple checks and verifications–to provide our clients with data that is reliable and accurate. Our unique quality control methodology includes built-in post-verification, dynamic overlaps, cross-validation, and golden sets.
  • Toloka has developed a state-of-the-art technology platform for data labeling and has over 10 years of managing human efforts, ensuring operational excellence at scale. Now, Toloka collaborates with data workers from 100+ countries speaking 40+ languages across 20+ knowledge domains and 120+ subdomains.
  • Toloka provides high-quality data for each stage of large language model (LLM) and generative AI (GenAI) development as a managed service. We offer data for fine-tuning, RLHF, and evaluation. Toloka handles a diverse range of projects and tasks of any data type—text, image, audio, and video—showcasing our versatility and ability to cater to various client needs.
  • Toloka addresses ML training data production needs for companies of various sizes and industries– from big tech giants to startups. Our experts cover over 20 knowledge domains and 120 subdomains, enabling us to serve every industry, including complex fields such as medicine and law. Many successful projects have demonstrated Toloka's expertise in delivering high-quality data to clients. Learn more about the use cases we feature on our customer case studies page.