How to digitize archives when you're short on resources: 6 steps to transcribe text in Toloka

Toloka Team
by Toloka Team
Image

Subscribe to Toloka News

Subscribe to Toloka News

Until recently, most documents were either hand-written or prepared on a typewriter, which means that a lot of the world’s records aren’t particularly compatible with computers. After all, you can’t do a quick search of a scanned document — not in the sense we’re used to. OCR technology has come a long way in helping us automatically recognize text and convert it to digital files, but sometimes it falls short and you want to have real people involved, like a recent case with the Archives of Latvian Folklore. The good news is, crowdsourcing can help with digitization in no time, and Toloka is here to explain how: simply follow the basic steps below.

Step 1: Decompose the task and classify content

Let’s assume you have a whole load of scanned documents on your hands (if you have paper documents, you need to scan them first). Before you can build an efficient pipeline and proceed with transcribing text, you need to decompose all of the content. This is important because your content is likely to contain drawings, musical notes, or something else other than text.

You can start by classifying all of the documents into two categories: text and non-text. This will help with organizing your pipeline, and you won’t have to waste your resources transcribing text that’s not there. A number of options are available on Toloka to meet this goal, like the “Image classification” template shown below.

Image

Step 2: Divide content into segments

The next step is to break the documents you now have into smaller, more manageable chunks. This is a good idea because longer tasks are more expensive, and they are also harder to do, especially when it comes to transcribing old handwriting. With longer tasks, crowd performers tend to get fatigued and may start making mistakes, which is something you want to avoid.

Image

Step 3: Divide images into smaller segments

Now that you have the images that contain text and an understanding of the overall workload, it’s time to divide each image into smaller components. Different scales can be used to achieve this, but one of the most common approaches is to outline paragraphs. You can use our “Selecting a region in an image” template for great results.

Paragraph outlining is common when smaller task portions have to be created in images with text.

Image

Step 4: Verify segmentation

Before you can proceed with the actual transcribing of text, you need to make sure that your text-containing images have been segmented correctly. Otherwise all of your subsequent work may be in jeopardy, and you may have to start over. The best way to avoid this tricky situation is to launch a simple binary classification project and ask the crowd performers to verify each other’s work.

Image

We recommend the “Non-automatic acceptance” option. This makes it easy to transfer your segmentation stage content to a new project, which will subsequently become a verification task. Be sure to include clear and detailed instructions with unambiguous examples of both good and bad segmentation (which is something you have to determine yourself). You should also specify a suitable time frame for the verification task in the “Deadline” field. This way the Tolokers will know when to expect both feedback and payment.

Step 5: Transcribe text from images

Great – you’re finally ready to transcribe text from your now segmented and verified images. All you need to do is use our “Text recognition from an image” template, which comes with trustworthy quality control techniques, including golden sets and majority voting.

Image

Step 6: Add quality checks

Quality control is the next crucial step. Here’s what you can do to ensure you get the highest quality results with as few hiccups as possible:

  • Impose a restriction on fast responses to make sure no one is rushing through text.
  • Add control tasks to filter out poorly-performing Tolokers who make frequent mistakes.
  • With the option above, you also get the added benefit of gauging the overlap (the number of performers to complete the same task), which is a good way to see if your Tolokers are on the same page in terms of their skills and understanding.

Give it a go

Here’s what your full pipeline should look like:

Image

That’s it! No doubt, it’s much easier than you imagined before. Now, it’s your turn to have a go. Follow these steps and also be sure to check out our new self-study guide to get even more hands-on help with our video tutorials.

Article written by:
Toloka Team
Toloka Team
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal