Subscribe to Toloka News
Subscribe to Toloka News
Until recently, most documents were either hand-written or prepared on a typewriter, which means that a lot of the world’s records aren’t particularly compatible with computers. After all, you can’t do a quick search of a scanned document — not in the sense we’re used to. OCR technology has come a long way in helping us automatically recognize text and convert it to digital files, but sometimes it falls short and you want to have real people involved, like a recent case with the Archives of Latvian Folklore. The good news is, crowdsourcing can help with digitization in no time, and Toloka is here to explain how: simply follow the basic steps below.
Let’s assume you have a whole load of scanned documents on your hands (if you have paper documents, you need to scan them first). Before you can build an efficient pipeline and proceed with transcribing text, you need to decompose all of the content. This is important because your content is likely to contain drawings, musical notes, or something else other than text.
You can start by classifying all of the documents into two categories: text and non-text. This will help with organizing your pipeline, and you won’t have to waste your resources transcribing text that’s not there. A number of options are available on Toloka to meet this goal, like the “Image classification” template shown below.
The next step is to break the documents you now have into smaller, more manageable chunks. This is a good idea because longer tasks are more expensive, and they are also harder to do, especially when it comes to transcribing old handwriting. With longer tasks, crowd performers tend to get fatigued and may start making mistakes, which is something you want to avoid.
Now that you have the images that contain text and an understanding of the overall workload, it’s time to divide each image into smaller components. Different scales can be used to achieve this, but one of the most common approaches is to outline paragraphs. You can use our “Selecting a region in an image” template for great results.
Paragraph outlining is common when smaller task portions have to be created in images with text.
Before you can proceed with the actual transcribing of text, you need to make sure that your text-containing images have been segmented correctly. Otherwise all of your subsequent work may be in jeopardy, and you may have to start over. The best way to avoid this tricky situation is to launch a simple binary classification project and ask the crowd performers to verify each other’s work.
We recommend the “Non-automatic acceptance” option. This makes it easy to transfer your segmentation stage content to a new project, which will subsequently become a verification task. Be sure to include clear and detailed instructions with unambiguous examples of both good and bad segmentation (which is something you have to determine yourself). You should also specify a suitable time frame for the verification task in the “Deadline” field. This way the Tolokers will know when to expect both feedback and payment.
Great – you’re finally ready to transcribe text from your now segmented and verified images. All you need to do is use our “Text recognition from an image” template, which comes with trustworthy quality control techniques, including golden sets and majority voting.
Quality control is the next crucial step. Here’s what you can do to ensure you get the highest quality results with as few hiccups as possible:
Here’s what your full pipeline should look like:
That’s it! No doubt, it’s much easier than you imagined before. Now, it’s your turn to have a go. Follow these steps and also be sure to check out our new self-study guide to get even more hands-on help with our video tutorials.