This tutorial provides a detailed guide on collecting data for aligning large language models (LLMs) with low-resource languages (LRLs). It addresses the challenge of data scarcity in these languages and introduces a pipeline for generating high-quality data, using Swahili as a primary example. The tutorial covers strategies for dataset collection and alignment of LLMs to LRLs, offering comprehensive guidance on producing and utilizing high-quality data for language technology development in under-resourced languages.

Materials

Notebooks