As a part of the master’s thesis titled, "Machine Translation for a less-resourced language: Manipuri", we worked on equipping the Manipuri language (locally known asMeiteilon) with resources for the purpose of machine translation. The main source for these resources is the Sangai Express (see thanks at the bottom of this page). The resources we make available are listed below:
EM Corpus, abbreviation ofEma-lonManipuri Corpus, which translates to ‘our mother tongue Manipuri corpus’. This is the first comparable corpus built for the Manipuri (mni)-English (eng) language pair from 0 sentences crawled and collected fromThe Sangai Expressfrom August 2020 to March 2021.
Monolingual data: 1,034,715 Manipuri sentencesand846,796English sentences in version 1 and 1,880,035 Manipuri sentencesand1,450,053 English sentences in version 2. This makes a comparable corpus in the two languages.
Parallel data: 124,975 Manipuri-English aligned sentences extracted from the comparable data version 2. We encourage anyone in improving the quantity and the quality of the aligned data so as to contribute to Manipuri language resources.
EM-ALBERTis the first ALBERT model available for Manipuri language which is trained on1,034,715 Manipuri sentences(from the first version of our EM Corpus).
EM-FTis also the first FastText word embedding available for Manipuri language trained on1,880,035 Manipuri sentences.
Kindly cite our papers if you use data and/or models from this distribution.
R. Huidrom, Y. Lepage, and K. Khomdram. EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC). Recent Advances in Natural Language Processing (RANLP), September 2021.