EBMT / NLP Laboratory

Graduate School of Information, Production and Systems, Waseda University

Manipuri resources

As a part of the master’s thesis titled, "Machine Translation for a less-resourced language: Manipuri", we worked on equipping the Manipuri language (locally known as Meiteilon) with resources for the purpose of machine translation. The main source for these resources is the Sangai Express (see thanks at the bottom of this page). The resources we make available are listed below:

EM Corpus, abbreviation of Ema-lon Manipuri Corpus, which translates to ‘our mother tongue Manipuri corpus’. This is the first comparable corpus built for the Manipuri (mni)-English (eng) language pair from 0 sentences crawled and collected from The Sangai Express from August 2020 to March 2021.
- Monolingual data: 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2. This makes a comparable corpus in the two languages.
- Parallel data: 124,975 Manipuri-English aligned sentences extracted from the comparable data version 2. We encourage anyone in improving the quantity and the quality of the aligned data so as to contribute to Manipuri language resources.
EM-ALBERT is the first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences (from the first version of our EM Corpus).
EM-FT is also the first FastText word embedding available for Manipuri language trained on 1,880,035 Manipuri sentences.

Downloads (all files are gzip files)

EM Corpus
1. Manipuri monolingual data ( version 1 , version 2 )
2. English monolingual data ( version 1 , version 2 )
3. Link to the parallel Manipuri-English data
EM-ALBERT
- Link to the ALBERT model
EM-FT
- Link to the FastText word embedding model ( cc.mni.300.vec.gz , cc.mni.300.bin.gz )

Kindly cite our papers if you use data and/or models from this distribution.

R. Huidrom, Y. Lepage, and K. Khomdram. EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC). Recent Advances in Natural Language Processing (RANLP), September 2021.

License

All resources in this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

We sincerely thank the team of The Sangai Express, which granted the permission to release the EM corpus to the NLP community.

Contact

Please do not hesitate to contact us for additional information:

rudali.huidrom@ruri.waseda.jp, yves.lepage@waseda.jp, khogen.kh@gmail.com

External Links

Contact

EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287